Skip to content

Latest commit

 

History

History
220 lines (151 loc) · 9.45 KB

File metadata and controls

220 lines (151 loc) · 9.45 KB

Elemental Embeddings

The data contained in this repository are a collection of various elemental representation/embedding schemes. We provide the literature source for these representations as well as the data source for which the files were obtained. Some representations have been obtained from the following repositories:

Linear representations

For the linear/scalar representations, the Embedding class will load these representations as one-hot vectors where the vector components are ordered following the scale (i.e. the atomic representation is ordered by atomic numbers).

Modified Pettifor scale

The following paper describes the details of the modified Pettifor chemical scale: The optimal one-dimensional periodic table: a modified Pettifor chemical scale from data mining

Data source

Atomic numbers

We included atomic as a linear representation to generate one-hot vectors corresponding to the atomic numbers

Vector representations

The following representations are all vector representations (some are local, some are distributed) and the Embedding class will load these representations as they are.

cgnf

The following paper describes the implementation of the composition graph neural fingerprint (cgnf) from the node embedding vectors of a pre-trained crystal graph convolution neural network: Synthesizability of materials stoichiometry using semi-supervised learning

Data source

crystallm

The following paper describes the details behind the generative crystal structure model based on a large language model: Crystal Structure Generation with Autoregressive Large Language Modeling

magpie

The following paper describes the details of the Materials Agnostic Platform for Informatics and Exploration (Magpie) framework: A general-purpose machine learning framework for predicting properties of inorganic materials

The source code for Magpie can be found on Bitbucket

Data source

The 22 dimensional embedding vector includes the following elemental properties:

Click to see the 22 properties
  • Number;
  • Mendeleev number;
  • Atomic weight;
  • Melting temperature;
  • Group number;
  • Period;
  • Covalent Radius;
  • Electronegativity;
  • no. of s, p, d, f valence electrons (4 features);
  • no. of valence electrons;
  • no. of unfilled: s, p, d, f orbitals (4 features),
  • no. of unfilled orbtials
  • GSvolume_pa (DFT volume per atom of T=0K ground state from the OQMD)
  • GSbandgap(DFT bandgap energy of T=0K ground state from the OQMD)
  • GSmagmom (DFT magnetic moment of T=0K ground state from the OQMD)
  • Space Group Number
  • magpie_sc is a scaled version of the magpie embeddings. Data source

mat2vec

The following paper describes the implementation of mat2vec: Unsupervised word embeddings capture latent knowledge from materials science literature

Data source

matscholar

The following paper describes the natural language processing implementation of Materials Scholar (matscholar): Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature

Data source

megnet

The following paper describes the details of the construction of the MatErials Graph Network (MEGNet): Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals. The 16 dimensional vectors are drawn from the atomic weights of a model trained to predict the formation energies of crystalline materials.

Data source

oliynyk

The following paper describes the details: High-Throughput Machine-Learning-Driven Synthesis of Full-Heusler Compounds

Data source

The 44 features of the embedding vector are formed of the following properties:

Click to see the 44 features!
  • Number
  • Atomic_Weight
  • Period
  • Group
  • Families
  • Metal
  • Nonmetal
  • Metalliod
  • Mendeleev_Number
  • l_quantum_number
  • Atomic_Radius
  • MiracleRadius[pm]
  • Covalent_Radius
  • Zunger_radii_sum
  • Ionic_radius
  • crystal_radius
  • Pauling_Electronegativity
  • MB_electonegativity
  • Gordy_electonegativity
  • Mulliken_EN
  • Allred-Rockow_electronegativity
  • Metallic_valence
  • Number_of_valence_electrons
  • Gilmor_number_of_valence_electron
  • valence_s
  • valence_p
  • valence_d
  • valence_f
  • Number_of_unfilled_s_valence_electrons
  • Number_of_unfilled_p_valence_electrons
  • Number_of_unfilled_d_valence_electrons
  • Number_of_unfilled_f_valence_electrons
  • Outer_shell_electrons
  • 1stionization_potential(kJ/mol)
  • Polarizability(A^3)
  • Meltingpoint(K)
  • BoilingPoint(K)
  • Density_(g/mL)
  • Specificheat(J/gK)
  • Heatof_fusion(kJ/mol)_
  • Heatof_vaporization(kJ/mol)_
  • Thermalconductivity(W/(mK))
  • Heat_atomization(kJ/mol)
  • Cohesive_energy
  • oliynyk_sc is a scaled version of the oliynyk embeddings: Data source

random

This is a set of 200-dimensional vectors in which the components are randomly generated

The 118 200-dimensional vectors in random_200_new were generated using the following code:

import numpy as np

mu, sigma = 0, 1  # mean and standard deviation s = np.random.normal(mu, sigma, 1000)
s = np.random.default_rng(seed=42).normal(mu, sigma, (118, 200))

skipatom

The following paper describes the details: Distributed representations of atoms and materials for machine learning

Data source

xenonpy

The XenonPy embedding uses the 58 features which are commonly used in publications that use the XenonPy package. See the following publications:

MLIP representations

The following embeddings are derived from Machine Learning Interatomic Potentials (MLIPs). Element-level vectors are obtained by averaging atom-level embeddings over structures from the MP-20 dataset.

mace_mp0

128-dimensional scalar invariant descriptors from the MACE-MP-0 medium foundation model, extracted from the last interaction layer via MACECalculator.get_descriptors(invariants_only=True, num_layers=1): A foundation model for atomistic simulations

sevennet

128-dimensional scalar (l=0) node features from the SevenNet (7net-0) model, captured from the output of the last equivariant gate after five interaction blocks: Scalable parallel algorithm for graph neural network interatomic potentials in molecular dynamics simulations

orb_v2

256-dimensional node features from the ORB-v2 model, captured at the input to the energy prediction head after the decoder transforms the graph neural network output: ORB: A fast, scalable neural network potential

chgnet

64-dimensional node features from the CHGNet model, captured from the output of the last atom convolution layer. CHGNet is a pretrained universal neural network potential that incorporates charge information: CHGNet as a pretrained universal neural network potential for charge-informed atomistic modelling

chemeleon

512-dimensional node features from the Chemeleon-DNG generative crystal structure model, captured after the GNN message-passing layers at t=0 (fully denoised state) and averaged over MP-20 structures: Crystal structure generation and property optimization using a generative graph neural network

LLM representations

The following embeddings are derived from Large Language Models trained on materials science text.

matscibert

768-dimensional token embeddings from MatSciBERT, a BERT model pre-trained on materials science literature. Element vectors are extracted from the token embedding layer for each element symbol: MatSciBERT: A materials domain language model for text mining and information extraction