Machine-Learning concepts

At the fundamental level, machine-learning consists a set of statistical methods to help make sense of complex, high-dimensional data. These methods can be divided into two main categories: supervised and unsupervised learning.

Unsupervised Learning


Unsupervised methods work only with the descriptors or features of the data, which in our case would be the information that uniquely defines materials. For example, the chemical composition and the electronic structure of catalytic materials qualifies as a complete set of descriptors, however a quite complex one with many dimensions. Unsupervised methods can find correlations between descriptors, thus eliminating redundant features with low information content. Furthermore, these methods can identify common patterns in the descriptors of different materials, and group them accordingly; this way, we can focus our efforts on the ones similar to known efficient catalysts.

Supervised learning


In many applications we want to find how the properties of a material correlate to its descriptors. When the relationship between the two is complex, simple fitting schemes can no longer be applied and we turn to machine-learning. With its supervised flavor, we can train a model that approximates the descriptor-property relation, based on actual knowledge. Its internal parameters are adjusted to reproduce known materials as best as possible, while keeping its complexity low. Afterwards, we can estimate the properties of new materials by feeding their descriptors to the model, and the prediction is calculated much faster and cheaper than measuring the material's properties experimentally or with accurate quantum chemistry methods.


Machine-Learning methods can be used to avoid large portions of the immense materials-space and perform searches and optimisations much faster than conventional methods. However, there are a few challenging point to address. The descriptor has to be carefully defined so that it contains the physics we hope to machine-learn with it. Ideally it is invariant with respect to translations, rotations and atomic permutations. It also has to be complete and uniquely define only one system. Smoothness is also important, i.e. the descriptors of two similar systems should not be too different. Nanolayers is actively developing such descritors together with novel machine-learning methods tailored for catalyst systems. The other main issue is data: more comples ML models contain more internal parameters, which demand more data to train. Large amounts of quantum chemistry calculations are then necessary to build a suitable database of catalysts, and are currently being performed at TUT and Aalto.