# Identifying domains of applicability of machine learning models for materials science

Sep 4, 2020

### Domain of applicability identification via subgroup discovery

To formally introduce the method for DA identification, we recall some notions of ML for materials. In order to apply smooth function approximation techniques like ridge regression, the materials of interest are represented as vectors in a vector space X according to some chosen representation. The more complex state-of-the-art representations evaluated in this work are defined further below. A first simple example is to use features ϕ1, …, ϕn of the isolated atoms that constitute the material (e.g., ϕi(Z) may be the “electronegativity of the species with atomic number Z” (see Supplementary Table 4) and then to lift these to representation coordinates xi for compounds ({({Z}_{j},{mu }_{j})}_{j = 1}^{k}) defined as

$${x}_{i}=mathop{sum }limits_{j=1}^{k}{mu }_{j}{phi }_{i}({Z}_{j})$$

(1)

where μj corresponds to the mixture coefficient for atomic number Zj. Moreover, let y be a numeric material property according to which screening should be performed (in this work, we focus on formation energy, which is relevant for performing a ground state search).

A predictive ML model is then a function (f:Xto {mathbb{R}}) aiming to minimize the expected error (also called prediction risk)

$$e(f)=int_{Xtimes {mathbb{R}}}l(f({boldsymbol{x}}),y)dP({boldsymbol{x}},y)$$

(2)

measured by some non-negative loss function l that quantifies the cost incurred by predicting the actual property value y with f(x). Examples for loss functions are the squared error ((l(y^{prime} ,y)={(y^{prime} -y)}^{2})), the absolute error ((l(y^{prime} ,y)=| y^{prime} -y|)), and, for non-zero properties, the relative error ((l(y^{prime} ,y)=| y^{prime} -y”https://www.nature.com/” y|)). Here P denotes some fixed probability distribution that captures how candidate materials are assumed to be sampled from the materials class (this concept, while commonly assumed in ML, is an unnecessary restriction for high-throughput screening as we discuss in more detail below). Since the true prediction risk is impossible to compute directly without perfect knowledge of the investigated materials class, models are evaluated by the test error (or empirical risk)

$$hat{e}(f)=mathop{sum }limits_{i=1}^{m}{e}_{i}(f)/m$$

(3)

defined as the average of the individual errors (losses) ei(f) = l(f(xi), yi) on some test set of m reference data points ({({x}_{i},{y}_{i})}_{i = 1}^{m}). The samples in this test set are drawn independently and identically distributed according to P and are also independent of the model—which means in practice that it is a random subset of all available reference data that has been withheld from the ML algorithm. In order to reduce the variance of this estimate, a common strategy is cross-validation, where this process is repeated multiple times based on partitioning the data into a number of non-overlapping “folds” and then to use each of these folds as test sets and the remaining data as a training set to fit the model.

This test error properly estimates the model performance globally over the whole representation space X (weighted by the distribution P used to generate the test points). This is an appropriate evaluation metric for selecting a model that is required to work well on average for arbitrary new input materials that are sampled according to the same distribution P. This is, however, not the condition of high-throughput screening. Here, rather than being presented with random inputs, we can decide which candidate materials to screen next. This observation leads to the central idea enabled by the DA analysis proposed in this work: if the employed model is particularly applicable in a specific subdomain of the materials class, and if that subdomain has a simple and interpretable shape that permits to generate new materials from it, then we can directly focus the screening there.

Such simply described DA can be identified by the descriptive data mining technique of subgroup discovery (SGD)22,23,24. This technique finds selectors in the form of logical conjunctions, i.e., Boolean functions (σX → {true, false}) of the form:

$$sigma (x)equiv {pi }_{1}(x)wedge {pi }_{2}(x)wedge ldots wedge {pi }_{p}(x)$$

where “” denotes the “and” operation and each proposition πi is a simple inequality constraint on one of the coordinates, i.e., πi(x) ≡ xj ≤ v for some constant v. Thus, these selectors describe intersections of axis-parallel half-spaces resulting in simple convex regions ({xXσ(x) = true}) in X. This allows to systematically reason about the described subdomains (e.g., it is easy to determine their differences and overlap) and also to sample novel points from them. To specifically obtain regions where a given model has a decreased error, standard SGD algorithms25,26 can be configured to yield a selector with maximum impact on the model error. The impact is defined as the product of selector coverage, i.e., the probability of the event σ(x) = true, and the selector effect on the model error, i.e., the model error minus the model error given that the features satisfy the selector.

### An illustrative example

Before describing the details of DA identification and its integration into the ML process, let us illustrate the concept and its utility via a synthetic example (see Fig. 1). We consider a simple two-dimensional representation consisting of independent features x1 and x2 that are each distributed according to a normal distribution with mean 0 and variance 2 (N(0, 2)) and a target property y that is a third-degree polynomial in x1 with an additive noise component that scales exponentially in x2:

$$y sim {x}_{1}^{3}-{x}_{1}+N(0,exp ({x}_{2}/2)).$$

That is, the y values are almost determined by the third-degree polynomial for low x2 values but are almost completely random for high x2 values. Discovering applicable domains reveals how different models cope differently with this setting even if they have a comparable average error. To show this, let us examine the error distributions obtained from three different kernelized regression models of the form

$$f(cdot )=mathop{sum }limits_{i=1}^{n}{nu }_{i}k({{boldsymbol{x}}}_{i}^{F},cdot )$$

with parameter vector ν that are fitted around a training, or fitting (F), set ({({{boldsymbol{x}}}_{i}^{F},{y}_{i}^{F})}_{i = 1}^{n}) with three different choices for the kernel function k. We observe:

• When using the linear (lin) kernel ((k(x,x^{prime} )=langle x,x^{prime} rangle)), the resulting linear model is globally incapable to trace the variation of the third-order polynomial except for a small stripe on the x1-axis where it can be approximated well by a linear function. Consequently, there is a very high error globally that is substantially reduced in the DA described by σlin(x1x2) ≡ −0.3 ≤ x1 ≤ 0.3.

• When using the Gaussian kernel (k(x,x^{prime} )=exp parallel x-x^{prime} {parallel }^{2}/ 2{epsilon }^{2})), the resulting radial basis function (rbf) model is able to represent the target property well locally unless (a) the noise component is too large and (b) the variation of the target property is too high relative to the number of training points. The second restriction is because the rbfs have non-negligible values only within a small region around the training examples. Consequently, the discovered DA is not only restricted in x2-direction but also excludes high absolute x1-values: σrbf ≡ −3.3 ≤ x11 ≤ 3.1 x2 ≤ 0.1.

• In contrast, when using the non-local third-degree polynomial (poly) kernel (k(x,x^{prime} )={(langle x,x^{prime} rangle +1)}^{3}), data sparsity does not prevent an accurate modeling of the target property along the x1-axis. However, this non-locality is counterproductive along the x2-axis where overfitting of the noise component has a global influence that results in higher prediction errors for the almost deterministic data points with low x2-values. This is reflected in the identified DA σpoly(x1x2) ≡ −3.5 ≤ x2 ≤ 0.1, which contains no restriction in x1-direction, but excludes both high and low x2-values. This highlights an important structural difference between the rbf and the polynomial model that is not reflected in their similar average errors.

### DA representation and objective function

In the illustrative example above, all evaluated models share the same simple representation. However, in practice different models are typically fitted with different and more complicated representations. For instance, for the study on formation energies of transparent oxides below, we compare models based on the n-gram, SOAP, and MBTR representations. These representations use different descriptions of the local atomic geometry, leading to high-dimensional non-linear transforms of the material configurations (e.g., 1400, 681, and 472 dimensions for MBTR, SOAP, and n-gram representations). A DA described directly in terms of these complex representations cannot easily be mapped back to intuitive conditions on the unit cell of a given material. This not only hinders interpreting the DA but also to construct novel materials from it. Finally, using different representations to describe DAs of different models makes it impossible to assess their overlap and differences. Therefore, we define a single representation comprised of features that are specifically intended for the description of insightful subdomains. A first natural group of features pertains directly to the shape of the unit cell such as the sorted lattice vectors and angles, the number of atoms in the unit cell, and the unit-cell volume. In addition, when we are interested in a fixed compositional space, we can add features describing the composition (e.g., “percentage of Al cations”) as well as structural features describing the bonding environments (e.g., “average nearest-neighbor distance between Al and O”, which we define using the effective coordination number27). The description of DAs in these simple terms of the unit-cell structure and composition allows to easily interpret, compare, and sample from them (e.g., for focused screening). However, we note that the representation space inputted into subgroup discovery can be adapted for various purposes depending on the focus of the investigation. See Table 1 for a summary of all features used.

The DA optimization and validation can be performed as a by-product from the labels and ML predictions of the test set. However, just as for the ML-model fitting itself, we can only estimate these quantities based on empirical data. For that purpose, it is sensible to also split the test data into two parts: a DA identification set for optimizing the empirical impact and a DA validation set for obtaining an unbiased performance estimate of the identified DA (see Fig. 2 for an illustration of the overall workflow). Technically, the data points withheld in the DA validation set mimic novel independent sample points that can be used to evaluate both the coverage of the DA, as well as, the reduction in model error. As an extension of this, one can also repeat the DA optimization/validation on several splits (cross-validation) to reduce the variance of the coverage and model error estimates and, moreover, to assess the stability of the DA selector elements.

For ease of notation we assume the DA identification set consists of the first k points of the test set. We end up with the following objective function for the SGD algorithm:

$${rm{impact}}(sigma )=underbrace{left(frac{s}{k}right)}_{{rm{coverage}}}underbrace{left(frac{1}{k}mathop{sum }limits_{i = 1}^{k}{l}_{i}(f)-frac{1}{s}sum _{iin I(sigma )}{l}_{i}(f)right)}_{{rm{effect}},{rm{on}},{rm{test}},{rm{error}}}$$

(4)

where s denotes the number of points in the DA identification set selected by σ and I(σ) = {i: 1 ≤ ikσ(xi)  = true} denotes the set of selected indices itself. Here, we focus on DA identification based on the relative error (l(y^{prime} ,y)=| y^{prime} -y”https://www.nature.com/” y|), as it is less correlated with the target values than the absolute error. Thus, this choice promotes DAs that contain a representative distribution of target values and, by extension, more distinct and thus more characteristic DAs for the different models (see Supplementary Note 2 for a discussion of the DAs resulting from using the absolute error).

The effect term of the objective function ensures that the model is estimated to be more accurate in the described region than in the global representation space. Thus, selectors with a large effect value describe domains of increased applicability as desired (see also Supplementary Note 4). In addition, promoting large, i.e., general, DAs through the coverage term is important as those have a higher chance to (a) contain data points of interest and (b) to have an accurate effect estimate, i.e., the empirical error reduction measured by the effect term is likely to generalize to other points in the DA that are not contained in the DA identification set. Thus, the coverage term has a similar role as a regularization term in common objective functions for model fitting. With the above objective function, we reduce the bi-criterial coverage/effect optimization problem to a uni-criterial impact optimization problem where both individual criteria are equally weighted and non-compensatory, i.e., due to the multiplicative combination, very low values of one criterion cannot be compensated by very high values in the other. The relative weight of both criteria can be re-calibrated by introducing a simple exponential weight parameter (see the Supplementary Methods section on Coverage/Effect Trade-off for a detailed discussion).

Optimizing the impact function over all conjunctive selectors that can be formed from a given set of base propositions is an NP-hard problem. This implies that there is no solver for it with worst-case polynomial time complexity (unless P = NP). However, there is a practically efficient branch-and-bound algorithm that turns out to be very fast in practice if the dimensionality of the DA representation is not too high—in particular, substantially faster than the model training process (see Methods and Supplementary Methods).

### Domains of applicability for TCO models

Equipped with the DA concept, we can now examine the ML models for the prediction of stable alloys with potential application as transparent conducting oxides (TCOs). Materials that are both transparent to visible light and electrically conductive are important for a variety of technological devices such as photovoltaic cells, light-emitting diodes for flat-panel displays, transistors, sensors, touch screens, and lasers28,29,30,31,32,33,34,35,36,37,38. However, only a small number of TCOs have been realized because typically the properties that maximize transparency are detrimental to conductivity and vice versa. Because of their promise for technologically relevant applications, a public data-analytics competition was organized by the Novel Materials Discovery Center of Excellence (NOMAD39) and hosted by the on-line platform Kaggle using a dataset of 3000 ({({{rm{Al}}}_{x}{{rm{Ga}}}_{y}{{rm{In}}}_{z})}_{2}{{rm{O}}}_{3}) sesquioxides, spanning six different space groups. The target property in this examination is the formation energy, which is a measure of the energetic stability of the specific elements in a local environment that is defined by the specific lattice structure.

Our aim is to demonstrate the ability of the proposed DA analysis to (i) differentiate the performance of models based on different representations of the local atomic information of each structure and (ii) to identify subdomains in which they can be used reliably for high-throughput screening. Specifically, we focus on the state-of-the-art representations of MBTR, SOAP, and the n-gram representation (all described in the Methods section). As an additional benchmark, we also perform DA identification for a simple representation containing just atomic properties averaged by the compositions (this corresponds to the simplistic choice of a representation given in Eq. (1); see Supplementary Table 4 for a list of atomic properties used in this representation). Since this representation is oblivious to configurational disorder (i.e., many distinct structures that are possible at a given composition), it is expected to perform poorly across all space groups and concentrations. Formally, there is no unique y-value associated with each x but rather a distribution P(yx). Thus, even the optimal prediction at each composition of the test set (the median energy) to predict the test set energies results in a mean absolute error of 32.6 meV/cation, which is the highest accuracy that can be obtained using just composition-based properties. Therefore, it is a candidate for a representation that does not have any useful DA when compared with its full domain.

MBTR, SOAP, and n-gram all display a similar test error (using the absolute error as the loss function l (see Eq. (3)); the resulting quantity we refer to as the mean absolute error, MAE) of 14.2, 14.1, and 14.7 meV/cation, respectively. This confirms previously reported virtually indistinguishable accuracies for MBTR and SOAP in the prediction of formation energies of alloys40. However, using the proposed method, key differences can be observed in the MAEs of their respective DAs (see Table 2 and Fig. 3 for a summary of all model performances). More specifically, the ML models built from MBTR, SOAP, and n-gram have an MAE (standard deviation) over the relevant DA validation sets of 7.6 (±1.5), 11.7 (±1.8), 10.2 (±0.9) meV/cation, respectively. All identified DAs for the models utilizing MBTR, SOAP, and n-gram have a large coverage (i.e., percent of samples within the DA) with an average (standard deviation) subpopulation contained within the DA validation set of 44% (±6%), 78% (±3%), and 52% (±5%), respectively.

In contrast, the atomic model is not only the worst model globally with a test error of 65.5 meV/cation, but, as anticipated, the DA error is virtually indistinguishable from the global model error (MAE = 60.2 meV/cation). This model performs worse than the MAE = 32.6 meV/cation that can be obtained by using the median energy at each composition of the test set to predict the test set energies. Therefore, this result illustrates the case of a weak representation for which no DA with substantial error reduction can be identified.

Although the reduction of the mean error for the three state-of-the-art representations is notable, the difference between the whole materials space and the DAs is even more pronounced when comparing the tails of the error distributions using the 95th percentile. For the global models, the average 95th percentile across all relevant splits is reduced by a factor of 2.9, 1.4, and 1.6 for the DA compared with the global error for MBTR, SOAP, and n-gram.

To put these error values into context, we consider the reference value of 24.9 meV/cation corresponding to half of the mean energy difference between the minimum energy and the second-to-minimum energy polymorph for all concentrations. The fraction of data points with these errors from the MBTR model above this reference value is reduced by a factor of 7.5 from 12.8% in the entire test set to 1.7% (averaged over each relevant split) within the DA. A smaller reduction in the fraction of errors is observed for the SOAP model (13.3 to 9.0%) and n-gram model (16.2 to 10.8%). For the MBTR model, the 95th percentile of the DA errors (18.8 meV/cation) lies below the reference value.

Since the restriction of features values in the DAs generally affects the distribution of target values, the observed MAE improvements might simply be a function of reduced variation in the property values. This would mean that the DAs are not actually characteristic for the models as those reduced variations would be independent of them. However, comparing the coefficient of determination (R) of the models globally and within their DA reveals that this is not the case. The R values are increased from 0.83 to 0.88 (MBTR), 0.84 to 0.85 (SOAP), and 0.83 to 0.86 (n-gram). This means, while there is a reduction in target dispersion, there is a disproportionate reduction in MAE in the model-specific predictions. Note that, matching our interest in absolute error performance, we consider here the R-value defined as one minus the sum of absolute errors over dispersion measured as the sum of absolute deviations from the median41.

The identified selectors are mostly stable, i.e., appearing in four out of six splits for MBTR and SOAP and five of six for n-gram. Interestingly, the variables that comprise the selectors of the DA are qualitatively different for each of these models. Selectors for MBTR include the number of atoms (N), the angle between the two longest lattice vectors in the unit cell (γ), and the average bond distance between aluminum and oxygen within the first coordination shell (that is defined by the effective coordination number), RAl–O:

$${sigma }_{{rm{MBTR}}}equiv Nge 50 {rm{atoms}}wedge gamma , le , 98.8{3}^{circ }wedge {R}_{{rm{Al}}-{rm{O}}}le 2.06 mathring{rm{AA}} .$$

For SOAP, selectors include features exclusively based on the unit-cell shape such as the ratio of the longest (a) and shortest (c) lattice vectors, and lattice-vector angles (β and γ):

$${sigma }_{{rm{SOAP}}}equiv frac{a}{c}le 3.87wedge gamma, <, 90.3{5}^{circ }wedge beta ge 88.6{8}^{circ }$$

The selector of the n-gram model includes both features describing the unit-cell shape [medium lattice vector (b) and angle (γ)] and structural motifs [interatomic bond distances between Al–O (RAl–O) and Ga–O (RGa–O) within the first coordination shell]:

$${sigma}_{{rm{n}}{hbox{-}}{rm{gram}}} equiv , bge 5.59 {mathring{rm{AA}}} wedge gamma ,, < , {90.35}^{circ }wedge \ {R}_{{rm{Al}}-{rm{O}}} ,le , 2.06 {mathring{rm{AA}}} wedge {R}_{{rm{Ga}}-{rm{O}}}le 2.07 {mathring{rm{AA}}}$$

It is worth noting that applying these DA selectors to the training set results in a similar reduction in error between the global and local populations and sample coverages (i.e., local population size) to what was observed for the test set: The training MAEs are reduced by factors of 1.87, 1.18, and 1.43 and the training DA coverages are 44%, 76%, and 54% for MBTR, SOAP, and n-gram models, respectively.

The qualitative differences observed in the DA selectors for these three models can be quantified by examining the overlapping samples in the DAs using the Jaccard similarity, which is the ratio of the number of overlapping samples over the total number of samples in both DAs. We find Jaccard similarities of 0.61 for n-gram vs. SOAP, 0.66 for n-gram vs. MBTR, 0.57 for SOAP vs. MBTR (computed over the whole test set). In other words, the discovered DA selectors are not only syntactically different, but, despite some overlap, they do indeed describe substantially different sub-populations of the investigated materials class.

We close this results section by an investigation of the effect of the individual DA selector elements of the SOAP-based model (details for MBTR and n-gram based models are provided in Supplementary Figs. 1 and 2). The inclusion of the attributes γ < 90.35 and β ≥ 88.68 excludes 18.3% and 1.8% samples that have irregular unit cells based on the relatively large γ and small β values compared with the rest of the data points (see Fig. 4 for the distribution of the selected a/c, γ, and β values for the SOAP-based model). The inclusion of the term a/c ≤ 3.87, which describes 86.2% of the test set, is attributed to the fact that SOAP employs a real-space radial cut-off value rcut = 10 Å in constructing the local atomic density for all samples (see above for a description of this representation). The algorithm threshold choice of a/c ≤ 3.87 separates two modes of a relatively dense region of points (see Fig. 4 left panel); however, for structures with asymmetric unit cell, the spherical radius could lead to inaccurate depiction of the local atomic environment, therefore, we repeat the procedure for two additional rcut values of 20 and 30 Å. Compared with the selector identified for rcut = 10 Å, a largely consistent selector is observed when the cut-off value is changed to a value of rcut = 20 Å:

$${sigma }_{{rm{SOAP}},{r}_{{rm{cut}}} = 20mathring{rm{AA}} }equiv frac{a}{c}le 3.89wedge gamma le 90.3{5}^{circ }$$

However, increasing rcut to a value of 30 Å—which exceeds the largest unit-cell vector length (a) of ca. 24 Å in the structures contained within this dataset—results in the selector:

$${sigma }_{{rm{SOAP}},{r}_{{rm{cut}}} = 30mathring{rm{AA}} }equiv cge 4.05 mathring{rm{AA}} wedge gamma le 90.3{5}^{circ }$$

The absence of the a/c term for the SOAP representation utilizing a rcut = 30 Å indicates that the choice of a cut-off value less than the length of the unit cell directly impacts the model performance for the larger unit cells within this dataset, and thus, directly affects the selector chosen by SGD.