We started with the 9Cr dataset with only raw experimental data (i.e., elemental alloy compositions, processing, and testing conditions, and PAGS—groups 1 and 2 in Fig. 1) to train five different ML models. Figure 2 shows the average accuracy of these models, and their standard deviation from ten training runs as a function of the numbers of top-ranking features from Pearson’s correlation coefficient (PCC)34 and maximal information coefficient (MIC)35 analyses. Overall, ML models RF, NN, and SVM exhibit high accuracy (R2 > 0.9) regardless of the number of top-ranking features. More specifically, RF was the most accurate (always higher than 0.95), followed by SVM. Nevertheless, the applicability of these models for alloy design is questionable since PAGS is the only physically measured microstructure-related feature involved in ML training. Other relevant physically meaningful features, such as volume fraction of key phases and phase transformation temperatures, are required to properly represent the process–structure–property relationship and serve as physical constraints in ML.
Analyses of temperature-based sub-datasets
Given the lack of physically measurable microstructure features in the 9Cr dataset, the raw experimental data were augmented with synthetically derived features, i.e., groups 3 and 4 in Fig. 1 (see Table 1), from high-throughput CALPHAD calculations. Since the primary strengthening mechanisms of 9Cr steel are temperature dependent, it was essential to carefully examine whether the present dataset is capable of representing the temperature-dependent strengthening mechanism. Thus, we divided the 9Cr dataset into several sub-datasets based on the testing temperature for further analysis. As such, we performed correlation analysis for each sub-dataset.
The top 10 and bottom 10 features from the PCC analysis were evaluated at three representative temperatures, i.e., 200 °C (low temperature), 550 and 650 °C (medium to high temperatures), and 750 °C (above service temperature). These results are presented in Fig. 3. From this analysis, it was observed that the closer the absolute value of the correlation coefficient is to 1, the stronger the correlation is between the feature and yield strength. Those features identified with either a positive or negative correlation with yield strength at 200, 550, and 650 °C were consistent and mostly in good agreement with generally accepted strengthening factors/mechanisms in 9Cr steel. For example, Ni content exhibited a strong positive correlation with yield strength, i.e., the higher the Ni content, the higher the yield strength. This is in accordance with the practice of adding Ni to 9Cr steel to stabilize austenite at high temperatures, lower the martensitic transformation temperatures, and consequently, increases the hardenability in the normalization process. These effects generally increase the yield strength of martensitic–ferritic steels, including the 9Cr family of steel27. This result is also logistically supported by the present correlation analysis that shows a strong negative correlation between the Ms temperatures and the yield strength.
The M23C6 phase also plays an important role in strengthening the 9Cr steel from the precipitate strengthening perspective and stabilizes the tempered martensite microstructure, especially at elevated temperatures36. A higher volume fraction of M23C6 leads to higher yield strength. Thus, it is reasonable that the volume fraction of M23C6 has one of the strongest positive correlations with yield strength. The elements V and N facilitate the formation of strengthening MX precipitates during tempering, which also assists in increasing yield strength by impeding dislocation motion during deformation and stabilizing the sub-grain structure. Co is also an austenite stabilizer that suppresses δ-ferrite formation during the normalizing heat treatment step. Ms and microstructure-related features (e.g., volume fractions of M23C6, hcp, and fcc phases) from our high-throughput calculation are highly impactful features, critical to obtaining high-fidelity surrogate ML models. This finding is also applicable to the other sub-datasets up to 650 °C (see Supplementary Table 1).
For the sub-datasets above 650 °C (e.g., 750 °C in Fig. 3), the correlation coefficients are smaller than those at low temperatures, indicating weaker response between alloy features and yield strength. In addition, the feature ranking order at 750 °C is counterintuitive and very different from the trends below 650 °C. For instance, Ms has a negative correlation below 650 °C, and now it shows to have a positive response at 750 °C. Features wC, wCr, wW, and PAGS should positively contribute to yield strength are now identified as having a negative impact at 750 °C. The MIC analysis also shows a similar trend (see Supplementary Fig. 1).
The correlation between alloy features and yield strength at 750 °C is much weaker than those at lower temperatures. Typical high impact features, such as Temper 1, wV, wNb, wNi, wC, T2_VPV_M23C6, have been correctly identified at 200, 550, and 650 °C, while at 750 °C they are counterintuitive in nature. The present findings may be put into context by realizing that (1) the number of data points at >650 °C is insufficient for representing the effects of certain features on yield strength correctly, and (2) the microstructure changes during exposure at high temperatures are significant and may result in a variation of yield strength attributed to other factors that are not considered in the present dataset (e.g., the heating rate and/or the holding time before tensile testing at temperature).
We then trained five ML models (BR, LR, RF, NN, and SVM) with these sub-datasets at each temperature. Since these sub-datasets have a maximum of 44 data points, we limited the number of top-ranking features used in ML to 10 to avoid overfitting. The top 10 features of each sub-dataset from correlation analysis are summarized in the Supplementary information (Supplementary Table 1). As an example, Fig. 4 shows the accuracy of the RF model trained with various top-ranking features as a function of temperature-based sub-datasets. The results of the 9Cr entire dataset (“All”) are also included for comparison. As shown in Fig. 4a, the accuracy of ML models trained with sub-datasets is always lower than that of the one using the entirety of the 9Cr dataset (i.e., “All”), which can be attributed to their smaller volume of data for the former. The performance of RF trained with top-ranking features from MIC does not improve with more features, and the top 4 features already lead to the maximal accuracy. This exercise shows that these features are sufficient to fit the RF model well. However, the top 8 features from PCC analysis are required to reach maximal accuracy (Fig. 4b). In both cases, the maximum accuracy is always >0.8 from room temperature (RT) to 600 °C regardless of the ranked features from the MIC or PCC analyses. From this point, it decreases monotonically above 600 °C, which is in accordance with the decreasing data volume above 600 °C (see Fig. 1). Since the ranking of features at 650 °C is reasonable (see Fig. 3), the lower accuracy at 650 °C may be attributed to its slightly smaller data volume than the lower temperature datasets.
Consequently, no matter how many top-ranking features are used in ML models, the accuracy (R2) is always below 0.6. This observation again confirms that data at >650 °C are insufficient, and the features in the present 9Cr dataset cannot represent the microstructure instability at high temperatures. Therefore, including the data at >650 °C could mislead the training of ML models, and consequently, result in an incorrect prediction. For this reason, data above 650 °C were removed, resulting in the truncated (≤650 °C) 9Cr dataset for the following ML model.
Truncated (≤650 °C) dataset
Figure 5 and Table 2 summarize the results of correlation analysis for the truncated dataset. Many physically meaningful features (i.e., volume fractions of phases and Ms) that we added into the raw 9Cr yield strength dataset commonly have high correlation coefficients. These highly impactful features from both PCC and MIC analyses are in good agreement with the generally accepted strengthening mechanisms, indicating that the features collected in the truncated dataset can capture the strengthening mechanisms of 9Cr steel well in the given temperature range. In this dataset, tensile testing temperature (TTTemp) is included, which allows its inclusion into the temperature dependence of yield strength in the ML models. TTTemp possesses a strong negative correlation with yield strength, which is also consistent with the experimental observations that the higher the test temperature, the lower the yield strength.
There is a discrepancy between the results from MIC and PCC analyses, for example, MIC ranked wCo 1st (9th in PCC), while PCC ranked T2_VPV_M23C6 2nd (14th in MIC). This is attributed to the different algorithms in assigning in the strength of correlation. PCC only evaluates the strength of the linear relationship and MIC has an advantage over PCC when there is a non-linear correlation between input feature and target property. Detailed comparison of MIC and PCC analyses with different data structures are available in ref. 37. It should be emphasized that the purpose of performing both MIC and PCC analyses in this study is not to rank one method over the other. Correlation analysis is a topic of its own, aiming to study the statistical relationship strength between two variables. It is also a category of feature selection approach that facilitates the choice of the most relevant input features for ML23. The intent here is also to demonstrate that correlation analysis is necessary to validate whether or not underlying mechanisms have been efficiently captured by quantitatively evaluating the score of features considered. It can also be used to evaluate the quality of the consistency of a material dataset. The results of different correlation analyses can be further analyzed to inspire alloy design experts to generate alloy hypotheses.
Five ML models (i.e., BR, LR, RF, NN, and SVM) were trained using the truncated dataset. The results are shown in Fig. 6. Similarly, the number of top-ranking features based on the MIC and PCC analyses was varied to train these models. The accuracy of the models using the top-ranking features from the MIC and PCC analyses show similar trends. For example, increasing the top-ranking features from 5 to 10 for PCC, and from 5 to 15 for MIC increased the accuracy of these models significantly. After taking into account the top-ranking features, the accuracy of the BR, LR, RF, and SVM models was almost constant, with the NN model showing a monotonic decrease in accuracy. For the models utilized, it was necessary to include at least the top 10 features for PCC and the top 15 features for MIC to obtain good accuracy.
Regardless of the type and number of features used for the PCC and MIC analyses, the accuracy of the trained models in predicting yield strength were, in order: RF > SVM > NN > BR ≈ LR. More specifically, RF, NN, and SVM exhibited very high accuracy (R2 > 0.9), while the maximum accuracy of the LR and BR models were ~0.85. For example, Fig. 7 shows the predicted yield strength using the RF model. It exhibits an excellent agreement with the experimentally determined yield strength. Although the accuracy of trained ML models with the dataset augmented by synthetic features is similar to those trained only with raw experimental data (see Fig. 2), the fidelity of these models is notably enhanced for LR, BR, and SVM. This is because the synthetic features we incorporated into the dataset are proved to be highly correlated with the yield strength of 9Cr steel. Moreover, the ML models still achieve very high accuracy even though the truncated dataset contains ~10% less data than the initial 9Cr dataset, mainly because the inconsistent data above 650 °C was eliminated. As such, we believe that the trained ML models (as described in this section) are more accurate and can provide more realistic predictions.
The high-fidelity surrogate models obtained in this work will allow prediction of the yield strength of hypothetical 9Cr alloys. However, in this case, additional work on predicting PAGS is required, as it was used as an input feature to predict the yield strength. For all features in groups 1 and 2 (see Fig. 1), PAGS is unique. The PAGS is an essential input for predicting Ms38, which was previously identified as a highly relevant feature for yield strength and served as an important constraint in training high-fidelity surrogate models. Also, PAGS depends on various details of the composition and processing conditions. However, PAGS of an alloy can only be obtained by physical inspection, i.e., metallography. Thus, following the similar workflow in the present study, surrogate models for PAGS were trained using the truncated dataset. The predicted PAGS using the NN, RF, and SVM models is in excellent agreement with experimental data (see Supplementary Fig. 2 in Supplementary materials). As an example, a comparison between experimental and predicted PAGS of the 9Cr steel using the RF ML model is shown in Fig. 8. We believe the outstanding performance of trained ML models is attributed to the extremely high correlation between input features and PAGS (see the correlation scores of high-ranking features in Supplementary Table 2). The average MIC score of top 15 features is 0.933 ± 0.061, which is extremely high. The average scores of PCC are not as high as those of MIC, but the average score of top 10 is 0.660 ± 0.100, which can be regarded to be high. With the success of this approach, PAGS for any 9Cr steel alloys can be derived and used as input to predict yield strength via a data analytics approach as demonstrated in the present study.
In summary, we have demonstrated a workflow that can incorporate highly relevant physics into ML for predicting properties of complex heat-resistant alloys. Using a yield strength dataset of the 9–12 wt% Cr steel as an example, the approach has been described in detail. We augmented raw experimental data with key features that can capture both the microstructure and phase transformation of this class of alloy, i.e., the volume fraction of key phases, A3, and martensite phase transformation temperatures. It is worth mentioning that the present features could not capture the complex location- and size-specific microstructural detail of the secondary phases that form in the 9Cr alloys. It would be ideal to incorporate such detailed microstructure-related information into the data analytics workflow. However, obtaining such a large volume of high-fidelity microstructural details for all the alloy chemistries and processing conditions will be extremely time and cost-prohibitive.
We computed these synthetic features using high-fidelity thermodynamic models in a high-throughput manner. Critical evaluation of each temperature-based sub-datasets, including correlation analysis and ML training, showed that data above 650 °C are insufficient for correctly capturing the significant factors related to the yield strength of 9Cr steel due to the relative lack of experimental data and relevant microstructure features. Thus, this information was removed from the 9Cr dataset, and correlation analysis of this truncated dataset showed that the high-ranking features were in good agreement with the generally accepted strengthening mechanisms.
We tested the performance of representative ML models, i.e., RF, SVM, NN, BR, and LR, as a function of the number of top-ranking features. From this exercise, the top 10 features from PCC and the top 15 features from MIC are necessary to obtain good accuracy for all models. Among the ML models tested, the RF and SVM ones exhibited very high accuracy (R2 > 0.95) for predicting 9Cr steel yield strength. In conclusion, this study demonstrated that high-fidelity surrogate models could be trained with highly relevant and physically meaningful features. Such physical constraints effectively prevent erroneously predicting properties of hypothetical candidate alloys when interrogating trained ML models in a data-driven materials design. We anticipate that the approach demonstrated in the present work can be further extended by integrating additional alloy physical/chemical features beyond what is achievable in this study.