Chemistry involving transition metals (TMs) is relatively common, and yet the errors introduced by using density functional theory (DFT) and approximate basis sets remain poorly understood. Large datasets containing TM complexes have been created for machine learning (ML) applications using DFT and small basis sets due to computational constraints. However, this introduces errors from both the inherent approximations of the model chemistries and the multireference nature of TM complexes, which may negatively impact the resultant ML model. For these datasets to be used in the future for training ML models, the errors must be thoroughly understood, and higher-accuracy data should be incorporated into the training set. This talk provides a breakdown on a new dual-purpose dataset for machine learning and quantum chemistry benchmarking. Further, this benchmarking suggests methods and basis sets for efficient creation of future machine learning datasets.
Errors in simulations of transition-metal (TM) chemistry with density functional theory (DFT) and Gaussian basis sets remain poorly understood. Yet datasets of tens of thousands of DFT results are routinely used to train machine learning (ML) models for faster simulations. Without understanding the underlying model-chemistry errors and the multireference nature of TM complexes - potentially more significant than fitting error - ML accuracy is unclear, limiting applicability. To address this, we developed a medium-accuracy dataset of 30k TM complex energies using DLPNO-CCSD(T)/cc-pVDZ and a high accuracy, 150 TM formation energy dataset using DLPNO-CCSD(T)/CBS. This enables multi-fidelity ML training (few high-accuracy results plus many moderate-accuracy ones) and benchmarking of density functional approximations with multiple basis sets. These benchmarks provide error estimates for existing datasets (and ML models) and give model-chemistry recommendations for future dataset generation.