Introduction
The majority of machine learning algorithms applied in chemistry and biology are black box models used to make predictions on given target properties.1–3
The model receives input features and generates an output but the inner workings of how the model arrived at the output is unknown or extremely difficult to understand due to the complexity of the models. Therefore, extracting meaningful scientific insights from these models has proven to be a challenge.1,4
Interpretable ML models that offer predictive capabilities combined with interpretable physical equations are gaining traction in many areas of science.1,2,5,6
The goal here is to have what is termed a glass-box model where simple physical equations relate the input features to the target properties. In this way relationships in the data can be understood and improved scientific insight can be gained from the model.
Figure 1: Schematic of Black-box models and Interpretable glass-box models
Sure Independence Screening and Sparsifying Operator – SISSO
Among the methods developed for interpretable machine learning the Sure Independence Screening and Sparsifying Operator – SISSO methodology has been widely applied in heterogeneous catalysis and organic chemistry.7–11 SISSO is part of the symbolic regression class of models and can be used to find mathematical functions to predict the target property.
In simple terms SISSO consists of 2 parts:
- Creation of a large feature space by combining the feature columns or descriptors with user selected operators (e.g., multiplication, division, ln, sqrt etc.).
- Using sure-independence screening (SIS) to select the descriptors with highest correlation to the target property. Finally applying regularization (ℓ0) to select low-dimension linear models with the lowest error.
With this approach, the aim is to use SISSO to find interpretable equations, that make scientific sense, from a range of input features. The input columns or descriptors can be experimental and/or those obtained from molecular modelling studies, including those conducted in BIOVIA Materials Studio® or BIOVIA Turbomole®.
Figure 2: Example of the types of inputs that have been used with SISSO
The original SISSO code was implemented in FORTRAN12 and does not contain direct Python support. However, a newer C++ implementation (SISSO++) has been released by the NOMAD Laboratory which has native Python integration.13,14
Let’s say I wanted to apply the SISSO algorithm to some chemistry datasets in order to expand my scientific insight. How could I go about deploying this ML method as part of my data science pipelines?
The answer is to use BIOVIA Pipeline Pilot15 to wrap the Python code and extend access to this glass box models.
SISSO++ integration with Pipeline Pilot
Using the strong integration between BIOVIA Pipeline Pilot and Python there are a range of options one can take to incorporate Python code into existing data pipelines.16
In this example we are going to use the Jupyter Notebook components to handle the Python portion and use the native PLP components to read, write and clean data ready for input.
We will take two sets of data, one published by bp17 and one published by Sigman and co-workers.10 The bp data covers the use of benzaldehyde promoters in H-ZSM-5 dehydration of methanol to dimethyl ether (DME) and the Sigman data a diastereoselective Rh catalysed C-H insertion.
Both datasets are small (22 rows and 84 rows) by the standards of most AI methods but reflect the realistic acquisition of small high-quality datasets typically found in industry and academia.
Inside the Python Jupyter Notebook component, SISSO++ can be set up by selecting the operators, target column and the desired train/test split. In addition, hyperparameters can be set and the calculation type can be toggled between regression and classification.
We apply the model to the bp dataset where the target property is DME STY (space time yield – a measure of catalytic performance) and the 10 descriptor columns are density functional theory (DFT) derived features for the organic promoter aldehydes (other reaction parameters are kept constant).
We obtain an interpretable equation that provides scientific insight and outputs are displayed using the Pipeline Pilot reporting components. The SISSO++ model is comparable to the reported model and makes chemical sense as it relates steric and electronic features of the aldehyde promoter to catalytic performance.
Figure 3: Output of SISSO++ regression model for bp dataset run through BIOVIA Pipeline Pilot
One potential limitation of the SISSO++ code is that it can become computationally expensive with datasets that contain a large number of features. To that end, the Materials AI team at BIOVIA along with Felix Hanke (formerly at BIOVIA) developed a BIOVIA Pipeline Pilot native version of SISSO++ for regression problems.
Native SISSO++ in Pipeline Pilot
By using the parallelisation and simplicity of BIOVIA Pipeline Pilot we can increase the speed of finding interpretable equations for scientific datasets and run the models without any coding expertise.
The new protocol applies the same SISSO++ methodology whereby a vast number of features are generated and then parsed to give the best performing equations, but this is performed in a different way inside Pipeline Pilot.
Figure 4: Native SISSO protocol in BIOVIA Pipeline Pilot
The resulting output is comparable to the SISSO++ Python package but simplifies usage for the scientist as you do not need to interact with any code. In fact, the protocol can be run through the Pipeline Pilot Web Port with users selecting parameters through drop-down menus making it ideal for scientists with no coding experience.
In this example we show the output for the dataset from Sigman and co-workers10 where the target is ΔΔG‡ (a measure of diastereoselectivity) and there are 19 DFT derived chemical descriptors. Again, we obtain an interpretable equation for the data that is comparable to the reported model which related steric and electronic properties of the catalyst/ligand to the diastereoselectivity.
Due to the ability of Pipeline Pilot to handle large amounts of data effectively obtaining models with larger datasets (>50 billion generated features) is also possible.
Figure 5: Example of BIOVIA Pipeline Pilot Web Port being used to run native SISSO algorithm.
Conclusion
The simple integration of Python in BIOVIA Pipeline Pilot enables us to incorporate SISSO++ and other Python packages easily into new and existing data pipelines.
We can also make use of the flexibility and speed of BIOVIA Pipeline Pilot to incorporate new methods for interpretable machine learning into data science workflows. In this way, BIOVIA Pipeline Pilot can be used to help scientists gain meaningful scientific insights from predictive models. With BIOVIA Pipeline Pilot these types of models can be deployed in a low/no-code environment to aid understanding and further innovation in scientific challenges.
References
(1) Esterhuizen, J. A.; Goldsmith, B. R.; Linic, S. Interpretable Machine Learning for Knowledge Generation in Heterogeneous Catalysis. Nat Catal 2022, 5.
(2) Azodi, C. B.; Tang, J.; Shiu, S.-H. Opening the Black Box: Interpretable Machine Learning for Geneticists. Trends in Genetics 2020, 36 (6), 442–455.
(3) Jiménez-Luna, J.; Grisoni, F.; Schneider, G. Drug Discovery with Explainable Artificial Intelligence. Nature Machine Intelligence. Nature Research October 1, 2020, pp 573–584.
(4) Molnar, C. Interpretable Machine Learning. A Guide for Making Black Box Models Explainable; https://christophm.github.io/interpretable-ml-book/., 2019, accessed 04/11/2025
(5) La Cava, W. G.; Lee, P. C.; Ajmal, I.; Ding, X.; Solanki, P.; Cohen, J. B.; Moore, J. H.; Herman, D. S. A Flexible Symbolic Regression Method for Constructing Interpretable Clinical Prediction Models. NPJ Digit Med 2023, 6 (1).
(6) Rudin, C. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. Nat Mach Intell 2019, 1 (5), 206–215.
(7) Foppa, L.; Rüther, F.; Geske, M.; Koch, G.; Girgsdies, F.; Kube, P.; Carey, S. J.; Hävecker, M.; Timpe, O.; Tarasov, A. V.; Scheffler, M.; Rosowski, F.; Schlögl, R.; Trunschke, A. Data-Centric Heterogeneous Catalysis: Identifying Rules and Materials Genes of Alkane Selective Oxidation. J Am Chem Soc 2023, 145 (6), 3427–3442.
(8) Miyazaki, R.; Belthle, K. S.; Tüysüz, H.; Foppa, L.; Scheffler, M. Materials Genes of CO 2 Hydrogenation on Supported Cobalt Catalysts: An Artificial Intelligence Approach Integrating Theoretical and Experimental Data. J Am Chem Soc 2024, 146 (8), 5433–5444.
(9) Wang, J.; Xie, H.; Wang, Y.; Ouyang, R. Distilling Accurate Descriptors from Multi-Source Experimental Data for Discovering Highly Active Perovskite OER Catalysts. J Am Chem Soc 2023, 145 (20), 11457–11465.
(10) Souza, L. W.; Miller, B. R.; Cammarota, R. C.; Lo, A.; Lopez, I.; Shiue, Y.-S.; Bergstrom, B. D.; Dishman, S. N.; Fettinger, J. C.; Sigman, M. S.; Shaw, J. T. Deconvoluting Nonlinear Catalyst–Substrate Effects in the Intramolecular Dirhodium-Catalyzed C–H Insertion of Donor/Donor Carbenes Using Data Science Tools. ACS Catal 2023, 104–115.
(11) Park, J.; Oh, J.; Kim, J. S.; Shin, J. H.; Jeon, N.; Chang, H.; Yun, Y. Catalyst Discovery for Propane Dehydrogenation through Interpretable Machine Learning: Leveraging Laboratory-Scale Database and Atomic Properties. ACS Sustain Chem Eng 2024, 12 (28), 10376–10386.
(12) Ouyang, R.; Curtarolo, S.; Ahmetcik, E.; Scheffler, M.; Ghiringhelli, L. M. SISSO: A Compressed-Sensing Method for Identifying the Best Low-Dimensional Descriptor in an Immensity of Offered Candidates. Phys Rev Mater 2018, 2 (8), 1–12.
(13) Purcell, T. A. R.; Scheffler, M.; Carbogno, C.; Ghiringhelli, L. M. SISSO++: A C++ Implementation of the Sure-Independence Screening and Sparsifying Operator Approach. J Open Source Softw 2022, 7 (71), 3960.
(14) Purcell, T. A. R.; Scheffler, M.; Ghiringhelli, L. M. Recent Advances in the SISSO Method and Their Implementation in the SISSO++ Code. J Chem Phys 2023, 159 (11), 114110.
(15) Pipeline Pilot. https://www.3ds.com/products-services/biovia/products/data-science/pipeline-pilot/.
(16) Pipeline Pilot | Integration of Python and Jupyter Notebook. https://www.youtube.com/watch?v=1sFaA7Fj0oM, accessed on 18/09/2024.
(17) Yang, Z.; Dennis-Smither, B. J.; Buda, C.; Easey, A.; Jackson, F.; Price, G. A.; Sainty, N.; Tan, X.; Xu, Z.; Sunley, G. J. Aromatic Aldehydes as Tuneable and Ppm Level Potent Promoters for Zeolite Catalysed Methanol Dehydration to DME. Catal Sci Technol 2023, 13 (12), 3590–3605.
Interested in staying up to date on all the latest news of BIOVIA?