Machine Learning for Fast, Quantum Mechanics-Based Approximation of Drug Lipophilicity.

Isert, Clemens; Kromann, Jimmy C; Stiefl, Nikolaus; Schneider, Gisbert; Lewis, Richard A

Isert, Clemens; Kromann, Jimmy C; Stiefl, Nikolaus; Schneider, Gisbert; Lewis, Richard A.

Afiliación

Isert C; Department of Chemistry and Applied Biosciences, ETH Zurich, Vladimir-Prelog-Weg 4, 8093Zurich, Switzerland.
Kromann JC; Novartis Institutes for BioMedical Research, 4056Basel, Switzerland.
Stiefl N; Novartis Institutes for BioMedical Research, 4056Basel, Switzerland.
Schneider G; Novartis Institutes for BioMedical Research, 4056Basel, Switzerland.
Lewis RA; Department of Chemistry and Applied Biosciences, ETH Zurich, Vladimir-Prelog-Weg 4, 8093Zurich, Switzerland.

ACS Omega ; 8(2): 2046-2056, 2023 Jan 17.

Article en En | MEDLINE | ID: mdl-36687099

RESUMEN

Lipophilicity, as measured by the partition coefficient between octanol and water (log P), is a key parameter in early drug discovery research. However, measuring log P experimentally is difficult for specific compounds and log P ranges. The resulting lack of reliable experimental data impedes development of accurate in silico models for such compounds. In certain discovery projects at Novartis focused on such compounds, a quantum mechanics (QM)-based tool for log P estimation has emerged as a valuable supplement to experimental measurements and as a preferred alternative to existing empirical models. However, this QM-based approach incurs a substantial computational cost, limiting its applicability to small series and prohibiting quick, interactive ideation. This work explores a set of machine learning models (Random Forest, Lasso, XGBoost, Chemprop, and Chemprop3D) to learn calculated log P values on both a public data set and an in-house data set to obtain a computationally affordable, QM-based estimation of drug lipophilicity. The message-passing neural network model Chemprop emerged as the best performing model with mean absolute errors of 0.44 and 0.34 log units for scaffold split test sets of the public and in-house data sets, respectively. Analysis of learning curves suggests that a further decrease in the test set error can be achieved by increasing the training set size. While models directly trained on experimental data perform better at approximating experimentally determined log P values than models trained on calculated values, we discuss the potential advantages of using calculated log P values going beyond the limits of experimental quantitation. We analyze the impact of the data set splitting strategy and gain insights into model failure modes. Potential use cases for the presented models include pre-screening of large compound collections and prioritization of compounds for full QM calculations.

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Idioma: En Revista: ACS Omega Año: 2023 Tipo del documento: Article País de afiliación: Suiza Pais de publicación: Estados Unidos

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google