Abstract:
In the realm of medicinal chemistry, the primary objective is to swiftly optimize a multitude of chemical properties of
a set of compounds to yield a clinical candidate poised for clinical trials. In recent years, two computational techniques, machine
learning (ML) and physics-based methods, have evolved substantially and are now frequently incorporated into the medicinal
chemist’s toolbox to enhance the efficiency of both hit optimization and candidate design. Both computational methods come with
their own set of limitations, and they are often used independently of each other. ML’s capability to screen extensive compound
libraries expediently is tempered by its reliance on quality data, which can be scarce especially during early-stage optimization.
Contrarily, physics-based approaches like free energy perturbation (FEP) are frequently constrained by low throughput and high
cost by comparison; however, physics-based methods are capable of making highly accurate binding affinity predictions. In this
study, we harnessed the strength of FEP to overcome data paucity in ML by generating virtual activity data sets which then inform
the training of algorithms. Here, we show that ML algorithms trained with an FEP-augmented data set could achieve comparable
predictive accuracy to data sets trained on experimental data from biological assays. Throughout the paper, we emphasize key
mechanistic considerations that must be taken into account when aiming to augment data sets and lay the groundwork for successful
implementation. Ultimately, the study advocates for the synergy of physics-based methods and ML to expedite the lead optimization
process. We believe that the physics-based augmentation of ML will significantly benefit drug discovery, as these techniques continue
to evolve.
Description:
DATA AVAILABILITY STATEMENT : All software generated for this paper is available in the
Supporting Information. The KNIME analytics platform can
be downloaded for free at https://www.knime.com/. All
KNIME workflows are provided within the Supporting
Information. All necessary data to replicate the study can be
found in the public domain or within the provided Supporting
Information.
SUPPORTING INFORMATION : Comprehensive description of the methodologies and
parameters employed; list of the chemicals involved in
this research; outcomes for each FEP calculation; MD
reports; workflow of the ML experiments, including the
corresponding initial data; and ML performance at two
additional categorical cutoff values (PDF)
MD reports (ZIP)
Input structure data (ZIP)
FEPML workflows (ZIP)
FEPML results (ZIP)
Compound list (ZIP)
SMILES (CSV)
Processing Data Workflow (ZIP)