Machine learning (or, for investors,
artificial intelligence) has received plenty of attention. To be successful you
need lots of data. If you’re trying to, say, train a large language model to
write limericks you’ve got oodles of text to draw from. But to train a model
that can predict binding affinities you need lots of measurements, and they must be accurate. Among public repositories, ChEMBL is one of the most
prominent due to its size (>2.4 million compounds) and quality, achieved through
manual curation. But even here you need to be cautious, as illustrated in a
recent open-access J. Chem. Inf. Model. paper by Gregory Landrum and
Sereina Riniker (both at ETH Zurich).
The researchers were interested
in the consistency of IC50 or Ki values for the same
compound against the same target. They downloaded data for >50,000 compounds
run at least twice against the same target. Activity data were compared either directly
or after “maximal curation,” which entailed removing duplicate measurements from
the same paper, removing data against mutant proteins, separating binding vs functional
data, and several other quality checks. They used a variety of statistical
tests (R2, Kendall τ, Cohen’s κ, Matthew’s correlation coefficient)
to evaluate how well the data agreed, the simplest being the fraction of pairs
where the difference was more than 0.3 or 1 log units, roughly two-fold or
ten-fold differences.
The results were not encouraging.
Looking at IC50 values, 64% of pairs differed by >0.3 log units,
and 27% differed by more than 1 log unit. In other words, for more than a
quarter of measurements a molecule might test as 100 nM in one assay and >1 µM
in another.
Of course, as the researchers
note, “it is generally not scientifically valid to combine values from different
IC50 assays without knowledge of the assay conditions.” For example,
the concentration of ATP in a kinase assay can have dramatic effects on the IC50
values for an inhibitor. Surely Ki values should be more comparable.
But no, 67% of pairs differed by >0.3 log units and 30% differed by >1!
The situation improved for IC50
values using maximal curation, with the fraction of pairs differing by >0.3
and >1 log units dropping to 48% and 13%. However, this came at the expense
of throwing away 99% of the data.
Surprisingly, using maximal
curation data for Ki data actually made the situation worse. Digging
into the data, the researchers found that 32 assays reporting Ki
values for human carbonic anhydrase I, all from the same corresponding author,
include “a significant number of overlapping compounds, with results that are sometimes
inconsistent.” Scrubbing these improved the situation, but 38% of pairs still
differed by >0.3 log units, and 21% differed by >1 log unit.
This is all rather sobering, and
suggests there are limits to the quality of available data. As we noted in January
there are all kinds of reasons assays can differ even within the same lab. Add
in subtle variations in protein activity or buffer conditions and perhaps we
should not be too surprised at log-order differences in experimental measurements.
And this assumes everyone is trying to do good science: I’m sure sloppy and fraudulent
data only make the situation worse. No matter how well we build our
computational tools, noisy data will ensure they often differ from reality,
whatever that may be.
3 comments:
A very nice paper indeed, important to stress.
I think this reading pairs well with "Comparability of Mixed IC50 Data – A Statistical Analysis" (open access, link in the name) from 2013. As stated there more than a decade ago "... it is likely the data quality will rise over time by continuous iterative improvement of the large databases such as ChEMBL and BindingDB".
Thanks Vladimer - nice paper, though your conclusion may have been prematurely optimistic!
Thanks for highlighting this paper, it points out important things that are often overlooked!
There is an serious issue with some published carbonic anhydrase inhibitor data that has previously been exposed and discussed by Jonsson and Lijas, here is the last published comments from their initiative: Jonsson BH, Liljas A. Comments to the Editor Due to the Response by the Supuran Group to Our Article. Biophys J. 2021 Jan 5;120(1):182-183. doi: 10.1016/j.bpj.2020.11.012. Epub 2020 Dec 13. PMID: 33308476; PMCID: PMC7820732.
I think this is also a "good" example of poor science that can be used for education.
Post a Comment