22 April 2024

The limits of published data

Machine learning (or, for investors, artificial intelligence) has received plenty of attention. To be successful you need lots of data. If you’re trying to, say, train a large language model to write limericks you’ve got oodles of text to draw from. But to train a model that can predict binding affinities you need lots of measurements, and they must be accurate. Among public repositories, ChEMBL is one of the most prominent due to its size (>2.4 million compounds) and quality, achieved through manual curation. But even here you need to be cautious, as illustrated in a recent open-access J. Chem. Inf. Model. paper by Gregory Landrum and Sereina Riniker (both at ETH Zurich).
 
The researchers were interested in the consistency of IC50 or Ki values for the same compound against the same target. They downloaded data for >50,000 compounds run at least twice against the same target. Activity data were compared either directly or after “maximal curation,” which entailed removing duplicate measurements from the same paper, removing data against mutant proteins, separating binding vs functional data, and several other quality checks. They used a variety of statistical tests (R2, Kendall τ, Cohen’s κ, Matthew’s correlation coefficient) to evaluate how well the data agreed, the simplest being the fraction of pairs where the difference was more than 0.3 or 1 log units, roughly two-fold or ten-fold differences.
 
The results were not encouraging. Looking at IC50 values, 64% of pairs differed by >0.3 log units, and 27% differed by more than 1 log unit. In other words, for more than a quarter of measurements a molecule might test as 100 nM in one assay and >1 µM in another.
 
Of course, as the researchers note, “it is generally not scientifically valid to combine values from different IC50 assays without knowledge of the assay conditions.” For example, the concentration of ATP in a kinase assay can have dramatic effects on the IC50 values for an inhibitor. Surely Ki values should be more comparable. But no, 67% of pairs differed by >0.3 log units and 30% differed by >1!
 
The situation improved for IC50 values using maximal curation, with the fraction of pairs differing by >0.3 and >1 log units dropping to 48% and 13%. However, this came at the expense of throwing away 99% of the data.
 
Surprisingly, using maximal curation data for Ki data actually made the situation worse. Digging into the data, the researchers found that 32 assays reporting Ki values for human carbonic anhydrase I, all from the same corresponding author, include “a significant number of overlapping compounds, with results that are sometimes inconsistent.” Scrubbing these improved the situation, but 38% of pairs still differed by >0.3 log units, and 21% differed by >1 log unit.
 
This is all rather sobering, and suggests there are limits to the quality of available data. As we noted in January there are all kinds of reasons assays can differ even within the same lab. Add in subtle variations in protein activity or buffer conditions and perhaps we should not be too surprised at log-order differences in experimental measurements. And this assumes everyone is trying to do good science: I’m sure sloppy and fraudulent data only make the situation worse. No matter how well we build our computational tools, noisy data will ensure they often differ from reality, whatever that may be.

2 comments:

Vladimir Talibov said...

A very nice paper indeed, important to stress.

I think this reading pairs well with "Comparability of Mixed IC50 Data – A Statistical Analysis" (open access, link in the name) from 2013. As stated there more than a decade ago "... it is likely the data quality will rise over time by continuous iterative improvement of the large databases such as ChEMBL and BindingDB".

Dan Erlanson said...

Thanks Vladimer - nice paper, though your conclusion may have been prematurely optimistic!