False positives and artifacts are
a constant source of irritation – and worse – in compound screening. We’ve written
frequently about small molecule aggregation as well as generically reactive
molecules that repeatedly come up as screening hits. It is possible to weed
these out experimentally, but this can entail considerable effort, and for particularly
difficult targets, false positives may dominate. Indeed, there may be no true
hits at all, as we noted in this account of a five-year and ultimately fruitless
hunt for prion protein binders.
A computational screen to rapidly
assess small molecule hits as possible artifacts would be nice, and in fact
several have been developed. Among the most popular are computational filters
for pan-assay interference compounds, or PAINs. However, as Pete Kenny and
others have pointed out, these were developed using data from a limited number
of screens in one particular assay format. Now Alexander Tropsha and collaborators
at University of North Carolina Chapel Hill and the National Center for
Advancing Translational Science (NCATS) at the NIH have provided a broader resource
in a new J. Med. Chem. paper.
The researchers experimentally
screened around 5000 compounds, taken from the NCATS Pharmacologically Active
Chemical Toolbox, in four different assays: a fluorescence-based thiol
reactivity assay, an assay for redox activity, a firefly luciferase (FLuc)
assay, and a nanoluciferase (NLux) assay. The latter two assays are commonly
used in cell-based screens to measure gene transcription. The thiol reactivity
assay yielded around 1000 interfering compounds, while the other three assays
each produced from 97 to 142. Interestingly, there was little overlap among
the problematic compounds.
These data were used to develop
quantitative structure-interference relationship (QSIR) models. The NCATS
library of nearly 64,000 compounds was virtually screened, and around 200
compounds were tested experimentally for interference in the four assays, with
around half predicted to interfere and the other half predicted not to
interfere. The researchers had also previously built a computational model to
predict aggregation, and this – along with the four models discussed here – have
been combined into a free web-based “Liability Predictor.”
So how well does it work? The researchers
calculated the sensitivity, specificity, and balanced accuracy for each of the
models and state that “they can detect around 55%-80% of interfering compounds.”
This sounded encouraging, so naturally
I took it for a spin. Unfortunately, my mileage varied. Or, to pile on the
metaphors, lots of wolves successfully passed themselves off as sheep. Iniparib was recognized correctly as a possible thiol interference compound. On the other hand, the known
redox cycler toxoflavin was predicted not to be a redox
cycler – with 97.12% confidence. Similarly, curcumin, which can form adducts
with thiols as well as aggregate and redox cycle, was pronounced innocent. Quercetin was recognized as possibly thiol-reactive, but its known propensity to aggregate was not. Weirdly, Walrycin B, which the researchers note interferes with all the
assays, got a clean bill of health. Perhaps the online tool is still being optimized.
At this point, perhaps the
Liability Predictor is best treated as a cautionary tool: molecules that come
up with a warning should be singled out for particular interrogation, but
passing does not mean the molecule is innocent. Laudably, the researchers have
made all the underlying data and models publicly available for others to build
on, and I hope this happens. But for now, it seems that no computational tool can
substitute for experimental (in)validation of hits.
Hi Dan, I can’t actually see the article and will make some general comments. The authors state in the abstract that they “… developed and validated quantitative structure–interference relationship (QSIR) models to predict these nuisance behaviors. The resulting models showed 58–78% external balanced accuracy for 256 external compounds per assay.” I take “quantitative” to imply that continuous data have been used to build regression models but “external balanced accuracy” suggests the models are actually categorical. I’m assuming that the four assays that they’ve run all measure nuisance behavior directly (PAINS filters are based on assumptions that frequent-hitter behavior in the assay panel is indicative of nuisance behavior) and these will not detect nuisance behaver resulting from UV/vis absorption, fluorescence, singlet oxygen reactivity/quenching or colloidal aggregation. Interference with read-out increases with concentration and here’s a relevant article from my former AZ colleagues that shows how interference can be assessed and in some cases corrected for.
ReplyDeleteMy view is that QSAR (or ML) models cannot make reliable predictions if they’ve not been trained with data for close structural analogs of the compounds for which predictions are being made and it may be that there’s nothing that is similar to toxoflavin in the data sets used to train the models. One simple way to address this issue is to present the user with the relevant data for the compounds from the training set which are considered to be the closest neighbours of the compound for which the prediction has been made (for some reason QSAR/ML modellers appear to consider this a terrible idea). Uneven coverage of chemical space by training and test sets is a real (although rarely acknowledged) problem in QSAR/ML modelling and my view, expressed in this 2009 article is that some (most?) “global” models are actually ensembles of local models. Another consequence of uneven coverage of chemical space by training/test sets is that validation procedures can lead to optimistic assessments of model quality.
Hi Pete,
ReplyDeleteUnfortunately the paper is behind a paywall, but I'd be curious to get your thoughts on the Liability Predictor itself, which is not. You are correct that it won't pick out problems due to UV/vis absorption etc., but what I like is that in theory it shows which assays a given compound may show false-positive behavior. As you point out the training set may be too small, but it is odd that even compounds specifically described in the paper as being problematic seem to pass the online filter.