04 February 2013

Beware correlation inflation

Drug discovery today is replete with rules and metrics: the Rule of 5, the Rule of 3, (though perhaps not 1), not to mention ligand efficiency and friends. The hope is that these encapsulate physical trends that will guide drug hunters towards better compounds. However, there is a danger that rules will become strait-jackets; plenty of drugs, after all, lie well outside the Rule of 5 (Ro5). In a paper recently published in J. Comput. Aided Mol. Des., Peter Kenny (of FBDD-Lit fame) and Carlos Montanari argue that the correlations underlying many rules may not be as robust as they appear. The article is full of the trenchant prose we’ve come to expect of Kenny, so I’ll quote liberally.

The background:

Those who have followed the drug discovery literature over the last decade or so will have become aware of a publication genre that can be described as ‘retrospective data analysis of large proprietary data sets’ or, more succinctly, as ‘Ro5 envy’.

The problem:

Although data analysts frequently tout the statistical significance of the trends that their analysis has revealed, weak trends can be statistically significant without being remotely interesting.

This is especially likely to occur when data are “binned” into a smaller number of categories before being analyzed, thereby hiding variation and making correlations appear stronger than they really are. Since many published analyses use proprietary, unavailable data, Kenny and Montanari constructed model “noisy” data sets and looked for correlations in the primary data and the binned data. They found that correlations in the binned data were inflated. Perhaps counter-intuitively, the effect actually gets more pronounced the larger the data set.

Having described the problem, Kenny and Montanari go on to question some recent high-profile papers correlating, for example, lipophilicity with pharmacological promiscuity, or the percentage of sp3-hybridized carbons (Fsp3) with solubility (see also here). In the latter case, all the data were publicly available, and a reanalysis with the primary data as opposed to binned data caused the correlation coefficient (r) to drop from 0.972 to 0.247!

Graphical representation of data comes under heavy scrutiny too. In particular, the common practice of subdividing data points into small numbers of categories (often red, yellow, and green) can make these categories appear discrete when the underlying data are better described as a continuum.

The overall message is that weak correlations may lead to misguided strategies:

To restrict values of properties such as lipophilicity more stringently than is justified by trends in the data is to deny one’s own drug-hunting teams room to maneuver while yielding the initiative to hungrier, more agile competitors.

There is something to this, though acting on it is not without risk. As the old saying goes, nobody gets fired for buying IBM. Most drug discovery efforts fail, but if you fail making conventional compounds, you’re less likely to come under fire than if you fail by doing something outside the accepted norm.

But whatever you do, it’s worth remembering:

The human liver remains an effective antidote to the hubris of the drug designer.


Peter Kenny said...

Thanks for highlighting this, Dan, and honoured that you quoted so freely from the text! Actually one would expect correlations for averaged data to improve as larger random samples are drawn from a population. This because the probability of a large deviation of a sample mean from the true mean decreases with the size of the sample. This is also the basis for much of statistical mechanics. We're hoping that the article will get people asking more questions and not being afraid to challenge (the institutional 'wisdom' of a Pharma company can be quite intimidating).

Dr. Teddy Z said...

This is a paper long overdue and of course Peter is the perfect person to have written it. Bravo.

Anonymous said...

Nate Silver called this one, generically.

Dr. Teddy Z said...

And it was picked up by ITP