In a previous post about heterocycles that appear chemically feasible but have not been reported, we wondered whether these molecules would show biological activity. The structure of biologically relevant chemical space – that fraction of possible molecules that will exhibit some biological effect – is of great interest, but as yet unknown. Brian Shoichet and coworkers at UCSF have just published a thought-provoking analysis in Nature Chemical Biology that is also relevant to developing new fragment libraries.
The researchers ask why it is that HTS collections of a million or so compounds, vanishingly small in comparison to the roughly 1,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 possible small drug-sized molecules, nonetheless so often succeed in identifying hits. A lovely paper by Tobias Fink and Jean-Louis Reymond had previously computationally enumerated all possible compounds with up to 11 C, N, O, and F atoms. Of these 26,429,328 molecules, 25,810 are commercially available. Shoichet and colleagues compared the structures of these compounds with the structures of metabolites and natural products (all of which have by definition been processed by at least one protein) and found that the commercially available compounds were much more similar to natural products and metabolites than were non-commercially available compounds.
Indeed, the more similar a molecule is to a known natural product or metabolite, the more likely that it is available for purchase; 2918 of the commercially available compounds are in fact natural products or metabolites.
The bias also increases exponentially with molecular size: a random 11-atom commercial compound is almost 1000-times more likely to resemble a natural product or metabolite than is a non-commercially available molecule, whereas the bias is only about 2-fold for 6-atom molecules. Similar results were observed with other libraries.
The authors conclude that:
A major reason why the screening of synthetic compounds ever finds notable hits is that our libraries are biased toward the sort of molecules that proteins have evolved to recognize.
This resemblance is reasonable. After all, most commercially available compounds are ultimately derived from naturally occurring starting materials, so their similarity to natural products isn’t surprising. Moreover, historically much of chemistry was devoted to natural product synthesis, so many of the intermediates built up over the years resemble natural products. And of course, once you learn how to do chemistry on one moiety, you will tend to stick with it unless you have a good reason to do otherwise; each heterocycle behaves (often frustratingly) differently, so if a natural-product-like molecule does the job, why look for trouble?
But does this “biogenic bias” mean that the rest of chemical space is a biological desert? Not necessarily. I can imagine at least two alternative models of chemical-biological diversity space.
Let’s call one model “lamp posts in dark fields.” Consider a vast field of some crop that can only be harvested by night. There are lights scattered haphazardly throughout the field. One might expect that the crops immediately under the lamp posts would be harvested more intensively than crops in darker parts of the field, even if other areas are equally productive. In this scenario, the lamp posts reveal natural products and similar molecules, but much – or even most – of (unlit) chemical space may also be biologically active, it just hasn’t been sampled yet.
Another possibility is the “oil-field model.” As we are all too aware, petroleum is distributed very unevenly across the globe. In some areas, such as Texas, oil was easy to find and easy to extract. In others, such as the deep ocean or the high arctic, oil is harder to find and more technically demanding to access. In this scenario, there are vast pockets of chemical space that are relevant to biology, they just haven’t been identified (let alone accessed) yet.
These are fun, speculative questions, but the paper provides some practical data. Specifically, 83% of core ring scaffolds found in natural products are absent from commercial libraries. In fragment or lead-sized molecules of MW < 350 with less than three stereocenters, 1891 rings scaffolds found in natural products are not commercially available. These could be useful additions to fragment libraries, and the paper lists 18 examples. In fact, at least one company, deCODE, is explicitly enriching its fragment collection with molecules based on natural products and metabolites.
This might be a good strategy. After all, even if the “lamp posts in dark fields” model is correct, there are plenty of brightly illuminated, unharvested chemotypes. At least for now, picking these may be more productive than venturing into the twilight-zone of uncharted chemical space.