12 August 2019

Achieving maximum diversity with minimum size

One theoretical advantage of fragment-based drug discovery is the ability to efficiently explore chemical space: there are vastly fewer possible fragment-sized molecules than lead-sized molecules. That said, even fragment space is daunting; the number of possible molecules with up to 17 non-hydrogen atoms is about three orders of magnitude larger than the largest computational screen. Maximizing diversity is thus a key goal in designing fragment libraries, but how do you actually do this? A new open-access paper in Molecules by Yun Shi and Mark von Itzstein at Griffith University provides a practical new approach.

As the researchers point out, diversity itself can be a slippery concept. Functional diversity (ie, what targets are bound) is important but hard-won knowledge. Physicochemical diversity is by definition limited for fragments. That leaves structural diversity, as defined by “molecular fingerprints.” These can be as simple as the presence or absence of a fluorine atom, or can require complicated calculations involving, say, the distance between a hydrogen bond donor and acceptor in the lowest energy conformation of a molecule. In their paper the researchers focus on “extended-connectivity” fingerprints, which take into consideration the physical connectivity between different types of atoms.

But how can you actually quantify structural diversity? One possibility is by comparing molecules to see how different they are, as used for example in Tanimoto similarity assessments. Each additional molecule would be chosen to be least similar to those in a library. Alternatively, one could consider “richness,” how much of chemical space is covered, by calculating how many unique structural features (such as specific bond connectivities) are represented. Each additional molecule would be chosen to provide as many new molecular fingerprints as possible. Shi and von Itzstein propose a third approach, “true diversity,” that considers the number of unique features as well as their proportional abundances. In other words, a library with a higher true diversity would have a “more even distribution of proportional abundances.” The researchers note that this approach has been used in ecology for decades.

To see how their approach performs, the researchers started with a set of 227,787 commercially available fragments, all of which were roughly rule-of-3-compliant and scrubbed of undesirable functionalities. They also considered a subset of 47,708 fluorine-containing fragments. For both sets, they then assessed structural diversity as a function of increasing fragment library size using Tanimoto similarity, richness, and true diversity, as well as random sampling.

Naturally, as the size of a fragment library rose, the diversity increased. As expected, applying Tanimoto similarity or richness led to greater diversity at a smaller library size than did random sampling. This was even more true for true diversity. Interestingly, true diversity reached a maximum at 8.8% or 15.7% (for the full and fluorinated libraries) and then began to decline. This conceptually makes sense because commercial compounds themselves are unlikely to be truly diverse.

More importantly, just 1% or 2.5% of fragments were sufficient to achieve the same true diversity as the full sets. This corresponds to 2052 fragments for the complete commercial set, the structures of which are provided in the supplementary material. As the researchers note, this is comparable to the size of many commonly used fragment libraries.

The method is computationally inexpensive (it runs on a desktop), and should be a useful tool for both building and curating fragment libraries, real and virtual. Of course, diversity is not everything, and it probably makes sense to include privileged pharmacophores even at the cost of lower diversity. But as Lord Kelvin said, “when you can measure what you are speaking about, and express it in numbers, you know something about it.” This paper provides a quantitative approach for measuring diversity.

2 comments:

Peter Kenny said...

Hi Dan, I would argue that it is actually coverage rather than diversity that one is trying to maximize in screening library design. Diversity maximization can select weird stuff and singletons which are typically more difficult to follow up. One key question when considering coverage is how similar two molecular structures have to be in order for one to be considered to be representing the other.

Yun Shi said...

Hi Peter, thank you for the comments.
If by 'diversity' you meant difference, I agree that trying to maximise difference during selection would actually result in not only weird structures but also poor coverage. We define coverage as a ratio: the number of features (richness) in selected cpds over the richness in all cpds available for selection, and in this paper we used radial fingerprints to describe structural features. Instead of putting an arbitrary similarity/difference cutoff values to decide if cpd A is covered/represented by cpd B, I find it easier to quantify coverage by reducing the problem to the fingerprint level. At this level, we have coverage become a yes-or-no issue, i.e. you either have this exact fingerprint or not have this fingerprint at all.