Practical Fragments: July 2025

28 July 2025

Can machine learning help you avoid SCAMs?

Among the many types of artifacts that can fool screens and derail efforts to find leads, small colloidally aggregating molecules (SCAMs) are particularly pernicious. As we discussed way back in 2009, these molecules can form aggregates in aqueous buffer that interfere with a variety of assays, leading to wasted resources and embarrassing publications.

The problem is that there isn’t necessarily anything wrong with the molecules per se, and even many approved drugs can form aggregates. Thus, it is difficult to predict whether any given molecule will be a troublemaker. In a new (open-access) Angew. Chem. Int. Ed. paper, Pascal Friederich, Rebecca Davis, and collaborators at Karlsruhe Institute of Technology and University of Manitoba Winnipeg explore whether machine learning can help.

The researchers built a Multi-Explanation Graph Attention Network, or MEGAN, which is accessible through a simple web interface. Rather than a homicidal doll, this MEGAN represents atoms as nodes and bonds as edges in a graph, similar to the Fragment Network we wrote about here. MEGAN was trained on a set of 12,338 aggregators and 177,048 non-aggregating molecules. Importantly, the researchers used explainable AI (xAI), which colors portions of the molecule according to their importance for (non)aggregation.

Testing MEGAN on a set of 1500 aggregators and 1500 non-aggregators, none of which were included in the training set, yielded an accuracy of 82%. Given that most molecules don’t aggregate, a model biased towards non-aggregators would be expected to have a high accuracy, and to account for this the researchers assessed the “F1” score, which was similarly impressive.

The researchers provide several examples in which subtle variations transform a molecule from a non-aggregator to an aggregator, and show that MEGAN correctly predicts these. Furthermore, it “shows its work,” highlighting the chemical features underlying the prediction. For example, 9H-pyrido[3,4-b]indole is predicted with 92% confidence not to be an aggregator.

Just adding a methyl group flips the odds in favor of aggregation to 92%.

Exploring the molecular features that lead to aggregation can reveal general trends, such as rigid, “flat” molecules with moieties that can serve either as hydrogen bond donors or acceptors. This is consistent with a paper we discussed last year, though unfortunately the researchers do not cite it.

To further assess the tool, it was tested against a set of drugs that had been characterized as aggregators or non-aggregators. MEGAN correctly classified 15 of 30 aggregators and 24 of 28 non-aggregators. In contrast, a different program caught only 2 of the aggregators. The researchers note that most of the training data for MEGAN came from a single screen in phosphate buffer at pH 7, and aggregation can be very dependent on buffer components and pH.

Practical Fragments has previously highlighted other aggregation predictors, most notably Aggregator Advisor and Liability Predictor. As for any computational model, the old chestnut “trust but verify” applies. MEGAN appears to be a useful tool, but please run physical experiments if the molecule is important.

21 July 2025

How can we house our crystallographic data?

Three years ago we highlighted a growing debate about how and where to house crystallographic fragment data. With the recent surge in high-throughput crystallography, issues including access, accuracy, and capacity have only become more urgent. An open-access perspective in Nat. Comm. by Manfred Weiss (Helmholtz-Zentrum Berlin) and multiple coauthors, including yours truly, calls on the scientific community to make some difficult decisions. Indeed, a session at the 75^th annual meeting of the American Crystallographic Association going on today is devoted to the topic.

High-throughput crystallography can involve soaking more than 1000 crystals with fragments, sometimes yielding hundreds of protein-ligand structures. The paper tabulates a dozen synchrotrons around the world with current or planned high-throughput capabilities. We’ve written recently about the XChem facility at the Diamond Light Source, which is currently running about 80 fragment screens per year. Assuming similar productivity at the other synchrotrons, we might soon see 1000 fragment campaigns per year worldwide. If each of these involves 1000 crystals and we get 10% hit rates, that could mean 100,000 new fragment structures annually.

That is a big number. For reference, 10,000 new crystal structures are currently being released by the protein data bank (PDB) each year. (Director of the RCSB PDB Stephen Burley is one of the authors of the perspective.)

The problem is that, as we discussed in the 2022 blog post, most fragment structures from high-throughput screens are not refined to the level required for the PDB, a process which typically takes a day or two for the researcher and up to 3 hours by a biocurator at the PDB. Moreover, fragments are often identified using PanDDA (Pan-Dataset Density Analysis, which we wrote about here), a process which makes use of the many unbound structures obtained in a dataset. Ideally, these datasets should also be made available.

The challenge is balancing practicality with FAIR (Findable, Accessible, Interoperable, and Reusable) principles. The paper outlines four non-exclusive options. Very briefly, these are:

Option One: Fully refine and deposit all protein-fragment structures just as with other structures.

Option Two: Partially refine structures, and possibly flag or even segregate them from other structures in the PDB.

Option Three: Rather than treating each protein-ligand structure independently, treat each high-throughput screen as a single experiment, and archive all of the data in its entirety, including unbound structures. These data could be housed in the PDB or elsewhere.

Option Four: A hybrid approach, where fully refined structures would be deposited in the PDB and the rest of the data would be stored in a separate branch of the PDB or elsewhere entirely.

There are pros and cons for each option. At the extremes, the first option puts a tremendous burden on experimentalists and the PDB, and potentially valuable information regarding unbound structures is lost, while option three requires setting up new repositories to store vast quantities of data.

The paper intentionally avoids making a specific recommendation and instead calls for discussion within the scientific community. Personally, I favor some sort of hybrid approach such as option four. As the paper notes, no one could have foreseen AlphaFold2 when the PDB was launched in 1971. Over the next decade researchers around the world are likely to generate hundreds of thousands of protein-fragment structures. I don’t pretend to know what the artificial intelligence tools of the future will be able to make of such data, but I hope they will have access.

What do you think?

14 July 2025

The importance of specific reactivity for covalent drugs

As we noted in our thousandth post, covalent drugs are becoming increasingly popular, particularly for tackling tough targets. But finding and optimizing covalent ligands entails unique challenges, as discussed in a new paper by Bharath Srinivasan at Cancer Research UK. (Derek Lowe also recently blogged about this.)

Interactions between noncovalent drugs and their targets are characterized by dissociation or inhibition constants K_D or K_{I ,} where lower numbers mean stronger binding. In contrast, irreversible covalent drugs are characterized by a ratio we discussed last year, k_inact/K_I, where the rate constant k_inact represents the covalent modification step. (Side note: although the term k_inact is commonly used, covalent modulators can also be activators; my company Frontier Medicines recently announced a covalent activator of p53^Y220C. Perhaps k_cov would be more general?)

To explain k_inact/K_I, Srinivasan draws a useful analogy to enzymes, which are mechanistically described by the specificity constant k_cat/K_m in Michaelis-Menten kinetics. In both cases, higher numbers mean more rapid modification or greater catalytic efficiency. A study of several thousand enzymes found the median k_cat/K_m to be around 100,000 M^-1s^-1, with 60% between 1,000 and 1,000,000 M^-1s^-1. Enzymes operate by stabilizing the transition state of the reaction, which means that the affinities for the substrates do not necessarily have to be high, particularly if the structures of the substrates differ from the transition states.

Just as catalytic efficiency for enzymes can be increased either by increasing k_cat or lowering K_m, the inactivation efficiency of covalent drugs can be optimized either by increasing k_inact or by decreasing K_I. Historically, drug hunters have focused on the latter; we previously described the discovery of TAK-020 in which the affinity of a fragment for the kinase BTK was first optimized and then a covalent warhead was appended.

However, focusing on k_inact can also be productive, and Srinivasan argues this is particularly true for challenging targets with shallow pockets where noncovalent affinity is difficult to obtain. As a case in point he discusses covalent KRAS^G12C inhibitors such as sotorasib, which I wrote about here. Just as residues within enzyme active sites stabilize the transition state of a reaction, a lysine residue in KRAS forms a hydrogen bond to the carbonyl of the acrylamide electrophile, thereby increasing its reactivity for the protein.

Srinivasan emphasizes that k_inact is specific for each particular protein-ligand pair as well as distinct from intrinsic or chemical reactivity. This is a critical point. Newcomers to the field often worry that a high k_inact value means a molecule is generically reactive and thus likely to react with many proteins, but this is not necessarily true. For example, sotorasib’s favorable k_inact/K_I is driven by a high k_inact for KRAS^G12C but it is still quite specific. Indeed, Srinivasan points out that even a chemically reactive molecule may not react with a protein if the geometry isn’t right.

A nice way of assessing specific reactivity (which unfortunately is not cited) is the reactivity enhancement factor, or REF, as defined by Alan Armstrong, David Mann, and colleagues at Imperial College London in an (open-access) 2020 ChemBioChem paper. Akin to the k_cat/k_uncat ratio used to assess rate enhancement for enzymes, REF is defined as the rate of reaction for a specific protein divided by the rate of reaction for glutathione, an abundant cellular thiol. The higher the REF score, the higher the specific reactivity for the protein of interest.

Srinivasan also considers tradeoffs between k_inact and K_I as k_inact/K_I approaches the rate of diffusion, suggesting that above 1,000,000 M^-1s^-1 or so any further improvement in affinity will come at the cost of specific reactivity. While this is theoretically interesting, from a practical perspective you can have a perfectly fine drug with a k_inact/K_I of just 10,000 M^-1s^-1.

Covalent drugs will only become more important as we pursue increasingly hard targets that have resisted previous efforts. For these targets in particular, focusing on specific reactivity will be rewarding.

07 July 2025

Fragment events in 2025 and 2026

For better or for worse, 2025 is half-way over. There are still some good conferences coming up, and 2026 is also starting to take shape.

September 21-24: FBLD 2025 will be held in the original Cambridge (UK), where it was supposed to be held in 2020. This will mark the ninth in an illustrious series of conferences organized by scientists for scientists. You can read impressions of FBLD 2024, FBLD 2018, FBLD 2016, FBLD 2014, FBLD 2012, FBLD 2010, and FBLD 2009.

September 22-25: You'll need to make a tough choice: FBLD 2025 or CHI’s Twenty-Third Annual Discovery on Target in Boston. As the name implies this event is more target-focused than chemistry-focused, but there are always plenty of FBDD-related talks. You can read my impressions of the 2024 meeting, the 2023 meeting, the 2022 meeting, the 2021 meeting, the 2020 virtual meeting, the 2019 meeting, and the 2018 meeting.

November 11-13: CHI holds its second Drug Discovery Chemistry Europe in beautiful Barcelona. This will include tracks on lead generation, protein-protein interactions, degraders and glues, and machine learning, with multiple fragment talks throughout.

2026

February 17-19: The Twelfth NovAliX Conference will be held for the first time in San Diego! (Please note the date and location change.) You can read my impressions of the 2018 Boston event here, the 2017 Strasbourg event here, and Teddy's impressions of the 2013 event here, here, and here.

April 13-16: CHI’s Fragment-Based Drug Discovery turns 21, old enough to legally drink in the US! The longest-running annual fragment event returns as always to San Diego. This is part of the larger Drug Discovery Chemistry meeting. You can read impressions of the 2025 meeting, the 2024 meeting, the 2023 meeting, the 2022 meeting, the 2021 virtual meeting, the 2020 virtual meeting, the 2019 meeting, the 2018 meeting, the 2017 meeting, the 2016 meeting; the 2015 meeting here, here, and here; the 2014 meeting here and here; the 2013 meeting here and here; the 2012 meeting; the 2011 meeting; and the 2010 meeting.

September 14-16: RSC-BMCS Tenth Fragment-based Drug Discovery Meeting will be held in Cambridge, UK. You can read my impressions of the 2024 meeting, the 2013 meeting, and the 2009 meeting.

Know of anything else? Please leave a comment or drop me a note.