21 July 2025

How can we house our crystallographic data?

Three years ago we highlighted a growing debate about how and where to house crystallographic fragment data. With the recent surge in high-throughput crystallography, issues including access, accuracy, and capacity have only become more urgent. An open-access perspective in Nat. Comm. by Manfred Weiss (Helmholtz-Zentrum Berlin) and multiple coauthors, including yours truly, calls on the scientific community to make some difficult decisions. Indeed, a session at the 75th annual meeting of the American Crystallographic Association going on today is devoted to the topic.
 
High-throughput crystallography can involve soaking more than 1000 crystals with fragments, sometimes yielding hundreds of protein-ligand structures. The paper tabulates a dozen synchrotrons around the world with current or planned high-throughput capabilities. We’ve written recently about the XChem facility at the Diamond Light Source, which is currently running about 80 fragment screens per year. Assuming similar productivity at the other synchrotrons, we might soon see 1000 fragment campaigns per year worldwide. If each of these involves 1000 crystals and we get 10% hit rates, that could mean 100,000 new fragment structures annually.
 
That is a big number. For reference, 10,000 new crystal structures are currently being released by the protein data bank (PDB) each year. (Director of the RCSB PDB Stephen Burley is one of the authors of the perspective.)
 
The problem is that, as we discussed in the 2022 blog post, most fragment structures from high-throughput screens are not refined to the level required for the PDB, a process which typically takes a day or two for the researcher and up to 3 hours by a biocurator at the PDB. Moreover, fragments are often identified using PanDDA (Pan-Dataset Density Analysis, which we wrote about here), a process which makes use of the many unbound structures obtained in a dataset. Ideally, these datasets should also be made available.
 
The challenge is balancing practicality with FAIR (Findable, Accessible, Interoperable, and Reusable) principles. The paper outlines four non-exclusive options. Very briefly, these are:
 
Option One: Fully refine and deposit all protein-fragment structures just as with other structures.
 
Option Two: Partially refine structures, and possibly flag or even segregate them from other structures in the PDB.
 
Option Three: Rather than treating each protein-ligand structure independently, treat each high-throughput screen as a single experiment, and archive all of the data in its entirety, including unbound structures. These data could be housed in the PDB or elsewhere.
 
Option Four: A hybrid approach, where fully refined structures would be deposited in the PDB and the rest of the data would be stored in a separate branch of the PDB or elsewhere entirely.
 
There are pros and cons for each option. At the extremes, the first option puts a tremendous burden on experimentalists and the PDB, and potentially valuable information regarding unbound structures is lost, while option three requires setting up new repositories to store vast quantities of data.
 
The paper intentionally avoids making a specific recommendation and instead calls for discussion within the scientific community. Personally, I favor some sort of hybrid approach such as option four. As the paper notes, no one could have foreseen AlphaFold2 when the PDB was launched in 1971. Over the next decade researchers around the world are likely to generate hundreds of thousands of protein-fragment structures. I don’t pretend to know what the artificial intelligence tools of the future will be able to make of such data, but I hope they will have access.
 
What do you think?

1 comment:

Anonymous said...

One of the authors of the paper, Stephen Burley from Rutgers, will address this topic in his presentation at Discovery on Target at CHI's Lead Generation Strategies track: https://www.discoveryontarget.com/lead-generation-strategies -- hopefully Dan Erlanson will be there asking questions from the audience -- should be a good discussion! (Dan is speaking at DOT about his covalent p53 activator work on the Small Molecules for Cancer track the day before).