Three years ago we highlighted a
growing debate about how and where to house crystallographic fragment data. With
the recent surge in high-throughput crystallography, issues including access, accuracy,
and capacity have only become more urgent. An open-access perspective in Nat.
Comm. by Manfred Weiss (Helmholtz-Zentrum Berlin) and multiple coauthors,
including yours truly, calls on the scientific community to make some difficult
decisions. Indeed, a session at the 75th annual meeting of the
American Crystallographic Association going on today is devoted to the topic.
High-throughput crystallography
can involve soaking more than 1000 crystals with fragments, sometimes yielding
hundreds of protein-ligand structures. The paper tabulates a dozen synchrotrons
around the world with current or planned high-throughput capabilities. We’ve
written recently about the XChem facility at the Diamond Light Source, which is
currently running about 80 fragment screens per year. Assuming similar productivity
at the other synchrotrons, we might soon see 1000 fragment campaigns per year
worldwide. If each of these involves 1000 crystals and we get 10% hit rates,
that could mean 100,000 new fragment structures annually.
That is a big number. For
reference, 10,000 new crystal structures are currently being released by the protein
data bank (PDB) each year. (Director of the RCSB PDB Stephen Burley is one of
the authors of the perspective.)
The problem is that, as we discussed
in the 2022 blog post, most fragment structures from high-throughput screens
are not refined to the level required for the PDB, a process which typically
takes a day or two for the researcher and up to 3 hours by a biocurator at the PDB.
Moreover, fragments are often identified using PanDDA (Pan-Dataset Density
Analysis, which we wrote about here), a process which makes use of the many unbound
structures obtained in a dataset. Ideally, these datasets should also be made
available.
The challenge is balancing
practicality with FAIR (Findable, Accessible, Interoperable, and Reusable)
principles. The paper outlines four non-exclusive options. Very briefly, these
are:
Option One: Fully refine and
deposit all protein-fragment structures just as with other structures.
Option Two: Partially refine
structures, and possibly flag or even segregate them from other structures in
the PDB.
Option Three: Rather than
treating each protein-ligand structure independently, treat each high-throughput
screen as a single experiment, and archive all of the data in its entirety,
including unbound structures. These data could be housed in the PDB or
elsewhere.
Option Four: A hybrid
approach, where fully refined structures would be deposited in the PDB and the rest
of the data would be stored in a separate branch of the PDB or elsewhere entirely.
There are pros and cons for each
option. At the extremes, the first option puts a tremendous burden on experimentalists
and the PDB, and potentially valuable information regarding unbound structures is
lost, while option three requires setting up new repositories to store vast
quantities of data.
The paper intentionally avoids
making a specific recommendation and instead calls for discussion within the scientific
community. Personally, I favor some sort of
hybrid approach such as option four. As the paper notes, no one could have
foreseen AlphaFold2 when the PDB was launched in 1971. Over the next decade
researchers around the world are likely to generate hundreds of thousands of
protein-fragment structures. I don’t pretend to know what the artificial intelligence
tools of the future will be able to make of such data, but I hope they will
have access.
What do you think?
1 comment:
One of the authors of the paper, Stephen Burley from Rutgers, will address this topic in his presentation at Discovery on Target at CHI's Lead Generation Strategies track: https://www.discoveryontarget.com/lead-generation-strategies -- hopefully Dan Erlanson will be there asking questions from the audience -- should be a good discussion! (Dan is speaking at DOT about his covalent p53 activator work on the Small Molecules for Cancer track the day before).
Post a Comment