03 July 2022

What belongs in the Protein Data Bank?

The rise of high-throughput crystallography is among the most exciting recent developments for fragment finding. Historically deemed too slow for primary screening, crystallography was reserved for select hits from an assay cascade. Now crystallographic screens up-front sometimes yield hundreds of hits. Many have been deposited in the Protein Data Bank (PDB). In a recent (open access) Protein Sci. commentary, Mariusz Jaskolski (Mickiewicz University), Bernhard Rupp (Medical University Innsbruck), and collaborators in the US question this practice.
 
In particular, the researchers ask whether molecules processed using Pan-Dataset Density Analysis (PanDDA) belong in the PDB. The method, which we described here, is typically used when hundreds of compounds have been soaked into crystals of the same protein. Most molecules will not bind, and these empty structures can be averaged to provide a background map to better identify weakly-bound ligands that may have only partial occupancy.
 
The researchers seem suspicious of this technique, referring to “supposed ligands” that may “confuse most biomedical researchers” and “degrade the PDB integrity,” the effect of which “could be disastrous.” To support their argument, they provide two examples from the PDB where the atomic models diverge from the electron density calculated using conventional methods and one with wonky statistics.
 
To avoid “contamination of the PDB by suboptimal structures,” the researchers suggest depositing structures from large-scale crystallographic screens in a separate database. Alternatively, they suggest clearer annotation. (To be fair, all three of the examples cited are already prominently marked “PanDDA analysis group deposition.”)
 
Needless to say, this is controversial. In a bioRxiv preprint, Manfred Weiss (Helmholtz-Zentrum Berlin) and collaborators in the US, Germany, Sweden, and the Netherlands, some of whom co-developed PanDDA, take a different view.
 
The researchers agree that group depositions need to be marked clearly, but they argue that they squarely belong in the PDB rather than in a separate repository. Moreover, “commentaries that underestimate the knowledge of PDB users, that ignore the opportunities present in heterogenous crystallographic data, and that miss out on chances for education on structure quality do more harm than good.”
 
The three examples described by Jaskolski and colleagues are re-examined, and while it is true that two of them do show poor occupancy using conventional methods, the ligands are clearly visible when PanDDA is used. (In the third case, there was an error in the resolution cutoff during automated processing, but the data could be successfully reprocessed manually.)
 
PanDDA was developed specifically to identify small, low occupancy ligands, so the researchers argue that these entries “cannot and should not be treated in the same way” as other ligands. Banning them from the PDB would potentially impede future research.
 
Weiss and colleagues refer to the Structural Genomics campaign of the late 1990s and early 2000s to solve myriad structures of diverse proteins, most of which were not being otherwise studied. At the time some commentators derided this effort as “stamp collecting.” Yet the number and diversity of structures thus deposited into the PDB likely contributed to the success of automated protein folding algorithms such as AlphaFold2.
 
Similarly, including structures from PanDDA processing could lead to unforeseen advances. For example, Weiss and colleagues suggest we may be able to “extract all aspects of conformational as well as of compositional heterogeneity out of all these data sets.” A better understanding of the role of protein dynamics in ligand binding is likely to require thousands of similar datasets of the kind being uploaded.
 
Personally, I believe that scientists should be wary of all published information. As the old saying goes, trust, but verify. As evidenced by my five-part series “Getting misled by crystal structures,” even conventional structures in the PDB should not necessarily be taken at face value. With that precaution, I’ll hold with the conclusion of Weiss and colleagues: “As long as the data is there, let’s embrace it and make it available!”

2 comments:

Anonymous said...

I agree with Dr. Weiss and co. The model should answer biological questions (recognition event and binding pose), and hiding the data will do more harm.

However, here there is an alarming mindset that the end user is responsible for how he uses the data. While this is a conventional truth, past month I was at a conference and talked to comp chem people - almost no one, unless coming from industry with a solid structural biology expertise in-house, examines electronic densities. Absolutely no one examines PanDDA event maps and digs into details there. The model is taken for granted. Feels like we need a good guideline on how to judge recognition events in MX data, and most certainly PanDDA depositions should be supplemented with a guideline on how to evaluate them. Unfortunately, such guidelines can not rely on modern MX software environments. The end user simply will not invest time and especially resources into getting familiar with the current solutions and bulk of excellent, but heavy on competence literature. Sad but true.

Wladek Minor said...

We are pleased to note that Weiss et al. essentially agree with our main postulate (already expressed in the title of our paper) that the results of PanDDA screening should be archived correctly in a dedicated repository rather than being inadequately presented by the current Protein Data Bank (PDB) protocol, which was developed for an entirely different purpose. Our main concern was appropriate storage, dissemination, and interpretation of fragment screening results and not explicit criticism of the method as such.

Mariusz Jaskolski, Bernhard Rupp, Alexander Wlodawer, Zbyszek Dauter