Focus on X-rays: No more data sacrifice
- Published: Jun 1, 2012
- Author: David Bradley
Don't wipe out the noise
More signal less noise
The equivalent of a new pair of spectacles has come in the form of an advanced method for analysing the so-called "discarded" X-ray data from a crystallographic study. The analytical technique could help improve structural studies on biopolymers including proteins and nucleic acids.
Typically discarded data includes the lower-quality information from the edge regions of an X-ray pattern. These regions are important for understanding the details of the structure but often have too low a signal to noise ratio to allow the crystallography to extract anything useful from the random errors and background noise. Now, Oregon State University biophysicist Andy Karplus and his colleague Kay Diederichs of the University of Konstanz in Germany have now demonstrated that useful information can most certainly be extracted even from data that is five times noisier than the previously acceptable threshold.
"The criteria that have been used in the past are way too conservative," says Karplus. "The data that people have been throwing out are actually good." The bottom line for crystallography, of course, is resolution and precision in the resulting molecular model. The better the model, the better it will predict the pattern created by the incident X-rays. Karplus and colleagues suggest that their development may be one of the most important conceptual advances in crystallography in the last two decades because it allows them to retrieve the discarded data overcome the noise problem and to set boundaries on the experiment before the Fourier transform is done on the raw data.
"The question is: Where do we cut it off?" says Karplus. He and his colleagues added back data incrementally to show how the model might be improved at each step without pushing it over the noise limit. The results showed that what was previously discarded can be and should be retained in the structure refinement, Karplus and Diederichs say.
The team explains that commonly R values are a measure of the agreement between observed and calculated data whereas "Rmerge" values reflect data quality. The researchers have shown that although many crystallographers use Rmerge values, the numbers do not provide a high-resolution limit, wasting potentially useful data. To circumvent this loss, the team's new statistical estimate correlates observed data with the true, but not measurable, signal. They label this quantity, CC* and suggest that it "provides a single statistically valid guide for deciding which data are useful".
Data impact factor
"The big impact on the field will be that every structure determined from here on out will be a little more accurate because people won't throw away data that are okay," Karplus explains. "If you have a crummy image of the protein, it will get a little sharper. If you have a good image of the protein, it will also get a little sharper."
"Now that we know that very noisy data are useful, this will presumably enable still further improvements as it stimulates new software development to do a better job of handling such weak data," adds Karplus. The team points out that the same data retention approach will be equally applicable to other physical sciences where information is often discarded for fear of miring the data in noise.
"No one is worried about an additional data storage problem," Karplus told SpectroscopyNOW. "To the contrary, as was written by Phil Evans in the accompanying perspective, the community right now does not generally store the raw data, and this work emphasizes why we need to do that. We need to do it so that when developments like this happen people can go back and reprocess old data to get more value from it....rather than just wishing they could."
In a paper back to back with the Karplus and Diederichs research published in Science, Qun Liu, Tassadite Dahmane, Zhen Zhang, Zahra Assur, Julia Brasch, Lawrence Shapiro, Filippo Mancia, and Wayne Hendrickson explain how averaging data from multiple crystals can give additional useful information for solving the phase problem in X-ray crystallography based on the presence of sulfur atoms in the protein rather than having to introduce heavy atoms to develop the structure.
"Linking Crystallographic Model and Data Quality", Science, 2012, 336, 1030-1033; DOI: 10.1126/science.1218231
"Structures from Anomalous Diffraction of Native Biological Macromolecules", Science, 2012, 336, 1033-1037; DOI: 10.1126/science.1218753