EDSTATS

NAME

edstats - Calculates per-residue real-space electron density R factors, correlation coefficients, Z(observed) metrics for the ρobs Fourier map and Z(difference) metrics for the Δρ (difference) Fourier; also computes data for the histogram, and P-P and Q-Q difference plots for the observed and difference Fourier maps.

SYNOPSIS

edstats  MAPIN1 input1.map  MAPIN2 input2.map  XYZIN input.pdb  [HISOUT output.his]  [PPDOUT output.ppd]  [QQDOUT output.qqd]  [MAPOUT1 output1.map]  [MAPOUT2 output2.map]  [OUT output.out]  [XYZOUT output.pdb]

DESCRIPTION

The program EDSTATS calculates real-space electron density R factors, correlation coefficients, Zobs and Zdiff metrics for main- (includes Cβ atom) and side-chain atoms of individual residues and/or atoms.  This integrates and replaces the functionalities of SFALL (MODE ATMMAP ATMMOD/RESMOD options) and OVERLAPMAP (CORRELATE ATOM/RESIDUE options).  In addition it recognises the chain ID and the PDB residue label insertion code (which SFALL ignores!), and so does not require a specification of the residue label mapping for each chain (CHAIN option in SFALL/OVERLAPMAP).

INPUT

The input is in 'namelist' format, i.e. it consists of 'keyword = value' pairs separated by a comma or newline.  The keyword is always case-insensitive and only the first 4 characters are significant.  The value may be a character string, a logical (true or t or false or f) or an integer or real scalar or array.  The RESLO & RESHI values, obtainable from an MTZDUMP summary table for the map coefficient columns (NOT the overall values for the file as given in the MTZ header), are required; all other input values are optional.

Available options:

INPUT FILES

OUTPUT FILES

ANALYSING THE STANDARD OUTPUT: OVERALL METRICS

The default run options will produce 2 files: the standard output from edstats (edstats.log in the example above) which contains some overall metrics, and the output file (e.g.edstats.out) containing the table of per-residue metrics (see following section).

The supplied Perl script percent-rank.pl extracts a small subset of the overall metrics from the standard output, compares the results with a pre-calculated set in the supplied data file pdb-edstats.out, and for each metric prints out the per-cent rank (i.e. the percentage of structures in the pre-calculated set which have a worse score, so 0% is 'worst' and 100% is 'best').  This is intended to a give a quick overview of the state of the difference Fourier and is not a meant as a substitute for interpreting the per-residue metrics (see next section).  Generally you would probably want your structure to score above average on all measures, so at least above the median 50% rank.  But obviously not every structure can be above average!

The data file pdb-edstats.out, or a link to it, must be present in the current directory; alternatively set the environment variable PDB_EDSTATS to point to it.  The data were obtained by running edstats on ~ 600 supposedly 'good' structures (anonymous!) from PDB_REDO with Rfree < 0.175 and > 100 residues (protein only).  This is not ideal, since it would clearly be much better to bin the known structures by high resolution cut-off and compare your structure only with known structures at roughly the same resolution; however this will require a much larger database than I have the resources to set up in the short term.  Hopefully this feature will be developed and improved in a future release.

The columns in the data file pdb-edstats.out contain:

  1. High resolution cut-off.

  2. Resolution-weighted average Biso (e.g. the effective average of Bisos 10 and 100 is not 55 but something much closer to 10, depending on the resolution cut-offs, since the atom with Biso = 100 only contributes significantly to the scattering at low resolution).

  3. Q-Q plot ZD- metric: this gives an overall indication of how much the distribution of all negative difference density in the asymmetric unit deviates from the expected normal distribution for purely random errors.  Significant negative density outliers giving a high numerical Q-Q plot ZD- metric probably indicates wrongly placed atoms, over-restrained B factors, problems with the bulk solvent parameters (e.g. due to low completeness at low resolution), or generally low data completeness.

  4. Q-Q plot ZD+ metric: ditto for all positive difference density.  Low per-cent ranks (large values) for this metric are not as indicative of problems as are low ranks for the Q-Q plot ZD- metric above, because it can be difficult to interpret residual density outliers due to disorder, buffer ions, cryo-protectants and other additives, so uninterpreted (and uninterpretable) density is quite common in deposited structures.  Consequently a high value for the Q-Q plot ZD+ metric does not necessarily indicate a serious problem; it would be better to check the per-residue RSZD+ scores.

  5. Percentage of residue RSZD- metrics numerically above the 3σ threshold (see also next section).

  6. Percentage of residue RSZD+ metrics above the 3σ threshold.

The percent-rank.pl script prints out the per-cent ranks for metrics 3-6 above.

Examples of usage:

percent-rank.pl edstats.log
or
percent-rank.pl *.log
Note that the overall statistics for the RSZO metrics which appear in the standard output are not listed by the percent-rank.pl script; this is deliberate: the RSZO metric is a measure of precision and is really only meaningful when analysed at the residue level.  For example it may be that only say 50% of the residues score above the threshold of the precision metric, but if these 50% tell you all that you wanted to know about the biological function, then clearly the experiment can be counted as a success (assuming of course that all residues have acceptable scores for the accuracy metrics).  So it all depends on which residues have high values of the precision metric.  On the other hand, if only 50% of residues scored above the threshold for the accuracy metric then this would be regarded as a poor result, no matter which residues they were.

INTERPRETING THE PER-RESIDUE METRICS

For the per-residue metrics listed in the output file (e.g. edstats.out) I have suggested rejection limits of < -3σ and > 3σ for the residue RSZD-/+ metrics respectively, and < 1σ for the residue RSZO metrics, though these may need to be adjusted in the light of experience.

The RSZD scores are accuracy metrics, i.e. at least in theory they can be improved by adjusting the model (by eliminating the obvious difference density), so start by checking the worst offenders first.  Use the Fourier and difference maps in your favourite graphics model-building program to guide any adjustments of the model that may be required, in the usual way.  Note that positive density deviations are usually more frequent than negative ones, because they represent uninterpretable, as opposed to incorrectly interpreted density, and are therefore less symptomatic of underlying problems.

The RSZO scores are precision metrics and will be strongly correlated with the Bisos (since that is also a precision metric), i.e. assuming you've fixed any issues with accuracy of that residue there's nothing you can do about the precision, short of re-collecting the data.

The RSR and RSCC (both 'sample' and 'population') metrics are tabulated for comparison but are correlated with both accuracy and precision, so they can be useful in some circumstances, but they don't always help with telling you whether adjustment of the model is required, or whether the problem is actually an intrinsic property of the structure, or lies with the data.  Note that the RSR and RSCC metrics vary with the program used, since they depend strongly on the radius cut-off, scaling algorithm and other variables which can vary a lot between programs.

REFERENCES

C-I. Brändén & T.A. Jones Nature (1990). 343, 687-689.
J.D. Gibbons & S. Chakraborti, S. (2003). Nonparametric statistical inference, 4th ed., New York: Marcel Dekker, Inc.
T.A. Jones, J-Y. Zou, S.W. Cowan & M. Kjeldgaard Acta Cryst. (1991). A47, 110-119.
P. Main Acta Cryst. (1979). A35, 779-785.
R.J. Read Acta Cryst. (1986). A42, 140-149.
R.R. Sokal & F.J. Rohlf (1995). Biometry, 3rd ed., New York: WH Freeman.
I.J. Tickle, R.A. Laskowski, & D.S. Moss Acta Cryst. (1998). D54, 243-252.
I.J. Tickle CCP4 Study Weekend (2011). Manuscript of presentation submitted - to be published in Acta Cryst. D.

AUTHOR

EXAMPLES

Example 1

This example illustrates how the maps must be prepared.  Failure to follow this recipe is likely to give inaccurate results!
#!/bin/tcsh
# Fix up the map coefficients: FLABEL specifies the label for Fobs &
# σ(Fobs) (defaults are F/SIGF or FOSC/SIGFOSC).  Here, 'in.mtz'
# is the output reflection file from the refinement program in MTZ
# format.

rm -f fixed.mtz
mtzfix  FLABEL FP  HKLIN in.mtz  HKLOUT fixed.mtz  >mtzfix.log
if($?) exit $?

# Good idea to check the mtzfix output before proceeding!

less mtzfix.log

# If no fix-up was needed, use the original file.

if(! -e fixed.mtz)  ln -s  in.mtz fixed.mtz

# Compute the 2mFo-DFc map; you need to specify the correct labels for
# the F and phi columns: 'FWT' & 'PHWT' should work for Refmac.
# Note that EDSTATS needs only 1 asymmetric unit (but will also work
# with more).  Grid sampling must be at least 4.

echo 'labi F1=FWT PHI=PHWT\nxyzl asu\ngrid samp 4.5'  | fft  \
HKLIN fixed.mtz  MAPOUT fo.map
if($?) exit $?

# Compute the 2(mFo-DFc) map; again you need to specify the right
# labels.

echo 'labi F1=DELFWT PHI=PHDELWT\nxyzl asu\ngrid samp 4.5'  | fft  \
HKLIN fixed.mtz  MAPOUT df.map
if($?) exit $?

Example 2

#!/bin/tcsh
# Q-Q difference plot & main- & side-chain residue statistics.

echo resl=50,resh=2.1  | edstats  XYZIN in.pdb  MAPIN1 fo.map  \
MAPIN2 df.map  QQDOUT q-q.out  OUT stats.out
if($?) exit $?

Example 3

#!/bin/tcsh
# Main- & side-chain atom statistics, using chains A & I only & writing
# PDB file with per-atom Zdiff metrics.
echo mole=AI,resl=50,resh=2.1,main=atom,side=atom  | edstats  \
XYZIN in.pdb  MAPIN1 fo.map  MAPIN2 df.map  XYZOUT out.pdb  \
OUT stats.out
if($?) exit $?