- The real-space R factor (RSR) is defined (Brändén &
Jones, 1990; Jones
*et al.*, 1991) as:RSR = Σ |ρ _{obs}- ρ_{calc}| / Σ |ρ_{obs}+ ρ_{calc}| - The real-space correlation coefficient (RSCC) is defined as:
RSCC = cov(ρ _{obs},ρ_{calc}) / sqrt(var(ρ_{obs}) var(ρ_{calc}))where cov(.,.) and var(.) are the sample covariance and variance (

*i.e.*calculated with respect to the sample means of ρ_{obs}and ρ_{calc}). - EDSTATS computes two real-space correlation coefficients: the
'sample' correlation coefficient defined above, and the 'population'
correlation coefficient,
*i.e.*with respect to the population (overall) means, which will be zero if the F(000) terms were not included in the map calculation (OVERLAPMAP uses only the sample means). The RSCC based on the population means seems to be better at detecting weak correlations. - The real-space Z
_{obs}metric (RSZO) is defined (Tickle, 2011) as:RSZO = mean(ρ _{obs}) / σ(Δρ)where σ(Δρ) is the standard uncertainty of the

**difference Fourier**map. Note that this is the standard uncertainty of the 'Fo-Fc' map, NOT the RMS value of the '2Fo-Fc' map, which bears no relationship whatsoever to the uncertainty! - The real-space Z
_{diff}metrics (RSZD- and RSZD+) are defined (Tickle, 2011) as follows for the sets of negative and positive values respectively of Δρ at the grid points that are covered by the group of main- or side-chain atoms under consideration:1. Order the values in each set in increasing numerical value (

*i.e.*ignoring the sign).2. For each of

*N*subsets of size 1, 2, ...,*N*-1,*N*of the numerically highest values of the original set of size*N*, compute the cumulative probability of chi-square (χ^{2}= Σ (Δρ/σ(Δρ))^{2}) for the subset. So the subset of size 1 is simply the numerically highest value ('maximum order statistic') in the original set, the subset of size 2 consists of the 2 highest values of the set, the subset of size*N*-1 excludes the lowest value, and the subset of size N is just the set itself.3. In practice this χ

^{2}cumulative probability is very difficult to compute (even by stochastic numerical integration) for subsets other than those of size 1 and*N*(it involves integrals up to dimension*N*where*N*may be anything from 10 to 1000). Note that the standard χ^{2}cumulative probability assumes that the sample is selected randomly, whereas here we are selecting the highest values. Therefore we approximate it as the product of two components: the standard cumulative probability of χ^{2}for a randomly selected subset, and a correction, the Dunn-Šidák correction (Sokal & Rohlf, 1995; Gibbons & Chakraborti, 2003), in this case the cumulative probability of the order statistic, for the fact that we are selecting the highest values.4. Take the highest cumulative probability over all subsets, and convert this to the corresponding normal Z-score, making the Z-score negative for the set of negative values; this is the final RSZD- or RSZD+ score. The program also computes a combined RSZD score which is simply the maximum of |RSZD-| and RSZD+.

- The real-space Z-scores RSZO, RSZD- and RSZD+ require estimates of the standard uncertainty σ(Δρ) and offset of the 'Fo-Fc' map (the offset arises from omission of the F(000) term, which may differ from zero since the model is not necessarily complete). The recommended procedure is to use as an initial estimate the value of the σ(Δρ) in the map header, with zero as the offset, and then rescale σ(Δρ) and the offset separately for each chain and the bulk solvent. Bulk solvent is assigned the chain ID '%' for this purpose and ordered waters are considered to belong to a chain with ID '0' whatever their actual chain IDs in the PDB file.
- The sample size correction above arises because the greater the
sample size the more likely it is that high values will occur purely by
chance. This correction takes into account the fact that the number
of grid points is not the same for all residues, because obviously
different residue types contain different numbers of atoms, and also
different limiting atom radii will enclose different numbers of grid
points, because the radius varies with atom type and
B
_{iso}. The correction is therefore necessary to make the metrics comparable between different residues and to be able to apply a common threshold to the metric for all residues. Note that the RSR and RSCC metrics do not apply a sample size correction: it is assumed that all sample points contribute equally to the metrics independent of the sample size. - The number of grid points referred to above is the number of
statistically independent grid points covering the atoms; this is the
actual number of grid points with an over-sampling correction
factor. According to the Nyquist-Shannon sampling theorem, the grid
spacing required for statistical independence is 1/2 the high resolution
cut-off (
*d*_{min}), so*e.g.*if a grid spacing of*d*_{min}/4 is used then the effective number of grid points is the actual number / 2^{3}. - The advantage of the real-space Z-scores over the real-space R
factor and correlation coefficient scores (including the 'population' CC
metric) is that the former depend purely on model accuracy (RSZD) or
model precision (RSZO), whereas RSR and RSCC depend on both
(
*e.g.*it's obvious from the plots that RSR and RSCC are at least partially correlated with the atomic B_{iso}s); this means that it's impossible to say how much of the observed effect on the metric is due to lack of accuracy and how much to lack of precision.Note that model accuracy is related to the likelihood of the model (

*i.e.*the consistency of the model with the data), and is what is improved by model building and refinement. The difference Fourier density is obviously a measure of any discrepancy between the model and the data, so is a direct measure of model accuracy.Model precision is a property of the crystal and the data (assuming the refinement is done optimally), and is related to data quality and completeness, resolution, atom type (or atomic scattering factor), occupancy and atomic B

_{iso}; hence model precision can only be improved by crystallizing in a different crystal form and/or collecting better (*e.g.*more precise and/or higher resolution) data. The ρ_{obs}density, divided by its standard uncertainty (note: this is not the same as RMS(ρ_{obs})), is a measure of model precision which incorporates all the above factors correlated with precision (*e.g.*the atomic B_{iso}is also a precision metric but it doesn't take account of the variation of precision with atom type and occupancy). - The sums and min/max functions required to compute all residue or
atom metrics are taken over all map grid points within a specified
distance of each atom centre. This distance limit is naturally a
function of the atom type (
*via*the atomic scattering factors computed from the 5-Gaussian approximation table in $CLIBD/atomsf.lib), the atomic B_{iso}values and the resolution limits, as shown in the following table of the distance limit*r*_{max}for an O atom. Values used by SFALL are also shown for comparison: note that the latter depend only on B_{iso}and are independent of atom type and resolution:B/Å

Note that the limiting high-resolution values of^{2}: 10 20 30 40 50 60 70 80 90 d_{min}/Å*r*_{max}/Å (SFALL: all atoms) All 2.35 2.67 2.95 3.21 3.45 3.67 3.88 4.08 4.27*r*_{max}/Å (EDSTATS: O atom) 3.5 1.72 1.78 1.83 1.89 1.95 2.02 2.08 2.15 2.22 3.0 1.51 1.58 1.65 1.72 1.80 1.88 1.97 2.06 2.14 2.5 1.31 1.39 1.49 1.59 1.70 1.80 1.91 2.02 2.12 2.0 1.12 1.24 1.38 1.52 1.66 1.79 1.91 2.02 2.13 1.5 0.96 1.16 1.35 1.52 1.66 1.79 1.91 2.02 2.13 <=1.0 0.91 1.16 1.35 1.52 1.66 1.79 1.91 2.02 2.13*r*_{max}are attained at ~ d_{min}= 1.5Å. - The resolution-dependent distance limit is computed by first
performing an analytical truncated Fourier transform of the atomic
scattering factor
*f*(*s*) to obtain the equation for the calculated electron density ρ(*r*) for data between specified resolution cut-offs, at distance*r*from the atom centre:*s*_{max}ρ(*r*) = FT(*f*(*s*)) = (8/*r*) ∫*f*(*s*) exp(-*Bs*^{2}) sin(4π*rs*)*s*d*s**s*_{min}*s*_{min}and*s*_{max}of sin(θ)/λ.Then the ratio of the radius integral of ρ(

*r*) integrated out to the outer limit*r*_{max}relative to the radius integral integrated to infinite distance is:*r*_{max}∞ Radius integral ratio = ∫ ρ(*r*) d*r*/ ∫ ρ(*r*) d*r*0 0*r*_{max}for a radius integral ratio = 0.95 (i.e. 95% of the integral lies within distance*r*_{max}of the atom centre). The integrals with respect to*r*can be obtained analytically; the integrals with respect to*s*in general have no analytical solution and must be computed numerically (using*e.g.*the QUADPACK library). Note that ideally the volume integral of ρ(*r*):*r*_{max}*r*_{max}Volume integral = ∫ ρ(*r*) d*V*= 4π ∫ ρ(*r*)*r*^{2}d*r*0 0 - For the RSZO metric EDSTATS uses the ρ
_{obs}map with Fourier coefficients 2mF_{o}-DF_{c}for acentric reflections or mF_{o}for centrics (Main, 1979; Read, 1986); for the RSZD metric it uses the Δρ (difference Fourier) map with Fourier coefficients 2(mF_{o}-DF_{c}) for acentrics, or mF_{o}-DF_{c}for centrics. For the RSR and RSCC metrics it uses the ρ_{obs}and ρ_{calc}maps.However, for the latter, because we cannot rely on the correct Fourier coefficient for ρ

_{calc}being present in the file of map coefficients, it is necessary to obtain it as the difference between the ρ_{obs}and Δρ coefficients. Since we have:Δρ = ρ

or:_{obs}- ρ_{calc}ρ

therefore for acentrics:_{calc}= ρ_{obs}- Δρρ

whereas for centrics:_{calc}= F(2mF_{o}-DF_{c}) - F(2(mF_{o}-DF_{c}))_{ }= F(DF_{c})ρ

Hence the correct Fourier coefficient for ρ_{calc}= F(mF_{o}) - F(mF_{o}-DF_{c})_{ }= F(DF_{c})_{calc}is DF_{c}for all reflections. Note that it is frequently stated that the coefficient for acentrics is mF_{o}-DF_{c}but if this were used it would give completely the wrong result for the ρ_{calc}coefficient (it would give mF_{o}!). - EDSTATS also has options to output data for the histogram, 'P-P
difference' and 'Q-Q difference' plots of the difference Fourier and
observed Fourier maps. Note that the 'P-P difference' and 'Q-Q
difference' plots are functionally identical to the standard 'P-P plot '
(probability-probability) and 'Q-Q plot '
(quantile-quantile: 'quantile' is just another name for 'normalised
deviate' or 'Z-score'). The distinction is purely one of
presentation: whereas the standard 'P-P' or 'Q-Q' plot plots
*x vs. y*, where*x*and*y*are respectively the normal expected and observed probabilitlies or quantiles, the 'P-P difference' or 'Q-Q difference' plot plots*x vs. y-x*.

Available options:

- MAIN =
*string*

Optional specification of type of averaging used to compute main-chain (including Cβ atom) R factors and correlation coefficients (both types), where*string*is either RESI (default) or ATOM (both case-insensitive):RESI averages all map values for the main-chain atoms in each residue.

ATOM averages the map values for each atom, but reports the extreme values of these as the residue metrics.

This option has no effect on the real-space Z-scores, which are as defined in the DESCRIPTION section above.

- SIDE =
*string*

Same as MAIN, but for side-chains. - MOLE =
*string*

Optional concatenated list of chain IDs defining the molecule for which metrics are to be calculated (default is to use all atoms). Chain IDs are case-sensitive. - RESC =
*string*

Optional specification of type of rescaling of σ(Δρ) by Q-Q plot required:*string*may be ALL, BULK, CHAIN (default) or NONE (all are case-insensitive).Scaling type ALL rescales using a single scale factor and offset based on all map points in the asymmetric unit.

BULK rescales using a single scale factor and offset based only on points in the bulk solvent.

CHAIN independently rescales each chain and the bulk solvent with a separate scale factor and offset for each group (ordered waters are treated as belonging to a single separate chain '0' regardless of their chain IDs in the PDB file). This is now the recommended procedure.

NONE does no Q-Q plot rescaling; the value of σ(Δρ) read from the map header is used, with zero for the offset.

- RESLO =
*real***Required**low resolution cut-off used in map calculation. - RESHI =
*real***Required**high resolution cut-off used in map calculation. - THR1 =
*real*

Optional σ cut-off threshold for Fo map: default is no cut-off. - THR2=
*real*

Optional σ cut-off threshold for ΔF map: default is no cut-off. - TEST =
*integer*

Debug flag used for testing and obsolete options: sum of debug option values as follows:LS-bit Value Output 0 1 General debugging. 1 2 P-P & Q-Q difference plots for chains. 2 4 Memory allocation debugging. 3 8 ZSCORE s/r debugging for RSZD values. 4 16 RSZD outliers. 5 32 Cumulative frequencies for RSZDs > 3 σ. 6 64 Normality tests.

- USEFO =
*logical*

A value TRUE indicates that the density histogram and Q-Q difference plots should use the Fo density instead of the ΔF density. This is only intended for demonstration purposes: the Fo density is not useful in the calculation of the RSZD metrics so with this option set, the program will stop after doing the Q-Q plot calculations.

- XYZIN - Co-ordinate file in PDB format.
- MAPIN1 - Input 2mFo-DFc map in CCP4 format.
- MAPIN2 - Input 2(mFo-DFc) map in CCP4 format: this must contain
the same header info as MAPIN1.
Both maps should be calculated with a grid spacing between 1/4 and 1/6 of the high resolution cut-off (usually 1/4 is sufficient), and the PDB file and the maps should all be from the same refinement job.

**NOTE**: it is essential that the MTZ file from the refinement job is run through the MTZFIX program before map calculation with FFT to ensure that the map coefficients are correct and consistent between programs (unfortunately different refinement programs have different conventions for the map coefficients!).

- HISOUT - Optional output file for the histogram of map values,
containing 2 data columns: the observed Z-score and the observed
frequency. This is used for visualising the deviations of the
observed distribution of either Δρ/σ(Δρ) (if
USEFO = f), or of ρ
_{obs}/σ(Δρ) (USEFO = t), from the theoretical normal distribution. A normal distribution would give the Gaussian curve*y*= exp(-.5*x*^{2}) / √(2π), so deviations from this indicate deviations from normality. However a histogram does not show up outliers nearly as clearly as the P-P and Q-Q plots (see below), so is really only suitable for demonstration purposes. The output is readily visualised using a plotting program such as gnuplot,*e.g.*:> gnuplot Terminal type set to 'x11' gnuplot> plot'edstats.his' w l,exp(-.5*x**2)/sqrt(2*pi)

- PPDOUT - Optional output file for the 'P-P difference' plot, containing 2 data columns: the cumulative probability for the normal distribution, and the difference (inverse normal cumulative probability of the observed quantile - normal probability). The output is readily visualised using gnuplot (see example for Q-Q difference plot below). The P-P plot is not as informative as the Q-Q plot, and generally is only used for test purposes.
- QQDOUT - Optional output file for the 'Q-Q difference' plot,
containing 2 data columns: the expected quantile (or Z-score) for the
normal distribution, and the difference (observed quantile - normal
expected quantile). This is used for visualising the deviations of
the observed distribution of either Δρ/σ(Δρ)
(if USEFO = f), or of ρ
_{obs}/σ(Δρ) (USEFO = t) from the normal distribution. A normal distribution would give the straight line*y*= 0, so deviations from this line indicate outliers,*i.e.*deviations from normality (note that the Q-Q plot does*not*show deviations from zero density, but rather deviations from the normal, or other assumed, distribution). The numerically highest outliers will be in the 'tails',*i.e.*the negative outliers are the troughs in Δρ/σ(Δρ) or ρ_{obs}/σ(Δρ) and the positive outliers are the peaks. The output is readily visualised (with the 'normal'*y*= 0 line) using gnuplot,*e.g.*:> gnuplot Terminal type set to 'x11' gnuplot> plot'edstats.qqd' w l,0 lt 0

- OUT - Optional output file for table of per-residue metrics
suitable for plotting with
*e.g.*gnuplot. If no output file is specified the data go to standard output. The columns in this table are:- Residue 3-letter code.
- Chain ID.
- Residue label (including insertion code if present).
- Weighted average
*B*_{iso}for main-chain atoms in residue (including Cβ). This is weighted according to the contribution of the atoms to the total scattering in the resolution range specified (Tickle*et al.*, 1998). - Number of statistically independent grid points covered by main-chain atoms.
- Real-space R factor (RSR) for the main-chain atoms in the residue.
- Real-space correlation coefficient (RSCC).
- Real-space 'population' correlation coefficient.
- Real-space Z
_{obs}metric (RSZO). - Real-space Z
_{diff}metric (RSZD); this is simply the maximum value of |RSZD-| and RSZD+. - Real-space Z
_{diff}metric for negative differences (RSZD-). - Real-space Z
_{diff}metric for positive differences (RSZD+).

Columns 13-21 contain the same information as columns 4-12 above (

*i.e.*add 9), but for the side-chain atoms (excluding Cβ) if present.To plot the RSZD- and RSZD+ metrics (in columns 11 & 12) by residue for the main-chain atoms with the suggested threshold lines at ±3σ, using gnuplot:

> gnuplot Terminal type set to 'x11' gnuplot> set style data impulses gnuplot> plot'edstats.out'u 11,''u 12,-3 lt 0,3 lt 0

Similarly use columns 20 & 21 to plot the side-chain values. See separate section below on interpreting these plots. - MAPOUT1 - Optional rescaled and normalised 2mFo-DFc map,
*i.e.*a map of ρ_{obs}/σ(Δρ) where σ(Δρ) may vary between grid points. - MAPOUT2 - Optional rescaled and normalised 2(mFo-DFc) map,
*i.e.*a map of Δρ/σ(Δρ) where σ(Δρ) may vary between grid points. - XYZOUT - Optional co-ordinate file in PDB format; if given, only the molecule(s) selected are output and the occupancy column (character columns 55-60) is overwritten with the per-atom |RSZD-| metric.

The supplied Perl script percent-rank.pl extracts a small subset of
the overall metrics from the standard output, compares the results with
a pre-calculated set in the supplied data file pdb-edstats.out, and for
each metric prints out the per-cent rank (*i.e.* the percentage of
structures in the pre-calculated set which have a worse score, so 0% is
'worst' and 100% is 'best'). This is intended to a give a quick
overview of the state of the difference Fourier and is not a meant as a
substitute for interpreting the per-residue metrics (see next
section). Generally you would probably want your structure to score
above average on all measures, so at least above the median 50%
rank. But obviously not every structure can be above average!

The data file pdb-edstats.out, or a link to it, must be present in the current directory; alternatively set the environment variable PDB_EDSTATS to point to it. The data were obtained by running edstats on ~ 600 supposedly 'good' structures (anonymous!) from PDB_REDO with Rfree < 0.175 and > 100 residues (protein only). This is not ideal, since it would clearly be much better to bin the known structures by high resolution cut-off and compare your structure only with known structures at roughly the same resolution; however this will require a much larger database than I have the resources to set up in the short term. Hopefully this feature will be developed and improved in a future release.

The columns in the data file pdb-edstats.out contain:

- High resolution cut-off.
- Resolution-weighted average B
_{iso}(*e.g.*the effective average of B_{iso}s 10 and 100 is not 55 but something much closer to 10, depending on the resolution cut-offs, since the atom with B_{iso}= 100 only contributes significantly to the scattering at low resolution). - Q-Q plot ZD- metric: this gives an overall indication of how much
the distribution of all negative difference density in the asymmetric
unit deviates from the expected normal distribution for purely random
errors. Significant negative density outliers giving a high
numerical Q-Q plot ZD- metric probably indicates wrongly placed atoms,
over-restrained B factors, problems with the bulk solvent parameters
(
*e.g.*due to low completeness at low resolution), or generally low data completeness. - Q-Q plot ZD+ metric: ditto for all positive difference density. Low per-cent ranks (large values) for this metric are not as indicative of problems as are low ranks for the Q-Q plot ZD- metric above, because it can be difficult to interpret residual density outliers due to disorder, buffer ions, cryo-protectants and other additives, so uninterpreted (and uninterpretable) density is quite common in deposited structures. Consequently a high value for the Q-Q plot ZD+ metric does not necessarily indicate a serious problem; it would be better to check the per-residue RSZD+ scores.
- Percentage of residue RSZD- metrics numerically above the 3σ threshold (see also next section).
- Percentage of residue RSZD+ metrics above the 3σ threshold.

The percent-rank.pl script prints out the per-cent ranks for metrics 3-6 above.

Examples of usage:

percent-rank.pl edstats.log or percent-rank.pl *.logNote that the overall statistics for the RSZO metrics which appear in the standard output are not listed by the percent-rank.pl script; this is deliberate: the RSZO metric is a measure of precision and is really only meaningful when analysed at the residue level. For example it may be that only say 50% of the residues score above the threshold of the precision metric, but if these 50% tell you all that you wanted to know about the biological function, then clearly the experiment can be counted as a success (assuming of course that all residues have acceptable scores for the accuracy metrics). So it all depends on which residues have high values of the precision metric. On the other hand, if only 50% of residues scored above the threshold for the accuracy metric then this would be regarded as a poor result, no matter which residues they were.

The RSZD scores are accuracy metrics, *i.e.* at least in theory
they can be improved by adjusting the model (by eliminating the obvious
difference density), so start by checking the worst offenders
first. Use the Fourier and difference maps in your favourite
graphics model-building program to guide any adjustments of the model
that may be required, in the usual way. Note that positive density
deviations are usually more frequent than negative ones, because they
represent uninterpretable, as opposed to incorrectly interpreted
density, and are therefore less symptomatic of underlying problems.

The RSZO scores are precision metrics and will be strongly correlated
with the B_{iso}s (since that is also a precision metric),
*i.e.* assuming you've fixed any issues with accuracy of that
residue there's nothing you can do about the precision, short of
re-collecting the data.

The RSR and RSCC (both 'sample' and 'population') metrics are tabulated for comparison but are correlated with both accuracy and precision, so they can be useful in some circumstances, but they don't always help with telling you whether adjustment of the model is required, or whether the problem is actually an intrinsic property of the structure, or lies with the data. Note that the RSR and RSCC metrics vary with the program used, since they depend strongly on the radius cut-off, scaling algorithm and other variables which can vary a lot between programs.

J.D. Gibbons & S. Chakraborti, S. (2003). Nonparametric statistical inference, 4th ed., New York: Marcel Dekker, Inc.

T.A. Jones, J-Y. Zou, S.W. Cowan & M. Kjeldgaard

P. Main

R.J. Read

R.R. Sokal & F.J. Rohlf (1995). Biometry, 3rd ed., New York: WH Freeman.

I.J. Tickle, R.A. Laskowski, & D.S. Moss

I.J. Tickle

#!/bin/tcsh # Fix up the map coefficients: FLABEL specifies the label for Fobs & # σ(Fobs) (defaults are F/SIGF or FOSC/SIGFOSC). Here, 'in.mtz' # is the output reflection file from the refinement program in MTZ # format. rm -f fixed.mtz mtzfix FLABEL FP HKLIN in.mtz HKLOUT fixed.mtz >mtzfix.log if($?) exit $? # Good idea to check the mtzfix output before proceeding! less mtzfix.log # If no fix-up was needed, use the original file. if(! -e fixed.mtz) ln -s in.mtz fixed.mtz # Compute the 2mFo-DFc map; you need to specify the correct labels for # the F and phi columns: 'FWT' & 'PHWT' should work for Refmac. # Note that EDSTATS needs only 1 asymmetric unit (but will also work # with more). Grid sampling must be at least 4. echo 'labi F1=FWT PHI=PHWT\nxyzl asu\ngrid samp 4.5' | fft \ HKLIN fixed.mtz MAPOUT fo.map if($?) exit $? # Compute the 2(mFo-DFc) map; again you need to specify the right # labels. echo 'labi F1=DELFWT PHI=PHDELWT\nxyzl asu\ngrid samp 4.5' | fft \ HKLIN fixed.mtz MAPOUT df.map if($?) exit $?

#!/bin/tcsh # Q-Q difference plot & main- & side-chain residue statistics. echo resl=50,resh=2.1 | edstats XYZIN in.pdb MAPIN1 fo.map \ MAPIN2 df.map QQDOUT q-q.out OUT stats.out if($?) exit $?

#!/bin/tcsh # Main- & side-chain atom statistics, using chains A & I only & writing # PDB file with per-atom Z_{diff}metrics. echo mole=AI,resl=50,resh=2.1,main=atom,side=atom | edstats \ XYZIN in.pdb MAPIN1 fo.map MAPIN2 df.map XYZOUT out.pdb \ OUT stats.out if($?) exit $?