Processing Data Using autoPROC (Single and Multiple Crystals)

Running autoPROC at SSRL
Processing a Single Crystal Dataset
Multi-crystal Processing

Initial Integration Jobs using autoPROC
Selecting Datasets to Scale
Running aP_scale

Running autoPROC at SSRL

autoProc is very useful for processing diffraction data and for merging multiple sweeps or datasets collected from multiple crystals.

To run autoPROC, log onto one of the SSRL processing computers (pxproc## machines) either locally or through NoMachine.

Instructions to set up a remote Unix desktop using NoMachine
Accessing the data processing computers

autoPROC generates many files. These files will have generic filenames, such as "process.log", "aimless.log", etc. These files will be overwritten if autoPROC is run from the same directory. We therefore recommend running all instances of autoPROC in different subdirectories with names that are easily identifiable.

Processing a Single Crystal Dataset

If the raw images, for example are in /data/username/thaumatin/C4, first create a subfolder for processing:

mkdir /data/username/thaumatin/C4/autoproc_processing cd /data/username/thaumatin/C4/autoproc_processing

autoPROC runs 3rd party software programs such as xds, aimless, etc. in an iterative fashion - optimizing CC1/2 and other parameters such as I/sig(I). Use the "process" command with the input directory keywords -Id to run the software:

process -Id /data/username/thaumatin/C4

If processing a subset of images is desired, it is possible to restrict processing to a specified image range. The filename extension with the numerical numbering format must be specified. This example shows images 5-40 are used as input to processing (notice that strings need quotes when there are breaks like commas or spaces):

process -Id "/data/username/thaumatin/C4/filename_1_#####.cbf,5,40"

In most cases autoPROC will process the data without issues and will return acceptable results.

To evaluate the results, inspect the output in the terminal window or processing directory. For a more in-depth summary of the processing results and detailed warnings and error messages, view the "summary.html" file in a web browser:

firefox summary.html

A major advantage of autoPROC is that in addition to the standard analysis, it also performs an anisotropic scattering analysis (staraniso). The analysis fits an ellipsoid to the scattering data and provides 3 resolutions corresponding to the 3 axes. When there is anisotropy, the anisotropic analysis provides much improved electron density maps when compared to the standard analysis. The anisotropic results are also reported in the summary.html file.

Multi-crystal Processing

Multi-crystal datasets must be processed individually, mainly for the integration results and then later scaled using aP_scale. Filtering out certain datasets before scaling may be critical for obtaining reasonable statistics since the datasets may vary in resolution, mosaicity, Rpim, CC1/2 and other properties that can negatively affect merging statistics.

The overall procedure requires 1) running autoPROC on each individual dataset, 2) Identify outlier datasets, and 3) run aP_scale on the remaining datasets.

1. Initial Integration Jobs using autoPROC

If the space group and cell are known, they should be included in processing the datasets to improve the success rate. However, even without prior knowledge, we have successfully indexed and inegrated datasets with as few as 2 images (1 degree phi rotation each) using autoPROC and a few special keywords.

In order to keep a consistent cell and setting among datasets, a reference dataset should be used.

Below are examples of running autoProc using the command "process": a) when no crystal information is known, b) when using a reference dataset, c) when the space group and unit cell are known and d) when ice prevents successful indexing.

a) Initial processing command line to run on each dataset with no prior information:

This example uses the first 10 images of each dataset.

process -Id "/data/username/thaumatin/C4/filename_#####.cbf,1,10" -d /data/username/thaumatin/C4/autoproc_processing -M LowResOrTricky XdsSpotSearchNumRanges=1

process -Id "/data/username/thaumatin/C5/filename_#####.cbf,1,10" -d /data/username/thaumatin/C5/autoproc_processing -M LowResOrTricky XdsSpotSearchNumRanges=1

process -Id "/data/username/thaumatin/C6/filename_#####.cbf,1,10" -d /data/username/thaumatin/C6/autoproc_processing -M LowResOrTricky XdsSpotSearchNumRanges=1

Important: Once a few datasets have successfully indexed and yield the same unit cell and point group symmetry, pick one of the successful datasets to use as a reference dataset and rerun processing.

b) Processing command line with a reference dataset:

The reference mtz file can be from one of the multi-crystal datasets or from an external dataset.

process -Id "/data/username/thaumatin/C4/filename_#####.cbf,1,10" -d /data/username/thaumatin/C4/autoproc_processing -M LowResOrTricky XdsSpotSearchNumRanges=1 -ref /data/username/thaumatin/C5/autoproc_processing/autoproc_processing_alldata-unique.mtz

process -Id "/data/username/thaumatin/C6/filename_#####.cbf,1,10" -d /data/username/thaumatin/C6/autoproc_processing -M LowResOrTricky XdsSpotSearchNumRanges=1 -ref /data/username/thaumatin/C5/autoproc_processing/autoproc_processing_alldata-unique.mtz

process -Id "/data/username/thaumatin/C7/filename_#####.cbf,1,10" -d /data/username/thaumatin/C7/autoproc_processing -M LowResOrTricky XdsSpotSearchNumRanges=1 -ref /data/username/thaumatin/C5/autoproc_processing/autoproc_processing_alldata-unique.mtz

c) Command line when the space group and unit cell are known but a dataset is not readily available:

process -Id "/data/username/thaumatin/C4/filename_#####.cbf,1,10" -d /data/username/thaumatin/C4/autoproc_processing -M LowResOrTricky XdsSpotSearchNumRanges=1 symm="P212121" cell="34.24 35.11 44.35 90 90 90"

c) Command line when ice rings prevent indexing:

Indexing could fail for some or many of the datatsets. If the main reason is due to ice rings, a large number of the failures may be recovered:

process -Id "/data/username/thaumatin/C4/filename_#####.cbf,1,10" -d /data/username/thaumatin/C4/autoproc_processing -M LowResOrTricky XdsSpotSearchNumRanges=1 XdsExcludeIceRingsAutomatically=all

Important: Rerun processing using this keyword, but only for the datasets that failed to index the first time around.

2. Selecting Datasets to Scale

Inspect the successful integration output. The parameters that should be considered when choosing which datasets to exclude for scaling are:

Space Group Symmetry - reject datasets with the wrong space group
Unit Cell - variation of cell parameters should be 1% or less
Resolution - outliers wil have much lower resolution
Mosaicity - outliers will have much higher mosaicities
Estimated Maximum I/sig(I) - outliers will be significantly smaller

3. Running aP_scale

Once questionable datasets are filtered out, run aP_scale listing each hkl file for the datasets that should be included or alternatively, use all the integrated datasets and filter out datasets after the initial scaling:

aP_scale -hkl /path/to/dataset1/XDS_ASCII.HKL -hkl /path/to/dataset2/XDS_ASCII.HKL -hkl /path/to/dataset3/XDS_ASCII.HKL -hkl /path/to/dataset4/XDS_ASCII.HKL -hkl /path/to/dataset5/XDS_ASCII.HKL EnsureConsistentIndexing=yes

Check the scaling ouput statistics to make sure Rpim and CC1/2 are reasonable.

To improve Rpim and CC1/2, try removing outlier datasets. Use the following key scaling output parameters that are calculated for each individual dataset in comparison to the total dataset:

R-rank - below 100
CC-rank = within 0.7-1.0
Scale-factor - within factors of 4*ave to ave/4
B-factor - within -30 to +25

Important: When the number of datasets decreases beyond a certain threshold, proper scaling between datasets cannot be achieved, particulalry for datasets with few reflections. The result is Rpim will hit a minimum and start to increase and CC1/2 will hit a mximum and begin to decrease. For datasets of only a few degrees or low reflection coverage, many random datasets are required for proper scaling between them. The proper parameter to minimize for this specific case is "uniqueness" - ideally, all reflections should be measured more than once, preferably across several datasets (i.e., individual datasets should not contribute many unique relections to the overall dataset).

The space group and/or unit cell can also be specified for scaling using specific keywords, otherwise a weighted average is deteremined:

aP_scale -hkl /path/to/dataset1/XDS_ASCII.HKL -hkl /path/to/dataset2/XDS_ASCII.HKL -hkl /path/to/dataset3/XDS_ASCII.HKL -hkl /path/to/dataset4/XDS_ASCII.HKL -hkl /path/to/dataset5/XDS_ASCII.HKL EnsureConsistentIndexing=yes symm="P43212" cell="30 40 50 90 90 90"

aP_scale uses a reference dataset which normalizes the individual parameters relative to the reference dataset (for example, the scale-factor = 1 for the reference dataset). The reference dataset is always the first hkl file listed. To assign a different dataset to be the reference dataset (for example, dataset4), list it first in the aP_scale command line:

aP_scale -hkl /path/to/dataset4/XDS_ASCII.HKL -hkl /path/to/dataset1/XDS_ASCII.HKL -hkl /path/to/dataset2/XDS_ASCII.HKL -hkl /path/to/dataset3/XDS_ASCII.HKL -hkl /path/to/dataset5/XDS_ASCII.HKL EnsureConsistentIndexing=yes