Automated Data Processing
Table of Contents
Overview
The following sections provide a complete description of automated data processing on SSRL beamlines using the SSRL and the XIA2 workflows (the XIA2 workflow is in development). Whenever a run is collected using the Blu-ice data control software (of at least five consecutive diffraction images from a single crystal), the data collected will be automatically processed to provide quick feedback. Data auto-indexing and integration are carried out with XDS. The data are analyzed with POINTLESS and XTRIAGE, scaled and merged with AIMLESS, and transformed to
amplitudes with CTRUNCATE. A free R-set is also generated using 5% of the reflections. All programs use default settings.
In general, new diffraction data should be collected into a sub-directory of the /data/your_username directory (where your_username is your unix account username). For more information about the directory structure of the SSRL crystallography computing see the SSRL computing manual.
The SSRL Automated Processing Pipeline
Workflow
The results of the automated analysis of the diffraction data collected within a user-specified folder, is included in the sub-folder ./imagedirectory/auto-processing-imageprefix_run#_random7digit#/
Each time the processing software runs a crystallography program, it creates a new directory and adds an increasing step number to the directory name. Within each new directory, a "run.csh" file can be found. This file is the actual command used to run the program for this step, allowing the user to see what the inputs to the step were and what options were used on the program within that step. The increasing step number on each new directory prevents overwriting of previous steps (preserves what has happened before) and also allows multiple steps to be run simultaneously when it makes sense to do so. In summary, the stepXX_ on each directory allows the user to see the order and how each program was run.
To illustrate, the first two steps of the pipeline are 'xds' , followed by 'pointless' , visualized like this:
xds ──> pointless
These simple steps will generate two subdirectories in the root of the
processing directory:
> ls
step1_xds
step2_pointless
If the processing software determines that it needs to re-run a certain step, the software will generate more directories, increasing the step number each time. For example, if pointless disagrees with the spacegroup or cell determined by XDS, the workflow will rerun xds (forcing the spacegroup and cell). In this case, the first two steps of the pipeline become a total of four steps, and the workflow will look like this:
xds ──> pointless ──> xds (forced spacegroup and cell) ──> pointless
This will generate subdirectories like this:
> ls
step1_xds
step2_pointless
step3_xds
step4_pointless
Note: In this scenario above, a curious and discerning user that wants to understand why this happened can compare the 'step1_xds/run.csh' against the 'step3_xds/run.csh' to find that the spacegroup and cell have been forced within the step3_xds .
After xds and pointless , the pipeline runs aimless , xtriage , trunctate , and freeR as a group twice in parallel. This can be thought of as a branch in the workflow, and each branch gets its own subdirectory. In one branch of the workflow, there is no resolution cutoff. In the second branch, a cutoff
of Isig/I=1.5 is used. The processing program will create a new directory for each of these branches (increasing the step on the directory names as described above). Within the two new subdirectories, the step count starts over at 1.
The complete workflow can be visualized as follows:
xds ──>pointless──┬─[no cutoff]─ ────────> aimless ──> xtriage ──> truncate ─> freeR
│
└─[Isig/i=1.5 cutoff] ─> aimless ──> xtriage ─> truncate ─> freeR
which may generate directories like this:
>ls
step1_xds
step2_pointless
step3_noCutoff
step4_isig1p5
Investigating the noCutoff directory will show the first branch of the workflow (notice the step has been return to 1 for this branch or sub-workflow).
>ls noCutoff_step3
step1_aimless_noCutoff
step2_truncate
step3_freeR
Similarly, within the parallel isig1p5 directory, the steps will also start from step1 to avoid confusion.
Software run by the workflow
The following is a summary of the programs run by this workflow (note that the step number is not shown here).
xds - contains files generated by the program XDS . The input file XDS.INP supplies the default parameters to the program and is based upon information stored in the header of the diffraction images (detector type and distance, oscillation start and range, number of images in the date set etc). The important output files from XDS are IDXREF.LP (the results of the automated indexing to find the unit cell parameters and an idea of what the crystal symmetry is), INTEGRATE.LP (the full log of the processing), and CORRECT.LP (which gives an indication of the data quality and resolution). The data output file is XDS_ASCII.HKL .
-
pointless - POINTLESS is a program which converts the output from XDS into mtz format and then analyzes the data for twinning and symmetry, and then identifies the correct space group. The directory contains three files, run.csh (a script which can be used to re-run POINTLESS ), out.mtz (the data output in mtz format) and stdout.txt (a log file from the program).
-
aimless - The program AIMLESS takes the output from POINTLESS , calculates scale factors between all of the images in the data set, applies the scales, and merges all the reflection data together to give an output file containing one copy of each reflection (the unique data set). The important files in the directory are run.csh (a script which can be used to re-run AIMLESS ), out.mtz (the scaled and merged unique data set in mtz format), and stdout.txt (a log file from the program). This is an important file to read, as it gives in-depth statistics on the scaling and merginbg and the quality of the data. There is a summary table at the end of the files which can be used to assess the quality of the data, which includes most of the statistics used in "Table 1" of your publication. The out.mtz file has the general format H K L I sigI (which an intensity and error estimate for each reflection). There may also be information on any anomalous data (generally as I+ and I- ).
-
xtriage - The program XTRIAGE generates statistics on the unique data set and also analyzes the data for twinning. The directory contains three files, run.csh (a script which can be used to re-run XTRIAGE ), stdout.txt (a log file from the program), and xtriage.log (the same information as stdout.txt but with additional tables of statistics). There is no data set output by this program, it is an analysis step only.
-
truncate - The program TRUNCATE reads the output from aimless and attempts to put the data onto an absolute scale, and generates structure factor amplitudes (F ) from the reflection intensities (I ). The directory contains three files, run.csh (a script which can be used to re-run TRUNCATE ), stdout.txt (a log file from the program), and out.mtz (the data output in mtz format). This file typically contains H K L I sig I F sigF and may contain the anomalous data as I+ I- F+ F- and dano ... the calculated differences between F+ and F- .
-
freeR - The final step involves the assignment of 5% of the unique data as free R reflections. These reflections are not used in subsequent refinement steps except as an independent gauge of how the refinement is progressing. The directory contains three files, run.csh (a script which can be used to re-run the freeR calculation), stdout.txt (a log file from the program), and out.mtz (the data output in mtz format). This is your final processed data set and can be used for structure solution steps and structure refinement.
Additional Output
The main directory also contains two other sub-directories: ../_metadata - contains some general files from the automated processing run, including a stdout.txt which is a comprehensive log file with all of the logfiles from each step concatenated together.
../plots - contains two files ztest.out and ltest.out which are the tabulated results of the twinning testss
How to review these files: We highly recommend that you look at the log files from XDS and AIMLESS carefully, to make sure that automated data processing has run correctly, rather than take the output mtz files at face value without checking. The scaling and merging statistics in the Summary table at the end of the AIMLESS log file will provide an estimate of the resolution limit of the data, which may then be applied on manual reprocessing. Note that automatic data processing runs with no resolution or I/sigI cutoffs.
If the auto-processing fails for any reason, an Error Log is generated which lists fatal errors during execution of a program or script. It can be useful to trace the ultimate cause of a problem during processing.
How to re-do data processing
In each of the processing directories, there is a run.csh file that can be executed to re-run the selected process. You are welcome to change the input files in order to reprocess the data. However, this will only run the selected program. In order to follow the progression of programs, you would need to run each individually. If you wish to reprocess the data from scratch, it might be easier to do this with the software available on the SSRL px-proc machines (XDS , HKL3000 or DIALS ). If you wish to know more, please consult the SSRL data processing environment manual.
Automated Data Processing using XIA2
Workflow
Data processing based on a XIA2 pipeline is also available. The pipeline is currently in development and programs, input parameters, etc. are subject to change. This pipeline will be more robust and configurable than the SSRL pipeline and will also be used for multi-crystal processing when fully released. The processing is initiated automatically upon completion of a data collection run, requiring no user input, but running in a customized configuration. While xia2 provides several options for indexing, integration, scaling, and merging, for this workflow the XDS algorithm has been selected for indexing and integration, while the Aimless algorithm is used for scaling and merging, and CTruncate used to convert intensities to structure factors.
Software run by the workflow
xia2 - the data processing front end
xia2.multiplex - multi-crystal processing optimization algorithm (to be implemented in a future release)
dials - a collection of data processing back-end programs, which include implementations of XDS and Aimless algorithms.
Output
The auto-processing directories (symbolic links to the actual directories) can be found in the directory containing the raw images. The directories are named autoprocessing_xia2_{workspace_code} .
This folder contains
xia2.txt - the overall log for this automated processing run.
xia2.html - result summary in webpage format, can be viewed by issuing firefox xia2.html .
./DataFiles/ - a subfolder with files in MTZ and SCA format for various stages of data processing, including {PROJECT_ID}_{CRYSTAL_ID}_free.mtz , which is the file with integrated intensities, structure factors, and the Rfree flag for selected reflections
./LogFiles/ - a folder with contains the logs for each processing stage
How to re-do data processing
- In the processing directory, go one level up and create a new processing folder:
cd ../
mkdir new_xia2_processing
cd new_xia2_processing
Critical - if you do not use a new directory, the original processing results will be overwritten!
- Copy the
run.sh file from the original processing directory:
cp ../original_xia2_processing/run.sh .
- Open run.sh in a text editor and edit the xia2 launch string:
geany run.sh
- Launch run.sh from the new processing directory:
./run.sh
More information on the software supported by the SSRL-SMB Macromolecular Crystallography division is available on our software webpage.
|