Automated Data Processing

Table of Contents


Overview

The following sections provide a complete description of automated data processing on SSRL beamlines using the SSRL and the XIA2 workflows (the XIA2 workflow is in development). Whenever a run is collected using the Blu-ice data control software (of at least five consecutive diffraction images from a single crystal), the data collected will be automatically processed to provide quick feedback. Data auto-indexing and integration are carried out with XDS. The data are analyzed with POINTLESS and XTRIAGE, scaled and merged with AIMLESS, and transformed to amplitudes with CTRUNCATE. A free R-set is also generated using 5% of the reflections. All programs use default settings.

In general, new diffraction data should be collected into a sub-directory of the /data/your_username directory (where your_username is your unix account username). For more information about the directory structure of the SSRL crystallography computing see the SSRL computing manual.


The SSRL Automated Processing Pipeline

Workflow

The results of the automated analysis of the diffraction data collected within a user-specified folder, is included in the sub-folder ./imagedirectory/auto-processing-imageprefix_run#_random7digit#/

Each time the processing software runs a crystallography program, it creates a new directory and adds an increasing step number to the directory name. Within each new directory, a "run.csh" file can be found. This file is the actual command used to run the program for this step, allowing the user to see what the inputs to the step were and what options were used on the program within that step. The increasing step number on each new directory prevents overwriting of previous steps (preserves what has happened before) and also allows multiple steps to be run simultaneously when it makes sense to do so. In summary, the stepXX_ on each directory allows the user to see the order and how each program was run.

To illustrate, the first two steps of the pipeline are 'xds', followed by 'pointless', visualized like this:

xds ──> pointless

These simple steps will generate two subdirectories in the root of the processing directory:

> ls
  step1_xds
  step2_pointless

If the processing software determines that it needs to re-run a certain step, the software will generate more directories, increasing the step number each time. For example, if pointless disagrees with the spacegroup or cell determined by XDS, the workflow will rerun xds (forcing the spacegroup and cell). In this case, the first two steps of the pipeline become a total of four steps, and the workflow will look like this:

xds ──> pointless ──> xds (forced spacegroup and cell) ──> pointless

This will generate subdirectories like this:

> ls
  step1_xds
  step2_pointless
  step3_xds
  step4_pointless

Note: In this scenario above, a curious and discerning user that wants to understand why this happened can compare the 'step1_xds/run.csh' against the 'step3_xds/run.csh' to find that the spacegroup and cell have been forced within the step3_xds.

After xds and pointless, the pipeline runs aimless, xtriage, trunctate, and freeR as a group twice in parallel. This can be thought of as a branch in the workflow, and each branch gets its own subdirectory. In one branch of the workflow, there is no resolution cutoff. In the second branch, a cutoff of Isig/I=1.5 is used. The processing program will create a new directory for each of these branches (increasing the step on the directory names as described above). Within the two new subdirectories, the step count starts over at 1.

The complete workflow can be visualized as follows:

xds ──>pointless──┬─[no cutoff]─ ────────> aimless ──> xtriage ──> truncate ─> freeR │ └─[Isig/i=1.5 cutoff] ─> aimless ──> xtriage ─> truncate ─> freeR

which may generate directories like this:

>ls
  step1_xds
  step2_pointless
  step3_noCutoff
  step4_isig1p5

Investigating the noCutoff directory will show the first branch of the workflow (notice the step has been return to 1 for this branch or sub-workflow).

>ls noCutoff_step3
  step1_aimless_noCutoff
  step2_truncate
  step3_freeR

Similarly, within the parallel isig1p5 directory, the steps will also start from step1 to avoid confusion.


Software run by the workflow

The following is a summary of the programs run by this workflow (note that the step number is not shown here).

  • xds - contains files generated by the program XDS. The input file XDS.INP supplies the default parameters to the program and is based upon information stored in the header of the diffraction images (detector type and distance, oscillation start and range, number of images in the date set etc). The important output files from XDS are IDXREF.LP (the results of the automated indexing to find the unit cell parameters and an idea of what the crystal symmetry is), INTEGRATE.LP (the full log of the processing), and CORRECT.LP (which gives an indication of the data quality and resolution). The data output file is XDS_ASCII.HKL.

  • pointless - POINTLESS is a program which converts the output from XDS into mtz format and then analyzes the data for twinning and symmetry, and then identifies the correct space group. The directory contains three files, run.csh (a script which can be used to re-run POINTLESS), out.mtz (the data output in mtz format) and stdout.txt (a log file from the program).

  • aimless - The program AIMLESS takes the output from POINTLESS, calculates scale factors between all of the images in the data set, applies the scales, and merges all the reflection data together to give an output file containing one copy of each reflection (the unique data set). The important files in the directory are run.csh (a script which can be used to re-run AIMLESS), out.mtz (the scaled and merged unique data set in mtz format), and stdout.txt (a log file from the program). This is an important file to read, as it gives in-depth statistics on the scaling and merginbg and the quality of the data. There is a summary table at the end of the files which can be used to assess the quality of the data, which includes most of the statistics used in "Table 1" of your publication. The out.mtz file has the general format H K L I sigI (which an intensity and error estimate for each reflection). There may also be information on any anomalous data (generally as I+ and I-).

  • xtriage - The program XTRIAGE generates statistics on the unique data set and also analyzes the data for twinning. The directory contains three files, run.csh (a script which can be used to re-run XTRIAGE), stdout.txt (a log file from the program), and xtriage.log (the same information as stdout.txt but with additional tables of statistics). There is no data set output by this program, it is an analysis step only.

  • truncate - The program TRUNCATE reads the output from aimless and attempts to put the data onto an absolute scale, and generates structure factor amplitudes (F) from the reflection intensities (I). The directory contains three files, run.csh (a script which can be used to re-run TRUNCATE), stdout.txt (a log file from the program), and out.mtz (the data output in mtz format). This file typically contains H K L I sig I F sigF and may contain the anomalous data as I+ I- F+ F- and dano ... the calculated differences between F+ and F-.

  • freeR - The final step involves the assignment of 5% of the unique data as free R reflections. These reflections are not used in subsequent refinement steps except as an independent gauge of how the refinement is progressing. The directory contains three files, run.csh (a script which can be used to re-run the freeR calculation), stdout.txt (a log file from the program), and out.mtz (the data output in mtz format). This is your final processed data set and can be used for structure solution steps and structure refinement.

Additional Output

The main directory also contains two other sub-directories: ../_metadata - contains some general files from the automated processing run, including a stdout.txt which is a comprehensive log file with all of the logfiles from each step concatenated together.

../plots - contains two files ztest.out and ltest.out which are the tabulated results of the twinning testss

How to review these files: We highly recommend that you look at the log files from XDS and AIMLESS carefully, to make sure that automated data processing has run correctly, rather than take the output mtz files at face value without checking. The scaling and merging statistics in the Summary table at the end of the AIMLESS log file will provide an estimate of the resolution limit of the data, which may then be applied on manual reprocessing. Note that automatic data processing runs with no resolution or I/sigI cutoffs.

If the auto-processing fails for any reason, an Error Log is generated which lists fatal errors during execution of a program or script. It can be useful to trace the ultimate cause of a problem during processing.


How to re-do data processing

In each of the processing directories, there is a run.csh file that can be executed to re-run the selected process. You are welcome to change the input files in order to reprocess the data. However, this will only run the selected program. In order to follow the progression of programs, you would need to run each individually. If you wish to reprocess the data from scratch, it might be easier to do this with the software available on the SSRL px-proc machines (XDS, HKL3000 or DIALS). If you wish to know more, please consult the SSRL data processing environment manual.

Automated Data Processing using XIA2

Workflow

Data processing based on a XIA2 pipeline is also available. The pipeline is currently in development and programs, input parameters, etc. are subject to change. This pipeline will be more robust and configurable than the SSRL pipeline and will also be used for multi-crystal processing when fully released. The processing is initiated automatically upon completion of a data collection run, requiring no user input, but running in a customized configuration. While xia2 provides several options for indexing, integration, scaling, and merging, for this workflow the XDS algorithm has been selected for indexing and integration, while the Aimless algorithm is used for scaling and merging, and CTruncate used to convert intensities to structure factors.

Software run by the workflow

  • xia2 - the data processing front end
  • xia2.multiplex - multi-crystal processing optimization algorithm (to be implemented in a future release)
  • dials - a collection of data processing back-end programs, which include implementations of XDS and Aimless algorithms.

Output

The auto-processing directories (symbolic links to the actual directories) can be found in the directory containing the raw images. The directories are named autoprocessing_xia2_{workspace_code}.

This folder contains

  • xia2.txt - the overall log for this automated processing run.
  • xia2.html - result summary in webpage format, can be viewed by issuing firefox xia2.html.
  • ./DataFiles/ - a subfolder with files in MTZ and SCA format for various stages of data processing, including {PROJECT_ID}_{CRYSTAL_ID}_free.mtz, which is the file with integrated intensities, structure factors, and the Rfree flag for selected reflections
  • ./LogFiles/ - a folder with contains the logs for each processing stage

How to re-do data processing

  1. In the processing directory, go one level up and create a new processing folder:
    cd ../
    mkdir new_xia2_processing
    cd new_xia2_processing

    Critical - if you do not use a new directory, the original processing results will be overwritten!

  2. Copy the run.sh file from the original processing directory:
    cp ../original_xia2_processing/run.sh .

  3. Open run.sh in a text editor and edit the xia2 launch string:
    geany run.sh

  4. Launch run.sh from the new processing directory:
    ./run.sh

More information on the software supported by the SSRL-SMB Macromolecular Crystallography division is available on our software webpage.