Processing multi-crystal datasets with xia2 and xia2.multiplex at SSRL

The program xia2 [1] is useful for merging multiple sweep data or datasets collected from multiple crystals. While generally highly-automated and capable of processing datasets without much human intervention, xia2 requires a bit more hand-holding when processing datasets collected from multiple crystals, especially if these datasets contain wedges (or “sweeps”) that are of too poor quality to be included. “Poor quality” may refer to low resolution, irregular spot shapes, high mosaicity, and other properties of images that negatively affect merging statistics.


Running xia2 on the SSRL processing cluster

To run the version of xia2 used at SSRL, log onto the SSRL processing (pxproc) cluster. We recommend using NoMachine for this purpose. The instructions to do this are here:

After logging onto a pxprpoc machine, unload the conflicting Phenix module and load the DIALS module that contains the most current xia2 version:

module switch -f phenix dials

You should see the following output:

Loading dials/3-13-0
  WARNING: dials/3-13-0 conflicts with phenix

(If you need to use Phenix, open a separate Terminal window, log into a pxproc machine, and use Phenix from here as normal.)

Housekeeping notes

Getting your processed data: xia2-processed data files can be found under the DataFiles folder in the same directory in which you do the xia2 processing. They would be named something like AUTOMATIC_NATIVE_scaled.mtz. Data from xia2.multiplex processing will be dumped into a scaled.mtz file in same folder from which xia2.multiplex was run.

Be aware that xia2 generates many files. These files will have very generic filenames, such as “xia2.html” or “xia2.txt”. This means that if you run xia2 for multiple projects but stay within the same folder, you will overwrite these files. Even if you do not, dumping all of the output of all the programs into the same folder is the easiest way to lose track of your work. Thus, we urgently recommend that you organize your data carefully into separate folders, by project, by dataset, etc., so that you can easily find what you need when you are writing your paper years later.

Processing a single crystal dataset

The simplest way of processing images with xia2 is to simply invoke it and direct it at a folder that contains image files:

xia2 /folder/with/image/files

If one wishes to use a specific processing backend, xia2 has the ability to use several combinations, invoked using the pipeline keyword:

pipeline=dials tell xia2 to use DIALS with DIALS scaling (default)
pipeline=dials-aimless tell xia2 to use DIALS and AIMLESS scaling
pipeline=3d tell xia2 to use XDS and XSCALE scaling, indexing with peaks found in a subset of images
pipeline=3dii tell xia2 to use XDS and XSCALE scaling, indexing with peaks found from all images

Thus, the command line becomes something like:

xia2 --pipeline=3d /folder/with/image/files

This will start xia2, turn on the XDS pipeline, process all images found in the /folder/with/image/files path and then use XSCALE for scaling and merging.

In most cases this will be all you need, as xia2 will automatically do all the processing and return an acceptable result.

To evaluate the result, inspect the output in the terminal window, or the xia2.txt log file. A Table 1-like summary of the merging statistics is found at the end of the output or log file.

A more in-depth look at the processing results can be obtained by viewing a website-like summary of the run by issuing

firefox xia2.html

Running xia2 with multiple imagesets

In cases where multiple imagesets (i.e. sets of images that are named using the same template, e.g. image_file_1_#####.cbf, where the hashes represent a number) are stored in the same folder, it is sufficient to point xia2 to that folder:

xia2 --pipeline=dials /folder/with/image/files

If these imagesets are scattered over several specific subfolders, the command can point to each imageset individually by specifying the first file in the set:

xia2 --pipeline=dials image=/folder/with/images/1/filename_1_00001.cbf image=/folder/with/images/2/filename_2_00001.cbf image=/folder/with/images/3/filename_3_00001.cbf

If selecting a subset of images is desired (for example, due to radiation damage), it’s possible to restrict processing to the specified image ranges:

xia2 --pipeline=dials image=/folder/with/images/1/filename_1_00001.cbf:1:40 /folder/with/images/2/filename_2_00001.cbf:1:35 /folder/with/images/3/filename_3_00001.cbf:1:45

Finally, one can feed a script to xia2, saved in a file with an .xinfo extension, which specifies all of the paths, files, and image ranges in one place. A typical .xinfo file looks like this:


BEGIN PROJECT AUTOMATIC
BEGIN CRYSTAL DEFAULT

BEGIN WAVELENGTH NATIVE
WAVELENGTH 0.979500
END WAVELENGTH NATIVE

BEGIN SWEEP SWEEP1
WAVELENGTH NATIVE
DIRECTORY /folder/with/images/1/
IMAGE filename_1_0001.cbf
START_END 1 40
END SWEEP SWEEP1

BEGIN SWEEP SWEEP2
WAVELENGTH NATIVE
DIRECTORY /folder/with/images/2/
IMAGE filename_2_0001.cbf
START_END 1 35
END SWEEP SWEEP2

BEGIN SWEEP SWEEP3
WAVELENGTH NATIVE
DIRECTORY /folder/with/images/3/
IMAGE filename_3_0001.cbf
START_END 1 45
END SWEEP SWEEP3

END CRYSTAL DEFAULT
END PROJECT AUTOMATIC

The .xinfo file is automatically generated the first time xia2 is run; conversely, the above template can be used to create one from scratch. The file can then be submitted to xia2 and it will read all of its information from :

xia2 --info=info.xinfo

Multi-crystal processing

For best results a multi-crystal dataset should be processed iteratively, in several cycles, with each cycle featuring adjustments that improve the merging statistics. While the procedure varies for each dataset, below is a general approach that we found to be useful.

1. Initial run - all data, with defaults

This run serves two purposes: to weed out the non-processable imagesets (or "sweeps") and to generate the *.xinfo file that can be used for subsequent runs. For the purposes of this manual, we are going to assume that all images are located in the same folder.

xia2 --failover=True --read_all_image_headers=False --multi_crystal=True /folder/with/image/files multiprocessing.mode=parallel multiprocessing.njob=4 multiprocessing.nproc=4

The options prior to the path to image files (designated with double dashes) are used to set xia2-specific parameters; here, the options are selected to speed up data processing, as multi-crystal sets can take a while.

failover=True skip sweeps that cause errors without terminating the entire run
read_all_image_headers=False assume that one image header applies to the entire sweep, this saves time
multi_crystal=True tell xia2 that you are using data from multiple crystals

After the path to files, you can add more options to the command line; note that these are not designated with the double-dash. The multiprocessing options will cause xia2 to run in parallel. NOTE: running xia2 in parallel ensures that errors to processing one sweep do not terminate the whole run.

multiprocessing.mode=parallel activate parallel processing mode
multiprocessing.njob=4 number (e.g 4) of sweeps to be processed in parallel
multiprocessing.nproc=4 number (e.g. 4) of CPUs to use per sweep

2. First round of sweep selection in case of failure

The automatic.xinfo file is generated upon the first launch of xia2 (from step 1); it is advisable to not overwrite this version, but create a working copy:

cp automatic.xinfo work.xinfo

The file can then be edited as desired using the text editor of preference. (An app called geany is implemented on all pxproc machines and features a user-friendly GUI.)

The most common edits to the *.xinfo file include changing the image ranges within a sweep or deleting a problematic sweep altogether; after editing and saving the file, xia2 can be rerun as follows:

xia2 --info=work.xinfo

It is useful to not only delete the sweep on which xia2 had failed, but also note which sweeps were not successfully indexed, as well. A telltale sign of such a failure could be found in the xia2 output (on screen and in the xia2.txt log file) and would look like this:

------------------- Autoindexing SWEEP8 --------------------
All possible indexing solutions:
tP  58.64  58.64 151.42  90.00  90.00  90.00
oC  82.86  82.92 151.40  90.00  90.00  90.00
oP  58.61  58.65 151.39  90.00  90.00  90.00
mC  82.69  82.84 151.21  90.00  89.91  90.00
mP  58.62  58.60 151.33  90.00  89.99  90.00
aP  58.52  58.56 151.21  89.95  89.95  90.07
Indexing solution:
tP  58.64  58.64 151.42  90.00  90.00  90.00
-------------------- Integrating SWEEP8 --------------------
Processed batches 3 to 22
Standard Deviation in pixel range: 0.47 0.55
Integration status per image:
oooooooo............
"o" => good        "%" => ok        "!" => bad rmsd
"O" => overloaded  "#" => many bad  "." => weak
"@" => abandoned
Mosaic spread: 0.060 < 0.060 < 0.060
-------------------- Spotfinding SWEEP9 --------------------
184 spots found on 20 images (max 17 / bin)
**
**
**    *
**    **
**** ***   *
**** *** * *     ***
******** **** ******
********************
********************
********************
2      image      21
------------------- Autoindexing SWEEP9 --------------------
Processing sweep SWEEP9 failed: No suitable indexing solution could be found.

You can view the reciprocal space with:
dials.reciprocal_lattice_viewer /folder/to/processing/DEFAULT/NATIVE/SWEEP9/index/108_optimised.expt /folder/to/processing/DEFAULT/NATIVE/SWEEP9/index/106_SWEEP9_strong.refl

In the above fragment of the log, notice that Sweep 8 indexed and integrated normally; conversely, Sweep 9 failed to index, with an error message. Based on this information, one could edit work.xinfo to remove Sweep 9 from further processing.

One could also try and see how the processing would work with more information about the crystal parameters extracted from this initial attempt.

3. Re-run with space group and unit cell information

Inspecting the output, one could see that quite a few of the sweeps (if not all of them) are indexed in the same symmetry, with highly similar unit cell parameters, e.g.:

------------------- Autoindexing SWEEP8 --------------------
All possible indexing solutions:
tP  58.64  58.64 151.42  90.00  90.00  90.00
oC  82.86  82.92 151.40  90.00  90.00  90.00
oP  58.61  58.65 151.39  90.00  90.00  90.00
mC  82.69  82.84 151.21  90.00  89.91  90.00
mP  58.62  58.60 151.33  90.00  89.99  90.00
aP  58.52  58.56 151.21  89.95  89.95  90.07
Indexing solution:
tP  58.64  58.64 151.42  90.00  90.00  90.00

It is possible to re-run xia2 with these specific crystal parameters, which will aid the processing by stabilizing the indexing of datasets and improving the chances of success.

xia2 --info=work.xinfo --failover=True --read_all_image_headers=False space_group=P4 unit_cell=58.64,58.64,151.42,90.00,90.00,90.00

(note the lack of dashes before space_group and unit_cell options!)

4. Inspecting statistics by batch (sweep)

By this point, some iteration of steps 2-3 should produce a completed xia2 run, which will present the summary output on screen and write it out to the xia2.txt log file. Parts of this output can be inspected to see if any sweeps should be omitted, usually because they decrease the quality of the merging statistics for the entire dataset.

One useful piece of information is the estimated resolutions for each sweep, which can be found towards the end of the processing run, and look like this:

--------------------- Scaling DEFAULT ---------------------- Resolution for sweep NATIVE/SWEEP7: 1.78 (cc_half > 0.3) Resolution for sweep NATIVE/SWEEP15: 1.76 (cc_half > 0.3) * Resolution for sweep NATIVE/SWEEP11: 2.39 (cc_half > 0.3) Resolution for sweep NATIVE/SWEEP3: 1.77 (cc_half > 0.3) Resolution for sweep NATIVE/SWEEP5: 1.59 (cc_half > 0.3) Resolution for sweep NATIVE/SWEEP4: 1.53 (cc_half > 0.3) Resolution for sweep NATIVE/SWEEP2: 1.71 (cc_half > 0.3) * Resolution for sweep NATIVE/SWEEP9: 3.91 (cc_half > 0.3) Resolution for sweep NATIVE/SWEEP14: 1.64 (cc_half > 0.3) Resolution for sweep NATIVE/SWEEP10: 1.96 (cc_half > 0.3) Resolution for sweep NATIVE/SWEEP1: 1.79 (cc_half > 0.3) Resolution for sweep NATIVE/SWEEP6: 1.63 (cc_half > 0.3) Resolution for sweep NATIVE/SWEEP13: 1.69 (cc_half > 0.3) Resolution for sweep NATIVE/SWEEP12: 1.50 (cc_half > 0.3) * Resolution for sweep NATIVE/SWEEP8: 0.00

Note that for most of the sweeps, the resolution varied between ~1.5Å and ~2Å, but three datasets (which we marked with asterisks) seem to be the outliers. Two of them have resolution of 2.4Å and 3.9Å, while the third one seems to be of such poor quality that its resolution is not even estimated.

xia2 will also create a xia2.html file, which can be used to view the results as charts. The most useful - for this type of data - chart can be found under the "Dataset NATIVE" tab. (The "NATIVE" name is the default for your dataset, you can change it in the work.xinfo file to something else.)

Scale and Rmerge vs. Batch

This tab will show the Table 1 summary (same as the one under the "Summary" tab), the Xtriage warnings (if any), and below them, Analysis Plots. Clicking on "Analysis by batch" will expose several useful plots that will show if any sweeps act as obvious outliers. In the example dataset, the graph of Rmerge and scale vs. frame shows a concerningly high Rmerge for Sweep 9. (If sweep names overlap too much, mousing over any point of the chart will reveal which sweep the concerning Rmerge belongs to.)

The high multiplicity of this dataset allows us to edit the work.xinfo file to exclude Sweep 9 and re-run processing without it. The next run may reveal that more sweeps have problematic statistics and they, too can be removed from processing by editing the work.xinfo file. We can also remove sweeps with resolutions that are too low for our purposes, such as Sweeps 11 and 8, judging by the resolution chart above. We can continue this process until we are satisfied with the merging statistics.

5. Improving the merged dataset with xia2.multiplex

xia2.multiplex [2] is a program that improves a multi-crystal dataset by performing symmetry analysis, re-scaling and re-merging, as well as analyzing the various pathologies that bedevil multi-crystal datasets, such as non-isomorphism, radiation damage, and cases of preferential crystal orientation (which results in missing cones of data). This program can be run on the entirety of the merged dataset from above, or on selected sweeps if that is desired. xia2.multiplex accepts a set of two types of files: the reflection table file (with the extension .refl) and a file with stored experiment metadata (with the extension .expt). These files (both for each sweep and for the entire dataset) can be found in the DataFiles subfolder of your xia2 processing directory. Thus, to run xia2.multiplex on the data you just processed, issue

xia2.multiplex DataFiles/AUTOMATIC_DEFAULT_scaled.expt DataFiles/AUTOMATIC_DEFAULT_scaled.refl

One could also select several sweeps to see if a better complete dataset can be assembled from those, e.g.:

xia2.multiplex DataFiles/AUTOMATIC_DEFAULT_NATIVE_SWEEP1.expt DataFiles/AUTOMATIC_DEFAULT_NATIVE_SWEEP1.refl DataFiles/AUTOMATIC_DEFAULT_NATIVE_SWEEP3.expt DataFiles/AUTOMATIC_DEFAULT_NATIVE_SWEEP3.refl DataFiles/AUTOMATIC_DEFAULT_NATIVE_SWEEP7.expt DataFiles/AUTOMATIC_DEFAULT_NATIVE_SWEEP7.refl

xia2.multiplex will also perform a Laue check of the point group for this crystal and ensure that the dataset is isomorphous. If some sweeps are found to be outliers (i.e. if multiple crystal forms are present), this information could be used to separate the datasets.

6. Sacrificing quantity for quality

It is more useful to monitor the multiplicity-independent Rpim rather than the multiplicity-dependent Rmerge or Rmeas, since the latter can be cosmetically manipulated by reducing multiplicity, i.e. throwing data away. However, in cases of multi-crystal datasets, it is sometimes useful to sacrifice some multiplicity in exchange for gains in overall resolution or for tangible statistical improvements. In cases of crystals that have high-symmetry space groups, it is very easy to amass a highly-redundant dataset and thus to have the option to pick out only the best (in terms of resolution, mosaicity, etc.) sweeps, rather than include everything.

xia2.multiplex performs hierarchical clustering of datasets to determine which sweeps best match up with one another. This approach is very useful in cases of non-isomorphism, as it very effectively separates different isoforms.

Cos(angle) clustering summary:
=========  ==============  ===================  ========  ==============  ==============
  Cluster    No. datasets  Datasets               Height    Multiplicity    Completeness
=========  ==============  ===================  ========  ==============  ==============
        1               2  4 5                   7.7e-09             5              0.93
        2               2  7 9                   6.8e-05             1.7            0.83
        3               3  3 7 9                 0.00022             4.3            1
        4               2  1 8                   0.00067             4.4            0.99
        5               3  2 4 5                 0.00081             7.7            0.98
        6               4  0 3 7 9               0.0014              7.2            1
        7               5  1 2 4 5 8             0.006              11.9            1
        8               6  1 2 4 5 6 8           0.014              14.8            1
        9              10  0 1 2 3 4 5 6 7 8 9   0.023              22              1
=========  ==============  ===================  ========  ==============  ==============

Clustering can also be used to identify smaller subsets that could have better merging statistics than the full set of sweeps. In this particular case, a combination of datasets #0, 3, 7 and 9 may be useful to try. As you can see, the jump in the "Height" parameter from cluster 6 to cluster 7 is rather abrupt, the overall multiplicity is still a respectable 7.2, and the dataset composed of those sweeps would be 100% complete.

Unfortunately, the numbering scheme of the datasets does not directly correlate with the numbering scheme of the sweeps. To correlate one with the other, we need to open xia2.multiplex.html file in a web browser, expand the "Datasets" item, and look at which image filenames the datasets numbers correlate to: Here, dataset 0 correlates with files named "thaulmatin_1_#####.cbf", dataset 3 with "thaulmatin_13_#####.cbf", dataset 7 with "thaulmatin_15_#####.cbf", and dataset 9 with "thaulmatin_6_#####.cbf", or with Sweeps 4, 6, 7, and 12 respectively.

List of Datasets in xia2.multiplex

Once the corresponding sweeps are identified, the work.xinfo file can be edited to restrict the processing to these sweeps alone.

Tutorial with sample HEWL data

NOTE: In all steps "[username]" is a placeholder for your username. Make sure you replace all instances of it with your own username, otherwise nothing will work.

Step 1. Preparation

  1. Log onto a pxproc machine that appears the least used at the moment; we recommend pxproc1-12, as they contain more CPUs than the others.
  2. Go to your user data folder:

    cd /data/[username]/

  3. Either in your data folder or in a subfolder therein, create a xia2 tutorial subfolder:

    mkdir xia2_tutorial

  4. Go to that subfolder:

    cd xia2_tutorial

Step 2. Initial xia2 run.

  1. Point xia2 at the sample HEWL data accessible via a symlink, and run it in multiprocessing mode as above:

    xia2 --failover=False --read_all_image_headers=False --multi_crystal=True /data/[username]/templates/xia2_sample_data/ multiprocessing.mode=parallel multiprocessing.njob=10 multiprocessing.nproc=4

  2. Inspect the Table 1 summary of the data output at the end of the processing run. Strictly speaking, these data are already "good"; for the purposes of this tutorial, let's improve them a little bit more.

Step 3. Removing unhelpful sweeps

  1. Look through the xia2.txt file; note that
    1. There is an error (Processing sweep SWEEP10 failed: xia2.integrate subprocess failed with exitcode 1) reported for Sweep 10.
    2. There is very dramatic deterioration of spotfinding results for Sweep 9.
    3. The "mosaic spread" parameter for Sweep 9 is much higher than for other sweeps.
    4. The resolution for Sweep 9 is notably lower than for other sweeps.
  2. Open xia2.html using Firefox; click on the "Dataset NATIVE" tab and expand the "Analysis by batch" section. Note that
    1. The Rmerge vs. resolution plot shows a huge Rmerge spike for Sweep 9.
    2. The I/sigma(I) vs. resolution plot shows steep deterioration of I/sigma(I) for Sweep 1 and Sweep 8, and very poor values for Sweep 9.
  3. Create a copy of the automatic.xinfo file:

    cp automatic.xinfo work.xinfo

  4. Edit the work.xinfo file to remove Sweep9 and Sweep 10 from processing. (We will deal with Sweeps 1 and 8 later.)
  5. Re-run xia2 using the exited work.xinfo:

    xia2 --failover=False --read_all_image_headers=False --multi_crystal=True --info=work.xinfo multiprocessing.mode=parallel multiprocessing.njob=10 multiprocessing.nproc=4

Step 4. Removing unhelpful images

  1. Inspect xia2.txt again and note that spotfinding results for all sweeps show some deterioration with exposure, but especially for Sweeps 1 and 8.
  2. Open xia2.html with Firefox, and inspect the "Analysis by batch" charts; note that
    1. Rmerge for Sweeps 2 and 4 rapidly rise about half-way through
    2. I/sigma(I) for Sweeps 1 and 8 rapidly drop about 1/3 of the way through.
  3. Edit work.xinfo to truncate image ranges as follows:
    1. Sweep 1: 2 to 21
    2. Sweep 2: 2 to 26
    3. Sweep 4: 2 to 31
    4. Sweep 8: 2 to 26

    Note: these are semi-arbitrary ranges; you can make your own decisions here.

  4. Re-run xia2 with work.xinfo:

    xia2 --failover=False --read_all_image_headers=False --multi_crystal=True --info=work.xinfo multiprocessing.mode=parallel multiprocessing.njob=10 multiprocessing.nproc=4

  5. Inspect the Table 1 summary; note slight improvement to the merging statistics.

Step 5. Refining the merged dataset with xia2.multiplex

  1. Run xia2.multiplex with the results of this latest processing run:

    xia2.multiplex DataFiles/AUTOMATIC_DEFAULT_scaled.expt DataFiles/AUTOMATIC_DEFAULT_scaled.refl

  2. Note: the input for xia2.multiplex are
    1. File with expt extension containing information about the experiment
    2. File with refl extension containing the reflection table
  3. Inspect the xia2.multiplex.log file and find the "overall merging statistics". Note the improvement vs. the result of the previous xia2 processing run.
  4. The xia2.multiplex.html file can be inspected in Firefox to see more helpful statistics and charts.

REFERENCES

  1. Winter, G., et al. (2010). J. Appl. Cryst. 43, 186–190. [ link ]
  2. Gildea, G., et al. (2022). Acta Cryst. D78, 752-769. [ link ]