Processing Data Using xia2 (Single and Multiple Crystals)

Running xia2 at SSRl
Housekeeping
Processing a Single Crystal Dataset
Running with a Subset of images
Multi-crystal Processing

Initial run - all data, with defaults
First round of sweep selection in case of failure
Re-run with space group and unit cell information
Inspecting statistics by batch (sweep)
Improving the merged dataset with xia2.multiplex
Sacrificing quantity for quality

Tutorial with sample HEWL data
References

Running xia2 at SSRL

The program xia2 [1] is useful for processing data and for merging multiple sweep data or datasets collected from multiple crystals.

While generally highly-automated and capable of processing datasets from single crystals, xia2 requires inspection and reprocessing when processing datasets collected from multiple crystals, especially if these datasets contain wedges (or “sweeps”) that are of too poor quality to be included. “Poor quality” may refer to low resolution, irregular spot shapes, high mosaicity, and other properties of images that negatively affect merging statistics.

To run the current SSRL version of xia2, log onto one of the SSRL processing computers (pxproc## machines) either locally or through NoMachine.

Instructions to set up a remote Unix desktop using NoMachine
Accessing the data processing computers

After logging onto a pxproc## machine, unload the conflicting Phenix module and load the DIALS module that contains the most current xia2 version:

module switch -f phenix dials (once per login session)

The following output should be displayed:

Loading dials/3-13-0
  WARNING: dials/3-13-0 conflicts with phenix

If you need to use Phenix, open a separate terminal window, log into a pxproc## machine, and use Phenix as normal.

Housekeeping

xia2 generates many files. These files will have generic filenames, such as “xia2.html” or “xia2.txt”. These files will be overwritten if xia2 is run form the same directory.

We recommend running all instances of xia2 in different directoies with names that are easily identifiable several years later.

Processing a Single Crystal Dataset

If your images are in /data/username/thaulmatin/C4, for example, create a subfolder:

mkdir /data/username/thaulatin/C4/xia2_processing_1

cd /data/username/thaulatin/C4/xia2_processing_1

xia2 /data/username/thaulatin/C4

In most cases xia2 will automatically process the data and will return an acceptable result.

To evaluate the results, inspect the output in the terminal window, or use the xia2.txt log file. A Table summary of the merging statistics can be found at the end of the output or log file.

A more in-depth look at the processing results can be obtained by viewing the summary in a web browser:

firefox xia2.html

Additionally, specific processing backends can be used in combination during processing:

`pipeline=dials`	xia2 uses DIALS and DIALS scaling (default)
`pipeline=dials-aimless`	xia2 uses DIALS and AIMLESS scaling
`pipeline=3d`	xia2 uses XDS and XSCALE scaling, indexing with peaks found in a subset of images
`pipeline=3dii`	xia2 uss XDS and XSCALE scaling, indexing with peaks found from all images

The command line for the 3rd option would look like this:

xia2 --pipeline=3d /data/username/thaulatin/C4

This will run the XDS pipeline and then use XSCALE for scaling and merging.

Running with a Subset of Images

If selecting a subset of images is desired for example, due to radiation damage, it is possible to restrict processing to a specified image range:

xia2 /data/username/thaulatin/C4/filename_1_00001.cbf:1:40

Multi-crystal Processing

Multi-crystal datasets may have to be processed iteratively, in several cycles, with each cycle featuring adjustments that improve the merging statistics. The steps for a successful result is provided below:

1. Initial run - all datasets using defaults

This run is to weed out the non-processable datasets (or "sweeps" using xia2 nomenclature)"

Load DIALS:

module switch -f phenix dials (if it hasn't been done yet)

If the datasets are in one directory, for example, in /data/username/thaulmatin/A7, create a subfolder in that directory:

mkdir /data/username/thaulatin/A7/xia2_multi_crystal_1

cd /data/username/thaulatin/A7/xia2_multi_crystal_1

The following command would be used to read all the datasets in teh directory automatically and attempt to cluster similar data together:

xia2 --failover=True --read_all_image_headers=False --multi_crystal=True /data/username/thaulatin/C4 multiprocessing.mode=parallel multiprocessing.njob=4 multiprocessing.nproc=4



Options prior to path to the image files files (designated with double dashes, i.e. --failover) are used to set xia2-specific parameters; here, the options are selected to speed up data processing.


    
        failover=True
        Skips sweeps that cause errors without terminating the entire run
    
    
        read_all_image_headers=False
        Assumes that one image header applies to the entire sweep, saving time
    
    
        multi_crystal=True
        Required for clustering multiple crystal datasets
    


After the path to image files, you can add more options to the command line; note that these are not designated with the double-dash. 


    
        multiprocessing.mode=parallel
        activate parallel processing for each sweep which ensures that an error in one sweep (dataset) does not terminate the entire run.
    
    
        multiprocessing.njob=4
        number (e.g 4) of sweeps to be processed in parallel
    
    
        multiprocessing.nproc=4
        number (e.g. 4) of CPUs to use per sweep
    


2. First round of sweep selection in case of failure


The automatic.xinfo file is generated upon the first launch of xia2 (from step 1); it is advisable to not overwrite this version, but create a working copy:



cp automatic.xinfo work.xinfo




The file can then be edited as desired using the text editor of preference. (An app called geany is implemented on all pxproc machines and features a user-friendly GUI.)


The most common edits to the *.xinfo file include changing the image ranges within a sweep or deleting a problematic sweep altogether; after editing and saving the file, xia2 can be rerun as follows:




xia2 --info=work.xinfo




It is useful to not only delete the sweep on which xia2 had failed, but also note which sweeps were not successfully indexed, as well. A telltale sign of such a failure could be found in the xia2 output (on screen and in the xia2.txt log file) and would look like this:




------------------- Autoindexing SWEEP8 --------------------
All possible indexing solutions:
tP  58.64  58.64 151.42  90.00  90.00  90.00
oC  82.86  82.92 151.40  90.00  90.00  90.00
oP  58.61  58.65 151.39  90.00  90.00  90.00
mC  82.69  82.84 151.21  90.00  89.91  90.00
mP  58.62  58.60 151.33  90.00  89.99  90.00
aP  58.52  58.56 151.21  89.95  89.95  90.07
Indexing solution:
tP  58.64  58.64 151.42  90.00  90.00  90.00
-------------------- Integrating SWEEP8 --------------------
Processed batches 3 to 22
Standard Deviation in pixel range: 0.47 0.55
Integration status per image:
oooooooo............
"o" => good        "%" => ok        "!" => bad rmsd
"O" => overloaded  "#" => many bad  "." => weak
"@" => abandoned
Mosaic spread: 0.060 < 0.060 < 0.060
-------------------- Spotfinding SWEEP9 --------------------
184 spots found on 20 images (max 17 / bin)
**
**
**    *
**    **
**** ***   *
**** *** * *     ***
******** **** ******
********************
********************
********************
2      image      21
------------------- Autoindexing SWEEP9 --------------------
Processing sweep SWEEP9 failed: No suitable indexing solution could be found.

You can view the reciprocal space with:
dials.reciprocal_lattice_viewer /folder/to/processing/DEFAULT/NATIVE/SWEEP9/index/108_optimised.expt /folder/to/processing/DEFAULT/NATIVE/SWEEP9/index/106_SWEEP9_strong.refl




In the above fragment of the log, notice that Sweep 8 indexed and integrated normally; conversely, Sweep 9 failed to index, with an error message. Based on this information, one could edit work.xinfo to remove Sweep 9 from further processing.



One could also try and see how the processing would work with more information about the crystal parameters extracted from this initial attempt.



3. Re-run with space group and unit cell information


Inspecting the output, one could see that quite a few of the sweeps (if not all of them) are indexed in the same symmetry, with highly similar unit cell parameters, e.g.:




------------------- Autoindexing SWEEP8 --------------------
All possible indexing solutions:
tP  58.64  58.64 151.42  90.00  90.00  90.00
oC  82.86  82.92 151.40  90.00  90.00  90.00
oP  58.61  58.65 151.39  90.00  90.00  90.00
mC  82.69  82.84 151.21  90.00  89.91  90.00
mP  58.62  58.60 151.33  90.00  89.99  90.00
aP  58.52  58.56 151.21  89.95  89.95  90.07
Indexing solution:
tP  58.64  58.64 151.42  90.00  90.00  90.00




It is possible to re-run xia2 with these specific crystal parameters, which will aid the processing by stabilizing the indexing of datasets and improving the chances of success.




xia2 --info=work.xinfo --failover=True --read_all_image_headers=False space_group=P4 unit_cell=58.64,58.64,151.42,90.00,90.00,90.00




(note the lack of dashes before space_group and unit_cell options!)




4. Inspecting statistics by batch (sweep)


By this point, some iteration of steps 2-3 should produce a completed xia2 run, which will present the summary output on screen and write it out to the xia2.txt log file. Parts of this output can be inspected to see if any sweeps should be omitted, usually because they decrease the quality of the merging statistics for the entire dataset.

One useful piece of information is the estimated resolutions for each sweep, which can be found towards the end of the processing run, and look like this:





--------------------- Scaling DEFAULT ----------------------
Resolution for sweep NATIVE/SWEEP7: 1.78 (cc_half > 0.3)
Resolution for sweep NATIVE/SWEEP15: 1.76 (cc_half > 0.3)
* Resolution for sweep NATIVE/SWEEP11: 2.39 (cc_half > 0.3)
Resolution for sweep NATIVE/SWEEP3: 1.77 (cc_half > 0.3)
Resolution for sweep NATIVE/SWEEP5: 1.59 (cc_half > 0.3)
Resolution for sweep NATIVE/SWEEP4: 1.53 (cc_half > 0.3)
Resolution for sweep NATIVE/SWEEP2: 1.71 (cc_half > 0.3)
* Resolution for sweep NATIVE/SWEEP9: 3.91 (cc_half > 0.3)
Resolution for sweep NATIVE/SWEEP14: 1.64 (cc_half > 0.3)
Resolution for sweep NATIVE/SWEEP10: 1.96 (cc_half > 0.3)
Resolution for sweep NATIVE/SWEEP1: 1.79 (cc_half > 0.3)
Resolution for sweep NATIVE/SWEEP6: 1.63 (cc_half > 0.3)
Resolution for sweep NATIVE/SWEEP13: 1.69 (cc_half > 0.3)
Resolution for sweep NATIVE/SWEEP12: 1.50 (cc_half > 0.3)
* Resolution for sweep NATIVE/SWEEP8: 0.00





Note that for most of the sweeps, the resolution varied between ~1.5Å and ~2Å, but three datasets (which we marked with asterisks) seem to be the outliers. Two of them have resolution of 2.4Å and 3.9Å, while the third one seems to be of such poor quality that its resolution is not even estimated.



xia2 will also create a xia2.html file, which can be used to view the results as charts. The most useful - for this type of data - chart can be found under the "Dataset NATIVE" tab. (The "NATIVE" name is the default for your dataset, you can change it in the work.xinfo file to something else.)






This tab will show the Table 1 summary (same as the one under the "Summary" tab), the Xtriage warnings (if any), and below them, Analysis Plots. Clicking on "Analysis by batch" will expose several useful plots that will show if any sweeps act as obvious outliers. In the example dataset, the graph of Rmerge and scale vs. frame shows a concerningly high R_merge for Sweep 9.

(If sweep names overlap too much, mousing over any point of the chart will reveal which sweep the concerning R_merge belongs to.)




The high multiplicity of this dataset allows us to edit the work.xinfo file to exclude Sweep 9 and re-run processing without it. The next run may reveal that more sweeps have problematic statistics and they, too can be removed from processing by editing the work.xinfo file. We can also remove sweeps with resolutions that are too low for our purposes, such as Sweeps 11 and 8, judging by the resolution chart above. We can continue this process until we are satisfied with the merging statistics.




5. Improving the merged dataset with xia2.multiplex

xia2.multiplex [2] is a program that improves a multi-crystal dataset by performing symmetry analysis, re-scaling and re-merging, as well as analyzing the various pathologies that bedevil multi-crystal datasets, such as non-isomorphism, radiation damage, and cases of preferential crystal orientation (which results in missing cones of data). This program can be run on the entirety of the merged dataset from above, or on selected sweeps if that is desired. xia2.multiplex accepts a set of two types of files: the reflection table file (with the extension .refl) and a file with stored experiment metadata (with the extension .expt). These files (both for each sweep and for the entire dataset) can be found in the DataFiles subfolder of your xia2 processing directory. Thus, to run xia2.multiplex on the data you just processed, issue


xia2.multiplex DataFiles/AUTOMATIC_DEFAULT_scaled.expt DataFiles/AUTOMATIC_DEFAULT_scaled.refl




One could also select several sweeps to see if a better complete dataset can be assembled from those, e.g.:




xia2.multiplex DataFiles/AUTOMATIC_DEFAULT_NATIVE_SWEEP1.expt DataFiles/AUTOMATIC_DEFAULT_NATIVE_SWEEP1.refl DataFiles/AUTOMATIC_DEFAULT_NATIVE_SWEEP3.expt DataFiles/AUTOMATIC_DEFAULT_NATIVE_SWEEP3.refl DataFiles/AUTOMATIC_DEFAULT_NATIVE_SWEEP7.expt DataFiles/AUTOMATIC_DEFAULT_NATIVE_SWEEP7.refl




xia2.multiplex will also perform a Laue check of the point group for this crystal and ensure that the dataset is isomorphous. If some sweeps are found to be outliers (i.e. if multiple crystal forms are present), this information could be used to separate the datasets.




6. Sacrificing quantity for quality

It is more useful to monitor the multiplicity-independent R_pim rather than the multiplicity-dependent R_merge or R_meas, since the latter can be cosmetically manipulated by reducing multiplicity, i.e. throwing data away. However, in cases of multi-crystal datasets, it is sometimes useful to sacrifice some multiplicity in exchange for gains in overall resolution or for tangible statistical improvements. In cases of crystals that have high-symmetry space groups, it is very easy to amass a highly-redundant dataset and thus to have the option to pick out only the best (in terms of resolution, mosaicity, etc.) sweeps, rather than include everything.




xia2.multiplex performs hierarchical clustering of datasets to determine which sweeps best match up with one another. This approach is very useful in cases of non-isomorphism, as it very effectively separates different isoforms.



Cos(angle) clustering summary:
=========  ==============  ===================  ========  ==============  ==============
  Cluster    No. datasets  Datasets               Height    Multiplicity    Completeness
=========  ==============  ===================  ========  ==============  ==============
        1               2  4 5                   7.7e-09             5              0.93
        2               2  7 9                   6.8e-05             1.7            0.83
        3               3  3 7 9                 0.00022             4.3            1
        4               2  1 8                   0.00067             4.4            0.99
        5               3  2 4 5                 0.00081             7.7            0.98
        6               4  0 3 7 9               0.0014              7.2            1
        7               5  1 2 4 5 8             0.006              11.9            1
        8               6  1 2 4 5 6 8           0.014              14.8            1
        9              10  0 1 2 3 4 5 6 7 8 9   0.023              22              1
=========  ==============  ===================  ========  ==============  ==============



 Clustering can also be used to identify smaller subsets that could have better merging statistics than the full set of sweeps. In this particular case, a combination of datasets #0, 3, 7 and 9 may be useful to try. As you can see, the jump in the "Height" parameter from cluster 6 to cluster 7 is rather abrupt, the overall multiplicity is still a respectable 7.2, and the dataset composed of those sweeps would be 100% complete.



Unfortunately, the numbering scheme of the datasets does not directly correlate with the numbering scheme of the sweeps. To correlate one with the other, we need to open xia2.multiplex.html file in a web browser, expand the "Datasets" item, and look at which image filenames the datasets numbers correlate to: Here, dataset 0 correlates with files named "thaulmatin_1_#####.cbf", dataset 3 with "thaulmatin_13_#####.cbf", dataset 7 with "thaulmatin_15_#####.cbf", and dataset 9 with "thaulmatin_6_#####.cbf", or with Sweeps 4, 6, 7, and 12 respectively.




 Once the corresponding sweeps are identified, the work.xinfo file can be edited to restrict the processing to these sweeps alone. 


Tutorial with sample HEWL data
NOTE: In all steps "[username]" is a placeholder for your username. Make sure you replace all instances of it with your own username, otherwise nothing will work.

Step 1. Preparation


Log onto a pxproc machine that appears the least used at the moment; we recommend pxproc1-12, as they contain more CPUs than the others.
Go to your user data folder:
cd /data/[username]/
Either in your data folder or in a subfolder therein, create a xia2 tutorial subfolder:
mkdir xia2_tutorial
Go to that subfolder:
cd xia2_tutorial




Step 2. Initial xia2 run.


NOTE: the following tutorial includes multiple full runs of xia2, which might take a while depending on which system you run it on. If you'd like to shorten the experience somewhat, you can skip this step and go straight to step #3 of this tutorial.
Point xia2 at the sample HEWL data accessible via a symlink, and run it in  multiprocessing mode as above:
xia2 --failover=True --read_all_image_headers=False --multi_crystal=True /data/[username]/templates/xia2_sample_data/ multiprocessing.mode=parallel multiprocessing.njob=10 multiprocessing.nproc=4
Note that the xia2 job fails with an error. Note also that xia2 reports an indexing failure for SWEEP8; this did not terminate the entire job due to the --failover=True option; if failover were set to True, the job would fail after attempting to index SWEEP8.



Step 3. Re-running with known symmetry.


Inspect the xia2.txt log file; note how at some point xia2 had decided to use the lowest-symmetry monoclinic Bravais lattice and "eliminated" the solution that was initially found for each of the sweeps:

-------------------- Preparing DEFAULT ---------------------
Correct lattice asserted to be mP
Eliminating indexing solution:
tP  78.98  78.98  38.07  90.00  90.00  90.00


Let's insist that the initially obtained indexing solution is the correct one, seeing as it's found for every single successfully-indexed sweep, and re-run xia2 with crystal lattice parameters included.
p>xia2 --failover=True --read_all_image_headers=False --multi_crystal=True --space_group=P4 --unit_cell=79,79,38,90,90,90 /data/[username]/templates/xia2_sample_data/ multiprocessing.mode=parallel multiprocessing.njob=10 multiprocessing.nproc=4

Note that
    
    We are specifying the space group as "P4", which is the base space group for the primitive tetragonal lattice identified in the initial solution.
    The unit cell parameters specified aren't precise; they are, however, "close enough" for xia2 to know what we want from it.
    xia2 no longer brings up alternate indexing solutions, going with the information given to it.
    

Now the processing will conclude successfully. Inspect Table 1 from xia2.txt or terminal output. The statistics are not great, especially the merging R-factors; we can make them quite a bit better by manipulating the input.


Step 4. Removing unhelpful sweeps 

Look through the xia2.txt file; note that
    
    There is an error (Processing sweep SWEEP8 failed: xia2.integrate subprocess failed with exitcode 1) reported for Sweep 8.
    There is very dramatic deterioration of spotfinding results for Sweep 3.
    The resolution for Sweep 3 is notably lower than for other sweeps.
    

Open xia2.html using Firefox; click on the "Dataset NATIVE" tab. Note that
    
        There is a "serious warning" that the data may be assigned the wrong space group.
        There's a possible non-crystallographic symmetry.
        It's likely that, instead of NCS, the data have a higher-symmetry point group of P422, instead of P4, which we had used.
    

Clicking on "Analysis by batch" under "Analysis plots" will show a few helpful ... analysis plots. Inspect the R_merge vs. resolution plot and note the giant R_merge spike for Sweep 3.
Create a copy of the automatic.xinfo file:
cp automatic.xinfo work.xinfo
Edit the work.xinfo file to remove Sweep3 and Sweep8 from processing.
Also in work.xinfo, change the space group (on top of file) from "P 4" to "P 4 2 2" to reflect the suggestion in the xia2.html file.
If you wish to later compare the statistics, rename the current xia2.txt and xia2.html files, which otherwise will be overwritten.
Re-run xia2 using the edited work.xinfo:
xia2 --failover=False --read_all_image_headers=False --multi_crystal=True --info=work.xinfo multiprocessing.mode=parallel multiprocessing.njob=10 multiprocessing.nproc=4
Inspect the results in xia2.txt and xia2.html files. Note the major improvements to merging statistics as a result of removing forty (40) images from the dataset and using a more appropriate space group.



Step 5. Refining the merged dataset with xia2.multiplex

Run xia2.multiplex with the results of this latest processing run:
xia2.multiplex DataFiles/AUTOMATIC_DEFAULT_scaled.expt DataFiles/AUTOMATIC_DEFAULT_scaled.refl
Note: the input for xia2.multiplex are
    
    File with expt extension containing information about the experiment
    File with refl extension containing the reflection table 
    

Inspect the xia2.multiplex.log file and find the "overall merging statistics". Note the improvement vs. the result of the previous xia2 processing run.
The xia2.multiplex.html file can be inspected in Firefox to see more helpful statistics and charts.




REFERENCES

Winter, G., et al. (2010). J. Appl. Cryst. 43, 186–190. [ link ]
Gildea, G., et al. (2022). Acta Cryst. D78, 752-769. [ link ]