e!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> CCP4 Program Suite: blend

BLEND (CCP4: Supported Program)

NAME

blend
- management and processing of multiple crystals / multiple data sets

SYNOPSIS

blend -a foo_in.dat or /path/to/data

blend -aDO foo_in.dat or /path/to/data

blend -s cut_level_high [cut_level_low]

blend -sLCV cut_LCV_level_high [cut_LCV_level_low]

blend -saLCV cut_absolute_LCV_level_high [cut_absolute_LCV_level_low]

blend -c d1 d2 d3 d4 ... or [d1] [d2] [[d3-d4]] ..., etc.

blend -cF d1 d2 d3 d4 ... or [d1] [d2] [[d3-d4]] ..., etc.

blend -cP d1 d2 d3 d4 ... or [d1] [d2] [[d3-d4]] ..., etc.

blend -g D clN N

blend -g DO clN N

Description
Input and output files
Keyworded input
Miscellaneous and problems
References
Authors and credits
How to cite BLEND

DESCRIPTION

X-ray data collection from a single crystal is not always feasible. Very often crystallographers try to collect data from multiple crystals or from multiple locations on a single crystal. The resulting datasets are normally incomplete, or show low redundancy. Using any of them individually does not make for reliable phasing or model building and refinement. It is, rather, potentially better to try and merge all diverse datasets into more-complete ones. While solving the incompleteness issue, this merging produces datasets which, although inherently less precise, have a tendency to be more accurate because systematic errors interfere destructively when coming from different sources (multiple crystals). This, in turn, translates in better-quality structure factors, with positive effects for phasing, model building and refinement. High redundancy can also increase the anomalous signal, when this is needed.

BLEND is a program for the management of multiple datasets. It simplifies the analysis and greatly reduces the combinatorial explosion involved in the formation of multiple groups from the original set of data. The program essentially runs in three different modes, each one including variants. In the analysis mode (option -a) it reads in multiple unmerged reflection files produced by an integration program (either MOSFLM [2], XDS [3] or DIALS [5]), and carries out cluster analysis on one or two types of statistical descriptors, extracted or calculated from each dataset. There is a variant for the analysis mode (option -aDO) which is a lighter version of the analysis mode. It should be used when there is no interest in creating multiple-datasets files, but only in observing clustering based on cell parameters. Results produced by the program in analysis mode can be used during runs in synthesis mode (option -s), or in combination mode (option -c). In synthesis mode datasets belonging to clusters previously determined are scaled together and output into individual merged reflection files in MTZ format, ready to be used in all subsequent stages of phasing, model building and refinement. Variants of the synthesis mode are the synthesis mode using LCV values and the synthesis mode using absolute LCV values (see later for meaning of LCV and absolute LCV). The combination mode (option -c) allows users to carry out the same tasks enabled by the synthesis mode, this time for any combination of datasets, not necessarily those grouped in clusters. Variants exist also for the combination mode, when automated datasets deletion or individual dataset pruning is required.

An additional mode, the graphics mode (option -g), has been added to the three main modes previously listed so to have some visual tools to facilitate data analysis.

Input preparation

Input for the program is a group of unmerged reflection files produced by integration programs. At present only files from the MOSFLM, XDS and DIALS integration software can be handled. Reflection files can be either included in a single directory, or spread across several directories. In the first scenario, only the path to the directory needs to be fed into the program. In the second scenario, paths to all individual files will have to be listed in a single ASCII file and this is fed into BLEND. Program execution is controlled by keywords passed as standard input, as it is in general the case for the majority of CCP4 programs. If no keywords are passed to the program, default values and / or procedures for the parameters connected to the keywords will be used. More on BLEND keywords later.

Some problems arising in connection with input preparation can be found in the section Miscellaneous and problems.

Running the program in analysis mode

Once the input is ready, BLEND can be executed in analysis mode. All multiple datasets will be analysed individually and tested for overall radiation damage. If any dataset is thought to be significantly affected, parts of it will be removed. The amount of data to be removed can be controlled by the keyword RADFRAC. Values range between 0 and 1; 0 means keeping everything while 1 means removing all affected parts of data. The default value is 0.75, which essentially tells the program to remove all reflections whose intensity, on average, has been dampened by radiation damage of more than 25% of its true value.

Input files for BLEND in analysis mode contain integrated (not scaled) data. They can be either mtz files produced by MOSFLM, or ASCII files produced by XDS ("INTEGRATE.HKL"). If these files are stored within a single directory, then simply type:

         blend -a /where/integrated/data/are/store

If files are spread across a given number of directories, then the user will have to create an ASCII file with all files (and their exact paths) listed one after the other. The content of one such files, which will arbitrarily name "original.dat", could for instance look like the following:

         /home/joe/data/xtal1/xl-d01.mtz
         /home/joe/data/xtal1/xf-d03.mtz
         /home/joe/data/xtal5/xl-d12.mtz
         /home/joe/data/xtal12/INTEGRATE.HKL
         /home/joe/data/xtal13/INTEGRATE.HKL

In this case BLEND can be executed as follows:

         blend -a original.dat

Several files will be produced by the program in analysis mode (see Input and Output files section). Those describing datasets clustering are "tree.png" (a postscript version, "tree.ps") and "CLUSTERS.txt". Others files are needed for bookkeeping. The important binary file "BLEND.RData" contains essential information needed by the program to run in synthesis mode; it cannot be deleted.

Variants of run in analysis mode

The execution time of BLEND in analysis mode can be substantially increased by datasets analysis and by the procedure for early detection of radiation damage. Quicker runs can be achieved if only the dendrogram based on cell parameters is all that is required. In this case the input will be the same as the input for the standard analysis mode, but the command line will include "-aDO", rather than "-a". Some of the output produced for the simple analysis mode will not be present if the variant "-aDO" is used.

Running the program in synthesis mode

By running BLEND in synthesis mode the user aims at producing new datasets out of two or more individual datasets. Each node in the dendrogram can give rise to a scaled dataset. The easiest option for the user is to force BLEND to produce scaled datasets for all nodes in the dendrogram. This is, though, also the lengthiest option, because the user might only be interested in part of the nodes, for example those relating to tighter clusters. In order to single out only part of the nodes in the dendrogram one or two numerical levels need to be provided for execution. Consider for instance a case corresponding to the following "CLUSTERS.txt" file, describing a dendrogram with 13 clusters:

          Cluster     Number of         Cluster         LCV      aLCV      Datasets
           Number      Datasets          Height                            ID
           
              001             2           0.173        0.03      0.02      5 6 
              002             2           0.242        0.01      0.01      9 13
              003             2           0.433        0.05      0.05      3 14
              004             2           0.518        0.08      0.07      8 10
              005             3           0.610        0.05      0.04      7 9 13
              006             4           0.702        0.13      0.11      12 7 9 13
              007             4           0.744        0.11      0.09      5 6 8 10
              008             3           0.982        0.17      0.14      4 3 14
              009             4           1.297        0.19      0.17      2 4 3 14
              010             5           1.623        0.28      0.23      11 12 7 9 13
              011             5           2.711        0.48      0.39      1 2 4 3 14
              012             9           3.343        0.44      0.30      5 6 8 10 11 12 7 9 13
              013            14          13.670        1.04      0.84      1 2 4 3 14 5 6 8 10 11 12 7 9 13

To create merged files out of all nodes below height 4 in the dendrogram we type:

         blend -s 4

This will produce 12 new datasets: the one corresponding to node (5+9), the one corresponding to node (9+13), the one corresponding to node (3+14), etc. To produce datasets for all nodes, simply type:

         blend -s 14

because the whole dendrogram is below 14 (the top height is 13.670). Suppose one needs to merge only data sets 1, 2, 4, 3, 14, because they form a rather tight cluster. With:

         blend -s 3

these data sets will be merged, but so will be data sets 11, 12, 7, 9, 13, data sets 2, 4, 3, 14, and so on, because they all happen to correspond to nodes at heights lower than 3. Given that 1, 2, 4, 3, 14 form a clusters at exactly an height of 2.711, by selecting two levels, one higher and one lower than 2.711, a scaled filed for this cluster only will be calculated. For example:

         blend -s 2.712 2.710

These two numbers are arbitrary numbers that fall just above and below 2.711, and that do not include values for any other node in the dendrogram. It is important to notice that when using two values it is compulsory to type the largest one first.

Variants of run in synthesis mode

As previously mentioned, two variants of synthesis mode are available, synthesis mode using LCV values and synthesis mode using absolute LCV values (see later). In this case cluster selection will use LCV and absolute LCV values, rather than cluster heights (many thanks to Alkistis Mitropoulou for suggesting these variants!).

Running the program in combination mode

Cluster analysis produces a grouping of all datasets in several clusters. This makes it feasible to carry out a limited number of merging and scalings among the huge number of possible datasets combinations, thus making it possible to save on processing time. Clustering, though, introduces limitations because the user is forced to calculate datasets only corresponding to nodes in the dendrogram. For example, referring to the dendrogram described previously, there is no way we could obtain scaled data out of the union of data sets 1, 4, 11 and 13 because there is no node corresponding to this combination. Such a limitation can be overcome by running the program in combination mode. In the specific case, simply:

         blend -c 1 4 11 13

When the number of data sets and clusters is large it is very tedious to type or even cut and paste the long string of numbers forming the combination. For this reason an ad hoc syntax has been created to include groups of numerically-contiguous data sets, whole clusters or groups of clusters, and to exclude individual data sets or groups of data sets. The syntax is made up of the following rules:

         "[]"        a single-square bracket including one or more numbers means all data
                     sets in the clusters corresponding to those numbers.

         "[[]]"      a double-square bracket including one or more numbers indicates that all
                     data sets corresponding to those numbers are to be removed from the final group.

         "-"         an hyphen (minus sign) between two numbers indicates all integers between the two
                     numbers. If the first number is greater than the second, the selection is ignored.

         ","         commas between numbers are sometimes needed to separate data sets or clusters,
                     if they are inside single or double-square brackets.

         EXAMPLES (all referring to the dendrogram previously described).

         1) Combine cluster 2 with cluster 4:

                                               blend -c [2] [4]               equivalent to       blend -c 8 9 10 13 

         2) All data sets in cluster 12, with the exception of data sets 7 and 11:

                                               blend -c [12] [[7,11]]         equivalent to       blend -c 5 6 8 9 10 11 12 13      or    blend -c 5 6 8-13

         3) Clusters 1 and 9, with the exception of data set 14, and with the addition of data sets 1 and 7:

                                               blend -c [1] [9] [[14]] 1 7    equivalent to       blend -c 1 2 3 4 5 6 7            or    blend 1-7

The ability to create scaled data out of any desired combinations confers flexibility to the program.

Variants of run in combination mode

Filtering of the individual datasets forming a cluster can be also carried out in an automated way using a variant of the combination mode called filtering (option "-cF"). In this variant individual datasets are discarded one at a time based on Rmerge values, until a pre-defined or user-defined overall data completeness is reached. At each cycle the dataset with highest overall Rmerge is discarded. The process terminates when either the maximum number of cycles is reached (default 5, keyword MAXCYCLE), or when the specific dataset removed causes overall data to drop below a specific target completeness (default 95%, keyword COMPLETENESS). At the end of the process results from the cycle displaying the lowest overall Rmerge are selected for output. It is also possible to remove the terminal part (given number of images) of each dataset forming a cluster in an attempt of reducing that part of data affected by radiation damage. This variant is named pruning (option "-cP"). At each cycle the overall number of images that can be removed to keep completeness at a specific value (default 95%, keyword COMPLETENESS) is counted and displayed. This overall number is partitioned across all datasets forming the specific cluster. Removal occurs for the individual dataset having highest overall Rmerge in the current cycle. The amount of images (always at the end of the rotation sweep) to be removed from this dataset is a fraction of the allocated share (default 0.5, keyword CUTFRACTION). The process is terminated when a maximum number of cycles is reached (default 5, keyword MAXCYCLE), when target completeness is reached (default 95%, keyword COMPLETENESS) or when deletion of the next chunk of images completely removes one specific dataset. In this case it is suggested to re-run BLEND (still with the pruning variant if wished) without that specific dataset.

Running the program in graphics mode

BLEND graphics mode has been implemented as a visual aid to assist in the selection and filtering of datasets in clusters. This is especially the case when dealing with several datasets because the dendrogram is densely populated and it is difficult to understand clusters composition. Furthermore, the graphics mode yields so-called annotated dendrograms with numbers or descriptions included in the tree. These are particularly useful to work out which clusters are good for a specific purpose, or which datasets are affecting a given group or cluster negatively.

The graphics mode is executed using the "-g" option, followed by an uppercase character indicating the type of graphics to be produced. The annotated dendrograms produced in graphics mode have the tree nodes at heights different from Ward's heights. The nodes are, instead, placed at integer levels, 1, 2, 3, ..., corresponding to merging levels. In the first level are located all nodes corresponding to the union of two datasets. At level two we find nodes corresponding to clusters with 3 datasets; these can be formed by the union of a cluster at level one and a dataset. At level three nodes corresponding to clusters with four datasets are placed. Higher and higher levels are formed with the inclusion of more datasets. The convenience of using levels, rather than Ward's height, is in highlighting similarities and differences among clusters with the same number of datasets. Another convenience of displaying dendrograms using levels rather than Ward's heights is that for densely populated dendrograms it is possible to fix the number of levels down the specific dendrogram; in this case only part of the dendrogram will be displayed. This action is roughly equivalent to zooming in the dendrogram around the specified cluster. All graphics files produced in graphics mode are stored in the directory "graphics".

Listed below are the possible graphics types with the related command lines:

type "DO"
An annotated dendrogram is produced when using the "DO" type. The annotation here includes cluster number and aLCV value. This annotated dendrogram is produced with the following command line:
```
         blend -g DO clN N
```
where clN is the cluster number (default is the top cluster), while N is the level of details (how many levels to display, including cluster clN's one. An example could be,
```
         blend -g DO 12 3 
```
which will display the annotated dendrogram displaying cluster 12 and two more levels of clusters below cluster 12. The type "DO" can be executed after execution of BLEND in analysis (both simple and dendrogram-only variants) mode.
type "D"
An annotated dendrogram is produced when using the "D" type. The annotation here includes Rmeas, completeness and resolution estimated as value of CC1/2=0.3 (see AIMLESS log file). This type of graphics can only be executed after having run BLEND in synthesis mode. The command line is the same as the one used for the "DO" type:
```
         blend -g DO clN N
```

INPUT AND OUTPUT FILES

Input

BLEND can read unscaled reflection files in mtz format, or ASCII files in XDS format.
MTZ files contain, typically, integrated intensities as processed by MOSFLM [2] or DIALS [5]. XDS files are the unscaled integrated data ("INTEGRATE.HKL") produced by XDS [3].
Input can be either a file (no fixed name, but here it is indicated as "foo_in.dat"), or a directory

foo_in.dat (file): Each line of this ASCII file is the path to a valid unscaled reflection file, to be processed by the program

/path/to/a/valid/directory/ (directory): All valid unscaled reflection files in this directory will be processed by the program

Output

Execution of BLEND in different modes implies different output files.

(a) From analysis mode:
( WARNING!!! Files "FINAL_list_of_files.dat" and "BLEND.RData" will be slightly different to what described below, when the variant dendrogram-only (-aDO) is used. In particular, file "BLEND.RData" is, in this case, called "BLEND0.RData" )

BLEND_SUMMARY.txt: is an ASCII file with tabulated information for all datasets being processed. Each dataset is given a serial number and this same number is used throughout the whole statistical analysis
mtz_names.dat: is simply the list of files read in by BLEND. If a previous list was already present (because created by the user), this new list is a copy of it, with invalid files removed
xds_files: is a directory containing files in MTZ format, in those cases where integrated data are in XDS format. The mtz files are obtained from the XDS files using POINTLESS. Names for the newly created MTZ files have names like "dataset_xxx.mtz", where xxx is a number. See also "xds_lookup_table.txt"
xds_lookup_table.txt: this file can be checked in order to keep track of the original XDS files. If no XDS files are involved, neither xds_files nor xds_lookup_table.txt will be created. All logs produced by POINTLESS when converting XDS files into MTZ format will also be dumped in the xds_files directory
tree.png, tree.ps: are graphics file in PNG and POSTSCRIPT format, showing the dendrogram derived from cluster analysis of all input datasets. Individual objects (datasets) are recognizable through their serial number. If the number of datasets is relatively low (15- 20 max), the dendrogram can be interpreted quite easily. For larger numbers it might be easier to refer to the ASCII counterpart of the dendrogram, which is the file called "CLUSTERS.txt"
CLUSTERS.txt: In this file the exact numerical value of the dendrogram's merging nodes is also reported, a feature useful to run BLEND in synthesis mode. The dendrogram is the most important outcome of BLEND analysis. The user takes decisions on merged data, based on his/her interpretation of the dendrogram. In this graphics file a "Linear Cell Variation" number is also reported. The Ward distance used to measure cluster mergings indicates the overall loss in cell variability when the number of merged datasets is increased. As cell parameter values are normalised and rotated through principal component analysis their numerical value in the dendrogram is not immediately related to real cell variation. Therefore it is not possible to get a feeling for structural isomorphism using the Ward distance. To help with this issue a parameter directly related to unit cell differences has been introduced, the Linear Cell Variation (LCV). LCV measures the maximum linear increase or decrease of the diagonals on the 3 independent cell faces. Values below 1% in general indicate a good degree of isomorphism among different crystals. Structural differences start to be noticeable with LCV greater than 1.5%. A value in angstroms associated to LCV is provided by the absolute Linear Cell Variation (aLCV) , presented jointly to LCV in both "CLUSTERS.txt" and dendrogram. The isomorphism issue will, obviously, have to be considered jointly with the available data resolution
FINAL_list_of_files.dat: is an ASCII file reporting number of batches kept and highest resolution recommended for each dataset analysed by BLEND. Batches can be discarded because intensities in them are deemed to be severely affected by radiation damage (see keyword RADFRAC to control amount of discarded data). The highest recommended resolution is a rough estimate of where data should be cut, if the user wishes signal-to-noise ratio for the average intensity to be greater than a given value. This value is provided by the user with the keyword ISIGI, followed by a numerical value; default value is 1.5. The "FINAL_list_of_files.dat" file has 6 columns. The first is the path to the input files, the second is the serial number assigned from BLEND (and used in cluster analysis), the fourth and fifth are initial and final input image numbers, the third is the image number after which data are discarded because weakened by radiation damage, the sixth is resolution cutoff
BLEND.RData: is a binary file produced by the R code. It stores essential information used by all runs of BLEND in synthesis and combination modes

(b) From synthesis mode:

merged_files: all files produced by BLEND when executed in synthesis mode are stored within this directory, which is created if not already present, or is deleted and recreated if already present. Thus, it is important to rename this directory if more than one run of BLEND in synthesis mode is executed. This is taken care of when BLEND is executed with the CCP4 GUI
copies_of_reference_files [optional]: if a reference file is used (keyword DREF) and if such reference file is a datasets of one of the clusters processed in BLEND, then a directory with this name is created and the reference file copied in it. The reason why this is necessary is connected with how POINTLESS works with its keyword HKLREF and with reference files. Essentially the file pointed at by HKLREF cannot be the same file pointed at by any HKLIN entry. By using a copy of the reference file the HKLREF is never going to point at the same file as any of those corresponding to HKLIN.
MERGING_STATISTICS.info (inside directory "merged_files"): is an ASCII file, essentially a table listing overall merging statistics for all merged datasets produced by the specific run of BLEND. It includes Cluster number, Rmeas, Rpim, Completeness, Multiplicity, Lowest Resolution and Highest Resolution. The table is sorted according to the Rmeas column, from its lowest to its highest value. If scaling with AIMLESS has failed for some reason, NA's are inserted in the corresponding rows. This table should make it easy for the user to select the desired merged dataset, in terms of completeness, multiplicity and data quality
Rmeas_vs_Cmpl.png, Rmeas_vs_Cmpl.ps (inside directory "merged_files"): a plot of all merged datasets in terms of Rmeas vs Completeness, both as PNG and PS graphics file
CLUSTERS.info (inside directory "merged_files"): is an ASCII file listing names and number of batches of each individual dataset composing specific clusters
unscaled_001.mtz, unscaled_002.mtz, ... (inside directory "merged_files"): are unscaled files in mtz format. There are as many of these files as the number of nodes selected by the user in the execution of BLEND in synthesis mode. The number associated with each file name coincide with the cluster (or node) number. Before scaling a dataset, obtained by the collation of individual datasets, it is necessary to have all of them with same space group and same indexing (if they belong to polar groups). Also, individual images will need to have unique numbers. Furthermore some datasets can have some images discarded and resolution limited. All this bookkeeping is taken care by a script calling POINTLESS which, by default, assigns the most likely space group. This can be changed by using keyword CHOOSE SPACEGROUP , where is the space group name (e.g. P 21 21 21, C 2, etc). Another keyword used in BLEND which relates to POINTLESS is TOLERANCE; this keyword controls how much cell sizes are allowed to change if they are to be considered in connection with a same structure. If the user wants to use a specific dataset as reference, so that space group and indexing convention of the reference are passed on to the processed datasets, the name of the reference file (an mtz file) can be included with the keyword DREF. The reason why merged but unscaled files are kept for the user is connected with the way subsequent scaling is carried out. At present scaling in default mode is performed by BLEND using AIMLESS. This does not always guarantee the production of final averaged intensities. For example, data could be weak, or the collection followed some unusual set up. A successful scaling could, then, be obtained by running AIMLESS in non-default mode, using specific keywords. The starting files for doing this are the "unscaled_xxx.mtz" files. There is also, of course, the option to re-run BLEND in synthesis mode by adding specific scaling keywords, but, at present, not all available AIMLESS keywords can be used in BLEND.
scaled_001.mtz, scaled_002.mtz, ... (inside directory "merged_files"): are the final scaled files, for those cases that could be successfully scaled
pointless_001.log, pointless_002.log, ... (inside directory "merged_files"): log files from all POINTLESS jobs executed to produce files "unscaled_001.mtz", "unscaled_002.mtz", ...
aimless_001.log, aimless_002.log, ... (inside directory "merged_files"): full logs from the AIMLESS runs. The user can benefit from these files to find out detailed information on merging statistics and scaling in general
BLEND.RMergingStatistics: This is a binary file used by BLEND when executed in graphics mode to display annotated dendrograms with merging statistics.

combined_files: all files produced by BLEND when executed in combination mode are stored within this directory, which is created if not already present
copies_of_reference_files [optional]: if a reference file is used (keyword DREF) and if such reference file is a datasets of one of the groups processed in BLEND, then a directory with this name is created and the reference file copied in it. The reason why this is necessary is connected with how POINTLESS works with its keyword HKLREF and with reference files. Essentially the file pointed at by HKLREF cannot be the same file pointed at by any HKLIN entry. By using a copy of the reference file the HKLREF is never going to point at the same file as any of those corresponding to HKLIN.
MERGING_STATISTICS.info (inside directory "combined_files"): same file as the one produced inside "merged_files" when BLEND is executed in synthesis mode. Results are, in this case, not sorted according to decreasing completeness
GROUPS.info (inside directory "combined_files"): this file is the equivalent of "CLUSTERS.info" in the "merged_files" directory when BLEND is executed in synthesis mode
unscaled_001, unscaled_002, ... (inside directory "combined_files"): unscaled files corresponding to all combinations tried by the user. See equivalent files in directory "merged_files", created when BLEND is executed in synthesis mode
scaled_001.mtz, scaled_002.mtz, ... (inside directory "combined_files"): scaled files corresponding to all successful scaling jobs of files unscaled_001.mtz, unscaled_002.mtz, ...
pointless_001.log, pointless_002.log, ... (inside directory "combined_files"): log files from all POINTLESS jobs executed to produce files "unscaled_001.mtz", "unscaled_002.mtz", ...
aimless_001.log, aimless_002.log, ... (inside directory "combined_files"): full logs from the AIMLESS runs. The user can benefit from these files to find out detailed information on merging statistics and scaling in general

(d) From graphics mode:

aLCV_annotated_dendrogram_cluster_[clN]_level_[N].png, aLCV_annotated_dendrogram_cluster_[clN]_level_[N].ps (inside directory "graphics")

These plots are created when running BLEND in graphics mode, using the "DO" graphics type

stats_annotated_dendrogram_cluster_[clN]_level_[N].png, stats_annotated_dendrogram_cluster_[clN]_level_[N].ps (inside directory "graphics")

These plots are created when running BLEND in graphics mode, using the "D" graphics type

KEYWORDED INPUT

BLEND keywords can be divided in three groups, as they control essentially three different parts of the program. Keywords with their default values are summarized here:

Group 1

CPARWT              1.000
ISIGI               1.500
LAUEGROUP           (laue or space group symbol, as used in POINTLESS)
RADFRAC             0.750
COMPLETENESS        95.0
CUTFRACTION         0.5
MAXCYCLE            5

Group 2

CHOOSE SPACEGROUP   (space group, as used in POINTLESS)
DREF                ()
TOLERANCE           5 (same default value as the one used in POINTLESS)

Group 3

ANOMALOUS           OFF
RUN                 (default is to break into different runs at each discontinuity - see AIMLESS)
EXCLUDE             (individual image numbers or images range - see AIMLESS)
RESOLUTION          HIGH [smallest among highest resolutions of all composing data sets - see AIMLESS]
SCALES              ROTATION SPACING 5 SECONDARY  BFACTOR ON BROTATION SPACING 20 (see AIMLESS)
SDCORRECTION        REFINE INDIVIDUAL (see AIMLESS)

Keywords in group 1 are specific for BLEND and used to control data preparation, analysis and clustering, and also for the combination mode filtering and prunig variants. Keywords in group 2 are keywords used in POINTLESS [4], while keywords in group 3 are keywords used in AIMLESS [4].
In the definitions below "[]" encloses optional items, "|" delineates alternatives. All keywords are case-insensitive, but are listed below in upper-case.

ANOMALOUS, CHOOSE SPACEGROUP, CPARWT, DREF, EXCLUDE, ISIGI, LAUEGROUP, RADFRAC, RUN, RESOLUTION, SCALES, SDCORRECTION, TOLERANCE COMPLETENESS CUTFRACTION MAXCYCLE

ANOMALOUS [OFF | ON]

Default value is OFF. ANOMALOUS is the same keyword used in AIMLESS. By default all I+ and I- observations are averaged together in merging. If ANOMALOUS is ON there will be separate anomalous observations in the final AIMLESS output pass, both for statistics and merging. ANOMALOUS will be automatically be turned ON if a substantial anomalous signal is detected.

CHOOSE SPACEGROUP

Default value is a blank, i.e. the final space group for the specific group of data to be scaled is the one determined by POINTLESS. If the user wishes to fix space group, rather than allowing POINTLESS to determine it, then this keyword should be used with the accompanying chosen space group symbol. This is advisable, for instance, when data are of poor quality and fixing space group is necessary to avoid POINTLESS to select a wrong space group.

CPARWT

Default value is 1.0. CPARWT is a number between 0 and 1, controlling which type of statistical descriptors are used in cluster analysis. A value of 1 means that we are using cell parameters (known as primary descriptors), while a value of 0 means we are using essentially averaged integrated intensities (known as secondary descriptors). Numbers between 0 and 1 are possible, and they essentially mean a weighted use of both descriptors. At the present stage of research, though, it is not clear the advantage in mixing the two descriptors. Primary descriptors seem to behave systematically better than secondary descriptors. Secondary descriptors can be tried, as a valid alternative, in those cases where cell parameters are known to be changing very little.

DREF

DREF, followed by a path pointing to an MTZ file, provides a reference file for indexing and space group assignment. This is normally not needed. Indexing for all datasets in a cluster or a group follows the indexing of the first dataset in the cluster or group (the one with smallest serial number). Space group is, then, assigned by POINTLESS after the systematic absences analysis. But there might be reasons for users to be wanting to use always a fixed dataset (with correct space group) as reference. It is in such instances that the reference file is considered by BLEND via the DREF keyword.

EXCLUDE [<batch range> | <batch list>]

Default is no exclusion of any image. This keyword, equivalent to the one used in AIMLESS, controls exclusion from the scaling process of specific images. These can be provided as a series of individual image numbers or as an image range:

         Example 1.   EXCLUDE BATCH 12 18 21 89
         Example 2.   EXCLUDE BATCH 32 TO 46

An easier alternative for excluding images from scaling jobs is to write an AIMLESS keywords file by copying and pasting input keywords found in specific AIMLESS logs included in either "merged_files" or "combined_files" directories, and adding as many EXCLUDE keywords as needed.

ISIGI

Default value is 1.5. ISIGI controls the resolution cut. Integrated intensities and their errors are averaged in resolution shells and interpolated with a 10-degrees polynomial. Data are truncated when signal-to-noise ratio falls below the ISIGI value. The user can assess signal- to-noise ratio after scaling (from within the �aimless_xxx.log� files). Normally this is higher than the 1.5 value introduced by ISIGI. This value, in fact, refers to unscaled data. If too much or too little truncation has been applied, BLEND can be executed again to change this value.

LAUEGROUP [ | AUTO | Point Group | Space Group]

Default value is blank, i.e. the point group is unchanged from the one found in the original reflection file. LAUEGROUP can be used for data preparation, when reading data from "INTEGRATE.HKL" files produced by XDS. These files normally include integration data in a low symmetry space group, typically P1. If such data are fed into BLEND directly, the program would treat all 6 cell parameters as independent. This is permitted and feasible, but if the correct laue group is known to have higher symmetry, then treating all 6 cell parameters as independent could introduce unnecessary statistical noise in the process of cluster analysis. In such cases it is advisable to input the correct laue or space group after keyword LAUEGROUP. The resulting mtz file includes data and cell parameters of the desired symmetry. If AUTO is used after LAUEGROUP, the conversion to an mtz file will be carried out with POINTLESS in default mode, i.e. leaving to POINTLESS to find out the correct symmetry. If LAUEGROUP is not used, then the �INTEGRATE.HKL� file will be converted into an mtz file without changing its laue group (default).

RADFRAC

Default value is 0.75. The program makes use of this keyword when data are found to be subject to overall radiation damage. RADFRAC controls the fraction of average intensity retained that a user is willing to accept when decay for radiation damage occurs. When RADFRAC is equal to 1, cutting is quite severe; when RADFRAC is equal to 0 there is no cutting, even when substantial radiation damage is affecting datasets. By default (RADFRAC 0.75) when BLEND detects the occurrence of substantial global radiation damage, then all images collected after a certain image are discarded. The discarded images, on average, include intensities that have been reduced of more than 25% of their original value.

RUN <Nrun> BATCH <b1> TO <b2>

Default keys are the same used in AIMLESS. This keyword is equivalent to the one used in AIMLESS and controls the definition of "runs" (i.e. contiguous batches of data undergoing a same scaling protocol). More details can be found in AIMLESS documentation pages.

RESOLUTION [[LOW] [[HIGH] <Resmax>]

Default for subkey HIGH is the biggest among highest resolutions of all composing data sets; for subkey LOW is the smallest among lowest resolutions of all composing datasets, where resolutions are here meant to be indicated in angstroms. The resolution limits computed by BLEND during analysis are determined via keyword ISIGI. When merging several datasets together it is the smallest among high resolutions and the largest among low resolutions to be fixed for subsequent scaling. Such limits can be changed by the user with the keyword RESOLUTION, exactly in the same way it is used in AIMLESS.

SCALES [<subkeys>]

Default keys are the same used in AIMLESS. This keyword is equivalent to the one used in AIMLESS and controls the scaling procedure followed. More details can be found in AIMLESS documentation pages.

TOLERANCE

Default value is the same used in POINTLESS, i.e. 5. TOLERANCE is equivalent to the corresponding POINTLESS keyword. Multiple crystals can have cell parameters very dissimilar with each other (non isomorphism). When a map is needed to calculate a mid or low resolution electron density, then POINTLESS might need instructions to avoid halting because large cell variations are encountered. Essentially the program is told to stop execution when cell difference among all component data sets goes beyond a threshold (the TOLERANCE value). The higher the TOLERANCE the more cell parameters are allowed to change, i.u. the more non-isomorphism is tolerated. Use high values (say 100) if you do not care about cell variability.

SDCORRECTION [[NO]REFINE] [INDIVIDUAL | SAME [FIXSDB]

Default is REFINE INDIVIDUAL. This keyword is analogous to the one used in AIMLESS (see AIMLESS documentation pages). SDCORRECTION plays a role in the determination of each reflection's error. Errors for all reflections undergo a refinement process equivalent to the refinement used for scaling intensities. But it is more unstable than the refinement for the intensities. Thus it is possible that cycles for SD parameters estimation do not converge, ultimately failing an AIMLESS job. In such circumstances it is possible to re-run BLEND using different values for the SDCORRECTION keyword, similarly to what is prescribed in AIMLESS. Quite often the provision,

         SDCORRECTION SAME

is sufficient to take to completion failed scaling jobs. If no solution is found for obtaining refined SD values, no refinement (NOREFINE) is the only option left.

MAXCYCLE

Parameters to be used with variants filtering and pruning of the combination mode. The default number of cycles for both variants is 5, so to avoid long execution times. In general convergence to merging statistics with better values than the starting ones is achieved within 5 cycles. But the number of cycles can be increased or decreased using the MAXCYCLE keyword.

CUTFRACTION

This is used in conjunction with the pruning variant of combination mode. At the end of each cycle the total number of images that can be deleted is calculated and partitioned proportionally among all datasets forming the starting group or cluster. Then, at the start of the following cycle, the dataset with highest overall Rmerge is selected for images removal. The number of images to be removed is CUTFRACTION times the number assigned to that dataset. Default value is 0.5. Smaller values will determine slower and more gradual change of the merging statistics, while greater values than 0.5 will result in abrupt changes to the same statistics. As scaling integrated data is a non linear process, it seems wiser to remove a smaller number of images while, at the same time, increasing the maximum number allowed of cycles.

COMPLETENESS

This keyword can be used for both filtering and pruning variants of the combination mode. Whole datasets or portions of them, more precisely a certain number of images at the end of a run, are subtracted from a specific cluster or group in a cyclical fashion. The subtraction continues until overall data completeness has decreased to reach the COMPLETENESS value. Default is 95% (COMPLETENESS 95). Cycles can also stop if the maximum number of cycles allowed (keyword MAXCYCLE, default 5) is reached. Pruning cycles can also be halted if the progressive deletion of images result in the deletion of a whole dataset. This last occurrence can, obviously, be achieved more quickly filtering out the specific dataset using the combination mode.

MISCELLANEOUS AND PROBLEMS

Problems (and, alas, crashes!) could happen in BLEND, as it is the case with any software. Some of them and their cause are known (and described in this section).

(1) Program abrupt terminations

We have made substantial efforts to stop the program from crashing and, rather, to enable it to exit in a clean way with some kind of error message. But crashes are still to be expected. They will become less and less frequent as users report them:

Crashes in analysis mode: At present the program has been reported to crash in analysis mode if the size of data read in exceeds memory storage capacity. Luckily this is quite high for modern laptops and desktops, thus should not be an issue in the majority of cases. It is likely to become an issue if several very large datasets are read in on run. Other types of crashes are unknown.
Crashes in synthesis mode: These are generally a consequence of execution terminations by either the POINTLESS or AIMLESS programs. BLEND can handle several of these terminations and can execute in normal mode with an error or warning message in this case. If POINTLESS is successful, but AIMLESS fails, then the user should find that the "unscaled_xxx.mtz" type of files have been created under the directory "merged_files", but the "scaled_xxx.mtz" type of files are not created, where "xxx" refers to all clusters with successful scaling jobs. In this case it is likely that the default scaling recipe will have to be changed. Some clusters are made of datasets with different point group or other indexing inconsistencies. Unless appropriate keywords are used for POINTLESS, the execution of BLEND in synthesis mode for these cases will return an error message, and files of type "unscaled_xxx.mtz" will not be created.

(2) How to create an ASCII list of input files

Quite often input files are not included in a single directory, but are spread across a number of directories. In this case a judicious use of the unix command "grep" and "find" can quickly produce the input list for BLEND. Suppose all files are spread across directories all under a single directory named, say, "cdir". A quick way to generate the list is to move to directory "cdir" and use "find" as follows (many thanks to Morten Groftehauge for this tip):

         find `pwd` -name "INTEGRATE.HKL" > original.dat

In this case all XDS files found under "cdir" on in "cdir" subdirectories will be listed in original.dat with their full path. Variants of the above line will produce results for specific cases.

(3) Error estimation with AIMLESS

Error estimation and correction for multiple datasets is still not completely reliable in AIMLESS. If AIMLESS crashes while handling errors, or if the Mean((I)/sd(I)) has ridiculously high values, it is advisable to re-run BLEND (with either the -s option, or the -c option for the specific combination of datasets under scrutiny) using keywords "SDCORRECTION SAME" or "SDCORRECTION NOREFINE". Error estimation will be, in this case, less reliable, but this is still better than obtaining no results at all. Phil Evans (the author of AIMLESS) is constantly working to improve error estimation for difficult scaling cases (and multiple crystals are difficult!), but this is an inherently challenging theoretical and computational problem, not likely to be overcome in its entirety any time soon.

REFERENCES

J. Foadi, P. Aller, Y. Alguel, A. Cameron, D. Axford, R.L. Owen, W. Armour, D. Waterman, S. Iwata and G. Evans "Clustering procedures for the optimal selection of data sets from multiple crystals in macromolecular crystallography" Acta Cryst. (2013), D69, 1617–1632
A.G.W. Leslie and H.R. Powell "Processing Diffraction Data with Mosflm" in Evolving Methods for Macromolecular Crystallography (2007), 245, 41–51
W. Kabsch "XDS" Acta Cryst. (2010), D66, 125–132
P.R. Evans "Scaling and assessment of data quality" Acta Cryst. (2006), D62, 72–82

AUTHORS AND CREDITS

James Foadi, Diamond Light Source (james_foadi@diamond.ac.uk)
Gwyndaf Evans, Diamond Light Source (gwyndaf.evans@diamond.ac.uk)

Special thanks to David Waterman (CCP4 core team) for implementing the BLEND GUI version and Pierre Aller (Diamond Light Source) for BLEND tutorials.

HOW TO CITE BLEND

The main reference for BLEND is:

J. Foadi, P. Aller, Y. Alguel, A. Cameron, D. Axford, R.L. Owen, W. Armour, D.G. Waterman, S. Iwata and G. Evans
Clustering procedures for the optimal selection of data sets from multiple crystals in macromolecular crystallography"
Acta Cryst. (2013), D69, 1617-1632

The following reference should also be included when citing BLEND because the software makes frequent use of the CCP4 programs POINTLESS and AIMLESS:

For POINTLESS cite
P.R.Evans
An introduction to data reduction: space-group determination, scaling and intensity statistics
Acta Cryst. (2011), D67, 282-292
For AIMLESS cite
P.R.Evans and G.N. Murshudov
How good are my data and what is the resolution?
Acta Cryst. (2013), D69, 1204-1214