MrBUMP (CCP4: Supported Program)

NAME

MrBUMP - automated search model generation and automated molecular replacement

SYNOPSIS

Full model search, model preparation and molecular replacement:

mrbump hklin foo_in.mtz seqin foo.seq hklout foo_out.mtz xyzout foo.pdb
[Key-worded input]

Model search and preparation only:

mrbump seqin foo.seq
[Key-worded input]

Molecular replacement only (requires input of output directory of previous run of the program):

mrbump hklin foo_in.mtz prepdir path to previous mrbump job output directory hklout foo_out.mtz xyzout foo.pdb
[Key-worded input]

DESCRIPTION

MrBUMP has three main parts:

For a given target sequence, automated discovery of chains, domains and multimers that are possible templates for molecular replacement search models
Preparation of actual search models using a variety of structure editing techniques
Running molecular replacement using these search models and testing whether the resulting solutions will refine.

Note that MrBUMP makes a number of calls to web-based applications. If your sequence information is in any way sensitive, it is recommended that you use the option to run the fasta search locally rather than via the OCA web application. This will require installing fasta34 on the users local machine. The software can be downloaded from the EBI website here.

DEPENDENCIES

Before MrBUMP can be used, the following dependencies should be installed on the local system.

Mandatory:
- CCP4 6.0 or later
- Python 2.3 or later
- one of: Mafft, Probcons, TCoffee or Clustalw
Optional:
- Fasta34,
- Perl + SOAP-Lite module (for SSM search).
- Gnuplot.

MrBUMP also requires that the local machine has a connection to the internet (directly or via a proxy).

INPUT AND OUTPUT FILES

HKLIN

Input structure factor file for target structure. Must include a FreeR_flag column. In general, the spacegroup in HKLIN is assumed to be correct (you should check this!). The only exception is the ENANt keyword which requests that both members of a pair of enantiomorphic spacegroups are checked by MrBUMP.

SEQIN

Input sequence file for the target structure. Can be in PIR or Fasta format or it can just contain the amino acid sequence.

HKLOUT

MTZ file from Refmac5 refinement of the top MR solution.

XYZOUT

PDB coordinate file from Refmac5 refinement of the top MR solution.

KEYIN

Input keywords in a file rather than throught stdin (Note: you can't use environment variables as keyword arguments)

KEYWORDED INPUT

There are a number of options for specifying parameters (e.g. number of molecules expected in the asymmetric unit) or preferences (e.g. which multiple alignment program to use). All options have sensible defaults.

Main keywords:

LABIn, JOBId ROOTdir, RLEVel, NMASu, MRNUm, ENSEmnum, INCLude, LOCAlfile, FIXEd_xyzin, IGNOre, MRPRograms, MAPRogram, MDLDpdbclp, MDLUnmod, MDLPlyala, MDLMolrep, MDLChainsaw, MDLSculptor, SSMSearch, SCOPsearch, PQSSearch, DOFAsta, DOHHpred, DOPHmmer, HHDBpdb, HHSCore, PACK, PJOBS, NCYC, REFTwin, UPDAte, ONLYmodels, TRYAll, USEAcorn, ACORnres, BUCCaneer BCYC ARPWarp ACYC SHELxe SCYC SXREBUILD SXRARPW SXRBUCC USEPHS ENANt

Additional keywords:

PKEYword, PDBDir, PDBLocal, CLUSter, QTYPe, QSIZe, QSUBcom, CLEAn, LITE, PICKle, CHECk, DEBUg, PROXyserver

LABIN <program label>=<file label>...

This keyword tells the program which columns in the MTZ file should be used as native structure factors, sigmas, and FreeR flag. Available program labels are F, SIGF and FreeR_flag.

JOBID <job name>

This is a name for the job. A directory called "search_JOBID" will be created in the directory in which MrBUMP is started from. This directory will contain all of the downloaded files and results.

ROOTDIR <directory>

The root directory where the search folder will be created.
[Default Current working directory]

RLEVEL [ 70 | 100 ]

When performing the homologue search (using Phmmer) what level of redundancy should be used for the sequence identities. There are currently two options - 70% elimantes all sequences with greater than 70% identity to each other an 100% which uses all sequences.
[Default 70]

NMASU <number>

The number of molecules in the asymmetric unit. Leave this blank for automatic calculation.
[Default Automatic]

MRNUM <number>

The number of prepared models to be used molecular replacement.
[Default 20]

ENSEMNUM <number>

The number of prepared models to be used in a Phaser Ensemble.
[Default 5]

INCLUDE <pdb chain id 1> <pdb chain id 2>...

A list of PDB ID codes and Chain IDs to be included in the homologue search. Any specific chains entered here are automatically processed in molecular replacement regardless of how they score in the template model scoring.
Example: INCLUDE 1nio_A.

LOCALFILE <pdb filename [CHAIN chain id] >

Use this keyword to specify the location of a local PDB file to be used as a search model in MrBUMP. The full path to the file must be specified. The optional CHAIN subkeyword can be used to specify a particular chain in the PDB file. If CHAIN is not used then the program will extract chain "A" from the file.
Example: LOCALFILE /tmp/1nio.pdb CHAIN A.

FIXED_XYZIN <pdb filename IDEN sequence identity >

The FIXED_XYZIN keyword allows the user to input a fixed component structure in the MR search. This component should already have the correct orientation. This keyword can be used multiple times if there are more than one fixed components known. The sequence identity for each component against its corresponding section of the target sequence is required. The full path to the PDB file should also be specified. Fixed components are passed to both Phaser and Molrep.
Example: FIXED_XYZIN /tmp/fixed.pdb IDEN 0.43

IGNORE <pdb id 1> <pdb id 2>...

A list of PDB ID codes to be ignored in the homologue search. Used for development purposes.

MRPROGRAMS [ Molrep | Phaser ]

Names of Molecular Replacement programs to try search models in. Options are Molrep, Phaser or both. If both are selected Molrep will be used first.
[Default Molrep Phaser]

MAPROGRAM [ MAFFT | PROBCONS | T_COFFEE | CLUSTALW ]

Name of the sequence alignment program to be used to do multiple alignment of the template structure sequences and the target sequence. In good cases, these programs should give the same result. In more marginal cases (e.g. small number of sequences, low sequence identity) they may give very different results.
[Default MAFFT]

MDLUNMOD [ True | False ]

If true unmodified search models will be passed to the MR stage. This can be useful when a user is providing their own pre-prepared search models via the LOCALFILE option.
[Default False]

MDLDPDBCLP [ True | False ]

If true models will be prepared for MR using the PDBclip method. With this method, the waters and hydrogens are removed from the coordinate file and the most probable side-chain confirmations are selected. If chain ID's are missing they are added.
[Default False]

MDLPLYALA [ True | False ]

If true Polyalanine models will be prepared for the MR step. All side-chains are removed from the PDB files.
[Default False]

MDLMOLREP [ True | False ]

If true models will be prepared using Molrep. Molrep does a sequence alignment of the target sequence and the template sequence and prunes the template structure file accordingly.
[Default True]

MDLCHAINSAW [ True | False ]

If true models will be prepared using Chainsaw. Chainsaw takes in a sequence alignment of the target sequence and the template sequence and prunes the template structure file accordingly.
[Default True]

MDLSCULPTOR [ True | False ]

If true models will be prepared using Phaser.Sculptor. Sculptor takes in a sequence alignment of the target sequence and the template sequence and prunes the template structure file accordingly.
[Default True]

SSMSEARCH [ True | False ]

If true MrBUMP will use the top match from the sequence-based search in a secondary structure-based search to find more potential homologues. Set to false by default. Requires perl and the perl SOAP-Lite module to be installed.
[Default False]

SCOPSEARCH [ True | False ]

If true MrBUMP will use the SCOP database to look for individual domains in the template structures found in the sequence-based and secondary structure-based searches.
[Default True]

PQSSEARCH [ True | False ]

If true MrBUMP will use the PQS service at the EBI to find more multimers based on the template structures found in the sequence-based and secondary structure-based searches.
[Default True]

DOFASTA [ True | False ]

If true, a FASTA search will be carried out to search for the possible template models. A user can turn this off and give specific chain IDs to be used as the template models. Note that at least one chain must be specified using the INCLUDE keyword if DOFASTA is set to False. Alterntaively, a local file can be specified with the LOCALFILE keyword. Requires that fasta34 or fasta35 be installed on the local machine. These are available from the EBI website: http://www.ebi.ac.uk/fasta.
[Default False]

DOHHPRED [ True | False ]

If true a sequence-based search for template models will be carried out using HHblits from the HHsuite. Using this search mechanism may give different results from the standard fasta search and produce alternative search models for MR. Note that this requires that the user has installed the hhsuite separately along with it's associated pdb sequence database. If this keyword is set to true the user must also specify the path to the hhsuite pdb database file using the HHDBpdb keyword and the hhsuite uniprot database index file using the HHINdex keyword. In addition the HHLIB environment variable must be set to point to the hhsuite "lib/hh" directory (see hhsuite setup instructions for details). The HHsuite can be downloaded from: http://toolkit.tuebingen.mpg.de/hhpred.
[Default False]

DOPHMMER [ True | False ]

If true, a phmmer sequence-based search will be carried out to search for the possible template models. [Default True]

HHDBPDB <directory/db_basename>

Use this keyword to specify a directory where MrBUMP can find the HHsuite PDB sequence database files. This keyword is required if the user wishes to use the hhpred suite to search for template models. It is only needed if the DOHHpred keyword is set to "True". It should give the full path to the hhsuite pdb70 database directory followed by the base name for the pdb sequence file e.g. "HHDBPDB /usr/local/hhsuite/database/pdb70/pdb70_08Feb14_hhm_db"
[Default Not set]

HHINDEX <directory/uniprot index file>

Use this keyword to specify a directory where MrBUMP can find the HHsuite uniprot sequence database index file. This keyword is required if the user wishes to use the hhpred suite to search for template models. It is only needed if the DOHHpred keyword is set to "True". It should give the full path to the hhsuite uniprot database index file e.g. "HHINDEX /usr/local/hhsuite/database/uniprot20/uniprot20_hhm_db.index"
[Default Not set]

HHSCORE [ True | False ]

If DOHHPRED is true this keyword, when true, instructs MrBUMP to use the alignments generated in the HHblits sequence search for templates as the alignments used to score the template models (only for templates found by the HHblits search). In addition, both Sculptor and Chainsaw will be given the HHblits alignments for search model preparation. If set to false the alignments generated using the selected multiple alignment program (e.g. MAFFT) will be used for the scoring of template models.
[Default True]

PACK <number>

The number of clashes that Phaser will tolerate.
[Default 5]

PJOBS <number>

The number of processing cores that Phaser will will use in parallel. Note that when the CLUSTER keyword is set to True (molecular replacement jobs submitted to a cluster queue) PJOBS is always set to 1.
[Default 2]

NCYC <number>

The number of cycles of restrained refinement to use in Refmac.
[Default 30]

REFTWIN [ True | False ]

Set this keyword to true if the indications are that your data is twinned. The "TWIN" keyword will be used in Refmac which will determine and account for the twinning in refinement. Only valid for Refmac version 5.5 or later.
[Default False]

UPDATE [ True | False ]

If true, the search database files will be tested at the start of the job to see if they are out of date with respect to those available from the EBI website. If they are found to be out of date, the latest version will be downloaded.
[Default True]

ONLYMODELS [ True | False ]

If true, only the search models will be generated. The program will exit before any Molecular Replacement is carried out.
[Default False]

TRYALL [ True | False ]

If true, the program will try all of the search models in molecular replacement. If false the program will exit when it finds the first solution.
[Default False]

USEACORN [ True | False ]

If true, program will put each positioned and refined search model through the program Acorn to try and improve the phases. The target data must also be at least 1.7 A. Acorn is unlikely to help at lower resolutions, but this resolution limit can be changed with the ACORnres keyword.

MrBUMP prints out the correlation coefficient for medium E values from Acorn. An increase in these correlation coefficients over Acorn cycles is a good sign that you have the correct solution (the absolute value of the CC may be low, because these are not the strongest E values). The columns ECOUT, PHIOUT and WTOUT from Acorn can be used to generate high quality maps to help model re-building.
[Default False]

ACORNRES <resolution>

Resolution limit for applying the Acorn phase improvement procedure.
[Default 1.7]

BUCCANEER [ True | False ]

Perform automated model building using Buccaneer.
[Default True]

BCYC <number>

Number of autobuild-refine cycles to carry out in Buccaneer.
[Default 5]

ARPWARP [ True | False ]

Perform automated model building using ARP/wARP.
[Default False]

ACYC <number>

Number of autobuild-refine cycles to carry out in ARP/wARP.
[Default 5]

SHELXE [ True | False ]

Perform phase improvement and main-chain tracint using SHELXE.
[Default False]

SCYC <number>

Number of auto-tracing cycles to perform in SHELXE.
[Default 15]

SXREBUILD [ True | False ]

Perform model building with ARP/wARP and/or Buccaneer after SHELXE.
[Default False]

SXRARPW [ True | False ]

Perform model building with ARP/wARP after SHELXE (SXREBUILD must be set to True).
[Default False]

SXRBUCC [ True | False ]

Perform model building with Buccaneer after SHELXE (SXREBUILD must be set to True).
[Default True]

USEPHS [ True | False ]

Use the phases generated by SHELXE (.phs file) in the subsequent rebuild in ARP/wARP and or Buccaneer (SXREBUILD must be set to True). If USEPHS is set to false model building will use the c-alpha trace generated by SHELXE as a starting point.
[Default True]

ENANT [ True | False ]

If true, program will do molecular replacement for all search models in the enantiomorphic spacegroup, as well as in the HKLIN spacegroup, if an enantiomorph exists for the target data spacegroup. MrBUMP will identify the better spacegroup for each model. For good MR solutions, the correct spacegroup should be identified. For wrong or marginal solutions, it may be harder to distinguish the correct spacegroup.
[Default False]

PDBDIR <directory>

Use this keyword to specify a directory where MrBUMP can search for the PDB files it needs for generating search models. This can help reduce the number of downloads from the PDB databases on the internet. Mainly useful for users with slow connections and cases where a user wishes to run several jobs requiring similar search models. PDB files should take the form <PDB ID>.pdb (e.g. 1nio.pdb). Also, the full path to the directory should be specified.
[Default Not set]

PDBLOCAL <directory>

If you have a local mirror of the PDB available through the file system, you can instruct MrBUMP to access it for the PDB files that it needs using this keyword. Give it the full path to the location on this of the top level directory for the PDB database. Note that the PDB mirror must have the standard file hierarchy for the PDB database. The files should be stored in directories named according to the middle two characters in the PDB ID code. The files should also be gzipped and stored in the format pdb<PDB ID>.ent.gz. For example, the file for PDB code 1nio should be <path to PDB directory>/ni/pdb1nio.ent.gz.
[Default Not set]

CLUSTER [ True | False ]

If true, the model preparation and molecular replacement jobs will be farmed out to a cluster. Currently only works for Sun Grid Engine enabled clusters.
[Default False]

QTYPE [ SGE | PBS ]

For use in combination with the CLUSTER keyword. This is the type of batch system that jobs will be submitted to. Currently, the Sun Grid Engine (SGE) and Portable Batch System (PBS) systems are supported.
[Default SGE]

QSIZE <number of cluster processes>

This is the maximum number of jobs allowed to be submitted to a cluster system at any one time. Set this value if you want to prevent MrBUMP from overloading an open cluster system.
[Default Unlimited]

QSUBCOM <cluster submission command>

Cluster submission command. Arguments to the cluster submission command can also be provided through this keyword. e.g. QSUB qsub -l vmem=500MB,walltime=02:00:00.
[Default qsub]

CLEAN [ True | False ]

If true, the program will remove the files generated for models that were marked as "Failed" solutions. Also, any files in the scratch area will be removed. This is to cut down on disk space usage.
[Default False]

LITE [ True | False ]

If true, the program will delete surplus files as it progresses. These include both Molrep and Phaser output files, scratch files, log files, downloaded PDB files and sequence alignment files. This reduces considerably the disk footprint of a MrBUMP job. For Phaser and Molrep (on unix systems) a shell script is created to allow for the re-running of the jobs should further investigation be needed.
[Default False]

PICKLE [ True | False ]

Use the python 'pickle' function to output the main data structures into a pickle file. Mainly useful for two-step runs of MrBUMP - 1. Model search, 2. Molecular Replacement using the previous model generation directory.
[Default True]

CHECK [ True | False ]

This keyword, if set to True, enables an internet connectivity check at the outset of a job. The test involves connecting to each of the PDB file servers specified and attempting to download a PDB file. If all of the download attempts fail, the process will report an error and exit the program. This is a possible indication of a network connection problem or the need for the user to set a proxy server. It can be disabled in situations where PDB files are sourced from a local folder and a network connection is not required. When set to True, this option also invokes a PDB file server connecting time script. This script tests how long it takes to retrieve a PDB file from each of the commonly used PDB file servers (UK, USA and JAPAN). The quickest one is then used to retrieve files at later stages in the program.
[Default True]

DEBUG [ True | False ]

If true MrBUMP will give a more verbose output. Also, temporary directories will not be deleted at the end of the job. For Phaser/Molrep and Refmac jobs, shell scripts to re-run these jobs on their own will be created in the mr or refine directories for the particular search model (not on Windows).
[Default False]

PROXYSERVER <http_proxy server address>

If you need to use a proxy server to access the internet you should set it using PROXYSERVER. MrBUMP uses several on-line services and databases (e.g. the PDB) and thus requires internet access. It is possible to run MrBUMP without internet access by turning off the FASTA, SSM and PQS searches and using locally stored PDB files as input search models. The proxy server is set in the environment in which MrBUMP is running. An example of a proxy server would be "http://proxy.mysite.com:8080/".
[Default not set]

PKEYWORD <Phaser keyword and value>

This keyword allows for the passing in of any Phaser keywords to the underlying call to Phaser. For example "PKEYWORD MACMR PROTOCOL OFF" will turn the refinement option off for molecular replacement in Phaser. For a list of potential keywords please see the Phaser wiki documentation page.
[Default not set]

END

End keyworded input.

EXAMPLE KEYWORD INPUT FILES

Simple example with minimal input using default values:

LABIN F=F SIGF=SIGF FreeR_flag=FreeR_flag
JOBID MY_JOB_1

A more elaborate example:

LABIN F=FP SIGF=SIGFP FreeR_flag=FREE
JOBID MY_JOB_2
MRNUM 10
ENSEMNUM 5
IGNORE 1smw 1smm 1smu
MRPROGRAM molrep phaser
MAPROGRAM mafft
DEBUG True
CLUSTER False 
SCOPSEARCH True
SSMSEARCH True
PQSSEARCH True
MDLM True
MDLC True
MDLS True
MDLP False
USEACORN True
END

PROGRAM OUTPUT

Once a job has been started a user may view the current status of the job via the output log file or via the results.html web page which is created in the directory <ROOTDIR>/search_<JOBID>/results and is updated after each stage in the process. A set of search models is first generated and these are fed to the MR/refinement stage in sequence where the ordering depends on the alignment score of the template sequence against the target sequence. If a suitable solution is found, i.e. a model that refines well, the job will terminate and the final results will be displayed. The resulting refined PDB model and MTZ output from Refmac are made available to the user for further model building.

AUTHORS

Ronan Keegan, Daresbury Laboratory, UK
Martyn Winn, Daresbury Laboratory, UK
Vincent Fazio, Materials Science and Engineering, CSIRO, Australia

ACKNOWLEDGEMENTS

Norman Stein, Pryank Patel.

MrBUMP Program References

Any publication arising from use of MrBUMP should include the following reference:

R.M.Keegan and M.D.Winn (2007) Acta Cryst. D63, 447-457

In addition, authors of specific programs should be referenced where applicable:

CCP4: Collaborative Computational Project, Number 4. (1994), "The CCP4 Suite: Programs for Protein Crystallography". Acta Cryst. D50, 760-763
FASTA: W. R. Pearson and D. J. Lipman (1988), "Improved Tools for Biological Sequence Analysis", PNAS 85, 2444-2448
HHPred: Remmert M., Biegert A., Hauser A., and Söding J. (2012) " HHblits: Lightning-fast iterative protein sequence searching by HMM-HMM alignment", Nat. Methods 9, 173-175
SSM: E.Krissinel and K.Henrick (2004), "Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions" Acta Cryst. D60, 2256-2268
SCOP: A.G.Murzin, S.E.Brenner, T.Hubbard & C.Chothia (1995), J.Mol.Biol., 247, 536-540
MAFFT: K. Katoh, K. Kuma, H. Toh and T. Miyata (2005) "MAFFT version 5: improvement in accuracy of multiple sequence alignment" Nucleic Acids Res. 33, 511-518
PROBCONS: Do, C.B., Mahabhashyam, M.S.P., Brudno, M., and Batzoglou, S. (2005) "PROBCONS: Probabilistic Consistency-based Multiple Sequence Alignment." Genome Research 15, 330-340
T_COFFEE: C.Notredame, D. Higgins, J. Heringa (2000) "T-Coffee: A novel method for multiple sequence alignments." Journal of Molecular Biology 302, 205-217
CLUSTALW: Chenna, Ramu, Sugawara, Hideaki, Koike,Tadashi, Lopez, Rodrigo, Gibson, Toby J, Higgins, Desmond G, Thompson, Julie D. (2003) "Multiple sequence alignment with the Clustal series of programs" Nucleic Acids Res 31, 3497-500
CHAINSAW: N.D.Stein (2006) in preparation
MOLREP: A.A.Vagin & A.Teplyakov (1997) J. Appl. Cryst. 30, 1022-1025
PHASER: McCoy, A.J., Grosse-Kunstleve, R.W., Storoni, L.C. & Read, R.J. (2005). "Likelihood-enhanced fast translation functions" Acta Cryst D61, 458-464
REFMAC: G.N. Murshudov, A.A.Vagin and E.J.Dodson, (1997) "Refinement of Macromolecular Structures by the Maximum-Likelihood Method" Acta Cryst. D53, 240-255
PISA: E.Krissinel and K.Henrick (2005), "Detection of Protein Assemblies in Crystals", edited by M.R. Berthold et.al, CompLife 2005, LNBI 3695, pp. 163-174. Springer-Verlag Berlin Heidelberg
ACORN: Yao Jia-xing, Woolfson,M.M., Wilson,K.S. and Dodson,E.J. (2005) Acta. Cryst. D61, 1465-1475