CCP4i version of pdb_extract

Documentation and Examples

ã2004 Rutgers, The State University of New Jersey

Research Collaboratory for Structural Bioinformatics

Questions and comments about this manual should be sent to info@rcsb.org.

The Protein Data Bank (PDB) is operated by Rutgers, The State University of New Jersey; the San Diego Supercomputer Center at the University of California, San Diego; and the Center for Advanced Research in Biotechnology/UMBI/NIST -- three members of the Research Collaboratory for Structural Bioinformatics (RCSB).

The RCSB PDB is supported by funds from the National Science Foundation (NSF), the National Institute of General Medical Sciences (NIGMS), the Office of Science, Department of Energy (DOE), the National Library of Medicine (NLM), the National Cancer Institute (NCI), the National Center for Research Resources (NCRR), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), and the National Institute of Neurological Disorders and Stroke (NINDS).

Contents

1. General

1.1 Description 2

1.2 Credits 2

2. Usage 3

2.1 The CCP4i interface 3

2.2 The command line interface 4

2.3 The script interface 5

2.4 The Web interface 5

2.5 Example for using the different interfaces 5

3. Files 9

3.1 The data template file 9

3.2 The script input file 9

3.3 The output file 10

3.4 Input log files from various crystallographic applications 10

3.4.1 Data collection/reduction 10

3.4.2 Molecular replacement 11

3.4.3 Heavy atom phasing 12

3.4.4 Density modification 16

3.4.5 Final structure refinement 17

3.4.6 List of crystallographic applications supported by pdb_extract 19

4. Command line arguments for running pdb_extract 20

4.1 Arguments for running pdb_extract to prepare the coordinate files 21

4.1.1 Examples 23

4.2 Arguments for running pdb_extract_sf to prepare sf files 24

4.2.1 Examples 26

4.3 Arguments for running extract to generate data template and script input files 27

4.3.1 Examples 28

4.4 Summary of arguments 29

5. Appendices 30

5.1 Example of a data template file 31

5.2 Example of a script input file 36

6. References 41

7. Frequently asked questions 43

1 General

1.1 Description

The pdb_extract software is designed to automatically extract information and statistics about data reduction, heavy atom phasing, molecular replacement, density modification, and final structure refinement from the output and log files produced by many X-ray crystallographic applications. The program can merge all the extracted information into macromolecular Crystallographic Information File (mmCIF) data files for validation and deposition to the PDB.

Some of the advantages of using this software are listed below:

· It reduces manual intervention during the assembly and preparation of coordinate and structure factor data thereby making it quick and accurate.

· Files prepared using this software have more detailed information pertaining to the structure determination and quality.

· Since this application is based on the PDB mmCIF exchange dictionary, its use for structure deposition also facilitates annotation and processing of the data.

· The coordinate and structure factor files prepared by pdb_extract can be validated and deposited using ADIT (either at http://deposit.pdb.org/adit/ or http://pdbdep.protein.osaka-u.ac.jp/adit/). Alternatively, the files can be validated (http://deposit.pdb.org/validate/ or http://pdbdep.protein.osaka-u.ac.jp/validate/) and directly submitted to the PDB either via email (deposit@rcsb.rutgers.edu) or via ftp (pdb.rutgers.edu).

· The program pdb_extract can be used to separately extract and save relevant information and statistics regarding different stages of structure determination. This may be useful for situations where the structure determination process is extended over a long period of time or when different people are involved in the various steps of structure determination.

1.2 Credits

This program was developed by the RCSB-Protein Data Bank in order to facilitate deposition of structural data.

2. Usage

The various interfaces for running the pdb_extract application are explained in the following sections. A few important points to keep in mind regarding the use of this program are:

· A number of trials may have been used at each step of the structure determination. Please use the output and log files from the best or final trial of data processing, heavy atom phasing, density modification and final structure refinement for running pdb_extract.

· Multiple applications may have been used at a single step of structure determination. For example, if program A was used to locate heavy atom positions and program B was used to refine heavy atom parameters (like x, y, z, occupancy and B factors), information regarding the phasing statistics should be extracted from the output of program B.

· If multiple structures need to be deposited to the PDB, please run pdb_extract separately for each file since each structure will be deposited as a separate entry with unique PDB IDs.

· Once the pdb_extract program has been installed, as part of the CCP4 package, it can be run using any one of the following ways:

o The CCP4i interface

o The command line interface

o The script interface

o The Web interface

2.1 The CCP4i interface

The graphical user interface for CCP4 (CCP4i) can be used to prepare structural data for deposition to the PDB. This interface is intuitive and easy to use. Steps for running pdb_extract using this interface are described below:

· In the main CCP4i window, click on yellow button at the top left hand corner. This lists the different modules of CCP4.

· Select the ‘Validation and Deposition’ module.

· Now select the ‘Data Harvesting Management Tool’ option from the left hand menu. This opens a new window titled the 'Harvesting Manager'.

· In the 'Harvesting Manager' window, under the option ‘Run program to’, select ‘Extract additional information for deposition’

· Now under the 'Extract information from' options, select ‘Generate a data template’. This opens up boxes for uploading either a PDB or mmCIF format file with the coordinates and the name of an output file. You may either type the complete path for the input PDB or mmCIF file name (or select it using the browsing function) in the appropriate box. Note that even if no file name is included in any of the boxes, blank spaces should not be entered in them.

· Select the 'Run Now' option in the Run button to generate the data template file.

· Edit and complete the data template file according to the instructions included within the file. Please refer to section 3.1 for more information. An example data template file is included in section 5.1.

· Return to the 'Extract information from' category and select the ‘Generate a complete mmCIF file for PDB deposition’ option.

· Select the program and log file names used for each stage of the structure determination like data scaling, heavy atom phasing, density modification, molecular replacement and structure refinement. Also include the names of the data template file generated above, and an output file.

· Run the pdb_extract program to obtain a complete mmCIF format file that can be uploaded to the web version of ADIT (at http://deposit.pdb.org/adit/ or http://pdbdep.protein.osaka-u.ac.jp/adit/) for structure validation and deposition. Alternatively, the mmCIF format file may be validated either at http://deposit.pdb.org/validate/ or http://pdbdep.protein.osaka-u.ac.jp/validate/, corrected if necessary and submitted to the PDB either via email (deposit@rcsb.rutgers.edu) or via ftp (pdb.rutgers.edu).

· The structure factor file for the deposition should be converted to mmCIF format using the 'Structure factor for deposition' button in the main CCP4i window. Alternatively, the mtz2various application can be used to convert a mtz format structure factor file(s) to mmCIF format. Note that the structure factor data should be the one used for the final refinement and at least have h, k, l, F, SigmaF, (and/or I, SigmaI) and test flags, if appropriate. Another method for preparing the structure factor file using pdb_extract is described in section 2.2 and 2.3

Also see section 2.5 for an example.

2.2 The command line interface

Once the CCP4 suite of programs has been completely installed, the pdb_extract application may be run using a command line interface. This allows greater flexibility for using the various options of the program.

· Obtain the data template file 'data_template.text' using the command

extract -pdb coordinate_PDB_file_name

extract -cif coordinate_CIF_file_name

· Edit and complete the data template file according to the instructions included within the file. An example of the data template file is included in section 5.1.

· Run the pdb_extract program using the appropriate arguments to include the names of the programs and their log files in the command line, to obtain a complete mmCIF format file including coordinates and all the data statistics. Instructions for including the different filenames and a list of the commonly used arguments for running this program are included in section 4.1. They are also described in an example in section 2.5.

· Run pdb_extract_sf to convert the structure factor file to mmCIF format. If the structure factor file is in mtz format, it can also be converted to mmCIF format using the mtz2various program, available as part of CCP4. If multiple structure factor data were used for phasing the structure (for example in the case of a MAD experiment), pdb_extract can be used to concatenate all the data sets in a single file. The first block of structure factors should be the one used for the refinement. Note that each block of structure factor data should have h, k, l, F, SigmaF, (and/or I, SigmaI) and test flags if appropriate.

Also see section 2.5 for an example.

2.3 The script interface

This interface uses scripts similar to the CNS script input files. It is an easy and user friendly interface that can be executed without the use of a graphical interface. The advantage here is that it does not involve the use of specific arguments to include the names of all the programs, output and log files. All this information can be included in the script input file 'log_script.inp'.

· Obtain the data template file 'data_template.text' and script input file ‘log_script.inp’ using the command

extract -pdb coordinate_PDB_file_name

extract -cif coordinate_CIF_file_name

· Edit the data template file according to the instructions included in the file. Fill the names of all relevant software applications, their log and output file names, as well as the data_template file name in the 'log_script.inp' file. Examples of the data template and script input files are included in section 5.1 and 5.2 respectively.

· Run the program using the command

extract -ext log_script.inp

Also see section 2.5 for an example.

2.4 The Web interface

This is actually not part of the CCP4 package. However, if internet access is available on the workstation running CCP4, this option is available at http://pdb-extract.rutgers.edu/. Detailed instructions and examples for running the program using this interface are available from this link.

2.5 Examples

Here is an example, where the experimental method for solving the protein structure was multiple anomalous diffraction (MAD). The structure determination details were as follows:

· A single crystal was used for data collection.

· Three datasets were collected for the MAD experiment at wavelengths (inflection, peak, remote edge of Selenium).

· The program HKL2000 was used for indexing and scaling the data sets.

· The program SOLVE was used for phase determination and refinement of heavy atom parameters. All three reflection data files were used for phasing.

· RESOLVE was used for density modification.

· REFMAC5 was used for final structure refinement.

The output and log files generated from above programs were as follows:

· The HKL2000 program generated three reflection data files (scale1.sca, scale2.sca, scale3.sca) and three log files (scale1.log, scale2.log, scale3.log) for the three data sets collected.

· The SOLVE program generated one log file (solve.prt) containing phasing statistics and one PDB file (ha.pdb) containing heavy atom (Selenium in this case) coordinates.

· The RESOLVE program generated one log file (resolve.log) containing statistics

· The REFMAC5 program generated one PDB file (refmac.pdb) containing atomic coordinates and one mmCIF file (native.refmac) containing refinement statistics. The structure factor data used for the final refinement was refmac_sf.mtz.

· The steps involved in running pdb_extract for complete data extraction for this example, using the different interfaces (sections 2.1-2.4) are described below.

Using the CCP4i interface:

· Follow the instructions (in section 2.1) to launch the ‘Harvest manager’ window and select the ‘Generate a data template’ option.

· In the box titled ‘PDB File’, upload the file refmac.pdb.

· Include the name of an output file, for example ‘data_template.text’ with the complete path.

· Select the 'Run Now' option to generate the data template file.

· Edit and complete this file according to the instructions included in it. Replace any chain breaks (denoted by question marks ‘????’ in the one-letter-code sequence listed in the ‘Sequence information’ category), with the sequence of the residues that were not modeled due to missing density etc. Also add any residues missing from the N- and C-termini and correct the sequence where the residues were modeled as Ala or Gly due to missing side chain density. If the question marks ‘????’ are not removed from the sequence the program may not run to completion. You may also complete other non-electronically produced information, like the author names, citation etc. in the data template file.

· Return to the 'Extract information from' category and select the ‘Generate a complete mmCIF file for PDB deposition’ option

· In the data scaling section, select the scaling program HKL2000 and upload the log file scale1.log to extract scaling statistics (scale1 was the data used for the final refinement).

· In the phasing section, select phasing method MAD, program SOLVE and upload the log file solve.prt to obtain phasing statistics.

· In the density modification section, select the program RESOLVE and upload the log file resolve.log to obtain density modification statistics.

· In the structure refinement section, select the program REFMAC5 and upload the PDB coordinate file refmac.pdb and the data harvest file native.refmac to obtain the PDB coordinates and refinement statistics

· Upload the data template file generated above (data_template.text) to obtain the sequence information for all unique polymers in the file and any other the non-electronically produced information that you may have added in the file.

· Run the program to obtain a complete mmCIF format file.

· The structure factor data accompanying this file, refmac_sf.mtz, can be prepared using the 'Structure factor for deposition' button in the main CCP4i window. Alternatively, the mtz2various application can be used to convert the mtz format structure factor file(s) to mmCIF format. For details on generating a file with this structure factor data and the other data sets used for phasing the structure see instructions in the command line section below.

· The coordinate and structure factor files can be validated and deposited to the PDB as instructed in section 1.1.

Using the command line interface

· Generate the data_template file, ‘data_template.text’ using the command

extract -pdb coordinate_PDB_file_name

extract -cif coordinate_CIF_file_name

· Run the pdb_extract program to obtain coordinates and statistics using the following command:

pdb_extract -e MAD \

-p SOLVE solve.prt \

-d RESOLVE -iLOG resolve.log \

-r refmac5 -icif peak.refmac -ipdb refmac.pdb\

-s HKL –iLOG scale3.log \

-sp HKL scale1.log scale2.log scale3.log \

-iENT date_template.text \

-o output.cif

Note that the command line can be extended by using a backslash (\) at the end of a line. There should be no space after the backslash (\). Refer to section 4 for a list and explanation of arguments used to input the names of applications and their output and/or log files.

· Run pdb_extract_sf to convert HKL format structure factors to mmCIF format. Since the structure factor file used for the final refinement was in mtz format (refmac_sf.mtz), convert this to refmac_sf.mmcif either using the CCP4i interface or using the mtz2various application. The structure factor data for all the three wavelengths were used for phase determination, they should be merged to one file for deposition using the following command:

pdb_extract_sf -rt F -rp refmac5 -idat refmac_sf.mmcif \ (for refinement)

-dt I -dp HKL \ (for phasing)

-c 1 -w 1 -idat scale1.sca \

-c 1 -w 2 -idat scale2.sca \

-c 1 -w 3 -idat scale3.sca \

-o output_sf.cif

Note that each block of structure factor data should have h, k, l, F, SigmaF, (and/or I, SigmaI) and test flags if appropriate. In this case, a test set was used for the final structure refinement, thus the file refmac_sf.mmcif should include a column with the test flags. The output file (output_sf.cif) contains one reflection data block for refinement (derived from refmac_sf.mmcif) and a data block for protein phasing (derived from scale1.sca, scale2.sca and scale3.sca).

· The coordinate and structure factor files can be validated and deposited to the PDB as instructed in section 1.1.

Using the script interface

· Generate the data_template file, ‘data_template.text' using the command

extract -pdb coordinate_PDB_file_name

extract -cif coordinate_CIF_file_name

Two files are generated, 'data_template.text’ and 'log_script.inp'.

· Edit the log_script.inp file according to the instructions in the file to include names of all the applications used for the structure determination and the names of their output and log files.

· Run the program using the command:

extract -ext log_script.inp

The coordinate and structure factor output files generated in this run should be identical to those generated using the command line interface.

· The coordinate and structure factor files can be validated and deposited to the PDB as instructed in section 1.1.

3. Files

This section describes various input and output files that are used for running pdb_extract. Useful tips for using these files are included here. For examples of a data template file and script input file see sections 5.1 and 5.2.

3.1 The data template file

· This file is generated by running extract on a coordinate file as follows:

extract –pdb coordinate_file_name or

extract –cif coordinate_file_name

· The data template file contains the sequence information for all unique polymers (protein or nucleic acids) in the structure and other non-electronically captured information.

· The categories 1 and 2 must be filled in the file before running pdb_extract. The categories 3-18 may either be filled in here or later during deposition using ADIT.

· In the data template file, only strings included between the 'lesser than' and 'greater than' signs (<.....>) will be parsed for evaluation by the program. Therefore, DO NOT write either on the left or right of the 'less than' and 'greater than' signs respectively.

· All alphanumeric values or strings that you include in the different categories should be within double-quotes. Blank spaces or carriage returns within a pair of double quotes are ignored by the program. DO NOT use double quotes (") within strings that you enter.

· See section 5.1 for an example of a data template file.

3.2 The script input file

· This file is also generated by running extract (in addition to the data template file). This file is used only for running pdb_extract using the script interface.

· The script input file is used to enter the names of the crystallographic software used for structure determination and the log, PDB, mmCIF or other text files generated by them. Names of the coordinate, structure factor and data template file should also be included here.

· The script input file should be completed according to the type of experiment used for structure determination. The command 'extract -ext log_script.inp' is then used to obtain the completed structure data files ready for validation and deposition.

· Only strings included between the 'lesser than' and 'greater than' signs (<.....>) will be parsed for evaluation by the program. Therefore, DO NOT write either on the left or right of the 'less than' and 'greater than' signs respectively.

· The log files used for generating the deposition should be generated from the best (usually the last) trial for each crystallographic application.

· See section 5.2 for an example of a script input file.

3.3 Output files

· The output files generated by pdb_extract (coordinate and structure factors) are in mmCIF format.

· These files are ready to be uploaded to the validation server for validation or to the ADIT tool for validation and deposition.

· mmCIF files containing information regarding different stages of structure determination can be prepared separately. Thus instead of saving all the log files generated during structure determination, pdb_extract output files containing relevant information regarding a particular step of structure determination can be saved. These output files (which are in mmCIF format) can be read later by pdb_extract and combined to create a complete mmCIF format file for validation and deposition.

3.4 Running pdb_extract on the output and log files from various crystallographic applications

pdb_extract can be run independently on the output and log files obtained from the various applications used for structure determination. Details about the information extracted from the log and output files of the different applications are described in the following sections.

3.4.1 Data collection/reduction/scaling

This is the early stage of solving a crystal structure. The statistics and details about the data integration and scaling describe the quality of the structure factor data. Information that can be extracted at this stage include:

Intensities (or amplitude) and standard deviations
Data completeness (overall, resolution shells)
Redundancy (overall, resolution shells), mosaicity.
R-merge, R-sym (overall, resolution shells)
<I>/<sigmaI> (overall, resolution shells)
Total and unique reflections collected.
Resolution range

A few data scaling programs that are commonly used are described in the following sections.

3.4.1.1 Using HKL/ HKL2000/ or scalepack

(http://www.lnls.br/infra/linhasluz/denzo-hkl.htm)

This package by Otwinowski and Minor is used for data collection, data reduction, and scaling. The program can be run using either a graphical interface or the scalepack scripts. The LOG file (e.g. scale1.log) contains scaling statistics for the data. pdb_extract can be run on the log files as follows:

pdb_extract –s HKL –ilog scale1.log (one dataset for refinement)

pdb_extract –sp HKL –ilog scale1.log scale2.log … (multiple datasets for phasing)

3.4.1.2 Using D*trek (http://www.msc.com/protein/dtrek.html)

This package by Rigaku/MSC is used for data collection, data reduction and scaling. The program can be run using a graphical interface to scale (or merge/average) datasets. The log file (e.g. scale1.log) contains statistics for data scaling. pdb_extract can be run on the log files as follows:

pdb_extract –s Dtrek –ilog scale1.log (one dataset for refinement)

pdb_extract –sp Dtrek –ilog scale1.log scale2.log … (multiple datasets for phasing)

3.4.1.3 Using SAINT (http://xray.chm.bris.ac.uk/facilities/smart.html)

This package by Bruker (Siemens Molecular Analytical Research Tool) is used for data collection, data reduction and scaling. The log file (e.g. scale1.ls) contains statistics for data scaling. pdb_extract can be run on the log files as follows:

pdb_extract –s SAINT –ilog scale1.ls (one dataset for refinement)

pdb_extract –sp SAINT –ilog scale1.ls scale2.ls … (multiple datasets for phasing)

3.4.1.4Using 3DSCALE

· This program by Fu et al. is used for data scaling. pdb_extract can be run on the log file (e.g. scale1.log) as follows:

pdb_extract –s 3DSCALE –ilog scale1.log (one dataset for refinement)

pdb_extract –sp 3DSCALE –ilog scale1.log scale2.log … (multiple datasets for phasing)

3.4.1.5Using SCALA or TRUNCATE (http://www.ccp4.ac.uk/dist/html/scala.html)

Scala and truncate are a CCP4 supported program. Scala is used to scale together multiple observations of reflections, while truncate can compute structure factor amplitudes from the intensities. Both scala and truncate generate mmCIF format files containing useful statistics. When running these programs, export the data harvest file. If the output files generated by these applications are name.scala and name.truncate, pdb_extract can be run as follows:

pdb_extract –s scala –icif name.scala (one dataset for refinement)

3.4.2 Programs for molecular replacement

Information and statistics regarding molecular replacement that can be extracted from the log files are listed below:

· Low and high resolution used in rotation and translation.

· Rotation and translation methods

· Reflection cut off criteria, reflection completeness.

· Correlation coefficients for I or F between observed and calculated.

· R_factor, packing information, and model details.

A few molecular replacement programs that are commonly used are described in the following sections.

3.4.2.1 Using CNS/XPLOR (http://cns.csb.yale.edu/v1.1/)

In CNS, molecular replacement can be done by first running a rotation search (using cross_rotation.inp) followed by a translation search (using translation.inp). The log file called translation.list which contains scoring information regarding the best solutions of the molecular replacement. pdb_extract can be run on the log file as follows:

pdb_extract -o test.mmcif –e MR –m CNS –ilog translation.list

3.4.2.2Using Amore (CCP4 version 4.1-5.0)

(http://www.ccp4.ac.uk/dist/html/INDEX.html)

· Amore is a CCP4 supported program, commonly used for molecular replacement. After rotation and translation search two log files rotation.log and translation.log are generated. pdb_extract can be run on the log files as follows:

pdb_extract –e MR –m amore –ilog rotation.log translation.log -o test.mmcif

3.4.2.3Using Molrep (CCP4 version 4.1-5.0)

(http://www.ccp4.ac.uk/dist/html/INDEX.html)

· Molrep is another CCP4 supported program that is used for molecular replacement. All the statistics regarding the molecular replacement can be recorded in the log file, say molrep.log. pdb_extract can be run on the log file as follows:

pdb_extract –e MR –m molrep –ilog morep.log -o test.mmcif

3.4.2.4Using EPMR (http://www.msg.ucsf.edu/local/programs/epmr/epmr.html)

· EPMR is a command line program for molecular replacement. Write out a log file when you run the program as:

Epmr [options] files > epmr.log

All the relevant statistics will be recorded in the log file. pdb_extract can be run as follows:

pdb_extract –e MR –m epmr –ilog epmr.log -o test.mmcif

3.4.3 Programs for heavy atom position location and protein phasing

The phase problem lies at the center of macromolecular crystallography. Heavy atom phasing may be used to solve this problem. The log files generated at this stage contain important statistics and information. pdb_extract can be used to extract the following information from the log files:

· Wavelength, f’,f” , resolution range

· FOM (acentric, centric, overall, resolution shells)

· R-Cullis (acentric, centric, overall, resolution shells)

· R-Kraut (acentric, centric, overall, resolution shells)

· Phasing power (acentric, centric, overall, resolution shells)

· Number of heavy atom sites, heavy atom type.

· Method used to locate heavy atom(s).

· Heavy atom B-factor, occupancies, and xyz coordinates.

A few commonly used programs for heavy atom phasing are described in the following sections.

3.4.3.1Using CNS/XPLOR (http://cns.csb.yale.edu/v1.1/)

· CNS may be used for initial phase determination for a structure. The scripts for locating heavy atoms and phase refinement are ‘mad_phase.inp’ or ‘ir_phase.inp’. When you run these scripts, you will get the output files like ‘phase_final.summary’, ‘phase_final.sdb’ or ‘mad_phase.fp’. The file phase_final.summary has all the phasing statistics while the file phase_final.sdb has all the heavy atom coordinates, occupancies and B factors. The file mad_phase.fp has refined f_prime and f_double_prime, if applicable.

(Note: The refined heavy atom coordinates, B factors and occupancies can be found in a file like ‘phase_final.sdb’. This file may be converted to the PDB format, by running the script sdb_to_pdb.inp. This generates a PDB format file ‘phase_final.pdb’.)

· To extract phasing information, run the following:

pdb_extract -o test.mmcif –e MAD –p CNS \

–iLOG phase_final.summary phase_final.sdb mad_phase.fp

or, if you have the heavy atom coordinates in PDB format:

pdb_extract -o test.mmcif –e MAD –p CNS \

–iLOG phase_final.summary mad_phase.fp \

–iPDB phase_final.pdb

3.4.3.2Using MLPHARE (http://www.ccp4.ac.uk/dist/html/INDEX.html)

· MLPHARE is a CCP4 supported program and is used for refining heavy atom parameters.

· When running the program using the CCP4i interface, select the data harvest button. When using scripts do not use the keyword NOHARV. In either case a file (say name.mlphare) is generated, which is in mmCIF format. This contains all the statistics and information regarding the heavy atom phasing refinement.

· Run the program REVISE (in CCP4) to extract wavelength information. This generates a log file say prephadata.log. Extract phasing information from these files as follows:

pdb_extract -o test.mmcif –e method –p MLPHARE \

–iCIF name.mlphare –iLOG prephadata.log

3.4.3.3Using SOLVE (http://www.solve.lanl.gov/)

· Solve is a program used for locating heavy atoms and refining their position and occupancy. Information regarding these stages of structure solution is summarized in a file called “solve.prt” (default name used by the program). The program exports the heavy atom coordinates, in a file called “ha.pdb”.

· The pdb_extract program can be used to extract phasing information for any one of the following:

o SOLVE log file for a single SAD experiment

o SOLVE log file for a single MAD experiment

o SOLVE log file for a single MIR experiment

o SOLVE log file for phasing based on a single MIR experiment and anomalous data at the native wavelength (e.g. MIR using two different derivatives Hg, plus Fe anomalous data in the native dataset)

o SOLVE log file for phasing based on a single MAD experiment and two sets of anomalous scattering in the native dataset. (e.g. MAD using Se, combined with anomalous data for Se and Fe at the native wavelength)

o SOLVE log file for phasing based on a combination of MAD and MIR experiments

o SOLVE log file for phasing based on two MAD experiments (e.g. Using Se and Hg)

o SOLVE log file for phasing based on more than one MIR experiments (e.g. using Hg, I, Pt etc.)

The phasing information can be extracted as follows:

pdb_extract –e method –p SOLVE –iLOG solve.prt -ipdb ha.pdb -o test.mmcif

3.4.3.4Using SHARP (http://babinet.globalphasing.com/sharp/)

· SHARP is used for finding heavy atom positions and refining the heavy atom parameters. When running SHARP or autoSHARP, the log files are saved in the directory sharpfiles/logfiles_local/dirs, where dirs refer to the subdirectories for your various structures. Please note that the location of log files generated by the program may vary depending on how the program is installed.

· Of the numerous output files generated by SHARP, the following are used for extracting information regarding this stage of structure determination:

(For version 1.3.x)

o Heavy.pdb: which contains the heavy atom coordinates.

o FOMstats.html: which contains figure of merit statistics.

o Name.sin: which is a generated input script with input information.

o Otherstat.html which contains Rcullis, Rkraut, phasing power.

(For version 2.0 or above)

o Heavy.pdb: which contains the heavy atom coordinates.

o FOMstats.html: which contains figure of merit statistics.

o Name.sin: which is a generated input script with input information.

o RCullis_?.html which contains Rcullis.

o PhasingPower_?.html which contains phasing power

· The easiest way to obtain these files is to run the program from the SUSHI interface. Review all the log files from the internet browser and save the files in plain text (or html) format. The phasing information can be extracted as follows:

pdb_extract -o test.mmcif –e method –p SHARP –iPDB heavy.pdb \

–iLOG FOMstats.html Otherstat.html Name.sin

3.4.3.5Using SnP (http://www.hwi.buffalo.edu/SnB/)

· SnB is graphical interface software using the Shake-and-Bake algorithm. It produces heavy atom coordinates (e.g. heavy.pdb) in PDB format. However, this program does not refine the heavy atom parameters, thus has no statistics regarding this. The heavy atom coordinates can be extracted as follows:

pdb_extract -o test.mmcif –e method –p SNB –iPDB heavy.pdb

Note: If a program like MLPHARE or CNS was used for refining the heavy atom coordinates determined by SnB. Extract the heavy atom coordinates as well as the phasing information from the MLPHARE or CNS output files (even though SnB may have been used to find the initial heavy atom positions).

3.4.3.6Using BnP (http://www.hwi.buffalo.edu/BnP/)

· BnP is a combination of the programs SnB (described above) and Phases by Furey (described below). Here, the heavy atom positions are located by SnB while the heavy atom parameters are refined by Phases. The log file (for example auto.log) can be found from the directory ~/PHASES/* and contains phasing power for each phasing set. The phasing information can be extracted as follows:

pdb_extract -o test.mmcif –e method –p BnP –ilog auto.log –iPDB heavy.pdb

pdb_extract -o test.mmcif –e method –p phases –ilog auto.log –iPDB heavy.pdb

3.4.3.7Using SHELXD or SHELXS (http://shelx.uni-ac.gwdg.de/SHELX/)

· These programs are similar to SnB in that they also only compute the heavy atom substructure in PDB format (e.g. heavy.pdb). The heavy atom coordinates may be extracted as follows:

pdb_extract -o test.mmcif –e method –p SHELXD –iPDB heavy.pdb

pdb_extract -o test.mmcif –e method –p SHELXS –iPDB heavy.pdb

3.4.3.8Using PHASES (http://imsb.au.dk/~mok/phases/phases.html)

· The PHASES package was developed by Furey and can be used to locate heavy atom positions and refine the heavy atom parameters. The log file (for example name.log) can be found from the directory ~/PHASES/* and contains phasing power for each phasing set. Heavy atom coordinates and phasing information can be extracted as follows:

pdb_extract -o test.mmcif –e method –p Phases –ilog name.log –iPDB heavy.pdb

3.4.4 Programs for density modification

Density modification is normally applied after obtaining the phase information (from heavy atom coordinates, molecular replacement etc.). The application pdb_extract can be used to extract the following information from the generated log files:

· Density modification method

· FOM after density modification (overall, resolution shells)

· Solvent mask determination method

· Structure solution software

A few refinement programs that are commonly used are described in the following sections

3.4.4.1Using CNS/XPLOR (http://cns.csb.yale.edu/v1.1/)

· The input script like ‘density_modify.inp’ in CNS runs density modification and produces a log file called ‘density_modify.list’. The density modification statistics can be extracted as follows:

pdb_extract -o test.mmcif –e method –d CNS –iLOG density_modify.list

3.4.4.2Using DM (http://www.ccp4.ac.uk/dist/html/INDEX.html)

· DM is a density modification program supported by CCP4. It generates a log file (like dm.log), both when the program is run using the CCP4i interface and also using scripts. The density modification statistics can be extracted as follows:

pdb_extract -o test.mmcif –e method –d DM –iLOG dm.log

3.4.4.3Using SOLOMON (http://www.ccp4.ac.uk/dist/html/INDEX.html)

· SOLOMON is also a density modification program supported by CCP4. A log file (like Solomon.log) is generated, when the program is run either by using the CCP4i interface or scripts. The density modification statistics can be extracted as follows:

pdb_extract -o test.mmcif –e method –d SOLOMON

–iLOG solomon.log

3.4.4.4Using RESOLVE (http://www.solve.lanl.gov/)

· RESOLVE is a density modification program in the solve/resolve package. Normally it runs together with SOLVE, but it can be run separately. Run RESOLVE so that a log file (like resolve.log) is written out using “resolve input_file > resolve.log”. The density modification statistics can be extracted as follows:

pdb_extract -o test.mmcif –e method –d RESOLVE –iLOG resolve.log

3.4.4.5Using SHARP (http://babinet.globalphasing.com/sharp/)

· Density modification used in SHARP actually runs DM or solomon. Thus running density modification in SHARP, generates a log file like ‘dm.log’. The density modification statistics can be extracted as follows:

pdb_extract -o test.mmcif –e method –d SHARP –iLOG dm.log

pdb_extract -o test.mmcif –e method –d dm –iLOG dm.log

3.4.5 Programs for final structure refinement

The structure refinement is performed at the end of structure determination. Normally the atom coordinates are generated in PDB format and the statistics are generated in log files. The pdb_extract program can be applied to extract the following information:

· Number of reflections used in refinement, and in R-Free set.

· Resolution range (overall, highest resolution shell)

· R-factor (overall, resolution shells)

· Number of atoms refined

· Cell parameters and space group.

· The xyz coordinates of all the atoms.

· RMS Bond Distances, Bond Angles, Chiral Volume, Torsion Angles

· Isotropic temperature factor restraints

· Non-crystallographic symmetry restraints

· Solvent model used

· Overall Average Isotropic B Factor

· Overall Anisotropic B Factor

· Overall Isotropic B Factor

· Topology/parameter data used to refine the structure

· Refinement software

A few refinement programs that are commonly used are described in the following sections

3.4.5.1 Using CNS/XPLOR (http://cns.csb.yale.edu/v1.1/)

· CNS is used for final structure refinement. After completion of the refinement, a pre-deposition file can be created which is rich in various statistics regarding the refinement. This is done by running the script deposit_mmcif.inp to produces a file say deposit.mmcif. This file should be used for extracting refinement statistics as follows:

pdb_extract -o test.mmcif –e method –r CNS –iCIF deposit.mmcif

3.4.5.2Using REFMAC5 (http://www.ccp4.ac.uk/dist/html/INDEX.html)

· REFMAC5 is a program used for structure refinement (also supported by CCP4). When using the CCP4i interface, select the data harvest button and in the script mode, do not use the keyword NOHARV. The output files generated upon running this application includes the mmCIF format file, name.refmac, which contains information about the structure refinement and a PDB format file (name.pdb), which contains the atomic coordinates. Refinement statistics can be extracted from these files as follows:

pdb_extract -o test.mmcif –e method –r REFMAC5 –iCIF \

name.refmac –iPDB name.pdb

3.4.5.3Using SHELXL (http://shelx.uni-ac.gwdg.de/SHELX/)

· SHELXL is a program within the SHELX package and is used for structure refinement.

· After completion of structure refinement, please run the interactive program shelxpro and use option B. This program generates a PDB format file (name.pdb) with header information. Refinement statistics can be extracted from these files as follows:

pdb_extract -o test.mmcif –e method –r SHELXL –iPDB name.pdb

3.4.5.4Using TNT (http://www.uoxray.uoregon.edu/tnt/welcome.html)

· TNT is a crystal structure refinement program. After completion of the structure refinement the command rfactor is used to generate a log file (rfactor.log) as follows:

rfactor name.cor > rfactor.log

· The to_pdb command may be used to convert coordinates in TNT format (name.cor) to the PDB format (name.pdb) as:

to_pdb name.cor

· The symmetry information must also be provided via a symmetry file (e.g. p6122.dat) in the control file name.tnt

· Complete information regarding the refinement statistics can be extracted from the output PDB file and log files as follows:

pdb_extract –r TNT –iLOG p6122.dat rfactor.log –iPDB name.pdb

3.4.5.5Using ARP/wARP (http://www.embl-hamburg.de/ARP/)

· ARP/wARP is a program for automatic structure solution and refinement, where REFMAC5 is used for the structure refinement.

· The new version (6.0) can use the graphical interface of CCP4i. Thus the program may either be run from the CCP4i interface or using scripts. The output files include a log file (warpNtrace_refine.log) and a PDB file (warpNtrace.pdb). Information can be extracted from these files as follows:

pdb_extract -o test.mmcif –e method –r WARP\

–iLOG warpNtrace_refine.log \

–iPDB warpNtrace.pdb

3.4.5.6Using RESTRAIN (http://www.ccp4.ac.uk/dist/html/INDEX.html)

· RESTRAIN is a CCP4 supported program, used for structure refinement. When using the script, do not use the keyword NOHARV. This program generates a mmCIF format file (name.restrain), which contains information about the structure refinement, and a PDB format file (name.pdb), which contains the coordinates. The refinement statistics can be extracted as follows:

pdb_extract -o test.mmcif –e method –r RESTRAIN –iCIF name.restrain \

–iPDB name.pdb

3.4.6 Summary of crystallographic applications supported by pdb_extract

Category	Software	Versions	Authors
Data collection and reduction	HKL/SCALEPACK	1.30 - 1.96	Otwinowski & Minor (1997)
	d*TREK	7.0SSI	Pflugrath (1997)
	SAINT	V6.35A	Siemens (1994)
	SCALA	3.1.4 - 3.2.3	Evans (1997)
Molecular replacement	CNS	0.9 - 1.1	Brunger et al. (1998)
	AMORE	CCP4 (4.0 - 5.0)	Navaza (1994)
	Molrep	7.5.01	Vagin & Teplyakov (1997)
	EPMR	2.5	Kissinger et al. (1999)
Heavy atom phase determination	CNS	0.9 - 1.1	Brunger et al. (1998)
	SOLVE	2.0 - 2.06	Terwilliger & Berendzen (1999)
	MLPHARE	CCP4 (4.0 - 5.0)	CCP4 (1994)
	SHARP/autoSHARP	1.3.x - 2.02	Fortelle & Bricogne (1997)
	SHELXD/SHELXS	97	Sheldrick (1997)
	PHASES	95	Furey (1997)
	SnB	2.0 - 2.2	Weeks & Miller (1999).
	BnP	0.93 - 0.94	Weeks et al. (2002)
Density modification	CNS	0.9 - 1.1	Brunger et al. (1998)
	DM	2.0 - 2.1	Cowtan (1994)
	Solomon	CCP4 (4.0 - 5.0)	Abrahams & Leslie (1996)
	RESOLVE	2.0 - 2.06	Terwilliger (2000)
	SHELXE	97	Sheldrick (1997)
Structure refinement	CNS	0.9 - 1.1	Brunger et al. (1998)
	REFMAC5	5.0 - 5.2	Murshudov (1997)
	RESTRAIN	4.7.7	CCP4 (1994)
	SHELXL	97	Sheldrick (1997)
	TNT	5F	Tronrud (1997)
	WARP	5.0 - 6.0	Lamzin & Wilson (1997)

4. Command line arguments for running pdb_extract

There are three components of the pdb_extract application (pdb_extract, pdb_extract_sf, and extract). The following sections describe the arguments used for running each of these components from the command line. Examples for using these arguments are also included here.

4.1 Arguments and options for preparing coordinate files using pdb_extract

NAME pdb_extract

SYNOPSIS pdb_extract [OPTIONs]... [FILEs]...

DESCRIPTION

pdb_extract is used to extract information about data processing, heavy atom phasing, molecular replacement, density modification, and final structure refinement from the output files produced by many X-ray crystallographic applications. This program also merges the all this information into mmCIF format files, ready for validation and deposition.

Help on how to run this program is also available by typing ‘pdb_extract –h or pdb_extract –help’ in the command line.

OPTIONS

-o Followed by a given output file name.

example: -o outfile.mmcif

Note: if you do provide an output file name, a default output file name (pdb_extract.mmcif) will be used.

-e Followed by one of the following experimental methods:

· MR molecular replacement.

· SAD single anomalous diffraction.

· MAD multiple anomalous diffraction.

· SIR single isomorphous diffraction.

· SIRAS single isomorphous with anomalous diffraction.

· MIR multiple isomorphous diffraction.

· MIRAS multiple isomorphous with anomalous diffraction.

example: -e MAD

Note: If you have used a combination of methods to solve the structure (e.g. MR with MAD), you may extract information and details regarding both methods (e.g. -e MR –m program_mr –ilog Log_file –e MAD –p program_mad –ilog file_name). Here program_mr and program_mad are the names of the programs used for molecular replacement and MAD phasing, respectively.

-m Followed by the one of following programs for molecular replacement:

· CNS (versions 1.0 and 1.1).

· Amore from CCP4 suite (versions 4.1-5.0).

· EPMR (versions 2.5).

· MOLREP from CCP4 suite (versions 4.1-5.0)

example: -m amore

-p Followed by the one of following program names for phasing:

· CNS (versions 1.0 and 1.1).

· MLPHARE from CCP4 suite (versions 4.0-5.0).

· SOLVE (versions 2.00-2.06).

· SHARP (versions 1.3.x – 2.03).

· SHELXS (version 97).

· SHELXD (version 97).

· SnB (version 2.2).

· BnP (version 0.93-0.96).

· PHASES (version 0.97).

example: -p CNS

Note: if the program that you have used for phasing is not in the above list, you should use the program name and run pdb_extract. If the log and/or output file is in PDB or mmCIF format, some information (like heavy atom coordinates) may still be extracted. (use as –p program_name).

-d Followed by the one of following program names for density modification:

· CNS (versions 1.0 and 1.1).

· DM from CCP4 suite (CCP4 versions 4.0~5.0).

· SOLOMON from CCP4 suite (CCP4 versions 4.0~5.0).

· RESOLVE (versions 2.01~2.06).

· SHELXE (version 97).

· SHARP (version 1.3.x-2.03. using DM version 2.2 for density modification).

example: -d CNS

-r Followed by one of the following program names for final structure refinement.

· CNS (versions 1.0 and 1.1).

· REFMAC5 from CCP4 suite version 4.1-5.0 (REFMAC version 5.2).

· RESTRAIN from CCP4 suite version 4.1-5.0 (RESTRAIN v4.6).

· SHELXL (version 97).

· TNT (version 5F).

· WARP (version 6.0, It uses REFMAC5 for refinement)

example: -r CNS

Note: if the program that you used for final structure refinement is not in the above list, you may still give the program name. Some information (like atom coordinates) may still be extracted, if the produced file is in PDB or CIF format. (use –r program_name )

-s Followed by one of the following programs used for scaling the structure refinement dataset:

· HKL/HKL2000/SCALEPACK (versions 1.30 ~ 1.96).

· SCALA (version 3.1.4 ~3.2.3) or from CCP4 suite version 4.1-5.0

· D*trek (version 7.0SSI)

· SAINT (version 6.35A)

· 3DSCALE

example: -s HKL

Note: The –s option is used to extract statistics from data reduction of the dataset that is finally used for structure refinement. This option must be used for preparing all structure factors files. If you would like to deposit additional datasets that were used for phasing the structure, please use the –sp option described below, in addition to the –s option.

-sp Followed by one of the following programs used for scaling the dataset(s) used in phasing the structure:

· HKL/HKL2000/SCALEPACK (versions 1.30 ~ 1.96).

· SCALA (version 3.1.4 ~3.2.3) or from CCP4 suite version 4.1-5.0

· D*trek (version 7.0SSI)

· SAINT (version 6.35A)

· 3DSCALE

example: -sp HKL

Note: This option is different from –s, since it is used to extract statistics from data reduction of dataset used for phasing the structure (e.g. by SAD, MAD, SIR, MIR). Normally, this option is followed by multiple data sets as in a MAD or MIR experiment.

-iPDB Followed by a input file with PDB format.

example: -iPDB test1.pdb

Note: PDB files are usually generated from heavy atom phasing (heavy atom coordinates) or the final structure refinement.

-iCIF Followed by a input file with mmCIF format.

example: -iCIF deposit_cns.cif

Note: This option may be used to read in any mmCIF format file at different stages of structure determination. For instance, if you used MLPHARE for refining the heavy atom parameters, the output file is in mmCIF format. Another instance where mmCIF format files are produced is by running the deposit.inp script in CNS. This is run at the end of the refinement to generate a file that contains the final coordinates and refinement statistics.

-iLOG Followed by one or more input LOG files

example: -iLOG mad_sdb.dat mad_summary.dat

Note: All stages of structure determination produce log files. The specific format of the file depends on the program used. They may contain phasing statistics or heavy atom coordinates. In some cases, multiple log files may be generated, each containing a different type of information regarding that stage of structure determination. For instance, when CNS is used for heavy atom phasing, it generates log files mad_sdb.dat, which contains the heavy atom coordinates and mad_summary.dat, which contains phase refinement statistics.

-iENT Followed either by the data template file (in plain text format) or a mmCIF format file with additional information that you may wish to include in your deposition (add_info.mmcif).

example: -iENT data_template.text or -iENT add_info.mmcif

Note: The data template file is generated by the program extract (see section 4.3) and contains sequence information of all unique polymers present in the structure. It also has tokens for including other non-electronically produced information regarding the deposition. The option iENT also allows you to include any additional information regarding the deposition in a mmCIF format file. For further details regarding the mmCIF format, please consult the mmCIF dictionary at: http://pdb.rutgers.edu/mmcif.

4.1.1 Examples for using pdb_extract options

Note: You can run pdb_extract to separately extract information and statistics from each step of structure determination (data processing, heavy atom phasing, density modification, molecular replacement and final structure refinement). Alternatively, pdb_extract may be run to extract and combine information from all these stages and add non-electronically produced information for a complete deposition. A few examples of running pdb_extract are shown below:

· Command for extracting information about heavy atom phasing

pdb_extract -e experimental_method -p program_name_phasing \

-iPDB pdb_files –iLOG log_files \

–iCIF mmCIF_files -o output_file_name

(The experimental_method must be given for this step)

· Command for extracting information about density modification

pdb_extract -d program_name_for_dm –iLOG log_files -o output_file_name

· Command for extracting information about molecular replacement

pdb_extract -m program_name_for_mr –iLOG log_files -o output_file_name

· Command for extracting information from final structure refinement:

pdb_extract -r program_name_for_refinement -iPDB pdb_files \

–iLOG log_files –iCIF mmCIF_files -o output_file_name

· Command for extracting information from data scaling log files (for the dataset used for refinement):

pdb_extract -s program_name_scaling –iLOG log_file -o output_file_name

· Command for extracting information from data scaling log files (for the dataset(s) used for phasing):

pdb_extract -sp program_name_scaling –iLOG log_file1 log_file2 \

-o output_file_name

· Command for extracting information and generating a complete mmCIF file for deposition:

pdb_extract -e experimental_method -r program_name_for_refinement \

-iPDB pdb_files –iLOG log_files –iCIF mmCIF_files \

-p program_name_for_phasing -iPDB pdb_files \

–iLOG log_files –iCIF mmCIF_files \

-d program_name_for_dm –iLOG log_files \

-s program_name_for_scaling –iLOG log_files \

-sp program_name_for_scaling –iLOG log_files \

-iENT data_template.text -o output_file_name

4.2 Arguments and options for preparing structure factor files using pdb_extract_sf

NAME pdb_extract_sf

SYNOPSIS pdb_extract_sf [OPTIONs]... [FILEs]...

DESCRIPTION

This program can either be used to prepare

(a) a single reflection dataset used for final structure refinement or

(b) Multiple reflection dataset (eg. in the MAD, MIR …) used for phasing the structure.

OPTIONS

-o Followed by an output file name.

example: -o outfile.cif

Note: if you do not specify the output file name, a default output file name (pdb_extract_sf.mmcif) will be used.

-dt followed by data type for initial data processing (Amplitude (F) or Intensity (I)). The data type at this step is usually intensity.

example: -dt I

-dp Data format for initial data processing.

It is followed by one of the following program names:

HKL/SCALEPACK, DTREK, SAINT, XPREP, 3DSCALE, SCALA, OTHER.

example: -dp HKL

Note1: If the program used for data scaling is not in the above list, please use “OTHER” as the program name. Please provide the reflection data in a text format file including h, k, l, F, SigmaF (and/or I, SigmaI), and test flags. These columns should be separated by spaces. Usage –dt I –dp OTHER –idat file_name

Note2: If the structure factor data is in mtz format (processed by MOSFILM and SCALA), you must convert it to either CNS format, scalepack format or mmCIF format. This may be done using the mtz2various application of CCP4.

If the data is converted to CNS format use:

-dp CNS –idat file-name.

For a scalepack format file use:

-dp HKL –idat file-name

For a mmCIF format file, use:

-dp SCALA –idat file-name.

-c followed by crystal index. This is the crystal number which was used for data collection (this value is always an integer like 1,2,3, ..)

example: -c 2

(Thus the reflection dataset was collected using crystal 2)

-w followed by the wavelength index. This is the wavelength number at which the data was collected (this value is also an integer like 1, 2, 3, …)

example: -w 2

(Thus the dataset was collected at the second wavelength).

-idat followed by the reflection data file name

example: -idat scalepack.sca

Note: Please be careful in including the file names of the reflection file. It should be –c i, -w j –idat file_name in the right order, where i is the crystal index, j is wavelength index, and file_name is the name of the file containing the reflections.

-rt followed by data type used for final structure refinement (Amplitude (F) or Intensity (I))

example: -dt F

-rp data format in the final structure refinement.

It is followed by the data format name: CNS/XPLOR, REFMAC5, SHELX, TNT, HKL/SCALEPACK, DTREK, SAINT, XPREP, 3DSCALE, SCALA

example: -rp CNS

Note: If you used REFMAC5 for the final structure refinement, the mtz format structure factor file should be converted to mmCIF or CNS format using the mtz2various application of CCP4.

If it is converted to mmCIF format use:

pdb_extract_sf –rt I –rp REFMAC5 –idat data-file-name

For a CNS format file use:

pdb_extract_sf –rt I –rp CNS –idat data-file-name

Note1: If the program that you used for structure refinement is not in the above list, please use “OTHER” as the program name and provide either a plain text or mmCIF format file with reflection data. The file should contain h, k, l, F, SigmaF, (and/or I, SigmaI), and test flags if appropriate. These columns should be separated by spaces. Usage –rt F –rp OTHER –idat file_name

-imgCIF followed by input file name in imgCIF format.

example: -imgCIF example.cbf

Note: Only some header information can be extracted from the imgCIF file. This format is not commonly used.

4.2.1 Examples for using pdb_extract_sf options

· Extracting reflection data used for final structure refinement:

pdb_extract_sf -rt data-type -rp data-format-for-refinement \

-idat data-file-name –o output-file-name

This option is used to prepare the dataset used for the final refinement for deposition to the PDB. If you collected several datasets and merged them together for the structure refinement, use the merged file here.

· Extracting reflection data used for phase determination of the structure:

pdb_extract_sf -dt data_type -dp program_name_for_scaling \

-c crystal_number_1 -w wavelength_number_1 -idat data_file_name_1 \

-c crystal_number_2 -w wavelength_number_2 -idat data_file_name_2 \

…

–o output_file_name

Include the details regarding all datasets used for phasing the structure (e.g. by MAD, MIR …). The initial scaled reflection datasets files are used here.

Note: Even if only one reflection dataset is included here (e.g. in the case of a SAD experiment), structure factor data used for phasing should always be accompanied by the crystal and wavelength numbers (using –c and –w).

· Preparing a mmCIF format structure factor file with all the reflection data:

pdb_extract_sf -rt data-type_refine -rp data-format-for_refine \

-idat data-file-name_refine -dt data_type_scaling \

-dp program_name_for_scaling\

-c crystal_number_1 -w wavelength_number_1 -idat data_file_name_1 \

-c crystal_number_2 -w wavelength_number_2 -idat data_file_name_2 \

…

–o output_file_name

The output_file_name contains blocks of reflections used for the final structure refinement and for phasing.

4.3 Arguments and options for running extract

NAME extract

SYNOPSIS extract [OPTIONs] [FILE]

DESCRIPTION

This program can be used to generate the data template and script input files. Both these files are in plain text format and used for running pdb_extract.

OPTIONS

-pdb Followed by the coordinate PDB file name

example: -pdb pdb_file_name

Note: this generates two plain text files (data_template.text and log_script.inp). See sections 3.1, 3.2, 5.1 and 5.2 for more details on these files.

-cif Followed by the coordinate mmCIF file name

example: -cif mmCIF_file_name

Note: this also generates the same files as above. See sections 3.1, 3.2, 5.1 and 5.2 for more details on the data template and script input files.

-ext Followed by the completed log script file

example: -ext log_script.inp

Note: The script input file should be completed appropriately by including names of programs and their output/log files generated at different stages of structure determination. Since the name of the data template file is included in the script input file, at least the sequence information should be completed in the data template file. Use ‘extract –ext log_script.inp’ to generate complete mmCIF format coordinate and structure factor files.

4.3.1 Examples for using extract options

· Generate the data template and log script input files:

extract -pdb pdb_file_name

extract -cif cif_file_name

· Get a complete mmCIF file for deposition

extract -ext log_script.inp
4.4 Summary of arguments

Command line options for the three components of pdb_extract are: pdb_extract_sf (used to capture structure factors), pdb_extract (used to capture the details of data scaling, molecular replacement, heavy atom phasing, density modification and structure refinement, and extract (used to generate data_template.text and log_script.inp files).

pdb_extract_sf [OPTION]... [FILE]...
Option	Argument descriptions
-o	output file name (default name is pdb_extract_sf.mmCIF)
-dt	data type (I or F) after data processing at beam line
-dp	program for processing data (e.g. HKL/Scalepack, D*Trek, SCALA)
-rt	data type (I or F) used for final structure refinement
-rp	program for structure refinement (e.g. CNS\|REFMAC5\|SHELX\|TNT)
-c	crystal number (like 1, 2, 3 …) for diffraction
-w	wavelength number (like 1, 2, 3 …) for diffraction
-idat	data file name used for phasing or structure refinement
-ilog	log file name obtained from data processing
-icif	file name obtained from data processing (in mmCIF format)

pdb_extract [OPTION]... [FILE]...
Option	Argument descriptions
-o	output file name (default name is pdb_extract.mmcif)
-e	experimental method (eg. MR\|SAD\|MAD\|SIR\|MIR\|SIRAS\|MIRAS)
-m	program for molecular replacement (e.g. CNS\|AMORE\|MOLREP\|EPMR)
-p	program for heavy atom phasing (e.g. CNS\|MLPHARE\|SOLVE\|SHARP\|SHELXD\|SnB\|BnP)
-d	program for density modification (e.g. CNS\|DM\|SOLOMON\|RESOLVE)
-r	program for final structure refinement (e.g. CNS\|REFMAC5\|RESTRAIN\|SHELXL\|TNT\|WARP)
-s	program for reflection data scaling (only for refinement) (e.g. HKL/Scalepack, D*Trek, SAINT, SCALA, 3DSCALE)
-sp	program for reflection data scaling (only for phasing) (e.g. HKL/Scalepack, D*Trek, SAINT, SCALA, 3DSCALE)
-ilog	the input file with format corresponding to the program used
-ipdb	the input file with PDB format
-icif	the input file with mmCIF format
-ient	the input file data_template.text (for complete sequence)

extract [OPTION] [FILE]
Option	Argument descriptions
-pdb	input coordinate file name (PDB format)
-cif	input coordinate file name (mmCIF format)
-ext	input script file name log_script.inp

5. Appendices

5.1 An example of the data template file (data_template.text)

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

THE DATA_TEMPLATE.TEXT FILE

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

NOTES AND REMINDER

The data template file contains data entries for unique chemical sequences

present in the structure and other non-electronically captured information.

PLEASE CHECK CATEGORIES 1 & 2: Before proceeding any further, make necessary

corrections here so that all information in these categories are complete

and correct.

You may choose to fill in CATEGORIES (3-18) either here or later in ADIT.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

GUIDELINES FOR USING THIS FILE

1. Only strings included between the 'lesser than' and 'greater than'

signs (<.....>) will be parsed for evaluation by the program. Therefore,

DO NOT write either on the left or right of the 'less than' and 'greater

than' signs respectively.

2. All alphanumeric values or strings that you include in the different

categories should be within double-quotes. Blank spaces or carriage

returns within a pair of double quotes are ignored by the program.

DO NOT use double quotes (") within strings that you enter.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

~~~~~~~~~~~~~~~~~~~~~~~~~~~~START INPUT DATA BELOW~~~~~~~~~~~~~~~~~~~~~~~

================CATEGORY 1: Crystallographic Data=======================

Enter crystallographic data

<space_group = "P3221 "> (use International Table conventions)

<space_group_number = "? ">

<unit_cell_a = " 120.831 " >

<unit_cell_b = " 120.831 " >

<unit_cell_c = " 185.222 " >

<unit_cell_alpha = " 90.00 " >

<unit_cell_beta = " 90.00 " >

<unit_cell_gamma = "120.00 " >

================CATEGORY 2: Sequence Information =======================

Enter one letter sequence for each polymeric entity in asymmetric unit

--------------------------------------------------------------------------

SOME DEFINITIONS

An ENTITY is defined as any unique molecule present in the asymmetric

unit. Each unique biological polymer (protein or nucleic acids) in the

structure is considered an entity. Thus, if there are five copies of

a single protein in the asymmetric unit, the molecular entity is still

only one. Water and non-polymers like ions, ligands and sugars are

also entities.

Here we only consider the sequences of polymeric entities (protein or

nucleic acid).

GUIDELINES FOR COMPLETING THIS CATEGORY

* In a PDB or mmCIF format file, all residues of a single polymeric

entity should have one chain ID. Multiple copies of the same entity

should each be assigned a unique chain ID. The multiple chain IDs

should be separated by commas as 'A,B,C,...'. If incorrect chain IDs

are used the entity groups extracted by this program will not be

correct. To avoid this, make necessary corrections in the PDB or mmCIF

file used to generate the data_template file and regenerate the

data_template.text file. Alternatively, edit the extracted sequence

in this file to correctly represent the sequence and chain IDs of each

polymeric entity.

* In addition to chain IDs, this program uses distance geometry to

assess if there are any breaks in the polymer sequence. These breaks

may occur due to missing residues (not included in the model due to

missing electron density) or due to poor geometry. Four question marks

'????' are used to denote these chain breaks. Replace these question

marks with the sequence of residues missing from the coordinates. Also

add any residues missing from the N- and/or C-termini here.

* If there are non-standard residues in the coordinates, this program

lists them according to the three letter code used in the coordinate

file as (ABC). If all the residues in your sequence are nonstandard,

check and edit the sequence manually to represent it correctly in this

file.

* If any residue was modeled as Ala or Gly due to lack of the side-chain

density, the sequence extracted here will represent them as A or G

respectively. Correct this to the original sequence that was present in

the crystal.

----------------------------------------------------------------------------

Below is the one letter chemical sequence extracted from your PDB

coordinate file. The molecular entities are grouped and listed

together.

PLEASE CHECK THE SEQUENCE of each entity carefully and modify it, as necessary.

Make sure that you REVIEW THE FOLLOWING:

* chain breaks due to missing residues,

* missing residues in the N- and/or C-termini,

* non-standard residues and

* cases of residues modeled as Ala or Gly due to missing

side-chain density.

<molecule_entity_id="1" >

<molecule_entity_type="polypeptide(L)" >

<molecule_one_letter_sequence="

SASFDGPKFK(MSE)TDGSYVQTKTIDVGSSTDISPYLSLIREDSILNGNRAVIFDVYWDVGF????TKTSGWSLSSV

KLSTRNLCLFLRLPKPFHDNLKDLYRFFASKFVTFVGVQIEEDLDLLRENHGLVIRNAINVGKLAAEARG

TLVLEFLGTRELAHRVLWSDLGQLDSIEAKWEKAGPEEQLEAAAIEGWLIVNVWDQLSDE" >

< molecule_chain_id="A,B,C,D,E,F" >

Copy the following template to add information regarding more entities:

<molecule_entity_id=" " >

<molecule_entity_type=" " >

<molecule_one_letter_sequence=" " >

<molecule_chain_id=" " >

================CATEGORY 3: Contact Authors=============================

Enter information about the contact authors.

Information about the Principal investigator (PI) should be given.

For principal investigator

<contact_author_PI_name = " ">

<contact_author_PI_email = " ">

<contact_author_PI_phone = " ">

<contact_author_PI_fax = " ">

<contact_author_PI_address = " ">

For other contact authors

<contact_author_name_1 = " ">

<contact_author_email_1 = " ">

<contact_author_phone_1 = " ">

<contact_author_fax_1 = " ">

<contact_author_address_1 = " ">

<contact_author_name_2 = " ">

<contact_author_email_2 = " ">

<contact_author_phone_2 = " ">

<contact_author_fax_2 = " ">

<contact_author_address_2 = " ">

...(add more if needed)...

================CATEGORY 4: Release Status==============================

Enter release status for the coordinates, constraints and sequence

Status should be chosen from one of the following:

(release now, hold for publication, hold for 6 months,

hold for 1 year)

<Release_status_for_coordinates = " ">

<Release_status_for_structure_factor = " ">

<Release_status_for_sequence = " ">

================CATEGORY 5: Title=======================================

Enter the title for the structure

<structure_title = " ">

================CATEGORY 6: Citation Authors============================

Enter citation authors (e.g. Surname, F.M.)

The primary citation is the article in which the deposited coordinates

were first reported. Other related citations may also be provided.

For the primary citation

<primary_citation_author_name_1 = " ">

<primary_citation_author_name_2 = " ">

<primary_citation_author_name_3 = " ">

<primary_citation_author_name_4 = " ">

<primary_citation_author_name_5 = " ">

...add more if needed...

For other related citations (if applicable)

<citation_1_author_name_1 = " ">

<citation_1_author_name_2 = " ">

<citation_1_author_name_3 = " ">

<citation_1_author_name_4 = " ">

<citation_1_author_name_5 = " ">

...add more if needed...

<citation_2_author_name_1 = " ">

<citation_2_author_name_2 = " ">

<citation_2_author_name_3 = " ">

<citation_2_author_name_4 = " ">

<citation_2_author_name_5 = " ">

...add more if needed...

...(add more citations if needed)...

================CATEGORY 7: Citation Article============================

Enter citation article (journal, title, year, volume, page)

If the citation has not yet been published, use 'To be published'

for the category 'journal_abbrev'. The order of citations in this

category should correspond to that is CATEGORY 6.

For primary citation

<primary_citation_journal_abbrev = " ">

<primary_citation_title = " ">

<primary_citation_year = " ">

<primary_citation_journal_volume = " ">

<primary_citation_page_first = " ">

<primary_citation_page_last = " ">

For other related citation (if applicable)

<citation_1_journal_abbrev = " ">

<citation_1_title = " ">

<citation_1_year = " ">

<citation_1_journal_volume = " ">

<citation_1_page_first = " ">

<citation_1_page_last = " ">

<citation_2_journal_abbrev = " ">

<citation_2_title = " ">

<citation_2_year = " ">

<citation_2_journal_volume = " ">

<citation_2_page_first = " ">

<citation_2_page_last = " ">

...(add more citations if needed)...

================CATEGORY 8: Molecule Names==============================

Enter the name of the molecule for each entity

The name of molecule should be obtained from the appropriate

sequence database reference, if available. Otherwise the gene name or

other common name of the entity may be used.

e.g. HIV-1 integrase for protein

RNA Hammerhead Ribozyme for RNA

The number of entities should be the same as in CATEGORY 1.

<molecule_name_1 = " "> (entity 1)

<molecule_name_2 = " "> (entity 2)

<molecule_name_3 = " "> (entity 3)

...(add more if needed)...

================CATEGORY 9: Molecule Details============================

Enter additional information about each entity

Additional information would include details such as fragment name

(if applicable), mutation, and E.C. number.

For entity 1

<Molecular_entity_id_1 = " "> (e.g. 1, 2, ...)

<Fragment_name_1 = " "> (e.g. ligand binding domain, hairpin)

<Specific_mutation_1 = " "> (e.g. C280S)

<Enzyme_Commission_number_1 = " "> (if known: e.g. 2.7.7.7)

For entity 2

<Molecular_entity_id_2 = " ">

<Fragment_name_2 = " ">

<Specific_mutation_2 = " ">

<Enzyme_Comission_number_2 = " ">

For entity 3

<Molecular_entity_id_3 = " ">

<Fragment_name_3 = " ">

<Specific_mutation_3 = " ">

<Enzyme_Comission_number_3 = " ">

...(add more if needed)...

================CATEGORY 10: Genetically Manipulated Source=============

Enter data in the genetically manipulated source category

If the biomolecule has been genetically manipulated, describe its

source and expression system here.

For entity 1

<Manipulated_entity_id_1 = " "> (e.g. 1, 2, ...)

<Source_organism_scientific_name_1 = " "> (e.g. Homo sapiens)

<Source_organism_gene_1 = " "> (e.g. RPOD, ALKA...)

<Expression_system_scientific_name_1 = " "> (e.g. Escherichia coli)

<Expression_system_strain_1 = " "> (e.g. BL21(DE3))

<Expression_system_vector_type_1 = " "> (e.g. plasmid)

<Expression_system_plasmid_name_1 = " "> (e.g. pET26)

<Manipulated_source_details_1 = " "> (any other relevant information)

For entity 2

<Manipulated_entity_id_2 = " ">

<Source_organism_scientific_name_2 = " ">

<Source_organism_gene_2 = " ">

<Expression_system_scientific_name_2 = " ">

<Manipulated_source_description_2 = " ">

For entity 3

<Manipulated_entity_id_3 = " ">

<Source_organism_scientific_name_3 = " ">

<Source_organism_gene_3 = " ">

<Expression_system_scientific_name_3 = " ">

<Manipulated_source_description_3 = " ">

...(add more if needed)...

================CATEGORY 11: Natural Source=============================

Enter data in the natural source category

If the biomolecule was derived from a natural source, describe

it here.

For entity 1

<natural_source_entity_id_1 = " "> (e.g. 1, 2, ...)

<natural_source_scientific_name_1 = " "> (e.g. Homo sapiens)

<natural_source_details_1 = " "> (any other relevant information

e.g. organ, tissue, cell ..)

For entity 2

<natural_source_entity_id_2 = " ">

<natural_source_scientific_name_2 = " ">

<natural_source_description_2 = " ">

for entity 3

<natural_source_entity_id_3 = " ">

<natural_source_scientific_name_3 = " ">

<natural_source_description_3 = " ">

...(add more if needed)...

================CATEGORY 12: Keywords===================================

Enter a list of keywords that describe important features of the deposited

structure.

For example, beta barrel, protein-DNA complex, double helix,

hydrolase, structural genomics etc.

<structure_keywords = " ">

================CATEGORY 13: Biological Assembly========================

Enter data in the biological assembly category

Biological assembly describes the functional unit(s) present in the

structure. There may be part of a biological assembly, one or more

than one biological assemblies in the asymmetric unit.

Case 1

* If the asymmetric unit is the same as the biological assembly

nothing special needs to be noted here.

Case 2

* If the asymmetric unit does not contain a complete biological unit.

Please provide symmetry operations including translations required

to build the biological unit.

(example:

The biological assembly is a hexamer generated from the dimer

in the asymmetric unit by the operations: -y, x-y-1, z-1 and

-x+y, -x-1, z-l.)

Case 3

* If the asymmetric unit has multiple biological units

Please specify how to group the contents of the asymmetric unit into

biological units.

(example:

The biological unit is a dimer. There are 2 biological units in the

asymmetric unit (chains A & B and chains C & D).

For biological unit 1

<biological_assembly_1 = " ">

For biological unit 2

<biological_assembly_2 = " ">

....(add more if needed)....

================CATEGORY 14: Crystals===================================

Enter the number of crystals used for diffraction

<number_of_crystals = " ">

================CATEGORY 15: Methods and Conditions=====================

Enter the crystallization conditions for each crystal

For crystal 1:

<crystal_number_1 = " "> (e.g. 1, 2, ...)

<crystallization_method_1 = " "> (e.g. vapor diffusion, hanging drop)

<crystallization_pH_1 = " "> (e.g. 7.5 ...)

<crystallization_temperature_1 = " "> (e.g. 100) (in Kelvin)

<crystallization_components_1 = " "> (e.g. PEG 4000, NaCl etc.)

For crystal 2:

<crystal_number_2 = " ">

<crystallization_method_2 = " ">

<crystallization_pH_2 = " ">

<crystallization_temperature_2 = " ">

<crystallization_components_2 = " ">

...(add more if needed)...

================CATEGORY 16: Crystal Property===========================

Enter details about the crystals used

Include additional information about the crystals used

for example: solvent content, Matthews coefficient

For crystal 1:

<crystals_number_1 = " "> (e.g. 1, 2, ...)

<crystals_solvent_content_1 = " "> (e.g. 63.7 )

<crystals_matthews_coefficient_1 = " "> (e.g. 2.5 ...)

For crystal 2:

<crystals_number_2 = " ">

<crystals_solvent_content_2 = " ">

<crystals_matthews_coefficient_2 = " ">

...(add more if needed)...

================CATEGORY 17: Radiation Source===========================

Enter the details of the source of radiation, the X-ray generator,

and the wavelength for each diffraction.

For experiment 1:

<radiation_experiment_1 = " "> (e.g. 1, 2, ...)

<radiation_source_1 = " "> (e.g. rotating-anode, synchrotron ...)

<radiation_source_type_1= " "> (e.g. Rigaku RU200, CHESS Beamline A1 ...)

<radiation_wavelengths_1= " "> (e.g. 1.502 ...)

<radiation_protocol_1= " "> (e.g. MAD, SINGLE WAVELENGTH ...)

<radiation_detector_1 = " "> (e.g. CCD, IMAGE PLATE ...)

<radiation_detector_type_1= " "> (e.g. SIEMENS-NICOLET, RIGAKU RAXIS ...)

For experiment 2:

<radiation_experiment_2 = " ">

<radiation_source_2 = " ">

<radiation_source_type_2 = " ">

<radiation_wavelengths_2 = " ">

<radiation_protocol_2= " ">

<radiation_detector_2 = " ">

<radiation_detector_type_2= " ">

....(add more if needed)....

================CATEGORY 18: Collection Temperature=====================

Enter the Temperature for data collection (in Kelvin)

<collection_temperature_crystal_1 = " "> (for crystal 1:)

<collection_temperature_crystal_2 = " "> (for crystal 2:)

....(add more if needed)....

=====================================END==================================

5.2 An example of the script input file (log_script.inp)

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

THE LOG_SCRIPT.INP FILE

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

NOTES AND REMINDER

This script file is used to enter the names of the crystallographic

software used for structure determination and the log, PDB, mmCIF or

text files generated by them.

PLEASE COMPLETE the ENTRY FIELDS according to the type of your experiment

and use the command 'extract -ext log_script.inp' to obtain the completed

structure data ready for validation and deposition.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

GUIDELINES FOR USING THIS FILE

1. Only strings included between the 'lesser than' and 'greater than'

signs (<.....>) will be parsed for evaluation by the program. Therefore,

DO NOT write either on the left or right of the 'less than' and 'greater

than' signs respectively.

2. All alphanumeric values or strings that you include in the different

categories should be within double-quotes. Blank spaces or carriage

returns within a pair of double quotes are ignored by the program.

DO NOT use double quotes (") within strings that you enter.

3. Log files used for generating the deposition should be generated from

the best (usually the last) trial for each crystallographic software.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

~~~~~~~~~~~~~~~~~~~~~~~~~~~~START INPUT DATA BELOW~~~~~~~~~~~~~~~~~~~~~~~

===============PART 1: Structure Factor for Final Refinement==============

Enter reflection data file used for final structure refinement

NOTE:

* Usually the highest resolution or best data set is used for the

refinement. Use that structure factor file here.

* In some cases, it may not be possible to collect a complete dataset

from a single crystal. Thus, multiple data sets have to be scaled

and merged together for refinement. Use the merged reflection file

here.

* If the reflection data format is not one of those listed below,

please use OTHER for the data format, and provide an ASCII file

that has at least five values [H, K, L, I (or F), sigmaI (or sigmaF)]

for each reflection. Include the test flags as the sixth column

in the file (if available).

* If the reflection file is in mtz format, convert it to mmCIF format

using the mtz2various application provided by CCP4. This can be

Reflection data format:

<reflection_data_type = "F" > (enter I (intensity) or F (amplitude))

<reflection_data_format = "CNS" >

<reflection_data_file_name = " " >

==============PART 2: Structure Factors for Protein Phasing================

Enter reflection data files used for heavy atom or MAD phasing

NOTE:

* Enter this category if you have more than one complete reflection

file (e.g. in the case of MAD,SIRAS, MIR). The LOG files generated

from data scaling software for all these data sets is also needed.

* If the scaling program is not one of those listed below

(HKL|SCALEPACK|DTREK|SAINT|3DSCALE), enter OTHER for the program

name and provide an ASCII file with at least five values

[H, K, L, I (or F), sigmaI (or sigmaF)] for each reflection.

* If the same crystal was used for collecting multiple data sets, the

crystal number will remain '1' as the wavelength numbers change.

However, if multiple crystals were used, for the data collections,

the corresponding crystal numbers should be used for each data set.

<scale_data_type = "I" > (enter I (intensity) or F (amplitude))

<scale_program_name = "HKL" >

For data set 1:

<crystal_number = "1" >

<diffract_number = "1" >

<scale_data_file_name_1 = " " >

<scale_log_file_name_1 = " " >

For data set 2:

<crystal_number_2 = "1" >

<diffract_number_2 = "2" >

<scale_data_file_name_2 = " " >

<scale_log_file_name_2 = " " >

For data set 3:

<crystal_number_3 = "1" >

<diffract_number_3 = "3" >

<scale_data_file_name_3 = " " >

<scale_log_file_name_3 = " " >

==================PART 3: Statistics for Data Scaling=====================

Enter log file and software name for data scaling

NOTE:

* The log file included here should have scaling statistics of

the file used for the final structure refinement. If multiple data

sets were scaled and merged for refinement (as described in Part 1

above) use the log file generated during merging of the data sets.

* While SCALA produces a mmCIF format file with the scaling statistics,

most other software produce ASCII LOG files with this information.

Software for scaling is one of the following:

<data_scaling_software = "HKL" >

<data_scaling_LOG_file_name = " " >

<data_scaling_CIF_file_name = " " > (in mmcif format)

==============PART 4: Statistics for Molecular Replacement================

Enter log files and software name for molecular replacement

NOTE:

Software is one of the following:

(CNS|AMORE|MOLREP|EPMR)

The log file should be from the best trial of MR.

<mr_software = " " >

<mr_log_file_LOG_1 = " " >

<mr_log_file_LOG_2 = " " >

=================PART 5: Statistics for Protein Phasing===================

Enter log files and software name for heavy atom phasing

NOTE:

Software is one of the following:

The log file should be from the best trial of phasing.

<phasing_method = "MAD" > (SAD|MAD|SIR|SIRAS|MIR|MIRAS)

<phasing_software = "SOLVE" >

<phasing_log_file_LOG_1 = " " >

<phasing_log_file_PDB_1 = " " > (in PDB format)

<phasing_log_file_CIF_1 = " " > (in mmCIF format)

<phasing_log_file_LOG_2 = " " >

<phasing_log_file_PDB_2 = " " >

<phasing_log_file_CIF_2 = " " >

... add more if needed ...

===============PART 6: Statistics for Density Modification================

Enter log files and software name for density modification

NOTE:

Software is one of the following:

The log file should be from the best trial of density modification.

<dm_software = "RESOLVE " >

<dm_log_file_LOG_1 = " " >

<dm_log_file_CIF_1 = " " > (in mmCIF format)

===============PART 7: Statistics for Structure Refinement================

Enter log files and software name used for final structure refinement

NOTE:

Software is one of the following:

The log file should be from the final trial of structure refinement.

<refine_software = "REFMAC5" >

<refine_log_file_PDB_1 = " " > (coordinate file in PDB format)

<refine_log_file_CIF_1 = " " > (LOG file in mmCIF format)

<refine_log_file_LOG_1 = " " >

<refine_log_file_PDB_2 = " " >

<refine_log_file_CIF_2 = " " >

<refine_log_file_LOG_2 = " " >

=======================PART 8: Data Template File=========================

Enter file name of the data template file

NOTE:

This file 'data_template.text' was generated by using the

command 'extract -pdb pdb_file' or 'extract -cif cif_file'. It

contains the sequences of all unique polymers (protein or nucleic

acid) present in the structure. It also contains other

non-electronically captured information. Please complete the

data template file before running pdb_extract.

<data_template_file = "data_template.text" >

=====================================END==================================

6. References

Berman, H. M., Henrick, K. & Nakamura, H. (2003). Nat Struct Biol 10, 980.

Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). Nucleic Acids Res. 28, 235-242.

Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer Jr., E. F., Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). J. Mol. Biol. 112, 535-542.

Bourne, P. E., Berman, H. M., Watenpaugh, K., Westbrook, J. D. & Fitzgerald, P. M. D. (1997). Meth. Enzymol. 277, 571-590.

Bruker Analytical X-ray Systems (1998). SMART/SAINT.

Brünger, A. T., Adams, P. D., Clore, G. M., DeLano, W. L., Gros, P., Grosse-Kunstleve, R. W., Jiang, J.-S., Kuszewski, J., Nilges, M., Pannu, N. S., Read, R. J., Rice, L. M., Simonson, T. & Warren, G. L. (1998). Acta Crystallogr. D54, 905-921.

Burley, S. K., Almo, S. C., Bonanno, J. B., Capel, M., Chance, M. R., Gaasterland, T., Lin, D., Sali, A., Studier, F. W. & Swaminathan, S. (1999). Nat. Genet. 23, 151-157.

CCP4 (1994). Acta Crystallogr. D50, 760-763.

Cowtan, K. (1994). Joint CCP4 and ESF-EACBM Newsletter on Protein Crystallography 31, 34-38.

de la Fortelle, E. & Bricogne, G. (1997). Meth. Enzymol. 276, 472-494.

Evans, P. R. (1997). Joint CCP4 and ESF-EACBM Newsletter on Protein Crystallography 33, 22-24.

Feng, Z., Westbrook, J. & Berman, H. M. (1998). Report NDB-407. Rutgers University, New Brunswick, NJ.

Furey, W. & Swaminathan, S. (1997). Editor. PHASES: a program package for the processing and analysis of diffraction data from macromolecules,

Haebel, P. W., Arcus, V. L., Baker, E. N. & Metcalf, P. (2001). Acta Crystallogr D Biol Crystallogr 57, 1341-1343.

Harris, M. & Jones, T. A. (2002). Acta Crystallogr D Biol Crystallogr 58, 1889-1891.

Hendrickson, W. A. (1991). Science 254, 51-58.

Kissinger, C. R., Gehlhaar, D. K. & Fogel, D. B. (1999). Acta Crystallogr D Biol Crystallogr 55 ( Pt 2), 484-491.

Lamzin, V. S. & Wilson, K. S. (1997). Methods in Enzymology 277, 269-305.

Murshudov, G. N., Vagin, A. A., Lebedev, A., Wilson, K. S. & Dodson, E. J. (1999). Acta Crystallogr D Biol Crystallogr 55 ( Pt 1), 247-255.

Navaza, J. (1994). Acta Crystallogr. A50, 157-163.

Otwinowski, Z. & Minor, W. (1997). Meth. Enzymol. 276, 307-326.

Pflugrath, J. W. (1999). Acta Crystallogr D Biol Crystallogr 55, 1718-1725.

Sheldrick, G. & Schneider, T. (1997). Methods in Enzymology 277, 319-343.

Terwilliger, T. C. (2000). Acta Crystallogr D Biol Crystallogr 56 ( Pt 8), 965-972.

Terwilliger, T. C. & Berendzen, J. (1999). Acta Crystallogr D Biol Crystallogr 55 ( Pt 4), 849-861.

Tronrud, D. E. (1997). Methods Enzymol 277, 306-319.

Vagin, A. & Teplyakov, A. (2000). Acta Crystallogr D Biol Crystallogr 56 Pt 12, 1622-1624.

Weeks, C. M., Blessing, R. H., Miller, R., Mungee, S., Potter, S. A., Rappleye, A., Simith, G. D., Xu, H. & Furey, W. (2002). Z. Kristallogr. 217, 686-693.

Weeks, C. M. & Miller, R. (1999). Acta Crystallogr D Biol Crystallogr 55 ( Pt 2), 492-500.

Westbrook, J., Feng, Z., Burkhardt, K. & Berman, H. M. (2003). Meth. Enz. 374, 370-385.

7. Frequently Asked Questions

Q. What does pdb_extract do?

A. pdb_extract can read in log files from various crystallographic applications, coordinates and structure factor files to automatically extract relevant information regarding the data reduction, scaling, heavy atom phasing, molecular replacement, density modification and final structure refinement. This program can combine all this information to prepare a mmCIF format file for validation and deposition to the PDB.

Q. What should I do if the program that I used for solving the structure is not supported by pdb_extract?

A. If the program generates a coordinate file in the PDB format and any log files in mmCIF format, include these files and the name of the program and pdb_extract should be able to prepare a deposition file for you. Please send the name of the unsupported program, any other relevant details about it and its log file to help@rcsb.rutgers.edu. We will include this program to our list of supported applications.

Q. I included all the appropriate file names in the log_script.inp file but the program does not run to completion. What should I do?

A. Check the data template file to make sure that there are no ‘????’ in the sequence of the polymers. This represents a break in the chain due to missing residues. Edit the sequence information appropriately to ensure that all residues that were not modeled due to missing density or residues that were modeled as Ala or Gly due to missing side chain density have been appropriately corrected.

Q. All the residues in my file are non-standard or modified. Will pdb_extract be able to extract the sequence from the coordinate file?

A. pdb_extract can recognize and extract the sequence of polymers (protein or nucleic acid) including some non-standard residues. The non-standard or modified residues are denoted by their 3 letter code as '(MSE)' for selenomethionine. However, if all the residues in the polymer are non-standard, the program may fail to get a correct register for the sequence. Thus it is recommended that in such cases the entity_poly (sequence) category should be manually edited in the data_template.text file to ensure that the sequence included is complete and correct.

Q. I am behind a firewall so I can not use the web version of ADIT. How do I complete my deposition? How do I complete my deposition?

A. Please use either the validation server (web or desktop versions) or the command line option for validating the files that you prepared using pdb_extract. You can email the validated coordinate and structure factor files to deposit@rcsb.rutgers.edu or ftp them to pdb.rutgers.edu.

Q. It will probably take me a really long time to complete solving this structure. Why should I bother with pdb_extract right now?

A. pdb_extract can help you keep track of all the relevant information from the different stages of structure solution required for depositing the structure. Apply pdb_extract to the output and log files of each step of structure determination (scaling, molecular replacement, density modification etc.). Finally you can combine all these output files (using the –icif cif_file_name option) to generate an mmCIF format file that contains all the information regarding the different stages of structure solution.