topp (CCP4: Supported Program)

NAME

topp - an automatic topological and atomic comparison program for protein structures

SYNOPSIS

topp
[Keyworded input]

top3d foo_1.pdb foo_2.pdb

topsearch foo_1.pdb

AUTHOR

Author:
Guoguang Lu
Div. of Molecular Structural Biology
Dept. of Medical Biochemistry and Biophysics,
Karolinska Institute, Stockholm, 17 177, Sweden
E-mail:
Guoguang.Lu@mbfys.lu.se

NOTES ON CCP4 VERSION

Note: TOPP has been renamed from the original TOP to avoid a clash with the UNIX command of that name.

TOPP can be run directly using the command topp with Keyworded input, or via the script top3d which takes two file names as arguments and program parameters from the file $CLIBD/TOP.PARM (see examples section). A search with one file against a database of structures can be done using the script topsearch which takes one file name as argument and program parameters from the file $CLIBD/SEARCH.PARM (see examples section).

Use of the browser facility to search a Protein Data Bank site requires two commands to be on the user's path, namely wget and pdbhtf. The latter is part of the CCP4 suite and should have been compiled and installed. On the other hand, wget is not part of CCP4, but is a GNU program available via internet from the usual GNU sites.

Index

DESCRIPTION

TOP is a protein TOPological comparison program which detects whether there are structural similarities between two proteins. It superimposes two protein structures automatically without any previous knowledge of sequence alignment. The program can be used to find out if a newly determined protein structure is similar to any structures in the Protein Data Bank and rank the homologous proteins according to topological and structural diversities (similarities). The program (version 6 or higher) can directly browse data from Protein Data Bank or its mirror sites via internet, so that users can search most recent data without regularly downlowding the whole database to their local disks. The program has a 3DB browser interface so that it can perform rapid structure similarity search if users limit a searching range by sequence, keywords, resolution, date or other restraints. This provides possibilities that TOP is conveniently used for modelling homologous proteins and automatic tracing new coming similar structures related for special interests without literature reading.

TOP is designed to be user friendly. For example, once the program is properly set up on unix computers, users can use simple commands such as top3d file1 file2 so that the coordinate file2 will be automatically superimposed to file1. The Protein Data Bank (PDB) entry code can be recognized by the program. For example if the second molecule is 2cnd in PDB, user can just type top3d file1 2cnd@pdb so the program will browse the coordinates of 2cnd into the local disk and perform the comparison. If a user wants to know whether a structure in file is similar to any structures in PDB, one can type topsearch file.pdb so that the program will output a list of pdb code which are ranked according to 3d-structure similarities. The user can type top3d file.pdb code@pdb to get the interested coordinates superimposed to the probe model. The program can detect sequence permutation and be used for special purpose, such as motif searching.

The program runs two steps in each structure comparison. In the first step topology of secondary structures in the two are compared. The program uses two points to represent each secondary structure element (alpha helixes or beta strands) then systematically searches all the possible superposition of these elements between the two protein structures. Once a couple of elements in the two structures can fit each other in 3-d space (defined as, the rms, the angle between the two lines formed by the two points and the line-line distance are smaller than the given values), the program will search whether more secondary structure elements can fit by the same superposition operation. If secondary structures which can fit each other exceed a given number, the program will claim the two structures are similar, outputs names of secondary structures which correspond to each other in the two proteins and output the superimposed coordinates. It also outputs a matrix, with which one molecule can be rotated and translated to the other molecule. The program output a comparison score called "Topological Diversity", which considers both the rate of matching SSEs and structure difference of the representing points. In the data base searching, this parameter can be used for rank the topological similarities of SSEs.

While Ca atoms are available, the program can run the second step to find the alignment based on Ca atoms of all the residues from the initial comparison matrix, and improve the comparison matrix based on the superposition of newly aligned Ca atoms. The procedure is iterated until the member of matching residues converges. The program is able to overcome sequence permutation in the superpositions. According to both r.m.s deviations and numbers of matching residues, the program calculated a score of "Structure Diversity", which can be used to rank the structure difference of homologous proteins.

Use of a SSE database

The optimized way of database searching in TOP is to use a library of Secondary Structure Elements (SSEs). This can be created from a set of PDB files with the command MAKEVEC (see below).

The compact SSE library is automatically updated in Karolinska Institute every week, which include not only the current released structures in Protein Data Bank, but also compact SSE databases of independent family, super-family, structures classified in the SCOP database for efficient similarity search. It can be obtained from ftp://gamma.mbb.ki.se/pub/guoguang/sndlib.tar.Z . After you get this TAR file from FTP and save to your local disk as, for example /dir/sndlib.tar.Z, use following commands:


cd $TOPHOME
zcat /dir/sndlib.tar.Z | tar -xvf -

you can have the most recently updated SSE databases.

Keyworded Input

The parameters of the TOP program can be controlled by different lines of text, each of which is a "keyword command". Any command line which starts with "!" will be ignored. Available keywords are:

Keywords for location of protein coordinates

The TOP program can compare two structures or search similarities in database by comparing one structure with a group of other structures. The MOL1 command specifies the data location of one molecule (called Molecule 1) while the commands MOL2, LIBDIR, MOLVEC, PDBSITE or WEBSITE specify the data location of another molecule (called Molecule 2) or the other molecules (called database).

TOP can read 3d coordinates of protein structures in "Brookhaven" (PDB) format either from user's local computer disk, CD ROM or via internet. In the case of structure similarity searching, there can be many ways to read data. The recommended setup for the program is to use automatic updating of a secondary structure element (SSE) library searching (see automatic updating of SSE library and MOLVEC). In this way the program can search most recent database from compact SSE library and browse the detailed coordinates of only those structures which are found similar with the molecule 1. It is considerably faster and does not require regular maintaining works for database after setup.

Keyword Input for structure comparisons

Keywords for 3DB interface

Conventions of the Coordinate files

When comparing two protein structures, the program needs two coordinates files in Brookhaven format. It can read the secondary structure elements (SSEs i.e. alpha helices and beta strands) which are pre-assigned in the files in the PDB format file as in the following example:

HELIX    1  F1 LEU     96  SER    103  
HELIX    2  N1 ILE    148  ARG    160 
HELIX    3  N2 ARG    184  GLU    193 
HELIX    4  N3 GLU    223  HIS    229 
HELIX    5 N4A PRO    245  GLN    249 
HELIX    6 N4B SER    253  GLU    257 
HELIX    7  N5 MET    263  SER    266 
SHEET    1  FB 6 LYS    58  TYR    64  0 
SHEET    2  FB 6 HIS    48  ILE    55 -1
SHEET    3  FB 6 TYR   109  LEU   116 -1
SHEET    4  FB 6 ILE    13  SER    24 -1
SHEET    5  FB 6 VAL    27  SER    33 -1
SHEET    6  FB 6 HIS    75  LYS    81 -1

If there are no SSE assignments in the coordinates file, the program will take some CPU time to calculate it. If the file contains coordinates of all mainchain atoms, the program will use the "Smith-Laskowski method" as in the PROCHECK package. If the file only contains Ca coordinates or many mainchain atoms are missing, the program can also automatically assign the secondary structures using another method, but some elements, especially beta strands, might be not as accurate as in the case that all the mainchain atoms are provided. However, this does not influence the structure comparisons in most cases.

Conventions of some output parameters

Examples

In many cases, users can quickly learn how to use the program just by studying appropriate examples. One can use one of two ways to run TOP: simple commands or Unix script files. The simple commands are designed for the convenience of those users who don't have the Protein Data Bank in their local lab and use TOP for ordinary purposes. The Unix command files are more flexible for special purposes.

Simple commands:

Unix script file

There are several examples files available at http://gamma.mbb.ki.se/~guoguang/webtop/examples showing how to use the TOP program. Here is a summary of them
Name PDB data from Function
top.com local disk or internet Superimposing two protein structures and compare them
pdbscan.com local disk Searching similar structures in Protein Data Bank
topscan.com internet
pdbsearch.com local disk Searching similar structures in a compact database.
topsearch.com internet
top3db.com internet Searching similar structures with 3DB restraints
makevec.com local disk Making SSE library
makevec_web.com internet

Example 1: Compare two structures Two files 1kxd.pdb and 1vcp.pdb will be compared by the following script file. ($TOPHOME/examples/top.com in the distribution package)

#
rm fort.10 fort.11 fort.12
ln -s omatrix.ofm fort.10
ln -s mol1.ofm fort.11
ln -s mol2.ofm fort.12
$LUEXE/top << 'end-top'
MOL1 1kxd.pdb
MOL2 1vcp.pdb
RESIDUE 3
WRITE
'end-top'
#

type "top.com > top.log", the program will output which secondary structure elements are corresponding to each other in the two structures. Optionally, the program also superimposes the two structures based on the Ca atoms and output the sequence comparison. (See instruction of keyword RESIDUE). The rms deviation is output. When the WRITE statement appears, the program will write a file which superimposes molecule 2 onto molecule 1. In this case the output file name is 1vcp_1kxd.pdb. Sometimes, there are more than one way to superimpose the two structures (e.g. when the two structures are dimers AB, the program can superimpose AB to A'B' and AB to B'A'). In this case the program will output several superimposed coordinate files, called 1vcp_1kxd.pdb, 1vcp_1kxd.pdb_2, 1vcp_1kxd.pdb_3,....). One can use any graphics program (such as O, Insight or Frodo) to display the superimposed coordinates together with 1kxd.pdb. Look at top.log for more information.

There are other commands concerning the parameters for different purpose of the comparisons. For details, please see "Keyworded Input"

The TOP software can directory browse coordinates from Protein Data Bank (PDB), if an URL address of a mirror site of PDB is provided. In this example, if you know one of structures PDB entry code is 1vcp, you can do the following: 1) add a command to indicate from which site you want to browse PDBSITE http://www.pdb.bnl.gov/ 2) use xxxx@pdb in the MOL2MOL2 1vcp@pdb Then the program will directly read 1vcp from Brookhaven National Laboratory via the internet.


Example 2: Searching similar structures in Protein Data Bank TOP can be used to see whether a protein is similar with certain structures in Protein Data Bank. Regarding how to obtaining the data from database, TOP may have two ways to run database searching.
  1. Search Protein Data Bank installed in the local disk. The example command files are shown in pdbsearch.com and pdbscan.com in the directory $TOPHOME/examples/
  2. Search Protein Data Bank via internet (see in topsearch.com and topscan.com).

The recommended way run TOP is first searching a compact library of Secondary Structure Elements (SSEs) . If SSEs constructions of some proteins are found to be similar to the studied structure, the program can do the further comparisons based on Ca atoms (as shown in pdbsearch.com and topsearch.com). This ways requires a regularly updated SSEs library which can be obtained from ftp://gamma.mbb.ki.se/pub/guoguang/sndlib.tar.Z It can also be made and updated automatically (see instructions for " Automatic updating of SSE library"

If users choose not to use compact SSE library, one can use pdbscan.com or topscan.com instead of pdbsearch.com or topsearch.com for searching PDB in local disk or via internet.

In pdbscan.com, it is assumed that user have all the Protein Data Bank files under directory /nfs/protein/pdb/current_release/uncompressed_files and all the files are called *.ent. In this example file, the command find $pdbdir -name "*.ent" -print > current.lis find all the PDB entries and write into the file current.lis which has contents like:


/nfs/protein/pdb/current_release/uncompressed_files/00/pdb100d.ent
/nfs/protein/pdb/current_release/uncompressed_files/00/pdb200d.ent
/nfs/protein/pdb/current_release/uncompressed_files/00/pdb200l.ent
/nfs/protein/pdb/current_release/uncompressed_files/00/pdb300d.ent
/nfs/protein/pdb/current_release/uncompressed_files/01/pdb101d.ent
/nfs/protein/pdb/current_release/uncompressed_files/01/pdb201d.ent
....

In this way all the file names are stored in current.lis which will be read by the MOL2 command in the TOP program. MOL2 @current.lis In fact, one can search not only the whole protein data bank, but also a group of selected structures, for example, structures represent independent folding in the SCOP classification.

Still take pdbscan.com as an example. To run database searching, type "pdbscan.com &", after some hours, there will be all the information in pdbscan.log which users usually don't have to look at. User can look at the summary files: "strdiv.lis" or "topdiv.log" (If the program crash, you could also look at the middle results by typing "grep Str pdbscan.log | sort +3 -4" or "grep Top pdbscan.log | sort +3 -4")

The content of strdiv.lis is the following:


 1692 structures are found to be similar under the given criteria
 Best Structure Diversity   7.67  with   52 matched residues to 2cnd
 Best Structure Diversity   7.68  with   56 matched residues to 1azz
 Best Structure Diversity   8.13  with   57 matched residues to 1epa
 Best Structure Diversity   8.33  with   48 matched residues to 1cnf
 Best Structure Diversity   8.48  with   54 matched residues to 1ave
 Best Structure Diversity   8.70  with   54 matched residues to 1hav
 Best Structure Diversity   8.70  with   54 matched residues to 2pia
 Best Structure Diversity   9.28  with   51 matched residues to 1avd
 ............


The structure here 2cnd, 1azz, 1epa ... and so on are found similar to the searched model. (2cnd is ranked as most similar structure by the program). Users can use command file of example 1 and pick up the coordinates to run the individual comparison which gives superimposed structure and details of the comparison such as r.m.s and sequence alignment and so on (these information are also inside pdbscan.log, run nicelist.com or toplist.com to get a better output.)


Example 3: Searching similar structures from a compact SSE library As described in the description section, in the first step TOP detects the similarities based on SSE topology of two proteins. Except coordinates files in PDB format, the program can also read a compact database which contains SSE topology derived from Protein Data Bank. Using the SSE library is a fast and recommended way for similarity searching in database. To make the library from PDB in local disk, user can use $TOPHOME/examples/makevec.com. To make the library from PDB on Web, please use $TOPHOME/examples/makevec_web.com. This SSE library can be automatically updated according most recent PDB data. Please see installation section.

The following is an example how to use SSE library for similarity searching. It is similar with example 2, but with one more command MOLVEC.


rm -f fort.10 fort.11 fort.12
ln -s omatrix.ofm fort.10
ln -s mol1.ofm fort.11
ln -s mol2.ofm fort.12
cat > topsearch.inp << EOF
MATCH auto
PDBSITE http://www2.ebi.ac.uk
!LIBDIR /nfs/pdb/current_release/uncompressed_files/
MOL1 kinA.pdb
MOLVEC $TOPHOME/lib/sndlib.vec
EOF
$TOPBIN/top < topsearch.inp  > topsearch.log
grep Top topsearch.log | sort +3 -4 >> topdiv.lis
grep similar topsearch.log > strdiv.lis
grep Str topsearch.log | sort +3 -4 >> strdiv.lis

The running and analysis procedure is similar to example 2

In this example, if you use LIBDIR /nfs/pdb/current_release/uncompressed_files/ instead of PDBSITE http://www2.ebi.ac.uk, the program will browse the coordinates from local disk instead of internet.

If you use another SSE database, for example MOLVEC $TOPHOME/lib/scop_structure.vec You search only about 2000 independent domain structures selected in the SCOP dastabase instead of 8000 in Protein Data Bank. The speed would be much faster (only 1/10 to 1/5 as before). For same reason, you could use $TOPHOME/lib/scop_family.vec (about 900 domain structures) or $TOPHOME/lib/scop_superfamily.vec (about 600 domain structures) to even search for a short time. The SCOP database is not updated as frequently as PDB, so far once every year. The SSE database for most recent SCOP is always kept in our FTP distribution site

In the Web server of TOP, there is another way to search all the structures: The program search classification unit of independent domain structures, families or super-families in SCOP. Once it found the similarity, it can optionally further search other structures in the same classification unit. Such a search is very efficient in terms of speed although it does not search the most recent data in Protein Data Bank. Please have a look at: http://alfa.mbb.ki.se:8000/TOP/search_SCOP_new.html


Example 4: Superimpose all the sequence-homologous proteins in PDB If users wish to compare all the structures in PDB which have sequence homology to a particular structure, one can use following simple procedure to make all the superimposed structures.

#!/bin/csh
rm fort.10 fort.11 fort.12
ln -s omatrix.ofm fort.10
ln -s mol1.ofm fort.11
ln -s mol2.ofm fort.12
$TOPBIN/top << 'end-top' 
MOL1 zmA.pdb
MOLVEC snd1.vec
pdbsite http://www2.ebi.ac.uk
3dbseq 0.02 @zm.seq
MATCH auto
WRITE yes
'end-top'

In this example zm.pdb is the PDB coordinates of the probe structure. zm.seq is the file which contains the sequence in format of 1-letter code:

SYTVGTYLAERLVQIGLKHHFAVAGDYNLVLLDNLLLNKNMEQVYCCNEL
TLKFIANRDKVAVLVGSKLRAAGAEEAAVKFTDALGGAVATMAAAKSFFP
EENALYIGTSWGEVSYPGVEKTMKEADAVIALAPVFN
....

The filename for all the superimposed coordinates will be 1pyd_zmA.pdb, 1pvd_zmA.pdb, 1pox_zmA.pdb....

Some hints about the program

  1. Database searching If you find that structures in Protein Data Bank are similar to your new structure, the program can not directly tell you which family it belong to. However there are some Web sites where you can get this information and classify your new protein according to the results from TOP program. Some of these sites are listed below.

    Name URL address Function Group
    SCOP http://scop.mrc-lmb.cam.ac.uk/scop Structure Classification of Proteins Chothia, Murzin...
    CATH http://www.biochem.ucl.ac.uk/bsm/cath Class Architecture Topology Homology Thornton...

    While searching similar structures in the whole protein data bank usually, a lot of time is wasted on tens of Lysozyme mutants or other closely related homologous proteins. It is possible to make a file list where only structures with independent folds or super-families are present (see example 2), if such information can be obtained from other sources. So far, no such effort has been made by the author.

  2. Speed. When you have a huge structure with many domains, it is much faster if you divide your protein into several independent domains and search each domain individually. The results will be much easier to understand too.

  3. Parameter of MATCH Over-estimation: If the program fail to compare two similar structures, it can be because the parameter value in the MATCH command is too high. Users can find out in the following way. For example the MATCH number should be 4 or less, but you use 7, at the end of the output the program would write something like: ... No way to align in 12ca.pdb Maximum match : 4 Minimum Align: 7 Then you can change MATCH from 7 to 4 and the program will run successfully.

    In the case database searching, too high value in this command will cause that no or too few similar structures are found. Users can find out what is the proper parameter for by typing: grep "Maximum match" pdbscan.log | sort +10 -11 (it is assumed that the log file is called pdbscan.log). For example, you give MATCH number 5 and you have no hitted structure, you will get something like

     ......
     ... No way to align in 1abj.pdb Maximum match :  3 Minimum Align:  5
     ... No way to align in 1abn.pdb Maximum match :  3 Minimum Align:  5
     ... No way to align in 1abo.pdb Maximum match :  3 Minimum Align:  5
     ... No way to align in 12ca.pdb Maximum match :  4 Minimum Align:  5
     ... No way to align in 1aag.pdb Maximum match :  4 Minimum Align:  5
     ... No way to align in 1aao.pdb Maximum match :  4 Minimum Align:  5
    
    
    In this example, you can get 3 more matched similar structures if you use 4 in the MATCH command.

    Under-estimation: Usually under-estimation of this number is OK. The program will find too many structures which you are not interested, but you can always rank the similarity by "Structure Diversity" or "Topological Diversity" and look only the structures at top in the rankings. If you find you think the speed of searching is too slow because of the too low value of this parameter, you also have some way to know the your wanted number far before the searching is finished. For example, you give 5 in the MATCH command. After a while of running the program, you can type grep "Max Align" pdbscan.log | sort +3 -4 you get

    .......
    ...(too many hints)...
    ......
     1cax.pdb<->mol1.pdb  Max Align:  5  Max Match:  5
     1cwa.pdb<->mol1.pdb  Max Align:  5  Max Match:  5
     1cwb.pdb<->mol1.pdb  Max Align:  5  Max Match:  5
     1cwc.pdb<->mol1.pdb  Max Align:  5  Max Match:  5
     1cxf.pdb<->mol1.pdb  Max Align:  5  Max Match:  5
     1cyn.pdb<->mol1.pdb  Max Align:  5  Max Match:  5
     1dlc.pdb<->mol1.pdb  Max Align:  5  Max Match:  5
     1cnd.pdb<->mol1.pdb  Max Align:  7  Max Match:  7
     1cne.pdb<->mol1.pdb  Max Align:  7  Max Match:  7
     1cnf.pdb<->mol1.pdb  Max Align:  7  Max Match:  7
    
    
    If you find only the last 3 structures fall into your "similarity" criterion, you can give "MATCH 6" (or 7) when you re-scan the database.

Reference

  1. Lu G., A WWW service system for automatic comparison of protein structures Protein Data Bank Quarterly Newsletter, #78, 10-11. 1996
  2. Guoguang Lu, An automatic topological and atomic comparison program for protein structures (in manuscript or http://gamma.mbb.ki.se/~guoguang/top.html).

Acknowledgment

The author is grateful to Dr. Ylva Lindqvist and Prof. Gunter Schneider for encouraging me to make this program and contributing important ideas. I also thank Dr. Roman Laskowski for permission to use his secondary structure assignment program and Dr. Jaime Prilusky for suggestions of 3DB interface. Thank a number of colleagues for suggestions and bug reporting.