The np_clus README file INTRODUCTION This is the README file for the np_clus utility for analysing non-polar cores in protein structures. It is described in the following paper. Finn Drablos, 'Clustering of non-polar contacts in proteins', Bioinformatics 15(6), 501-509, 1999 Please cite this publication in any research using the np_clus utility. You do not need to sign anything in order to get access to the software, as I do not have time to handle anything like that. However, I ask you to respect the following simple rules. * Cite the original publication (see above) whenever you publish any research where this software has been used. * I ask you to inform me about any bugs in the software. * If you modify the software you should indicate the modifications very clearly in the source code, and if you publish any research based where a modified version of the software has been used you should specify this in the publication. * If you feel that the modification represents an improvement over the original software I will be very grateful if you will give me access to it. * I will also be grateful for reprints of any publications describing research were the software has been used. * The software distribution is for academic research only. If you want to use it for commercial purposes then you have to contact me first. * You should not remove this README file from the distribution, or change the INTRODUCTION part of it. If you have something to add please use an additional README file. So please go ahead and have fun. I do not have time to give much support, but I will be happy to help you as far as I can if you get stuck. Please use email. Trondheim, 1999, 2005 Finn Drablos finn.drablos@ntnu.no Department of Cancer Research and Molecular Medicine (IKM) Faculty of Medicine (DMF) Norwegian University of Science and Technology (NTNU) N-7489 Trondheim Norway INSTALLATION The distribution is a gz tar archive, and you can unpack it (to a temporary directory) with the following command. zcat np_clus.tar.gz | tar xvof - In the distribution you will find the following files. README (this file) pdb_np_cont (a Cygwin executable for pairwise contacts) pdb_np_cont.c (the c source file for pdb_np_cont) pdb_np_clus (a nawk script for doing the actual clustering) dots.lis (a sphere of dots, for area calculation) atoms.dat (definitions of standard atom types) pdb1crn.ent (a crambin data set, to get you started) The software was initially developed on SGI computers running IRIX 5.3, 6.2 and 6.5, and later ported to PC/Cygwin platforms. The software should be fairly portable to other platforms. In order to install the software you should do the following. 1. Make a suitable directory, e.g. /usr/local/np_clus, and copy dots.lis and atoms.dat to this directory. You can also copy the README file to this directory, in order to keep it for future reference. 2. Define the SEQ_TOOLS variable in your .cshrc file (or some other suitable place) to point to the directory in 1 (e.g. setenv SEQ_TOOLS /usr/local/np_clus). Source the file to make sure that the definition is active. 3a. If you are not able to use the precompiled executable you have to compile the source code (e.g. gcc pdb_np_cont.c -o pdb_np_cont). 3b. Copy the pdb_np_cont executable to a suitable location, which should be in your path (e.g. /usr/local/bin). 4a. The pdb_np_clus is a nawk script. The location of the interpreter is defined in the first line, and by default it is '#!/usr/bin/gawk -f' You should change this (if necessary) to point to your local nawk/gawk/bawk or whatever. 4b. Copy the pdb_np_clus script to the same location as pdb_np_cont, and rehash so that the programs are available. 5. The installation is now complete. Please observe that the exact order of residues in the final tree may depend upon the awk version you are using (see 4a). The residues are retrieved directly from the awk data structure, which is generated by hashing. This means that awk implementations using a different hashing algorithm may generate a different order. However, the tree structure itself should be identical. USAGE First a WARNING. This software was written as an experiment, based on some old software tools, and only to be used by me :-) It is therefore a very userUNfriendly piece of software. But the basic idea is simple, and as soon as you grasp that you can probably understand the rest of it by playing around with the options, looking at the source code (if you dare!), and looking at this README file. There is no real documentation yet. Maybe some day, if I am able to get some funding from the research council ... To use the software you need a protein structure in PDB format. The software has not been adapted to other types of molecules. However, it is quite flexible, so it is likely that you can adapt it by adding relevant definitions to the atoms.dat file. First you need to compute the pairwise contact areas with pdb_np_cont (e.g. pdb_np_cont pdb1crn.ent). There is just one option, '-s'. This option will scale distances according to van der Waals radius during classification of surface regions. The default is to assign a surface point as buried by the nearest neighbour (atom). By scaling the distance according to van der Waals radius the relative burial becomes significant. If a point is equally far from two different atoms, it will be assigned to the one with the largest van der Waals radius. This may be sensible, because it will be relatively more deeply buried into the van der Waals sphere of this atom. However, the default behaviour is slightly faster, it may be easier to analyse and understand, and the net effect of using scaled distances tends to be relatively small. You will normally run pdb_np_cont only once for a given structure. It will generate a file of contact areas (e.g. pdb1crn.cnt). This file will then be used as input into pdb_np_clus, which will do the actual clustering. There you have several options (NB Specified with a '+', rather than the more standard '-', for historical reasons ...). Usage : pdb_np_clus [] +c - Cutoff for contact stopping tree building +n - Minimum number of members +s - Prefix for script output +e - Use external color list (r g b name) +h - Do not print header in postscript plot +f - Use colors in postscript plot +d - Enter DEBUG mode Tree build options default - Join trees on residue-residue contact area +t - Use total contact area between trees Cutoff options default - Cutoff on tree building scale +p - Cutoff on plot scale Plot options default - Plot against tree building scale +w - Plot against total contact area within trees +g - Plot against global contact area within trees Normally you will do a first clustering with just default values, no options. Then you can fine-tune this tree by playing around with the options. Most of them are explained in the paper. Output is in encapsulated postscript (EPS) format, and can be viewed with a postscript viewer (e.g. ghostscript) or printed on a postscript printer. You can generate a command file for colour coding the protein structure according to clusters. This is done by a BCL script, which can be read by the InsightII program from MSI. The short name of the protein used by Insight can be specified with the '+s' option. The default colour coding can be modified by using an explicit colour list, which will be read sequentially. The following is an example of a colour list. # This is just a test colour file. 255 0 255 magenta 255 0 0 red 0 255 0 green 0 0 255 blue 255 255 0 yellow 0 255 255 cyan # This is a real example. 0 0 255 A/B 255 0 0 C 255 0 0 C 255 255 0 A 0 255 0 B 0 255 0 B 0 0 255 A/C The following commands will reproduce some of the examples in the publication, given that default contact files have been generated first. Fig 2a - pdb_np_clus pdb1crn.cnt Fig 2b - pdb_np_clus +t pdb1crn.cnt Fig 2c - pdb_np_clus +w pdb1crn.cnt Fig 2d - pdb_np_clus +w +t pdb1crn.cnt