Abstract
The classification of protein structural information, especially using the
overall structure of the protein (the fold) as the basis for the classification
[1],
is an exciting area of research in structural biology. We were interested
in the complementary question, could we develop a classification scheme based
on the primary structure (residue based) that would allow us to understand on
the molecular level the intricacies of protein structure.
In this paper we present the Protein Structure Database (PSdb), a new protein database that relates secondary (e.g. Helix, Sheet, Turn, Random Coil), supersecondary (e.g., helix-helix interactions), and tertiary information (e.g. Solvent accessibility, internal relative distances, and ligand interactions) to the primary structure. The data for each protein is supplied on a residue by residue basis and encoded in a series of flat ASCII files.
Relationships between the various levels of structure (primary, secondary,
tertiary) can be investigated visually using PSdbView, a graphical
tool provided to view the information within the PSdb. This tool allows
for side by side comparison of residue based data and includes a variety
of standard mechanisms
for visualizing protein data including Ramachandran plots,
C(alpha)-C(alpha)
distance plots, and differences in solvent accessible molecular
surface area graphs (e.g.,
differences in the exposed surface
with and without including either the ligands, metal ions or buried
waters in the computations). PSdbView is written in Java,
thus providing a platform
independent means for exploring PSdb entries over the Internet.
Introduction
Two of the most important and difficult problems in molecular
biology, the protein folding problem and the structure-function relationship in
proteins, are directly concerned with relating the three-dimensional structure
of a protein to the primary sequence. The richest source of information about
protein structure is the
Protein Data Bank (PDB)
maintained by the Brookhaven National Laboratory [2].
The PDB is a collection of individual "flat" text
files, each of which contains the three-dimensional coordinates of one of the
several thousand macromolecular structures determined by
various experimental
techniques (e.g. X-ray crystallography or NMR).
While these files hold massive amounts of crucial
information, that information is difficult to access and use in
a systematic fashion.
There are many efforts that aim to unlock the information contained within the PDB [3 - 21], but there is no one database currently available that provides the flexibility to allow for a variety of questions to be asked. In practice, this has resulted in each investigator's developing specific single-use programs to determine the desired information. These programs are often problem-specific, hastily written and not easily adapted to related problems. This lack of readily available query tools places profound limitations on current research. Many researchers could profit from using structural information to guide and interpret their results, but few have either the skills or time to design and implement database programs.
In this paper, we describe the Protein Structure database (PSdb),
a database which extends the information available in current databases
and allows researchers to quickly explore relationships between the
primary, secondary, supersecondary, tertiary, and terniary
structures. The extensions to the currently available databases
include an extended number of secondary structure elements identified,
the ability to use a super-secondary structure for searches, and a greater
number of environmental descriptors for each residue.
Previous Work
There have been a limited number of tools available to
researchers for the examination and correlation of the structural information
within the PDB. The
NRL_3D database [22 - 24]
was created to address whether
the structure of a specific sequence
had been reported. It is a
collection of the primary sequences, annotations and author assigned
secondary structure of the crystal structures reported in the
PDB with a resolution of less than 3 Angstroms. In a series of companion files,
the authors, keywords, species from which the sample was taken, and other
information is maintained. Chris Sander and his co-workers at EMBL
[3] created
one of the first structural databases based upon the secondary structure
assignments made by the program
DSSP. This
database consists of a series of flat files with primary sequence and
secondary structure information. "Molecular Structure in Biology", a
commercial program available from Oxford University Press, contains at
least three different images of each of the PDB entries along with the
capability to rotate wireframe images of the protein. The researcher is
allowed to browse through the information contained within the header of the
PDB file and to perform searches through this information.
Several groups have incorporated protein structure information into relational
databases. PKB [25]
is a user interface combining a relational database and a
series of user accessible macros. The information contained in this database is
taken directly from the header information of each PDB file (e.g., secondary
structure elements (Helix, Sheet or Turn), resolution, R-Value, refinement
technique, and cell description). Macros have been added to the system to allow
the user to perform relatively complex mathematical modeling
and theading [25].
SESAM follows
similar ideas. It has been developed by Wodak et al.
[26 - 35] at the Universite
Libre de Bruxelles, Belgium. The core of the database is a series of structural
determinants and solvent accessible computations by DSSP
[3]. Isis (Integrated
Sequence/Integrated Structures) is a commercial protein sequence/structural
database developed by the Protein Engineering Club Database Group
[36, 37].
Primary sequences (e.g.,
NBRF-PIR and
SWISS-PROT) are contained in the
OWL
sequence database
and the structural information is contained within the BIPED
relational database. The structural information includes structural domains
(sheet, helix, and turn presumably taken from the file header), torsional
angles, solvent accessibility and hydrogen bonds. The database is provided as
either flat files or Oracle tables.
Data Components of the PSdb
PSdb entries are derived from coordinate data found in the PDB.
The information in each entry is compiled and
stored on a residue by residue basis in a series of flat
ASCII files.
The data included for each residue includes:
We developed software for determining the various values reported in the PSdb. In the following sections, we briefly describe the algorithms used in these computations. In the case of multiple conformations, we always used the first conformation reported in the PDB file.
Many existing databases base their secondary structure classification on the author assignments found in PDB files. The problem with relying on these definitions is that the assignment of the secondary structure is highly idiosyncratic. Helix, sheet, and turn structures are determined using the investigator's personal criteria. While this may cause few problems in the central regions of helix and sheet structures, it can lead to very different classifications at the ends of these structures [4]. Unfortunately, the ends of these secondary structural features may be critical to our understanding of protein structure [4] and function [38].
To address this problem, a new classification has been developed that provides a single, consistent definition for secondary structure. The classification is dervived using an algorithm developed by Deerfield [39]. The algorithm initially performs a classification of the secondary structure for each residue based upon the Phi, Psi and Omega angles. This is followed by a more stringent examination of secondary structure for the residues (e.g., whether there are enough consecutive residues to define an element or whether the necessary hydrogen bonds are present). The variety of secondary structural elements identified in the current implementation of this algorithm is greater than in other databases (e.g., C7(eq) and all beta turns are identified). The amount of helical structure identified by this methodology is slightly greater than from DSSP, but similar to the percent helix as determined by CD spectroscopy.
The PSdb also records secondary structure of the author as reported in the PDB file and the Phi, Psi definition as classified by Garnier [40], Scheraga [41 - 43], and Thornton [44]. The Phi, Psi definitions are provided for competeness and comparison. In addition, these definitions are available for use as the background for PSdbView Ramachandran Plots (see below).
We used Connolly's algorithm [45] for computing the solvent accessible molecular surface area (SAMSA) using van der Waal radii (Table I) based upon the atom and its hybridization [46]. We computed the SAMSA for all heavy atoms in a single residue and all of the side-chain atoms (in this context, side-chain atoms are all heavy atoms EXCEPT for N, C and O). We kept track of the contribution of the polar and nonpolar atoms for both the entire residue and side-chain. The SAMSA is reported as either the absolute value or as a percentage of the theoretical maximum SAMSA for the specific residue. We computed a total buried area by subtracting the computed SAMSA for the side-chain from the theoretical maximum SAMSA for the side-chain. We systematically increased the contacts used in the computation of the SAMSA (Table II). The abbreviations used for this data in the PSdb are summarized in Table III.
PFracc was computed using Eisenberg's [47] algorithm (Eq 1). We computed Eisenberg's environmental descriptors; but due to differences in the algorithm used to compute the SAMSA (Eisenberg used the Lee and Richards algorithm) and different van der Waal radii (Table I), we reproduced his environmental descriptors in only about 95% of the tested cases. We have recently reported [48] a new approach towards defining the environmental regions based upon the theoretical features in the environmental plot. The region boundaries (AMH Classification, Figure 1a). are defined as a series of radial lines starting at the upper left hand corner of the environment plot and the series of arcs representing the distance from the upper left hand corner. The upperleft hand corner of the Environmental Plot represents a fully exposed residue side-chain. It should be noted that Eisenberg used Method #1 (Table II) for computing his environment classification, whereas, we feel that Method #4 (Table II) is potentially more appropriate. A comparison of the Eisenberg and AMH Environmental classifications are illustrated in Figure 1.
We have had a long standing interest in the interaction of divalent metal ions with peptides and proteins [49 - 55]. Indeed, one of the driving forces in the development of the PSdb is to understand the role of metal ions in the stabilization of specific tertiary structures and the propensities of the various side-chains to interact with specific metal ions [56].
All ligands in the first coordination sphere for each metal ion is identified. This includes ligands from the protein (e.g., backbone oxygen and sidechain groups), ligands and the water of hydration about the metal ion. The water of hydration about the metal ion is treated as a member of this site due to the slow exchange rate and was used as a part of the metal ion for the surface area computations (Method #4, Table II). The distance criteria are a function of the metal ion and ligand. For example, Zn(II)-O distances are required to be less than 3.0 Angstroms while Zn(II)-S are required to be less than 4.0 Angstroms.
All water molecules were checked for possible hydrogen bond interactions
with the protein and ligands about the protein. A distance criteria was
used and the heavy atom-heavy atom distance was required to be less than
3.5 Angstroms. We did not use an angle dependence in this assignment,
which will be included in future releases. Waters with three of more
contacts to the protein and ligands was treated as a structural water
and was included in the surface computations (Method #5,
Table II).
The PSdb Viewer
In order to visually inspect and investigate the entries of the PSdb,
we provide, PSdbView, a tool for visualizing individual PSdb entries.
The tool is written in
Java, thus providing a platform
independent viewer as well as a means of viewing the PSdb over the Internet.
PSdb consists of a number of screens, each screen providing a visual overview of specific set of PSdb data. These screens are summarized below.
Sequence bars for given catagories can be added to or deleted from the screen, thus giving the user the capability to view only those data catagories of interest.
The sequence bars themselves are catagorized within groups corresponding to the residue based data in the PSdb. Catagories include Residue Charge and Polarity, Disulfide Bonds, Amide/Chiral, (e.g. whether the backbone amide bond (Omega) is cis or trans, and whether C(alpha) is D or L), Secondary Structure and Phi/Psi Classifications, Solvent Accessible Molecular Surface Data (Table III), Metal Ion Interactions, and Structural Water Molecules Bound.
Surface area plots are currently available for exposed surface area values, buried surface area values, and polar fraction values.
Visualization beyond the two-dimensional representation of proteins is essential. Thus, we will extend PSdbView to work in conjunction with molecular graphics packages, thus enabling the color coded seqbar data to be mapped accordingly onto the displayed 3d structure. This integration will also allow for the exploration of alternative methods for the representation of the results from the database queries.
Lastly, we will extend
PSC's efforts in
context sequence analysis to include the
primary, secondary, and tertiary structure information
contained within the PSdb.
The alignment of sequences has traditionally been based upon
pairwise alignments using gross structural or evolutionary measures.
Context sequence alignment has only recently been used successfully.
From the multiple sequence alignment of homologous
proteins, one can determine areas of high mutability. If one of these
sequences is in the PSdb, then the relationship between structure and
mutability can be examined.
Acknowledgements
This work was funded by a grant from the NIH-NCRR (1 P41 RR06009). We
greatly acknowledge help during the early stages of this work by Catherine
P. Milligan, Joseph C. Lappa, Alexander J. Ropelewski and
Amanda M. Holland-Minkley.
We would also like to thank Hugh B. Nicholas Jr. for many helpful discussions.
References
[1] Murzin et al., J. Mol. Biol. 1995, 247, 536-540.
[10] Kuntz, I.D. (1972) Protein Folding. J. Am. Chem. Soc, 94, pp. 4009-4012.
[14] Chou, P.Y. and Fasman, G.D. (1977) '-Turns in Proteins J. Mol. Biol., 115, pp. 135-175.
[17] Hohne and Kretschmer (1985) Stud. Biophys., 108, 165-186.
[22] Namboodiri, K., Pattabiraman, N., Lowrey, A. and Gaber, B.P. (1988) J. Mol Graphics, 6, 211-212
[26] Morffen, A.J., Rodd, S.J.P., and Snelgrove, M.(1983) J. Comput. Chem.,7, 9-16.
[28] Morffew, A.J. and Todd, S.J.P. (1986) Computers Chem., 10, 9-14.
[31] Morffew, A.J., Todd, S.J.P. and Snelgrove, M.J. (1983) Computers Chem.,7, 9-16.
[32] Pabo, C. (1987) Nature, 327, 467.
[33] Glen, R.C. and Rose, V.S. (1987) J. Mol. Graphics, 5, 79-86.
[34] Frey, P.M.D., Paton, N.W., Kemp, G.J.L., Fothergill, J.E. (1990) Protein Engr.,3, 235-243.
[35] Pongor, S. (1988) Nature,332, 24.
[38] Presta, L.G. and Rose, G.D. (1988) Helix Signals in Proteins. Science, 240, pp. 1632-1641.
[39] Deerfield, D.W., II, manuscript in preparation.
[42] Venkatachalam C.M., 1968. Bioploymers, 6:1425-1436.
[43] P.N. Lewis, F.A. Momany and H.A. Scherga, 1973. Biochim. Biophys. Acta, 303, 211-229.
[45] Connolly, M.L (1983), Analytical Molecular Surface Calculation, J. Appl Cryst, 16, pg 548-558.
[46] Francl, M.M., Hunt, R.F.,Jr., Hehre, W.J., (1984) J. Am Chem Soc., 106, pg 563-570.