DATABASE AND INTERFACE FOR 3-DIMENSIONAL MOLECULAR STRUCTURE VISUALIZATION AND ANALYSIS
5 CROSS-REFERENCE TO RELATED APPLICATION
For purposes of International patent applications, this application claims benefit
of priority from U.S. Non-provisional patent application Serial No. 09/272,814 entitled
"Database and Interface for 3 -Dimensional Molecular Structure Visualization and
Analysis" filed March 19, 1999. Forpurposes ofanyU.S. application, priority is claimed
0 from U.S. Non-provisional patent application Serial No. 09/272,814 entitled "Database
and Interface for 3-Dimensional Molecular Structure Visualization and Analysis" filed
March 19, 1999. Where permitted, this application incorporates by reference the
referenced pending U.S. patent application in its entirety for all purposes.
5 BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates generally to database access and, more particularly, to
interfaces that control access to collections of data.
2. Structure-Based Drug Design
o Recent advances in molecular biology, such as the discovery and identification
of large numbers of novel gene sequences encoded in the genomes of humans, other
mammals and infectious disease agents, have contributed to the identification of a large
number of target proteins and other biological macromolecules. Within the decade, up
to 150,000 fully sequenced human genes and an estimated 400,000 genes from other
species will be available.
Based on information derived from these sequences, as well as from experimental
methods, such as X-ray crystallography or NMR, or protein structure determination
5 methodologies, such as homology modeling or ab initio structure prediction, three-
dimensional (3-D) molecular structures of enzymes, ligands or target receptors, such as
protein or other macromolecular receptors, are being determined at increasing rates. 3-D
molecular structure is known to be related to biological function. By employing
structure-based drug discovery methodologies, including structure-directed combinatorial
0 or molecular diversity screening and computational screening using molecular similarity
or computational docking algorithms, it is possible to derive knowledge of 3-D structure
of biomolecules, which is useful, for example, in facilitating the identification of
pharmacophores and the design of biologically-active compounds, such as small
molecule agonists or antagonists.
5 As the number of available biomolecular structures increases, there is an
increasing need for capabilities, such as databases, for organizing and providing access
to the structures to make them available for use in structure-based discovery and design.
3. 3-Dimensional Molecular Structures
3-D molecular structures of proteins or other molecules are represented as sets of
o atomic coordinates that describe the spatial arrangement and intramolecular connectivity
of the atoms in the molecule. For example, a standard format of representing
macromolecular structural data is specified by the Protein Data Bank (PDB) format. The
PDB format organizes molecular structure data by representing atoms according to their
atom types and atomic coordinates. In addition to atomic coordinate data, a PDB format
data file can include information on structural attributes or reactivity of the molecule,
such as active sites or secondary structure attributes. The molecular structure data in a
PDB data file is provided as a simple alphanumeric representation of the atomic
coordinates for a single protein or other macromolecule. Thus, it is stored as a text file
using standard ASCII display codes.
The PDB is a depository of molecular structure data for biological
macromolecules, that is, researchers are invited to add molecular structure data in the
PDB format to the collection currently maintained at the Brookhaven National
Laboratory (Bernstein et al, "The protein data bank: a computer-based archival file for
macromolecule structures", J. Mol. Biol., 112, 535-542 (1977)). To facilitate maximum
accessibility to the data contained in the PDB, the data files are made available over the
Internet and can be retrieved and viewed by a user. Unfortunately, the PDB collection
of data files is simply an archive comprising a series of molecular data files stored
serially, as the files are deposited by researchers. That is, the PDB data files are flat files.
Thus, there is no logical structure or interrelationship among the data files or between the
records of different files. If a user is interested in viewing the atomic structure of a
particular protein, for example, then the user must conduct a text-based search of the
alphanumeric data in each file, one after the other, until the desired protein name or
structure, or other associated information, is located. Text searching through the
molecular data files at a depository such as the PDB might not effectively identify the
desired molecular structural information. Thus, it would be advantageous if molecular
structure data files could be conformed to a logical database design, to permit more
convenient and efficient searching for data records.
The coordinates of molecular structures can be read or downloaded into a
molecular graphics program for visualization and manipulation and structural analysis.
For example, a researcher can textually search the PDB data file to locate a protein of
interest, and then can import the located file into a viewer program to obtain a visual
representation of the protein. Several such graphics packages are known and are readily
available commercially. Molecular graphics programs read in atomic coordinate data,
such as is contained in a PDB-format file, and perform calculations to construct a three-
dimensional representation of the molecule. Additionally, data beyond simple molecular
structure, such as molecular shape analysis, energetic or strain analysis, active or reactive
sites, variants or properties such as electrostatic potential or hydrophobicity, among
others, may be associated with a given molecular structure. It would also be desirable
to provide access to this information, as well. Currently, many of these data files are
generated and must be viewed using different programs. To date, there is no software
package available that integrates a molecular graphics program with a relational database
to permit navigation among related molecular structures stored in a database and
visualization analysis of molecular structures and their related properties within a single
user interface. Thus, it would be further advantageous to integrate a molecular structure
database with molecular visualization and analysis tools, as navigating among the
different databases and viewing programs can be time consuming and can be confusing,
as each program may have a different scale or look and feel.
4. Proprietary Data Issues
Recently, techniques such as homology modeling and other structure prediction
methodologies have been used to generate models of enzymes, ligands or target
receptors, such as protein or other macromolecular receptors, structures whose structures
have not yet been determined experimentally. The databases described herein can
provide access to these structures to researchers and others, such as clinicians or
educators. The databases can be stored on large networks, such as the Internet, which
provide a convenient means of data dissemination, but are most well-known for
providing unrestricted access to data uploaded to Internet servers.
In some cases, however, it might be desirable for the proprietor of a molecular
structure database to carefully control access to their respective collections of molecular
structure data files. For example, such data files may be obtained as a result of
proprietary structure prediction algorithms or methods. In such cases, access to such
information might be limited to only certain structures or families of structures.
Thus, it would be additionally advantageous to provide an integrated system to
permit controlled access to a database of molecular structures and related properties
within a single user interface.
SUMMARY OF THE INVENTION
Described herein is a database and interface for access to 3-D molecular structures
and associated properties, which can be used to facilitate the design of potential new
therapeutics. The interface also provides access to other structure-based drug discovery
tools and to other databases, such as databases of chemical structures, including fine
chemical or combinatorial libraries, for use in structure-focused high-throughput
screening, as well as to a host of public domain databases and bioinformatics sites.
The invention provides a relational database that collects multiple data files
relating to the same molecular structure in the same subdirectory, and provides an
interface to access all of the collected files from the same structure using the same user
interface program. The collected files can comprise a variety of information and
computer file formats, depending on the type of information to be conveyed to users of
the database. In accordance with the invention, a user communicates over a public shared
network, such as the Internet, or over a controlled private network, such as an intranet,
with a secure server that controls access to the collected files, and the interface to the
collected files is provided by a graphical user interface. In this way, the invention
provides a convenient means of searching molecular structure data for characteristics of
interest. The invention also permits data searching, file viewing, and investigation of
multiple representations of molecular structures from within a single viewing program.
In one aspect of the invention, the data files are made available over a wide area
shared network, such as the Internet, and the graphical user interface (GUI) used for
viewing the data files is a standard Internet web browser program, such as the web
- 1-
browser products by Netscape Communications, Inc. and Microsoft Corporation. Such
browser products readily import and provide views of files having a wide variety of
formats that contain alphanumeric, video, and audio data. In another aspect of the
invention, the GUI is provided with a platform independent programming environment
using client applets, such as Enterprise Java Beans (EJB). A security server is preferably
located between the user browser program at a network client machine controls access
to the database, which is housed at a file server connected to the security server. Before
a user gains access to the database, the security server checks authorization for the
individual user and then, if appropriate, permits downloading of appropriate data from
the database file server.
In another aspect of the invention, data for a molecular structure is loaded into the
database by specifying the file pathnames for the various data files that contain the
different types of data, including the different molecule views. Using a browser to view
the data files permits various helper applications, called plug-ins, to smoothly and
transparently accept the different file formats and provide views to the user. The various
data files of the database are organized in accordance with the database design when they
are loaded into the database and are managed by a relational database management
program.
Other features and advantages of the present invention should be apparent from
the following description of the preferred embodiment, which illustrates, by way of
example, the principles of the invention.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 is a block diagram of a network system constructed in accordance with
the present invention.
Figure 2 is a block diagram showing the primary functional components of the
5 database server illustrated in Figure 1.
Figure 3 is a block diagram showing the primary hardware components of the
database server and security server machines illustrated in Figure 1.
Figure 4 is a representation of a Login screen shown to a user at a client machine
display of the network system illustrated in Figure 1.
l o Figure 5 is a representation of a selection screen shown to a user following proper
login through the display illustrated in Figure 4.
Figure 6 is a representation of a selection screen shown to a user following
selection of a database option from the display illustrated in Figure 5.
Figure 7A and Figure 7B show a representation of a protein families display
5 screen showing an index of the protein families available for selection from the database
server illustrated in Figure 1.
Figure 8 is a representation of a query submission screen shown to a user
following selection of the query option from the display illustrated in Figure 6.
Figure 9A and Figure 9B show a protein database listing generated in response
0 to a query submitted from the display illustrated in Figure 8.
Figure 10 is a representation of a query submission screen for a search based on
a particular disease state.
Figure 11 is a representation of a protein information screen such as might be
generated in response to a query or in response to selection of a protein from the protein
name display illustrated in Figure 9.
Figure 12A, 12B, 12C, 12D, 12E, and 12F show protein structural data as stored
in a PDB-format data file, comprising one of the data files stored in the database server
illustrated in Figure 1.
Figure 13 is a representation of a protein Visualization Toolkit screen for a
protein selected from the database server illustrated in Figure 1.
Figure 14 is a representation of a viewing screen displaying a 3 -dimensional view
of a protein structure selected from the database server illustrated in Figure 1.
Figure 15 is a representation of the functionality contained in the Measure menu
showing a graphical display of the interatomic distance between two atoms of the protein
molecule shown in Figure 14.
Figure 16 is a representation showing a graphical display of the angle between
three atoms in the protein molecule shown in Figure 14.
Figure 17 is a representation showing a graphical display of the dihedral angle
between four atoms in the protein molecule shown in Figure 14.
Figure 18 is a representation showing a graphical display of the atomic
coordinates for an atom in the protein molecule shown in Figure 14.
Figure 19 is a representation of the Sequence Viewer window showing the amino
acid sequence of the protein molecule shown in Figure 14.
Figure 20 is a representation of the Sequence Alignment window showing the
alignment of the amino acid sequence of the protein molecule shown in Figure 14 with
a template sequence.
Figure 21 is a representation of the Secondary Structure prediction window
showing the amino acid sequence of the protein molecule shown in Figure 14 and the
predicted secondary structural features of the protein.
Figure 22 is a representation of the Visualization menu selections for the protein
molecule shown in Figure 14.
Figure 23 is a representation of the Secondary Structure Ribbon menu selections
for the protein molecule shown in Figure 14.
Figure 24 is a representation of the Quality menu selections and shows a
Ramachandran plot for the protein molecule shown in Figure 14.
Figure 25 is a representation of the Quality menu selections and shows a
Balasubramanian plot for the protein molecule shown in Figure 14.
Figure 26 is a representation of the Quality menu selections and illustrates the
Ellipsoid functionality for the protein molecule shown in Figure 14.
Figure 27 is a representation of the Surface Hydrophobicity menu selection for
the protein molecule shown in Figure 14.
Figure 28 is a representation of the Strain Plot menu selection for the protein
molecule shown in Figure 14.
Figure 29 is a representation of the Profile Analysis menu selection for the protein
molecule shown in Figure 14.
Figure 30 is a representation of the Align functionality showing a list of proteins
to select for superimposing on the protein molecule shown in Figure 14 and a window
for displaying the superposition of multiple proteins.
Figure 31 A and Figure 3 IB show a display screen for entering data about a
protein into the database server illustrated in Figure 1.
Figure 32 is a flow diagram representation of operations performed when a user
accesses the protein database of the system illustrated in Figure 1.
Figure 33 is a flow diagram representation of the operations performed during the
security checking operation box of Figure 25.
Figure 34 is a representation of the database schema used by the relational
database management system illustrated in Figure 1.
Figure 35 shows the object classes that produce the screen displays from the
browser program at a user terminal illustrated in Figure 1.
Figure 36 is a block diagram representation of an alternative network system
constructed in accordance with the present invention, using a distributed architecture and
"Enterprise Java Beans" components.
Figure 37 is a block diagram representation of the "Enterprise Java Beans"
components in the Figure 36 system.
Figure 38 is a representation of an Application screen shown to a user at a client
machine display of the network system illustrated in Figure 36, with a Family Tree panel
shown at the left side of the application window.
Figure 39 is a representation of the Application screen shown to a user at a client
machine of the network system illustrated in Figure 36, illustrating a Search panel shown
in the application window.
Figure 40 is a representation of the Application screen shown to a user at a client
machine of the network system illustrated in Figure 36, illustrating a Data Mining panel
shown at the left side of the application window with a Structure visualization panel at
the right side of the application window.
Figure 41 is a representation of the Application screen shown to a user at a client
machine of the network system illustrated in Figure 36, illustrating simultaneous Quality
display features for two selected proteins in the Visualization panel of the application
window.
Figure 42 is a representation of the Application screen shown to a user at a client
machine of the network system illustrated in Figure 36, illustrating simultaneous
Structure display features for two selected proteins in the Visualization panel of the
application window.
Figure 43 is a representation of the Application screen of Figure 42, showing the
drop down menu for selection of Display features.
Figure 44 is a representation of the Application screen of Figure 42, showing the
drop down menu for selection of Options for atoms to be viewed.
Figure 45 is a representation of the Application screen shown to a user at a client
machine of the network system illustrated in Figure 36, illustrating simultaneous
Sequence display features for two selected proteins in the Visualization panel of the
application window.
Figure 46 is a representation of the database objects for the database design of the
system illustrated in Figure 36.
5
DESCRIPTION OF THE PREFERRED EMBODIMENTS
The invention will be better understood with reference to the attached drawings,
in which like reference numerals refer to like objects. For purposes of illustration, the
system is described with respect to a database of protein structures. It should be
o understood that the system could also be adapted for use with databases of 3-D structures
of other biological molecules or macromolecules, such as DNA, RNA or carbohydrates.
1. System Components
Figure 1 is a block diagram of a network system 100 constructed in accordance
with the present invention. The system includes a database server 102 that communicates
5 over a high speed communications line 104 with a security server 106. Multiple user
client machines 108, 110, 112 are shown in communication with the security server 106
over high speed network communications lines 114, 116, 118, respectively, to gain access
to data stored at the database server 102. The database server stores protein data in a
relational protein database that can be searched by protein family and by a variety of
o protein characteristics, so that files relating to the same protein are stored in the same
subdirectory. The files comprise a wide variety of protein data, and can include PDB-
format files, text files, graphic image files, virtual reality files, video files, and audio files.
Each of the client machines 108, 110, 112 comprise a network terminal that permits a
user to gain access and retrieve data using a resident browser on their respective client
machine. The browsers provide a graphical user interface to access all of the collected
database files and view the data without changing the interface.
In accordance with the invention, a user at one of the terminals 108, 110, 112
communicates over a public network connection 114, 116, 118 after receiving
authorization from the security file server 106, which controls access to the collected
files. Access to the security server and collected files at the database server 102 is
provided through the graphical user interface of the conventional browser program that
is widely available, such as the Communicator or Navigator browser product from
Netscape Communications Corp. or the Internet Explorer browser product from
Microsoft Corporation. In this way, the invention provides a convenient means of
searching among protein families for proteins with characteristics of interest. The
invention also permits data searching, file viewing, and investigation of multiple views
of proteins from within a single viewing program.
Each of the network terminals 108, 110, 112 comprise a computing platform that
typically includes a central processor unit (CPU), such as the Pentium-class of
microprocessor chips manufactured by Intel Corporation or the CPU microprocessors
manufactured by Silicon Graphics, Inc. The terminals also include a video display unit,
such as a video monitor or other display screen, and a network interface. In the preferred
embodiment, each network terminal also includes another means of accessing data on
storage media, such as a floppy disk drive or a CD-ROM or DVD-ROM drive. Because
each network terminal 108, 110, 112 communicates with the security server 106 using
the same type of graphical user interface, access to the security server does not depend
on the operating system of each terminal being the same. For example, each of the
illustrated terminals 108, HO, 112 may use a different operating system. Thus, the first
client machine 108 may function using the "Windows 95" operating system program
from Microsoft Corporation. The second client machine 110 may function using the
Unix operating system, and the third client machine 112 may operate using the Apple
Macintosh operating system. Access to the security server will be transparent to the
users at these client machines 108, 110, 112 as long as the browser at each terminal is
operable with the respective client operating system.
As noted above, the client network terminals 108, 110, 112 communicate through
a web browser interface to the security server 106 and then gain access to the database
at the database server 102. Figure 2 is a block diagram of the database server 102. The
database server is a computing machine that supports a relational database system 202,
such as the systems manufactured by Oracle Development Corporation, which accesses
the collection of data files 204 comprising the protein database of the system 100. That
is, the relational database system 202 provides access, security control, and search and
retrieval services for the database files 204 that are stored at the database server 102. The
database files may include a wide variety of file types, including alphanumeric text files,
graphic image files, video files, and audio files.
The database server 102 also includes a web application server 206, which
interfaces with the relational data base system 202 to retrieve and, if necessary, process
the data files 204. The web application server may include programs known as
cartridges. Those skilled in the art will be familiar with cartridges and with the tasks they
can perform, and will understand that cartridges specific to particular database systems
can be readily developed by users of the database system 202. As described further
5 below, for example, cartridges are used to control access to the database, so that data
subscribers of different authorizations are provided with different access levels to the
files, or will be granted access with different exceptions. The database server 102 also
includes a controller 208, which provides the operating system at the server. The
operating system in the preferred embodiment supports various program types, including
l o those written in the "Java" programming language developed by Sun Microsystems, Inc
of Palo Alto, California, USA.
Figure 3 is a block diagram of the primary hardware components of the computers
that comprise both the database server 102 and the security server 106. It should be
understood that the functions of the database server and the security server can be
15 implemented on the same computer, if desired. In the preferred embodiment, the
functions are performed at separate computers to facilitate database maintenance at the
database server and increase the rate at which new data can be uploaded and made
available at the database. In addition, the client network terminals 108, 110, 112 may
each have a construction similar to that shown in Figure 3. As noted above, each client
0 machine 108, 110, 112 may function with a different operating system, as may the
servers 102, 106, the only requirement being that they can communicate with each other
over the network connections 104, 114, 116, 118. The exemplary computer 300
includes a CPU 302 that provides substantial computing power to successfully handle a
large volume of user traffic and to manage transfer of large amounts of data. For
example, when a user requests the database files for a single protein, the amount of data
transferred from the database server 102 can comprise more than 2 megabytes (MB) of
data. The computer 300 also includes memory 304, which typically includes 64 MB or
more of fast random access memory (RAM). The computer will also include a network
interface 306 to permit high speed communications with anetwork 308, such as provided
by direct connection with Tl data lines, a frame relay connection, or other high
bandwidth digital data line. The computer also will include an external storage device
interface 310 to permit transfer of data, including programming, from external storage
media 312, such as floppy disks, magnetic tape, CD-ROM, and DVD-ROM storage
media. Internal data storage, of persistent data too large to fit within the memory 304,
may be stored in disk storage 314. Presently, readily available disk storage provides
space of six to eight gigabytes (GB).
A user will initiate action and input data from a combination of input devices 320,
such as a keyboard and display mouse, and will view data on a display 322 such as a
video monitor. Typically, the database server 102 and security server 106 will be more
powerful machines than the network terminals 108, 110, 112, because the servers must
potentially handle a large amount of communications traffic. The network terminals
must be able to communicate with the security server and receive large amounts of data,
but the required speed of such data transfer is to be determined by the user preference.
It has been found, however, that a network connection of greater than 56K/sec data
transfer is most desired, and slower terminal-to-security server communications will not
be satisfactory to most users.
2. Using the System to Retrieve Data
The first step in gaining access to the stored database and the data files contained
therein is to obtain access to the database server 102. To gain access to the server, a user
must first receive access through the security server 106 by standard conventional
communication techniques. For example, the Internet protocol (IP) address of the
security server must be known, so that a user 108, 110, 112 can communicate with the
server 106 through the user's browser program. Upon communicating with the security
server 106, the user is presented with the first login screen at the browser interface.
Figure 4 is a representation of a login screen 402 that is shown to a user 108, 110, 112
(Figure 1) at a client machine display unit, viewing through the user's conventional
browser window. The login screen identifies the database to be accessed, and provides
a user with a "login" request button 404 to be selected (clicked) using the client machine
display mouse, keyboard, or other window designation device. No program response will
occur until the login button is selected.
After a user clicks on the login button 404, the security server will provide the
user's browser with a display screen that shows a list of database selection features, or
utilities, from which one must be chosen. Figure 5 is a representation of the information
box screen 502 that is shown to a user following selection of the login button in the
screen display illustrated in Figure 4. In the Figure 5 representation, two of the illustrated
database selections are shown for variety, but do not form any part of the invention; these
are the selections for "SVdBase" 504 and the "CombiLib" 506. Other permitted database
selections may be included. The other two database selections shown in Figure 5, for
"SBdBase" 508 and for "SBdBasePlus" 510, are used to select access to the unique
relational database of protein structural data provided in accordance with the present
invention. The two choices 508, 510 select different levels of access to the database files.
The different levels of access may be, for example, to different protein families, so that
certain protein families may be made available only at a higher fee. The different levels
of access also may restrict access to files within a given protein family, so that certain
types of data may be made available only at a higher fee.
After a user selects one of the database access levels in Figure 5, the security
server will present the user with a database selection screen. Figure 6 is a representation
of the database selection screen 602 shown to a user following selection of the database
option from the display illustrated in Figure 5, and shows that a user can select between
an index view display button 604 and a query display button 606. If the index view
button 604 is selected by a user, then the security server will return a list of protein
families that are available for viewing. In the preferred embodiment, all protein families
are shown in the index view, according to protein family name, even if the user does not
have authorization to view them. In this way, users will be kept informed of the full
database that is potentially available, including recently added data for which they may
not have access. Thus, users may decide to upgrade their level of access upon viewing
the complete list and determining the available proteins to which they do not currently
have access rights. Figure 7A and Figure 7B comprise a display screen list 702 that
shows an exemplary index listing of the available proteins displayed by selecting the
index button 604. If the user selects the query display button 606 from the Figure 6
display, then the security server will return a query screen so the user can designate a
query on the database.
5 Figure 8 is a representation of a database query display screen 802 shown to a
user following selection of query submission from the display screen illustrated in Figure
6. The query screen 802 shows a display area 804 that lists query fields over which
searching can be performed. In the preferred embodiment, these fields include a database
identification (ID) number, protein name, species, gene, conventional "SwissProt"
0 accession number, disease state, protein function, protein family, and full-text search
string. After the user has determined the search query, the user selects the "Submit
Query" display button 806 and the user's browser sends the query information to the
security server.
Figure 9A and Figure 9B illustrate a query results display screen 902 that is
5 shown to a user following selection of the query option from the display illustrated in
Figure 8, where a user has selected the "Protein Name" field for search. Thus, the Figure
9 A and 9B display shows a list of protein names, corresponding to the names of proteins
available in the database at the database server 102 (Figure 1). That is, the left-most
column 904 in Figure 9A, 9B lists the protein name, the next column 906 shows the
o corresponding database ID number, the next column 908 contains an indication (where
applicable) of whether PDB data is available, and the last column 910 contains an
indication (where applicable) of whether homology data is available for the named
protein.
Figure 10 is a representation of the query display screen first illustrated in Figure
8, except that the Figure 10 display screen 1002 shows that "cancer" has been inserted
5 into the "Disease" state field 1004 so that a query is executed that will search for proteins
related to cancer. More particularly, when the "Submit Query" display button 1006 is
selected in Figure 10, the security server will receive the "cancer" disease query and will
cause a search to be executed over the database files by the database server 202 (Figure
2). The search results will be returned by the security server to the user's browser.
o Figure 11 is a representation of a protein information screen 1102, such as might
be generated in response to a query or in response to selection of a protein from the
protein name display illustrated in Figure 9. The Figure 11 screen shows that, in the
preferred embodiment, the protein information includes the database ID number 1104,
the complete protein name 1106, the species 1108 for the protein data file, the gene type
5 1110, predetermined database keywords 1112 for the protein, disease information 1114,
function information 1116, protein family class 1118, the SwissProt accession number
1120, and the EC number 1122. These data contained in these fields will be familiar or
readily understood to those skilled in the art. A row of display selection buttons below
the information fields of Figure 11 show that a user may select to retrieve the
o corresponding data file for the protein from the PDB-format database 1130, or may select
the SwissProt data 1132 for the protein, or may select "GenBank" data 1134, or may
select "PIR" data 1136 for the protein. Of these alternatives, it should be noted that only
the PDB-format database contains multiple data types, other than protein sequence data.
That is, the SwissProt, GenBank, and PIR databases contain only sequence data and not
molecular structure data. The user may decide to begin another query by selecting the
"New Query" display button 1138.
5 3. The Protein Visual Display Tools
As noted above, protein structure data is alphanumeric data that can be searched
using text strings, but is not meaningful to researchers without the aid of viewing
programs. Figure 12A, 12B, 12C, 12D, 12E, 12F (collectively referred to as "Figure 12")
is a representation of protein structure data as stored in a PDB-format data file,
o comprising one of the data files stored in the database server illustrated in Figure 2. The
standard PDB format of the protein data shown in Figure 12 will be familiar to those
skilled in the art. It should be apparent that such data is tedious to review and is not
especially revealing of protein structures and characteristics. The database system of the
present invention organizes such data into a coherent database representation and permits
5 a visual interface to such data through a conventional Internet web browser interface.
Figure 13 is a representation of a protein "Visualization Toolkit" display screen
1302 for a protein selected from the database server 202 (Figure 2) using the interface of
the present invention. In response to the user selecting a protein name from the
"database" button of Figure 11, the security server causes the visualization toolkit
o interface program to be launched and displayed within the user's browser, and returns the
Figure 13 display page as the opening page of the visualization toolkit interface program.
Thus, a user will always be provided with the display of Figure 11 after selecting a
protein through either the protein name displays or from executing a query, so that
selection of the visualization toolkit may be made for the desired protein.
Figure 14 is a representation of a viewing screen displaying a 3-dimensional view
of a protein structure selected from the database server illustrated in Figure 1. When a
structure is read into the viewer, the atoms are displayed and can be colored according
to atom type. The structure can be manipulated within the 3-D graphics window using
the computer mouse to control its movement. Across the top of the screen are six
pulldown menu selections under which are commands for structure visualization and
analysis. In some cases, commands can only be executed if precalculated information
is stored in data files associated with a particular protein in the database. If such
information is not available for a given protein, the command that requires the
information will be blanked out and will not be accessible for that protein. Shown at the
right side of the screen is the atom identifier information for any atom selected in the
structure (e.g. K 224, CA), as well as commands for clearing any screen annotations,
downloading coordinates for additional structures, for accessing the database index and
for submitting a new database query.
Figure 15 is a representation of the functionality contained in the Measure menu.
The figure shows a graphical display of the interatomic distance between two atoms in
the protein, which is created by selecting the Distance command and picking the two
atoms using the cursor and mouse. The Angle (Figure 16) and Dihedral (Figure 17)
commands calculate and display angle values when the user picks either three or four
consecutive or non-consecutive atoms in the structure. The Coordinate (Figure 18)
command displays the atomic coordinates for a selected atom.
The View pulldown menu includes functionality for displaying selective atoms
in the protein, for example, all atoms or only main chain or alpha carbon atoms.
The Sequence menu includes commands for amino acid sequence analysis. The
Sequence command brings up a separate window in which the protein sequence is
displayed according to single letter amino acid codes and colored according to residue
type. Figure 19 shows a representation of the Sequence Viewer window showing the
amino acid sequence of the protein molecule shown in Figure 14. The sequence and
structure windows are interactive in that placement of the cursor on an amino acid code
in the sequence highlights the position of that amino acid in the structure. The sequence
window is independent of the main molecular structure window in that it can be moved,
resized and closed as a separate frame. These capabilities are due to functionality in the
"Java" language and are applicable among all windows present in the system.
The Alignment functionality (Figure 20) shows the alignment of the protein
molecule with one or more template sequences. The sequence alignment is calculated
using any one of a number of sequence alignment algorithms known to those of skill in
the art, for example, programs such as MSA (Carrillo and Lipman, "The multiple
sequence alignment problem in Biology", SIAM J. Appl. Math. 48, 1073-1082 (1988);
Altschul and Lipman, "Trees, stars and multiple biological sequence alignment", SIAM
J. Appl. Math., 49, 197-209 (1989); Altschul, "Gap costs for multiple sequence
alignment", J. Theor. Biol., 138, 297-309 (1989); Altschul et al, "Weights for data
related by a tree", J. Molec. Biol, 207, 647-653 (1989); Altschul, "Leaf pairs and tree
dissections", SIAM J. Discrete Math., 2, 293-299 (1989); Lipman et al, "A tool for
multiple sequence alignment", Proc. Natl. Acad. Sci. USA, 86, 4412-4415 (1989)) or
ClustalW (Higgins et al, CABIOS, 8, 189-191 (1991)), which are available in the public
domain, can be used.
The Secondary Structure command (Figure 21) opens an additional window in
which is displayed information about the predicted secondary structure of the protein, for
example, whether a particular residue is involved in a helix, coil or sheet. This
information is precalculated and stored in the database, for example, by using a publicly
available algorithm for calculating secondary structure, such as SSPAL (Salamov and
Solovyev, "Prediction of protein secondary structure by combining nearest-neighbor
algorithms and multiple sequence alignments", J. Mol. Biol. 247, 11-15 (1995)) or can
be calculated interactively.
Figure 22 is a representation of the Visualization menu selections for the protein
molecule shown in Figure 14. This menu contains functionality for visualizing different
structural attributes of the displayed molecule. The Hydrophobicity command
interactively colors the residues of the protein according to a color scheme based on a
predetermined hydrophobicity scale for the individual amino acid residues. This allows
identification of hydrophobic and hydrophilic regions in the protein structure. The
Active Sites, Glycosylation, Phosphorylation and Natural Variants commands display
these sites in a protein molecule, if known, in accordance with data which has been
previously determined and stored in association with the particular protein.
The Secondary Structure Ribbon command (Figure 23) opens a window in which
is displayed a solid ribbon along the protein backbone highlighting the secondary
structural attributes of the protein, for example, helices, sheets or coils. The ribbon is
precalculated, such as by using the publicly available Ribbons program (Carson and
Bugg, "Algorithm for ribbon models of proteins", J. Mol. Graphics, 4, 121-122 (1986);
Carson, "Ribbon models of macromolecules", J. Mol. Graphics, 5, 103-106 (1987);
Carson, "Ribbons 2.0", J. Appl. Cryst, 24, 958-961 (1991); Carson, "Ribbons", Methods
in Enzymology, R.M. Sweet and C.W. Carter, eds, Academic Press, 277, 493-505
(1997)) and is stored in the database in a data file associated with a given protein. The
solid surface is displayed in the graphics window, for example, by plugging in a utility
such as the SGI Cosmo Player to Netscape. Also precalculated and stored along with the
protein are electrostatic and dynamic surfaces, which are called up and displayed as solid
surfaces in a separate window using the Electrostatic Surface and Dynamic Surface
commands, respectively. The electrostatic surface is precalculated using an available
program, such as GRASP (Nicholls et al, PROTEINS, Structure, Function and Genetics,
Vol. 11, No. 4, pg 281 ff (1991)), for calculating electrostatic potential. The dynamic
surface is calculated from molecular dynamics data, which is calculated by using any
number of molecular dynamics software packages. Such packages are well known to
those of skill in the art. The Dynamic Surface command color codes the residues in the
protein according to movement or flexibility during molecular dynamics simulation. For
example, "hot" residues move the most and are colored red; residues that move over
average trajectories are colored green; and the "cool" residues that exhibit the least
movement are colored blue.
Figures 24-29 are representations of the Quality menu selections. The commands
in the Quality menu are used to evaluate the quality of a model structure by validating
that the structural and energetic characteristics of the molecule are within a reasonable
expected range of values. Figure 24 shows a Ramachandran plot for the protein molecule
shown in Figure 14. The plot is calculated interactively when the command is executed
and displays the phi and psi angle values for the protein structure in a separate window.
Figure 25 is a representation of the Quality menu selections and shows a
Balasubramanian plot for the protein molecule shown in Figure 14. The
Balasubramanian plot is also calculated interactively for the displayed protein and
indicates a directional arrow between the phi and psi angle values for each residue in the
protein. Each residue is color coded according to secondary structure based on the
dihedral angle values, for example, residues involved in helices are colored red and those
in beta sheets are colored blue.
Figure 26 is a representation of the Quality menu selections and illustrates the
Ellipsoid functionality for the protein molecule shown in Figure 14. The ellipsoid
displays the total extent or size of the protein in the x-, y- and z-directions.
Figure 27 is a representation of the Surrounding Hydrophobicity for the protein
molecule shown in Figure 14 as calculated using the Surface Hydrophobicity command.
The hydrophobic packing for the protein is precalculated based on how far a residue is
from the protein surface (Ponnuswamy and Prabhakaran, "Properties of nucleation sites
in globular proteins", Biochem. and Biophys. Res. Comm., 97, 1582-1590 (1980);
Ponnuswamy et al, "Hydrophobic packing and spatial arrangement of amino acid
residues in globular proteins", Biochim. Biophys. Acta, 623, 301-316 (1980)), and the
result is plotted in a separate window. The interior residues are displayed in the center
of the plot in the area of highest hydrophobicity and the surface residues are shown at the
edge of the plot, indicating areas of lowest hydrophobicity. This plot can give
information on nucleation sites and can be compared to crystal structure data.
The Strain Plot and Local Strain Energy commands display plots of internal strain
energy in the molecule. The strain plot (Figure 28) is displayed in a separate window as
a graph of strain energy per residue. The local strain energy is displayed in a separate
window as a solid ribbon along the protein backbone which is colored according to strain
energy (for example, highly strained residues are colored red and unstrained residues are
colored blue). Strain energies are precalculated by using external programs, such as
ICM, and stored in the database in association with the protein structure. The Profile
Analysis command (Figure 29) displays a plot of packing factor per residue, which is
precalculated using a publicly available program such as WHAT IF (Vriend, "WHAT IF:
a molecular modeling and drug design program", J. Mol. Graph. 8, 52 (1990)) and stored
in a data file in the database. The profile analysis is displayed in a separate window as
a graph of packing factor per residue.
Figure 30 is a representation of the Align functionality. The Align command
aligns proteins within subfamilies. A list of proteins within the subfamily is displayed
in a viewer window and can be selected for superimposing on the protein molecule
shown in the viewer window. A separate window displays the superposition of the
selected proteins.
4. Entering Protein Data Into the System Database
Figure 31 is a representation of a display screen for entering data about a protein
into the database server illustrated in Figure 1. As noted above, the database design of
the system illustrated in Figure 1 collects files of different types, such as text, graphic,
video, virtual reality, and audio. The database design further collects data files and places
them in a relational organization, so that files that are related to the same protein family
are readily accessible through a search routine. In accordance with known relational
database management programs, such searching may be carried out with specialized
search languages, such as Structured Query Language (SQL). In the preferred
embodiment, data is entered into the database by supplying pathnames to data files that
are loaded into the data storage of the database server 102 (Figure 1). The template for
entering such pathname information is supplied by the entry display screen 3102 of
Figure 31. The display screen includes fields that accept input that is manually entered
by an authorized user, as well as data file names to identify protein data files that will
become interrelated by the database management system.
As indicated in Figure 31 , the first data field 3104 is for the protein family name,
followed by the protein name 3106. This information is manually entered by a user who
gains access through the security server or other special access means at the database
server. When the information is received by the database management program at the
database server, it is automatically stored according to the database design. The next
information that is manually entered is for the species name 3108. Following the species
name, the user who is entering the database information must supply data file names.
The file names will comprise a data file pathname to a data file stored at the database
server.
5 The first data file pathname 3110 is to an annotation file, which is a text
(alphanumeric) file that contains predetermined supplementary protein data, such as
disease information, ID numbers, and the like. Those skilled in the art will understand
that a text file has a standardized file extension of ".txt" in the filename. The annotation
file provides a means of collecting the supplementary protein data in one convenient file,
o and may be created by a user manually entering the necessary data, or may be the result
of processing other files or data, such as the protein PDB data file.
The next file pathname to be entered into the database is for an alignment text file
3112 that contains atomic alignment data. The next pathname is to a secondary structure
file 3114, which is another text file. A ribbon setup file pathname 3116 and a ribbon data
5 file pathname 3118 are next entered. These data files comprise virtual reality files, which
have a standardized file extension of ".wrl" and which are viewed in a web browser
program using any one of a variety of readily available plug-in applications. Those
skilled in the art will appreciate that a standard file extension such as " .wrl" is recognized
by operating systems and browsers to automatically trigger launch of a program that can
o view the data. For example, one " .wrl" viewer program that integrates with conventional
Internet browsers such as Netscape Navigator is the "Cosmo Player" by Platinum
Technology, Inc. for playing virtual reality files that are coded in accordance with the
VRML 2.0 open standard for Internet programming. The next entered file pathname is
to a natural variant coordinate file 3120, which is a text file.
The next data pathname to be entered is for an electrostatic surface file 3122,
which is a virtual reality ".wrl" file. Those skilled in the art will understand the type of
molecule view that corresponds to an electrostatic surface. The next file is a text file that
contains surface hydrophobicity data 3124. Next is a profile analysis data file 3126,
which contains atomic coordinate data in the same format as a PDB file, but which is
created with homology model techniques. Next is a protein ellipsoid data file 3128, a
text file. An accessibility data file 3130 is next, comprising a text file.
A local strain data file 3132 is the next file pathname, and is a text file.
Thereafter, the next two files relate to local strain data. The first is a local strain ribbon
data ".wrl" virtual reality file 3134, and the second is a local strain ribbon line data ".wrl"
virtual reality file 3136. The next pathname is to another ".wrl" file, a dynamic surface
file 3138. The dynamic surface file provides a view of the molecular outer surface of the
protein.
The next items for database input relate to the initial model 3140. The first entry
is the pathname of a text file comprising an initial coordinates file 3142. The next three
items shown in Figure 31 are optional, in that the information they specify can be
extracted in real-time from other data files by the visualization toolkit interface program.
If the visualization toolkit interface program has the ability to extract such data in real¬
time, then the optional file pathnames shown in Figure 31 need not be supplied. These
optional files comprise a Ramachandran plot 3144, bond length graph 3146, and bond
angle graph 3148. Those skilled in the art will appreciate that these three types of views
can be easily generated in real-time, without any need for pre-processing. The last data
entry relating to the initial model is for the depositor name 3150. This field identifies the
person who is entering the data, and is useful for identifying persons who should be able
to answer questions about the particular data files they entered.
It is possible to review the atomic coordinate data for a protein and, after careful
consideration by expert review, it is possible to improve the molecular representation by
modifying the coordinate data. Therefore, the preferred embodiment of the system
accepts multiple levels of data refinement, as indicated in Figure 31 by the "Refinement
Level 1" 3160 and "Refinement Level 2" 3162 entry fields. Thus, Level 1 includes a
field for the pathname of the Level 1 coordinates file 3164, optional fields for Level 1
refinement Ramachandran plot pathname 3166, Level 1 refinement bond length graph
pathname 3168, and Level 1 refinement bond angle graph pathname 3170, as well as the
mandatory field for Level 1 refinement depositor name 3172. Similarly, the Level 2
fields 3162 include a field for the pathname of the Level 2 coordinates file 3174, optional
fields for Level 2 refinement Ramachandran plot pathname 3176, Level 2 refinement
bond length graph pathname 3178, and Level 2 refinement bond angle graph pathname
3180, as well as the mandatory field for Level 2 refinement depositor name 3182.
Finally, at the bottom of the data entry display of Figure 31 , the entered data can
be submitted by clicking the "Submit" display button 3190. The entered information can
be cleared by selecting the "Clear" button 3192.
5. System Operation
Figure 32 is a flow diagram representation of operations performed when a user
accesses the protein database of the system illustrated in Figure 1. The first operation is
for the security server to perform a security check to ensure that the user has
5 authorization to proceed with access. This operation is represented by the Figure 32 flow
diagram box numbered 3202. As noted above, different users may be granted different
levels of access. This is described further below. After a user has been granted access,
the next operation is for the user to select a database for viewing. The request must be
authorized by the security server. This operation is represented by the flow diagram box
0 numbered 3204, and the screen display being viewed by the user corresponds to Figure
5.
Next, the user selects either a search query or an index list of protein names
available in the database. At this point, represented by the flow diagram box numbered
3206, the user is viewing a screen display like that of Figure 6. Selection of the index list
5 by a user will cause the user's network terminal to initiate a request to the database server
to execute a corresponding database server cartridge program to create and display the
index list of available protein names. In the preferred embodiment, the entire list of
available SBdBase protein names is shown to all users, regardless of authorization level.
The protein names available to the user may be indicated by special formatting, such as
o boldface or special characters. In this way, the user can be appraised of new database
entries and possibly interesting protein names that are unavailable. This can assist in
migrating users from a lower level of authorization to a higher level of authorization for
viewing the proprietary database.
Next, the user selects a protein name for viewing, either from an index list or from
the results of a query search. At this step, once a protein name has been selected, the user
will be shown a display like that of Figure 11. Selecting a protein name causes the user's
network terminal to initiate a request to an appropriate web application server to create
the display page corresponding to Figure 11. In accordance with the web application
server programming, for example of the kind available from Oracle Development
Corporation, the proper display page is automatically created. This operation is
represented by the Figure 32 flow diagram box numbered 3208. The user may select
from multiple database choices. As illustrated in Figure 11, these choices may include
the "SwissProt" database, "GenBank" database, or PIR database, all of which contain
only text data. Alternatively, the choice may be for the "SDdBase", comprising the
multi-format protein relational database constructed in accordance with the present
invention. The selection of the "SBdBase" is indicated by the flow diagram box
numbered 3210.
With selection of the "SBdBase" button, the user's network terminal sends a
request to the security server, which recognizes the "SBdBase" selection and initiates
execution of appropriate "Java" programming language scripts. Such scripts are executed
quickly and without database server involvement, and so provide convenient and more
efficient operation of the system. This operation is represented by the flow diagram box
numbered 3212. The Java scripts cause the relevant database pathnames to be provided
from the security server to Java language programming routines at the database server
controller. The pathnames may include, for example, the pathnames corresponding to
the multi-format data files entered into the database, such as illustrated in Figure 31. The
pathname sending operation is represented by the Figure 32 flow diagram box numbered
3214.
When the database controller receives the database pathnames, the controller
causes the associated database files to be sent from the database server through the
security server and on to the client network terminal, thereby causing the database files
to be downloaded to the network terminal. This operation is represented by the flow
diagram box numbered 3216. After the files have been downloaded to the network
terminal, the user at the terminal can invoke the "Visualization Toolkit" and view all
associated displays (comprising the views illustrated in Figures 14 through 30). Such
viewing will be accomplished through the user's web browser program, as described
above, and comprise entirely local operations. Thus, network traffic will not be involved.
These local operations are represented by the Figure 32 flow diagram box numbered
3218. Thus, viewing operations can be continued locally, as indicated by the "Continue"
box 3220 of Figure 32.
Figure 33 is a more detailed flow diagram representation of the operations
performed during the security checking operation box 3202 of Figure 32. In Figure 33,
the first operating step is represented by the sending of a request for access from the
user's network terminal to the security server. This operation is represented by the flow
diagram box numbered 3302. In the preferred embodiment, the security server checks
the Internet protocol (IP) address of the sending user against an authorization list and
grants conditional privileges accordingly. This operation is represented by the flow
diagram box numbered 3304.
Next, the security server checks the received user login information for
verification. This may comprise, for example, checking user name and user password
information received from the network terminal. This security operation is represented
by the flow diagram box numbered 3306. If all security checking is approved, then the
security server grants access and returns appropriate display screen information to the
user, as represented by the flow diagram box numbered 3308. The system operation then
continues as described in Figure 32.
As noted above, a relational database management system (RDBMS) provides
access, security control, and search and retrieval services for the database files stored at
the database server. Figure 34 is a representation of the database schema used by the
relational database management system illustrated in Figure 1. The database schema
defines the tables that will be used by the RDBMS in performing the services described
above. Each protein in the database will have data in the RDBMS tables corresponding
to the fields described in Figure 34. Thus, the tables in Figure 34 determine the fields
over which the RDBMS can perform the search and retrieval services and determine the
level of control exercised over the other services provided by the RDBMS. In the
preferred embodiment, the tables are implemented by the RDBMS produced by Oracle
Development Corporation. Those skilled in the art will understand that RDBMS
products from other vendors are equally suitable.
The primary table in Figure 34, called MY_CORE, includes fields from which
the menu display of Figure 5 is created. The MY CORE fields include a field for a
specific protein identification number, listed as SP ID, unique to the database
implementation, for internal identification. Other fields are for accession number
(ACC NUM), pdb-file indicator (PDB_FLAG), database flag for the additional views
provided by the invention (SBDBASE_FLAG), special code (SPEC_CODE), a
description field (DESCRIPTION), disease notes (DISEASE NOTES), function notes
(FUNCTION_NOTES), sequence data (MY_SEQUENCE), and keywords for searching
(KEYWORDS). The MY_CORE table also includes fields for gene data (GENES),
enzyme data (ENZYME), pdb data (PDB_DATA), and notes to indicate similarities to
other proteins (SIMILARITY_NOTES).
Another table for the RDBMS is called FAMILY, which contains a single field
for the family name of the protein to which the Figure 34 tables correspond. Another
table in Figure 34 is called SUBFAMILY, which contains the family name and also any
subfamily name for the protein. A PROTEINS table contains fields for protein family
name, subfamily name, protein name, and protein identification number (SP ID). Thus,
a system user can search the database, using the RDBMS, and search for protein name,
protein family, protein subfamily, and protein identification number.
The RDBMS tables also include a USER_ACCESS table, with which the
RDBMS controls access to the database depending on the individual user. That is, for
each protein entry in the database, the USER ACCESS table indicates whether a
particular user has been granted access. As noted above, a user can be granted viewing
access to the database, protein by protein. Thus, the USER_ACCESS table has fields for
user name (USER_NAME), protein identification (SP_ID), and a conventional protein
identification number.
The CUSTOMERS table contains information that is used by the RDBMS to
control database access according to customer accounts. The fields for CUSTOMERS
include USER_NAME, START_DATE, and END DATE. Thus, customer access is
controlled by time period, to reflect whether a customer account is current or past due for
payments.
If the information in the Figure 34 tables are used for search and retrieval, then
the RDBMS uses the information in the fields to perform searching. Once proteins or
other database entries are identified as satisfying the search request, the RDBMS retries
the data files from the database server as described above and the data files are
downloaded to the user.
The browser display technique described above is advantageous in that a wide
variety of data formats can be accessed and displayed from within the same viewing
program, independently of the operating system being used, as long as the browser has
been configured with the appropriate helper applications. Those skilled in the art will
appreciate that conventional browser programs are developed with object-oriented
programming techniques, and that such browsers can be made to display the protein
database information in an appropriate manner through proper interface with the browser
programs. In particular, the Visualization Toolkit display shown in Figure is an
application that executes from within the browser and provides the special menus and
window views shown above in the drawing figures. Figure 35 shows the classes that
interface with the browser, in a manner specified by the browser vendor.
Figure 35 indicates object classes for the browser product from Netscape
Communications Corporation, but other suitable browsers may be used and will occur
to those skilled in the art. The classes include an Ellipsoid class that generates the
ellipsoid view from the View menu. The class HydrophobicityPlot generates the
hydrophobicity plot view. The class "netscape. application.InternalWindow" generates
a window within the browser, according to the Netscape specification for the class
netscape. application. Window. The subclasses for the internal window include windows
for the Baluchandran plot (class sbiBaluCanvas), the hydrophobicity internal window
(class sbiHydrophobicity), the List window (class List), the Profile window (class
sbiProfile), the Protein Viewer window (class sbiProtViewer), the Ramachandran
window (class sbiRama), the Sequence Viewer (class sbiSeqViewer), and the Strain
window (class sbiStrain).
Other classes are for display windows or to perform calculations on-the-fly to
generate data that is displayed in windows. Such classes include generating the Profile
Plot (class ProfilePlot to generate data for the sbiProfile class), generating the Strain Plot
window (class StrainPlot to generate strain data), a graph test function (class graphTest),
generating the hydrophobicity graph (class hydrophobicityGraph to generate data), and
generating the Profile data (class profileGraph). Other classes provide the Alignment
window (class sbiAlignViewer), generating the Balucandran data (class sbiBalu),
producing the Baluchandran window buttons (class sbiBaluButtons), providing the
Ellipsoid window (class sbiEllipsoid), and providing the browser frame (class sbiGui).
Another class provides the pdb "Canvas" viewer window (class sbiPdbCanvas). Those
skilled in the art will recognize that "Canvas" is a particular viewing program for data.
Another class provides the pdb-data viewer (class sbiPdb Viewer), and another class
5 provides the window frame slider control (class sbiSlider). Finally, other classes provide
the Active Sites display (class sbiActiveSites) and provide a converter application (class
sbiConverter).
6. Alternative Embodiment with a Distributed Network Architecture
Figure 36 shows the configuration of an alternative network system 3600
0 constructed in accordance with the present invention, using a distributed network
architecture and "Enterprise Java Beans" (EJB) components in a multi-tiered
configuration. The distributed architecture may be characterized as comprising multiple
programming tiers. This characterization of an EJB implementation will be familiar to
those skilled in the art. See, for example, Client/Server Programming with JAVA and
5 CORBA (2nd edition), by Robert Orfali and Dan Harkey, pp. 33-50.
The distributed system 3600 permits multiple client machines 3602, designated
Web Client 1 , Web Client 2, ..., Web Client n, in a first tier to communicate over a shared
network 3604, such as the Internet, with a second tier comprising an
Authorization/Security access server 3606 that controls access by the clients 3602 to a
o bioinformatics database. The access server 3606 can comprise one or more programs and
machines that perform the duties of the security server 106 and database server 102 of
the embodiment illustrated in Figure 1.
The client machines 3602 execute one or more user interface applets to interface
with the Authorization server 3606, which communicates with multiple "Enterprise Java
Beans" (EJB) components 3608 that provide the functionality needed to generate the
display panels and features illustrated in Figures 37 through 45 for the second
embodiment, as described further below. The Authorization server and EJB components
form the second tier of networking and communicate with a third tier of the system, a
Bioinformatics Database Management System (DBMS) server 3610. The Bioinformatics
DBMS server manages the collection of protein data stored in a database and provides
the protein data in response to requests and queries from the users 3602.
Those skilled in the art will understand how the EJB components 3608 can be
implemented using a distributed object model of the database 3610. For example, the
database can be structured to communicate with the clients according to the object
communication standard called Common Object Request Broker -Architecture (CORBA),
which is specified by the industry consortium called Object Management Group (OMG).
The CORBA standard will be familiar to those skilled in the art of database design.
Figure 37 is a block diagram representation of the classes into which the EJB
components 3608 are organized. These components provide the functionality for the
GUI of the second embodiment, as described further below. The functionality of these
components generally duplicates the functionality of the corresponding classes described
above for the first embodiment of Figure 13 through Figure 35. Figure 37 shows that the
classes include a Java class for an Alignment database, to provide alignment views of
proteins when requested by the client user, and also include a Java class for an Atomic
database, to provide atom-specific data.
The EJB classes 3608 also include classes for protein Chains, Domain, Family,
Protein, Residue, Secondary Structure, Subfamilies, Subfamily Proteins, Users, VAST,
and VRML The VRML class provides support for the "Virtual Reality" display system
which, in the preferred embodiment, is achieved through the "Cosmo" VRML player
interface developed by Silicon Graphics, Inc. of Mountain View, California USA. A
Deployment EJB class handles communications tasks between user clients and the
application server for activation of classes. The VAST component performs processing
to check the database 3610 for protein similarity, such as for data mining functions. This
information is used in generating the displays for local strain energy, secondary structures
dynamic surface, and hydrostatic surface.
In the preferred embodiment, the VAST component provides an interface to a
portion of the database 3610 (see Figure 36) that is constructed using the "Vector
Alignment Search Tool" (VAST) search program that is publicly available from the
National Institutes of Health (NIH) in Bethesda, Maryland, USA. In the preferred
embodiment illustrated in Figure 36, the database includes VAST output for the database
proteins. Thus, the VAST program is executed on the database proteins to create a
database of protein similarity information, in accordance with the VAST standard. In this
way, the VAST EJB component provides an interface to the VAST output data, such that
a similarity search request from a system user can be provided by appropriately scanning
the VAST database, rather than attempting to execute a comparison operation in real
time. This improves the response time of the system. Those skilled in the art will be
familiar with the VAST search methodology, details of which are available from the
National Center for Biotechnology Information (NCBI) division of the National Library
of Medicine (NLM) at the NTH (available on-line through the World Wide Web at
5 www.ncbi.nlm.nih.gov).
Figure 38 is a representation of an Application screen 3800 shown to a user at a
client machine display of the network system illustrated in Figure 36. The Application
screen is shown in a display window at a client machine, such as in a Java applet
window. Those skilled in the art will understand the automatic launch of Java applets
l o from a web browser. In the preferred embodiment of the distributed architecture system,
the applet window includes a menu bar 3802 along the top of the applet window, a
Protein Selection display panel 3804 along the left side of the applet display, with a sub-
panel 3806 for showing proteins selected by the user and available for viewing, and a
Visualization panel 3808 at the right side of the applet display.
15 In Figure 38, the Protein Selection panel 3804 shows a Family Tree display that
lists the protein families that may be selected for investigation by the system user. As
described further below, other display tabs in the Protein Selection panel indicate that a
Search panel and a Data Mining panel may be called up, in addition to the Family Tree
panel shown. In this regard, it should be noted that this embodiment of the invention
0 permits especially rapid review and easy selection of protein families without need for
the multiple window displays associated with the first embodiment described above in
conjunction with Figures 5 through 9. A user can select protein families by positioning
computer the display cursor on a desired protein family in the Family Tree display and
then using a display mouse button to "double-click" and select the protein family. The
protein family name will then be listed in the "Proteins Available for Viewing" panel in
the lower left area of the applet window.
In the Visualization panel 3808 on the right side of the applet window, the protein
may be displayed in one of several formats. In the preferred embodiment, these formats
include Description, Sequence, Structure, and Quality views, which are selected using
the computer display mouse and the tabs along the top edge of the Visualization panel.
The details of these display formats are the same as for the corresponding views
described above in conjunction with Figures 11 through 30. As noted below, however,
the alternative embodiment of Figure 36 uses a distributed network architecture for the
database and for the user-to-database interface, and thereby permits simultaneous display
of multiple proteins and associated data in the Visualization panel through EJB
components.
The Visualization panel of Figure 38 also shows that various sub-panels 3810
may be displayed and manipulated. For example, in the lower right area of the
Visualization panel, protein views may be selected to show Ramachandran plots,
Balasubram plots, Hydrophobicity data, Strain plots, and Profiles plots. In addition,
Sequence and Secondary Structure information may be shown simultaneously, as
indicated by the sub-panel adj acent to the Profile Analysis view at the bottom right of the
Visualization panel. The details of these display formats are the same as for the
corresponding views described above in conjunction with Figures 11 through 30. As
noted below, however, the alternative embodiment of Figure 36 uses a distributed
network architecture for the database and for the user-to-database interface and thereby
permits simultaneous display of multiple proteins and associated data in the Visualization
panel.
Figure 39 is a representation of the Application screen 3900 shown to a user at
a client machine of the network system illustrated in Figure 36, illustrating the Search
panel that is shown in the application window when a user selects that display tab from
the Protein Selection display area on the left side of the display 3904. Figure 39 shows
that a user enters a protein pattern identifier in a protein dialogue box 3910 of the display
and then selects a Search display button 3912. The results of a search for the identified
protein are then listed in a text window 3914, showing the protein families that matched
the search string. Those skilled in the art will understand the protein pattern identifier
notation used to identify protein families. Beneath the pattern search panel, the web
client applet shows sequence information 3916 for a user-selected one of the search result
proteins. As noted above, one of the proteins may then be selected for viewing in the
Visualization area of the display screen, and multiple proteins may be selected for
simultaneous display.
Figure 40 is a representation of the Application screen 4000 shown to a user at
a client machine of the network system illustrated in Figure 36, showing the Data Mining
panel selected using the display tab of the Protein Selection panel 4004 at the left side of
the application window. Figure 40 shows that selection of the Data Mining tab generates
a dialogue box 4010 in which a protein name is entered, and on which a similarity search
will be conducted. Figure 40 shows that a "Find All Similar Proteins" display button
4012 is provided. When the user selects the display button, the web client applet
forwards the protein search identifier entered by the user in the dialogue box 4010 and
provides it to the application server and then to the database management system. A
database search is then conducted and a list of similar proteins is provided in the text
window 4014 shown beneath the "Find" button.
From the list of Similar Proteins found from the database, a user can select one
or more of the "Similar Proteins" for further information. For example, Figure 40 shows
a panel 4016 below the "Find" box that is called "Similar Cores Between Selected
Proteins", in which the target protein entered in the dialogue box is shown, followed by
information for a selected one of the "Similar Proteins". In the "Proteins Available for
Viewing" panel 4018, both the target protein (shown as "muscle type acylphosphatease")
is listed and the "Similar Protein" (shown as "orphan nuclear receptor HMR") is shown.
When these proteins are selected for viewing, as indicated by their entry in the "Proteins
Available for Viewing" panel, their corresponding views are shown in the Visualization
panel 4008.
Figure 41 is a representation of the Application screen 4100 shown to a user at
a client machine of the network system illustrated in Figure 36, illustrating that
simultaneous proteins can be shown in the Visualization panel 4108. In particular, Figure
41 shows Quality display features for two selected proteins. The two proteins are
identified in the "Proteins Available for Viewing" panel 4108 at the lower left area of the
Application screen, and their corresponding Ramachandran plots of the Quality view are
simultaneously shown in the visualization panel 4108, along with their respective Profile
-Analysis panels with Residue information. As noted above, it should be understood that
all of the Description, Sequence, Structure, and Quality sub-panels and display views
described above for the first embodiment are shown in the respective display views in the
second embodiment. In the second embodiment of Figures 36 through 46, however,
multiple proteins can be shown simultaneously, advantageously using the web client
applets and the distributed architecture.
Figure 42 is a representation of the Application screen 4200 shown to a user at
a client machine of the network system illustrated in Figure 36, illustrating the
simultaneous display of protein Structure features for two selected proteins in the
Visualization panel 4208 of the application window. Moreover, Figure 42 shows that the
applet panels can be resized in accordance with known window programming techniques,
thereby making it possible to devote greater screen area to panels of greater interest.
Thus, in Figure 42, the visual display Structure panel of the Visualization area has been
resized to occupy a greater portion of the user's display screen as compared with the
visualization panel of Figures 38 through 41. Accordingly, the Protein Selection area
occupies less display screen area (compare, for example, with Figure 39). It should be
apparent to those skilled in the art that the resizing is achieved with the left and right
buttons of a display mouse, appropriately dragging the display cursor for the desired
resizing. Again, details of the information shown in the Structure display of the
Visualization panel are the same as those described above for the first embodiment. The
second embodiment, however, permits simultaneous display of multiple proteins.
Figure 43 is a representation 4300 of the Application screen of Figure 42, this
time showing the drop down menu 4320 for selection of Display features. More
particularly, Figure 43 of the Display item of the applet menu bar shows that a user may
select between a Skeleton display format, a Ball & Sticks format, a Spacefill format, and
a VDW Dot Skeleton display format for the Visualization panel. These display formats
will be known to those skilled in the art, so that such persons will understand what
information will be shown in such display views without further explanation. Again,
details of the information shown in the Display formats of the Visualization panel are the
same as those described above for corresponding Display views of the first embodiment.
Figure 44 is a representation 4400 of the Application screen of Figure 42,
showing the drop down menu 4420 for selection of Options for atoms to be viewed.
More particularly, Figure 44 of the Options item of the applet menu bar shows that a user
may select between a "C- Alpha Atoms" view, a "Main Chain Atoms" view, and an "All
Atoms" view for the Visualization panel. These view formats will be known to those
skilled in the art, so that such persons will understand what information will be shown
in such display views without further explanation. Again, details of the information
shown in the Options formats of the Visualization panel are the same as those described
above for corresponding views of the first embodiment.
Figure 45 is a representation of the Application screen 4500 shown to a user at
a client machine of the network system illustrated in Figure 36, illustrating simultaneous
Sequence display features for two selected proteins in the Visualization panel 4508.
Figure 45 shows the Visualization panel in the resized condition first shown in Figure 42,
illustrating sequence information for multiple proteins. As before, the proteins are
selected through the "Proteins Available for Viewing" panel 4508 at the lower left of the
display.
Figure 46 is a graphical representation of the database objects for the database
design of the system illustrated in Figure 36. As with the first embodiment, the system
of the second embodiment shown in Figure 36 is implemented using object oriented
programming techniques in which data objects are organized into classes, each of which
is characterized by attributes that specify parameters of the class and methods or
processes that specify behaviors of the class.
More particularly, Figure 46 shows that the database design of the second
embodiment includes a Residue class, an Atom class, a Domain class, and an Active Sites
class. The database also includes a Protein class, a Protein Sequence class, a Sequence
Link class, a Chain class, Secondary Structure class, VRMLURL class, Protein Segment
class, Protein Search class, Transformation class, an Alignment class, an Alignment
Residue class, and an Elements class. The database design also includes a Subfamily
class, a Subfamily Proteins class, and a Family class. Finally, the database includes a
Useraccess class, a UserAccessProteins class, a Feedback class, and a BugGroup class.
Attributes of the database classes are shown in the class boxes of Figure 46. These
classes store attribute data values and specify class behaviors to provide the functionality
described herein.
For example, the Residue class stores parameters to produce a protein residue
display with features such as illustrated in connection with the first embodiment (Figures
1 through 35) and the second embodiment (Figures 36 through 46). That is, the Residue
class contains the information typically needed to specify a residue display, which will
be apparent to those skilled in the art without further explanation. Similarly, the Atom
class contains information needed to specify display of a protein atom, the Domain class
contains information needed to specify a protein domain, and the Active Sites class
contains information needed to specify the active sites of a protein for display. Those
skilled in the art will understand the information needed to specify display of such
features, which are like those specified by the first embodiment described above for
corresponding displays in conjunction with Figures 1 through 35. Those skilled in the
art will likewise appreciate the information needed for the other classes shown in Figure
46, in conjunction with the description of classes for the first embodiment and for display
of the features described above.
Thus, the second embodiment illustrated in Figures 36 through 46 implements a
bioinformatics database access system having a graphical user interface (GUI) that
communicates over a shared network (such as the Internet) using graphical browsers to
establish a communications session. Once communications with the database server are
established, the GUI is provided through a platform-independent applet environment,
such as provided with a Java programming environment. Thus, no hypertext mark-up
language (HTML) page links are needed, and no common gateway interface (CGI) scripts
are needed to exchange information and retrieve database information and create
displays. Such processing, for example, also permits the simultaneous display of
database information for more than one protein. Such programming also permits a
Protein Selection panel to be shown adjacent a Visualization panel, thereby permitting
search and selection of proteins, followed by display of visualization data for such
proteins, from the same display window. Thus, families of proteins can be shown along
the left side of the display, while visualization displays of the proteins can be shown
5 simultaneously along the right side of the display. This functionality is possible because,
with the Java implementation, the same applet that communicates with the bioinformatics
database also performs the visualization display tasks.
The present invention has been described above in terms of presently preferred
embodiments so that an understanding of the present invention can be conveyed. There
l o are, however, many configurations for protein database viewing systems not specifically
described herein but with which the present invention is applicable. The present
invention should therefore not be seen as limited to the particular embodiments described
herein, but rather, it should be understood that the present invention has wide
applicability with respect to molecular structure database viewing systems generally. All
5 modifications, variations, or equivalent arrangements and implementations that are within
the scope of the attached claims should therefore be considered within the scope of the
invention.