A SYSTEM AND METHOD FOR IDENTIFYING DNA SEQUENCES THAT COULD CODE INTO A STRING OF AMINO ACIDS
BACKGROUND OF THE INVENTION
1. Technical Field The present invention relates to computerized systems for studying DNA, and more
specifically, to a system and method for identifying DNA sequences that could code into a string
of amino acids.
2. Related Art
Presently, scientists are engaged in an intense research effort to characterize the genomes
of human and selected model organisms through complete mapping and sequencing of their
DNA (deoxyribonucleic acid), and to develop technologies for genomic analysis. A genome comprises all of the genetic material found in the chromosomes of a particular organism. For instance, the human genome consists of 50,000 to 100,000 genes located on 23 pairs of
chromosomes. The first complete human genome to be sequenced will be a composite of
sequences from many sources, most of these being cell lines that have existed in laboratories all
over the world for some time. The sequence will thus be a generic sequence representative of humans in general and not of any particular individual. The complete sequence will provide a
standard against which other partial sequences can be compared.
DNA is contained in self-replicating genetic structures known as chromosomes. Each
chromosome contains a long molecule of DNA, the chemical of which genes are made. The
DNA, in turn, is a double-stranded molecule in which each strand is a linear array of units called nucleotides or bases. There are four different bases, called adenine "A," thymidine "T," guanosine "G," and cytosine "C." The bases on one strand of DNA are precisely paired with the
bases on the other strand, so that an A is always opposite T and G opposite C. The order of the four units on the DNA strand determines the information content of a particular gene or piece of DNA. Genes are of differing length, ranging in size from roughly 2,000 to as many as 2,000,000 base pairs. Mapping is the process of determining the position and spacing of genes, or other landmarks, on the chromosomes relative to one another. Sequencing is the process of determining the order of the nucleotides, or base pairs, in a DNA molecule.
Although mapping of human genes began early in the twentieth century, it has been intensively pursued only for the past two decades. For most of this period the methods that were developed, though original and ingenious, have been inadequate for comprehensive mapping and have only allowed the construction of relatively crude maps with very little detail. Recently, much more effective technology has been introduced. At the beginning of the 21st century, about 1,700 of the estimated 50,000 to 100,000 human genes (less than 2 percent) have been mapped. Scientists involved in the Human Genome Project hope to find the location of the 50,000 to 100,000 or so human genes and to read the entire genetic script, all three billion bits of information, by the year 2005. More information on the Human Genome Project is available on the world wide web at sites such as the National Human Genome Research Institute at www.nhgri.gov.
The information generated by genome projects is expected to be the source book for biomedical science in the 21st century and will be of immense benefit to the field of medicine. It will help us to understand and eventually treat many of the more than 4000 genetic diseases that afflict mankind, as well as the many multifactorial diseases in which genetic predisposition plays an important role. Scientists are now just beginning to formulate uses for these vast databases of information. To fully exploit these databases, it will be vital to develop new methods and tools for the analysis and interpretation of genome maps and DNA sequences.
One such area that will potentially benefit from genome projects is the study of proteins and their correlation with genetic information, as genes are responsible for determining which proteins get created. Proteins are made up of a string or sequence of amino acids. DNA contains
the information about how amino acids are put together in a series to form a protein.
Accordingly, a gene is said to code into a specific protein. For this coding to occur, DNA serves as a template for the creation of mRNA. The process by which RNA is made from DNA is
called transcription. After transcription, mRNA is further processed until it can serve as a template for linking amino acids together. The process by which an mRNA sequence is
converted into an amino acid sequence is called "translation." Translation is performed by
ribosomes which line up tRNAs with mRNAs. Each tRNA has three nucleotides which
complement three nucleotides on the mRNA. Each tRNA also carries a specific amino acid. As tRNAs are lined up along the mRNA a covalent bond is formed between the amino acid on that
tRNA and the amino acid already present. A key result is that a specific nucleotide triplet corresponds to a specific amino acid. However, there is some redundancy so that several
different combinations of nucleotide triplets can correspond to the same amino acid.
Until now, most efforts have focused on identifying a protein and related amino acid
details from a DNA sequence. The prior art fails however to provide systems for performing a
reverse translation, i.e., generating all possible DNA sequences from a single amino acid
sequence. Given the correspondence between protein data (i.e., amino acids) and genetic
information (i.e., nucleotide triplets), there exists any number of potential applications where
such systems could be useful. For instance, a research scientist studying a tissue sample populated with proteins of interest, e.g., a portion of a failing heart, may be interested in
determining the gene responsible for coding to that protein. Until now however, there has been
no automated mechanism for identifying the gene or collection of genes potentially responsible for coding into that protein.
SUMMARY OF THE INVENTION
The present invention addresses the above-mentioned problems by providing a
computerized method and system for identifying DNA sequences that could code for a string of
inputted amino acids and then for identifying genes within the DNA sequences that are
potentially responsible for creating the amino acid string. In a first aspect, the invention comprises a method of: (1) inputting a string of amino acids; (2) generating all possible codon permutations that could code into the inputted string of amino acids; and (3) examining a gene
database to identify DNA sequences that match the codon permutations.
In a second aspect, the invention comprises a computerized method for identifying DNA
sequences that could code for an inputted motif, comprising the steps of: (1) identifying within
the motif a primary region, wherein the primary region includes a string of amino acids having
contiguous recurrence in the motif; (2) searching a gene database to identify all codon permutations that could code into the primary region; (3) identifying a subset of the gene
database, wherein the subset contains genes that could translate into the primary region; (4)
selecting within the motif a secondary amino acid apart from the primary region; (5) determining a positional relationship of the secondary amino acid with respect to the primary region; (6)
searching the subset of the gene database for DNA sequences that include codon permutations that could code into both the primary region and the secondary amino acid, at the determined
positional relationship.
In a third aspect, the invention comprises a program product stored on a recordable
medium for identifying DNA sequences that could code into an inputted amino acid sequence, comprising: (1) means for inputting the sequence of amino acids; (2) means for generating codon
permutations that could code into the inputted sequence of amino acids; and (3) means for searching a gene database to identify DNA sequences that match the generated codon permutations.
In a fourth aspect, the invention comprises a computer system for identifying DNA sequences that could code into an inputted sequence of amino acids, comprising: a central
processing unit, a memory, a peπnutation generator for generating possible codon permutations
for an inputted sequence of amino acids, and a search interface system for searching a gene
database to identify DNA sequences that match the codon permutations.
It is therefore an advantage of the present invention to allow a researcher to start with a
limited amount of information regarding a particular protein (i.e., a peptide sequence or amino
acid string) and end up with the entire DNA code(s), i.e, gene(s), for the protein, or families of
the protein, of interest. It is therefore a further advantage of the present invention to provide a system to screen
DNA databases for genes that could potentially code for proteins containing a pre-specified
amino acid sequence (peptide).
It is therefore a further advantage of the present invention to provide a permutation generator that can perform reverse translation of an amino acid sequence into all possible DNA
sequences.
It is therefore a further advantage of the present invention to provide a comprehensive
system that includes a reverse translation system, permutation generation system, a homology
search engine, and a system for screening for in-frame start codons to select valid gene candidates.
BRIEF DESCRIPTION OF THE DRAWINGS
The preferred exemplary embodiment of the present invention will hereinafter be
described in conjunction with the appended drawings, where like designations denote like
elements, and:
Figure 1 depicts a block diagram depicting a computer system for identifying DNA sequences that could code for a string of amino acids in accordance with a preferred embodiment of the present invention.
Figure 2 depicts a block diagram of the operational flow of the present invention.
Figure 3 depicts a table showing nucleotide triplets and their corresponding amino acids.
It should be noted that the drawings are merely schematic representations, not intended to portray specific parameters of the invention. The drawings are intended to depict only typical embodiments of the invention, and therefore should not be considered as limiting the scope of the invention. In the drawings, like numbering represents like elements between the drawings.
DETAILED DESCRIPTION OF THE DRAWINGS
Overview
As noted above, a feature of this invention is to identify genes that could potentially code into an inputted amino acid sequence or a protein. Genes are made up of DNA sequences, and DNA is made up of combinations of nucleotides. DNA contains the information about how amino acids are put together in a series to form a protein. There are four possible nucleotides: A,
T, G, and C. A series of three contiguous nucleotides constitutes a triplet. Codons are nucleotide triplets that code for an amino acid or a stop signal. Figure 3 shows a table depicting the list of nucleotide triplets (i.e., codons) along with the corresponding amino acid. Blanks in the table indicate a "stop signal" A key point is that a specific nucleotide triplet (i.e., codon) corresponds to a specific amino acid. However, there is some redundancy in that several different codons can correspond to the same amino acid. Thus, given an amino acid sequence, there exists a corresponding set of codon sequences, with the number of codon sequences in the set being dependent on the number of possible permutations of corresponding codons. The set
of codon sequences, once identified, can be used to identify corresponding genes that were potentially responsible for creating a given amino acid sequence.
Computer System & Software
Referring now to Figure 1, a computer system 10 is shown that includes a central processing unit (CPU) 16, an input/output (I/O) system 18, bus 28, and memory 20. Stored in memory 20 is a software program 26 comprising permutation generator 22 and search interface system 24. Memory 20 may comprise any known type of data storage and/or transmission media, including magnetic media, optical media, random access memory (RAM), read only memory (ROM), a data cache, a data object, etc. Moreover, memory 20 may reside at a single physical location, comprising one or more types of data storage, or be distributed across a plurality of physical systems in various forms. CPU 16 may likewise comprise a single processing unit, or be distributed across one or more processing units in one or more locations, e.g., on a client and a server. I/O 18 may comprise any system for exchanging information with external sources. User interface 12 is in communication with computer system 10 via datalink 30. User interface 12 may comprise any known type of device for inputting and receiving information into computer system 10, including a CRT, LED screen, hand-held device, keyboard, mouse, voice recognition system, speech output system, printer, facsimile, pager, etc. Gene database 14 is in communication with computer system 10 via datalink 31. User interface 12 and gene database 14 may be linked to computer system 10 in any known way, including via an internet, intranet, worldwide web, local area network, wide area network, etc. Alternatively, user interface 12 and/or gene database 14, may be integrated into computer system 10. Datalinks 30 and 31, and bus 28 may comprise any known type of transmission link including electrical, optical, wireless, etc. In addition, although not shown, additional components, such as cache
memory, communications systems, systems software, etc., may be incorporated into computer system 10.
It is understood that the present invention can be realized in hardware, software, or a
combination of hardware and software. As indicated above, the computer system 10 according
to the present invention can be realized in a centralized fashion in a single computerized workstation, or in a distributed fashion where different elements are spread across several
interconnected computer systems. Any kind of computer system - or other apparatus adapted for
carrying out the methods described herein - is suited. A typical combination of hardware and
software could be a general purpose computer system with a computer program that, when
loaded and executed, controls the computer system 10 such that it carries out the methods described herein. Alternatively, a specific use computer, containing specialized hardware for
carrying out one or more of the functional tasks of the invention could be utilized. The present
invention can also be embedded in a computer program product, which comprises all the features
enabling the implementation of the methods and functions described herein, and which - when loaded in a computer system - is able to carry out these methods and functions. Computer program, software program, program, program product, or software, in the present context mean
any expression, in any language, code or notation, of a set of instructions intended to cause a
system having an information processing capability to perform a particular function either
directly or after either or both of the following: (a) conversion to another language, code or notation; and/or (b) reproduction in a different material form.
Referring now to Figure 2, a box diagram is shown depicting the operation of software
program 26. Permutation generator 22 operates by obtaining an amino acid string or sequence 40. Permutation generator 22 then generates codon sequences 46 that could code to the inputted
amino acid sequence. This operation utilizes an amino acid/codon table 42 and a permutation • algorithm which are described in more detail below. Once the list of codon sequences 46 are
generated, they are inputted into search interface system 24. Search interface system 24 compares each codon sequence to gene sequences in one or more gene databases 14. When a
match is identified, search interface system 24 then attempts to identify the gene containing the
matched sequence. The gene identification process includes identifying a start codon utilizing
start codon identifier 32. Often, start codons are separated from the matched sequence by splice
junctions. To facilitate the process of identifying remotely located start codons, a splice junction
handler 34 is included to identify and handle splice junctions. This process is described in more
detail below. After search interface system 24 identifies the genes containing matches, the genes
48 that code for the inputted amino acid sequence are outputted.
In addition to handling complete amino acid sequences 40, the system can also handle
partial amino acid sequences, known as motifs. Motifs are amino acid sequences that have a
common pattern embedded in nonhomologous sequences. The invention includes a motif
handler 44 and a positional search mechanism 36 to facilitate the process, which are described in more detail below.
Finally, in addition to entering an amino acid sequence 40, a search criteria 50 may also
be inputted into the system. The search criteria 50 can be used to instruct the search interface
system 24 to, among other things, search a specific gene database 14, search a specific genome,
specify restraints on how to search for start codons, specify whether to look for splice junctions or specify how to search for motifs.
Permutation Generator
As noted above, permutation generator 22 generates a set of codon sequences (DNA sequences) that could code for an inputted sequence of amino acids. For example, if a sequence containing the four amino acids: CYS, GLU, SER, GLU was inputted, a set of all possible codon
sequences that could code into the inputted amino acids, each having four codons, would be
generated. As shown in the table of Figure 3, there exist two codons, {TGT, TGC} that could code to CYS; two codons {GAA, GAG} that could code to GLU; six codons {TCT, TCC, TCA,
TCG, AGT, AGC} that could code to SER; and two codons {GAA, GAG} that could code to
GLU. The total number of permutations therefore would be 2 x 2 x 6 x 2 = 48, which is the
product of the number of possible codons that correspond to each amino acid in the inputted
sequence. A partial list of the resulting codon sequences that could code for the inputted amino
acid sequence would comprise: (1) TGT, GAA, TCT, GAA; (2) TGT, GAA, TCT, GAG ... (48)
TGC, GAG, AGC, GAG.
Permutation Algorithm
The following is an example of a method of generating permutations of codon sequences
based on an inputted amino acid sequence. This particular method utilizes matrices to manage the permutations. It is understood that the following method is not intended to be limiting, and any algorithm for generating permutations that can be implemented by a person skilled in the art
is believed to be within the scope of the invention.
The first step is to define a Master Codon Table (Table 1 shown below), which contains a
list of all the amino acids and their possible codons, as well as the number of total possible
codons for a given amino acid. The possible codons for the stop signal are also included in this
table. As can be seen, there are 20 possible amino acids and one stop signal (d = 1..21), and each
amino acid may have up to six corresponding codons (q = 1..6).
Table 1
The next step is to define various matrices and tables that are utilized in the process. Each
matrix uses addresses which comprise three components: row and column coordinates and a
symbol which indicates the target matrix to which these coordinates are applied. An address locates a specific position within the designated matrix.
A value is the object and all the properties attached to that object in a specific location of
a matrix where the specific location is uniquely defined by an address. A value in a specific
location designated by an address can itself be an address in another matrix.
A reference matrix is then defined [Xaj] where a=l to γ, j= 1 to e. The reference matrix
address (column and row coordinates) assigns location within the matrix. The values in the
reference matrix constitute a "codon table" and are addresses to specific locations in the master
codon table.
Next, an Input Matrix [Ij] for holding the inputted amino acid is defined, where j - 1 to e
(j represents the position of a given amino acid in the amino acid sequence that is input). Each
amino acid input has a corresponding d value in the master codon table (the d value should be
placed under the corresponding amino acid by a link to the master codon table). Amino acids from the input table are linked to their location in the master codon table):
Next, a permutation generator (PG) address matrix [Yαβ] is defined, where α=l to e, and β=l to
g (the PG address assigns location within this matrix). The values in the PG matrix are addresses to specific positions in the reference matrix.
Next, various operations are defined. Operation mqda— »mqdv shows the value in
position mqd of the master codon table. Operation [Xaj]— > [Yαβ] transposes address locations
in the reference matrix to the PG matrix so that the address in the PG matrix is the transposition
of the address in the reference matrix (i.e., the codon table). Operation μ(2| Ynβ*Y(n+i)β)
generates all permutations of two elements, one element from column α=n, one element from
column α=n+l, for all elements 1 to β of each column. Cμ2(Ynβ|Yn+l)β) is the superset of the
sets that contain all possible 2 element permutations of Ynβ and Yn+lβ.
For example, consider the following PG matrix:
In this example, the numbers shown in the PG grid are simple integers for purposes of describing
the concept of the permutation generator; in the actual implementation, these integers would be
replaced by the reference matrix address, i.e., the 1 in position Yl 1 would be xl 1, the 2 in position Y12 would be xl2, the 2 in position Y22 would be x22. xl 1 would be the location in
the reference matrix. The value in the xl 1 position of the reference matrix would in turn be a
location in the master codon table. xl2 would not necessarily equal x22. In this example X12
could not equal X22 because the βmax in the two respective Y columns is different indicating
that the two columns represent codons for different amino acids. The location in the master
codon table would contain the codon.
In this example, Cμ2(Ylβ|Y2β) = { (1,1), (1,2), (2,1), (1,3), (3,1), (1,4), (2,2), (2,3),
(2,4), (3,2), (3,3), (3,4)}. For each element in Cμ2(Ylβ|Y2β) Cμ2(Y3β|Y4β) can be generated
as follows:
(1,1) → Cμ2(Y3β|Y4β)
(l,2)→ Cμ2(Y3β|Y4β) (2,1)→ Cμ2(Y3β|Y4β)
(3,4)→ Cμ2(Y3β|Y4β).
The operation Cμ(4: Cμ2Yl β|Y2β|| Cμ2Y3 β|Y4β), defines the set of four element
subsets which represent all possible permutations of elements in the sets Cμ2Ylβ|Y2β and
Cμ2Y3βY4β, i.e., (1,1) combined with (1,1)→ (1,1,1,1); (1,1) combined with (1,2)→ (1,1,1,2);
etc. For each element of Cμ(4: Cμ2Ylβ|Y2β|| Cμ2Y3β|Y4β), Cμ2(Y5β|Y6β) is generated, i.e.,
(1,1,1,1)→ Cμ2(Y5β|Y6β); (1,1,1,2)→ Cμ2(Y5β|Y6β); etc. This results in the set Cμ6(6:
Cμ2Yl β|Y2β|| Cμ2Y3β|Y4β|| Cμ2(Y5β|Y6β), which represents all of the possible permutations
of the elements shown in the above example.
A general operation may be defined as: Cμp= Cμ(p| Cμ2Y(n)β*Y(n+l)β||
Cμ2Y(n+2)β|Cμ2Y(n+3)β)||...||Cμ2Y(n+p-2)β*Y(n+p-l)β), which generates the sets containing
p elements that represent all possible permutations of elements in the sets contained as elements
in the groups Cμ2Y(n)β|Y(n+l)β, Cμ2Y(n+2)β|Cμ2Y(n+3)β), through Cμ2Y(n+p-2)β|Y(n+p-
l)β).
Defined rules for the operation would include:
1. Always start at n=l .
2. For a given set, if any value of any element = 0, then delete the set containing the element of
value = 0.
3. Perform iterations of μ until n+x+2 = j where x= integer.
4. If n+x+2 >j (i.e., n+x+l=j) for each element of Cμ(n+x) generate Cμl(Yn+x+l)β where Cμl
is the set of elements in Y(n+x+l)β.
Thus, a given permutation of two elements, one from Ynβ and one from Y(n+l)β
constitutes an element in the total set of permutations Cμ2(Ynβ,Yn+lβ).
Example:
The following example illustrates the permutation generator program. The input
comprises an amino acid sequence containing six amino acids, which is stored in the input
matrix below.
Next, a codon table in the reference matrix from the input data is created.
The program then transposes the data in the reference matrix to the permutation generator
matrix:
The program then generates all possible permutations of the codons that could yield the input
amino acid sequence by performing iterations of μ(2| Ynβ*Y(n+1)o):
Cμ2(Ylβ|Y2β)= { (X11.X12), (X11.X22), (X21.X12), (X11,X32), (X31,X12), (X11,X42),
(X21,X22), (X21,X32), (X21, X42), (X31,X22), (X31,X32), (X31,X42)}
For each element in Cμ2(Ylβ|Y2β) generate Cμ2(Y3β|Y4β):
(X11.X12) → Cμ2(Y3β|Y4β)
(X11,X22)→ Cμ2(Y3β|Y4β)
(X21,X12)→ Cμ2(Y3β|Y4β), etc.
After all iterations are complete, there is a list of Xaj addresses, with the values in these
Xaj locations being addresses to locations in the Mqd matrix, i.e.:
M13 M15 M17 M120 M13 M18
M23 M25 M27 M220 M23 M28, etc.
The program then calls the values in the Mqd locations and displays them as codon strings:
ATT GTT CCT GGT ATT ACT
ATC GTC CCC GGC ATC ACC, etc.
Each codon string represents one of all the possible codon permutations that could code for the input amino acid sequence.
Gene Searching
Referring again to Figure 2, once the codon sequences 46 are generated, they can be
compared to sequences in gene database 14, such as GenBank. Programs that perform this task, referred to as homology search programs 33, may include known programs such as BLAST™
and FASTA™, or can comprise customized systems to handle multiple sequence searching.
Homology search programs 33 will identify high homology sequences, which comprise
sequences that match the inputted sequence. After high homology sequences are found, other regions of the gene must be examined to
identify a segment homologous to the inputted permutation. Specifically, start codon identifier
32 may be utilized to identify a start codon for the high homology sequence. A start codon is an
ATG codon upstream (prior to) and "in frame" with the high homology sequence. (A start codon is in frame if there exists a whole number of triplets between the start codon and the high
homology sequence.) Thus, start codon identifier 32 examines upstream sequences to find a start
codon (ATG) that would result in a reading frame that could result in expression of the
permutation sequence as a protein. For example, assume the codon sequence ATT GTT CCT
GGT ATT ACT, which resulted from the example above, was inputted into a search engine and matched a gene having a partial sequence: GCAATG CCCGATT GTT CCT GGTATTACT
The inputted permutation sequence (shown in bold) is present, and there is an upstream ATG sequence (also shown in bold). However, the upstream ATG start codon is not "in frame" to result in an expression of the amino acid sequence of interest. (There is not a whole number of
triplets between the start codon and the inputted sequence.) The ATG open reading frame in this
example is:
ATG CCC GAT TGT TCC TGG TAT TAC, which would translate as: MET PRO ASP CYS SER TRP TYR TYR. However, the inputted amino acid sequence was:
ILE NAL PRO GLY ILE THR. Thus, start codon identifier 32 must look further upstream to
identify a start codon that is in frame. A program that can find open reading frames in a gene can
be readily implemented by one skilled in the art.
An additional problem that may arise when searching for a start codon relates to non-
coding sequences that may exist in the gene database. These non-coding sequences must be
effectively removed from the search, which is done by junction handler 34. Νon-coding
sequences arise as a consequence of the distinction between genomic DΝA and cDΝA. Genomic
DΝA is the "blueprint" from which mRΝA is transcribed. rnRΝA is transcribed from genomic
DΝA but may contain a sequence that does not code for a protein. These non-coding sequences
are called intervening sequences or "introns." Before protein is made from an mRΝA, the introns are spliced out. The remaining excised sequences are called "exons" and are linked
together to form the sequence from which the protein is translated. The term cDΝA refers to the
DΝA sequence that would be obtained if one converted mRΝA after removal of its introns
directly into DΝA. cDΝA is a DΝA copy of mRΝA that has been processed and is ready for
translation. Thus, if genomic DΝA is searched, one may find that a start codon is separated from the sequence found to be homologous to the permutation codon string by one or more introns.
The intron which separates the start codon from the high homology region could be several thousand base pairs (nucleotides) in length. Accordingly, splice junction handler 34 is provided
to identify remote start codons. The boundary between an intron and an exon is called a splice
junction. Because splice junctions tend to have particular sequences, it is straightforward for splice junction handler 34 to identify splice junctions in a genomic DΝA sequence.
Motifs
A further feature of the invention is to provide a motif handler 44 that allows motifs to be
searched. Motifs are amino acid sequences that have a common pattern embedded in
nonhomologous sequences, i.e., motifs include commonly recurring amino acids in fixed
positional relationships. An example is a motif called the SH2 domain. This domain is a region
in some proteins in which certain amino acids recur in the same position. A partial
representation of an SH2 domain is shown below:
GNxxGxFL(N/I)RESExxxGxxSLSxx-xxxxxGDxxKHxK where G=GLY, Ν=ASΝ, F=PHE, L=LEU, V=VAL, I=ILE, R=ARG, E=GLU, S=SER, D=ASP,
K=LYS, H=HIS, x=[any amino acid], - = [a gap in the sequence] (i.e., an amino acid may or may
not be present in this position), (N/I)= either NAL or ILE. In this case, permutations must be
done for the specified amino acids in the motif, and the permutations for the "x" amino acids must be permissive for any codon that codes for an amino acid. Motif handler 44 may utilize,
among others, the following two systems for dealing with motifs.
A first option is to perform permutations only on the amino acids that recur in the motif.
Positional search mechanism 36 is then utilized during the search to ensure that particular
elements in a given permutation are set at a proper distance apart from each other when sequence matching to the database 14 is performed.
A second option is to use motif handler 44 to first locate a primary region (e.g., a region
of highest contiguous recurrence) within the motif. For instance, in the example shown above, a
primary region could be "RESE." Next, codon permutations of this region would be generated to
create a subset of genes within the larger database that contain the permutation sequence. Next, motif handler 44 would identify a second search parameter and note its positional relationship to
the primary region, e.g., G, which is three amino acids upstream from RESE. Then, all the genes
that contain a codon for the second search parameter in the noted positional relationship would
be identified in the created subset. Again, positional search mechanism 36 maintains the codon permutations in the same fixed positional relationship as the commonly recurring amino acids in
the motif, thereby facilitating positional searching of codons.
In the above example, search interface system 24 would search for high homology
sequences that have GGT, GGC, GGA, or GGG (i.e., one of the codons for GLY) nine
nucleotides downstream from the last codon for the RESE component of the motif. Thus, one sequence that would be searched would be CGT GAA TCT GAA XXX XXX XXX GGT, which
would be searched in the subset of all genes with the amino acid sequence RESE. (CGT is a
codon for R, GAA is a codon for E, and TCT is a codon for S.) The above DNA sequence
corresponds to the protein motif sequence RESExxxG. Additional iterations of this procedure
could be performed to identify genes which contain the entire motif.
The foregoing description of the preferred embodiments of the invention have been
presented for purposes of illustration and description. They are not intended to be exhaustive or
to limit the invention to the precise form disclosed, and obviously many modifications and
variations are possible in light of the above teachings. Such modifications and variations that
are apparent to a person skilled in the art are intended to be included within the scope of this invention as defined by the accompanying claims.