WO2001096860A1

WO2001096860A1 - A system and method for identifying dna sequences that could code into a string of amino acids

Info

Publication number: WO2001096860A1
Application number: PCT/US2001/040930
Authority: WO
Inventors: Lawrence S. Zisman
Original assignee: Zisman Lawrence S
Priority date: 2000-06-13
Filing date: 2001-06-12
Publication date: 2001-12-20
Also published as: AU2001267070A1

Abstract

A computerized method and system for identifying DNA sequences that could code for a string of inputted amino acids and then identifying genes within the DNA sequences that are potentially responsible for creating the amino acid string. The system comprises means for: (1) inputting a string of amino acids; (2) generating all possible codon permutations that could code into the inputted string of amino acids; and (3) examining a gene database to identify DNA sequences that match the codon permutations. The system further comprises mechanisms for identifying start codons and handling splice junctions in the gene database. Additionally, the invention includes a system for handling amino sequences inputted in the form of a motif.

Description

A SYSTEM AND METHOD FOR IDENTIFYING DNA SEQUENCES THAT COULD CODE INTO A STRING OF AMINO ACIDS

BACKGROUND OF THE INVENTION

1. Technical Field The present invention relates to computerized systems for studying DNA, and more

specifically, to a system and method for identifying DNA sequences that could code into a string

of amino acids.

2. Related Art

Presently, scientists are engaged in an intense research effort to characterize the genomes

of human and selected model organisms through complete mapping and sequencing of their

DNA (deoxyribonucleic acid), and to develop technologies for genomic analysis. A genome comprises all of the genetic material found in the chromosomes of a particular organism. For instance, the human genome consists of 50,000 to 100,000 genes located on 23 pairs of

chromosomes. The first complete human genome to be sequenced will be a composite of

sequences from many sources, most of these being cell lines that have existed in laboratories all

over the world for some time. The sequence will thus be a generic sequence representative of humans in general and not of any particular individual. The complete sequence will provide a

standard against which other partial sequences can be compared.

DNA is contained in self-replicating genetic structures known as chromosomes. Each

chromosome contains a long molecule of DNA, the chemical of which genes are made. The

DNA, in turn, is a double-stranded molecule in which each strand is a linear array of units called nucleotides or bases. There are four different bases, called adenine "A," thymidine "T," guanosine "G," and cytosine "C." The bases on one strand of DNA are precisely paired with the bases on the other strand, so that an A is always opposite T and G opposite C. The order of the four units on the DNA strand determines the information content of a particular gene or piece of DNA. Genes are of differing length, ranging in size from roughly 2,000 to as many as 2,000,000 base pairs. Mapping is the process of determining the position and spacing of genes, or other landmarks, on the chromosomes relative to one another. Sequencing is the process of determining the order of the nucleotides, or base pairs, in a DNA molecule.

Although mapping of human genes began early in the twentieth century, it has been intensively pursued only for the past two decades. For most of this period the methods that were developed, though original and ingenious, have been inadequate for comprehensive mapping and have only allowed the construction of relatively crude maps with very little detail. Recently, much more effective technology has been introduced. At the beginning of the 21^st century, about 1,700 of the estimated 50,000 to 100,000 human genes (less than 2 percent) have been mapped. Scientists involved in the Human Genome Project hope to find the location of the 50,000 to 100,000 or so human genes and to read the entire genetic script, all three billion bits of information, by the year 2005. More information on the Human Genome Project is available on the world wide web at sites such as the National Human Genome Research Institute at www.nhgri.gov.

The information generated by genome projects is expected to be the source book for biomedical science in the 21st century and will be of immense benefit to the field of medicine. It will help us to understand and eventually treat many of the more than 4000 genetic diseases that afflict mankind, as well as the many multifactorial diseases in which genetic predisposition plays an important role. Scientists are now just beginning to formulate uses for these vast databases of information. To fully exploit these databases, it will be vital to develop new methods and tools for the analysis and interpretation of genome maps and DNA sequences. One such area that will potentially benefit from genome projects is the study of proteins and their correlation with genetic information, as genes are responsible for determining which proteins get created. Proteins are made up of a string or sequence of amino acids. DNA contains

the information about how amino acids are put together in a series to form a protein.

Accordingly, a gene is said to code into a specific protein. For this coding to occur, DNA serves as a template for the creation of mRNA. The process by which RNA is made from DNA is

called transcription. After transcription, mRNA is further processed until it can serve as a template for linking amino acids together. The process by which an mRNA sequence is

converted into an amino acid sequence is called "translation." Translation is performed by

ribosomes which line up tRNAs with mRNAs. Each tRNA has three nucleotides which

complement three nucleotides on the mRNA. Each tRNA also carries a specific amino acid. As tRNAs are lined up along the mRNA a covalent bond is formed between the amino acid on that

tRNA and the amino acid already present. A key result is that a specific nucleotide triplet corresponds to a specific amino acid. However, there is some redundancy so that several

different combinations of nucleotide triplets can correspond to the same amino acid.

Until now, most efforts have focused on identifying a protein and related amino acid

details from a DNA sequence. The prior art fails however to provide systems for performing a

reverse translation, i.e., generating all possible DNA sequences from a single amino acid

sequence. Given the correspondence between protein data (i.e., amino acids) and genetic

information (i.e., nucleotide triplets), there exists any number of potential applications where

such systems could be useful. For instance, a research scientist studying a tissue sample populated with proteins of interest, e.g., a portion of a failing heart, may be interested in

determining the gene responsible for coding to that protein. Until now however, there has been

no automated mechanism for identifying the gene or collection of genes potentially responsible for coding into that protein. SUMMARY OF THE INVENTION

The present invention addresses the above-mentioned problems by providing a

computerized method and system for identifying DNA sequences that could code for a string of

inputted amino acids and then for identifying genes within the DNA sequences that are

potentially responsible for creating the amino acid string. In a first aspect, the invention comprises a method of: (1) inputting a string of amino acids; (2) generating all possible codon permutations that could code into the inputted string of amino acids; and (3) examining a gene

database to identify DNA sequences that match the codon permutations.

In a second aspect, the invention comprises a computerized method for identifying DNA

sequences that could code for an inputted motif, comprising the steps of: (1) identifying within

the motif a primary region, wherein the primary region includes a string of amino acids having

contiguous recurrence in the motif; (2) searching a gene database to identify all codon permutations that could code into the primary region; (3) identifying a subset of the gene

database, wherein the subset contains genes that could translate into the primary region; (4)

selecting within the motif a secondary amino acid apart from the primary region; (5) determining a positional relationship of the secondary amino acid with respect to the primary region; (6)

searching the subset of the gene database for DNA sequences that include codon permutations that could code into both the primary region and the secondary amino acid, at the determined

positional relationship.

In a third aspect, the invention comprises a program product stored on a recordable

medium for identifying DNA sequences that could code into an inputted amino acid sequence, comprising: (1) means for inputting the sequence of amino acids; (2) means for generating codon

permutations that could code into the inputted sequence of amino acids; and (3) means for searching a gene database to identify DNA sequences that match the generated codon permutations. In a fourth aspect, the invention comprises a computer system for identifying DNA sequences that could code into an inputted sequence of amino acids, comprising: a central

processing unit, a memory, a peπnutation generator for generating possible codon permutations

for an inputted sequence of amino acids, and a search interface system for searching a gene

database to identify DNA sequences that match the codon permutations.

It is therefore an advantage of the present invention to allow a researcher to start with a

limited amount of information regarding a particular protein (i.e., a peptide sequence or amino

acid string) and end up with the entire DNA code(s), i.e, gene(s), for the protein, or families of

the protein, of interest. It is therefore a further advantage of the present invention to provide a system to screen

DNA databases for genes that could potentially code for proteins containing a pre-specified

amino acid sequence (peptide).

It is therefore a further advantage of the present invention to provide a permutation generator that can perform reverse translation of an amino acid sequence into all possible DNA

sequences.

It is therefore a further advantage of the present invention to provide a comprehensive

system that includes a reverse translation system, permutation generation system, a homology

search engine, and a system for screening for in-frame start codons to select valid gene candidates.

BRIEF DESCRIPTION OF THE DRAWINGS

The preferred exemplary embodiment of the present invention will hereinafter be

described in conjunction with the appended drawings, where like designations denote like

elements, and: Figure 1 depicts a block diagram depicting a computer system for identifying DNA sequences that could code for a string of amino acids in accordance with a preferred embodiment of the present invention.

Figure 2 depicts a block diagram of the operational flow of the present invention.

Figure 3 depicts a table showing nucleotide triplets and their corresponding amino acids.

It should be noted that the drawings are merely schematic representations, not intended to portray specific parameters of the invention. The drawings are intended to depict only typical embodiments of the invention, and therefore should not be considered as limiting the scope of the invention. In the drawings, like numbering represents like elements between the drawings.

DETAILED DESCRIPTION OF THE DRAWINGS

Overview

As noted above, a feature of this invention is to identify genes that could potentially code into an inputted amino acid sequence or a protein. Genes are made up of DNA sequences, and DNA is made up of combinations of nucleotides. DNA contains the information about how amino acids are put together in a series to form a protein. There are four possible nucleotides: A,

T, G, and C. A series of three contiguous nucleotides constitutes a triplet. Codons are nucleotide triplets that code for an amino acid or a stop signal. Figure 3 shows a table depicting the list of nucleotide triplets (i.e., codons) along with the corresponding amino acid. Blanks in the table indicate a "stop signal" A key point is that a specific nucleotide triplet (i.e., codon) corresponds to a specific amino acid. However, there is some redundancy in that several different codons can correspond to the same amino acid. Thus, given an amino acid sequence, there exists a corresponding set of codon sequences, with the number of codon sequences in the set being dependent on the number of possible permutations of corresponding codons. The set of codon sequences, once identified, can be used to identify corresponding genes that were potentially responsible for creating a given amino acid sequence.

Computer System & Software

Referring now to Figure 1, a computer system 10 is shown that includes a central processing unit (CPU) 16, an input/output (I/O) system 18, bus 28, and memory 20. Stored in memory 20 is a software program 26 comprising permutation generator 22 and search interface system 24. Memory 20 may comprise any known type of data storage and/or transmission media, including magnetic media, optical media, random access memory (RAM), read only memory (ROM), a data cache, a data object, etc. Moreover, memory 20 may reside at a single physical location, comprising one or more types of data storage, or be distributed across a plurality of physical systems in various forms. CPU 16 may likewise comprise a single processing unit, or be distributed across one or more processing units in one or more locations, e.g., on a client and a server. I/O 18 may comprise any system for exchanging information with external sources. User interface 12 is in communication with computer system 10 via datalink 30. User interface 12 may comprise any known type of device for inputting and receiving information into computer system 10, including a CRT, LED screen, hand-held device, keyboard, mouse, voice recognition system, speech output system, printer, facsimile, pager, etc. Gene database 14 is in communication with computer system 10 via datalink 31. User interface 12 and gene database 14 may be linked to computer system 10 in any known way, including via an internet, intranet, worldwide web, local area network, wide area network, etc. Alternatively, user interface 12 and/or gene database 14, may be integrated into computer system 10. Datalinks 30 and 31, and bus 28 may comprise any known type of transmission link including electrical, optical, wireless, etc. In addition, although not shown, additional components, such as cache memory, communications systems, systems software, etc., may be incorporated into computer system 10.

It is understood that the present invention can be realized in hardware, software, or a

combination of hardware and software. As indicated above, the computer system 10 according

to the present invention can be realized in a centralized fashion in a single computerized workstation, or in a distributed fashion where different elements are spread across several

interconnected computer systems. Any kind of computer system - or other apparatus adapted for

carrying out the methods described herein - is suited. A typical combination of hardware and

software could be a general purpose computer system with a computer program that, when

loaded and executed, controls the computer system 10 such that it carries out the methods described herein. Alternatively, a specific use computer, containing specialized hardware for

carrying out one or more of the functional tasks of the invention could be utilized. The present

invention can also be embedded in a computer program product, which comprises all the features

enabling the implementation of the methods and functions described herein, and which - when loaded in a computer system - is able to carry out these methods and functions. Computer program, software program, program, program product, or software, in the present context mean

any expression, in any language, code or notation, of a set of instructions intended to cause a

system having an information processing capability to perform a particular function either

directly or after either or both of the following: (a) conversion to another language, code or notation; and/or (b) reproduction in a different material form.

Referring now to Figure 2, a box diagram is shown depicting the operation of software

program 26. Permutation generator 22 operates by obtaining an amino acid string or sequence 40. Permutation generator 22 then generates codon sequences 46 that could code to the inputted

amino acid sequence. This operation utilizes an amino acid/codon table 42 and a permutation ^• algorithm which are described in more detail below. Once the list of codon sequences 46 are generated, they are inputted into search interface system 24. Search interface system 24 compares each codon sequence to gene sequences in one or more gene databases 14. When a

match is identified, search interface system 24 then attempts to identify the gene containing the

matched sequence. The gene identification process includes identifying a start codon utilizing

start codon identifier 32. Often, start codons are separated from the matched sequence by splice

junctions. To facilitate the process of identifying remotely located start codons, a splice junction

handler 34 is included to identify and handle splice junctions. This process is described in more

detail below. After search interface system 24 identifies the genes containing matches, the genes

48 that code for the inputted amino acid sequence are outputted.

In addition to handling complete amino acid sequences 40, the system can also handle

partial amino acid sequences, known as motifs. Motifs are amino acid sequences that have a

common pattern embedded in nonhomologous sequences. The invention includes a motif

handler 44 and a positional search mechanism 36 to facilitate the process, which are described in more detail below.

Finally, in addition to entering an amino acid sequence 40, a search criteria 50 may also

be inputted into the system. The search criteria 50 can be used to instruct the search interface

system 24 to, among other things, search a specific gene database 14, search a specific genome,

specify restraints on how to search for start codons, specify whether to look for splice junctions or specify how to search for motifs.

Permutation Generator

As noted above, permutation generator 22 generates a set of codon sequences (DNA sequences) that could code for an inputted sequence of amino acids. For example, if a sequence containing the four amino acids: CYS, GLU, SER, GLU was inputted, a set of all possible codon

sequences that could code into the inputted amino acids, each having four codons, would be generated. As shown in the table of Figure 3, there exist two codons, {TGT, TGC} that could code to CYS; two codons {GAA, GAG} that could code to GLU; six codons {TCT, TCC, TCA,

TCG, AGT, AGC} that could code to SER; and two codons {GAA, GAG} that could code to

GLU. The total number of permutations therefore would be 2 x 2 x 6 x 2 = 48, which is the

product of the number of possible codons that correspond to each amino acid in the inputted

sequence. A partial list of the resulting codon sequences that could code for the inputted amino

acid sequence would comprise: (1) TGT, GAA, TCT, GAA; (2) TGT, GAA, TCT, GAG ... (48)

TGC, GAG, AGC, GAG.

Permutation Algorithm

The following is an example of a method of generating permutations of codon sequences

based on an inputted amino acid sequence. This particular method utilizes matrices to manage the permutations. It is understood that the following method is not intended to be limiting, and any algorithm for generating permutations that can be implemented by a person skilled in the art

is believed to be within the scope of the invention.

The first step is to define a Master Codon Table (Table 1 shown below), which contains a

list of all the amino acids and their possible codons, as well as the number of total possible

codons for a given amino acid. The possible codons for the stop signal are also included in this

table. As can be seen, there are 20 possible amino acids and one stop signal (d = 1..21), and each

amino acid may have up to six corresponding codons (q = 1..6). Table 1

The next step is to define various matrices and tables that are utilized in the process. Each

matrix uses addresses which comprise three components: row and column coordinates and a

symbol which indicates the target matrix to which these coordinates are applied. An address locates a specific position within the designated matrix.

A value is the object and all the properties attached to that object in a specific location of

a matrix where the specific location is uniquely defined by an address. A value in a specific

location designated by an address can itself be an address in another matrix. A reference matrix is then defined [Xaj] where a=l to γ, j= 1 to e. The reference matrix

address (column and row coordinates) assigns location within the matrix. The values in the

reference matrix constitute a "codon table" and are addresses to specific locations in the master

codon table.

Next, an Input Matrix [Ij] for holding the inputted amino acid is defined, where j - 1 to e

(j represents the position of a given amino acid in the amino acid sequence that is input). Each

amino acid input has a corresponding d value in the master codon table (the d value should be

placed under the corresponding amino acid by a link to the master codon table). Amino acids from the input table are linked to their location in the master codon table):

Next, a permutation generator (PG) address matrix [Yαβ] is defined, where α=l to e, and β=l to

g (the PG address assigns location within this matrix). The values in the PG matrix are addresses to specific positions in the reference matrix.

Next, various operations are defined. Operation mqda— »mqdv shows the value in

position mqd of the master codon table. Operation [Xaj]— > [Yαβ] transposes address locations

in the reference matrix to the PG matrix so that the address in the PG matrix is the transposition

of the address in the reference matrix (i.e., the codon table). Operation μ(2| Y_nβ*Y_(n+i_)β)

generates all permutations of two elements, one element from column α=n, one element from

column α=n+l, for all elements 1 to β of each column. Cμ2(Ynβ|Yn+l)β) is the superset of the

sets that contain all possible 2 element permutations of Ynβ and Yn+lβ.

For example, consider the following PG matrix:

In this example, the numbers shown in the PG grid are simple integers for purposes of describing

the concept of the permutation generator; in the actual implementation, these integers would be

replaced by the reference matrix address, i.e., the 1 in position Yl 1 would be xl 1, the 2 in position Y12 would be xl2, the 2 in position Y22 would be x22. xl 1 would be the location in

the reference matrix. The value in the xl 1 position of the reference matrix would in turn be a location in the master codon table. xl2 would not necessarily equal x22. In this example X12

could not equal X22 because the βmax in the two respective Y columns is different indicating

that the two columns represent codons for different amino acids. The location in the master

codon table would contain the codon.

In this example, Cμ2(Ylβ|Y2β) = { (1,1), (1,2), (2,1), (1,3), (3,1), (1,4), (2,2), (2,3),

(2,4), (3,2), (3,3), (3,4)}. For each element in Cμ2(Ylβ|Y2β) Cμ2(Y3β|Y4β) can be generated

as follows:

(1,1) → Cμ2(Y3β|Y4β)

(l,2)→ Cμ2(Y3β|Y4β) (2,1)→ Cμ2(Y3β|Y4β)

(3,4)→ Cμ2(Y3β|Y4β).

The operation Cμ(4: Cμ2Yl β|Y2β|| Cμ2Y3 β|Y4β), defines the set of four element

subsets which represent all possible permutations of elements in the sets Cμ2Ylβ|Y2β and

Cμ2Y3βY4β, i.e., (1,1) combined with (1,1)→ (1,1,1,1); (1,1) combined with (1,2)→ (1,1,1,2);

etc. For each element of Cμ(4: Cμ2Ylβ|Y2β|| Cμ2Y3β|Y4β), Cμ2(Y5β|Y6β) is generated, i.e.,

(1,1,1,1)→ Cμ2(Y5β|Y6β); (1,1,1,2)→ Cμ2(Y5β|Y6β); etc. This results in the set Cμ6(6:

Cμ2Yl β|Y2β|| Cμ2Y3β|Y4β|| Cμ2(Y5β|Y6β), which represents all of the possible permutations

of the elements shown in the above example.

A general operation may be defined as: Cμp= Cμ(p| Cμ2Y(n)β*Y(n+l)β||

Cμ2Y(n+2)β|Cμ2Y(n+3)β)||...||Cμ2Y(n+p-2)β*Y(n+p-l)β), which generates the sets containing

p elements that represent all possible permutations of elements in the sets contained as elements

in the groups Cμ2Y(n)β|Y(n+l)β, Cμ2Y(n+2)β|Cμ2Y(n+3)β), through Cμ2Y(n+p-2)β|Y(n+p-

l)β). Defined rules for the operation would include:

1. Always start at n=l .

2. For a given set, if any value of any element = 0, then delete the set containing the element of

value = 0.

3. Perform iterations of μ until n+x+2 = j where x= integer.

4. If n+x+2 >j (i.e., n+x+l=j) for each element of Cμ(n+x) generate Cμl(Yn+x+l)β where Cμl

is the set of elements in Y(n+x+l)β.

Thus, a given permutation of two elements, one from Ynβ and one from Y(n+l)β

constitutes an element in the total set of permutations Cμ2(Ynβ,Yn+lβ).

Example:

The following example illustrates the permutation generator program. The input

comprises an amino acid sequence containing six amino acids, which is stored in the input

matrix below.

Next, a codon table in the reference matrix from the input data is created.

The program then transposes the data in the reference matrix to the permutation generator

matrix:

The program then generates all possible permutations of the codons that could yield the input

amino acid sequence by performing iterations of μ(2| Y_nβ*Y_(n+1)o):

Cμ2(Ylβ|Y2β)= { (X11.X12), (X11.X22), (X21.X12), (X11,X32), (X31,X12), (X11,X42),

(X21,X22), (X21,X32), (X21, X42), (X31,X22), (X31,X32), (X31,X42)}

For each element in Cμ2(Ylβ|Y2β) generate Cμ2(Y3β|Y4β):

(X11.X12) → Cμ2(Y3β|Y4β)

(X11,X22)→ Cμ2(Y3β|Y4β)

(X21,X12)→ Cμ2(Y3β|Y4β), etc.

After all iterations are complete, there is a list of Xaj addresses, with the values in these

Xaj locations being addresses to locations in the Mqd matrix, i.e.:

M13 M15 M17 M120 M13 M18

M23 M25 M27 M220 M23 M28, etc.

The program then calls the values in the Mqd locations and displays them as codon strings:

ATT GTT CCT GGT ATT ACT

ATC GTC CCC GGC ATC ACC, etc. Each codon string represents one of all the possible codon permutations that could code for the input amino acid sequence.

Gene Searching

Referring again to Figure 2, once the codon sequences 46 are generated, they can be

compared to sequences in gene database 14, such as GenBank. Programs that perform this task, referred to as homology search programs 33, may include known programs such as BLAST™

and FASTA™, or can comprise customized systems to handle multiple sequence searching.

Homology search programs 33 will identify high homology sequences, which comprise

sequences that match the inputted sequence. After high homology sequences are found, other regions of the gene must be examined to

identify a segment homologous to the inputted permutation. Specifically, start codon identifier

32 may be utilized to identify a start codon for the high homology sequence. A start codon is an

ATG codon upstream (prior to) and "in frame" with the high homology sequence. (A start codon is in frame if there exists a whole number of triplets between the start codon and the high

homology sequence.) Thus, start codon identifier 32 examines upstream sequences to find a start

codon (ATG) that would result in a reading frame that could result in expression of the

permutation sequence as a protein. For example, assume the codon sequence ATT GTT CCT

GGT ATT ACT, which resulted from the example above, was inputted into a search engine and matched a gene having a partial sequence: GCAATG CCCGATT GTT CCT GGTATTACT

The inputted permutation sequence (shown in bold) is present, and there is an upstream ATG sequence (also shown in bold). However, the upstream ATG start codon is not "in frame" to result in an expression of the amino acid sequence of interest. (There is not a whole number of

triplets between the start codon and the inputted sequence.) The ATG open reading frame in this example is:

ATG CCC GAT TGT TCC TGG TAT TAC, which would translate as: MET PRO ASP CYS SER TRP TYR TYR. However, the inputted amino acid sequence was:

ILE NAL PRO GLY ILE THR. Thus, start codon identifier 32 must look further upstream to

identify a start codon that is in frame. A program that can find open reading frames in a gene can

be readily implemented by one skilled in the art.

An additional problem that may arise when searching for a start codon relates to non-

coding sequences that may exist in the gene database. These non-coding sequences must be

effectively removed from the search, which is done by junction handler 34. Νon-coding

sequences arise as a consequence of the distinction between genomic DΝA and cDΝA. Genomic

DΝA is the "blueprint" from which mRΝA is transcribed. rnRΝA is transcribed from genomic

DΝA but may contain a sequence that does not code for a protein. These non-coding sequences

are called intervening sequences or "introns." Before protein is made from an mRΝA, the introns are spliced out. The remaining excised sequences are called "exons" and are linked

together to form the sequence from which the protein is translated. The term cDΝA refers to the

DΝA sequence that would be obtained if one converted mRΝA after removal of its introns

directly into DΝA. cDΝA is a DΝA copy of mRΝA that has been processed and is ready for

translation. Thus, if genomic DΝA is searched, one may find that a start codon is separated from the sequence found to be homologous to the permutation codon string by one or more introns.

The intron which separates the start codon from the high homology region could be several thousand base pairs (nucleotides) in length. Accordingly, splice junction handler 34 is provided

to identify remote start codons. The boundary between an intron and an exon is called a splice

junction. Because splice junctions tend to have particular sequences, it is straightforward for splice junction handler 34 to identify splice junctions in a genomic DΝA sequence. Motifs

A further feature of the invention is to provide a motif handler 44 that allows motifs to be

searched. Motifs are amino acid sequences that have a common pattern embedded in

nonhomologous sequences, i.e., motifs include commonly recurring amino acids in fixed

positional relationships. An example is a motif called the SH2 domain. This domain is a region

in some proteins in which certain amino acids recur in the same position. A partial

representation of an SH2 domain is shown below:

GNxxGxFL(N/I)RESExxxGxxSLSxx-xxxxxGDxxKHxK where G=GLY, Ν=ASΝ, F=PHE, L=LEU, V=VAL, I=ILE, R=ARG, E=GLU, S=SER, D=ASP,

K=LYS, H=HIS, x=[any amino acid], - = [a gap in the sequence] (i.e., an amino acid may or may

not be present in this position), (N/I)= either NAL or ILE. In this case, permutations must be

done for the specified amino acids in the motif, and the permutations for the "x" amino acids must be permissive for any codon that codes for an amino acid. Motif handler 44 may utilize,

among others, the following two systems for dealing with motifs.

A first option is to perform permutations only on the amino acids that recur in the motif.

Positional search mechanism 36 is then utilized during the search to ensure that particular

elements in a given permutation are set at a proper distance apart from each other when sequence matching to the database 14 is performed.

A second option is to use motif handler 44 to first locate a primary region (e.g., a region

of highest contiguous recurrence) within the motif. For instance, in the example shown above, a

primary region could be "RESE." Next, codon permutations of this region would be generated to

create a subset of genes within the larger database that contain the permutation sequence. Next, motif handler 44 would identify a second search parameter and note its positional relationship to

the primary region, e.g., G, which is three amino acids upstream from RESE. Then, all the genes

that contain a codon for the second search parameter in the noted positional relationship would be identified in the created subset. Again, positional search mechanism 36 maintains the codon permutations in the same fixed positional relationship as the commonly recurring amino acids in

the motif, thereby facilitating positional searching of codons.

In the above example, search interface system 24 would search for high homology

sequences that have GGT, GGC, GGA, or GGG (i.e., one of the codons for GLY) nine

nucleotides downstream from the last codon for the RESE component of the motif. Thus, one sequence that would be searched would be CGT GAA TCT GAA XXX XXX XXX GGT, which

would be searched in the subset of all genes with the amino acid sequence RESE. (CGT is a

codon for R, GAA is a codon for E, and TCT is a codon for S.) The above DNA sequence

corresponds to the protein motif sequence RESExxxG. Additional iterations of this procedure

could be performed to identify genes which contain the entire motif.

The foregoing description of the preferred embodiments of the invention have been

presented for purposes of illustration and description. They are not intended to be exhaustive or

to limit the invention to the precise form disclosed, and obviously many modifications and

variations are possible in light of the above teachings. Such modifications and variations that

are apparent to a person skilled in the art are intended to be included within the scope of this invention as defined by the accompanying claims.

Claims

CLAIMSI claim:

1. A computerized method for identifying DNA sequences that could code for a string of amino

acids, comprising the steps of: inputting the string of amino acids;

generating all possible codon permutations that could code into the inputted string of

amino acids; and

examining a gene database to identify DNA sequences that match the codon

permutations.

2. The method of claim 1, comprising the further step of:

searching the gene database for a start codon that is in frame with at least one of the

identified DNA sequences.

3. The method of claim 2, wherein the step of searching the gene database for a start codon

includes the step of identifying splice junctions.

4. The method of claim 3, wherein the step of searching the gene database for a start codon

includes the further step of identifying a remote start codon.

5. The method of claim 1, comprising the further step of: for each identified DNA sequence, searching the gene database for an associated start

codon.

6. The method of claim 1 , wherein the inputted string of amino acids includes a motif that has

commonly recurring amino acids in a fixed positional relationship.

7. The method of claim 6, wherein the step of generating all possible codon permutations

generates codon permutations only for the commonly recurring amino acids.

8. The method of claim 7, wherein the step of examining the gene database includes the step of maintaining the codon permutations in the same fixed positional relationship as the commonly

recurring amino acids.

9. A computerized method for identifying DNA sequences that could code for an inputted motif, comprising the steps of:

identifying within the motif a primary region, wherein the primary region includes a

string of amino acids having the highest contiguous recurrence in the motif;

searching a gene database to identify all codon permutations that could translate into the

primary region; selecting a subset of the gene database, wherein the subset contains genes that could

translate into the primary region;

selecting within the motif a secondary amino acid apart from the primary region; determining a positional relationship of the secondary amino acid with respect to the

primary region; and

searching the subset of the gene database for DNA sequences that include codon

permutations that could code into both the primary region and the secondary amino acid, at the determined positional relationship.

10. A program product stored on a recordable medium, that when executed by a computer

system, comprises: means for inputting a string of amino acids;

means for generating all possible codon permutations that could code into the inputted

string of amino acids; and

means for examining a gene database to identify DNA sequences that match the codon

permutations.

11. The program product of claim 10, further comprising means for searching the gene database

for a start codon that is in frame with at least one of the identified DNA sequence.

12. The program product of claim 11, wherein the means for searching the gene database for a

start codon includes means for identifying a splice junction.

13. The program product of claim 12, wherein the means for searching the gene database for a

start codon includes means for identifying a remote start codon.

14. The program product of claim 10, wherein the inputted string of amino acids includes a

motif having commonly recurring amino acids in a fixed positional relationship.

15. The program product of claim 14, wherein the means for generating all possible codon

permutations generates codon permutations only for the commonly recurring amino acids, and

wherein the means for examining the gene database includes means for maintaining the codon permutations in the same fixed positional relationship as the commonly recurring amino acids.

16. A computer system for identifying DNA sequences that could code for a sequence of amino

acids, comprising:

a central processing unit; a computer system memory;

a permutation generator, wherein the permutation generator generates permutations of

codon sequences in response to an inputted amino acid sequence; and a search interface system, wherein the search interface system identifies DNA sequences

in a database that matches the codon sequences.

17. The computer system of claim 16, wherein the permutation generator includes a motif

handler.

18. The computer system of claim 17, wherein the search interface system includes a positional search mechanism.

19. The computer system of claim 16, wherein the search interface system includes a homology

search engine.

20. The computer system of claim 16, wherein the search interface system includes a start codon identifier.

21. The computer system of claim 16, wherein the search interface system includes a splice junction handler.