WO2004053766A1

WO2004053766A1 - Reverse translation of protein sequences to nucleotide code

Info

Publication number: WO2004053766A1
Application number: PCT/CA2003/001929
Authority: WO
Inventors: Wayne R. Danter
Original assignee: London Health Sciences Centre Research Inc.
Priority date: 2002-12-06
Filing date: 2003-12-05
Publication date: 2004-06-24
Also published as: AU2003287823A1

Abstract

The invention can be summarized as follows. A method of converting an amino acid sequence of a protein of interest to doublet nucleotide code comprising the steps of providing a first data set corresponding to an amino acid sequence as input to an information processing system (IPS) and determining from the IPS a second data set corresponding to the predicted doublet nucleotide code encoding the amino acid sequence. The method may also comprise a search of a data structure for a nucleotide sequence comprising the predicted doublet nucleotide code. Also disclosed is an information processing system for performing biological sequence analysis.

Description

REVERSE TRANSLATION OF PROTEIN SEQUENCES TO NUCLEOTIDE CODE

The present invention relates to systems and methods for performing biological sequence analysis. More specifically, the invention relates to systems and methods for reverse engineering protein sequences to nucleotide code.

BACKGROUND OF THE INVENTION

While the Human Genome Project has deciphered the human DNA genetic code, the task of determining the function of gene products remains at an early stage of development. Gene products comprise peptides , proteins and antibodies that result from the complex processes of (1) DNA transcription producing messenger ribonucleic acid (mRNA), (2) ribosomal translation of mRNA and (3) post translational processing of the resulting proteins.

Functional genomics and proteomics are relatively new branches of genetics that attempt to determine the specific function of gene products based on sequence and structure information. A still newer area of active research that is at the interface between functional genomics and pharmacology is pharmacogenomics. Broadly speaking, pharmacogenetics is the study of therapeutics in relation to the genetic makeup of an organism.

If a disease process can be shown to be the result of an abnormal gene product then specific therapies can be developed which target the abnormal gene product. A good example of this process is illustrated by the development of a small molecule called Gleevec (Imatinab) that is used to treat Chronic Myelogenous Leukemia (CML) . CML is a type of leukemia associated with a recognized genetic mutation known as the Philadelphia chromosome. This genetic mutation results in the production of an abnormal protein tyrosine kinase. Gleevec is a specific protein tyrosine kinase inhibitor that targets the ATP (Adenosine Tri-Phosphate) binding site of the abnormal enzyme, but leaves cells containing the normal or wild-type enzyme largely unaffected. This represents the first instance that a specific therapeutic agent capable of selectively targeting cells with a specific genetic abnormality has been produced.

One potential method of linking disease to the genetic makeup of an organism is to identify an abnormal gene product associated with a disease and then search for the DNA encoding the abnormal gene product. This association may allow for a clearer understanding of the disease process, enhance diagnosis and screening for the disease, and may lead to the development of therapeutics or specific gene therapies that target and repair the abnormal gene sequence.

The basic building blocks of human proteins are the amino acids , of which 20 are most commonly used. The double helix of DNA is composed of units called nucleotides or base pairs that are organized into triplets known as codons. There are 64 of these different three base pair codons. One of these codons codes for either the START instruction or the amino acid methionine, depending on whether or not the codon instruction occurs at the beginning of the coding sequence. Three other codons code for the STOP instruction that terminates ribosomal translation. The remaining 60 triplets code for the 20 amino acids commonly linked together by amide bonds to form gene products (i.e. peptides, proteins and .antibodies). To further complicate the relationship between DNA code and gene products such as proteins, there is considerable variability in the number of triplets that code for each amino acid. The DNA code is both redundant and degenerate. For example, while Methionine and Cysteine are each encoded by one unique triplet codon, the other amino acids may be encoded by 2, 3, 4 or 5 different triplet codons. Two amino acids (i.e. Serine and Leucine) may be encoded by 6 different triplet codons. This characteristic of the DNA code has made the process of reverse engineering gene products to DNA code a formidable task. Current approaches depend on using a brute force search through all possible combinations based on a specific DNA sequence known as gene probes.

Attempts have been made to use artificial intelligence technologies such as neural networks to reverse engineer proteins to DNA code. These attempts have met with limited success. In the best study to date (White and Seffens, 1998) a simple neural network correctly predicted 100% of the non-redundant codons and 85% of the redundant codons from the test amino acid sequences. Overall 93 % of the test amino acid sequences were correctly mapped to the actual DNA triplet codon and until very recently it appeared that improving on this degree of accuracy when reverse engineering proteins based on the triplet codon DNA code would remain problematic.

There is a need in the art for novel methods of reverse engineering protein sequences to nucleotide code. Further, there is a need in the art for novel methods of searching and identifying nucleotide sequences in databases.

It is an object of the present invention to overcome disadvantages of the prior art.

The above object is met by a combination of the features of the main claims. The sub claims disclose further advantageous embodiments of the invention.

SUMMARY OF THE INVENTION

According to the present invention there is provided a method of converting an amino acid sequence of a protein of interest to doublet nucleotide code comprising the steps of, a) providing a first data set corresponding to the amino acid sequence as input to an information processing system (IPS), and; b) determining from the IPS a second data set corresponding to the predicted doublet nucleotide code encoding the amino acid sequence of the protein of interest. Further, the method may also comprise the step of outputting the second data set, or information equivalent to the predicted doublet code encoding the amino acid sequence of the protein of interest. Further, the doublet nucleotide code may comprise a DNA nucleotide sequence or an RNA nucleotide sequence.

Further contemplated by the method of the present invention as defined above, the first data set may comprise a full or partial amino acid sequence of a protein of interest. Further, the first data set may be encoded in binary form.

Also contemplated by the method of the present invention, the protein of interest may be a variant or mutant protein of a wild-type protein. Further, the variant or mutant protein may be associated with a disease.

Further, the present invention provides a method as defined above, wherein the information processing system (IPS) comprises a neural network. The neural network may also employ a genetic algorithm. In an embodiment of the present invention, which is not meant to be limiting, the neural network is NeuroSHELL Classifier v2.0. Preferably the neural network is a trained neural network. For example, but not wishing to be limiting, the neural network may be trained by processing a plurality of data sets comprising data elements (X,Y) wherein X represents the amino acid sequence of a protein and Y represents the nucleotide sequence encoding X. Alternatively, but without wishing to be limiting, the IPS may comprise a rule-based system.

The present invention also contemplates a method of identifying a nucleotide sequence encoding a protein of interest within a data structure, comprising the steps of a) providing a first data set corresponding to the amino acid sequence of the protein of interest to an information processing system (IPS) capable of producing a second data set corresponding to the predicted doublet DNA code encoded by the amino acid sequence, and; b) performing an in string search of the data structure to identify all instances wherein the second data set is present in the data structure. The data structure may comprise an electronic medium containing nucleotide sequences, for example, but not limited to an electronic database. Further the nucleotide sequences may comprise genomic nucleotide sequences, for example, but not limited to containing introns.

Also contemplated by the method of the present invention as defined above, the in string search may be performed, controlled or both performed and controlled by an algorithm employing a sliding window approach to compare sequences. Further, the algorithm may comprise an alignment algorithm such as, but not limited to BLAST, FAST A, dynamic programming, or a version or an executable-code-modified version thereof.

Also contemplated by the present invention is an information processing system (IPS) capable of a) receiving a first data set corresponding to an amino acid sequence of a protein of interest, and b) producing a second data set corresponding to the predicted doublet nucleotide code encoding the amino acid sequence. Further, the IPS may further comprise hardware, software or the like, and may further comprise an alignment algorithm for performing an in-string search of a data structure to identify all instances wherein the second data set is present in the data structure.

This summary does not necessarily describe all necessary features of the invention but that the invention may also reside in a sub-combination of the described features.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other features of the invention will become more apparent from the following description in which reference is made to the appended drawings wherein: FIGURE 1 shows the predicted doublet nucleotide code of normal and mutant hemoglobin proteins output from a trained neural network following input of the first twenty nine amino acids of the proteins.

DESCRIPTION OF PREFERRED EMBODIMENT

The following description is of a preferred embodiment by way of example only and without limitation to the combination of features necessary for carrying the invention into effect.

According to an embodiment of the present invention there is provided a method of converting an amino acid sequence of a protein of interest to doublet nucleotide code comprising the steps of, a) providing a first data set corresponding to the amino acid sequence as input to an information processing system (IPS), and; b) determining from the IPS a second data set corresponding to predicted doublet nucleotide code encoding the amino acid sequence.

The method may further comprise a step of outputting the second data set, or information equivalent to the predicted doublet nucleotide code of the protein of interest.

Also contemplated by the present invention is an information processing system (IPS) capable of a) receiving as input a first data set corresponding to an amino acid sequence, b) determining a second data set corresponding to the predicted doublet nucleotide code encoding the amino acid sequence and c) outputting the second data set or information equivalent to the predicted doublet nucleotide code. Further, the IPS may comprise additional hardware, for example but not limited to control circuits, microprocessors, software or both, and it may be part of one or more computers or biological sequence analysis systems. Definitions:

By the term "amino acid sequence" it is meant a consecutive sequence of amino acids linked via peptide bonds defining the protein of interest, preferably starting from the amino terminus (N-terminus) and proceeding to the carboxy terminus (C-terminus) of the protein.

The protein of interest may comprise any protein known in the art, for example, but not limited to, pharmaceutically important proteins such as, but not limited to regulatory proteins, signaling proteins, growth factors, growth regulators, antibodies, antigens, interleukins, insulin, colony stimulating factors such as G-CSF, GM-CSF, hPG-CSF, M-CSF or combinations thereof, interferons, for example, interferon-a, interferόn-β, interferon-g, blood clotting factors, for example, Factor VIII, Factor IX, or tPA. However, any native protein produced by an organism may be considered a protein of interest. Also, the present invention contemplates mutant proteins and variants of native proteins produced by an organism.

By the term "doublet nucleotide code" it is meant a DNA or RNA nucleotide sequence comprising two of the three nucleotides of each triplet codon encoding an amino acid, with a blank, space-holder or the like in the third position. For example, Alanine may be represented by the doublet codon GC_, Cysteine by TC_, Arginine by CG_ OR AG_, Leucine by CT_ OR TT_, Serine by TC_ OR AG_, etc. Preferably, the doublet nucleotide code is provided in a defined orientation, such as a 5' to 3' orientation, as would be understood by a person of skill in the art.

By the term "first data set" it is meant information pertaining to the amino acid sequence of the protein of interest. The first data set comprises the amino acid sequence of the protein of interest, or a computer readable version thereof. Further, the first data set may be encoded on an electronic medium, such as, but not limited to a computer information storage device, such as, but not limited to a hard drive, floppy disk or the like. In this regard, it may be desirable to encode individual amino acids that make up the protein of interest into an appropriate form such as, but not limited to as a binary sequence. For example, but not to be considered limiting in any manner, the amino acid Alanine which represents one of the twenty amino acids, may be represented by the numerical string 10000000000000000000, serine as 00000000000010000000, valine as 00000000000000000001 and so on. Using such notation a protein of interest that comprises the amino acid sequence Ala-Ala-Val-Ser may be depicted by the numerical sequence:

100000000000000000001000000000000000000000000000000000000001000000000 00010000000,

(also see White and Seffens (1998, Electronic J. Biotech. 1: 196-201, which is incorporated herein by reference). As will be evident to someone of skill in the art, other ways of encoding the amino acid sequence into binary or other forms may be possible and any such method is contemplated by the present invention.

Similarly, by the term "second data set" it is meant information corresponding to the predicted doublet nucleotide code encoding the amino acid sequence. The second data set, corresponding to the predicted doublet DNA code may also be encoded on an electronic medium as defined previously.

By the term "information processing system" or "IPS" it is meant an electronic device that is capable of accepting the first data set and producing therefrom a second data set corresponding to the predicted doublet nucleotide code encoding the amino acid sequence. The IPS may comprise a neural network or a rule-based system or algorithm.

In an aspect of an embodiment the present invention employs a neural network, preferably a trained neural network. In an alternate embodiment, the information processing system comprises a rule-based algorithm. The IPS may also comprise one or more circuits, microprocessors, or combinations thereof, as would be evident to a person of skill in the art.

Artificial Neural Networks are pattern recognition computer models based on the human nervous system that are capable of learning from experience and then making predictions about new patterns. Neural computation encompasses the concepts of distributed, adaptive and nonlinear computing. Neural networks usually comprise a plurality of layers such as an input layer, a middle or hidden layer and an output layer. Each layer comprises a plurality of processing elements that are usually interconnected by weighted connections or scaling factors. A processing element multiplies an input by a set of weights, and non-linear ly transforms the result into an output value. The performance of the neural network may be measured in terms of a desired signal and error criterion. The output of the neural network is compared with a desired response to produce an error. A backpropagation algorithm may be used to adjust the weights interconnecting the processing elements in a manner to minimize the error. Other learning algorithms that may be employed include, but are not limited to Probabilistic/Bayesian, Generalized Regression, Self organizing maps (eg Kohonen Networks), Cascade Correlation or a combination thereof. The network may be trained by repeatedly exposing the neural network to known data patterns while the training algorithm adjusts the connection weights between processing elements in order to "learn" the relationship between the input patterns and the desired outputs. This process continues until a predetermined error tolerance has been achieved so as to optimize the generalizability of the trained model. For example, but not wishing to be limiting, a neural network may be trained by processing a plurality of data sets comprising data elements (X,Y) wherein X represents the amino acid sequence of a protein and Y represents the nucleotide sequence encoding X. The current process attempts to relate the properties of each of the 20 AAs to the specific coding doublet DNA code. Having taught a machine learning system to understand this relationship, the model can then reverse engineer any given sequence of amino acids to the original DNA code via the concept of the formatted doublet codon. In an aspect of the present invention, the training set comprises 128 data patterns (64 for the DNA instance and 64 for the mRNA instance) relating each A A to its' encoding doublet DNA codon.

A trained neural network based on Bayes theorem is a powerful classifier and by design learns and makes predictions based on probabilities calculated from the training data. Once trained and validated the neural network can be used to evaluate the amino acid sequence of the protein of interest and output the predicted doublet nucleotide code corresponding to the input amino acid sequence. Example of training of neural network systems are known in the art, may be found in, but are not limited to: Foundations of Neural Networks, Fuzzy Systems and Knowledge Engineering, Nikola K. Kasabov, A Bradford Book , The MIT Press, Cambridge, Massachusetts, London, England, 1998; and Neural Smithing: Supervised Learning in Feedforward Artificial Networks, Russell D. Reed and Robert J. Marks II. A Bradford Book , The MIT Press, Cambridge, Massachusetts, London, England, 1999; which are hereby incorporated by reference.

Neural networks that may be employed in the present invention include, but are not limited to NeuroSHELL Classifier v2.0. In an aspect of an embodiment, the present system may rely on a number of specific features of the NeuroSHELL package, for example, but not limited to the ability to carry out leave one out cross validation coupled with the probabilistic classifier utilizing a genetic algorithm to evolve a cross validated optimal solution to the classification problem.

Alternatively, the IPS may comprise a rule-based algorithm. For example, but not to be considered limiting in any manner, a rule-based algorithm may comprise a plurality of rules such as:

Rule 1: If AA= "Alanine" then output: Doublet = "GC_"

Rule 2: If AA = "Arparagine" then output: Doublet = "AA_"

Rule 3: If AA = "Aspartate" then output: Doublet = "GA_"

Rule 4: If AA = "Cysteine" then output: Doublet = "TG_" Rule 5: If AA = "Glutamate" then output: Doublet = "GA_"

Rule 6: If AA = "Arginine" the output "CG_" OR "AG_" and so on for all amino acids.

Preferably, the rule-based system comprises (1) a comprehensive set of rules of varying complexity to handle all possible relationships and (2) an efficient search algorithm to find the appropriate rule quickly rather than a bruit force search through every rule each time to find the appropriate rule for the instance of the data pattern being evaluated. However, other rule based systems may be employed in the present invention as would be understood by a person of skill in the art.

The Doublet Nucleotide Code

The genetic material of an organism comprises a string of nucleotides consisting of adenine (A), cytosine (C), guanine (G) and thymine (T) in the case of DNA and A, C, G, and uracil (U) in the case of RNA. Proteins comprising a series of amino acids linked by peptide bonds are produced by transcribing and translating the genetic material. It is well known in the art that triplet codons consisting of three consecutive nucleotides of genetic material specify the amino acids to be incorporated into a protein. In an aspect of the present invention, which is not meant to be bound by theory or limiting in any manner is based on the notion that the triplet genetic code may have evolved from a doublet code comprising two base pair codons. Since there are 4 different nucleotides in either DNA or RNA, a doublet codon can be arranged in only 2⁴= 16 different ways. These 16 doublet codons encode 16 of the 20 amino acids found in proteins (Table 1).

Table 1: The Doublet Genetic Code Amino Acid Doublet (100%) Amino Acid Doublet (100%)

Alanine GC Isoleucine AT

Asparagine AA Lysine AA

Aspartate GA Methionine AT*(ATG only)

Cysteine TG Phenylalanine TT

Glutamate GA Proline CC

Glutamine CA Threonine AC

Glycine GG Tryptophan TG*(TGG only)

Histadine CA Tyrosine TA

Valine GT Thus, about 80% of the amino acids in proteins can be directly mapped to the triplet DNA code from the doublet nucleotide code. This doublet nucleotide code may be converted to the modern triplet code format by simply adding a blank space holder in the third nucleotide position. If sequence analysis is employed using doublet nucleotide codons for 16 amino acids, then the problem of reverse engineering an amino acid sequence of a protein of interest may be significantly reduced. Further, as the amino acid Tryptophan is encoded by a single codon (i.e. TGG) then only three amino acids (Arginine, Leucine and Serine) encoded by multiple triplet codons require conversion to multiple doublet codons.

Results from the human genome proj ect and other sequencing proj ects permit statistical analysis of DNA sequences. For example, it is possible to estimate the probability that a given amino acid is encoded by a specific doublet DNA codon. Such a statistical approach may be applied to the 3 amino acids (Arginine, Leucine and Serine) that cannot be defined by a single doublet nucleotide code. For example, but not wishing to be limiting , based on statistical analysis of hundreds of thousands of DNA sequences from the human genome, it may be predicted that about 60% of the time Arginine is encoded by the doublet codon CG and about 40 % of the time by the doublet codon AG. Similarly Leucine is encoded by CT about 80% of the time and TT about 20% of the time while Serine is encoded by TC about 60 % and AG about 40 % of the time, as shown in Table 2. As will be understood by a person of skill in the art, nucleotide coding frequencies encoding specific amino acids may be organism or species specific. Thus, the present invention also contemplates using known codon frequencies for specific organisms during reverse engineering protein sequences to nucleotide code as described herein. Further the present invention contemplates extrapolating the codon frequencies for organisms that have yet to be fully sequenced at the nucleotide level, for example, but not limited to by analyzing the nucleotide sequence encoding known proteins from that organism. Table 2: Doublet Codon Probabilities Derived From the Human Genome Project Amino Acid Doublet Probability

Arginine CG 0.6

Arginine . AG 0.4

Leucine CT 0.8

Leucine TT 0.2

Serine TC 0.6

Serine AG 0.4

The final modified doublet representation of the genetic code for all 20 Amino Acids is summarized in Table 3.

Table 3: Modified Doublet Nucleotide Code Symbolic Representation For All 20

Amino Acids

Amino Acid Doublet (100%) Amino Acid Doublet (100%)

Alanine GC Isoleucine AT

Asparagine AA Lysine AA

Aspartate GA Methionine AT*(ATG only)

Cysteine TG Phenylalanine TT

Glutamate GA Proline CC

Glutamine CA Threonine AC

Glycine GG Tryptophan TG*(TGG only)

Histadine CA Tyrosine TA Valine GT

Arginine C(A.4)G thus CG OR AG (100%)

Leucine C(T.2)T thus CT OR TT (100%)

Serine TC(AG.4) thus TC OR AG (100%) Thus, in an embodiment, the present invention contemplates a symbolic probabilistic reverse mapping system from amino acid sequence to DNA or RNA code using a doublet nucleotide representation (i.e. AA_, CG_, TT_, CA_, etc). The system may also employ additional concepts such as, but not limited to, that every coding sequence has a start (ATG) and termination (STOP) code. These 2 instructions may be employed to represent the "boundaries" of a target DNA sequence and therefore define a finite number of amino acids and codons lying between the boundaries. Secondly, both methionine and tryptophan are each encoded by a unique triplet codon (i.e. ATG and TGG, respectively). These unique triplet codons may be employed as "anchors" or constants within a given DNA or RNA sequence. Boundaries, anchors or both provide a simple method of partial internal validation of the reverse engineering proteins to nucleotide code.

In a further aspect of an embodiment of the present invention, the method of reverse engineering protein sequences to nucleotide code may be employed to convert an amino acid sequence of a protein of interest into doublet nucleotide code and identify the corresponding nucleotide sequence in a data-structure, database or the like, wherein the nucleotide sequence also comprises one or more introns. Without wishing to be limiting in any manner, the method may comprise a 2 (or more) stage search whereby the first stage search looks for the contiguous DNA sequence predicted by the system. If no match is found then an iterative 2nd stage search may be employed to identify subset matches of codons, preferably at least 3 contiguous codons. Thus, if no match is found in the first search then a codon by codon match may ensue. For example if the first 3 codons of the probe match 3 codons in the target but not the fourth then the search algorithm interprets this as the beginning of a possible intron sequence and tries to match the 4th codon of the probe with the next codon in the target. As long as there are 2 or fewer contiguous codon matches this process continues . Matches of 3 or more codons are acknowledged by the system as above and the next iteration begins. The iterative search continues until the probe sequence is matched. A minimum match requirement for 3 contiguous codons is based on calculated probabilities for the appearance of 1, 2, 3 and 4 codons in association in any DNA sequence. The probability that any 3 contiguous codons would be associated based on chance alone is small.

The method of the present invention was tested using wild-type and mutant hemoglobins including hemoglobins, hemoglobinC and two hemoglobins encoding truncated proteins.

HbS and HbC both result from missense mutations at the 6th codon position. In the case of HbS, Valine is substituted for Glutamic acid while HbC results from the substitution of Lysine for Glutamic acid. One example of Thalassemia results from a Stop (TAG) mutation at position 17 and a second example of a Thalassemia mutant results from the deletion of an adjacent Adenine at position 8 producing a series of missense amino acids which terminates in an early Stop codon. Both of these examples result in decreased beta-chain synthesis.

The method of the present invention as described herein was used to determine the doublet nucleotide code encoding up to the first 29 amino acids of the beta globulin chain. Results are shown in Figure 1. The amino acid sequence for the first 29 amino acids of the hemoglobin proteins was first encoded as outlined in Example 1, and then submitted to a trained neural network for analysis. The output of the neural network was a series of 29 or less formatted doublet nucleotide codons. The double nucleotide code output was compared with the known triplet DNA code. In all cases, (normal hemoglobin, HbS, HbC and the 2 Thalassemias Hb), the doublet nucleotide code predictions exactly matched the first 2 nucleotides of the actual DNA triplet codons including the actual mutations responsible for the abnormal hemoglobin. This indicates that the original triplet DNA code was accurately determined from the doublet nucleotide code produced from neural network evaluation of the amino acid sequences .

The method of the present invention may further comprise one or more additional steps at any stage in the method. For example, but not to be considered limiting in any manner, after the step of determining or outputting, the second data set may be used to search a data structure comprising nucleotide sequence information. By the term "data structure" it is meant any electronic medium comprising nucleotide sequence information. For example, but not to be considered limiting in any manner, the data structure may comprise a database or the like which contains the genome of an organism, for example, but not limited to a yeast genome, such as, but not limited to Saccharomyces cerevisiae, Saccharomyces pombe (Nature 387, 5-105 (suppl) (1997); Wood et al., Nature, 415(6874):871-880 (2002)), protozoa, such as but not limited to Plasmodium falciparum, plants such as, but not limited to Arabidopsis thaliana (Nature 408(6814):796-815), Oryza Sativa (Yu et al., Science, 296:79-92 (2002); Goff et al., Science, 296:92-100 (2002)), nematodes such as, but not limited to Caenorhabditis elegans (Washington U, Science, 282(5396):2012-2018 (1998). Erratum: 283(5398):35(1999), 283(5410):2103(1999), 285(5433): 1493(1999)), insects such as, but not limited to Drosophila melanogaster (Adams, et al., Science, 287(5461):2185-2195 (2000)) and human (Venter, et al. Science, 291: 1304-1351 (2001); International Human Genome Sequencing Consortium, Nature, 409:860-921 (2001)) or a combination thereof.

Further, the data structure may comprise a plurality of nucleotide sequences , preferably eukaryotic nucleotide sequences. However, the data structure may also comprise prokaryotic sequences. Preferably, the coding relationship between nucleotide codons and amino acids is known for the species in question. Also, the present invention contemplates data structures comprising partial or incomplete genomes of organisms. For example, but not to be considered limiting in any manner, the data structure may comprise one or more databases from the National Center for Biotechnology Information (NCBI), European Molecular Biology Laboratory (EMBL), or both. Further, the data structure may comprise a commercial data structure, for example, but not limited to, such as the type available from Celera.

In an alternate embodiment, there is provided a method of identifying a nucleotide sequence encoding a protein of interest within a data structure comprising the steps of i) providing a first data set corresponding to the amino acid sequence of the protein of interest to an information processing system (IPS) capable of producing a second data set corresponding to the predicted doublet nucleotide code encoding the amino acid sequence of the protein of interest, and; ii) performing an in-string search of the data structure to identify all instances wherein the second data set is present in the data structure.

The in-string search of the data structure may be accomplished by any appropriate search algorithm known in the art. For example, the algorithm may perform a simple in-string search to identify the predicted nucleotide sequence within the data structure. An example of a simple in-string search, which is not meant to be considered limiting in any manner is described in Example 4. Alternatively, dynamic programming, for example, but not limited to, the Smith Waterman dynamic programming algorithm or a version thereof may be employed (Smith and Waterman, 1981 a,b Identification of Common Molecular SubSequences. J. Molecular Biology 147: 195-197; which is herein incorporated by reference). Further, other programs, algorithms and the like, such as the Basic Local Alignment Search Tool (BLAST), FASTA, or versions or derivatives thereof, that search for matching patterns termed k-tuples (Wilbur and Lipman, 1983 Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Proc. Natl. Acad. Sci. 80:726-730; Altschul et al. , 1990 Basic local Alignment Search Tool J. Mol. Biol. 215:403-410; which are herein incorporated by reference) or other algorithms, for example as described in Bioinformatics, Sequence and Genome Analysis by David W. Mount, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York, 2001 and references contained therein, which are herein incorporated by reference) may be employed by the present invention. However, any suitable algorithm known in the art may be employed to perform the in-string search.

The use of alignment algorithms such as, but not limited to BLAST and FASTA and dynamic programming algorithms permit alignment of sequences that are not identical. Thus, such algorithms may be employed to search a data structure that comprises introns in nucleotide sequences, for example, but not limited to genomic nucleotide sequences. Further, the alignment algorithms also permit alignment of a predicted nucleotide doublet sequence encoding a mutant protein of interest with one or more nucleotide sequences contained in a data structure, for example, but not limited to an electronic database. In such an instance the predicted doublet nucleotide sequence does not need to be identical to the sequence contained in the data structure. For example, but not to be considered limiting in any manner, the human genome has been sequenced and thus any protein produced in a human may be mapped to a specific nucleotide sequence in the genome. However, a human subject suffering from a disease, may exhibit one or more mutant proteins, that may be partly or wholly responsible for the disease. If a mutant protein is isolated and sequenced, it is unlikely that the exact nucleotide sequence encoding this mutant protein will be found within a data structure comprising the human genome. However, the search algorithm may be employed to determine the most likely nucleotide sequence within the data structure that may give rise to the mutant protein. By providing such information, a person of skill in the art may then determine whether the mutation arose as a result of a point mutation, such as, but not lkαited to insertion of a stop codon into a nucleotide sequence that is translated into a truncated protein, or an inversion, deletion, translocation or combination thereof.

Thus, the present invention also contemplates a method of searching a data structure for a target nucleotide sequence wherein the target nucleotide sequence comprises the doublet nucleotide code encoding the amino acid sequence of the protein of interest. The data structure may comprise any database known in the art. Further, the data structure may comprise introns, for example, as found in genomic nucleotide sequences of eukaryotes.

The above description is not intended to limit the claimed invention in any manner, Furthermore, the discussed combination of features might not be absolutely necessary for the inventive solution.

The present invention will be further illustrated in the following examples. However, it is to be understood that these examples are for illustrative purposes only, and should not be used to limit the scope of the present invention in any manner. Examples

Example: 1 Data Representation and Variable Creation

The development of machine learning systems such as neural networks as highly accurate predictive models can require attention to both data representation , and input variable selection. Based on information known in the art, a superset of 38 potential input variables was employed as a training set to identify an optimal subset of available descriptors that can uniquely be mapped from a specific amino acid to the specific DNA encoding doublet. The following 38 variables were used in this example. However, not all variables are required by the method of the present invention, as would be understood by a person of skill in the art. The 38 variables represent a superset of possible descriptors from which a subset of variables was selected during model development by a statistical importance analysis of the inputs. Based on the analysis, and without wishing to be limiting in any manner, 8 of the 38 variables were required in order to train a perfect Bayesian classifer.

(i) Variables 1-21: The 20 Amino Acids

Amino acids are represented by a series of ones and zeros as described by White and Seffens (1998; Electronic J. Biotech. 1: 196-201, which is herein incorporated by reference) for each of the 20 Amino Acids. Optionally, one additional variable for example, consisting entirely of zeros may be used for the Stop instruction. Using this representation each Amino Acid can be represented as string of 20 0s and a 1 which is unique for each Amino Acid. For example:

Alanine is represented by the string: 100000000000000000000, Valine is represented by the string: 000000000000000000010, Serine is represented by the string: 000000000000100000000, , and so on for all 20 AAs. The Stop instruction is represented as all 0s 00000000000000000000.

As will be evident to a person of skill in the art, numerous ways exist to represent ^• twenty amino acids and any such way of representing the amino acids is contemplated by the present invention.

(ii) Variable 22: Average Codon Frequencies

Data from millions of DNA sequences has been analyzed during the human genome project. As a result, it is possible to estimate on average how frequently each triplet codon encodes its corresponding amino acid. A simple conversion from triplet to the doublet nucleotide code results in average frequencies that each doublet code is associated with each amino acid. Specific codon usage tables which may be employed in the present invention may be obtained from Codon usage tabulated from the international DNA sequence database: status for the year 2000. Nakamura, Y. , Gojobori, T. and Ikemura, T. (2000) Nucl. Acids Res. 28, 292; 2002 codon usage table at : www.kazura.or.jp/codon/CUTG.html

For example:

Amino Acid Frequency Doublet

Alanine 7.02% GC

Arginine 2.28% AG

3.33% CG

Asparagine 3.68% AA

... and so on for all 20 amino acids.

(iii) Variables 23-32: Molecular Descriptors

The 20 amino acids have different structural and chemical properties._. Similarities in some of these properties have permitted a rudimentary classification scheme based on size, charge and lipid solubility. Each of the amino acids can be classified as (1) hydrophobic, (2) small/polar, (3) charged/polar, or (4) polar. This is an imperfect system with some amino acids arguably being members of more than one class. Another approach is to use molecular descriptors calculated from connection table files for each amino acid (CT File Formats (1999), MDL Information Systems, Inc, 14600 Catalina Street, San Leandro, CA, 94577). The Molecular Surface Area (Angstroms²) of the molecules is a pseudo 3D descriptor and represents the contact surface created when a spherical probe representing a solvent molecule is rolled over the molecular model . The Molar Refractivity Index (cmVmole) estimates the potential light refracting ability of the molecule. Molecular weight (in daltons) can be used as a measure of molecular size and lipid solubility can be estimated from calculation of LogP, the log of the Octanol/Water partition coefficient. A final descriptor, the Weiner Index was also used. The Weiner Index is a topological descriptor of longstanding proven value that is calculated from the number and types of bonds in a given molecule. All of these descriptors are easily calculated from connection table files by molecular modeling programs such as Chem3D/ChemOFFICE Ultra 7.0 , which was used in the present project.

(iv) Variable 33: DNA/mRNA flag:

A boolean switch was included to allow for the prediction of either DNA or mRNA sequences encoding amino acid sequences. Here DNA= 1, mRNA=0 and the flag is set to the appropriate value at the time of amino acid sequence entry.

(v) Variable 34: Essential vs Non Essential Amino Acids

Some amino acids are identified as essential because they are not made by an organism. An essential amino acid is scored as 1 and a non essential amino acid as 0.

(vi) Variables 35-38: The probability that the 2nd position in the codon is occupied by a given nucleotide. Similar to (ii) above, it is possible to calculate the probability that a given intra-codon position is occupied by a specific nucleotide encoding a specific amino acid. Previous research has suggested that intra-codon nucleotide position 2 is the most important of th e 3 p o s iti o ns fo r o p t i ma l tr an s l at i o n t o o c cu r (www.sb.fsu.edu/^~hongli/BCH5425/note8.html and W. R. Danter, 2001). The probability that the 2nd position was occupied by a given nucleotide for a given amino acid was calculated for each of Adenine, Cytosine, Guanine and Thymidine or Uridine depending on the DNA/mRNA flag as outlined above in (iv). This process resulted in 4 variables named P(A),P(C), P(G) and P(UT) representing the corresponding calculated probability.

The 38 variables outlined above were used to generate an input pattern for the 20 amino acids and Stop instruction together with the corresponding encoding DNA (N=64) and mRNA (N=64) doublet codons. This generated a database of 128 data patterns. Field 1 of the database is the amino acid name while Fields 2-39 are the 38 variables outlined above and the Target, namely the doublet DNA codon, is Field 40.

Example 2: Model Development and Input Selection

A neural network or rule-based algorithm may be used to reverse engineer protein sequences to nucleotide code. Both approaches are outlined below.

(i) The neural network approach:

NeuroSHELL Classifier V2.0 was selected as the neural network modeling software. This artificial intelligence tool is a statistical classifier based on Bayes Theorem coupled with a genetic algorithm used to evolve an optimal solution. The software also has the built in ability to carry out cross validation as the model is evolving. Probabilistic classifiers use the available data to generate an optimized equation from the input variables. The genetic algorithm determines the optimal subset of input variables. As will be evident to someone of skill in the art, evolutionary systems using different input variables may be expected to produce similar but not always the same final model and performance of these models may also be expected to be similar but not identical. Therefore some models may perform better than others and thus it is preferable to find an adequate or optimal/minimal set of predictors.

Each input data pattern relating to an amino acid sequence was composed of 38 variables. Simple genetic principles such as cross over and point mutation can be applied to these data patterns with the result that the addition and subtraction of some variables will produce different and shorter input patterns. For example, but not to be considered limiting in any manner: Cross-over simulation: each of two adjacent or otherwise associated evolving potential solution patterns may transfer part of their respective sequence to the other pattern thereby generating two new potential solution patterns for evaluation according to the fitness criteria;

Point mutation simulation: a particular single variable in an evolving potential solution pattern is deleted or replaced by a new or different variable creating a new potential solution pattern which can be evaluated according to the fitness criteria.

At the outset of model development a family of mutated input data patterns is created. Each of these new data patterns provides a potential solution to the classification problem and is evaluated against fitness criteria which in this case is the maximum number of correct classifications by the model. If a perfect classifier is found after the first generation the process stops. Otherwise the "fittest" patterns from the previous generation undergo further evolution, fitness testing and cross validation. This process continues until an optimal solution is found or a stopping criteria, usually a maximum number of generations, has been reached. Cross validation of the potential solutions is carried out during testing for each candidate model in each generation. A Leave One Out (LOO) training and testing strategy was employed here. During model creation and testing, one data pattern is held out for testing while the remaining 127 patterns are used to train the model. This process is then repeated 127 times so that in turn each pattern is tested in an independent fashion. In other words, all patterns can be used as both training and testing patterns and each test pattern is preferably not in the dataset from which the model is developed at the time it is tested. This approach represents a widely recognized and validated method of compensating for small data sets during model development. References for training of neural network systems include, but are not limited to: Foundations of Neural Networks, Fuzzy Systems and Knowledge Engineering, Nikola K. Kasabov, A Bradford Book , The MIT Press, Cambridge, Massachusetts, London, England, 1998; and Neural Smithing: Supervised Learning in Feedforward Artificial Networks, Russell D. Reed and Robert J. Marks II. A Bradford Book , The MIT Press, Cambridge, Massachusetts, London, England, 1999; which are hereby incorporated by reference.

Early in the first generation of model development NeuroSHELL Classifier identified a cross-validated, perfect classifier using only 8 of the original 38 variables. Since it is possible that more than one solution exists for a given problem particularly when there is a large numbers of input variables relative to data patterns, the process was repeated 20 times. In each case the same model utilizing the same 8 variables was identified.

The identified 8 variables listed in decreasing order of importance as determined by sensitivity analysis were:

1. Molecular Weight of the Amino Acid (MolWt);

2. Solvent Excluded Molecular Surface Area of the Amino Acid (MolSA);

3. The log of the Octanol/Water solubility coefficient- Log P; 4. The average doublet codon frequency (AvrCodFreq);

5. The probability that Guanine ( i . e . P(G)) will occupy the 2nd position in a given codon;

6. The Molar Refr activity Index (MR);

7. The topological descriptor - Wiener Index (WIndx) and 8. The DNA/mRNA switch. Example 3: Information Processing System comprising a Rule-Based Algorithm

A series of rules can be developed which also allow for perfect mapping of each amino acid to the encoding doublet DNA or mRNA. The pseudocode for these rules for DNA doublets appears below.

Rule 1: If AA= "Alanine" then output: Doublet = "GC_"

Rule 2: If AA = "Arparagine" then output: Doublet = "AA_"

Rule 3: If AA = "Aspartate" then output: Doublet = "GA_" Rule 4: If AA = "Cysteine" then output: Doublet = "TG_"

Rule 5: If AA = "Glutamate" then output: Doublet = "GA_"

Rule 6: If AA = "Glutamine" then output: Doublet = "CA_"

Rule 7; If AA = "Glycine" then output: Doublet = "GG_"

Rule 8: If AA = "Histadine" then output: Doublet = "CA_" Rule 9: If AA = "Isoleucine" then output: Doublet = "AT_"

Rule 10: If AA = "Lysine" then output: Doublet = "AA_"

Rule 11: If AA = "Methionine" then output: Doublet = "ATG"

Rule 12: If AA = "Phenylalanine" then output: Doublet = "TT_"

Rule 13: If AA = "Proline" then output: Doublet = "CC_" Rule 14: If AA = "Threonine" then output: Doublet = "AC_"

Rule 15: If AA = "Tryptophan" then output: Doublet = "TGG"

Rule 16: If AA = "Tyrosine" then output: Doublet = "TA_"

Rule 17: If AA = "Valine" then output: Doublet = "GT_"

Rule 18: If AA = "Arginine" then output: Doublet = "C(A)G_" Rule 19: If AA = "Leucine" then output: Doublet = "C(T)T_"

Rule 20: If AA = "Serine" then output:. Doublet = "TC(AG)_"

Rule 21: If AA = " " then END

The output provided by rules 18-20 may be interpreted as Doublet = "CG_ or "AG "; Doublet = "CT_" or "TT_" and Doublet = "TC " or "AG ", respectively, and wherein A = Adenine; C = Cytosine; G = Guanine; T = Thymine and U = Uracil.

Altering the output for mRNA code is simply a matter of modifying the output such that T is replaced by U everywhere that T appears in the output.

Both the neural network model and the rule-based system employ a software shell for execution. Specifically, the neural network approach employed herein requires a Runtime version of NeuroSHELL Classifier and the rule based system approach requires an expert system shell or look up table with code in order to evaluate the amino acid sequences from which the formatted doublet codes are derived.

Example 4 : Searching Data Structures

The output of the information processing is formatted doublet nucleotide code reverse engineered from the actual amino acid sequence of a protein of interest. The output may be used to search a data structure, for example, an electronic database that comprises nucleotide sequence information. A search algorithm that may be used, which is not meant to be limiting in any manner is shown below. However, a person of skill in the art will recognize that a variety of search or alignment algorithms, for example, but not limited to BLAST, FASTA and dynamic programming may be used and all of these are meant to be included under the present invention.

A Simple Search Algorithm:

(i) Identify the first instance of the start codon "ATG" in the data structure.

(ii) Attempt to match the next doublet nucleotide codon of the sequence (e.g. CC_) with the next triplet codon of the target DNA sequence in the data structure. (iii) If match = YES (e.g. CC_ and CCT) then repeat (ii) above.

(iv) If match = NO (e.g. CC_ and CAT) then find the next instance of the start codon "ATG" in the data structure and then repeat (ii) above. (v) The above process continues until either a match is found between the doublet nucleotide code and the target nucleotide sequence in the data structure or the end of the nucleotide sequences in the data structure is reached with no match found, (vi) If a match is found then Stop OR the same doublet nucleotide codon can be applied to the next target nucleotide sequence in the data structure if one exists. (vii) If no match is found then the same doublet nucleotide codon can be used to the search the next target nucleotide sequence in the data structure, OR a new doublet nucleotide codon may be applied to search the same target nucleotide sequence, (viii) The above steps are repeated until all available target nucleotide sequences within the data structure have been fully searched using all available doublet nucleotide codons.

Searching with Special Instances of the Doublet Nucleotide Codons :

Methionine maps exclusively, that is with 100% accuracy to one and only one instance of the doublet codon AT_ namely ATG.

Tryptophan maps exclusively to one and only one instance of the doublet codon TG_ namely TGG.

Arginine maps exclusively to 2 instances of doublet codons namely CG_ OR AG_ .

Leucine maps exclusively to 2 instances of doublet codons namely CT_ OR TT_ . Serine map^'s exclusively to 2 instances of doublet codons namely TC_ OR AG_ .

When the search process encounters one of the special instances of the doublet codons listed above the search at that point occurs as follows. For instance, where the mapping is from Methionine or Tryptophan, the search is for the exact instance namely "ATG" or "TGG" respectively. In instances where mapping is from Arginine, Leucine or Serine the search is carried out as a simple boolean "OR" match. This occurs because Arginine only maps to CG_ OR AG_; Leucine only maps to CT_ OR TT_ and Serine only maps to TC_ OR AG_. Example 5. Example of Reverse Engineering Abnormal Gene Products

Normal adult hemoglobin is a tetramere composed of 2 pairs of polypeptide chains termed alpha and beta globin subunits. Each of these subunits is bound to a heme group. Hemoglobin is found primarily in red blood cells. It is the hemoglobin in the red blood cell that is responsible for transporting oxygen from the lungs to the tissues and transporting metabolic products such as carbon dioxide in the reverse direction.

A separate gene regulates the synthesis of each of the hemoglobin subunits. Normally individuals inherit one beta-chain gene from each parent, 2 alpha-chain genes and 2 gamma-chain genes from each parent. The inheritance of abnormal hemoglobins follows classical Mendelian genetics. The commonly found hemoglobin abnormalities are predominantly beta-chain variants and are usually due to single amino acid replacement that results from a single base substitution in the encoding triplet DNA codon. The most common hemoglobin disorders are labeled Hemoglobin S and C. Thalassemia is a general term used to describe a genetically determined reduction in the amount hemoglobin produced.

HemolobinS (HbS) and HemoglobinC (HbC) both result from missense mutations at the 6th position. In the case of HbS, Valine is substituted for Glutamic acid while HbC results from the substitution of Lysine for Glutamic acid. One example of Thalassemia results from a Stop (TAG) mutation at position 17 and a second example of a Thalassemia mutant results from the deletion of an adjacent Adenine at position 8 producing a series of missense amino acids which terminates in an early Stop codon. Both of these examples result in decreased beta-chain synthesis.

The reverse engineering process described above was applied to the amino acid sequence and DNA code for the first 29 amino acids of the beta globulin chain. The amino acid sequence for the first 29 Amino Acids was first coded as outlined in Example 1, and then submitted to a previously trained and cross validated neural network for analysis. The output of the neural network was a series of 29 or less formatted doublet nucleotide codons. The double nucleotide codon output was then compared with the known triplet DNA code. In all cases i.e. Normal Hemoglobin, HbS, HbC and the 2 Thalassemias, the doublet nucleotide code predictions exactly matched the first 2 nucleotides of the actual DNA triplet codons including the actual mutations responsible for the abnormal hemoglobin. In other words the original triplet DNA code was accurately determined from the doublet nucleotide code produced from neural network evaluation of the amino acid sequences. These results are summarized graphically in Figure 1.

All references are herein incorporated by reference.

The present invention has been described with regard to preferred embodiments. However, it will be obvious to persons skilled in the art that a number of variations and modifications can be made without departing from the scope of the invention as described herein.

Claims

THE EMBODIMENTS OF THE INVENTION IN WHICH AN EXCLUSIVE PROPERTY OF PRIVILEGE IS CLAIMED ARE DEFINED AS FOLLOWS:

1. A method of converting an amino acid sequence of a protein of interest to doublet nucleotide code comprising the steps of, a) providing a first data set corresponding to said amino acid sequence as input to an information processing system (IPS), and; b) determining from said IPS a second data set corresponding to predicted doublet nucleotide code encoding said amino acid sequence.

2. The method of claim 1, wherein after said step of determining, said method comprises the step of outputting said second data set, or information equivalent to said predicted doublet code encoding the amino acid sequence of the protein of interest.

3. The method of claim 1 , wherein said doublet nucleotide code comprises a DNA nucleotide sequence or an RNA nucleotide sequence.

4. The method of claim 1 wherein said first data set comprises a full or partial amino acid sequence of a protein of interest.

5. The method of claim 1, wherein said first data set corresponding to the amino acid sequence is in binary form.

6. The method of claim 1, wherein said protein of interest is a variant or mutant protein of a wild-type protein.

7. The method of claim 6, wherein said variant or mutant protein is associated with a disease.

8. The method of claim 1, wherein said information processing system (IPS) comprises a neural network.

9. The method of claim 8, wherein said IPS employs a genetic algorithm.

10. The method of claim 8 , wherein the neural network is NeuroSHELL Classifier v2.0.

11. The method of claim 8, wherein said neural network is a trained neural network.

12. The method of claim 11 , wherein said neural network is trained by processing a plurality of data sets comprising data elements (X,Y) wherein X represents the amino acid sequence of a protein and Y represents the nucleotide sequence encoding X.

13. The method of claim 1, wherein said IPS comprises a rule-based system.

14. A method of identifying a nucleotide sequence encoding a protein of interest within a data structure, comprising the steps of a) providing a first data set corresponding to the amino acid sequence of the protein of interest to an information processing system (IPS) capable of producing a second data set corresponding to the predicted doublet DNA code encoded by the amino acid sequence, and; b) performing an in string search of the data structure to identify all instances wherein the second data set is present in the data structure.

15. The method of claim 14, wherein said data structure comprises an electronic medium containing nucleotide sequence information.

16. The method of claim 15, wherein said electronic medium comprises an electronic database.

17. The method of claim 15, wherein said nucleotide sequence information comprises one or more genomic nucleotide sequences of one or more organisms.

18. The method of claim 14, wherein said in string search comprises a sliding window approach.

19. The method of claim 14, wherein said in string search is controlled by an alignment algorithm.

20. The method of claim 19 , wherein said alignment algorithm comprises BLAST , FASTA, dynamic programming, or a version or an executable-code-modified version thereof.

21. An information processing system (IPS) capable of a) receiving a first data set corresponding to an amino acid sequence of a protein of interest, and b) producing a second data set corresponding to the predicted doublet nucleotide code encoding the amino acid sequence.

22. The IPS of claim 21 , further comprising an alignment algorithm for performing an in-string search of a data structure to identify all instances wherein the second data set is present in the data structure.

23. The IPS of claim 22, wherein said data structure comprise an on-line data structure. .