WO2004053766A1 - Reverse translation of protein sequences to nucleotide code - Google Patents

Reverse translation of protein sequences to nucleotide code Download PDF

Info

Publication number
WO2004053766A1
WO2004053766A1 PCT/CA2003/001929 CA0301929W WO2004053766A1 WO 2004053766 A1 WO2004053766 A1 WO 2004053766A1 CA 0301929 W CA0301929 W CA 0301929W WO 2004053766 A1 WO2004053766 A1 WO 2004053766A1
Authority
WO
WIPO (PCT)
Prior art keywords
doublet
amino acid
nucleotide
protein
acid sequence
Prior art date
Application number
PCT/CA2003/001929
Other languages
French (fr)
Inventor
Wayne R. Danter
Original Assignee
London Health Sciences Centre Research Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by London Health Sciences Centre Research Inc. filed Critical London Health Sciences Centre Research Inc.
Priority to AU2003287823A priority Critical patent/AU2003287823A1/en
Publication of WO2004053766A1 publication Critical patent/WO2004053766A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • the present invention relates to systems and methods for performing biological sequence analysis. More specifically, the invention relates to systems and methods for reverse engineering protein sequences to nucleotide code.
  • Gene products comprise peptides , proteins and antibodies that result from the complex processes of (1) DNA transcription producing messenger ribonucleic acid (mRNA), (2) ribosomal translation of mRNA and (3) post translational processing of the resulting proteins.
  • mRNA messenger ribonucleic acid
  • pharmacogenomics is the study of therapeutics in relation to the genetic makeup of an organism.
  • CML Chronic Myelogenous Leukemia
  • Gleevec is a specific protein tyrosine kinase inhibitor that targets the ATP (Adenosine Tri-Phosphate) binding site of the abnormal enzyme, but leaves cells containing the normal or wild-type enzyme largely unaffected. This represents the first instance that a specific therapeutic agent capable of selectively targeting cells with a specific genetic abnormality has been produced.
  • One potential method of linking disease to the genetic makeup of an organism is to identify an abnormal gene product associated with a disease and then search for the DNA encoding the abnormal gene product. This association may allow for a clearer understanding of the disease process, enhance diagnosis and screening for the disease, and may lead to the development of therapeutics or specific gene therapies that target and repair the abnormal gene sequence.
  • the basic building blocks of human proteins are the amino acids , of which 20 are most commonly used.
  • the double helix of DNA is composed of units called nucleotides or base pairs that are organized into triplets known as codons. There are 64 of these different three base pair codons.
  • One of these codons codes for either the START instruction or the amino acid methionine, depending on whether or not the codon instruction occurs at the beginning of the coding sequence.
  • Three other codons code for the STOP instruction that terminates ribosomal translation.
  • the remaining 60 triplets code for the 20 amino acids commonly linked together by amide bonds to form gene products (i.e. peptides, proteins and .antibodies).
  • DNA code is both redundant and degenerate.
  • Methionine and Cysteine are each encoded by one unique triplet codon
  • the other amino acids may be encoded by 2, 3, 4 or 5 different triplet codons.
  • Two amino acids i.e. Serine and Leucine
  • Serine and Leucine may be encoded by 6 different triplet codons.
  • the present invention relates to systems and methods for performing biological sequence analysis. More specifically, the invention relates to systems and methods for reverse engineering protein sequences to nucleotide code.
  • a method of converting an amino acid sequence of a protein of interest to doublet nucleotide code comprising the steps of, a) providing a first data set corresponding to the amino acid sequence as input to an information processing system (IPS), and; b) determining from the IPS a second data set corresponding to the predicted doublet nucleotide code encoding the amino acid sequence of the protein of interest. Further, the method may also comprise the step of outputting the second data set, or information equivalent to the predicted doublet code encoding the amino acid sequence of the protein of interest. Further, the doublet nucleotide code may comprise a DNA nucleotide sequence or an RNA nucleotide sequence.
  • the first data set may comprise a full or partial amino acid sequence of a protein of interest. Further, the first data set may be encoded in binary form.
  • the protein of interest may be a variant or mutant protein of a wild-type protein. Further, the variant or mutant protein may be associated with a disease.
  • the present invention provides a method as defined above, wherein the information processing system (IPS) comprises a neural network.
  • the neural network may also employ a genetic algorithm.
  • the neural network is NeuroSHELL Classifier v2.0.
  • the neural network is a trained neural network.
  • the neural network may be trained by processing a plurality of data sets comprising data elements (X,Y) wherein X represents the amino acid sequence of a protein and Y represents the nucleotide sequence encoding X.
  • the IPS may comprise a rule-based system.
  • the present invention also contemplates a method of identifying a nucleotide sequence encoding a protein of interest within a data structure, comprising the steps of a) providing a first data set corresponding to the amino acid sequence of the protein of interest to an information processing system (IPS) capable of producing a second data set corresponding to the predicted doublet DNA code encoded by the amino acid sequence, and; b) performing an in string search of the data structure to identify all instances wherein the second data set is present in the data structure.
  • the data structure may comprise an electronic medium containing nucleotide sequences, for example, but not limited to an electronic database. Further the nucleotide sequences may comprise genomic nucleotide sequences, for example, but not limited to containing introns.
  • the in string search may be performed, controlled or both performed and controlled by an algorithm employing a sliding window approach to compare sequences.
  • the algorithm may comprise an alignment algorithm such as, but not limited to BLAST, FAST A, dynamic programming, or a version or an executable-code-modified version thereof.
  • an information processing system capable of a) receiving a first data set corresponding to an amino acid sequence of a protein of interest, and b) producing a second data set corresponding to the predicted doublet nucleotide code encoding the amino acid sequence.
  • the IPS may further comprise hardware, software or the like, and may further comprise an alignment algorithm for performing an in-string search of a data structure to identify all instances wherein the second data set is present in the data structure.
  • FIGURE 1 shows the predicted doublet nucleotide code of normal and mutant hemoglobin proteins output from a trained neural network following input of the first twenty nine amino acids of the proteins.
  • the present invention relates to systems and methods for performing biological sequence analysis. More specifically, the invention relates to systems and methods for reverse engineering protein sequences to nucleotide code.
  • a method of converting an amino acid sequence of a protein of interest to doublet nucleotide code comprising the steps of, a) providing a first data set corresponding to the amino acid sequence as input to an information processing system (IPS), and; b) determining from the IPS a second data set corresponding to predicted doublet nucleotide code encoding the amino acid sequence.
  • IPS information processing system
  • the method may further comprise a step of outputting the second data set, or information equivalent to the predicted doublet nucleotide code of the protein of interest.
  • an information processing system capable of a) receiving as input a first data set corresponding to an amino acid sequence, b) determining a second data set corresponding to the predicted doublet nucleotide code encoding the amino acid sequence and c) outputting the second data set or information equivalent to the predicted doublet nucleotide code.
  • the IPS may comprise additional hardware, for example but not limited to control circuits, microprocessors, software or both, and it may be part of one or more computers or biological sequence analysis systems. Definitions:
  • amino acid sequence it is meant a consecutive sequence of amino acids linked via peptide bonds defining the protein of interest, preferably starting from the amino terminus (N-terminus) and proceeding to the carboxy terminus (C-terminus) of the protein.
  • the protein of interest may comprise any protein known in the art, for example, but not limited to, pharmaceutically important proteins such as, but not limited to regulatory proteins, signaling proteins, growth factors, growth regulators, antibodies, antigens, interleukins, insulin, colony stimulating factors such as G-CSF, GM-CSF, hPG-CSF, M-CSF or combinations thereof, interferons, for example, interferon-a, interfer ⁇ n- ⁇ , interferon-g, blood clotting factors, for example, Factor VIII, Factor IX, or tPA.
  • any native protein produced by an organism may be considered a protein of interest.
  • the present invention contemplates mutant proteins and variants of native proteins produced by an organism.
  • doublet nucleotide code it is meant a DNA or RNA nucleotide sequence comprising two of the three nucleotides of each triplet codon encoding an amino acid, with a blank, space-holder or the like in the third position.
  • Alanine may be represented by the doublet codon GC_, Cysteine by TC_, Arginine by CG_ OR AG_, Leucine by CT_ OR TT_, Serine by TC_ OR AG_, etc.
  • the doublet nucleotide code is provided in a defined orientation, such as a 5' to 3' orientation, as would be understood by a person of skill in the art.
  • first data set information pertaining to the amino acid sequence of the protein of interest.
  • the first data set comprises the amino acid sequence of the protein of interest, or a computer readable version thereof.
  • the first data set may be encoded on an electronic medium, such as, but not limited to a computer information storage device, such as, but not limited to a hard drive, floppy disk or the like.
  • a computer information storage device such as, but not limited to a hard drive, floppy disk or the like.
  • amino acid Alanine which represents one of the twenty amino acids, may be represented by the numerical string 1000000000000000, serine as 00000000000010000000, valine as 00000000000000000001 and so on.
  • a protein of interest that comprises the amino acid sequence Ala-Ala-Val-Ser may be depicted by the numerical sequence:
  • second data set information corresponding to the predicted doublet nucleotide code encoding the amino acid sequence.
  • the second data set, corresponding to the predicted doublet DNA code may also be encoded on an electronic medium as defined previously.
  • IPS information processing system
  • the IPS may comprise a neural network or a rule-based system or algorithm.
  • the present invention employs a neural network, preferably a trained neural network.
  • the information processing system comprises a rule-based algorithm.
  • the IPS may also comprise one or more circuits, microprocessors, or combinations thereof, as would be evident to a person of skill in the art.
  • Neural Networks are pattern recognition computer models based on the human nervous system that are capable of learning from experience and then making predictions about new patterns. Neural computation encompasses the concepts of distributed, adaptive and nonlinear computing. Neural networks usually comprise a plurality of layers such as an input layer, a middle or hidden layer and an output layer. Each layer comprises a plurality of processing elements that are usually interconnected by weighted connections or scaling factors. A processing element multiplies an input by a set of weights, and non-linear ly transforms the result into an output value. The performance of the neural network may be measured in terms of a desired signal and error criterion. The output of the neural network is compared with a desired response to produce an error.
  • a backpropagation algorithm may be used to adjust the weights interconnecting the processing elements in a manner to minimize the error.
  • Other learning algorithms that may be employed include, but are not limited to Probabilistic/Bayesian, Generalized Regression, Self organizing maps (eg Kohonen Networks), Cascade Correlation or a combination thereof.
  • the network may be trained by repeatedly exposing the neural network to known data patterns while the training algorithm adjusts the connection weights between processing elements in order to "learn" the relationship between the input patterns and the desired outputs. This process continues until a predetermined error tolerance has been achieved so as to optimize the generalizability of the trained model.
  • a neural network may be trained by processing a plurality of data sets comprising data elements (X,Y) wherein X represents the amino acid sequence of a protein and Y represents the nucleotide sequence encoding X.
  • the current process attempts to relate the properties of each of the 20 AAs to the specific coding doublet DNA code. Having taught a machine learning system to understand this relationship, the model can then reverse engineer any given sequence of amino acids to the original DNA code via the concept of the formatted doublet codon.
  • the training set comprises 128 data patterns (64 for the DNA instance and 64 for the mRNA instance) relating each A A to its' encoding doublet DNA codon.
  • a trained neural network based on Bayes theorem is a powerful classifier and by design learns and makes predictions based on probabilities calculated from the training data. Once trained and validated the neural network can be used to evaluate the amino acid sequence of the protein of interest and output the predicted doublet nucleotide code corresponding to the input amino acid sequence.
  • Example of training of neural network systems are known in the art, may be found in, but are not limited to: Foundations of Neural Networks, Fuzzy Systems and Knowledge Engineering, Nikola K. Kasabov, A Bradford Book , The MIT Press, Cambridge, Massachusetts, London, England, 1998; and Neural Smithing: Supervised Learning in Feedforward Artificial Networks, Russell D. Reed and Robert J. Marks II. A Bradford Book , The MIT Press, Cambridge, Massachusetts, London, England, 1999; which are hereby incorporated by reference.
  • Neural networks that may be employed in the present invention include, but are not limited to NeuroSHELL Classifier v2.0.
  • the present system may rely on a number of specific features of the NeuroSHELL package, for example, but not limited to the ability to carry out leave one out cross validation coupled with the probabilistic classifier utilizing a genetic algorithm to evolve a cross validated optimal solution to the classification problem.
  • the IPS may comprise a rule-based algorithm.
  • a rule-based algorithm may comprise a plurality of rules such as:
  • the rule-based system comprises (1) a comprehensive set of rules of varying complexity to handle all possible relationships and (2) an efficient search algorithm to find the appropriate rule quickly rather than a Son force search through every rule each time to find the appropriate rule for the instance of the data pattern being evaluated.
  • rule based systems may be employed in the present invention as would be understood by a person of skill in the art.
  • the genetic material of an organism comprises a string of nucleotides consisting of adenine (A), cytosine (C), guanine (G) and thymine (T) in the case of DNA and A, C, G, and uracil (U) in the case of RNA.
  • Proteins comprising a series of amino acids linked by peptide bonds are produced by transcribing and translating the genetic material. It is well known in the art that triplet codons consisting of three consecutive nucleotides of genetic material specify the amino acids to be incorporated into a protein.
  • Valine GT thus, about 80% of the amino acids in proteins can be directly mapped to the triplet DNA code from the doublet nucleotide code.
  • This doublet nucleotide code may be converted to the modern triplet code format by simply adding a blank space holder in the third nucleotide position. If sequence analysis is employed using doublet nucleotide codons for 16 amino acids, then the problem of reverse engineering an amino acid sequence of a protein of interest may be significantly reduced. Further, as the amino acid Tryptophan is encoded by a single codon (i.e. TGG) then only three amino acids (Arginine, Leucine and Serine) encoded by multiple triplet codons require conversion to multiple doublet codons.
  • Results from the human genome proj ect and other sequencing proj ects permit statistical analysis of DNA sequences. For example, it is possible to estimate the probability that a given amino acid is encoded by a specific doublet DNA codon. Such a statistical approach may be applied to the 3 amino acids (Arginine, Leucine and Serine) that cannot be defined by a single doublet nucleotide code. For example, but not wishing to be limiting , based on statistical analysis of hundreds of thousands of DNA sequences from the human genome, it may be predicted that about 60% of the time Arginine is encoded by the doublet codon CG and about 40 % of the time by the doublet codon AG.
  • nucleotide coding frequencies encoding specific amino acids may be organism or species specific.
  • the present invention also contemplates using known codon frequencies for specific organisms during reverse engineering protein sequences to nucleotide code as described herein. Further the present invention contemplates extrapolating the codon frequencies for organisms that have yet to be fully sequenced at the nucleotide level, for example, but not limited to by analyzing the nucleotide sequence encoding known proteins from that organism.
  • Table 2 Doublet Codon Probabilities Derived From the Human Genome Project Amino Acid Doublet Probability
  • the present invention contemplates a symbolic probabilistic reverse mapping system from amino acid sequence to DNA or RNA code using a doublet nucleotide representation (i.e. AA_, CG_, TT_, CA_, etc).
  • the system may also employ additional concepts such as, but not limited to, that every coding sequence has a start (ATG) and termination (STOP) code.
  • STOP termination
  • These 2 instructions may be employed to represent the "boundaries" of a target DNA sequence and therefore define a finite number of amino acids and codons lying between the boundaries.
  • both methionine and tryptophan are each encoded by a unique triplet codon (i.e. ATG and TGG, respectively). These unique triplet codons may be employed as "anchors" or constants within a given DNA or RNA sequence. Boundaries, anchors or both provide a simple method of partial internal validation of the reverse engineering proteins to nucleotide code.
  • the method of reverse engineering protein sequences to nucleotide code may be employed to convert an amino acid sequence of a protein of interest into doublet nucleotide code and identify the corresponding nucleotide sequence in a data-structure, database or the like, wherein the nucleotide sequence also comprises one or more introns.
  • the method may comprise a 2 (or more) stage search whereby the first stage search looks for the contiguous DNA sequence predicted by the system. If no match is found then an iterative 2nd stage search may be employed to identify subset matches of codons, preferably at least 3 contiguous codons.
  • a codon by codon match may ensue. For example if the first 3 codons of the probe match 3 codons in the target but not the fourth then the search algorithm interprets this as the beginning of a possible intron sequence and tries to match the 4th codon of the probe with the next codon in the target. As long as there are 2 or fewer contiguous codon matches this process continues . Matches of 3 or more codons are acknowledged by the system as above and the next iteration begins. The iterative search continues until the probe sequence is matched. A minimum match requirement for 3 contiguous codons is based on calculated probabilities for the appearance of 1, 2, 3 and 4 codons in association in any DNA sequence. The probability that any 3 contiguous codons would be associated based on chance alone is small.
  • the method of the present invention was tested using wild-type and mutant hemoglobins including hemoglobins, hemoglobinC and two hemoglobins encoding truncated proteins.
  • HbS and HbC both result from missense mutations at the 6th codon position.
  • Valine is substituted for Glutamic acid
  • HbC results from the substitution of Lysine for Glutamic acid.
  • One example of Thalassemia results from a Stop (TAG) mutation at position 17 and a second example of a Thalassemia mutant results from the deletion of an adjacent Adenine at position 8 producing a series of missense amino acids which terminates in an early Stop codon. Both of these examples result in decreased beta-chain synthesis.
  • the method of the present invention as described herein was used to determine the doublet nucleotide code encoding up to the first 29 amino acids of the beta globulin chain. Results are shown in Figure 1.
  • the amino acid sequence for the first 29 amino acids of the hemoglobin proteins was first encoded as outlined in Example 1, and then submitted to a trained neural network for analysis. The output of the neural network was a series of 29 or less formatted doublet nucleotide codons. The double nucleotide code output was compared with the known triplet DNA code.
  • the doublet nucleotide code predictions exactly matched the first 2 nucleotides of the actual DNA triplet codons including the actual mutations responsible for the abnormal hemoglobin. This indicates that the original triplet DNA code was accurately determined from the doublet nucleotide code produced from neural network evaluation of the amino acid sequences .
  • the method of the present invention may further comprise one or more additional steps at any stage in the method.
  • the second data set may be used to search a data structure comprising nucleotide sequence information.
  • data structure it is meant any electronic medium comprising nucleotide sequence information.
  • the data structure may comprise a database or the like which contains the genome of an organism, for example, but not limited to a yeast genome, such as, but not limited to Saccharomyces cerevisiae, Saccharomyces pombe (Nature 387, 5-105 (suppl) (1997); Wood et al., Nature, 415(6874):871-880 (2002)), protozoa, such as but not limited to Plasmodium falciparum, plants such as, but not limited to Arabidopsis thaliana (Nature 408(6814):796-815), Oryza Sativa (Yu et al., Science, 296:79-92 (2002); Goff et al., Science, 296:92-100 (2002)), nematodes such as, but not limited to Caenorhabditis elegans (Washington U, Science, 282(5396):2012-2018 (1998).
  • yeast genome such as, but not limited to Saccharomyces cerevisi
  • insects such as, but not limited to Drosophila melanogaster (Adams, et al., Science, 287(5461):2185-2195 (2000)) and human (Venter, et al. Science, 291: 1304-1351 (2001); International Human Genome Sequencing Consortium, Nature, 409:860-921 (2001)) or a combination thereof.
  • the data structure may comprise a plurality of nucleotide sequences , preferably eukaryotic nucleotide sequences. However, the data structure may also comprise prokaryotic sequences. Preferably, the coding relationship between nucleotide codons and amino acids is known for the species in question. Also, the present invention contemplates data structures comprising partial or incomplete genomes of organisms. For example, but not to be considered limiting in any manner, the data structure may comprise one or more databases from the National Center for Biotechnology Information (NCBI), European Molecular Biology Laboratory (EMBL), or both. Further, the data structure may comprise a commercial data structure, for example, but not limited to, such as the type available from Celera.
  • NCBI National Center for Biotechnology Information
  • EMBL European Molecular Biology Laboratory
  • the data structure may comprise a commercial data structure, for example, but not limited to, such as the type available from Celera.
  • a method of identifying a nucleotide sequence encoding a protein of interest within a data structure comprising the steps of i) providing a first data set corresponding to the amino acid sequence of the protein of interest to an information processing system (IPS) capable of producing a second data set corresponding to the predicted doublet nucleotide code encoding the amino acid sequence of the protein of interest, and; ii) performing an in-string search of the data structure to identify all instances wherein the second data set is present in the data structure.
  • IPS information processing system
  • the in-string search of the data structure may be accomplished by any appropriate search algorithm known in the art.
  • the algorithm may perform a simple in-string search to identify the predicted nucleotide sequence within the data structure.
  • An example of a simple in-string search, which is not meant to be considered limiting in any manner is described in Example 4.
  • dynamic programming for example, but not limited to, the Smith Waterman dynamic programming algorithm or a version thereof may be employed (Smith and Waterman, 1981 a,b Identification of Common Molecular SubSequences. J. Molecular Biology 147: 195-197; which is herein incorporated by reference).
  • BLAST Basic Local Alignment Search Tool
  • FASTA FASTA
  • k-tuples Altschul et al. , 1990 Basic local Alignment Search Tool J. Mol. Biol. 215:403-410; which are herein incorporated by reference
  • other algorithms for example as described in Bioinformatics, Sequence and Genome Analysis by David W. Mount, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York, 2001 and references contained therein, which are herein incorporated by reference
  • any suitable algorithm known in the art may be employed to perform the in-string search.
  • alignment algorithms such as, but not limited to BLAST and FASTA and dynamic programming algorithms permit alignment of sequences that are not identical.
  • such algorithms may be employed to search a data structure that comprises introns in nucleotide sequences, for example, but not limited to genomic nucleotide sequences.
  • the alignment algorithms also permit alignment of a predicted nucleotide doublet sequence encoding a mutant protein of interest with one or more nucleotide sequences contained in a data structure, for example, but not limited to an electronic database. In such an instance the predicted doublet nucleotide sequence does not need to be identical to the sequence contained in the data structure.
  • the human genome has been sequenced and thus any protein produced in a human may be mapped to a specific nucleotide sequence in the genome.
  • a human subject suffering from a disease may exhibit one or more mutant proteins, that may be partly or wholly responsible for the disease. If a mutant protein is isolated and sequenced, it is unlikely that the exact nucleotide sequence encoding this mutant protein will be found within a data structure comprising the human genome.
  • the search algorithm may be employed to determine the most likely nucleotide sequence within the data structure that may give rise to the mutant protein.
  • a person of skill in the art may then determine whether the mutation arose as a result of a point mutation, such as, but not lk ⁇ ited to insertion of a stop codon into a nucleotide sequence that is translated into a truncated protein, or an inversion, deletion, translocation or combination thereof.
  • the present invention also contemplates a method of searching a data structure for a target nucleotide sequence wherein the target nucleotide sequence comprises the doublet nucleotide code encoding the amino acid sequence of the protein of interest.
  • the data structure may comprise any database known in the art. Further, the data structure may comprise introns, for example, as found in genomic nucleotide sequences of eukaryotes.
  • Amino acids are represented by a series of ones and zeros as described by White and Seffens (1998; Electronic J. Biotech. 1: 196-201, which is herein incorporated by reference) for each of the 20 Amino Acids.
  • one additional variable for example, consisting entirely of zeros may be used for the Stop instruction.
  • each Amino Acid can be represented as string of 20 0s and a 1 which is unique for each Amino Acid. For example:
  • Alanine is represented by the string: 100000000000000000000
  • Valine is represented by the string: 00000000000000010
  • Serine is represented by the string: 000000000000100000000, , and so on for all 20 AAs.
  • the Stop instruction is represented as all 0s 00000000000000000000.
  • the 20 amino acids have different structural and chemical properties. . Similarities in some of these properties have permitted a rudimentary classification scheme based on size, charge and lipid solubility. Each of the amino acids can be classified as (1) hydrophobic, (2) small/polar, (3) charged/polar, or (4) polar. This is an imperfect system with some amino acids arguably being members of more than one class. Another approach is to use molecular descriptors calculated from connection table files for each amino acid (CT File Formats (1999), MDL Information Systems, Inc, 14600 Catalina Street, San Leandro, CA, 94577).
  • the Molecular Surface Area (Angstroms 2 ) of the molecules is a pseudo 3D descriptor and represents the contact surface created when a spherical probe representing a solvent molecule is rolled over the molecular model .
  • the Molar Refractivity Index (cmVmole) estimates the potential light refracting ability of the molecule. Molecular weight (in daltons) can be used as a measure of molecular size and lipid solubility can be estimated from calculation of LogP, the log of the Octanol/Water partition coefficient.
  • a final descriptor, the Weiner Index was also used.
  • the Weiner Index is a topological descriptor of longstanding proven value that is calculated from the number and types of bonds in a given molecule. All of these descriptors are easily calculated from connection table files by molecular modeling programs such as Chem3D/ChemOFFICE Ultra 7.0 , which was used in the present project.
  • a boolean switch was included to allow for the prediction of either DNA or mRNA sequences encoding amino acid sequences.
  • amino acids are identified as essential because they are not made by an organism. An essential amino acid is scored as 1 and a non essential amino acid as 0.
  • Variables 35-38 The probability that the 2nd position in the codon is occupied by a given nucleotide. Similar to (ii) above, it is possible to calculate the probability that a given intra-codon position is occupied by a specific nucleotide encoding a specific amino acid. Previous research has suggested that intra-codon nucleotide position 2 is the most important of th e 3 p o s iti o ns fo r o p t i ma l tr an s l at i o n t o c cu r (www.sb.fsu.edu/ ⁇ hongli/BCH5425/note8.html and W. R.
  • a neural network or rule-based algorithm may be used to reverse engineer protein sequences to nucleotide code. Both approaches are outlined below.
  • NeuroSHELL Classifier V2.0 was selected as the neural network modeling software.
  • This artificial intelligence tool is a statistical classifier based on Bayes Theorem coupled with a genetic algorithm used to evolve an optimal solution.
  • the software also has the built in ability to carry out cross validation as the model is evolving.
  • Probabilistic classifiers use the available data to generate an optimized equation from the input variables.
  • the genetic algorithm determines the optimal subset of input variables.
  • evolutionary systems using different input variables may be expected to produce similar but not always the same final model and performance of these models may also be expected to be similar but not identical. Therefore some models may perform better than others and thus it is preferable to find an adequate or optimal/minimal set of predictors.
  • Each input data pattern relating to an amino acid sequence was composed of 38 variables. Simple genetic principles such as cross over and point mutation can be applied to these data patterns with the result that the addition and subtraction of some variables will produce different and shorter input patterns. For example, but not to be considered limiting in any manner: Cross-over simulation: each of two adjacent or otherwise associated evolving potential solution patterns may transfer part of their respective sequence to the other pattern thereby generating two new potential solution patterns for evaluation according to the fitness criteria;
  • Point mutation simulation a particular single variable in an evolving potential solution pattern is deleted or replaced by a new or different variable creating a new potential solution pattern which can be evaluated according to the fitness criteria.
  • a family of mutated input data patterns is created. Each of these new data patterns provides a potential solution to the classification problem and is evaluated against fitness criteria which in this case is the maximum number of correct classifications by the model. If a perfect classifier is found after the first generation the process stops. Otherwise the "fittest" patterns from the previous generation undergo further evolution, fitness testing and cross validation. This process continues until an optimal solution is found or a stopping criteria, usually a maximum number of generations, has been reached. Cross validation of the potential solutions is carried out during testing for each candidate model in each generation. A Leave One Out (LOO) training and testing strategy was employed here. During model creation and testing, one data pattern is held out for testing while the remaining 127 patterns are used to train the model.
  • LEO Leave One Out
  • NeuroSHELL Classifier Early in the first generation of model development NeuroSHELL Classifier identified a cross-validated, perfect classifier using only 8 of the original 38 variables. Since it is possible that more than one solution exists for a given problem particularly when there is a large numbers of input variables relative to data patterns, the process was repeated 20 times. In each case the same model utilizing the same 8 variables was identified.
  • Altering the output for mRNA code is simply a matter of modifying the output such that T is replaced by U everywhere that T appears in the output.
  • Both the neural network model and the rule-based system employ a software shell for execution.
  • the neural network approach employed herein requires a Runtime version of NeuroSHELL Classifier and the rule based system approach requires an expert system shell or look up table with code in order to evaluate the amino acid sequences from which the formatted doublet codes are derived.
  • the output of the information processing is formatted doublet nucleotide code reverse engineered from the actual amino acid sequence of a protein of interest.
  • the output may be used to search a data structure, for example, an electronic database that comprises nucleotide sequence information.
  • a search algorithm that may be used, which is not meant to be limiting in any manner is shown below. However, a person of skill in the art will recognize that a variety of search or alignment algorithms, for example, but not limited to BLAST, FASTA and dynamic programming may be used and all of these are meant to be included under the present invention.
  • Methionine maps exclusively, that is with 100% accuracy to one and only one instance of the doublet codon AT_ namely ATG.
  • Arginine maps exclusively to 2 instances of doublet codons namely CG_ OR AG_ .
  • Leucine maps exclusively to 2 instances of doublet codons namely CT_ OR TT_ .
  • Serine map ' s exclusively to 2 instances of doublet codons namely TC_ OR AG_ .
  • the search process encounters one of the special instances of the doublet codons listed above the search at that point occurs as follows. For instance, where the mapping is from Methionine or Tryptophan, the search is for the exact instance namely "ATG” or "TGG” respectively. In instances where mapping is from Arginine, Leucine or Serine the search is carried out as a simple boolean "OR" match. This occurs because Arginine only maps to CG_ OR AG_; Leucine only maps to CT_ OR TT_ and Serine only maps to TC_ OR AG_. Example 5. Example of Reverse Engineering Abnormal Gene Products
  • Hemoglobin is a tetramere composed of 2 pairs of polypeptide chains termed alpha and beta globin subunits. Each of these subunits is bound to a heme group. Hemoglobin is found primarily in red blood cells. It is the hemoglobin in the red blood cell that is responsible for transporting oxygen from the lungs to the tissues and transporting metabolic products such as carbon dioxide in the reverse direction.
  • a separate gene regulates the synthesis of each of the hemoglobin subunits Normally individuals inherit one beta-chain gene from each parent, 2 alpha-chain genes and 2 gamma-chain genes from each parent.
  • the inheritance of abnormal hemoglobins follows classical Mendelian genetics. The commonly found hemoglobin abnormalities are predominantly beta-chain variants and are usually due to single amino acid replacement that results from a single base substitution in the encoding triplet DNA codon. The most common hemoglobin disorders are labeled Hemoglobin S and C. Thalassemia is a general term used to describe a genetically determined reduction in the amount hemoglobin produced.
  • HemolobinS HbS
  • HbC HemoglobinC
  • TAG Stop
  • TAG Stop
  • Thalassemia mutant results from the deletion of an adjacent Adenine at position 8 producing a series of missense amino acids which terminates in an early Stop codon. Both of these examples result in decreased beta-chain synthesis.
  • the reverse engineering process described above was applied to the amino acid sequence and DNA code for the first 29 amino acids of the beta globulin chain.
  • the amino acid sequence for the first 29 Amino Acids was first coded as outlined in Example 1, and then submitted to a previously trained and cross validated neural network for analysis.
  • the output of the neural network was a series of 29 or less formatted doublet nucleotide codons.
  • the double nucleotide codon output was then compared with the known triplet DNA code. In all cases i.e.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Peptides Or Proteins (AREA)

Abstract

The invention can be summarized as follows. A method of converting an amino acid sequence of a protein of interest to doublet nucleotide code comprising the steps of providing a first data set corresponding to an amino acid sequence as input to an information processing system (IPS) and determining from the IPS a second data set corresponding to the predicted doublet nucleotide code encoding the amino acid sequence. The method may also comprise a search of a data structure for a nucleotide sequence comprising the predicted doublet nucleotide code. Also disclosed is an information processing system for performing biological sequence analysis.

Description

REVERSE TRANSLATION OF PROTEIN SEQUENCES TO NUCLEOTIDE CODE
The present invention relates to systems and methods for performing biological sequence analysis. More specifically, the invention relates to systems and methods for reverse engineering protein sequences to nucleotide code.
BACKGROUND OF THE INVENTION
While the Human Genome Project has deciphered the human DNA genetic code, the task of determining the function of gene products remains at an early stage of development. Gene products comprise peptides , proteins and antibodies that result from the complex processes of (1) DNA transcription producing messenger ribonucleic acid (mRNA), (2) ribosomal translation of mRNA and (3) post translational processing of the resulting proteins.
Functional genomics and proteomics are relatively new branches of genetics that attempt to determine the specific function of gene products based on sequence and structure information. A still newer area of active research that is at the interface between functional genomics and pharmacology is pharmacogenomics. Broadly speaking, pharmacogenetics is the study of therapeutics in relation to the genetic makeup of an organism.
If a disease process can be shown to be the result of an abnormal gene product then specific therapies can be developed which target the abnormal gene product. A good example of this process is illustrated by the development of a small molecule called Gleevec (Imatinab) that is used to treat Chronic Myelogenous Leukemia (CML) . CML is a type of leukemia associated with a recognized genetic mutation known as the Philadelphia chromosome. This genetic mutation results in the production of an abnormal protein tyrosine kinase. Gleevec is a specific protein tyrosine kinase inhibitor that targets the ATP (Adenosine Tri-Phosphate) binding site of the abnormal enzyme, but leaves cells containing the normal or wild-type enzyme largely unaffected. This represents the first instance that a specific therapeutic agent capable of selectively targeting cells with a specific genetic abnormality has been produced.
One potential method of linking disease to the genetic makeup of an organism is to identify an abnormal gene product associated with a disease and then search for the DNA encoding the abnormal gene product. This association may allow for a clearer understanding of the disease process, enhance diagnosis and screening for the disease, and may lead to the development of therapeutics or specific gene therapies that target and repair the abnormal gene sequence.
The basic building blocks of human proteins are the amino acids , of which 20 are most commonly used. The double helix of DNA is composed of units called nucleotides or base pairs that are organized into triplets known as codons. There are 64 of these different three base pair codons. One of these codons codes for either the START instruction or the amino acid methionine, depending on whether or not the codon instruction occurs at the beginning of the coding sequence. Three other codons code for the STOP instruction that terminates ribosomal translation. The remaining 60 triplets code for the 20 amino acids commonly linked together by amide bonds to form gene products (i.e. peptides, proteins and .antibodies). To further complicate the relationship between DNA code and gene products such as proteins, there is considerable variability in the number of triplets that code for each amino acid. The DNA code is both redundant and degenerate. For example, while Methionine and Cysteine are each encoded by one unique triplet codon, the other amino acids may be encoded by 2, 3, 4 or 5 different triplet codons. Two amino acids (i.e. Serine and Leucine) may be encoded by 6 different triplet codons. This characteristic of the DNA code has made the process of reverse engineering gene products to DNA code a formidable task. Current approaches depend on using a brute force search through all possible combinations based on a specific DNA sequence known as gene probes.
Attempts have been made to use artificial intelligence technologies such as neural networks to reverse engineer proteins to DNA code. These attempts have met with limited success. In the best study to date (White and Seffens, 1998) a simple neural network correctly predicted 100% of the non-redundant codons and 85% of the redundant codons from the test amino acid sequences. Overall 93 % of the test amino acid sequences were correctly mapped to the actual DNA triplet codon and until very recently it appeared that improving on this degree of accuracy when reverse engineering proteins based on the triplet codon DNA code would remain problematic.
There is a need in the art for novel methods of reverse engineering protein sequences to nucleotide code. Further, there is a need in the art for novel methods of searching and identifying nucleotide sequences in databases.
It is an object of the present invention to overcome disadvantages of the prior art.
The above object is met by a combination of the features of the main claims. The sub claims disclose further advantageous embodiments of the invention.
SUMMARY OF THE INVENTION
The present invention relates to systems and methods for performing biological sequence analysis. More specifically, the invention relates to systems and methods for reverse engineering protein sequences to nucleotide code.
According to the present invention there is provided a method of converting an amino acid sequence of a protein of interest to doublet nucleotide code comprising the steps of, a) providing a first data set corresponding to the amino acid sequence as input to an information processing system (IPS), and; b) determining from the IPS a second data set corresponding to the predicted doublet nucleotide code encoding the amino acid sequence of the protein of interest. Further, the method may also comprise the step of outputting the second data set, or information equivalent to the predicted doublet code encoding the amino acid sequence of the protein of interest. Further, the doublet nucleotide code may comprise a DNA nucleotide sequence or an RNA nucleotide sequence.
Further contemplated by the method of the present invention as defined above, the first data set may comprise a full or partial amino acid sequence of a protein of interest. Further, the first data set may be encoded in binary form.
Also contemplated by the method of the present invention, the protein of interest may be a variant or mutant protein of a wild-type protein. Further, the variant or mutant protein may be associated with a disease.
Further, the present invention provides a method as defined above, wherein the information processing system (IPS) comprises a neural network. The neural network may also employ a genetic algorithm. In an embodiment of the present invention, which is not meant to be limiting, the neural network is NeuroSHELL Classifier v2.0. Preferably the neural network is a trained neural network. For example, but not wishing to be limiting, the neural network may be trained by processing a plurality of data sets comprising data elements (X,Y) wherein X represents the amino acid sequence of a protein and Y represents the nucleotide sequence encoding X. Alternatively, but without wishing to be limiting, the IPS may comprise a rule-based system.
The present invention also contemplates a method of identifying a nucleotide sequence encoding a protein of interest within a data structure, comprising the steps of a) providing a first data set corresponding to the amino acid sequence of the protein of interest to an information processing system (IPS) capable of producing a second data set corresponding to the predicted doublet DNA code encoded by the amino acid sequence, and; b) performing an in string search of the data structure to identify all instances wherein the second data set is present in the data structure. The data structure may comprise an electronic medium containing nucleotide sequences, for example, but not limited to an electronic database. Further the nucleotide sequences may comprise genomic nucleotide sequences, for example, but not limited to containing introns.
Also contemplated by the method of the present invention as defined above, the in string search may be performed, controlled or both performed and controlled by an algorithm employing a sliding window approach to compare sequences. Further, the algorithm may comprise an alignment algorithm such as, but not limited to BLAST, FAST A, dynamic programming, or a version or an executable-code-modified version thereof.
Also contemplated by the present invention is an information processing system (IPS) capable of a) receiving a first data set corresponding to an amino acid sequence of a protein of interest, and b) producing a second data set corresponding to the predicted doublet nucleotide code encoding the amino acid sequence. Further, the IPS may further comprise hardware, software or the like, and may further comprise an alignment algorithm for performing an in-string search of a data structure to identify all instances wherein the second data set is present in the data structure.
This summary does not necessarily describe all necessary features of the invention but that the invention may also reside in a sub-combination of the described features.
BRIEF DESCRIPTION OF THE DRAWINGS
These and other features of the invention will become more apparent from the following description in which reference is made to the appended drawings wherein: FIGURE 1 shows the predicted doublet nucleotide code of normal and mutant hemoglobin proteins output from a trained neural network following input of the first twenty nine amino acids of the proteins.
DESCRIPTION OF PREFERRED EMBODIMENT
The present invention relates to systems and methods for performing biological sequence analysis. More specifically, the invention relates to systems and methods for reverse engineering protein sequences to nucleotide code.
The following description is of a preferred embodiment by way of example only and without limitation to the combination of features necessary for carrying the invention into effect.
According to an embodiment of the present invention there is provided a method of converting an amino acid sequence of a protein of interest to doublet nucleotide code comprising the steps of, a) providing a first data set corresponding to the amino acid sequence as input to an information processing system (IPS), and; b) determining from the IPS a second data set corresponding to predicted doublet nucleotide code encoding the amino acid sequence.
The method may further comprise a step of outputting the second data set, or information equivalent to the predicted doublet nucleotide code of the protein of interest.
Also contemplated by the present invention is an information processing system (IPS) capable of a) receiving as input a first data set corresponding to an amino acid sequence, b) determining a second data set corresponding to the predicted doublet nucleotide code encoding the amino acid sequence and c) outputting the second data set or information equivalent to the predicted doublet nucleotide code. Further, the IPS may comprise additional hardware, for example but not limited to control circuits, microprocessors, software or both, and it may be part of one or more computers or biological sequence analysis systems. Definitions:
By the term "amino acid sequence" it is meant a consecutive sequence of amino acids linked via peptide bonds defining the protein of interest, preferably starting from the amino terminus (N-terminus) and proceeding to the carboxy terminus (C-terminus) of the protein.
The protein of interest may comprise any protein known in the art, for example, but not limited to, pharmaceutically important proteins such as, but not limited to regulatory proteins, signaling proteins, growth factors, growth regulators, antibodies, antigens, interleukins, insulin, colony stimulating factors such as G-CSF, GM-CSF, hPG-CSF, M-CSF or combinations thereof, interferons, for example, interferon-a, interferόn-β, interferon-g, blood clotting factors, for example, Factor VIII, Factor IX, or tPA. However, any native protein produced by an organism may be considered a protein of interest. Also, the present invention contemplates mutant proteins and variants of native proteins produced by an organism.
By the term "doublet nucleotide code" it is meant a DNA or RNA nucleotide sequence comprising two of the three nucleotides of each triplet codon encoding an amino acid, with a blank, space-holder or the like in the third position. For example, Alanine may be represented by the doublet codon GC_, Cysteine by TC_, Arginine by CG_ OR AG_, Leucine by CT_ OR TT_, Serine by TC_ OR AG_, etc. Preferably, the doublet nucleotide code is provided in a defined orientation, such as a 5' to 3' orientation, as would be understood by a person of skill in the art.
By the term "first data set" it is meant information pertaining to the amino acid sequence of the protein of interest. The first data set comprises the amino acid sequence of the protein of interest, or a computer readable version thereof. Further, the first data set may be encoded on an electronic medium, such as, but not limited to a computer information storage device, such as, but not limited to a hard drive, floppy disk or the like. In this regard, it may be desirable to encode individual amino acids that make up the protein of interest into an appropriate form such as, but not limited to as a binary sequence. For example, but not to be considered limiting in any manner, the amino acid Alanine which represents one of the twenty amino acids, may be represented by the numerical string 10000000000000000000, serine as 00000000000010000000, valine as 00000000000000000001 and so on. Using such notation a protein of interest that comprises the amino acid sequence Ala-Ala-Val-Ser may be depicted by the numerical sequence:
100000000000000000001000000000000000000000000000000000000001000000000 00010000000,
(also see White and Seffens (1998, Electronic J. Biotech. 1: 196-201, which is incorporated herein by reference). As will be evident to someone of skill in the art, other ways of encoding the amino acid sequence into binary or other forms may be possible and any such method is contemplated by the present invention.
Similarly, by the term "second data set" it is meant information corresponding to the predicted doublet nucleotide code encoding the amino acid sequence. The second data set, corresponding to the predicted doublet DNA code may also be encoded on an electronic medium as defined previously.
By the term "information processing system" or "IPS" it is meant an electronic device that is capable of accepting the first data set and producing therefrom a second data set corresponding to the predicted doublet nucleotide code encoding the amino acid sequence. The IPS may comprise a neural network or a rule-based system or algorithm.
In an aspect of an embodiment the present invention employs a neural network, preferably a trained neural network. In an alternate embodiment, the information processing system comprises a rule-based algorithm. The IPS may also comprise one or more circuits, microprocessors, or combinations thereof, as would be evident to a person of skill in the art.
Artificial Neural Networks are pattern recognition computer models based on the human nervous system that are capable of learning from experience and then making predictions about new patterns. Neural computation encompasses the concepts of distributed, adaptive and nonlinear computing. Neural networks usually comprise a plurality of layers such as an input layer, a middle or hidden layer and an output layer. Each layer comprises a plurality of processing elements that are usually interconnected by weighted connections or scaling factors. A processing element multiplies an input by a set of weights, and non-linear ly transforms the result into an output value. The performance of the neural network may be measured in terms of a desired signal and error criterion. The output of the neural network is compared with a desired response to produce an error. A backpropagation algorithm may be used to adjust the weights interconnecting the processing elements in a manner to minimize the error. Other learning algorithms that may be employed include, but are not limited to Probabilistic/Bayesian, Generalized Regression, Self organizing maps (eg Kohonen Networks), Cascade Correlation or a combination thereof. The network may be trained by repeatedly exposing the neural network to known data patterns while the training algorithm adjusts the connection weights between processing elements in order to "learn" the relationship between the input patterns and the desired outputs. This process continues until a predetermined error tolerance has been achieved so as to optimize the generalizability of the trained model. For example, but not wishing to be limiting, a neural network may be trained by processing a plurality of data sets comprising data elements (X,Y) wherein X represents the amino acid sequence of a protein and Y represents the nucleotide sequence encoding X. The current process attempts to relate the properties of each of the 20 AAs to the specific coding doublet DNA code. Having taught a machine learning system to understand this relationship, the model can then reverse engineer any given sequence of amino acids to the original DNA code via the concept of the formatted doublet codon. In an aspect of the present invention, the training set comprises 128 data patterns (64 for the DNA instance and 64 for the mRNA instance) relating each A A to its' encoding doublet DNA codon.
A trained neural network based on Bayes theorem is a powerful classifier and by design learns and makes predictions based on probabilities calculated from the training data. Once trained and validated the neural network can be used to evaluate the amino acid sequence of the protein of interest and output the predicted doublet nucleotide code corresponding to the input amino acid sequence. Example of training of neural network systems are known in the art, may be found in, but are not limited to: Foundations of Neural Networks, Fuzzy Systems and Knowledge Engineering, Nikola K. Kasabov, A Bradford Book , The MIT Press, Cambridge, Massachusetts, London, England, 1998; and Neural Smithing: Supervised Learning in Feedforward Artificial Networks, Russell D. Reed and Robert J. Marks II. A Bradford Book , The MIT Press, Cambridge, Massachusetts, London, England, 1999; which are hereby incorporated by reference.
Neural networks that may be employed in the present invention include, but are not limited to NeuroSHELL Classifier v2.0. In an aspect of an embodiment, the present system may rely on a number of specific features of the NeuroSHELL package, for example, but not limited to the ability to carry out leave one out cross validation coupled with the probabilistic classifier utilizing a genetic algorithm to evolve a cross validated optimal solution to the classification problem.
Alternatively, the IPS may comprise a rule-based algorithm. For example, but not to be considered limiting in any manner, a rule-based algorithm may comprise a plurality of rules such as:
Rule 1: If AA= "Alanine" then output: Doublet = "GC_"
Rule 2: If AA = "Arparagine" then output: Doublet = "AA_"
Rule 3: If AA = "Aspartate" then output: Doublet = "GA_"
Rule 4: If AA = "Cysteine" then output: Doublet = "TG_" Rule 5: If AA = "Glutamate" then output: Doublet = "GA_"
Rule 6: If AA = "Arginine" the output "CG_" OR "AG_" and so on for all amino acids.
Preferably, the rule-based system comprises (1) a comprehensive set of rules of varying complexity to handle all possible relationships and (2) an efficient search algorithm to find the appropriate rule quickly rather than a bruit force search through every rule each time to find the appropriate rule for the instance of the data pattern being evaluated. However, other rule based systems may be employed in the present invention as would be understood by a person of skill in the art.
The Doublet Nucleotide Code
The genetic material of an organism comprises a string of nucleotides consisting of adenine (A), cytosine (C), guanine (G) and thymine (T) in the case of DNA and A, C, G, and uracil (U) in the case of RNA. Proteins comprising a series of amino acids linked by peptide bonds are produced by transcribing and translating the genetic material. It is well known in the art that triplet codons consisting of three consecutive nucleotides of genetic material specify the amino acids to be incorporated into a protein. In an aspect of the present invention, which is not meant to be bound by theory or limiting in any manner is based on the notion that the triplet genetic code may have evolved from a doublet code comprising two base pair codons. Since there are 4 different nucleotides in either DNA or RNA, a doublet codon can be arranged in only 24= 16 different ways. These 16 doublet codons encode 16 of the 20 amino acids found in proteins (Table 1).
Table 1: The Doublet Genetic Code Amino Acid Doublet (100%) Amino Acid Doublet (100%)
Alanine GC Isoleucine AT
Asparagine AA Lysine AA
Aspartate GA Methionine AT*(ATG only)
Cysteine TG Phenylalanine TT
Glutamate GA Proline CC
Glutamine CA Threonine AC
Glycine GG Tryptophan TG*(TGG only)
Histadine CA Tyrosine TA
Valine GT Thus, about 80% of the amino acids in proteins can be directly mapped to the triplet DNA code from the doublet nucleotide code. This doublet nucleotide code may be converted to the modern triplet code format by simply adding a blank space holder in the third nucleotide position. If sequence analysis is employed using doublet nucleotide codons for 16 amino acids, then the problem of reverse engineering an amino acid sequence of a protein of interest may be significantly reduced. Further, as the amino acid Tryptophan is encoded by a single codon (i.e. TGG) then only three amino acids (Arginine, Leucine and Serine) encoded by multiple triplet codons require conversion to multiple doublet codons.
Results from the human genome proj ect and other sequencing proj ects permit statistical analysis of DNA sequences. For example, it is possible to estimate the probability that a given amino acid is encoded by a specific doublet DNA codon. Such a statistical approach may be applied to the 3 amino acids (Arginine, Leucine and Serine) that cannot be defined by a single doublet nucleotide code. For example, but not wishing to be limiting , based on statistical analysis of hundreds of thousands of DNA sequences from the human genome, it may be predicted that about 60% of the time Arginine is encoded by the doublet codon CG and about 40 % of the time by the doublet codon AG. Similarly Leucine is encoded by CT about 80% of the time and TT about 20% of the time while Serine is encoded by TC about 60 % and AG about 40 % of the time, as shown in Table 2. As will be understood by a person of skill in the art, nucleotide coding frequencies encoding specific amino acids may be organism or species specific. Thus, the present invention also contemplates using known codon frequencies for specific organisms during reverse engineering protein sequences to nucleotide code as described herein. Further the present invention contemplates extrapolating the codon frequencies for organisms that have yet to be fully sequenced at the nucleotide level, for example, but not limited to by analyzing the nucleotide sequence encoding known proteins from that organism. Table 2: Doublet Codon Probabilities Derived From the Human Genome Project Amino Acid Doublet Probability
Arginine CG 0.6
Arginine . AG 0.4
Leucine CT 0.8
Leucine TT 0.2
Serine TC 0.6
Serine AG 0.4
The final modified doublet representation of the genetic code for all 20 Amino Acids is summarized in Table 3.
Table 3: Modified Doublet Nucleotide Code Symbolic Representation For All 20
Amino Acids
Amino Acid Doublet (100%) Amino Acid Doublet (100%)
Alanine GC Isoleucine AT
Asparagine AA Lysine AA
Aspartate GA Methionine AT*(ATG only)
Cysteine TG Phenylalanine TT
Glutamate GA Proline CC
Glutamine CA Threonine AC
Glycine GG Tryptophan TG*(TGG only)
Histadine CA Tyrosine TA Valine GT
Arginine C(A.4)G thus CG OR AG (100%)
Leucine C(T.2)T thus CT OR TT (100%)
Serine TC(AG.4) thus TC OR AG (100%) Thus, in an embodiment, the present invention contemplates a symbolic probabilistic reverse mapping system from amino acid sequence to DNA or RNA code using a doublet nucleotide representation (i.e. AA_, CG_, TT_, CA_, etc). The system may also employ additional concepts such as, but not limited to, that every coding sequence has a start (ATG) and termination (STOP) code. These 2 instructions may be employed to represent the "boundaries" of a target DNA sequence and therefore define a finite number of amino acids and codons lying between the boundaries. Secondly, both methionine and tryptophan are each encoded by a unique triplet codon (i.e. ATG and TGG, respectively). These unique triplet codons may be employed as "anchors" or constants within a given DNA or RNA sequence. Boundaries, anchors or both provide a simple method of partial internal validation of the reverse engineering proteins to nucleotide code.
In a further aspect of an embodiment of the present invention, the method of reverse engineering protein sequences to nucleotide code may be employed to convert an amino acid sequence of a protein of interest into doublet nucleotide code and identify the corresponding nucleotide sequence in a data-structure, database or the like, wherein the nucleotide sequence also comprises one or more introns. Without wishing to be limiting in any manner, the method may comprise a 2 (or more) stage search whereby the first stage search looks for the contiguous DNA sequence predicted by the system. If no match is found then an iterative 2nd stage search may be employed to identify subset matches of codons, preferably at least 3 contiguous codons. Thus, if no match is found in the first search then a codon by codon match may ensue. For example if the first 3 codons of the probe match 3 codons in the target but not the fourth then the search algorithm interprets this as the beginning of a possible intron sequence and tries to match the 4th codon of the probe with the next codon in the target. As long as there are 2 or fewer contiguous codon matches this process continues . Matches of 3 or more codons are acknowledged by the system as above and the next iteration begins. The iterative search continues until the probe sequence is matched. A minimum match requirement for 3 contiguous codons is based on calculated probabilities for the appearance of 1, 2, 3 and 4 codons in association in any DNA sequence. The probability that any 3 contiguous codons would be associated based on chance alone is small.
The method of the present invention was tested using wild-type and mutant hemoglobins including hemoglobins, hemoglobinC and two hemoglobins encoding truncated proteins.
HbS and HbC both result from missense mutations at the 6th codon position. In the case of HbS, Valine is substituted for Glutamic acid while HbC results from the substitution of Lysine for Glutamic acid. One example of Thalassemia results from a Stop (TAG) mutation at position 17 and a second example of a Thalassemia mutant results from the deletion of an adjacent Adenine at position 8 producing a series of missense amino acids which terminates in an early Stop codon. Both of these examples result in decreased beta-chain synthesis.
The method of the present invention as described herein was used to determine the doublet nucleotide code encoding up to the first 29 amino acids of the beta globulin chain. Results are shown in Figure 1. The amino acid sequence for the first 29 amino acids of the hemoglobin proteins was first encoded as outlined in Example 1, and then submitted to a trained neural network for analysis. The output of the neural network was a series of 29 or less formatted doublet nucleotide codons. The double nucleotide code output was compared with the known triplet DNA code. In all cases, (normal hemoglobin, HbS, HbC and the 2 Thalassemias Hb), the doublet nucleotide code predictions exactly matched the first 2 nucleotides of the actual DNA triplet codons including the actual mutations responsible for the abnormal hemoglobin. This indicates that the original triplet DNA code was accurately determined from the doublet nucleotide code produced from neural network evaluation of the amino acid sequences .
The method of the present invention may further comprise one or more additional steps at any stage in the method. For example, but not to be considered limiting in any manner, after the step of determining or outputting, the second data set may be used to search a data structure comprising nucleotide sequence information. By the term "data structure" it is meant any electronic medium comprising nucleotide sequence information. For example, but not to be considered limiting in any manner, the data structure may comprise a database or the like which contains the genome of an organism, for example, but not limited to a yeast genome, such as, but not limited to Saccharomyces cerevisiae, Saccharomyces pombe (Nature 387, 5-105 (suppl) (1997); Wood et al., Nature, 415(6874):871-880 (2002)), protozoa, such as but not limited to Plasmodium falciparum, plants such as, but not limited to Arabidopsis thaliana (Nature 408(6814):796-815), Oryza Sativa (Yu et al., Science, 296:79-92 (2002); Goff et al., Science, 296:92-100 (2002)), nematodes such as, but not limited to Caenorhabditis elegans (Washington U, Science, 282(5396):2012-2018 (1998). Erratum: 283(5398):35(1999), 283(5410):2103(1999), 285(5433): 1493(1999)), insects such as, but not limited to Drosophila melanogaster (Adams, et al., Science, 287(5461):2185-2195 (2000)) and human (Venter, et al. Science, 291: 1304-1351 (2001); International Human Genome Sequencing Consortium, Nature, 409:860-921 (2001)) or a combination thereof.
Further, the data structure may comprise a plurality of nucleotide sequences , preferably eukaryotic nucleotide sequences. However, the data structure may also comprise prokaryotic sequences. Preferably, the coding relationship between nucleotide codons and amino acids is known for the species in question. Also, the present invention contemplates data structures comprising partial or incomplete genomes of organisms. For example, but not to be considered limiting in any manner, the data structure may comprise one or more databases from the National Center for Biotechnology Information (NCBI), European Molecular Biology Laboratory (EMBL), or both. Further, the data structure may comprise a commercial data structure, for example, but not limited to, such as the type available from Celera.
In an alternate embodiment, there is provided a method of identifying a nucleotide sequence encoding a protein of interest within a data structure comprising the steps of i) providing a first data set corresponding to the amino acid sequence of the protein of interest to an information processing system (IPS) capable of producing a second data set corresponding to the predicted doublet nucleotide code encoding the amino acid sequence of the protein of interest, and; ii) performing an in-string search of the data structure to identify all instances wherein the second data set is present in the data structure.
The in-string search of the data structure may be accomplished by any appropriate search algorithm known in the art. For example, the algorithm may perform a simple in-string search to identify the predicted nucleotide sequence within the data structure. An example of a simple in-string search, which is not meant to be considered limiting in any manner is described in Example 4. Alternatively, dynamic programming, for example, but not limited to, the Smith Waterman dynamic programming algorithm or a version thereof may be employed (Smith and Waterman, 1981 a,b Identification of Common Molecular SubSequences. J. Molecular Biology 147: 195-197; which is herein incorporated by reference). Further, other programs, algorithms and the like, such as the Basic Local Alignment Search Tool (BLAST), FASTA, or versions or derivatives thereof, that search for matching patterns termed k-tuples (Wilbur and Lipman, 1983 Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Proc. Natl. Acad. Sci. 80:726-730; Altschul et al. , 1990 Basic local Alignment Search Tool J. Mol. Biol. 215:403-410; which are herein incorporated by reference) or other algorithms, for example as described in Bioinformatics, Sequence and Genome Analysis by David W. Mount, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York, 2001 and references contained therein, which are herein incorporated by reference) may be employed by the present invention. However, any suitable algorithm known in the art may be employed to perform the in-string search.
The use of alignment algorithms such as, but not limited to BLAST and FASTA and dynamic programming algorithms permit alignment of sequences that are not identical. Thus, such algorithms may be employed to search a data structure that comprises introns in nucleotide sequences, for example, but not limited to genomic nucleotide sequences. Further, the alignment algorithms also permit alignment of a predicted nucleotide doublet sequence encoding a mutant protein of interest with one or more nucleotide sequences contained in a data structure, for example, but not limited to an electronic database. In such an instance the predicted doublet nucleotide sequence does not need to be identical to the sequence contained in the data structure. For example, but not to be considered limiting in any manner, the human genome has been sequenced and thus any protein produced in a human may be mapped to a specific nucleotide sequence in the genome. However, a human subject suffering from a disease, may exhibit one or more mutant proteins, that may be partly or wholly responsible for the disease. If a mutant protein is isolated and sequenced, it is unlikely that the exact nucleotide sequence encoding this mutant protein will be found within a data structure comprising the human genome. However, the search algorithm may be employed to determine the most likely nucleotide sequence within the data structure that may give rise to the mutant protein. By providing such information, a person of skill in the art may then determine whether the mutation arose as a result of a point mutation, such as, but not lkαited to insertion of a stop codon into a nucleotide sequence that is translated into a truncated protein, or an inversion, deletion, translocation or combination thereof.
Thus, the present invention also contemplates a method of searching a data structure for a target nucleotide sequence wherein the target nucleotide sequence comprises the doublet nucleotide code encoding the amino acid sequence of the protein of interest. The data structure may comprise any database known in the art. Further, the data structure may comprise introns, for example, as found in genomic nucleotide sequences of eukaryotes.
The above description is not intended to limit the claimed invention in any manner, Furthermore, the discussed combination of features might not be absolutely necessary for the inventive solution.
The present invention will be further illustrated in the following examples. However, it is to be understood that these examples are for illustrative purposes only, and should not be used to limit the scope of the present invention in any manner. Examples
Example: 1 Data Representation and Variable Creation
The development of machine learning systems such as neural networks as highly accurate predictive models can require attention to both data representation , and input variable selection. Based on information known in the art, a superset of 38 potential input variables was employed as a training set to identify an optimal subset of available descriptors that can uniquely be mapped from a specific amino acid to the specific DNA encoding doublet. The following 38 variables were used in this example. However, not all variables are required by the method of the present invention, as would be understood by a person of skill in the art. The 38 variables represent a superset of possible descriptors from which a subset of variables was selected during model development by a statistical importance analysis of the inputs. Based on the analysis, and without wishing to be limiting in any manner, 8 of the 38 variables were required in order to train a perfect Bayesian classifer.
(i) Variables 1-21: The 20 Amino Acids
Amino acids are represented by a series of ones and zeros as described by White and Seffens (1998; Electronic J. Biotech. 1: 196-201, which is herein incorporated by reference) for each of the 20 Amino Acids. Optionally, one additional variable for example, consisting entirely of zeros may be used for the Stop instruction. Using this representation each Amino Acid can be represented as string of 20 0s and a 1 which is unique for each Amino Acid. For example:
Alanine is represented by the string: 100000000000000000000, Valine is represented by the string: 000000000000000000010, Serine is represented by the string: 000000000000100000000, , and so on for all 20 AAs. The Stop instruction is represented as all 0s 00000000000000000000.
As will be evident to a person of skill in the art, numerous ways exist to represent twenty amino acids and any such way of representing the amino acids is contemplated by the present invention.
(ii) Variable 22: Average Codon Frequencies
Data from millions of DNA sequences has been analyzed during the human genome project. As a result, it is possible to estimate on average how frequently each triplet codon encodes its corresponding amino acid. A simple conversion from triplet to the doublet nucleotide code results in average frequencies that each doublet code is associated with each amino acid. Specific codon usage tables which may be employed in the present invention may be obtained from Codon usage tabulated from the international DNA sequence database: status for the year 2000. Nakamura, Y. , Gojobori, T. and Ikemura, T. (2000) Nucl. Acids Res. 28, 292; 2002 codon usage table at : www.kazura.or.jp/codon/CUTG.html
For example:
Amino Acid Frequency Doublet
Alanine 7.02% GC
Arginine 2.28% AG
3.33% CG
Asparagine 3.68% AA
... and so on for all 20 amino acids.
(iii) Variables 23-32: Molecular Descriptors
The 20 amino acids have different structural and chemical properties.. Similarities in some of these properties have permitted a rudimentary classification scheme based on size, charge and lipid solubility. Each of the amino acids can be classified as (1) hydrophobic, (2) small/polar, (3) charged/polar, or (4) polar. This is an imperfect system with some amino acids arguably being members of more than one class. Another approach is to use molecular descriptors calculated from connection table files for each amino acid (CT File Formats (1999), MDL Information Systems, Inc, 14600 Catalina Street, San Leandro, CA, 94577). The Molecular Surface Area (Angstroms2) of the molecules is a pseudo 3D descriptor and represents the contact surface created when a spherical probe representing a solvent molecule is rolled over the molecular model . The Molar Refractivity Index (cmVmole) estimates the potential light refracting ability of the molecule. Molecular weight (in daltons) can be used as a measure of molecular size and lipid solubility can be estimated from calculation of LogP, the log of the Octanol/Water partition coefficient. A final descriptor, the Weiner Index was also used. The Weiner Index is a topological descriptor of longstanding proven value that is calculated from the number and types of bonds in a given molecule. All of these descriptors are easily calculated from connection table files by molecular modeling programs such as Chem3D/ChemOFFICE Ultra 7.0 , which was used in the present project.
(iv) Variable 33: DNA/mRNA flag:
A boolean switch was included to allow for the prediction of either DNA or mRNA sequences encoding amino acid sequences. Here DNA= 1, mRNA=0 and the flag is set to the appropriate value at the time of amino acid sequence entry.
(v) Variable 34: Essential vs Non Essential Amino Acids
Some amino acids are identified as essential because they are not made by an organism. An essential amino acid is scored as 1 and a non essential amino acid as 0.
(vi) Variables 35-38: The probability that the 2nd position in the codon is occupied by a given nucleotide. Similar to (ii) above, it is possible to calculate the probability that a given intra-codon position is occupied by a specific nucleotide encoding a specific amino acid. Previous research has suggested that intra-codon nucleotide position 2 is the most important of th e 3 p o s iti o ns fo r o p t i ma l tr an s l at i o n t o o c cu r (www.sb.fsu.edu/~hongli/BCH5425/note8.html and W. R. Danter, 2001). The probability that the 2nd position was occupied by a given nucleotide for a given amino acid was calculated for each of Adenine, Cytosine, Guanine and Thymidine or Uridine depending on the DNA/mRNA flag as outlined above in (iv). This process resulted in 4 variables named P(A),P(C), P(G) and P(UT) representing the corresponding calculated probability.
The 38 variables outlined above were used to generate an input pattern for the 20 amino acids and Stop instruction together with the corresponding encoding DNA (N=64) and mRNA (N=64) doublet codons. This generated a database of 128 data patterns. Field 1 of the database is the amino acid name while Fields 2-39 are the 38 variables outlined above and the Target, namely the doublet DNA codon, is Field 40.
Example 2: Model Development and Input Selection
A neural network or rule-based algorithm may be used to reverse engineer protein sequences to nucleotide code. Both approaches are outlined below.
(i) The neural network approach:
NeuroSHELL Classifier V2.0 was selected as the neural network modeling software. This artificial intelligence tool is a statistical classifier based on Bayes Theorem coupled with a genetic algorithm used to evolve an optimal solution. The software also has the built in ability to carry out cross validation as the model is evolving. Probabilistic classifiers use the available data to generate an optimized equation from the input variables. The genetic algorithm determines the optimal subset of input variables. As will be evident to someone of skill in the art, evolutionary systems using different input variables may be expected to produce similar but not always the same final model and performance of these models may also be expected to be similar but not identical. Therefore some models may perform better than others and thus it is preferable to find an adequate or optimal/minimal set of predictors.
Each input data pattern relating to an amino acid sequence was composed of 38 variables. Simple genetic principles such as cross over and point mutation can be applied to these data patterns with the result that the addition and subtraction of some variables will produce different and shorter input patterns. For example, but not to be considered limiting in any manner: Cross-over simulation: each of two adjacent or otherwise associated evolving potential solution patterns may transfer part of their respective sequence to the other pattern thereby generating two new potential solution patterns for evaluation according to the fitness criteria;
Point mutation simulation: a particular single variable in an evolving potential solution pattern is deleted or replaced by a new or different variable creating a new potential solution pattern which can be evaluated according to the fitness criteria.
At the outset of model development a family of mutated input data patterns is created. Each of these new data patterns provides a potential solution to the classification problem and is evaluated against fitness criteria which in this case is the maximum number of correct classifications by the model. If a perfect classifier is found after the first generation the process stops. Otherwise the "fittest" patterns from the previous generation undergo further evolution, fitness testing and cross validation. This process continues until an optimal solution is found or a stopping criteria, usually a maximum number of generations, has been reached. Cross validation of the potential solutions is carried out during testing for each candidate model in each generation. A Leave One Out (LOO) training and testing strategy was employed here. During model creation and testing, one data pattern is held out for testing while the remaining 127 patterns are used to train the model. This process is then repeated 127 times so that in turn each pattern is tested in an independent fashion. In other words, all patterns can be used as both training and testing patterns and each test pattern is preferably not in the dataset from which the model is developed at the time it is tested. This approach represents a widely recognized and validated method of compensating for small data sets during model development. References for training of neural network systems include, but are not limited to: Foundations of Neural Networks, Fuzzy Systems and Knowledge Engineering, Nikola K. Kasabov, A Bradford Book , The MIT Press, Cambridge, Massachusetts, London, England, 1998; and Neural Smithing: Supervised Learning in Feedforward Artificial Networks, Russell D. Reed and Robert J. Marks II. A Bradford Book , The MIT Press, Cambridge, Massachusetts, London, England, 1999; which are hereby incorporated by reference.
Early in the first generation of model development NeuroSHELL Classifier identified a cross-validated, perfect classifier using only 8 of the original 38 variables. Since it is possible that more than one solution exists for a given problem particularly when there is a large numbers of input variables relative to data patterns, the process was repeated 20 times. In each case the same model utilizing the same 8 variables was identified.
The identified 8 variables listed in decreasing order of importance as determined by sensitivity analysis were:
1. Molecular Weight of the Amino Acid (MolWt);
2. Solvent Excluded Molecular Surface Area of the Amino Acid (MolSA);
3. The log of the Octanol/Water solubility coefficient- Log P; 4. The average doublet codon frequency (AvrCodFreq);
5. The probability that Guanine ( i . e . P(G)) will occupy the 2nd position in a given codon;
6. The Molar Refr activity Index (MR);
7. The topological descriptor - Wiener Index (WIndx) and 8. The DNA/mRNA switch. Example 3: Information Processing System comprising a Rule-Based Algorithm
A series of rules can be developed which also allow for perfect mapping of each amino acid to the encoding doublet DNA or mRNA. The pseudocode for these rules for DNA doublets appears below.
Rule 1: If AA= "Alanine" then output: Doublet = "GC_"
Rule 2: If AA = "Arparagine" then output: Doublet = "AA_"
Rule 3: If AA = "Aspartate" then output: Doublet = "GA_" Rule 4: If AA = "Cysteine" then output: Doublet = "TG_"
Rule 5: If AA = "Glutamate" then output: Doublet = "GA_"
Rule 6: If AA = "Glutamine" then output: Doublet = "CA_"
Rule 7; If AA = "Glycine" then output: Doublet = "GG_"
Rule 8: If AA = "Histadine" then output: Doublet = "CA_" Rule 9: If AA = "Isoleucine" then output: Doublet = "AT_"
Rule 10: If AA = "Lysine" then output: Doublet = "AA_"
Rule 11: If AA = "Methionine" then output: Doublet = "ATG"
Rule 12: If AA = "Phenylalanine" then output: Doublet = "TT_"
Rule 13: If AA = "Proline" then output: Doublet = "CC_" Rule 14: If AA = "Threonine" then output: Doublet = "AC_"
Rule 15: If AA = "Tryptophan" then output: Doublet = "TGG"
Rule 16: If AA = "Tyrosine" then output: Doublet = "TA_"
Rule 17: If AA = "Valine" then output: Doublet = "GT_"
Rule 18: If AA = "Arginine" then output: Doublet = "C(A)G_" Rule 19: If AA = "Leucine" then output: Doublet = "C(T)T_"
Rule 20: If AA = "Serine" then output:. Doublet = "TC(AG)_"
Rule 21: If AA = " " then END
The output provided by rules 18-20 may be interpreted as Doublet = "CG_ or "AG "; Doublet = "CT_" or "TT_" and Doublet = "TC " or "AG ", respectively, and wherein A = Adenine; C = Cytosine; G = Guanine; T = Thymine and U = Uracil.
Altering the output for mRNA code is simply a matter of modifying the output such that T is replaced by U everywhere that T appears in the output.
Both the neural network model and the rule-based system employ a software shell for execution. Specifically, the neural network approach employed herein requires a Runtime version of NeuroSHELL Classifier and the rule based system approach requires an expert system shell or look up table with code in order to evaluate the amino acid sequences from which the formatted doublet codes are derived.
Example 4 : Searching Data Structures
The output of the information processing is formatted doublet nucleotide code reverse engineered from the actual amino acid sequence of a protein of interest. The output may be used to search a data structure, for example, an electronic database that comprises nucleotide sequence information. A search algorithm that may be used, which is not meant to be limiting in any manner is shown below. However, a person of skill in the art will recognize that a variety of search or alignment algorithms, for example, but not limited to BLAST, FASTA and dynamic programming may be used and all of these are meant to be included under the present invention.
A Simple Search Algorithm:
(i) Identify the first instance of the start codon "ATG" in the data structure.
(ii) Attempt to match the next doublet nucleotide codon of the sequence (e.g. CC_) with the next triplet codon of the target DNA sequence in the data structure. (iii) If match = YES (e.g. CC_ and CCT) then repeat (ii) above.
(iv) If match = NO (e.g. CC_ and CAT) then find the next instance of the start codon "ATG" in the data structure and then repeat (ii) above. (v) The above process continues until either a match is found between the doublet nucleotide code and the target nucleotide sequence in the data structure or the end of the nucleotide sequences in the data structure is reached with no match found, (vi) If a match is found then Stop OR the same doublet nucleotide codon can be applied to the next target nucleotide sequence in the data structure if one exists. (vii) If no match is found then the same doublet nucleotide codon can be used to the search the next target nucleotide sequence in the data structure, OR a new doublet nucleotide codon may be applied to search the same target nucleotide sequence, (viii) The above steps are repeated until all available target nucleotide sequences within the data structure have been fully searched using all available doublet nucleotide codons.
Searching with Special Instances of the Doublet Nucleotide Codons :
Methionine maps exclusively, that is with 100% accuracy to one and only one instance of the doublet codon AT_ namely ATG.
Tryptophan maps exclusively to one and only one instance of the doublet codon TG_ namely TGG.
Arginine maps exclusively to 2 instances of doublet codons namely CG_ OR AG_ .
Leucine maps exclusively to 2 instances of doublet codons namely CT_ OR TT_ . Serine map's exclusively to 2 instances of doublet codons namely TC_ OR AG_ .
When the search process encounters one of the special instances of the doublet codons listed above the search at that point occurs as follows. For instance, where the mapping is from Methionine or Tryptophan, the search is for the exact instance namely "ATG" or "TGG" respectively. In instances where mapping is from Arginine, Leucine or Serine the search is carried out as a simple boolean "OR" match. This occurs because Arginine only maps to CG_ OR AG_; Leucine only maps to CT_ OR TT_ and Serine only maps to TC_ OR AG_. Example 5. Example of Reverse Engineering Abnormal Gene Products
Normal adult hemoglobin is a tetramere composed of 2 pairs of polypeptide chains termed alpha and beta globin subunits. Each of these subunits is bound to a heme group. Hemoglobin is found primarily in red blood cells. It is the hemoglobin in the red blood cell that is responsible for transporting oxygen from the lungs to the tissues and transporting metabolic products such as carbon dioxide in the reverse direction.
A separate gene regulates the synthesis of each of the hemoglobin subunits. Normally individuals inherit one beta-chain gene from each parent, 2 alpha-chain genes and 2 gamma-chain genes from each parent. The inheritance of abnormal hemoglobins follows classical Mendelian genetics. The commonly found hemoglobin abnormalities are predominantly beta-chain variants and are usually due to single amino acid replacement that results from a single base substitution in the encoding triplet DNA codon. The most common hemoglobin disorders are labeled Hemoglobin S and C. Thalassemia is a general term used to describe a genetically determined reduction in the amount hemoglobin produced.
HemolobinS (HbS) and HemoglobinC (HbC) both result from missense mutations at the 6th position. In the case of HbS, Valine is substituted for Glutamic acid while HbC results from the substitution of Lysine for Glutamic acid. One example of Thalassemia results from a Stop (TAG) mutation at position 17 and a second example of a Thalassemia mutant results from the deletion of an adjacent Adenine at position 8 producing a series of missense amino acids which terminates in an early Stop codon. Both of these examples result in decreased beta-chain synthesis.
The reverse engineering process described above was applied to the amino acid sequence and DNA code for the first 29 amino acids of the beta globulin chain. The amino acid sequence for the first 29 Amino Acids was first coded as outlined in Example 1, and then submitted to a previously trained and cross validated neural network for analysis. The output of the neural network was a series of 29 or less formatted doublet nucleotide codons. The double nucleotide codon output was then compared with the known triplet DNA code. In all cases i.e. Normal Hemoglobin, HbS, HbC and the 2 Thalassemias, the doublet nucleotide code predictions exactly matched the first 2 nucleotides of the actual DNA triplet codons including the actual mutations responsible for the abnormal hemoglobin. In other words the original triplet DNA code was accurately determined from the doublet nucleotide code produced from neural network evaluation of the amino acid sequences. These results are summarized graphically in Figure 1.
All references are herein incorporated by reference.
The present invention has been described with regard to preferred embodiments. However, it will be obvious to persons skilled in the art that a number of variations and modifications can be made without departing from the scope of the invention as described herein.

Claims

THE EMBODIMENTS OF THE INVENTION IN WHICH AN EXCLUSIVE PROPERTY OF PRIVILEGE IS CLAIMED ARE DEFINED AS FOLLOWS:
1. A method of converting an amino acid sequence of a protein of interest to doublet nucleotide code comprising the steps of, a) providing a first data set corresponding to said amino acid sequence as input to an information processing system (IPS), and; b) determining from said IPS a second data set corresponding to predicted doublet nucleotide code encoding said amino acid sequence.
2. The method of claim 1, wherein after said step of determining, said method comprises the step of outputting said second data set, or information equivalent to said predicted doublet code encoding the amino acid sequence of the protein of interest.
3. The method of claim 1 , wherein said doublet nucleotide code comprises a DNA nucleotide sequence or an RNA nucleotide sequence.
4. The method of claim 1 wherein said first data set comprises a full or partial amino acid sequence of a protein of interest.
5. The method of claim 1, wherein said first data set corresponding to the amino acid sequence is in binary form.
6. The method of claim 1, wherein said protein of interest is a variant or mutant protein of a wild-type protein.
7. The method of claim 6, wherein said variant or mutant protein is associated with a disease.
8. The method of claim 1, wherein said information processing system (IPS) comprises a neural network.
9. The method of claim 8, wherein said IPS employs a genetic algorithm.
10. The method of claim 8 , wherein the neural network is NeuroSHELL Classifier v2.0.
11. The method of claim 8, wherein said neural network is a trained neural network.
12. The method of claim 11 , wherein said neural network is trained by processing a plurality of data sets comprising data elements (X,Y) wherein X represents the amino acid sequence of a protein and Y represents the nucleotide sequence encoding X.
13. The method of claim 1, wherein said IPS comprises a rule-based system.
14. A method of identifying a nucleotide sequence encoding a protein of interest within a data structure, comprising the steps of a) providing a first data set corresponding to the amino acid sequence of the protein of interest to an information processing system (IPS) capable of producing a second data set corresponding to the predicted doublet DNA code encoded by the amino acid sequence, and; b) performing an in string search of the data structure to identify all instances wherein the second data set is present in the data structure.
15. The method of claim 14, wherein said data structure comprises an electronic medium containing nucleotide sequence information.
16. The method of claim 15, wherein said electronic medium comprises an electronic database.
17. The method of claim 15, wherein said nucleotide sequence information comprises one or more genomic nucleotide sequences of one or more organisms.
18. The method of claim 14, wherein said in string search comprises a sliding window approach.
19. The method of claim 14, wherein said in string search is controlled by an alignment algorithm.
20. The method of claim 19 , wherein said alignment algorithm comprises BLAST , FASTA, dynamic programming, or a version or an executable-code-modified version thereof.
21. An information processing system (IPS) capable of a) receiving a first data set corresponding to an amino acid sequence of a protein of interest, and b) producing a second data set corresponding to the predicted doublet nucleotide code encoding the amino acid sequence.
22. The IPS of claim 21 , further comprising an alignment algorithm for performing an in-string search of a data structure to identify all instances wherein the second data set is present in the data structure.
23. The IPS of claim 22, wherein said data structure comprise an on-line data structure. .
PCT/CA2003/001929 2002-12-06 2003-12-05 Reverse translation of protein sequences to nucleotide code WO2004053766A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003287823A AU2003287823A1 (en) 2002-12-06 2003-12-05 Reverse translation of protein sequences to nucleotide code

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US43166102P 2002-12-06 2002-12-06
US60/431,661 2002-12-06

Publications (1)

Publication Number Publication Date
WO2004053766A1 true WO2004053766A1 (en) 2004-06-24

Family

ID=32507776

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CA2003/001929 WO2004053766A1 (en) 2002-12-06 2003-12-05 Reverse translation of protein sequences to nucleotide code

Country Status (2)

Country Link
AU (1) AU2003287823A1 (en)
WO (1) WO2004053766A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109977227A (en) * 2019-03-19 2019-07-05 中国科学院自动化研究所 Text feature, system, device based on feature coding
CN110945595A (en) * 2017-07-25 2020-03-31 南京金斯瑞生物科技有限公司 DNA-based data storage and retrieval

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
GENE. 30 DEC 2000, vol. 261, no. 1, 30 December 2000 (2000-12-30), pages 139 - 151, XP004316501, ISSN: 0378-1119 *
INTERNATIONAL JOURNAL OF PEPTIDE AND PROTEIN RESEARCH. 1976, vol. 8, no. 1, 1976, pages 13 - 19, XP009027685, ISSN: 0367-8377 *
JOURNAL OF MOLECULAR BIOLOGY. 5 MAY 1985, vol. 183, no. 1, 5 May 1985 (1985-05-05), pages 1 - 12, XP000674208, ISSN: 0022-2836 *
JOURNAL OF THEORETICAL BIOLOGY. 7 AUG 1977, vol. 67, no. 3, 7 August 1977 (1977-08-07), pages 345 - 376, XP009027688, ISSN: 0022-5193 *
JUKES T H: "Evolution of anticodons.", ADVANCES IN SPACE RESEARCH : THE OFFICIAL JOURNAL OF THE COMMITTEE ON SPACE RESEARCH (COSPAR). 1984, vol. 4, no. 12, 1984, pages 177 - 182, XP001180081 *
ORIGINS OF LIFE. JUL 1975, vol. 6, no. 3, July 1975 (1975-07-01), pages 423 - 427, XP009027682, ISSN: 0302-1688 *
R. KNIPPERS, P. PHILIPPSEN, K.P. SCHÄFER, E. FANNING: "Molekulare Genetik", 1990, GEORG THIEME VERLAG, STUTTGART, XP002274646 *
WHITE G. AND SEFFENS W., ELECTRONIC JOURNAL OF BIOTECHNOLOGY, vol. 1, no. 3, 15 December 1998 (1998-12-15), pages 196 - 201, XP001180083 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110945595A (en) * 2017-07-25 2020-03-31 南京金斯瑞生物科技有限公司 DNA-based data storage and retrieval
CN110945595B (en) * 2017-07-25 2023-08-18 南京金斯瑞生物科技有限公司 DNA-based data storage and retrieval
CN109977227A (en) * 2019-03-19 2019-07-05 中国科学院自动化研究所 Text feature, system, device based on feature coding

Also Published As

Publication number Publication date
AU2003287823A1 (en) 2004-06-30

Similar Documents

Publication Publication Date Title
Baldi et al. Bioinformatics: the machine learning approach
CN111161793B (en) Stacking integration based N in RNA 6 Method for predicting methyladenosine modification site
Camproux et al. A hidden markov model derived structural alphabet for proteins
Pandey et al. Computational approaches for protein function prediction: A survey
EP4004200A1 (en) Method and apparatus using machine learning for evolutionary data-driven design of proteins and other sequence defined biomolecules
Mooney et al. Protein structural motif prediction in multidimensional ø-ψ space leads to improved secondary structure prediction
Yang et al. NCResNet: noncoding ribonucleic acid prediction based on a deep resident network of ribonucleic acid sequences
Yoo et al. Machine learning techniques for protein secondary structure prediction: an overview and evaluation
Rangwala et al. Introduction to protein structure prediction
Hu et al. Developing optimal non-linear scoring function for protein design
Smigrodzki et al. Genetic algorithm for analysis of mutations in Parkinson's disease
Zviling et al. Genetic algorithm-based optimization of hydrophobicity tables
WO2004053766A1 (en) Reverse translation of protein sequences to nucleotide code
US20230298692A1 (en) Method, System and Computer Program Product for Determining Presentation Likelihoods of Neoantigens
Ji Improving protein structure prediction using amino acid contact & distance prediction
Lobley Human protein function prediction: application of machine learning for integration of heterogeneous data sources
Gali et al. Deep Learning for the Classification of Pseudogenes in the Genome
Tramontano Integral and differential form of the protein folding problem
Ohlson The use of evolutionary information in protein alignments and homology identification
Weidmann-Krebs The sequence space of natural proteins
Singh et al. GOR Method for Protein Structure Prediction using Cluster Analysis
Steeg Automated motif discovery in protein structure prediction.
Mehta Finding nuclear localization signals: A quantitative analysis
Bartoli Computational methods for the analysis of protein structure and function
Mukhopadhyay et al. Genetic sequence classification and its application to cross-species homology detection

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): BW GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP