WO2001096860A1 - A system and method for identifying dna sequences that could code into a string of amino acids - Google Patents

A system and method for identifying dna sequences that could code into a string of amino acids Download PDF

Info

Publication number
WO2001096860A1
WO2001096860A1 PCT/US2001/040930 US0140930W WO0196860A1 WO 2001096860 A1 WO2001096860 A1 WO 2001096860A1 US 0140930 W US0140930 W US 0140930W WO 0196860 A1 WO0196860 A1 WO 0196860A1
Authority
WO
WIPO (PCT)
Prior art keywords
codon
amino acids
permutations
gene database
string
Prior art date
Application number
PCT/US2001/040930
Other languages
French (fr)
Inventor
Lawrence S. Zisman
Original Assignee
Zisman Lawrence S
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zisman Lawrence S filed Critical Zisman Lawrence S
Priority to AU2001267070A priority Critical patent/AU2001267070A1/en
Publication of WO2001096860A1 publication Critical patent/WO2001096860A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Definitions

  • the present invention relates to computerized systems for studying DNA, and more
  • a genome comprises all of the genetic material found in the chromosomes of a particular organism.
  • the human genome consists of 50,000 to 100,000 genes located on 23 pairs of
  • the first complete human genome to be sequenced will be a composite of
  • sequence will thus be a generic sequence representative of humans in general and not of any particular individual.
  • the complete sequence will provide a
  • DNA is contained in self-replicating genetic structures known as chromosomes.
  • chromosome contains a long molecule of DNA, the chemical of which genes are made.
  • DNA in turn, is a double-stranded molecule in which each strand is a linear array of units called nucleotides or bases.
  • bases There are four different bases, called adenine "A,” thymidine “T,” guanosine “G,” and cytosine “C.”
  • the bases on one strand of DNA are precisely paired with the bases on the other strand, so that an A is always opposite T and G opposite C.
  • the order of the four units on the DNA strand determines the information content of a particular gene or piece of DNA. Genes are of differing length, ranging in size from roughly 2,000 to as many as 2,000,000 base pairs. Mapping is the process of determining the position and spacing of genes, or other landmarks, on the chromosomes relative to one another. Sequencing is the process of determining the order of the nucleotides, or base pairs, in a DNA molecule.
  • mapping of human genes began early in the twentieth century, it has been intensively pursued only for the past two decades. For most of this period the methods that were developed, though original and ingenious, have been inadequate for comprehensive mapping and have only allowed the construction of relatively crude maps with very little detail. Recently, much more effective technology has been introduced. At the beginning of the 21 st century, about 1,700 of the estimated 50,000 to 100,000 human genes (less than 2 percent) have been mapped.
  • scientists involved in the Human Genome Project hope to find the location of the 50,000 to 100,000 or so human genes and to read the entire genetic script, all three billion bits of information, by the year 2005. More information on the Human Genome Project is available on the world wide web at sites such as the National Human Genome Research Institute at www.nhgri.gov.
  • a gene is said to code into a specific protein.
  • DNA serves as a template for the creation of mRNA.
  • the process by which RNA is made from DNA is
  • mRNA After transcription, mRNA is further processed until it can serve as a template for linking amino acids together. The process by which an mRNA sequence is
  • Translation is performed by
  • ribosomes which line up tRNAs with mRNAs.
  • Each tRNA has three nucleotides which
  • each tRNA also carries a specific amino acid. As tRNAs are lined up along the mRNA a covalent bond is formed between the amino acid on that
  • nucleotide triplets can correspond to the same amino acid.
  • a research scientist studying a tissue sample populated with proteins of interest may be interested in
  • the present invention addresses the above-mentioned problems by providing a
  • the invention comprises a method of: (1) inputting a string of amino acids; (2) generating all possible codon permutations that could code into the inputted string of amino acids; and (3) examining a gene
  • the invention comprises a computerized method for identifying DNA
  • sequences that could code for an inputted motif comprising the steps of: (1) identifying within
  • the motif a primary region, wherein the primary region includes a string of amino acids having
  • the invention comprises a program product stored on a recordable
  • medium for identifying DNA sequences that could code into an inputted amino acid sequence comprising: (1) means for inputting the sequence of amino acids; (2) means for generating codon
  • the invention comprises a computer system for identifying DNA sequences that could code into an inputted sequence of amino acids, comprising: a central
  • processing unit a memory, a pe ⁇ nutation generator for generating possible codon permutations
  • search engine and a system for screening for in-frame start codons to select valid gene candidates.
  • Figure 1 depicts a block diagram depicting a computer system for identifying DNA sequences that could code for a string of amino acids in accordance with a preferred embodiment of the present invention.
  • Figure 2 depicts a block diagram of the operational flow of the present invention.
  • Figure 3 depicts a table showing nucleotide triplets and their corresponding amino acids.
  • a feature of this invention is to identify genes that could potentially code into an inputted amino acid sequence or a protein.
  • Genes are made up of DNA sequences, and DNA is made up of combinations of nucleotides.
  • DNA contains the information about how amino acids are put together in a series to form a protein. There are four possible nucleotides: A,
  • T, G, and C A series of three contiguous nucleotides constitutes a triplet. Codons are nucleotide triplets that code for an amino acid or a stop signal.
  • Figure 3 shows a table depicting the list of nucleotide triplets (i.e., codons) along with the corresponding amino acid. Blanks in the table indicate a "stop signal"
  • a key point is that a specific nucleotide triplet (i.e., codon) corresponds to a specific amino acid. However, there is some redundancy in that several different codons can correspond to the same amino acid.
  • a computer system 10 that includes a central processing unit (CPU) 16, an input/output (I/O) system 18, bus 28, and memory 20.
  • memory 20 Stored in memory 20 is a software program 26 comprising permutation generator 22 and search interface system 24.
  • Memory 20 may comprise any known type of data storage and/or transmission media, including magnetic media, optical media, random access memory (RAM), read only memory (ROM), a data cache, a data object, etc.
  • RAM random access memory
  • ROM read only memory
  • memory 20 may reside at a single physical location, comprising one or more types of data storage, or be distributed across a plurality of physical systems in various forms.
  • CPU 16 may likewise comprise a single processing unit, or be distributed across one or more processing units in one or more locations, e.g., on a client and a server.
  • I/O 18 may comprise any system for exchanging information with external sources.
  • User interface 12 is in communication with computer system 10 via datalink 30.
  • User interface 12 may comprise any known type of device for inputting and receiving information into computer system 10, including a CRT, LED screen, hand-held device, keyboard, mouse, voice recognition system, speech output system, printer, facsimile, pager, etc.
  • Gene database 14 is in communication with computer system 10 via datalink 31.
  • User interface 12 and gene database 14 may be linked to computer system 10 in any known way, including via an internet, intranet, worldwide web, local area network, wide area network, etc.
  • user interface 12 and/or gene database 14 may be integrated into computer system 10.
  • Datalinks 30 and 31, and bus 28 may comprise any known type of transmission link including electrical, optical, wireless, etc.
  • additional components such as cache memory, communications systems, systems software, etc., may be incorporated into computer system 10.
  • Computer program, software program, program, program product, or software in the present context mean
  • Permutation generator 22 operates by obtaining an amino acid string or sequence 40. Permutation generator 22 then generates codon sequences 46 that could code to the inputted
  • search interface system 24 compares each codon sequence to gene sequences in one or more gene databases 14.
  • search interface system 24 attempts to identify the gene containing the
  • the gene identification process includes identifying a start codon utilizing
  • start codon identifier 32 Often, start codons are separated from the matched sequence by splice
  • a splice junction To facilitate the process of identifying remotely located start codons, a splice junction
  • handler 34 is included to identify and handle splice junctions. This process is described in more
  • the system can also handle
  • Motifs are amino acid sequences that have a
  • the invention includes a motif
  • search criteria 50 may also be entered.
  • the search criteria 50 can be used to instruct the search interface
  • permutation generator 22 generates a set of codon sequences (DNA sequences) that could code for an inputted sequence of amino acids. For example, if a sequence containing the four amino acids: CYS, GLU, SER, GLU was inputted, a set of all possible codon
  • TGC TGC, GAG, AGC, GAG.
  • the first step is to define a Master Codon Table (Table 1 shown below), which contains a
  • codons for a given amino acid are also included in this specification.
  • the possible codons for the stop signal are also included in this specification.
  • the next step is to define various matrices and tables that are utilized in the process.
  • An address locates a specific position within the designated matrix.
  • a value is the object and all the properties attached to that object in a specific location of
  • reference matrix constitute a "codon table” and are addresses to specific locations in the master
  • amino acid input has a corresponding d value in the master codon table (the d value should be
  • PG permutation generator
  • the PG address assigns location within this matrix.
  • the values in the PG matrix are addresses to specific positions in the reference matrix.
  • the 1 in position Yl 1 would be xl 1
  • the 2 in position Y12 would be xl2
  • the 2 in position Y22 would be x22
  • xl 1 would be the location in
  • xl 1 position of the reference matrix would in turn be a location in the master codon table.
  • xl2 would not necessarily equal x22.
  • Y2 ⁇ ) ⁇ (1,1), (1,2), (2,1), (1,3), (3,1), (1,4), (2,2), (2,3),
  • Y6 ⁇ ) is generated, i.e.,
  • the program then transposes the data in the reference matrix to the permutation generator
  • the program then generates all possible permutations of the codons that could yield the input
  • Xaj locations being addresses to locations in the Mqd matrix, i.e.:
  • Each codon string represents one of all the possible codon permutations that could code for the input amino acid sequence.
  • homology search programs 33 may include known programs such as BLASTTM
  • FASTATM can comprise customized systems to handle multiple sequence searching.
  • Homology search programs 33 will identify high homology sequences, which comprise
  • start codon identifier a segment homologous to the inputted permutation. Specifically, start codon identifier
  • a start codon is an
  • ATG codon upstream (prior to) and "in frame” with the high homology sequence. (A start codon is in frame if there exists a whole number of triplets between the start codon and the high
  • start codon identifier 32 examines upstream sequences to find a start
  • GGT ATT ACT which resulted from the example above, was inputted into a search engine and matched a gene having a partial sequence: GCAATG CCCGATT GTT CCT GGTATTACT
  • the inputted permutation sequence (shown in bold) is present, and there is an upstream ATG sequence (also shown in bold). However, the upstream ATG start codon is not "in frame” to result in an expression of the amino acid sequence of interest. (There is not a whole number of
  • the ATG open reading frame in this example is:
  • start codon identifier 32 must look further upstream to
  • a program that can find open reading frames in a gene can
  • Genomic D ⁇ A sequences arise as a consequence of the distinction between genomic D ⁇ A and cD ⁇ A.
  • D ⁇ A is the "blueprint" from which mR ⁇ A is transcribed. rnR ⁇ A is transcribed from genomic
  • D ⁇ A but may contain a sequence that does not code for a protein.
  • cD ⁇ A refers to the
  • cD ⁇ A is a D ⁇ A copy of mR ⁇ A that has been processed and is ready for
  • genomic D ⁇ A is searched, one may find that a start codon is separated from the sequence found to be homologous to the permutation codon string by one or more introns.
  • the intron which separates the start codon from the high homology region could be several thousand base pairs (nucleotides) in length. Accordingly, splice junction handler 34 is provided
  • a further feature of the invention is to provide a motif handler 44 that allows motifs to be
  • Motifs are amino acid sequences that have a common pattern embedded in
  • nonhomologous sequences i.e., motifs include commonly recurring amino acids in fixed
  • SH2 domain An example is a motif called the SH2 domain. This domain is a region
  • Motif handler 44 may utilize,
  • a first option is to perform permutations only on the amino acids that recur in the motif.
  • Positional search mechanism 36 is then utilized during the search to ensure that particular
  • elements in a given permutation are set at a proper distance apart from each other when sequence matching to the database 14 is performed.
  • a second option is to use motif handler 44 to first locate a primary region (e.g., a region
  • motif handler 44 would identify a second search parameter and note its positional relationship to
  • positional search mechanism 36 maintains the codon permutations in the same fixed positional relationship as the commonly recurring amino acids in
  • search interface system 24 would search for high homology
  • CCT is a

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Organic Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Biochemistry (AREA)
  • Genetics & Genomics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A computerized method and system for identifying DNA sequences that could code for a string of inputted amino acids and then identifying genes within the DNA sequences that are potentially responsible for creating the amino acid string. The system comprises means for: (1) inputting a string of amino acids; (2) generating all possible codon permutations that could code into the inputted string of amino acids; and (3) examining a gene database to identify DNA sequences that match the codon permutations. The system further comprises mechanisms for identifying start codons and handling splice junctions in the gene database. Additionally, the invention includes a system for handling amino sequences inputted in the form of a motif.

Description

A SYSTEM AND METHOD FOR IDENTIFYING DNA SEQUENCES THAT COULD CODE INTO A STRING OF AMINO ACIDS
BACKGROUND OF THE INVENTION
1. Technical Field The present invention relates to computerized systems for studying DNA, and more
specifically, to a system and method for identifying DNA sequences that could code into a string
of amino acids.
2. Related Art
Presently, scientists are engaged in an intense research effort to characterize the genomes
of human and selected model organisms through complete mapping and sequencing of their
DNA (deoxyribonucleic acid), and to develop technologies for genomic analysis. A genome comprises all of the genetic material found in the chromosomes of a particular organism. For instance, the human genome consists of 50,000 to 100,000 genes located on 23 pairs of
chromosomes. The first complete human genome to be sequenced will be a composite of
sequences from many sources, most of these being cell lines that have existed in laboratories all
over the world for some time. The sequence will thus be a generic sequence representative of humans in general and not of any particular individual. The complete sequence will provide a
standard against which other partial sequences can be compared.
DNA is contained in self-replicating genetic structures known as chromosomes. Each
chromosome contains a long molecule of DNA, the chemical of which genes are made. The
DNA, in turn, is a double-stranded molecule in which each strand is a linear array of units called nucleotides or bases. There are four different bases, called adenine "A," thymidine "T," guanosine "G," and cytosine "C." The bases on one strand of DNA are precisely paired with the bases on the other strand, so that an A is always opposite T and G opposite C. The order of the four units on the DNA strand determines the information content of a particular gene or piece of DNA. Genes are of differing length, ranging in size from roughly 2,000 to as many as 2,000,000 base pairs. Mapping is the process of determining the position and spacing of genes, or other landmarks, on the chromosomes relative to one another. Sequencing is the process of determining the order of the nucleotides, or base pairs, in a DNA molecule.
Although mapping of human genes began early in the twentieth century, it has been intensively pursued only for the past two decades. For most of this period the methods that were developed, though original and ingenious, have been inadequate for comprehensive mapping and have only allowed the construction of relatively crude maps with very little detail. Recently, much more effective technology has been introduced. At the beginning of the 21st century, about 1,700 of the estimated 50,000 to 100,000 human genes (less than 2 percent) have been mapped. Scientists involved in the Human Genome Project hope to find the location of the 50,000 to 100,000 or so human genes and to read the entire genetic script, all three billion bits of information, by the year 2005. More information on the Human Genome Project is available on the world wide web at sites such as the National Human Genome Research Institute at www.nhgri.gov.
The information generated by genome projects is expected to be the source book for biomedical science in the 21st century and will be of immense benefit to the field of medicine. It will help us to understand and eventually treat many of the more than 4000 genetic diseases that afflict mankind, as well as the many multifactorial diseases in which genetic predisposition plays an important role. Scientists are now just beginning to formulate uses for these vast databases of information. To fully exploit these databases, it will be vital to develop new methods and tools for the analysis and interpretation of genome maps and DNA sequences. One such area that will potentially benefit from genome projects is the study of proteins and their correlation with genetic information, as genes are responsible for determining which proteins get created. Proteins are made up of a string or sequence of amino acids. DNA contains
the information about how amino acids are put together in a series to form a protein.
Accordingly, a gene is said to code into a specific protein. For this coding to occur, DNA serves as a template for the creation of mRNA. The process by which RNA is made from DNA is
called transcription. After transcription, mRNA is further processed until it can serve as a template for linking amino acids together. The process by which an mRNA sequence is
converted into an amino acid sequence is called "translation." Translation is performed by
ribosomes which line up tRNAs with mRNAs. Each tRNA has three nucleotides which
complement three nucleotides on the mRNA. Each tRNA also carries a specific amino acid. As tRNAs are lined up along the mRNA a covalent bond is formed between the amino acid on that
tRNA and the amino acid already present. A key result is that a specific nucleotide triplet corresponds to a specific amino acid. However, there is some redundancy so that several
different combinations of nucleotide triplets can correspond to the same amino acid.
Until now, most efforts have focused on identifying a protein and related amino acid
details from a DNA sequence. The prior art fails however to provide systems for performing a
reverse translation, i.e., generating all possible DNA sequences from a single amino acid
sequence. Given the correspondence between protein data (i.e., amino acids) and genetic
information (i.e., nucleotide triplets), there exists any number of potential applications where
such systems could be useful. For instance, a research scientist studying a tissue sample populated with proteins of interest, e.g., a portion of a failing heart, may be interested in
determining the gene responsible for coding to that protein. Until now however, there has been
no automated mechanism for identifying the gene or collection of genes potentially responsible for coding into that protein. SUMMARY OF THE INVENTION
The present invention addresses the above-mentioned problems by providing a
computerized method and system for identifying DNA sequences that could code for a string of
inputted amino acids and then for identifying genes within the DNA sequences that are
potentially responsible for creating the amino acid string. In a first aspect, the invention comprises a method of: (1) inputting a string of amino acids; (2) generating all possible codon permutations that could code into the inputted string of amino acids; and (3) examining a gene
database to identify DNA sequences that match the codon permutations.
In a second aspect, the invention comprises a computerized method for identifying DNA
sequences that could code for an inputted motif, comprising the steps of: (1) identifying within
the motif a primary region, wherein the primary region includes a string of amino acids having
contiguous recurrence in the motif; (2) searching a gene database to identify all codon permutations that could code into the primary region; (3) identifying a subset of the gene
database, wherein the subset contains genes that could translate into the primary region; (4)
selecting within the motif a secondary amino acid apart from the primary region; (5) determining a positional relationship of the secondary amino acid with respect to the primary region; (6)
searching the subset of the gene database for DNA sequences that include codon permutations that could code into both the primary region and the secondary amino acid, at the determined
positional relationship.
In a third aspect, the invention comprises a program product stored on a recordable
medium for identifying DNA sequences that could code into an inputted amino acid sequence, comprising: (1) means for inputting the sequence of amino acids; (2) means for generating codon
permutations that could code into the inputted sequence of amino acids; and (3) means for searching a gene database to identify DNA sequences that match the generated codon permutations. In a fourth aspect, the invention comprises a computer system for identifying DNA sequences that could code into an inputted sequence of amino acids, comprising: a central
processing unit, a memory, a peπnutation generator for generating possible codon permutations
for an inputted sequence of amino acids, and a search interface system for searching a gene
database to identify DNA sequences that match the codon permutations.
It is therefore an advantage of the present invention to allow a researcher to start with a
limited amount of information regarding a particular protein (i.e., a peptide sequence or amino
acid string) and end up with the entire DNA code(s), i.e, gene(s), for the protein, or families of
the protein, of interest. It is therefore a further advantage of the present invention to provide a system to screen
DNA databases for genes that could potentially code for proteins containing a pre-specified
amino acid sequence (peptide).
It is therefore a further advantage of the present invention to provide a permutation generator that can perform reverse translation of an amino acid sequence into all possible DNA
sequences.
It is therefore a further advantage of the present invention to provide a comprehensive
system that includes a reverse translation system, permutation generation system, a homology
search engine, and a system for screening for in-frame start codons to select valid gene candidates.
BRIEF DESCRIPTION OF THE DRAWINGS
The preferred exemplary embodiment of the present invention will hereinafter be
described in conjunction with the appended drawings, where like designations denote like
elements, and: Figure 1 depicts a block diagram depicting a computer system for identifying DNA sequences that could code for a string of amino acids in accordance with a preferred embodiment of the present invention.
Figure 2 depicts a block diagram of the operational flow of the present invention.
Figure 3 depicts a table showing nucleotide triplets and their corresponding amino acids.
It should be noted that the drawings are merely schematic representations, not intended to portray specific parameters of the invention. The drawings are intended to depict only typical embodiments of the invention, and therefore should not be considered as limiting the scope of the invention. In the drawings, like numbering represents like elements between the drawings.
DETAILED DESCRIPTION OF THE DRAWINGS
Overview
As noted above, a feature of this invention is to identify genes that could potentially code into an inputted amino acid sequence or a protein. Genes are made up of DNA sequences, and DNA is made up of combinations of nucleotides. DNA contains the information about how amino acids are put together in a series to form a protein. There are four possible nucleotides: A,
T, G, and C. A series of three contiguous nucleotides constitutes a triplet. Codons are nucleotide triplets that code for an amino acid or a stop signal. Figure 3 shows a table depicting the list of nucleotide triplets (i.e., codons) along with the corresponding amino acid. Blanks in the table indicate a "stop signal" A key point is that a specific nucleotide triplet (i.e., codon) corresponds to a specific amino acid. However, there is some redundancy in that several different codons can correspond to the same amino acid. Thus, given an amino acid sequence, there exists a corresponding set of codon sequences, with the number of codon sequences in the set being dependent on the number of possible permutations of corresponding codons. The set of codon sequences, once identified, can be used to identify corresponding genes that were potentially responsible for creating a given amino acid sequence.
Computer System & Software
Referring now to Figure 1, a computer system 10 is shown that includes a central processing unit (CPU) 16, an input/output (I/O) system 18, bus 28, and memory 20. Stored in memory 20 is a software program 26 comprising permutation generator 22 and search interface system 24. Memory 20 may comprise any known type of data storage and/or transmission media, including magnetic media, optical media, random access memory (RAM), read only memory (ROM), a data cache, a data object, etc. Moreover, memory 20 may reside at a single physical location, comprising one or more types of data storage, or be distributed across a plurality of physical systems in various forms. CPU 16 may likewise comprise a single processing unit, or be distributed across one or more processing units in one or more locations, e.g., on a client and a server. I/O 18 may comprise any system for exchanging information with external sources. User interface 12 is in communication with computer system 10 via datalink 30. User interface 12 may comprise any known type of device for inputting and receiving information into computer system 10, including a CRT, LED screen, hand-held device, keyboard, mouse, voice recognition system, speech output system, printer, facsimile, pager, etc. Gene database 14 is in communication with computer system 10 via datalink 31. User interface 12 and gene database 14 may be linked to computer system 10 in any known way, including via an internet, intranet, worldwide web, local area network, wide area network, etc. Alternatively, user interface 12 and/or gene database 14, may be integrated into computer system 10. Datalinks 30 and 31, and bus 28 may comprise any known type of transmission link including electrical, optical, wireless, etc. In addition, although not shown, additional components, such as cache memory, communications systems, systems software, etc., may be incorporated into computer system 10.
It is understood that the present invention can be realized in hardware, software, or a
combination of hardware and software. As indicated above, the computer system 10 according
to the present invention can be realized in a centralized fashion in a single computerized workstation, or in a distributed fashion where different elements are spread across several
interconnected computer systems. Any kind of computer system - or other apparatus adapted for
carrying out the methods described herein - is suited. A typical combination of hardware and
software could be a general purpose computer system with a computer program that, when
loaded and executed, controls the computer system 10 such that it carries out the methods described herein. Alternatively, a specific use computer, containing specialized hardware for
carrying out one or more of the functional tasks of the invention could be utilized. The present
invention can also be embedded in a computer program product, which comprises all the features
enabling the implementation of the methods and functions described herein, and which - when loaded in a computer system - is able to carry out these methods and functions. Computer program, software program, program, program product, or software, in the present context mean
any expression, in any language, code or notation, of a set of instructions intended to cause a
system having an information processing capability to perform a particular function either
directly or after either or both of the following: (a) conversion to another language, code or notation; and/or (b) reproduction in a different material form.
Referring now to Figure 2, a box diagram is shown depicting the operation of software
program 26. Permutation generator 22 operates by obtaining an amino acid string or sequence 40. Permutation generator 22 then generates codon sequences 46 that could code to the inputted
amino acid sequence. This operation utilizes an amino acid/codon table 42 and a permutation algorithm which are described in more detail below. Once the list of codon sequences 46 are generated, they are inputted into search interface system 24. Search interface system 24 compares each codon sequence to gene sequences in one or more gene databases 14. When a
match is identified, search interface system 24 then attempts to identify the gene containing the
matched sequence. The gene identification process includes identifying a start codon utilizing
start codon identifier 32. Often, start codons are separated from the matched sequence by splice
junctions. To facilitate the process of identifying remotely located start codons, a splice junction
handler 34 is included to identify and handle splice junctions. This process is described in more
detail below. After search interface system 24 identifies the genes containing matches, the genes
48 that code for the inputted amino acid sequence are outputted.
In addition to handling complete amino acid sequences 40, the system can also handle
partial amino acid sequences, known as motifs. Motifs are amino acid sequences that have a
common pattern embedded in nonhomologous sequences. The invention includes a motif
handler 44 and a positional search mechanism 36 to facilitate the process, which are described in more detail below.
Finally, in addition to entering an amino acid sequence 40, a search criteria 50 may also
be inputted into the system. The search criteria 50 can be used to instruct the search interface
system 24 to, among other things, search a specific gene database 14, search a specific genome,
specify restraints on how to search for start codons, specify whether to look for splice junctions or specify how to search for motifs.
Permutation Generator
As noted above, permutation generator 22 generates a set of codon sequences (DNA sequences) that could code for an inputted sequence of amino acids. For example, if a sequence containing the four amino acids: CYS, GLU, SER, GLU was inputted, a set of all possible codon
sequences that could code into the inputted amino acids, each having four codons, would be generated. As shown in the table of Figure 3, there exist two codons, {TGT, TGC} that could code to CYS; two codons {GAA, GAG} that could code to GLU; six codons {TCT, TCC, TCA,
TCG, AGT, AGC} that could code to SER; and two codons {GAA, GAG} that could code to
GLU. The total number of permutations therefore would be 2 x 2 x 6 x 2 = 48, which is the
product of the number of possible codons that correspond to each amino acid in the inputted
sequence. A partial list of the resulting codon sequences that could code for the inputted amino
acid sequence would comprise: (1) TGT, GAA, TCT, GAA; (2) TGT, GAA, TCT, GAG ... (48)
TGC, GAG, AGC, GAG.
Permutation Algorithm
The following is an example of a method of generating permutations of codon sequences
based on an inputted amino acid sequence. This particular method utilizes matrices to manage the permutations. It is understood that the following method is not intended to be limiting, and any algorithm for generating permutations that can be implemented by a person skilled in the art
is believed to be within the scope of the invention.
The first step is to define a Master Codon Table (Table 1 shown below), which contains a
list of all the amino acids and their possible codons, as well as the number of total possible
codons for a given amino acid. The possible codons for the stop signal are also included in this
table. As can be seen, there are 20 possible amino acids and one stop signal (d = 1..21), and each
amino acid may have up to six corresponding codons (q = 1..6). Table 1
Figure imgf000012_0001
The next step is to define various matrices and tables that are utilized in the process. Each
matrix uses addresses which comprise three components: row and column coordinates and a
symbol which indicates the target matrix to which these coordinates are applied. An address locates a specific position within the designated matrix.
A value is the object and all the properties attached to that object in a specific location of
a matrix where the specific location is uniquely defined by an address. A value in a specific
location designated by an address can itself be an address in another matrix. A reference matrix is then defined [Xaj] where a=l to γ, j= 1 to e. The reference matrix
address (column and row coordinates) assigns location within the matrix. The values in the
reference matrix constitute a "codon table" and are addresses to specific locations in the master
codon table.
Figure imgf000013_0001
Next, an Input Matrix [Ij] for holding the inputted amino acid is defined, where j - 1 to e
(j represents the position of a given amino acid in the amino acid sequence that is input). Each
amino acid input has a corresponding d value in the master codon table (the d value should be
placed under the corresponding amino acid by a link to the master codon table). Amino acids from the input table are linked to their location in the master codon table):
Figure imgf000013_0002
Next, a permutation generator (PG) address matrix [Yαβ] is defined, where α=l to e, and β=l to
g (the PG address assigns location within this matrix). The values in the PG matrix are addresses to specific positions in the reference matrix.
Figure imgf000014_0001
Next, various operations are defined. Operation mqda— »mqdv shows the value in
position mqd of the master codon table. Operation [Xaj]— > [Yαβ] transposes address locations
in the reference matrix to the PG matrix so that the address in the PG matrix is the transposition
of the address in the reference matrix (i.e., the codon table). Operation μ(2| Y*Y(n+i)
generates all permutations of two elements, one element from column α=n, one element from
column α=n+l, for all elements 1 to β of each column. Cμ2(Ynβ|Yn+l)β) is the superset of the
sets that contain all possible 2 element permutations of Ynβ and Yn+lβ.
For example, consider the following PG matrix:
Figure imgf000014_0002
In this example, the numbers shown in the PG grid are simple integers for purposes of describing
the concept of the permutation generator; in the actual implementation, these integers would be
replaced by the reference matrix address, i.e., the 1 in position Yl 1 would be xl 1, the 2 in position Y12 would be xl2, the 2 in position Y22 would be x22. xl 1 would be the location in
the reference matrix. The value in the xl 1 position of the reference matrix would in turn be a location in the master codon table. xl2 would not necessarily equal x22. In this example X12
could not equal X22 because the βmax in the two respective Y columns is different indicating
that the two columns represent codons for different amino acids. The location in the master
codon table would contain the codon.
In this example, Cμ2(Ylβ|Y2β) = { (1,1), (1,2), (2,1), (1,3), (3,1), (1,4), (2,2), (2,3),
(2,4), (3,2), (3,3), (3,4)}. For each element in Cμ2(Ylβ|Y2β) Cμ2(Y3β|Y4β) can be generated
as follows:
(1,1) → Cμ2(Y3β|Y4β)
(l,2)→ Cμ2(Y3β|Y4β) (2,1)→ Cμ2(Y3β|Y4β)
(3,4)→ Cμ2(Y3β|Y4β).
The operation Cμ(4: Cμ2Yl β|Y2β|| Cμ2Y3 β|Y4β), defines the set of four element
subsets which represent all possible permutations of elements in the sets Cμ2Ylβ|Y2β and
Cμ2Y3βY4β, i.e., (1,1) combined with (1,1)→ (1,1,1,1); (1,1) combined with (1,2)→ (1,1,1,2);
etc. For each element of Cμ(4: Cμ2Ylβ|Y2β|| Cμ2Y3β|Y4β), Cμ2(Y5β|Y6β) is generated, i.e.,
(1,1,1,1)→ Cμ2(Y5β|Y6β); (1,1,1,2)→ Cμ2(Y5β|Y6β); etc. This results in the set Cμ6(6:
Cμ2Yl β|Y2β|| Cμ2Y3β|Y4β|| Cμ2(Y5β|Y6β), which represents all of the possible permutations
of the elements shown in the above example.
A general operation may be defined as: Cμp= Cμ(p| Cμ2Y(n)β*Y(n+l)β||
Cμ2Y(n+2)β|Cμ2Y(n+3)β)||...||Cμ2Y(n+p-2)β*Y(n+p-l)β), which generates the sets containing
p elements that represent all possible permutations of elements in the sets contained as elements
in the groups Cμ2Y(n)β|Y(n+l)β, Cμ2Y(n+2)β|Cμ2Y(n+3)β), through Cμ2Y(n+p-2)β|Y(n+p-
l)β). Defined rules for the operation would include:
1. Always start at n=l .
2. For a given set, if any value of any element = 0, then delete the set containing the element of
value = 0.
3. Perform iterations of μ until n+x+2 = j where x= integer.
4. If n+x+2 >j (i.e., n+x+l=j) for each element of Cμ(n+x) generate Cμl(Yn+x+l)β where Cμl
is the set of elements in Y(n+x+l)β.
Thus, a given permutation of two elements, one from Ynβ and one from Y(n+l)β
constitutes an element in the total set of permutations Cμ2(Ynβ,Yn+lβ).
Example:
The following example illustrates the permutation generator program. The input
comprises an amino acid sequence containing six amino acids, which is stored in the input
matrix below.
Figure imgf000016_0001
Next, a codon table in the reference matrix from the input data is created.
Figure imgf000016_0002
The program then transposes the data in the reference matrix to the permutation generator
matrix:
Figure imgf000017_0001
The program then generates all possible permutations of the codons that could yield the input
amino acid sequence by performing iterations of μ(2| Y*Y(n+1)o):
Cμ2(Ylβ|Y2β)= { (X11.X12), (X11.X22), (X21.X12), (X11,X32), (X31,X12), (X11,X42),
(X21,X22), (X21,X32), (X21, X42), (X31,X22), (X31,X32), (X31,X42)}
For each element in Cμ2(Ylβ|Y2β) generate Cμ2(Y3β|Y4β):
(X11.X12) → Cμ2(Y3β|Y4β)
(X11,X22)→ Cμ2(Y3β|Y4β)
(X21,X12)→ Cμ2(Y3β|Y4β), etc.
After all iterations are complete, there is a list of Xaj addresses, with the values in these
Xaj locations being addresses to locations in the Mqd matrix, i.e.:
M13 M15 M17 M120 M13 M18
M23 M25 M27 M220 M23 M28, etc.
The program then calls the values in the Mqd locations and displays them as codon strings:
ATT GTT CCT GGT ATT ACT
ATC GTC CCC GGC ATC ACC, etc. Each codon string represents one of all the possible codon permutations that could code for the input amino acid sequence.
Gene Searching
Referring again to Figure 2, once the codon sequences 46 are generated, they can be
compared to sequences in gene database 14, such as GenBank. Programs that perform this task, referred to as homology search programs 33, may include known programs such as BLAST™
and FASTA™, or can comprise customized systems to handle multiple sequence searching.
Homology search programs 33 will identify high homology sequences, which comprise
sequences that match the inputted sequence. After high homology sequences are found, other regions of the gene must be examined to
identify a segment homologous to the inputted permutation. Specifically, start codon identifier
32 may be utilized to identify a start codon for the high homology sequence. A start codon is an
ATG codon upstream (prior to) and "in frame" with the high homology sequence. (A start codon is in frame if there exists a whole number of triplets between the start codon and the high
homology sequence.) Thus, start codon identifier 32 examines upstream sequences to find a start
codon (ATG) that would result in a reading frame that could result in expression of the
permutation sequence as a protein. For example, assume the codon sequence ATT GTT CCT
GGT ATT ACT, which resulted from the example above, was inputted into a search engine and matched a gene having a partial sequence: GCAATG CCCGATT GTT CCT GGTATTACT
The inputted permutation sequence (shown in bold) is present, and there is an upstream ATG sequence (also shown in bold). However, the upstream ATG start codon is not "in frame" to result in an expression of the amino acid sequence of interest. (There is not a whole number of
triplets between the start codon and the inputted sequence.) The ATG open reading frame in this example is:
ATG CCC GAT TGT TCC TGG TAT TAC, which would translate as: MET PRO ASP CYS SER TRP TYR TYR. However, the inputted amino acid sequence was:
ILE NAL PRO GLY ILE THR. Thus, start codon identifier 32 must look further upstream to
identify a start codon that is in frame. A program that can find open reading frames in a gene can
be readily implemented by one skilled in the art.
An additional problem that may arise when searching for a start codon relates to non-
coding sequences that may exist in the gene database. These non-coding sequences must be
effectively removed from the search, which is done by junction handler 34. Νon-coding
sequences arise as a consequence of the distinction between genomic DΝA and cDΝA. Genomic
DΝA is the "blueprint" from which mRΝA is transcribed. rnRΝA is transcribed from genomic
DΝA but may contain a sequence that does not code for a protein. These non-coding sequences
are called intervening sequences or "introns." Before protein is made from an mRΝA, the introns are spliced out. The remaining excised sequences are called "exons" and are linked
together to form the sequence from which the protein is translated. The term cDΝA refers to the
DΝA sequence that would be obtained if one converted mRΝA after removal of its introns
directly into DΝA. cDΝA is a DΝA copy of mRΝA that has been processed and is ready for
translation. Thus, if genomic DΝA is searched, one may find that a start codon is separated from the sequence found to be homologous to the permutation codon string by one or more introns.
The intron which separates the start codon from the high homology region could be several thousand base pairs (nucleotides) in length. Accordingly, splice junction handler 34 is provided
to identify remote start codons. The boundary between an intron and an exon is called a splice
junction. Because splice junctions tend to have particular sequences, it is straightforward for splice junction handler 34 to identify splice junctions in a genomic DΝA sequence. Motifs
A further feature of the invention is to provide a motif handler 44 that allows motifs to be
searched. Motifs are amino acid sequences that have a common pattern embedded in
nonhomologous sequences, i.e., motifs include commonly recurring amino acids in fixed
positional relationships. An example is a motif called the SH2 domain. This domain is a region
in some proteins in which certain amino acids recur in the same position. A partial
representation of an SH2 domain is shown below:
GNxxGxFL(N/I)RESExxxGxxSLSxx-xxxxxGDxxKHxK where G=GLY, Ν=ASΝ, F=PHE, L=LEU, V=VAL, I=ILE, R=ARG, E=GLU, S=SER, D=ASP,
K=LYS, H=HIS, x=[any amino acid], - = [a gap in the sequence] (i.e., an amino acid may or may
not be present in this position), (N/I)= either NAL or ILE. In this case, permutations must be
done for the specified amino acids in the motif, and the permutations for the "x" amino acids must be permissive for any codon that codes for an amino acid. Motif handler 44 may utilize,
among others, the following two systems for dealing with motifs.
A first option is to perform permutations only on the amino acids that recur in the motif.
Positional search mechanism 36 is then utilized during the search to ensure that particular
elements in a given permutation are set at a proper distance apart from each other when sequence matching to the database 14 is performed.
A second option is to use motif handler 44 to first locate a primary region (e.g., a region
of highest contiguous recurrence) within the motif. For instance, in the example shown above, a
primary region could be "RESE." Next, codon permutations of this region would be generated to
create a subset of genes within the larger database that contain the permutation sequence. Next, motif handler 44 would identify a second search parameter and note its positional relationship to
the primary region, e.g., G, which is three amino acids upstream from RESE. Then, all the genes
that contain a codon for the second search parameter in the noted positional relationship would be identified in the created subset. Again, positional search mechanism 36 maintains the codon permutations in the same fixed positional relationship as the commonly recurring amino acids in
the motif, thereby facilitating positional searching of codons.
In the above example, search interface system 24 would search for high homology
sequences that have GGT, GGC, GGA, or GGG (i.e., one of the codons for GLY) nine
nucleotides downstream from the last codon for the RESE component of the motif. Thus, one sequence that would be searched would be CGT GAA TCT GAA XXX XXX XXX GGT, which
would be searched in the subset of all genes with the amino acid sequence RESE. (CGT is a
codon for R, GAA is a codon for E, and TCT is a codon for S.) The above DNA sequence
corresponds to the protein motif sequence RESExxxG. Additional iterations of this procedure
could be performed to identify genes which contain the entire motif.
The foregoing description of the preferred embodiments of the invention have been
presented for purposes of illustration and description. They are not intended to be exhaustive or
to limit the invention to the precise form disclosed, and obviously many modifications and
variations are possible in light of the above teachings. Such modifications and variations that
are apparent to a person skilled in the art are intended to be included within the scope of this invention as defined by the accompanying claims.

Claims

CLAIMSI claim:
1. A computerized method for identifying DNA sequences that could code for a string of amino
acids, comprising the steps of: inputting the string of amino acids;
generating all possible codon permutations that could code into the inputted string of
amino acids; and
examining a gene database to identify DNA sequences that match the codon
permutations.
2. The method of claim 1, comprising the further step of:
searching the gene database for a start codon that is in frame with at least one of the
identified DNA sequences.
3. The method of claim 2, wherein the step of searching the gene database for a start codon
includes the step of identifying splice junctions.
4. The method of claim 3, wherein the step of searching the gene database for a start codon
includes the further step of identifying a remote start codon.
5. The method of claim 1, comprising the further step of: for each identified DNA sequence, searching the gene database for an associated start
codon.
6. The method of claim 1 , wherein the inputted string of amino acids includes a motif that has
commonly recurring amino acids in a fixed positional relationship.
7. The method of claim 6, wherein the step of generating all possible codon permutations
generates codon permutations only for the commonly recurring amino acids.
8. The method of claim 7, wherein the step of examining the gene database includes the step of maintaining the codon permutations in the same fixed positional relationship as the commonly
recurring amino acids.
9. A computerized method for identifying DNA sequences that could code for an inputted motif, comprising the steps of:
identifying within the motif a primary region, wherein the primary region includes a
string of amino acids having the highest contiguous recurrence in the motif;
searching a gene database to identify all codon permutations that could translate into the
primary region; selecting a subset of the gene database, wherein the subset contains genes that could
translate into the primary region;
selecting within the motif a secondary amino acid apart from the primary region; determining a positional relationship of the secondary amino acid with respect to the
primary region; and
searching the subset of the gene database for DNA sequences that include codon
permutations that could code into both the primary region and the secondary amino acid, at the determined positional relationship.
10. A program product stored on a recordable medium, that when executed by a computer
system, comprises: means for inputting a string of amino acids;
means for generating all possible codon permutations that could code into the inputted
string of amino acids; and
means for examining a gene database to identify DNA sequences that match the codon
permutations.
11. The program product of claim 10, further comprising means for searching the gene database
for a start codon that is in frame with at least one of the identified DNA sequence.
12. The program product of claim 11, wherein the means for searching the gene database for a
start codon includes means for identifying a splice junction.
13. The program product of claim 12, wherein the means for searching the gene database for a
start codon includes means for identifying a remote start codon.
14. The program product of claim 10, wherein the inputted string of amino acids includes a
motif having commonly recurring amino acids in a fixed positional relationship.
15. The program product of claim 14, wherein the means for generating all possible codon
permutations generates codon permutations only for the commonly recurring amino acids, and
wherein the means for examining the gene database includes means for maintaining the codon permutations in the same fixed positional relationship as the commonly recurring amino acids.
16. A computer system for identifying DNA sequences that could code for a sequence of amino
acids, comprising:
a central processing unit; a computer system memory;
a permutation generator, wherein the permutation generator generates permutations of
codon sequences in response to an inputted amino acid sequence; and a search interface system, wherein the search interface system identifies DNA sequences
in a database that matches the codon sequences.
17. The computer system of claim 16, wherein the permutation generator includes a motif
handler.
18. The computer system of claim 17, wherein the search interface system includes a positional search mechanism.
19. The computer system of claim 16, wherein the search interface system includes a homology
search engine.
20. The computer system of claim 16, wherein the search interface system includes a start codon identifier.
21. The computer system of claim 16, wherein the search interface system includes a splice junction handler.
PCT/US2001/040930 2000-06-13 2001-06-12 A system and method for identifying dna sequences that could code into a string of amino acids WO2001096860A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2001267070A AU2001267070A1 (en) 2000-06-13 2001-06-12 A system and method for identifying dna sequences that could code into a string of amino acids

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US59372500A 2000-06-13 2000-06-13
US09/593,725 2000-06-13

Publications (1)

Publication Number Publication Date
WO2001096860A1 true WO2001096860A1 (en) 2001-12-20

Family

ID=24375887

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2001/040930 WO2001096860A1 (en) 2000-06-13 2001-06-12 A system and method for identifying dna sequences that could code into a string of amino acids

Country Status (2)

Country Link
AU (1) AU2001267070A1 (en)
WO (1) WO2001096860A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117095752A (en) * 2023-08-21 2023-11-21 基诺创物(武汉市)科技有限公司 DNA protein coding region streaming data storage method capable of keeping codon preference

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CAI Y. AND BORK P.: "Homology-based gene prediction using neural nets", ANALYTICAL CBIOCHEMISTRY, vol. 265, 1998, pages 269 - 274, XP002946078 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117095752A (en) * 2023-08-21 2023-11-21 基诺创物(武汉市)科技有限公司 DNA protein coding region streaming data storage method capable of keeping codon preference
CN117095752B (en) * 2023-08-21 2024-03-19 基诺创物(武汉市)科技有限公司 DNA protein coding region streaming data storage method capable of keeping codon preference

Also Published As

Publication number Publication date
AU2001267070A1 (en) 2001-12-24

Similar Documents

Publication Publication Date Title
Burland DNASTAR’s Lasergene sequence analysis software
Troyanskaya et al. Sequence complexity profiles of prokaryotic genomic sequences: A fast algorithm for calculating linguistic complexity
Usuka et al. Optimal spliced alignment of homologous cDNA to a genomic DNA template
US6950753B1 (en) Methods for extracting information on interactions between biological entities from natural language text data
JP5068414B2 (en) System and method for validating, aligning and reordering one or more gene sequence maps using at least one ordered restriction enzyme map
Benson et al. A method for fast database search for all k-nucleotide repeats
Benson Sequence alignment with tandem duplication
WO2005024562A2 (en) System and method for pattern recognition in sequential data
Mao et al. ESTAP—an automated system for the analysis of EST data
WO2009155443A2 (en) Method and apparatus for sequencing data samples
WO2009143212A1 (en) Computer system and computer-facilitated method for nucleic acid sequence alignment and analysis
Shi et al. Untangling ITS2 genotypes of algal symbionts in zooxanthellate corals
Fernandes et al. CSA: an efficient algorithm to improve circular DNA multiple alignment
US20030200033A1 (en) High-throughput alignment methods for extension and discovery
Sofi et al. Bioinformatics for everyone
US5618672A (en) Method for analyzing partial gene sequences
Thomas et al. Comparative genome mapping in the sequence-based era: early experience with human chromosome 7
WO2001096860A1 (en) A system and method for identifying dna sequences that could code into a string of amino acids
Levy et al. Xlandscape: the graphical display of word frequencies in sequences.
US7043371B2 (en) Method for search based character optimization
Al-Barhamtoshy et al. DNA sequence error corrections based on TensorFlow
Tammi et al. TRAP: Tandem Repeat Assembly Program produces improved shotgun assemblies of repetitive sequences
Mabrouk et al. BIOINFTool: Bioinformatics and sequence data analysis in molecular biology using Matlab
WO2008043149A1 (en) Methods for obtaining information from genetic material
Tallat et al. A Novel Evaluation of Motif Detection in Protein Sequences of p53 and DNA Sequences of RHAG Gene using Big Data Analytic Techniques

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP