WO2008000090A1 - Dna barcode sequence classification - Google Patents

Dna barcode sequence classification Download PDF

Info

Publication number
WO2008000090A1
WO2008000090A1 PCT/CA2007/001170 CA2007001170W WO2008000090A1 WO 2008000090 A1 WO2008000090 A1 WO 2008000090A1 CA 2007001170 W CA2007001170 W CA 2007001170W WO 2008000090 A1 WO2008000090 A1 WO 2008000090A1
Authority
WO
WIPO (PCT)
Prior art keywords
dna
barcode
dna barcode
query
barcode sequence
Prior art date
Application number
PCT/CA2007/001170
Other languages
French (fr)
Inventor
Mehrdad Hajibabaei
Paul Hebert
Donal Hickey
Original Assignee
University Of Guelph
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University Of Guelph filed Critical University Of Guelph
Publication of WO2008000090A1 publication Critical patent/WO2008000090A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures

Definitions

  • the present invention relates generally to deoxyribonucleic acid (DNA) barcodes and specifically to a system and method for effectively indexing and searching a DNA barcode library for classification of a DNA barcode query.
  • DNA deoxyribonucleic acid
  • a DNA barcode is a relatively short sequence of genomic DNA that can be used to identify biological species.
  • the idea of using DNA barcodes (also referred to as barcoding) for species identification was first introduced in early 2003 in "Biological identifications through DNA barcodes" by Hebert, P. D. N., A. Cywinska, S. L. Ball, and J. R. deWaard, Proceedings of the Royal Society of London B Biological Sciences 270:313-321, 2003.
  • the barcode sequence currently used for animal species is an 650 base pair (bp) fragment from a mitochondrial gene, referred to as cytochrome c oxidase I (COl, coxl).
  • COl cytochrome c oxidase I
  • This barcode sequence has shown potential in identification of protist and fungi species. Since the COl gene shows a reduced level of sequence diversity among plant species, other genomic fragments are being tested for barcoding in plants. In the future, it is likely that the assignment of a specimen to a particular species will involve the use of more than a single barcode sequence.
  • the use of DNA barcodes for species identification is dependant on the availability of comprehensive barcode libraries from known species. These libraries are now being constructed as part of the large scale DNA barcoding projects that are currently underway.
  • NJ neighbor-joining
  • the following describes a method that is used for the analysis of DNA barcodes rather than the generation of those sequences. Specifically, a new method is described for comparing the DNA sequence from a biological specimen to a library of reference DNA barcode sequences. As a result of this comparison, the specimen can quickly be assigned to a biological species.
  • a method for indexing DNA barcode sequences for a barcode database comprising the steps of: receiving a DNA barcode sequence; cleaning the received DNA barcode sequence by removing all characters except those characters uniquely identifying one of the four nucleotide subunits of a DNA strand; segmenting the cleaned DNA barcode sequence into a number of words, each word having a predefined number of nucleotides, the segmented DNA barcode representing a barcode index; and associating the barcode index with its species name for storage in the barcode database.
  • a method of searching for DNA barcode sequence against the barcode database described above comprising the steps of: receiving a DNA barcode sequence query; cleaning the received DNA barcode sequence query by removing all characters except those characters uniquely identifying as one of the four nucleotide subunits of a DNA strand; segmenting the cleaned DNA barcode sequence query into a number of words, each word having a predefined number of nucleotides; creating a frame set comprising a plurality of query index frames, each query index frame represent a shifted version of the cleaned DNA barcode sequence query; and searching the barcode database using a predefined search algorithm for finding a barcode index that best matches at least one query index frame in the frame set.
  • a computer readable medium comprising instructions for executing one or both of the methods described above.
  • Figure 1 is a block diagram illustrating a computing device
  • Figure 2 is a flow chart illustrating adding DNA barcode sequences to a barcode database
  • Figure 3 is a flow chart illustrating searching the barcode database
  • Figure 4 is a sample distributed architecture for accessing the barcode database
  • Figure 5 is a graph illustrating the accuracy of the search in relation to word size.
  • Figure 6 is a graph illustrating the ability to uniquely identify species in relations to barcode size.
  • the computing device 102 can be any type computing device, including a desktop computer, notebook, or a mobile device such as a data messaging device, a two-way pager, a smart-phone and a personal digital assistant.
  • the computing device 102 includes a communication subsystem 111, which includes a receiver 212, a transmitter 214, and associated components, such as one or more embedded or internal antenna elements 216 and 218, local oscillators (LOs) 213, and a processing module such as a digital signal processor (DSP) 220.
  • a communication subsystem 111 which includes a receiver 212, a transmitter 214, and associated components, such as one or more embedded or internal antenna elements 216 and 218, local oscillators (LOs) 213, and a processing module such as a digital signal processor (DSP) 220.
  • DSP digital signal processor
  • the computing device 102 includes a microprocessor 138 which controls general operation of the computing device 102.
  • the microprocessor 138 also interacts with additional device subsystems such as a display 122, a hard disc drive 124, a random access memory (RAM) 126, auxiliary input/output (I/O) subsystems 128, a serial port 130, a keyboard 132, a speaker 134, a microphone 136, a short-range communications subsystem 140 such as BluetoothTM for example, and any other device subsystems or peripheral devices generally designated at 142.
  • additional device subsystems such as a display 122, a hard disc drive 124, a random access memory (RAM) 126, auxiliary input/output (I/O) subsystems 128, a serial port 130, a keyboard 132, a speaker 134, a microphone 136, a short-range communications subsystem 140 such as BluetoothTM for example, and any other device subsystems or peripheral devices generally designated at 142.
  • additional device subsystems such
  • Operating system software used by the microprocessor 138 is preferably stored in a persistent store such as the hard disc drive 124, which may alternatively be a read-only memory (ROM) or similar storage element such as flash memory (not shown).
  • a persistent store such as the hard disc drive 124
  • ROM read-only memory
  • flash memory not shown
  • the display 122 is used to visually present an application's graphical user interface (GUI) to the user.
  • GUI graphical user interface
  • the user can manipulate application data by modifying information on the GUI using an input device such as the keyboard 132 for example.
  • an input device such as the keyboard 132 for example.
  • the user may have access to other types of input devices, such as, for example, a scroll wheel, trackball, light pen or touch sensitive screen.
  • the microprocessor 138 in addition to its operating system functions, preferably enables execution of software applications on the computing device 102.
  • a predetermined set of applications which control basic device operations, is installed on the mobile device 102 during its manufacture. These basic operations typically include data and voice communication applications, for example. Additionally, applications may also be loaded onto the mobile device 102 through a network, an auxiliary I/O subsystem 128, serial port 130, short-range communications subsystem 140, or any other suitable subsystem 142, and installed by a user in RAM 126, or preferably the persistent store 124, for execution by the microprocessor 138.
  • Such flexibility in application installation increases the functionality of the computing device 102 and may provide enhanced on-device features, communication-related features, or both.
  • data can be stored in the persistent store 124 for access by software applications executing on the computing device 102.
  • the persistent store can be included locally on the computing device 102 or provided remotely. Any number of types of data can be stored in the persistent store, including, for example, DNA barcode sequences.
  • DNA barcode sequence searching software can be provided for the computing device 102.
  • the DNA barcode sequence searching software provides a method of DNA barcode sequence analysis that is analogous to non-molecular methods of taxonomic analysis.
  • the method can be implemented as software using a number of programming techniques for execution on the computing device 102.
  • the DNA barcode sequence searching software converts a DNA barcode sequence into a series of "characters" that can be used to create keys for species identification. In practice, removing the requirement to assess evolutionary relatedness, which is not the primary goal of DNA barcoding, allows for an efficient method of taxonomic assignment of an unknown specimen to a species.
  • a DNA barcode sequence is subdivided into predefined units (also referred to as words). Each word includes a given number of nucleotides and is treated as a taxonomic character. In this manner, each DNA barcode sequence is transformed into a set of unique words. This set of words is then compared to a library of known DNA barcode sequences that have, themselves, been subdivided in a similar way. Since the DNA barcode sequence of the query and the library of DNA barcode sequences have been separated into relatively short words, many, if not all, existing word search algorithms can be manipulated to perform these searches.
  • the benefit of this method lies in its simplicity and its flexibility. Because of its simplicity, it can exploit many existing text-searching strategies, known or proprietary. Because of its flexibility, it can be easily customized for efficiency. Specifically, the flexibility lies in the fact that (i) the word size is variable; and (ii) the DNA barcode sequence is predefined.
  • a developer implementing the search tool can determine the word size into which the DNA barcode is divided.
  • the COl DNA coding sequence is approximately 1,500 nucleotides long.
  • individual units of a DNA sequence are referred to as nucleotides.
  • a tri-nucleotide sequence is referred to as a codon. Therefore, the 1 ,500 nucleotide-long DNA sequence can be divided into 500 words, where each word is a codon.
  • the 1,500 nucleotide-long DNA sequence can be treated as a single 1,500-nucleotide word. It will be apparent that other word sizes between these extremes can also be selected.
  • Longer word sizes are faster to search.
  • each DNA barcode sequence is divided into approximately 40 words, each word comprising 5 codons (15 nucleotides). This particular word-size yields high accuracy while allowing the identification of DNA barcode sequences that are related, but not necessarily identical, to the DNA barcode sequence query.
  • Such partial matching allows recognition of a new species or sub-species that may not have DNA barcode sequences listed in the existing library of sequences.
  • DNA barcodes are predefined. This feature allows matching a variety of query DNA barcode sequences that are derived from different areas of the gene. This is useful in DNA barcoding because the availability of sequence primers may constrain the query DNA barcode sequences to come from non-identical gene regions. That is, for example, one query may match words four-through-eighteen while another may match words two-through-twenty-three. This becomes even more important for the analysis of degraded samples where we have to work with several "mini" DNA barcodes.
  • the first component relates to entering DNA barcode sequences in a barcode database.
  • the second component relates to querying the DNA barcode database for a particular DNA barcode sequence.
  • the barcode database can be a custom-built hash table, or can be one of any number of existing database types and search engines provided by applications such as Google Desktop Search, Google Enterprise Solutions (Google Mini or Google Appliance), Apple Spotlight, and Microsoft's Indexing Service. This list merely illustrates some of the possible implementations. Other implementations will be apparent to a person of ordinary skill in the art.
  • a flow chart illustrating the method for entering a DNA barcode sequence into the barcode database is illustrated generally by numeral 200.
  • a DNA barcode sequence is provided in one of a plurality of sequence formats.
  • the sequence format can be any of the standard sequence formats, such as FASTA, PHYLIP, or MEGA, or a proprietary format.
  • step 204 the DNA barcode sequence is read from the provided source.
  • step 206 The DNA barcode sequence is converted to upper-case characters and "cleaned” to remove alignment characters or flanking ambiguous characters (poly-N's). For example, the DNA sequence NNNagcGCG—cgGATNNN would be converted to AGCGCGCGGAT.
  • step 208 the converted sequence is broken up into words comprising a predefined number of characters, starting at the first nucleotide. For example, consider a converted sequence GTATCGGT AACGAACTT and a word size of five (5). The resulting division is GTATC GGTAA CGAAC TT. Note that the word size of five is selected for ease of explanation only.
  • step 210 it is determined whether or not the final word is incomplete. If the word is complete the method continues to step 214. If the word is incomplete, the method continues to step 212. In step 212 the final, incomplete word is deleted and the method continues to step 214. Therefore, continuing the previous example, the converted DNA barcode sequence is modified to GTATC GGTAA CGAAC TT and the incomplete word including the final TT nucleotides are deleted.
  • the converted DNA barcode sequence is stored in the barcode database as a barcode index along with its associated relevant information.
  • the relevant information includes a name of the species associated with the DNA barcode sequence.
  • the relevant information may also include the complete DNA barcode sequence, a source identifier and/or an information link.
  • the source identifier identifies the source of the DNA barcode sequence.
  • the source can be, for the example, the person, organization or institution that added the DNA barcode to the barcode library.
  • the information link identifies a source for more information relating to the DNA barcode sequence, such as the entire DNA sequence and/or other genetic information. If the method is implemented over a computer network, the information link can be a hyperlink to a Web page.
  • the converted, modified DNA barcode sequence GTATC GGTAA CGAAC is stored in the barcode database.
  • a species name Species Examp IeI is associated with the DNA barcode sequence as well as source identifier 12345 that identifies which organization provided the DNA barcode sequence to the barcode library.
  • hyperlink http://www.dnabarcodeexample.com (fictional hyperlink for example only) provides a link to a web page that includes more information regarding the DNA barcode sequence and its associated species.
  • a flow chart illustrating the method for searching a DNA barcode sequence in the barcode database is illustrated generally by numeral 300.
  • step 302 the user submits a DNA barcode sequence from an unknown specimen.
  • step 304 the DNA barcode sequence query is converted to upper-case characters and "cleaned” to remove alignment characters or flanking ambiguous characters (poly-N's). Accordingly, the only characters that remain include A, C, G or T representing the four nucleotide subunits of a DNA strand. For example, the DNA sequence NNNN??gtatcg — GTAACGAA CTT would be converted to GT ATCGGT AACGAACTT .
  • step 306 the converted sequence is broken up into words comprising a predefined number of characters, starting at the first nucleotide. For example, consider a converted sequence GTATCGGTAACGAACTT and a word size of five (5). The resulting word set is GTATC GGTAA CGAAC TT.
  • a frame set is created for all possible frames.
  • Each frame is referred to as a query index frame.
  • Different frames are used because the DNA barcode sequence query may not align directly with the DNA barcode sequence stored in the DNA database, depending on how the words are broken up. Accordingly, the DNA barcode query is "shifted" to allow for different possible alignments. Therefore, it will be apparent that the number of query index frames correlates directly with the word size. Continuing the previous example, the word size is five, so there are five possible query index frames:
  • Frame 1 GTATC GGTAA CGAAC TT Frame 2: G TATCG GTAAC GAACT T Frame 3: GT ATCGG TAACG AACTT Frame 4: GTA TCGGT AACGA ACTT Frame 5: GTAT CGGTA ACGAA CTT
  • step 310 for each frame, incomplete words are deleted, resulting a query frame set that is used to query the barcode database.
  • the query frame set is:
  • Frame 1 GTATC GGTAA CGAAC Frame 2: TATCG GTAAC GAACT Frame 3: ATCGG TAACG AACTT Frame 4: TCGGT AACGA Frame 5: CGGTA ACGAA
  • step 312 the set of words from each frame in the query frame set is used to query the barcode database.
  • a score is assigned to each database match based on the number of words in the DNA barcode sequence query that exactly match words of a DNA barcode sequence in the barcode database.
  • the database match with the best score is considered to be the matching sequence. In the present embodiment, the database match with the best score is always returned, even if the best score matches only one of the words. It is left to the user to determine the relevance of the results.
  • the species name of the matching sequence is returned to the user.
  • information associated with the species name may also be returned to the user.
  • the information associated with the query may include, for example, the DNA barcode sequence itself, a source identifier and/or an information link.
  • the returned DNA barcode sequence has the matching words highlighted, so that the user can determine the relevance of the results accordingly.
  • a predefined threshold is established to determine the relevance of the match.
  • the predefined threshold is set at a number of matching words. Therefore, if the number of matching words exceeds the threshold, the results are returned to the user. Alternatively, if the number of matching words fails to exceed the threshold, a message indicating that no match was found is returned to the user.
  • the threshold can be established by the person entering the search query or an administrator of the database, for example.
  • a list of the most relevant results can be returned, with the list organized in decreasing score results.
  • FIG. 400 a system architecture in accordance with an embodiment is illustrated generally by numeral 400.
  • the present embodiment includes a Web server 402, a database 404, a communication network 406 and a plurality of client devices 408.
  • the client devices 408 comprehend any computing device 102 that can be used to access the communication network 406, and include desktop computers, notebook computers, smart- phones, portable digital assistants, and the like.
  • the client devices 408 can connect with the web server 402 via the communication network.
  • the communication network 406 may include one or more of a local intranet, the Internet, a wireless network infrastructure, and a mobile telecommunication network.
  • the web server 402 hosts software required to add new DNA barcode sequences to the barcode database as well as search the database for existing DNA barcode sequences, as described with reference to Figures 2 and 3.
  • Access to the Web server 402 is provided to the client device 408 by an interface.
  • a standard hyper-text markup language (HTML) Web page provides the interface to the Web server 402.
  • the interface allows the user to add DNA barcode sequences to the barcode database and search the barcode database for DNA barcode sequences that match a DNA barcode sequence query.
  • the user If the user chooses to submit DNA barcode sequences to be indexed in the barcode database, the user is provided with a submission interface.
  • the submission interface allows the user to upload a file with the required information, include the DNA barcode sequence, species name, source identifier and/or an information link, if used.
  • the user is provided with a manual entry interface to enter the information manually.
  • the information is transmitted to the Web server 402, which enters it into the barcode database, as described with reference to Figure
  • the user If the user chooses to search against the DNA barcode sequences stored in the barcode database, the user is provided with a search interface.
  • the search interface provides a text box in which the user can enter the DNA.
  • the query is transmitted to the Web server 402.
  • the Web server 402 formats the query as described with reference to Figure 3 and searches the barcode database.
  • the actual database search can be performed using any one of a number of existing search algorithms. In the present embodiment, the Google Desktop Search algorithm is used.
  • the speed for a matching response in the present embodiment has been measured to be approximately two to three seconds in most cases, even though the current system is only a proof-of-concept implementation on a low-end computer. Further optimization of the system will ensure that the rapid response is maintained as the databases continue to grow. It also may be possible to improve and expand the operation by using improved search algorithms.
  • the methods described with reference to Figures 2 and 3 provide several advantages over the prior art. Specifically, the creation of the database does not require time-consuming alignments.
  • the DNA barcode sequence query can be in any "frame" relative to the DNA barcode sequences in the barcode database and it can still be matched properly.
  • the order of the words is of little importance. Scores are based on the number of words that match, not the order in which they match. Therefore, highly fragmented sequences can still be matched properly.

Abstract

There is provided a method for indexing and searching DNA barcode sequences in a barcode database. In order to index a DNA barcode sequence, the DNA barcode sequence is cleaned by removing all characters except those characters uniquely identifying one of the four nucleotide subunits of a DNA strand. The cleaned DNA barcode sequence is segmented into a number of words, each word having a predefined number of nucleotides, the segmented DNA barcode representing a barcode index. The barcode index is associated with its species name for storage in the barcode database. In order to search the barcode database, a DNA barcode sequence query is cleaned and segmented as described above. A frame set is created comprising a plurality of query index frames, each query index frame representing a shifted version of the cleaned DNA barcode sequence query. The barcode database is searched using the words of the query indices and a predefined search algorithm for finding a barcode index that best matches at least one query index frame in the frame set.

Description

DNA BARCODE SEQUENCE CLASSIFICATION
[0001] The present invention relates generally to deoxyribonucleic acid (DNA) barcodes and specifically to a system and method for effectively indexing and searching a DNA barcode library for classification of a DNA barcode query.
BACKGROUND
[0002] A DNA barcode is a relatively short sequence of genomic DNA that can be used to identify biological species. The idea of using DNA barcodes (also referred to as barcoding) for species identification was first introduced in early 2003 in "Biological identifications through DNA barcodes" by Hebert, P. D. N., A. Cywinska, S. L. Ball, and J. R. deWaard, Proceedings of the Royal Society of London B Biological Sciences 270:313-321, 2003.
[0003] This approach was later tested in a series of studies for different animal groups. These early studies established the usefulness of DNA barcodes in species identification with 95% of species showing distinctive DNA barcode patterns that provide a robust method for identification.
[0004] Although many different segments of the genome are potentially suitable for use as DNA barcodes, it has been important to achieve a wide consensus among researchers regarding the choice of a particular genomic segment. This is because, when faced with an unknown specimen, the user needs to know which particular short sequence can be used to identify it. If different barcodes were used for different species, then the user would need to know the species identity in order to choose the appropriate barcode. This would defeat the purpose of a technique whose primary function is species identification.
[0005] The barcode sequence currently used for animal species is an 650 base pair (bp) fragment from a mitochondrial gene, referred to as cytochrome c oxidase I (COl, coxl). This barcode sequence has shown potential in identification of protist and fungi species. Since the COl gene shows a reduced level of sequence diversity among plant species, other genomic fragments are being tested for barcoding in plants. In the future, it is likely that the assignment of a specimen to a particular species will involve the use of more than a single barcode sequence. [0006] The use of DNA barcodes for species identification is dependant on the availability of comprehensive barcode libraries from known species. These libraries are now being constructed as part of the large scale DNA barcoding projects that are currently underway. Global DNA barcode libraries for birds, fishes and several arrays of Lepidoptera, as well as regional and nation-wide barcode libraries, are now being assembled. For example, studies have tested the utility of barcoding in species identification and a barcode of life data (BOLD) system currently holds more than 260,000 barcode sequences from over 25,000 species. Additional sequences will migrate to public databases, such as GenBank, upon publication in scientific studies.
[0007] However, these barcode libraries will require efficient search and pattern matching methods so that the barcode of an unknown specimen can be accurately and rapidly found if it exists in the barcode library. At present, this task is being performed using methods that were not developed for taxonomy but rather for molecular phylogenetics. Therefore, the goal of these existing methods is to measure evolutionary relatedness rather than to achieve taxonomic assignment at the species level.
[0008] For example, the neighbor-joining (NJ) phylogenetic clustering algorithm, which has been used in most of barcode studies so far, is based on measuring genetic distances between DNA sequences and assembling a phylogenetic tree. In addition to having been developed for a different purpose, and perhaps because of this, such methods are not easily adaptable to very large data sets containing several thousand DNA barcodes. Furthermore, the interpretation of their results is not always straight- forward.
[0009] Recently, Bayesian population genetics approaches have been proposed for DNA barcode data. However, their use has been limited to somewhat small test data sets and their effectiveness in large-scale barcode data sets has not been established. Other available sequence search methods such as the popular Blast algorithm do not yield reliable results.
[0010] Therefore, it can be seen that there is a need for a system and method that provides efficient searching of a library of DNA barcodes. Accordingly, it is an object of the present invention to obviate or mitigate at least some of the above mentioned disadvantages. SUMMARY
[0011] The following describes a method that is used for the analysis of DNA barcodes rather than the generation of those sequences. Specifically, a new method is described for comparing the DNA sequence from a biological specimen to a library of reference DNA barcode sequences. As a result of this comparison, the specimen can quickly be assigned to a biological species.
[0012] Additionally, new web-based technologies are described that can significantly increase the possibilities for sharing and using sequence data in different contexts. The approach presented herein utilizes the capabilities of web-based search engines, such as Google™ or Yahoo!® , for exploring sequence and related information across multiple data sources.
[0013] In accordance with an aspect of the invention, there is provided a method for indexing DNA barcode sequences for a barcode database, the method comprising the steps of: receiving a DNA barcode sequence; cleaning the received DNA barcode sequence by removing all characters except those characters uniquely identifying one of the four nucleotide subunits of a DNA strand; segmenting the cleaned DNA barcode sequence into a number of words, each word having a predefined number of nucleotides, the segmented DNA barcode representing a barcode index; and associating the barcode index with its species name for storage in the barcode database.
[0014] In accordance with a further aspect of the invention, there is provided a method of searching for DNA barcode sequence against the barcode database described above, the method comprising the steps of: receiving a DNA barcode sequence query; cleaning the received DNA barcode sequence query by removing all characters except those characters uniquely identifying as one of the four nucleotide subunits of a DNA strand; segmenting the cleaned DNA barcode sequence query into a number of words, each word having a predefined number of nucleotides; creating a frame set comprising a plurality of query index frames, each query index frame represent a shifted version of the cleaned DNA barcode sequence query; and searching the barcode database using a predefined search algorithm for finding a barcode index that best matches at least one query index frame in the frame set. [0015] In accordance with a yet a further aspect of the invention there is provided a computer readable medium comprising instructions for executing one or both of the methods described above.
BRIEF DESCRIPTION OF THE DRAWINGS
[0016] An embodiment of the present invention will now be described by way of example only with reference to the following drawings in which:
Figure 1 is a block diagram illustrating a computing device;
Figure 2 is a flow chart illustrating adding DNA barcode sequences to a barcode database;
Figure 3 is a flow chart illustrating searching the barcode database;
Figure 4 is a sample distributed architecture for accessing the barcode database;
Figure 5 is a graph illustrating the accuracy of the search in relation to word size; and
Figure 6 is a graph illustrating the ability to uniquely identify species in relations to barcode size.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0017] For convenience, like numeral in the description refer to like structures in the drawings. Referring to Figure 1, a computing device 102 is illustrated in greater detail. The computing device 102 can be any type computing device, including a desktop computer, notebook, or a mobile device such as a data messaging device, a two-way pager, a smart-phone and a personal digital assistant.
[0018] The computing device 102 includes a communication subsystem 111, which includes a receiver 212, a transmitter 214, and associated components, such as one or more embedded or internal antenna elements 216 and 218, local oscillators (LOs) 213, and a processing module such as a digital signal processor (DSP) 220. As will be apparent to those skilled in field of communications, the particular design of the communication subsystem 211 depends on the communication network in which computing device 102 is intended to operate.
[0019] The computing device 102 includes a microprocessor 138 which controls general operation of the computing device 102. The microprocessor 138 also interacts with additional device subsystems such as a display 122, a hard disc drive 124, a random access memory (RAM) 126, auxiliary input/output (I/O) subsystems 128, a serial port 130, a keyboard 132, a speaker 134, a microphone 136, a short-range communications subsystem 140 such as Bluetooth™ for example, and any other device subsystems or peripheral devices generally designated at 142.
[0020] Operating system software used by the microprocessor 138 is preferably stored in a persistent store such as the hard disc drive 124, which may alternatively be a read-only memory (ROM) or similar storage element such as flash memory (not shown). Those skilled in the art will appreciate that the operating system, specific device applications, or parts thereof, may be temporarily loaded into a volatile store such as RAM 126.
[0021] The display 122 is used to visually present an application's graphical user interface (GUI) to the user. The user can manipulate application data by modifying information on the GUI using an input device such as the keyboard 132 for example. Depending on the type of computing device 102, the user may have access to other types of input devices, such as, for example, a scroll wheel, trackball, light pen or touch sensitive screen.
[0022] The microprocessor 138, in addition to its operating system functions, preferably enables execution of software applications on the computing device 102. A predetermined set of applications, which control basic device operations, is installed on the mobile device 102 during its manufacture. These basic operations typically include data and voice communication applications, for example. Additionally, applications may also be loaded onto the mobile device 102 through a network, an auxiliary I/O subsystem 128, serial port 130, short-range communications subsystem 140, or any other suitable subsystem 142, and installed by a user in RAM 126, or preferably the persistent store 124, for execution by the microprocessor 138. Such flexibility in application installation increases the functionality of the computing device 102 and may provide enhanced on-device features, communication-related features, or both.
[0023] Additionally, data can be stored in the persistent store 124 for access by software applications executing on the computing device 102. The persistent store can be included locally on the computing device 102 or provided remotely. Any number of types of data can be stored in the persistent store, including, for example, DNA barcode sequences.
[0024] Accordingly, DNA barcode sequence searching software can be provided for the computing device 102. As described herein, the DNA barcode sequence searching software provides a method of DNA barcode sequence analysis that is analogous to non-molecular methods of taxonomic analysis. The method can be implemented as software using a number of programming techniques for execution on the computing device 102.
[0025] The DNA barcode sequence searching software converts a DNA barcode sequence into a series of "characters" that can be used to create keys for species identification. In practice, removing the requirement to assess evolutionary relatedness, which is not the primary goal of DNA barcoding, allows for an efficient method of taxonomic assignment of an unknown specimen to a species.
[0026] In accordance with the present embodiment, an efficient method for DNA barcode sequence searching is described as follows. A DNA barcode sequence is subdivided into predefined units (also referred to as words). Each word includes a given number of nucleotides and is treated as a taxonomic character. In this manner, each DNA barcode sequence is transformed into a set of unique words. This set of words is then compared to a library of known DNA barcode sequences that have, themselves, been subdivided in a similar way. Since the DNA barcode sequence of the query and the library of DNA barcode sequences have been separated into relatively short words, many, if not all, existing word search algorithms can be manipulated to perform these searches.
[0027] The benefit of this method lies in its simplicity and its flexibility. Because of its simplicity, it can exploit many existing text-searching strategies, known or proprietary. Because of its flexibility, it can be easily customized for efficiency. Specifically, the flexibility lies in the fact that (i) the word size is variable; and (ii) the DNA barcode sequence is predefined.
[0028] A developer implementing the search tool can determine the word size into which the DNA barcode is divided. For example, the COl DNA coding sequence is approximately 1,500 nucleotides long. For clarity, individual units of a DNA sequence are referred to as nucleotides. A tri-nucleotide sequence is referred to as a codon. Therefore, the 1 ,500 nucleotide-long DNA sequence can be divided into 500 words, where each word is a codon. At the other extreme, the 1,500 nucleotide-long DNA sequence can be treated as a single 1,500-nucleotide word. It will be apparent that other word sizes between these extremes can also be selected. [0029] Longer word sizes are faster to search. However, the longer the word size, the more stringent the matching requirement becomes. Typical DNA barcodes are approximately 600 nucleotides long. Accordingly, for the present embodiment, each DNA barcode sequence is divided into approximately 40 words, each word comprising 5 codons (15 nucleotides). This particular word-size yields high accuracy while allowing the identification of DNA barcode sequences that are related, but not necessarily identical, to the DNA barcode sequence query. Such partial matching allows recognition of a new species or sub-species that may not have DNA barcode sequences listed in the existing library of sequences.
[0030] It will be appreciated that different embodiments of the method can use different word sizes in order to suit the purpose of the embodiment. Further, it will be appreciated that the word size within an embodiment can be modified after the embodiment has been implemented. In order to facilitate a change in word size once the embodiment has been implemented, the library is parsed and the words for each barcode are adjusted accordingly. Once the library has been adjusted to the new word size, it can be searched as described herein.
[0031] Further flexibility is afforded by the fact that the DNA barcodes are predefined. This feature allows matching a variety of query DNA barcode sequences that are derived from different areas of the gene. This is useful in DNA barcoding because the availability of sequence primers may constrain the query DNA barcode sequences to come from non-identical gene regions. That is, for example, one query may match words four-through-eighteen while another may match words two-through-twenty-three. This becomes even more important for the analysis of degraded samples where we have to work with several "mini" DNA barcodes.
[0032] As will be appreciated, there are two components to the present embodiment. The first component relates to entering DNA barcode sequences in a barcode database. The second component relates to querying the DNA barcode database for a particular DNA barcode sequence. The barcode database can be a custom-built hash table, or can be one of any number of existing database types and search engines provided by applications such as Google Desktop Search, Google Enterprise Solutions (Google Mini or Google Appliance), Apple Spotlight, and Microsoft's Indexing Service. This list merely illustrates some of the possible implementations. Other implementations will be apparent to a person of ordinary skill in the art. [0033] Referring to Figure 2, a flow chart illustrating the method for entering a DNA barcode sequence into the barcode database is illustrated generally by numeral 200. In step 202, a DNA barcode sequence is provided in one of a plurality of sequence formats. The sequence format can be any of the standard sequence formats, such as FASTA, PHYLIP, or MEGA, or a proprietary format.
[0034] In step 204, the DNA barcode sequence is read from the provided source. In step 206 The DNA barcode sequence is converted to upper-case characters and "cleaned" to remove alignment characters or flanking ambiguous characters (poly-N's). For example, the DNA sequence NNNagcGCG—cgGATNNN would be converted to AGCGCGCGGAT.
[0035] In step 208, the converted sequence is broken up into words comprising a predefined number of characters, starting at the first nucleotide. For example, consider a converted sequence GTATCGGT AACGAACTT and a word size of five (5). The resulting division is GTATC GGTAA CGAAC TT. Note that the word size of five is selected for ease of explanation only.
[0036] In step 210, it is determined whether or not the final word is incomplete. If the word is complete the method continues to step 214. If the word is incomplete, the method continues to step 212. In step 212 the final, incomplete word is deleted and the method continues to step 214. Therefore, continuing the previous example, the converted DNA barcode sequence is modified to GTATC GGTAA CGAAC TT and the incomplete word including the final TT nucleotides are deleted.
[0037] In step 214, the converted DNA barcode sequence is stored in the barcode database as a barcode index along with its associated relevant information. In the present embodiment, the relevant information includes a name of the species associated with the DNA barcode sequence. Optionally, the relevant information may also include the complete DNA barcode sequence, a source identifier and/or an information link. The source identifier identifies the source of the DNA barcode sequence. The source can be, for the example, the person, organization or institution that added the DNA barcode to the barcode library. The information link identifies a source for more information relating to the DNA barcode sequence, such as the entire DNA sequence and/or other genetic information. If the method is implemented over a computer network, the information link can be a hyperlink to a Web page.
[0038] Continuing the example, the converted, modified DNA barcode sequence GTATC GGTAA CGAAC is stored in the barcode database. A species name Species Examp IeI is associated with the DNA barcode sequence as well as source identifier 12345 that identifies which organization provided the DNA barcode sequence to the barcode library. Further, hyperlink http://www.dnabarcodeexample.com (fictional hyperlink for example only) provides a link to a web page that includes more information regarding the DNA barcode sequence and its associated species.
[0039] Referring to Figure 3, a flow chart illustrating the method for searching a DNA barcode sequence in the barcode database is illustrated generally by numeral 300. In step 302 the user submits a DNA barcode sequence from an unknown specimen.
[0040] In step 304, the DNA barcode sequence query is converted to upper-case characters and "cleaned" to remove alignment characters or flanking ambiguous characters (poly-N's). Accordingly, the only characters that remain include A, C, G or T representing the four nucleotide subunits of a DNA strand. For example, the DNA sequence NNNN??gtatcg — GTAACGAA CTT would be converted to GT ATCGGT AACGAACTT .
[0041] In step 306, the converted sequence is broken up into words comprising a predefined number of characters, starting at the first nucleotide. For example, consider a converted sequence GTATCGGTAACGAACTT and a word size of five (5). The resulting word set is GTATC GGTAA CGAAC TT.
[0042] In step 308, a frame set is created for all possible frames. Each frame is referred to as a query index frame. Different frames are used because the DNA barcode sequence query may not align directly with the DNA barcode sequence stored in the DNA database, depending on how the words are broken up. Accordingly, the DNA barcode query is "shifted" to allow for different possible alignments. Therefore, it will be apparent that the number of query index frames correlates directly with the word size. Continuing the previous example, the word size is five, so there are five possible query index frames:
Frame 1 : GTATC GGTAA CGAAC TT Frame 2: G TATCG GTAAC GAACT T Frame 3: GT ATCGG TAACG AACTT Frame 4: GTA TCGGT AACGA ACTT Frame 5: GTAT CGGTA ACGAA CTT
[0043] In step 310, for each frame, incomplete words are deleted, resulting a query frame set that is used to query the barcode database. Continuing the above example, the query frame set is:
Frame 1 : GTATC GGTAA CGAAC Frame 2: TATCG GTAAC GAACT Frame 3: ATCGG TAACG AACTT Frame 4: TCGGT AACGA Frame 5: CGGTA ACGAA
[0044] In step 312, the set of words from each frame in the query frame set is used to query the barcode database. A score is assigned to each database match based on the number of words in the DNA barcode sequence query that exactly match words of a DNA barcode sequence in the barcode database. The database match with the best score is considered to be the matching sequence. In the present embodiment, the database match with the best score is always returned, even if the best score matches only one of the words. It is left to the user to determine the relevance of the results.
[0045] In step 314, the species name of the matching sequence is returned to the user. Optionally, information associated with the species name may also be returned to the user. In the context of a search, the information associated with the query may include, for example, the DNA barcode sequence itself, a source identifier and/or an information link. The returned DNA barcode sequence has the matching words highlighted, so that the user can determine the relevance of the results accordingly.
[0046] In alternate embodiment, a predefined threshold is established to determine the relevance of the match. The predefined threshold is set at a number of matching words. Therefore, if the number of matching words exceeds the threshold, the results are returned to the user. Alternatively, if the number of matching words fails to exceed the threshold, a message indicating that no match was found is returned to the user. Depending on the implementation, the threshold can be established by the person entering the search query or an administrator of the database, for example.
[0047] In yet an alternate embodiment, a list of the most relevant results can be returned, with the list organized in decreasing score results.
Sample Implementation
[0048] Referring to Figure 4, a system architecture in accordance with an embodiment is illustrated generally by numeral 400. The present embodiment includes a Web server 402, a database 404, a communication network 406 and a plurality of client devices 408.
[0049] The client devices 408 comprehend any computing device 102 that can be used to access the communication network 406, and include desktop computers, notebook computers, smart- phones, portable digital assistants, and the like. The client devices 408 can connect with the web server 402 via the communication network. Accordingly, the communication network 406 may include one or more of a local intranet, the Internet, a wireless network infrastructure, and a mobile telecommunication network.
[0050] The web server 402 hosts software required to add new DNA barcode sequences to the barcode database as well as search the database for existing DNA barcode sequences, as described with reference to Figures 2 and 3.
[0051] Access to the Web server 402 is provided to the client device 408 by an interface. In the present embodiment, a standard hyper-text markup language (HTML) Web page provides the interface to the Web server 402. The interface allows the user to add DNA barcode sequences to the barcode database and search the barcode database for DNA barcode sequences that match a DNA barcode sequence query.
[0052] If the user chooses to submit DNA barcode sequences to be indexed in the barcode database, the user is provided with a submission interface. The submission interface allows the user to upload a file with the required information, include the DNA barcode sequence, species name, source identifier and/or an information link, if used. Optionally, the user is provided with a manual entry interface to enter the information manually. The information is transmitted to the Web server 402, which enters it into the barcode database, as described with reference to Figure
2.
[0053] If the user chooses to search against the DNA barcode sequences stored in the barcode database, the user is provided with a search interface. The search interface provides a text box in which the user can enter the DNA. The query is transmitted to the Web server 402. The Web server 402 formats the query as described with reference to Figure 3 and searches the barcode database. The actual database search can be performed using any one of a number of existing search algorithms. In the present embodiment, the Google Desktop Search algorithm is used.
[0054] The speed for a matching response in the present embodiment has been measured to be approximately two to three seconds in most cases, even though the current system is only a proof-of-concept implementation on a low-end computer. Further optimization of the system will ensure that the rapid response is maintained as the databases continue to grow. It also may be possible to improve and expand the operation by using improved search algorithms.
Word Size Selection
[0055] As previously described, the selection of the word size is flexible and left to the implementation. However, it has been determined that a word size of 15 nucleotides provides results that have a fast response time with little loss in accuracy. In order to arrive at this word size, a series of parsimony tree-building analyses were performed on words of length 5-50 nucleotides from over 1,200 barcode sequences from butterflies, moths, and primates. How often species were correctly grouped into clades in the resulting tree was also determined.
[0056] A graph illustrating the results of this analysis in provided in Figure 5. As shown in the graph, there is only a gradual decline in accuracy as the word size increase to 15 nucleotides and then a significant drop as the word size increases from 15 to 20 nucleotides. Further, since the speed of the algorithm improves with increased word size, it is desirable to have the largest word size possible that provides reasonable accuracy. Accordingly, a 15-nucleotide long word appears to be the largest word size that maintains a high degree of accuracy. DNA Barcode Sequence Length
[0057] Although the standard barcode length is approximately 650 nucleotides long, it is often impossible to obtain such a large sequence from an archival biological sample. This is especially true of DNA degradation in preserved museum specimens. However, experiments show that shorter sequences are often adequate for species identification. Referring to Figure 6, a graph illustrating the results of DNA barcode sequence length is illustrated generally by numeral 600. In order to plot the graph, all barcode sequences from GenBank were loaded into the barcode database. DNA barcodes of varying length were used to search the barcode database and it was determined that 95% of the species could be accurately identified with only a 300-nucleotide long DNA barcode. Even a 100-nucleotide long DNA barcode is good enough to identify the species 90% of the time. This exercise has been successfully repeated for a wide variety of barcodes, from mammals to insects to fish, and similar results have been found in each case.
[0058] Although the embodiments described above relates to a Web based implementation, it will be apparent to a person of ordinary skill in the art that a local installation of this search tool can also be implemented. Such a tool would be installed locally on a users computer and can be useful, for example, to researchers who have unpublished DNA barcode sequences through which they want to search.
[0059] Further, although the embodiments described above teach a user pushing DNA barcode sequences into the barcode database, it will be apparent to a person of ordinary skill in the art that a predefined set of third party databases could be polled periodically to determine if any new DNA barcode sequences have been identified. If so, the new DNA barcode sequences could be retrieved and added to the barcode database as described with reference to Figure 2. Alternatively, the third party databases could be set up to automatically push any new DNA barcode sequences to the barcode database.
[0060] Accordingly, it will be appreciated that the methods described with reference to Figures 2 and 3 provide several advantages over the prior art. Specifically, the creation of the database does not require time-consuming alignments. The DNA barcode sequence query can be in any "frame" relative to the DNA barcode sequences in the barcode database and it can still be matched properly. The order of the words is of little importance. Scores are based on the number of words that match, not the order in which they match. Therefore, highly fragmented sequences can still be matched properly.
[0061] While the invention has been described in terms of various specific embodiments, those skilled in the art will recognize that the invention can be practiced with modification within the spirit and scope of the claims.

Claims

Claims:
1. A method for indexing DNA barcode sequences for a barcode database, the method comprising the steps of: receiving a DNA barcode sequence; cleaning the received DNA barcode sequence by removing all characters except those characters uniquely identifying one of the four nucleotide subunits of a DNA strand; segmenting the cleaned DNA barcode sequence into a number of words, each word having a predefined number of nucleotides, the segmented DNA barcode representing a barcode index; and associating the barcode index with its species name for storage in the barcode database.
2. The method of claim 1 , wherein the barcode index does not include incomplete words.
3. The method of claim 2, comprising the further step of ensuring that all of the characters representing one of the four nucleotide subunits of a DNA strand are in capital letters.
4. The method of claim 2, wherein the DNA barcode sequence is received as one of a plurality of DNA barcode sequences in a file, and the steps are repeated for each DNA barcode sequence.
5. The method of claim 2, wherein the word size is 15 nucleotides.
6. The method of claim 2, wherein the DNA barcode sequence is at least 100 nucleotides in length.
7. The method of claim 6, wherein the DNA barcode sequence is approximately 600 nucleotides in length.
8. The method of claim 2, wherein additional information is associated with the segment DNA barcode sequence.
9. The method of claim 8, wherein the additional information includes at least one of a complete DNA barcode sequence, a source identifier or an information link.
10. The method of claim 1 , wherein the DNA barcode sequence information is retrieved from a third party data-source at a predefined time.
11. The method of claim 1 , wherein the DNA barcode sequence information is pushed from a third party data-source when it becomes available.
12. A computer readable medium comprising instructions which, when executed on a computing device, cause the computing device to implement the steps of claim 1.
13. A method of searching for DNA barcode sequence against the barcode database of claim 1 , the method comprising the steps of: receiving a DNA barcode sequence query; cleaning the received DNA barcode sequence query by removing all characters except those characters uniquely identifying one of the four nucleotide subunits of a
DNA strand; segmenting the cleaned DNA barcode sequence query into a number of words, each word having a predefined number of nucleotides; creating a frame set comprising a plurality of query index frames, each query index frame represent a shifted version of the cleaned DNA barcode sequence query; and searching the barcode database using the words of the query indices and a predefined search algorithm for finding a barcode index that best matches at least one query index frame in the frame set.
14. The method of claim 13 comprising the further step of returning a result to a user, the result comprising at the barcode index that best matches the at least one query.
15. The method of claim 14, wherein the step of returning the result of the search is performed only when the barcode index that best matches the at least one query index frame exceeds a predefined threshold.
16. The method of claim 14, wherein a plurality of results are returned to the user, the results being organized in accordance with the results of their match.
17. The method of claim 14, wherein matching words are highlighted in the result.
18. The method of claim 13, wherein the query frame indices do not include incomplete words.
19. The method of claim 14, wherein the DNA barcode sequence query is received via a communication network from a client device.
20. The method of claim 14 comprising the further step of retrieving a species identifier associated with the best match barcode index.
21. The method of claim 16, wherein supplemental information is retrieved along with the species identifier.
22. The method of claim 17, wherein the supplemental information includes at least one of a complete DNA barcode sequence, a source identifier or an information link.
23. A computer readable medium comprising instructions which, when executed on a computing device, cause the computing device to implement the steps of claim 13.
PCT/CA2007/001170 2006-06-30 2007-06-29 Dna barcode sequence classification WO2008000090A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US81735306P 2006-06-30 2006-06-30
US60/817,353 2006-06-30

Publications (1)

Publication Number Publication Date
WO2008000090A1 true WO2008000090A1 (en) 2008-01-03

Family

ID=38845087

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CA2007/001170 WO2008000090A1 (en) 2006-06-30 2007-06-29 Dna barcode sequence classification

Country Status (1)

Country Link
WO (1) WO2008000090A1 (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102200967A (en) * 2011-03-30 2011-09-28 中国人民解放军军事医学科学院放射与辐射医学研究所 Method and system for processing text based on DNA sequences
CN102332064A (en) * 2011-10-07 2012-01-25 吉林大学 Biological species identification method based on genetic barcode
WO2012039633A2 (en) * 2010-09-23 2012-03-29 Real Time Genomics, Inc. Methods of characterizing, determining similarity, predicting correlation between and representing sequences and systems and indicators therefor
CN102799795A (en) * 2011-05-25 2012-11-28 中国医学科学院药用植物研究所 Species movable identification system, terminal, server and method
US20140358937A1 (en) * 2013-05-29 2014-12-04 Sterling Thomas Systems and methods for snp analysis and genome sequencing
WO2015179493A1 (en) * 2014-05-23 2015-11-26 Centrillion Technology Holding Corporation Methods for generating and decoding barcodes
WO2015184016A3 (en) * 2014-05-27 2016-03-10 The Broad Institute, Inc. High-throughput assembly of genetic elements
WO2016168584A1 (en) * 2015-04-17 2016-10-20 President And Fellows Of Harvard College Barcoding systems and methods for gene sequencing and other applications
WO2016210191A1 (en) * 2015-06-23 2016-12-29 Tupac Bio, Inc. Computer-implemented method for designing synthetic dna, and terminal, system and computer-readable medium for the same
CN108018607A (en) * 2016-10-28 2018-05-11 深圳华大基因股份有限公司 A kind of sequence label for lifting microarray dataset library fractionation rate mixes storehouse method and apparatus
CN108823317A (en) * 2017-05-05 2018-11-16 深圳市华大海洋研究院 A kind of DNA bar code and its application for the detection of Xu Shi tooth mudskipper
CN109979536A (en) * 2019-03-07 2019-07-05 青岛市疾病预防控制中心(青岛市预防医学研究院) It is a kind of based on DNA bar code to the identification method of species
US10560552B2 (en) 2015-05-21 2020-02-11 Noblis, Inc. Compression and transmission of genomic information
US10596541B2 (en) 2014-04-21 2020-03-24 President And Fellows Of Harvard College Systems and methods for barcoding nucleic acids
US11001883B2 (en) 2012-03-05 2021-05-11 The General Hospital Corporation Systems and methods for epigenetic sequencing
US11222712B2 (en) 2017-05-12 2022-01-11 Noblis, Inc. Primer design using indexed genomic information

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6484166B1 (en) * 1999-05-20 2002-11-19 Evresearch, Ltd. Information management, retrieval and display system and associated method
US20020177138A1 (en) * 2000-11-15 2002-11-28 The United States Of America , Represented By The Secretary, Department Of Health And Human Services Methods for the indentification of textual and physical structured query fragments for the analysis of textual and biopolymer information
US20030124527A1 (en) * 2000-09-28 2003-07-03 Schlager John J. Automated method of identifying and archiving nucleic acid sequences
US6876930B2 (en) * 1999-07-30 2005-04-05 Agy Therapeutics, Inc. Automated pathway recognition system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6484166B1 (en) * 1999-05-20 2002-11-19 Evresearch, Ltd. Information management, retrieval and display system and associated method
US6876930B2 (en) * 1999-07-30 2005-04-05 Agy Therapeutics, Inc. Automated pathway recognition system
US20030124527A1 (en) * 2000-09-28 2003-07-03 Schlager John J. Automated method of identifying and archiving nucleic acid sequences
US20020177138A1 (en) * 2000-11-15 2002-11-28 The United States Of America , Represented By The Secretary, Department Of Health And Human Services Methods for the indentification of textual and physical structured query fragments for the analysis of textual and biopolymer information

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012039633A2 (en) * 2010-09-23 2012-03-29 Real Time Genomics, Inc. Methods of characterizing, determining similarity, predicting correlation between and representing sequences and systems and indicators therefor
WO2012039633A3 (en) * 2010-09-23 2012-05-18 Real Time Genomics, Inc. Methods of characterizing, determining similarity, predicting correlation between and representing sequences and systems and indicators therefor
GB2498278A (en) * 2010-09-23 2013-07-10 Real Time Genomics Inc Methods of characterizing,determining similarity,predicting correlation between and representing sequences and systems and indicators therefor
CN102200967A (en) * 2011-03-30 2011-09-28 中国人民解放军军事医学科学院放射与辐射医学研究所 Method and system for processing text based on DNA sequences
CN102799795A (en) * 2011-05-25 2012-11-28 中国医学科学院药用植物研究所 Species movable identification system, terminal, server and method
CN102332064A (en) * 2011-10-07 2012-01-25 吉林大学 Biological species identification method based on genetic barcode
CN102332064B (en) * 2011-10-07 2013-11-06 吉林大学 Biological species identification method based on genetic barcode
US11001883B2 (en) 2012-03-05 2021-05-11 The General Hospital Corporation Systems and methods for epigenetic sequencing
US11047003B2 (en) 2012-03-05 2021-06-29 The General Hospital Corporation Systems and methods for epigenetic sequencing
US10191929B2 (en) * 2013-05-29 2019-01-29 Noblis, Inc. Systems and methods for SNP analysis and genome sequencing
US11308056B2 (en) 2013-05-29 2022-04-19 Noblis, Inc. Systems and methods for SNP analysis and genome sequencing
US20140358937A1 (en) * 2013-05-29 2014-12-04 Sterling Thomas Systems and methods for snp analysis and genome sequencing
US10596541B2 (en) 2014-04-21 2020-03-24 President And Fellows Of Harvard College Systems and methods for barcoding nucleic acids
US11052368B2 (en) 2014-04-21 2021-07-06 Vilnius University Systems and methods for barcoding nucleic acids
WO2015179493A1 (en) * 2014-05-23 2015-11-26 Centrillion Technology Holding Corporation Methods for generating and decoding barcodes
WO2015184016A3 (en) * 2014-05-27 2016-03-10 The Broad Institute, Inc. High-throughput assembly of genetic elements
US11898141B2 (en) 2014-05-27 2024-02-13 The Broad Institute, Inc. High-throughput assembly of genetic elements
WO2016168584A1 (en) * 2015-04-17 2016-10-20 President And Fellows Of Harvard College Barcoding systems and methods for gene sequencing and other applications
US11746367B2 (en) 2015-04-17 2023-09-05 President And Fellows Of Harvard College Barcoding systems and methods for gene sequencing and other applications
US10560552B2 (en) 2015-05-21 2020-02-11 Noblis, Inc. Compression and transmission of genomic information
EP3475422A4 (en) * 2015-06-23 2021-01-06 Tupac Bio, Inc. Computer-implemented method for designing synthetic dna, and terminal, system and computer-readable medium for the same
JP2019527443A (en) * 2015-06-23 2019-09-26 タパック バイオ, インコーポレイテッドTupac Bio, Inc. Computer mounting method for designing synthetic DNA, terminal, system and computer readable medium for designing synthetic DNA
WO2017222596A1 (en) 2015-06-23 2017-12-28 Tupac Bio, Inc. Computer-implemented method for designing synthetic dna, and terminal, system and computer-readable medium for the same
WO2016210191A1 (en) * 2015-06-23 2016-12-29 Tupac Bio, Inc. Computer-implemented method for designing synthetic dna, and terminal, system and computer-readable medium for the same
CN108018607B (en) * 2016-10-28 2021-04-27 深圳华大基因股份有限公司 Tag sequence library mixing method and device for improving sequencing platform library resolution rate
CN108018607A (en) * 2016-10-28 2018-05-11 深圳华大基因股份有限公司 A kind of sequence label for lifting microarray dataset library fractionation rate mixes storehouse method and apparatus
CN108823317A (en) * 2017-05-05 2018-11-16 深圳市华大海洋研究院 A kind of DNA bar code and its application for the detection of Xu Shi tooth mudskipper
US11222712B2 (en) 2017-05-12 2022-01-11 Noblis, Inc. Primer design using indexed genomic information
CN109979536A (en) * 2019-03-07 2019-07-05 青岛市疾病预防控制中心(青岛市预防医学研究院) It is a kind of based on DNA bar code to the identification method of species
CN109979536B (en) * 2019-03-07 2022-12-23 青岛市疾病预防控制中心(青岛市预防医学研究院) Species identification method based on DNA bar code

Similar Documents

Publication Publication Date Title
WO2008000090A1 (en) Dna barcode sequence classification
Down et al. NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence
Ghodsi et al. DNACLUST: accurate and efficient clustering of phylogenetic marker genes
Chain et al. An applications-focused review of comparative genomics tools: Capabilities, limitations and future challenges
Merkel et al. Detecting short tandem repeats from genome data: opening the software black box
Emms et al. SHOOT: phylogenetic gene search and ortholog inference
Parisi et al. STRING: finding tandem repeats in DNA sequences
Shibuya et al. Dictionary-driven prokaryotic gene finding
Chesters et al. A DNA Barcoding system integrating multigene sequence data
Portik et al. SuperCRUNCH: A bioinformatics toolkit for creating and manipulating supermatrices and other large phylogenetic datasets
Dinu et al. An efficient rank based approach for closest string and closest substring
Hua et al. Towards comprehensive integration and curation of chloroplast genomes
Mørk et al. Evaluating bacterial gene-finding HMM structures as probabilistic logic programs
Svetlitsky et al. CSBFinder: discovery of colinear syntenic blocks across thousands of prokaryotic genomes
Velasco et al. Look4TRs: a de novo tool for detecting simple tandem repeats using self-supervised hidden Markov models
US20080263002A1 (en) Base Sequence Retrieval Apparatus
Vilo Pattern discovery from biosequences
Nicolas et al. Finding and characterizing repeats in plant genomes
Park et al. UPP2: fast and accurate alignment of datasets with fragmentary sequences
Miklós et al. Stochastic models of sequence evolution including insertion—deletion events
Pavesi et al. Using Weeder for the discovery of conserved transcription factor binding sites
Nicolas et al. Finding and characterizing repeats in plant genomes
Bennett et al. SeqWho: reliable, rapid determination of sequence file identity using k-mer frequencies in Random Forest classifiers
Esmat et al. A parallel hash‐based method for local sequence alignment
Gupta et al. DAVI: Deep learning-based tool for alignment and single nucleotide variant identification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07720061

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: RU

122 Ep: pct application non-entry in european phase

Ref document number: 07720061

Country of ref document: EP

Kind code of ref document: A1