WO1999049403A1 - Systeme et procedes d'analyse de sequences biomoleculaires - Google Patents

Systeme et procedes d'analyse de sequences biomoleculaires Download PDF

Info

Publication number
WO1999049403A1
WO1999049403A1 PCT/US1999/006575 US9906575W WO9949403A1 WO 1999049403 A1 WO1999049403 A1 WO 1999049403A1 US 9906575 W US9906575 W US 9906575W WO 9949403 A1 WO9949403 A1 WO 9949403A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequences
bins
regions
similarity
polymer
Prior art date
Application number
PCT/US1999/006575
Other languages
English (en)
Inventor
Stephen E. Lincoln
David M. Hodgson
Peter A. Spiro
Frank D. Russo
Ingrid E. Akerblom
Jennifer L. Hillman
Anissa Lee Jones
Shawn Robert Bratcher
Howard Jerome Cohen
Gerard Dufour
Michael Peter Wood
Alexander George Koleszar
Steven C. Banville
Claudia Alden CASE
Original Assignee
Incyte Pharmaceuticals, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Incyte Pharmaceuticals, Inc. filed Critical Incyte Pharmaceuticals, Inc.
Priority to CA002325469A priority Critical patent/CA2325469A1/fr
Priority to AU34537/99A priority patent/AU771877B2/en
Priority to EP99916165A priority patent/EP1066576A1/fr
Priority to JP2000538305A priority patent/JP2002508546A/ja
Publication of WO1999049403A1 publication Critical patent/WO1999049403A1/fr

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly

Definitions

  • the present invention relates generally to bioinformatics, and particularly to a system and method for analyzing biomolecular sequences.
  • bioinformatics includes the development of methods to search databases quickly, to analyze nucleic acid sequence information, and to predict protein sequence and structure from DNA sequence data.
  • molecular biology is shifting from the laboratory bench to the computer desktop. Advanced quantitative analyses, database comparisons, and computational algorithms are needed to explore the relationships between sequence and phenotype.
  • a gene 30 is the basic unit of genetic information which is made up of a set of DNA sequences.
  • a gene 30 is transcribed into an RNA primary transcript. This primary transcript is typically spliced to create a mature mRNA, which is then translated into a polypeptide (protein), which performs some function in the cell.
  • An exon 32 is a coding region of the gene 30, while an intron 34 is a control or non-coding region of the gene 30.
  • the most complete representation of a gene 30 is a genomic DNA sequence completely covering, the coding, control and non-coding regions of a gene 30.
  • the gene 30 is edited by removing the introns, and splicing together the remaining exons.
  • the transcript may be spliced, by the optional inclusion or exclusion of each intron or exon. The various arrangements that result are called splice variants.
  • the exons are labeled as 1 , 2, 3 and 4.
  • the same gene 30 may generate different mRNA sequences for healthy and diseased tissue, 42 and 44, respectively.
  • the diseased tissue 42 includes sequences from exons 1 , 2 and 4, while the healthy tissue 44 includes sequences from exons 1 , 2, 3 and 4.
  • Fig. 2 further illustrates the relationship of expressed sequence tags (ESTs) 46 to mRNA (mRNA1 and mRNA2) and genomic sequences.
  • ESTs expressed sequence tags
  • mRNA1 and mRNA2 mRNA1 and mRNA2
  • genomic sequences a gene may be transcribed into multiple copies of mRNA. Each mRNA is transcribed into a different cDNA sequence.
  • EST 46 is a sampling of a cDNA sequence. ESTs 46 are partial transcript sequences that may cover different parts of the mRNA(s) of a gene, depending on cloning and sequencing strategy.
  • genomic research DNA, mRNA, and cDNA molecules are broken into fragments, the nucleotide sequence of the fragments are identified, the sequence data for the fragments are input into a database, and a computer program attempts to electronically re-assemble the sequence fragments.
  • assembly processes There are two types of assembly processes for this data.
  • genomic data the DNA from one or more individuals is broken up, individual portions or sequences of the DNA are identified, and then the sequences are reassembled using computer based methods. Any given fragment of a genomic sequence should be represented at approximately the same level, and there is theoretically one correct way to reassemble these fragments into a linear sequence representing the original genomic DNA.
  • EST expressed sequence tag
  • Fig. 3 is a flowchart of a typical computer-based assembly process for EST data.
  • clusters are generated from the EST data.
  • the clustering process groups ESTs based on the similarity between pairs of sequences (pairwise similarity) that make up the ESTs.
  • a computer program such as BLAST, receives the EST data from two ESTs and generates a score based on the similarity of the bases making up the ESTs. If the score exceeds a predetermined threshold, the ESTs are grouped into the same cluster.
  • step 54 within each cluster, the ESTs are assembled into sequence data. Typically, a single cluster will produce many contiguous sequences. Ideally, for each cluster, the goal is to generate a consensus sequence that represents the entire cluster.
  • This prior art method has two problems. First, the clustering technique tends to overcluster the ESTs. In other words, the method generates too few clusters with too many ESTs in each cluster. Second, the assembly process generates too many consensus sequences. To solve these problems, one prior art method clusters ESTs and selects a single consensus sequence to represent the cluster. For those clusters with multiple consensus sequences, another prior art method designates each consensus sequence as a different gene.
  • the same gene may generate multiple cDNA sequences. Therefore, the prior art methods may designate splice variants as different genes. Because individuals can vary in expression of the same gene over long sequences, there is a need for a clustering method that tolerates differences over long sequences. Conversely, cDNA sequences from different genes may be quite similar. Therefore, the clustering method needs to distinguish consensus sequences from different genes from splice variants of the same gene.
  • a false positive is a similarity score that exceeds a predetermined threshold, but, in reality, the ESTs are from different parts of the gene or from different genes.
  • stringent thresholds can be set for the similarity scores. Conversely, too high a threshold tends to break apart clusters too much, and therefore undercluster. Therefore, a method is needed that avoids under and overclustering problems.
  • some clusters may generate multiple consensus sequences.
  • a method of identifying and displaying consensus sequences that are splice variants of the same gene is needed.
  • Polymer sequences are assembled into bins. A first number of bins are populated with polymer sequences. The polymer sequences in each bin are assembled into one or more consensus sequences representative of the polymer sequences of the bin. The consensus sequences of the bins are compared to determine relationships, if any, between the consensus sequences. The bins are modified based on the relationships between the consensus sequences. The polymer sequences are reassembled in the modified bins to generate one or more modified consensus sequences for each bin representative of the modified bins.
  • sequence similarities and dissimilarities are analyzed in a set of polymer sequences.
  • Pairwise alignment data is generated for pairs of the polymer sequences.
  • the pairwise alignment data defines regions of similarity between the pairs of polymer sequences with boundaries. Additional boundaries in particular polymer sequences are determined by applying at least one boundary from at least one pairwise alignment for one pair of polymer sequences to at least one other pairwise alignment for another pair of polymer sequences including one of the particular polymer sequences. Additional regions of similarity are generated based on the boundaries.
  • Fig. 1 is an example of gene expression.
  • Fig. 2 depicts the relationship of ESTs to mRNA and genomic sequences.
  • Fig. 3 is a flowchart of a prior art clustering and assembly process.
  • Fig. 4A is a diagram of client-server system using the present invention.
  • Fig. 4B is a diagram of a computer system with a memory storing exemplary procedures and data of the present invention.
  • Fig. 5A is an exemplary gene bin with a single consensus sequence and EST data.
  • Fig. 5B is another exemplary gene bin with multiple consensus sequences and EST data.
  • Fig. 6 is a flowchart of a method of generating gene bins of the present invention.
  • Fig. 7A illustrates the population and assembling of ESTs in gene bins.
  • Fig. 7B illustrates the joining of two exemplary gene bins.
  • Fig. 7C illustrates the splitting of the gene bin of Fig. 7A.
  • Fig. 8 is a flowchart of a filter applied prior to the assembly or re-assembly process.
  • Fig. 9 is a flowchart of a method of mapping persistent bin identifiers when new EST data is added to the database.
  • Fig. 10 is table used for tracking inheritance of old gene bin identifiers to new bin identifiers used with the method of Fig. 10.
  • Fig. 11 is a flowchart of an alternate embodiment of populating an initial set of gene bins.
  • Fig. 12 is a flowchart of a method of identification of cross-species gene links.
  • Fig. 13 is a flowchart of a method of a general method of determining conserved regions across input sequences.
  • Fig. 14 is an alternate embodiment of the method of Fig. 13.
  • Fig. 15 is a diagram of three sequences showing regions of similarity and boundaries.
  • Fig. 16 is more detailed flowchart of the method of Fig. 13.
  • Fig. 17 is a detailed flowchart of the method of identifying and determining segments with multiple alignments among the received input sequences of
  • Fig. 16 shows data structures used with the method of Fig. 17.
  • Fig. 19 is an exemplary display of multiple consensus sequences and segment graph.
  • Figs. 20A and 20B are a flowchart of a method of displaying consensus sequences and a segment graph for identification of splice variants among the consensus sequences as shown in Fig. 19.
  • a network system is used to retrieve information stored in the biomolecular expression information processing system of the present invention.
  • the major network system components are: • at least one client computer 60, 62,
  • firewall gateway server 70 that connects to the internet 72 to access external databases 74.
  • Fig. 4A depicts the memories 80, 82 of the client computers 60, 62, respectively.
  • an operating system 84 such as UNIX
  • a web browser 86 such as Netscape.
  • the network server 64 has a UNIX operating system 84, an application software module 88 and a relational database management system (RDBMS) 90 such as Oracle.
  • RDBMS relational database management system
  • the application module 88 uploads JAVA classes 92 from the server 64 to the client system 80.
  • the JAVA classes 92 include a similarity boundary finder 94 and a template viewer 96 which will be discussed below.
  • the web browser 86 executes the uploaded JAVA classes 98 which use JAVA objects 100 to provide a graphical user interface 102 to the application module 88 for the user.
  • a subset of the JAVA objects 100 are loaded with data from the database 68.
  • JAVA classes 98 on client 80 build a SQL statement based on user defined criteria that is passed to a CGI 104 on the network server 64.
  • the CGI 104 then passes the SQL statement to the RDBMS 90.
  • the RDBMS 90 executes the SQL statement and returns the retrieved data to the CGI 104 which, in turn, passes the data back to the client 80.
  • the JAVA classes 98 populate the JAVA objects 100 with the retrieved data, and the results are displayed on the client computer 80.
  • methods within the JAVA classes 98 pass a parameter to the CGI script 104 which builds a SQL statement using a SQL Query Generator 106.
  • the SQL statement is passed to the RDBMS 90.
  • the gene bin database 68 is stored on storage media in a storage device 66 such as a disk drive.
  • the gene bin database 68 stores the data in tables 108.
  • the client systems 80, 82 access public domain resources on the Internet 72 via the firewall gateway server 68.
  • the client systems 80, 82, network server 64 and the firewall gateway server 64 are networked via an intranet 109 using TCP/IP protocol.
  • a generate_gene_bin procedure 110 uses the methods of the present invention to process expression data 112 to generate gene bins and a gene bin database 114, which will be described below.
  • the client system 82 copies the database onto one of the storage devices 66 on the network server 64 where the copied gene bin database 66 is made available to all users.
  • the network server 64 generates the gene bin database 68.
  • the graphical user interface 98 allows the user to graphically construct search requests to retrieve data from the tables 108 of the gene bin database 68. The commands of the search request are called queries. As described above, either the JAVA classes or the CGI scripts generate the database queries.
  • the gene bin database 68 has many tables 108 storing information including gene bins, consensus sequences and ESTs.
  • an exemplary network server computer system 120 stores exemplary procedures and data of the present invention in a memory 122.
  • the memory 122 includes both semiconductor memory and disk memory.
  • a system bus 124 connects a processor 126, a display 128, a keyboard 130, a mouse 132, a network interface 134 that connects to the intranet, a disk drive 136 and the semiconductor memory 122.
  • the procedures and data can also be stored on the disk drive 66.
  • the procedures include: • the operating system 84 such as UNIX; the Web Browser 86 such as Netscape; and
  • the set of application modules 136 include the following.
  • the Generate Gene Bin Procedure 110 creates the gene bins of the present invention.
  • the EST data 112 from both private and public databases includes both the raw and processed EST data.
  • a block 1 sequence preparation procedure 138 receives the raw EST data and outputs processed EST data for the gene bin database.
  • a populate bins procedure 140 populates an initial set of gene bins.
  • BLAST Basic Local Alignment Search Tool
  • NCBI National Center for Biotechnology Information
  • HSP high-scoring segment pairs
  • a "phragment” assembly program (PHRAP) 144 assembles shotgun- DNA sequence data such as the processed EST data.
  • a representative EST filter 146 generates a representative set of EST sequences to be processed by PHRAP 144.
  • An ID&Remove_Bins procedure 148 is used to exclude a predetermined subset of bins from the joining and splitting process of the present invention.
  • Cross_match 150 is a computer program for rapid protein and nucleic acid sequence comparison and database searches based on the Smith-Waterman-Gotoh algorithm developed by Phil Green at the University of Washington. In the present invention, Cross_match was modified to obtain sequence alignment comparison results that are independent of the order in which the input sequences are compared.
  • An annotate_bins procedure 152 adds annotation data for certain consensus sequences to the database.
  • a compare_bins procedure 154 compares the consensus sequences of the gene bins.
  • a join_bins procedure 156 joins gene bins.
  • a spiit bins procedure 158 splits gene bins.
  • a FASTX procedure 160 is a database search tool used to compare nucleotide sequences to a peptide sequence database. The procedure is based on the rapid sequence algorithm described by Lipman and
  • a map_persistent_bin_id procedure 162 maps bin identifiers between old and new versions of the gene bin database.
  • the template viewer procedure 96 displays the consensus sequences of the gene bins with their assembled ESTs.
  • the gene bin database 68 is stored in the memory 122.
  • a similarity boundary finder 94 finds similar boundaries and segments across input sequences while accommodating for gaps.
  • the similarity boundary finder 94 identifies, aligns and displays consensus segments among an arbitrarily large number of input sequences.
  • the RDBMS 90 is also stored in the memory 122.
  • the similarity boundary finder 94 includes a set of procedures and data structures.
  • the procedures include:
  • An id_similar_regions procedure 166 that identifies shared regions of similarity among different sequences and within a sequence
  • a display_con_sequence procedure 168 that displays the shared regions of similarity among different sequences in a spatially aligned manner
  • a display_segment_map procedure 170 that displays a segment map of the input sequences.
  • the data structures include: input sequence strings 172;
  • Cross natch output 174 boundary lists 176; equivalent boundary lists 178; • a directed graph array 180; and a topological ordering list.
  • an exemplary gene bin 200 has a single consensus sequence 202 that represents assembled EST data 204.
  • the term "gene” or “genes” refers to the partial or complete coding sequence of a gene.
  • Gene bins 200 are sequenced-based clusters which have been grouped together.
  • a gene bin 200 is designed to associate or store all the EST sequences 204 for a particular single gene.
  • An EST 204 belongs to only one gene bin 200.
  • 11 gene bin 200 is associated with the component sequences 204 for a particular single gene.
  • the PHRAP assembly program is run using the ESTs 204 of the bin 200 to generate at least one consensus sequence 202.
  • the consensus sequence 202 acts as a template for that gene.
  • Each base of the assembled sequence represents the consensus of base calls in the component sequences 204 aligned at that position.
  • the component sequences 212 generate multiple consensus sequences 214, 216, 218.
  • each consensus sequence 214, 216, 218 acts as a template for the gene associated with the gene bin 210.
  • Gene bins 210 with multiple templates or consensus sequences 214, 216, 218 may denote or represent genes with alternative splicing or significant polymorphism.
  • the gene bins are implemented in tables of the relational database. Each gene bin has a gene bin identifier, each consensus sequence has a consensus sequence identifer and each EST has an identifer. Tables in the database associate the gene bins with consensus sequences and ESTs using the gene bin, consensus sequence and EST identifiers, respectively. Other tables associate the EST data with consensus sequences using the EST and consensus sequence identifiers.
  • the component sequences or EST data may come from public and private databases.
  • Fig. 6 is a flowchart of a method of generating gene bins of the present invention used in the generate_gene_bin 110 procedure of Fig. 4B. The flowchart will be described in general, followed by a detailed discussion of each of the steps.
  • step 222 new raw sequence or EST data is received and processed in a set of block 1 procedures (138, Fig. 4B).
  • Step 224 populates an initial set of gene bins with the EST data using the populate_bin procedure (140, Fig. 4B).
  • a filter (146 Fig. 4B) is applied to the ESTs in the gene bins to determine a representative set of ESTs which will be assembled using PHRAP. In an alternate embodiment, the filter is not used.
  • the PHRAP assembler within each bin, the PHRAP assembler (144, Fig. 4B) is used to assemble the ESTs in the bin to generate one or more consensus sequences.
  • the id_&_remove_bins procedure (148, Fig. 4B) identifies a predetermined set of bins and removes them from further processes.
  • a compare__bins procedure (154, Fig. 4B) compares the consensus sequences of the bins to determine relationships, if any, between the consensus sequences of the bins.
  • a join_bin procedure (156, Fig. 4B) joins bins based on the relationships of the consensus sequences to generate modified bins.
  • step 236 the filter (146, Fig. 4B) is applied to the EST data of the modified bins. In an alternate embodiment, the filter is not used.
  • step 240 the consensus sequences in the modified bins are compared to determine relationships, if any, between the consensus sequences.
  • step 242 the modified bins are split based on the relationships of the consensus sequences using the split bin procedure (158, Fig. 4B).
  • step 244 the method determines whether the comparing, joining and splitting process should repeat. If so, the process continues at step 232.
  • step 246 bins may be joined based on clone information.
  • the filter (146, Fig. 4B) is applied to the EST data of the modified bins. In an alternate embodiment, the filter is not used.
  • the PHRAP assembler (144, Fig. 4B) is used to re-assemble the ESTs in the modified bins to generate one or more consensus sequences.
  • the bins are annotated.
  • the template viewer procedure (96, Fig.
  • the method of the present invention provides a set of gene bins that avoids the overclustering and underclustering of the prior art and that tends to group splice variants of the same gene.
  • Block 1 Sequence Preparation In step 222, block 1 sequence preparation is performed. After raw sequence data is extracted from a sequencing chromatogram, the raw sequence data passes through a series of filters. First, low quality sequences and those with sequencing artifacts are clipped on the basis of quality scores. Next, recognized 5 1 and 3' vector sequences are clipped using a method based on dynamic programming. Then regular expression matching to 3' PolyA (or 5' PolyT) patterns is used to clip the mRNA tail.
  • 3' PolyA or 5' PolyT
  • BLAST comparisons is performed to further filter the sequence data.
  • Low-information segments such as dinucleotide repeats, are masked - replaced by "n"s-to prevent subsequent spurious matches when the BLAST similarity score is greater than or equal to 150.
  • the "n”s are different from “N”s which are used to represent ambiguities found during sequencing.
  • Raw sequences containing recognized contamination sequences are removed from further bioanalysis when the BLAST similarity score is greater than or equal to 130.
  • Dispersed repetitive elements such as Alu, LINE and MIR are masked when the BLAST similarity score is greater than or equal to 150.
  • Known repetitive elements are present in many copies in the genome. Their functional relevance is very low and they would cause assembly problems if included.
  • ribosomal RNA sequences are removed based on a BLAST similarity score greater than or equal to 150.
  • the initial bin set is populated with clusters of those sequences having at least fifty bases.
  • the PHRAP assembly program generates at least one consensus sequence for each gene bin.
  • the version of PHRAP used in this method was modified to interpret a set of private sequence identifier conventions.
  • other assembly programs such as FAKII that was developed by Eugene W. Myers, are used.
  • Cross match 150 Fig. 4B
  • Fig. 4B compares all unassigned ESTs to all consensus sequences using a Smith- Waterman based algorithm.
  • An unassigned EST sequence is added to the bin with the consensus sequence that yields the highest Smith-Waterman score. New bins are created for the non-matching unassigned EST sequences.
  • PHRAP has the advantage of being able to incorporate base quality values into the assembly process. This extra data is essential to achieve the sensitivity and accuracy required for EST assembly.
  • step 232 the bins are modified based on the relationship between the consensus sequences among all the bins. All consensus sequences in all bins are compared to each other using BLAST2. A high BLAST2 score indicates high sequence overlap and identity.
  • all consensus sequences in all bins are compared to each other using BLAST. If the BLAST score exceeds 150 for a pair of consensus sequences, Cross natch is executed using that pair of consensus sequences to verify the BLAST score and generate the local identity.
  • bins are joined when at least one consensus sequence overlaps a consensus sequence in another bin with at least 82% local identity according to BLAST2. In an alternate embodiment, bins are joined when the local identity is at least 92%. In another alternate embodiment, bins are joined when the local identity is at least 85%.
  • the PHRAP assembly program generates at least one consensus sequence for each modified gene bin.
  • Cross_match is used to compare the consensus sequences of the reassembled bins.
  • the Smith-Waterman algorithm is used instead of Cross_match.
  • bins are split when the overlap between the consensus sequences results in less than 95% identity or the
  • 16 length of the alignment is less than fifty base pairs.
  • the consensus sequences with insufficient overlap or alignment are split out to form a new bin.
  • step 244 the process of comparing all consensus sequences across all bins, joining bins, re-assembling bins, re-comparing bins and splitting bins repeats until convergence of the database is achieved. Convergence of the database is achieved when the bin compositions do not change significantly between iterations.
  • the process of comparing all consensus sequences across all bins, joining bins, re-assembling bins, re-comparing bins and splitting bins repeats for a predetermined number of iterations.
  • a single EST clone may be used multiple times to perform sequencing reactions in the laboratory. Therefore, a clone may be associated with multiple sequences. For example, a single clone may be associated with a 5' first-pass sequence, a 5' long-read sequence and a 3' first-pass sequence.
  • step 246 after a number of iterations of joining and splitting bins based on their consensus sequences, bins are joined based on clone information. If the 5' sequence of one clone is present in one bin and the 3' sequence from the same clone is present in a different bin, it is likely that the two bins actually belong together in a single bin. Since it is possible that a single clone may be chimeric, bins are joined in this step if there are at least two different clones with a 5 1 and 3' sequence in each of the bins to be joined.
  • Bins are not joined if the resulting bin would be very large, having 5,000 or more ESTs.
  • clone joining is not applied to bins with annotation hits to common genes, nor is clone joining performed on inert bins.
  • step 252 using BLAST2 and FASTX, each consensus sequence is compared to the sequences in the GenBank database, one of the external databases available on the internet. Exact hits are annotated and homologs are recorded in the gene bin database. If no match is found for the gene's consensus sequence, the gene is identified as unique in the gene bin database.
  • Gbpri and gbpept are divisions of the GenBank database. Using BLAST2 searches, hits are collected against the gbpri database. Exact hits are annotated and recorded when the percent identity is greater than or equal to 95% with an alignment length of at least 200 base pairs, to a percent identity greater than or equal to 100% with an alignment length of at least 100 base pairs as summarized below: percent identity > 95% alignment length > 200 base pairs, percent identity > 96% alignment length > 180 base pairs, percent identity > 97% alignment length > 160 base pairs, percent identity > 98% alignment length > 140 base pairs, percent identity > 99% alignment length > 120 base pairs, and percent identity > 100% alignment length > 100 base pairs.
  • Homologs are recorded when hits have an expectation value (E-value) less than or equal to 1 x 10 8 .
  • the expectation value indicates the expected number of times that an alignment between two sequences might occur by chance.
  • An E-value of zero indicates an exact match while an E-value of one indicates no significant matches were found.
  • a sequence is annotated as an exact match when the percent identity is equal to 100% with an alignment length of at least 50 base pairs, and both the portion of the template before the match is less than or equal to ten base pairs
  • Step 230 identifies and removes the inert bins from the assembly process.
  • Inert bins are very deep, typically having more than 2,000 EST sequences. The inventors found that reassembly of the inert bins does not significantly affect the existing assembled consensus sequences. Therefore, for the inert bins, new EST sequences that are assigned to the inert bins are aligned to the existing consensus sequences, but the new EST sequences are not used to generate the consensus sequences in the assembly process.
  • the inert bins are predetermined and are typically well-known and well- characterized genes such as actin or EF-1a.
  • an initial set of bins is updated with new EST data using the following procedures: assign sequences to bins based on a BLAST comparison, confirm matches and append the EST sequences to the bins for future assembly.
  • the new sequences are assigned to a bin based on the BLAST comparison between the new EST sequences and the current set of consensus sequences. Significant matches are confirmed using Cross_match, a Smith-Waterman based tool that also incorporates base-call confidence scores into the alignment process.
  • the template viewer procedure displays a bin with at least one consensus sequences with its assembled ESTs.
  • the consensus sequence is displayed at the top of the display, and the ESTs are displayed, starting at the leftmost EST in left-to-right order, below the consensus sequence with one EST to a row.
  • exemplary ESTs 272 are placed into a bin 274 and assembled to generate a bin 272 with two consensus sequences 276, 278.
  • Fig. 7B two exemplary bins 282 and 284 are joined and the ESTs are associated with a single bin 286.
  • Fig. 7C the assembled bin 274 of Fig. 7A is split into two bins 292 and 294.
  • a flowchart of the optional filtering procedure 146 (Fig. 4B) of steps 226 and 236 is shown.
  • the PHRAP assembly program either fails or takes a very long time to execute when the ESTs of a bin have a large local depth.
  • the local depth refers to, for a particular location in the eventual assembly, the number of ESTs whose alignments span that location.
  • the filter generates a set of representative ESTs for that gene bin that are input to the PHRAP assembler. Since local depth is the problem, the filter effectively removes ESTs located in the regions of greatest local depth, while retaining those ESTs with low local depth. Since some bins may have a very large number of EST sequences, for example, 30,000 or more, the filter reduces the number of ESTs used in the assembly process and thereby speeds up the operation of the assembly process.
  • step 302 for each gene bin starting at the first gene bin, a set of ESTs is initialized. The set of steps in block 304 are then performed for each gene bin.
  • step 306 a redundancy score is calculated for ESTs in the gene bin. To generate the redundancy score, Cross-match is run on the set of ESTs to obtain the pairwise alignments of the ESTs. Based on the pairwise alignments, the redundancy score for an EST is equal to the minimum, over all the bases of the EST, of the number of matches each base has with respect to the other ESTs in the gene bin.
  • step 308 the EST with the highest redundancy score is identified. In step 308, the identified EST is removed from the representative set of ESTs.
  • ESTs having the highest redundancy score identify the minimum local depth of the ESTs with the highest redundancy score, and remove the EST with the fewest number of bases having the identified minimum local depth. In this way, the ESTs covering shallow regions tend to remain as representative ESTs, while ESTs in the deeper regions are removed. In addition, ESTs having a shorter sequence length will also tend to be removed.
  • step 312 after removing an EST, if the depth of the remaining representative ESTs of the gene bin is greater than a predetermined threshold, the method repeats steps 306 to 310 to determine the next EST to remove. If the depth of the remaining representative ESTs of the gene bin is less than or equal to the predetermined threshold, the process ends for that gene bin.
  • Cross_match can also incur memory problems and take a long time to execute for bins with large numbers of ESTs. Therefore, in an alternate embodiment, for those bins with large number of ESTs, the ESTs are divided into batches and each batch is processed separately using the method described above for Fig. 8. Prior to assembly, the remaining ESTs are combined into a representative set of ESTs for that bin and are submitted to the assembly process.
  • a bin identifier can be persistent between database versions.
  • a persistent bin identifier entails the retroactive monitoring of the inheritance of bin identifiers by determining which bins in the newer version of the data base are substantially the same as bins in the older version of the database.
  • Fig. 9 provides a method of mapping persistent bin identifiers using the Map_persistent_bin_id procedure 162 (Fig. 4B).
  • Map_persistent_bin_id procedure 162 Fig. 4B.
  • bin identifiers are mapped from an old set of bins of an old database to a new set of bins of a new database. The method is independent of the process used to generate the bins. Using this method, there is no need to track a bin identifier through the many steps of processing of Fig. 6 or to generate and compress a processing history into a compact interpretable form.
  • a two-sided score that includes a forward score and a reverse score is determined as follows:
  • Reverse Score # ESTs in common between the old and new bin in the pair of bins total # of inheritable ESTs in the old bin
  • both the forward score and the reverse score have the same numerator.
  • the denominator of the forward score is the total number of inherited ESTs in the new bin. In other words, the total number of ESTs in the new set of bins that were present in the old set of bins.
  • the denominator of the reverse score is the total number of ESTs in the old bins.
  • step 324 for each new bin, all Reverse Scores greater than or equal to a predetermined reverse score threshold, such as 90%, are identified in order to identify a subset of potentially inheritable bin identifiers, and all Forward Scores are ranked.
  • a predetermined reverse score threshold such as 90%
  • step 326 for each new bin, the new bin identifier is mapped to the old bin identifier in the subset of potentially inheritable bin identifiers that has the highest Forward Score.
  • a table 328 in the database store the mapping of old bin identifiers to new bin identifiers.
  • Fig. 11 is a flowchart of an alternate embodiment of populating the initial set of gene bins of step 224 of Fig. 6.
  • each EST sequence is placed in its own bin so that each EST is a consensus sequence.
  • the consensus sequences of the bins are compared to determine relationships, if any between the consensus sequences of the bins.
  • Step 334 of Fig. 11 is the same as step 232 of Fig. 6.
  • step 336 the bins are joined based on the relationships of the consensus sequences.
  • Step 336 of Fig. 11 is the same as step 234 of Fig. 6.
  • Cross-Species Gene Links Sets of gene bins can be assembled not only for human sequence data, but also for other organisms. In these gene bins, the same gene may appear across multiple species. Genes sufficiently common to be captured by the libraries for a given species that are grouped together by the assembly process will appear in the database represented at the sequence level by one or more consensus sequences from one or more gene bins.
  • step 338 of Fig. 12 consensus sequences from assembled bin sets from each species are compared using BLAST.
  • step 340 for those comparison results exceeding a predetermined threshold, a first species
  • the first species gene bin identifier, the first species consensus sequence identifier and a second species identifier with its second species gene bin identifier, a second species consensus sequence identifier are stored in a table in the database to provide a cross-reference of common genes among species.
  • Similarity Boundary Finder The purpose of the similarity boundary finder is to identify and then extract information about regions of similarity between input sequences, as well as unique regions. Regions of similarity are patterns that occur at least once in two or more input sequences, or that occur at least twice in a single input sequence. A segment is a region of similarity, or is designated as such, when the difference between patterns from different input sequences is deemed as biologically unimportant. Input sequences have at least one and typically many segments.
  • a flowchart of a general method of determining conserved regions across input sequences 174 (Fig. 4B) used by the similarity boundary finder 94 (Fig. 4B) is shown.
  • the initial pairwise alignment criteria is set. Since the pairwise alignment tool is Cross_match, the criteria includes a minimum length and a score threshold at which a homologous sequence or region of similarity is identified.
  • pairwise alignment data 176 (Fig. 4B) is generated for all pairs of input sequences using Cross_match.
  • step 356 based on the pairwise alignment data, boundaries of aligned sequence portions are identified. All boundaries of all aligned sequence portions are determined by iteratively applying all identified boundaries to previously identified aligned sequence portions.
  • step 358 an average number of boundaries per input sequence is determined.
  • step 360 if the average number of boundaries is greater than or equal to a predetermined threshold, the process proceeds to step 362.
  • step 362 the pairwise alignment criteria is modified to increase the requirements for pairwise alignment such that the
  • step 364 displays the input sequences with their aligned sequence portions and boundaries.
  • a user sets the predetermined threshold number of sequences to be compared to the average.
  • Fig. 14 is an alternate embodiment of the general method of the similarity boundary finder of Fig. 13.
  • Fig. 14 is different from Fig. 13 because the pairwise alignment data is generated only once.
  • the pairwise alignment criteria are set; and, in step 354, the pairwise alignment data for pairs of input sequences are generated.
  • the alternate embodiment of Fig. 14 differs from that shown in Fig. 13.
  • the pairwise alignment data are ordered according to the likelihood of generating short segments.
  • a pairwise alignment is considered likely to yield short segments according to the extent to which the aligned regions of the sequences involved are also contained in other pairwise alignments.
  • the likelihood is considered especially high if there is another pairwise alignment involving the same two sequences and containing the majority or the entire extents of the aligned regions.
  • step 367 based on the ordered pairwise alignment data contained in the pairwise alignment data processed so far, the boundaries of aligned sequence portions are identified, and all boundaries of all shared sequence portions are determined by iteratively applying all identified boundaries to aligned sequence portions.
  • step 368 the average distance between boundaries in the input sequences is determined.
  • step 369 if the average is greater than or equal to a predetermined threshold and if there are more pairwise alignments to process, the process proceeds to step 370 to get the next pairwise alignment and the process repeats at step 367. If the average distance between boundaries is less than the predetermined threshold and if there are no more pairwise
  • step 364 the input sequences are displayed with their boundaries.
  • Step 364 is the same for Fig. 13 and Fig. 14.
  • the id_similar_regions procedure 166 of Fig. 4B implements either steps 352-362 of Fig. 13 or steps 352, 354, 365-370 of Fig. 13.
  • the display_con_sequence procedure 168 and the display_segment_map procedure 170 of Fig. 4B implements step 364 of Figs. 13 and 14.
  • Sequences 1 and 2 have a first region of similarity with boundaries Boundary 1 and Boundary 2.
  • Sequences 2 and 3 have a second region of similarity with boundaries Boundary 3 and Boundary 4. Since Boundary 3 falls in the middle of the first region of similarity, the present invention will apply Boundary 3 to Sequence 1 thereby splitting the first region of similarity into two portions. Since Boundary 2 falls in the middle of the second region of similarity, Boundary 2 is applied to Sequence 3 to split the second region of similarity into two portions.
  • Figs. 16A and 16B are a more detailed flowchart of the method of Fig. 13.
  • step 372 input sequences are received.
  • the input sequences are consensus sequences of EST assemblies. Alternately other sequences can be received such as genomic sequence data. Auxiliary data may also be received with the input sequences such as assembly depth, base call quality scores, and tissue or disease-state categorization.
  • step 374 as described above, the initial pairwise alignment criteria is set.
  • step 376 pairwise alignments between the input sequences are identified.
  • pairwise alignments between the input sequences and their reverse complements are identified.
  • step 378 for each pairwise alignment, the boundaries of the alignment in each sequence, the locations of all insertions and deletions in the alignments and the orientation of each sequence are identified.
  • step 380 input sequences are received.
  • the input sequences are consensus sequences of EST assemblies. Alternately other sequences can be received such as genomic sequence data. Auxiliary data may also be received with the input sequence
  • the pairwise alignments are split at large gaps.
  • a large gap is a gap that exceeds a predetermined threshold gap length in the pairwise alignments. A user can set the predetermined gap length.
  • the pairwise alignment is subdivided at the large gap to form two new shorter pairwise alignments. The ends of the gap are boundaries.
  • any sequences whose alignments are primarily to their reverse complements are replaced with their reverse complements. This step is performed to simplify the display.
  • step 384 based on the pairwise alignment data, the boundaries of aligned sequence portions are identified. All boundaries of all regions of similarity between sequences are determined by iteratively applying all identified boundaries to all aligned sequence portions. Steps 358, 360 and 362 are the same as described above and the description will not be repeated.
  • step 390 based on the pairwise alignment data and the boundaries, segment instances are identified.
  • a segment instance is a region of a sequence between a pair of adjacent similarity boundaries.
  • similar segment instances e.g., from different input sequences are clustered into segment groups.
  • the segment instances are multiply aligned into segment groups.
  • the alignment along a tree method is used, except that instead of using profiles as guides in aligning two multiple alignments, the gapping that is specified by one of the generated pairwise alignments that has a segment from each multiple alignment is used.
  • the structure of the tree is determined by an ordering of the sequence pairwise alignments. Segment instances are iteratively clustered into binary trees by merging, for each pairwise alignment, the pair of trees containing the two segment instances contained in the alignment.
  • the pairwise alignments are processed in increasing order of the sum of the lengths of the two aligned regions because such an ordering appears likely to join more similar segments before more dissimilar segments. However, other orderings can be used.
  • a pairwise alignment is ignored if its two aligned segment instances are already in the
  • the consensus segment for each of the segment groups is determined by selecting, for each position in the multiple alignment, the base call having the highest quality score from among the base calls at the corresponding positions in the segment instances.
  • a gap quality score is assigned to equal the average score of the two bases on either side of the gap. Ties are resolved by selecting the base call occurring in the largest number of segment instances at the highest quality score. If there is still a tie, an unambiguous base call is chosen instead of a gap, and a gap is chosen over an ambiguous base call. If there is still a tie among unambiguous base calls, assign an "N" to that position in the consensus segment.
  • the quality score is defined as the highest score among the segment instances at that position.
  • the assembly depth and tissue counts are the sums of the equivalent quantities for the segment instances.
  • step 398 junctions between segment groups are identified.
  • a junction occurs when two segment instances, one from each group, are adjacent in any sequence.
  • step 400 for nucleotide input sequences and their consensus sequences, likely splice junction sequences are identified.
  • step 402 the input sequences are displayed with their boundaries.
  • Fig. 17 is a detailed flowchart of the method of identifying and determining segments with multiple alignments among the received input sequences of step 386 of Figs. 16A and 16B.
  • a boundary list 178 (Fig. 4B) is created and populated with the sequence's left and right endpoints.
  • step 424 the left and right endpoints of all pairwise alignments involving the sequence is added to that sequence's boundary list.
  • An equivalent boundary list 180 (Fig. 4B) associates the equivalent boundaries of
  • a queue of boundaries to be processed is generated. Initially, the queue includes all of the above sequence and alignment endpoints. The queue may also be implemented as another list.
  • a spanning list of all pairwise alignments spanning the boundary location in a corresponding sequence is created.
  • the pairwise alignment is subdivided by adding the boundary to the boundary list of the input sequence associated with the pairwise alignment if the boundary list does not already contain a boundary at this location, and this boundary is added to the queue for processing.
  • Fig. 18 shows data structures used with the method of Fig. 17 that reflect the exemplary sequences, alignment and boundaries of Fig. 15.
  • each sequence has a boundary list with its starting point, S1 , S2 and S3, and end point, E1 , E2 and E3, respectively.
  • Each initial boundary list also has boundaries from the pairwise alignment data.
  • the boundaries are uniquely designated as "Bx" where x refers to a boundary number.
  • Boundaries B1 and B2 of sequences 1 and 2 are aligned.
  • boundary B1 will most likely occur at a different location in sequence 1 , such as fifty, from boundary B1 in sequence 2, such as seventy. However, for simplicity, both boundaries are designated as B1.
  • Boundaries B3 and B4 of sequences 2 and 3 are also aligned. Referring also to Fig. 15, boundaries 1 , 2, 3 and 4 of Fig. 15 are the same as B1 , B2, B3 and B4 of Fig. 18.
  • Fig. 18 another data structure, such as a list, is used to associate the equivalent boundaries among the sequences, such as B1 from sequence 1 and B1 from sequence 2.
  • boundary lists for each sequence are shown after applying the method of Figs. 13, 14 and 17 described above. Note that boundary B3 is added to the list for Sequence one and boundary B2 is added to the list for Sequence three.
  • FIG. 19 the input sequences and their segments are displayed.
  • An exemplary display 440 has an upper portion 442 displaying the input sequences AA, BB.c, and CC with aligned consensus segments 443.
  • One input sequence with all its consensus segments is displayed horizontally on a single line. For simplicity, the segments are numbered. In practice, each similar segment has a unique color.
  • Input sequence BB.c is the reverse complement as indicated by the ".c" extension.
  • the rows of input sequences are displayed in an order that positions more similar sequence pairs closer together than less similar pairs based on the number of similar base pairs of each input sequence.
  • the lines 444 between segments indicate junctions.
  • the junctions are drawn at the endpoints at which the segments meet.
  • An alignment between a region of a sequence and its reverse complement is displayed with an "X" pattern.
  • a segment graph shows the relationship among the aligned segments.
  • the segments are numbered one through fourteen and each segment is shown once. Again, the lines indicate junctions between segments.
  • segment 6 is a likely alternatively spliced exon because input sequence AA includes segment 6 while input sequence BB.c does not include segment 6 as indicated by the curved line connecting segments 5 and 7.
  • the segments of the segment graph are also vertically aligned with respect to the segments of the input sequences in the upper display.
  • Segments 8 and 9 are repeating sequences. The method of the present invention results in repeating sequences being identified both within a single input sequence and among two or more input sequences.
  • the input sequences are consensus sequences from the gene bins.
  • Figs. 20A and 20B are a flowchart of a method of displaying input sequences and the segment graph of Fig. 19 for identification of splice variants among the input sequences.
  • step 452 the input or consensus sequences and their segments are received.
  • step 454 the relative horizontal ordering of segments in the display is determined by clustering segment instances within segment groups into subsets that will share the same horizontal location.
  • step 456 the relative horizontal ordering of segment instances is represented using an acyclic directed graph 182 (Fig. 4B).
  • the vertices of the acyclic directed graph represent segment subsets and the edges indicate the horizontal adjacency of the segment subsets, with the edge direction dictated by the two segment subsets' left-right ordering.
  • the acyclic directed graph is initialized as a set of unconnected directed paths, each path representing the ordering of segment instances within one input sequence.
  • step 458 a list of all pairs of similar segment instances is created and the list is sorted.
  • the list is sorted, first in descending order of the lengths of each pair's input sequences, then by whether the pair has the same orientation, then in ascending order of the two segment instances' average location within their respective input sequences.
  • step 460 for each segment instance pair in the sorted list, starting from the beginning of the list, attempt to merge the subsets to which the two segment instances belong, if the segment instances in the pair belong to different subsets of segments.
  • a merge when a merge is to be performed, identify the two vertices in the acyclic directed graph corresponding to the two subsets, and merge the subsets only if doing so will not cause a cycle to be added to the acyclic directed graph when the corresponding graph vertices are merged.
  • step 462 the absolute positions of segment subsets in the display are determined by:
  • the relative positioning of the above trees defines one or more clusters of connected segment subsets.
  • the segment subsets within each cluster form a connected graph via their junctions and segment subsets in different clusters have no left-to-right junctions to each other. All such clusters are aligned so that the left end of the leftmost segment subset in each cluster is located at position zero.
  • the input sequences are ordered vertically by: creating an ordering of all pairs of input sequences, sorted in decreasing order of the total lengths of all pairwise alignments between each input sequence pair; creating lists of vertically ordered input sequences, by processing, in order, pairs of input sequences as follows: starting with each sequence being in its own one-sequence list, then in the ordering created in the previous step, if two input sequences in a pair belong to different lists, append one list to the other; and if, at the end, there are two or more lists, arrange the lists vertically in decreasing order of their numbers of consensus sequences.
  • the topmost list to display will be determined based on the length of the input sequences.
  • the vertical (row) positions of consensus segments in the segment graph are determined by: sorting all segment instances in decreasing order of the length of the corresponding sequence; starting with a segment graph having only empty rows, for each segment instance in the sorted list, if the corresponding segment subset does not yet have a position in the graph, add the corresponding consensus segment to the topmost row of the graph where the consensus segment can be positioned at the horizontal location of the segment subset and be at least the minimum separation distance from all other consensus segments already positioned the row.
  • a consensus segment is added to the topmost row in which it fits and which contains the consensus segment of a second segment subset with which the first segment subset shares a left-to-right junction. If there is no such row, then the consensus segment is added to the topmost row in which it fits.
  • the similarity boundary finder processes the output of the pairwise alignment to reliably identify conserved regions in a manner consistent with all of the pairwise alignment data, no matter how complex. Therefore, the similarity boundary finder can be used to aid in determining alternative splicing of gene by displaying putative variants, that is, segments which may correspond to putative alternatively spliced exons or groups of exons.
  • the input sequences to the similarity boundary finder are not limited to consensus sequences of the gene bins.
  • the similarity boundary finder can be used to determine genomic to cDNA alignments by processing the genomic and cDNA sequence data as the input or consensus sequences described above.
  • the similarity boundary finder can also be used to identify similar regions of homologous sequences including cross-species homologs by processing sequence data from two different species as the input or consensus sequences described above.
  • the similarity boundary finder can be used to determine sequence polymorphisms, such as single nucleotide polymorphisms (SNPs) - including substitutions, insertions and deletions. This can be done by disallowing substitutions in the Cross_match pairwise alignments by setting the magnitude of the mismatch penalty greater than twice that of the gap initiation penalty to force SNPs to appear as gaps in the alignments, and by setting the minimum gap length to zero within a segment, to force SNPs to form individual single base segments.
  • sequence polymorphisms such as single nucleotide polymorphisms (SNPs) - including substitutions, insertions and deletions. This can be done by disallowing substitutions in the Cross_match pairwise alignments by setting the magnitude of the mismatch penalty greater than twice that of the gap initiation penalty to force SNPs to appear as gaps in the alignments, and by setting the minimum gap length to zero within a segment, to force SNPs to form individual single base segments.
  • the similarity boundary finder can also used to determine tissue differentiation among the segments in the consensus sequences.
  • the similar and dissimilar segments are correlated with a tissue category to form subsets having a common tissue category. Each subset may include both similar and dissimilar segments.
  • the polymer sequences are displayed as shown in Fig. 19.
  • Each subset of segments is displayed with a unique color such that the colors of the segments indicate regions where expression is specific to a single tissue category.
  • the segments are correlated with a disease state and each disease state is uniquely identified on the display.
  • the segments are correlated with a developmental stages, and each developmental stage is uniquely identified on the display.
  • the present invention solves many problems of identifying genes from many heterogeneous sequences.
  • the invention removes chimeric clones, removes construction artifacts, masks repetitive elements, splits close homologs, merges gene bins with apparent splice variation into a single gene bin, and trims low accuracy tails.
  • the present invention also provides a visual display of the consensus sequences of the gene bins for identification of splice variants.

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Micro-Organisms Or Cultivation Processes Thereof (AREA)

Abstract

Des séquences de polymères sont assemblées en réservoirs. Un premier groupe de réservoirs est peuplé de séquences de polymères. Les séquences de polymères dans chaque réservoir sont assemblées en une ou plusieurs séquences consensus représentatives des séquences de polymères du réservoir. Ces séquences consensus sont comparées pour déterminer les relations, le cas échéant, entre les séquences consensus du réservoir. Les réservoirs sont modifiés en fonction des relations entre les séquences consensus. Les séquences de polymères sont réassemblées dans les réservoirs modifiés pour générer une ou plusieurs séquences consensus modifiées pour chaque réservoir représentatif des réservoirs modifiés. Selon un autre aspect de l'invention, les similarités et les disparités des séquences sont analysées dans un ensemble de séquences de polymères. Des données d'alignements par paires sont générées pour les paires de séquences de polymères. Ces données définissent des zones de similarité entre les paires de séquences de polymères avec des frontières. D'autres frontières dans des séquences de polymères particulières sont déterminées en appliquant au moins une frontière à partir d'au moins un alignement par paires pour une paire de séquences de polymères jusqu'à au moins un alignement par paires pour une autre paire de séquences de polymères comprenant une des séquences de polymères particulières. Des zones supplémentaires de similarité sont générées en fonction des frontières.
PCT/US1999/006575 1998-03-26 1999-03-25 Systeme et procedes d'analyse de sequences biomoleculaires WO1999049403A1 (fr)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CA002325469A CA2325469A1 (fr) 1998-03-26 1999-03-25 Systeme et procedes d'analyse de sequences biomoleculaires
AU34537/99A AU771877B2 (en) 1998-03-26 1999-03-25 Computer system and methods for analyzing biomolecular sequences
EP99916165A EP1066576A1 (fr) 1998-03-26 1999-03-25 Systeme et procedes d'analyse de sequences biomoleculaires
JP2000538305A JP2002508546A (ja) 1998-03-26 1999-03-25 生体分子配列を解析するためのシステムおよび方法

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US7946998P 1998-03-26 1998-03-26
US60/079,469 1998-03-26

Publications (1)

Publication Number Publication Date
WO1999049403A1 true WO1999049403A1 (fr) 1999-09-30

Family

ID=22150762

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1999/006575 WO1999049403A1 (fr) 1998-03-26 1999-03-25 Systeme et procedes d'analyse de sequences biomoleculaires

Country Status (5)

Country Link
EP (1) EP1066576A1 (fr)
JP (1) JP2002508546A (fr)
AU (1) AU771877B2 (fr)
CA (1) CA2325469A1 (fr)
WO (1) WO1999049403A1 (fr)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002015107A2 (fr) * 2000-08-14 2002-02-21 Incyte Genomics, Inc. Systeme et protocole de conversion de donnees brutes en une sequence de bases
US20030177143A1 (en) * 2002-01-28 2003-09-18 Steve Gardner Modular bioinformatics platform
US7957908B2 (en) * 2003-11-17 2011-06-07 New York University System, method and software arrangement utilizing a multi-strip procedure that can be applied to gene characterization using DNA-array data
WO2015021540A1 (fr) * 2013-08-15 2015-02-19 Zymeworks Inc. Systèmes et procédés pour l'évaluation in silico de polymères
WO2015026853A3 (fr) * 2013-08-19 2015-04-16 Abbott Molecular Inc. Bibliothèques de séquençage de nouvelle génération
WO2015112619A1 (fr) * 2014-01-22 2015-07-30 Adam Platt Procedes et systemes pour la detection de mutations genetiques
US9618474B2 (en) 2014-12-18 2017-04-11 Edico Genome, Inc. Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids
US9857328B2 (en) 2014-12-18 2018-01-02 Agilome, Inc. Chemically-sensitive field effect transistors, systems and methods for manufacturing and using the same
US9859394B2 (en) 2014-12-18 2018-01-02 Agilome, Inc. Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids
US10006910B2 (en) 2014-12-18 2018-06-26 Agilome, Inc. Chemically-sensitive field effect transistors, systems, and methods for manufacturing and using the same
US10020300B2 (en) 2014-12-18 2018-07-10 Agilome, Inc. Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids
US10429342B2 (en) 2014-12-18 2019-10-01 Edico Genome Corporation Chemically-sensitive field effect transistor
US10811539B2 (en) 2016-05-16 2020-10-20 Nanomedical Diagnostics, Inc. Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5825790B2 (ja) * 2011-01-11 2015-12-02 日本ソフトウェアマネジメント株式会社 核酸情報処理装置およびその処理方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ATE264523T1 (de) * 1997-07-25 2004-04-15 Affymetrix Inc A Delaware Corp Verfahren zur herstellung einer bio-informatik- datenbank
US6047109A (en) * 1998-07-29 2000-04-04 Smithkline Beecham P.L.C. Methods and systems for re-evaluating assembly consensus sequences

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
No relevant documents have been disclosed. *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002015107A2 (fr) * 2000-08-14 2002-02-21 Incyte Genomics, Inc. Systeme et protocole de conversion de donnees brutes en une sequence de bases
WO2002015107A3 (fr) * 2000-08-14 2004-04-08 Incyte Genomics Inc Systeme et protocole de conversion de donnees brutes en une sequence de bases
US20030177143A1 (en) * 2002-01-28 2003-09-18 Steve Gardner Modular bioinformatics platform
US7957908B2 (en) * 2003-11-17 2011-06-07 New York University System, method and software arrangement utilizing a multi-strip procedure that can be applied to gene characterization using DNA-array data
WO2015021540A1 (fr) * 2013-08-15 2015-02-19 Zymeworks Inc. Systèmes et procédés pour l'évaluation in silico de polymères
US10036013B2 (en) 2013-08-19 2018-07-31 Abbott Molecular Inc. Next-generation sequencing libraries
US10865410B2 (en) 2013-08-19 2020-12-15 Abbott Molecular Inc. Next-generation sequencing libraries
WO2015026853A3 (fr) * 2013-08-19 2015-04-16 Abbott Molecular Inc. Bibliothèques de séquençage de nouvelle génération
WO2015112619A1 (fr) * 2014-01-22 2015-07-30 Adam Platt Procedes et systemes pour la detection de mutations genetiques
US9859394B2 (en) 2014-12-18 2018-01-02 Agilome, Inc. Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids
US10006910B2 (en) 2014-12-18 2018-06-26 Agilome, Inc. Chemically-sensitive field effect transistors, systems, and methods for manufacturing and using the same
US10020300B2 (en) 2014-12-18 2018-07-10 Agilome, Inc. Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids
US9857328B2 (en) 2014-12-18 2018-01-02 Agilome, Inc. Chemically-sensitive field effect transistors, systems and methods for manufacturing and using the same
US10429381B2 (en) 2014-12-18 2019-10-01 Agilome, Inc. Chemically-sensitive field effect transistors, systems, and methods for manufacturing and using the same
US10429342B2 (en) 2014-12-18 2019-10-01 Edico Genome Corporation Chemically-sensitive field effect transistor
US10494670B2 (en) 2014-12-18 2019-12-03 Agilome, Inc. Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids
US10607989B2 (en) 2014-12-18 2020-03-31 Nanomedical Diagnostics, Inc. Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids
US9618474B2 (en) 2014-12-18 2017-04-11 Edico Genome, Inc. Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids
US10811539B2 (en) 2016-05-16 2020-10-20 Nanomedical Diagnostics, Inc. Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids

Also Published As

Publication number Publication date
AU771877B2 (en) 2004-04-01
JP2002508546A (ja) 2002-03-19
EP1066576A1 (fr) 2001-01-10
CA2325469A1 (fr) 1999-09-30
AU3453799A (en) 1999-10-18

Similar Documents

Publication Publication Date Title
EP3304383B1 (fr) Ensemble du génome diploïde de novo et reconstruction de séquence d'haplotype
Buhler Efficient large-scale sequence comparison by locality-sensitive hashing
AU771877B2 (en) Computer system and methods for analyzing biomolecular sequences
Brendel et al. Gene structure prediction from consensus spliced alignment of multiple ESTs matching the same genomic locus
US6714874B1 (en) Method and system for the assembly of a whole genome using a shot-gun data set
Batzoglou et al. ARACHNE: a whole-genome shotgun assembler
Mullikin et al. The phusion assembler
US5970500A (en) Database and system for determining, storing and displaying gene locus information
US5966712A (en) Database and system for storing, comparing and displaying genomic information
AU2006258264B2 (en) Method of processing and/or genome mapping of ditag sequences
US11308056B2 (en) Systems and methods for SNP analysis and genome sequencing
US20160019339A1 (en) Bioinformatics tools, systems and methods for sequence assembly
WO2015094844A1 (fr) Assemblage de graphiques de chaînes pour génomes polyploïdes
CN111161797A (zh) 一种基于三代测序检测多样本量比较转录组分析方法
CA2400890A1 (fr) Procede et systeme d'assemblage d'un genome entier au moyen d'un ensemble de donnees prises au hasard
WO2016205767A1 (fr) Assemblage de graphes de chaînes pour génomes polyploïdes
Ogasawara et al. A fast and sensitive algorithm for aligning ESTs to the human genome
Li et al. An optimized approach for local de novo assembly of overlapping paired-end RAD reads from multiple individuals
Li et al. Seeding with minimized subsequence
Yee et al. Automated clustering and assembly of large EST collections.
Tammi et al. TRAP: Tandem Repeat Assembly Program produces improved shotgun assemblies of repetitive sequences
Morris et al. Read Alignment and Transcriptome Assembly
Assour et al. Hot RAD: a tool for analysis of next-gen RAD tag data
Scheetz et al. Informatics for efficient EST-based gene discovery in normalized and subtracted cDNA libraries
Aguilar et al. Improving spliced alignment for identification of ortholog groups and multiple CDS alignment

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SL SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
ENP Entry into the national phase

Ref document number: 2325469

Country of ref document: CA

Ref country code: CA

Ref document number: 2325469

Kind code of ref document: A

Format of ref document f/p: F

ENP Entry into the national phase

Ref country code: JP

Ref document number: 2000 538305

Kind code of ref document: A

Format of ref document f/p: F

NENP Non-entry into the national phase

Ref country code: KR

WWE Wipo information: entry into national phase

Ref document number: 34537/99

Country of ref document: AU

WWE Wipo information: entry into national phase

Ref document number: 1999916165

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1999916165

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWW Wipo information: withdrawn in national office

Ref document number: 1999916165

Country of ref document: EP

WWG Wipo information: grant in national office

Ref document number: 34537/99

Country of ref document: AU