US20030175722A1 - Methods and systems for searching genomic databases - Google Patents

Methods and systems for searching genomic databases Download PDF

Info

Publication number
US20030175722A1
US20030175722A1 US10/119,528 US11952802A US2003175722A1 US 20030175722 A1 US20030175722 A1 US 20030175722A1 US 11952802 A US11952802 A US 11952802A US 2003175722 A1 US2003175722 A1 US 2003175722A1
Authority
US
United States
Prior art keywords
sequence
glu
lys
val
ala
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US10/119,528
Other languages
English (en)
Inventor
Matthias Mann
Peter Mortensen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
MDS Proteomics Inc
Original Assignee
MDS Proteomics Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by MDS Proteomics Inc filed Critical MDS Proteomics Inc
Priority to US10/119,528 priority Critical patent/US20030175722A1/en
Assigned to MDS PROTEOMICS, INC. reassignment MDS PROTEOMICS, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MANN, MATTHIAS, MORTENSEN, PETER
Publication of US20030175722A1 publication Critical patent/US20030175722A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • the present invention provides a method for identifying a coding sequence in a genomic database, e.g., an unannotated genomic database, comprising:
  • the genomic database is an unannotated genomic database.
  • the method also involves determining an open reading frame for the input polypeptide sequence in the genomic database, and, optionally, determining intron/exon boundaries in the open reading frame. In that manner, the subject method can be used to update, or provide a cross-referenced database, including coding sequence and intronic annotation for the genomic database.
  • the input polypeptide sequence is provided from a system for protein sequencing by mass spectrometry.
  • the subject method can be performed by a computer which has a data link from a mass spectrometer system for transmitting the input polypeptide sequence.
  • the approximate string matching method is selected from the group consisting of a Shift-And method, a Karp-Rabin fingerprint method, and a Commentz-Walter method.
  • the approximate string matching method is a GREP method, such as an AGREP method.
  • the approximate string matching method will be one which tolerates a maximal number of errors, such as gaps for intronic sequence, of a size equal to at least the average length of intronic sequences in the genomic database.
  • the approximate string matching method has an error ratio, a, is less than 3.0, and even more preferably less than 1.0.
  • the subject method is carried out with multiple sequence tags, e.g., the multiple sequence tags are combined into a single array which is used as the input for the approximate string matching method.
  • Another aspect of the present invention provides a method for identifying a coding sequence in an unannotated genomic database, comprising:
  • Still another aspect of the present invention provides a computer system for identifying coding sequences in genomic databases, comprising:
  • the system generates a set of sequence tags corresponding to possible coding sequences for an input polypeptide sequence, and identifies, from the database, any genomic sequences which are similar to one or more of the sequence tags, and indicates exon/intron boundaries, if any, in the genomic sequence(s).
  • the computer system also includes a sample/identification proteomics database for logging and correlating information such as sample identity, gel photos, mass spectra (and features therein), and search results.
  • a sample/identification proteomics database for logging and correlating information such as sample identity, gel photos, mass spectra (and features therein), and search results.
  • the subject system can also include a transfer system to automate the transfer and utilization of mass spectrometric data of a target polypeptide.
  • Still another aspect of the present invention provides a mass spectrometry system including the above computer system and a mass spectrometer for sequencing polypeptides.
  • the spectrometer may include an ion source selected from the group consisting of electrospray and MALDI.
  • Yet another aspect of the present invention relates to a method of conducting a proteomics business, comprising:
  • step (iii) conducting therapeutic profiling of agents identified in step (b), or further analogs thereof, for efficacy and toxicity in animals;
  • step (iv) formulating a pharmaceutical preparation including one or more agents identified in step (iii) as having an acceptable therapeutic profile.
  • the subject business method can include the additional step of establishing a distribution system for distributing the pharmaceutical preparation for sale, and may optionally include establishing a sales group for marketing the pharmaceutical preparation.
  • Still another aspect of the present invention provides a method of conducting a proteomics business, comprising:
  • FIG. 1 Scheme representing the steps which can be used to identify a gene locus from MS sequencing of a protein.
  • FIG. 2 Illustration of how MALDI sequence data can be used to extend exon coverage (SEQ ID NOS: 91-100).
  • FIG. 3 Comparison of performance of various sequence analysis algorithms with respect to predicting gene structure (SEQ ID NOS: 101-102).
  • FIG. 4 Two sequences retrieved from the human genome by the indicated peptide sequence tag. Correlation of calculated Y-ion series of the two sequences with the tandem MS spectrum reveals that only one sequence can be correct (SEQ ID NOS: 103-104).
  • FIG. 5 Demonstrates the use of MS/MS and genome identification to elucidate the gene structure of a novel human protein (SEQ ID NOS: 105-112).
  • FIG. 6 Schematic representation of one preferred embodiment for information flow.
  • FIG. 7 Proposed information flow. All relevant information is stored in ProteomeDB, and unique Sample ID numbers are given. Links may also go directly to ProteomeDB from ProLogDB, ProAutoDB and the Prospects agents (not shown) in order to enter information without operator assistance.
  • FIG. 8 Main switchboard.
  • FIG. 9 Tables in ProteomeDB.
  • FIG. 10 Forms in ProteomeDB. In general, these parameters should not be modified unless by an administrator familiar with the database program such as MS Access.
  • FIG. 11 Reporting options in ProteomeDB. Reports can be transferred to a word processor (MS Word) by one button click and subsequently saved as a separate file (e.g. in rich text format) for easy distribution of analytical results via electronic means such as e-mail.
  • MS Word word processor
  • FIG. 12 ProteomeDB interface form.
  • FIG. 13 Search parameter window for peptide map queries.
  • FIG. 14 Search parameter window for peptide sequence tag queries.
  • FIG. 15 Search parameter window for breakpoint queries.
  • FIG. 16 Search parameter window for amino acid sequence queries.
  • FIG. 17 Sample information dialogue box.
  • FIG. 18 Search parameter window for automating ID program via ProAutoDB.
  • FIG. 19 Search parameter window for logging searches.
  • FIG. 20 Multi template interface.
  • FIG. 21 Search results window.
  • FIG. 22 2nd pass check windows.
  • FIG. 23 Database entry window.
  • FIG. 24 Result summary window.
  • FIG. 25 ProLogDB browser window.
  • FIG. 26 Conversion of DNA sequence to amino acid sequence.
  • FIG. 27 Search parameter window for calculating theoretical fragment masses from a peptide sequence.
  • FIG. 28 Search parameter window for calculating theoretical peptide masses.
  • One aspect of the present invention is related to the demonstration that proteins can be directly sequenced using, e.g., mass spectrometry, and the coding sequences (along with information about intronic structures) for the proteins unambiguously identified in unannotated genomes as large as the human genome.
  • MS sequencing mass spectrometry sequencing
  • the subject method can be carried out using small amounts of proteins, e.g., sub-nanomol amounts of a test protein, and more preferably sub-picomol amounts of the test protein.
  • the present invention is based on the discovery that a suitably modified pattern matching algorithm can be used with direct protein sequencing data to locate coding sequences in raw genomic data.
  • the mass spectrometric data can be used to predict the gene structure, such as intron/exon boundaries.
  • the subject method is carried out as follows. Beginning with amino acid sequence data for a sample proteins, such as may be provided using mass spectrometry, a set of degenerate nucleotide sequences (“reverse transcribed sequences” or “sequence tags”) are calculated for the input polypeptide sequence.
  • the set of sequence tags represents all, or at least the most likely based on codon usage, nucleotide sequences which could encode the sample protein.
  • sequence tags Utilizing each of the sequence tags, one or more similarity searches of a genomic database(s) is carried out in “forward” and “reverse” directions to identify similar sequence(s) in the genomic database.
  • the subject method will utilize a pattern matching algorithm in the search which accounts for gaps in the similarity between the sequence tag and the genomic sequence, e.g., which accounts for and identifies the occurrence of intronic sequences which may disrupt the genomic coding sequence for the sample protein. This may be carried out utilizing further sequencing data, or by calculating intron/exon boundaries using known rules for intron splicing, and, for example, knowledge of the molecular weight of an unmodified form of the sample protein.
  • the subject method is carried out by pattern searching with the amino acid sequence for the sample protein, against a set (e.g., six) of genomic sequence databases representing the genomic nucleotide sequence having been dynamically translated in all three reading frames. That is, the pattern matching is done at the level of actual amino acid sequence in a database of predicted amino acid sequences.
  • the subject method will preferably utilize a pattern matching algorithm which accounts for gaps in the similarity between the amino acid sequence of the sample protein and the dynamically translated genomic sequence in order to allow for intronic sequences which have been carried into the dynamically translated database in the form of non-sense amino acid sequence.
  • the subject method also utilizes homology searching to identify known, related proteins. Where only fragments of the sample protein have been sequenced, the sequence of identified homologs can be used to predict the remaining coding sequence and, accordingly, the intronic structure of the gene. The presence of homologs of known function can, of course, also provide guidance to the potential function of the sample protein.
  • the size of the human genome is approximately 25 times that of A. thaliana but the coding sequence is expected to be only 2-3 times larger. Tryptic peptides of the size typically encountered in MS sequencing (>10 aa) are almost always unique in the human genome. The information content of peptide sequence tags approximates that of the complete peptide sequence. In addition, the sequences retrieved by the search are checked against the tandem MS data which eliminates false positives. Therefore, searches using even short tags almost always result in unique identifications. Interestingly, the search specificity in the human genome is virtually identical to that of the dbEST but with the added advantage of high sequence accuracy, low redundancy and unbiased coverage.
  • peptides are partially sequenced during the course of a protein identification experiment using, for example, a mass spectrometer. Subsequent database searches identify peptides which cluster in a confined (2-15 kb) region of the genome which encompasses the underlying gene. The identified peptides define reading frames which in turn hold information about the intron/exon structure of the gene. Generally, two peptides are sufficient to identify and map the respective gene to its chromosomal location. Any of the identified exons can be used as probes for cloning or for homology searching for tentative function assignment. The defined genome area can be used to direct sequencing of further peptides in the same experiment.
  • percent identity refers to the degree which residues in common at aligned positions between nucleic acid or amino acid sequences are said to be identical. For example, if they have 43 residues out of a total of 144 in common they are 29.9% identical.
  • genomic information includes protein coding regions, introns and other non-coding sequences, and other such structures that commonly appear genomic sequences. It is also meant to include the reading frame for proteins as encoded by a gene.
  • a “nucleotide residue” refers to-the nucleotide found along a polynucleotide sequence.
  • A adenine
  • G guanine
  • C cytosine
  • T thymine
  • U uracil
  • This term can also include mutated and/or genetically engineered variations of nucleotide bases as are known in the art.
  • ORF or “Open Reading Frame” is a nucleotide sequence which could be translated into a polypeptide. Such a stretch of sequence is uninterrupted by a stop codon.
  • An ORF that represents the coding sequence for a full protein begins with an ATG “start” codon and terminates with one of the three “stop” codons.
  • an ORF may be any part of a coding sequence, with or without start and/or stop codons. “ORF” and “CDS” may be used interchangeably.
  • annotation refers to the description of an ORF, introns and other genomic features.
  • a “contig” is a sequence derived by assembling two or more overlapping sequence fragments. For instance, a contig representing a portion of a CDS may be constructed by combining two or more overlapping EST sequences.
  • allele refers to alternative forms of a genetic locus; a single allele for each locus is inherited separately from each parent. The sequence of two alleles may identical or may different.
  • genomic sequence matching or “approximate string matching” algorithms known in the art which can be readily adapted for use in the present invention.
  • One problem in finding the coding sequence for a protein in genomic sequence databases can be formally stated as follows: given a genomic sequence of length n, a sequence tag of length m, and a maximal number of errors (e.g., gaps for intronic sequence) of k, find all segments of the genomic sequence (referred to herein as “occurrences” or “matches”) whose “edit distance” to the sequence tag is at most k.
  • the edit distance between two sequences is defined as the minimum number of edit operations needed to transform one sequence into the other.
  • the allowed edits in the context of the present invention include deleting, inserting and replacing nucleotide residues.
  • the genomic sequence(s) and sequence tag(s) are sequences of characters from an alphabet ⁇ (of nucleotide residues) of ⁇ .
  • Examples of local similarity algorithms include the Smith-Waterman ( J Mol Biol 147:195-197, 1981), BLAST (Altschul et al, J Mol Biol 215:403-410, 1990), and FASTA (Pearson and Lipman, PNAS 85:2444-2448, 1988).
  • the subject method uses a string matching method based on bit operations or on arithmetic, rather than character comparisons.
  • Some of the examples are the Shift-And method, Karp-Rabin fingerprint method, or the algorithm of Commentz-Walter (“A string matching algorithm fast on the average” Proc. 6 th International Colloquium on Automata, Languages, and Programming (1979), pp. 118-132), which combines the Boyer-Moore technique with the Aho algorithm.
  • the subject method utilizes a pattern matching algorithm from the GREP family.
  • One method for solving this problem is the algorithm described by Aho et al. (“Efficient string matching”, Communications of the ACM 18 (June 1975), pp. 333-340) which solves the problem in linear time. This algorithm is the basis of fgrep.
  • an exemplary embodiment of the method utilizes the AGREP algorithm, e.g., adapted from the teachings of Wu et al. (1992) Communications of the ACM, 35:83 and Wu et al. Proceedings of the Winter 1992 USENIX Conference San Francisco, 20-24. January 1992. pp. 153-162, Berkeley.
  • the pattern and the text are sequences of characters from a finite character set ⁇ .
  • the characters are DNA sequences, e.g., representing nucleotide bases, and are preferably genomic sequences.
  • 1 ⁇ i ⁇ n ⁇ m+1 such that t i t i+1 . . . t i+m ⁇ 1 P ⁇ .
  • the subject method uses an extract string matching method.
  • R be a bit array of size m (the size of the pattern).
  • the term R j denotes the value of the array R after the j character of the nucleotide sequence has been processed.
  • the transition from R j to R j+1 can be summarized as follows:
  • the subject method utilizes an algorithm which tolerates errors (mismatches), e.g., for approximate pattern matching between the sequence tag and genomic sequence(s).
  • the previously described method can be adapted to allow errors in matching.
  • the method can be adapted to permit one insertion into the pattern at any position.
  • the method finds all intervals of size at most m+1 in the genomic sequence that contain the pattern of the sequence tag as a subsequence.
  • the R and S arrays are defined as before, but now there are two possibilities for each prefix match. There can be an exact match or a match with one insertion.
  • the transition for the R array is the same as before. One need only to specify the transition for R 1 . There are two cases for a match with at most one insertion of the first i characters of P up to t j+1 :
  • Case I1 can be handled by just copying the value of R to R 1
  • the method can allow for one deletion between the sequences (and no insertions).
  • R, R 1 (which now indicates one deletion), and S are as defined before. There are again two cases for a match with at most one deletion of the first i characters of P up to t j+1 :
  • the method can allow for a substitution. That is, it allows for replacing one nucleotide of P with one nucleotide of T. Again, there are two cases:
  • Case S2 is again the same. Case S1 corresponds to looking at R j [i ⁇ 1] as opposed to looking at R j+1 [i ⁇ 1] in case D1.
  • the subject method handles the general case of up to k errors, where an error can be either an insertion, a deletion, or a substitution (the Levenshtein or the edit-distance measure).
  • an error can be either an insertion, a deletion, or a substitution (the Levenshtein or the edit-distance measure).
  • k additional arrays R 1 , R 2 , . . . , R k are maintained, such that array R d stores all possible matches with up to d errors.
  • the transition from array R j d to R j+1 d is determined.
  • the subject method can provide the following expression for R j+1 d :
  • R 0 d 11 . . . 100 . . . 000 d ones.
  • P 1 , P 2 , . . . , P k+1 be the first k+1 blocks of P each of size r.
  • All P j 's can be searched at the same time and, if one matches, then the whole pattern can be checked directly within a neighborhood of size m from the position of the match.
  • the method looks for an exact match, there is no need to maintain all k of the R d vectors.
  • This scheme will run fast if the number of exact matches to any one of the P j 's is not too high. The number of such matches depend on many factors including the values of r and m.
  • the subject method will generating multiple sequence tags. In general, one will want to find all occurrences of any of these sequence tags. Under those circumstances, the pattern searching against the genomic sequence(s) can be conducted one at a time or together.
  • the multi-pattern matching algorithm described above can be used to solve the approximate string-matching problem for searching reverse translated sequences against genomic sequences.
  • Pk+1 fragments P1, P2, . . . , Pk+1, each of size m M/(k+1).
  • a pigeonhole principle if Tij differs from P by no more than k errors, then one of the fragment must match a substring of Tij exactly.
  • the approximate string matching algorithm is conducted in two phases.
  • the sequence is partitioned into k+1 fragments and uses the multi-pattern string matching algorithm to find all those places in the genomic sequence that contain one of the fragments. If there is a match of a fragment at position i of the genomic sequence, the system marks the positions i ⁇ M ⁇ k to i+M+k ⁇ m as a “candidate” area.
  • an approximate matching algorithm as described above to find the actual matches in those marked area.
  • Mass spectrometry has emerged as a central technique in a wide variety of functional genomics, or proteomics approaches to study gene function in the post-genomics world. Mass spectrometric instrumentation continues to become more powerful and novel instrumental concepts are being put into use.
  • the subject genomic searching system can be used as part of a proteomics discovery method.
  • the subject method can use peptide sequence information obtained by mass spectrometry as the identification method in “expression proteomics”, sequencing data from with two-dimensional gels of two different biological states.
  • FT ICR Fourier transform ion cyclotron resonance mass spectrometer
  • tandem mass spectrometric method it may also be possible to identify the proteins “on-line” as they elute into the mass spectrometer. See, for example, M ⁇ rtz et al. (1996) PNAS 93:8264-8267; and Li et al. (1999) Anal. Chem. 71:4397.
  • the subject method is used to search genomic databases for sequences derived from multi-protein complexes, e.g., assemblies with a particular function such as splicing, transport or nuclear import/export.
  • proteomics technology is to determine the make up of such complexes. To this end, they need to be purified specifically, the identity of the factors in the complex needs to be determined and finally the in vivo presence of the novel members of the complex needs to be established.
  • the subject method can also be used as part of a proteomic discovery method to elucidate transient rather than structural complexes. Many signaling cascades are transmitted through multi-protein complexes involving scaffolds and these complexes can be biochemically purified.
  • the subject method can be used identify proteins in cellular organelles. For instance, organelles can be purified and their composition analyzed by mass spectrometry. Since organelles are often less well defined than smaller multi-protein complexes, the task of verification of identifications becomes even more important.
  • Yet another aspect of the present invention relates to a method of conducting a proteomics business, comprising:
  • step (iii) conducting therapeutic profiling of agents identified in step (b), or further analogs thereof, for efficacy and toxicity in animals;
  • step (iv) formulating a pharmaceutical preparation including one or more agents identified in step (iii) as having an acceptable therapeutic profile.
  • the subject business method can include the additional step of establishing a distribution system for distributing the pharmaceutical preparation for sale, and may optionally include establishing a sales group for marketing the pharmaceutical preparation.
  • Still another aspect of the present invention provides a method of conducting a proteomics business, comprising:
  • a protein identification program comprising two main components: a server application with sequence database search routines that include client interface(s).
  • the ID program can be automated via the Microsoft Access databases ProAutoDB and ProLogDB and associated Visual Basic applications. Control of automation and data flow can be as follows: from the ID program GUI it is specified to query e.g. the ‘FavoritelndexFile’ from a list of several virtual index files. Elsewhere it is specified that ‘FavoritelndexFile’ is actually e.g. particular genomic sequence databases. Upon finding matches with scores higher than a predefined value, the search result and all search parameters can be logged, also in another prespecified database, and further searches on the dataset can be aborted or continued as predefined in the automation database.
  • Special automated actions can also be triggered by certain database retrieval events, e.g. the matching of a data set to a specific ORF (Open Reading Frame) could result in an e-mail being sent with all available information to a person with particular interest in this gene/protein.
  • ORF Open Reading Frame
  • ProteinDB A sample/identification proteomics database for logging and correlating information such as sample identity, gel photos, mass spectra (and features therein), search results, etc. This database can be the final destination of data but can also be regarded as a temporary storage facility for data that is subsequently transferred by, e.g., standard SQL commands to other databases (e.g. Oracle and Sybase databases) on a remote server.
  • databases e.g. Oracle and Sybase databases
  • ID program Flow Agent A software daemon to automate the transfer and utilization of mass spectrometric data.
  • ProAutoDB Database(s) containing search parameters and related information regarding data sets that have been scheduled for later automated (and repeated) searching against sequence databases.
  • all incoming samples are logged into ProteomeDB, before any analyses are carried out. Each logged sample is then automatically given a unique ID number that can be used to sort subsequently generated mass spectrometric data and database search results.
  • ProteomeDB will be able to download digestion protocols to a robotic workstation and supply all relevant sample information directly into MALDI and ES mass spectrometer control software. This means that setting up the analysis of a batch of samples will be done automatically.
  • mass spectrometric data is acquired either: i) manually; ii) automatically through built in features in the MS software; or iii) governed by scripts.
  • the relevant MS information e.g. a peptide mass list or fragment mass list is passed to ProAutoDB either by the MS control software directly, or via ID program Flow Agent.
  • ID program Client checks ProAutoDB for new tasks at set intervals and upon finding a job then executes the sequence database search. The outcome of every search is logged in ProLogDB, and if sequence database entries achieve a scoring value above a set threshold then these proteins are also logged back into ProteomeDB under the pertinent sample record.
  • ProteomeDB is a hierarchic database as can be developed for Microsoft Access. It may contain tables, forms, reports and a VisualBasic module or the like. Briefly, a batch can have many samples, each of which can have many mass spectra, each of which can have many database search results. The form set and the database tables can also be separated (called ‘split’) such that the data can reside on a central file server and be simultaneously accessible to a group of users, each of whom should have a copy of the form set on their computer.
  • FIG. 8 shows an exemplary first window that becomes available after opening ProteomeDB is the main ‘Switchboard’.
  • the appearance of the switchboard can be modified to display the logo and colors of a company.
  • the ‘Enter New Batch’ button can be used to enter the data relating to a new batch of samples.
  • One or more secondary switchboards give access to most of the sub-forms for more direct and simplified entry of data (e.g., going into one table at a time). See FIGS. 9 - 11 .
  • the primary batch information can be one record. See FIG. 12. These two forms, or views, can be set up to be the most used ProteomeDB interfaces; i.e., they are the ‘top level’ where a batch of samples is set in line for analysis and the report option is finally chosen. Ideally, it is only necessary to type in the number of samples in the batch and the name given to each sample by the owner of the batch. There is no way of predicting what Web call their samples, so this task is preferably not automated.
  • the information in all the sub forms and surrounding bits of information may be either:
  • reused information from earlier batches are digestion protocols (and protocol steps), contact person information, etc.
  • the Companies sub form shows all the information stored on each single contact person. Not all information is necessarily used for each role that the contact subsequently has. For example, only the person chosen in the ‘Contacts information’ tab may be allocated the Web access code, and only the person in the tab ‘Billing information’ shows as having a Tax identification number associated.
  • a list can be provided which contains the information for each batch, namely the sample names along with the corresponding identification or sequence information that was found.
  • the Web can be prompted to start by setting the number of samples in order to obtain an auto-generated list of unique sample ID numbers.
  • the analysis status of a new sample can be by default ‘Received for identification’, with other status possibility chosen manually, such as for example, ‘Received for sequencing’.
  • the ProteomeDB can maintain reports, e.g., for printing or electronic documentation, in separate text files pertaining to the relevant analytical results from each batch.
  • the system may offer some report options that exclusively deal with the results:
  • the ‘Short status report’ a very short information abstract that shows the results from the entire batch on one or a few pages.
  • the ‘Search Details paper’ includes results lists from each search and can occupy several pages per sample. In many cases, the researchers doing the actual protein analysis may not be the ones who own (submit) the samples and need the results. This means that communication of results to other parties is needed, and for that purpose there are some extra report options;.
  • the ‘Receipt of samples’ fax to confirm the arrival of the samples, also states the batch ID and Web access code that will allow the owner of the samples to follow the analysis progress via the Web.
  • the ‘Report letter’ a letter to accompany one or all of the result reports mentioned above.
  • the information that needs to be conveyed about the analyses can often be similar from project to project. Therefore it may be useful to have a selection of informative standard paragraphs that may be included (or excluded) in this letter.
  • the ‘Invoice’ This module may generate the finished invoice, or can be used for interdepartmental billing. Invoice numbers can be assigned when the ‘Batch status’ is changed to ‘Completed’.
  • Date fields allow entry of the dates where: I) the samples were dispatched by their owner; II) the samples were received for analysis; and III) the analyses were completed.
  • the system will provide queries to check current status at various stages of the project work.
  • the window in FIGS. 13 and 14 contain the primary information that can be used in the database query, e.g., under the mass spectrometric data.
  • the user can create searches using peptide maps, peptide sequence tags, breakpoint, and sequence alone.
  • Help lines pertaining to any parameter field can be provided, shown here in the lower left-hand corner of the window, e.g., by leaving the cursor over the field of interest.
  • Nucleotide databases can be queried by peptide maps by the ID program version. ‘Breakpoint’ searches require a defined minimum number of fragment ion masses to match theoretically expected fragment masses. See FIG. 15. For example, for a database entry to match, from a list of 10 masses. The system may require that at least 5 of these masses must be possible Y-ions.
  • Main parameters are the precursor mass along with a list of fragment masses of which a requested number must theoretically match calculated fragment masses of Y, B, or Y AND B ions. See FIG. 16.
  • the MS error may, for example, be chosen very large (say, 50 Da) to accommodate for modifications and substitutions etc. To regain search specificity, a very small MS/MS error should then be used (for example, 20 ppm).
  • This search method is useful for searching on completely uninterpreted data. However, the search specificity is not as high as for sequence tag searches.
  • FIG. 17 shows a sample dialog box for entering and viewing information that is secondary to the database searches, i.e. it is unnecessary for the search itself but which may be relevant to the information flow following completion of the search. All of this information is expected to be entered automatically, either by the ID program Client itself or by the ID program Flow Agent parsing information to ID program.
  • search information can then be logged in ProteomeDB (if logging is selected) but ONLY if a unique sample record can be assigned. This requires the field ‘Sample ID’ to be filled out correctly. There must also be a spectrum name. Other fields can remain empty and still allow logging.
  • FIG. 18 shows a search parameter window for automating pattern matching.
  • the ‘Search life cycle’ can be set in the Automation tab of the search window.
  • Parameters that may be required by the subject system include:
  • the ID program Client can be configured to send an e-mail to a user when a match is found or when the search life cycle has ended and no match has been found.
  • the ID program Client can use the Simple Mail Transfer Protocol to send an e-mail.
  • FIG. 19 shows one embodiment in which the options for logging search results can be specified by the user.
  • These log files can be, e.g., local or on a remote file server. If the ProLogDB file does not exist, ID program Client will create such a file.
  • the ProteomeDB database file can be the file that contains the tables. This means that if a database is split into forms and tables (e.g., by Microsoft Access function) then ID program Client must also keep track of the various parts.
  • the user may be prompted to set the standard search parameters to the most accurate values. If the standard search fails then the user may define one or more follow-up searches using, e.g., less stringent criteria.
  • Matching entries to a query can be returned in a dynamic table that allows alphabetic sorting in either ascending or descending order following any column contents.
  • Default sorting may follow a score, which follows empirical scoring algorithms based on observations from hundreds of database searches. For example, these may have the form:
  • FIG. 21 shows an illustrative Search Result Window. Selecting and then ‘right-clicking’ on an entry in the result window, for example, can bring up a menu with a selection of information windows to further enhance the analysis of that entry.
  • FIG. 22 shows an exemplary 2nd pass check window.
  • the entry can be selected, e.g., by left-clicking in any field its row on the Result window, and then either right-click to choose or go to the side bar to choose the ‘2nd passcheck’ window.
  • the window displays the entire sequence information in the index file with the matched sequence pieces highlighted in different colors. Sequence covered by one matching peptide is a first color, that covered by two peptides is another color, and so forth.
  • FIG. 23 shows an exemplary database entry window.
  • the illustrated browser window is for fast access to e.g. SwissProt and BLAST searches at NCBI.
  • the addresses of the databases are listed in a settings file and can be changed to utilize Intranet mirrors instead of the presently chosen sites.
  • FIG. 24 illustrates what a Result summary window may look like. There is a small check box in the upper right-hand corner of each search result window. When checked, the contents of the individual search result windows are parsed to the summary window.
  • the result lists are interleaved to allow the proteins found to multiple times register with counted occurrence and added score values. This is meant to provide a simple and fast means of comparing data from several individually non-specific searches (as arising from short sequence tags and low abundance MALDI maps, for example).
  • FIG. 25 is an illustration of a database browser window.
  • the ProLogDB file whose path is specified in the ‘Logging’ tab can be browsed directly from the ID program Client by choosing ‘ProLogDB’ in the ‘View’ menu.
  • the result list can also show in full length in a separate window (not displayed here) whenever a record is selected (highlighted). Alternatively, it is possible to work with the contents of these files via Microsoft Access or the like.
  • FIG. 26 shows two sample windows for permitting the user to convert a nucleic acid sequence to the corresponding amino acid sequence.
  • FIG. 27 shows an exemplary search parameter window for calculating theoretical fragment masses from peptide sequences. Selecting an entry from any result window from searches other than on peptide maps will enable the calculation of theoretical fragment mass values. These can be sorted (ascending and descending), for example, by each of the column titles shown in the window below.
  • FIG. 28 is a window which can be used as an interface for translating DNA sequence data that is either typed or copied into the window. It is also possible to type in a stretch of amino acid sequence to check for the occurrence of the sequence in each reading frame. This feature can be used for the validation of found ESTs on queries by MS/MS data. However, the window can also be used generally for highlighting amino acids sequence stretches in longer sequences of amino acids copied in (disregarding the use of the translation facility).
  • the ID program Flow Agent can function as conduit for control and information between the mass spectrum acquisition software and ID program by transferring a list of peptide masses from an MS peptide map to ID program client for subsequent database search.
  • the ID program Flow Agent can monitor specified folders for the arrival of new peak lists and then transfers these with or without relevant specific search parameters to ID program. This application is generally useful on computer systems that are directly in control of mass spectrometric data acquisition and handling or wherever the mass spectra are stored, but it also works well over a network.
  • Proteome projects seek to provide systematic functional analysis of the genes uncovered by genome sequencing initiatives. Mass spectrometric protein identification is a key requirement in these studies but to date, database-searching tools rely on the availability of protein sequences derived from full-length cDNA, expressed sequence tags (ESTs) or predicted open reading frames (ORFs) from genomic sequences. We demonstrate here that proteins can be identified directly in large genomic databases using peptide sequence tags obtained by tandem mass spectrometry. On the background of vast amounts of non-coding DNA sequence, identified peptides localize coding sequences (exons) in a confined region of the genome, which contains the cognate gene. The approach does not require prior information about putative ORFs as predicted by computerized gene finding algorithms. The method scales to the complete human genome and allows identification, mapping, cloning and assistance in gene prediction of any protein for which minimal mass spectrometric information can be obtained. Several novel proteins from A. thaliana and human have been discovered in this way.
  • Proteins Protein samples from Arabidopsis A. thaliana were excised from a 2D PAGE gel of a total membrane-associated protein preparation. (Human protein samples were obtained from ongoing research projects within our group and through collaborations, see Example 2). Spots were excised from gels and digested with trypsin as described previously. See Shevchenko et al. 1996 Anal Chem 68:850-858.
  • Mass spectrometry MALDI mass spectra were acquired on a Bruker REFLEX III reflectron time of flight (TOF) mass spectrometer (Bruker-Daltonik, Bremen, Germany). Matrix surfaces were made from ⁇ -cyano-4-hydroxycinnamic acid by the fast evaporation method. Vorm et al 1994 Anal. Chem. 66:3281-3287; and Jensen et al. 1996 Rapid Commun Mass Spectrom 10:1371-1378.
  • TOF time of flight
  • Genome databases and searching The A. thaliana genome database was obtained from the curators of the Arabidopsis Genome Initiative at The Institute of Genomic Research (TIGR), Rockville, Md.). A custom Perl script was used to convert the downloaded database into a FASTA formatted sequence index file accepted by the PepSea database search software system (MDS Protana A/S, Denmark). The human genome database (HGdB) was constructed in a similar fashion. Finished and unfinished human genome sequences (phases 0-3) were downloaded from the NCBI ftp site. Peptide sequence tags and MALDI peptide mass maps were searched against the respective databases using the program PepSea (MDS Protana A/S, Denmark).
  • Default search criteria specified trypsin as the protease and required measurement accuracy of better than 50 ppm for both intact peptide ion and fragment ion masses.
  • the amino acid part of the peptide sequence tag was translated into the corresponding degenerated oligonucleotide sequence.
  • Potential hits in the forward or reverse direction on the human genome data were checked as to whether they coded for the amino acid sequence of the tag.
  • the mass distance to the N- and C-terminal part of the potential peptide match was then calculated in the reading frame defined by that match. For the A. thaliana genomic database, searches took 2 s to complete on a PC cluster.
  • a peptide sequence tag consists of a few amino acids that can easily be assigned (manually or by software) in a tandem mass spectrum. These amino acids are ‘locked’ into position within the peptide by the ‘start’ and ‘end’ masses of a fragment ion series. Together with the mass of the intact peptide, a search template is created.
  • the use of accurate peptide and fragment ion mass information in addition to amino acid sequence information increases the search specificity of a peptide sequence tag by more than a million fold over the short amino acid tag sequence alone. Searches using the peptide sequence tag algorithm (Mann et al.
  • In-frame stop codons upstream and downstream of the identified peptide also limit the extent of the exon within which the splice signals (exon intron boundaries) must be found. This information is useful for the reconstruction of the gene from the nucleotide sequence (see below).
  • Unambiguous identification of peptides in the genome directly provides the information necessary for further analysis of the corresponding gene.
  • the identified exon sequences define the direction of the nucleotide sequence.
  • the identified exon can be used directly for homology searching, and as probes for cloning the genes.
  • the above identifications map the respective genes to their locations in the genome.
  • the mass fingerprinting data obtained in the first step of the mass spectrometric analysis was used to obtain further information about the gene structure. Since only peptide masses are available in peptide mass mapping, the identified genomic region (approximately 10 kb for A. thaliana ) was translated in three reading frames. The exon sequence coverage can be refined and additional exons can sometimes be discovered by peptide mass mapping.
  • Peptide sequences can also be used to join adjacent exons.
  • the underlined part of the peptide sequence TFDESK ETINKEIEEK (SEQ ID No: 1), derived from MS sequencing of the protein S8, is located in exon 1 and the remainder in exon 2 (Table 1) of the gene.
  • these ‘peptide exon bridges’ were frequently found (on average one per protein) but not as frequent as to always allow the full reconstruction of the gene.
  • the size of the human genome is approximately 25 times that of A. thaliana and it is estimated that only 3% of the nucleotide sequence is coding for proteins.
  • To learn about the feasibility of identifying coding sequences in the human genome on the background of the vast amounts of non-coding sequence we searched data from more than 200 peptides which we have sequenced by mass spectrometry in various projects against an estimated 80% of the human genome which was publicly available at the time of writing. The results of these experiments are summarized in table 2.
  • Peptide sequence tags comprising four amino acid residues retrieve only a single entry if the peptide is indeed in the human genome and none if it is not.
  • the search retrieves on average two sequences, only one of which fits the spectrum when comparing all calculated fragment ion masses for the retrieved peptides with the experimental spectrum.
  • a two amino acid tag on average seven peptide sequences are retrieved. Evaluation of the sequences yields a unique result in almost all cases except when the peptide is too short to be unique in the database ( ⁇ 10 amino acids).
  • tryptic peptides encountered in MS sequencing are typically longer than 10 amino acids and thus were almost always unique in the human genome. As in the case of searches in the AT genome, however, data from any two peptides was always sufficient for an unambiguous localization of the protein in the genome.
  • the methods presented here can be used in small and large-scale proteomics projects for all organisms that have sequenced genomes as well as their close relatives. Given the availability of minimal mass spectrometric peptide fragmentation data, it is possible to identify any protein from those organisms whether or not additional sequence information in the form of ESTs is available. The approach does not rely on a completely assembled genome sequence, only on full coverage of the genome, which, to date, can be achieved relatively quickly even for large genomes.
  • Peptide sequences correspond to coding regions within a gene. Whenever a peptide sequence tag derived from a MS/MS spectrum unambiguously identifies the corresponding DNA sequence in the genome, this sequence must be part of an exon. The peptide therefore locates the exon as well as the correct reading frame. In-frame stop codons upstream and downstream of the identified peptide also limit the extent of the exon within which the splice signals (exon intron boundaries) must be found. Mass spectral data can be used to screen the vicinity of mapped regions for further exons. In many cases, peptides span two exons which enables the localization of the exact splice site for the two exons involved.
  • peptides are partially sequenced during the course of a protein identification experiment using nanoES tandem MS. Subsequent database searches identify peptides which cluster in a confined (2-15 kb) region of the genome which encompasses the underlying gene. The identified peptides define reading frames which in turn hold information about the intron/exon structure of the gene. Generally, two peptides are sufficient to identify and map the respective gene to its chromosomal location. Any of the identified exons can be used as probes for cloning or for homology searching for tentative function assignment. The defined genome area can be used to direct sequencing of further peptides in the same experiment.
  • the size of the human genome is approximately 25 times that of A. thaliana but the coding sequence is expected to be only 2-3 times larger. Tryptic peptides of the size typically encountered in MS sequencing (>10 aa) are almost always unique in the human genome. The information content of peptide sequence tags approximates that of the complete peptide sequence. In addition, the sequences retrieved by the search are checked against the tandem MS data which eliminates false positives. Therefore, searches using even short tags almost always result in unique identifications. Interestingly, the search specificity in the human genome is virtually identical to that of the dbEST but with the added advantage of high sequence accuracy, low redundancy and unbiased coverage.

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
  • Peptides Or Proteins (AREA)
US10/119,528 2001-04-09 2002-04-09 Methods and systems for searching genomic databases Abandoned US20030175722A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US10/119,528 US20030175722A1 (en) 2001-04-09 2002-04-09 Methods and systems for searching genomic databases

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US28255101P 2001-04-09 2001-04-09
US28536201P 2001-04-20 2001-04-20
US10/119,528 US20030175722A1 (en) 2001-04-09 2002-04-09 Methods and systems for searching genomic databases

Publications (1)

Publication Number Publication Date
US20030175722A1 true US20030175722A1 (en) 2003-09-18

Family

ID=26961511

Family Applications (1)

Application Number Title Priority Date Filing Date
US10/119,528 Abandoned US20030175722A1 (en) 2001-04-09 2002-04-09 Methods and systems for searching genomic databases

Country Status (5)

Country Link
US (1) US20030175722A1 (fr)
EP (1) EP1419518A2 (fr)
AU (1) AU2002256173A1 (fr)
CA (1) CA2445529A1 (fr)
WO (1) WO2002080649A2 (fr)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030036207A1 (en) * 2001-07-13 2003-02-20 Washburn Michael P. System and method for storing mass spectrometry data
US20050278324A1 (en) * 2004-05-31 2005-12-15 Ibm Corporation Systems and methods for subspace clustering
US20080281529A1 (en) * 2007-05-10 2008-11-13 The Research Foundation Of State University Of New York Genomic data processing utilizing correlation analysis of nucleotide loci of multiple data sets
US20090112850A1 (en) * 2006-04-28 2009-04-30 Riken Bioitem Searcher, Bioitem Search Terminal, Bioitem Search Method, and Program
US20120102054A1 (en) * 2010-10-25 2012-04-26 Life Technologies Corporation Systems and Methods for Annotating Biomolecule Data
US20160026753A1 (en) * 2011-10-11 2016-01-28 Life Technologies Corporation Systems and Methods for Analysis and Interpretation of Nucleic Acid Sequence Data
CN105956416A (zh) * 2016-05-10 2016-09-21 湖北普罗金科技有限公司 一种快速自动分析原核生物蛋白质基因组学数据的方法
WO2017044290A1 (fr) * 2015-09-10 2017-03-16 Iris International, Inc. Systèmes et procédés d'analyse d'échantillons
US10055540B2 (en) 2015-12-16 2018-08-21 Gritstone Oncology, Inc. Neoantigen identification, manufacture, and use
US10318523B2 (en) 2014-02-06 2019-06-11 The Johns Hopkins University Apparatus and method for aligning token sequences with block permutations
US11264117B2 (en) 2017-10-10 2022-03-01 Gritstone Bio, Inc. Neoantigen identification using hotspots
CN116825198A (zh) * 2023-07-14 2023-09-29 湖南工商大学 基于图注意机制的肽序列标签鉴定方法
US11885815B2 (en) 2017-11-22 2024-01-30 Gritstone Bio, Inc. Reducing junction epitope presentation for neoantigens
CN117554545A (zh) * 2023-11-10 2024-02-13 广东省麦思科学仪器创新研究院 基于弱监督在线学习的质谱校正方法和装置

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AUPS111802A0 (en) * 2002-03-13 2002-04-18 Proteome Systems Intellectual Property Pty Ltd Annotation of genome sequences
US7569392B2 (en) 2004-01-08 2009-08-04 Vanderbilt University Multiplex spatial profiling of gene expression
CN110996387B (zh) * 2019-12-02 2021-05-11 重庆邮电大学 一种基于TOF和位置指纹融合的LoRa定位方法
CN112148947B (zh) * 2020-09-28 2024-03-22 微梦创科网络科技(中国)有限公司 一种批量挖掘刷评用户的方法及系统
CN115713973B (zh) * 2022-11-21 2023-08-08 深圳市儿童医院 一种鉴定sl序列反式剪切所形成的基因编码框的方法

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030036207A1 (en) * 2001-07-13 2003-02-20 Washburn Michael P. System and method for storing mass spectrometry data
US20050278324A1 (en) * 2004-05-31 2005-12-15 Ibm Corporation Systems and methods for subspace clustering
US7565346B2 (en) * 2004-05-31 2009-07-21 International Business Machines Corporation System and method for sequence-based subspace pattern clustering
US20090112850A1 (en) * 2006-04-28 2009-04-30 Riken Bioitem Searcher, Bioitem Search Terminal, Bioitem Search Method, and Program
US7921105B2 (en) * 2006-04-28 2011-04-05 Riken Bioitem searcher, bioitem search terminal, bioitem search method, and program
US20080281530A1 (en) * 2007-05-10 2008-11-13 The Research Foundation Of State University Of New York Genomic data processing utilizing correlation analysis of nucleotide loci
US20080281819A1 (en) * 2007-05-10 2008-11-13 The Research Foundation Of State University Of New York Non-random control data set generation for facilitating genomic data processing
US20080281818A1 (en) * 2007-05-10 2008-11-13 The Research Foundation Of State University Of New York Segmented storage and retrieval of nucleotide sequence information
US20080281529A1 (en) * 2007-05-10 2008-11-13 The Research Foundation Of State University Of New York Genomic data processing utilizing correlation analysis of nucleotide loci of multiple data sets
US20120102054A1 (en) * 2010-10-25 2012-04-26 Life Technologies Corporation Systems and Methods for Annotating Biomolecule Data
US20160078094A1 (en) * 2010-10-25 2016-03-17 Life Technologies Corporation Systems and Methods for Annotating Biomolecule Data
US20210173842A1 (en) * 2010-10-25 2021-06-10 Life Technologies Corporation Systems and Methods for Annotating Biomolecule Data
US20160026753A1 (en) * 2011-10-11 2016-01-28 Life Technologies Corporation Systems and Methods for Analysis and Interpretation of Nucleic Acid Sequence Data
US10937522B2 (en) 2011-10-11 2021-03-02 Life Technologies Corporation Systems and methods for analysis and interpretation of nucliec acid sequence data
US10318523B2 (en) 2014-02-06 2019-06-11 The Johns Hopkins University Apparatus and method for aligning token sequences with block permutations
WO2017044290A1 (fr) * 2015-09-10 2017-03-16 Iris International, Inc. Systèmes et procédés d'analyse d'échantillons
CN108140069A (zh) * 2015-09-10 2018-06-08 艾瑞思国际股份有限公司 粒子分析系统及方法
US10847252B2 (en) 2015-12-16 2020-11-24 Gritstone Oncology, Inc. Neoantigen identification, manufacture, and use
US10847253B2 (en) 2015-12-16 2020-11-24 Gritstone Oncology, Inc. Neoantigen identification, manufacture, and use
US10055540B2 (en) 2015-12-16 2018-08-21 Gritstone Oncology, Inc. Neoantigen identification, manufacture, and use
US11183286B2 (en) 2015-12-16 2021-11-23 Gritstone Bio, Inc. Neoantigen identification, manufacture, and use
CN105956416A (zh) * 2016-05-10 2016-09-21 湖北普罗金科技有限公司 一种快速自动分析原核生物蛋白质基因组学数据的方法
US11264117B2 (en) 2017-10-10 2022-03-01 Gritstone Bio, Inc. Neoantigen identification using hotspots
US11885815B2 (en) 2017-11-22 2024-01-30 Gritstone Bio, Inc. Reducing junction epitope presentation for neoantigens
CN116825198A (zh) * 2023-07-14 2023-09-29 湖南工商大学 基于图注意机制的肽序列标签鉴定方法
CN117554545A (zh) * 2023-11-10 2024-02-13 广东省麦思科学仪器创新研究院 基于弱监督在线学习的质谱校正方法和装置

Also Published As

Publication number Publication date
EP1419518A2 (fr) 2004-05-19
WO2002080649A2 (fr) 2002-10-17
WO2002080649A3 (fr) 2004-03-11
AU2002256173A1 (en) 2002-10-21
CA2445529A1 (fr) 2002-10-17

Similar Documents

Publication Publication Date Title
US20030175722A1 (en) Methods and systems for searching genomic databases
Marks et al. Resolving the full spectrum of human genome variation using Linked-Reads
Fermin et al. Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics
Goubert et al. A beginner’s guide to manual curation of transposable elements
Field et al. RADARS, a bioinformatics solution that automates proteome mass spectral analysis, optimises protein identification, and archives data in a relational database
Nesvizhskii et al. Analysis, statistical validation and dissemination of large-scale proteomics datasets generated by tandem MS
US5953727A (en) Project-based full-length biomolecular sequence database
Küster et al. Mass spectrometry allows direct identification of proteins in large genomes
Haas et al. Full-length messenger RNA sequences greatly improve genome annotation
US5966712A (en) Database and system for storing, comparing and displaying genomic information
US6742004B2 (en) Database and system for storing, comparing and displaying genomic information
Shevchenko et al. Deciphering protein complexes and protein interaction networks by tandem affinity purification and mass spectrometry: analytical perspective
Eddes et al. CHOMPER: A bioinformatic tool for rapid validation of tandem mass spectrometry search results associated with high‐throughput proteomic strategies
CN109616155B (zh) 一种编码区域遗传变异致病性分类的数据处理系统与方法
Na et al. Unrestrictive identification of multiple post-translational modifications from tandem mass spectrometry using an error-tolerant algorithm based on an extended sequence tag approach
Nadershahi et al. Comparison of computational methods for identifying translation initiation sites in EST data
Krug et al. Construction and assessment of individualized proteogenomic databases for large‐scale analysis of nonsynonymous single nucleotide variants
US20030078374A1 (en) Complementary peptide ligands generated from the human genome
Mead et al. Public proteomic MS repositories and pipelines: available tools and biological applications
Rinner et al. AGenDA: gene prediction by comparative sequence analysis
Cuff et al. ProtEST: protein multiple sequence alignments from expressed sequence tags
Falkner et al. Fast tandem mass spectra-based protein identification regardless of the number of spectra or potential modifications examined
Islam et al. De novo peptide sequencing: deep mining of high-resolution mass spectrometry data
US20020091907A1 (en) Method and apparatus for simplified research of multiple dynamic databases
Bessant Proteome informatics

Legal Events

Date Code Title Description
AS Assignment

Owner name: MDS PROTEOMICS, INC., CANADA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:MANN, MATTHIAS;MORTENSEN, PETER;REEL/FRAME:013078/0638

Effective date: 20020826

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION