CN100552686C

CN100552686C - Gene detecting probe information annotate method

Info

Publication number: CN100552686C
Application number: CNB2006100259721A
Authority: CN
Inventors: 金刚; 谢松旻; 王超
Original assignee: Shanghai Institutes for Biological Sciences SIBS of CAS
Current assignee: Shanghai Institutes for Biological Sciences SIBS of CAS
Priority date: 2006-04-24
Filing date: 2006-04-24
Publication date: 2009-10-21
Anticipated expiration: 2026-04-24
Also published as: CN101063988A

Abstract

The invention discloses a kind of gene detecting probe information annotate method, integration technology will be linked and data warehouse technology combines, solve the limitation that exists in traditional gene chip probes annotation system, made the number of the common source database that annotation system can comprise and Data Update time significantly increase.Its technical scheme is: the identification information that (1) will prepare the gene chip probes library is input to data warehouse; (2) corresponding relation of extraction gene probe identification information and common source Database Identification information; (3) set up link relevant according to corresponding relation, and directly extract the specifying information relevant in the common source database with gene chip probes by link with the common source database; (4) resolve specifying information and with its output.The present invention is applied to make up the biochip technology platform.

Description

Gene detecting probe information annotate method

Technical field

The present invention relates to a kind of construction method of biochip technology platform, relate in particular to a kind of information annotate method of gene chip probes.

Background technology

The appearance of genetic chip is the major progress that has characteristics of the times in recent years in the high-technology field, is the new and high technology that physics, microelectronics and molecular biology comprehensively intersect to form.Biochip technology is a kind of high-throughout technology, its ultimate principle is to be integrated with ten hundreds of dna probes by micro fabrication on the chip of centimeter square, realize mRNA and dna sequence dna are carried out the detection by quantitative of efficient quick, in the exploration of elaboration, disease reason and the mechanism of gene function, possible diagnosis and the applications such as discovery of treatment target spot, genetic chip is just being brought into play increasing purposes.

Because genetic chip has the characteristic of high flux and high information quantity, so its probe annotation system is a committed step that makes up the biochip technology platform.Genetic chip annotation system major function is gene probe ten hundreds of on the note chip, integrates the up-to-date relevant information about sequence, function and the metabolic pathway of gene, to satisfy the needs of genetic chip testing result automated analysis and gene chip probes design.Genetic chip annotation system famous on the our times has: the SOURCE system of people's inventions such as the DRAGON system of people's inventions such as the DAVID system of people such as the state-run health science Button of institute of U.S. invention, the U.S. Wilkinson of Johns Hopkins University and the invention Diehn of Stanford University.The ultimate principle of these systems all is to utilize data warehouse technology, by each common source database physics is integrated, sets up the record of " one-stop " of gene chip probes relevant information.

Yet this technology has significant limitation: this technology that is limited in of data warehouse maximum can not real-time update.Simultaneously since development of life science day the crescent benefit, the common source database is all included and upgrade a large amount of new information every day, whenever two to three months just upgrade once data warehouse technology and can not include up-to-date information with the integrator gene probe timely.With the DAVID system is example, and its note result just contains a large amount of useless URL (Universal Resource Location) in the inside, and these URL can not offer the correct note result of user.

The another one limitation of data warehouse annotate method is because the disunity of common source database data form, cause along with the number that comprises source database and data type is many more, the data warehouse management meeting becomes more and more difficult, so the annotation capability of data warehouse method is limited.With aforesaid another kind of probe annotation system DRAGON system is example, because the annotation capability of DRAGON system is limited, causes its can not note the most frequently used database GenBank and the information of LocusLink.

Because gene detecting probe information annotate is set up and the automatic importance analytically of chip detection result at the genetic chip platform, overcome the limitation of above-mentioned annotation system, set up more accurate, more perfect annotate method, be biochip technology field urgent problem.

Summary of the invention

The objective of the invention is to address the above problem, a kind of gene detecting probe information annotate method is provided, it has overcome the limitation of traditional gene chip probes annotation system, overcome the problem that the data warehouse remarking technology exists, integration helps the probe design of the automated analysis and the genetic chip of genetic chip testing result about the up-to-date information of sequence, function and the metabolic pathway of chip probe target spot gene.

Technical scheme of the present invention is: a kind of gene detecting probe information annotate method, comprising: the identification information in (1) preparation gene chip probes library is input to data warehouse; (2) extract the corresponding relation of described gene probe identification information and common source Database Identification information by the interface routine of data warehouse; (3) set up and relevant the linking of described common source database according to described corresponding relation, and directly extract specifying information relevant in the common source database with described gene chip probes by described link; (4) resolve described specifying information and with its output.

Above-mentioned gene detecting probe information annotate method, wherein, described data warehouse comprises Ensembl and the Uniport data warehouse of Entrez, the EBI of NCBI.

Above-mentioned gene detecting probe information annotate method, wherein, described interface comprises the E-Utilities interface of Entrez, the ensmart interface of Ensembl and the SRS interface of UniPort.

Above-mentioned gene detecting probe information annotate method, wherein, described specifying information comprises the gene-correlation information of probe correspondence, the gene coded protein relevant information of probe correspondence, the documentation ﹠ info of probe correspondence and the data that help the chip results automated analysis.

Above-mentioned gene detecting probe information annotate method, wherein, the identification information in preparation gene chip probes library is to land number, UniGene Cluster identifier or LocusLink identifier in the step (1).

Above-mentioned gene detecting probe information annotate method, wherein, in the step (3), described specifying information arrangement back exports file to text formatting.

Above-mentioned gene detecting probe information annotate method, wherein, in the step (3), the described URL that is linked as links.

Gene detecting probe information annotate method contrast prior art of the present invention has following beneficial effect.The present invention utilizes the corresponding relation that extracts various data types in the online data warehouse system, and then utilize link to integrate (link integration) technology and directly in the common source database, go to extract the specifying information relevant, thereby reach the function of note and integration with gene chip probes according to these corresponding relations.Because the technology of obtaining the common source database in real time that adopted is integrated in link, therefore can overcome the untimely problem of renewal that the data warehouse remarking technology exists.But the link integration technology has a major defect to be to be difficult to corresponding relation between the handle data structures, and data warehouse has perfect data type corresponding relation.The present invention will link integration technology and data warehouse technology combines, and has solved the limitation that exists in traditional gene chip probes annotation system, makes the number of the common source database that annotation system can comprise and Data Update time significantly increase.

Description of drawings

Fig. 1 is the process flow diagram of gene detecting probe information annotate method of the present invention.

Fig. 2 is the data flow diagram of gene detecting probe information annotate method of the present invention.

Embodiment

The invention will be further described below in conjunction with drawings and Examples.

Fig. 1 shows the flow process of gene detecting probe information annotate method of the present invention, and Fig. 2 shows the data stream of gene detecting probe information annotate method of the present invention.Please be detailed description below simultaneously referring to Fig. 1 and Fig. 2 to this method step.

Step S11: the identification number (ID) in input preparation gene chip probes library.The information of gene chip probes is landed number (Accession Number) as discernible unique ID with it usually, UniGeneCluster ID (hereinafter to be referred as UniGene ID) and LocusLink ID (being the Gene ID of Entrez Gene database) the more appearance because of the importance in chip information simultaneously also can be used as initial unique ID input.Gene chip probes ID input format can be defined as one of delegation, is the legibility that guarantees the note structure and correct, comprises that ID can read in the form of one of delegation in the text of ID.Here, we are example with P53 gene (famous tumor suppressor gene), import the ID:AF136271 of this probe.

Step S12: this ID is submitted in the data warehouse.These data warehouses comprise Ensembl and the Uniport data warehouse of Entrez, the EBI of NCBI.Collected the information of a large amount of common source databases in these data warehouses again, for example, Entrez has collected the information of MIM database, GO database and other databases, Ensembl has collected the information of Promotor database, and Uniport has collected the information of Pfam database, Prosite database and other databases.

Step S13: the corresponding relation that extracts probe I D and other common source databases ID by the interface of data warehouse successively.In above-mentioned data warehouse, Entrez provides a CGI interface that is called E-Utilities, can be to Entrez inquiry and data download by this interface.Same Ensembl provides ensmart interface, and Uniport provides SRS interface, offers the user and carries out personalization and large batch of inquiry and obtain information.Can utilize these interfaces to extract corresponding relation between the various ID.

Be example with the P53 gene described in the S11 still, the CGI interface of this probe I D by E-Utilities is delivered to Entrez UniGene inquiry, and the compiler by system extracts Unigene ID, obtains P53 UnigeneID number: Hs.408312.If do not have corresponding Unigene ID then be returned as " data do not find " printed words.

Unigene ID ordering back is delivered to Entrez Gene inquiry again by the CGI interface of E-Utilities, obtain the ID corresponding relation of other common source databases, these ID comprise the relevant common source database ID of range gene such as MIM ID, GO ID, PMID, RefseqID and CDD ID.

Extract corresponding relation between the ID of its storage with the Ensmart CGI interface of Unigene ID by Ensembl then, comprise Uniport ID, Ensembl ID etc.Here, the Ensembl ID of P53 correspondence number is ENSG00000141510, and corresponding Uniport ID is Q761V2.

Then extract the ID corresponding relation of the relevant common source database of various albumen such as Pfam ID, Prosite ID again by the SRS CGI interface of Uniport data warehouse with Uniport ID.

Step S14: according to the corresponding relation of each ID among the step S13, write the URL of each common source database regulation, utilize online database, long-range direct extraction relevant information.

Step S15: resolve and put in order the information that obtains among the step S14.After extracting each common source database data, utilize the text resolution device to extract the chip probe relevant information, these finish messages are become the gene-correlation information of (1) chip probe correspondence, comprising the descriptive and functional annotation of sequence, as the product of gene title, code name, accession number, pertinent literature, GO item, chromosome position, E.C. number and many this gene-correlation features and coding; (2) relevant information of the gene coded protein of probe correspondence comprises structural information, function information, classification, domain, conservative region and sequence die body (motif) information of albumen; (3) documentation ﹠ info of probe correspondence, comprising in the omim database about the documents and materials of individual gene relevant disease and quoted passage information, GeneRif about the ID that publishes an article on the Pubmed and the summary etc. of the meticulous compilation of documents and materials recently; (4) other help the significant data of chip results automated analysis, comprise can be used to predict promoter sequence in the Ensembl database gene in batches the probe note and be used for analyzing KEGG ID and the GO ID that relates to the path situation.

Step S16: the information among the output step S15.For the probe of short run, can directly demonstrate the note result; For large batch of probe, the note result can be preserved hereof.

Should understand, inventive point of the present invention is to carry out note in conjunction with linking integration and the data warehouse technology probe to genetic chip, the concrete data warehouse of mentioning in the foregoing description, common source database and concrete information such as gene probe is all for the example explanation, and is not used for limiting the present invention.

The foregoing description provides to those of ordinary skills and realizes or use of the present invention; those of ordinary skills can be under the situation that does not break away from invention thought of the present invention; the foregoing description is made various modifications or variation; thereby protection scope of the present invention do not limit by the foregoing description, and should be the maximum magnitude that meets the inventive features that claims mention.

Claims

1 one kinds of gene detecting probe information annotate methods is characterized in that, comprising:

(1) identification information that will prepare the gene chip probes library is input to data warehouse;

(2) extract the corresponding relation of described gene probe identification information and common source Database Identification information by the interface routine of data warehouse;

(3) set up and relevant the linking of described common source database according to described corresponding relation, and directly extract specifying information relevant in the common source database with described gene chip probes by described link;

(4) resolve described specifying information and with its output.

2 gene detecting probe information annotate methods according to claim 1 is characterized in that, described data warehouse comprises Ensembl and the Uniport data warehouse of Entrez, the EBI of NCBI.

3 gene detecting probe information annotate methods according to claim 2 is characterized in that, described interface comprises the E-Utilities interface of Entrez, the ensmart interface of Ensembl and the SRS interface of UniPort.

4 gene detecting probe information annotate methods according to claim 1, it is characterized in that described specifying information comprises the gene-correlation information of probe correspondence, the gene coded protein relevant information of probe correspondence, the documentation ﹠ info of probe correspondence and the data that help the chip results automated analysis.

5 gene detecting probe information annotate methods according to claim 1 is characterized in that, the identification information in preparation gene chip probes library is to land number, UniGene Cluster identifier or LocusLink identifier in the step (1).

6 gene detecting probe information annotate methods according to claim 1 is characterized in that, in the step (3), described specifying information arrangement back exports file to text formatting.

7 gene detecting probe information annotate methods according to claim 1 is characterized in that, in the step (3), the described URL that is linked as links.