CN100343852C - Specific function-related gene information searching system and method for building database of searching workds thereof - Google Patents

Specific function-related gene information searching system and method for building database of searching workds thereof Download PDF

Info

Publication number
CN100343852C
CN100343852C CNB2005100375268A CN200510037526A CN100343852C CN 100343852 C CN100343852 C CN 100343852C CN B2005100375268 A CNB2005100375268 A CN B2005100375268A CN 200510037526 A CN200510037526 A CN 200510037526A CN 100343852 C CN100343852 C CN 100343852C
Authority
CN
China
Prior art keywords
gene
keyword
database
word frequency
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2005100375268A
Other languages
Chinese (zh)
Other versions
CN1744080A (en
Inventor
黄仲曦
姚开泰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southern Medical University
Original Assignee
Southern Medical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southern Medical University filed Critical Southern Medical University
Priority to CNB2005100375268A priority Critical patent/CN100343852C/en
Publication of CN1744080A publication Critical patent/CN1744080A/en
Application granted granted Critical
Publication of CN100343852C publication Critical patent/CN100343852C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The present invention discloses a gene information search system which is relevant to a specific function. The system searches out relevant documents of gene to be searched and carries out word frequency analysis by making use of a computer with an input and display terminal, and a database of document search words, which is composed of a database of gene names, a database of word frequency base values, a database of character strings and a database of assisted search words so as to pick out key words of the gene. Through specialized processing, lists of word frequency are established. Finally, information of gene, which is relevant to a specific function is searched out through cluster analysis. The present invention has the advantages of accurate positioning and high search speed, and can avoid repeated work and save a large number of human and material resources. The present invention is also suitable for commercial popularization and development.

Description

A kind of gene information searching system relevant and be used for the construction method of the term database of this system with specific function
Technical field:
The present invention relates to retrieval related gene information from existing gene information storehouse, particularly relates to the system of the related gene information of retrieval specific function from public gene information storehouse.
Technical background:
Along with going deep into of life science, it is unusual that people have known the particular organisms function now all is that unusual modification by some gene unconventionality expression or expression product in the biosome causes.These genes are called the related gene of this biological function.For this reason, since the nineties, the mankind have just begun genome plan, and existing at present many biological gene groups (for example, yeast, people, paddy rice, chicken, mouse etc.) order-checking is finished.The genome that sequencing result shows microorganism by several to several thousand genomic constitutions that do not wait; People's genome has more than 25000 gene, and animals and human beings is suitable, and plant is what for to reaching tens0000.
Be accompanied by the carrying out of genome plan, many high throughput analysis technology occur thereupon, for example, and chip technology, serial analysis of gene expression SAGE, inhibition subtractive hybridization etc.The common feature of these high throughput analysis technology be they can measure simultaneously a large amount of gene of biosome or even full genome (generally being several thousand to several ten thousand genes) in research object with particular organisms changes of function (for example certain disease) with the differential expression situation of object of reference (for example, healthy individual).The gene relevant with this biological function should be just in the middle of these difference expression genes.
But thereupon the problem of Chu Xianing be the number of the difference expression gene that comes out of high throughput analysis technology screening generally tens between the hundreds of, even thousands of, substantially exceeded the number of the function related gene of expection.Reason is that these difference expression genes can be the sign that causes this research object () reasons for example, the generation of disease, and more also can be the satellite phenomenon of this signs appearance.Yet, almost be impossible if method is by experiment got rid of non-this function related gene one by one.Because judge by experiment whether relevant its expense with this biological function is quite expensive to a gene.Therefore, the past researcher always therefrom selects several genes according to own limited experimental knowledge and does further research, in the hope of one or two this function related gene of lucky discovery.
Fast development along with life science, the laboratory that the whole world is ten hundreds of is put in the middle of the functional study of gene, almost each gene all more or less has relevant function report, the summary of these reports is included by the biomedical bibliographic data base MEDLINE in the world, the summary that present MEDLINE includes surpasses 15,000,000 pieces, and wherein the pertinent literature of people's gene on average has pieces of writing up to a hundred.For this reason, the interaction of correlation function or its product and other molecule of gene is attempted to excavate in many laboratories, the whole world from these documents, attempts to explain the mechanism of the biological function of current research in conjunction with the high throughput analysis result.Present this Forecasting Methodology mainly is divided into two classes: one, extract from document and treat the function of screening-gene and mutual functional relationship, integrate the relation of the biological function of setting forth these genes and current research object, thereby infer how reach relevant with this biological function of which gene is correlated with, and the gene of first-hand report then is not new gene.Two, set the relevant keyword of specific function earlier, the related abstract of the gene of these keywords appears in search then, make a summary to infer whether these genes are relevant with the biological function of current research object by reading these again, wherein the gene of first-hand report then is not new gene.These two kinds of methods have all been predicted new candidate's specific function related gene to a certain extent, but also all have defective.
For first kind method, because these documents may be in the specific time (stage of development or age), space (biosome, tissue or cell) and condition (physics, chemistry is with physiology etc.) Leader this gene relevant with certain function, and gene is in the different time, function is different and various under space and the condition, therefore the function of all genes that extract from document and intergenic functional relationship are very complicated, be difficult to infer that what mutual relationship is relevant with current biological function, also be difficult to the new function related gene of prediction from mutual relationship.For second class methods, because specific biological function is usually relevant with multiple element, and people always are difficult to comprehensively the cognition of specific function, so always can't fully provide the keyword relevant with this function.On the other hand, each element has multiple expression-form again, individual element is usually relevant with multiple biological function, it is very a large amount of usually finding out gene relevant with certain keyword and relevant document, therefore which gene is relevant with the particular organisms function of current research object on earth for the still extremely difficult deduction of people, thereby also is difficult to predict new function related gene.
Damien Chaussabel and Alan Sher are in 2002 3 10 phases of volume of " Genome Biology " magazine, the research0055.1-0055.16 page or leaf has proposed the method that the examination gene is treated in a kind of note analysis earlier, this method is at first extracted various titles and the another name for the treatment of the examination gene from gene name database HUGO Nomenclature Committee (http://www.gene.ucl.ac.uk/nomenclature/), obtain the summary that title contains these titles and another name in the public from network then biomedical bibliographic data base (PubMed is the network edition of MEDLINE), analyze the frequency of speech in the summary with business software Provalis Research.Because the keyword relevant with specific gene should have a lower frequency values in the pertinent literature of all genes, and has high frequency values in the pertinent literature of current gene; On the other hand, we are usually caused by signal path by the biological function of research, thereby we are only interested in the total keyword of a plurality of genes on the path, thus the author be chosen at random average frequency in the pertinent literature of 250 genes (studies show that 250 genes can represent all genes) be lower than a base value (5%) and at least the frequency in two current pertinent literatures for the treatment of the examination gene all be higher than the speech of a threshold value as keyword.With the frequency of these keywords in the pertinent literature of special examination gene gene is carried out cluster analysis again, thereby the biological function of elaboration current research and which keyword (that is element) are relevant and the functional classification (comprise with keyword and come the representation function classification) of these genes.This method also can be predicted new current biological function related gene sometimes from other gene the concrete functional category at known current biological function related gene place.
After this, the inventor has delivered one piece at " No.1 Military Medical Univ.'s journal " 2003 23 volume 11 phase 1195-7 pages or leaves and has been entitled as " excavate colorectal cancer with the document profile and shift the chip express spectra " article.The disclosed technical scheme of this article has been carried out some improvement to described methods of people such as Damien Chaussabel, increased an Entrez Gene database (former Locuslink by name), and the deletion of the keyword that the generation profile for definition gene specific speech do not have much good and the setting of keyword word frequency weight in data filter (being crucial professional treatment) process, have been increased, thereby to Entrez Gene database, also make the keyword that obtained and the level of intimate of current biological function obtain certain raising simultaneously the expanded range that obtains to wait to look into gene information.But still there is following deficiency in the method described in the literary composition:
1) retrieval is not comprehensive, has omitted the gene title and the another name that only are included in the Genome Database (http://www.gdb.org) and GENATLAS (http://www.dsi.univ-paria5.fr/genatlas/) database.2) the only title and the another name of retrieval gene in title can not be with expanded range in summaries, can't retrieve a lot of only mentioning and the pairing pertinent literatures of gene title do not mentioned in title in summary.3) mode of obtaining the pertinent literature of gene is pure manual, time-consuming effort again.4) utilize business software Provalis Research to carry out word frequency analysis, be difficult for to grasp, loaded down with trivial details and make mistakes easily, reason is that the word frequency that business software Provalis Research mainly analyzes newpapers and periodicals is developed, and for the general property used, software function is done complicated hugely; Moreover this method needs to do the file layout conversion before doing word frequency analysis, generates new file, need original, new file and destination file is corresponding one by one when analyzing word frequency, and correspondence does not go up easily; Especially this method can not be analyzed the pertinent literature of all genes automatically, can only analyze one by one, and is very mechanical, loaded down with trivial details.5) keyword that obtains automatically only according to frequency values is the inanimate object functional meaning sometimes, occurs false positive easily, also can occur sometimes omitting.6) degree of relationship of keyword of Huo Deing and current biological function differs, directly after the cluster, often the relation of frequency of occurrences height accumulates in gene under the keyword with current biological function degree of relationship lower (even irrelevant), thereby is difficult to obtain the related gene of current biological function.7) can not a plurality of synonyms or multiple version usually be arranged to the keyword of a certain element that characterizes the specific function related gene as same entity, easily cause the dispersion of cluster and owing to keyword too much causes browsing inconvenience.8) manual information retrieval is difficult to retrieve and comprises a plurality of keywords and synon pertinent literature thereof.9) directly with waiting to look into the retrieval of gene title, each retrieval person will experience from public gene name database to extract and wait that the relevant information of looking into gene calculates word frequency, asks base value, determines the complexity of character string and retrieval-assisted phrase and long slowly processing procedure, waste lot of manpower and material resources resource.
Summary of the invention:
In view of there is above-mentioned deficiency in prior art, the invention provides a kind of gene information searching system relevant, to solve the technical matters of location quickly and accurately of new specific function related gene with specific function.
The technical solution that the present invention addresses the above problem is;
A kind of gene information searching system relevant with specific function, this system comprises that one has the input and computing machine, a webserver, public biomedical bibliographic data base and the public gene name database and the cluster analysis unit of display terminal, it is characterized in that also comprising the documenm database formed by gene name database, word frequency base value database, string data storehouse and retrieval-assisted phrase database and
One waits to look into gene-correlation literature search unit, this unit
According to official's abbreviation of waiting to look into gene of being imported, from the documenm database that makes up, obtain all corresponding name character strings and the retrieval-assisted phrase edlin of going forward side by side, according to the raw information in the documenm database, remove and cause false-positive name character string and retrieval-assisted phrase easily, add name character string and the retrieval-assisted phrase omitted
Retrieval includes the Documentary Records of these name character strings and retrieval-assisted phrase and is saved in the file of appointment from public biomedical bibliographic data base then;
One waits to look into gene word frequency analysis unit, this unit extracts the abstract fields of the every piece of Documentary Records that retrieves earlier, extract each speech in the abstract fields then, with the quantity of the document that one of them speech occurs pertinent literature sum divided by this gene, calculate these speech one by one in the frequency of occurrences of waiting to look in the gene-correlation document, promptly wait to look into the gene word frequency;
One keyword extracting unit, this unit will wait that the base value of looking into the same vocabulary in gene word frequency and the word frequency base value database compares differentiation, the deletion base value is higher than 1%~10% speech and the difference waiting to look into gene word frequency value threshold or wait to look into gene word frequency and word frequency base value is lower than m=t+ (k/n) * 100% (wherein t is a minimum threshold, k is a constant, n is the related abstract record of this gene) speech, select then at least by the common speech of two genes as the keyword and the keeping records of waiting to look into gene;
One keyword professional treatment unit, but this unit produces an edit list, and can carry out the weight setting of keyword interpolation or deletion, the setting of keyword list plural form, keyword and the synonym of keyword in this tabulation is the setting of single entities and the preservation of reference record;
The tabulation of one word frequency is set up, output unit, this unit obtains the word frequency of keyword in the pertinent literature of each gene from the word frequency that the word frequency analysis unit calculates, earlier that the word frequency of the odd number of keyword and plural form is average, obtain the word frequency of keyword, multiply by the weight of word frequency again, the word frequency of the keyword of average then same class synonym entity, word frequency as this synonym entity, set up the word frequency tabulation, export the word frequency tabulation of the occurrence frequency of all keywords in the pertinent literature of each gene of cluster analysis software format at last, data in this word frequency listing file are carried out cluster analysis and shown resulting specific function related gene information by described cluster analysis device.
A preferred version of above-mentioned a kind of gene information searching system relevant with specific function is the quadratic search unit that also comprises the gene-correlation document, and this unit is according to the gene information of being correlated with by cluster analysis and resulting specific function,
Selection waits to look into gene and a plurality of and wait to look into the corresponding keyword of gene;
The search and show selected wait to look into contain selected a plurality of keyword and synon document thereof in the gene-correlation document;
Preserve Search Results.
Keyword extracting unit described in the technique scheme will wait that the base value of looking into the same vocabulary in gene word frequency and the word frequency base value database compares differentiation, the deletion base value is higher than 5%~10% speech and waits to look into the speech that gene word frequency value threshold is lower than m=15%+ (1.5/n) * 100%, selects the keyword and the keeping records of being waited to look into gene by two common speech conducts of gene at least then.
Another object of the present invention is to disclose the method that a kind of structure is used for the documenm database of the above-mentioned gene information searching system relevant with specific function, this method utilizes a computing machine with input and display terminal to enter public gene information database by a webserver, it is characterized in that comprising the following steps;
1) therefrom extracts the unduplicated full name of each gene, abbreviation, another name and product title and be abbreviated as mark with official and set up new gene record, form the gene name database;
2) elder generation randomly draws from known and imports more than or equal to 200 and belongs to the gene of same species (for the convenience of describing with gene to be looked into, here this gene is defined as random gene, and be used for following description), calling the pairing new gene record of random gene again from formed gene name database edits, according to the raw information in the gene name database, name character string and retrieval-assisted phrase are set;
Retrieval includes the Documentary Records of these name character strings and retrieval-assisted phrase and is saved in the file of appointment from public biomedical bibliographic data base then;
Then extract the abstract fields of the every piece of Documentary Records that retrieves, extract each speech in the abstract fields then, with the quantity of the document that one of them speech occurs pertinent literature sum divided by a random gene, calculate the frequency of occurrences of these speech in the pertinent literature of a random gene one by one, summation again divided by the number of random gene, is obtained the appearance average frequency of these speech in the pertinent literature of a random gene then, be base value, form a word frequency base value database;
3) call gene record new in the gene name database and set up string data storehouse or retrieval-assisted phrase database, wherein said string data storehouse follows these steps to set up:
A, character are handled: the content in the deletion title bracket, non-letter and non-numeric character are replaced with other symbol,
The version that b, interpolation gene family member write: the space is arranged in writing, the deletion space produces the new form of writing, when last character of abridging is a numeral, then insert a space to first non-numeric character place and produce new abbreviated form at reverse search
C, deletion be less than the gene title of 2~4 characters,
D, deletion belong to the non-genomic title of public speech,
E, deletion belong to the non-genomic title of English word,
F, output gene name character string are set up the string data storehouse;
Described retrieval-assisted phrase database is set up by the following step:
A, extract all speech in all full names of each gene and the product title,
B, deletion length less than 4~6 characters and also with the auxiliary speech of gene title candidate of the same name,
C, deletion belong to the speech of public speech,
D, output result set up the retrieval-assisted phrase database of gene.
Number the best that the above-mentioned number that the new gene of random call writes down pairing gene from the gene name database is a random gene is 250.
It is advantage and the significant technique effect of giving prominence to that the present invention has following than prior art:
1) made up the gene name database an of this locality,, not only reduced the unnecessary duplication of labour, and speed has been fast, can save the lot of manpower and material resources resource for directly calling retrieval person with the person;
2) Ben Di gene name database can provide update service by the developer, is easy to commercialization and promotes;
3) range of search is expanded to gene title and another name in present disclosed all public gene name databases, the retrieval-assisted phrase that therefore produces the gene name character string that can be used for biomedical data in literature library searching and gene title is comprehensive;
4) because range of search is expanded in the summary, the gene-correlation document that retrieves is comprehensive;
5) can select a most interested gene title and high keyword and the synonym of a plurality of level of intimate to carry out quadratic search, accurate positioning with the person;
6) have interactive interface, the keyword of inanimate object functional meaning is modified or deleted to the professional knowledge that can give full play to the user, adds and replenish the keyword of omitting, and the theory that people-oriented has obtained sufficient embodiment;
7) highly specialized, operation is plucked just, learns easily and grasps;
8) frequency to keyword is weighted, the keyword of acquisition and the degree of relationship of current biological function can be embodied, avoided because the relation of frequency of occurrences height accumulates in gene under the keyword with current biological function degree of relationship lower (even irrelevant), the gene relevant with the keyword of the current biological function of close sign then is dispersed in the different classifications;
9) characterize the specific function related gene a certain element keyword and a plurality of synonyms and multiple version are arranged, as same entity, in the cluster set, browse conveniently;
10) all computing, processing procedures are all carried out automatically, accurate positioning, and retrieval rate is fast.
Description of drawings:
Fig. 1 is a kind of gene information searching system structural representation relevant with specific function of the present invention
Fig. 2 is a kind of gene information relevant with specific function retrieval main flow chart of the present invention;
Fig. 3 is the gene-correlation literature search process flow diagram of waiting to look into of the present invention;
Fig. 4 is the gene word frequency analysis process flow diagram of waiting to look into of the present invention;
Fig. 5 is a keyword extraction process flow diagram of the present invention;
Fig. 6 is a keyword professional treatment process flow diagram of the present invention;
Fig. 7 is that word frequency tabulation of the present invention is set up, the output process flow diagram;
Fig. 8 is the quadratic search process flow diagram of gene-correlation document of the present invention;
Fig. 9 is for making up the string routine figure of gene name database of the present invention;
Figure 10 is for making up the string routine figure of word frequency base value database of the present invention;
Figure 11 is for making up the string routine figure in string data of the present invention storehouse;
Figure 12 is for making up the string routine figure of retrieval-assisted phrase database of the present invention;
Figure 13 is gene-correlation literature search interface among the following embodiment;
Figure 14 is keyword professional treatment interface among the following embodiment;
Figure 15 is for containing the literature search interface of specific gene and a plurality of keywords among the following embodiment;
Figure 16 retrieves main interface for specific function related gene among the following embodiment;
Figure 17 is the cluster result synoptic diagram that 51 colorectal cancers shift difference expression gene and keyword among the following embodiment;
Figure 18 comprises the interface that synonym occurs simultaneously for a plurality of keywords of retrieval in the related abstract of gene among the following embodiment;
Figure 19 is the cluster result synoptic diagram of 174 pathologic scar difference expression genes and keyword;
Figure 19 a is total figure of cluster result;
Figure 19 b is the synoptic diagram of the main keyword relevant with pathologic scar;
Figure 19 c is the dendrogram of the known collagen related gene relevant with pathologic scar;
Figure 19 d is the gene dendrogram relevant with keyword " anoxic ".
Embodiment:
Example 1 (method for building up of human documenm database):
Documenm database of the present invention is made up of gene name database, word frequency base value database, string data storehouse and retrieval-assisted phrase database, and wherein the string data storehouse is that the new gene record that obtains in the gene name database builds up through different technical finesses respectively with the retrieval-assisted phrase database.The composition of concrete human documenm database also is identical with making up, and is described in detail as follows:
One, the structure of human gene name database (referring to Fig. 9)
In order to make gene title and another name more comprehensive, four public gene information databases: HUGONomenclature Committee (http://www.gene.ucl.ac.uk/nomenclature/) is collected and integrated to present embodiment, Entrez Gene (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? DB=gene), the Genome Database (http:/www.gdb.org) and GENATLAS (http://www.dsi.univ-paris5.fr/genatlas/), therefrom extract full name of gene, abbreviation, another name, the product title, and (for example be abbreviated as mark with official, official's abbreviation of gene " palate; lung and nasal epithelium carcinoma associated " is PLUNC), obtain 23060 people's gene informations altogether, build up the human gene name database---human gene names database.The concrete grammar and the step that make up are as follows:
1, each that extract among the HUGO Nomenclature Committee confirms (approved) gene as a new gene record, is abbreviated as genetic marker (that is, abridge with official characterize this gene) with official.With full name, other abbreviation, another name and the product title of this gene all as another name.Append unduplicated full name, abbreviation, another name and the product title of respective record in three databases of Entrez Gene, the Genome Database and GENATLAS then, also as another name.If another name occurs in the text, think that then this gene occurs, that is, official's abbreviation occurs.
2, everyone gene in the extraction Entrez Gene database.If this gene is not included by HUGO Nomenclature Committee database, then as the new gene record of human gene names database.Be abbreviated as genetic marker with official equally.With full name, other abbreviation, another name and the product title of this gene all as another name.Append unduplicated full name, abbreviation, another name and the product title of respective record in three databases of the Genome Database and GENATLAS then, also as another name.
3, the rest may be inferred, adds gene record new among the Genome Database and the GENATLAS respectively.
4, preserve new gene record and build the human gene name database.
Two, the structure (referring to Figure 10) of human gene word frequency base value database
Because the judgement of public speech and the extraction of keyword all are relatively to differentiate according to the numerical value of word frequency base value in the system of the present invention, therefore be necessary to set up a word frequency base value database, with the retrieving of back-up system.The construction method of described word frequency base value database is as follows: randomly draw 250 genes (random gene) and be input in the system from the known person genoid; From formed gene name database, call the pairing new gene record of random gene again and edit,, name character string and retrieval-assisted phrase are set according to the raw information in the gene name database; Retrieval includes the summary of these name character strings and retrieval-assisted phrase and is saved in the file of appointment with the form of text from public biomedical bibliographic data base then; Then extract the abstract fields of the every piece of Documentary Records that retrieves, extract each speech in the abstract fields then, with the quantity of the document that one of them speech occurs pertinent literature sum divided by a random gene, calculate the frequency of occurrences of these speech in the pertinent literature of a random gene one by one, summation again divided by 250 (numbers of random gene), is obtained the appearance average frequency of these speech in the pertinent literature of a random gene then, be base value, form a word frequency base value database.Described random gene number is the best with 250, is better between 200~300.Described random gene number is certain good greater than 300, but operand is too big, simultaneously also need not because do not increase basically greater than 250 back word frequency base values according to former study random gene number.Described random number can cause in retrieving the keyword that should not delete being fallen less than 200, causes the omission of keyword.
Three, be used for the string data storehouse of public biomedical data in literature library searching and the structure of retrieval-assisted phrase database
After building up the gene name database, in order to retrieve related gene in the text exactly, need do some ancillary techniques and handle: (1) deletion length is less than the gene title of 2~4 (preferably 3) characters; (2) be less than 4~6 (preferably 5) characters as the fruit gene title, then also need the word at least one gene full name to occur in the text just thinking that this gene occurs.These auxiliary process avoid too much false positive to occur.For example, gene title " MET " (official's abbreviation of gene " met proto-oncogene (hepatocyte growth factor receptor) "), because character very little, be easy in other field appearance of the same name (for example, medicine, cell line etc.), if do not add auxiliary speech (for example " oncogene ") retrieval-by-unification, then false-positive pertinent literature can be very many; Single character (for example " A ") is as the gene title, and false-positive pertinent literature occurring can be more.
1, the structure in string data storehouse (referring to Figure 11)
1) character is handled: the content in the deletion title bracket: non-letter and non-numeric character replace with the space.This processing be similar to that Alako etc. done like that.Because we have increased full name, another name and the product title etc. of the gene title as gene, retrieve its existence in the text: and these titles often include parantheses (content in the parantheses is in order to aid illustration), make these titles therefore must not delete the content in the parantheses directly as searching character string.For example, include parantheses in full name of gene " MET " " met proto-oncogene (hepatocyte growth factorreceptor) ", content " hepatocyte growth factor receptor " in the parantheses is in order to the full name " met proto-oncogene " of aid illustration gene " MET ", not that retrieval gene " MET " exists in the text necessary (whether we only need retrieval full name " metproto-oncogene " to exist in the text), therefore must delete, otherwise can cause zero to mate fully.
2) add the version that the gene family member abridges:, then delete the space and produce new abbreviated form if in the abbreviation space is arranged; If last character of abbreviation is a numeral, then reverse search inserts a space herein and produces new abbreviated form to first non-numeric character.For example, gene " TP53 " has the version " P53 " of a gene family member abbreviation, and the space is arranged in the abbreviation, then deletes the space and produces new abbreviated form " P53 ".For another example, last character of gene " BCL2 " is a numeral, and then reverse search inserts a space herein and produces new abbreviated form " BCL2 " to first non-numeric character.
3) deletion is less than the gene title of 3 characters, to reduce false positive.
4) deletion belongs to the non-genomic title of public speech.If base value just thinks that this speech is that public speech is (because modal gene title ' P53 ' base value in the literature is 1% greater than 1%, its pertinent literature has 35000 pieces, even this means that the base value of certain speech is not public speech greater than 1%, its pertinent literature is also because too much can't analyze).Phrase also might be public speech, for example ' novel protein '.Product is a common phrases greater than 0.005 phrase to the base value that we define each speech in the phrase greater than 0.4%, this means if two speech phrase, one of the base value of two speech just in time be one 5% another 10%.For example, the base value of ' novel ' and ' protein ' is respectively 13% and 47.6%, and their product is 0.06, so ' novel protein ' is a common phrases.
5) deletion belongs to the non-genomic title of English word.For example ' sky ' and ' fat '.
6) output gene name character string builds human string data storehouse.
2, retrieval-assisted phrase database (referring to Figure 12)
1) extracts all speech (comprising parenthetic speech) in all full names of each gene and the product title.
2) deletion length is less than 5 characters and the candidate auxiliary speech of the same name with the gene title.
3) the deletion base value is greater than 1% speech (base value is public speech greater than 1% speech).Remaining speech is as the retrieval-assisted phrase of this gene.For example, full name of gene BMP3 is ' bone morphogenetic protein 3 (osteogenic) ', wherein the base value of each speech is respectively: bone (2.5%), morphogenetic (0.3%), protein (47.6%), 3 (17.7%), and osteogenic (0.2%), so speech ' morphogenetic ' and ' osteogenic ' are as gene BMP3 retrieval-assisted phrase.
4) the output result builds the retrieval-assisted phrase database of human gene.
Example 2 (method for building up of animal and plant documenm database):
Because Entrez Gene database has almost comprised the gene name information of all species that checked order, so we can therefrom extract the gene name information structure gene name database of interested animal and plant.
Animal is example with the mouse, obtains official's abbreviation, full name, another name and the product title of 48039 genes earlier from Entrez gene, builds up the gene name database of mouse, 250 genes of random call then, the retrieval pertinent literature is analyzed word frequency, sets up word frequency base value database.Again the gene name database is handled string data storehouse and the retrieval-assisted phrase database that produces mouse.Word frequency base value database, string data storehouse and the retrieval-assisted phrase database of gene of building data processing method in the process of storehouse and the process and the foundation mankind is identical, but reference example 1 is carried out.
Plant is example with the arabidopsis.Obtain official's abbreviation, full name, another name and the product title of 30879 genes earlier from Entrez gene, build up the gene name database of arabidopsis, 250 genes of random call are retrieved pertinent literature then, analyze word frequency, set up word frequency base value database.Again the gene name database is handled string data storehouse and the retrieval-assisted phrase database that the back produces arabidopsis.The documenm database of gene of building data processing method in the process of storehouse and the process and the foundation mankind is identical, but also reference example 1 is carried out.
Example 3 (method for building up of microorganism documenm database):
The method for building up of microorganism documenm database is example with the Epstein-Barr virus, obtain earlier official's abbreviation, full name, another name and the product title of 90 genes altogether, build up the gene name database of Epstein-Barr virus from Taxonomy database and Swiss-Prot albumen database.The string data storehouse and the retrieval-assisted phrase database of Epstein-Barr virus manually are set then.Retrieve the pertinent literature of these 90 genes again, analyze word frequency, set up word frequency base value database.The word frequency base value database of gene of building data processing method in the process of storehouse and the process and the foundation mankind is identical, but same reference example 1 is carried out.
Example 4 (retrieval of human specific function related gene information is referring to Fig. 2):
Online public biomedical bibliographic data base is the valuable source of gene studies, and the gene information searching system relevant with specific function of the present invention utilized the searching system of this development of resources just.The documenm database of forming by gene name database, word frequency base value database, string data storehouse and retrieval-assisted phrase database that this system's utilization has input and the computing machine of display terminal and makes up in machine, enter public biomedical bibliographic data base by the webserver and retrieve the pertinent literature (system architecture is referring to Fig. 1) of waiting to look into gene, carry out word frequency analysis, therefrom extract the keyword of gene, pass through professional treatment again, set up the word frequency tabulation, by cluster analysis, retrieve specific function related gene information at last.The gene information searching system relevant with specific function of the present invention is applicable to the retrieval of the specific function related gene information of various different plant species, that is to say that its search method and step are identical, only will build up corresponding documenm database (construction method of human documenm database is seen example 1) in advance according to the species of waiting to look under the gene.Present embodiment is described in detail as follows human specific function related gene information-searching method and step:
1. the obtaining automatically of document (referring to Fig. 3)
1) system produce automatically one automatically retrieval import the gene that will retrieve by the user and promptly wait to look into the abridge official that maybe will wait to look into gene of the official of gene and abridge to be kept in the file and read in interactive tabulation (Figure 13 is seen at the interface) by machine.As long as the described number of waiting to look into gene determines according to the purpose of retrieval, can import a plurality ofly such as the related gene screening study of doing function, and does the functional study of gene one of input.2) all name character string and retrieval-assisted phrases of its correspondence are obtained in official's abbreviation of the gene that gives according to the user of system respectively from character string storehouse and assisted retrieval dictionary.For example, all name character strings of CDH1 gene comprise: " CDH1 ", " UVO ", " CDHE ", " ECAD ", " LCAM ", " Arc 1 ", " HDGC ", " CAD 1 ", " CDH 1 ", " Arc 1 ", " CAD 1 ", " uvomorulin ", " cadherin 1 type 1 E cadherin ", " cadherin 1 E cadherin ", " cadherin 1 type 1 ", " cell CAM 120 80 " and " calcium dependent adhesionprotein epithelial ".All retrieval-assisted phrases of CDH1 gene comprise: " cadherin ", " cam " and " 120 ".3) system provides a personal-machine interactive interface, and the raw information of name character string, retrieval-assisted phrase and the gene title of gene that explicit user is submitted to and correspondence makes the user can edit name character string and retrieval-assisted phrase.Though above-mentioned two kinds of auxiliary process to the gene name database are that to retrieve the gene-correlation document automatically necessary, but because the complicacy of gene title statement in the scientific literature, the result who obtains after these two kinds of methods are handled can't characterize the various forms that the gene title occurs in the document fully.For example, the CDH1 gene is in the pertinent literature of research nasopharyngeal carcinoma, the chance that occurs with above-mentioned name character string form seldom, mainly the form with " E cadherin " occurs, and " E cadherin " do not included separately by the gene name database and be the gene title.In addition, in full name of CDH1 gene, the base value of word " adhesion " is 2.4% greater than 1%, be not construed to is retrieval-assisted phrase, but it is to judge in the literature whether the CDH1 gene characterizes the important evidence of CDH1 gene less than the name character string of 5 characters, illustrates that public speech can not be got rid of non-retrieval-assisted phrase fully.Therefore, before retrieving the gene-correlation document automatically, a kind of human-computer interaction device must be provided, make the user in the gene of submitting desire retrieval pertinent literature to, can see the searching character string and the retrieval-assisted phrase of gene, the raw information of gene in the gene name database is provided simultaneously, makes the user to add, to delete and to edit searching character string group and retrieval-assisted phrase, produce more accurate, comprehensive searching character string and retrieval-assisted phrase according to these raw informations.Does 4) mode with " logical OR " produce one at public biomedical bibliographic data base (PubMed) (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? db=pubmed) character string of retrieval gene-correlation document.If the length of certain gene title less than 5 characters, also needs with it any one relevant retrieval-assisted phrase retrieval-by-unification.For example, suppose that we will retrieve the pertinent literature of PIM1 gene.The title of PIM1 gene comprises: " PIM1 ", " PIM ", " PIM1 ", " Oncogene PIM1 " and " pim 1 oncogene ", wherein therefore the length of title " PIM1 " and " PIM " need retrieval-assisted phrase less than 5 characters.The retrieval-assisted phrase of PIM1 gene comprises: " oncogene ", " proviral " and " integration ".Therefore, is the character string of the pertinent literature of retrieval PIM1 gene: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? db=pubmed﹠amp; Cmd=search﹠amp; Term=PIM%201%5BAB%5D%20OR%20Oncogene%20PIM1%5BAB%5D20%OR %20pim%201%20oncogene%5BAB%5D%20OR%20 ((PIM1%5BAB%5D%20OR%20PIM%5BAB%5D) %20AND%20 (oncogene%5BAB%5D%20OR%20proviral%5BAB%5D%20OR%20integrat ion%5BAB%5D).This is equivalent to by web browser, imports searching character string: ' PIM1[AB in the retrieval window of PUBMED] OR Oncogene PIM1[AB] OR pim 1 oncogene[AB] OR ((PIM1[AB] OR PIM[AB]) AND (oncogene[AB] OR proviral[AB] OR integration[AB])) ').5) produce a web browser, with the searching character string of the address URL guiding gene of browser.In " download finishes " incident of browser, add order, the output format of selecting PubMed is " outputing to file " (' send to file ') for " summary " (' Abstract ') and the way of output, with the official of this gene abbreviation name this document, be saved in the file of user's appointment with the form of text.
Above-mentioned automatic retrieval is tabulated (interface as shown in figure 13) with interactive, and this tabulation is made up of 3 list boxes, 11 buttons, 3 text boxes, 1 combo box and 1 status bar.Division is as follows:
3 list boxes are respectively: list of genes (" Gene List "), gene name character tandem table (" Search Term for One Gene ") and gene raw information table (" Gene Information Detail ").5 row are arranged in the list of genes: the gene numbering is shown in " No. " tabulation, official's abbreviation of gene is shown in " OfficalName " tabulation, " In " tabulation is shown in the summary (AB) and still retrieves the gene title in the title (TI), the pertinent literature (Y/N) of whether retrieving this gene is shown in " Dn " tabulation, and the gene-correlation document number that retrieves is shown in " PNum " tabulation.Two row are arranged in the gene name character tandem table: gene name character string is shown in " First Term " tabulation, and " Sec " tabulation shows whether need retrieval-assisted phrase (Y/N).Gene raw information table provides the raw information and the gene summary of all titles of gene.List of genes, gene name character tandem table and gene raw information table link.The user clicks certain gene in the list of genes, and the content in gene name character tandem table and the gene raw information table changes thereupon.
11 buttons are respectively: add gene (" Add Gene "), add a plurality of genes (" Add Genes "), deletion gene (" Del Gene "), add name character string (" Add Term "), deletion name character string (" Del Term "), that opens existing gene-correlation document obtains record (" Open "), that preserves current gene-correlation document obtains record (" Save "), begin to obtain the pertinent literature (" Retrieve ") of gene, stop to obtain the pertinent literature (" Stop ") of gene, withdraw from official's abbreviation (" Search ") of (" Exit ") and retrieval gene.Wherein preceding 10 buttons are arranged in together in panel, and text box (" Official Name ") is assisted in official's abbreviation and official's abbreviation is exported between the combo box and official's abbreviation (" Search ") button of retrieval gene is in.
3 text boxes are respectively: retrieval-assisted phrase text box (" Candidate Second Terms "), official abbreviation input text frame (" Official Name ") and deposit the file text box (" Target ") of gene-correlation document.
1 combo box is as official's abbreviation output combo box.1 status bar shows the progress that current document obtains.
2. speech frequency analysis (referring to Fig. 4)
After obtaining the pertinent literature of gene and being saved in a file with the form of text, by user's specified folder, then therefrom extracting text carries out following processing automatically in system.
1) each field (comprising: source, title, author, address, summary and PMID number) with every piece of Documentary Records in the above-mentioned text is converted to delegation.Generally, in Documentary Records, title, address and summary are the multirow textual representation.System reads the delegation in the text file at every turn, is saved in the variables A, and the original content of variables A is saved among another variable B.Variable B and A are actually and have preserved continuous two style of writing words like this.We serve as the border that sign is judged Documentary Records with continuous two nulls; With a null is the border that sign is judged field; It with B null and A does not judge the single file field for null; With A and B is not that null is judged the multirow field.For the field of multirow literal, read the end that A is added to B at every turn, make it become the single file field.For example, be two pieces of Documentary Records from the NMI gene text that PubMed obtains below.“……
6:J?Interferon?Cytokine?Res?1998?Sep;18(9):767-71
Interferon-induced?upregulation?and?cytoplasmic?localization?of?Myc-interactingprotein?Nmi.
Lebrun?SJ,Shpall?RL,Naumovski?L.
Department?of?Pediatrics,Stanford?Medical?Center,CA?94305,USA.
Nmi?interacts?with?c-Myc,N-Myc,Max,and?fos,as?demonstrated?by?yeasttwo-hybrid?and?coimmunoprecipitation?assays.Nmi?is?partially?homologous?to?IFP35,an?interferon(IFN)-inducible?protein.In?this?study,we?show?that?basalexpression?of?Nmi?is?upregulated?by?IFN?in?multiple?tumor-derived?cell?lines.Treatment?with?IFN?results?in?an?increased?amount?of?cytoplasmic?Nmi?distributedin?a?punctate?granular?pattern.We?also?demonstrate?that?Nmi?is?expressed?invarious?fetal?and?adult?tissues.As?Nmi?does?not?contain?a?known?DNA-bindingmotif,it?has?the?potential?to?form?inactive?heterodimers?with?its?putativeDNA-binding?partners.?Our?studies?suggest?that?Nmi?may?modulate?its?bindingpartners?in?an?IFN-inducible?manner.
PMID:9781816[PubMed-indexed?for?MEDLINE]
7:Oncogene?1996?May?16:12(10):2171-6
Isolation?and?characterization?of?Nmi,a?novel?partner?of?Myc?proteins.
Bao?J,Zervos?AS.
Cutaneous?Biology?Research?Center,Massachusetts?General?Hospital,HarvardMedical?School,Charlestown?Massachusetts?02129,USA.
The?Myc?fmily?of?oncogenes?is?thought?to?play?an?important?role?in?cellproliferation,differentiation,and?neoplastic?transformation.Although?thestructure?and?expression?of?Myc?genes?are?well?characterized,the?function?andbiochemical?properties?of?the?Myc?proteins?are?less?well?understood.Here,usinga?yeast?genetic?screen,we?identified?a?novel?gene,Nmi,that?binds?to?N-myc?andC-myc.It?also?interacts?with?other?transcription?factors?in?yeast.The?carboxylterminus?of?Nmi?shows?homology?to?an?interferon-induced?leucine?zipper?protein,IFP?35,whereas?its?amino?terminus?is?homologous?to?a?coiled-coil?heptad?repeatin?the?C.elegans?protein,CEF59.Co-precipitation?studies?of?Nmi?with?N-myc?andC-myc?confirmed?the?interaction?in?mammalian?cells.Nmi?mRNA?is?expressed?at?lowlevels?in?all?fetal?and?adult?human?tissues?tested,except?brain.Among?severalcancer?cell?lines,high?expression?of?Mni?was?found?in?myeloid?leukemias,whichalso?express?high?levels?of?C-myc.Mni?gene?is?localized?on?human?chromosome22q13.3.Translocations?of?this?region?have?been?reported?in?some?humanleukemias.
PMID:8668343[PubMed-indexed?for?MEDLINE]
……”。
In the last example, separate (criterion that obviously meets continuous two nulls) with continuous three nulls between record.Intrarecord interfield separates with a null.In every piece of Documentary Records, title, address and summary all are the multirow textual representation, by the mode of addition, all convert single line text to.
2) abstract fields in every piece of Documentary Records of extraction (i.e. summary row).As above shown in the example, every piece of Documentary Records is all with the numeral beginning that puts in a colon, which piece document of digitized representation, so the beginning of a Documentary Records is judged by system with " numeral put in a colon beginning ".It is capable to adopt following method to obtain the summary of every piece of Documentary Records then: A) obtain the first behavior source row (being the source field).B) obtaining two row continuously with two variable V 1 and V2, is header line with V2.If V2 with " Comment in: ", " Comment on: " or " Erratum in: " beginning, show that then this is comment row or magazine statement row, therefore repeat to obtain again two row, up to V2 with above-mentioned three sign beginnings.C) obtain remaining all row in the document record, calculate the number and the length of non-null.If the number of non-null is greater than 2, and the length that has delegation at least is then gone as summary with the longest row of length greater than 180 characters.Because 10000 pieces of Documentary Records of stochastic analysis, the length of all finding the summary row is greater than 180 characters, and in all fields of whole Documentary Records, length is the longest.
3) carry out following two step characters and handle the summary that extracts is capable: (A) all non-letters and nonnumeric character replace with the space.This is to be the word identification on border with the space for convenience.For example, word " (BCL-2) ", " BCL-2 " and " BCL2 " are same words; All be characterized by " BCL 2 " after the processing.(B) all letters all are converted into capitalization.This is to write difference for fear of two identical words because of alphabet size to be taken as be different words by machine.For example, " Cancer " and " cancer ".
4) be each speech that Boundary Extraction is handled back summary row with the space, calculate the frequency of occurrences (the summary number that this speech promptly occurs removes in total summary number) of speech.
Adopt above method, system can analyze the word frequency in the pertinent literature of each gene automatically, carries out next step analysis.Simultaneously, we keep the centre of all processing and net result (comprising: the multirow field converts the destination file of single file field to, purely by the capable file of forming of summary that extracts and the destination file of word frequency analysis) to use after an action of the bowels.The method that keeps is as follows: 1) set up a result (" result ") file in the file of user's appointment.2) in the destination file folder, set up three sub-folders: " linemid ", " lineresult " and " wordresult ", in order to preserve above-mentioned three results respectively.3) same official with gene is abbreviated as filename, preserves in above-mentioned three results to three different file with the form of text.
3. the extraction of keyword (referring to Fig. 5)
After system analyzes word frequency in the pertinent literature of each gene automatically, just can from these speech, extract the keyword of gene automatically.The method of extracting is briefly described as follows: all speech and the word frequency of 1) obtaining each gene.2) the deletion base value thinks that these speech are public speech, for example ' the ' and ' and ' etc. greater than 5% speech.Want the base value of stop word to select between 1%~10% in this step, scope is 5%~10% preferably, the 5%th, and the optimal selection point.3) word frequency of deletion in the pertinent literature of current gene is less than the speech of a threshold value m=t+ (k/n) * 100%, think that these speech can not be accepted as keyword because of extensively not mentioning in the research of current gene, wherein t is a minimum threshold, k is a constant, n is the related abstract record of this gene, the span of t can be 5%~25%, and the span of k can be 0.5~2.5, gets t=15% and k=1.5 in this example.4) peculiar speech of gene of deletion only keeps by two speech that gene is common.In this step, also can adopt the size of the difference of the word frequency of waiting to look into gene and word frequency base value to differentiate the speech that to delete, the speech deletion of the difference of the word frequency of gene and word frequency base value less than m=t+ (k/n) * 100% will be waited to look into, because the essence of these two kinds of method of discrimination is basic identical, both errors are also very little, do not influence result for retrieval basically.Because current biological function is all caused by a gene cluster rather than individual gene, just may is the function word that characterizes gene cluster by two common speech of gene at least therefore, these speech is remained other all deletions.5) last so remaining speech is considered to the keyword of gene, is " result " sub-folder of the filename file that is saved in user's appointment with " allkeyword.xls ".
4. the professional treatment of keyword (referring to Fig. 6)
Keyword professional treatment method is as follows:
System extracts the key word file of generation automatically, but produces an edit list (Figure 14 is seen at the interface).But in this edit list, can carry out following editor:
1) interpolation of keyword or deletion or define keyword with phrase.
2) frequency weight of keyword is set.Characterize the frequency weight of keyword with numeral.Default value is 1.The user can set weighted value according to keyword and particular organisms function relationship level of intimate.During cluster analysis, weighted value will multiply each other with the word frequency of keyword and replace original word frequency as new word frequency.
3) synonym that keyword is set is same entity.Characterize synonym with numeral.Default value is 0, represents non-synonym.Numeral is greater than 0 expression synonym.Keyword with same numbers belongs to same class synonym.During cluster, this class keyword will be regarded as same entity; Characterize this entity with first synonym in the tabulation.
4) single plural form of keyword is set.Characterize single plural number with numeral.Default value is 0, and expression has only singulative.The user can be set to 1, and expression allows the plural form of appearance with " s " ending.
5) preserve and replace keyword and extract the key word file of generation automatically.
But the above-mentioned professional treatment edit list (interface as shown in figure 14) that is used for keyword, this tabulation comprises a list box and 6 buttons.Division is as follows:
List box comprises 5 row: the keyword numbering is shown in " No. " tabulation.Keyword is shown in " Keyword " tabulation.Single plural form of keyword is shown in " Plural " tabulation.The weight of keyword is shown in " Weight " tabulation.The synonym code name of keyword is shown in " Synonymy " tabulation.
6 buttons are respectively: add keyword (" Add "), deletion keyword (" Remove "), keyword rearrangement (" Sort "), the edited result (" Save ") of preserving keyword, definite (" Ok ") and cancellation (" Cancel ").Wherein, the keyword rearrangement is first ranking value (in proper order) with the synonym code name, is second ranking value (backward) with weighted value.Therefore, the user can determine first synonym (as shown in figure 14) in the same class entity by the higher weighted value (for example+1) of definition.
5. the cluster analysis (referring to Fig. 7) of foundation, output and the gene-word frequency rate of word frequency tabulation
The user just can carry out the cluster analysis of gene one word frequency rate after finishing keyword by keyword professional treatment editor.This method is summarized as follows: 1) obtain the word frequency of keyword in each gene, form a word frequency tabulation.The first row contains gene of this tabulation, the first column contains keyword, row is this keyword word frequency in this gene with the point of crossing of row.2) with the average chain grade clustering algorithm module of the Cluster software (http://rana.lbl.gov/index.htm) of Stanford University exploitation cluster analysis is carried out in this tabulation.3) use Treeview software (http://rana.lbl.gov/index.htm) to show the result of cluster analysis.Because Cluster and Treeview all are freewares, and use easily, key is how to obtain the word frequency tabulation of keyword in each gene, and therefore, we only need to set up the word frequency tabulation and carry out word frequency tabulation output according to the form of Cluster software.
The method of described word frequency tabulation output is: 1) obtain the word frequency of keyword at each gene from the word frequency file of gene.2), then the odd number of this keyword and the word frequency addition of plural form are removed in two, as the word frequency of this keyword if keyword allows single plural number.3) word frequency with keyword multiply by the word frequency weight.4) with the word frequency addition of the keyword of same class synonym entity, remove in synon number, as the word frequency of this synonym entity.The synonym entity characterizes with first keyword, other keyword in the delete entity.5), be " result " sub-folder of the filename file that is saved in user's appointment with " array.txt " with the list of frequency of the form of Cluster software requirement output gene-keyword.
Cluster analysis result shows in the mode of tree and word frequency tabulation.From the tree result, the user can see gene according to keyword cluster and speech by the gene cluster.From the word frequency tabulating result, the user can see the occurrence value size of all speech in each gene, and promptly the correlation degree of speech and gene characterizes with colour brightness.
By cluster analysis, the user can obtain the information of three aspects.1) the current high throughput analysis result's of checking reliability.Because the problem that current most of high throughput analysis technology all exist virgin renaturation to be doubted, and since costly, often can not carry out repeatedly repeated experiments again to obtain believable data.In system of the present invention, the user is provided with high weight by the main keyword to current biological function, just known current function related gene can be got together.According to the known current biological function related gene ratio that testing result conforms to bibliographical information in native system, can infer the degree of reliability of this experimental result.2) obtain the generation of current biological function or the general cognition of variation mechanism.Number and this keyword and the related degree of related gene according to the gene relevant with each keyword, the user can infer that current to wait to look into gene main relevant with which keyword, thereby judge that on the whole current biological function is main relevant with which element, and the variation of which element can cause the variation of function.3) the new current biological function related gene of prediction.The user can predict new function related gene according to the correlation degree of the main keyword of waiting to look into gene and current biological function.If certain gene is relevant with the main keyword of the great majority of current biological function, uncorrelated with the keyword that characterizes the current research object only, can predict that then this gene is new current biological function related gene.For example, TIAM1 gene unconventionality expression in colorectal cancer shifts is found in experiment, and cluster analysis finds that the TIAM1 gene is relevant with " transfer " with keyword " cancer ", uncorrelated with keyword " large intestine " only, can infer that then the TIAM1 gene is new colorectal cancer metastasis related gene.
6. the literature search of gene-many speech (quadratic search of gene-correlation document is referring to Fig. 8)
Owing to when cluster analysis dopes some new current biological function related genes, need the document support, yet have only by reading these documents, could be clear and definite their relation.Therefore, the present invention adopts the method for the literature search that contains specific gene and a plurality of keywords to carry out quadratic search, to retrieve the gene information relevant with specific function exactly.This method is as described below:
System's generation-interactive mode can be edited quadratic search tabulation (Figure 15 is seen at the interface), can carry out following editor and processing in this table: 1) select gene.Read list of genes to be looked into, in a combobox, show, select interested gene (can only select) for the user.2) select keyword.Read lists of keywords, in four comboboxs, show, select interested keyword (can select four keywords simultaneously) for the user.3) search key.The multirow field of obtaining selected gene-correlation document correspondence converts the destination file of single file field (being called for short the single file file) and pure in the capable file of forming of summary (abbreviation Summary file) that extracts to.Serves as the beginning row (being the source row) that sign is judged each Documentary Records in the single file file with the put in a colon length of beginning and row of numeral less than 120 characters, and the single file file is read among the two-dimensional array A; First dimension of array A is represented Documentary Records, the different field row in the second dimension representative record.Read each row of Summary file.Judge wherein whether contain selected keyword simultaneously.Judge whether that the foundation that contains a keyword is: the synonym that contains any one this keyword.And when judging whether to contain a synonym,, then only need to occur this synon odd number or plural form if this synonym allows single plural number.Contain selected keyword simultaneously if a summary is gone, then in two-dimensional array A, search for the field row that mates with this summary row, write down all field row that write down at this field row place.4) document shows.All field row of noting mode with addition line by line is saved among the string variable V1.But the content that shows V1 with the text box of a Show Color.Carry out following processing then: (1) V1 carries out character to be handled.All non-letters and nonnumeric character replace with the space; Convert all letters to capitalization.(2) show gene.Obtain all name character strings of selected gene from gene name character string storehouse, whether retrieval exists in V1 respectively.If certain name character string exists, then write down the length L of its position and this character string in V1; The L that relevant position in the text box an is begun character conversion becomes shiny red then.(3) show keyword.Obtain all synonyms of four keywords respectively,, then add its plural form with " s " ending if certain synonym allows single plural number.Synonym carries out the character identical with V1 to be handled: all non-letters and nonnumeric character replace with the space; Convert all letters to capitalization.Whether then, retrieve them exists in V1.If certain synonym exists, then write down its position and this synon length L in V1; The L that relevant position in the text box is begun character conversion becomes colored (first keyword is a sapphirine, and second is the brilliant fuchsin look, and the 3rd is bright green, and the 4th is bright cyan) then.5) saving result.The user can be saved in the document that retrieves in the file of an appointment.
Quadratic search tabulation (interface as shown in figure 15) comprises a dialog box, a text box and two buttons.Division is as follows:
Comprise a gene title (" Gene ") combo box and four keywords (" Keywords ") combo box in the dialog box.The user can select interested gene and keyword respectively in these two kinds of combo boxes.
Text box (" Information ") is used for showing the document that retrieves.
Two buttons are respectively to show (" Display ") button and preservation (" Save ") button.After the user chooses interested gene and keyword, click the Show Button, then text box demonstrates the document that retrieves.Click save button, then the document that retrieves is saved in the file of user's appointment.
For the public fully understands major function and the hints on operation that utilizes system that the present invention builds up, simply introduce system master interface below in conjunction with accompanying drawing.
(Gene Specialized Finder GSPFinder) can use on existing various window platforms specific function related gene searching system of the present invention.Use for convenience, native system on main interface as shown in figure 16, comprises 8 buttons, a text box and a dialog box with each functional unit centralized displaying on this main interface, and division is as follows:
8 buttons are respectively: the automatic retrieval of gene-correlation document (" Genes ") button, word frequency is analyzed automatically and keyword extract automatically (" Frequency ") button, keyword professional treatment (" Keywords ") button, word frequency tabulation output (" Array ") button, gene-many speech literature search (quadratic search) demonstration (" Disply ") button and preservation (" Save ") button, help (" Help ") button and withdraw from (" Exit ") button.
Text box (" Information ") is used for showing current state, process; When regarding the literature search (quadratic search) of gene-many speech, be used for showing the document that retrieves.
Dialog box is the dialog box of the literature search of gene-many speech.Figure 15 is actually from Figure 16 and decomposes out, is the part of Figure 16.Example 5 (retrieval of animal specific function related gene information):
The connection of retrieval animal specific function related gene information hardware system and composition and whole retrieving and the human specific function related gene of method, step and retrieval information are identical, but reference example 4 is carried out.Be that example is introduced retrieving and result for retrieval briefly below with the mouse.
For the mechanism that the malignant phenotype that inquires into tumour cell can reverse by nuclear transfer, we in the egg mother cell of the healthy mice of stoning, have obtained the reconstituted embryo of tumour cell with the nuclear transplantation of mouse malignant melanoma cell, carry out in vitro culture.Utilize the cDNA microarray then, carried out the analysis of gene expression profile, and contrast, obtain 244 difference expression genes with the cumulus cell reconstituted embryo of 32 cell stages to growing to the reconstituted embryo of the mouse malignant melanoma cell of 32 cell stages.Then, utilize native system that these genes are analyzed.Select the gene name database of mouse, retrieval Pubmed obtains 150643 pieces of the pertinent literatures of 233 genes altogether, after the output through the word frequency tabulation of the professional treatment of the obtaining automatically of word frequency analysis, keyword, keyword and gene-speech, cluster analysis result shows that most gene is relevant with embryonic development, and a part is also with cancer with sequencing is relevant again.
Example 6 (retrieval of microorganism specific function related gene information):
The connection of retrieval microorganism specific function related gene information hardware system and composition and whole retrieving and the human specific function related gene of method, step and retrieval information are identical, but reference example 4 is carried out.Be that example is introduced retrieving and result for retrieval briefly with the Epstein-Barr virus in the nasopharyngeal carcinoma below.
The infection of Epstein-Barr virus is one of pathogenesis of nasopharyngeal carcinoma.In order to inquire into mechanism wherein, we utilize native system to analyze whole 90 genes of Epstein-Barr virus.Select the gene name database of Epstein-Barr virus, it is the species title (EBV of Epstein-Barr virus that all Epstein-Barr virus genes are increased retrieval-assisted phrase, EB virus or Epstein-Barr virus etc.), retrieval Pubmed obtains 11905 pieces of the pertinent literatures of 80 genes altogether, through word frequency analysis, obtaining automatically of keyword, after the output of the professional treatment of keyword and the tabulation of the word frequency of gene-speech, it is relevant with nasopharyngeal carcinoma that cluster analysis result shows 9 genes, respectively self-sustaining with the growth signals of nasopharyngeal carcinoma, escape apoptosis, keep angiogenesis, tissue infiltration is relevant with aspects such as transfer and cell growth inhibiting.
Example 7 (colorectal cancer shifts the retrieval of closely-related expressing gene information):
In order to make the public grasp use of the present invention and method of operating better, fully understand the technique effect that the present invention can reach, present embodiment shifts the example that is retrieved as of closely-related expressing gene information with concrete colorectal cancer, describes the automatic analytic process of retrieving system and retrieval person's specific operation process.
Those skilled in the art has found to shift 51 of closely-related expressing genes (seeing Table 1) with colorectal cancer in the colorectal cancer study of metastasis in succession.At present relevant with these 51 expressing genes document surpasses 60,000 pieces, and therefore almost nobody may understand the present Research of these 51 genes at short notice fully, thereby illustrates their relation and find new target.How to analyze the functional relationship of these 51 genes and find new colorectal cancer metastasis related gene? in order to address this problem, we adopt system of the present invention to explore the functional relationship of these genes and find new colorectal cancer metastasis related gene.The concrete analysis process of native system is as follows:
1. obtain the gene-correlation document automatically
Specific operation process is as follows: 1) click the Genes button of Figure 16, the interface of ejecting Figure 13.2) in Figure 13, click Add Genes button (adding a plurality of genes), eject an open file dialogs.In dialog box, select to comprise official's abbreviation of these 51 genes and the text of full name (table 1).The call format of text file: each gene delegation; Official abbreviation is preceding in every row, full name after; Official's abbreviation and full name are separated with tab.Click the confirming button of open file dialogs, then system reads selected text, shows these genes (seeing Figure 13) in list of genes (" Gene List ").3) user clicks each gene one by one, browses gene information and edits searching character string and retrieval-assisted phrase in gene name character tandem table (" SearchTerm for One Gene "), retrieval-assisted phrase text box (" Candidate Second Terms ") and gene raw information table (" GeneInformation Detail ").4) select target file " D: colon " in depositing the file text box of gene-correlation document (" Target ").5) click the pertinent literature that the Retrieve button begins to obtain gene then.The result as shown in figure 14." PNum " row of list of genes frame among the figure (" Gene List ") have shown the gene-correlation document number that retrieves.6) the pertinent literature number of some gene represents that with " large " expression the pertinent literature number of this gene surpasses 1000 pieces, needs the user to download by hand.It is to surpass 1000 pieces to have following problem by the manual reason of downloading of user just that the native system restriction surpasses 1000 pieces of documents: (1) has a lot of false positives, needs the user to examine searching character string again; The time of (2) obtaining automatically is long.When clicking the Retrieve button, native system can be in destination folder " D: colon " genesinformation.gsi file of generation, the pertinent literature that has write down each gene in the file obtains the address.Is for example, the address of PIM1 gene: http://www.ncbi.nlm.nih.gov/entrez/query.fegi? db=pubmed﹠amp; Cmd=search﹠amp; Term=PIM%201%5BAB%5D%20R%20Oncogene%20PIM1%5BAB%5D20%OR% 20pim%201%20oncogene%5BAB%5D%20OR%20 ((PIM1%5BAB%5D%20OR%20PIM%5BAB%5D) %20AND%20 (oncogene%5BAB%5D%20OR%20proviral%5BAB%5D%20OR%20integrat ion%5BAB%5D).The user only needs this address reproduction to one browser's address bar, the result that browser display PubMed retrieves after the carriage return; The output format of selecting PubMed then is " outputing to file " (' send to file ') for " summary " (' Abstract ') and the way of output, with the official of this gene abbreviation name this document, be saved in the destination folder " D: colon " with the form of text.Be total up to these 51 genes like this and obtain 61914 pieces of pertinent literatures, the gene-correlation document is counted scope from 3 pieces to 7144 pieces, average 1214 pieces.
2. the automatic extraction of the automatic analysis of word frequency rate and keyword
Specific operation process is as follows: 1) click the Frequency button of Figure 16, eject an open file dialogs.In dialog box, select to preserve any one text in the text place file " D: colon " of gene-correlation document.2) system at first analyzes speech and the word frequency in all texts in this document folder.In destination folder " D: colon ", set up sub-folder " result ".The document record of each gene is recorded in the text " filecount.txt ", and leaves in " result " file.In file " result ", set up three sub-folder linemid ", " lineresult " and " wordresult ", in order to the centre of preserving all processing respectively and net result (comprising: the multirow field converts the destination file of single file field to, purely by the capable file of forming of summary that extracts and the destination file of word frequency analysis).3) system analyzes speech and the word frequency in all texts in " wordresult " file then, extracts the keyword of these 51 genes automatically.Wherein the document record reads from text " filecount.txt ".Obtain 148 keywords (seeing Table 2) at last altogether, be saved in " result " file with the filename of " allkeyword.xls ".
3. the professional treatment of keyword
Specific operation process is as follows: 1) click the Keywords button of Figure 16, eject an open file dialogs." allkeyword.xls " file in the select target file " D: colon resule " is clicked confirming button.2) system reads " allkeyword.xls " file, ejects the interface of Figure 14, shows these 148 keywords in tabulation.Under the default situations, the value of " Plural " row is 0 in the list box; The value of " Weight " row is 1; The value of " Synonymy " row is 0.3) choose insignificant keyword in the tabulation (for example, " 9 ", " H ", " N " and " X " see table 2 for details) by click, click the Remove button again and delete this keyword.4) click the Add button, the list box bottom can increase delegation newly, adds new keywords at " Keyword " of this row row.For example, " COLON ".5), single plural form of keyword is set at " Plural " row.For example, keyword " CANCER " is made as 1 in the value of " Plural " row, shows that this keyword allows to occur in the text the form of " CANCERS ".6) at " Synonymy " row, with identical numeral definition synonym.For example, keyword " COLORECTAL " and " COLON " are 3 in the value of " Synonymy " row, show that the two is a synonym (table 3).7) to keyword weight is set, for example the weight of keyword " INVASION " is 10, and the weight of keyword " COLORECTAL " is 1001 (tables 3).Because same class synonym will be characterized by the keyword of weight maximum in cluster result, so we are being made as maximum with the weight of the keyword of its sign.For example, in " COLORECTAL " and " COLON " this class synonym, the weight of " COLORECTAL " is than " COLON " big 1 (table 3).The last result of keyword is as shown in table 3, remaining 15 keywords (27 synonyms).
4. the word frequency cluster analysis of gene-keyword
Specific operation process is as follows: 1) click the Array button of Figure 16, eject an open file dialogs." allkeyword.xls " file in the select target file " D: colon result " is clicked confirming button.2) system accesses the array output unit, read allkeyword.xls " keyword and various setting in the file; again from file " D: colon result wordresult " read the word frequency of selected keyword in each gene; the list of frequency of output gene-keyword is " result " sub-folder of the filename file that is saved in user's appointment with " array.txt ".3) operation Cluster software is clicked " Load File " button, ejects an open file dialogs." array.txt " file in the select target file " D: colon result " is clicked confirming button.Click " the Average Linkage Clustering " button in " Hierarchical Clustering " option, then Cluster software will export three files " array.cdt ", " array.atr " and " array.gtr " and arrive in the destination folder " D: colon result ".4) operation Treeview software, and the click menu " File-〉Load ", eject an open file dialogs." array.cdt " file in the select target file " D: colon result " is clicked confirming button.Main window will show cluster result.Click menu " Setting-〉Options " again, eject a dialog box, " Image Contrast " value is made as 25; Click " color " option again, selecting the color of " Positive " is glassy yellow.The display result of main window will be as shown in figure 17.
We can obtain following three information from Figure 17: 1) have 31 genes obviously relevant with keyword " colon " (" colorectal " and synonym thereof), " cancer " (" cancer " and synonym thereof) and " transfer " (" metastasis " and synonym thereof) 51 genes; For example, gene " CDH2 ", " VCAM1 " and " NCAM1 " etc.Also have 8 genes and keyword " colon " and " cancer " strong correlation and with " transfer " a little less than relevant; For example, gene " CD58 " and " BCL2 " etc.Most known colorectal cancer metastasis related gene is all detected by this experiment, therefore illustrates that this result of experiment is comparatively reliable.2) from the correlation degree of each keyword and all genes, we can see these genes except with above-mentioned three keywords mutually outside the Pass, main and keyword " infiltration " (" invasion "), " mortifier " (" suppressor "), " hyperplasia " (" proliferation "), " sticking " (" adhesion "), " apoptosis " (" apoptosis ") are relevant with " cell cycle " (" cycle "), and next is relevant with keyword " angiogenesis " (" angiogenesis ").The metastasis that colorectal cancer is described may be: (1) tumour cell passes through to break through the control of cell cycle or escape apoptosis and paraplasm.(2) tumour cell lacks effective metastasis inhibition thing.(3) tumour cell can destroy intercellular sticking and be convenient to shift.3) because we are provided with very high weight (table 3) with keyword " colon " (' colorectal ') and synonym thereof, so these 51 genes carry out cluster by this keyword: a class is relevant with it, and is another kind of uncorrelated with it.With incoherent this genoid of keyword " colon " in we see: the TIAM1 gene be unique obviously with " cancer " (' cancer '), " transfer " (' metastasis ') and the relevant gene of " invasion and attack " (' invasion ').This has hinted that TIAM1 may be a new colorectal cancer metastasis related gene.
5. the literature search of gene-many speech (quadratic search)
Specific operation process is as follows: 1) gene (" the Gene ") combo box in the search dialogue (" Search Dialog ") of click Figure 18, select gene " TIAM1 "; Click first keyword (" Keywords ") combo box, select keyword " cancer " (" CANCER "); Click second keyword combination frame, select keyword " transfer " (" METASTASIS ").2) click demonstration (" the Display ") button of Figure 17, then in information (" Information ") text box, show the document that retrieved totally 6 pieces (seeing Figure 18).3) preservation (" the Save ") button of click Figure 18, eject one and preserve FileDialog, according to selected file is " D: colon ", then Mo Ren storing path and file " D: colon result TIAM1-CANCER-METASTASIS.rtf " by name are clicked the filename of determining then to save as in the path of appointment appointment.4) read the summary of these six pieces of documents, find that TIAM1 normally mainly expresses in brain and testis, and expression in various tumor cell strains (comprising JEG-3); TIAM1 can promote Metastasis in Breast Cancer; And also high expressed (seeing Table 1) in the colorectal cancer that shifts of TIAM1 in this experiment.In addition, find that also TIAM1 and another metastases mortifier NME1 interact, NME1 also in this experiment unconventionality expression (seeing Table 1) and also obviously with keyword " colon ", " cancer " relevant with " transfer " (seeing Figure 18).5) we further in the related abstract (totally 1101 pieces) of NME1 search keyword " colon ", " cancer " and " transfer " and synon summary thereof appear simultaneously, obtain 70 pieces of summaries altogether.These summaries have shown that obviously the downward modulation expression of NME1 gene is relevant with the transfer of colorectal cancer; This experiment also shows NME1 low expression (seeing Table 1) in the colorectal cancer that shifts.All above these information have pointed out TIAM1 to influence the transfer of colorectal cancer by interacting with NME1 strongly, so TIAM1 is a new colorectal cancer metastasis related gene.Infusively be that people such as people such as Liu and Minard have confirmed this prediction recently.
Example 8 (analysis of the gene of pathologic scar and normal skin differential expression and expressed sequence mark ESTs)
We have further analyzed another group at gene and the expressed sequence mark ESTs (totally 205) of pathologic scar (comprising hyperplastic scar and keloid) with the normal skin differential expression with the present invention.174 genes can get access to pertinent literature as a result, analyze these documents and obtain 453 potential keywords altogether, through remaining 30 keywords (comprising 73 synonyms) after the professional treatment, keyword " keloid " (' keloid ') wherein, " transition hyperplasia " (' hypertrophic ') and " scar " (' scar ') add by keyword professional treatment device, and keyword " keloid " (' keloid '), " collagen " (' collagen ') and " anoxic " (' hypoxia ') are provided with very high weight (seeing Table 4), because up-to-date theory thinks that keloid may be that fibrocyte transition hyperplasia causes under the anoxia condition.
According to the word frequency of keyword in each gene and each autoregressive parameter the word frequency array of output keyword-gene is set, cluster analysis result as shown in figure 19.Visible most of genes all relevant with " hyperplasia " (' proliferation ') with keyword " fibroblast " (' fibroblast ') (as the frame 1 of Figure 19 a with shown in Figure 19 b) and many known collagen related genes relevant with the pathologic scar gather into a class (shown in the frame 2 and Figure 19 c of Figure 19 a) among the figure.We also found one group with the relevant gene of anoxic (' hypoxia '), wherein HIF1A gene also relevant (shown in the frame 4 and Figure 19 d of Figure 19 a) with keloid (' keloid ').Therefore, we search for the summary that keyword keloid (' keloid ') occur in the related abstract of HIF1A, obtain 2 pieces.These two pieces summaries show that it is to cause keloidal major reason that the HIF1A of anoxic activation can induce the expression of PAI-1, so HIF1A may be a treatment target spot of regulating the scar fibrotic processes.This group gene in, we find that also the Cited2 gene is relevant with " hyperplasia " (' proliferation ') with keyword " fibroblast " (' fibroblast '), thus we search for the correspondence document, obtain 2 pieces.Making a summary shows that Cited2 is the negative instrumentality of HIF1A under the anoxia condition in fibroblast, and Cited2 hangs down expression and HIF1A high expressed (data do not provide) in this experiment in pathologic scar, so gene C ited2 and the worth further research of HIF1A.
Table 1 colorectal cancer shifts difference expression gene
Up-regulated gene Down-regulated gene
Symbol Geme?name ?Symbol Geme?name
MAPK4 NMI MYCBP MTA1 CDH2 TACSTD1 MMP9 ECGF1 CCND1 BCL2 IL6 PIM2 GAMAP YAP65 S100B RASA1 RET FGF4 ETS2 PIM1 TIAM1 MET ? ? ? ? ? ? ? ? ? mitogen-activated?protein?kinase?4 N-myc(and?STAT)interactor c-myc?binding?protein metastasis?associated?1 cadherin?2,type?1,N-cadherin tumor-associated?calcium?signal?transducer?1 matrix?metalloproteinase?9 endothelial?cell?growth?factor?1 cyclin?D1 B-cell?CLL/lymphoma?2 interleukin?6 pim-2?oncogene GPI-anchored?metastasis-associated?protein?homolog YAP65 S100?caicium?binding?protein,beta RAS?p21?protein?activator?1 ret?proto-oncogene fibroblast?growth?factor?4 v-ets?erythroblastosis?virus?E26?oncogene?homolog?2 pim-1?oncogene T-cell?lymphoma?invasion?and?metastasis?1 met?proto-oncogene(hepetocyte?growth?factor receptor) ? ? ? ? ? ? ? ? ?IL1B ?NCAM1 ?CD58 ?TP53 ?RB1 ?DCC ?THBS1 ?LGALS3 ?CD44 ?VCAM1 ?SELE ?E2F3 ITGB1 ?ITGAV ?PECAM1 ?ICAM2 ?SELP ?ITM2A SCAMP3 ?LAMP1 ?CSPG6 ?SERPINB2 ? ?PPP1R15A ? ?CASP1 ?NME3 ?NME1 ?CLDN1 ?TIMP1 ?BAI2 interleukin?1,beta neural?cell?adhesion?molecule?1 CD58?antigen tumor?protein?p53 retinoblastoma?1 deleted?in?colorectal?carcinoma thrombospondin?1 lectin,galactoside-binding,soluble,3 CD44?antigen vascular?cell?adhesion?molecule?1 selectin?E E2F?transcription?factor?3 integrin,beta?1 integrin,alpha?V platelet/endothelial?cell?adhesion?molecule intercellular?adhesion?molecule?2 selectin?P integral?membrane?protein?2A secretory?carrier?membrane?protein?3 lysosomal-associated?membrane?protein?1 chondroitin?sulfate?proteoglycan?6 serine?or?cysteine?proteinase?inhibitor,clade?B member?2 protein?phosphatase?1,regulatory(inhibitor)subunit 15A caspase?1,apoptosis-related?cysteine?protease non-metastatic?cells?3 non-metastatic?cells?1 claudin?1 tissue?inhibitor?of?metalloproteinase?1 brain-specific?angiogenesis?inhibitor?2
Table 2. pair 51 colorectal cancers shift the potential keyword that difference expression gene is found automatically
Keywords ?Keywords ?Keywords Keywords ?Keywords ?Keywords
9 H N X H1 21 IL ML 13 RB G1 MAB GTP MMP MYC IFN P53 PCR RAS BCL TNF LFA NM23 NODE LOSS ?ICAM ?PIM1 ?BETA ?CD54 ?TIME ?TIMP ?ANTI ?VCAM ?WILD ?LIKE ?GROUP ?PROTO ?MOTIF ?BLOOD ?SERUM ?STAGE ?CYCLE ?DEATH ?LYMPH ?CASES ?GLIAL ?PHASE ?MURINE ?BREAST ?GTPASE ?SERINE ?NEURAL ?CYCLIN ?STRESS ?TARGET ?MARKER ?MATRIX ?CANCER ?TUMORS ?LIGAND ?HYBRID ?PLASMA ?SUBUNIT ?DERIVED ?LIBRARY ?METHODS ?CALCIUM ?ANTIGEN ?CULTURE ?KINASES ?LAMININ ?CYTOKINE ?NECROSIS ?SELECTIN ?NEGATIVE ADHESION MEASURED STAINING INTEGRIN ONCOGENE SURVIVAL INVASION TERMINAL CLINICAL MOLECULE COLLAGEN LYMPHOMA TYROSINE VASCULAR PLATELET POSITIVE LEUKEMIA CARCINOMA SYNTHESIS APOPTOSIS APOPTOTIC THREONINE LEUKOCYTE ACTIVATOR MIGRATION ?ACTIVATED ?INDUCIBLE ?INDUCTION ?INHIBITOR ?COMPLEXES ?HOMOLOGUE ?COLORECTAL ?ANGIOGENIC ?SUPPRESSOR ?METASTASIS ?METASTATIC ?PRODUCTION ?PROGNOSTIC ?EPITHELIAL ?MICROSCOPY ?EPITHELIUM ?RESISTANCE ?LYMPHOCYTE ?EUKARYOTIC ?CONCLUSION ?MONOCLONAL ?CORRELATED ?STIMULATED ?STIMULATION ?PLASMINOGEN ?INTERLEUKIN ?DEGRADATION ?CORRELATION ?ENDOTHELIAL ?PROGRESSION ?FIBRONECTTN ?ENDOTHELIUM ?INFLAMMATION ?INFLAMMATORY ?ANGIOGENESIS ?GLYCOPROTEIN ?INTERACTIONS ?TRANSDUCTION ?PROLIFERATION ?INTERCELLULAR ?PHOSPHORYLATED ?OVEREXPRESSION ?RETINOBLASTOMA ?CONCENTRATIONS ?PHOSPHORYLATION ?METALLOPROTEINASE ?IMMUNOHISTOCHEMICAL ? ? ?
The associative key of 51 colorectal cancer transfer difference expression genes after table 3 manual process
Keywords ?Plural ?Weight ?Synonym Keywords Plural Weight Synonym
INVASION PROLIFERATION ADHESION ONCOGENE METASTASIS METASTATIC MIGRATION CANCER CARCINOMA TUMORS COLORECTAL COLON CYCLE CYCLIN 0 0 0 0 0 0 0 1 1 0 0 0 0 0 10 5 5 1 11 10 4 11 10 10 1001 1000 6 5 0 0 0 0 1 1 1 2 2 2 3 3 4 4 INFLAMMATION INFLAMMATORY SUPPRESSOR INHIBITOR LEUKEMIA LYMPHOMA EPITHELIAL EPITHELIUM ANGIOGENESIS ANGIOGENIC APOPTOSIS APOPTOTIC DEATH 0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 6 5 2 1 1 1 6 5 6 5 2 6 6 7 7 8 8 9 9 10 10 11 11 11
The associative key of 174 pathologic scar difference expression genes after table 4. manual process
Keywords ?Plural ?Weight ?Synonym ?Keywords ?Plural ?Weight ?Synonym
SCAR KELOID FIBROBLAST PROLIFERATION HYPERTROPHIC CONGENITAL EPITHELIAL SKIN EPITHELIUM EPIDERMIS EPITHELIA EPIDERMAL LEUKEMIA SCAR LYMPHOMA CADHERIN ANNEXIN CALCIUM CALCINEURIN CAP INFLAMMATION INFECTION INFLAMMATORY ANGIOGENESIS ANGIOGENIC ENDOTHELIAL VASCULAR CANCER CARCINOMA TUMORS PARANEOPLASTIC 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 100 100 10 10 10 1 5 5 5 5 5 5 1 100 1 2 1 1 1 1 4 2 1 5 5 1 1 5 5 5 3 0 0 0 0 0 0 1 1 1 1 1 1 2 0 2 3 3 3 3 3 4 4 4 5 5 5 5 6 6 6 6 ?SUBSTRATE ?METALLOTHIONEIN ?BASEMENT ?HYPOXIA ?HYPOXIC ?ANTIOXIDANT ?HIF ?OXYGEN ?IMMUNE ?IMMUNOSUPPRESSIVE ?KERATINOCYTE ?KERATIN ?CYTOKERATIN ?CYTOSKELETON ?CATENIN ?APOPTOSIS ?APOPTOTIC ?DEATH ?CYCLE ?CYCLIC ?CYCLIN ?G1 ?MITOGEN ?COLLAGEN ?PROCOLLAGEN ?CYTOKINE ?CTGF ?TNF ?VEGF ?TGF 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 3 1 1 10 9 1 1 1 2 1 1 1 1 3 2 5 4 2 5 4 2 2 1 10 5 2 1 1 1 1 8 8 8 9 9 9 9 9 10 10 12 12 12 14 14 16 16 16 18 18 18 18 18 19 19 20 20 20 20 20

Claims (5)

1, a kind of specific function related gene information retrieval system, this system comprises that one has the input and computing machine, a webserver, public biomedical bibliographic data base and the public gene name database and the cluster analysis unit of display terminal, it is characterized in that also comprising the documenm database formed by gene name database, word frequency base value database, string data storehouse and retrieval-assisted phrase database and
One waits to look into gene-correlation literature search unit, this unit
Look into the abbreviation of the unnamed gene council of the Human Genome Organization definition of gene according to waiting of being imported, from the documenm database that makes up, obtain all corresponding name character strings and the retrieval-assisted phrase edlin of going forward side by side, according to the raw information in the documenm database, remove and cause false-positive name character string and retrieval-assisted phrase easily, add name character string and the retrieval-assisted phrase omitted
Retrieval includes the Documentary Records of name character string behind the editor and retrieval-assisted phrase and is saved in the file of appointment from public biomedical bibliographic data base then;
One waits to look into gene word frequency analysis unit, this unit extracts the abstract fields of the every piece of Documentary Records that retrieves earlier, extract each speech in the abstract fields then, with the quantity that the document of each speech wherein occurs divided by the pertinent literature sum of waiting to look into gene, calculate these speech one by one in the frequency of occurrences of waiting to look in the gene-correlation document, promptly wait to look into the gene word frequency;
One keyword extracting unit, this unit will wait that the base value of looking into the same vocabulary in gene word frequency and the word frequency base value database compares differentiation, the deletion base value is higher than 1% speech and the difference waiting to look into gene word frequency value threshold or wait to look into gene word frequency and word frequency base value is lower than the speech of m=t+ (k/n) * 100%, select the keyword and the keeping records of being waited to look into gene by two common speech conducts of gene at least then, wherein the t among the formula m=t+ (k/n) * 100% is the minimum value threshold of m, k is a constant, and n is a related abstract record of waiting to look into gene;
One keyword professional treatment unit, but this unit produces an edit list, and can carry out the weight setting of keyword interpolation or deletion, the setting of keyword list plural form, keyword and the synonym of keyword in this tabulation is the setting of single entities and the preservation of reference record;
The tabulation of one word frequency is set up, output unit, gene word frequency analysis unit calculates waits to look into and obtain the word frequency of keyword in the pertinent literature of each gene the gene word frequency from waiting to look in this unit, earlier that the word frequency of the odd number of keyword and plural form is average, obtain the word frequency of keyword, multiply by the weight of word frequency again, the synon word frequency of average then same class, as this synon word frequency, set up the word frequency tabulation, export the word frequency tabulation of the occurrence frequency of all keywords in the pertinent literature of each gene of cluster analysis software format at last, data in this word frequency listing file are carried out cluster analysis and shown resulting specific function related gene information by described cluster analysis unit.
2, a kind of gene information searching system relevant with specific function according to claim 1 is characterized in that also comprising the quadratic search unit of gene-correlation document, and this unit is according to the gene information of being correlated with by cluster analysis and resulting specific function,
Selection waits to look into gene and a plurality of and wait to look into the corresponding keyword of gene;
The search and show selected wait to look into contain selected a plurality of keyword and synon document thereof in the gene-correlation document;
Preserve Search Results.
3, the gene information searching system relevant according to claim 1 and 2 with specific function, it is characterized in that described keyword extracting unit will wait that the base value of looking into the same vocabulary in gene word frequency and the word frequency base value database compares differentiation, the deletion base value is higher than 5% speech and waits to look into the speech that gene word frequency value threshold is lower than m=15%+ (1.5/n) * 100%, selects the keyword and the keeping records of being waited to look into gene by two common speech conducts of gene at least then.
4, a kind of structure is used for the method for the documenm database of claim 1 or 2 described gene information searching systems relevant with specific function, this method utilizes a computing machine with input and display terminal to enter public gene name database by a webserver, it is characterized in that comprising the following steps:
1) therefrom extracts the unduplicated full name of each gene, abbreviation, another name and product title and set up new gene record, form the gene name database with the mark of being abbreviated as of the unnamed gene council of Human Genome Organization definition;
2) earlier from known, randomly draw and import more than or equal to 200 and belong to the gene of same species with gene to be looked into, calling the pairing new gene record of the random gene of randomly drawing again from formed gene name database edits, according to the raw information in the gene name database, name character string and retrieval-assisted phrase are set;
Retrieval includes the Documentary Records of these name character strings and retrieval-assisted phrase and is saved in the file of appointment from public biomedical bibliographic data base then;
Then extract the abstract fields of the every piece of Documentary Records that retrieves, extract each speech in the abstract fields then, with the quantity of the document that one of them speech occurs pertinent literature sum divided by a random gene, calculate the frequency of occurrences of these speech in the pertinent literature of a random gene one by one, summation again divided by the number of random gene, is obtained the appearance average frequency of these speech in the pertinent literature of a random gene then, be base value, form a word frequency base value database;
3) call gene record new in the gene name database and set up string data storehouse and retrieval-assisted phrase database, wherein said string data storehouse follows these steps to set up:
A, character are handled: the content in the deletion title bracket, non-letter and non-numeric character are replaced with space symbol,
The version that b, interpolation gene family member write: in abbreviation, the space is arranged, the deletion space produces new abbreviated form, when last character of abridging is a numeral, then insert a space to first non-numeric character place and produce new abbreviated form at reverse search
C, deletion be less than the gene title of 4 characters,
D, deletion belong to the non-genomic title of public speech,
E, deletion belong to the non-genomic title of English word,
F, output gene name character string are set up the string data storehouse;
Described retrieval-assisted phrase database is set up by the following step:
A, extract all speech in all full names of each gene and the product title,
B, deletion length less than 6 characters and also with the auxiliary speech of gene title candidate of the same name,
C, deletion belong to the speech of public speech,
D, output result set up the retrieval-assisted phrase database of gene.
5, a kind of structure according to claim 4 is used for the method for the documenm database of the gene information searching system relevant with specific function, it is characterized in that the number that from the gene name database new gene of random call writes down pairing random gene is 250.
CNB2005100375268A 2005-09-27 2005-09-27 Specific function-related gene information searching system and method for building database of searching workds thereof Expired - Fee Related CN100343852C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2005100375268A CN100343852C (en) 2005-09-27 2005-09-27 Specific function-related gene information searching system and method for building database of searching workds thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2005100375268A CN100343852C (en) 2005-09-27 2005-09-27 Specific function-related gene information searching system and method for building database of searching workds thereof

Publications (2)

Publication Number Publication Date
CN1744080A CN1744080A (en) 2006-03-08
CN100343852C true CN100343852C (en) 2007-10-17

Family

ID=36139456

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2005100375268A Expired - Fee Related CN100343852C (en) 2005-09-27 2005-09-27 Specific function-related gene information searching system and method for building database of searching workds thereof

Country Status (1)

Country Link
CN (1) CN100343852C (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2011205296B2 (en) 2010-01-13 2016-07-28 Ab Initio Technology Llc Matching metadata sources using rules for characterizing matches
EP2545462A1 (en) * 2010-03-12 2013-01-16 Telefonaktiebolaget LM Ericsson (publ) System and method for matching entities and synonym group organizer used therein
CN102270208A (en) * 2010-06-29 2011-12-07 上海聚类生物科技有限公司 Method for constructing gene interaction network
KR101188886B1 (en) * 2010-10-22 2012-10-09 삼성에스디에스 주식회사 System and method for managing genetic information
US9384321B2 (en) 2010-11-25 2016-07-05 Portable Genomics, Inc. Organization, visualization and utilization of genomic data on electronic devices
CN102902711B (en) * 2012-08-09 2016-03-09 刘莎 The generation of the general masterplate of a kind of pragmatic keyword, application process and device
JP6101563B2 (en) * 2013-05-20 2017-03-22 株式会社日立製作所 Information structuring system
CN110955371B (en) 2014-02-13 2023-09-12 Illumina公司 Integrated consumer genome services
CN106295252B (en) * 2016-08-18 2019-05-07 杭州布理岚柏科技有限公司 Search method for gene prod
CN108428137A (en) * 2017-02-14 2018-08-21 阿里巴巴集团控股有限公司 Generate the method and device of abbreviation, verification electronic banking rightness of business
CN109493978B (en) * 2018-11-12 2021-05-25 北京懿医云科技有限公司 Disease research hotspot mining method and device, storage medium and electronic equipment
CN110349632B (en) * 2019-06-28 2020-06-16 南方医科大学 Method for screening gene keywords from PubMed literature
CN113921082B (en) * 2021-10-27 2023-04-07 云舟生物科技(广州)股份有限公司 Gene search weight adjustment method, computer storage medium, and electronic device
CN116796750B (en) * 2023-08-24 2023-11-10 宁波甬恒瑶瑶智能科技有限公司 NER model-based gene literature information extraction method, system and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
应用基因芯片技术筛查肺癌变相关基因的研究 吕嘉春,陈家堃,纪卫东,蒋义国,施侣元,吴中亮,何敏,曾波航,中华医学杂志,National Medical Journal of China,第83卷第24期 2003 *
用文献轮廓挖掘大肠癌转移芯片表达谱 黄仲曦,孙青,丁彦青,姚开泰,第一军医大学学报,Journal of First Military Medical University,第23卷第11期 2003 *
用文献轮廓挖掘鼻咽癌微阵列表达数据 黄仲曦,姚开泰,第一军医大学学报,Journal of First Military Medical University,第24卷第7期 2004 *

Also Published As

Publication number Publication date
CN1744080A (en) 2006-03-08

Similar Documents

Publication Publication Date Title
CN100343852C (en) Specific function-related gene information searching system and method for building database of searching workds thereof
Bastian et al. The Bgee suite: integrated curated expression atlas and comparative transcriptomics in animals
Bornstein et al. Single-cell mapping of the thymic stroma identifies IL-25-producing tuft epithelial cells
US20210272695A1 (en) Systems and methods for using sequencing data for pathogen detection
Maas et al. Cutting edge: molecular portrait of human autoimmune disease
Bleharski et al. Use of genetic profiling in leprosy to discriminate clinical forms of the disease
Monti et al. Molecular profiling of diffuse large B-cell lymphoma identifies robust subtypes including one characterized by host inflammatory response
Bomprezzi et al. Gene expression profile in multiple sclerosis patients and healthy controls: identifying pathways relevant to disease
Culhane et al. GeneSigDB—a curated database of gene expression signatures
US20200395097A1 (en) Pan-cancer model to predict the pd-l1 status of a cancer cell sample using rna expression data and other patient data
Wang et al. Single-cell RNA sequencing reveals the sustained immune cell dysfunction in the pathogenesis of sepsis secondary to bacterial pneumonia
Alcorta et al. Microarray studies of gene expression in circulating leukocytes in kidney diseases
Weirick et al. Logic programming to infer complex RNA expression patterns from RNA-seq data
Elsink et al. Implementation of early next-generation sequencing for inborn errors of immunity: a prospective observational cohort study of diagnostic yield and clinical implications in Dutch genome diagnostic centers
Moreland et al. The Mnemiopsis Genome Project Portal: integrating new gene expression resources and improving data visualization
Yang et al. Platform-independent approach for cancer detection from gene expression profiles of peripheral blood cells
Ladanyi et al. Expression profiling of human tumors: the end of surgical pathology?
Ammons et al. A single-cell RNA sequencing atlas of circulating leukocytes from healthy and osteosarcoma affected dogs
Quéré et al. Mining SAGE data allows large-scale, sensitive screening of antisense transcript expression
Coccaro et al. Feasibility of Optical Genome Mapping in Cytogenetic Diagnostics of Hematological Neoplasms: A New Way to Look at DNA
Cios et al. Specific TCR V–J gene segment recombinations leading to the identification pan-V–J CDR3s associated with survival distinctions: diffuse large B-cell lymphoma
CN1653454A (en) Method of forming molecule function network
Lee et al. How should biobanks prioritize and diversify biosample collections? A 40-year scientific publication trend analysis by the type of biosample
Alenda et al. FFPE samples from cavitational ultrasonic surgical aspirates are suitable for RNA profiling of gliomas
Fan et al. Genetic cross-talk between oral squamous cell carcinoma and type 2 diabetes: the potential role of immunity

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20071017

Termination date: 20100927