CN113626567A

CN113626567A - Method for mining information related to genes and diseases from biomedical literature

Info

Publication number: CN113626567A
Application number: CN202110857485.6A
Authority: CN
Inventors: 韦嘉; 付宁
Original assignee: Shanghai Jixukang Biotechnology Co ltd
Current assignee: Shanghai Jixukang Biotechnology Co ltd
Priority date: 2021-07-28
Filing date: 2021-07-28
Publication date: 2021-11-09

Abstract

The invention relates to a method for mining gene and disease correlation information from biomedical documents, which is called as disease and gene bidirectional exploration for short, and uses an intensified standard NER tool to identify genes and disease terms in the biomedical documents by combining a longest matching strategy on the basis of a dictionary, and simultaneously uses a Support Vector Machine (SVM) combining local and overall characteristics to screen possible correlations of candidate gene-disease groups, and uses a coexistence frequency statistical method considering influence of articles and authors to sort the positively correlated gene-disease groups according to the correlation degree, compared with other common schemes, the method greatly improves the identified rate of the genes and diseases on the basis of ensuring higher accuracy and stability, greatly reduces the operation time, and in addition, the disease-gene correlation information base created by the system method, the coverage range is wider, and more systematic and comprehensive information can be provided for updating the treatment and diagnosis technology of diseases.

Description

Method for mining information related to genes and diseases from biomedical literature

Technical Field

The invention relates to the technical field of gene and disease related information mining, in particular to a method for mining gene and disease related information from biomedical documents.

Background

In recent years, with the progress of biomedical research, molecules, cells and genetic components related to diseases are continuously disclosed, new ideas and insights are provided for researchers to explore cell signal paths, genetic changes and the consequences caused by the changes, and meanwhile, with the continuous breakthrough of medical diagnosis technology, a strategy of classifying patients by disease biomarkers and selecting optimal therapy also achieves certain effect, and attracts more researchers to continuously develop in the field.

Although the focus of such research has been on oncology, in the last decade, other disease fields, such as respiratory diseases, infectious diseases, inflammatory diseases, etc., have been advanced to a great extent, providing a broader background basis for the research of disease-related factors, and all these new findings are contained in a large number of biomedical documents, so that if gene-disease related information can be effectively extracted from these biomedical documents, it is possible to discover and develop new therapeutic targets and biomarkers for patient classification, which will greatly promote the development of clinical medicine and research.

However, the biomedical literature is considerable in quantity and text complexity, and many difficulties still exist in effectively extracting gene-disease related information from the biomedical literature, and in the past decades, many researchers have made various attempts, but the coverage of the extracted information and the accuracy of information screening are still very limited, so that a method for mining the gene-disease related information from the biomedical literature is proposed to solve the problems.

Disclosure of Invention

The invention aims at the problems in the prior art and provides a method for mining gene and disease related information from biomedical documents, which is called disease and gene bidirectional exploration for short.

The technical scheme for solving the technical problems is as follows: a method for mining gene and disease related information from biomedical literature, comprising the following steps:

s1, preparing and preprocessing data;

s2, identifying genes and relevant words of diseases in a given literature base;

s3, identifying possible correlations among the candidate gene-disease groups;

s4, sorting the identified positively correlated gene-disease groups.

On the basis of the technical scheme, the invention is further improved as follows.

Further, the preparation and preprocessing of the data comprises the steps of:

1) preparing a reference library for identifying the two classes of words, a gene library and a disease term library, respectively;

2) preparing a portion of the annotated text data to train a model for subsequent identification of genes/diseases to achieve better accuracy;

3) preprocessing a document library to be retrieved, and mainly acquiring three aspects of information:

i) titles and abstracts of articles;

ii) author information;

iii) literature references information of the article.

Further, the gene library is constructed based on a combinatorial library comprising three publicly available gene/protein databases, respectively:

i) gene databases of the National Center for Biotechnology Information (NCBI);

ii) the gene database of the human genome tissue gene naming committee (HGNC);

iii) UniProt knowledge base;

the disease term library is based on a joint set of multiple databases comprising:

i) disease Ontology constructed by the university of maryland medical school;

ii) a created MedDRA medical dictionary hosted by the International harmonization conference (ICH) for registration of technical requirements for drugs by humans;

iii) integrated medical language system (UMLS) designed by the United states National Library of Medicine (NLM);

iv) Infectious Disease Database (IDDB).

Further, the identification of the gene/disease vocabulary comprises the steps of:

1) firstly, using prepared sentences which are annotated with genes and correspond to disease relativity to train the NER recognition model based on CRF;

2) identifying genes and related words of diseases from titles and abstracts of articles extracted from a document library by using a trained model;

3) the recognized words are in one-to-one correspondence with the unique library identification codes of the words in the gene and disease libraries;

4) sliding characters one by one from the beginning to the end of a sentence to be detected by using a sliding window with a fixed character length, sequentially carrying out fuzzy matching on the characters contained in the current window and a gene/disease library, subtracting one from the fixed character length of the sliding window when the sliding window reaches the end of the sentence, then starting sliding from the beginning of the sentence, repeating the steps until the length of the window is zero, and finishing the detection of the sentence;

5) combining the detection results of the step 3) and the step 4) to be used as a final recognition result, and when the two results are in a divergence, processing the two results in the following way:

i) if a recognized word is less than four characters, it is likely to be either an acronym for a term or misrecognized, and if it has a longer form of synonym that appears in the previous text, it should belong to the acronym for a term;

ii) if the less than four character vocabulary has no longer form synonyms present, but it is recognized by the enhanced NER, then it is still added to the results file;

iii) if a vocabulary is recognized by both schemes, the result of the enhanced NER is added to the result file.

Further, the identification of the gene/disease vocabulary was also enhanced by the following approach for the standard NER tool:

A. if a gene/disease word identified is not contained in the existing gene/disease library, automatically searching relevant information on the network to finally find the unique corresponding word in the library by using the necessary search;

B. if it can eventually find its corresponding vocabulary in the library, add the gene/disease to the results file;

C. if it does not already belong to any alternative or synonym of the corresponding vocabulary in the library, then the gene/disease vocabulary is also added to the existing gene/disease library at the same time, but always shares the same library identifier with its corresponding vocabulary.

Further, the fuzzy matching means that when a certain gene/disease word is recognized from the sliding window, it is marked to ensure that it is not overlapped with other recognized words, by extracting a completely matched word and retaining a word slightly different in a punctuation mark or a singular/plural number.

Further, the identification of the correlation of the identified gene-disease combination using a binary Support Vector Machine (SVM) classifier for the discrimination of the correlation of the gene-disease binary group obtained in the second part mainly comprises the following steps:

1) classification is performed by extracting two types of features, namely local lexical features and global syntactic features, wherein the local lexical features contain words surrounding the gene and disease term to be identified in the original text, and the global syntactic features contain unigram, bigram and trigram models extracted through the following two ways:

i) the shortest path between a gene and its disease term in its dependency tree;

ii) a path between the smallest common ancestor (LCA) of the two words and a root of the dependency tree;

wherein, the two categories include the following three features:

1. the local lexical characteristics are specifically the lemmas of two words before and after the gene term and the lemmas of two words before and after the disease term;

2. the overall syntactic characteristics are specifically unary, binary and ternary syntactic models of the shortest path lemma between genes and disease terms in the dependency tree;

3. integral syntactic features, univariate, bivariate and trigram models of path lemmas between genes and disease term minimal common ancestors and dependence tree root lemmas;

2) after the characteristics are extracted, training a binary SVM classifier by using libsvm;

wherein, the kernel function of the SVM scores the above features 1, 2 and 3, and linearly adds the obtained scores, the features are regarded as a bag-of-words model, each possible word or n-gram model is regarded as a dimension in the vector, if the feature contains a specific word or n-gram, the value of the corresponding dimension is set to 1, otherwise, the value is 0, and the similarity between two specific features is calculated by the cosine value between the vectors.

Further, the ranking of the identified positively correlated gene-disease groups comprises the steps of:

1) calculating the webpage ranking (PageRank) of journal articles in an article citation network constructed by PubMed, and giving different weights to the coexistence of gene-disease groups in different journals according to the webpage ranking;

2) after the web pages of the journal articles are ranked, the score of each gene-disease group is calculated using the following formula (1):

wherein g represents a gene, d represents a disease, C_(g,d)Represents the collection of all articles containing the gene-disease group, pr (a) represents the webpage ranking of article a (PageRan)k)；

3) Considering the influence of the authors of the articles, when a pair of gene-disease groups are mentioned repeatedly by the same author in different articles, it is necessary to suppress the contribution of the repeated evidence, and it is assumed that the weights of the contributions of one author to different articles related to the same gene-disease group are the same and the sum is 1, then the weight of each article related to the gene-disease group will be shown in formula (2):

wherein l represents a list of all authors of article a, | Cx | represents the number of articles issued by author X about the gene-disease (g, d);

4) when the influence of the article and the author is comprehensively considered, the formula (1) is modified into the following formula (3):

in the end-to-end operation, when a user queries a positively correlated gene of a specific disease, the system outputs a series of correlated genes according to the score (i.e. the degree of correlation) obtained by formula (3).

Compared with the prior art, the technical scheme of the application has the following beneficial technical effects:

1) the intensified standard NER tool is combined with the longest matching strategy on the basis of the dictionary to recognize the genes and the disease terms, so that compared with other schemes, the recognition rate of the genes and the diseases is greatly improved, and the accuracy is higher;

2) the possible relevance of the candidate gene-disease group is identified by combining a Support Vector Machine (SVM) with local and overall characteristics, the operation time is greatly reduced while the accuracy is ensured, and the method is more efficient than other schemes;

3) the identified disease positive correlation genes are subjected to correlation sequencing by using a sequencing algorithm which takes the statistical coexistence frequency as the basis and simultaneously considers the influence of articles and authors, and a better F score can be maintained all the time while the coverage rate is continuously increased, namely, the method is more stable and reliable;

4) the disease-gene correlation information base created by the system method has a coverage range far beyond other methods in the same period, and provides more systematic and comprehensive information for researching the treatment and diagnosis technology of diseases.

Drawings

FIG. 1 is a system diagram of the method for mining information on gene and disease correlations from biomedical literature according to the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

In the embodiment, a method for mining information related to genes and diseases from biomedical documents is mainly used for improving the prior art from three aspects:

first, using an improved Named Entity Recognition (NER) tool to identify a disease or gene in a given biomedical literature repository that may be associated with a gene or disease of interest and to form a paired candidate gene-disease set;

secondly, identifying the possible relevance of the candidate gene-disease group by using a Support Vector Machine (SVM) with set characteristics;

and finally, sequencing the correlation degree of the identified gene-disease positive correlation group by using a sequencing algorithm which takes the statistical coexistence frequency as the basis and simultaneously considers the influence of the article and the author.

Specifically, the system framework implemented in this embodiment is shown in fig. 1, and includes the following steps:

s1, preparing and preprocessing data;

s3, identifying possible correlations among the candidate gene-disease groups;

s4, sorting the identified positively correlated gene-disease groups.

In step S1, to identify genes and disease-related words in the biomedical literature, a reference library for identifying these two types of words must be prepared: gene libraries and disease term libraries, in practical practice, the protocol system combines three publicly available gene/protein databases, namely:

i) gene databases of the National Center for Biotechnology Information (NCBI);

ii) the gene database of the human genome tissue gene naming committee (HGNC);

iii) UniProt knowledge base.

In this combinatorial gene library, a total of 60,197 independent genes are contained, each gene name and its alternative or synonym are assigned to the same library identifier, and cross-reference tags are attached to three different databases, so that the corresponding information in any one database can be easily found.

The disease term library is also constructed by combining several databases, and comprises:

i) disease Ontology constructed by the university of maryland medical school;

iv) Infectious Disease Database (IDDB).

It should be noted that the method may be used without being limited to these databases/sets, and may be determined according to actual requirements.

The library of disease terms thus assembled collectively encompasses 22,831 disease nouns and is not a collection of simple nouns but rather is hierarchical, in the event that a disease a belongs to a parent class B, then a is assigned a B-related label, so that each disease in the library has a unique identifier, the name of the disease and its alias or synonym, as well as the number of the disease in the source database and its parent class in the library.

Meanwhile, a reference library structure is constructed, and a part of annotated text data needs to be prepared to train a model for subsequent gene/disease recognition so as to achieve better accuracy.

In practical operation, about 2000 annotated sentences are used, wherein the annotated sentences comprise both positively correlated (i.e. correlated) disease-genomes and negatively correlated (i.e. unrelated) groups, most of the annotated texts are from Genetic Association Databases (GADs) developed and maintained by National Institutes of Health (NIH), but most of the annotated texts are subjected to the in-person review and supplement of domain experts, so that the vocabulary recognition rate and accuracy are increased, and the recognition capability of subsequent models is improved, and a small part of texts are from document databases to be retrieved and matched with manual annotation, so that the recognition degree of the models for specific libraries is increased.

In addition, before formal identification is started, preprocessing is performed on a document library to be retrieved, and three aspects of information are mainly obtained:

i) titles and summaries of articles (which are the main parts containing important information and needing to be identified);

ii) author information;

iii) literature references information of the article.

In step S2, the problem of gene/disease recognition is Named Entity Recognition (NER), and the technical method uses a joint recognition scheme, i.e., using the developed NER tool based on Conditional Random Fields (CRF) by the stanford university in combination with a dictionary-based longest match strategy as the gene/disease recognition scheme.

Specifically, the CRF-based NER recognition model is first trained using about 2000 sentences previously mentioned with annotated genes and corresponding disease associations, and then the trained model is used to identify genes and disease-related words from the article titles and abstracts extracted from the corpus, and the identified words are associated with their unique library identifiers in the gene and disease libraries.

Wherein the protocol further enhances the standard NER tool by: if an identified gene/disease vocabulary is not contained in an existing gene/disease library, then automatically crawling relevant information on the network must be searched to eventually find its unique corresponding vocabulary in the library, if it can eventually find its corresponding vocabulary in the library, adding the gene/disease to the result file, and if it does not yet belong to any other or synonym of the corresponding vocabulary in the library, adding the gene/disease vocabulary to the existing gene/disease library at the same time, but always sharing the same library identifier with its corresponding vocabulary.

In addition, another recognition scheme is a longest match strategy based on gene/disease libraries. And sliding the characters one by one from the beginning to the end of the sentence to be detected by using a sliding window with a fixed character length, sequentially carrying out fuzzy matching on the characters contained in the current window and the gene/disease library, subtracting one from the fixed character length of the sliding window when the sliding window reaches the end of the sentence, sliding from the beginning of the sentence, repeating the steps until the length of the window is zero, and finishing the detection of the sentence.

Fuzzy matching refers to extracting a completely matched word and retaining a slightly different word on a punctuation mark or a singular/plural number, and when a certain gene/disease word is recognized from a sliding window, marking the word to ensure that the word is not overlapped with other recognized words.

The detection results of the two recognition schemes are combined together to be used as a final recognition result, and when the two results are in a divergence, the detection results are processed in the following mode:

It should be noted that the basis for this is that the enhanced version of NER also uses a web search engine, and the terms to be covered are more comprehensive, so the recognition result is more accurate, and in addition, sometimes the genes and the disease vocabularies are distinguished by the same terms, so that one vocabulary is recognized by a plurality of independent library identifiers, but if one of the terms is recognized by one of the independent library identifiers in the foregoing, it is likely to belong to the library identifier; otherwise it will match each individual library id once.

Meanwhile, because the gene/disease library has hierarchical hierarchy, the identification code of a recognized word not only with the corresponding library word can be endowed with the identification code of a parent word, and the gene-disease binary extracted from a certain literature text can be used as candidate evidence for identifying the mutual relation between the two.

In step S3, a binary Support Vector Machine (SVM) classifier is used to discriminate the gene-disease doublet correlation obtained in the second part, and S w is determined for a given sentence₁,...,g,...,w_i,...,d,...,w_nWhere w represents a word, g represents a gene vocabulary, and d represents a disease vocabulary, the SVM classifier determines whether there is a correlation between g and d.

The classifier is mainly used for classifying by extracting two types of features, namely local lexical features and overall syntactic features, wherein the local lexical features comprise words surrounding genes and disease terms to be identified in an original text, and the overall syntactic features comprise univariate, binary and ternary grammar models extracted through the following two ways:

ii) the path between the smallest common ancestor (LCA) of the two words and the root of the dependency tree.

The following table details these two main categories of three features:

it should be noted that the lemma and dependency tree are constructed by using CoreNLP tool developed by stanford university (published by Manning et al in 2014), the feature extraction process considers the influence of words with part-of-speech tags, such as "neg (negative)", "advmod (adverb)", etc., used for modifying verbs, and includes such modifiers in the path as part of the overall syntactic feature.

After the features were extracted, the binary SVM classifier was trained with libsvm (developed by taiwan university in 2011).

Wherein, the kernel function of the SVM scores the characteristics 1, 2 and 3, and the obtained scores are linearly added.

It should be understood that these features are treated as a bag of words model, each possible word or n-gram model is treated as a dimension in a vector, if a feature contains a particular word or n-gram, the value of the corresponding dimension is set to 1, otherwise 0, and the similarity between two particular features is calculated by the cosine value between the vectors.

After a series of positively correlated gene-disease groups and the unique identification numbers of journal articles coexisting in the positively correlated gene-disease groups are identified by the SVM classifier, the next step is carried out: and (6) sorting.

In the ranking of the positively correlated gene-disease groups identified in step S4, by the correlation identification in step S3, a number of positively correlated gene-disease groups are generated, there are many genes positively correlated with a certain disease, and similarly, there are many diseases positively correlated with a certain gene, and there is a method of determining the frequency of gene-disease coexistence, while considering the ways in which articles and authors are involved in their influence to rank these positively correlated genes/diseases, for different gene-disease groups containing the same disease, respectively counting how many different journal articles appear together in each group, this is the frequency of gene co-existence with the disease at its base, which has proven to be less effective when used alone in ranking, and relatively optimal results when considering the impact of both the article and the author.

Specifically, firstly, in an article citation network constructed by PubMed, webpage ranking (PageRank) of journal articles is calculated, and different weights are given to coexistence of gene-disease groups in different journals according to the webpage ranking.

It should be noted that the web Page ranking method proposed by Page and Brin et al in 1999 performs in the idea that the more important a web site is, the more web sites linked to the web site are, and therefore, it uses the number and importance of web sites linked to a given web site to evaluate the importance of the web site (or called web Page ranking), where the principle of web Page ranking is applied to journal articles, the cross-reference between different articles is equivalent to a link between web pages, and the more influential an article is referenced by more articles, and the more important the article is, the higher the natural web Page ranking will be.

Then, after the web pages of the journal articles are ranked, the score of each gene-disease group is calculated by the following formula (1):

wherein g represents a gene, d represents a disease, C_(g,d)Represents the collection of all articles containing the gene-disease group, pr (a) represents the web page rank (PageRank) of article a.

In addition to the importance of the article, and the influence of the author of the article, a biomedical researcher who focuses on a particular disease or gene may issue many articles on the same gene-disease group, and thus, when a pair of gene-disease groups are mentioned repeatedly by the same author in different articles, the contribution to such repeated evidence is suppressed.

Assuming that the weights contributed by one author to different articles related to the same gene-disease group are the same and sum to 1, the weight of each article related to the gene-disease group will be shown in equation (2) below:

where l represents a list of all authors of article a and | Cx | represents the number of articles issued by author X about the gene-disease (g, d).

Therefore, when considering the influence of both the article and the author, equation (1) is modified to equation (3) as follows, which is also the basis for ranking positively correlated genes in this method:

in the end-to-end operation, when a user queries a positively related gene for a specific disease, the system outputs a series of related genes according to formula (3) (i.e., the degree of correlation).

Typical cases are as follows:

the practical performance of the method is demonstrated here using an example of identifying gene and disease associations in the MEDLINE literature base created in the national library of medicine.

First, a corresponding data set is prepared, and the gene and disease term libraries described in the technical scheme are the two basic libraries used in this practice: the gene library contains 60,197 genes; the disease term library covers 22,831 diseases and annotated datasets for training recognition models and evaluation system protocols or tools as also described above, mostly from the NIH genetic association database GAD in combination with manual supplementation by domain experts, and a small part from the manually annotated MEDLINE randomly drawn text, which collectively comprises 2340 positive gene-disease correlation (determined to be correlated) tags and 1437 negative correlation (determined to be not correlated) tags.

The data set is prepared, and then the gene/disease term recognition effect of the method is tested, wherein 800 annotated sentences are used as a verification set, wherein 525 genes are annotated, 592 diseases are annotated, and the recognition effect is shown in the following table compared with other common or feasible schemes in the same period:

among them, ABNER (Settles et al, 2004) and NER developed by Stanford university are existing commonly used schemes, while the combination of NER tool and dictionary (not enhanced by search engine) scheme is also similar to the contemporary commonly used scheme BeFree (Bravo equals 2015) and CoPub (Frijters equals 2008), and the results show that the recognition scheme of the method is superior to other common or feasible schemes (with the highest F-score: 0.877) in comprehensive recognition effect of gene/disease terms.

The discriminatory power of the method was evaluated and the classification model was subsequently tested for its effectiveness in identifying candidate gene-disease associations.

The method uses a support vector machine model combining local and overall characteristics, firstly, different effects generated by the model only using one type of characteristics for identification and using two types of characteristics for identification are compared, a ten-fold cross validation is carried out in a training set containing 2080 positive correlation samples and 1277 negative correlation samples, and the comparison result is shown in the following table:

when two types of features are adopted for classification, the best effect can be achieved in terms of accuracy, recall rate or F score compared with a single feature classifier.

In addition, as the number of documents is increasing, the calculation speed or the identification speed is an important factor to be considered, and compared with the BeFree scheme, the method has a quite obvious speed advantage on the basis of achieving the equivalent identification effect (equivalent F fraction), as shown in the following table:

finally, a check is made on the ranking algorithm.

The ranking algorithm in the method is a statistical coexistence frequency method considering influence weighting of articles and authors, data of five random diseases and related genes thereof obtained from a DisGeNET website are used as a basis truth (ground true) for verification, the comprehensive effect (expressed by F fraction) of the method and the accuracy of the ranking of the first K (K is 50, 100, 150 and 200) output genes of BeFree and CoPub respectively is evaluated, and the evaluation results are shown in the following table:

	this scheme	BeFree	CoPub
				K＝50	0.241	0.213	0.212
K＝100	0.238	0.214	0.211
				K＝150	0.216	0.186	0.192
K＝200	0.197	0.167	0.175

In addition, for a specific disease, the method can identify more genes related to the disease than other methods, for example, in the above evaluation experiment, for the disease of retinitis pigmentosa, the method can identify 818 genes related to the disease, while the BeFree method only identifies 193 genes (142 of which are in the result of the method), the CoPub method identifies 179 genes (124 of which are in the result of the method), and by manually checking 40 of the 818 genes, 34 of the genes are found to be positive related genes which have been determined.

Although the F score decreased with increasing number of export genes, the ranking scheme consistently performed best (the F score was maximal) compared to the other two schemes.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A method for mining information related to genes and diseases from biomedical literature, which is characterized by comprising the following steps:

s1, preparing and preprocessing data;

s3, identifying possible correlations among the candidate gene-disease groups;

s4, sorting the identified positively correlated gene-disease groups.

2. The method of claim 1, wherein the preparation and preprocessing of the data comprises the steps of:

i) titles and abstracts of articles;

ii) author information;

iii) literature references information of the article.

3. The method of claim 2, wherein the gene library is created based on a combinatorial library comprising three publicly available gene/protein databases, wherein the three publicly available gene/protein databases are respectively:

i) gene databases of the National Center for Biotechnology Information (NCBI);

ii) the gene database of the human genome tissue gene naming committee (HGNC);

iii) UniProt knowledge base;

i) disease Ontology constructed by the university of maryland medical school;

iv) Infectious Disease Database (IDDB).

4. The method of claim 1, wherein the identification of the gene/disease vocabulary comprises the steps of:

4) sliding the characters one by one from the beginning to the end of a sentence to be detected by using a sliding window with a fixed character length, and sequentially carrying out fuzzy matching on the characters contained in the current window and a gene/disease library, when the sliding window reaches the end of the sentence, subtracting one from the fixed character length of the sliding window, then starting sliding and matching from the beginning of the sentence, and repeating the steps until the length of the window is zero;

5. The method of claim 4, wherein the identification of gene/disease vocabulary is further enhanced by the following approaches to standard NER tools:

6. The method as claimed in claim 5, wherein the fuzzy matching means that the complete matching words are extracted and the slightly different words in the punctuation or singular/plural are retained, and when a certain gene/disease word is recognized from the sliding window, it is marked to ensure that it is not overlapped with other recognized words.

7. The method of claim 1, wherein the identification of the correlation between the identified gene and disease combination uses a binary Support Vector Machine (SVM) classifier to discriminate the correlation between the gene and disease duplet obtained from the second part, and the method comprises the following steps:

wherein, the two categories include the following three features:

8. The method of claim 1, wherein the step of ranking the identified positively correlated gene-disease groups comprises the steps of:

wherein g represents a gene, d represents a disease, C_(g,d)Represents the set of all articles containing the gene-disease group, pr (a) represents the web page rank (PageRank) of article a;

in the end-to-end operation, when a user needs to inquire a positive related gene of a specific disease, the system outputs a series of related genes according to the score obtained by the formula (3).