CN113626567A - Method for mining information related to genes and diseases from biomedical literature - Google Patents

Method for mining information related to genes and diseases from biomedical literature Download PDF

Info

Publication number
CN113626567A
CN113626567A CN202110857485.6A CN202110857485A CN113626567A CN 113626567 A CN113626567 A CN 113626567A CN 202110857485 A CN202110857485 A CN 202110857485A CN 113626567 A CN113626567 A CN 113626567A
Authority
CN
China
Prior art keywords
disease
gene
library
genes
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110857485.6A
Other languages
Chinese (zh)
Inventor
韦嘉
付宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jixukang Biotechnology Co ltd
Original Assignee
Shanghai Jixukang Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jixukang Biotechnology Co ltd filed Critical Shanghai Jixukang Biotechnology Co ltd
Priority to CN202110857485.6A priority Critical patent/CN113626567A/en
Publication of CN113626567A publication Critical patent/CN113626567A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Abstract

The invention relates to a method for mining gene and disease correlation information from biomedical documents, which is called as disease and gene bidirectional exploration for short, and uses an intensified standard NER tool to identify genes and disease terms in the biomedical documents by combining a longest matching strategy on the basis of a dictionary, and simultaneously uses a Support Vector Machine (SVM) combining local and overall characteristics to screen possible correlations of candidate gene-disease groups, and uses a coexistence frequency statistical method considering influence of articles and authors to sort the positively correlated gene-disease groups according to the correlation degree, compared with other common schemes, the method greatly improves the identified rate of the genes and diseases on the basis of ensuring higher accuracy and stability, greatly reduces the operation time, and in addition, the disease-gene correlation information base created by the system method, the coverage range is wider, and more systematic and comprehensive information can be provided for updating the treatment and diagnosis technology of diseases.

Description

Method for mining information related to genes and diseases from biomedical literature
Technical Field
The invention relates to the technical field of gene and disease related information mining, in particular to a method for mining gene and disease related information from biomedical documents.
Background
In recent years, with the progress of biomedical research, molecules, cells and genetic components related to diseases are continuously disclosed, new ideas and insights are provided for researchers to explore cell signal paths, genetic changes and the consequences caused by the changes, and meanwhile, with the continuous breakthrough of medical diagnosis technology, a strategy of classifying patients by disease biomarkers and selecting optimal therapy also achieves certain effect, and attracts more researchers to continuously develop in the field.
Although the focus of such research has been on oncology, in the last decade, other disease fields, such as respiratory diseases, infectious diseases, inflammatory diseases, etc., have been advanced to a great extent, providing a broader background basis for the research of disease-related factors, and all these new findings are contained in a large number of biomedical documents, so that if gene-disease related information can be effectively extracted from these biomedical documents, it is possible to discover and develop new therapeutic targets and biomarkers for patient classification, which will greatly promote the development of clinical medicine and research.
However, the biomedical literature is considerable in quantity and text complexity, and many difficulties still exist in effectively extracting gene-disease related information from the biomedical literature, and in the past decades, many researchers have made various attempts, but the coverage of the extracted information and the accuracy of information screening are still very limited, so that a method for mining the gene-disease related information from the biomedical literature is proposed to solve the problems.
Disclosure of Invention
The invention aims at the problems in the prior art and provides a method for mining gene and disease related information from biomedical documents, which is called disease and gene bidirectional exploration for short.
The technical scheme for solving the technical problems is as follows: a method for mining gene and disease related information from biomedical literature, comprising the following steps:
s1, preparing and preprocessing data;
s2, identifying genes and relevant words of diseases in a given literature base;
s3, identifying possible correlations among the candidate gene-disease groups;
s4, sorting the identified positively correlated gene-disease groups.
On the basis of the technical scheme, the invention is further improved as follows.
Further, the preparation and preprocessing of the data comprises the steps of:
1) preparing a reference library for identifying the two classes of words, a gene library and a disease term library, respectively;
2) preparing a portion of the annotated text data to train a model for subsequent identification of genes/diseases to achieve better accuracy;
3) preprocessing a document library to be retrieved, and mainly acquiring three aspects of information:
i) titles and abstracts of articles;
ii) author information;
iii) literature references information of the article.
Further, the gene library is constructed based on a combinatorial library comprising three publicly available gene/protein databases, respectively:
i) gene databases of the National Center for Biotechnology Information (NCBI);
ii) the gene database of the human genome tissue gene naming committee (HGNC);
iii) UniProt knowledge base;
the disease term library is based on a joint set of multiple databases comprising:
i) disease Ontology constructed by the university of maryland medical school;
ii) a created MedDRA medical dictionary hosted by the International harmonization conference (ICH) for registration of technical requirements for drugs by humans;
iii) integrated medical language system (UMLS) designed by the United states National Library of Medicine (NLM);
iv) Infectious Disease Database (IDDB).
Further, the identification of the gene/disease vocabulary comprises the steps of:
1) firstly, using prepared sentences which are annotated with genes and correspond to disease relativity to train the NER recognition model based on CRF;
2) identifying genes and related words of diseases from titles and abstracts of articles extracted from a document library by using a trained model;
3) the recognized words are in one-to-one correspondence with the unique library identification codes of the words in the gene and disease libraries;
4) sliding characters one by one from the beginning to the end of a sentence to be detected by using a sliding window with a fixed character length, sequentially carrying out fuzzy matching on the characters contained in the current window and a gene/disease library, subtracting one from the fixed character length of the sliding window when the sliding window reaches the end of the sentence, then starting sliding from the beginning of the sentence, repeating the steps until the length of the window is zero, and finishing the detection of the sentence;
5) combining the detection results of the step 3) and the step 4) to be used as a final recognition result, and when the two results are in a divergence, processing the two results in the following way:
i) if a recognized word is less than four characters, it is likely to be either an acronym for a term or misrecognized, and if it has a longer form of synonym that appears in the previous text, it should belong to the acronym for a term;
ii) if the less than four character vocabulary has no longer form synonyms present, but it is recognized by the enhanced NER, then it is still added to the results file;
iii) if a vocabulary is recognized by both schemes, the result of the enhanced NER is added to the result file.
Further, the identification of the gene/disease vocabulary was also enhanced by the following approach for the standard NER tool:
A. if a gene/disease word identified is not contained in the existing gene/disease library, automatically searching relevant information on the network to finally find the unique corresponding word in the library by using the necessary search;
B. if it can eventually find its corresponding vocabulary in the library, add the gene/disease to the results file;
C. if it does not already belong to any alternative or synonym of the corresponding vocabulary in the library, then the gene/disease vocabulary is also added to the existing gene/disease library at the same time, but always shares the same library identifier with its corresponding vocabulary.
Further, the fuzzy matching means that when a certain gene/disease word is recognized from the sliding window, it is marked to ensure that it is not overlapped with other recognized words, by extracting a completely matched word and retaining a word slightly different in a punctuation mark or a singular/plural number.
Further, the identification of the correlation of the identified gene-disease combination using a binary Support Vector Machine (SVM) classifier for the discrimination of the correlation of the gene-disease binary group obtained in the second part mainly comprises the following steps:
1) classification is performed by extracting two types of features, namely local lexical features and global syntactic features, wherein the local lexical features contain words surrounding the gene and disease term to be identified in the original text, and the global syntactic features contain unigram, bigram and trigram models extracted through the following two ways:
i) the shortest path between a gene and its disease term in its dependency tree;
ii) a path between the smallest common ancestor (LCA) of the two words and a root of the dependency tree;
wherein, the two categories include the following three features:
1. the local lexical characteristics are specifically the lemmas of two words before and after the gene term and the lemmas of two words before and after the disease term;
2. the overall syntactic characteristics are specifically unary, binary and ternary syntactic models of the shortest path lemma between genes and disease terms in the dependency tree;
3. integral syntactic features, univariate, bivariate and trigram models of path lemmas between genes and disease term minimal common ancestors and dependence tree root lemmas;
2) after the characteristics are extracted, training a binary SVM classifier by using libsvm;
wherein, the kernel function of the SVM scores the above features 1, 2 and 3, and linearly adds the obtained scores, the features are regarded as a bag-of-words model, each possible word or n-gram model is regarded as a dimension in the vector, if the feature contains a specific word or n-gram, the value of the corresponding dimension is set to 1, otherwise, the value is 0, and the similarity between two specific features is calculated by the cosine value between the vectors.
Further, the ranking of the identified positively correlated gene-disease groups comprises the steps of:
1) calculating the webpage ranking (PageRank) of journal articles in an article citation network constructed by PubMed, and giving different weights to the coexistence of gene-disease groups in different journals according to the webpage ranking;
2) after the web pages of the journal articles are ranked, the score of each gene-disease group is calculated using the following formula (1):
Figure BDA0003184574480000051
wherein g represents a gene, d represents a disease, C(g,d)Represents the collection of all articles containing the gene-disease group, pr (a) represents the webpage ranking of article a (PageRan)k);
3) Considering the influence of the authors of the articles, when a pair of gene-disease groups are mentioned repeatedly by the same author in different articles, it is necessary to suppress the contribution of the repeated evidence, and it is assumed that the weights of the contributions of one author to different articles related to the same gene-disease group are the same and the sum is 1, then the weight of each article related to the gene-disease group will be shown in formula (2):
Figure BDA0003184574480000052
wherein l represents a list of all authors of article a, | Cx | represents the number of articles issued by author X about the gene-disease (g, d);
4) when the influence of the article and the author is comprehensively considered, the formula (1) is modified into the following formula (3):
Figure BDA0003184574480000061
in the end-to-end operation, when a user queries a positively correlated gene of a specific disease, the system outputs a series of correlated genes according to the score (i.e. the degree of correlation) obtained by formula (3).
Compared with the prior art, the technical scheme of the application has the following beneficial technical effects:
1) the intensified standard NER tool is combined with the longest matching strategy on the basis of the dictionary to recognize the genes and the disease terms, so that compared with other schemes, the recognition rate of the genes and the diseases is greatly improved, and the accuracy is higher;
2) the possible relevance of the candidate gene-disease group is identified by combining a Support Vector Machine (SVM) with local and overall characteristics, the operation time is greatly reduced while the accuracy is ensured, and the method is more efficient than other schemes;
3) the identified disease positive correlation genes are subjected to correlation sequencing by using a sequencing algorithm which takes the statistical coexistence frequency as the basis and simultaneously considers the influence of articles and authors, and a better F score can be maintained all the time while the coverage rate is continuously increased, namely, the method is more stable and reliable;
4) the disease-gene correlation information base created by the system method has a coverage range far beyond other methods in the same period, and provides more systematic and comprehensive information for researching the treatment and diagnosis technology of diseases.
Drawings
FIG. 1 is a system diagram of the method for mining information on gene and disease correlations from biomedical literature according to the present invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.
In the embodiment, a method for mining information related to genes and diseases from biomedical documents is mainly used for improving the prior art from three aspects:
first, using an improved Named Entity Recognition (NER) tool to identify a disease or gene in a given biomedical literature repository that may be associated with a gene or disease of interest and to form a paired candidate gene-disease set;
secondly, identifying the possible relevance of the candidate gene-disease group by using a Support Vector Machine (SVM) with set characteristics;
and finally, sequencing the correlation degree of the identified gene-disease positive correlation group by using a sequencing algorithm which takes the statistical coexistence frequency as the basis and simultaneously considers the influence of the article and the author.
Specifically, the system framework implemented in this embodiment is shown in fig. 1, and includes the following steps:
s1, preparing and preprocessing data;
s2, identifying genes and relevant words of diseases in a given literature base;
s3, identifying possible correlations among the candidate gene-disease groups;
s4, sorting the identified positively correlated gene-disease groups.
In step S1, to identify genes and disease-related words in the biomedical literature, a reference library for identifying these two types of words must be prepared: gene libraries and disease term libraries, in practical practice, the protocol system combines three publicly available gene/protein databases, namely:
i) gene databases of the National Center for Biotechnology Information (NCBI);
ii) the gene database of the human genome tissue gene naming committee (HGNC);
iii) UniProt knowledge base.
In this combinatorial gene library, a total of 60,197 independent genes are contained, each gene name and its alternative or synonym are assigned to the same library identifier, and cross-reference tags are attached to three different databases, so that the corresponding information in any one database can be easily found.
The disease term library is also constructed by combining several databases, and comprises:
i) disease Ontology constructed by the university of maryland medical school;
ii) a created MedDRA medical dictionary hosted by the International harmonization conference (ICH) for registration of technical requirements for drugs by humans;
iii) integrated medical language system (UMLS) designed by the United states National Library of Medicine (NLM);
iv) Infectious Disease Database (IDDB).
It should be noted that the method may be used without being limited to these databases/sets, and may be determined according to actual requirements.
The library of disease terms thus assembled collectively encompasses 22,831 disease nouns and is not a collection of simple nouns but rather is hierarchical, in the event that a disease a belongs to a parent class B, then a is assigned a B-related label, so that each disease in the library has a unique identifier, the name of the disease and its alias or synonym, as well as the number of the disease in the source database and its parent class in the library.
Meanwhile, a reference library structure is constructed, and a part of annotated text data needs to be prepared to train a model for subsequent gene/disease recognition so as to achieve better accuracy.
In practical operation, about 2000 annotated sentences are used, wherein the annotated sentences comprise both positively correlated (i.e. correlated) disease-genomes and negatively correlated (i.e. unrelated) groups, most of the annotated texts are from Genetic Association Databases (GADs) developed and maintained by National Institutes of Health (NIH), but most of the annotated texts are subjected to the in-person review and supplement of domain experts, so that the vocabulary recognition rate and accuracy are increased, and the recognition capability of subsequent models is improved, and a small part of texts are from document databases to be retrieved and matched with manual annotation, so that the recognition degree of the models for specific libraries is increased.
In addition, before formal identification is started, preprocessing is performed on a document library to be retrieved, and three aspects of information are mainly obtained:
i) titles and summaries of articles (which are the main parts containing important information and needing to be identified);
ii) author information;
iii) literature references information of the article.
In step S2, the problem of gene/disease recognition is Named Entity Recognition (NER), and the technical method uses a joint recognition scheme, i.e., using the developed NER tool based on Conditional Random Fields (CRF) by the stanford university in combination with a dictionary-based longest match strategy as the gene/disease recognition scheme.
Specifically, the CRF-based NER recognition model is first trained using about 2000 sentences previously mentioned with annotated genes and corresponding disease associations, and then the trained model is used to identify genes and disease-related words from the article titles and abstracts extracted from the corpus, and the identified words are associated with their unique library identifiers in the gene and disease libraries.
Wherein the protocol further enhances the standard NER tool by: if an identified gene/disease vocabulary is not contained in an existing gene/disease library, then automatically crawling relevant information on the network must be searched to eventually find its unique corresponding vocabulary in the library, if it can eventually find its corresponding vocabulary in the library, adding the gene/disease to the result file, and if it does not yet belong to any other or synonym of the corresponding vocabulary in the library, adding the gene/disease vocabulary to the existing gene/disease library at the same time, but always sharing the same library identifier with its corresponding vocabulary.
In addition, another recognition scheme is a longest match strategy based on gene/disease libraries. And sliding the characters one by one from the beginning to the end of the sentence to be detected by using a sliding window with a fixed character length, sequentially carrying out fuzzy matching on the characters contained in the current window and the gene/disease library, subtracting one from the fixed character length of the sliding window when the sliding window reaches the end of the sentence, sliding from the beginning of the sentence, repeating the steps until the length of the window is zero, and finishing the detection of the sentence.
Fuzzy matching refers to extracting a completely matched word and retaining a slightly different word on a punctuation mark or a singular/plural number, and when a certain gene/disease word is recognized from a sliding window, marking the word to ensure that the word is not overlapped with other recognized words.
The detection results of the two recognition schemes are combined together to be used as a final recognition result, and when the two results are in a divergence, the detection results are processed in the following mode:
i) if a recognized word is less than four characters, it is likely to be either an acronym for a term or misrecognized, and if it has a longer form of synonym that appears in the previous text, it should belong to the acronym for a term;
ii) if the less than four character vocabulary has no longer form synonyms present, but it is recognized by the enhanced NER, then it is still added to the results file;
iii) if a vocabulary is recognized by both schemes, the result of the enhanced NER is added to the result file.
It should be noted that the basis for this is that the enhanced version of NER also uses a web search engine, and the terms to be covered are more comprehensive, so the recognition result is more accurate, and in addition, sometimes the genes and the disease vocabularies are distinguished by the same terms, so that one vocabulary is recognized by a plurality of independent library identifiers, but if one of the terms is recognized by one of the independent library identifiers in the foregoing, it is likely to belong to the library identifier; otherwise it will match each individual library id once.
Meanwhile, because the gene/disease library has hierarchical hierarchy, the identification code of a recognized word not only with the corresponding library word can be endowed with the identification code of a parent word, and the gene-disease binary extracted from a certain literature text can be used as candidate evidence for identifying the mutual relation between the two.
In step S3, a binary Support Vector Machine (SVM) classifier is used to discriminate the gene-disease doublet correlation obtained in the second part, and S w is determined for a given sentence1,...,g,...,wi,...,d,...,wnWhere w represents a word, g represents a gene vocabulary, and d represents a disease vocabulary, the SVM classifier determines whether there is a correlation between g and d.
The classifier is mainly used for classifying by extracting two types of features, namely local lexical features and overall syntactic features, wherein the local lexical features comprise words surrounding genes and disease terms to be identified in an original text, and the overall syntactic features comprise univariate, binary and ternary grammar models extracted through the following two ways:
i) the shortest path between a gene and its disease term in its dependency tree;
ii) the path between the smallest common ancestor (LCA) of the two words and the root of the dependency tree.
The following table details these two main categories of three features:
Figure BDA0003184574480000111
it should be noted that the lemma and dependency tree are constructed by using CoreNLP tool developed by stanford university (published by Manning et al in 2014), the feature extraction process considers the influence of words with part-of-speech tags, such as "neg (negative)", "advmod (adverb)", etc., used for modifying verbs, and includes such modifiers in the path as part of the overall syntactic feature.
After the features were extracted, the binary SVM classifier was trained with libsvm (developed by taiwan university in 2011).
Wherein, the kernel function of the SVM scores the characteristics 1, 2 and 3, and the obtained scores are linearly added.
It should be understood that these features are treated as a bag of words model, each possible word or n-gram model is treated as a dimension in a vector, if a feature contains a particular word or n-gram, the value of the corresponding dimension is set to 1, otherwise 0, and the similarity between two particular features is calculated by the cosine value between the vectors.
After a series of positively correlated gene-disease groups and the unique identification numbers of journal articles coexisting in the positively correlated gene-disease groups are identified by the SVM classifier, the next step is carried out: and (6) sorting.
In the ranking of the positively correlated gene-disease groups identified in step S4, by the correlation identification in step S3, a number of positively correlated gene-disease groups are generated, there are many genes positively correlated with a certain disease, and similarly, there are many diseases positively correlated with a certain gene, and there is a method of determining the frequency of gene-disease coexistence, while considering the ways in which articles and authors are involved in their influence to rank these positively correlated genes/diseases, for different gene-disease groups containing the same disease, respectively counting how many different journal articles appear together in each group, this is the frequency of gene co-existence with the disease at its base, which has proven to be less effective when used alone in ranking, and relatively optimal results when considering the impact of both the article and the author.
Specifically, firstly, in an article citation network constructed by PubMed, webpage ranking (PageRank) of journal articles is calculated, and different weights are given to coexistence of gene-disease groups in different journals according to the webpage ranking.
It should be noted that the web Page ranking method proposed by Page and Brin et al in 1999 performs in the idea that the more important a web site is, the more web sites linked to the web site are, and therefore, it uses the number and importance of web sites linked to a given web site to evaluate the importance of the web site (or called web Page ranking), where the principle of web Page ranking is applied to journal articles, the cross-reference between different articles is equivalent to a link between web pages, and the more influential an article is referenced by more articles, and the more important the article is, the higher the natural web Page ranking will be.
Then, after the web pages of the journal articles are ranked, the score of each gene-disease group is calculated by the following formula (1):
Figure BDA0003184574480000121
wherein g represents a gene, d represents a disease, C(g,d)Represents the collection of all articles containing the gene-disease group, pr (a) represents the web page rank (PageRank) of article a.
In addition to the importance of the article, and the influence of the author of the article, a biomedical researcher who focuses on a particular disease or gene may issue many articles on the same gene-disease group, and thus, when a pair of gene-disease groups are mentioned repeatedly by the same author in different articles, the contribution to such repeated evidence is suppressed.
Assuming that the weights contributed by one author to different articles related to the same gene-disease group are the same and sum to 1, the weight of each article related to the gene-disease group will be shown in equation (2) below:
Figure BDA0003184574480000122
where l represents a list of all authors of article a and | Cx | represents the number of articles issued by author X about the gene-disease (g, d).
Therefore, when considering the influence of both the article and the author, equation (1) is modified to equation (3) as follows, which is also the basis for ranking positively correlated genes in this method:
Figure BDA0003184574480000131
in the end-to-end operation, when a user queries a positively related gene for a specific disease, the system outputs a series of related genes according to formula (3) (i.e., the degree of correlation).
Typical cases are as follows:
the practical performance of the method is demonstrated here using an example of identifying gene and disease associations in the MEDLINE literature base created in the national library of medicine.
First, a corresponding data set is prepared, and the gene and disease term libraries described in the technical scheme are the two basic libraries used in this practice: the gene library contains 60,197 genes; the disease term library covers 22,831 diseases and annotated datasets for training recognition models and evaluation system protocols or tools as also described above, mostly from the NIH genetic association database GAD in combination with manual supplementation by domain experts, and a small part from the manually annotated MEDLINE randomly drawn text, which collectively comprises 2340 positive gene-disease correlation (determined to be correlated) tags and 1437 negative correlation (determined to be not correlated) tags.
The data set is prepared, and then the gene/disease term recognition effect of the method is tested, wherein 800 annotated sentences are used as a verification set, wherein 525 genes are annotated, 592 diseases are annotated, and the recognition effect is shown in the following table compared with other common or feasible schemes in the same period:
Figure BDA0003184574480000132
among them, ABNER (Settles et al, 2004) and NER developed by Stanford university are existing commonly used schemes, while the combination of NER tool and dictionary (not enhanced by search engine) scheme is also similar to the contemporary commonly used scheme BeFree (Bravo equals 2015) and CoPub (Frijters equals 2008), and the results show that the recognition scheme of the method is superior to other common or feasible schemes (with the highest F-score: 0.877) in comprehensive recognition effect of gene/disease terms.
The discriminatory power of the method was evaluated and the classification model was subsequently tested for its effectiveness in identifying candidate gene-disease associations.
The method uses a support vector machine model combining local and overall characteristics, firstly, different effects generated by the model only using one type of characteristics for identification and using two types of characteristics for identification are compared, a ten-fold cross validation is carried out in a training set containing 2080 positive correlation samples and 1277 negative correlation samples, and the comparison result is shown in the following table:
Figure BDA0003184574480000141
when two types of features are adopted for classification, the best effect can be achieved in terms of accuracy, recall rate or F score compared with a single feature classifier.
In addition, as the number of documents is increasing, the calculation speed or the identification speed is an important factor to be considered, and compared with the BeFree scheme, the method has a quite obvious speed advantage on the basis of achieving the equivalent identification effect (equivalent F fraction), as shown in the following table:
Figure BDA0003184574480000142
finally, a check is made on the ranking algorithm.
The ranking algorithm in the method is a statistical coexistence frequency method considering influence weighting of articles and authors, data of five random diseases and related genes thereof obtained from a DisGeNET website are used as a basis truth (ground true) for verification, the comprehensive effect (expressed by F fraction) of the method and the accuracy of the ranking of the first K (K is 50, 100, 150 and 200) output genes of BeFree and CoPub respectively is evaluated, and the evaluation results are shown in the following table:
this scheme BeFree CoPub
K=50 0.241 0.213 0.212
K=100 0.238 0.214 0.211
K=150 0.216 0.186 0.192
K=200 0.197 0.167 0.175
In addition, for a specific disease, the method can identify more genes related to the disease than other methods, for example, in the above evaluation experiment, for the disease of retinitis pigmentosa, the method can identify 818 genes related to the disease, while the BeFree method only identifies 193 genes (142 of which are in the result of the method), the CoPub method identifies 179 genes (124 of which are in the result of the method), and by manually checking 40 of the 818 genes, 34 of the genes are found to be positive related genes which have been determined.
Although the F score decreased with increasing number of export genes, the ranking scheme consistently performed best (the F score was maximal) compared to the other two schemes.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (8)

1. A method for mining information related to genes and diseases from biomedical literature, which is characterized by comprising the following steps:
s1, preparing and preprocessing data;
s2, identifying genes and relevant words of diseases in a given literature base;
s3, identifying possible correlations among the candidate gene-disease groups;
s4, sorting the identified positively correlated gene-disease groups.
2. The method of claim 1, wherein the preparation and preprocessing of the data comprises the steps of:
1) preparing a reference library for identifying the two classes of words, a gene library and a disease term library, respectively;
2) preparing a portion of the annotated text data to train a model for subsequent identification of genes/diseases to achieve better accuracy;
3) preprocessing a document library to be retrieved, and mainly acquiring three aspects of information:
i) titles and abstracts of articles;
ii) author information;
iii) literature references information of the article.
3. The method of claim 2, wherein the gene library is created based on a combinatorial library comprising three publicly available gene/protein databases, wherein the three publicly available gene/protein databases are respectively:
i) gene databases of the National Center for Biotechnology Information (NCBI);
ii) the gene database of the human genome tissue gene naming committee (HGNC);
iii) UniProt knowledge base;
the disease term library is based on a joint set of multiple databases comprising:
i) disease Ontology constructed by the university of maryland medical school;
ii) a created MedDRA medical dictionary hosted by the International harmonization conference (ICH) for registration of technical requirements for drugs by humans;
iii) integrated medical language system (UMLS) designed by the United states National Library of Medicine (NLM);
iv) Infectious Disease Database (IDDB).
4. The method of claim 1, wherein the identification of the gene/disease vocabulary comprises the steps of:
1) firstly, using prepared sentences which are annotated with genes and correspond to disease relativity to train the NER recognition model based on CRF;
2) identifying genes and related words of diseases from titles and abstracts of articles extracted from a document library by using a trained model;
3) the recognized words are in one-to-one correspondence with the unique library identification codes of the words in the gene and disease libraries;
4) sliding the characters one by one from the beginning to the end of a sentence to be detected by using a sliding window with a fixed character length, and sequentially carrying out fuzzy matching on the characters contained in the current window and a gene/disease library, when the sliding window reaches the end of the sentence, subtracting one from the fixed character length of the sliding window, then starting sliding and matching from the beginning of the sentence, and repeating the steps until the length of the window is zero;
5) combining the detection results of the step 3) and the step 4) to be used as a final recognition result, and when the two results are in a divergence, processing the two results in the following way:
i) if a recognized word is less than four characters, it is likely to be either an acronym for a term or misrecognized, and if it has a longer form of synonym that appears in the previous text, it should belong to the acronym for a term;
ii) if the less than four character vocabulary has no longer form synonyms present, but it is recognized by the enhanced NER, then it is still added to the results file;
iii) if a vocabulary is recognized by both schemes, the result of the enhanced NER is added to the result file.
5. The method of claim 4, wherein the identification of gene/disease vocabulary is further enhanced by the following approaches to standard NER tools:
A. if a gene/disease word identified is not contained in the existing gene/disease library, automatically searching relevant information on the network to finally find the unique corresponding word in the library by using the necessary search;
B. if it can eventually find its corresponding vocabulary in the library, add the gene/disease to the results file;
C. if it does not already belong to any alternative or synonym of the corresponding vocabulary in the library, then the gene/disease vocabulary is also added to the existing gene/disease library at the same time, but always shares the same library identifier with its corresponding vocabulary.
6. The method as claimed in claim 5, wherein the fuzzy matching means that the complete matching words are extracted and the slightly different words in the punctuation or singular/plural are retained, and when a certain gene/disease word is recognized from the sliding window, it is marked to ensure that it is not overlapped with other recognized words.
7. The method of claim 1, wherein the identification of the correlation between the identified gene and disease combination uses a binary Support Vector Machine (SVM) classifier to discriminate the correlation between the gene and disease duplet obtained from the second part, and the method comprises the following steps:
1) classification is performed by extracting two types of features, namely local lexical features and global syntactic features, wherein the local lexical features contain words surrounding the gene and disease term to be identified in the original text, and the global syntactic features contain unigram, bigram and trigram models extracted through the following two ways:
i) the shortest path between a gene and its disease term in its dependency tree;
ii) a path between the smallest common ancestor (LCA) of the two words and a root of the dependency tree;
wherein, the two categories include the following three features:
1. the local lexical characteristics are specifically the lemmas of two words before and after the gene term and the lemmas of two words before and after the disease term;
2. the overall syntactic characteristics are specifically unary, binary and ternary syntactic models of the shortest path lemma between genes and disease terms in the dependency tree;
3. integral syntactic features, univariate, bivariate and trigram models of path lemmas between genes and disease term minimal common ancestors and dependence tree root lemmas;
2) after the characteristics are extracted, training a binary SVM classifier by using libsvm;
wherein, the kernel function of the SVM scores the above features 1, 2 and 3, and linearly adds the obtained scores, the features are regarded as a bag-of-words model, each possible word or n-gram model is regarded as a dimension in the vector, if the feature contains a specific word or n-gram, the value of the corresponding dimension is set to 1, otherwise, the value is 0, and the similarity between two specific features is calculated by the cosine value between the vectors.
8. The method of claim 1, wherein the step of ranking the identified positively correlated gene-disease groups comprises the steps of:
1) calculating the webpage ranking (PageRank) of journal articles in an article citation network constructed by PubMed, and giving different weights to the coexistence of gene-disease groups in different journals according to the webpage ranking;
2) after the web pages of the journal articles are ranked, the score of each gene-disease group is calculated using the following formula (1):
Figure FDA0003184574470000041
wherein g represents a gene, d represents a disease, C(g,d)Represents the set of all articles containing the gene-disease group, pr (a) represents the web page rank (PageRank) of article a;
3) considering the influence of the authors of the articles, when a pair of gene-disease groups are mentioned repeatedly by the same author in different articles, it is necessary to suppress the contribution of the repeated evidence, and it is assumed that the weights of the contributions of one author to different articles related to the same gene-disease group are the same and the sum is 1, then the weight of each article related to the gene-disease group will be shown in formula (2):
Figure FDA0003184574470000051
wherein l represents a list of all authors of article a, | Cx | represents the number of articles issued by author X about the gene-disease (g, d);
4) when the influence of the article and the author is comprehensively considered, the formula (1) is modified into the following formula (3):
Figure FDA0003184574470000052
in the end-to-end operation, when a user needs to inquire a positive related gene of a specific disease, the system outputs a series of related genes according to the score obtained by the formula (3).
CN202110857485.6A 2021-07-28 2021-07-28 Method for mining information related to genes and diseases from biomedical literature Pending CN113626567A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110857485.6A CN113626567A (en) 2021-07-28 2021-07-28 Method for mining information related to genes and diseases from biomedical literature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110857485.6A CN113626567A (en) 2021-07-28 2021-07-28 Method for mining information related to genes and diseases from biomedical literature

Publications (1)

Publication Number Publication Date
CN113626567A true CN113626567A (en) 2021-11-09

Family

ID=78381325

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110857485.6A Pending CN113626567A (en) 2021-07-28 2021-07-28 Method for mining information related to genes and diseases from biomedical literature

Country Status (1)

Country Link
CN (1) CN113626567A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113921089A (en) * 2021-11-22 2022-01-11 北京安智因生物技术有限公司 Method and system for confirming updating frequency of IVD gene annotation database

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050086078A1 (en) * 2003-10-17 2005-04-21 Cogentmedicine, Inc. Medical literature database search tool
US20130339005A1 (en) * 2012-03-30 2013-12-19 The Florida State University Research Foundation, Inc. Automated Extraction of Bio-Entity Relationships from Literature
CN105740243A (en) * 2014-12-08 2016-07-06 深圳华大基因研究院 Method and device for constructing biological information database
CN108780445A (en) * 2016-03-16 2018-11-09 马鲁巴公司 Parallel hierarchical model for the machine understanding to small data
CN110010196A (en) * 2019-03-19 2019-07-12 北京工业大学 A kind of gene similarity searching algorithm based on heterogeneous network
CN110428897A (en) * 2019-06-19 2019-11-08 西安电子科技大学 Medical diagnosis on disease information processing method based on SNP pathogenic factor Yu disease association relationship
CN110807327A (en) * 2019-10-16 2020-02-18 大连理工大学 Biomedical entity identification method based on contextualized capsule network
CN111428036A (en) * 2020-03-23 2020-07-17 浙江大学 Entity relationship mining method based on biomedical literature
CN112036151A (en) * 2020-09-09 2020-12-04 平安科技(深圳)有限公司 Method and device for constructing gene disease relation knowledge base and computer equipment
CN112397141A (en) * 2019-08-16 2021-02-23 财团法人工业技术研究院 Method and apparatus for constructing a digital disease module
CN112990039A (en) * 2021-03-25 2021-06-18 上海基绪康生物科技有限公司 Method for extracting structured text information from medical image based on ODL (optical distribution level)
CN113160879A (en) * 2021-04-25 2021-07-23 上海基绪康生物科技有限公司 Method for predicting drug relocation through side effect based on network learning

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050086078A1 (en) * 2003-10-17 2005-04-21 Cogentmedicine, Inc. Medical literature database search tool
US20130339005A1 (en) * 2012-03-30 2013-12-19 The Florida State University Research Foundation, Inc. Automated Extraction of Bio-Entity Relationships from Literature
CN105740243A (en) * 2014-12-08 2016-07-06 深圳华大基因研究院 Method and device for constructing biological information database
CN108780445A (en) * 2016-03-16 2018-11-09 马鲁巴公司 Parallel hierarchical model for the machine understanding to small data
CN110010196A (en) * 2019-03-19 2019-07-12 北京工业大学 A kind of gene similarity searching algorithm based on heterogeneous network
CN110428897A (en) * 2019-06-19 2019-11-08 西安电子科技大学 Medical diagnosis on disease information processing method based on SNP pathogenic factor Yu disease association relationship
CN112397141A (en) * 2019-08-16 2021-02-23 财团法人工业技术研究院 Method and apparatus for constructing a digital disease module
CN110807327A (en) * 2019-10-16 2020-02-18 大连理工大学 Biomedical entity identification method based on contextualized capsule network
CN111428036A (en) * 2020-03-23 2020-07-17 浙江大学 Entity relationship mining method based on biomedical literature
CN112036151A (en) * 2020-09-09 2020-12-04 平安科技(深圳)有限公司 Method and device for constructing gene disease relation knowledge base and computer equipment
CN112990039A (en) * 2021-03-25 2021-06-18 上海基绪康生物科技有限公司 Method for extracting structured text information from medical image based on ODL (optical distribution level)
CN113160879A (en) * 2021-04-25 2021-07-23 上海基绪康生物科技有限公司 Method for predicting drug relocation through side effect based on network learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
徐栋: "基于文本的致病基因挖掘", 《中国优秀硕士学位论文全文数据库基础科学辑》, no. 01, pages 006 - 674 *
王雪 等: "阿尔茨海默病基因-疾病关联的知识挖掘", 《图书情报工作》, vol. 64, no. 13, pages 120 - 132 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113921089A (en) * 2021-11-22 2022-01-11 北京安智因生物技术有限公司 Method and system for confirming updating frequency of IVD gene annotation database

Similar Documents

Publication Publication Date Title
CN109446338B (en) Neural network-based drug disease relation classification method
Eltyeb et al. Chemical named entities recognition: a review on approaches and applications
US7707206B2 (en) Document processing
Virpioja et al. Empirical comparison of evaluation methods for unsupervised learning of morphology
Chowdhury et al. Disease mention recognition with specific features
Cohen Unsupervised gene/protein named entity normalization using automatically extracted dictionaries
Georgiev et al. Feature-rich named entity recognition for Bulgarian using conditional random fields
Baumgartner et al. Concept recognition for extracting protein interaction relations from biomedical text
García-Remesal et al. Using nanoinformatics methods for automatically identifying relevant nanotoxicology entities from the literature
Sharma et al. Ontology-based semantic retrieval of documents using Word2vec model
Gong Application of biomedical text mining
Jiang et al. A CRD-WEL system for chemical-disease relations extraction
Mahalakshmi Content-based information retrieval by named entity recognition and verb semantic role labelling
CN113626567A (en) Method for mining information related to genes and diseases from biomedical literature
Rebholz-Schuhmann et al. Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources
Erdogmus et al. Application of automatic mutation–gene pair extraction to diseases
French et al. Automated recognition of brain region mentions in neuroscience literature
Pinto et al. What Drives Research Efforts? Find Scientific Claims that Count!
Jain et al. Named-Entity Recognition for Hindi language using context pattern-based maximum entropy
Ramani et al. An Explorative Study on Extractive Text Summarization through k-means, LSA, and TextRank
Hakenberg Mining relations from the biomedical literature
Ananiadou et al. Improving search through event-based biomedical text mining
Sun et al. Biomedical named entities recognition using conditional random fields model
AlMahmoud et al. The effect of clustering algorithms on question answering
Banuqitah et al. Two level self-supervised relation extraction from MEDLINE using UMLS

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination