CN109558493A

CN109558493A - A kind of disease similarity calculating method based on disease ontology

Info

Publication number: CN109558493A
Application number: CN201811255993.1A
Authority: CN
Inventors: 周水庚; 袁梓峰; 孙志丹; 关佶红
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2018-10-26
Filing date: 2018-10-26
Publication date: 2019-04-02
Anticipated expiration: 2038-10-26
Also published as: CN109558493B

Abstract

The invention belongs to field of bioinformatics, specially a kind of disease similarity calculating method based on disease ontology.The method of the present invention is divided into two parts, and first part is the function of diseases Similarity measures based on gene ontology, and second part is that the disease Semantic Similarity based on disease ontology calculates.In terms of algorithm evaluation, two kinds of appraisal procedures of rate are shared using ROC curve and PTC, are as a result superior to existing disease to similarity assessment algorithm.Complicated metabolism and vital movement are in close relations often and in vivo for the pathogenesis of disease, this brings huge challenge to the mankind in the understanding of disease incidence mechanism and the research of diagnosis and treatment means, and the method for the present invention facilitates the research of disease incidence mechanism, diagnosis and treatment means and disease prevention etc..

Description

A kind of disease similarity calculating method based on disease ontology

Technical field

The invention belongs to field of bioinformatics, and for assessing the similarity of disease pair, correlation result analysis facilitates A kind of research of disease incidence mechanism, diagnosis and treatment means and disease prevention etc., and in particular to disease phase based on disease ontology Like degree calculation method.

Background technique

In recent years, more and more researchers participate in Study on Similarity, including protein in field of biomedicine Functional similarity, gene similitude, the similarity directions such as drug similitude and disease similitude.Wherein, disease similitude Research is intended to disclose its possible pathology principle, and provide medical treatment appropriate by the internal association for analyzing two kinds of diseases and build View.Relevant information based on limited disease such as gene information etc. can carry out similitude point to known disease and unidentified illness Analysis, when their similitudes are very high, indicates that two kinds of diseases at least possess the same or similar attribute in one aspect, so also May there are similar disease Forming Mechanism and diagnosis and treatment method, provide preferable guide for the cause of disease of study of disease.

Fig. 1 is listed based on the existing main disease based on disease ontology to similarity calculating method, can be subdivided into The method that method based on disease Semantic Similarity, the method based on function of diseases similitude and the two combine.Wherein, Method based on disease Semantic Similarity can be further subdivided into the method based on disease ontology nodal information amount and be based on disease The method of sick body construction information.

Nineteen ninety-five Resnik proposes a kind of algorithm based on maximum fault information common ancestor, and the algorithm is according to disease ontology In " is_a " relationship, the similitude of disease pair is indicated with maximum fault information common ancestor node.Couto et al. was in 2004 The concept of incompatible ancestor node is proposed, and calculates the similitude of two disease nodes with this.The side of Resnik and Couto Method is all the similarity for calculating disease from the ancestor node of two disease nodes in disease ontology.And Lin in 1998 and In the method that Schlicker was delivered in 2006, it is believed that the similarity of disease is only calculated from one or several ancestor nodes, And it is incomplete for having ignored disease node information content in disease ontology itself.Lin is mentioned on the basis of the method for Resnik A kind of disease similarity calculation method combining ancestor node and own node out, then work of the Schlicker in Lin On the basis of introduce correction factor and improved.

In the calculation method based on disease body construction information, Rada et al. is relatively early in 1989 to be proposed based on node The disease similarity calculation method of distance, but the value of distance may be 0 or infinity in the calculation method of Rada, for this Problem, Lee et al. proposed innovatory algorithm in 1993, and the arithmetic result of Rada has been carried out normalized, has made similarity Value be [0,1].In the method that Wang et al. was delivered in 2007, it is believed that when calculating disease similitude, not only to utilize The information content of disease node, but also the path structure information of disease ontology should be utilized.In this regard, Wang sharing two diseases Information is extended to all ancestor node by common ancestor's node, effectively overcomes information caused by local nodes routing information not Full problem, improves accuracy rate.In addition, thinking in the method that Wu was delivered at 2006 and Jiang in 1997, disease ontology Hierarchical structure the similitude between disease can also be described well.Wu proposes a kind of calculation based on disease ontology routing information Method, the algorithm not only allow for two routing informations between disease and ancestor node, it is also contemplated that disease and leaf node it Between routing information, obtained preferable calculated result.The algorithm of Jiang consider localized network complexity in ontology, Node level, while type and while intensity, and opposite side assign weight, by shortest path between disease node across side The sum of weight as their Similarity measures result.

In the calculation method of function of diseases similitude, Mathur et al. has delivered BOG calculation 2010 and 2012 respectively Method and PSB algorithm, he thinks the inhereditary feature of gene control biology, its expression and disease generation are closely related, therefore disease Between there is only the incidence relations on disease ontology, there is also incidence relations on gene ontology, and the association of the latter is more It is important.BOG algorithm mainly calculates the similitude of disease using the shared gene of two disease nodes, so BOG algorithm can be sent out There are close associations between the existing not similar disease of disease symptoms.PSB algorithm is on the basis of BOG algorithm, it is contemplated that disease The relevant intergenic functional cohesion of disease, is first annotated with all gene-for-gene ontologies with disease association, based on this Calculate the information content of each gene ontology node, then calculate gene ontology node similarity matrix, finally by disease with Incidence relation between gene ontology obtains the similitude of disease.The algorithm is more quasi- than the algorithm based on sick language disease justice similitude Really.

Li et al. thinks that corresponding gene may also can share phase when two gene ontology nodes are much like but not identical With bioprocess, and in the algorithm of Mathur two genes must share identical gene ontology node just and can obtain it is high similar Spend score.In view of this, Li et al. proposed a kind of algorithm based on gene interaction network, i.e. SemFunSim in 2013 Algorithm.The algorithm combines disease Semantic Similarity and function of diseases similitude, takes in terms of Semantic Similarity calculating The disease semantic similarity that the method for Resnik is asked；In terms of functional similarity calculating, gene interaction network is introduced HumanNet, in the network, the gene of each Thermodynamic parameters has a similarity scores (Log Likelihood Score, LLS) for measuring the functional cohesion between gene.Final similarity score is function similarity multiplied by semantic similar Degree.

Summary of the invention

The invention belongs to field of bioinformatics, method of specially a kind of assessment disease to similarity.The method of the present invention (referred to as DGS algorithm) is divided into two parts, and first part is the function of diseases Similarity measures based on gene ontology, and second part is Disease Semantic Similarity based on disease ontology calculates.In terms of algorithm evaluation, two kinds of rate is shared using ROC curve and PTC and is commented Estimate method, makes comparisons with the method for Resnik, the method for Wang, PSB algorithm and SemFunSim algorithm, be as a result superior to respectively The existing disease of these types is to similarity assessment algorithm.The complicated metabolism and vital movement often and in vivo of the pathogenesis of disease In close relations, this brings huge challenge, side of the present invention in the understanding of disease incidence mechanism and the research of diagnosis and treatment means to the mankind Method facilitates the research of disease incidence mechanism, diagnosis and treatment means and disease prevention etc..

A kind of disease similarity calculating method based on disease ontology proposed by the present invention, the method includes assessing disease Pair function of diseases similarity method and assessment disease pair disease Method of Semantic Similarity Analysis two parts, it is specific as follows:

(1) the function of diseases similarity method for assessing disease pair is as follows:

(1.1) gene ontology node similitude is constructed using shortest path first (the Shortest Path, SP) first Matrix calculates the information content of gene ontology node, each element of gene ontology node similarity matrix using Resnik method It is the numerical value between one 0~1, indicates the similitude of any two node in gene ontology, this similarity matrix note For Sim_GO；

Wherein path₁For t₁Shortest path of the node to the public ancestor node of maximum fault information (MICA node), path₂For t₂Shortest path of the node to the public ancestor node of maximum fault information (MICA node)；

(1.2) disease ontology relation medical subject headings databases, if two disease arts in medical subject headings database Language is respectively m₁And m₂, correspondingly, obtaining the relevant gene molecule functional term collection of disease term in medical subject headings database Closing is respectively T₁={ t₁₁, t₁₂..., t_1pAnd T₂={ t₂₁, t₂₂..., t_2q}；Set T₁And T₂Regard as to medical subject headings Gene ontology node similarity matrix is converted into medical subject using formula (1) by the description of disease term characteristics in database Word node similitude:

Wherein p, q are respectively and disease term m₁、m₂Associated molecular function term number, max function are to take maximum Value, Sim_GO(t₁, t₂) it is two gene ontology node t₁And t₂Similarity score；

(1.3) the disease node in disease ontology usually all can be with the disease term in several medical subject headings databases It is corresponding, it is assumed that disease is to for d₁And d₂, two disease terms in disease ontology have been corresponded, have been obtained from disease ontology Disease is to d₁And d₂Associated medical subject headings database disease term is respectively M₁={ m₁₁, m₁₂..., m_1pAnd M₂={ m₂₁, m₂₂..., m_2q}；

Wherein p, q are respectively and disease d₁、d₂Associated medical subject headings database disease term number, max function are It is maximized, Sim_MeSH(m₁, m₂) it is two gene ontology node m₁And m₂Similarity score；

(2) the disease Method of Semantic Similarity Analysis for assessing disease pair is as follows:

The calculating of function of diseases similitude considers the mutual incidence relation of gene ontology molecular level disease, and disease Advantage of the disease ontology in structure organization is then utilized in the calculating of sick Semantic Similarity；When calculating disease Semantic Similarity, Need to find out the public ancestor node of maximum fault information (the most informative common of two disease nodes Ancestor, MICA), remember that two diseases are d₁、d₂, remember that this ancestor node is d_MICA, enable G₁、G₂And G_MICARespectively indicate disease d₁、d₂And d_MICAssociated gene sets；Similitude is calculated using following equation:

Wherein | G₁|、|G₂| and | G_MICA| respectively indicate disease d₁、d₂And d_MICAAssociated gene ontology interstitial content；

(3) similarity score of disease pair is assessed:

According to the score of the function of diseases similitude and disease Semantic Similarity that are calculated in step (1) and step (2), value All between 0~1, the two is multiplied to obtain the similarity score of disease pair:

Sim_DGS(d₁, d₂)=FSim_DGS(d₁, d₂)·SSim_DGS(d₁, d₂)

Wherein FSim_DGS(d₁, d₂) it is disease to d₁And d₂Functional similarity score, SSim_DGS(d₁, d₂) it is disease to d₁ And d₂Semantic Similarity score.

The beneficial effects of the present invention are: similarity algorithm is obtained more accurately as a result, for disease than original disease The research of pathogenesis, diagnosis and treatment means and disease prevention etc. provides new reference tool.

Detailed description of the invention

Fig. 1 is existing main disease to similarity algorithm synoptic chart.

Fig. 2 is algorithm flow of the invention.

Specific embodiment

The present invention is further illustrated below by embodiment combination attached drawing.

Embodiment 1: the specific implementation process of the method for the present invention is as follows:

Step 1: calculating the information content of gene ontology node using the method for Resnik

It is obtained from GOA database (http://geneontology.org/page/download-annotations) The related information of gene and gene ontology node, the process of each gene association to gene ontology node are known as annotating, annotate Cheng Hou, each gene ontology node have watched several genes attentively, this number is considered as frequency, is acquired according to the formula of information content The information content of each gene ontology node.

Step 2: constructing gene ontology node similarity matrix using SP algorithm

The nodal information amount acquired using previous step is constructed using shortest path first (the Shortest Path, SP) Gene ontology node similarity matrix.Each element of this matrix is the numerical value between one 0~1, is indicated in gene ontology In two nodes similitude, this similarity matrix is denoted as Sim_GO, calculation formula are as follows:

Wherein path₁For t₁Shortest path of the node to the public ancestor node of maximum fault information (MICA node), path₂For t₂Shortest path of the node to the public ancestor node of maximum fault information (MICA node).

Step 3: calculating disease to function similarity

Due to the various medical data bases of disease ontology relation, cause the selection of its data varied, traditional Chinese medicine master Epigraph (Medical Subject Headings, MeSH) database has more than 100 years usage histories, can preferably describe Disease, authority with higher, and medical subject headings have a wide range of applications in each large database concept, so we are counting Medical subject headings are introduced into calculating process when calculating the similitude of disease.Assuming that two disease terms point in medical subject headings It Wei not m₁And m₂, it is respectively T that the relevant gene molecule functional term set of disease term is obtained in medical subject headings database₁ ={ t₁₁, t₁₂..., t_1pAnd T₂={ t₂₁, t₂₂..., t_2q}.Set T₁And T₂It can be regarded as to disease in medical subject headings Gene ontology node similitude is converted into medical subject headings node similitude using following equation by the description of term characteristics:

Disease ontology is to arrange a kind of body construction summed up after multiple medical data bases, so the disease in disease ontology Sick node usually all can be corresponding with several medical subject vocabulary, and the disease function of disease pair is next calculated with similar method It can similitude.Assuming that disease is to for d₁And d₂, they have corresponded two disease terms in disease ontology, from disease ontology In obtain d₁And d₂Associated medical subject headings term is respectively M₁={ m₁₁, m₂₃..., m_1pAnd M₂={ m₂₁, m₂₂..., m_2q}。

Step 4: calculating disease to semantic similarity

The calculating of function of diseases similitude considers the mutual incidence relation of gene ontology molecular level disease, and disease Advantage of the disease ontology in structure organization is then utilized in the calculating of sick Semantic Similarity.When calculating disease Semantic Similarity, Need to find out the public ancestor node of maximum fault information (the most informative common of two disease nodes Ancestor, MICA), remember that two diseases are d₁、d₂, remember that this ancestor node is d_MICA, enable G₁、G₂And G_MICARespectively indicate disease d₁、d₂And d_MICAssociated gene sets.Similitude is calculated using following equation:

Step 5: calculating disease to similarity final score

Above-mentioned two step has calculated the score of function of diseases similitude and disease Semantic Similarity, value all 0~1 it Between, the two is multiplied to obtain the similarity score of disease pair:

Sim_DGS(d₁, d₂)=FSim_DGS(d₁, d₂)·SSim_DGS(d₁, d₂)

The method of the present invention performance evaluation:

According to disease to the appraisal procedure of the correlative theses of similarity calculation, two kinds of rate is shared using ROC curve and PTC and is commented Estimate method, makees ratio with the method for the relatively good Resnik of effect, the method for Wang, PSB algorithm and SemFunSim algorithm respectively Compared with.

The assessment of 1 disease similitude ROC curve

Test set is combined by two collection, and one is standard set, and one is random set.Standard set is from Li et al. 2013 Year paper in the data set announced, the disease pair of totally 70 pairs of higher similitudes；Random set by 100 pairs from disease ontology it is random The disease selected is to composition.It is verified for the generality to algorithm, we generate 10 random sets at random, they are distinguished It is combined with standard set, becomes test set, then this ten groups of test sets are calculated using five kinds of algorithms, obtain calculating knot Fruit, and assessment is compared to calculated result.

Disease in test set to calculating respectively using five kinds of algorithms, is obtained the meter of five groups of disease similitudes by us It calculates as a result, be then ranked up the correlation result of disease pair from high to low, by the disease of standard set to being set as positive sample, with The disease of machine collection draws ROC curve, further calculates out AUC value, obtained assessment result to negative sample is set as.

In experiment herein, for the validity for further increasing ROC curve assessment algorithm effect, we are by random disease pair The condition of generation generates three group data sets, the first group data set S plus constraint₁It is to generate disease at random in disease ontology It is right；Second group data set S₂It is to constrain the range of choice of disease, it only could when the associated gene number of disease is greater than 5 It is selected into random disease pair；Third group data set S₃It is that random disease is added in disease when the gene number of disease association is greater than 10 Pair selection.

Following table is comparing result, and the method for the present invention effect is best as the result is shown, is secondly successively SemFunSim algorithm, PSB Algorithm, the method for Wang, the method for Resnik.

2 disease similitude PTC share rate assessment

The relevant gene sheet of disease is provided not only in CTD database (http://ctdbase.org/downloads/) Body information additionally provides the potential treatment compound (Potential Therapeutic Compounds, PTC) of disease, PTC The ingredient that disease therapeuticing medicine can effectively be measured, has great importance to medicament research and development.

For the validity for further studying algorithm, we obtain disease and the incidence relation of PTC from CTD database, Carry out effect of the verification algorithm in terms of disease medicament with this.

We generate ten groups of diseases to test set at random, wherein including 1000 pairs of diseases pair in each test set, respectively Five kinds of Similarity Algorithms are taken to calculate the disease of each test set to similarity.Then, to the Similarity measures knot of test set Fruit is ranked up, and takes before similarity 100 disease to composition set T₁₀₀, pass through analysis set T later₁₀₀The PTC of middle disease pair Shared rate carrys out measure algorithm effect.

Following table is comparing result:

Method	Comprehensive PTC shares rate	Standard PTC shares rate	The disease logarithm of shared PTC
				The method of Resnik	6.26%	7.63%	82.1
The method of Wang	7.34%	8.37%	87.8
				PSB algorithm	11.13%	12.10%	92.2
SemFunSim algorithm	13.03%	13.54%	96.3
				DGS algorithm	15.25%	15.28%	99.8

There are two the index that PTC shares rate in table, Plays PTC shares rate and refers in set T₁₀₀In all diseases pair PTC shares the average of rate, i.e., all diseases share the sum of rate divided by the disease logarithm for sharing PTC in set to PTC；It is comprehensive PTC shares rate and refers to that all diseases share the sum of rate divided by set T to PTC₁₀₀Disease logarithm (100).Known by defining us Road, comprehensive PTC, which shares rate, can more reflect overall effect of the algorithm on test set.But when the disease pair for including in test set When quantity further expansion, it is possible that set T₁₀₀In disease to all there is the case where shared PTC, standard PTC shares rate Rough assessment can be carried out to data in this case.

From table it can be seen that, it is shared that rate, the disease logarithm of shared PTC or standard PTC either are shared in comprehensive PTC In terms of rate, the performance of the method for the present invention is all best.

In two kinds of appraisal procedures, the method for the present invention all achieves best effect.The experimental results showed that the method for the present invention The mutual association of our study of disease is not only assisted in, provides preferable guide for our study of disease, but also provide A kind of effective way of study of disease drug.

Claims

1. a kind of disease similarity calculating method based on disease ontology, it is characterised in that: the method includes assessing disease pair Function of diseases similarity method and assessment disease pair disease Method of Semantic Similarity Analysis two parts, it is specific as follows:

(1.1) gene ontology node similitude square is constructed using shortest path first (the Shortest Path, SP) first Battle array, the information content of gene ontology node is calculated using Resnik method, and each element of gene ontology node similarity matrix is Numerical value between one 0~1, indicates the similitude of any two node in gene ontology, this similarity matrix is denoted as Sim_GO；

Wherein path₁For t₁Shortest path of the node to the public ancestor node of maximum fault information (MICA node), path₂For t₂Node To the shortest path of the public ancestor node of maximum fault information (MICA node)；

(1.2) disease ontology relation medical subject headings databases, if two disease terms point in medical subject headings database It Wei not m₁And m₂, correspondingly, obtaining the relevant gene molecule functional term set point of disease term in medical subject headings database It Wei not T₁={ t₁₁, t₁₂..., t_1pAnd T₂={ t₂₁, t₂₂..., t_2q}；Set T₁And T₂Regard as to medical subject headings data Gene ontology node similarity matrix is converted into medical subject headings section using formula (1) by the description of disease term characteristics in library Point similitude:

Wherein p, q are respectively and disease term m₁、m₂Associated molecular function term number, max function are to be maximized, Sim_GO(t₁, t₂) it is two gene ontology node t₁And t₂Similarity score；

(1.3) the disease node in disease ontology usually all can be opposite with the disease term in several medical subject headings databases It answers, it is assumed that disease is to for d₁And d₂, two disease terms in disease ontology have been corresponded, have obtained disease from disease ontology To d₁And d₂Associated medical subject headings database disease term is respectively M₁={ m₁₁, m₁₂..., m_1pAnd M₂={ m₂₁, m₂₂..., m_2q}；

Wherein p, q are respectively and disease d₁、d₂Associated medical subject headings database disease term number, max function is to take most Big value, Sim_MeSH(m₁, m₂) it is two gene ontology node m₁And m₂Similarity score；

The calculating of function of diseases similitude considers the mutual incidence relation of gene ontology molecular level disease, and disease language Advantage of the disease ontology in structure organization is then utilized in the calculating of adopted similitude；When calculating disease Semantic Similarity, need Find out two disease nodes the public ancestor node of maximum fault information (most informative common ancestor, MICA), remember that two diseases are d₁、d₂, remember that this ancestor node is d_MICA, enable G₁、G₂And G_MICARespectively indicate disease d₁、d₂With d_MICANode set in associated gene ontology；Similitude is calculated using following equation:

(3) similarity score of disease pair is assessed:

According to the score of the function of diseases similitude and disease Semantic Similarity that calculate in step (1) and step (2), value all exists Between 0~1, the two is multiplied to obtain the similarity score of disease pair:

Sim_DGS(d₁, d₂)=FSim_DGS(d₁, d₂)·SSim_DGS(d₁, d₂)

Wherein FSim_DGS(d₁, d₂) it is disease to d₁And d₂Functional similarity score, SSim_DGS(d₁, d₂) it is disease to d₁And d₂ Semantic Similarity score.