CN106021871A

CN106021871A - Disease similarity calculation method and device based on big data group behaviors

Info

Publication number: CN106021871A
Application number: CN201610307328.7A
Authority: CN
Inventors: 韦辉华; 王界兵; 张伟; 董迪马; 郭宇翔; 宋泰然; 梁猛
Original assignee: Shenzhen Frontsurf Information Technology Co Ltd
Current assignee: Shenzhen Frontsurf Information Technology Co Ltd
Priority date: 2016-05-10
Filing date: 2016-05-10
Publication date: 2016-10-12

Abstract

The invention discloses a disease similarity calculation method and device based on big data group behaviors. The disease similarity calculation method comprises the following steps: calculating weight for each patient element instance to be correspondingly diagnosed into each disease, wherein the patient element instance comprises patient case information; according to a weight value obtained by calculation, establishing a disease vector for each disease, wherein the weight value is used as an element of the disease vector; and according to the disease vector, calculating disease similarity. The disease similarity calculation method and device based on the big data group behaviors calculates the similarity among diseases from the social perspective of diseases according to big data group disease behavior diagnosis and treatment behaviors, and the disease similarity calculation method and device can be used for identifying diseases which are likely to be misdiagnosed but do not have correlations including cells, genes and the like.

Description

Disease similarity calculating method based on big data group behavior and device

Technical field

The present invention relates to the calculating field of disease similarity, particularly to a kind of disease phase based on big data group behavior Like degree computational methods and device.

Background technology

The method calculating disease similarity at present is typically to calculate according to the attribute of disease, as between disease and disease Inclusion relation: ' breast carcinoma ' comprises ' male breast carcinoma ' and ' women with breast cancer '；Relation factor between disease and disease: common Disease-causing gene, common medicine, common metabolite etc..The method calculating disease similarity generally can be from two angles Degree considers:

1, disease similarity is calculated based on semantic association.

Biomedical sector is frequently utilized that body calculates the semantic similarity of term, such as: gene ontology, human phenotype body Deng.While it is true, these methods but only have a little part have been used for calculating disease similarity.The method of Resnik design is i.e. Being the most most commonly seen method, the method is more of applied to gene ontology and calculates gene function, cellularity, mistake biology The similarity of Cheng Shuyu, and if with other multiple method (union-intersection, longest shared path, JC) compare, then there is obvious advantage.The method of Resnik is to utilize ' is_a ' relation in body to calculate Similarity of Term, The method calculate disease between similarity depend on disease to the maximum common ancestor's node of quantity of information.And Lin Method then improves the comparative approach in the method for Resnik to comentropy, from point of theory, the method for Resnik has been carried out one Fixed is perfect.The method of Resnik and Lin is write R bag by research worker the most, to facilitate the similarity calculating disease.Wang Et al. propose method the method for Resnik has been carried out the optimization of deeper.The method is when calculating disease to similarity, no Only account for common ancestor's node that the quantity of information of disease pair is maximum, it is also contemplated that disease common ancestor's node to other.Should The superiority of method has obtained more preferable embodiment in gene ontology, and has been used for calculating the disease term in medical subject headings Semantic similarity.

2, disease similarity is calculated based on the gene that disease is relevant.

The association of disease is not only embodied on the body that disease is relevant, and is embodied on common Disease-causing gene.Therefore, Research worker focus attentions equally on Disease-causing gene based on disease calculates the similarity of disease.Presently, there are two kinds based on gene meter The method calculating disease similarity.

(1) the first is side based on common disease gene (based on overlapping gene set-BOG) Method.The method compares common relevant number gene between disease, obtains disease similarity therefrom.If with angle based on semanteme Degree calculates similarity and compares, and this method finds similar disease pair from a brand-new angle.Therefore, the method can find new not Know disease association.While it is true, when calculating disease similarity, the method does not but consider the function association between disease gene, And be apparent from is that this association has certain impact to disease similarity.

(2) second method then Kernel-based methods similarity (process similarity based-PSB) calculates disease phase Like degree, wherein, process refers to the biological process term of the relevant gene ontology of Disease-causing gene.The method considers disease base The function association of cause, is therefore greatly improved to BOG method.PSB with Resnik, Lin, LC and JC method compared with, also Present good performance.Intergenic function association comprises a lot of aspect, such as: gene co-expressing, protein interaction, base Because of body term etc..It addition, for the performance improving disease similarity based method, FunSim method utilizes mankind's base of aggregative weighted Because related network calculates disease similarity.

Therefore, if gene function, cellularity, biological process or common pathogenetic gene between two kinds of diseases are big Cause identical, then calculate the disease similarity gene relevant with based on disease based on semantic association and calculate disease similarity both sides Method will be effective in the similarity calculating disease, and this is very useful for the research of disease science.But, for two Planting the disease that easy mistaken diagnosis does not but have cell, gene etc. to associate, both approaches effect is the most poor.

Summary of the invention

The main object of the present invention is for providing a kind of disease similarity calculating method based on big data group behavior and dress Put, according to the diagnosis of disease behavior of big data colony, calculated the similarity between disease from the social goniometer of disease, can be used for knowing Not easily mistaken diagnosis does not but have the disease that cell, gene etc. associate.

The present invention proposes a kind of disease similarity calculating method based on big data group behavior, including step:

Calculate each patient unit example correspondence and be diagnosed as the weight of each disease；Described patient unit example includes patient cases Information；

According to the weights of calculating gained weight, each disease is set up disease vector；The weights of described weight are as disease The element of vector；

Disease similarity is calculated according to disease vector.

Further, the step of the weight that described calculating each patient unit example correspondence is diagnosed as each disease includes:

Calculate each patient unit example correspondence and be diagnosed as the frequency of each disease, and obtain this frequency in all data Frequency, described frequency is diagnosed as each patient unit example correspondence the weights of the weight of each disease.

Further, the described step according to disease vector calculating disease similarity includes:

According to disease vector, COS distance is utilized to calculate two kinds of disease similarities.

Further, described according to disease vector, the computing formula utilizing COS distance to calculate two kinds of disease similarities is:

s i m (\overset{&RightArrow;}{D_{i}}, \overset{&RightArrow;}{D_{j}}) = \cos < \overset{&RightArrow;}{D_{i}}, \overset{&RightArrow;}{D_{j}} > = \frac{Σ_{k}^{H} (d_{i k} + d_{j k})}{\sqrt{Σ_{k}^{H} {d_{i k}}^{2}} \cdot \sqrt{Σ_{k}^{H} {d_{j k}}^{2}}};

Wherein,It it is the disease similarity between two kinds of disease vectors；d_ik、d_jkIt is respectively each patient unit real Example correspondence is diagnosed as D_i、D_jThe weights of two kinds of diseases；H is total number of patient unit example；K is natural number.

It is further, when the weights that patient unit example correspondence is diagnosed as certain disease are the lowest, described according to disease vector, The computing formula utilizing COS distance to calculate two kinds of disease similarities is:

s i m (\overset{&RightArrow;}{D_{i}}, \overset{&RightArrow;}{D_{j}}) = \frac{\underset{k &Element; T_{i j}}{Σ} (d_{i k} + d_{j k})}{\sqrt{\underset{i &Element; T_{i}}{Σ} {d_{i k}}^{2}} \cdot \sqrt{\underset{i &Element; T_{j}}{Σ} {d_{j k}}^{2}}};

Wherein,It it is the disease similarity between two kinds of disease vectors；d_ik、d_jkIt is respectively each patient unit real Example correspondence is diagnosed as D_i、D_jThe weights of two kinds of diseases, and its value is higher than setting weights；K is natural number；T_iFor each patient unit example Correspondence is diagnosed as D_iThe weights of disease and higher than setting weights；T_jIt is diagnosed as D for each patient unit example correspondence_jThe weights of disease and Higher than setting weights；T_ijIt is diagnosed as D for each patient unit example correspondence respectively_i、D_jThe weights of two kinds of diseases and higher than setting weights Intersection.

Present invention also offers a kind of disease Similarity Measure device based on big data group behavior, including:

Weight calculation unit, calculates each patient unit example correspondence and is diagnosed as the weight of each disease；Described patient unit is real Example includes patient cases's information；

Vector sets up unit, according to the weights of calculating gained weight, each disease is set up disease vector；Described weight Weights are as the element of disease vector；

Similarity calculated, calculates disease similarity according to disease vector.

Further, described weight calculation unit includes:

Weight computing subelement, calculates each patient unit example correspondence and is diagnosed as the frequency of each disease, and obtain this frequency Number frequency in all data, is diagnosed as the power of the weight of each disease using described frequency as each patient unit example correspondence Value.

Further, described similarity calculated includes:

COS distance computation subunit, according to disease vector, utilizes COS distance to calculate two kinds of disease similarities.

Further, the computing formula of described COS distance computation subunit is:

s i m (\overset{&RightArrow;}{D_{i}}, \overset{&RightArrow;}{D_{j}}) = \cos < \overset{&RightArrow;}{D_{i}}, \overset{&RightArrow;}{D_{j}} > = \frac{Σ_{k}^{H} (d_{i k} + d_{j k})}{\sqrt{Σ_{k}^{H} {d_{i k}}^{2}} \cdot \sqrt{Σ_{k}^{H} {d_{j k}}^{2}}};

Further, when the weights that patient unit example correspondence is diagnosed as certain disease are the lowest, described COS distance calculates The computing formula of subelement is:

s i m (\overset{&RightArrow;}{D_{i}}, \overset{&RightArrow;}{D_{j}}) = \frac{\underset{k &Element; T_{i j}}{Σ} (d_{i k} + d_{j k})}{\sqrt{\underset{i &Element; T_{i}}{Σ} {d_{i k}}^{2}} \cdot \sqrt{\underset{i &Element; T_{j}}{Σ} {d_{j k}}^{2}}};

The disease similarity calculating method based on big data group behavior provided in the present invention and device, having following has Benefit effect:

The disease similarity calculating method based on big data group behavior provided in the present invention and device, according to big data Colony's diagnosis of disease behavior, has calculated the similarity between disease from the social goniometer of disease, can be used for identifying easy mistaken diagnosis but There is no the disease that cell, gene etc. associate；The weight being diagnosed as each disease according to each patient unit example correspondence sets up disease Vector, and calculate disease similarity by COS distance, can will be easiest to the disease identification of mistaken diagnosis out；When patient unit example pair Should be diagnosed as the weights of certain disease the lowest time, then it is believed that this patient unit example correspondence is diagnosed as the credible journey of certain disease Spend the lowest, when calculating disease similarity, can be ignored.

Accompanying drawing explanation

Fig. 1 is disease similarity calculating method step based on big data group behavior signal in one embodiment of the invention Figure；

Fig. 2 is disease similarity calculating method step based on big data group behavior signal in another embodiment of the present invention Figure；

Fig. 3 is disease Similarity Measure apparatus structure based on big data group behavior signal in one embodiment of the invention Figure；

Fig. 4 is disease Similarity Measure apparatus structure based on big data group behavior signal in another embodiment of the present invention Figure；

Fig. 5 is weight calculation unit structural representation in one embodiment of the invention；

Fig. 6 is similarity calculated structural representation in one embodiment of the invention.

The realization of the object of the invention, functional characteristics and advantage will in conjunction with the embodiments, are described further referring to the drawings.

Detailed description of the invention

Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.

With reference to Fig. 1, for disease similarity calculating method step based on big data group behavior in one embodiment of the invention Schematic diagram.

One embodiment of the invention proposes a kind of disease similarity calculating method based on big data group behavior, including:

Step S1, calculates each patient unit example correspondence and is diagnosed as the weight of each disease；Above-mentioned patient unit example includes Patient cases's information；This case information at least includes: Gender, age, symptom, symptom position, symptom refinement description etc..

Step S2, according to the weights of calculating gained weight, sets up disease vector to each disease；The weights of above-mentioned weight are made Element for disease vector.

Step S3, calculates disease similarity according to disease vector.

At present, the gene based on semantic association calculating disease similarity is relevant with based on disease is generally used to calculate disease phase The similarity of disease is being calculated like degree both approaches, if the gene function between two kinds of diseases, cellularity, mistake biology Journey or common pathogenetic gene are roughly the same, then both approaches calculates in disease similarity the most effective, for disease science Research for very useful.But, the disease that cell, gene etc. associate, both approaches are not but had for two kinds of easy mistaken diagnosis Effect is the most poor.The disease similarity calculating method based on big data group behavior provided in the present embodiment, quantifies disease And the similarity between disease.The calculating process of disease similarity does not use the relation on attributes between disease and disease, but uses Big data human diseases goes to a doctor behavior.First collect the mass data that crowd seeks medical advice for each disease patient, use these groups Disease is modeled by body medical diagnosis on disease behavior, then selects suitable similarity algorithm to calculate the similarity between disease two-by-two, can be by Easily the disease identification of mistaken diagnosis is out.The method can be used for the systems such as disease auxiliary diagnosis, disease autodiagnosis.Simultaneously to complex disease Study of incident mechanism, major disease infectious disease early prevention diagnoses, and novel drugs research and development etc. have important function.

With reference to Fig. 2, can also include before above-mentioned steps S1:

Step S0, according to case informations such as Gender, age, symptom, symptom position, symptom refinement descriptions, builds and suffers from Person unit example.This step S0 may comprise steps of:

Step 1: collect the diagnosis of disease data of different sexes all ages and classes patient.Below one case is extracted (but do not limit In following) information: Gender, patient age, symptom, symptom position, symptom refinement description, diagnosed disease.By Gender It is expressed as G:{G₁,G₂, G_iRepresent (in 2) certain value that sex G is desirable；Patient age is expressed as A；Symptom is expressed as S:{S₁,S₂,…,S_L, S_iRepresent (in L) certain value that symptom S is desirable；Symptom position is expressed as B:{B₁,B₂,,B_M, B_i Represent (in M) certain value desirable for position B；Disease is expressed as D:{D₁,D₂,…,D_N, D_iRepresent desirable (N number of of position D In) certain value.

Step 2: by discrete for age data chemical conversion K section: A:{A₁,A₂,…,A_K, A_iRepresent desirable for position A (K in) certain Individual value.Age data division methods the following two kinds:

The first K=5:

(1) childhood: 0 years old 6 years old；(2) juvenile: 7 years old 17 years old；(3) young: 18 years old 40 years old；(4) middle age: 41 65 Year；(5) old: after 66 years old.

The second K=14:

(1) infancy stage: the 0-3 moon in week；(2) children's's phase: 4 weeks 2.5 years old moons；(3) preschool period: after 2.5 years old 6 years old；(4) The initiation phase: 7 years old 10 years old；(5) the converse phase: 11 years old 14 years old；(6) growth stage: 15 years old 17 years old；(7) adolescence: 18 28 Year；(8) period of maturation: 29 40 years old；(9) the sturdy phase: 41 48 years old；(10) the sane phase: 49 55 years old；(11) phase of adjustment: 56- 65 years old；(12) the oldest phase: 67 72 years old；(13) the old phase in: 73 84 years old；(14) the old phase: after 85 years old.

Step 3: the symptom under each symptom is refined and describes the time of origin according to symptom, the order of severity, priming factors 3 to 5 symptom refinements are become to describe Deng consolidation: S_i:{S_i1,S_i2,S_i3,S_i4,S_i5, wherein S_i1,S_i2,S_i3,S_i4,S_i5Represent symptom S_iSymptom refinement describe.The symptom refinement of each symptom describes number and can differ.

Step 4: by Gender G:{G₁,G₂, patient age A:{A₁,A₂,…,A_K, symptom S:{S₁,S₂,…,S_L, disease Shape position B:{B₁,B₂,…,B_M, symptom refinement describe S_i:{S_i1,S_i2,S_i3,S_i4,S_i5Patient unit is become by their valued combinations Example E:{E₁,E₂,…,E_H, such as E₁Represent that " sex is G₁, the age is A₁, symptom is S₁, position is B₁, symptom refinement describes For S₁₁" a patient unit example, H be unit example number.H=2 × K × L × M × 5 in theory, actually since some symptom Only appear in some age bracket or certain sex, thus H ＜ 2 × K × L × M × 5.

Further, in above-mentioned steps S1, calculate each patient unit example correspondence and be diagnosed as the weight of each disease and include:

Calculate each patient unit example correspondence and be diagnosed as the frequency of each disease, and obtain this frequency in all data Frequency, is diagnosed as the weights of the weight of each disease using said frequencies as each patient unit example correspondence.

In the present embodiment, each patient unit example E is calculated_iEach disease D corresponding_jWeight W_ijComputational methods such as Under:

Magnanimity (data volume wants sufficiently large could meet statistical significance requirement) the case data that use is above collected, for Each patient unit example E_i, add up it and be diagnosed as each disease { D₁,D₂,…,D_NFrequency { F_i1,F_i2,…,F_iN, then will This frequency is converted into frequency:

Wherein

Using this frequency as this patient unit example E_iCorresponding each disease { D₁,D₂,…,D_NWeight, it may be assumed that

W_{i j} = \frac{F_{i j}}{F} .

Thus obtain the weight matrix W about patient unit example disease_ij。

Further, in above-mentioned steps S2, according to the weights of calculating gained weight, each disease is set up disease vector bag Include: use the weight matrix W of patient unit example disease_ijWeights as disease vector element:

\overset{&RightArrow;}{D_{j}} = {d_{1 j}, d_{2 j}, ..., d_{H j}} = (W_{1 j}, W_{2 j}, ..., W_{H}) .

Further, in above-mentioned steps S3, calculate disease similarity according to disease vector and include:

Further, disease vector is a higher-dimension sparse vector, above-mentioned according to disease vector, utilizes COS distance to calculate The computing formula of two kinds of disease similarities is:

s i m (\overset{&RightArrow;}{D_{i}}, \overset{&RightArrow;}{D_{j}}) = \cos < \overset{&RightArrow;}{D_{i}}, \overset{&RightArrow;}{D_{j}} > = \frac{Σ_{k}^{H} (d_{i k} + d_{j k})}{\sqrt{Σ_{k}^{H} {d_{i k}}^{2}} \cdot \sqrt{Σ_{k}^{H} {d_{j k}}^{2}}};

Further, when the weights that patient unit example correspondence is diagnosed as certain disease are the lowest, then real according to this patient unit The credibility that example is diagnosed as certain disease is the lowest, when generally calculating disease similarity, can be ignored.Above-mentioned according to disease to Amount, the computing formula utilizing COS distance to calculate two kinds of disease similarities is adjusted to:

s i m (\overset{&RightArrow;}{D_{i}}, \overset{&RightArrow;}{D_{j}}) = \frac{\underset{k &Element; T_{i j}}{Σ} (d_{i k} + d_{j k})}{\sqrt{\underset{i &Element; T_{i}}{Σ} {d_{i k}}^{2}} \cdot \sqrt{\underset{i &Element; T_{j}}{Σ} {d_{j k}}^{2}}};

Wherein,It it is the disease similarity between two kinds of disease vectors；d_ik、d_jkIt is respectively each patient unit real Example correspondence is diagnosed as D_i、D_jThe weights of two kinds of diseases, and its value is higher than setting weights；K is natural number；T_iFor each patient unit example Correspondence is diagnosed as D_iThe weights of disease and higher than setting weights；T_jIt is diagnosed as D for each patient unit example correspondence_jThe weights of disease and Higher than setting weights；T_ijIt is diagnosed as D for each patient unit example correspondence respectively_i、D_jThe weights of two kinds of diseases and higher than setting weights Intersection；The above-mentioned weights that set are as the minimum threshold set according to practical situation, and when patient unit, example correspondence is diagnosed as D_i、 D_jWhen the weights of two kinds of diseases are less than this threshold value, it is determined that its credibility is low, when calculating disease similarity, by this patient unit example Weights ignore.

With reference to Fig. 3, for disease Similarity Measure apparatus structure based on big data group behavior in one embodiment of the invention Schematic diagram.

One embodiment of the invention additionally provides a kind of disease Similarity Measure device based on big data group behavior, bag Include:

Weight calculation unit 10, calculates each patient unit example correspondence and is diagnosed as the weight of each disease；Above-mentioned patient unit Example includes patient cases's information；

Vector sets up unit 20, according to the weights of calculating gained weight, each disease is set up disease vector；Above-mentioned weight Weights as disease vector element；

Similarity calculated 30, calculates disease similarity according to disease vector.

At present, the gene based on semantic association calculating disease similarity is relevant with based on disease is generally used to calculate disease phase The similarity of disease is being calculated like degree both approaches, if the gene function between two kinds of diseases, cellularity, biology Process or common pathogenetic gene are roughly the same, then both approaches calculates in disease similarity the most effective, for disease section Learn research for very useful.But, the disease that cell, gene etc. associate, both sides are not but had for two kinds of easy mistaken diagnosis Method effect is the most poor.The disease Similarity Measure device based on big data group behavior provided in the present embodiment, quantifies disease Similarity between disease and disease.The calculating process of disease similarity does not use the relation on attributes between disease and disease, but makes Go to a doctor behavior with big data human diseases.First collect the mass data that crowd seeks medical advice for each disease patient, use these Disease is modeled by colony's medical diagnosis on disease behavior, then selects suitable similarity algorithm to calculate the similarity between disease two-by-two, can be by It is easiest to the disease identification of mistaken diagnosis out.This device can be used for the systems such as disease auxiliary diagnosis, disease autodiagnosis.Simultaneously to complicated disease Sick study of incident mechanism, major disease infectious disease early prevention diagnoses, and novel drugs research and development etc. have important function.

With reference to Fig. 4, above-mentioned disease Similarity Measure device based on big data group behavior can also include:

Unit's example construction unit 1, according to case letters such as Gender, age, symptom, symptom position, symptom refinement descriptions Breath, builds patient unit example.This yuan of example construction unit 1 builds patient unit example and may include that

Step 1: collect the diagnosis of disease data of different sexes all ages and classes patient.Below one case is extracted (but do not limit In following) information: Gender, patient age, symptom, symptom position, symptom refinement description, diagnosed disease.By Gender It is expressed as G:{G₁,G₂, G_iRepresent (in 2) certain value that sex G is desirable；Patient age is expressed as A；Symptom is expressed as S:{S₁,S₂,…,S_L, S_iRepresent (in L) certain value that symptom S is desirable；Symptom position is expressed as B:{B₁,B₂,…,B_M, B_iRepresent (in M) certain value desirable for position B；Disease is expressed as D:{D₁,D₂,…,D_N, D_iRepresent the desirable (N of position D In individual) certain value.

The first K=5:

The second K=14:

Further, with reference to Fig. 5, above-mentioned weight calculation unit 10 includes:

Weight computing subelement 100, calculates each patient unit example correspondence and is diagnosed as the frequency of each disease, and obtain this Frequency frequency in all data, is diagnosed as the weight of each disease using said frequencies as each patient unit example correspondence Weights.

In the present embodiment, weight computing subelement 100 calculates each patient unit example E_iEach disease D corresponding_jWeight W_ijComputational methods as follows:

Wherein

W_{i j} = \frac{F_{i j}}{F_{i}} .

Thus obtain the weight matrix W about patient unit example disease_ij。

Further, above-mentioned vector sets up unit 20, according to the weights of calculating gained weight, each disease is set up disease Vector includes: use the weight matrix W of patient unit example disease_ijWeights as disease vector element:

\overset{&RightArrow;}{D_{j}} = {d_{1 j}, d_{2 j}, ..., d_{H j}} = (W_{1 j}, W_{2 j}, ..., W_{H}) .

Further, with reference to Fig. 6, above-mentioned similarity calculated 30 includes:

COS distance computation subunit 300, according to disease vector, utilizes COS distance to calculate two kinds of disease similarities.

Further, disease vector is a higher-dimension sparse vector, and the calculating of above-mentioned COS distance computation subunit 300 is public Formula is:

s i m (\overset{&RightArrow;}{D_{i}}, \overset{&RightArrow;}{D_{j}}) = \cos < \overset{&RightArrow;}{D_{i}}, \overset{&RightArrow;}{D_{j}} > = \frac{Σ_{k}^{H} (d_{i k} + d_{j k})}{\sqrt{Σ_{k}^{H} {d_{i k}}^{2}} \cdot \sqrt{Σ_{k}^{H} {d_{j k}}^{2}}};

Further, when the weights that patient unit example correspondence is diagnosed as certain disease are the lowest, then real according to this patient unit The credibility that example is diagnosed as certain disease is the lowest, when generally calculating disease similarity, can be ignored.Above-mentioned COS distance meter The computing formula of operator unit 300 is adjusted to:

s i m (\overset{&RightArrow;}{D_{i}}, \overset{&RightArrow;}{D_{j}}) = \frac{\underset{k &Element; T_{i j}}{Σ} (d_{i k} + d_{j k})}{\sqrt{\underset{i &Element; T_{i}}{Σ} {d_{i k}}^{2}} \cdot \sqrt{\underset{i &Element; T_{j}}{Σ} {d_{j k}}^{2}}};

In one embodiment, experiment uses three kinds of disease similarity calculating methods or device to " being easiest to the 7 of mistaken diagnosis Plant disease " to test, the computational methods wherein used are respectively as follows:

A method: calculate disease similarity based on semantic association；

B method: the gene relevant based on disease calculates disease similarity；

C method: in the embodiment of the present invention based on big data group behavior calculate disease similarity.

Experimental result is with reference to table 1 below:

Table 1

Drawn by the experimental result of table 1:

If the gene function between two kinds of diseases, cellularity, biological process or common pathogenetic gene substantially phase With, then the gene calculating disease similarity both approaches calculating disease similarity relevant with based on disease based on semantic association exists Calculating all can be effective in the similarity of disease, very useful for the research of disease science.

But, the disease that cell, gene etc. associate but is not had for two kinds of easy mistaken diagnosis, before two kinds of method effects just than Poor.Method and device in the embodiment of the present invention considers from the social angle of disease, based on big data colony diagnosis of disease Behavior calculates the similarity of disease, can the 7 kinds of diseases that be easiest to mistaken diagnosis be identified.

The disease similarity calculating method based on big data group behavior provided in embodiment in the present invention and device, root According to the diagnosis of disease behavior of big data colony, calculate the similarity between disease from the social goniometer of disease, can be used for identifying and hold Easily mistaken diagnosis does not but have the disease that cell, gene etc. associate；The weight of each disease it is diagnosed as according to each patient unit example correspondence Set up disease vector, and calculate disease similarity by COS distance, can will be easiest to the disease identification of mistaken diagnosis out；Work as patient Unit's example correspondence be diagnosed as the weights of certain disease the lowest time, then it is believed that this patient unit example correspondence is diagnosed as certain disease Credibility the lowest, calculate disease similarity time, can be ignored.

The foregoing is only the preferred embodiments of the present invention, not thereby limit the scope of the claims of the present invention, every utilization Equivalent structure or equivalence flow process that description of the invention and accompanying drawing content are made convert, or are directly or indirectly used in other phases The technical field closed, is the most in like manner included in the scope of patent protection of the present invention.

Claims

1. a disease similarity calculating method based on big data group behavior, it is characterised in that include step:

Calculate each patient unit example correspondence and be diagnosed as the weight of each disease；Described patient unit example includes that patient cases believes Breath；

According to the weights of calculating gained weight, each disease is set up disease vector；The weights of described weight are as disease vector Element；

Disease similarity is calculated according to disease vector.

Disease similarity calculating method based on big data group behavior the most according to claim 1, it is characterised in that institute The step stating the weight that calculating each patient unit example correspondence is diagnosed as each disease includes:

Calculate each patient unit example correspondence and be diagnosed as the frequency of each disease, and obtain this frequency frequency in all data Rate, is diagnosed as the weights of the weight of each disease using described frequency as each patient unit example correspondence.

Disease similarity calculating method based on big data group behavior the most according to claim 1, it is characterised in that institute State the step according to disease vector calculating disease similarity to include:

Disease similarity calculating method based on big data group behavior the most according to claim 3, it is characterised in that institute Stating according to disease vector, the computing formula utilizing COS distance to calculate two kinds of disease similarities is:

s i m (\overset{&RightArrow;}{D_{i}}, \overset{&RightArrow;}{D_{j}}) = \cos < \overset{&RightArrow;}{D_{i}}, \overset{&RightArrow;}{D_{j}} > = \frac{Σ_{k}^{H} (d_{i k} + d_{j k})}{\sqrt{Σ_{k}^{H} {d_{i k}}^{2}} \cdot \sqrt{Σ_{k}^{H} {d_{j k}}^{2}}};

Wherein,It it is the disease similarity between two kinds of disease vectors；d_ik、d_jkIt is respectively each patient unit example pair D should be diagnosed as_i、D_jThe weights of two kinds of diseases；H is total number of patient unit example；K is natural number.

5., according to the disease similarity calculating method based on big data group behavior described in claim 3 or 4, its feature exists In, when the weights that patient unit example correspondence is diagnosed as certain disease are the lowest, described according to disease vector, utilize COS distance meter The computing formula calculating two kinds of disease similarities is:

s i m (\overset{&RightArrow;}{D_{i}}, \overset{&RightArrow;}{D_{j}}) = \frac{\underset{k &Element; T_{i j}}{Σ} (d_{i k} + d_{j k})}{\sqrt{\underset{i &Element; T_{i}}{Σ} {d_{i k}}^{2}} \cdot \sqrt{\underset{i &Element; T_{j}}{Σ} {d_{j k}}^{2}}};

Wherein,It it is the disease similarity between two kinds of disease vectors；d_ik、d_jkIt is respectively each patient unit example pair D should be diagnosed as_i、D_jThe weights of two kinds of diseases, and its value is higher than setting weights；K is natural number；T_iCorresponding for each patient unit example It is diagnosed as D_iThe weights of disease and higher than setting weights；T_jIt is diagnosed as D for each patient unit example correspondence_jThe weights of disease and being higher than Set weights；T_ijIt is diagnosed as D for each patient unit example correspondence respectively_i、D_jThe weights of two kinds of diseases and higher than the conjunction setting weights Collection.

6. a disease Similarity Measure device based on big data group behavior, it is characterised in that including:

Weight calculation unit, calculates each patient unit example correspondence and is diagnosed as the weight of each disease；Described patient unit example bag Include patient cases's information；

Vector sets up unit, according to the weights of calculating gained weight, each disease is set up disease vector；The weights of described weight Element as disease vector；

Disease Similarity Measure device based on big data group behavior the most according to claim 6, it is characterised in that institute State weight calculation unit to include:

Weight computing subelement, calculates each patient unit example correspondence and is diagnosed as the frequency of each disease, and obtains this frequency and exist Frequency in all data, is diagnosed as the weights of the weight of each disease using described frequency as each patient unit example correspondence.

Disease Similarity Measure device based on big data group behavior the most according to claim 6, it is characterised in that institute State similarity calculated to include:

Disease Similarity Measure device based on big data group behavior the most according to claim 8, it is characterised in that institute The computing formula stating COS distance computation subunit is:

s i m (\overset{&RightArrow;}{D_{i}}, \overset{&RightArrow;}{D_{j}}) = \cos < \overset{&RightArrow;}{D_{i}}, \overset{&RightArrow;}{D_{j}} > = \frac{Σ_{k}^{H} (d_{i k} + d_{j k})}{\sqrt{Σ_{k}^{H} {d_{i k}}^{2}} \cdot \sqrt{Σ_{k}^{H} {d_{j k}}^{2}}};

The most according to claim 8 or claim 9, disease Similarity Measure device based on big data group behavior, its feature exists In, when the weights that patient unit example correspondence is diagnosed as certain disease are the lowest, the calculating of described COS distance computation subunit is public Formula is:

s i m (\overset{&RightArrow;}{D_{i}}, \overset{&RightArrow;}{D_{j}}) = \frac{\underset{k &Element; T_{i j}}{Σ} (d_{i k} + d_{j k})}{\sqrt{\underset{i &Element; T_{i}}{Σ} {d_{i k}}^{2}} \cdot \sqrt{\underset{i &Element; T_{j}}{Σ} {d_{j k}}^{2}}};