CN106021871A - Disease similarity calculation method and device based on big data group behaviors - Google Patents
Disease similarity calculation method and device based on big data group behaviors Download PDFInfo
- Publication number
- CN106021871A CN106021871A CN201610307328.7A CN201610307328A CN106021871A CN 106021871 A CN106021871 A CN 106021871A CN 201610307328 A CN201610307328 A CN 201610307328A CN 106021871 A CN106021871 A CN 106021871A
- Authority
- CN
- China
- Prior art keywords
- disease
- weights
- diagnosed
- similarity
- unit example
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Public Health (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Medical Informatics (AREA)
- Primary Health Care (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Pathology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The invention discloses a disease similarity calculation method and device based on big data group behaviors. The disease similarity calculation method comprises the following steps: calculating weight for each patient element instance to be correspondingly diagnosed into each disease, wherein the patient element instance comprises patient case information; according to a weight value obtained by calculation, establishing a disease vector for each disease, wherein the weight value is used as an element of the disease vector; and according to the disease vector, calculating disease similarity. The disease similarity calculation method and device based on the big data group behaviors calculates the similarity among diseases from the social perspective of diseases according to big data group disease behavior diagnosis and treatment behaviors, and the disease similarity calculation method and device can be used for identifying diseases which are likely to be misdiagnosed but do not have correlations including cells, genes and the like.
Description
Technical field
The present invention relates to the calculating field of disease similarity, particularly to a kind of disease phase based on big data group behavior
Like degree computational methods and device.
Background technology
The method calculating disease similarity at present is typically to calculate according to the attribute of disease, as between disease and disease
Inclusion relation: ' breast carcinoma ' comprises ' male breast carcinoma ' and ' women with breast cancer ';Relation factor between disease and disease: common
Disease-causing gene, common medicine, common metabolite etc..The method calculating disease similarity generally can be from two angles
Degree considers:
1, disease similarity is calculated based on semantic association.
Biomedical sector is frequently utilized that body calculates the semantic similarity of term, such as: gene ontology, human phenotype body
Deng.While it is true, these methods but only have a little part have been used for calculating disease similarity.The method of Resnik design is i.e.
Being the most most commonly seen method, the method is more of applied to gene ontology and calculates gene function, cellularity, mistake biology
The similarity of Cheng Shuyu, and if with other multiple method (union-intersection, longest shared path,
JC) compare, then there is obvious advantage.The method of Resnik is to utilize ' is_a ' relation in body to calculate Similarity of Term,
The method calculate disease between similarity depend on disease to the maximum common ancestor's node of quantity of information.And Lin
Method then improves the comparative approach in the method for Resnik to comentropy, from point of theory, the method for Resnik has been carried out one
Fixed is perfect.The method of Resnik and Lin is write R bag by research worker the most, to facilitate the similarity calculating disease.Wang
Et al. propose method the method for Resnik has been carried out the optimization of deeper.The method is when calculating disease to similarity, no
Only account for common ancestor's node that the quantity of information of disease pair is maximum, it is also contemplated that disease common ancestor's node to other.Should
The superiority of method has obtained more preferable embodiment in gene ontology, and has been used for calculating the disease term in medical subject headings
Semantic similarity.
2, disease similarity is calculated based on the gene that disease is relevant.
The association of disease is not only embodied on the body that disease is relevant, and is embodied on common Disease-causing gene.Therefore,
Research worker focus attentions equally on Disease-causing gene based on disease calculates the similarity of disease.Presently, there are two kinds based on gene meter
The method calculating disease similarity.
(1) the first is side based on common disease gene (based on overlapping gene set-BOG)
Method.The method compares common relevant number gene between disease, obtains disease similarity therefrom.If with angle based on semanteme
Degree calculates similarity and compares, and this method finds similar disease pair from a brand-new angle.Therefore, the method can find new not
Know disease association.While it is true, when calculating disease similarity, the method does not but consider the function association between disease gene,
And be apparent from is that this association has certain impact to disease similarity.
(2) second method then Kernel-based methods similarity (process similarity based-PSB) calculates disease phase
Like degree, wherein, process refers to the biological process term of the relevant gene ontology of Disease-causing gene.The method considers disease base
The function association of cause, is therefore greatly improved to BOG method.PSB with Resnik, Lin, LC and JC method compared with, also
Present good performance.Intergenic function association comprises a lot of aspect, such as: gene co-expressing, protein interaction, base
Because of body term etc..It addition, for the performance improving disease similarity based method, FunSim method utilizes mankind's base of aggregative weighted
Because related network calculates disease similarity.
Therefore, if gene function, cellularity, biological process or common pathogenetic gene between two kinds of diseases are big
Cause identical, then calculate the disease similarity gene relevant with based on disease based on semantic association and calculate disease similarity both sides
Method will be effective in the similarity calculating disease, and this is very useful for the research of disease science.But, for two
Planting the disease that easy mistaken diagnosis does not but have cell, gene etc. to associate, both approaches effect is the most poor.
Summary of the invention
The main object of the present invention is for providing a kind of disease similarity calculating method based on big data group behavior and dress
Put, according to the diagnosis of disease behavior of big data colony, calculated the similarity between disease from the social goniometer of disease, can be used for knowing
Not easily mistaken diagnosis does not but have the disease that cell, gene etc. associate.
The present invention proposes a kind of disease similarity calculating method based on big data group behavior, including step:
Calculate each patient unit example correspondence and be diagnosed as the weight of each disease;Described patient unit example includes patient cases
Information;
According to the weights of calculating gained weight, each disease is set up disease vector;The weights of described weight are as disease
The element of vector;
Disease similarity is calculated according to disease vector.
Further, the step of the weight that described calculating each patient unit example correspondence is diagnosed as each disease includes:
Calculate each patient unit example correspondence and be diagnosed as the frequency of each disease, and obtain this frequency in all data
Frequency, described frequency is diagnosed as each patient unit example correspondence the weights of the weight of each disease.
Further, the described step according to disease vector calculating disease similarity includes:
According to disease vector, COS distance is utilized to calculate two kinds of disease similarities.
Further, described according to disease vector, the computing formula utilizing COS distance to calculate two kinds of disease similarities is:
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit real
Example correspondence is diagnosed as Di、DjThe weights of two kinds of diseases;H is total number of patient unit example;K is natural number.
It is further, when the weights that patient unit example correspondence is diagnosed as certain disease are the lowest, described according to disease vector,
The computing formula utilizing COS distance to calculate two kinds of disease similarities is:
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit real
Example correspondence is diagnosed as Di、DjThe weights of two kinds of diseases, and its value is higher than setting weights;K is natural number;TiFor each patient unit example
Correspondence is diagnosed as DiThe weights of disease and higher than setting weights;TjIt is diagnosed as D for each patient unit example correspondencejThe weights of disease and
Higher than setting weights;TijIt is diagnosed as D for each patient unit example correspondence respectivelyi、DjThe weights of two kinds of diseases and higher than setting weights
Intersection.
Present invention also offers a kind of disease Similarity Measure device based on big data group behavior, including:
Weight calculation unit, calculates each patient unit example correspondence and is diagnosed as the weight of each disease;Described patient unit is real
Example includes patient cases's information;
Vector sets up unit, according to the weights of calculating gained weight, each disease is set up disease vector;Described weight
Weights are as the element of disease vector;
Similarity calculated, calculates disease similarity according to disease vector.
Further, described weight calculation unit includes:
Weight computing subelement, calculates each patient unit example correspondence and is diagnosed as the frequency of each disease, and obtain this frequency
Number frequency in all data, is diagnosed as the power of the weight of each disease using described frequency as each patient unit example correspondence
Value.
Further, described similarity calculated includes:
COS distance computation subunit, according to disease vector, utilizes COS distance to calculate two kinds of disease similarities.
Further, the computing formula of described COS distance computation subunit is:
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit real
Example correspondence is diagnosed as Di、DjThe weights of two kinds of diseases;H is total number of patient unit example;K is natural number.
Further, when the weights that patient unit example correspondence is diagnosed as certain disease are the lowest, described COS distance calculates
The computing formula of subelement is:
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit real
Example correspondence is diagnosed as Di、DjThe weights of two kinds of diseases, and its value is higher than setting weights;K is natural number;TiFor each patient unit example
Correspondence is diagnosed as DiThe weights of disease and higher than setting weights;TjIt is diagnosed as D for each patient unit example correspondencejThe weights of disease and
Higher than setting weights;TijIt is diagnosed as D for each patient unit example correspondence respectivelyi、DjThe weights of two kinds of diseases and higher than setting weights
Intersection.
The disease similarity calculating method based on big data group behavior provided in the present invention and device, having following has
Benefit effect:
The disease similarity calculating method based on big data group behavior provided in the present invention and device, according to big data
Colony's diagnosis of disease behavior, has calculated the similarity between disease from the social goniometer of disease, can be used for identifying easy mistaken diagnosis but
There is no the disease that cell, gene etc. associate;The weight being diagnosed as each disease according to each patient unit example correspondence sets up disease
Vector, and calculate disease similarity by COS distance, can will be easiest to the disease identification of mistaken diagnosis out;When patient unit example pair
Should be diagnosed as the weights of certain disease the lowest time, then it is believed that this patient unit example correspondence is diagnosed as the credible journey of certain disease
Spend the lowest, when calculating disease similarity, can be ignored.
Accompanying drawing explanation
Fig. 1 is disease similarity calculating method step based on big data group behavior signal in one embodiment of the invention
Figure;
Fig. 2 is disease similarity calculating method step based on big data group behavior signal in another embodiment of the present invention
Figure;
Fig. 3 is disease Similarity Measure apparatus structure based on big data group behavior signal in one embodiment of the invention
Figure;
Fig. 4 is disease Similarity Measure apparatus structure based on big data group behavior signal in another embodiment of the present invention
Figure;
Fig. 5 is weight calculation unit structural representation in one embodiment of the invention;
Fig. 6 is similarity calculated structural representation in one embodiment of the invention.
The realization of the object of the invention, functional characteristics and advantage will in conjunction with the embodiments, are described further referring to the drawings.
Detailed description of the invention
Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.
With reference to Fig. 1, for disease similarity calculating method step based on big data group behavior in one embodiment of the invention
Schematic diagram.
One embodiment of the invention proposes a kind of disease similarity calculating method based on big data group behavior, including:
Step S1, calculates each patient unit example correspondence and is diagnosed as the weight of each disease;Above-mentioned patient unit example includes
Patient cases's information;This case information at least includes: Gender, age, symptom, symptom position, symptom refinement description etc..
Step S2, according to the weights of calculating gained weight, sets up disease vector to each disease;The weights of above-mentioned weight are made
Element for disease vector.
Step S3, calculates disease similarity according to disease vector.
At present, the gene based on semantic association calculating disease similarity is relevant with based on disease is generally used to calculate disease phase
The similarity of disease is being calculated like degree both approaches, if the gene function between two kinds of diseases, cellularity, mistake biology
Journey or common pathogenetic gene are roughly the same, then both approaches calculates in disease similarity the most effective, for disease science
Research for very useful.But, the disease that cell, gene etc. associate, both approaches are not but had for two kinds of easy mistaken diagnosis
Effect is the most poor.The disease similarity calculating method based on big data group behavior provided in the present embodiment, quantifies disease
And the similarity between disease.The calculating process of disease similarity does not use the relation on attributes between disease and disease, but uses
Big data human diseases goes to a doctor behavior.First collect the mass data that crowd seeks medical advice for each disease patient, use these groups
Disease is modeled by body medical diagnosis on disease behavior, then selects suitable similarity algorithm to calculate the similarity between disease two-by-two, can be by
Easily the disease identification of mistaken diagnosis is out.The method can be used for the systems such as disease auxiliary diagnosis, disease autodiagnosis.Simultaneously to complex disease
Study of incident mechanism, major disease infectious disease early prevention diagnoses, and novel drugs research and development etc. have important function.
With reference to Fig. 2, can also include before above-mentioned steps S1:
Step S0, according to case informations such as Gender, age, symptom, symptom position, symptom refinement descriptions, builds and suffers from
Person unit example.This step S0 may comprise steps of:
Step 1: collect the diagnosis of disease data of different sexes all ages and classes patient.Below one case is extracted (but do not limit
In following) information: Gender, patient age, symptom, symptom position, symptom refinement description, diagnosed disease.By Gender
It is expressed as G:{G1,G2, GiRepresent (in 2) certain value that sex G is desirable;Patient age is expressed as A;Symptom is expressed as
S:{S1,S2,…,SL, SiRepresent (in L) certain value that symptom S is desirable;Symptom position is expressed as B:{B1,B2,,BM, Bi
Represent (in M) certain value desirable for position B;Disease is expressed as D:{D1,D2,…,DN, DiRepresent desirable (N number of of position D
In) certain value.
Step 2: by discrete for age data chemical conversion K section: A:{A1,A2,…,AK, AiRepresent desirable for position A (K in) certain
Individual value.Age data division methods the following two kinds:
The first K=5:
(1) childhood: 0 years old 6 years old;(2) juvenile: 7 years old 17 years old;(3) young: 18 years old 40 years old;(4) middle age: 41 65
Year;(5) old: after 66 years old.
The second K=14:
(1) infancy stage: the 0-3 moon in week;(2) children's's phase: 4 weeks 2.5 years old moons;(3) preschool period: after 2.5 years old 6 years old;(4)
The initiation phase: 7 years old 10 years old;(5) the converse phase: 11 years old 14 years old;(6) growth stage: 15 years old 17 years old;(7) adolescence: 18 28
Year;(8) period of maturation: 29 40 years old;(9) the sturdy phase: 41 48 years old;(10) the sane phase: 49 55 years old;(11) phase of adjustment: 56-
65 years old;(12) the oldest phase: 67 72 years old;(13) the old phase in: 73 84 years old;(14) the old phase: after 85 years old.
Step 3: the symptom under each symptom is refined and describes the time of origin according to symptom, the order of severity, priming factors
3 to 5 symptom refinements are become to describe Deng consolidation: Si:{Si1,Si2,Si3,Si4,Si5, wherein Si1,Si2,Si3,Si4,Si5Represent symptom
SiSymptom refinement describe.The symptom refinement of each symptom describes number and can differ.
Step 4: by Gender G:{G1,G2, patient age A:{A1,A2,…,AK, symptom S:{S1,S2,…,SL, disease
Shape position B:{B1,B2,…,BM, symptom refinement describe Si:{Si1,Si2,Si3,Si4,Si5Patient unit is become by their valued combinations
Example E:{E1,E2,…,EH, such as E1Represent that " sex is G1, the age is A1, symptom is S1, position is B1, symptom refinement describes
For S11" a patient unit example, H be unit example number.H=2 × K × L × M × 5 in theory, actually since some symptom
Only appear in some age bracket or certain sex, thus H < 2 × K × L × M × 5.
Further, in above-mentioned steps S1, calculate each patient unit example correspondence and be diagnosed as the weight of each disease and include:
Calculate each patient unit example correspondence and be diagnosed as the frequency of each disease, and obtain this frequency in all data
Frequency, is diagnosed as the weights of the weight of each disease using said frequencies as each patient unit example correspondence.
In the present embodiment, each patient unit example E is calculatediEach disease D correspondingjWeight WijComputational methods such as
Under:
Magnanimity (data volume wants sufficiently large could meet statistical significance requirement) the case data that use is above collected, for
Each patient unit example Ei, add up it and be diagnosed as each disease { D1,D2,…,DNFrequency { Fi1,Fi2,…,FiN, then will
This frequency is converted into frequency:
Wherein
Using this frequency as this patient unit example EiCorresponding each disease { D1,D2,…,DNWeight, it may be assumed that
Thus obtain the weight matrix W about patient unit example diseaseij。
Further, in above-mentioned steps S2, according to the weights of calculating gained weight, each disease is set up disease vector bag
Include: use the weight matrix W of patient unit example diseaseijWeights as disease vector element:
Further, in above-mentioned steps S3, calculate disease similarity according to disease vector and include:
According to disease vector, COS distance is utilized to calculate two kinds of disease similarities.
Further, disease vector is a higher-dimension sparse vector, above-mentioned according to disease vector, utilizes COS distance to calculate
The computing formula of two kinds of disease similarities is:
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit real
Example correspondence is diagnosed as Di、DjThe weights of two kinds of diseases;H is total number of patient unit example;K is natural number.
Further, when the weights that patient unit example correspondence is diagnosed as certain disease are the lowest, then real according to this patient unit
The credibility that example is diagnosed as certain disease is the lowest, when generally calculating disease similarity, can be ignored.Above-mentioned according to disease to
Amount, the computing formula utilizing COS distance to calculate two kinds of disease similarities is adjusted to:
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit real
Example correspondence is diagnosed as Di、DjThe weights of two kinds of diseases, and its value is higher than setting weights;K is natural number;TiFor each patient unit example
Correspondence is diagnosed as DiThe weights of disease and higher than setting weights;TjIt is diagnosed as D for each patient unit example correspondencejThe weights of disease and
Higher than setting weights;TijIt is diagnosed as D for each patient unit example correspondence respectivelyi、DjThe weights of two kinds of diseases and higher than setting weights
Intersection;The above-mentioned weights that set are as the minimum threshold set according to practical situation, and when patient unit, example correspondence is diagnosed as Di、
DjWhen the weights of two kinds of diseases are less than this threshold value, it is determined that its credibility is low, when calculating disease similarity, by this patient unit example
Weights ignore.
With reference to Fig. 3, for disease Similarity Measure apparatus structure based on big data group behavior in one embodiment of the invention
Schematic diagram.
One embodiment of the invention additionally provides a kind of disease Similarity Measure device based on big data group behavior, bag
Include:
Weight calculation unit 10, calculates each patient unit example correspondence and is diagnosed as the weight of each disease;Above-mentioned patient unit
Example includes patient cases's information;
Vector sets up unit 20, according to the weights of calculating gained weight, each disease is set up disease vector;Above-mentioned weight
Weights as disease vector element;
Similarity calculated 30, calculates disease similarity according to disease vector.
At present, the gene based on semantic association calculating disease similarity is relevant with based on disease is generally used to calculate disease phase
The similarity of disease is being calculated like degree both approaches, if the gene function between two kinds of diseases, cellularity, biology
Process or common pathogenetic gene are roughly the same, then both approaches calculates in disease similarity the most effective, for disease section
Learn research for very useful.But, the disease that cell, gene etc. associate, both sides are not but had for two kinds of easy mistaken diagnosis
Method effect is the most poor.The disease Similarity Measure device based on big data group behavior provided in the present embodiment, quantifies disease
Similarity between disease and disease.The calculating process of disease similarity does not use the relation on attributes between disease and disease, but makes
Go to a doctor behavior with big data human diseases.First collect the mass data that crowd seeks medical advice for each disease patient, use these
Disease is modeled by colony's medical diagnosis on disease behavior, then selects suitable similarity algorithm to calculate the similarity between disease two-by-two, can be by
It is easiest to the disease identification of mistaken diagnosis out.This device can be used for the systems such as disease auxiliary diagnosis, disease autodiagnosis.Simultaneously to complicated disease
Sick study of incident mechanism, major disease infectious disease early prevention diagnoses, and novel drugs research and development etc. have important function.
With reference to Fig. 4, above-mentioned disease Similarity Measure device based on big data group behavior can also include:
Unit's example construction unit 1, according to case letters such as Gender, age, symptom, symptom position, symptom refinement descriptions
Breath, builds patient unit example.This yuan of example construction unit 1 builds patient unit example and may include that
Step 1: collect the diagnosis of disease data of different sexes all ages and classes patient.Below one case is extracted (but do not limit
In following) information: Gender, patient age, symptom, symptom position, symptom refinement description, diagnosed disease.By Gender
It is expressed as G:{G1,G2, GiRepresent (in 2) certain value that sex G is desirable;Patient age is expressed as A;Symptom is expressed as
S:{S1,S2,…,SL, SiRepresent (in L) certain value that symptom S is desirable;Symptom position is expressed as B:{B1,B2,…,BM,
BiRepresent (in M) certain value desirable for position B;Disease is expressed as D:{D1,D2,…,DN, DiRepresent the desirable (N of position D
In individual) certain value.
Step 2: by discrete for age data chemical conversion K section: A:{A1,A2,…,AK, AiRepresent desirable for position A (K in) certain
Individual value.Age data division methods the following two kinds:
The first K=5:
(1) childhood: 0 years old 6 years old;(2) juvenile: 7 years old 17 years old;(3) young: 18 years old 40 years old;(4) middle age: 41 65
Year;(5) old: after 66 years old.
The second K=14:
(1) infancy stage: the 0-3 moon in week;(2) children's's phase: 4 weeks 2.5 years old moons;(3) preschool period: after 2.5 years old 6 years old;(4)
The initiation phase: 7 years old 10 years old;(5) the converse phase: 11 years old 14 years old;(6) growth stage: 15 years old 17 years old;(7) adolescence: 18 28
Year;(8) period of maturation: 29 40 years old;(9) the sturdy phase: 41 48 years old;(10) the sane phase: 49 55 years old;(11) phase of adjustment: 56-
65 years old;(12) the oldest phase: 67 72 years old;(13) the old phase in: 73 84 years old;(14) the old phase: after 85 years old.
Step 3: the symptom under each symptom is refined and describes the time of origin according to symptom, the order of severity, priming factors
3 to 5 symptom refinements are become to describe Deng consolidation: Si:{Si1,Si2,Si3,Si4,Si5, wherein Si1,Si2,Si3,Si4,Si5Represent symptom
SiSymptom refinement describe.The symptom refinement of each symptom describes number and can differ.
Step 4: by Gender G:{G1,G2, patient age A:{A1,A2,…,AK, symptom S:{S1,S2,…,SL, disease
Shape position B:{B1,B2,…,BM, symptom refinement describe Si:{Si1,Si2,Si3,Si4,Si5Patient unit is become by their valued combinations
Example E:{E1,E2,…,EH, such as E1Represent that " sex is G1, the age is A1, symptom is S1, position is B1, symptom refinement describes
For S11" a patient unit example, H be unit example number.H=2 × K × L × M × 5 in theory, actually since some symptom
Only appear in some age bracket or certain sex, thus H < 2 × K × L × M × 5.
Further, with reference to Fig. 5, above-mentioned weight calculation unit 10 includes:
Weight computing subelement 100, calculates each patient unit example correspondence and is diagnosed as the frequency of each disease, and obtain this
Frequency frequency in all data, is diagnosed as the weight of each disease using said frequencies as each patient unit example correspondence
Weights.
In the present embodiment, weight computing subelement 100 calculates each patient unit example EiEach disease D correspondingjWeight
WijComputational methods as follows:
Magnanimity (data volume wants sufficiently large could meet statistical significance requirement) the case data that use is above collected, for
Each patient unit example Ei, add up it and be diagnosed as each disease { D1,D2,…,DNFrequency { Fi1,Fi2,…,FiN, then will
This frequency is converted into frequency:
Wherein
Using this frequency as this patient unit example EiCorresponding each disease { D1,D2,…,DNWeight, it may be assumed that
Thus obtain the weight matrix W about patient unit example diseaseij。
Further, above-mentioned vector sets up unit 20, according to the weights of calculating gained weight, each disease is set up disease
Vector includes: use the weight matrix W of patient unit example diseaseijWeights as disease vector element:
Further, with reference to Fig. 6, above-mentioned similarity calculated 30 includes:
COS distance computation subunit 300, according to disease vector, utilizes COS distance to calculate two kinds of disease similarities.
Further, disease vector is a higher-dimension sparse vector, and the calculating of above-mentioned COS distance computation subunit 300 is public
Formula is:
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit real
Example correspondence is diagnosed as Di、DjThe weights of two kinds of diseases;H is total number of patient unit example;K is natural number.
Further, when the weights that patient unit example correspondence is diagnosed as certain disease are the lowest, then real according to this patient unit
The credibility that example is diagnosed as certain disease is the lowest, when generally calculating disease similarity, can be ignored.Above-mentioned COS distance meter
The computing formula of operator unit 300 is adjusted to:
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit real
Example correspondence is diagnosed as Di、DjThe weights of two kinds of diseases, and its value is higher than setting weights;K is natural number;TiFor each patient unit example
Correspondence is diagnosed as DiThe weights of disease and higher than setting weights;TjIt is diagnosed as D for each patient unit example correspondencejThe weights of disease and
Higher than setting weights;TijIt is diagnosed as D for each patient unit example correspondence respectivelyi、DjThe weights of two kinds of diseases and higher than setting weights
Intersection;The above-mentioned weights that set are as the minimum threshold set according to practical situation, and when patient unit, example correspondence is diagnosed as Di、
DjWhen the weights of two kinds of diseases are less than this threshold value, it is determined that its credibility is low, when calculating disease similarity, by this patient unit example
Weights ignore.
In one embodiment, experiment uses three kinds of disease similarity calculating methods or device to " being easiest to the 7 of mistaken diagnosis
Plant disease " to test, the computational methods wherein used are respectively as follows:
A method: calculate disease similarity based on semantic association;
B method: the gene relevant based on disease calculates disease similarity;
C method: in the embodiment of the present invention based on big data group behavior calculate disease similarity.
Experimental result is with reference to table 1 below:
Table 1
Drawn by the experimental result of table 1:
If the gene function between two kinds of diseases, cellularity, biological process or common pathogenetic gene substantially phase
With, then the gene calculating disease similarity both approaches calculating disease similarity relevant with based on disease based on semantic association exists
Calculating all can be effective in the similarity of disease, very useful for the research of disease science.
But, the disease that cell, gene etc. associate but is not had for two kinds of easy mistaken diagnosis, before two kinds of method effects just than
Poor.Method and device in the embodiment of the present invention considers from the social angle of disease, based on big data colony diagnosis of disease
Behavior calculates the similarity of disease, can the 7 kinds of diseases that be easiest to mistaken diagnosis be identified.
The disease similarity calculating method based on big data group behavior provided in embodiment in the present invention and device, root
According to the diagnosis of disease behavior of big data colony, calculate the similarity between disease from the social goniometer of disease, can be used for identifying and hold
Easily mistaken diagnosis does not but have the disease that cell, gene etc. associate;The weight of each disease it is diagnosed as according to each patient unit example correspondence
Set up disease vector, and calculate disease similarity by COS distance, can will be easiest to the disease identification of mistaken diagnosis out;Work as patient
Unit's example correspondence be diagnosed as the weights of certain disease the lowest time, then it is believed that this patient unit example correspondence is diagnosed as certain disease
Credibility the lowest, calculate disease similarity time, can be ignored.
The foregoing is only the preferred embodiments of the present invention, not thereby limit the scope of the claims of the present invention, every utilization
Equivalent structure or equivalence flow process that description of the invention and accompanying drawing content are made convert, or are directly or indirectly used in other phases
The technical field closed, is the most in like manner included in the scope of patent protection of the present invention.
Claims (10)
1. a disease similarity calculating method based on big data group behavior, it is characterised in that include step:
Calculate each patient unit example correspondence and be diagnosed as the weight of each disease;Described patient unit example includes that patient cases believes
Breath;
According to the weights of calculating gained weight, each disease is set up disease vector;The weights of described weight are as disease vector
Element;
Disease similarity is calculated according to disease vector.
Disease similarity calculating method based on big data group behavior the most according to claim 1, it is characterised in that institute
The step stating the weight that calculating each patient unit example correspondence is diagnosed as each disease includes:
Calculate each patient unit example correspondence and be diagnosed as the frequency of each disease, and obtain this frequency frequency in all data
Rate, is diagnosed as the weights of the weight of each disease using described frequency as each patient unit example correspondence.
Disease similarity calculating method based on big data group behavior the most according to claim 1, it is characterised in that institute
State the step according to disease vector calculating disease similarity to include:
According to disease vector, COS distance is utilized to calculate two kinds of disease similarities.
Disease similarity calculating method based on big data group behavior the most according to claim 3, it is characterised in that institute
Stating according to disease vector, the computing formula utilizing COS distance to calculate two kinds of disease similarities is:
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit example pair
D should be diagnosed asi、DjThe weights of two kinds of diseases;H is total number of patient unit example;K is natural number.
5., according to the disease similarity calculating method based on big data group behavior described in claim 3 or 4, its feature exists
In, when the weights that patient unit example correspondence is diagnosed as certain disease are the lowest, described according to disease vector, utilize COS distance meter
The computing formula calculating two kinds of disease similarities is:
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit example pair
D should be diagnosed asi、DjThe weights of two kinds of diseases, and its value is higher than setting weights;K is natural number;TiCorresponding for each patient unit example
It is diagnosed as DiThe weights of disease and higher than setting weights;TjIt is diagnosed as D for each patient unit example correspondencejThe weights of disease and being higher than
Set weights;TijIt is diagnosed as D for each patient unit example correspondence respectivelyi、DjThe weights of two kinds of diseases and higher than the conjunction setting weights
Collection.
6. a disease Similarity Measure device based on big data group behavior, it is characterised in that including:
Weight calculation unit, calculates each patient unit example correspondence and is diagnosed as the weight of each disease;Described patient unit example bag
Include patient cases's information;
Vector sets up unit, according to the weights of calculating gained weight, each disease is set up disease vector;The weights of described weight
Element as disease vector;
Similarity calculated, calculates disease similarity according to disease vector.
Disease Similarity Measure device based on big data group behavior the most according to claim 6, it is characterised in that institute
State weight calculation unit to include:
Weight computing subelement, calculates each patient unit example correspondence and is diagnosed as the frequency of each disease, and obtains this frequency and exist
Frequency in all data, is diagnosed as the weights of the weight of each disease using described frequency as each patient unit example correspondence.
Disease Similarity Measure device based on big data group behavior the most according to claim 6, it is characterised in that institute
State similarity calculated to include:
COS distance computation subunit, according to disease vector, utilizes COS distance to calculate two kinds of disease similarities.
Disease Similarity Measure device based on big data group behavior the most according to claim 8, it is characterised in that institute
The computing formula stating COS distance computation subunit is:
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit example pair
D should be diagnosed asi、DjThe weights of two kinds of diseases;H is total number of patient unit example;K is natural number.
The most according to claim 8 or claim 9, disease Similarity Measure device based on big data group behavior, its feature exists
In, when the weights that patient unit example correspondence is diagnosed as certain disease are the lowest, the calculating of described COS distance computation subunit is public
Formula is:
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit example pair
D should be diagnosed asi、DjThe weights of two kinds of diseases, and its value is higher than setting weights;K is natural number;TiCorresponding for each patient unit example
It is diagnosed as DiThe weights of disease and higher than setting weights;TjIt is diagnosed as D for each patient unit example correspondencejThe weights of disease and being higher than
Set weights;TijIt is diagnosed as D for each patient unit example correspondence respectivelyi、DjThe weights of two kinds of diseases and higher than the conjunction setting weights
Collection.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610307328.7A CN106021871A (en) | 2016-05-10 | 2016-05-10 | Disease similarity calculation method and device based on big data group behaviors |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610307328.7A CN106021871A (en) | 2016-05-10 | 2016-05-10 | Disease similarity calculation method and device based on big data group behaviors |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106021871A true CN106021871A (en) | 2016-10-12 |
Family
ID=57100197
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610307328.7A Pending CN106021871A (en) | 2016-05-10 | 2016-05-10 | Disease similarity calculation method and device based on big data group behaviors |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106021871A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106650299A (en) * | 2017-01-18 | 2017-05-10 | 浙江大学 | Quick calculating method for patient similarity analysis |
CN106897580A (en) * | 2017-02-10 | 2017-06-27 | 华东师范大学 | The computational methods of semantic similarity between a kind of gene based on vector |
CN108346474A (en) * | 2018-03-14 | 2018-07-31 | 湖南省蓝蜻蜓网络科技有限公司 | The electronic health record feature selection approach of distribution within class and distribution between class based on word |
CN108647203A (en) * | 2018-04-20 | 2018-10-12 | 浙江大学 | A kind of computational methods of Chinese medicine state of an illness text similarity |
CN109102895A (en) * | 2017-06-21 | 2018-12-28 | 京东方科技集团股份有限公司 | Medical data coalignment and method |
CN111091906A (en) * | 2019-10-31 | 2020-05-01 | 中电药明数据科技(成都)有限公司 | Auxiliary medical diagnosis method and system based on real world data |
CN108630322B (en) * | 2018-04-27 | 2020-08-14 | 厦门大学 | Drug interaction modeling and risk assessment method, terminal device and storage medium |
CN112151184A (en) * | 2020-09-27 | 2020-12-29 | 东北林业大学 | System for calculating disease similarity based on network representation learning |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102156812A (en) * | 2011-04-02 | 2011-08-17 | 中国医学科学院医学信息研究所 | Hospital decision-making aiding method based on symptom similarity analysis |
CN102184314A (en) * | 2011-04-02 | 2011-09-14 | 中国医学科学院医学信息研究所 | Deviation symptom description-oriented automatic computer-aided diagnosis method |
CN104915561A (en) * | 2015-06-11 | 2015-09-16 | 万达信息股份有限公司 | Intelligent disease attribute matching method |
-
2016
- 2016-05-10 CN CN201610307328.7A patent/CN106021871A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102156812A (en) * | 2011-04-02 | 2011-08-17 | 中国医学科学院医学信息研究所 | Hospital decision-making aiding method based on symptom similarity analysis |
CN102184314A (en) * | 2011-04-02 | 2011-09-14 | 中国医学科学院医学信息研究所 | Deviation symptom description-oriented automatic computer-aided diagnosis method |
CN104915561A (en) * | 2015-06-11 | 2015-09-16 | 万达信息股份有限公司 | Intelligent disease attribute matching method |
Non-Patent Citations (2)
Title |
---|
李杰等: "基于疾病本体的疾病相似性计算方法", 《生物化学与生物物理进展》 * |
郭艾侠等: "融合 Harris 与 SIFT 算法的荔枝采摘点计算与立体匹配", 《农业机械学报》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106650299A (en) * | 2017-01-18 | 2017-05-10 | 浙江大学 | Quick calculating method for patient similarity analysis |
CN106650299B (en) * | 2017-01-18 | 2019-01-25 | 浙江大学 | A kind of quick calculation method of patient's similarity analysis |
CN106897580A (en) * | 2017-02-10 | 2017-06-27 | 华东师范大学 | The computational methods of semantic similarity between a kind of gene based on vector |
CN109102895A (en) * | 2017-06-21 | 2018-12-28 | 京东方科技集团股份有限公司 | Medical data coalignment and method |
CN108346474A (en) * | 2018-03-14 | 2018-07-31 | 湖南省蓝蜻蜓网络科技有限公司 | The electronic health record feature selection approach of distribution within class and distribution between class based on word |
CN108647203A (en) * | 2018-04-20 | 2018-10-12 | 浙江大学 | A kind of computational methods of Chinese medicine state of an illness text similarity |
CN108630322B (en) * | 2018-04-27 | 2020-08-14 | 厦门大学 | Drug interaction modeling and risk assessment method, terminal device and storage medium |
CN111091906A (en) * | 2019-10-31 | 2020-05-01 | 中电药明数据科技(成都)有限公司 | Auxiliary medical diagnosis method and system based on real world data |
CN111091906B (en) * | 2019-10-31 | 2023-06-20 | 中电药明数据科技(成都)有限公司 | Auxiliary medical diagnosis method and system based on real world data |
CN112151184A (en) * | 2020-09-27 | 2020-12-29 | 东北林业大学 | System for calculating disease similarity based on network representation learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106021871A (en) | Disease similarity calculation method and device based on big data group behaviors | |
CN104915561B (en) | Genius morbi intelligent Matching method | |
Titterington et al. | Comparison of discrimination techniques applied to a complex data set of head injured patients | |
CN103729395B (en) | For inferring the method and system of inquiry answer | |
Peter et al. | An empirical study on prediction of heart disease using classification data mining techniques | |
CN104063824B (en) | The generation method and device of health guidance information | |
CN107785075A (en) | Fever in children disease deep learning assistant diagnosis system based on text case history | |
CN101561868B (en) | Human motion emotion identification method based on Gauss feature | |
TW200426627A (en) | Information retrieval and text mining using distributed latent semantic indexing | |
CN105389470A (en) | Method for automatically extracting Traditional Chinese Medicine acupuncture entity relationship | |
CN107358014A (en) | The clinical pre-treating method and system of a kind of physiological data | |
Wang et al. | Attention-based multi-instance neural network for medical diagnosis from incomplete and low quality data | |
Li et al. | Towards medical machine reading comprehension with structural knowledge and plain text | |
CN106529110A (en) | Classification method and equipment of user data | |
CN104794222B (en) | Network form semanteme restoration methods | |
Adnan et al. | A survey on utilization of data mining for childhood obesity prediction | |
CN111128388A (en) | Value domain data matching method and device and related products | |
Lin et al. | Medical Concept Embedding with Variable Temporal Scopes for Patient Similarity. | |
Khazaee et al. | Heart arrhythmia detection using support vector machines | |
CN107491656A (en) | A kind of Effect of pregnancy outcome factor appraisal procedure based on relative risk decision-tree model | |
Yuan et al. | A similarity-based disease diagnosis system for medical big data | |
Sunge et al. | Prediction diabetes mellitus using decision tree models | |
Eskofier et al. | Predictive models for health deterioration: Understanding disease pathways for personalized medicine | |
Li et al. | Multi-source ensemble transfer approach for medical text auxiliary diagnosis | |
Korir et al. | Clusters of African countries based on the social contacts and associated socioeconomic indicators relevant to the spread of the epidemic |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161012 |