CN106021871A - Disease similarity calculation method and device based on big data group behaviors - Google Patents

Disease similarity calculation method and device based on big data group behaviors Download PDF

Info

Publication number
CN106021871A
CN106021871A CN201610307328.7A CN201610307328A CN106021871A CN 106021871 A CN106021871 A CN 106021871A CN 201610307328 A CN201610307328 A CN 201610307328A CN 106021871 A CN106021871 A CN 106021871A
Authority
CN
China
Prior art keywords
disease
weights
diagnosed
similarity
unit example
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610307328.7A
Other languages
Chinese (zh)
Inventor
韦辉华
王界兵
张伟
董迪马
郭宇翔
宋泰然
梁猛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Frontsurf Information Technology Co Ltd
Original Assignee
Shenzhen Frontsurf Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Frontsurf Information Technology Co Ltd filed Critical Shenzhen Frontsurf Information Technology Co Ltd
Priority to CN201610307328.7A priority Critical patent/CN106021871A/en
Publication of CN106021871A publication Critical patent/CN106021871A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Public Health (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Pathology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a disease similarity calculation method and device based on big data group behaviors. The disease similarity calculation method comprises the following steps: calculating weight for each patient element instance to be correspondingly diagnosed into each disease, wherein the patient element instance comprises patient case information; according to a weight value obtained by calculation, establishing a disease vector for each disease, wherein the weight value is used as an element of the disease vector; and according to the disease vector, calculating disease similarity. The disease similarity calculation method and device based on the big data group behaviors calculates the similarity among diseases from the social perspective of diseases according to big data group disease behavior diagnosis and treatment behaviors, and the disease similarity calculation method and device can be used for identifying diseases which are likely to be misdiagnosed but do not have correlations including cells, genes and the like.

Description

Disease similarity calculating method based on big data group behavior and device
Technical field
The present invention relates to the calculating field of disease similarity, particularly to a kind of disease phase based on big data group behavior Like degree computational methods and device.
Background technology
The method calculating disease similarity at present is typically to calculate according to the attribute of disease, as between disease and disease Inclusion relation: ' breast carcinoma ' comprises ' male breast carcinoma ' and ' women with breast cancer ';Relation factor between disease and disease: common Disease-causing gene, common medicine, common metabolite etc..The method calculating disease similarity generally can be from two angles Degree considers:
1, disease similarity is calculated based on semantic association.
Biomedical sector is frequently utilized that body calculates the semantic similarity of term, such as: gene ontology, human phenotype body Deng.While it is true, these methods but only have a little part have been used for calculating disease similarity.The method of Resnik design is i.e. Being the most most commonly seen method, the method is more of applied to gene ontology and calculates gene function, cellularity, mistake biology The similarity of Cheng Shuyu, and if with other multiple method (union-intersection, longest shared path, JC) compare, then there is obvious advantage.The method of Resnik is to utilize ' is_a ' relation in body to calculate Similarity of Term, The method calculate disease between similarity depend on disease to the maximum common ancestor's node of quantity of information.And Lin Method then improves the comparative approach in the method for Resnik to comentropy, from point of theory, the method for Resnik has been carried out one Fixed is perfect.The method of Resnik and Lin is write R bag by research worker the most, to facilitate the similarity calculating disease.Wang Et al. propose method the method for Resnik has been carried out the optimization of deeper.The method is when calculating disease to similarity, no Only account for common ancestor's node that the quantity of information of disease pair is maximum, it is also contemplated that disease common ancestor's node to other.Should The superiority of method has obtained more preferable embodiment in gene ontology, and has been used for calculating the disease term in medical subject headings Semantic similarity.
2, disease similarity is calculated based on the gene that disease is relevant.
The association of disease is not only embodied on the body that disease is relevant, and is embodied on common Disease-causing gene.Therefore, Research worker focus attentions equally on Disease-causing gene based on disease calculates the similarity of disease.Presently, there are two kinds based on gene meter The method calculating disease similarity.
(1) the first is side based on common disease gene (based on overlapping gene set-BOG) Method.The method compares common relevant number gene between disease, obtains disease similarity therefrom.If with angle based on semanteme Degree calculates similarity and compares, and this method finds similar disease pair from a brand-new angle.Therefore, the method can find new not Know disease association.While it is true, when calculating disease similarity, the method does not but consider the function association between disease gene, And be apparent from is that this association has certain impact to disease similarity.
(2) second method then Kernel-based methods similarity (process similarity based-PSB) calculates disease phase Like degree, wherein, process refers to the biological process term of the relevant gene ontology of Disease-causing gene.The method considers disease base The function association of cause, is therefore greatly improved to BOG method.PSB with Resnik, Lin, LC and JC method compared with, also Present good performance.Intergenic function association comprises a lot of aspect, such as: gene co-expressing, protein interaction, base Because of body term etc..It addition, for the performance improving disease similarity based method, FunSim method utilizes mankind's base of aggregative weighted Because related network calculates disease similarity.
Therefore, if gene function, cellularity, biological process or common pathogenetic gene between two kinds of diseases are big Cause identical, then calculate the disease similarity gene relevant with based on disease based on semantic association and calculate disease similarity both sides Method will be effective in the similarity calculating disease, and this is very useful for the research of disease science.But, for two Planting the disease that easy mistaken diagnosis does not but have cell, gene etc. to associate, both approaches effect is the most poor.
Summary of the invention
The main object of the present invention is for providing a kind of disease similarity calculating method based on big data group behavior and dress Put, according to the diagnosis of disease behavior of big data colony, calculated the similarity between disease from the social goniometer of disease, can be used for knowing Not easily mistaken diagnosis does not but have the disease that cell, gene etc. associate.
The present invention proposes a kind of disease similarity calculating method based on big data group behavior, including step:
Calculate each patient unit example correspondence and be diagnosed as the weight of each disease;Described patient unit example includes patient cases Information;
According to the weights of calculating gained weight, each disease is set up disease vector;The weights of described weight are as disease The element of vector;
Disease similarity is calculated according to disease vector.
Further, the step of the weight that described calculating each patient unit example correspondence is diagnosed as each disease includes:
Calculate each patient unit example correspondence and be diagnosed as the frequency of each disease, and obtain this frequency in all data Frequency, described frequency is diagnosed as each patient unit example correspondence the weights of the weight of each disease.
Further, the described step according to disease vector calculating disease similarity includes:
According to disease vector, COS distance is utilized to calculate two kinds of disease similarities.
Further, described according to disease vector, the computing formula utilizing COS distance to calculate two kinds of disease similarities is:
s i m ( D i &RightArrow; , D j &RightArrow; ) = cos < D i &RightArrow; , D j &RightArrow; > = &Sigma; k H ( d i k + d j k ) &Sigma; k H d i k 2 &CenterDot; &Sigma; k H d j k 2 ;
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit real Example correspondence is diagnosed as Di、DjThe weights of two kinds of diseases;H is total number of patient unit example;K is natural number.
It is further, when the weights that patient unit example correspondence is diagnosed as certain disease are the lowest, described according to disease vector, The computing formula utilizing COS distance to calculate two kinds of disease similarities is:
s i m ( D i &RightArrow; , D j &RightArrow; ) = &Sigma; k &Element; T i j ( d i k + d j k ) &Sigma; i &Element; T i d i k 2 &CenterDot; &Sigma; i &Element; T j d j k 2 ;
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit real Example correspondence is diagnosed as Di、DjThe weights of two kinds of diseases, and its value is higher than setting weights;K is natural number;TiFor each patient unit example Correspondence is diagnosed as DiThe weights of disease and higher than setting weights;TjIt is diagnosed as D for each patient unit example correspondencejThe weights of disease and Higher than setting weights;TijIt is diagnosed as D for each patient unit example correspondence respectivelyi、DjThe weights of two kinds of diseases and higher than setting weights Intersection.
Present invention also offers a kind of disease Similarity Measure device based on big data group behavior, including:
Weight calculation unit, calculates each patient unit example correspondence and is diagnosed as the weight of each disease;Described patient unit is real Example includes patient cases's information;
Vector sets up unit, according to the weights of calculating gained weight, each disease is set up disease vector;Described weight Weights are as the element of disease vector;
Similarity calculated, calculates disease similarity according to disease vector.
Further, described weight calculation unit includes:
Weight computing subelement, calculates each patient unit example correspondence and is diagnosed as the frequency of each disease, and obtain this frequency Number frequency in all data, is diagnosed as the power of the weight of each disease using described frequency as each patient unit example correspondence Value.
Further, described similarity calculated includes:
COS distance computation subunit, according to disease vector, utilizes COS distance to calculate two kinds of disease similarities.
Further, the computing formula of described COS distance computation subunit is:
s i m ( D i &RightArrow; , D j &RightArrow; ) = cos < D i &RightArrow; , D j &RightArrow; > = &Sigma; k H ( d i k + d j k ) &Sigma; k H d i k 2 &CenterDot; &Sigma; k H d j k 2 ;
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit real Example correspondence is diagnosed as Di、DjThe weights of two kinds of diseases;H is total number of patient unit example;K is natural number.
Further, when the weights that patient unit example correspondence is diagnosed as certain disease are the lowest, described COS distance calculates The computing formula of subelement is:
s i m ( D i &RightArrow; , D j &RightArrow; ) = &Sigma; k &Element; T i j ( d i k + d j k ) &Sigma; i &Element; T i d i k 2 &CenterDot; &Sigma; i &Element; T j d j k 2 ;
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit real Example correspondence is diagnosed as Di、DjThe weights of two kinds of diseases, and its value is higher than setting weights;K is natural number;TiFor each patient unit example Correspondence is diagnosed as DiThe weights of disease and higher than setting weights;TjIt is diagnosed as D for each patient unit example correspondencejThe weights of disease and Higher than setting weights;TijIt is diagnosed as D for each patient unit example correspondence respectivelyi、DjThe weights of two kinds of diseases and higher than setting weights Intersection.
The disease similarity calculating method based on big data group behavior provided in the present invention and device, having following has Benefit effect:
The disease similarity calculating method based on big data group behavior provided in the present invention and device, according to big data Colony's diagnosis of disease behavior, has calculated the similarity between disease from the social goniometer of disease, can be used for identifying easy mistaken diagnosis but There is no the disease that cell, gene etc. associate;The weight being diagnosed as each disease according to each patient unit example correspondence sets up disease Vector, and calculate disease similarity by COS distance, can will be easiest to the disease identification of mistaken diagnosis out;When patient unit example pair Should be diagnosed as the weights of certain disease the lowest time, then it is believed that this patient unit example correspondence is diagnosed as the credible journey of certain disease Spend the lowest, when calculating disease similarity, can be ignored.
Accompanying drawing explanation
Fig. 1 is disease similarity calculating method step based on big data group behavior signal in one embodiment of the invention Figure;
Fig. 2 is disease similarity calculating method step based on big data group behavior signal in another embodiment of the present invention Figure;
Fig. 3 is disease Similarity Measure apparatus structure based on big data group behavior signal in one embodiment of the invention Figure;
Fig. 4 is disease Similarity Measure apparatus structure based on big data group behavior signal in another embodiment of the present invention Figure;
Fig. 5 is weight calculation unit structural representation in one embodiment of the invention;
Fig. 6 is similarity calculated structural representation in one embodiment of the invention.
The realization of the object of the invention, functional characteristics and advantage will in conjunction with the embodiments, are described further referring to the drawings.
Detailed description of the invention
Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.
With reference to Fig. 1, for disease similarity calculating method step based on big data group behavior in one embodiment of the invention Schematic diagram.
One embodiment of the invention proposes a kind of disease similarity calculating method based on big data group behavior, including:
Step S1, calculates each patient unit example correspondence and is diagnosed as the weight of each disease;Above-mentioned patient unit example includes Patient cases's information;This case information at least includes: Gender, age, symptom, symptom position, symptom refinement description etc..
Step S2, according to the weights of calculating gained weight, sets up disease vector to each disease;The weights of above-mentioned weight are made Element for disease vector.
Step S3, calculates disease similarity according to disease vector.
At present, the gene based on semantic association calculating disease similarity is relevant with based on disease is generally used to calculate disease phase The similarity of disease is being calculated like degree both approaches, if the gene function between two kinds of diseases, cellularity, mistake biology Journey or common pathogenetic gene are roughly the same, then both approaches calculates in disease similarity the most effective, for disease science Research for very useful.But, the disease that cell, gene etc. associate, both approaches are not but had for two kinds of easy mistaken diagnosis Effect is the most poor.The disease similarity calculating method based on big data group behavior provided in the present embodiment, quantifies disease And the similarity between disease.The calculating process of disease similarity does not use the relation on attributes between disease and disease, but uses Big data human diseases goes to a doctor behavior.First collect the mass data that crowd seeks medical advice for each disease patient, use these groups Disease is modeled by body medical diagnosis on disease behavior, then selects suitable similarity algorithm to calculate the similarity between disease two-by-two, can be by Easily the disease identification of mistaken diagnosis is out.The method can be used for the systems such as disease auxiliary diagnosis, disease autodiagnosis.Simultaneously to complex disease Study of incident mechanism, major disease infectious disease early prevention diagnoses, and novel drugs research and development etc. have important function.
With reference to Fig. 2, can also include before above-mentioned steps S1:
Step S0, according to case informations such as Gender, age, symptom, symptom position, symptom refinement descriptions, builds and suffers from Person unit example.This step S0 may comprise steps of:
Step 1: collect the diagnosis of disease data of different sexes all ages and classes patient.Below one case is extracted (but do not limit In following) information: Gender, patient age, symptom, symptom position, symptom refinement description, diagnosed disease.By Gender It is expressed as G:{G1,G2, GiRepresent (in 2) certain value that sex G is desirable;Patient age is expressed as A;Symptom is expressed as S:{S1,S2,…,SL, SiRepresent (in L) certain value that symptom S is desirable;Symptom position is expressed as B:{B1,B2,,BM, Bi Represent (in M) certain value desirable for position B;Disease is expressed as D:{D1,D2,…,DN, DiRepresent desirable (N number of of position D In) certain value.
Step 2: by discrete for age data chemical conversion K section: A:{A1,A2,…,AK, AiRepresent desirable for position A (K in) certain Individual value.Age data division methods the following two kinds:
The first K=5:
(1) childhood: 0 years old 6 years old;(2) juvenile: 7 years old 17 years old;(3) young: 18 years old 40 years old;(4) middle age: 41 65 Year;(5) old: after 66 years old.
The second K=14:
(1) infancy stage: the 0-3 moon in week;(2) children's's phase: 4 weeks 2.5 years old moons;(3) preschool period: after 2.5 years old 6 years old;(4) The initiation phase: 7 years old 10 years old;(5) the converse phase: 11 years old 14 years old;(6) growth stage: 15 years old 17 years old;(7) adolescence: 18 28 Year;(8) period of maturation: 29 40 years old;(9) the sturdy phase: 41 48 years old;(10) the sane phase: 49 55 years old;(11) phase of adjustment: 56- 65 years old;(12) the oldest phase: 67 72 years old;(13) the old phase in: 73 84 years old;(14) the old phase: after 85 years old.
Step 3: the symptom under each symptom is refined and describes the time of origin according to symptom, the order of severity, priming factors 3 to 5 symptom refinements are become to describe Deng consolidation: Si:{Si1,Si2,Si3,Si4,Si5, wherein Si1,Si2,Si3,Si4,Si5Represent symptom SiSymptom refinement describe.The symptom refinement of each symptom describes number and can differ.
Step 4: by Gender G:{G1,G2, patient age A:{A1,A2,…,AK, symptom S:{S1,S2,…,SL, disease Shape position B:{B1,B2,…,BM, symptom refinement describe Si:{Si1,Si2,Si3,Si4,Si5Patient unit is become by their valued combinations Example E:{E1,E2,…,EH, such as E1Represent that " sex is G1, the age is A1, symptom is S1, position is B1, symptom refinement describes For S11" a patient unit example, H be unit example number.H=2 × K × L × M × 5 in theory, actually since some symptom Only appear in some age bracket or certain sex, thus H < 2 × K × L × M × 5.
Further, in above-mentioned steps S1, calculate each patient unit example correspondence and be diagnosed as the weight of each disease and include:
Calculate each patient unit example correspondence and be diagnosed as the frequency of each disease, and obtain this frequency in all data Frequency, is diagnosed as the weights of the weight of each disease using said frequencies as each patient unit example correspondence.
In the present embodiment, each patient unit example E is calculatediEach disease D correspondingjWeight WijComputational methods such as Under:
Magnanimity (data volume wants sufficiently large could meet statistical significance requirement) the case data that use is above collected, for Each patient unit example Ei, add up it and be diagnosed as each disease { D1,D2,…,DNFrequency { Fi1,Fi2,…,FiN, then will This frequency is converted into frequency:
Wherein
Using this frequency as this patient unit example EiCorresponding each disease { D1,D2,…,DNWeight, it may be assumed that
W i j = F i j F .
Thus obtain the weight matrix W about patient unit example diseaseij
Further, in above-mentioned steps S2, according to the weights of calculating gained weight, each disease is set up disease vector bag Include: use the weight matrix W of patient unit example diseaseijWeights as disease vector element:
D j &RightArrow; = { d 1 j , d 2 j , ... , d H j } = ( W 1 j , W 2 j , ... , W H ) .
Further, in above-mentioned steps S3, calculate disease similarity according to disease vector and include:
According to disease vector, COS distance is utilized to calculate two kinds of disease similarities.
Further, disease vector is a higher-dimension sparse vector, above-mentioned according to disease vector, utilizes COS distance to calculate The computing formula of two kinds of disease similarities is:
s i m ( D i &RightArrow; , D j &RightArrow; ) = cos < D i &RightArrow; , D j &RightArrow; > = &Sigma; k H ( d i k + d j k ) &Sigma; k H d i k 2 &CenterDot; &Sigma; k H d j k 2 ;
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit real Example correspondence is diagnosed as Di、DjThe weights of two kinds of diseases;H is total number of patient unit example;K is natural number.
Further, when the weights that patient unit example correspondence is diagnosed as certain disease are the lowest, then real according to this patient unit The credibility that example is diagnosed as certain disease is the lowest, when generally calculating disease similarity, can be ignored.Above-mentioned according to disease to Amount, the computing formula utilizing COS distance to calculate two kinds of disease similarities is adjusted to:
s i m ( D i &RightArrow; , D j &RightArrow; ) = &Sigma; k &Element; T i j ( d i k + d j k ) &Sigma; i &Element; T i d i k 2 &CenterDot; &Sigma; i &Element; T j d j k 2 ;
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit real Example correspondence is diagnosed as Di、DjThe weights of two kinds of diseases, and its value is higher than setting weights;K is natural number;TiFor each patient unit example Correspondence is diagnosed as DiThe weights of disease and higher than setting weights;TjIt is diagnosed as D for each patient unit example correspondencejThe weights of disease and Higher than setting weights;TijIt is diagnosed as D for each patient unit example correspondence respectivelyi、DjThe weights of two kinds of diseases and higher than setting weights Intersection;The above-mentioned weights that set are as the minimum threshold set according to practical situation, and when patient unit, example correspondence is diagnosed as Di、 DjWhen the weights of two kinds of diseases are less than this threshold value, it is determined that its credibility is low, when calculating disease similarity, by this patient unit example Weights ignore.
With reference to Fig. 3, for disease Similarity Measure apparatus structure based on big data group behavior in one embodiment of the invention Schematic diagram.
One embodiment of the invention additionally provides a kind of disease Similarity Measure device based on big data group behavior, bag Include:
Weight calculation unit 10, calculates each patient unit example correspondence and is diagnosed as the weight of each disease;Above-mentioned patient unit Example includes patient cases's information;
Vector sets up unit 20, according to the weights of calculating gained weight, each disease is set up disease vector;Above-mentioned weight Weights as disease vector element;
Similarity calculated 30, calculates disease similarity according to disease vector.
At present, the gene based on semantic association calculating disease similarity is relevant with based on disease is generally used to calculate disease phase The similarity of disease is being calculated like degree both approaches, if the gene function between two kinds of diseases, cellularity, biology Process or common pathogenetic gene are roughly the same, then both approaches calculates in disease similarity the most effective, for disease section Learn research for very useful.But, the disease that cell, gene etc. associate, both sides are not but had for two kinds of easy mistaken diagnosis Method effect is the most poor.The disease Similarity Measure device based on big data group behavior provided in the present embodiment, quantifies disease Similarity between disease and disease.The calculating process of disease similarity does not use the relation on attributes between disease and disease, but makes Go to a doctor behavior with big data human diseases.First collect the mass data that crowd seeks medical advice for each disease patient, use these Disease is modeled by colony's medical diagnosis on disease behavior, then selects suitable similarity algorithm to calculate the similarity between disease two-by-two, can be by It is easiest to the disease identification of mistaken diagnosis out.This device can be used for the systems such as disease auxiliary diagnosis, disease autodiagnosis.Simultaneously to complicated disease Sick study of incident mechanism, major disease infectious disease early prevention diagnoses, and novel drugs research and development etc. have important function.
With reference to Fig. 4, above-mentioned disease Similarity Measure device based on big data group behavior can also include:
Unit's example construction unit 1, according to case letters such as Gender, age, symptom, symptom position, symptom refinement descriptions Breath, builds patient unit example.This yuan of example construction unit 1 builds patient unit example and may include that
Step 1: collect the diagnosis of disease data of different sexes all ages and classes patient.Below one case is extracted (but do not limit In following) information: Gender, patient age, symptom, symptom position, symptom refinement description, diagnosed disease.By Gender It is expressed as G:{G1,G2, GiRepresent (in 2) certain value that sex G is desirable;Patient age is expressed as A;Symptom is expressed as S:{S1,S2,…,SL, SiRepresent (in L) certain value that symptom S is desirable;Symptom position is expressed as B:{B1,B2,…,BM, BiRepresent (in M) certain value desirable for position B;Disease is expressed as D:{D1,D2,…,DN, DiRepresent the desirable (N of position D In individual) certain value.
Step 2: by discrete for age data chemical conversion K section: A:{A1,A2,…,AK, AiRepresent desirable for position A (K in) certain Individual value.Age data division methods the following two kinds:
The first K=5:
(1) childhood: 0 years old 6 years old;(2) juvenile: 7 years old 17 years old;(3) young: 18 years old 40 years old;(4) middle age: 41 65 Year;(5) old: after 66 years old.
The second K=14:
(1) infancy stage: the 0-3 moon in week;(2) children's's phase: 4 weeks 2.5 years old moons;(3) preschool period: after 2.5 years old 6 years old;(4) The initiation phase: 7 years old 10 years old;(5) the converse phase: 11 years old 14 years old;(6) growth stage: 15 years old 17 years old;(7) adolescence: 18 28 Year;(8) period of maturation: 29 40 years old;(9) the sturdy phase: 41 48 years old;(10) the sane phase: 49 55 years old;(11) phase of adjustment: 56- 65 years old;(12) the oldest phase: 67 72 years old;(13) the old phase in: 73 84 years old;(14) the old phase: after 85 years old.
Step 3: the symptom under each symptom is refined and describes the time of origin according to symptom, the order of severity, priming factors 3 to 5 symptom refinements are become to describe Deng consolidation: Si:{Si1,Si2,Si3,Si4,Si5, wherein Si1,Si2,Si3,Si4,Si5Represent symptom SiSymptom refinement describe.The symptom refinement of each symptom describes number and can differ.
Step 4: by Gender G:{G1,G2, patient age A:{A1,A2,…,AK, symptom S:{S1,S2,…,SL, disease Shape position B:{B1,B2,…,BM, symptom refinement describe Si:{Si1,Si2,Si3,Si4,Si5Patient unit is become by their valued combinations Example E:{E1,E2,…,EH, such as E1Represent that " sex is G1, the age is A1, symptom is S1, position is B1, symptom refinement describes For S11" a patient unit example, H be unit example number.H=2 × K × L × M × 5 in theory, actually since some symptom Only appear in some age bracket or certain sex, thus H < 2 × K × L × M × 5.
Further, with reference to Fig. 5, above-mentioned weight calculation unit 10 includes:
Weight computing subelement 100, calculates each patient unit example correspondence and is diagnosed as the frequency of each disease, and obtain this Frequency frequency in all data, is diagnosed as the weight of each disease using said frequencies as each patient unit example correspondence Weights.
In the present embodiment, weight computing subelement 100 calculates each patient unit example EiEach disease D correspondingjWeight WijComputational methods as follows:
Magnanimity (data volume wants sufficiently large could meet statistical significance requirement) the case data that use is above collected, for Each patient unit example Ei, add up it and be diagnosed as each disease { D1,D2,…,DNFrequency { Fi1,Fi2,…,FiN, then will This frequency is converted into frequency:
Wherein
Using this frequency as this patient unit example EiCorresponding each disease { D1,D2,…,DNWeight, it may be assumed that
W i j = F i j F i .
Thus obtain the weight matrix W about patient unit example diseaseij
Further, above-mentioned vector sets up unit 20, according to the weights of calculating gained weight, each disease is set up disease Vector includes: use the weight matrix W of patient unit example diseaseijWeights as disease vector element:
D j &RightArrow; = { d 1 j , d 2 j , ... , d H j } = ( W 1 j , W 2 j , ... , W H ) .
Further, with reference to Fig. 6, above-mentioned similarity calculated 30 includes:
COS distance computation subunit 300, according to disease vector, utilizes COS distance to calculate two kinds of disease similarities.
Further, disease vector is a higher-dimension sparse vector, and the calculating of above-mentioned COS distance computation subunit 300 is public Formula is:
s i m ( D i &RightArrow; , D j &RightArrow; ) = cos < D i &RightArrow; , D j &RightArrow; > = &Sigma; k H ( d i k + d j k ) &Sigma; k H d i k 2 &CenterDot; &Sigma; k H d j k 2 ;
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit real Example correspondence is diagnosed as Di、DjThe weights of two kinds of diseases;H is total number of patient unit example;K is natural number.
Further, when the weights that patient unit example correspondence is diagnosed as certain disease are the lowest, then real according to this patient unit The credibility that example is diagnosed as certain disease is the lowest, when generally calculating disease similarity, can be ignored.Above-mentioned COS distance meter The computing formula of operator unit 300 is adjusted to:
s i m ( D i &RightArrow; , D j &RightArrow; ) = &Sigma; k &Element; T i j ( d i k + d j k ) &Sigma; i &Element; T i d i k 2 &CenterDot; &Sigma; i &Element; T j d j k 2 ;
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit real Example correspondence is diagnosed as Di、DjThe weights of two kinds of diseases, and its value is higher than setting weights;K is natural number;TiFor each patient unit example Correspondence is diagnosed as DiThe weights of disease and higher than setting weights;TjIt is diagnosed as D for each patient unit example correspondencejThe weights of disease and Higher than setting weights;TijIt is diagnosed as D for each patient unit example correspondence respectivelyi、DjThe weights of two kinds of diseases and higher than setting weights Intersection;The above-mentioned weights that set are as the minimum threshold set according to practical situation, and when patient unit, example correspondence is diagnosed as Di、 DjWhen the weights of two kinds of diseases are less than this threshold value, it is determined that its credibility is low, when calculating disease similarity, by this patient unit example Weights ignore.
In one embodiment, experiment uses three kinds of disease similarity calculating methods or device to " being easiest to the 7 of mistaken diagnosis Plant disease " to test, the computational methods wherein used are respectively as follows:
A method: calculate disease similarity based on semantic association;
B method: the gene relevant based on disease calculates disease similarity;
C method: in the embodiment of the present invention based on big data group behavior calculate disease similarity.
Experimental result is with reference to table 1 below:
Table 1
Drawn by the experimental result of table 1:
If the gene function between two kinds of diseases, cellularity, biological process or common pathogenetic gene substantially phase With, then the gene calculating disease similarity both approaches calculating disease similarity relevant with based on disease based on semantic association exists Calculating all can be effective in the similarity of disease, very useful for the research of disease science.
But, the disease that cell, gene etc. associate but is not had for two kinds of easy mistaken diagnosis, before two kinds of method effects just than Poor.Method and device in the embodiment of the present invention considers from the social angle of disease, based on big data colony diagnosis of disease Behavior calculates the similarity of disease, can the 7 kinds of diseases that be easiest to mistaken diagnosis be identified.
The disease similarity calculating method based on big data group behavior provided in embodiment in the present invention and device, root According to the diagnosis of disease behavior of big data colony, calculate the similarity between disease from the social goniometer of disease, can be used for identifying and hold Easily mistaken diagnosis does not but have the disease that cell, gene etc. associate;The weight of each disease it is diagnosed as according to each patient unit example correspondence Set up disease vector, and calculate disease similarity by COS distance, can will be easiest to the disease identification of mistaken diagnosis out;Work as patient Unit's example correspondence be diagnosed as the weights of certain disease the lowest time, then it is believed that this patient unit example correspondence is diagnosed as certain disease Credibility the lowest, calculate disease similarity time, can be ignored.
The foregoing is only the preferred embodiments of the present invention, not thereby limit the scope of the claims of the present invention, every utilization Equivalent structure or equivalence flow process that description of the invention and accompanying drawing content are made convert, or are directly or indirectly used in other phases The technical field closed, is the most in like manner included in the scope of patent protection of the present invention.

Claims (10)

1. a disease similarity calculating method based on big data group behavior, it is characterised in that include step:
Calculate each patient unit example correspondence and be diagnosed as the weight of each disease;Described patient unit example includes that patient cases believes Breath;
According to the weights of calculating gained weight, each disease is set up disease vector;The weights of described weight are as disease vector Element;
Disease similarity is calculated according to disease vector.
Disease similarity calculating method based on big data group behavior the most according to claim 1, it is characterised in that institute The step stating the weight that calculating each patient unit example correspondence is diagnosed as each disease includes:
Calculate each patient unit example correspondence and be diagnosed as the frequency of each disease, and obtain this frequency frequency in all data Rate, is diagnosed as the weights of the weight of each disease using described frequency as each patient unit example correspondence.
Disease similarity calculating method based on big data group behavior the most according to claim 1, it is characterised in that institute State the step according to disease vector calculating disease similarity to include:
According to disease vector, COS distance is utilized to calculate two kinds of disease similarities.
Disease similarity calculating method based on big data group behavior the most according to claim 3, it is characterised in that institute Stating according to disease vector, the computing formula utilizing COS distance to calculate two kinds of disease similarities is:
s i m ( D i &RightArrow; , D j &RightArrow; ) = cos < D i &RightArrow; , D j &RightArrow; > = &Sigma; k H ( d i k + d j k ) &Sigma; k H d i k 2 &CenterDot; &Sigma; k H d j k 2 ;
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit example pair D should be diagnosed asi、DjThe weights of two kinds of diseases;H is total number of patient unit example;K is natural number.
5., according to the disease similarity calculating method based on big data group behavior described in claim 3 or 4, its feature exists In, when the weights that patient unit example correspondence is diagnosed as certain disease are the lowest, described according to disease vector, utilize COS distance meter The computing formula calculating two kinds of disease similarities is:
s i m ( D i &RightArrow; , D j &RightArrow; ) = &Sigma; k &Element; T i j ( d i k + d j k ) &Sigma; i &Element; T i d i k 2 &CenterDot; &Sigma; i &Element; T j d j k 2 ;
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit example pair D should be diagnosed asi、DjThe weights of two kinds of diseases, and its value is higher than setting weights;K is natural number;TiCorresponding for each patient unit example It is diagnosed as DiThe weights of disease and higher than setting weights;TjIt is diagnosed as D for each patient unit example correspondencejThe weights of disease and being higher than Set weights;TijIt is diagnosed as D for each patient unit example correspondence respectivelyi、DjThe weights of two kinds of diseases and higher than the conjunction setting weights Collection.
6. a disease Similarity Measure device based on big data group behavior, it is characterised in that including:
Weight calculation unit, calculates each patient unit example correspondence and is diagnosed as the weight of each disease;Described patient unit example bag Include patient cases's information;
Vector sets up unit, according to the weights of calculating gained weight, each disease is set up disease vector;The weights of described weight Element as disease vector;
Similarity calculated, calculates disease similarity according to disease vector.
Disease Similarity Measure device based on big data group behavior the most according to claim 6, it is characterised in that institute State weight calculation unit to include:
Weight computing subelement, calculates each patient unit example correspondence and is diagnosed as the frequency of each disease, and obtains this frequency and exist Frequency in all data, is diagnosed as the weights of the weight of each disease using described frequency as each patient unit example correspondence.
Disease Similarity Measure device based on big data group behavior the most according to claim 6, it is characterised in that institute State similarity calculated to include:
COS distance computation subunit, according to disease vector, utilizes COS distance to calculate two kinds of disease similarities.
Disease Similarity Measure device based on big data group behavior the most according to claim 8, it is characterised in that institute The computing formula stating COS distance computation subunit is:
s i m ( D i &RightArrow; , D j &RightArrow; ) = cos < D i &RightArrow; , D j &RightArrow; > = &Sigma; k H ( d i k + d j k ) &Sigma; k H d i k 2 &CenterDot; &Sigma; k H d j k 2 ;
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit example pair D should be diagnosed asi、DjThe weights of two kinds of diseases;H is total number of patient unit example;K is natural number.
The most according to claim 8 or claim 9, disease Similarity Measure device based on big data group behavior, its feature exists In, when the weights that patient unit example correspondence is diagnosed as certain disease are the lowest, the calculating of described COS distance computation subunit is public Formula is:
s i m ( D i &RightArrow; , D j &RightArrow; ) = &Sigma; k &Element; T i j ( d i k + d j k ) &Sigma; i &Element; T i d i k 2 &CenterDot; &Sigma; i &Element; T j d j k 2 ;
Wherein,It it is the disease similarity between two kinds of disease vectors;dik、djkIt is respectively each patient unit example pair D should be diagnosed asi、DjThe weights of two kinds of diseases, and its value is higher than setting weights;K is natural number;TiCorresponding for each patient unit example It is diagnosed as DiThe weights of disease and higher than setting weights;TjIt is diagnosed as D for each patient unit example correspondencejThe weights of disease and being higher than Set weights;TijIt is diagnosed as D for each patient unit example correspondence respectivelyi、DjThe weights of two kinds of diseases and higher than the conjunction setting weights Collection.
CN201610307328.7A 2016-05-10 2016-05-10 Disease similarity calculation method and device based on big data group behaviors Pending CN106021871A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610307328.7A CN106021871A (en) 2016-05-10 2016-05-10 Disease similarity calculation method and device based on big data group behaviors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610307328.7A CN106021871A (en) 2016-05-10 2016-05-10 Disease similarity calculation method and device based on big data group behaviors

Publications (1)

Publication Number Publication Date
CN106021871A true CN106021871A (en) 2016-10-12

Family

ID=57100197

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610307328.7A Pending CN106021871A (en) 2016-05-10 2016-05-10 Disease similarity calculation method and device based on big data group behaviors

Country Status (1)

Country Link
CN (1) CN106021871A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106650299A (en) * 2017-01-18 2017-05-10 浙江大学 Quick calculating method for patient similarity analysis
CN106897580A (en) * 2017-02-10 2017-06-27 华东师范大学 The computational methods of semantic similarity between a kind of gene based on vector
CN108346474A (en) * 2018-03-14 2018-07-31 湖南省蓝蜻蜓网络科技有限公司 The electronic health record feature selection approach of distribution within class and distribution between class based on word
CN108647203A (en) * 2018-04-20 2018-10-12 浙江大学 A kind of computational methods of Chinese medicine state of an illness text similarity
CN109102895A (en) * 2017-06-21 2018-12-28 京东方科技集团股份有限公司 Medical data coalignment and method
CN111091906A (en) * 2019-10-31 2020-05-01 中电药明数据科技(成都)有限公司 Auxiliary medical diagnosis method and system based on real world data
CN108630322B (en) * 2018-04-27 2020-08-14 厦门大学 Drug interaction modeling and risk assessment method, terminal device and storage medium
CN112151184A (en) * 2020-09-27 2020-12-29 东北林业大学 System for calculating disease similarity based on network representation learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156812A (en) * 2011-04-02 2011-08-17 中国医学科学院医学信息研究所 Hospital decision-making aiding method based on symptom similarity analysis
CN102184314A (en) * 2011-04-02 2011-09-14 中国医学科学院医学信息研究所 Deviation symptom description-oriented automatic computer-aided diagnosis method
CN104915561A (en) * 2015-06-11 2015-09-16 万达信息股份有限公司 Intelligent disease attribute matching method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156812A (en) * 2011-04-02 2011-08-17 中国医学科学院医学信息研究所 Hospital decision-making aiding method based on symptom similarity analysis
CN102184314A (en) * 2011-04-02 2011-09-14 中国医学科学院医学信息研究所 Deviation symptom description-oriented automatic computer-aided diagnosis method
CN104915561A (en) * 2015-06-11 2015-09-16 万达信息股份有限公司 Intelligent disease attribute matching method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李杰等: "基于疾病本体的疾病相似性计算方法", 《生物化学与生物物理进展》 *
郭艾侠等: "融合 Harris 与 SIFT 算法的荔枝采摘点计算与立体匹配", 《农业机械学报》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106650299A (en) * 2017-01-18 2017-05-10 浙江大学 Quick calculating method for patient similarity analysis
CN106650299B (en) * 2017-01-18 2019-01-25 浙江大学 A kind of quick calculation method of patient's similarity analysis
CN106897580A (en) * 2017-02-10 2017-06-27 华东师范大学 The computational methods of semantic similarity between a kind of gene based on vector
CN109102895A (en) * 2017-06-21 2018-12-28 京东方科技集团股份有限公司 Medical data coalignment and method
CN108346474A (en) * 2018-03-14 2018-07-31 湖南省蓝蜻蜓网络科技有限公司 The electronic health record feature selection approach of distribution within class and distribution between class based on word
CN108647203A (en) * 2018-04-20 2018-10-12 浙江大学 A kind of computational methods of Chinese medicine state of an illness text similarity
CN108630322B (en) * 2018-04-27 2020-08-14 厦门大学 Drug interaction modeling and risk assessment method, terminal device and storage medium
CN111091906A (en) * 2019-10-31 2020-05-01 中电药明数据科技(成都)有限公司 Auxiliary medical diagnosis method and system based on real world data
CN111091906B (en) * 2019-10-31 2023-06-20 中电药明数据科技(成都)有限公司 Auxiliary medical diagnosis method and system based on real world data
CN112151184A (en) * 2020-09-27 2020-12-29 东北林业大学 System for calculating disease similarity based on network representation learning

Similar Documents

Publication Publication Date Title
CN106021871A (en) Disease similarity calculation method and device based on big data group behaviors
CN104915561B (en) Genius morbi intelligent Matching method
Titterington et al. Comparison of discrimination techniques applied to a complex data set of head injured patients
CN103729395B (en) For inferring the method and system of inquiry answer
Peter et al. An empirical study on prediction of heart disease using classification data mining techniques
CN104063824B (en) The generation method and device of health guidance information
CN107785075A (en) Fever in children disease deep learning assistant diagnosis system based on text case history
CN101561868B (en) Human motion emotion identification method based on Gauss feature
TW200426627A (en) Information retrieval and text mining using distributed latent semantic indexing
CN105389470A (en) Method for automatically extracting Traditional Chinese Medicine acupuncture entity relationship
CN107358014A (en) The clinical pre-treating method and system of a kind of physiological data
Wang et al. Attention-based multi-instance neural network for medical diagnosis from incomplete and low quality data
Li et al. Towards medical machine reading comprehension with structural knowledge and plain text
CN106529110A (en) Classification method and equipment of user data
CN104794222B (en) Network form semanteme restoration methods
Adnan et al. A survey on utilization of data mining for childhood obesity prediction
CN111128388A (en) Value domain data matching method and device and related products
Lin et al. Medical Concept Embedding with Variable Temporal Scopes for Patient Similarity.
Khazaee et al. Heart arrhythmia detection using support vector machines
CN107491656A (en) A kind of Effect of pregnancy outcome factor appraisal procedure based on relative risk decision-tree model
Yuan et al. A similarity-based disease diagnosis system for medical big data
Sunge et al. Prediction diabetes mellitus using decision tree models
Eskofier et al. Predictive models for health deterioration: Understanding disease pathways for personalized medicine
Li et al. Multi-source ensemble transfer approach for medical text auxiliary diagnosis
Korir et al. Clusters of African countries based on the social contacts and associated socioeconomic indicators relevant to the spread of the epidemic

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20161012