CN106021871A - Disease similarity calculation method and device based on big data group behaviors - Google Patents

Disease similarity calculation method and device based on big data group behaviors Download PDF

Info

Publication number
CN106021871A
CN106021871A CN201610307328.7A CN201610307328A CN106021871A CN 106021871 A CN106021871 A CN 106021871A CN 201610307328 A CN201610307328 A CN 201610307328A CN 106021871 A CN106021871 A CN 106021871A
Authority
CN
China
Prior art keywords
disease
weight
patient
similarity
diseases
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610307328.7A
Other languages
Chinese (zh)
Inventor
韦辉华
王界兵
张伟
董迪马
郭宇翔
宋泰然
梁猛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Frontsurf Information Technology Co Ltd
Original Assignee
Shenzhen Frontsurf Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Frontsurf Information Technology Co Ltd filed Critical Shenzhen Frontsurf Information Technology Co Ltd
Priority to CN201610307328.7A priority Critical patent/CN106021871A/en
Publication of CN106021871A publication Critical patent/CN106021871A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Public Health (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Pathology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a disease similarity calculation method and device based on big data group behaviors. The disease similarity calculation method comprises the following steps: calculating weight for each patient element instance to be correspondingly diagnosed into each disease, wherein the patient element instance comprises patient case information; according to a weight value obtained by calculation, establishing a disease vector for each disease, wherein the weight value is used as an element of the disease vector; and according to the disease vector, calculating disease similarity. The disease similarity calculation method and device based on the big data group behaviors calculates the similarity among diseases from the social perspective of diseases according to big data group disease behavior diagnosis and treatment behaviors, and the disease similarity calculation method and device can be used for identifying diseases which are likely to be misdiagnosed but do not have correlations including cells, genes and the like.

Description

Disease similarity calculation method and device based on big data group behaviors
Technical Field
The invention relates to the field of calculation of disease similarity, in particular to a disease similarity calculation method and device based on big data group behaviors.
Background
The current method for calculating the similarity of diseases is usually based on the attributes of diseases, such as the inclusion relationship between diseases: 'breast cancer' includes 'male breast cancer' and 'female breast cancer'; disease-to-disease association factors: common pathogenic genes, common therapeutic drugs, common metabolites, etc. Methods of calculating disease similarity can generally be considered from two perspectives:
1. and calculating the similarity of the diseases based on the semantic association.
The biomedical field often utilizes ontologies to compute semantic similarity of terms, such as: gene ontology, human phenotype ontology, etc. However, only a few of these methods have been used to calculate disease similarity. The method designed by Resnik is the most common method, and is more applied to the gene ontology to calculate the similarity of gene functions, cell structures and biological process terms, and has obvious advantages compared with other methods (unity-interaction, changest shared path, JC). The method of Resnik is to calculate term similarity using the 'is _ a' relationship in the ontology, and the method for calculating similarity between disease pairs mainly depends on the common ancestor node with the largest information amount of the disease pairs. The Lin method improves a comparison method of information entropy in the Resnik method, and improves the Resnik method to a certain extent from a theoretical point of view. The methods of Resnik and Lin have recently been written by researchers into R-packs to facilitate the calculation of disease similarity. The method proposed by Wang et al is a more deeply optimized approach to the method of Resnik. When the similarity of the disease pairs is calculated, the common ancestor node with the largest information amount of the disease pairs is considered, and other common ancestor nodes of the disease pairs are also considered. The superiority of the method is better embodied in gene ontology and has been used for calculating the semantic similarity of disease terms in medical subject words.
2. Disease similarity was calculated based on disease-associated genes.
The association of diseases is not only reflected in disease-related ontologies, but also in common pathogenic genes. Therefore, researchers are also concerned with how to calculate the similarity of diseases based on their causative genes. There are two methods for calculating disease similarity based on genes.
(1) The first is a method based on a common disease gene (based on overlapping gene set-BOG). The method compares the number of genes commonly associated between diseases, thereby obtaining the similarity of diseases. Compared with the similarity calculation based on the semantic angle, the method finds similar disease pairs from a brand new angle. Thus, this approach enables the discovery of new unknown disease associations. Nevertheless, in calculating the similarity of diseases, the method does not consider the functional association between the disease genes, but it is obvious that the association has some influence on the similarity of diseases.
(2) The second approach calculates disease similarity based on process similarity-PSB, where process refers to the biological process term of the gene ontology associated with the causative gene. The method considers the function association of disease genes, thereby greatly improving the BOG method. PSB also exhibits good performance compared to methods of Resnik, Lin, LC, and JC. Functional associations between genes include many aspects, such as: gene co-expression, protein interactions, gene ontology terms, and the like. In addition, to improve the performance of the disease similarity method, the FunSim method utilizes a synthetically weighted human gene association network to calculate the disease similarity.
Therefore, if the gene functions, cell constitutions, biological processes or co-pathogenic genes between two diseases are approximately the same, two methods of calculating the similarity of diseases based on semantic association and calculating the similarity of diseases based on disease-related genes are effective in calculating the similarity of diseases, which is useful for the research of disease science. However, these two methods are relatively ineffective in treating two diseases which are easily misdiagnosed and have no correlation with cells, genes, and the like.
Disclosure of Invention
The invention mainly aims to provide a disease similarity calculation method and device based on big data group behaviors, which can calculate the similarity between diseases from the social perspective of the diseases according to the big data group disease diagnosis and treatment behaviors and can be used for identifying diseases which are easy to misdiagnose and have no association of cells, genes and the like.
The invention provides a disease similarity calculation method based on big data group behaviors, which comprises the following steps:
calculating the weight of each patient meta-instance corresponding to each disease diagnosed; the patient meta-instance includes patient case information;
establishing a disease vector for each disease according to the weight of the calculated weight; the weight of the weight is used as an element of a disease vector;
and calculating the similarity of the diseases according to the disease vectors.
Further, the step of calculating the weight of each patient meta-instance corresponding to the diagnosis of the respective disease comprises:
and calculating the frequency of the diagnosis of each disease corresponding to each patient meta-instance, calculating the frequency of the frequency in all data, and taking the frequency as the weight of the diagnosis of each disease corresponding to each patient meta-instance.
Further, the step of calculating the disease similarity according to the disease vector includes:
and calculating the similarity of the two diseases by using the cosine distance according to the disease vectors.
Further, the formula for calculating the similarity between two diseases by using cosine distance according to the disease vectors is as follows:
s i m ( D i &RightArrow; , D j &RightArrow; ) = cos < D i &RightArrow; , D j &RightArrow; > = &Sigma; k H ( d i k + d j k ) &Sigma; k H d i k 2 &CenterDot; &Sigma; k H d j k 2 ;
wherein,is the disease similarity between two disease vectors; dik、djkCorresponding diagnosis D for each patienti、DjThe weight of the two diseases; h is the total number of patient meta-examples; k is a natural number.
Further, when the weight value of a patient meta-instance corresponding to a diagnosed disease is very low, the calculation formula for calculating the similarity between two diseases by using the cosine distance according to the disease vector is as follows:
s i m ( D i &RightArrow; , D j &RightArrow; ) = &Sigma; k &Element; T i j ( d i k + d j k ) &Sigma; i &Element; T i d i k 2 &CenterDot; &Sigma; i &Element; T j d j k 2 ;
wherein,is the disease similarity between two disease vectors; dik、djkRespectively for each patientExample correspondence diagnosis Di、DjThe weight of the two diseases is higher than the set weight; k is a natural number; t isiCorresponding diagnosis D for each patient meta-exampleiThe weight of the disease is higher than the set weight; t isjCorresponding diagnosis D for each patient meta-examplejThe weight of the disease is higher than the set weight; t isijThe meta-example for each patient corresponds to a diagnosis Di、DjThe weight of the two diseases is higher than the set of the set weight.
The invention also provides a disease similarity calculation device based on big data group behaviors, which comprises:
the weight calculation unit is used for calculating the weight of each disease corresponding to each patient meta-instance; the patient meta-instance includes patient case information;
a vector establishing unit for establishing a disease vector for each disease according to the weight of the calculated weight; the weight of the weight is used as an element of a disease vector;
and a similarity calculation unit for calculating the disease similarity according to the disease vector.
Further, the weight calculation unit includes:
and the weight calculation subunit is used for calculating the frequency of each patient meta-instance corresponding to each disease diagnosed, solving the frequency of the frequency in all data, and taking the frequency as the weight of each patient meta-instance corresponding to each disease diagnosed.
Further, the similarity calculation unit includes:
and the cosine distance calculating subunit calculates the similarity of the two diseases by utilizing the cosine distance according to the disease vectors.
Further, the calculation formula of the cosine distance calculation subunit is as follows:
s i m ( D i &RightArrow; , D j &RightArrow; ) = cos < D i &RightArrow; , D j &RightArrow; > = &Sigma; k H ( d i k + d j k ) &Sigma; k H d i k 2 &CenterDot; &Sigma; k H d j k 2 ;
wherein,is the disease similarity between two disease vectors; dik、djkCorresponding diagnosis D for each patienti、DjThe weight of the two diseases; h is the total number of patient meta-examples; k is a natural number.
Further, when the weight value of the patient meta-instance corresponding to a certain disease diagnosed is very low, the cosine distance calculation subunit has the following calculation formula:
s i m ( D i &RightArrow; , D j &RightArrow; ) = &Sigma; k &Element; T i j ( d i k + d j k ) &Sigma; i &Element; T i d i k 2 &CenterDot; &Sigma; i &Element; T j d j k 2 ;
wherein,is the disease similarity between two disease vectors; dik、djkCorresponding diagnosis D for each patienti、DjThe weight of the two diseases is higher than the set weight; k is a natural number; t isiCorresponding diagnosis D for each patient meta-exampleiThe weight of the disease is higher than the set weight; t isjCorresponding diagnosis D for each patient meta-examplejThe weight of the disease is higher than the set weight; t isijThe meta-example for each patient corresponds to a diagnosis Di、DjThe weight of the two diseases is higher than that of the herbAnd (5) determining a collection of weights.
The disease similarity calculation method and device based on big data group behaviors provided by the invention have the following beneficial effects:
according to the disease similarity calculation method and device based on the big data group behaviors, the similarity between diseases is calculated from the social perspective of the diseases according to the disease diagnosis and treatment behaviors of the big data group, and the method and device can be used for identifying the diseases which are easy to misdiagnose and have no association of cells, genes and the like; establishing a disease vector according to the weight of each disease correspondingly diagnosed by each patient meta-instance, and calculating the similarity of the diseases through cosine distance, so that the disease which is most easy to misdiagnose can be identified; when the weight value of the patient meta-instance corresponding to a certain disease is very low, the credibility of the patient meta-instance corresponding to a certain disease is considered to be very low, and the disease similarity can be ignored when being calculated.
Drawings
FIG. 1 is a schematic diagram of the steps of a disease similarity calculation method based on big data population behavior according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of the steps of a disease similarity calculation method based on big data population behavior according to another embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a disease similarity calculation apparatus based on big data group behavior according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a disease similarity calculation apparatus based on big data group behavior according to another embodiment of the present invention;
FIG. 5 is a diagram illustrating a structure of a weight calculation unit according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a similarity calculation unit according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, a schematic diagram of the steps of a disease similarity calculation method based on big data group behaviors according to an embodiment of the present invention is shown.
An embodiment of the present invention provides a disease similarity calculation method based on big data group behaviors, including:
step S1, calculating the weight of each patient meta-instance corresponding to each disease diagnosed; the patient meta-instance includes patient case information; the case information includes at least: the sex, age, symptoms, symptom parts, detailed description of symptoms, etc. of the patients.
Step S2, establishing disease vectors for each disease according to the weight of the calculated weight; the weight of the above weight is used as an element of the disease vector.
Step S3, calculating the disease similarity according to the disease vector.
At present, two methods of calculating the similarity of diseases based on semantic association and calculating the similarity of diseases based on genes related to diseases are generally adopted to calculate the similarity of diseases, and if the gene functions, cell constitutions, biological processes or common pathogenic genes between two diseases are approximately the same, the two methods are effective in calculating the similarity of diseases, and are useful for research of disease science. However, these two methods are relatively ineffective in treating two diseases which are easily misdiagnosed and have no correlation with cells, genes, and the like. The disease similarity calculation method based on big data group behaviors provided in this embodiment quantifies the similarity between diseases. The calculation process of the disease similarity does not use the attribute relationship between diseases, but uses the disease diagnosis behaviors of the big data population. The mass data of the population for each disease outpatient service is collected, the disease diagnosis behaviors of the population are used for modeling the disease, and the similarity between every two diseases is calculated by using a proper similarity calculation method, so that the disease which is most prone to misdiagnosis can be identified. The method can be used for disease auxiliary diagnosis, disease self-diagnosis and other systems. Meanwhile, the method has important effects on the research of pathogenesis of complex diseases, early prevention and diagnosis of serious diseases and infectious diseases, the research and development of new medicaments and the like.
Referring to fig. 2, before step S1, the method may further include:
in step S0, a patient meta-case is constructed based on case information such as patient sex, age, symptom site, and symptom detailed description. The step S0 may include the steps of:
step 1: collecting disease diagnosis and treatment data of patients with different sexes and different ages. One case extracts the following (but not limited to the following) information: sex of the patient, age of the patient, symptoms, symptom parts, detailed description of symptoms and definite diagnosis of diseases. Patient gender is expressed as G: { G1,G2},GiA value (of 2) indicating that gender G is desirable; the patient age is denoted as a; the symptoms are denoted S: { S1,S2,…,SL},SiA value (of L) indicative of the symptom S is advisable; the symptom site is denoted B: { B1,B2,,BM},BiA value (of M) indicating that site B is desirable; the disease is represented as D: { D1,D2,…,DN},DiIndicating a value (of N) that is desirable for location D.
Step 2: discretizing age data into K segments: a: { A1,A2,…,AK},AiIndicating a value (of K) that is desirable for location a. The age data is divided into two methods:
first K ═ 5:
(1) and (4) childhood: 0-6 years old; (2) juvenile: 7-17 years old; (3) young people: 18-40 years old; (4) in middle age: age 41-65 years old; (5) old people: after age 66.
Second K ═ 14:
(1) during infancy: 0-3 weeks and months; (2) in the pediatric stage: 4 weeks month-2.5 years old; (3) in the infancy stage: 2.5 years later-6 years old; (4) in the initial period: 7-10 years old; (5) and (3) reverse period: 11-14 years old; (6) growth period: from 15 years old to 17 years old; (7) and (3) puberty: 18-28 years old; (8) and (3) mature period: 29-40 years old; (9) and (3) a strong fruiting period: age 41-48 years old; (10) a steady period: 49-55 years old; (11) adjusting period: 56-65 years old; (12) in the early stage of aging: age 67-72; (13) in the middle-aged and old period: 73-84 years old; (14) the old age: after the age of 85 years.
And step 3: the detailed description of symptoms under each symptom is integrated into 3 to 5 detailed descriptions of symptoms according to the occurrence time, severity, trigger factors and the like of the symptoms: si:{Si1,Si2,Si3,Si4,Si5In which S isi1,Si2,Si3,Si4,Si5Indicates the symptom SiDetailed description of symptoms of (1). The number of symptom refinement descriptions for each symptom may not be the same.
And 4, step 4: the patient sex G: { G1,G2{ patient age A: { A1,A2,…,AK{ S:, symptom S1,S2,…,SL{ B, symptom site B1,B2,…,BMDescription of symptom refinement Si:{Si1,Si2,Si3,Si4,Si5Combine the values into patient examples E: { E1,E2,…,EH}, such as E1Means "sex is G1Age is A1The symptom is S1Site is B1The symptoms are described as S11Theoretically H-2 × K × L × M × 5, actually because some symptoms only appear in certain age groups or in certain sexes, H < 2 × K × L × M × 5.
Further, in step S1, the calculating the weight corresponding to each of the patient meta-instances for diagnosing the respective diseases includes:
and calculating the frequency of the diagnosis of each disease corresponding to each patient meta-instance, calculating the frequency of the frequency in all data, and taking the frequency as the weight of the diagnosis of each disease corresponding to each patient meta-instance.
In this embodiment, each patient meta-instance E is calculatediCorresponding to each disease DjWeight W ofijThe calculation method of (2) is as follows:
using the previously collected massive (data volume large enough to meet statistical significance requirements) case data, for each patient meta-instance EiStatistics it is diagnosed as individual diseases { D1,D2,…,DNFrequency of { F }i1,Fi2,…,FiNAnd converting the frequency into frequency:
wherein
This frequency is taken as the patient meta-instance EiCorresponding to each disease { D1,D2,…,DNThe weight of { i, i.e.:
W i j = F i j F .
thereby obtaining a weight matrix W of the disease as the patient meta-exampleij
Further, in the above step S2,establishing a disease vector for each disease according to the weight of the calculated weight comprises: using patient meta-example-weight matrix W for diseaseijThe weight of (a) is taken as an element of the disease vector:
D j &RightArrow; = { d 1 j , d 2 j , ... , d H j } = ( W 1 j , W 2 j , ... , W H ) .
further, in step S3, the calculating the disease similarity based on the disease vector includes:
and calculating the similarity of the two diseases by using the cosine distance according to the disease vectors.
Further, the disease vector is a high-dimensional sparse vector, and the above calculation formula for calculating the similarity between two diseases by using the cosine distance according to the disease vector is as follows:
s i m ( D i &RightArrow; , D j &RightArrow; ) = cos < D i &RightArrow; , D j &RightArrow; > = &Sigma; k H ( d i k + d j k ) &Sigma; k H d i k 2 &CenterDot; &Sigma; k H d j k 2 ;
wherein,is the disease similarity between two disease vectors; dik、djkCorresponding diagnosis D for each patienti、DjThe weight of the two diseases; h is the total number of patient meta-examples; k is a natural number.
Further, when the weight value of the patient meta-instance corresponding to a certain disease is low, the confidence level of the disease according to the patient meta-instance diagnosis is low, and the disease similarity is usually calculated and can be ignored. The above formula for calculating the similarity between two diseases by using cosine distance according to the disease vectors is adjusted as follows:
s i m ( D i &RightArrow; , D j &RightArrow; ) = &Sigma; k &Element; T i j ( d i k + d j k ) &Sigma; i &Element; T i d i k 2 &CenterDot; &Sigma; i &Element; T j d j k 2 ;
wherein,is the disease similarity between two disease vectors; dik、djkCorresponding diagnosis D for each patienti、DjThe weight of the two diseases is higher than the set weight; k is a natural number; t isiCorresponding diagnosis D for each patient meta-exampleiThe weight of the disease is higher than the set weight; t isjCorresponding diagnosis D for each patient meta-examplejThe weight of the disease is higher than the set weight; t isijThe meta-example for each patient corresponds to a diagnosis Di、DjThe weight of the two diseases is higher than the set weight; the weight value is a minimum threshold value set according to actual conditions, and when the patient meta-case is correspondingly diagnosed as Di、DjAnd when the weight values of the two diseases are smaller than the threshold value, judging that the credibility is low, and when the similarity of the diseases is calculated, neglecting the weight values of the patient meta-instances.
Fig. 3 is a schematic structural diagram of a disease similarity calculation apparatus based on big data group behaviors according to an embodiment of the present invention.
An embodiment of the present invention further provides a disease similarity calculation apparatus based on big data group behaviors, including:
the weight calculation unit 10 is used for calculating the weight of each disease diagnosed corresponding to each patient meta-instance; the patient meta-instance includes patient case information;
a vector establishing unit 20 for establishing a disease vector for each disease according to the weight of the calculated weight; the weight of the weight is used as an element of a disease vector;
the similarity calculation unit 30 calculates the disease similarity from the disease vector.
At present, two methods of calculating the similarity of diseases based on semantic association and calculating the similarity of diseases based on genes related to diseases are generally adopted to calculate the similarity of diseases, and if the gene functions, cell constitutions, biological processes or common pathogenic genes between two diseases are approximately the same, the two methods are effective in calculating the similarity of diseases, and are useful for research of disease science. However, these two methods are relatively ineffective in treating two diseases which are easily misdiagnosed and have no correlation with cells, genes, and the like. The disease similarity calculation device based on the big data group behaviors provided in the present embodiment quantifies the similarity between diseases. The calculation process of the disease similarity does not use the attribute relationship between diseases, but uses the disease diagnosis behaviors of the big data population. The mass data of the population for each disease outpatient service is collected, the disease diagnosis behaviors of the population are used for modeling the disease, and the similarity between every two diseases is calculated by using a proper similarity calculation method, so that the disease which is most prone to misdiagnosis can be identified. The device can be used for systems such as auxiliary diagnosis of diseases, self-diagnosis of diseases and the like. Meanwhile, the method has important effects on the research of pathogenesis of complex diseases, early prevention and diagnosis of serious diseases and infectious diseases, the research and development of new medicaments and the like.
Referring to fig. 4, the disease similarity calculation apparatus based on big data group behavior described above may further include:
the meta-instance constructing unit 1 constructs a patient meta-instance based on case information such as sex, age, symptom site, and symptom detailed description of the patient. The meta-instance construction unit 1 may construct the patient meta-instance by:
step 1: collecting disease diagnosis and treatment data of patients with different sexes and different ages. One case extracts the following (but not limited to the following) information: sex of the patient, age of the patient, symptoms, symptom parts, detailed description of symptoms and definite diagnosis of diseases. Patient gender is expressed as G: { G1,G2},GiA value (of 2) indicating that gender G is desirable; the patient age is denoted as a; the symptoms are denoted S: { S1,S2,…,SL},SiA value (of L) indicative of the symptom S is advisable; the symptom site is denoted B: { B1,B2,…,BM},BiA value (of M) indicating that site B is desirable; the disease is represented as D: { D1,D2,…,DN},DiIndicating a value (of N) that is desirable for location D.
Step 2: discretizing age data into K segments: a: { A1,A2,…,AK},AiIndicating a value (of K) that is desirable for location a. The age data is divided into two methods:
first K ═ 5:
(1) and (4) childhood: 0-6 years old; (2) juvenile: 7-17 years old; (3) young people: 18-40 years old; (4) in middle age: age 41-65 years old; (5) old people: after age 66.
Second K ═ 14:
(1) during infancy: 0-3 weeks and months; (2) in the pediatric stage: 4 weeks month-2.5 years old; (3) in the infancy stage: 2.5 years later-6 years old; (4) in the initial period: 7-10 years old; (5) and (3) reverse period: 11-14 years old; (6) growth period: from 15 years old to 17 years old; (7) and (3) puberty: 18-28 years old; (8) and (3) mature period: 29-40 years old; (9) and (3) a strong fruiting period: age 41-48 years old; (10) a steady period: 49-55 years old; (11) adjusting period: 56-65 years old; (12) in the early stage of aging: age 67-72; (13) in the middle-aged and old period: 73-84 years old; (14) the old age: after the age of 85 years.
And step 3: the detailed description of symptoms under each symptom is integrated into 3 to 5 detailed descriptions of symptoms according to the occurrence time, severity, trigger factors and the like of the symptoms: si:{Si1,Si2,Si3,Si4,Si5In which S isi1,Si2,Si3,Si4,Si5Indicates the symptom SiDetailed description of symptoms of (1). The number of symptom refinement descriptions for each symptom may not be the same.
And 4, step 4: the patient sex G: { G1,G2{ patient age A: { A1,A2,…,AK{ S:, symptom S1,S2,…,SL{ B, symptom site B1,B2,…,BMDescription of symptom refinement Si:{Si1,Si2,Si3,Si4,Si5Combine the values into patient examples E: { E1,E2,…,EH}, such as E1Means "sex is G1Age is A1The symptom is S1Site is B1The symptoms are described as S11Theoretically H-2 × K × L × M × 5, actually because some symptoms only appear in certain age groups or in certain sexes, H < 2 × K × L × M × 5.
Further, referring to fig. 5, the weight calculating unit 10 includes:
the weight calculation subunit 100 calculates the frequency of diagnosing each disease corresponding to each patient meta-instance, and determines the frequency of the frequency in all data, and uses the frequency as the weight of diagnosing each disease corresponding to each patient meta-instance.
In the present embodiment, the weight calculation subunit 100 calculates each patient meta-instance EiCorresponding to each disease DjWeight W ofijThe calculation method of (2) is as follows:
using the previously collected massive (data volume large enough to meet statistical significance requirements) case data, for each patient meta-instance EiStatistics it is diagnosed as individual diseases { D1,D2,…,DNFrequency of { F }i1,Fi2,…,FiNAnd converting the frequency into frequency:
wherein
This frequency is taken as the patient meta-instance EiCorresponding to each disease { D1,D2,…,DNThe weight of { i, i.e.:
W i j = F i j F i .
thereby obtaining a weight matrix W of the disease as the patient meta-exampleij
Further, the vector creating unit 20, based on the weight of the calculated weight, creates a disease vector for each disease, including: using patient meta-example-weight matrix W for diseaseijThe weight of (a) is taken as an element of the disease vector:
D j &RightArrow; = { d 1 j , d 2 j , ... , d H j } = ( W 1 j , W 2 j , ... , W H ) .
further, referring to fig. 6, the similarity calculation unit 30 includes:
the cosine distance calculating subunit 300 calculates the similarity between the two diseases by using the cosine distance according to the disease vectors.
Further, the disease vector is a high-dimensional sparse vector, and the above-mentioned cosine distance calculating subunit 300 has the following calculation formula:
s i m ( D i &RightArrow; , D j &RightArrow; ) = cos < D i &RightArrow; , D j &RightArrow; > = &Sigma; k H ( d i k + d j k ) &Sigma; k H d i k 2 &CenterDot; &Sigma; k H d j k 2 ;
wherein,is the disease similarity between two disease vectors; dik、djkCorresponding diagnosis D for each patienti、DjThe weight of the two diseases; h is the total number of patient meta-examples; k is a natural number.
Further, when the weight value of the patient meta-instance corresponding to a certain disease is low, the confidence level of the disease according to the patient meta-instance diagnosis is low, and the disease similarity is usually calculated and can be ignored. The above-mentioned calculation formula of the cosine distance calculation subunit 300 is adjusted as follows:
s i m ( D i &RightArrow; , D j &RightArrow; ) = &Sigma; k &Element; T i j ( d i k + d j k ) &Sigma; i &Element; T i d i k 2 &CenterDot; &Sigma; i &Element; T j d j k 2 ;
wherein,is the disease similarity between two disease vectors; dik、djkCorresponding diagnosis D for each patienti、DjThe weight of the two diseases is higher than the set weight; k is a natural number; t isiCorresponding diagnosis D for each patient meta-exampleiThe weight of the disease is higher than the set weight; t isjCorresponding diagnosis D for each patient meta-examplejThe weight of the disease is higher than the set weight; t isijThe meta-example for each patient corresponds to a diagnosis Di、DjThe weight of the two diseases is higher than the set weight; the weight value is a minimum threshold value set according to actual conditions, and when the patient meta-case is correspondingly diagnosed as Di、DjAnd when the weight values of the two diseases are smaller than the threshold value, judging that the credibility is low, and when the similarity of the diseases is calculated, neglecting the weight values of the patient meta-instances.
In one embodiment, the experiment uses three methods or devices for calculating the similarity of diseases to examine the 7 diseases most prone to misdiagnosis, wherein the calculation methods used are as follows:
the method A comprises the following steps: calculating disease similarity based on semantic association;
and B, method: calculating a disease similarity based on the disease-associated genes;
and C, method: the disease similarity is calculated based on the big data group behaviors in the embodiment of the invention.
The experimental results refer to table 1 below:
TABLE 1
The experimental results of table 1 show that:
if the gene functions, cell constitutions, biological processes or co-morbid genes between two diseases are approximately the same, both methods of calculating the similarity of diseases based on semantic association and calculating the similarity of diseases based on disease-related genes are effective in calculating the similarity of diseases, and are useful for research of disease science.
However, the former two methods are relatively ineffective in treating two diseases which are easily misdiagnosed and have no correlation with cells, genes, and the like. The method and the device in the embodiment of the invention are considered from the social aspect of diseases, the similarity of the diseases is calculated based on the disease diagnosis and treatment behaviors of the big data group, and 7 diseases which are most easy to misdiagnose can be identified.
According to the disease similarity calculation method and device based on the big data group behaviors, the similarity between diseases is calculated from the social perspective of the diseases according to the disease diagnosis and treatment behaviors of the big data group, and the method and device can be used for identifying the diseases which are easy to misdiagnose and have no association of cells, genes and the like; establishing a disease vector according to the weight of each disease correspondingly diagnosed by each patient meta-instance, and calculating the similarity of the diseases through cosine distance, so that the disease which is most easy to misdiagnose can be identified; when the weight value of the patient meta-instance corresponding to a certain disease is very low, the credibility of the patient meta-instance corresponding to a certain disease is considered to be very low, and the disease similarity can be ignored when being calculated.
The above description is only for the preferred embodiment of the present invention and is not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the present specification and drawings, or applied directly or indirectly to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A disease similarity calculation method based on big data group behaviors is characterized by comprising the following steps:
calculating the weight of each patient meta-instance corresponding to each disease diagnosed; the patient meta-instance includes patient case information;
establishing a disease vector for each disease according to the weight of the calculated weight; the weight of the weight is used as an element of a disease vector;
and calculating the similarity of the diseases according to the disease vectors.
2. The big data population behavior-based disease similarity calculation method according to claim 1, wherein the step of calculating the weight of each patient meta-instance corresponding to the diagnosis of the respective disease comprises:
and calculating the frequency of the diagnosis of each disease corresponding to each patient meta-instance, calculating the frequency of the frequency in all data, and taking the frequency as the weight of the diagnosis of each disease corresponding to each patient meta-instance.
3. The disease similarity calculation method based on big data group behaviors of claim 1, wherein the step of calculating the disease similarity according to the disease vector comprises:
and calculating the similarity of the two diseases by using the cosine distance according to the disease vectors.
4. The disease similarity calculation method based on big data group behaviors as claimed in claim 3, wherein the calculation formula for calculating the similarity of two diseases by using cosine distance according to the disease vector is as follows:
s i m ( D i &RightArrow; , D j &RightArrow; ) = cos < D i &RightArrow; , D j &RightArrow; > = &Sigma; k H ( d i k + d j k ) &Sigma; k H d i k 2 &CenterDot; &Sigma; k H d j k 2 ;
wherein,is the disease similarity between two disease vectors; dik、djkCorresponding diagnosis D for each patienti、DjThe weight of the two diseases; h is the total number of patient meta-examples; k is a natural number.
5. The method according to claim 3 or 4, wherein when the weight of the meta-instance of the patient diagnosed with a disease is low, the calculation formula for calculating the similarity between two diseases by using the cosine distance according to the disease vector is:
s i m ( D i &RightArrow; , D j &RightArrow; ) = &Sigma; k &Element; T i j ( d i k + d j k ) &Sigma; i &Element; T i d i k 2 &CenterDot; &Sigma; i &Element; T j d j k 2 ;
wherein,is the disease similarity between two disease vectors; dik、djkCorresponding diagnosis D for each patienti、DjThe weight of the two diseases is higher than the set weight; k is a natural number; t isiCorresponding diagnosis D for each patient meta-exampleiThe weight of the disease is higher than the set weight; t isjCorresponding diagnosis D for each patient meta-examplejThe weight of the disease is higher than the set weight; t isijThe meta-example for each patient corresponds to a diagnosis Di、DjThe weight of the two diseases is higher than the set of the set weight.
6. A disease similarity calculation apparatus based on big data group behavior, comprising:
the weight calculation unit is used for calculating the weight of each disease corresponding to each patient meta-instance; the patient meta-instance includes patient case information;
a vector establishing unit for establishing a disease vector for each disease according to the weight of the calculated weight; the weight of the weight is used as an element of a disease vector;
and a similarity calculation unit for calculating the disease similarity according to the disease vector.
7. The big data group behavior-based disease similarity calculation apparatus according to claim 6, wherein the weight calculation unit includes:
and the weight calculation subunit is used for calculating the frequency of each patient meta-instance corresponding to each disease diagnosed, calculating the frequency of the frequency in all data, and taking the frequency as the weight of each patient meta-instance corresponding to each disease diagnosed.
8. The big data population behavior-based disease similarity calculation apparatus according to claim 6, wherein the similarity calculation unit includes:
and the cosine distance calculating subunit calculates the similarity of the two diseases by utilizing the cosine distance according to the disease vectors.
9. The big data population behavior-based disease similarity calculation apparatus according to claim 8, wherein the cosine distance calculation subunit has a calculation formula of:
s i m ( D i &RightArrow; , D j &RightArrow; ) = cos < D i &RightArrow; , D j &RightArrow; > = &Sigma; k H ( d i k + d j k ) &Sigma; k H d i k 2 &CenterDot; &Sigma; k H d j k 2 ;
wherein,is the disease phase between two disease vectorsSimilarity; dik、djkCorresponding diagnosis D for each patienti、DjThe weight of the two diseases; h is the total number of patient meta-examples; k is a natural number.
10. The big data group behavior-based disease similarity calculation apparatus according to claim 8 or 9, wherein when the weight value of the patient meta-instance corresponding to a certain disease is diagnosed to be low, the cosine distance calculation subunit has the following calculation formula:
s i m ( D i &RightArrow; , D j &RightArrow; ) = &Sigma; k &Element; T i j ( d i k + d j k ) &Sigma; i &Element; T i d i k 2 &CenterDot; &Sigma; i &Element; T j d j k 2 ;
wherein,is the disease similarity between two disease vectors; dik、djkCorresponding diagnosis D for each patienti、DjThe weight of the two diseases is higher than the set weight; k is a natural number; t isiCorresponding diagnosis D for each patient meta-exampleiThe weight of the disease is higher than the set weight; t isjCorresponding diagnosis D for each patient meta-examplejThe weight of the disease is higher than the set weight; t isijThe meta-example for each patient corresponds to a diagnosis Di、DjThe weight of the two diseases is higher than the set of the set weight.
CN201610307328.7A 2016-05-10 2016-05-10 Disease similarity calculation method and device based on big data group behaviors Pending CN106021871A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610307328.7A CN106021871A (en) 2016-05-10 2016-05-10 Disease similarity calculation method and device based on big data group behaviors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610307328.7A CN106021871A (en) 2016-05-10 2016-05-10 Disease similarity calculation method and device based on big data group behaviors

Publications (1)

Publication Number Publication Date
CN106021871A true CN106021871A (en) 2016-10-12

Family

ID=57100197

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610307328.7A Pending CN106021871A (en) 2016-05-10 2016-05-10 Disease similarity calculation method and device based on big data group behaviors

Country Status (1)

Country Link
CN (1) CN106021871A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106650299A (en) * 2017-01-18 2017-05-10 浙江大学 Quick calculating method for patient similarity analysis
CN106897580A (en) * 2017-02-10 2017-06-27 华东师范大学 The computational methods of semantic similarity between a kind of gene based on vector
CN108346474A (en) * 2018-03-14 2018-07-31 湖南省蓝蜻蜓网络科技有限公司 The electronic health record feature selection approach of distribution within class and distribution between class based on word
CN108647203A (en) * 2018-04-20 2018-10-12 浙江大学 A kind of computational methods of Chinese medicine state of an illness text similarity
CN109102895A (en) * 2017-06-21 2018-12-28 京东方科技集团股份有限公司 Medical data coalignment and method
CN111091906A (en) * 2019-10-31 2020-05-01 中电药明数据科技(成都)有限公司 Auxiliary medical diagnosis method and system based on real world data
CN108630322B (en) * 2018-04-27 2020-08-14 厦门大学 Drug interaction modeling and risk assessment method, terminal device and storage medium
CN112151184A (en) * 2020-09-27 2020-12-29 东北林业大学 System for calculating disease similarity based on network representation learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156812A (en) * 2011-04-02 2011-08-17 中国医学科学院医学信息研究所 Hospital decision-making aiding method based on symptom similarity analysis
CN102184314A (en) * 2011-04-02 2011-09-14 中国医学科学院医学信息研究所 Deviation symptom description-oriented automatic computer-aided diagnosis method
CN104915561A (en) * 2015-06-11 2015-09-16 万达信息股份有限公司 Intelligent disease attribute matching method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156812A (en) * 2011-04-02 2011-08-17 中国医学科学院医学信息研究所 Hospital decision-making aiding method based on symptom similarity analysis
CN102184314A (en) * 2011-04-02 2011-09-14 中国医学科学院医学信息研究所 Deviation symptom description-oriented automatic computer-aided diagnosis method
CN104915561A (en) * 2015-06-11 2015-09-16 万达信息股份有限公司 Intelligent disease attribute matching method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李杰等: "基于疾病本体的疾病相似性计算方法", 《生物化学与生物物理进展》 *
郭艾侠等: "融合 Harris 与 SIFT 算法的荔枝采摘点计算与立体匹配", 《农业机械学报》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106650299A (en) * 2017-01-18 2017-05-10 浙江大学 Quick calculating method for patient similarity analysis
CN106650299B (en) * 2017-01-18 2019-01-25 浙江大学 A kind of quick calculation method of patient's similarity analysis
CN106897580A (en) * 2017-02-10 2017-06-27 华东师范大学 The computational methods of semantic similarity between a kind of gene based on vector
CN109102895A (en) * 2017-06-21 2018-12-28 京东方科技集团股份有限公司 Medical data coalignment and method
CN108346474A (en) * 2018-03-14 2018-07-31 湖南省蓝蜻蜓网络科技有限公司 The electronic health record feature selection approach of distribution within class and distribution between class based on word
CN108647203A (en) * 2018-04-20 2018-10-12 浙江大学 A kind of computational methods of Chinese medicine state of an illness text similarity
CN108630322B (en) * 2018-04-27 2020-08-14 厦门大学 Drug interaction modeling and risk assessment method, terminal device and storage medium
CN111091906A (en) * 2019-10-31 2020-05-01 中电药明数据科技(成都)有限公司 Auxiliary medical diagnosis method and system based on real world data
CN111091906B (en) * 2019-10-31 2023-06-20 中电药明数据科技(成都)有限公司 Auxiliary medical diagnosis method and system based on real world data
CN112151184A (en) * 2020-09-27 2020-12-29 东北林业大学 System for calculating disease similarity based on network representation learning

Similar Documents

Publication Publication Date Title
CN106021871A (en) Disease similarity calculation method and device based on big data group behaviors
Durairaj et al. A comparison of the perceptive approaches for preprocessing the data set for predicting fertility success rate
Prasadl et al. An approach to develop expert systems in medical diagnosis using machine learning algorithms (asthma) and a performance study
CN104915561A (en) Intelligent disease attribute matching method
CN103729395A (en) Method and system for inferring inquiry answer
CN106202883A (en) A kind of method setting up disease cloud atlas based on big data analysis
EP2747632A1 (en) Systems and methods for missing data imputation
Bellamy et al. Analysis of clustered and interval censored data from a community‐based study in asthma
Do et al. Classification of asthma severity and medication using TensorFlow and multilevel databases
Sapna et al. Implementation of genetic algorithm in predicting diabetes
Prasad et al. A comparative study of machine learning algorithms as expert systems in medical diagnosis (Asthma)
Torres-Espín et al. Topological network analysis of patient similarity for precision management of acute blood pressure in spinal cord injury
Fathima et al. Comparison of classification techniques-SVM and naives bayes to predict the Arboviral disease-Dengue
Acheme et al. Machine-learning models for predicting survivability in COVID-19 patients
Liu et al. Prediction of microbe–disease associations by graph regularized non-negative matrix factorization
CN116612852A (en) Method, device and computer equipment for realizing drug recommendation
Jiang et al. An aided diagnosis model of sub-health based on rough set and fuzzy mathematics: A case of TCM
Adday et al. Enhanced vaccine recommender system to prevent COVID-19 based on clustering and classification
Evirgen et al. Prediction and diagnosis of diabetic retinopathy using data mining technique
Phong et al. Some intuitionist linguistic aggregation operators
Caulfeild et al. Stroke prediction
Sakthidharan et al. Detection and prediction of breast cancer using CNN-MDRP algorithm in big data and machine learning: study and analysis
Wolock et al. Nonparametric variable importance for time-to-event outcomes with application to prediction of HIV infection
James et al. Collective infectivity of the pandemic over time and association with vaccine coverage and economic development
Bankar et al. Risk And Survival Analysis From Covid Outbreak Data: Lessons From India

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20161012