CN107845424B - Method and system for diagnostic information processing analysis - Google Patents
Method and system for diagnostic information processing analysis Download PDFInfo
- Publication number
- CN107845424B CN107845424B CN201711128161.9A CN201711128161A CN107845424B CN 107845424 B CN107845424 B CN 107845424B CN 201711128161 A CN201711128161 A CN 201711128161A CN 107845424 B CN107845424 B CN 107845424B
- Authority
- CN
- China
- Prior art keywords
- label
- information
- sample
- vector
- calculating
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 230000010365 information processing Effects 0.000 title claims description 7
- 238000004458 analytical method Methods 0.000 title claims description 6
- 239000013598 vector Substances 0.000 claims abstract description 72
- 201000010099 disease Diseases 0.000 claims abstract description 45
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims abstract description 45
- 239000011159 matrix material Substances 0.000 claims abstract description 41
- 238000004364 calculation method Methods 0.000 claims description 19
- 238000012360 testing method Methods 0.000 claims description 13
- 238000007635 classification algorithm Methods 0.000 claims description 5
- 238000012545 processing Methods 0.000 abstract description 6
- 238000003745 diagnosis Methods 0.000 abstract description 4
- 239000000523 sample Substances 0.000 description 80
- 238000004422 calculation algorithm Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 6
- 238000003556 assay Methods 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 230000036760 body temperature Effects 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 206010008118 cerebral infarction Diseases 0.000 description 2
- 208000026106 cerebrovascular disease Diseases 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000009499 grossing Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 239000013610 patient sample Substances 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000011282 treatment Methods 0.000 description 2
- 208000007342 Diabetic Nephropathies Diseases 0.000 description 1
- 208000004930 Fatty Liver Diseases 0.000 description 1
- 206010019708 Hepatic steatosis Diseases 0.000 description 1
- 208000002682 Hyperkalemia Diseases 0.000 description 1
- 208000031226 Hyperlipidaemia Diseases 0.000 description 1
- 208000034767 Hypoproteinaemia Diseases 0.000 description 1
- 208000001132 Osteoporosis Diseases 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000003759 clinical diagnosis Methods 0.000 description 1
- 208000029078 coronary artery disease Diseases 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 206010012601 diabetes mellitus Diseases 0.000 description 1
- 208000033679 diabetic kidney disease Diseases 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 208000010706 fatty liver disease Diseases 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012268 genome sequencing Methods 0.000 description 1
- 239000002184 metal Substances 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 231100000240 steatosis hepatitis Toxicity 0.000 description 1
- 238000011269 treatment regimen Methods 0.000 description 1
- 208000001072 type 2 diabetes mellitus Diseases 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The invention discloses a method and a system for processing and analyzing diagnostic information, and relates to the technical field of medical systems. The method for processing and analyzing the diagnostic information comprises the following steps: establishing a feature space and a label space through a plurality of sample features and label information, and establishing a feature vector and a first label vector of a sample; calculating the occurrence frequency of each label information; calculating the times that every two label information simultaneously appear in one sample; calculating the similarity of every two label information, and establishing a label-label similarity matrix; reconstructing the first label vector to obtain a second label vector, and calculating the occurrence probability of the label information of the target sample according to the second label vector; and selecting a preset number of label information as the recommended label information of the target sample. The invention can effectively find more potential diseases by using the correlation between the diseases and the complications thereof by utilizing the characteristic space and the label information formed by the characteristics corresponding to the diseases of the patients, thereby improving the precision of diagnosis decision support.
Description
Technical Field
The invention relates to the technical field of medical systems, in particular to a method and a system for processing and analyzing diagnostic information.
Background
In the big data age, people have gradually accepted methods of using health big data to assist diagnosis and treatment. With the success of many industry applications in utilizing big data, the health services industry is beginning to utilize medical big data to improve service efficiency and quality.
Health information tools and machine learning techniques have been successfully used to help physicians diagnose diseases and develop treatment regimens more efficiently. Clinical decision support applications include systems that provide diagnostics, personalized drug assessment, treatment protocols, and related medical knowledge. Clinical decision support applications aim to provide medical personnel with professional knowledge, patient information and intelligent means to improve the efficiency and utility of medical personnel in making decisions. Through clinical decision support applications, medical negligence can be reduced and medical service quality can be improved. In the medical field, there is an increasing demand for high quality clinical decision support systems.
Clinicians differentiate patients and make diagnoses for patients through their experience and knowledge. Therefore, if the clinician does not have a great deal of experience and accurate judgment, inevitable medical errors will result. The goal of building clinical decision support systems is to improve the accuracy and efficiency of clinicians through machine learning techniques. The system can extract the characteristics of the patient through personal health records, such as physiological data, electronic medical records, 3D images, radiological images, genome sequencing, clinical and charging data, classify the patient according to the characteristics of the patient and provide corresponding clinical suggestions to doctors. Scoring criteria in medical scenarios and the complexity of the medical field are challenges for clinical decision support systems. Currently, many clinical decision support systems have been developed in the market to provide assistance to physicians.
Because of complications associated with some diseases, it is common for a patient to have multiple diseases simultaneously. To assess the reference disease to the clinician, a more complex clinical support decision system is required. After analyzing the real clinical diagnosis information, the number of patients with multiple diseases accounts for a large part of the total number of patients. The clinical decision support system needs to recommend multiple reference diseases to the clinician. Therefore, disease conversion is recommended to reference the problem of disease for multi-label classification.
The ML-kNN (lazy multi-label classification method) is widely applied and researched due to simple steps and outstanding effects. However, the algorithm ignores the association between tags by estimating the likelihood of each tag independently. In actual disease diagnosis, many labels are connected, and the ML-kNN method is not effective for application scenes with connection among the labels.
Disclosure of Invention
The invention mainly aims to provide a method and a system for processing and analyzing diagnostic information, aiming at effectively improving the accuracy of diagnosing diseases by utilizing the correlation among the diseases.
To achieve the above object, the present invention provides a method for processing and analyzing diagnostic information, comprising the steps of:
establishing a feature space and a label space of multi-label information through a plurality of sample features and label information of a plurality of samples, and establishing a feature vector and a first label vector of each sample according to each sample feature and each label information;
calculating the occurrence frequency of each label information; calculating the times that every two label information simultaneously appear in one sample;
calculating the similarity of every two label information, and establishing a label-label similarity matrix;
reconstructing a first label vector of the sample through the label-label similarity matrix to obtain a second label vector, and calculating the label information occurrence probability of the target sample according to the second label vector;
and sorting the probability of the target sample label information in a descending order, and selecting a preset number of label information as the recommended label information of the target sample.
Preferably, the establishing a feature space and a label space of multi-label information by using a plurality of sample features and label information of a plurality of samples further comprises:
preset F ═ F1,f2...fbB-dimensional feature space of a plurality of label information, and preset L ═ L1,l2,...lqQ-dimensional label space of the plurality of label information;
predetermined T { (X)1,Y1),(X2,Y2),...,(Xn,Yn) Is a set of multiple label information, presetB-dimensional feature vectors for the samples; then
Is the sample XiA corresponding tag vector; if the feature vector XiSpace with label ljThen label vectorOtherwise
Preferably, the calculating the number of occurrences of each tag information includes: calculating the sample corresponding to the label information according to the occurrence frequency of the label information in each sample, and setting the value as rij(ii) a If the feature vector XkIn which there is a label space liThen r isij1, otherwise rij=0。
Preferably, the calculating the similarity between every two tags and the establishing the tag-tag similarity matrix further includes:
and calculating the similarity of every two labels by using a cosine similarity calculation method.
Preferably, the calculating the similarity of every two tags by using the cosine similarity calculation method includes:
wherein, PijIs to include label space l at the same timeiAnd the label space ljThe set of (a) and (b),is a label space liAnd the label space ljWhile appearing in sample XkThe number of times of (1);andrespectively a label space liTotal number of occurrences and label space ljTotal number of occurrences.
Preferably, the calculating the similarity between every two label information, and the establishing a label-label similarity matrix includes:
the label-to-label similarity matrix is
Wherein the element S in the matrixij=sim(Ii,Ij) Presentation label liAnd a label ljThe similarity of (c).
Preferably, the tag information matrix for reconstructing the sample by the tag-tag similarity matrix is: y ═ g (x),
preferably, the calculating the probability of occurrence of the tag information of the target sample according to the second tag vector includes:
and calculating the occurrence probability of each label information in the target sample by using a lazy multi-label classification algorithm according to the second label vector.
The present invention also provides a diagnostic information processing and analyzing system, the system comprising:
a module for establishing a feature space and a label space of multi-label information according to the plurality of sample features and label information of the plurality of samples;
means for establishing a feature vector and a first label vector for the sample based on each sample feature and each label information;
a module for calculating the number of occurrences of each of the tag information;
a module for counting the number of times that every two tag information simultaneously appear in one sample;
a module for calculating the similarity of every two label information and establishing a label-label similarity matrix;
the module is used for reconstructing a label information matrix of the sample through the label-label similarity matrix, reconstructing a first label vector of the sample to obtain a second label vector, and calculating the occurrence probability of label information of a target sample according to the second label vector;
module for sorting probability of occurrence of target sample label information in descending order'
And the module is used for selecting the label information with preset quantity as the recommended label information of the target sample.
Preferably, the system further comprises: the module for calculating the similarity of every two label information and establishing the label-label similarity matrix is a cosine similarity calculation module.
The technical scheme of the invention utilizes the characteristics corresponding to the diseases of the patients to form a characteristic space in the multi-label learning algorithm, and utilizes the diseases as label information in the multi-label learning algorithm, the times of the common appearance of every two label information on one patient are obtained by calculating and analyzing the label information, the label matrix is reconstructed by utilizing the label similarity matrix to update the label vector corresponding to each patient, and then the multi-label learning algorithm is utilized to predict the possible diseases for the target patient, so that the correlation between the diseases and the complications thereof can be effectively utilized to discover more potential diseases, and the precision of diagnosis decision support is improved.
Drawings
FIG. 1 is a flow chart illustrating a method of diagnostic information processing analysis according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, the present invention provides a method for processing and analyzing diagnostic information, comprising the steps of:
establishing a feature space and a label space of multi-label information through a plurality of sample features and label information of a plurality of samples, and establishing a feature vector and a first label vector of each sample according to each sample feature and each label information; calculating the occurrence frequency of each label information; calculating the times that every two label information simultaneously appear in one sample; calculating the similarity of every two label information, and establishing a label-label similarity matrix; reconstructing a first label vector of a sample through the label-label similarity matrix to obtain a second label vector, and calculating the occurrence probability of label information of a target sample according to the second label vector; and sorting the probability of the target sample label information in a descending order, and selecting a preset number of label information as the recommended label information of the target sample.
The principle of the invention is as follows: complications refer to the development of one disease causing another disease or condition. When a doctor diagnoses a patient, if the doctor has diagnosed a disease, the doctor considers whether the patient also suffers from complications of the disease to ensure that more diseases of the patient can be found. Meanwhile, the relation between a disease and other diseases can be reflected through the co-occurrence frequency of the disease and other diseases, so that the similarity between the diseases can be reflected by using the co-occurrence frequency between every two diseases, and the information of the diseases possibly suffered by the patient is finally recommended by using the calculated disease similarity and combining a lazy multi-label classification method.
In a particular embodiment, a multi-labeled dataset is pre-set from the patient diagnostic record, (Basic multiple labels data set, denoted BMLDS); meanwhile, a feature set and a label set of the sample are separated from the BMLDS and are respectively marked as F and L, and the BMLDS is subjected to the following steps of: 3 into a training set and a test set, the method of the invention comprises:
s1, establishing a feature space and a label space of multi-label information through a plurality of sample features and label information of a plurality of samples:
preset F ═ F1,f2...fbB-dimensional feature space of a plurality of label information, and preset L ═ L1,l2,...lqQ-dimensional label space of the plurality of label information;
predetermined T { (X)1,Y1),(X2,Y2),...,(Xn,Yn) Is a set of multiple label information, presetB-dimensional feature vectors for the samples; then
Is the sample XiA corresponding tag vector; if the feature vector XiSpace with label ljThen label vectorOtherwise
S2, calculating the occurrence frequency of each label information: calculating the sample corresponding to the label information according to the occurrence frequency of the label information in each sample, and setting the value as rij(ii) a If the feature vector XkIn which there is a label space liThen r isij1, otherwise rij=0。
S3, calculating the similarity of every two labels, and establishing a label-label similarity matrix:
and calculating the similarity of every two labels by using a cosine similarity calculation method.
The similarity calculation method comprises a cosine similarity calculation method, a Pearson correlation coefficient calculation method and a Jaccard similarity coefficient calculation method.
Specifically, calculating the similarity of every two tags by using a cosine similarity calculation method comprises the following steps:
by calculation of formulaCalculating the similarity of every two labels, wherein PijIs to include label space l at the same timeiAnd the label space ljThe set of (a) and (b),is a label space liAnd the label space ljWhile appearing in sample XkThe number of times of (1);andrespectively a label space liTotal number of occurrences and label space ljThe total number of occurrences;
the calculation method of the Pearson correlation coefficient comprises the following steps:
wherein I and J are each a label liAnd a label ljAnd the corresponding sample vector and the Pearson correlation coefficient are respectively subjected to cosine included angles of the calculated space vectors after the vector I and the vector J are subjected to overall standardization.
The calculation method of the Jaccard similarity coefficient comprises the following steps:
wherein I and J are each a label liAnd a label ljThe corresponding sample vector.
S4, calculating the similarity of every two label information, and establishing a label-label similarity matrix:
the label-to-label similarity matrix is
Wherein the element S in the matrixij=sim(Ii,Ij) Presentation label liAnd a label ljThe similarity of (c).
S5, reconstructing the label information matrix of the sample through the label-label similarity matrix
s6, calculating the label information occurrence probability of the target sample according to the second label vector:
and calculating the occurrence probability of each label information in the target sample by using a lazy multi-label classification algorithm according to the second label vector. Specifically, the probability value of each tag information appearing in the target sample ranges between [0, 1 ].
The classification function of the lazy multi-label classification algorithm is as follows:
the number of times each label appears in k neighboring (kNN) samples of each sample is first counted, and the label that may appear in the unlabeled sample is estimated by maximizing the posterior probability. For a sample space X containing m samples, the label space is denoted as L. Event(s)And the probability that the ith label information takes the value of b is represented, wherein b is 0 or 1, b is 0 and represents that the label does not appear, and b is 1 and represents that the label appears. Event(s)Indicating that exactly j of k neighbors are liAnd a label, which determines whether a label l appears in the sample x by calculating the magnitude of the equation value.
yj(li) Indicates whether sample j has label liS ∈ (0, 1). Therefore, the temperature of the molten metal is controlled,
conditional probabilityThe calculation of (2) requires that when traversing the training sample set, statistics is carried out on samples of k neighbors of each sample, wherein the samples contain labels liThe case (1). Array C [ j ]]Counting each sample label liWhen the value is 1, the k neighbor sample of the sample contains a label liThe number of (2); array C' [ j]Label liWhen the value is 0, the k neighbor sample of the sample contains a label liThe number of (2). p represents the number of tags. The conditional probability is calculated by the following equation:
and S7, sequencing the probability of the target sample label information in a descending order, and selecting a preset number of label information as the recommended label information of the target sample.
And arranging the labels from large to small according to the probability value of the label information in the target sample, and selecting N pieces of label information before rehearsal as recommended labels of the target sample. When a patient is seen, the first N possible complication label information with the maximum probability value is recommended, and the specific value of N can be preset according to different diseases.
In a specific embodiment, the patients are samples in the multi-label learning algorithm, the corresponding features of the diseases of each patient form a feature space in the multi-label learning algorithm, and the diseases suffered by all the patients serve as label information in the multi-label learning algorithm. Obtaining the co-occurrence times of every two label information through the analysis of the label information; when the number of times that two different label information appears in the same sample is more, the correlation of the two label information is larger; calculating the similarity between every two label information according to the cosine similarity, recalculating the label vectors corresponding to the samples according to the similarity of the labels, wherein the vector value corresponding to a certain label is equal to or exceeds 0.5, the vector value of the label is reset to be 1, otherwise, the vector value of the label is still 0; the method comprises the steps of finding potential labels of training samples by reconstructing label vectors corresponding to the samples, associating the labels of the training samples, reconstructing a label matrix in a label space by using a label similarity matrix to update the label vector corresponding to each sample, and finally predicting the potential labels for the target samples by using a multi-label learning algorithm.
In a specific example, 9 common diseases (including type 2 diabetes, hyperlipidemia, fatty liver, hyperkalemia, hypoproteinemia, diabetic nephropathy, cerebral infarction, coronary heart disease, and osteoporosis) were selected. Selecting one or more patients with these diseases from a hospital as a test sample; test reports and basic information of patients are collected as sample characteristics, and 459 patient samples containing 5 patient basic information and 278 test items are obtained. And extracting the sex, the age, the body temperature, the height and the weight of the patient from the patient sample as basic attributes of the patient. The value of gender is binary, such as 0 for male and 1 for female. However, the values of age, body temperature, height and weight are numerical, retaining their actual values; for an item whose assay value is numerical, if its assay value is within a normal reference range, its value is set to 1; if its assay value is outside the normal range, its value is set as the actual assay value. For items for which the assay value is in the form of a textual description, the different textual description values for the item are collected and then arranged using an array. If the text description value of the project is equal to the reference value, the value is 0; and if the text description of the item is not equal to the reference value, setting the value as the arrangement value of the text description in the array.
Statistics of the characteristics and statistics of the disease are listed, respectively, as shown in table 1 and table 2. Overall, 60% of patients are male and 40% are female. The average age, body temperature, height and weight of the patients were 64.64, 36.5, 167.81 and 67.75, respectively. From statistics of diseases, 2-type diabetes and cerebral infarction are the most common two diseases among these 9 diseases. In fact, these diseases are also the most common diseases in the elderly population. We randomly selected 70% of patients as training samples and the remaining 30% as testing samples.
TABLE 1
TABLE 2
The evaluation criteria for the multi-label classification problem fall into two categories: a. ranking-based evaluation criteria: the goal is to rank the relevant examples ahead of the irrelevant examples. b. Binary prediction evaluation: the goal is to make a strict yes/no classification for each target sample. Hamming Loss, accuracy, recall, and F1-score (F1 score) were used to evaluate the effect of the present invention.
Hamming Loss evaluates the average difference between the recommended label of the test sample and its actual label:
wherein, h (x)i) Is a test sample xiThe recommended set of tags of (1); p is the number of test samples; y isiIs a test sample xiThe actual set of tags of (2); Δ is the symmetry difference.
The accuracy is defined as the ratio of the number of tags hit in the tag recommendation list to the total number of the tag recommendation list. I.e., the accuracy rate represents the probability of the accuracy rate of the test sample having the recommended label. The accuracy is formulated as follows:
the recall is defined as the ratio of the number of hit tags to the number of true tags for the test sample. In other words, the recall rate represents the probability of the accuracy with which the real tag is recommended. The recall ratio is formulated as follows:
f1-score considers both accuracy and recall, and its formula is as follows:
the effect of the method is analyzed by comparing the effect of the invention with the effect of two classical multi-label classification methods. Two classic multi-label classification methods are a lazy multi-label classification method (ML-kNN) and a multi-label classification method (BR-kNN) combining a BR method and a kNN method, respectively. In table 3, the main and classical multi-label classification algorithms are compared, and according to the nearest neighbor numbers with good stability and high accuracy, the nearest neighbor numbers are all set to 10, and the smoothing factor of the invention and the smoothing factor of ML-kNN are both set to 1. In all methods, the recommended number of tags is 2. Experiments were performed using 10-fold cross validation, and the final results were the average of these experimental results. In Table 3, for the present invention, its accuracy is 0.236, recall is 0.3793 and F1-score is 0.2915, which are superior to the other two methods. Comparing the results of the experiment with the second ML-kNN, the accuracy, recall and F1-score of the present invention were improved by 11%, 13% and 12%, respectively. The Hamming Loss of the present invention is 0.2117, which is also superior to the other two methods. Thus, the performance of the present invention is superior to the other two methods.
TABLE 3
The smaller the value of ↓: the better the effect ↓: the larger the value
The present invention also provides a diagnostic information processing and analyzing system, the system comprising:
a module for establishing a feature space and a label space of multi-label information according to the plurality of sample features and label information of the plurality of samples;
means for establishing a feature vector and a first label vector for the sample based on each sample feature and each label information;
a module for calculating the number of occurrences of each of the tag information;
a module for counting the number of times that every two tag information simultaneously appear in one sample;
a module for calculating the similarity of every two label information and establishing a label-label similarity matrix;
the module is used for reconstructing a label information matrix of the sample through the label-label similarity matrix, reconstructing a first label vector of the sample to obtain a second label vector, and calculating the occurrence probability of label information of a target sample according to the second label vector;
module for sorting probability of occurrence of target sample label information in descending order'
And the module is used for selecting the label information with preset quantity as the recommended label information of the target sample.
Preferably, the system further comprises: the module for calculating the similarity of every two label information and establishing the label-label similarity matrix is a module for calculating cosine similarity.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention.
Claims (9)
1. A method of diagnostic information processing analysis, comprising the steps of:
establishing a feature space and a label space of multi-label information through a plurality of sample features and label information of a plurality of samples, and establishing a feature vector and a first label vector of each sample according to each sample feature and each label information; the sample characteristics are the test report and basic information of the disease of the patient;
calculating the occurrence frequency of each label information; calculating the times that every two label information simultaneously appear in one sample; calculating the similarity of every two label information; establishing a label-label similarity matrix;
reconstructing a first label vector of the sample through the label-label similarity matrix to obtain a second label vector, and calculating the label information occurrence probability of the target sample according to the second label vector;
sorting the probability of the target sample label information in a descending order, and selecting a preset number of label information as the recommended label information of the target sample;
the establishing of the feature space and the label space of the multi-label information by the plurality of sample features and the label information of the plurality of samples further comprises:
preset F ═ F1,f2...fbB-dimensional feature space of a plurality of label information, and preset L ═ L1,l2,...lqQ-dimensional label space of the plurality of label information;
predetermined T { (X)1,Y1),(X2,Y2),...,(Xn,Yn) Is a set of multiple label information, presetB-dimensional feature vectors for the samples; then
2. The method of claim 1, wherein said calculating the number of occurrences of said each tag information comprises: calculating the sample corresponding to the label information according to the occurrence frequency of the label information in each sample, and setting the value as rij(ii) a If the feature vector XkIn which there is a label space liThen r isij1, otherwise rij=0。
3. The method of claim 2, wherein the calculating the similarity between every two tags and establishing the tag-to-tag similarity matrix further comprises:
and calculating the similarity of every two labels by using a cosine similarity calculation method.
4. The method of claim 3, wherein the calculating the similarity of each two tags by using the cosine similarity calculation method comprises:
wherein, PijIs to include label space l at the same timeiAnd the label space ljThe set of (a) and (b),is a label space liAnd the label space ljWhile appearing in sample XkThe number of times of (1);andrespectively a label space liTotal number of occurrences and label space ljTotal number of occurrences.
5. The method of claim 4, wherein the calculating the similarity between every two label information and establishing the label-to-label similarity matrix comprises:
the label-to-label similarity matrix is
Wherein the element S in the matrixij=sim(Ii,Ij) Presentation label liAnd a label ljThe similarity of (c).
7. the method of claim 1, wherein the calculating the probability of occurrence of the tag information for the target sample according to the second tag vector comprises:
and calculating the occurrence probability of each label information in the target sample by using a lazy multi-label classification algorithm according to the second label vector.
8. A diagnostic information processing and analysis system, the system comprising:
a module for establishing a feature space and a label space of multi-label information according to the plurality of sample features and label information of the plurality of samples; the sample characteristics are the test report and basic information of the disease of the patient;
means for establishing a feature vector and a first label vector for the sample based on each sample feature and each label information;
a module for calculating the number of occurrences of each of the tag information;
a module for counting the number of times that every two tag information simultaneously appear in one sample;
a module for calculating the similarity of every two label information and establishing a label-label similarity matrix;
the module is used for reconstructing a label information matrix of the sample through the label-label similarity matrix, reconstructing a first label vector of the sample to obtain a second label vector, and calculating the occurrence probability of label information of a target sample according to the second label vector;
means for sorting the probabilities of occurrence of the target sample label information in descending order;
a module for selecting a preset number of label information as the recommended label information of the target sample;
the establishing of the feature space and the label space of the multi-label information according to the plurality of sample features and the label information of the plurality of samples further comprises:
preset F ═ F1,f2...fbB-dimensional feature space of a plurality of label information, and preset L ═ L1,l2,...lqQ-dimensional label space of the plurality of label information;
predetermined T { (X)1,Y1),(X2,Y2),...,(Xn,Yn) Is a set of multiple label information, presetB-dimensional feature vectors for the samples; then
9. The system of claim 8, further comprising: the module for calculating the similarity of every two label information and establishing the label-label similarity matrix is a module for calculating cosine similarity.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711128161.9A CN107845424B (en) | 2017-11-15 | 2017-11-15 | Method and system for diagnostic information processing analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711128161.9A CN107845424B (en) | 2017-11-15 | 2017-11-15 | Method and system for diagnostic information processing analysis |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107845424A CN107845424A (en) | 2018-03-27 |
CN107845424B true CN107845424B (en) | 2021-11-16 |
Family
ID=61678966
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711128161.9A Active CN107845424B (en) | 2017-11-15 | 2017-11-15 | Method and system for diagnostic information processing analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107845424B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108962383A (en) * | 2018-06-05 | 2018-12-07 | 南京麦睿智能科技有限公司 | Hospital's intelligence hospital guide's method and apparatus |
CN109697629B (en) * | 2018-11-15 | 2023-02-24 | 平安科技(深圳)有限公司 | Product data pushing method and device, storage medium and computer equipment |
CN113096795B (en) * | 2019-12-23 | 2023-10-03 | 四川医枢科技有限责任公司 | Multi-source data-aided clinical decision support system and method |
CN111968740B (en) * | 2020-09-03 | 2021-04-27 | 卫宁健康科技集团股份有限公司 | Diagnostic label recommendation method and device, storage medium and electronic equipment |
CN112117009A (en) * | 2020-09-25 | 2020-12-22 | 北京百度网讯科技有限公司 | Method, device, electronic equipment and medium for constructing label prediction model |
CN112201354A (en) * | 2020-10-20 | 2021-01-08 | 吾征智能技术(北京)有限公司 | Disease information matching system based on blood lead value |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102298605A (en) * | 2011-06-01 | 2011-12-28 | 清华大学 | Image automatic annotation method and device based on digraph unequal probability random search |
CN102799680A (en) * | 2012-07-24 | 2012-11-28 | 华北电力大学(保定) | XML (extensible markup language) document spectrum clustering method based on affinity propagation |
KR101450784B1 (en) * | 2013-07-02 | 2014-10-23 | 아주대학교산학협력단 | Systematic identification method of novel drug indications using electronic medical records in network frame method |
CN104615894A (en) * | 2015-02-13 | 2015-05-13 | 上海中医药大学 | Traditional Chinese medicine diagnosis method and system based on k-nearest neighbor labeled specific weight characteristics |
CN105243300A (en) * | 2015-08-31 | 2016-01-13 | 合肥工业大学 | Approximation spectral clustering algorithm based method for predicting cancer metastasis and recurrence |
CN105320967A (en) * | 2015-11-04 | 2016-02-10 | 中科院成都信息技术股份有限公司 | Multi-label AdaBoost integration method based on label correlation |
CN105808931A (en) * | 2016-03-03 | 2016-07-27 | 北京大学深圳研究生院 | Knowledge graph based acupuncture and moxibustion decision support method and apparatus |
CN106202916A (en) * | 2016-07-04 | 2016-12-07 | 扬州大学 | The layering multiple manifold setting up a kind of Alzheimer analyzes model |
CN107133293A (en) * | 2017-04-25 | 2017-09-05 | 中国科学院计算技术研究所 | A kind of ML kNN improved methods and system classified suitable for multi-tag |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090105092A1 (en) * | 2006-11-28 | 2009-04-23 | The Trustees Of Columbia University In The City Of New York | Viral database methods |
-
2017
- 2017-11-15 CN CN201711128161.9A patent/CN107845424B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102298605A (en) * | 2011-06-01 | 2011-12-28 | 清华大学 | Image automatic annotation method and device based on digraph unequal probability random search |
CN102298605B (en) * | 2011-06-01 | 2013-04-17 | 清华大学 | Image automatic annotation method and device based on digraph unequal probability random search |
CN102799680A (en) * | 2012-07-24 | 2012-11-28 | 华北电力大学(保定) | XML (extensible markup language) document spectrum clustering method based on affinity propagation |
KR101450784B1 (en) * | 2013-07-02 | 2014-10-23 | 아주대학교산학협력단 | Systematic identification method of novel drug indications using electronic medical records in network frame method |
CN104615894A (en) * | 2015-02-13 | 2015-05-13 | 上海中医药大学 | Traditional Chinese medicine diagnosis method and system based on k-nearest neighbor labeled specific weight characteristics |
CN105243300A (en) * | 2015-08-31 | 2016-01-13 | 合肥工业大学 | Approximation spectral clustering algorithm based method for predicting cancer metastasis and recurrence |
CN105320967A (en) * | 2015-11-04 | 2016-02-10 | 中科院成都信息技术股份有限公司 | Multi-label AdaBoost integration method based on label correlation |
CN105808931A (en) * | 2016-03-03 | 2016-07-27 | 北京大学深圳研究生院 | Knowledge graph based acupuncture and moxibustion decision support method and apparatus |
CN106202916A (en) * | 2016-07-04 | 2016-12-07 | 扬州大学 | The layering multiple manifold setting up a kind of Alzheimer analyzes model |
CN107133293A (en) * | 2017-04-25 | 2017-09-05 | 中国科学院计算技术研究所 | A kind of ML kNN improved methods and system classified suitable for multi-tag |
Also Published As
Publication number | Publication date |
---|---|
CN107845424A (en) | 2018-03-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107845424B (en) | Method and system for diagnostic information processing analysis | |
CN111710420B (en) | Complication onset risk prediction method, system, terminal and storage medium based on electronic medical record big data | |
CN101911078B (en) | Coupling similar patient case | |
Baker et al. | Continuous and automatic mortality risk prediction using vital signs in the intensive care unit: a hybrid neural network approach | |
CN114999629B (en) | AD early prediction method, system and device based on multi-feature fusion | |
DE112014000897T5 (en) | Learning health systems and procedures | |
CN111048210B (en) | Method and equipment for evaluating disease risk based on fundus image | |
US20210397996A1 (en) | Methods and systems for classification using expert data | |
CN113962930B (en) | Alzheimer disease risk assessment model establishing method and electronic equipment | |
Zhou et al. | Cohesive multi-modality feature learning and fusion for COVID-19 patient severity prediction | |
CN112102945A (en) | Device for predicting severe condition of COVID-19 patient | |
CN114783580B (en) | Medical data quality evaluation method and system | |
CN111540467A (en) | Schizophrenia classification identification method, operation control device and medical equipment | |
US11937939B2 (en) | Methods and systems for utilizing diagnostics for informed vibrant constituional guidance | |
EP2727520A1 (en) | Hepatic fibrosis detection apparatus and system | |
Rathi et al. | Early Prediction of Diabetes Using Machine Learning Techniques | |
Osuwa et al. | Importance of continuous improvement of machine learning algorithms from a health care management and management information systems perspective | |
Huang et al. | Study on patient similarity measurement based on electronic medical records | |
Uddin et al. | A Voice assistive mobile application tool to detect cardiovascular disease using machine learning approach | |
Bayati et al. | A low-cost method for multiple disease prediction | |
CN112561935A (en) | Method, device and equipment for identifying Alzheimer's disease | |
Sideris et al. | A data-driven feature extraction framework for predicting the severity of condition of congestive heart failure patients | |
Canino et al. | Feature selection model for diagnosis, electronic medical records and geographical data correlation | |
CN113707330B (en) | Construction method of syndrome differentiation model of Mongolian medicine, syndrome differentiation system and method of Mongolian medicine | |
Nahian et al. | Common human diseases prediction using machine learning based on survey data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |