CN107845424A

CN107845424A - The method and system of diagnostic message Treatment Analysis

Info

Publication number: CN107845424A
Application number: CN201711128161.9A
Authority: CN
Inventors: 黄梦醒; 韩惠蕊; 张雨; 冯文龙
Original assignee: Hainan University
Current assignee: Hainan University
Priority date: 2017-11-15
Filing date: 2017-11-15
Publication date: 2018-03-27
Anticipated expiration: 2037-11-15
Also published as: CN107845424B

Abstract

The invention discloses a kind of method and system of diagnostic message Treatment Analysis, is related to medical systems technology field.The method of the diagnostic message Treatment Analysis comprises the following steps：Feature space and Label space are established by multiple sample characteristics and label information, and establishes the characteristic vector and the first label vector of sample；Calculate the occurrence number of each label information；Calculate each two label information while appear in the number in a sample；The similarity of each two label information is calculated, establishes the similarity matrix of tag；The first label vector is reconstructed to obtain the second label vector, target sample label information probability of occurrence is calculated according to the second label vector；Choose recommendation label information of the label information of predetermined number as the target sample.The present invention can effectively utilize associating to find more potential diseases for disease and its complication, improve the precision for diagnosing decision support by using feature constitutive characteristic space, label information corresponding to patient's illnesses.

Description

The method and system of diagnostic message Treatment Analysis

Technical field

The present invention relates to medical system technical field, a kind of method more particularly to diagnostic message Treatment Analysis and it is System.

Background technology

In the big data epoch, people have gradually received using healthy big data the method assisting to diagnose and treat.With Many industries and apply the success obtained on using big data, health service industry also begins to be lifted using medical big data Efficiency of service and quality.

Health and fitness information instrument and machine learning techniques have been successfully used to help doctor more efficiently to diagnose the illness and make Determine therapeutic scheme.Clinical decision support, which is applied, to be included providing diagnosis, personalised drug assessment, therapeutic scheme, relevant medical knowledge System.Clinical decision support apply it is intended that medical personnel provide specialty knowledge, the information of patient and intelligent means from And improve efficiency and effectiveness that medical personnel do decision-making.Pass through clinical decision support application, it is possible to reduce medical error and lifting Medical service quality.In medical domain, the demand of the Clinical Decision Support Systems of high quality is increasingly lifted.

Clinician is distinguished patient and diagnosed for patient by their experience and knowledge.Therefore, if clinical doctor Raw no rich experience and accurate judgement, it will cause inevitable Mala praxis.Establish clinical decision support The target of system is that the accuracy and efficiency of clinician is lifted by machine learning techniques.The system can pass through personal health Record, such as physiological data, electronic health record, 3D rendering, radiation image, gene order-checking, clinic and charge data, suffer to extract The feature of person, according to the feature of patient to patient class and corresponding clinical recommendation is provided to doctor.Commenting in medical scene The accurate complexity with medical domain of minute mark is the problem of Clinical Decision Support Systems.At present, in the market has developed many and faced Bed DSS to provide help for doctor.

Because some diseases carry complication, the situation with a variety of diseases is very common simultaneously by a patient.In order to estimate With reference to disease to clinician, it is necessary to which clinical Decision-making support system is more complicated.Found after analyzing real clinical diagnosis information, The patient populations with multiple diseases account for a big chunk of all patient populations simultaneously.So Clinical Decision Support Systems needs Recommend multiple reference diseases to clinician.Therefore, it is recommended that disease conversion in order to multi-tag classification refer to disease the problem of.

Because ML-kNN (lazy multi-tag sorting technique) step is simple and effect protrudes, the algorithm receives widely Using and research.But the algorithm have ignored the association between label by independently estimating the possibility of each label.And It is associated in actually diagnosing the illness, between many labels, for related application scenarios between label, ML-kNN methods Lack validity.

The content of the invention

It is a primary object of the present invention to provide a kind of method and system of diagnostic message Treatment Analysis, it is intended to utilize disease Between correlation effectively improve the accuracy to diagnose the illness.

To achieve the above object, the present invention provides a kind of method of diagnostic message Treatment Analysis, comprises the following steps：

The feature space and label of multi-tag information are established by the label information of multiple sample characteristics and multiple samples Space, and establish according to each sample characteristics and each label information the characteristic vector and the first label vector of the sample；

Calculate the occurrence number of each label information；Each two label information is calculated to appear in a sample simultaneously Number；

The similarity of each two label information is calculated, establishes the similarity matrix of tag-tag；

First label vector of the sample is reconstructed by the tag-tag similar matrix with obtain the second label to Amount, and according to the label information probability of occurrence of second label vector calculating target sample；

The probability that target sample label information described in descending sort occurs, choose described in the label information conduct of predetermined number The recommendation label information of target sample.

Preferably, the label information by multiple sample characteristics and multiple samples establishes the feature of multi-tag information Space and Label space also include：

Default F={ f₁,f₂...f_bIt is multiple label information b dimensional feature spaces, preset L={ l₁,l₂,...l_qIt is described The Label space of multiple label information q dimensions；

Default T={ (X₁,Y₁),(X₂,Y₂),...,(X_n,Y_n) be multiple label informations collection, presetFor the b dimensional feature vectors of the sample；Then

For the sample X_iCorresponding label vector；If feature vector, X_iThere is Label space l_j, Then label vectorOtherwise

Preferably, the occurrence number for calculating each label information includes：According to the label information each The number occurred in sample, sample corresponding to the label information is calculated, if value is r_ij；If the feature vector, X_kIn have label empty Between l_i, then r_ij=1, otherwise r_ij=0.

Preferably, the similarity for calculating each two label, establishing the similarity matrix of tag-tag also includes：

The similarity of each two label is calculated using cosine similarity computational methods.

Preferably, the similarity that each two label is calculated using cosine similarity computational methods is included：

Pass through calculation formulaThe similarity of each two label is calculated,

Wherein, P_ijIt is to include Label space l simultaneously_iWith Label space l_jSet,For Label space l_iWith mark Sign space l_jWhile appear in sample X_kIn number；WithRespectively Label space l_iThe total degree of appearance and Label space l_jThe total degree of appearance.

Preferably, the similarity for calculating each two label information, establishing the similarity matrix of tag-tag includes：

The similarity matrix of the tag-tag is

Wherein, the element S in matrix_ij=sim (I_i,I_j) represent label l_iWith label l_jSimilarity.

Preferably, the label information matrix by the tag-tag similarity matrix reconstruct sample is：Y=g (X),

Wherein g (X) is：

Preferably, the label information probability of occurrence that target sample is calculated according to second label vector includes：

According to second label vector, calculate each label in target sample using the multi-tag sorting algorithm of laziness and believe Cease probability of occurrence.

The present invention also provides a kind of diagnostic message Treatment Analysis system, and the system includes：

For established according to the label information of multiple sample characteristics and multiple samples multi-tag information feature space and The module of Label space；

For established according to each sample characteristics and each label information the sample characteristic vector and the first label to The module of amount；

For the module for the occurrence number for calculating each label information；

For calculating each two label information while appearing in the module of the number in a sample；

For calculating the similarity of each two label information, the module of the similarity matrix of tag-tag is established；

For reconstructed by the tag-tag similar matrix label information matrix of the sample, reconstructed sample the One label vector is to obtain the module of the second label vector, and for calculating target sample according to second label vector The module of label information probability of occurrence；

For described in descending sort target sample label information occur probability module '

For choosing module of the label information of predetermined number as the recommendation label information of the target sample.

Preferably, the system also includes：The similarity for being used to calculate each two label information, establishes label-mark The module of the similarity matrix of label is cosine similarity computing module.

Technical scheme is formed in multi-tag learning algorithm by using feature corresponding to patient's illnesses Feature space and according to illnesses as the label information in multi-tag learning algorithm, by being counted to label information Point counting is analysed, and show that each two label information appears in the number in a sufferer jointly, and utilize label similarity matrix reconstruction For label matrix to update label vector corresponding to each patient, recycling multi-tag learning algorithm may for intended patient's prediction Disease, associating to find more potential diseases for disease and its complication can be effectively utilized, improve diagnosis decision support Precision.

Brief description of the drawings

Fig. 1 is the schematic flow sheet of the method for diagnostic message Treatment Analysis of the present invention.

Embodiment

Below in conjunction with accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Ground describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.Generally exist The component of the embodiment of the present invention described and illustrated in accompanying drawing can be configured to arrange and design with a variety of herein.Cause This, the detailed description of the embodiments of the invention to providing in the accompanying drawings is not intended to limit claimed invention below Scope, but it is merely representative of the selected embodiment of the present invention.Based on embodiments of the invention, those skilled in the art are not doing The every other embodiment obtained on the premise of going out creative work, belongs to the scope of protection of the invention.

As shown in figure 1, the present invention provides a kind of method of diagnostic message Treatment Analysis, comprise the following steps：

The feature space and label of multi-tag information are established by the label information of multiple sample characteristics and multiple samples Space, and establish according to each sample characteristics and each label information the characteristic vector and the first label vector of the sample；Meter Calculate the occurrence number of each label information；Calculate each two label information while appear in the number in a sample；Meter The similarity of each two label information is calculated, establishes the similarity matrix of tag-tag；Pass through the tag-tag similar matrix First label vector of reconstructed sample calculates target sample to obtain the second label vector according to second label vector Label information probability of occurrence；The probability that target sample label information described in descending sort occurs, choose the label letter of predetermined number Cease the recommendation label information as the target sample.

The present invention principle be：Complication refers to caused another disease or disease during a kind of advancing of disease Shape.Doctor for sufferer when diagnosing, if having made a definite diagnosis a kind of disease, can consider whether the patient has also suffered from the concurrent of the disease Disease, to guarantee to find more diseases of sufferer.Meanwhile a kind of disease can by the number of the common appearance with other diseases Hair mirrors its own relation between other diseases, so, disease can be reflected using the co-occurrence frequency between disease two-by-two Between similitude, utilize the disease similitude that calculates and combine lazy multi-tag sorting technique, consequently recommended sufferer may The information of the disease of trouble.

In a particular embodiment, multi-tag data set is preset as according to patient's idagnostic logout, (Basic multiple Labels data set, are designated as BMLDS)；Meanwhile the feature set and tally set of sample are isolated in BMLDS, F is designated as respectively And L, and to BMLDS according to 7：3 ratio is divided into training set and test set, and method of the invention includes：

S1, the feature space and mark for establishing by the label information of multiple sample characteristics and multiple samples multi-tag information Sign space：

S2, the occurrence number for calculating each label information：Occurred according to the label information in each sample Number, sample corresponding to the label information is calculated, if value is r_ij；If the feature vector, X_kIn have Label space l_i, then r_ij= 1, otherwise r_ij=0.

S3, the similarity for calculating each two label, establish the similarity matrix of tag-tag：

Similarity calculating method include three kinds, cosine similarity computational methods, Pearson correlation coefficients computational methods and Jaccard similarity factor computational methods.

Specifically, the similarity of each two label is calculated using cosine similarity computational methods to be included：

Pass through calculation formulaThe similarity of each two label is calculated, wherein, P_ijIt is Include Label space l simultaneously_iWith Label space l_jSet,For Label space l_iWith Label space l_jWhile occur In sample X_kIn number；WithRespectively Label space l_iThe total degree and Label space l of appearance_jWhat is occurred is total Number；

Pearson correlation coefficients computational methods are：

Wherein I and J is label l respectively_iWith label l_jCorresponding sample vector, Pearson correlation coefficients are respectively to vectorial I With the cosine angle that space vector is calculated after vectorial J itself overall standard.

Jaccard similarity factor computational methods are：

Wherein I and J is label l respectively_iWith label l_jCorresponding sample vector.

S4, the similarity for calculating each two label information, establish the similarity matrix of tag-tag：

The similarity matrix of the tag-tag is

S5, the label information matrix by the tag-tag similarity matrix reconstruct sample

Y=g (X), wherein g (X) are：

S6, the label information probability of occurrence according to the second label vector calculating target sample：

According to second label vector, calculate each label in target sample using the multi-tag sorting algorithm of laziness and believe Cease probability of occurrence.Specifically, the probit range that each label information occurs in target sample is between [0,1].

The classification function of lazy multi-tag sorting algorithm is as follows：

The number that each label occurs in k neighbour (kNN) sample of each sample is counted first, and with maximizing, posteriority is general The method of rate estimates possibly be present at the label in unlabeled exemplars.To a sample space X for including m sample, it is marked Label space is designated as L.EventThe probability that i-th of label information value is b is represented, wherein b is 0 or 1, b are 0 expression label Occur without, b is that 1 expression label occurs.EventRepresent there be j l just in k neighbour_iLabel, pass through the big of calculation equation value It is small to determine whether label l is appeared in sample x.

Wherein,By formulaCalculate,

y_j(l_i) represent whether sample j has label l_i, s ∈ (0,1).Therefore,

Conditional probabilityCalculating need when traveling through training sample set, count the k neighbours' of each sample Label l is included in sample_iSituation.Array C [j] counts each sample label l_iWhen value is 1, in k neighbour's samples of the sample Include label l_iNumber；Array C'[j] label l_iWhen value is 0, label l is included in k neighbour's samples of the sample_iNumber. P represents the number of label.Conditional probability is calculated by following equation：

The probability that target sample label information described in S7, descending sort occurs, chooses the label information conduct of predetermined number The recommendation label information of the target sample.

The probable value occurred according to label information in target sample, descending arrangement label, choose rehearsal top n mark Sign recommendation label of the information as target sample.When being seen a doctor for patient, recommend the maximum possible complication of top n of probable value Label information, N concrete numerical value can be preset according to various disease.

In a particular embodiment, patient is the sample in multi-tag learning algorithm, feature corresponding to the disease of each patient The feature space in multi-tag learning algorithm is constituted, the disease that all patients are suffered from is as the label in multi-tag learning algorithm Information.By the analysis to label information, the number of the common appearance of each two label information is drawn；When two different label letters The number that breath is appeared in same sample simultaneously is more, then the association of the two label informations is bigger；Counted according to cosine similarity The similarity between each two label information is calculated, according to the similarity of label, recalculates label vector corresponding to sample, wherein Vector value corresponding to some label is equal to or over 0.5, and the vector value of the label is reset to 1, otherwise the vector of the label Value remains as 0；By label vector corresponding to reconstructed sample, training sample potentially possible label is found, associates training sample Label, using the label matrix in label similarity matrix reconstruction Label space, with update label corresponding to each sample to Amount, finally predict possible label using multi-tag learning algorithm for target sample.

In a particular embodiment, 9 kinds of common diseases (including diabetes B, hyperlipemia, fatty liver, hyperkalemia are selected Disease, Hypoproteinemia, diabetic nephropathy, cerebral infarction, coronary heart disease and osteoporosis).Selected from hospital with these diseases One or more patient as experiment sample；The analysis report and essential information of collection patient obtains as sample characteristics 459 sufferer samples comprising 5 kinds of patient's essential informations and 278 kinds of inspection projects.Sex, year are extracted in sufferer sample The base attribute of age, body temperature, height and body weight as sufferer.The value of sex is binary type, and if male is 0, women is 1. However, age, body temperature, the value of height and body weight are numeric types, retain their actual value；It is numeric type for laboratory values Project, if its laboratory values, in normal term of reference, its value is set to 1；If its laboratory values are beyond normal Scope, its value are set to actual laboratory values.It is the project of text description form for laboratory values, collects the difference of the project Text description value then using array arrange them.If the text description value of project is equal to reference value, value 0；If item The description of purpose text is not equal to reference value, then value is arranged to the text and describes the arrangement value in array.

As shown in Table 1 and Table 2, the statistics of feature and the statistics of disease are listed respectively.On the whole, 60% patient is Male, 40% patient is women.Average age, body temperature, height and the body weight of patient is 64.64,36.5,167.81 and respectively 67.75.It was found from from the statistics of disease, in this 9 kinds of diseases, diabetes B and cerebral infarction are most common two kinds of diseases.It is real On border, these diseases are also most common disease in elderly population.We randomly choose 70% patient as training sample, remain 30% remaining patient is as test sample.

Table 1

Table 2

The evaluation criterion of multi-tag classification problem is divided into two classes：A, the evaluation criterion based on ranking：Target is correlation Sample is arranged in before incoherent sample.B, binary system forecast assessment：Target is that one is done to each target sample strictly Yes/No is classified.This is assessed using Hamming Loss (Hamming loss), accuracy rate, recall rate and F1-score (F1 fractions) The effect of invention.

Hamming Loss assess the mean difference for recommending label and its physical tags of test sample：

Wherein, h (x_i) it is test sample x_iRecommendation tag set；P is the number of test sample；Y_iIt is test sample x_i Physical tags set；Δ is symmetrical difference.

Accuracy rate is defined as the total ratio for the label number and label recommendations list hit in label recommendations list. I.e. accuracy rate represents that test sample has the probability for the accuracy rate for recommending label.The formula of accuracy rate is as follows：

Recall rate is defined as the ratio of the label number of hit and the true tag number of test sample.In other words, call together The rate of returning represents the probability of the recommended accuracy rate of real label.The formula of recall rate is as follows：

F1-score considers accuracy rate and recall rate simultaneously, and its formula is as follows：

The effect of the effect and two kinds of classical multi-tag sorting techniques that contrast the present invention carrys out the effect of analysis method.Two kinds Classical multi-tag sorting technique is respectively a kind of lazy multi-tag sorting technique (ML-kNN) and combines BR methods and kNN side The multi-tag sorting technique (BR-kNN) of method.Main and classical multi-tag sorting algorithm is contrasted in table 3, according to each The arest neighbors number that autostability is good, the degree of accuracy is high, neighbour's number are both configured to 10, and smoothing factor of the invention and ML-kNN's is smooth The factor is both configured to 1.In all methods, the number of labels of recommendation is all 2.Experiment is performed using 10 folding cross validations, most Result afterwards is the average value of these experimental results.In table 3, for the present invention, its accurate rate is 0.236, and recall rate is 0.3793 and F1-score is 0.2915, and these are better than other two methods.Contrast and experiment comes second ML- KNN, accurate rate of the invention, recall rate and F1-score have been respectively increased 11%, 13% and 12%.The Hamming of the present invention Loss is 0.2117, also superior to other two methods.Therefore, performance of the invention is an advantage over other two methods.

Table 3

↓:Be worth smaller effect it is better ↑:Value is bigger, and effect is better

Preferably, the system also includes：The similarity for being used to calculate each two label information, establishes label-mark The module of the similarity matrix of label is the module that cosine similarity calculates.

The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained Cover within protection scope of the present invention.

Claims

A kind of 1. method of diagnostic message Treatment Analysis, it is characterised in that comprise the following steps：

The feature space and Label space of multi-tag information are established by the label information of multiple sample characteristics and multiple samples, And the characteristic vector and the first label vector of the sample are established according to each sample characteristics and each label information；

Calculate the occurrence number of each label information；Calculate each two label information while appear in time in a sample Number；Calculate the similarity of each two label information；Establish the similarity matrix of tag-tag；

By the first label vector of the tag-tag similar matrix reconstruct sample to obtain the second label vector, and The label information probability of occurrence of target sample is calculated according to second label vector；

The probability that target sample label information described in descending sort occurs, chooses the label information of predetermined number as the target The recommendation label information of sample.
2. according to the method for claim 1, it is characterised in that described by multiple sample characteristics and the mark of multiple samples Label information, which establishes the feature space of multi-tag information and Label space, also to be included：

Default F={ f₁,f₂...f_bIt is multiple label information b dimensional feature spaces, preset L={ l₁,l₂,...l_qIt is the multiple The Label space of label information q dimensions；

Default T={ (X₁,Y₁),(X₂,Y₂),...,(X_n,Y_n) be multiple label informations collection, presetFor the b dimensional feature vectors of the sample；Then

For the sample X_iCorresponding label vector；If feature vector, X_iThere is Label space l_j, then label VectorOtherwise
3. according to the method for claim 2, it is characterised in that the occurrence number bag for calculating each label information Include：The number occurred according to the label information in each sample, sample corresponding to the label information is calculated, if value is r_ij； If the feature vector, X_kIn have Label space l_i, then r_ij=1, otherwise r_ij=0.
4. according to the method for claim 3, it is characterised in that it is described calculate each two label similarity, establish label- The similarity matrix of label also includes：

The similarity of each two label is calculated using cosine similarity computational methods.
5. according to the method for claim 4, it is characterised in that described to calculate each two using cosine similarity computational methods The similarity of label includes：

Pass through calculation formulaThe similarity of each two label is calculated,

Wherein, P_ijIt is to include Label space l simultaneously_iWith Label space l_jSet,For Label space l_iWith Label space l_jWhile appear in sample X_kIn number；WithRespectively Label space l_iThe total degree and label of appearance are empty Between l_jThe total degree of appearance.
6. according to the method for claim 5, it is characterised in that the similarity for calculating each two label information, establish The similarity matrix of tag-tag includes：

The similarity matrix of the tag-tag is

Wherein, the element S in matrix_ij=sim (I_i,I_j) represent label l_iWith label l_jSimilarity.
7. according to the method for claim 6, it is characterised in that described by described in the reconstruct of tag-tag similarity matrix The label information matrix of sample is：Y=g (X),

Wherein g (X) is：
8. according to the method for claim 1, it is characterised in that described that target sample is calculated according to second label vector Label information probability of occurrence include：

According to second label vector, calculate each label information in target sample using the multi-tag sorting algorithm of laziness and go out Existing probability.
9. a kind of diagnostic message Treatment Analysis system, it is characterised in that the system includes：

For establishing the feature space and label of multi-tag information according to the label information of multiple sample characteristics and multiple samples The module in space；

For establishing according to each sample characteristics and each label information the characteristic vector and the first label vector of the sample Module；

For the module for the occurrence number for calculating each label information；

For calculating each two label information while appearing in the module of the number in a sample；

For calculating the similarity of each two label information, the module of the similarity matrix of tag-tag is established；

For reconstructing the label information matrix of the sample by the tag-tag similar matrix, the first of reconstructed sample is marked Vector is signed to obtain the module of the second label vector, and for calculating the label of target sample according to second label vector The module of information probability of occurrence；

The module of the probability occurred for target sample label information described in descending sort；

For choosing module of the label information of predetermined number as the recommendation label information of the target sample.
10. system according to claim 9, it is characterised in that the system also includes：It is described to be used to calculate each two mark The similarity of information is signed, the module for establishing the similarity matrix of tag-tag is the module that cosine similarity calculates.