CN107845424B

CN107845424B - Method and system for diagnostic information processing analysis

Info

Publication number: CN107845424B
Application number: CN201711128161.9A
Authority: CN
Inventors: 黄梦醒; 韩惠蕊; 张雨; 冯文龙
Original assignee: Hainan University
Current assignee: Hainan University
Priority date: 2017-11-15
Filing date: 2017-11-15
Publication date: 2021-11-16
Anticipated expiration: 2037-11-15
Also published as: CN107845424A

Abstract

The invention discloses a method and a system for processing and analyzing diagnostic information, and relates to the technical field of medical systems. The method for processing and analyzing diagnostic information includes the following steps: establishing a feature space and a label space by using a plurality of sample features and label information, and establishing a feature vector and a first label vector of the sample; calculating the number of occurrences of each label information; calculating every two The number of times that each label information appears in a sample at the same time; calculate the similarity of each two label information, and establish a label-label similarity matrix; reconstruct the first label vector to obtain the second label vector, and calculate according to the second label vector The occurrence probability of the target sample label information; a preset number of label information is selected as the recommended label information of the target sample. The invention can effectively utilize the correlation between the disease and its complications to discover more potential diseases and improve the accuracy of diagnosis decision support by using the features corresponding to the diseases of the patients to form the feature space and label information.

Description

Method and system for processing and analyzing diagnostic information

技术领域technical field

本发明涉及医疗系统技术领域，特别是涉及一种诊断信息处理分析的方法和系统。The present invention relates to the technical field of medical systems, in particular to a method and system for processing and analyzing diagnostic information.

背景技术Background technique

在大数据时代，人们已经逐渐接受使用健康大数据来协助诊断和治疗的方法。随着许多产业应用在利用大数据上获得的成功，健康服务产业也开始利用医疗大数据来提升服务效率和质量。In the era of big data, people have gradually embraced the use of health big data to assist in diagnosis and treatment. With the success of many industrial applications in using big data, the health service industry has also begun to use medical big data to improve service efficiency and quality.

健康信息工具和机器学习技术已经成功地用于帮助医生更高效地诊断疾病和制定治疗方案。临床决策支持应用包括提供诊断、个性化药物评估、治疗方案、相关医学知识的系统。临床决策支持应用旨在为医护人员提供专业的知识、病人的信息和智能化手段从而提高医护人员做决策的效率和效用。通过临床决策支持应用，可以减少医疗过失和提升医疗服务质量。在医学领域，高质量的临床决策支持系统的需求日益提升。Health information tools and machine learning techniques have been successfully used to help doctors diagnose diseases and develop treatment plans more efficiently. Clinical decision support applications include systems that provide diagnosis, personalized drug evaluation, treatment plans, and relevant medical knowledge. Clinical decision support applications aim to provide medical staff with professional knowledge, patient information and intelligent means to improve the efficiency and effectiveness of medical staff decision-making. Through clinical decision support applications, medical errors can be reduced and the quality of medical services can be improved. In the medical field, there is an increasing demand for high-quality clinical decision support systems.

临床医生通过他们的经验和知识来区分病人并为病人做诊断。因此，如果临床医生没有丰富的经验和准确的判断能力，将会造成不可避免的医疗失误。建立临床决策支持系统的目标是通过机器学习技术提升临床医生的准确性和效率。该系统可以通过个人健康记录，如生理数据、电子病历、3D图像、放射图像、基因组测序、临床和收费等数据，来提取患者的特征、根据患者的特征对患者分类并提供相应的临床建议给医生。在医疗场景中的评分标准和医学领域的复杂性是临床决策支持系统的难题。目前，市场上已经开发了许多临床决策支持系统来为医生提供帮助。Clinicians use their experience and knowledge to differentiate and diagnose patients. Therefore, if clinicians do not have rich experience and accurate judgment, it will cause inevitable medical errors. The goal of establishing a clinical decision support system is to improve the accuracy and efficiency of clinicians through machine learning technology. The system can extract patient characteristics from personal health records, such as physiological data, electronic medical records, 3D images, radiological images, genome sequencing, clinical and billing data, etc., classify patients according to their characteristics, and provide corresponding clinical recommendations to doctor. Scoring criteria in medical scenarios and the complexity of the medical field are difficult problems for clinical decision support systems. Currently, many clinical decision support systems have been developed in the market to assist physicians.

由于一些疾病带有并发症，一个患者同时患有多种疾病的情况很常见。为了估计参考疾病给临床医生，需要临床支持决策系统更加复杂。分析真实的临床诊断信息后发现，同时患有多个疾病的患者数量占了所有患者数量的很大一部分。所以临床决策支持系统需要推荐多个参考疾病给临床医生。因此，推荐疾病转换为了多标签分类参考疾病的问题。Because some diseases have complications, it is common for a patient to have multiple diseases at the same time. In order to estimate the reference disease to the clinician, the clinical support decision system needs to be more complex. Analysis of real clinical diagnoses revealed that the number of patients with multiple diseases at the same time accounted for a large proportion of all patients. Therefore, the clinical decision support system needs to recommend multiple reference diseases to clinicians. Therefore, recommending diseases is transformed into the problem of multi-label classification of reference diseases.

由于ML-kNN(懒惰的多标签分类方法)步骤简单和效果突出，该算法受到了广泛的应用和研究。然而该算法通过独立地估计每一个标签的可能性，忽略了标签之间的关联。而实际诊断疾病中，很多标签之间是有联系的，对于标签之间有关联的应用场景，ML-kNN方法缺乏有效性。Due to the simple steps and outstanding effects of ML-kNN (lazy multi-label classification method), this algorithm has been widely used and studied. However, this algorithm ignores the association between labels by estimating the likelihood of each label independently. However, in the actual diagnosis of diseases, many labels are related. For the application scenarios where the labels are related, the ML-kNN method lacks effectiveness.

发明内容SUMMARY OF THE INVENTION

本发明的主要目的在于提供一种诊断信息处理分析的方法和系统，旨在利用疾病之间的相关性有效提高诊断疾病的准确性。The main purpose of the present invention is to provide a method and system for processing and analyzing diagnostic information, aiming at effectively improving the accuracy of diagnosing diseases by utilizing the correlation between diseases.

为实现上述目的，本发明提供一种诊断信息处理分析的方法，包括以下步骤：To achieve the above object, the present invention provides a method for processing and analyzing diagnostic information, comprising the following steps:

通过多个样本特征以及多个样本的标签信息建立多标签信息的特征空间和标签空间，并根据每个样本特征和每个标签信息建立所述样本的特征向量和第一标签向量；A feature space and a label space of multi-label information are established by multiple sample features and label information of multiple samples, and a feature vector and a first label vector of the sample are established according to each sample feature and each label information;

计算所述每个标签信息的出现次数；计算每两个标签信息同时出现在一个样本中的次数；Calculate the number of times of occurrence of each label information; calculate the number of times that every two label information appears in a sample at the same time;

计算每两个标签信息的相似度，建立标签-标签的相似度矩阵；Calculate the similarity of each two tag information, and establish a tag-tag similarity matrix;

通过所述标签-标签相似矩阵重构所述样本的第一标签向量以得到第二标签向量，并根据所述第二标签向量计算目标样本的标签信息出现概率；Reconstructing the first label vector of the sample through the label-label similarity matrix to obtain a second label vector, and calculating the occurrence probability of label information of the target sample according to the second label vector;

降序排序所述目标样本标签信息出现的概率，选取预设数量的标签信息作为所述目标样本的推荐标签信息。Sort the probability of occurrence of the target sample label information in descending order, and select a preset number of label information as the recommended label information of the target sample.

优选地，所述通过多个样本特征以及多个样本的标签信息建立多标签信息的特征空间和标签空间还包括：Preferably, the feature space and label space for establishing multi-label information by using multiple sample features and label information of multiple samples further include:

预设F＝{f₁,f₂...f_b}为多个标签信息b维特征空间，预设L＝{l₁,l₂,...l_q}为所述多个标签信息q维的标签空间；The preset F={f ₁ , f ₂ , ... f _b } is the b-dimensional feature space of multiple label information, and the preset L={l ₁ , l ₂ , ... l _q } is the multiple label information q-dimensional label space;

预设T＝{(X₁,Y₁),(X₂,Y₂),...,(X_n,Y_n)}为多个标签信息的集，预设

为所述样本的b维特征向量；则The preset T={(X ₁ ,Y ₁ ),(X ₂ ,Y ₂ ),...,(X _n ,Y _n )} is a set of multiple label information, the preset

is the b-dimensional feature vector of the sample; then

为所述样本X_i对应的标签向量；若特征向量X_i有标签空间l_j，则标签向量

否则

is the label vector corresponding to the sample X _i ; if the feature vector X _i has a label space l _j , then the label vector

otherwise

优选地，所述计算所述每个标签信息的出现次数包括：根据所述标签信息在每个样本中出现的次数，计算该标签信息对应的样本，设值为r_ij；若所述特征向量X_k中有标签空间l_i，则r_ij＝1，否则r_ij＝0。Preferably, the calculating the number of occurrences of each label information includes: calculating the sample corresponding to the label information according to the number of times the label information appears in each sample, and setting the value as r _ij ; if the feature vector If there is a label space _li in X _k , then ri _ij =1, otherwise ri _ij =0.

优选地，所述计算每两个标签的相似度，建立标签-标签的相似度矩阵还包括：Preferably, the calculation of the similarity of every two tags and the establishment of a tag-tag similarity matrix further include:

利用余弦相似度计算方法计算每两个标签的相似度。Use the cosine similarity calculation method to calculate the similarity of every two tags.

优选地，所述利用余弦相似度计算方法计算每两个标签的相似度包括：Preferably, calculating the similarity of every two tags by using the cosine similarity calculation method includes:

通过计算公式

计算每两个标签的相似度，by calculation formula

Calculate the similarity of every two labels,

其中，P_ij是同时包括标签空间l_i与标签空间l_j的集合，

为标签空间l_i与标签空间l_j的同时出现在样本X_k中的次数；

和

分别为标签空间l_i出现的总次数和标签空间l_j出现的总次数。Among them, P _ij is a set including both label space _{li and label space l j} _,

is the number of times the label space l _i and the label space l _j appear in the sample X _k at the same time;

and

are the total number of occurrences of label space _li and the total number of occurrences of label space _lj , respectively.

优选地，所述计算每两个标签信息的相似度，建立标签-标签的相似度矩阵包括：Preferably, calculating the similarity of each two tag information, and establishing a tag-tag similarity matrix includes:

所述标签-标签的相似度矩阵为The label-label similarity matrix is

其中，矩阵中的元素S_ij＝sim(I_i,I_j)表示标签l_i与标签l_j的相似度。Wherein, the element S _ij =sim(I _i , I _j ) in the matrix represents the similarity between the label l _i and the label l _j .

优选地，所述通过标签-标签相似度矩阵重构所述样本的标签信息矩阵为：Y＝g(X)，Preferably, the label information matrix of the sample reconstructed by the label-label similarity matrix is: Y=g(X),

其中g(X)为：

where g(X) is:

优选地，所述根据所述第二标签向量计算目标样本的标签信息出现概率包括：Preferably, the calculating the occurrence probability of the label information of the target sample according to the second label vector includes:

根据所述第二标签向量，使用懒惰的多标签分类算法计算目标样本中每个标签信息出现概率。According to the second label vector, a lazy multi-label classification algorithm is used to calculate the occurrence probability of each label information in the target sample.

本发明还提供一种诊断信息处理分析系统，所述系统包括：The present invention also provides a diagnostic information processing and analysis system, the system comprising:

用于根据多个样本特征以及多个样本的标签信息建立多标签信息的特征空间和标签空间的模块；A module for establishing a feature space and a label space of multi-label information according to multiple sample features and label information of multiple samples;

用于根据每个样本特征和每个标签信息建立所述样本的特征向量和第一标签向量的模块；A module for establishing a feature vector and a first label vector of the sample according to each sample feature and each label information;

用于计算所述每个标签信息的出现次数的模块；A module for calculating the number of occurrences of each label information;

用于计算每两个标签信息同时出现在一个样本中的次数的模块；A module for calculating the number of times that each two label information appears in a sample at the same time;

用于计算每两个标签信息的相似度，建立标签-标签的相似度矩阵的模块；A module for calculating the similarity of each two tag information and establishing a tag-tag similarity matrix;

用于通过所述标签-标签相似矩阵重构所述样本的标签信息矩阵、重构样本的第一标签向量以得到第二标签向量的模块，以及用于根据所述第二标签向量计算目标样本的标签信息出现概率的模块；A module for reconstructing the label information matrix of the sample by the label-label similarity matrix, reconstructing the first label vector of the sample to obtain a second label vector, and for calculating the target sample according to the second label vector The module of the probability of occurrence of label information;

用于降序排序所述目标样本标签信息出现的概率的模块‘A module used to sort the probability of occurrence of the target sample label information in descending order'

用于选取预设数量的标签信息作为所述目标样本的推荐标签信息的模块。A module for selecting a preset number of label information as the recommended label information for the target sample.

优选地，所述系统还包括：所述用于计算每两个标签信息的相似度，建立标签-标签的相似度矩阵的模块为余弦相似度计算模块。Preferably, the system further includes: the module for calculating the similarity of each two tag information, and establishing a tag-tag similarity matrix is a cosine similarity calculation module.

本发明的技术方案通过利用病人所患疾病对应的特征构成多标签学习算法中的特征空间、以及根据所患疾病作为多标签学习算法中的标签信息，通过对标签信息进行计算分析，得出每两个标签信息共同出现在一个病患上的次数，并利用标签相似度矩阵重构标签矩阵以更新每个病人对应的标签向量，再利用多标签学习算法来为目标病人预测可能的疾病，可有效地利用疾病与其并发症的关联来发现更多潜在疾病，提高诊断决策支持的精度。The technical scheme of the present invention uses the characteristics corresponding to the disease of the patient to form the feature space in the multi-label learning algorithm, and uses the disease as the label information in the multi-label learning algorithm, and calculates and analyzes the label information to obtain each label information. The number of times the two label information co-occurs on a patient, and the label matrix is used to reconstruct the label matrix to update the label vector corresponding to each patient, and then the multi-label learning algorithm is used to predict possible diseases for the target patient. Effectively use the association of diseases with their complications to discover more potential diseases and improve the accuracy of diagnostic decision support.

附图说明Description of drawings

图1为本发明诊断信息处理分析的方法的流程示意图。FIG. 1 is a schematic flowchart of a method for processing and analyzing diagnostic information according to the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。通常在此处附图中描述和示出的本发明实施例的组件可以以各种不同的配置来布置和设计。因此，以下对在附图中提供的本发明的实施例的详细描述并非旨在限制要求保护的本发明的范围，而是仅仅表示本发明的选定实施例。基于本发明的实施例，本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. The components of the embodiments of the invention generally described and illustrated in the drawings herein may be arranged and designed in a variety of different configurations. Thus, the following detailed description of the embodiments of the invention provided in the accompanying drawings is not intended to limit the scope of the invention as claimed, but is merely representative of selected embodiments of the invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present invention.

如图1所示，本发明提供一种诊断信息处理分析的方法，包括以下步骤：As shown in Figure 1, the present invention provides a method for processing and analyzing diagnostic information, comprising the following steps:

通过多个样本特征以及多个样本的标签信息建立多标签信息的特征空间和标签空间，并根据每个样本特征和每个标签信息建立所述样本的特征向量和第一标签向量；计算所述每个标签信息的出现次数；计算每两个标签信息同时出现在一个样本中的次数；计算每两个标签信息的相似度，建立标签-标签的相似度矩阵；通过所述标签-标签相似矩阵重构样本的第一标签向量以得到第二标签向量，并根据所述第二标签向量计算目标样本的标签信息出现概率；降序排序所述目标样本标签信息出现的概率，选取预设数量的标签信息作为所述目标样本的推荐标签信息。A feature space and a label space of multi-label information are established by multiple sample features and label information of multiple samples, and a feature vector and a first label vector of the sample are established according to each sample feature and each label information; calculate the The number of occurrences of each label information; calculate the number of times that every two label information appears in a sample at the same time; calculate the similarity of every two label information, and establish a label-label similarity matrix; through the label-label similarity matrix Reconstruct the first label vector of the sample to obtain a second label vector, and calculate the probability of occurrence of label information of the target sample according to the second label vector; sort the probability of occurrence of label information of the target sample in descending order, and select a preset number of labels information as the recommended label information of the target sample.

本发明的原理是：并发症是指一种疾病的发展过程中引起的另一种疾病或者症状。医生在为病患诊断时，如果确诊了一种疾病，会考虑该患者是否也患了该疾病的并发症，以确保能发现病患的更多疾病。同时，一种疾病通过与其他疾病的共同出现的次数能够发映出其自身与其他疾病之间的关系，所以，可利用两两疾病之间的共现频率来反映疾病之间的相似性，利用计算出的疾病相似性并结合懒惰多标签分类方法，最终推荐病患可能患的疾病的信息。The principle of the present invention is that a complication refers to another disease or symptom caused during the development of one disease. When a doctor diagnoses a patient, if a disease is diagnosed, the doctor will consider whether the patient also suffers from complications of the disease to ensure that more diseases can be found in the patient. At the same time, a disease can reflect the relationship between itself and other diseases through the number of co-occurrences with other diseases. Therefore, the co-occurrence frequency between two diseases can be used to reflect the similarity between diseases. Using the calculated disease similarities combined with a lazy multi-label classification method, information about the diseases that the patient may suffer from is finally recommended.

在具体实施例中，根据患者诊断记录预设为多标签数据集，(Basic multiplelabels data set，记为BMLDS)；同时，在BMLDS中分离出样本的特征集和标签集，分别记为F和L，并对BMLDS按照7：3的比例或分为训练集和测试集，本发明的方法包括：In a specific embodiment, a multi-label data set (Basic multiplelabels data set, denoted as BMLDS) is preset according to the patient's diagnosis record; at the same time, the feature set and label set of the sample are separated in the BMLDS, denoted as F and L respectively , and divide the BMLDS into a training set and a test set according to the ratio of 7:3, the method of the present invention includes:

S1、通过多个样本特征以及多个样本的标签信息建立多标签信息的特征空间和标签空间：S1. Establish a feature space and a label space of multi-label information through multiple sample features and label information of multiple samples:

is the b-dimensional feature vector of the sample; then

否则

otherwise

S2、计算所述每个标签信息的出现次数：根据所述标签信息在每个样本中出现的次数，计算该标签信息对应的样本，设值为r_ij；若所述特征向量X_k中有标签空间l_i，则r_ij＝1，否则r_ij＝0。S2, calculate the number of times of occurrence of each label information: according to the number of times that the label information appears in each sample, calculate the sample corresponding to this label information, and set the value to be _r _ij ; Label space _{li, then r ij} ₌ 1, otherwise r _ij =0.

S3、计算每两个标签的相似度，建立标签-标签的相似度矩阵：S3. Calculate the similarity of each two tags, and establish a tag-tag similarity matrix:

相似度计算方法包括三种，余弦相似度计算方法、皮尔森相关系数计算方法以及Jaccard相似系数计算方法。There are three similarity calculation methods, the cosine similarity calculation method, the Pearson correlation coefficient calculation method and the Jaccard similarity coefficient calculation method.

具体地，利用余弦相似度计算方法计算每两个标签的相似度包括：Specifically, using the cosine similarity calculation method to calculate the similarity of every two tags includes:

通过计算公式

计算每两个标签的相似度，其中，P_ij是同时包括标签空间l_i与标签空间l_j的集合，

为标签空间l_i与标签空间l_j的同时出现在样本X_k中的次数；

和

分别为标签空间l_i出现的总次数和标签空间l_j出现的总次数；by calculation formula

Calculate the similarity of every two labels, where P _ij is a set including both label space _{li and label space l j} _,

and

are the total number of occurrences of label space l _i and the total number of occurrences of label space l _j ;

皮尔森相关系数计算方法为：The Pearson correlation coefficient is calculated as:

其中I和J分别是标签l_i与标签l_j对应的样本向量，皮尔森相关系数是分别对向量I和向量J自身总体标准化后计算空间向量的余弦夹角。Among them, I and J are the sample vectors corresponding to the labels l _i and l _j respectively, and the Pearson correlation coefficient is the cosine angle of the space vector calculated after the overall normalization of the vector I and the vector J itself.

Jaccard相似系数计算方法为：The calculation method of the Jaccard similarity coefficient is:

其中I和J分别是标签l_i与标签l_j对应的样本向量。where I and J are the sample vectors corresponding to labels l _i and l _j , respectively.

S4、计算每两个标签信息的相似度，建立标签-标签的相似度矩阵：S4. Calculate the similarity of each two tag information, and establish a tag-tag similarity matrix:

所述标签-标签的相似度矩阵为The label-label similarity matrix is

S5、通过标签-标签相似度矩阵重构所述样本的标签信息矩阵S5. Reconstruct the label information matrix of the sample through the label-label similarity matrix

Y＝g(X)，其中g(X)为：

Y=g(X), where g(X) is:

S6、根据第二标签向量计算目标样本的标签信息出现概率：S6. Calculate the occurrence probability of the label information of the target sample according to the second label vector:

根据所述第二标签向量，使用懒惰的多标签分类算法计算目标样本中每个标签信息出现概率。具体地，每个标签信息在目标样本中出现的概率值范围在[0，1]之间。According to the second label vector, a lazy multi-label classification algorithm is used to calculate the occurrence probability of each label information in the target sample. Specifically, the probability value of each label information appearing in the target sample ranges between [0, 1].

懒惰的多标签分类算法的分类函数如下：The classification function of the lazy multi-label classification algorithm is as follows:

首先统计每个样本的k个近邻(kNN)样本中每个标签出现的次数，用最大化后验概率的方法来估计可能出现在无标签样本中的标签。对一个包含m个样本的样本空间X，其标签空间记为L。事件

表示第i个标签信息取值为b的概率，其中b为0或者1，b为0表示标签不出现，b为1表示标签出现。事件

表示k个近邻中恰好有j个l_i标签，通过计算等式值的大小来确定标签l是否出现在样本x中。First, count the number of occurrences of each label in the k nearest neighbor (kNN) samples of each sample, and use the method of maximizing the posterior probability to estimate the labels that may appear in the unlabeled samples. For a sample space X containing m samples, its label space is denoted as L. event

Indicates the probability that the i-th label information takes the value of b, where b is 0 or 1, b is 0 means the label does not appear, and b is 1 means the label appears. event

Indicates that there are exactly _{j li} labels in the k nearest neighbors, and whether the label l appears in the sample x is determined by calculating the magnitude of the equation value.

其中，

由公式

计算，in,

by formula

calculate,

y_j(l_i)表示样本j是否有标签l_i，s∈(0,1)。因此，

y _j (li _i ) indicates whether the sample _j has a label li , s∈(0,1). therefore,

条件概率

的计算需要在遍历训练样本集时，统计每个样本的k近邻的样本中包含标签l_i的情况。数组C[j]统计每个样本标签l_i取值为1时，该样本的k近邻样本中包含标签l_i的个数；数组C'[j]标签l_i取值为0时，该样本的k近邻样本中包含标签l_i的个数。p表示标签的个数。条件概率由下面的等式计算：Conditional Probability

The calculation of , requires that when traversing the training sample set, count the cases where the labels l _i are included in the k-nearest neighbors of each sample. Array C[j] counts the number of labels l _i included in the _k -nearest neighbor samples of each sample when the value of each sample label l _i is 1; The number of labels li included in the _k -nearest neighbor samples of . p represents the number of labels. The conditional probability is calculated by the following equation:

S7、降序排序所述目标样本标签信息出现的概率，选取预设数量的标签信息作为所述目标样本的推荐标签信息。S7. Sort the probability of occurrence of the target sample label information in descending order, and select a preset number of label information as the recommended label information of the target sample.

根据标签信息在目标样本中出现的概率值，由大到小排列标签，选取排练前N个标签信息作为目标样本的推荐标签。在为病人看病时，推荐概率值最大的前N个可能的并发症标签信息，N的具体数值可根据不同疾病进行预设。According to the probability value of the label information appearing in the target sample, the labels are arranged from large to small, and the N label information before rehearsal is selected as the recommended label of the target sample. When treating a patient, the top N possible complication label information with the highest probability value is recommended, and the specific value of N can be preset according to different diseases.

在具体实施例中，病人是多标签学习算法中的样本，每个病人的疾病对应的特征构成了多标签学习算法中的特征空间，所有病人所患的疾病作为多标签学习算法中的标签信息。通过对标签信息的分析，得出每两个标签信息的共同出现的次数；当两个不同标签信息同时出现在同一样本中的次数越多，则这两个标签信息关联越大；根据余弦相似度来计算每两个标签信息之间的相似度，根据标签的相似度，重新计算样本对应的标签向量，其中某个标签对应的向量值等于或者超过0.5的，该标签的向量值重设为1，否则该标签的向量值仍然为0；通过重构样本对应的标签向量，发现训练样本潜在的可能标签，关联训练样本的标签，利用标签相似度矩阵重构标签空间中的标签矩阵，以更新每个样本对应的标签向量，最后利用多标签学习算法来为目标样本预测可能的标签。In a specific embodiment, the patient is a sample in the multi-label learning algorithm, the features corresponding to the disease of each patient constitute the feature space in the multi-label learning algorithm, and the diseases of all patients are used as the label information in the multi-label learning algorithm . Through the analysis of the label information, the co-occurrence times of each two label information are obtained; the more times two different label information appear in the same sample at the same time, the greater the correlation between the two label information; according to the cosine similarity Calculate the similarity between each two label information, and recalculate the label vector corresponding to the sample according to the similarity of the label. If the vector value corresponding to a label is equal to or exceeds 0.5, the vector value of the label is reset to 1, otherwise the vector value of the label is still 0; by reconstructing the label vector corresponding to the sample, the potential possible labels of the training sample are found, the labels of the training samples are associated, and the label matrix in the label space is reconstructed by using the label similarity matrix to Update the label vector corresponding to each sample, and finally use the multi-label learning algorithm to predict possible labels for the target sample.

在具体实施例中，选择9种常见的疾病(包括2型糖尿病、高血脂症、脂肪肝、高钾血症、低蛋白血症、糖尿病肾病、脑梗塞、冠心病和骨质疏松症)。从医院挑选患有这些疾病中的一个或者多个患者作为实验样本；收集患者的化验报告和基本信息作为样本特征，得到包含5种病人基本信息和278种检验项目的459个病患样本。在病患样本中提取出性别、年龄、体温、身高和体重作为病患的基本属性。性别的取值是二进制型，如男性是0，女性是1。然而，年龄、体温、身高和体重的取值是数值型的，保留它们的实际值；对于化验值是数值型的项目，如果它的化验值在正常的参考范围内，它的取值设为1；如果它的化验值超出正常范围，它的取值设为实际的化验值。对于化验值是文本描述形式的项目，收集该项目的不同的文本描述值然后利用数组排列它们。若项目的文本描述值等于参考值，则取值为0；若项目的文本描述不等于参考值，则取值设置为该文本描述在数组中的排列值。In specific examples, 9 common diseases (including type 2 diabetes, hyperlipidemia, fatty liver, hyperkalemia, hypoalbuminemia, diabetic nephropathy, cerebral infarction, coronary heart disease and osteoporosis) were selected. One or more patients suffering from these diseases were selected from the hospital as experimental samples; the laboratory reports and basic information of the patients were collected as sample characteristics, and 459 patient samples containing 5 kinds of basic information of patients and 278 kinds of test items were obtained. Gender, age, body temperature, height and weight are extracted from the patient sample as the basic attributes of the patient. The value of gender is binary, such as 0 for male and 1 for female. However, the values of age, body temperature, height and weight are numerical, and their actual values are retained; for items whose laboratory values are numerical, if their laboratory values are within the normal reference range, their values are set to 1; If its assay value is out of the normal range, its value is set as the actual assay value. For an item whose assay value is in the form of a text description, collect the different text description values of the item and arrange them using an array. If the text description value of the item is equal to the reference value, the value is 0; if the text description value of the item is not equal to the reference value, the value is set to the arrangement value of the text description in the array.

如表1和表2所示，分别列出了特征的统计和疾病的统计。总体上说，60％的患者是男性，40％的患者是女性。患者的平均年龄、体温、身高和体重分别是64.64、36.5、167.81和67.75。从疾病的统计中可知，在这9种疾病中，2型糖尿病和脑梗塞是最常见的两种疾病。实际上，这些疾病也是老年人群中最常见的疾病。我们随机选择70％的患者作为训练样本，剩余的30％的患者作为测试样本。As shown in Tables 1 and 2, statistics of characteristics and statistics of diseases are listed, respectively. Overall, 60% of patients were male and 40% were female. The mean age, body temperature, height and weight of the patients were 64.64, 36.5, 167.81 and 67.75, respectively. From the statistics of diseases, among these 9 diseases, type 2 diabetes and cerebral infarction are the two most common diseases. In fact, these diseases are also the most common diseases in the elderly population. We randomly select 70% of patients as training samples and the remaining 30% as test samples.

表1Table 1

表2Table 2

多标签分类问题的评价标准分为两类：a、基于排名的评价标准：目标是把相关的样例排列在不相关的样例之前。b、二进制预测评估：目标是对每个目标样本做一个严格的是/否分类。使用Hamming Loss(汉明损失)、准确率、召回率和F1-score(F1分数)来评估本发明的效果。The evaluation criteria for multi-label classification problems are divided into two categories: a. Evaluation criteria based on ranking: the goal is to rank relevant examples before irrelevant examples. b. Binary prediction evaluation: The goal is to make a strict yes/no classification for each target sample. Hamming Loss, precision, recall and F1-score (F1 score) are used to evaluate the effect of the present invention.

Hamming Loss评估测试样本的推荐标签和它的实际标签的平均差异：Hamming Loss evaluates the average difference between the recommended label of the test sample and its actual label:

其中，h(x_i)是测试样本x_i的推荐标签集合；p是测试样本的个数；Y_i是测试样本x_i的实际标签集合；Δ是对称差异。Among them, h( _xi ) is the recommended label set of the test sample _xi ; p is the number of test samples; Y _i is the actual label set of the test sample _xi ; Δ is the symmetric difference.

准确率定义为在标签推荐列表中命中的标签个数与标签推荐列表的总数的比值。即准确率表示测试样本有推荐标签的准确率的概率。准确率的公式如下：The accuracy rate is defined as the ratio of the number of tags hit in the tag recommendation list to the total number of tag recommendation lists. That is, the accuracy rate represents the probability that the test sample has the accuracy rate of the recommended label. The formula for the accuracy rate is as follows:

召回率定义为命中的标签个数与测试样本的真实标签个数的比值。换句话说，召回率代表真实的标签被推荐的准确率的概率。召回率的公式如下：Recall is defined as the ratio of the number of hit labels to the true labels of the test sample. In other words, recall represents the probability that the true label is recommended with accuracy. The formula for recall is as follows:

F1-score同时考虑了准确率和召回率，它的公式如下：F1-score considers both precision and recall, and its formula is as follows:

对比本发明的效果与两种经典的多标签分类方法的效果来分析方法的效果。两种经典的多标签分类方法分别是一种懒惰的多标签分类方法(ML-kNN)和结合BR方法和kNN方法的多标签分类方法(BR-kNN)。在表3中主要和经典的多标签分类算法进行了对比，根据各自稳定性好、准确度高的最近邻数，近邻数都设置为10，本发明的平滑因子与ML-kNN的平滑因子都设置为1。在所有方法中，推荐的标签数量都为2。采用10折交叉验证来执行实验，最后的结果是这些实验结果的平均值。在表3中，对于本发明，它的精确率是0.236，召回率是0.3793以及F1-score为0.2915，这些都优于其他两个方法。对比实验结果排在第二的ML-kNN，本发明的精确率、召回率和F1-score分别提高了11％、13％和12％。本发明的HammingLoss是0.2117，也优于其他两种方法。因此，本发明的性能是优于其他两种方法的。The effect of the method is analyzed by comparing the effect of the present invention with the effect of two classical multi-label classification methods. Two classic multi-label classification methods are a lazy multi-label classification method (ML-kNN) and a multi-label classification method combining BR method and kNN method (BR-kNN). In Table 3, it is mainly compared with the classical multi-label classification algorithm. According to the number of nearest neighbors with good stability and high accuracy, the number of nearest neighbors is set to 10. The smoothing factor of the present invention and the smoothing factor of ML-kNN are both Set to 1. The recommended number of labels is 2 in all methods. Experiments are performed using 10-fold cross-validation, and the final result is the average of these experimental results. In Table 3, for the present invention, its precision is 0.236, recall is 0.3793 and F1-score is 0.2915, which are better than the other two methods. Compared with ML-kNN, which ranks second in the experimental results, the precision, recall and F1-score of the present invention are improved by 11%, 13% and 12%, respectively. The HammingLoss of the present invention is 0.2117, which is also better than the other two methods. Therefore, the performance of the present invention is superior to the other two methods.

表3table 3

↓:值越小效果越好↑:值越大效果越好↓: The smaller the value, the better the effect ↑: The larger the value, the better the effect

优选地，所述系统还包括：所述用于计算每两个标签信息的相似度，建立标签-标签的相似度矩阵的模块为余弦相似度计算的模块。Preferably, the system further includes: the module for calculating the similarity of each two tag information, and the module for establishing a tag-tag similarity matrix is a cosine similarity calculation module.

以上所述，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本发明的保护范围之内。The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited thereto. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed by the present invention. should be included within the protection scope of the present invention.

Claims

1. A method of diagnostic information processing analysis, comprising the steps of:

establishing a feature space and a label space of multi-label information through a plurality of sample features and label information of a plurality of samples, and establishing a feature vector and a first label vector of each sample according to each sample feature and each label information; the sample characteristics are the test report and basic information of the disease of the patient;

calculating the occurrence frequency of each label information; calculating the times that every two label information simultaneously appear in one sample; calculating the similarity of every two label information; establishing a label-label similarity matrix;

reconstructing a first label vector of the sample through the label-label similarity matrix to obtain a second label vector, and calculating the label information occurrence probability of the target sample according to the second label vector;

sorting the probability of the target sample label information in a descending order, and selecting a preset number of label information as the recommended label information of the target sample;

the establishing of the feature space and the label space of the multi-label information by the plurality of sample features and the label information of the plurality of samples further comprises:

preset F ═ F₁,f₂...f_bB-dimensional feature space of a plurality of label information, and preset L ═ L₁,l₂,...l_qQ-dimensional label space of the plurality of label information;

predetermined T { (X)₁,Y₁),(X₂,Y₂),...,(X_n,Y_n) Is a set of multiple label information, preset

B-dimensional feature vectors for the samples; then

Is the sample X_iA corresponding tag vector; if the feature vector X_iSpace with label l_jThen label vector

Otherwise

2. The method of claim 1, wherein said calculating the number of occurrences of said each tag information comprises: calculating the sample corresponding to the label information according to the occurrence frequency of the label information in each sample, and setting the value as r_ij(ii) a If the feature vector X_kIn which there is a label space l_iThen r is_ij1, otherwise r_ij＝0。

3. The method of claim 2, wherein the calculating the similarity between every two tags and establishing the tag-to-tag similarity matrix further comprises:

and calculating the similarity of every two labels by using a cosine similarity calculation method.

4. The method of claim 3, wherein the calculating the similarity of each two tags by using the cosine similarity calculation method comprises:

by calculation of formula

The similarity of every two labels is calculated,

wherein, P_ijIs to include label space l at the same time_iAnd the label space l_jThe set of (a) and (b),

is a label space l_iAnd the label space l_jWhile appearing in sample X_kThe number of times of (1);

and

respectively a label space l_iTotal number of occurrences and label space l_jTotal number of occurrences.

5. The method of claim 4, wherein the calculating the similarity between every two label information and establishing the label-to-label similarity matrix comprises:

the label-to-label similarity matrix is

Wherein the element S in the matrix_ij＝sim(I_i,I_j) Presentation label l_iAnd a label l_jThe similarity of (c).

6. The method of claim 5, wherein reconstructing the label information matrix of the sample by the label-to-label similarity matrix is: y ═ g (x),

wherein g (X) is:

7. the method of claim 1, wherein the calculating the probability of occurrence of the tag information for the target sample according to the second tag vector comprises:

and calculating the occurrence probability of each label information in the target sample by using a lazy multi-label classification algorithm according to the second label vector.

8. A diagnostic information processing and analysis system, the system comprising:

a module for establishing a feature space and a label space of multi-label information according to the plurality of sample features and label information of the plurality of samples; the sample characteristics are the test report and basic information of the disease of the patient;

means for establishing a feature vector and a first label vector for the sample based on each sample feature and each label information;

a module for calculating the number of occurrences of each of the tag information;

a module for counting the number of times that every two tag information simultaneously appear in one sample;

a module for calculating the similarity of every two label information and establishing a label-label similarity matrix;

the module is used for reconstructing a label information matrix of the sample through the label-label similarity matrix, reconstructing a first label vector of the sample to obtain a second label vector, and calculating the occurrence probability of label information of a target sample according to the second label vector;

means for sorting the probabilities of occurrence of the target sample label information in descending order;

a module for selecting a preset number of label information as the recommended label information of the target sample;

the establishing of the feature space and the label space of the multi-label information according to the plurality of sample features and the label information of the plurality of samples further comprises:

B-dimensional feature vectors for the samples; then

Otherwise

9. The system of claim 8, further comprising: the module for calculating the similarity of every two label information and establishing the label-label similarity matrix is a module for calculating cosine similarity.