CN115206538A

CN115206538A - Perioperative patient sample data set balancing method and sample data set acquisition system

Info

Publication number: CN115206538A
Application number: CN202210760514.1A
Authority: CN
Inventors: 卢莉; 王琳娜; 朱涛; 郝学超; 桑永胜
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-10-18

Abstract

The invention provides a method for balancing a patient sample data set in a perioperative period and a sample data set acquiring system. The sample data set equalization method includes: S1, over-sampling the minority label samples in the perioperative patient's sample data set to obtain synthetic samples, and generating a corresponding synthetic label set for the synthetic samples, and the sample data set includes a plurality of samples corresponding to the samples. Classification label set; S2, adding synthetic samples and synthetic label sets to the sample data set to obtain a temporary sample data set; S3, cleaning the samples in the temporary sample data set to obtain a balanced sample data set. Oversampling the minority class label samples in the sample data set to increase the number of minority class label samples, balance the majority class label samples and minority class label samples, clean the noise samples to improve the sample quality in the output balanced sample data set, and use the balanced sample data set for Subsequent classification processing can improve the performance of the classification model.

Description

A method for balancing the sample data set of perioperative patients and a sample data set acquisition system

技术领域technical field

本发明涉及计算机技术领域，尤其涉及一种围术期患者样本数据集均衡方法及样本数据集获取系统。The invention relates to the field of computer technology, in particular to a method for balancing a patient sample data set in a perioperative period and a sample data set acquisition system.

背景技术Background technique

围术期即围手术期，围手术期是围绕手术的一个全过程，从病人决定接受手术治疗开始，到手术治疗直至基本康复，包含手术前、手术中及手术后的一段时间，具体是指从确定手术治疗时起，直到与这次手术有关的治疗基本结束为止，时间约在术前5-7天至术后7-12天。The perioperative period is the perioperative period, and the perioperative period is a whole process surrounding the operation, starting from the patient's decision to receive surgical treatment, to the surgical treatment until the basic recovery, including the period before, during and after the operation. From the time when the surgical treatment is determined, until the treatment related to this surgery is basically finished, the time is about 5-7 days before surgery to 7-12 days after surgery.

据世界卫生组织(WHO)发布的《World health statistics 2021》报告数据来看，全球人口预期寿命增加到73.3岁，预计到2050年，全球老年人将超过15亿人。世界各地不断增加的老年人口已被确定为外科手术市场的主要人群，且老年患者的风险事件预测已经成为了热门研究方向之一。对老年手术患者群体进行术后风险预测，有助于医生制定诊治计划，合理配置救治资源，进而降低术后风险事件发生的概率。目前，一些诊断工具可以帮助医院为高风险病人提供全面、可靠的救治，如公开号为CN111009322A和CN114038565A的中国专利已公开了基于患者围术期数据集利用预测模型进行围术期风险评估，然而，在患者围术期数据集中，多存在数据集标签不平衡问题，这会直接影响到围术期预测模型的表现效果。According to the World Health Statistics 2021 report released by the World Health Organization (WHO), the life expectancy of the global population has increased to 73.3 years. The growing geriatric population around the world has been identified as the major population in the surgical market, and risk event prediction in geriatric patients has become one of the hot research directions. Postoperative risk prediction for elderly surgical patients can help doctors to formulate diagnosis and treatment plans, rationally allocate rescue resources, and reduce the probability of postoperative risk events. At present, some diagnostic tools can help hospitals provide comprehensive and reliable treatment for high-risk patients. For example, Chinese patents with publication numbers CN111009322A and CN114038565A have disclosed the use of predictive models for perioperative risk assessment based on patient perioperative data sets. However, , In the perioperative data set of patients, there is often a problem of imbalanced labels in the data set, which will directly affect the performance of the perioperative prediction model.

发明内容SUMMARY OF THE INVENTION

本发明旨在解决现有技术中存在的技术问题，提供一种围术期患者样本数据集均衡方法及样本数据集获取系统。The invention aims to solve the technical problems existing in the prior art, and provides a method for balancing a patient sample data set in a perioperative period and a sample data set acquisition system.

为了实现本发明的上述目的，根据本发明的第一个方面，本发明提供了一种围术期患者样本数据集均衡方法，包括：步骤S1，利用MLSMOTE算法对围术期患者的样本数据集中的少数类标签样本进行过采样获得合成样本，为合成样本生成对应的合成标签集，所述样本数据集包括多个样本和样本对应分类标签集；步骤S2，将合成样本加入样本数据集获得临时样本数据集；步骤S3，对临时样本数据集中的样本进行清洗获得均衡样本数据集。In order to achieve the above object of the present invention, according to the first aspect of the present invention, the present invention provides a method for balancing patient sample data sets in the perioperative period, comprising: step S1, using the MLSMOTE algorithm to align the sample data sets of perioperative patients in the perioperative period. The minority class label samples are oversampled to obtain synthetic samples, and a corresponding synthetic label set is generated for the synthetic samples, and the sample data set includes a plurality of samples and the corresponding classification label sets of the samples; Step S2, the synthetic samples are added to the sample data set to obtain temporary Sample data set; Step S3, cleaning the samples in the temporary sample data set to obtain a balanced sample data set.

上述技术方案：对样本数据集中的少数类标签样本进行过采样以增加少数类标签样本数量，达到多数类标签样本和少数类标签样本的均衡，此外，对于在对少类标签样本生成的过程中产生的噪声样本在全部样本中进行清洗，提升输出的均衡样本数据集中样本质量，有效增强数据，当均衡样本数据集用于后续分类处理时能够提升分类模型的表现效果。The above technical solution: oversampling the minority class label samples in the sample data set to increase the number of minority class label samples to achieve a balance between the majority class label samples and the minority class label samples. The generated noise samples are cleaned in all samples to improve the quality of the samples in the output balanced sample data set and effectively enhance the data. When the balanced sample data set is used for subsequent classification processing, the performance of the classification model can be improved.

为了实现本发明的上述目的，根据本发明的第二个方面，本发明提供了一种围术期患者的样本数据集均衡装置，包括：样本合成模块，利用MLSMOTE算法对围术期患者的样本数据集中的少数类标签样本进行过采样获得合成样本，为合成样本生成对应的合成标签集，所述样本数据集包括多个样本和样本对应分类标签集；临时样本数据集获取模块，将合成样本加入样本数据集获得临时样本数据集；清洗模块，对临时样本数据集中的样本进行清洗获得均衡样本数据集。In order to achieve the above object of the present invention, according to a second aspect of the present invention, the present invention provides a sample data set equalization device for perioperative patients, including: a sample synthesis module, which uses the MLSMOTE algorithm to analyze the samples of perioperative patients. The minority label samples in the data set are oversampled to obtain synthetic samples, and a corresponding synthetic label set is generated for the synthetic samples. The sample data set includes a plurality of samples and the corresponding classification label sets of the samples; the temporary sample data set acquisition module is used to synthesize the samples. Add the sample data set to obtain a temporary sample data set; the cleaning module cleans the samples in the temporary sample data set to obtain a balanced sample data set.

上述技术方案：通过MLSMOTE算法对样本数据集中的少数类标签样本进行过采样以增加少数类标签样本数量，达到多数类标签样本和少数类标签样本的均衡，此外，对于MLSMOTE在对少类标签样本生成的过程中产生的噪声样本在全部样本中进行清洗，提升输出的均衡样本数据集中样本质量，有效增强数据，当均衡样本数据集用于后续分类处理时能够提升分类模型的表现效果。The above technical solution: the MLSMOTE algorithm is used to oversample the minority class label samples in the sample data set to increase the number of minority class label samples and achieve a balance between the majority class label samples and the minority class label samples. The noise samples generated in the generation process are cleaned in all samples, which improves the sample quality in the output balanced sample data set, effectively enhances the data, and can improve the performance of the classification model when the balanced sample data set is used for subsequent classification processing.

为了实现本发明的上述目的，根据本发明的第三个方面，本发明提供了一种围术期患者样本数据集获取系统，包括：数据获取模块，用于获取多个患者的原始围术期特征数据和病例；分类标签集获取模块，基于多个病例获取分类标签集合，分类标签表征围术期患者风险事件；分类标签关联模块，用于将患者的原始围术期特征数据与分类标签集中至少一个分类标签关联对应；围术期患者数据降维装置，对所有患者的原始围术期特征数据进行降维处理获得对应的围术期特征数据；样本数据集获取模块，以患者的围术期特征数据作为样本，为样本关联相应的原始围术期特征数据对应的分类标签集，获得围术期患者的样本数据集；还包括如本发明第二方面所述的围术期患者的样本数据集均衡装置，用于对样本数据集进行均衡处理。In order to achieve the above objects of the present invention, according to a third aspect of the present invention, the present invention provides a system for acquiring a perioperative patient sample data set, comprising: a data acquisition module for acquiring the original perioperative period of a plurality of patients Feature data and cases; a classification label set acquisition module, which obtains a classification label set based on multiple cases, and classification labels represent perioperative patient risk events; classification label association module, which is used to set the patient's original perioperative characteristic data and classification labels. At least one classification label is associated and corresponds; the perioperative patient data dimensionality reduction device performs dimensionality reduction processing on the original perioperative characteristic data of all patients to obtain the corresponding perioperative characteristic data; the sample data set acquisition module is based on the patient's perioperative characteristic data. The perioperative feature data is used as a sample, and the classification label set corresponding to the corresponding original perioperative feature data is associated with the sample to obtain the perioperative patient sample data set; and the perioperative patient sample according to the second aspect of the present invention is also included. The data set equalization device is used to perform equalization processing on the sample data set.

上述技术方案：构建了围术期患者的多分类标签样本数据集，通过围术期患者数据降维装置使得该数据集中样本的特征维度较低并且为对后续分类影响较大的特征，能够加快后续分类处理、模型训练的效率；通过样本数据集均衡装置增加少数类标签样本数量，达到多数类标签样本和少数类标签样本的均衡，并对MLSMOTE在对少类标签样本生成的过程中产生的噪声样本在全部样本中进行清洗，提升输出的均衡样本数据集中样本质量，有效增强数据，当均衡样本数据集用于后续分类处理时能够提升分类模型的表现效果。The above technical solution: a multi-class label sample data set of perioperative patients is constructed, and the perioperative patient data dimension reduction device makes the feature dimension of the samples in the data set lower and has a greater impact on the subsequent classification, which can speed up the process. Efficiency of subsequent classification processing and model training; increasing the number of minority-labeled samples through the sample data set equalization device to achieve a balance between majority-labeled samples and minority-labeled samples, and compared with MLSMOTE in the process of generating minority-labeled samples. Noise samples are cleaned in all samples to improve the sample quality in the output balanced sample data set, effectively enhancing the data. When the balanced sample data set is used for subsequent classification processing, the performance of the classification model can be improved.

附图说明Description of drawings

图1是本发明实施例1中围术期患者数据降维装置的结构示意图；1 is a schematic structural diagram of a perioperative patient data dimension reduction device in Embodiment 1 of the present invention;

图2是本发明实施例2中围术期患者样本数据集获取系统的结构示意图；2 is a schematic structural diagram of a system for acquiring a patient sample data set in the perioperative period in Embodiment 2 of the present invention;

图3是本发明实施例3中样本数据集均衡方法流程示意图；3 is a schematic flowchart of a method for equalizing a sample data set in Embodiment 3 of the present invention;

图4是本发明实施例4中样本数据集均衡装置结构示意图；4 is a schematic structural diagram of a sample data set equalization device in Embodiment 4 of the present invention;

图5是本发明实施例5中样本数据集获取系统的结构示意图；5 is a schematic structural diagram of a sample data set acquisition system in Embodiment 5 of the present invention;

图6是本发明实施例6中围术期患者数据多标签分类方法的流程示意图；6 is a schematic flowchart of a multi-label classification method for perioperative patient data in Embodiment 6 of the present invention;

图7是实施例6中分类模型的结构示意图；Fig. 7 is the structural representation of classification model among the embodiment 6;

图8是实施例6中围术期患者数据多标签分类方法的一种优选流程示意图；Fig. 8 is a kind of preferred flow chart of the multi-label classification method of perioperative patient data in embodiment 6;

图9是本发明实施例7中围术期患者数据多标签分类装置的结构示意图；9 is a schematic structural diagram of a multi-label classification device for perioperative patient data in Embodiment 7 of the present invention;

图10是本发明实施例8中围术期患者风险事件预测系统的结构示意图。FIG. 10 is a schematic structural diagram of a perioperative patient risk event prediction system in Embodiment 8 of the present invention.

具体实施方式Detailed ways

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，仅用于解释本发明，而不能理解为对本发明的限制。The following describes in detail the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary, only used to explain the present invention, and should not be construed as a limitation of the present invention.

在本发明的描述中，需要理解的是，术语“纵向”、“横向”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。In the description of the present invention, it should be understood that the terms "portrait", "horizontal", "upper", "lower", "front", "rear", "left", "right", "vertical", The orientations or positional relationships indicated by "horizontal", "top", "bottom", "inside", "outside", etc. are based on the orientations or positional relationships shown in the accompanying drawings, which are only for the convenience of describing the present invention and simplifying the description, rather than An indication or implication that the referred device or element must have a particular orientation, be constructed and operate in a particular orientation, is not to be construed as a limitation of the invention.

在本发明的描述中，除非另有规定和限定，需要说明的是，术语“安装”、“相连”、“连接”应做广义理解，例如，可以是机械连接或电连接，也可以是两个元件内部的连通，可以是直接相连，也可以通过中间媒介间接相连，对于本领域的普通技术人员而言，可以根据具体情况理解上述术语的具体含义。In the description of the present invention, unless otherwise specified and limited, it should be noted that the terms "installed", "connected" and "connected" should be understood in a broad sense, for example, it may be a mechanical connection or an electrical connection, or two The internal communication between the elements may be directly connected or indirectly connected through an intermediate medium. For those of ordinary skill in the art, the specific meanings of the above terms can be understood according to specific situations.

实施例1Example 1

本实施例公开了一种围术期患者数据降维装置，如图1所示，该装置包括：This embodiment discloses a perioperative patient data dimension reduction device, as shown in FIG. 1 , the device includes:

输入模块，获取患者的包含多维特征的原始围术期特征数据，以及原始围术期特征数据对应的分类标签；The input module obtains the original perioperative characteristic data containing multi-dimensional features of the patient, and the classification label corresponding to the original perioperative characteristic data;

初次降维模块，基于主成分分析算法对原始围术期特征数据进行降维处理获得第一围术期特征数据；The initial dimension reduction module, based on the principal component analysis algorithm, performs dimension reduction processing on the original perioperative characteristic data to obtain the first perioperative characteristic data;

二次降维模块，基于遗传算法对第一围术期特征数据进行降维处理获得围术期特征数据；The second dimension reduction module, based on the genetic algorithm, performs dimension reduction processing on the characteristic data of the first perioperative period to obtain the characteristic data of the perioperative period;

输出模块，输出围术期特征数据。The output module outputs perioperative characteristic data.

在本实施例中，为更好的体现患者围术期状态，提高后续分类处理准确性，以及规避术后患者数据不易收集管理的问题，优选地，原始围术期特征数据包括患者术前、术中的指标数据，如术前的血压、心率、血脂等，术中的心率、血压、失血量、手术时长等。与现有的部分分类预测模型中只纳入手术患者的术前基础状况并未考虑手术进行中的具体情况不同，众多研究已经证实，术中心率、血压、失血量、手术时间等术中指标均与患者的手术术后情况相关，因此本实施例提供的原始围术期特征数据能够提高后续模型预测术后事件的准确性，并且不依赖术后患者指标数据。In this embodiment, in order to better reflect the patient's perioperative state, improve the accuracy of subsequent classification and processing, and avoid the problem of difficult collection and management of postoperative patient data, preferably, the original perioperative characteristic data includes the patient's preoperative, Intraoperative index data, such as preoperative blood pressure, heart rate, blood lipids, etc., intraoperative heart rate, blood pressure, blood loss, operation time, etc. Unlike some existing classification prediction models that only include the preoperative basic conditions of surgical patients and do not consider the specific conditions of the ongoing operation, many studies have confirmed that intraoperative indicators such as intraoperative heart rate, blood pressure, blood loss, and operation time are all consistent with each other. It is related to the postoperative condition of the patient, so the original perioperative characteristic data provided in this embodiment can improve the accuracy of the subsequent model in predicting postoperative events, and does not rely on postoperative patient index data.

在本实施例中，分类标签用于表征围术期患者风险事件，围术期患者风险事件优选但不限于包括非计划再入院、死亡。In this embodiment, the classification label is used to characterize perioperative patient risk events, and perioperative patient risk events preferably include, but are not limited to, unplanned readmission and death.

在本实施例中，为提高数据的丰富性，指标数据包括类别数据和数值数据，类别数据为通过类别表示指标数据，比如术中出血量可用多、中、少来表示，数值数据用数值表示指标数据，如血压值。In this embodiment, in order to improve the richness of the data, the index data includes category data and numerical data, and the category data is the index data represented by categories. Metric data, such as blood pressure values.

在本实施例中，原始围术期特征数据可为已知围术期患者风险事件的患者的数据，因此，可将已知围术期患者风险事件作为原始围术期特征数据关联对应的分类标签。原始围术期特征数据也可为未知围术期患者风险事件的患者的数据，由专家给原始围术期特征数据设置对应的分类标签。原始围术期特征数据对应的分类标签可以为一个、两个或多个。In this embodiment, the original perioperative characteristic data may be data of patients with known perioperative patient risk events. Therefore, the known perioperative patient risk events may be associated with the corresponding classification as the original perioperative characteristic data. Label. The original perioperative characteristic data may also be data of patients with unknown perioperative patient risk events, and experts set corresponding classification labels to the original perioperative characteristic data. There can be one, two or more classification labels corresponding to the original perioperative characteristic data.

在本实施例中，经过主成分分析算法处理后，第一围术期特征数据的特征维度小于原始围术期特征数据的特征维度，基于第一围术期特征数据构建遗传算法的初始种群。In this embodiment, after processing by the principal component analysis algorithm, the feature dimension of the first perioperative feature data is smaller than the feature dimension of the original perioperative feature data, and an initial population of the genetic algorithm is constructed based on the first perioperative feature data.

在本实施例中，为通过遗传算法对第一围术期特征数据进一步降维，优选地，二次降维模块包括：In this embodiment, in order to further reduce the dimension of the first perioperative feature data through the genetic algorithm, preferably, the second dimension reduction module includes:

初始种群设置单元，基于第一围术期特征数据设置个体，个体的基因数小于等于第一围术期特征数据中的特征总数，多个个体组成初始种群；个体的基因为第一围术期特征数据中的特征，在满足个体的基因数小于等于第一围术期特征数据中的特征总数的条件下，可随机设置每个个体的基因数；The initial population setting unit is used to set individuals based on the characteristic data of the first perioperative period. The number of genes of the individual is less than or equal to the total number of characteristics in the characteristic data of the first perioperative period, and multiple individuals form the initial population; the genes of the individual are the first perioperative period. For the characteristics in the characteristic data, the number of genes of each individual can be randomly set under the condition that the number of genes of the individual is less than or equal to the total number of characteristics in the characteristic data of the first perioperative period;

进化迭代单元，重复执行以下过程直到达到终止条件，并输出达到终止条件时适应度最大的个体：获取本代种群中每个个体的适应度；从本代种群中基于个体的适应度选取部分个体作为下一代种群的个体；对下一代种群的个体进行交叉运算和变异运算。Evolutionary iterative unit, repeat the following process until reaching the termination condition, and output the individual with the largest fitness when the termination condition is reached: obtain the fitness of each individual in the current generation population; select some individuals from the current generation population based on the fitness of the individual As an individual of the next generation population; perform crossover and mutation operations on the individuals of the next generation.

在本实施例中，终止条件优选但不限于为进化迭代次数达到了预设的最大进化迭代次数，或者，进化迭代中个体的适应度最大值不再增加，或者，进化迭代中个体的适应度最大值增加幅度低于增幅阈值。每次迭代中，对本代种群中的个体的适应度进行从高到低排序，选取排名靠前的部分个体作为下一代种群的个体。交叉运算主要是对配对的父代的同点基因位进行交换，交换后获得子代，将子代作为下一代种群的个体。In this embodiment, the termination condition is preferably, but not limited to, that the number of evolution iterations has reached the preset maximum number of evolution iterations, or the maximum value of the fitness of the individual in the evolutionary iteration is no longer increased, or the fitness of the individual in the evolutionary iteration The maximum increase is below the increase threshold. In each iteration, the fitness of the individuals in the current generation population is sorted from high to low, and some of the top-ranked individuals are selected as the individuals of the next generation population. The crossover operation is mainly to exchange the same locus of the paired parent, and after the exchange, the offspring is obtained, and the offspring is regarded as the individual of the next generation population.

在本实施例中，为使得降维后的围术期特征数据在后续分类处理中具有更优异的表现，提高分类准确性，优选地，获取个体的适应度的过程：获取多个患者的原始围术期特征数据和对应的分类标签，按照个体的特征信息对多个原始围术期特征数据进行降维处理获得与个体特征一致的多个降维样本；将多个降维样本划分为降维训练集和降维测试集；构建降维多层感知神经网络；利用降维训练集对构建的降维多层感知神经网络进行训练获得降维分类预测模型；利用降维测试集对降维分类预测模型进行测试获得该模型的准确率，将该准确率作为个体的适应度。In this embodiment, in order to make the perioperative feature data after dimensionality reduction have a better performance in the subsequent classification processing and improve the classification accuracy, preferably, the process of obtaining the fitness of an individual is: obtaining the original data of multiple patients. Perioperative feature data and corresponding classification labels, perform dimensionality reduction processing on multiple original perioperative feature data according to individual feature information to obtain multiple dimensionality reduction samples consistent with individual characteristics; divide multiple dimensionality reduction samples into reduced dimensionality samples. Dimension training set and dimensionality reduction test set; construct dimensionality reduction multilayer perceptron neural network; use dimensionality reduction training set to train the constructed dimensionality reduction multilayer perceptron neural network to obtain dimensionality reduction classification prediction model; use dimensionality reduction test set to conduct dimensionality reduction The classification prediction model is tested to obtain the accuracy of the model, and the accuracy is taken as the fitness of the individual.

实施例2Example 2

本实施例公开了一种围术期患者样本数据集获取系统，如图2所示，该围术期患者样本数据集获取系统包括：数据获取模块，用于获取多个患者的原始围术期特征数据和病例；病例数据一般是文本数据，包括医生诊断、既往病史、术后随访记录等；分类标签集获取模块，基于多个病例获取分类标签集合，分类标签表征围术期患者风险事件；分类标签关联模块，用于将患者的原始围术期特征数据与分类标签集中至少一个分类标签关联对应，因此原始围术期特征数据对应有一个分类标签集，分类标签集包括至少一个分类标签；以及实施例1提供的围术期患者数据降维装置，对所有患者的原始围术期特征数据进行降维处理获得对应的围术期特征数据；样本数据集获取模块，以患者的围术期特征数据作为样本，为样本关联相应的原始围术期特征数据对应的分类标签集，获得围术期患者的样本数据集。This embodiment discloses a perioperative patient sample data set acquisition system, as shown in FIG. 2 , the perioperative patient sample data set acquisition system includes: a data acquisition module for acquiring the original perioperative period of multiple patients Feature data and cases; case data is generally text data, including doctor's diagnosis, past medical history, postoperative follow-up records, etc.; the classification label set acquisition module, based on multiple cases, obtains the classification label set, and the classification label represents the risk events of perioperative patients; The classification label association module is used to associate the original perioperative characteristic data of the patient with at least one classification label in the classification label set, so the original perioperative characteristic data corresponds to a classification label set, and the classification label set includes at least one classification label; And the perioperative patient data dimensionality reduction device provided in Example 1, performs dimensionality reduction processing on the original perioperative characteristic data of all patients to obtain corresponding perioperative characteristic data; the sample data set acquisition module is based on the patient's perioperative period The characteristic data is used as a sample, and the classification label set corresponding to the corresponding original perioperative characteristic data is associated with the sample to obtain a sample data set of perioperative patients.

在本实施例中，优选地，分类标签集获取模块具体执行：对患者病例进行分词处理获得至少一个术后事件结果(术后事件结果即围术期患者风险事件)，对多个患者的术后事件结果利用训练好的CBOW模型进行相似词类比获得多个相似术后事件结果集合，将相似术后事件结果集合与事件字典匹配，从事件字典中查找与相似术后事件结果集合匹配的分类标签，多个分类标签构成分类标签集。In this embodiment, preferably, the classification label set acquisition module specifically performs: performing word segmentation processing on patient cases to obtain at least one postoperative event result (postoperative event result is the perioperative patient risk event), Post-event results Use the trained CBOW model to perform similar word analogy to obtain multiple similar post-operative event result sets, match the similar post-operative event result sets with the event dictionary, and find the classification matching the similar post-operative event result sets from the event dictionary. label, a plurality of classification labels constitute a classification label set.

在本实施例中，将Word2Vec的CBOW Multi-Word Context Model模型对大量医学语料库进行训练，通过PKUSEG分词工具(PKUSEG可以对多领域的单词进行分割，其中就包括医学领域的独立模型)将本实施例中的病例集对应的文本信息进行分词处理得到多个术后事件结果。事件字典优选但不限于为世界卫生组织发布的统一国际疾病分类规范的中文版ICD-11事件字典，事件字典中包含有很多分类标签。相似术后事件结果集合与事件字典是否匹配优选但不限于通过语义相似度来判断，若两者语义相似度大于预设的相似度阈值，则认为两者匹配，否则不匹配。In this embodiment, the CBOW Multi-Word Context Model model of Word2Vec is trained on a large number of medical corpora, and the PKUSEG word segmentation tool (PKUSEG can segment words in multiple fields, including independent models in the medical field) is used to implement this implementation. The text information corresponding to the case set in this example is processed by word segmentation to obtain multiple postoperative event results. The event dictionary is preferably, but not limited to, the Chinese version of the ICD-11 event dictionary of the Uniform International Classification of Diseases published by the World Health Organization, and the event dictionary contains many classification labels. Whether the similar postoperative event result set matches the event dictionary is preferably, but not limited to, judgment by semantic similarity. If the semantic similarity between the two is greater than the preset similarity threshold, the two are considered to match, otherwise they do not match.

在本实施例中，优选地，为对数据中的缺失值进行填补，提升数据质量，还包括缺失填补装置，用于对患者的原始围术期特征数据中的缺失值进行填补处理，并将填补处理后的原始围术期特征数据输入围术期患者数据降维装置进行降维处理。缺失填补装置优选但不限于通过现有的RandomForestRegressor填补法或Missforest填补法或均值Mean填补法或中位数填补法进行填补处理。In this embodiment, preferably, in order to fill in the missing values in the data and improve the data quality, a missing filling device is also included, which is used to fill in the missing values in the original perioperative characteristic data of the patient, and The original perioperative characteristic data after filling and processing is input into the perioperative patient data dimensionality reduction device for dimensionality reduction processing. The missing filling device preferably, but is not limited to, performs filling processing through the existing RandomForestRegressor filling method, Missforest filling method, Mean filling method, or median filling method.

在本实施例中，进一步优选地，缺失填补装置基于贝叶斯高斯过程隐变量模型对原始围术期特征数据进行缺失填补处理。In this embodiment, further preferably, the missing filling device performs missing filling processing on the original perioperative feature data based on a Bayesian Gaussian process latent variable model.

在本实施例中，对于缺失值的数据填补不可避免地会给原始围术期特征数据集引入不确定性。本实施例运用贝叶斯高斯过程隐变量模型(Bayesian Gaussian processlatent variable model，BGPLVM)来进行数值型特征的缺失值填补，具体包括：In this embodiment, data imputation for missing values inevitably introduces uncertainty into the original perioperative feature dataset. This embodiment uses the Bayesian Gaussian process latent variable model (BGPLVM) to fill in the missing values of the numerical features, specifically including:

首先，近似地计算观察到的测试数据向量y_*∈R^N×M的概率密度p(y_*|Y)(其中N为病人样本总数，M为特征总数)，与观测值y_*相关的隐变量的变分分布为q(x_*)。当模型参数和隐变量被学习到后，BGPLVM可以用来估计缺失值：

其中

是向量y_*中可以观察到的值，

是需要预测的缺失值。给定部分观察到的点y_*，本实施例希望重建丢失的部分

通过在一个小型完整数据集上学习对可观察变量的低维embedding，来填补缺失数据集。将BGPLVM在完整的数据集D上进行训练，引入隐变量X和新的测试隐变量x_*，如前所述

表示单个病人测量值的行向量，

代表已知观测值，

表示缺失值，通过最大化下面概率密度，得到y_*对应的隐变量x_*的高斯概率分布。First, approximately calculate the probability density p(y _* |Y) of the observed test data vector y _* ∈ R ^{N × M} (where N is the total number of patient samples and M is the total number of features), the implicit value associated with the observed value y _* The variational distribution of the variable is q(x _* ). After model parameters and latent variables are learned, BGPLVM can be used to estimate missing values:

in

is the observable value in the vector y _* ,

is the missing value that needs to be predicted. Given a partially observed point y _* , this embodiment wishes to reconstruct the missing part

Impute missing datasets by learning low-dimensional embeddings of observable variables on a small full dataset. The BGPLVM is trained on the complete dataset D, introducing a latent variable X and a new test latent variable x _* , as previously described

a row vector representing individual patient measurements,

stands for known observations,

Representing missing values, the Gaussian probability distribution of the latent variable x _* corresponding to y _* is obtained by maximizing the probability density below.

接下来，通过最大化在

的变分下界来优化变分分布q(x_*)，保持除q(x_*)之外的所有优化量不变。为了预测缺失值

本发明采用标准高斯过程预测方法，同时将输入x_*的不确定因素也考虑进去，因为x_*存在分布q(x_*)。与GP预测形式相似，为了预测

本发明先预测

即与y_*对应的隐函数值

Next, by maximizing the

to optimize the variational distribution q(x _* ), keeping all optimization quantities except q(x _* ) constant. To predict missing values

The present invention adopts the standard Gaussian process prediction method, and simultaneously takes into account the uncertain factors of the input x _* , because x _* has a distribution q(x _* ). Similar to the GP prediction form, in order to predict

The present invention first predicts

i.e. the implicit function value corresponding to y _*

对x_*的边缘化会产生非高斯完全依赖的多元密度，但基于平方指数核，

是可以分析处理的，在本发明中本发明用到了

的均值和协方差，均值可以为本发明提供缺失值的估计，方差则可以量化与均值估计相关的不确定性。通过BGPLVM模型，在训练集学习得到的隐空间和模型超参数，通过分布得到对于每个包含缺失值的特征的平均估计。Marginalization on x _* yields a multivariate density that is not fully Gaussian dependent, but based on a squared exponential kernel,

can be analyzed and processed, in the present invention, the present invention uses

The mean and covariance of the mean can provide the present invention with an estimate of missing values, and the variance can quantify the uncertainty associated with the mean estimate. Through the BGPLVM model, the latent space and model hyperparameters learned on the training set are distributed to obtain an average estimate for each feature containing missing values.

在本实施例中，为便于数据处理，进一步优选地，还包括编码装置，用于对原始围术期特征数据进行编码处理，将编码处理后的数据输入缺失填补装置。编码装置优选但不限于采用现有的One-hot编码规则进行编码。In this embodiment, in order to facilitate data processing, it is further preferable to further include an encoding device for encoding the original perioperative characteristic data, and inputting the encoded data into the missing filling device. The encoding device preferably, but is not limited to, uses the existing One-hot encoding rules for encoding.

在本实施例中，为便于数据处理，进一步优选地，还包括归一化装置，用于对编码处理后的原始围术期特征数据进行归一化处理，并将归一化处理后的数据输入缺失填补装置。归一化装置优选但不限于采用标准差归一化方法进行归一化处理。In this embodiment, in order to facilitate data processing, further preferably, a normalization device is further included, which is used to normalize the encoded original perioperative characteristic data, and normalize the normalized data. Enter the missing padding device. The normalization device preferably, but not limited to, uses the standard deviation normalization method to perform normalization processing.

实施例3Example 3

本实施例提供一种围术期患者的样本数据集均衡方法，如图3所示，该样本数据集均衡方法包括：This embodiment provides a method for balancing sample data sets of perioperative patients. As shown in FIG. 3 , the method for balancing sample data sets includes:

步骤S1，对围术期患者的样本数据集中的少数类标签样本进行过采样获得合成样本，为合成样本生成对应的合成标签集，样本数据集包括多个样本以及与样本对应的分类标签集；每个样本代表一个患者的围术期特征数据集，可以是原始围术期特征数据或者实施例1中原始围术期特征数据降维后获得的围术期特征数据，样本的分类标签关联过程已在实施例1中详细阐述，在此不再赘述。Step S1, performing oversampling on the minority label samples in the perioperative patient's sample data set to obtain synthetic samples, and generating a corresponding synthetic label set for the synthetic samples, the sample data set includes a plurality of samples and a classification label set corresponding to the samples; Each sample represents the perioperative feature data set of a patient, which can be the original perioperative feature data or the perioperative feature data obtained after dimension reduction of the original perioperative feature data in Example 1. The classification and label association process of the samples It has been described in detail in Embodiment 1 and will not be repeated here.

步骤S2，将合成样本和合成标签集加入样本数据集获得临时样本数据集；Step S2, adding the synthetic sample and the synthetic label set to the sample data set to obtain a temporary sample data set;

步骤S3，对临时样本数据集中的样本进行清洗获得均衡样本数据集。Step S3, cleaning the samples in the temporary sample data set to obtain a balanced sample data set.

在本实施例中，可通过SMOTE或SVM SMOTE或BorderlineSMOTE或K-Means SMOTE或SMOTE-NC对围术期患者的样本数据集中的少数类标签样本进行过采样获得合成样本以及为合成样本生成对应的合成标签集。优选地，为提升平衡效果，采用MLSMOTE算法对围术期患者的样本数据集中的少数类标签样本进行过采样获得合成样本以及为合成样本生成对应的合成标签集。MLSMOTE算法即多标签合成少数类过采样技术(Multi label SyntheticMinority Over-sampling Technique,MLSMOTE)，常用于处理多标签分类任务中数据不平衡问题，其生成过程包括：采用不平衡率Imbalance Rate(IR)选择少数类标签；最近邻居搜索:一旦属于少数标签的样本被选中为种子样本，就要搜索它的最近邻居；特征集生成:选择一个邻域后，通过插值获得合成样本；合成标签集的产生:对于产生的合成样本需要合成标签集。In this embodiment, synthetic samples can be obtained by over-sampling the minority-labeled samples in the perioperative patient sample data set through SMOTE or SVM SMOTE or BorderlineSMOTE or K-Means SMOTE or SMOTE-NC, and the corresponding synthetic samples can be generated for the synthetic samples. Synthetic label set. Preferably, in order to improve the balance effect, the MLSMOTE algorithm is used to oversample the minority label samples in the perioperative patient's sample data set to obtain synthetic samples, and to generate corresponding synthetic label sets for the synthetic samples. The MLSMOTE algorithm is the Multi-label Synthetic Minority Over-sampling Technique (MLSMOTE), which is often used to deal with the problem of data imbalance in multi-label classification tasks. The generation process includes: using the Imbalance Rate (IR) Select minority class labels; Nearest neighbor search: Once a sample belonging to a minority label is selected as a seed sample, its nearest neighbors are searched; Feature set generation: After selecting a neighborhood, synthetic samples are obtained by interpolation; Synthetic label set generation : A synthetic label set is required for the generated synthetic samples.

在本实施例中，由于MLSMOTE等过采样合成少数类样本算法在合成少类标签样本的过程中会产生一些噪声样本，对于这些噪声样本的清洗十分必要，因此设置步骤S3以提升样本数据集质量。In this embodiment, since the oversampling synthesizing minority class sample algorithm such as MLSMOTE will generate some noise samples in the process of synthesizing minority class label samples, it is necessary to clean these noise samples, so step S3 is set to improve the quality of the sample data set .

在本实施例中，优选地，为快速判断出样本数据集中的少数类标签，计算每个分类标签相应的样本数量与样本数据集的总样本数量的比值，将比值小于比值阈值的分类标签作为少数类分类标签，大于等于比值阈值的分类标签作为多数类分类标签，比值阈值优选但不限于小于0.2。In this embodiment, preferably, in order to quickly determine the minority class labels in the sample data set, the ratio of the number of samples corresponding to each class label to the total number of samples in the sample data set is calculated, and the class label whose ratio is less than the ratio threshold is used as Minority class classification label, the classification label greater than or equal to the ratio threshold is used as the majority class classification label, and the ratio threshold is preferably but not limited to less than 0.2.

在本实施例中，每个少数类分类标签需要生成的样本数量为该少数类分类标签的过采样率。为更好地确定每个少数类分类标签的过采样率，使得获得的均衡样本数据集应用于后续分类时表现效果更好，优选地，在步骤S1中，基于遗传算法为每个少数类标签设置过采样率，具体包括：In this embodiment, the number of samples to be generated for each minority class classification label is the oversampling rate of the minority class classification label. In order to better determine the oversampling rate of each minority class classification label, so that the obtained balanced sample data set performs better when applied to subsequent classification, preferably, in step S1, based on the genetic algorithm, each minority class label is assigned. Set the oversampling rate, including:

步骤S11，设样本数据集中包括W个少数类标签，将W个少数类标签的样本的过采样率作为个体的W个基因，W为正整数；每个基因代表一个少数类分类标签的过采样率，利用多个个体构建初始种群，初始种群包括多个初始个体，每个初始个体的W个基因数值大小为通过随机选取获得,优选地，可为每个少数类分类标签的过采样率设置数值范围，当构建初始种群时在该数值范围类随机选取数值作为基因数值，数值范围可根据需要设置；Step S11, set the sample data set to include W minority class labels, take the oversampling rate of the samples with W minority class labels as individual W genes, and W is a positive integer; each gene represents an oversampling of a minority class label. rate, using multiple individuals to construct an initial population, the initial population includes multiple initial individuals, the value of W genes of each initial individual is obtained by random selection, preferably, it can be set for the oversampling rate of each minority class classification label Numerical range, when constructing the initial population, the numerical value is randomly selected as the gene value in this numerical range, and the numerical range can be set as required;

步骤S12，重复执行以下进化迭代过程直到达到终止条件：获取本代种群中每个个体的适应度；从本代种群中基于个体的适应度选取部分个体作为下一代种群的个体；对下一代种群的个体进行交叉运算和变异运算；In step S12, the following evolutionary iterative process is repeatedly performed until the termination condition is reached: obtaining the fitness of each individual in the current generation population; selecting some individuals from the current generation population based on the fitness of the individuals as individuals of the next generation population; Individuals of , perform crossover and mutation operations;

步骤S13，输出达到终止条件时适应度最大的个体。Step S13, output the individual with the largest fitness when the termination condition is reached.

在本实施例中，终止条件优选但不限于为进化迭代次数达到了预设的最大进化迭代次数，或者，进化迭代中个体的适应度最大值不再增加，或者进化迭代中个体的适应度最大值增加幅度低于增幅阈值。每次迭代中，对本代种群中的个体的适应度进行从高到低排序，选取排名靠前的部分个体作为下一代种群的个体。In this embodiment, the termination condition is preferably, but not limited to, that the number of evolution iterations has reached the preset maximum number of evolution iterations, or the maximum fitness of the individual in the evolutionary iteration is no longer increased, or the fitness of the individual in the evolutionary iteration is the maximum The value increases by less than the increase threshold. In each iteration, the fitness of the individuals in the current generation population is sorted from high to low, and some of the top-ranked individuals are selected as the individuals of the next generation population.

在本实施例中，为使得获得的均衡样本数据集应用于后续分类时表现效果更好，优选地，获取个体的适应度的过程：In this embodiment, in order to make the obtained balanced sample data set perform better when applied to subsequent classification, preferably, the process of obtaining the fitness of an individual:

基于个体基因信息得到少数类标签过采样率组合；过采样率组合包括所有少数类标签的过采样率；Obtain the minority class label oversampling rate combination based on individual genetic information; the oversampling rate combination includes the oversampling rate of all minority class labels;

基于少数类标签过采样率组合对围术期患者的样本数据集中的少数类标签样本进行过采样获得合成样本以及合成样本的合成标签集，将合成样本和合成标签集加入样本数据集获得均衡样本集，将均衡样本集划分为均衡训练样本集和均衡测试样本集；Based on the combination of minority class label oversampling rate, the minority class label samples in the perioperative patient's sample data set are oversampled to obtain synthetic samples and synthetic label sets of synthetic samples, and the synthetic samples and synthetic label sets are added to the sample data set to obtain balanced samples The balanced sample set is divided into a balanced training sample set and a balanced test sample set;

构建均衡多层感知神经网络，利用均衡训练样本集训练均衡多层感知神经网络获得均衡预测分类模型，利用均衡测试样本集测试均衡预测分类模型获得均衡预测分类模型的准确率，将该准确率作为个体的适应度。Construct a balanced multi-layer perceptual neural network, use the balanced training sample set to train the balanced multi-layer perceptual neural network to obtain the balanced prediction classification model, use the balanced test sample set to test the balanced prediction classification model to obtain the accuracy of the balanced prediction classification model, and use the accuracy as individual fitness.

在本实施例中，为有效地去除噪声样本，提升样本集的质量，优选地，步骤S3为对临时样本数据集中每个样本进行清洗处理，清洗处理过程包括：In this embodiment, in order to effectively remove noise samples and improve the quality of the sample set, preferably, step S3 is to perform cleaning processing on each sample in the temporary sample data set, and the cleaning processing process includes:

步骤S31，从临时样本数据集中选取种子样本，选择种子样本的k个近邻样本，k个近邻样本的分类标签组成近邻分类标签集，k为正整数；可依次选取临时样本数据集中的每个样本作为种子样本；Step S31, select a seed sample from the temporary sample data set, select k neighbor samples of the seed sample, and the classification labels of the k neighbor samples form a neighbor classification label set, and k is a positive integer; each sample in the temporary sample data set can be selected in turn. as a seed sample;

步骤S32，基于近邻分类标签集通过贝叶斯条件概率预测种子样本的分类标签集，获得种子样本的预测分类标签集；Step S32, predicting the classification label set of the seed sample by Bayesian conditional probability based on the nearest neighbor classification label set, and obtaining the predicted classification label set of the seed sample;

步骤S33，判断种子样本的预测分类标签集与其在临时样本数据集中的分类标签集是否相同，若相同，保留该种子样本，若不相同，删除该种子样本，认为该种子样本为噪声样本。Step S33, determine whether the predicted classification label set of the seed sample is the same as the classification label set in the temporary sample data set, if the same, keep the seed sample, if not, delete the seed sample, and consider the seed sample to be a noise sample.

上述清洗过程直接基于种子样本近邻分类标签集通过贝叶斯条件概率预测种子样本的分类标签集，将获得的预测分类标签集与该种子样本在临时样本数据集中的真实分类标签集进行比较判断，不会依赖于分类器判定，仅依赖数据本身判定，减少运算量，提高判断效率和准确率。The above cleaning process directly predicts the classification label set of the seed sample through Bayesian conditional probability based on the nearest neighbor classification label set of the seed sample, and compares the obtained predicted classification label set with the real classification label set of the seed sample in the temporary sample data set. It will not rely on the classifier to determine, but only rely on the data itself to determine, reduce the amount of calculation, and improve the efficiency and accuracy of the judgment.

在本实施例中，进一步优选地，在步骤S31中，选择种子样本的k个近邻样本的具体过程包括：In this embodiment, further preferably, in step S31, the specific process of selecting the k nearest neighbor samples of the seed sample includes:

获取种子样本分别与临时样本数据集中全部或部分样本的异类值差度量HVDM；HVDM为Heterogeneous Value Difference Metric的缩写；Obtain the heterogeneous value difference metric HVDM between the seed sample and all or part of the samples in the temporary sample data set; HVDM is the abbreviation of Heterogeneous Value Difference Metric;

利用临时样本数据集中样本的全局不平衡权重对异类值差度量HVDM进行修正获得修正异类值差度量；Using the global imbalance weight of the samples in the temporary sample data set to modify the outlier difference measure HVDM to obtain the revised outlier difference measure;

对临时样本数据集中所有样本与种子样本的修正异类值差度量进行排序，选取前k个修正异类值差度量较大的样本作为种子样本的k个近邻样本。优选地，可对修正异类值差度量进行从高到底排序，选取前k个修正异类值差度量值较大的样本作为种子样本的k个近邻样本。Sort the modified outlier difference metric of all samples and seed samples in the temporary sample data set, and select the first k samples with larger modified outlier difference metric as the k nearest neighbor samples of the seed sample. Preferably, the modified outlier difference metric may be sorted from high to bottom, and the first k samples with larger modified outlier difference metric values are selected as the k nearest neighbor samples of the seed sample.

上述选择种子样本的k个近邻样本的过程中采用加权KNN(Weighted kNN,WkNN)的方法来提升合成样本质量。假如样本数据集中真实的少数类标签样本分布非常分散，即空间稀疏，那么在MLSMOTE等算法执行过程中合成的少数类样本还是会分散稀疏，在局部角度来说依旧没有平衡。若直接使用kNN清洗时，会大几率将稀疏的少数类样本和MLSMOTE合成的新少数类样本剔除掉，这样不能建立恰当的分类边界，因此，需要引入距离加权的思想来协调kNN清洗，也就是达到面对稀疏的分布样本时，不盲目地直接删掉，而是将局部空间密度(即异类值差度量HVDM和样本的全局不平衡权重)考虑进来，尽可能的保留小样本。kNN的清洗主要依靠近邻样本的标签集，所以对于近邻的距离计算在数据分布稀疏时显得尤为重要，这也是加入距离加权(即利用临时样本数据集中样本的全局不平衡权重对异类值差度量HVDM进行修正)的主要原因。WkNN来清洗噪声样本，改变计算近邻样本的距离(对修正异类值差度量进行了修正)，也就是考虑了局部密度影响，以样本异类值差度量表示样本之间的距离。The weighted KNN (Weighted kNN, WkNN) method is used to improve the quality of the synthesized samples in the process of selecting the k nearest neighbor samples of the seed sample. If the distribution of the real minority class label samples in the sample data set is very scattered, that is, the space is sparse, the minority class samples synthesized during the execution of algorithms such as MLSMOTE will still be scattered and sparse, and there is still no balance from a local perspective. If kNN is used for cleaning directly, the sparse minority samples and the new minority samples synthesized by MLSMOTE will be removed with a high probability, so that an appropriate classification boundary cannot be established. Therefore, the idea of distance weighting needs to be introduced to coordinate kNN cleaning, that is, When faced with sparsely distributed samples, it is not blindly deleted directly, but the local spatial density (that is, the heterogeneous value difference measurement HVDM and the global imbalance weight of the sample) is taken into account, and the small samples are retained as much as possible. The cleaning of kNN mainly relies on the label set of the neighbor samples, so the distance calculation for the neighbors is particularly important when the data distribution is sparse. the main reason for the correction). WkNN is used to clean the noise samples and change the distance of calculating the nearest neighbor samples (correction of the modified outlier difference metric), that is, considering the influence of local density, the sample anomaly value difference metric is used to represent the distance between samples.

在本实施例中，进一步优选地，种子样本与临时样本数据集中样本的异类值差度量HVDM的计算公式为：In this embodiment, further preferably, the calculation formula of the heterogeneous value difference measurement HVDM of the samples in the seed sample and the temporary sample data set is:

其中，f₁表示种子样本的特征向量；f₂表示临时样本数据集中除种子样本之外的任一样本的特征向量；HVDM(f₁,f₂)表示特征向量f₁与f₂的异类值差度量；D(f₁,f₂)表示特征向量f₁和f₂之间的距离；n表示临时样本数据集中样本的特征维数；x表示特征索引；d_x(f₁,f₂)表示特征向量f₁和特征向量f₂在特征x上的距离，d_x(f₁,f₂)通过如下公式获取：

C表示当特征x为类别特征时该特征的类别数，c表示特征x的类别索引，

表示临时样本数据集中特征x属于特征向量f₁且特征x的类别特征为c的样本数；

表示临时样本数据集中特征x属于特征向量f₂且特征x的类别特征为c的样本数；

表示临时样本数据集中特征x属于特征向量f₁的样本数；

表示临时样本数据集中特征x属于特征向量f₂的样本数；|f₁-f₂|表示特征向量f₁与f₂差值的绝对值；σ_x表示临时样本数据集中特征x的标准差。Among them, f ₁ represents the feature vector of the seed sample; f ₂ represents the feature vector of any sample in the temporary sample data set except the seed sample; HVDM(f ₁ , f ₂ ) represents the heterogeneous value of the feature vector f ₁ and f ₂ Difference metric; D(f ₁ , f ₂ ) represents the distance between the feature vectors f ₁ and f ₂ ; n represents the feature dimension of the sample in the temporary sample dataset; x represents the feature index; d _x (f ₁ , f ₂ ) Represents the distance between the feature vector f ₁ and the feature vector f ₂ on the feature x, d _x (f ₁ , f ₂ ) is obtained by the following formula:

C represents the number of categories of the feature when the feature x is a category feature, c represents the category index of the feature x,

Indicates the number of samples in the temporary sample dataset where feature x belongs to feature vector f ₁ and the category feature of feature x is c;

Indicates the number of samples in the temporary sample dataset where the feature x belongs to the feature vector f ₂ and the category feature of the feature x is c;

Indicates the number of samples whose feature x belongs to feature vector f ₁ in the temporary sample data set;

Represents the number of samples in the temporary sample data set where the feature x belongs to the feature vector f ₂ ; |f ₁ -f ₂ | represents the absolute value of the difference between the feature vector f ₁ and f ₂ ; σ _x represents the standard deviation of the feature x in the temporary sample data set.

在本实施例中，进一步优选地，种子样本与临时样本数据集中样本的修正异类值差度量的计算公式为：In this embodiment, further preferably, the calculation formula of the modified outlier difference metric of the samples in the seed sample and the temporary sample data set is:

其中，f₁表示种子样本的特征向量；f₂表示临时样本数据集中除种子样本之外的任一样本的特征向量；HVDM(f₁,f₂)表示特征向量f₁与f₂的异类值差度量；D_W(f₁,f₂)表示特征向量f₁与f₂的修正异类值差度量；n表示临时样本数据集中样本的特征维数；IW表示特征向量为f₂的样本的全局不平衡权重，IW＝IR_nn/(IR⁺+IR^-),IR⁺表示临时样本数据集中所有少数类分类标签总不平衡率，IR^-表示临时样本数据集中所有多数类分类标签总不平衡率，IR_nn为特征向量为f₂的样本的分类标签集中所有分类标签的总不平衡率。Among them, f ₁ represents the feature vector of the seed sample; f ₂ represents the feature vector of any sample in the temporary sample data set except the seed sample; HVDM(f ₁ , f ₂ ) represents the heterogeneous value of the feature vector f ₁ and f ₂ Difference metric; D _W (f ₁ , f ₂ ) represents the modified outlier difference metric of the feature vector f ₁ and f ₂ ; n represents the feature dimension of the sample in the temporary sample data set; IW represents the global value of the sample whose feature vector is f ₂ Imbalance weight, IW=IR _nn /(IR ⁺ +IR ^- ), IR ⁺ represents the total imbalance rate of all minority class labels in the temporary sample data set, IR ^- represents the total imbalance rate of all majority class labels in the temporary sample data set , IR _nn is the total imbalance rate of all classification labels in the classification label set of the sample whose feature vector is _f2 .

上述清除噪声样本过程中，WkNN计算距离时使用Heterogeneous ValueDifference Metric(HVDM)进行距离度量，并以样本的全局不平衡权重IW为权重系数对HVDM进行修正。对于临时样本数据集中，当分类标签集包含的少数类标签越多，IR_nn越大IW会越大；对于少数类标签样本分布稀疏，不平衡率大的临时样本数据集，将IW引入HVDM距离来能提高少数类样本密度。In the above process of removing noise samples, WkNN uses Heterogeneous ValueDifference Metric (HVDM) to measure the distance when calculating the distance, and uses the global imbalance weight IW of the sample as the weight coefficient to correct the HVDM. For the temporary sample data set, when the classification label set contains more minority class labels, the larger the IR _nn will be, the larger the IW will be; for the temporary sample data set with sparse minority class label sample distribution and large imbalance rate, the IW is introduced into the HVDM distance to increase the minority class sample density.

从公式

可以看到，加权系数

的值可以缩放HVDM(f₁,f₂)，近邻样本分类标签集中的少数类标签越多则加权系数

会越小。在种子样本的近邻样本集的IW越大时，即近邻样本标签集包含的少数类标签越多时，对应的近邻样本的加权系数

就越小，这样呈现单调递减的形式，可以维持：在特征维度固定的情况下，近邻样本的加权系数

会因为其标签集中包含的多数类和少数类标签的情况而不同程度的放缩；当特征维度增多时，也就是样本分布逐渐稀疏时，放缩系数也跟着变小。from the formula

It can be seen that the weighting factor

The value of can scale HVDM(f ₁ ,f ₂ ), the more the minority class labels in the nearest neighbor sample class label set, the weighting coefficient

will be smaller. When the IW of the neighbor sample set of the seed sample is larger, that is, when the label set of the neighbor sample contains more minority class labels, the weighting coefficient of the corresponding neighbor sample

The smaller it is, it presents a monotonically decreasing form, which can be maintained: when the feature dimension is fixed, the weighting coefficient of the nearest neighbor samples

It will be scaled to varying degrees because of the majority and minority labels contained in its label set; when the feature dimension increases, that is, when the sample distribution becomes sparse, the scaling factor also decreases.

可以看出WkNN可以帮助为标签集包含少数类标签较多的样本筛选近邻样本时，考虑进去近邻样本标签集中标签的分布情况，让标签集中少数类标签更多的样本向种子样本靠拢，增大局部少数类标签密度，同时减少多数类标签密度。整体流程为：首先运用MLSMOTE对少数类标签的样本进行上采样，与原始样本组成较为平衡的临时新样本集，在此新样本集上，给每一个样本进行WkNN过程，也就是基于加权的HVDM排序出k近邻个样本，然后根据近邻样本预测出种子样本的标签集，若预测标签集和种子标签集情况一样，则保留样本，否则删除It can be seen that WkNN can help to filter the neighbor samples for the samples with more minority class labels in the label set, taking into account the distribution of labels in the label set of the neighbor samples, so that the samples with more minority class labels in the label set move closer to the seed samples and increase Local minority class label density while reducing majority class label density. The overall process is as follows: First, use MLSMOTE to upsample the samples of minority class labels, and form a more balanced temporary new sample set with the original samples. On this new sample set, perform the WkNN process for each sample, that is, the weighted HVDM Sort out k nearest neighbor samples, and then predict the label set of the seed sample according to the nearest neighbor samples. If the predicted label set is the same as the seed label set, keep the sample, otherwise delete it

实施例4Example 4

本实施例公开了一种围术期患者的样本数据集均衡装置，如图4所示，该样本数据集均衡装置包括：This embodiment discloses a sample data set equalization device for perioperative patients. As shown in FIG. 4 , the sample data set equalization device includes:

样本合成模块，对围术期患者的样本数据集中的少数类标签样本进行过采样获得合成样本，为合成样本生成对应的合成标签集，样本数据集包括多个样本以及样本对应的分类标签集；The sample synthesis module is used to oversample the minority label samples in the perioperative patient's sample data set to obtain synthetic samples, and generate corresponding synthetic label sets for the synthetic samples. The sample data set includes multiple samples and the corresponding classification label sets of the samples;

临时样本数据集获取模块，将合成样本和合成标签集加入样本数据集获得临时样本数据集；The temporary sample data set acquisition module adds synthetic samples and synthetic label sets to the sample data set to obtain a temporary sample data set;

清洗模块，对临时样本数据集中的样本进行清洗获得均衡样本数据集。The cleaning module cleans the samples in the temporary sample data set to obtain a balanced sample data set.

在本实施例中，优选地，清洗模块包括：In this embodiment, preferably, the cleaning module includes:

近邻样本获取单元，从临时样本数据集中选取种子样本，选择种子样本的k个近邻样本，k个近邻样本的分类标签组成近邻分类标签集，k为正整数；The nearest neighbor sample acquisition unit selects a seed sample from the temporary sample data set, selects k nearest neighbor samples of the seed sample, and the classification labels of the k nearest neighbor samples form a nearest neighbor classification label set, where k is a positive integer;

预测分类标签集获取单元，基于近邻分类标签集通过贝叶斯条件概率预测种子样本的分类标签集，获得种子样本的预测分类标签集；The predicted classification label set obtaining unit, based on the nearest neighbor classification label set, predicts the classification label set of the seed sample through Bayesian conditional probability, and obtains the predicted classification label set of the seed sample;

清洗单元，判断种子样本的预测分类标签集与其在临时样本数据集中的分类标签集是否相同，若相同，保留该种子样本，若不相同，删除该种子样本。The cleaning unit determines whether the predicted classification label set of the seed sample is the same as the classification label set in the temporary sample data set, if the same, the seed sample is retained, if not, the seed sample is deleted.

在本实施例中，进一步优选地，近邻样本获取单元选择种子样本的k个近邻样本的具体过程包括：In this embodiment, further preferably, the specific process of selecting the k nearest neighbor samples of the seed sample by the nearest neighbor sample acquisition unit includes:

获取种子样本分别与临时样本数据集中全部或部分样本的异类值差度量HVDM；Obtain the heterogeneous value difference metric HVDM between the seed sample and all or part of the samples in the temporary sample data set;

对临时样本数据集中所有样本与种子样本的修正异类值差度量进行排序，选取前k个修正异类值差度量较大的样本作为种子样本的k个近邻样本。Sort the modified outlier difference metric of all samples and seed samples in the temporary sample data set, and select the first k samples with larger modified outlier difference metric as the k nearest neighbor samples of the seed sample.

对本实施例提供的样本数据集均衡装置的均衡效果进行试验验证，结果如下：The balancing effect of the sample data set balancing device provided in this embodiment is experimentally verified, and the results are as follows:

IR表示样本集的不平衡率Imbalance Rate，IR越大表示样本集越不均衡，从上表实验结果可以看出，本实施例提供的均衡装置的最大IR、平均IR是最小的，并且IR的最大值和均值之间的间隔被拉近，说明样本集的均衡性更好。IR represents the Imbalance Rate of the sample set. The larger the IR, the more unbalanced the sample set is. From the experimental results in the table above, it can be seen that the maximum IR and average IR of the equalization device provided in this embodiment are the smallest, and the IR The interval between the maximum value and the mean value is narrowed, indicating that the sample set is better balanced.

实施例5Example 5

本实施例也公开了一种围术期患者样本数据集获取系统，相比实施例2本实施例增加了样本数据集均衡装置，即对实施例2中获得的降维后获得样本数据集进行样本均衡处理，该装置的结构示意图如图5所示，包括：数据获取模块，用于获取多个患者的原始围术期特征数据和病例；分类标签集获取模块，基于多个病例获取分类标签集合，分类标签表征围术期患者风险事件；分类标签关联模块，用于将患者的原始围术期特征数据与分类标签集中至少一个分类标签关联对应；围术期患者数据降维装置，对所有患者的原始围术期特征数据进行降维处理获得对应的围术期特征数据；样本数据集获取模块，以患者的围术期特征数据作为样本，为样本关联相应的原始围术期特征数据对应的分类标签集，获得围术期患者的样本数据集；还包括实施例4提供的围术期患者的样本数据集均衡装置，用于对样本数据集进行均衡处理。This embodiment also discloses a system for obtaining a sample data set of patients during the perioperative period. Compared with Embodiment 2, this embodiment adds a sample data set equalization device, that is, the sample data set obtained after dimensionality reduction obtained in Embodiment 2 is processed. For sample balance processing, the schematic diagram of the structure of the device is shown in Figure 5, including: a data acquisition module for acquiring the original perioperative characteristic data and cases of multiple patients; a classification label set acquisition module for obtaining classification labels based on multiple cases set, the classification labels represent perioperative patient risk events; the classification label association module is used to associate the original perioperative characteristic data of patients with at least one classification label in the classification label set; the perioperative patient data dimensionality reduction device is used for all The original perioperative characteristic data of the patient is subjected to dimensionality reduction processing to obtain the corresponding perioperative characteristic data; the sample data set acquisition module takes the patient's perioperative characteristic data as a sample, and associates the corresponding original perioperative characteristic data for the sample. The classification label set of the perioperative period is obtained to obtain the sample data set of the perioperative patient; and the device for balancing the sample data set of the perioperative patient provided in Embodiment 4 is also used for balancing the sample data set.

在本实施例中，优选地，还包括缺失填补装置，用于对患者的原始围术期特征数据中的缺失值进行填补处理，并将填补处理后的原始围术期特征数据输入围术期患者数据降维装置进行降维处理。In this embodiment, preferably, a missing filling device is also included, which is used to fill in the missing values in the original perioperative characteristic data of the patient, and input the filled original perioperative characteristic data into the perioperative period. The patient data dimensionality reduction device performs dimensionality reduction processing.

实施例6Example 6

本实施例6公开了一种围术期患者数据多标签分类方法，如图6所示，该多标签分类方法包括：This embodiment 6 discloses a multi-label classification method for perioperative patient data, as shown in FIG. 6 , the multi-label classification method includes:

步骤A，获取待分类患者特征数据；待分类患者特征数据为围术期患者的特征数据，可包括多维特征。为提高待分类患者特征数据的可处理性、降低维度，提升质量，可依次对待分类患者特征数据进行编码处理、归一化处理，以及按照实施例1提供的围术期患者数据降维装置输出的样本的特征维度进行降维处理，将降维处理后的待分类患者特征数据输入训练好的分类模型。In step A, the characteristic data of the patient to be classified is obtained; the characteristic data of the patient to be classified is the characteristic data of the perioperative patient, which may include multi-dimensional features. In order to improve the processability, reduce the dimension, and improve the quality of the characteristic data of the patients to be classified, the characteristic data of the patients to be classified can be encoded, normalized, and output according to the perioperative patient data dimension reduction device provided in Example 1. The feature dimension of the sample is subjected to dimensionality reduction processing, and the feature data of the patient to be classified after the dimensionality reduction processing is input into the trained classification model.

步骤B，将待分类患者特征数据输入训练好的分类模型，分类模型输出分类结果，分类结果包括一个以上分类标签以及每个分类标签的分类置信度；分类标签的分类置信度表示待分类患者特征数据属于该分类标签的概率。分类模型包括基于Stacking的分类集成模型、标签关联规则获取模块和融合模块，融合模块用于融合分类集成模型输出的分类矩阵和标签关联规则获取模块输出的关联规则矩阵以获得分类结果，融合的方式优选但不限于将分类矩阵和关联规则矩阵相乘。Step B, input the characteristic data of the patient to be classified into the trained classification model, and the classification model outputs the classification result, and the classification result includes more than one classification label and the classification confidence of each classification label; the classification confidence of the classification label represents the characteristics of the patient to be classified. The probability that the data belongs to this classification label. The classification model includes a stacking-based classification integration model, a label association rule acquisition module, and a fusion module. The fusion module is used to fuse the classification matrix output by the classification integration model and the association rule matrix output by the label association rule acquisition module to obtain the classification results. The fusion method Preferably, but not limited to, the classification matrix and the association rule matrix are multiplied.

在实施例中，优选地，分类模型的结构示意图如图7所示，分类集成模型包括第一多分类模型、第二多分类模型、第三多分类模型和逻辑回归模型；第一多分类模型、第二多分类模型、第三多分类模型分别对待分类患者特征数据进行多标签分类处理获得第一初级分类结果、第二初级分类结果、第三初级分类结果；逻辑回归模型对第一初级分类结果、第二初级分类结果、第三初级分类结果进行处理获得分类矩阵。In the embodiment, preferably, a schematic diagram of the structure of the classification model is shown in FIG. 7 , and the classification integration model includes a first multi-classification model, a second multi-classification model, a third multi-classification model and a logistic regression model; the first multi-classification model , the second multi-classification model, and the third multi-classification model respectively perform multi-label classification processing on the patient characteristic data to be classified to obtain the first primary classification result, the second primary classification result, and the third primary classification result; the logistic regression model is used to classify the first primary classification The result, the second primary classification result, and the third primary classification result are processed to obtain a classification matrix.

在本实施例中，优选地，第一多分类模型、第二多分类模型、第三多分类模型分别为Ranking-SVM模型、分类多层感知神经网络模型、Binary Relevance模型。Ranking-SVM模型和Binary Relevance模型是Stacking集成中比较常规的基础模型，用在这里进行模型集成可靠性较高。分类多层感知神经网络模型采用多层感知神经网络结构(即MLP网络结构)，能够避免过拟合问题，并且复杂度较低。In this embodiment, preferably, the first multi-classification model, the second multi-classification model, and the third multi-classification model are a Ranking-SVM model, a classification multilayer perceptual neural network model, and a Binary Relevance model, respectively. The Ranking-SVM model and the Binary Relevance model are the more conventional basic models in Stacking integration, and they are used here for high reliability of model integration. The classification multi-layer perceptual neural network model adopts the multi-layer perceptual neural network structure (ie, the MLP network structure), which can avoid the problem of overfitting and has low complexity.

在本实施例中，优选地，还包括构建围术期患者的样本数据集的步骤，如图8所示，构建围术期患者的样本数据集的步骤优选但不限于采用实施例2或实施例5的系统进行构建。In this embodiment, preferably, the step of constructing a sample data set of perioperative patients is also included. As shown in FIG. 8 , the step of constructing a sample data set of perioperative patients is preferably, but not limited to, using Embodiment 2 or implementing The system of Example 5 was constructed.

在实施例中，如图7所示，分类集成模型的训练过程为：构建围术期患者的样本数据集，样本数据集中每个样本关联一个以上分类标签，将样本数据集划分为分类训练集和分类测试集，分类标签的关联可采用人工方式进行；构建分类集成模型，即上述的基于Stacking集成模型，其包括第一多分类模型、第二多分类模型、第三多分类模型和逻辑回归模型；利用分类训练集对分类集成模型进行训练，利用分类测试集对训练后的分类集成模型进行测试验证。在验证中，使用RandomizedSearchCV和GridSearchCV在训练集上进行交叉验证，通过F1_Micro得分进行超参数的选择。In the embodiment, as shown in FIG. 7 , the training process of the classification integrated model is: constructing a sample data set of perioperative patients, each sample in the sample data set is associated with more than one classification label, and the sample data set is divided into classification training sets and the classification test set, the association of classification labels can be performed manually; build a classification integration model, that is, the above-mentioned Stacking-based integration model, which includes the first multi-classification model, the second multi-classification model, the third multi-classification model and logistic regression Model; use the classification training set to train the classification ensemble model, and use the classification test set to test and verify the trained classification ensemble model. In validation, RandomizedSearchCV and GridSearchCV are used for cross-validation on the training set, and hyperparameter selection is performed by the F1_Micro score.

在本实施例中，如图7所示，优选地，关联规则获取模块执行以下步骤：获取围术期患者的样本数据集，样本数据集中每个样本关联一个以上分类标签；样本数据集优选但不限于为实施例2或实施例5中获取的围术期患者样本数据集，即为标准患者数据集。对样本数据集中的分类标签进行关联规则挖掘获得关联规则矩阵。关联规则矩阵包括所有分类标签中任意两个分类标签之间的关联置信度。In this embodiment, as shown in FIG. 7 , preferably, the association rule acquisition module performs the following steps: acquiring a sample data set of patients in the perioperative period, each sample in the sample data set is associated with more than one classification label; the sample data set is preferably but It is not limited to the perioperative patient sample data set obtained in Example 2 or Example 5, that is, the standard patient data set. Perform association rule mining on the classification labels in the sample data set to obtain the association rule matrix. The association rule matrix includes the association confidence between any two class labels among all class labels.

在本实施例中，如图7所示，进一步优选地，当样本数据集中分类标签的数量较少，具体的当少于数量阈值时，直接通过FP-growth算法对样本数据集中的分类标签进行关联规则挖掘。首先，建立如图7所示的分类标签矩阵，该分类标签矩阵中首行为各标签，首列为患者编号；之后，利用FP-growth算法对分类标签矩阵进行关联规则分析处理，输出任意两个分类标签之间的关联置信度，关联置信度取值范围为0到1。基于这些关联置信度建立如图7所示的关联规则矩阵，在关联规则矩阵中，首行和首列均为分类标签，矩阵内的元素代表该元素所在行和列的分类标签之间的关联置信度，如图7中，A(N-1)表示分类标签N与分类标签1之间的关联置信度。In this embodiment, as shown in FIG. 7 , further preferably, when the number of classification labels in the sample data set is small, specifically when it is less than the number threshold, the classification labels in the sample data set are directly processed by the FP-growth algorithm. Association rule mining. First, establish a classification label matrix as shown in Figure 7, the first row of the classification label matrix is each label, and the first column is the patient number; after that, use the FP-growth algorithm to perform association rule analysis processing on the classification label matrix, and output any two The association confidence between the classification labels. The value of the association confidence ranges from 0 to 1. Based on these association confidences, an association rule matrix as shown in Figure 7 is established. In the association rule matrix, the first row and the first column are both classification labels, and the elements in the matrix represent the association between the classification labels of the row and column where the element is located. Confidence, as shown in Figure 7, A(N-1) represents the confidence of the association between the classification label N and the classification label 1.

在本实施例中，优选地，当样本数据集中分类标签数量较多时，分类标签之间的相关性模式会存在不同，直接进行关联分析会造成频繁项集寻找过程复杂等，影响关联分析准确性，具体的，当分类标签数量大于等于数量阈值时，数量阈值优选但不限于为3或4或5。对样本数据集中的分类标签进行关联规则挖掘获得关联规则矩阵的步骤，具体包括：In this embodiment, preferably, when the number of classification labels in the sample data set is large, the correlation patterns between the classification labels will be different, and the direct correlation analysis will cause the frequent itemset search process to be complicated, which affects the accuracy of the correlation analysis. , specifically, when the number of classification labels is greater than or equal to the number threshold, the number threshold is preferably, but not limited to, 3 or 4 or 5. The steps of performing association rule mining on the classification labels in the sample data set to obtain an association rule matrix include:

对样本数据集中的分类标签进行聚类获得一个以上聚类簇；优选但不限于采用K-means++算法进行聚类处理；对每个聚类簇中的分类标签进行关联规则挖掘获得关联规则子矩阵。在融合时，将分类矩阵按照聚类结果划分为一个以上子分类矩阵，一个聚类簇对应一个子分类矩阵，将子分类矩阵与该聚类簇对应的关联规则子矩阵相乘获得该聚类簇的分类子结果，所有分类子结果组成分类结果。Clustering the classification labels in the sample data set to obtain more than one cluster; preferably but not limited to using the K-means++ algorithm for clustering processing; perform association rule mining on the classification labels in each cluster to obtain an association rule sub-matrix . During fusion, the classification matrix is divided into more than one sub-classification matrix according to the clustering results, one cluster cluster corresponds to one sub-classification matrix, and the sub-classification matrix is multiplied by the association rule sub-matrix corresponding to the cluster cluster to obtain the cluster The classification sub-results of the cluster, all the classification sub-results make up the classification result.

在本实施例中，进一步优选地，通过FP-growth算法对每个分类簇中的分类标签进行关联规则挖掘获得关联规则子矩阵，其获取过程图7中过程一致，已在上述优选方案中详细说明，在此不再赘述。In this embodiment, it is further preferred to perform association rule mining on the classification labels in each classification cluster through the FP-growth algorithm to obtain an association rule sub-matrix. description, which is not repeated here.

实施例7Example 7

本实施例公开了一种围术期患者数据多标签分类装置，如图9所示，包括：数据获取模块，用于获取待分类患者特征数据；分类模块，用于将待分类患者特征数据输入训练好的分类模型，分类模型输出分类结果，分类结果包括一个以上分类标签以及每个分类标签的分类置信度；分类模型包括基于Stacking的分类集成模型、标签关联规则获取模块和融合模块，融合模块用于融合分类集成模型输出的分类矩阵和标签关联规则获取模块输出的关联规则矩阵以获得分类结果。This embodiment discloses a multi-label classification device for perioperative patient data. As shown in FIG. 9 , it includes: a data acquisition module for acquiring characteristic data of patients to be classified; a classification module for inputting characteristic data of patients to be classified The trained classification model, the classification model outputs the classification result, and the classification result includes more than one classification label and the classification confidence of each classification label; the classification model includes the Stacking-based classification integration model, the label association rule acquisition module and the fusion module. The fusion module It is used to fuse the classification matrix output by the classification ensemble model and the label association rule acquisition module output association rule matrix to obtain the classification result.

在本实施例中，优选地，分类集成模型包括第一多分类模型、第二多分类模型、第三多分类模型和逻辑回归模型；第一多分类模型、第二多分类模型、第三多分类模型分别对所述待分类患者特征数据进行多标签分类处理获得第一初级分类结果、第二初级分类结果、第三初级分类结果；逻辑回归模型对第一初级分类结果、第二初级分类结果、第三初级分类结果进行处理获得分类矩阵。In this embodiment, preferably, the classification integration model includes a first multi-class model, a second multi-class model, a third multi-class model and a logistic regression model; the first multi-class model, the second multi-class model, the third multi-class model The classification model separately performs multi-label classification processing on the patient characteristic data to be classified to obtain a first primary classification result, a second primary classification result, and a third primary classification result; the logistic regression model performs a multi-label classification process on the first primary classification result and the second primary classification result. , and the third primary classification result is processed to obtain a classification matrix.

在本实施例中，优选地，还包括分类集成模型训练模块，分类集成模型训练模块执行以下过程：构建围术期患者的样本数据集，样本数据集中每个样本关联一个以上分类标签，将样本数据集划分为分类训练集和分类测试集；优选但不限于通过实施例2或实施例5提供的系统构建围术期患者的样本数据集；构建分类集成模型；分类集成模型包括第一多分类模型、第二多分类模型、第三多分类模型和逻辑回归模型；利用分类训练集对分类集成模型进行训练，利用分类测试集对训练后的分类集成模型进行测试验证。In this embodiment, preferably, a classification integrated model training module is further included, and the classification integrated model training module performs the following process: constructing a sample data set of perioperative patients, each sample in the sample data set is associated with more than one classification label, and the sample The data set is divided into a classification training set and a classification test set; preferably but not limited to constructing a sample data set of perioperative patients through the system provided in Embodiment 2 or Embodiment 5; building a classification integrated model; the classification integrated model includes the first multi-classification model, the second multi-classification model, the third multi-classification model and the logistic regression model; use the classification training set to train the classification ensemble model, and use the classification test set to test and verify the trained classification ensemble model.

在本实施例中，该分类装置搭建结合关联规则分析的围术期术后事件多标签的分类集成模型。术后可能出现多种术后风险事件，针对术后多事件结果进行研究预测，通过集成Ranking-SVM模型和多层感知神经网络模型与Binary Relevance模型，搭建多标签预测模型，为进一步提升模型的稳定性、准确率，融合了关联规则到预测模型中进行优化。In this embodiment, the classification device builds a multi-label classification integration model of perioperative and postoperative events combined with association rule analysis. There may be a variety of postoperative risk events after surgery, and the results of multiple events after surgery are researched and predicted. By integrating the Ranking-SVM model, the multi-layer perceptual neural network model and the Binary Relevance model, a multi-label prediction model is built to further improve the model's performance. Stability and accuracy, it integrates association rules into the prediction model for optimization.

实施例8Example 8

本实施例公开了一种围术期患者风险事件预测系统，如图10所示，包括：数据获取模块，用于获取待分类患者特征数据；分类模块，用于将待分类患者特征数据输入训练好的分类模型，分类模型输出分类结果，分类结果包括一个以上分类标签以及每个分类标签的分类置信度，每个分类标签对应一个围术期患者风险事件；This embodiment discloses a perioperative patient risk event prediction system, as shown in FIG. 10 , including: a data acquisition module for acquiring characteristic data of patients to be classified; a classification module for inputting characteristic data of patients to be classified into training A good classification model, the classification model outputs classification results, the classification results include more than one classification label and the classification confidence of each classification label, and each classification label corresponds to a perioperative patient risk event;

分类模型包括基于Stacking的分类集成模型、标签关联规则获取模块和融合模块，融合模块用于融合分类集成模型输出的分类矩阵和标签关联规则获取模块输出的关联规则矩阵以获得分类结果；转换模块，将分类结果中的分类标签转换为对应的围术期患者风险事件获得风险预测结果。The classification model includes a stacking-based classification integration model, a label association rule acquisition module and a fusion module. The fusion module is used to fuse the classification matrix output by the classification integration model and the association rule matrix output by the label association rule acquisition module to obtain the classification result; the conversion module, Convert the classification labels in the classification results to the corresponding perioperative patient risk events to obtain risk prediction results.

在本实施例中，实施例2或实施例5提供的系统样本数据集获取过程中，针对患者(尤其是老年手术患者)围术期内风险事件进行预测，在改进缺失及不平衡数据集的基础上，融合关联规则分析，搭建术后事件多标签预测模型。基于患者案例文本进行术后事件标签提取，采用Word2Vec的CBOW标签提取模型，收集大量医学相关语料库，训练医学词向量模型，实现术后事件标签集(即分类标签集)提取。接下来，采用基于贝叶斯高斯过程潜变量模型进行缺失数据填补，以及基于MLSMOTE，加权kNN(WKNN)和遗传算法进行标签不平衡数据处理，最后结合主成分分析PCA模型和遗传算法搭建特征降维模型，为分类集成模型提供相关性更高的输入。In this embodiment, in the process of obtaining the system sample data set provided in Embodiment 2 or Embodiment 5, the risk events in the perioperative period of patients (especially elderly surgical patients) are predicted, and the improvement of missing and unbalanced data sets is improved. Based on the analysis of association rules, a multi-label prediction model of postoperative events was built. Post-operative event label extraction is performed based on patient case text. The CBOW label extraction model of Word2Vec is used to collect a large number of medical-related corpora, train the medical word vector model, and realize the extraction of postoperative event label set (ie, classification label set). Next, the latent variable model based on Bayesian Gaussian process is used to fill missing data, and based on MLSMOTE, weighted kNN (WKNN) and genetic algorithm for label imbalance data processing, and finally combined with principal component analysis PCA model and genetic algorithm to build feature reduction dimensional model, providing more relevant input to the classification ensemble model.

尽管已经示出和描述了本发明的实施例，本领域的普通技术人员可以理解：在不脱离本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、修改、替换和变型，本发明的范围由权利要求及其等同物限定。Although embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, The scope of the invention is defined by the claims and their equivalents.

Claims

1. A perioperative patient sample dataset equalization method, comprising:

the method comprises the following steps of S1, oversampling a few types of label samples in a sample data set of a perioperative patient to obtain a synthetic sample, and generating a corresponding synthetic label set for the synthetic sample, wherein the sample data set comprises a plurality of samples and a classification label set corresponding to the samples;

s2, adding the synthetic sample and the synthetic label set into the sample data set to obtain a temporary sample data set;

and S3, cleaning the samples in the temporary sample data set to obtain a balanced sample data set.

2. The perioperative patient sample dataset equalization method of claim 1, wherein in said step S1, setting an oversampling rate for each minority class label based on a genetic algorithm, specifically comprises:

s11, setting a sample data set to comprise W minority class labels, and taking the oversampling rate of the samples of the W minority class labels as W genes of an individual, wherein W is a positive integer; constructing an initial population, wherein the initial population comprises a plurality of initial individuals, and the W gene values of each initial individual are obtained by random selection;

step S12, the following evolutionary iterative process is repeatedly performed until a termination condition is reached:

acquiring the fitness of each individual in the population of the current generation; selecting a part of individuals from the population of the current generation as individuals of the population of the next generation based on the fitness of the individuals; performing cross operation and variation operation on individuals of the next generation population;

and S13, outputting the individual with the maximum fitness when the termination condition is reached.

3. The perioperative patient sample data set balancing method of claim 2, wherein the process of obtaining the fitness of an individual:

obtaining minority label oversampling rate combinations based on individual gene information; the over-sampling rate combination includes the over-sampling rates of all the minority class tags;

oversampling a few types of label samples in a sample data set of a perioperative patient based on a few types of label oversampling rate combination to obtain a synthetic sample and a synthetic label set of the synthetic sample, adding the synthetic sample and the synthetic label set into the sample data set to obtain an equilibrium sample set, and dividing the equilibrium sample set into an equilibrium training sample set and an equilibrium testing sample set;

the method comprises the steps of constructing an equilibrium multi-layer perception neural network, training the equilibrium multi-layer perception neural network by using an equilibrium training sample set to obtain an equilibrium prediction classification model, testing the equilibrium prediction classification model by using an equilibrium test sample set to obtain the accuracy of the equilibrium prediction classification model, and taking the accuracy as the fitness of an individual.

4. The perioperative patient sample dataset balancing method according to claim 1, 2 or 3, wherein the step S3 is a cleaning process for each sample in the temporary sample dataset, the cleaning process comprising:

s31, selecting seed samples from the temporary sample data set, selecting k adjacent samples of the seed samples, wherein classification labels of the k adjacent samples form an adjacent classification label set, and k is a positive integer;

step S32, predicting the classification tag set of the seed sample through Bayes conditional probability based on the neighbor classification tag set to obtain a predicted classification tag set of the seed sample;

and step S33, judging whether the predicted classification label set of the seed sample is the same as the classification label set of the seed sample in the temporary sample data set, if so, retaining the seed sample, and if not, deleting the seed sample.

5. The perioperative patient sample dataset equalization method of claim 4, wherein in said step S31, the specific process of selecting k neighbor samples of seed samples comprises:

obtaining heterogeneous value difference measurement HVDM of the seed sample and all or part of the samples in the temporary sample data set respectively;

correcting the heterogeneous value difference measurement HVDM by using the global unbalanced weight of the samples in the temporary sample data set to obtain a corrected heterogeneous value difference measurement;

and sequencing the correction heterogeneous value difference measures of all the samples and the seed samples in the temporary sample data set, and selecting the first k samples with larger correction heterogeneous value difference measures as k adjacent samples of the seed samples.

6. The perioperative patient sample dataset equalization method of claim 5, wherein the heterology difference measure HVDM of the seed sample and the temporary sample dataset is calculated by the formula:

wherein f is ₁ A feature vector representing a seed sample; f. of ₂ A feature vector representing any sample in the temporary sample data set except the seed sample; HVDM (f) ₁ ,f ₂ ) Representing a feature vector f ₁ And f ₂ A heterology difference metric of; d (f) ₁ ,f ₂ ) Representing a feature vector f ₁ And f ₂ The distance between them; n represents the characteristic dimension of the sample in the temporary sample data set; x represents a feature index; d _x (f ₁ ,f ₂ ) Representing a feature vector f ₁ And a feature vector f ₂ Distance, d, over feature x _x (f ₁ ,f ₂ ) Obtained by the following formula:

c denotes the number of classes of the feature x when the feature x is a class feature, C denotes a class index of the feature x,

representing the feature x in the temporary sample dataset as belonging to the feature vector f ₁ And the class feature of the feature x is the number of samples of c;

representing the feature x in the temporary sample dataset as belonging to the feature vector f ₂ And the class feature of the feature x is the number of samples of c;

representing the feature x in the temporary sample dataset as belonging to the feature vector f ₁ The number of samples of (a);

representing the feature x in the temporary sample dataset as belonging to the feature vector f ₂ The number of samples of (a); l f ₁ -f ₂ I represents a feature vector f ₁ And f ₂ The absolute value of the difference; sigma _x Representing the standard deviation of the feature x in the temporary sample dataset.

7. The perioperative patient sample dataset equalization method of claim 5 or 6, wherein the modified heterogeneous difference measure of the seed sample and the sample in the temporary sample dataset is calculated by the formula:

wherein f is ₁ A feature vector representing a seed sample; f. of ₂ A feature vector representing any sample in the temporary sample data set except the seed sample; HVDM (f) ₁ ,f ₂ ) Representing a feature vector f ₁ And f ₂ A heterogeneous value difference metric of (a); d _W (f ₁ ,f ₂ ) Representing a feature vector f ₁ And f ₂ Modified heterology difference metric of (a); n represents the characteristic dimension of the sample in the temporary sample data set; IW represents a feature vector of f ₂ Of samples of (a), IW = IR _nn /(IR ⁺ +IR ^- ),IR ⁺ Indicating the Total imbalance Rate, IR, of all the minority class Classification tags in the temporary sample dataset ^- Indicating the Total imbalance Rate, IR, of all of the majority class Classification tags in the temporary sample dataset _nn Is a feature vector of f ₂ The total imbalance rate of all the class labels in the class label set of the sample.

8. A perioperative patient's sample data set equalization apparatus, comprising:

the sample synthesis module is used for oversampling a few types of label samples in a sample data set of a perioperative patient to obtain a synthesized sample and generating a corresponding synthesized label set for the synthesized sample, wherein the sample data set comprises a plurality of samples and a classification label set corresponding to the samples;

the temporary sample data set acquisition module is used for adding the synthetic sample and the synthetic label set into the sample data set to acquire a temporary sample data set;

and the cleaning module is used for cleaning the samples in the temporary sample data set to obtain a balanced sample data set.

9. A perioperative patient sample dataset acquisition system, comprising:

the data acquisition module is used for acquiring original perioperative characteristic data and cases of a plurality of patients;

the classification label set acquisition module is used for acquiring a classification label set based on a plurality of cases, and the classification labels represent perioperative patient risk events;

the classification label association module is used for associating and corresponding original perioperative characteristic data of the patient with at least one classification label in the classification label set;

the perioperative patient data dimension reduction device is used for performing dimension reduction processing on the original perioperative characteristic data of all patients to obtain corresponding perioperative characteristic data;

the sample data set acquisition module is used for associating a classification label set corresponding to the original perioperative period characteristic data for the sample by taking perioperative period characteristic data of the patient as the sample to obtain a sample data set of the perioperative period patient;

further comprising perioperative patient sample data set equalisation means as claimed in claim 8 for equalising the sample data set.

10. The perioperative patient sample dataset acquisition system of claim 9, further comprising a missing filling means for filling missing values in the original perioperative feature data of the patient and inputting the filled original perioperative feature data into the perioperative patient data dimension reduction means for performing dimension reduction.