CN115206539A

CN115206539A - Multi-label integrated classification method based on perioperative patient risk event data

Info

Publication number: CN115206539A
Application number: CN202210760528.3A
Authority: CN
Inventors: 卢莉; 王琳娜; 朱涛; 郝学超; 桑永胜
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-10-18

Abstract

The invention provides a multi-label integrated classification method based on perioperative patient risk event data. The classification method comprises the following steps: acquiring characteristic data of a patient to be classified; inputting the characteristic data of the patient to be classified into a trained classification model, and outputting a classification result by the classification model; the classification model comprises a Stacking-based classification integration model, a label association rule acquisition module and a fusion module, wherein the fusion module is used for fusing a classification matrix output by the classification integration model and an association rule matrix output by the label association rule acquisition module to obtain a classification result. The classification accuracy can be improved by adopting the classification integration model based on the Stacking, and meanwhile, the classification matrix output by the classification model is corrected through the association rule matrix representing the association between the classification labels, so that the accuracy of the final classification result or risk prediction result is further improved, and the reference value of the classification result or risk prediction result is improved.

Description

Multi-label integrated classification method based on perioperative patient risk event data

Technical Field

The invention relates to the technical field of computers, in particular to a multi-label integrated classification method based on perioperative patient risk event data.

Background

The perioperative period is a perioperative period, which is a period of time from the time when a patient decides to receive a surgical treatment to the time when the patient is substantially recovered, including before, during and after the surgery, and specifically, from the time when the surgical treatment is determined to the time when the treatment related to the surgery is substantially completed, the period of time is from 5 to 7 days before the surgery to 7 to 12 days after the surgery.

According to the data reported by World Health Organization (WHO) published as World health standards 2021, the life expectancy of the global population is increased to 73.3 years, and by 2050, the number of elderly people worldwide will exceed 15 hundred million people. An increasing population of the elderly across the world has been identified as a major segment of the surgical market, and the prediction of risk events in elderly patients has become one of the leading research directions. The postoperative risk prediction is carried out on the elderly operation patient group, so that doctors can make diagnosis and treatment plans, treatment resources are reasonably configured, and the probability of postoperative risk events is reduced. At present, some diagnostic tools can help hospitals to provide comprehensive and reliable treatment for high-risk patients, for example, chinese patents with publication numbers CN111009322A and CN114038565a have disclosed perioperative risk assessment by using a prediction model based on a perioperative data set of patients, however, the prediction model is mostly a single classification model, noise problems brought by the generation of a few classes of labels are not considered in an emphasized manner, and relevance among the classification labels is not considered, and the classification accuracy is poor.

Disclosure of Invention

The invention aims to at least solve the technical problems in the prior art and provides a multi-label integration classification method based on perioperative patient risk event data.

To achieve the above objects, according to a first aspect of the present invention, there is provided a method for multi-label ensemble classification based on perioperative patient risk event data, comprising: acquiring characteristic data of a patient to be classified; inputting the characteristic data of the patient to be classified into a trained classification model, and outputting a classification result by the classification model, wherein the classification result comprises more than one classification label and a classification confidence coefficient of each classification label; the classification model comprises a Stacking-based classification integration model, a label association rule acquisition module and a fusion module, wherein the fusion module is used for fusing a classification matrix output by the classification integration model and an association rule matrix output by the label association rule acquisition module to obtain a classification result.

To achieve the above object, according to a second aspect of the present invention, there is provided a perioperative patient data multi-label sorting apparatus comprising: the data acquisition module is used for acquiring the characteristic data of the patient to be classified; the classification module is used for inputting the characteristic data of the patient to be classified into a trained classification model, and the classification model outputs a classification result, wherein the classification result comprises more than one classification label and a classification confidence coefficient of each classification label; the classification model comprises a Stacking-based classification integration model, a label association rule acquisition module and a fusion module, wherein the fusion module is used for fusing a classification matrix output by the classification integration model and an association rule matrix output by the label association rule acquisition module to obtain a classification result.

To achieve the above object, according to a third aspect of the present invention, there is provided a perioperative patient risk event prediction system comprising: the data acquisition module is used for acquiring the characteristic data of the patient to be classified; the classification module is used for inputting the characteristic data of the patient to be classified into a trained classification model, and the classification model outputs a classification result, wherein the classification result comprises more than one classification label and the classification confidence of each classification label, and each classification label corresponds to a perioperative patient risk event; the classification model comprises a Stacking-based classification integration model, a label association rule acquisition module and a fusion module, wherein the fusion module is used for fusing a classification matrix output by the classification integration model and an association rule matrix output by the label association rule acquisition module to obtain a classification result; and the conversion module is used for converting the classification labels in the classification result into corresponding perioperative patient risk events to obtain a risk prediction result.

The technical scheme is as follows: the classification method has the advantages that multi-label classification of the characteristic data of the patient to be classified is achieved, classification accuracy can be improved by adopting the classification integration model based on Stacking in the classification process, meanwhile, the classification matrix output by the classification model is corrected through the association rule matrix representing the association between the classification labels, accuracy of the final classification result or the risk prediction result is further improved, and the reference value of the classification result or the risk prediction result is improved.

Drawings

FIG. 1 is a schematic structural diagram of a perioperative patient data dimension reduction apparatus in embodiment 1 of the present invention;

FIG. 2 is a schematic structural diagram of a perioperative patient sample dataset acquisition system according to embodiment 2 of the present invention;

fig. 3 is a schematic flowchart of a sample data set equalization method in embodiment 3 of the present invention;

fig. 4 is a schematic structural diagram of a sample data set equalizing apparatus in embodiment 4 of the present invention;

fig. 5 is a schematic structural diagram of a sample data set acquisition system in embodiment 5 of the present invention;

FIG. 6 is a flow chart of a method for classifying perioperative patient risk event data based on multi-label integration in embodiment 6 of the present invention;

FIG. 7 is a schematic diagram showing a structure of a classification model in embodiment 6;

FIG. 8 is a schematic flowchart showing a preferred method of multi-label ensemble classification based on perioperative patient risk event data according to example 6;

FIG. 9 is a schematic structural diagram of a perioperative patient data multi-label sorting device according to embodiment 7 of the present invention;

FIG. 10 is a schematic structural diagram of a perioperative patient risk event prediction system according to embodiment 8 of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

In the description of the present invention, it is to be understood that the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used merely for convenience of description and for simplicity of description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, are not to be construed as limiting the present invention.

In the description of the present invention, unless otherwise specified and limited, it is to be noted that the terms "mounted," "connected," and "connected" are to be interpreted broadly, and may be, for example, a mechanical connection or an electrical connection, a communication between two elements, a direct connection, or an indirect connection via an intermediate medium, and specific meanings of the terms may be understood by those skilled in the art according to specific situations.

Example 1

This embodiment discloses a perioperative patient data dimension reduction device, as shown in fig. 1, the device includes:

the input module is used for acquiring original perioperative characteristic data containing multi-dimensional characteristics of a patient and a classification label corresponding to the original perioperative characteristic data;

the primary dimensionality reduction module is used for carrying out dimensionality reduction on the original perioperative feature data based on a principal component analysis algorithm to obtain first perioperative feature data;

the secondary dimensionality reduction module is used for carrying out dimensionality reduction on the first perioperative feature data based on a genetic algorithm to obtain perioperative feature data;

and the output module outputs perioperative characteristic data.

In this embodiment, in order to better reflect the perioperative state of the patient, improve the accuracy of subsequent classification processing, and avoid the problem that the post-operative patient data is not easy to be collected and managed, preferably, the original perioperative characteristic data includes pre-operative and intra-operative index data of the patient, such as pre-operative blood pressure, heart rate, blood fat, etc., intra-operative heart rate, blood pressure, blood loss, operative duration, etc. Different from the existing partial classification prediction model which only includes preoperative basic conditions of an operation patient and does not consider concrete conditions in operation, a plurality of researches prove that intraoperative indexes such as intraoperative heart rate, blood pressure, blood loss, operation time and the like are related to postoperative conditions of the patient, so that the accuracy of predicting postoperative events by using the subsequent model can be improved by using the original perioperative characteristic data provided by the embodiment, and the postoperative patient index data are not depended on.

In this embodiment, the class labels are used to characterize perioperative patient risk events, which preferably include, but are not limited to, unscheduled readmission, death.

In this embodiment, to improve the richness of the data, the index data includes category data and numerical data, the category data is data representing the index by categories, such as more, middle, and less bleeding volume during operation, and the numerical data represents the index data by numerical values, such as blood pressure values.

In this embodiment, the original perioperative feature data can be data of a patient with a known perioperative patient risk event, and therefore, the known perioperative patient risk event can be used as the classification label associated with the original perioperative feature data. The original perioperative characteristic data can also be the data of a patient with unknown perioperative patient risk events, and the expert sets corresponding classification labels for the original perioperative characteristic data. The corresponding classification labels of the original perioperative feature data can be one, two or more.

In this embodiment, after being processed by the principal component analysis algorithm, the feature dimension of the first perioperative feature data is smaller than the feature dimension of the original perioperative feature data, and an initial population of the genetic algorithm is constructed based on the first perioperative feature data.

In this embodiment, to further reduce the dimension of the first perioperative feature data through a genetic algorithm, preferably, the second dimension reduction module includes:

the initial population setting unit is used for setting individuals based on the first perioperative characteristic data, the gene number of the individuals is less than or equal to the total number of characteristics in the first perioperative characteristic data, and a plurality of individuals form an initial population; the gene of the individual is the characteristic in the first perioperative characteristic data, and the base factor of each individual can be randomly set under the condition that the gene number of the individual is less than or equal to the total number of the characteristics in the first perioperative characteristic data;

and (3) evolving an iteration unit, repeatedly executing the following processes until a termination condition is reached, and outputting an individual with the maximum fitness when the termination condition is reached: acquiring the fitness of each individual in the population of the current generation; selecting a part of individuals from the population of the current generation as individuals of the population of the next generation based on the fitness of the individuals; and performing cross operation and variation operation on the individuals of the next generation population.

In this embodiment, the termination condition is preferably, but not limited to, that the number of evolution iterations reaches a preset maximum number of evolution iterations, or the maximum fitness value of the individual in the evolution iterations does not increase any more, or the increase amplitude of the maximum fitness value of the individual in the evolution iterations is lower than the amplification threshold. In each iteration, the fitness of the individuals in the population of the current generation is ranked from high to low, and the part of the individuals with the top rank is selected as the individuals of the population of the next generation. The cross operation is mainly to exchange the same point gene position of the matched parent, obtain the filial generation after the exchange, and take the filial generation as the individual of the next generation population.

In this embodiment, in order to make the perioperative feature data after dimensionality reduction have more excellent performance in the subsequent classification processing and improve the classification accuracy, preferably, the process of obtaining the fitness of the individual is as follows: obtaining original perioperative characteristic data of a plurality of patients and corresponding classification labels, and performing dimensionality reduction processing on the original perioperative characteristic data according to individual characteristic information to obtain a plurality of dimensionality reduction samples consistent with individual characteristics; dividing a plurality of dimension reduction samples into a dimension reduction training set and a dimension reduction testing set; constructing a dimension-reducing multilayer perception neural network; training the constructed dimensionality reduction multilayer perception neural network by using a dimensionality reduction training set to obtain a dimensionality reduction classification prediction model; and testing the dimensionality reduction classification prediction model by using a dimensionality reduction test set to obtain the accuracy of the model, and taking the accuracy as the fitness of an individual.

Example 2

The present embodiment discloses a perioperative patient sample data set acquiring system, as shown in fig. 2, the perioperative patient sample data set acquiring system includes:

the data acquisition module is used for acquiring original perioperative characteristic data and cases of a plurality of patients; case data is generally text data including doctor's diagnosis, past medical history, postoperative follow-up records, and the like;

the classification label set acquisition module is used for acquiring a classification label set based on a plurality of cases, and the classification labels represent perioperative patient risk events;

the classification label association module is used for associating and corresponding original perioperative characteristic data of the patient with at least one classification label in a classification label set, so that the original perioperative characteristic data corresponds to the classification label set which comprises at least one classification label;

the perioperative patient data dimension reduction device provided in embodiment 1 performs dimension reduction processing on the original perioperative characteristic data of all patients to obtain corresponding perioperative characteristic data;

and the sample data set acquisition module is used for associating the classification label set corresponding to the original perioperative characteristic data for the sample by taking the perioperative characteristic data of the patient as the sample to obtain the sample data set of the perioperative patient.

In this embodiment, preferably, the classification tag set obtaining module specifically executes: performing word segmentation on patient cases to obtain at least one postoperative event result (postoperative event result is perioperative patient risk event), performing similar word analogy on the postoperative event results of a plurality of patients by using a trained CBOW model to obtain a plurality of similar postoperative event result sets, matching the similar postoperative event result sets with an event dictionary, searching classification labels matched with the similar postoperative event result sets from the event dictionary, and forming a classification label set by using the plurality of classification labels.

In this embodiment, a CBOW Multi-Word Context Model of Word2Vec is used to train a large amount of medical corpus, and a PKUSEG segmentation tool (PKUSEG can segment words in multiple fields, including independent models in medical fields) is used to segment text information corresponding to a case set in this embodiment to obtain multiple post-operation event results. The event dictionary is preferably but not limited to the Chinese version ICD-11 event dictionary which is the unified International Classification of diseases published by the world health organization, and the event dictionary comprises a plurality of classification labels. Whether the post-similarity event result set is matched with the event dictionary is preferably, but not limited to, judged through semantic similarity, if the semantic similarity of the post-similarity event result set and the event dictionary is larger than a preset similarity threshold, the post-similarity event result set and the event dictionary are considered to be matched, and if not, the post-similarity event result set and the event dictionary are not matched.

In this embodiment, preferably, in order to fill up missing values in the data and improve the data quality, the perioperative data processing system further includes a missing filling device, which is configured to fill up missing values in the original perioperative feature data of the patient and input the filled original perioperative feature data into the perioperative patient data dimension reduction device for dimension reduction. The missing filling device is preferably but not limited to perform filling processing by the existing random forest regressor filling method, or misforest filling method, or Mean value Mean filling method, or median filling method.

In this embodiment, it is further preferable that the missing filling device performs missing filling processing on the original perioperative feature data based on a bayesian process hidden variable model.

In this embodiment, data padding for missing values inevitably introduces uncertainty into the original perioperative feature data set. The present embodiment applies a Bayesian Gaussian process hidden variable model (BGPLVM) to fill missing values of numerical features, specifically including:

first, the observed test data vector y is approximately calculated _* ∈R ^N×M Probability density p (y) of _* | Y) (where N is the total number of patient samples and M is the total number of features), and the observed value Y _* The variation distribution of the relevant hidden variables is q (x) _* ). After the model parameters and hidden variables are learned, the BGPLVM can be used to estimate missing values:

wherein

Is a vector y _* The value that can be observed in (a) is,

is the missing value that needs to be predicted. Given the partially observed point y _* The present embodiment wishes to reconstruct the missing part

Missing datasets are filled by learning low-dimensional embedding of observable variables on one small complete dataset. Training BGPLVM on a complete data set D, introducing hidden variable X and new test hidden variable X _* As described above

A row vector representing the individual patient measurements,

representing the known observed value of the image,

expressing the missing value, by maximizing the probability density below, get y _* Corresponding hidden variable x _* Gaussian probability distribution.

Next, by maximizing at

To optimize the variation distribution q (x) _* ) Keep dividing q (x) _* ) All optimization quantities except for the one are unchanged. To predict missing values

The invention adopts a standard Gaussian process prediction method and simultaneously inputs x _* Also take into account the uncertainty of x _* Presence distribution q (x) _* ). Similar to the GP prediction form, for prediction

The invention predicts first

I.e. with y _* Corresponding implicit function value

For x _* Marginalization of (a) would produce a multivariate density that is not gaussian fully dependent, but based on a square exponential kernel,

is analytically processable, in the present invention f is used _* ^U The mean may provide an estimate of the missing value for the present invention, and the variance may quantify the uncertainty associated with the mean estimate. Through a BGPLVM model, learning hidden space and model hyper-parameters in a training set, and obtaining average estimation of each feature containing a missing value through distribution.

In this embodiment, in order to facilitate data processing, it is further preferable that the data processing device further includes a coding device for coding the original perioperative feature data, and inputting the coded data into the missing filling device. The encoding means preferably, but not exclusively, encodes using existing One-hot encoding rules.

In this embodiment, in order to facilitate data processing, it is further preferable that the data processing device further includes a normalization device, configured to perform normalization processing on the encoded original perioperative period feature data, and input the normalized data into the deficiency filling device. The normalization means is preferably, but not limited to, normalization using a standard deviation normalization method.

Example 3

This embodiment provides a sample data set balancing method for perioperative patients, as shown in fig. 3, the sample data set balancing method includes:

the method comprises the following steps that S1, a few types of label samples in a sample data set of perioperative patients are subjected to oversampling to obtain synthetic samples, a corresponding synthetic label set is generated for the synthetic samples, and the sample data set comprises a plurality of samples and a classification label set corresponding to the samples; each sample represents a perioperative feature data set of a patient, and may be original perioperative feature data or perioperative feature data obtained after dimensionality reduction of the original perioperative feature data in example 1, and the classification label association process of the sample has been described in detail in example 1, and is not described herein again.

S2, adding the synthetic sample and the synthetic label set into the sample data set to obtain a temporary sample data set;

and S3, cleaning the samples in the temporary sample data set to obtain a balanced sample data set.

In this embodiment, a few types of label samples in the sample data set of the perioperative patient may be oversampled by SMOTE or SVM SMOTE or borderlinessmote or K-Means SMOTE or SMOTE-NC to obtain synthetic samples and generate a corresponding synthetic label set for the synthetic samples. Preferably, in order to improve the balance effect, the MLSMOTE algorithm is adopted to oversample a few types of label samples in the sample data set of the perioperative patient to obtain synthetic samples and generate a corresponding synthetic label set for the synthetic samples. The MLSMOTE algorithm, i.e. Multi label Synthetic least-sampling Technique (MLSMOTE), is commonly used to deal with the problem of data imbalance in the Multi-label classification task, and the generation process includes: selecting a few classes of labels by adopting an Imbalance Imbalance Rate (IR); searching nearest neighbors, namely searching the nearest neighbors of the samples belonging to a few labels once the samples are selected as seed samples; generating a characteristic set, namely selecting a neighborhood and then obtaining a synthesized sample through interpolation; and (3) generation of a synthetic label set, wherein the synthetic label set is required for the generated synthetic sample.

In this embodiment, since the algorithm for oversampling and synthesizing a few classes of samples, such as MLSMOTE, may generate some noise samples during the process of synthesizing a few classes of label samples, it is necessary to clean these noise samples, so step S3 is set to improve the quality of the sample data set.

In this embodiment, preferably, in order to quickly determine the minority class labels in the sample data set, a ratio between the number of samples corresponding to each classification label and the total number of samples in the sample data set is calculated, the classification label with the ratio smaller than a ratio threshold is used as the minority class classification label, the classification label with the ratio greater than or equal to the ratio threshold is used as the majority class classification label, and the ratio threshold is preferably, but not limited to, smaller than 0.2.

In the present embodiment, it is preferred that, the number of samples that each minority class classification label needs to generate is the over-sampling rate of the minority class classification label. In order to better determine the oversampling rate of each minority class classification label, so that the obtained equalized sample data set performs better when applied to the subsequent classification, preferably, in step S1, the method specifically includes, based on a genetic algorithm, setting an oversampling rate for each minority class label:

s11, setting a sample data set to comprise W minority class labels, and taking the oversampling rate of the samples of the W minority class labels as W genes of an individual, wherein W is a positive integer; each gene represents the oversampling rate of a minority class classification label, a plurality of individuals are utilized to construct an initial population, the initial population comprises a plurality of initial individuals, the numerical value of W genes of each initial individual is obtained through random selection, preferably, a numerical value range can be set for the oversampling rate of each minority class classification label, when the initial population is constructed, numerical values are randomly selected in the numerical value range to serve as the gene numerical values, and the numerical value range can be set according to needs;

step S12, the following evolutionary iterative process is repeatedly executed until the termination condition is reached: acquiring the fitness of each individual in the population of the current generation; selecting partial individuals from the population of the current generation as the individuals of the population of the next generation based on the fitness of the individuals; carrying out cross operation and mutation operation on individuals of the next generation population;

and step S13, outputting the individual with the maximum fitness when the termination condition is reached.

In this embodiment, the termination condition is preferably, but not limited to, that the number of evolution iterations reaches a preset maximum number of evolution iterations, or that the maximum fitness value of the individual in the evolution iterations does not increase any more, or that the increase amplitude of the maximum fitness value of the individual in the evolution iterations is lower than the increase threshold. In each iteration, the fitness of the individuals in the population of the current generation is ranked from high to low, and the part of the individuals with the top rank is selected as the individuals of the population of the next generation.

In this embodiment, in order to make the obtained equalized sample data set have a better performance when applied to subsequent classification, preferably, the process of obtaining the fitness of the individual is as follows:

obtaining minority label oversampling rate combinations based on individual gene information; the over-sampling rate combination includes the over-sampling rates of all the minority class tags;

oversampling a few types of label samples in a sample data set of perioperative patients based on a few types of label oversampling rate combination to obtain a synthetic sample and a synthetic label set of the synthetic sample, adding the synthetic sample and the synthetic label set into the sample data set to obtain a balanced sample set, and dividing the balanced sample set into a balanced training sample set and a balanced testing sample set;

the method comprises the steps of constructing an equilibrium multi-layer perceptive neural network, training the equilibrium multi-layer perceptive neural network by utilizing an equilibrium training sample set to obtain an equilibrium prediction classification model, testing the equilibrium prediction classification model by utilizing an equilibrium testing sample set to obtain the accuracy of the equilibrium prediction classification model, and taking the accuracy as the fitness of an individual.

In this embodiment, in order to effectively remove the noise sample and improve the quality of the sample set, preferably, step S3 is to perform a cleaning process on each sample in the temporary sample set, where the cleaning process includes:

s31, selecting seed samples from the temporary sample data set, selecting k adjacent samples of the seed samples, wherein classification labels of the k adjacent samples form an adjacent classification label set, and k is a positive integer; each sample in the temporary sample data set can be selected in sequence as a seed sample;

step S32, predicting the classification tag set of the seed sample through Bayes conditional probability based on the neighbor classification tag set to obtain a predicted classification tag set of the seed sample;

and step S33, judging whether the predicted classification label set of the seed sample is the same as the classification label set of the seed sample in the temporary sample data set, if so, retaining the seed sample, and if not, deleting the seed sample to determine the seed sample as a noise sample.

The cleaning process directly predicts the classification label set of the seed sample through Bayesian conditional probability based on the neighbor classification label set of the seed sample, compares and judges the obtained prediction classification label set and the real classification label set of the seed sample in the temporary sample data set, does not depend on classifier judgment, only depends on data judgment, reduces the operation amount, and improves the judgment efficiency and the accuracy.

In this embodiment, it is further preferable that, in step S31, the specific process of selecting k neighboring samples of the seed sample includes:

obtaining heterogeneous value difference measurement HVDM between the seed sample and all or part of the temporary sample data set respectively; HVDM is an abbreviation of hetereogenous Value Difference Metric;

modifying the heterogeneous value difference measurement HVDM by using the global unbalanced weight of the samples in the temporary sample data set to obtain a modified heterogeneous value difference measurement;

and sequencing the correction heterogeneous value difference measures of all the samples and the seed samples in the temporary sample data set, and selecting the first k samples with larger correction heterogeneous value difference measures as k adjacent samples of the seed samples. Preferably, the modified heterogeneous value difference metrics may be sorted from top to bottom, and the first k samples with larger modified heterogeneous value difference metric values are selected as k neighboring samples of the seed sample.

In the process of selecting k neighbor samples of the seed sample, a Weighted KNN (Weighted KNN, wkNN) method is adopted to improve the quality of the synthesized sample. If the real few types of label samples in the sample data set are distributed very dispersedly, i.e. spatially sparse, the few types of samples synthesized in the execution process of the algorithm such as MLSMOTE are still scattered sparsely, and there is still no balance in a local view. If the kNN cleaning is directly used, sparse minority samples and new minority samples synthesized by MLSMOTE are removed with high probability, so that a proper classification boundary cannot be established, therefore, the kNN cleaning needs to be coordinated by introducing a distance weighting idea, namely, when sparse distributed samples are faced, local space density (namely, heteroid difference measurement HVDM and global unbalanced weight of samples) is taken into consideration, and small samples are kept as much as possible. The kNN is mainly cleaned by relying on the label set of the neighboring samples, so the distance calculation for the neighboring samples is particularly important when the data distribution is sparse, which is also the main reason for adding the distance weighting (i.e. modifying the heterogeneous value difference measurement HVDM by using the global unbalanced weight of the samples in the temporary sample data set). WkNN cleans noise samples and varies the distance between neighboring samples (modified by the modified heterogeneous difference metric), i.e., the distance between samples is expressed in terms of the sample heterogeneous difference metric, taking into account the local density effect.

In this embodiment, it is further preferable that the calculation formula of the heterogeneous value difference metric HVDM between the seed sample and the sample in the temporary sample data set is:

wherein f is ₁ A feature vector representing a seed sample; f. of ₂ A feature vector representing any sample in the temporary sample data set except the seed sample; HVDM (f) ₁ ,f ₂ ) Representing a feature vector f ₁ And f ₂ A heterology difference metric of; d (f) ₁ ,f ₂ ) Representing a feature vector f ₁ And f ₂ The distance between them; n represents the characteristic dimension of the sample in the temporary sample data set; x represents a feature index; d _x (f ₁ ,f ₂ ) Representing a feature vector f ₁ And feature vector f ₂ Distance, d, over feature x _x (f ₁ ,f ₂ ) Obtained by the following formula:

c represents the number of categories of the feature x when the feature x is a category feature, C represents a category index of the feature x,

representing the feature x in the temporary sample dataset as belonging to the feature vector f ₁ And the class feature of the feature x is the number of samples of c;

representing the feature x in the temporary sample dataset as belonging to the feature vector f ₂ And the class feature of the feature x is the number of samples of c;

representing the feature x in the temporary sample dataset as belonging to the feature vector f ₁ The number of samples of (a);

representing the feature x in the temporary sample dataset as belonging to the feature vector f ₂ The number of samples of (a); l f ₁ -f ₂ I represents a feature vector f ₁ And f ₂ The absolute value of the difference; sigma _x Representing the standard deviation of the feature x in the temporary sample dataset.

In this embodiment, it is further preferable that the calculation formula of the modified heterogeneous value difference measure between the seed sample and the sample in the temporary sample data set is:

wherein f is ₁ A feature vector representing a seed sample; f. of ₂ A feature vector representing any sample in the temporary sample data set except the seed sample; HVDM (f) ₁ ,f ₂ ) Representing a feature vector f ₁ And f ₂ A heterogeneous value difference metric of (a); d _W (f ₁ ,f ₂ ) Representing a feature vector f ₁ And f ₂ Modified heterology difference metric of (a); n represents the characteristic dimension of the sample in the temporary sample data set; IW represents a feature vector of f ₂ Of samples of (a), IW = IR _nn /(IR ⁺ +IR ^- ),IR ⁺ Indicating the Total imbalance Rate, IR, of all the minority class Classification tags in the temporary sample dataset ^- Indicating the Total imbalance Rate, IR, of all of the majority class Classification tags in the temporary sample dataset _nn Is a feature vector of f ₂ The total imbalance rate of all the class labels in the class label set of the sample.

In the above process of removing noise samples, when the distance is calculated by WkNN, a Heterogeneous Value Difference Metric (HVDM) is used to measure the distance, and the HVDM is corrected by using the global imbalance weight IW of the sample as a weight coefficient. For a temporary sample data set, when the classification label set contains more labels of a few classes, the IR _nn The larger the IW will be; and for a temporary sample data set with sparse distribution of few types of label samples and large imbalance rate, IW is introduced into HVDM distance to improve the density of the few types of samples.

From the formula

It can be seen that the weighting coefficients

Can scale the HVDM (f) ₁ ,f ₂ ) Weighting coefficient is given when the number of minority labels in the neighbor sample classification label set is more

The smaller will be. In a sample set adjacent to the seed sampleThe larger the IW, that is, the more labels of the minority class contained in the label set of the neighboring samples, the weighting coefficient of the corresponding neighboring samples

The smaller, and thus monotonically decreasing form, the following can be maintained: weighting coefficients of neighboring samples under the condition of fixed characteristic dimension

May be scaled differently due to the presence of both majority and minority classes of tags contained in its set of tags; when the feature dimension is increased, that is, the sample distribution is gradually sparse, the scaling coefficient is also reduced.

It can be seen that the WkNN can help to screen the neighboring samples for the samples with more labels in the label set, and consider the distribution of the labels in the neighboring sample label set, so that the samples with more labels in the label set are close to the seed samples, the local minority label density is increased, and the majority label density is reduced. The whole process is as follows: firstly, MLSMOTE is used for up-sampling samples of a few types of labels, a balanced temporary new sample set is formed by the up-sampling samples and original samples, on the new sample set, a WkNN process is carried out on each sample, namely k adjacent samples are sequenced based on weighted HVDM, then the label set of seed samples is predicted according to the adjacent samples, if the situations of the predicted label set and the seed label set are the same, the samples are reserved, otherwise, the samples are deleted

Example 4

This embodiment discloses a sample data set balancing device of perioperative patients, as shown in fig. 4, the sample data set balancing device includes:

the sample synthesis module is used for oversampling a few types of label samples in the sample data set of the perioperative patient to obtain a synthesized sample, and generating a corresponding synthesized label set for the synthesized sample, wherein the sample data set comprises a plurality of samples and a classification label set corresponding to the samples;

the temporary sample data set acquisition module is used for adding the synthetic sample and the synthetic label set into the sample data set to acquire a temporary sample data set;

and the cleaning module is used for cleaning the samples in the temporary sample data set to obtain a balanced sample data set.

In this embodiment, preferably, the cleaning module includes:

the near neighbor sample acquisition unit selects seed samples from the temporary sample data set, selects k near neighbor samples of the seed samples, and the classification labels of the k near neighbor samples form a near neighbor classification label set, wherein k is a positive integer;

the prediction classification tag set obtaining unit is used for predicting the classification tag set of the seed sample through Bayesian conditional probability based on the neighbor classification tag set to obtain a prediction classification tag set of the seed sample;

and the cleaning unit is used for judging whether the predicted classification label set of the seed sample is the same as the classification label set of the seed sample in the temporary sample data set, if so, retaining the seed sample, and if not, deleting the seed sample.

In this embodiment, it is further preferable that the specific process of selecting k neighboring samples of the seed sample by the neighboring sample acquiring unit includes:

obtaining heterogeneous value difference measurement HVDM between the seed sample and all or part of the temporary sample data set respectively;

and sequencing the correction heterogeneous value difference measures of all the samples and the seed samples in the temporary sample data set, and selecting the first k samples with larger correction heterogeneous value difference measures as k adjacent samples of the seed samples.

The test verification is performed on the equalization effect of the sample data set equalization device provided by the embodiment, and the result is as follows:

the IR represents the Imbalance ratio Imbalance Rate of the sample set, and the larger the IR represents the more unbalanced the sample set, and it can be seen from the above experimental results that the maximum IR and the average IR of the equalizing apparatus provided in this embodiment are the smallest, and the interval between the maximum value and the average value of the IR is drawn closer, which indicates that the sample set is more balanced.

Example 5

This embodiment also discloses a perioperative patient sample data set acquisition system, which adds a sample data set balancing device in this embodiment, that is, performs sample balancing processing on the reduced dimension acquired sample data set obtained in embodiment 2, and a schematic structural diagram of the device is shown in fig. 5, and includes:

the data acquisition module is used for acquiring original perioperative characteristic data and cases of a plurality of patients;

the classification label correlation module is used for correlating and corresponding the original perioperative characteristic data of the patient with at least one classification label in the classification label set;

the perioperative patient data dimension reduction device is used for performing dimension reduction processing on the original perioperative characteristic data of all patients to obtain corresponding perioperative characteristic data;

the sample data set acquisition module is used for associating a classification label set corresponding to the original perioperative period characteristic data for the sample by taking perioperative period characteristic data of the patient as the sample to obtain a sample data set of the perioperative period patient;

the perioperative patient sample data set balancing device provided in embodiment 4 is further included, and is used for performing balancing processing on the sample data set.

In this embodiment, it is preferable that the surgical operation data input device further includes a missing filling device, configured to perform filling processing on missing values in the original perioperative feature data of the patient, and input the original perioperative feature data after the filling processing into the perioperative patient data dimension reduction device for dimension reduction processing.

Example 6

This embodiment 6 discloses a multi-label integrated classification method based on perioperative patient risk event data, as shown in fig. 6, the multi-label classification method includes:

step A, acquiring characteristic data of a patient to be classified; the patient feature data to be classified is feature data of perioperative patients and may include multi-dimensional features. In order to improve the processibility of the characteristic data of the patient to be classified, reduce the dimensionality and improve the quality, the characteristic data of the patient to be classified can be sequentially subjected to coding processing and normalization processing, and the characteristic dimensionality of the sample output by the perioperative patient data dimensionality reduction device provided in embodiment 1 is subjected to dimensionality reduction processing, and the characteristic data of the patient to be classified after the dimensionality reduction processing is input into a trained classification model.

Step B, inputting the characteristic data of the patient to be classified into a trained classification model, outputting a classification result by the classification model, wherein the classification result comprises more than one classification label and the classification confidence of each classification label; the classification confidence of a classification label indicates the probability that the patient feature data to be classified belongs to that classification label. The classification model comprises a Stacking-based classification integration model, a label association rule acquisition module and a fusion module, wherein the fusion module is used for fusing a classification matrix output by the classification integration model and an association rule matrix output by the label association rule acquisition module to obtain a classification result, and the fusion mode is preferably but not limited to multiplying the classification matrix and the association rule matrix.

In an embodiment, preferably, the structural diagram of the classification model is shown in fig. 7, and the classification integration model includes a first multi-classification model, a second multi-classification model, a third multi-classification model and a logistic regression model; the first multi-classification model, the second multi-classification model and the third multi-classification model respectively carry out multi-label classification processing on the characteristic data of the patient to be classified to obtain a first primary classification result, a second primary classification result and a third primary classification result; and processing the first primary classification result, the second primary classification result and the third primary classification result by the logistic regression model to obtain a classification matrix.

In this embodiment, preferably, the first multi-classification model, the second multi-classification model, and the third multi-classification model are a Ranking-SVM model, a classification multi-layer perceptual neural network model, and a Binary reservance model, respectively. The Ranking-SVM model and the Binary Relevance model are conventional basic models in Stacking integration, and the reliability of model integration is high when the model is used in the Stacking integration. The classified multilayer perception neural network model adopts a multilayer perception neural network structure (namely an MLP network structure), so that the over-fitting problem can be avoided, and the complexity is low.

In this embodiment, it is preferable that the method further includes a step of constructing a sample data set of the perioperative patient, and as shown in fig. 8, the step of constructing the sample data set of the perioperative patient is preferably, but not limited to, constructed by using the system of embodiment 2 or embodiment 5.

In an embodiment, as shown in fig. 7, the training process of the classification integration model is as follows:

constructing a sample data set of perioperative patients, wherein each sample in the sample data set is associated with more than one classification label, the sample data set is divided into a classification training set and a classification test set, and the association of the classification labels can be performed in a manual mode;

constructing a classification integration model, namely the Stacking-based integration model, which comprises a first multi-classification model, a second multi-classification model, a third multi-classification model and a logistic regression model;

and training the classification integrated model by using a classification training set, and testing and verifying the trained classification integrated model by using a classification testing set. In the validation, cross validation was performed on the training set using randomizided searchcv and GridSearchCV, with selection of the hyperparameters being performed by F1_ Micro scores.

In this embodiment, as shown in fig. 7, preferably, the association rule obtaining module performs the following steps:

acquiring a sample data set of perioperative patients, wherein each sample in the sample data set is associated with more than one classification label; the sample data set is preferably, but not limited to, the perioperative patient sample data set obtained in example 2 or example 5, i.e. a standard patient data set.

And mining association rules of the classification labels in the sample data set to obtain an association rule matrix. The association rule matrix includes the confidence of association between any two classification tags in all classification tags.

In this embodiment, as shown in fig. 7, it is further preferable that, when the number of the classification tags in the sample data set is smaller, specifically smaller than the number threshold, the FP-growth algorithm is directly used to perform association rule mining on the classification tags in the sample data set. Firstly, establishing a classification label matrix as shown in fig. 7, wherein the first row of the classification label matrix is each label, and the first row is a patient number; and then, performing association rule analysis processing on the classification label matrix by using an FP-growth algorithm, and outputting an association confidence coefficient between any two classification labels, wherein the value range of the association confidence coefficient is 0 to 1. Based on the association confidence degrees, an association rule matrix as shown in fig. 7 is established, in the association rule matrix, the first row and the first column are both classification tags, and an element in the matrix represents the association confidence degree between the classification tags in the row and the column where the element is located, as shown in fig. 7, a (N-1) represents the association confidence degree between the classification tag N and the classification tag 1.

In this embodiment, preferably, when the number of classification tags in the sample data set is large, correlation patterns between the classification tags may be different, and performing the correlation analysis directly may cause a complex item set searching process, which may affect the accuracy of the correlation analysis, and specifically, when the number of classification tags is greater than or equal to the number threshold, the number threshold is preferably, but not limited to, 3, 4, or 5. The method for mining the association rule of the classification label in the sample data set to obtain the association rule matrix comprises the following steps:

clustering classification labels in the sample data set to obtain more than one cluster; preferably but not limited to, the clustering process is carried out by adopting a K-means + + algorithm; and mining association rules of the classification labels in each cluster to obtain an association rule submatrix. During fusion, the classification matrix is divided into more than one sub-classification matrix according to the clustering result, one clustering cluster corresponds to one sub-classification matrix, the sub-classification matrix is multiplied by the associated rule sub-matrix corresponding to the clustering cluster to obtain the classification sub-result of the clustering cluster, and all the classification sub-results form the classification result.

In this embodiment, it is further preferable that association rule mining is performed on the classification tags in each classification cluster through an FP-growth algorithm to obtain an association rule submatrix, and the obtaining process of the association rule submatrix is consistent with the process in fig. 7, which has been described in detail in the above preferred embodiment and is not described again here.

Example 7

The embodiment discloses a perioperative patient data multi-label classification device, as shown in fig. 9, including:

the data acquisition module is used for acquiring characteristic data of the patient to be classified;

the classification module is used for inputting the characteristic data of the patient to be classified into a trained classification model, and the classification model outputs a classification result which comprises more than one classification label and the classification confidence of each classification label; the classification model comprises a Stacking-based classification integration model, a label association rule acquisition module and a fusion module, wherein the fusion module is used for fusing a classification matrix output by the classification integration model and an association rule matrix output by the label association rule acquisition module to obtain a classification result.

In this embodiment, preferably, the classification integration model includes a first multi-classification model, a second multi-classification model, a third multi-classification model and a logistic regression model; the first multi-classification model, the second multi-classification model and the third multi-classification model respectively carry out multi-label classification processing on the characteristic data of the patient to be classified to obtain a first primary classification result, a second primary classification result and a third primary classification result; and processing the first primary classification result, the second primary classification result and the third primary classification result by the logistic regression model to obtain a classification matrix.

In this embodiment, it is preferable that the system further includes a classification integration model training module, and the classification integration model training module performs the following processes:

constructing a sample data set of perioperative patients, associating more than one classification label with each sample in the sample data set, and dividing the sample data set into a classification training set and a classification testing set; preferably, but not limited to, constructing a specimen dataset of a perioperative patient by the system provided in example 2 or example 5;

constructing a classification integration model; the classification integration model comprises a first multi-classification model, a second multi-classification model, a third multi-classification model and a logistic regression model;

and training the classification integrated model by using a classification training set, and testing and verifying the trained classification integrated model by using a classification testing set.

In this embodiment, the classification device builds a perioperative postoperative event multi-label classification integration model combining association rule analysis. A plurality of postoperative risk events may occur after operation, research and prediction are carried out on postoperative multi-event results, a multi-label prediction model is built by integrating a Ranking-SVM model, a multi-layer perception neural network model and a Binary Relevance model, and association rules are fused into the prediction model for optimization in order to further improve the stability and accuracy of the model.

Example 8

The present embodiment discloses a perioperative patient risk event prediction system, as shown in fig. 10, including: the data acquisition module is used for acquiring characteristic data of the patient to be classified; the classification module is used for inputting the characteristic data of the patient to be classified into a trained classification model, the classification model outputs a classification result, the classification result comprises more than one classification label and the classification confidence coefficient of each classification label, and each classification label corresponds to a perioperative patient risk event; the classification model comprises a Stacking-based classification integration model, a label association rule acquisition module and a fusion module, wherein the fusion module is used for fusing a classification matrix output by the classification integration model and an association rule matrix output by the label association rule acquisition module to obtain a classification result; and the conversion module is used for converting the classification labels in the classification result into corresponding perioperative patient risk events to obtain a risk prediction result.

In this embodiment, preferably, the classification-integration model includes a first multi-classification model, a second multi-classification model, a third multi-classification model, and a logistic regression model; the first multi-classification model, the second multi-classification model and the third multi-classification model respectively carry out multi-label classification processing on the characteristic data of the patient to be classified to obtain a first primary classification result, a second primary classification result and a third primary classification result; and processing the first primary classification result, the second primary classification result and the third primary classification result by the logistic regression model to obtain a classification matrix.

In this embodiment, in the process of acquiring the system sample data set provided in embodiment 2 or embodiment 5, a risk event in a perioperative period of a patient (particularly, an elderly surgical patient) is predicted, and on the basis of improving a missing and unbalanced data set, association rule analysis is fused, and a post-operative event multi-label prediction model is built. Extracting post-operation event labels based on patient case texts, collecting a large amount of medical related corpora by adopting a CBOW label extraction model of Word2Vec, training a medical Word vector model, and realizing extraction of post-operation event label sets (namely classification label sets). And then, filling missing data by adopting a Bayesian Gaussian process latent variable model, processing label imbalance data by adopting an MLSMOTE, weighting KNN (WKNN) and a genetic algorithm, and finally building a feature dimension reduction model by combining a Principal Component Analysis (PCA) model and the genetic algorithm to provide input with higher correlation for a classification integration model.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A multi-label integrated classification method based on perioperative patient risk event data is characterized by comprising the following steps:

acquiring characteristic data of a patient to be classified;

inputting the characteristic data of the patient to be classified into a trained classification model, and outputting a classification result by the classification model, wherein the classification result comprises more than one classification label and a classification confidence coefficient of each classification label;

the classification model comprises a Stacking-based classification integration model, a label association rule acquisition module and a fusion module, wherein the fusion module is used for fusing a classification matrix output by the classification integration model and an association rule matrix output by the label association rule acquisition module to obtain a classification result.

2. The perioperative patient risk event data-based multi-label ensemble classification method of claim 1, wherein said ensemble of classifications model includes a first multi-classification model, a second multi-classification model, a third multi-classification model, and a logistic regression model;

the first multi-classification model, the second multi-classification model and the third multi-classification model respectively carry out multi-label classification processing on the characteristic data of the patient to be classified to obtain a first primary classification result, a second primary classification result and a third primary classification result;

and processing the first primary classification result, the second primary classification result and the third primary classification result by the logistic regression model to obtain a classification matrix.

3. The perioperative patient risk event data-based multi-label ensemble classification method of claim 2, wherein the training process of the ensemble classification model is:

constructing a sample data set of perioperative patients, associating more than one classification label with each sample in the sample data set, and dividing the sample data set into a classification training set and a classification testing set;

constructing a classification integration model;

and training the classification integration model by using a classification training set, and testing and verifying the trained classification integration model by using a classification testing set.

4. The perioperative patient risk event data-based multi-label ensemble classification method of claim 1, 2 or 3, wherein said association rule obtaining module performs the steps of:

acquiring a sample data set of perioperative patients, wherein each sample in the sample data set is associated with more than one classification label;

and mining association rules of the classification labels in the sample data set to obtain an association rule matrix.

5. The perioperative patient risk event data-based multi-label ensemble classification method of claim 4, wherein the classification labels in the sample dataset are mined by the FP-growth algorithm for association rules.

6. The perioperative patient risk event data-based multi-label ensemble classification method according to claim 4, wherein the step of performing association rule mining on the classification labels in the sample data set to obtain an association rule matrix specifically includes:

clustering classification labels in the sample data set to obtain more than one cluster;

and mining association rules of the classification labels in each cluster to obtain an association rule submatrix.

7. The perioperative patient risk event data-based multi-label ensemble classification method according to claim 6, wherein association rule mining is performed on the classification labels in each classification cluster through FP-growth algorithm to obtain an association rule submatrix.

8. The perioperative patient risk event data-based multi-label ensemble classification method of claim 5 or 7, wherein the association rule matrix includes confidence in the association between any two of all classification labels.

9. A perioperative patient data multi-label classification device, comprising:

the data acquisition module is used for acquiring the characteristic data of the patient to be classified;

the classification module is used for inputting the characteristic data of the patient to be classified into a trained classification model, and the classification model outputs a classification result, wherein the classification result comprises more than one classification label and a classification confidence coefficient of each classification label;

10. A perioperative patient risk event prediction system, comprising:

the classification module is used for inputting the characteristic data of the patient to be classified into a trained classification model, and the classification model outputs a classification result, wherein the classification result comprises more than one classification label and the classification confidence of each classification label, and each classification label corresponds to a perioperative patient risk event;

the classification model comprises a Stacking-based classification integration model, a label association rule acquisition module and a fusion module, wherein the fusion module is used for fusing a classification matrix output by the classification integration model and an association rule matrix output by the label association rule acquisition module to obtain a classification result;

and the conversion module is used for converting the classification labels in the classification result into corresponding perioperative patient risk events to obtain a risk prediction result.