CN115295105A

CN115295105A - Perioperative patient data dimension reduction device and sample data set acquisition system

Info

Publication number: CN115295105A
Application number: CN202210763377.7A
Authority: CN
Inventors: 卢莉; 王琳娜; 朱涛; 郝学超; 桑永胜
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-11-04

Abstract

The invention provides a perioperative patient data dimension reduction device and a sample data set acquisition system. The dimension reduction device comprises: the input module is used for acquiring original perioperative characteristic data of a patient, which contains multi-dimensional characteristics, and a classification label corresponding to the original perioperative characteristic data; the primary dimensionality reduction module is used for carrying out dimensionality reduction on the original perioperative feature data based on a principal component analysis algorithm to obtain first perioperative feature data; the secondary dimensionality reduction module is used for carrying out dimensionality reduction on the first perioperative feature data based on a genetic algorithm to obtain perioperative feature data; and the output module outputs perioperative characteristic data. The principal component analysis algorithm and the genetic algorithm are combined to perform data dimension reduction, lower dimensional perioperative feature data which is more beneficial to subsequent classification processing is screened out, the operating efficiency of the subsequent classification processing is accelerated, and meanwhile, a higher classification expression effect is kept.

Description

Perioperative patient data dimension reduction device and sample data set acquisition system

Technical Field

The invention relates to the technical field of computers, in particular to a perioperative patient data dimension reduction device and a sample data set acquisition system.

Background

Perioperative, i.e., perioperative, which is a period of time surrounding the entire procedure of the operation, from the time the patient decides to receive the operation treatment, to the time the operation treatment is substantially recovered, including the period of time before, during and after the operation, and specifically, from the time the operation treatment is determined until the treatment associated with the operation is substantially completed, the period of time is about 5-7 days before the operation to 7-12 days after the operation.

According to the data reported by World health standards 2021 issued by the World Health Organization (WHO), the expected life of the global population is increased to 73.3 years, and the expected life of the global elderly is more than 15 hundred million people by 2050. An increasing population of the elderly across the world has been identified as a major segment of the surgical market, and risk event prediction in elderly patients has become one of the leading research directions. The postoperative risk prediction is carried out on the elderly operation patient group, so that a doctor can formulate a diagnosis and treatment plan, treatment resources are reasonably configured, and the probability of postoperative risk events is reduced. At present, some diagnostic tools can help hospitals to provide comprehensive and reliable treatment for high-risk patients, for example, chinese patents with publication numbers CN111009322A and CN114038565A have disclosed perioperative risk assessment by using a prediction model based on a perioperative data set of a patient, however, in the perioperative data set of the patient, there is a problem that the data dimension is high, which may directly affect the operation efficiency of the perioperative prediction model, but blindly reduce the dimension, which may reduce the prediction effect of the perioperative prediction model.

Disclosure of Invention

The invention aims to solve the technical problems in the prior art and provides a perioperative patient data dimension reduction device and a sample data set acquisition system.

To achieve the above object, according to a first aspect of the present invention, there is provided a perioperative patient data dimension reduction apparatus comprising: the input module is used for acquiring original perioperative characteristic data of a patient, which contains multi-dimensional characteristics, and a classification label corresponding to the original perioperative characteristic data; the primary dimensionality reduction module is used for carrying out dimensionality reduction on the original perioperative period feature data based on a principal component analysis algorithm to obtain first perioperative period feature data; the secondary dimensionality reduction module is used for carrying out dimensionality reduction on the first perioperative feature data based on a genetic algorithm to obtain perioperative feature data; and the output module outputs perioperative characteristic data.

The technical scheme is as follows: the method combines a principal component analysis algorithm and a genetic algorithm to perform data dimension reduction, firstly performs primary dimension reduction through the principal component analysis algorithm to obtain first perioperative characteristic data which can well represent original perioperative characteristic data, then uses the first perioperative characteristic data as input of the genetic algorithm, and uses the first perioperative characteristic data as a heuristic data set of the genetic algorithm, so that an initial population of the genetic algorithm is a combination with lower latitude and better latitude compared with the original perioperative characteristic data, a small-scale search range is created for further characteristic selection, the operation efficiency can be improved, lower-dimensional perioperative characteristic data which is more beneficial to subsequent classification processing is screened, and the higher classification effect is maintained while the operation efficiency of the subsequent classification processing is accelerated.

To achieve the above object, according to a second aspect of the present invention, there is provided a perioperative patient sample data set acquisition system comprising: the data acquisition module is used for acquiring original perioperative characteristic data and cases of a plurality of patients; the classification label set acquisition module is used for acquiring a classification label set based on a plurality of cases, and the classification labels represent perioperative patient risk events; the classification label association module is used for associating and corresponding original perioperative characteristic data of the patient with at least one classification label in the classification label set; the perioperative patient data dimension reduction device of the first aspect of the invention performs dimension reduction processing on the original perioperative characteristic data of all patients to obtain corresponding perioperative characteristic data; the sample data set acquisition module is used for associating the sample with the classification label set corresponding to the original perioperative characteristic data by taking the perioperative characteristic data of the patient as a sample to obtain the sample data set of the perioperative patient.

The technical scheme is as follows: a multi-classification label sample data set of perioperative patients is constructed, the characteristic dimensionality of the samples in the data set is low, the efficiency of subsequent classification processing and model training can be accelerated, and meanwhile, the characteristics in the samples are all characteristics which have large influence on subsequent classification, so that the subsequent classification has a good expression effect.

Drawings

FIG. 1 is a schematic structural diagram of a perioperative patient data dimension reduction apparatus in embodiment 1 of the present invention;

FIG. 2 is a schematic structural diagram of a perioperative patient sample data set acquisition system in embodiment 2 of the present invention;

fig. 3 is a schematic flowchart of a sample data set equalization method in embodiment 3 of the present invention;

fig. 4 is a schematic structural diagram of a sample data set equalization apparatus in embodiment 4 of the present invention;

fig. 5 is a schematic structural diagram of a sample data set acquisition system in embodiment 5 of the present invention;

FIG. 6 is a flowchart illustrating a perioperative patient data multi-label classification method according to embodiment 6 of the present invention;

FIG. 7 is a schematic structural diagram of a classification model in example 6;

FIG. 8 is a schematic flow chart of a preferred method for multi-label classification of perioperative patient data according to example 6;

FIG. 9 is a schematic structural diagram of a perioperative patient data multi-label sorting device according to embodiment 7 of the present invention;

FIG. 10 is a schematic structural diagram of a perioperative patient risk event prediction system according to embodiment 8 of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

In the description of the present invention, it is to be understood that the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used merely for convenience of description and for simplicity of description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, are not to be construed as limiting the present invention.

In the description of the present invention, unless otherwise specified and limited, it is to be noted that the terms "mounted," "connected," and "connected" are to be interpreted broadly, and may be, for example, a mechanical connection or an electrical connection, a communication between two elements, a direct connection, or an indirect connection via an intermediate medium, and specific meanings of the terms may be understood by those skilled in the art according to specific situations.

Example 1

This embodiment discloses a perioperative patient data dimension reduction device, as shown in fig. 1, the device includes:

the input module is used for acquiring original perioperative characteristic data of a patient, which contains multi-dimensional characteristics, and a classification label corresponding to the original perioperative characteristic data;

the primary dimensionality reduction module is used for carrying out dimensionality reduction on the original perioperative feature data based on a principal component analysis algorithm to obtain first perioperative feature data;

the secondary dimensionality reduction module is used for carrying out dimensionality reduction on the first perioperative feature data based on a genetic algorithm to obtain perioperative feature data;

and the output module outputs perioperative characteristic data.

In this embodiment, in order to better reflect the perioperative state of the patient, improve the accuracy of subsequent classification processing, and avoid the problem that the post-operative patient data is not easy to be collected and managed, preferably, the original perioperative characteristic data includes pre-operative and intra-operative index data of the patient, such as pre-operative blood pressure, heart rate, blood fat, etc., intra-operative heart rate, blood pressure, blood loss, operative duration, etc. Different from the existing partial classification prediction model which only includes preoperative basic conditions of an operation patient and does not consider concrete conditions in operation, a plurality of researches prove that intraoperative indexes such as intraoperative heart rate, blood pressure, blood loss, operation time and the like are related to postoperative conditions of the patient, so that the accuracy of predicting postoperative events by using the subsequent model can be improved by using the original perioperative characteristic data provided by the embodiment, and the postoperative patient index data are not depended on.

In this embodiment, the class labels are used to characterize perioperative patient risk events, which preferably include, but are not limited to, unscheduled readmission, death.

In this embodiment, to improve the richness of the data, the index data includes category data and numerical data, the category data is data representing the index by categories, such as more, middle, and less bleeding volume during operation, and the numerical data represents the index data by numerical values, such as blood pressure values.

In this embodiment, the original perioperative feature data can be data of a patient with a known perioperative patient risk event, and therefore, the known perioperative patient risk event can be used as the classification label associated with the original perioperative feature data. The original perioperative characteristic data can also be the data of a patient with unknown perioperative patient risk events, and the expert sets corresponding classification labels for the original perioperative characteristic data. The corresponding classification labels of the original perioperative feature data can be one, two or more.

In this embodiment, after being processed by the principal component analysis algorithm, the feature dimension of the first perioperative feature data is smaller than the feature dimension of the original perioperative feature data, and an initial population of the genetic algorithm is constructed based on the first perioperative feature data.

In this embodiment, to further reduce the dimension of the first perioperative feature data through a genetic algorithm, preferably, the second dimension reduction module includes:

the initial population setting unit is used for setting individuals based on the first perioperative characteristic data, the gene number of the individuals is less than or equal to the total number of characteristics in the first perioperative characteristic data, and a plurality of individuals form an initial population; the gene of each individual is a feature in the first perioperative characteristic data, and the base factor of each individual can be randomly set under the condition that the number of genes of the individual is less than or equal to the total number of features in the first perioperative characteristic data;

and (3) evolving an iteration unit, repeatedly executing the following processes until a termination condition is reached, and outputting an individual with the maximum fitness when the termination condition is reached: acquiring the fitness of each individual in the population of the current generation; selecting partial individuals from the population of the current generation as the individuals of the population of the next generation based on the fitness of the individuals; and performing cross operation and variation operation on the individuals of the next generation population.

In this embodiment, the termination condition is preferably, but not limited to, that the number of evolution iterations reaches a preset maximum number of evolution iterations, or the maximum fitness value of the individual in the evolution iterations does not increase any more, or the increase amplitude of the maximum fitness value of the individual in the evolution iterations is lower than the amplification threshold. In each iteration, the fitness of the individuals in the population of the current generation is ranked from high to low, and the part of the individuals with the top rank is selected as the individuals of the population of the next generation. The cross operation is mainly to exchange the same point gene position of the matched parent, obtain the filial generation after the exchange, and take the filial generation as the individual of the next generation population.

In this embodiment, in order to make the perioperative feature data after dimensionality reduction have more excellent performance in the subsequent classification processing and improve the classification accuracy, preferably, the process of obtaining the fitness of the individual is as follows: obtaining original perioperative characteristic data of a plurality of patients and corresponding classification labels, and performing dimensionality reduction processing on the original perioperative characteristic data according to individual characteristic information to obtain a plurality of dimensionality reduction samples consistent with individual characteristics; dividing a plurality of dimension reduction samples into a dimension reduction training set and a dimension reduction testing set; constructing a dimension-reducing multilayer perception neural network; training the constructed dimensionality reduction multilayer perception neural network by using a dimensionality reduction training set to obtain a dimensionality reduction classification prediction model; and testing the dimension reduction classification prediction model by using a dimension reduction test set to obtain the accuracy of the model, and taking the accuracy as the fitness of the individual.

Example 2

This embodiment discloses a perioperative patient sample dataset acquisition system, as shown in fig. 2, this perioperative patient sample dataset acquisition system includes:

the data acquisition module is used for acquiring original perioperative characteristic data and cases of a plurality of patients; case data is generally text data, including doctor's diagnosis, past medical history, postoperative follow-up records, and the like;

the classification label set acquisition module is used for acquiring a classification label set based on a plurality of cases, and the classification label represents perioperative patient risk events;

the classification label association module is used for associating and corresponding original perioperative characteristic data of the patient with at least one classification label in a classification label set, so that the original perioperative characteristic data corresponds to the classification label set which comprises at least one classification label;

the perioperative patient data dimension reduction device provided in embodiment 1 performs dimension reduction processing on the original perioperative characteristic data of all patients to obtain corresponding perioperative characteristic data;

and the sample data set acquisition module is used for associating the classification label set corresponding to the original perioperative characteristic data for the sample by taking the perioperative characteristic data of the patient as the sample to obtain the sample data set of the perioperative patient.

In this embodiment, preferably, the classification tag set obtaining module specifically executes: performing word segmentation on patient cases to obtain at least one postoperative event result (postoperative event result is perioperative patient risk event), performing similar word analogy on the postoperative event results of a plurality of patients by using a trained CBOW model to obtain a plurality of similar postoperative event result sets, matching the similar postoperative event result sets with an event dictionary, searching classification labels matched with the similar postoperative event result sets from the event dictionary, and forming a classification label set by using the plurality of classification labels.

In this embodiment, a CBOW Multi-Word Context Model of Word2Vec is used to train a large amount of medical corpus, and a PKUSEG Word segmentation tool (PKUSEG can segment words in multiple fields, including an independent Model in the medical field) is used to perform Word segmentation processing on text information corresponding to a case set in this embodiment, so as to obtain multiple post-operation event results. The event dictionary is preferably, but not limited to, the Chinese version ICD-11 event dictionary of the unified International Classification of diseases issued by the world health organization, and the event dictionary contains a plurality of classification labels. Whether the post-similarity event result set is matched with the event dictionary is preferably, but not limited to, judged through semantic similarity, if the semantic similarity of the post-similarity event result set and the event dictionary is larger than a preset similarity threshold, the post-similarity event result set and the event dictionary are considered to be matched, and if not, the post-similarity event result set and the event dictionary are not matched.

In this embodiment, preferably, in order to fill up missing values in the data and improve the data quality, the perioperative data processing system further includes a missing filling device, which is configured to fill up missing values in the original perioperative feature data of the patient and input the filled original perioperative feature data into the perioperative patient data dimension reduction device for dimension reduction. The missing filling device is preferably but not limited to filling by the existing randomforteregressor filling method or missfreest filling method or Mean value Mean filling method or median filling method.

In this embodiment, it is further preferable that the deletion filling device performs deletion filling processing on the original perioperative feature data based on a bayesian gaussian process hidden variable model.

In this embodiment, data padding for missing values inevitably introduces uncertainty to the original perioperative feature data set. The present embodiment applies a Bayesian Gaussian process hidden variable model (BGPLVM) to fill missing values of numerical features, specifically including:

first, the observed test data vector y is approximately calculated _* ∈R ^N×M Probability density p (y) of _* | Y) (where N is the total number of patient samples and M is the total number of features), and the observed value Y _* The variation distribution of the relevant hidden variables is q (x) _* ). After the model parameters and hidden variables are learned, BGPLVM can be used to estimate missing values:

wherein

Is a vector y _* The value that can be observed in (a) is,

is the missing value that needs to be predicted. Given the partially observed point y _* The present embodiment wishes to reconstruct the missing part

Missing datasets are filled by learning low-dimensional embedding of observable variables on one small complete dataset. Training BGPLVM on a complete data set D, introducing a hidden variable X and a new test hidden variable X _* As described above

A row vector representing the individual patient measurements,

representing the known observed value of the image,

expressing the missing value, by maximizing the probability density below, get y _* Corresponding hidden variable x _* Gaussian probability distribution.

Next, by maximizing at

To optimize the variation distribution q (x) _* ) Keep dividing q (x) _* ) All optimization quantities except for the one are unchanged. To predict missing values

The invention adopts a standard Gaussian process prediction method and simultaneously inputs x _* Also take into account the uncertainty of x _* Presence distribution q (x) _* ). Similar to the GP prediction form, for prediction

The invention predicts first

I.e. with y _* Corresponding implicit function value

For x _* Marginalization of (a) would produce a multivariate density that is not gaussian fully dependent, but based on a square exponential kernel,

can be analyzed and processed, the invention uses

The mean may provide an estimate of the missing value for the present invention, and the variance may quantify the uncertainty associated with the mean estimate. Through a BGPLVM model, average estimation of each feature containing a missing value is obtained through distribution in a hidden space and model hyper-parameters obtained through learning of a training set.

In this embodiment, in order to facilitate data processing, it is further preferable that the data processing device further includes a coding device for coding the original perioperative characteristic data, and inputting the coded data into the deficiency filling device. The encoding means preferably, but not exclusively, encodes using existing One-hot encoding rules.

In this embodiment, in order to facilitate data processing, it is further preferable that the data processing device further includes a normalization device, configured to perform normalization processing on the encoded original perioperative feature data, and input the normalized data into the missing filling device. The normalization means is preferably, but not limited to, normalization using a standard deviation normalization method.

Example 3

This embodiment provides a sample data set balancing method for perioperative patients, as shown in fig. 3, the sample data set balancing method includes:

the method comprises the following steps that S1, a few types of label samples in a sample data set of perioperative patients are subjected to oversampling to obtain synthetic samples, a corresponding synthetic label set is generated for the synthetic samples, and the sample data set comprises a plurality of samples and a classification label set corresponding to the samples; each sample represents a perioperative feature data set of a patient, and may be original perioperative feature data or perioperative feature data obtained after dimensionality reduction of the original perioperative feature data in example 1, and the classification label association process of the sample has been described in detail in example 1, and is not described herein again.

S2, adding the synthetic sample and the synthetic label set into the sample data set to obtain a temporary sample data set;

and S3, cleaning the samples in the temporary sample data set to obtain a balanced sample data set.

In this embodiment, a few types of tag samples in the sample data set of the perioperative patient may be oversampled by SMOTE or SVM SMOTE or borderlinesimote or K-Means SMOTE or SMOTE-NC to obtain synthetic samples and generate a corresponding synthetic tag set for the synthetic samples. Preferably, in order to improve the balance effect, the MLSMOTE algorithm is adopted to oversample a few types of label samples in the sample data set of the perioperative patient to obtain synthetic samples and generate a corresponding synthetic label set for the synthetic samples. The MLSMOTE algorithm, i.e. Multi label Synthetic least-sampling Technique (MLSMOTE), is commonly used to deal with the problem of data imbalance in the Multi-label classification task, and the generation process includes: selecting a few classes of labels by adopting an Imbalance Imbalance Rate (IR); searching nearest neighbors, namely searching the nearest neighbors of the samples belonging to a few labels once the samples are selected as seed samples; generating a characteristic set, namely selecting a neighborhood and then obtaining a synthesized sample through interpolation; and (3) generation of a synthetic label set, wherein the synthetic label set is required for the generated synthetic sample.

In this embodiment, since the algorithm for oversampling and synthesizing a few classes of samples, such as MLSMOTE, may generate some noise samples during the process of synthesizing a few classes of label samples, it is necessary to clean these noise samples, so step S3 is set to improve the quality of the sample data set.

In this embodiment, preferably, in order to quickly determine the minority class labels in the sample data set, a ratio between the number of samples corresponding to each classification label and the total number of samples in the sample data set is calculated, the classification label with the ratio smaller than a ratio threshold is used as the minority class classification label, the classification label with the ratio greater than or equal to the ratio threshold is used as the majority class classification label, and the ratio threshold is preferably, but not limited to, smaller than 0.2.

In this embodiment, the number of samples that each minority class label needs to generate is the over-sampling rate of the minority class label. In order to better determine the over-sampling rate of each minority class classification label, so that the obtained equalized sample data set performs better when applied to subsequent classification, preferably, in step S1, the over-sampling rate is set for each minority class label based on a genetic algorithm, which specifically includes:

s11, setting a sample data set to comprise W minority class labels, and taking the oversampling rate of the samples of the W minority class labels as W genes of an individual, wherein W is a positive integer; each gene represents the oversampling rate of a minority class classification label, a plurality of individuals are utilized to construct an initial population, the initial population comprises a plurality of initial individuals, the numerical value of W genes of each initial individual is obtained through random selection, preferably, a numerical value range can be set for the oversampling rate of each minority class classification label, when the initial population is constructed, numerical values are randomly selected in the numerical value range to serve as the gene numerical values, and the numerical value range can be set according to needs;

step S12, the following evolutionary iterative process is repeatedly performed until a termination condition is reached: acquiring the fitness of each individual in the population of the current generation; selecting partial individuals from the population of the current generation as the individuals of the population of the next generation based on the fitness of the individuals; carrying out cross operation and mutation operation on individuals of the next generation population;

and S13, outputting the individual with the maximum fitness when the termination condition is reached.

In this embodiment, the termination condition is preferably, but not limited to, that the number of evolution iterations reaches a preset maximum number of evolution iterations, or that the maximum fitness value of the individual in the evolution iterations does not increase any more, or that the increase amplitude of the maximum fitness value of the individual in the evolution iterations is lower than the increase threshold. In each iteration, the fitness of the individuals in the population of the current generation is ranked from high to low, and the part of the individuals with the top rank is selected as the individuals of the population of the next generation.

In this embodiment, in order to make the obtained equalized sample data set have a better performance when applied to subsequent classification, preferably, the process of obtaining the fitness of the individual is as follows:

obtaining minority label oversampling rate combinations based on individual gene information; the over-sampling rate combination includes the over-sampling rates of all the minority class tags;

oversampling a few types of label samples in a sample data set of perioperative patients based on a few types of label oversampling rate combination to obtain a synthetic sample and a synthetic label set of the synthetic sample, adding the synthetic sample and the synthetic label set into the sample data set to obtain a balanced sample set, and dividing the balanced sample set into a balanced training sample set and a balanced testing sample set;

the method comprises the steps of constructing an equilibrium multi-layer perceptive neural network, training the equilibrium multi-layer perceptive neural network by utilizing an equilibrium training sample set to obtain an equilibrium prediction classification model, testing the equilibrium prediction classification model by utilizing an equilibrium testing sample set to obtain the accuracy of the equilibrium prediction classification model, and taking the accuracy as the fitness of an individual.

In this embodiment, in order to effectively remove the noise sample and improve the quality of the sample set, preferably, step S3 is to perform a cleaning process on each sample in the temporary sample set, where the cleaning process includes:

s31, selecting seed samples from the temporary sample data set, selecting k adjacent samples of the seed samples, wherein classification labels of the k adjacent samples form an adjacent classification label set, and k is a positive integer; each sample in the temporary sample data set can be selected in sequence as a seed sample;

step S32, predicting the classification label set of the seed sample through Bayesian conditional probability based on the neighbor classification label set to obtain a predicted classification label set of the seed sample;

and step S33, judging whether the predicted classification label set of the seed sample is the same as the classification label set of the seed sample in the temporary sample data set, if so, retaining the seed sample, and if not, deleting the seed sample and considering the seed sample as a noise sample.

The cleaning process directly predicts the classification label set of the seed sample through Bayesian conditional probability based on the neighbor classification label set of the seed sample, compares and judges the obtained prediction classification label set and the real classification label set of the seed sample in the temporary sample data set, does not depend on classifier judgment, only depends on data judgment, reduces the operation amount, and improves the judgment efficiency and the accuracy.

In this embodiment, it is further preferable that, in step S31, the specific process of selecting k neighboring samples of the seed sample includes:

obtaining heterogeneous value difference measurement HVDM between the seed sample and all or part of the temporary sample data set respectively; HVDM is an abbreviation for Heterogeneous Value Difference Metric;

modifying the heterogeneous value difference measurement HVDM by using the global unbalanced weight of the samples in the temporary sample data set to obtain a modified heterogeneous value difference measurement;

and sequencing the correction heterogeneous value difference measures of all the samples and the seed samples in the temporary sample data set, and selecting the first k samples with larger correction heterogeneous value difference measures as k adjacent samples of the seed samples. Preferably, the modified heterogeneous difference metrics may be sorted from top to bottom, and the first k samples with larger modified heterogeneous difference metric values are selected as k neighboring samples of the seed sample.

In the process of selecting k neighbor samples of the seed sample, a Weighted KNN (Weighted KNN, wkNN) method is adopted to improve the quality of the synthesized sample. If the real few types of label samples in the sample data set are distributed very dispersedly, i.e. spatially sparse, the few types of samples synthesized in the execution process of the algorithm such as MLSMOTE are still scattered sparsely, and there is still no balance in a local view. If the kNN cleaning is directly used, sparse minority samples and new minority samples synthesized by MLSMOTE are removed with high probability, so that a proper classification boundary cannot be established, and therefore, a distance weighting idea needs to be introduced to coordinate the kNN cleaning, namely when sparse distributed samples are faced, local space density (namely heteroid difference measurement HVDM and global unbalance weight of the samples) is considered, and small samples are kept as far as possible. The kNN is mainly cleaned by relying on the label set of the neighboring samples, so the distance calculation for the neighboring samples is particularly important when the data distribution is sparse, which is also the main reason for adding the distance weighting (i.e. modifying the heterogeneous value difference measurement HVDM by using the global unbalanced weight of the samples in the temporary sample data set). WkNN cleans noise samples and varies the distance between neighboring samples (modified by the modified heterogeneous difference metric), i.e., the distance between samples is expressed in terms of the sample heterogeneous difference metric, taking into account the local density effect.

In this embodiment, it is further preferable that the heterogeneous value difference metric HVDM between the seed sample and the sample in the temporary sample data set is calculated by:

wherein, f ₁ A feature vector representing a seed sample; f. of ₂ A feature vector representing any sample in the temporary sample data set except the seed sample; HVDM (f) ₁ ,f ₂ ) Representing a feature vector f ₁ And f ₂ A heterogeneous value difference metric of (a); d (f) ₁ ,f ₂ ) Representing a feature vector f ₁ And f ₂ The distance therebetween; n represents the characteristic dimension of the sample in the temporary sample data set; x represents a feature index; d is a radical of _x (f ₁ ,f ₂ ) Representing a feature vector f ₁ And a feature vector f ₂ Distance, d, over feature x _x (f ₁ ,f ₂ ) Obtained by the following formula:

c represents the number of categories of the feature x when the feature x is a category feature, C represents a category index of the feature x,

representing the feature x in the temporary sample dataset as belonging to the feature vector f ₁ And the class feature of the feature x is the number of samples of c;

representing the feature x in the temporary sample dataset as belonging to the feature vector f ₂ And the class feature of the feature x is the number of samples of c;

representing the feature x in the temporary sample dataset as belonging to the feature vector f ₁ The number of samples of (a);

representing the feature x in the temporary sample dataset as belonging to the feature vector f ₂ The number of samples of (a); l f ₁ -f ₂ I represents a feature vector f ₁ And f ₂ The absolute value of the difference; sigma _x Representing the standard deviation of the feature x in the temporary sample data set.

In this embodiment, it is further preferable that the calculation formula of the modified heterogeneous value difference measure between the seed sample and the sample in the temporary sample data set is:

wherein f is ₁ A feature vector representing a seed sample; f. of ₂ A feature vector representing any sample in the temporary sample data set except the seed sample; HVDM (f) ₁ ,f ₂ ) Representing a feature vector f ₁ And f ₂ A heterogeneous value difference metric of (a); d _W (f ₁ ,f ₂ ) Representing a feature vector f ₁ And f ₂ Modified heterogeneous value difference metric of (a);n represents the characteristic dimension of the sample in the temporary sample data set; IW represents a feature vector of f ₂ Of samples of (a), IW = IR _nn /(IR ⁺ +IR ^- ),IR ⁺ Indicating the Total imbalance Rate, IR, of all the minority class Classification tags in the temporary sample dataset ^- Indicating the Total imbalance Rate, IR, of all of the majority class Classification tags in the temporary sample dataset _nn Is a feature vector of f ₂ The total imbalance rate of all the class labels in the class label set of the sample.

In the above process of removing noise samples, when the distance is calculated by WkNN, a Heterogeneous Value Difference Metric (HVDM) is used to measure the distance, and the HVDM is corrected by using the global imbalance weight IW of the sample as a weight coefficient. For a temporary sample dataset, when the classification tag set contains more tags of a few classes, the IR _nn The larger the IW will be; for a temporary sample data set with sparse distribution of a few types of label samples and a large imbalance rate, IW is introduced into HVDM distance to improve the density of the few types of samples.

From formulas

It can be seen that the weighting coefficients

Can scale the HVDM (f) ₁ ,f ₂ ) Weighting coefficient is increased when more minority class labels in neighbor sample classification label set

The smaller will be. When the IW of the neighbor sample set of the seed samples is larger, that is, the neighbor sample label set contains more minority class labels, the weighting coefficient of the corresponding neighbor sample

The smaller, and thus monotonically decreasing, the smaller can be maintained: weighting coefficients of neighboring samples with fixed feature dimension

Will be scaled to different degrees due to the condition of the majority and minority classes of tags contained in its set of tags; when the feature dimension increases, that is, the sample distribution becomes sparse, the scaling factor also becomes smaller.

It can be seen that the WkNN can help to screen the neighboring samples for the samples with more labels in the label set, and consider the distribution of the labels in the neighboring sample label set, so that the samples with more labels in the label set are close to the seed samples, the local minority label density is increased, and the majority label density is reduced. The whole process is as follows: firstly, MLSMOTE is used for up-sampling samples of a few types of labels, a balanced temporary new sample set is formed by the up-sampled samples and original samples, wkNN process is carried out on each sample on the new sample set, namely k adjacent samples are sequenced based on weighted HVDM, then a label set of seed samples is predicted according to the adjacent samples, if the situations of the predicted label set and the seed label set are the same, the samples are reserved, otherwise, the samples are deleted

Example 4

This embodiment discloses a perioperative patient's sample data set equalizing device, as shown in fig. 4, this sample data set equalizing device includes:

the sample synthesis module is used for oversampling a few types of label samples in the sample data set of the perioperative patient to obtain a synthesized sample and generating a corresponding synthesized label set for the synthesized sample, wherein the sample data set comprises a plurality of samples and a classification label set corresponding to the samples;

the temporary sample data set acquisition module is used for adding the synthetic sample and the synthetic label set into the sample data set to acquire a temporary sample data set;

and the cleaning module is used for cleaning the samples in the temporary sample data set to obtain a balanced sample data set.

In this embodiment, preferably, the cleaning module includes:

the neighbor sample acquisition unit selects seed samples from the temporary sample data set, selects k neighbor samples of the seed samples, and the classification labels of the k neighbor samples form a neighbor classification label set, wherein k is a positive integer;

a predicted classification tag set obtaining unit which predicts the classification tag set of the seed sample through Bayesian conditional probability based on the neighbor classification tag set to obtain a predicted classification tag set of the seed sample;

and the cleaning unit is used for judging whether the predicted classification label set of the seed sample is the same as the classification label set of the seed sample in the temporary sample data set, if so, retaining the seed sample, and if not, deleting the seed sample.

In this embodiment, it is further preferable that the specific process of selecting k neighboring samples of the seed sample by the neighboring sample acquiring unit includes:

obtaining heterogeneous value difference measurement HVDM between the seed sample and all or part of the temporary sample data set respectively;

correcting the heterogeneous value difference measurement HVDM by using the global unbalanced weight of the samples in the temporary sample data set to obtain a corrected heterogeneous value difference measurement;

and sequencing the correction heterogeneous value difference measures of all the samples and the seed samples in the temporary sample data set, and selecting the first k samples with larger correction heterogeneous value difference measures as k adjacent samples of the seed samples.

The test verification of the equalization effect of the sample data set equalization device provided by the embodiment is carried out, and the result is as follows:

the IR represents the Imbalance ratio Imbalance Rate of the sample set, and the larger the IR represents the more unbalanced the sample set, and it can be seen from the above experimental results that the maximum IR and the average IR of the equalizing apparatus provided in this embodiment are the smallest, and the interval between the maximum value and the average value of the IR is drawn closer, which indicates that the sample set is more balanced.

Example 5

This embodiment also discloses a perioperative patient sample data set acquisition system, which adds a sample data set balancing device in this embodiment, that is, performs sample balancing processing on the reduced dimension acquired sample data set obtained in embodiment 2, and a schematic structural diagram of the device is shown in fig. 5, and includes:

the data acquisition module is used for acquiring original perioperative characteristic data and cases of a plurality of patients;

the classification label set acquisition module is used for acquiring a classification label set based on a plurality of cases, and the classification labels represent perioperative patient risk events;

the classification label correlation module is used for correlating and corresponding the original perioperative characteristic data of the patient with at least one classification label in the classification label set;

the perioperative patient data dimension reduction device is used for performing dimension reduction processing on the original perioperative characteristic data of all patients to obtain corresponding perioperative characteristic data;

the sample data set acquisition module is used for associating a classification label set corresponding to the corresponding original perioperative characteristic data for the sample by taking perioperative characteristic data of the patient as the sample to acquire a sample data set of the perioperative patient;

the perioperative patient sample data set balancing device provided in embodiment 4 is further included, and is used for performing balancing processing on the sample data set.

In this embodiment, it is preferable that the perioperative patient data input device further includes a missing filling device, configured to perform filling processing on missing values in the original perioperative feature data of the patient, and input the original perioperative feature data subjected to filling processing into the perioperative patient data dimension reducing device for dimension reduction processing.

Example 6

This embodiment 6 discloses a perioperative patient data multi-label classification method, as shown in fig. 6, the multi-label classification method includes:

step A, acquiring characteristic data of a patient to be classified; the patient feature data to be classified is feature data of perioperative patients and may include multi-dimensional features. In order to improve the processibility of the characteristic data of the patient to be classified, reduce the dimensionality and improve the quality, the characteristic data of the patient to be classified can be sequentially subjected to coding processing and normalization processing, and the characteristic dimensionality of the sample output by the perioperative patient data dimensionality reduction device provided in embodiment 1 is subjected to dimensionality reduction processing, and the characteristic data of the patient to be classified after the dimensionality reduction processing is input into a trained classification model.

Step B, inputting the characteristic data of the patient to be classified into a trained classification model, outputting a classification result by the classification model, wherein the classification result comprises more than one classification label and the classification confidence of each classification label; the classification confidence of a classification label indicates the probability that the patient feature data to be classified belongs to that classification label. The classification model comprises a Stacking-based classification integration model, a label association rule acquisition module and a fusion module, wherein the fusion module is used for fusing a classification matrix output by the classification integration model and an association rule matrix output by the label association rule acquisition module to obtain a classification result, and the fusion mode is preferably but not limited to multiplying the classification matrix and the association rule matrix.

In an embodiment, preferably, the structural diagram of the classification model is shown in fig. 7, and the classification integration model includes a first multi-classification model, a second multi-classification model, a third multi-classification model and a logistic regression model; the first multi-classification model, the second multi-classification model and the third multi-classification model respectively carry out multi-label classification processing on the characteristic data of the patient to be classified to obtain a first primary classification result, a second primary classification result and a third primary classification result; and processing the first primary classification result, the second primary classification result and the third primary classification result by the logistic regression model to obtain a classification matrix.

In this embodiment, preferably, the first multi-classification model, the second multi-classification model, and the third multi-classification model are a Ranking-SVM model, a classification multi-layer perceptual neural network model, and a Binary reservance model, respectively. The Ranking-SVM model and the Binary Relevance model are conventional basic models in Stacking integration, and the model integration reliability is high when the Ranking-SVM model and the Binary Relevance model are used in the Stacking integration. The classified multilayer perception neural network model adopts a multilayer perception neural network structure (namely an MLP network structure), so that the over-fitting problem can be avoided, and the complexity is low.

In this embodiment, it is preferable that the method further includes a step of constructing a sample data set of the perioperative patient, and as shown in fig. 8, the step of constructing the sample data set of the perioperative patient is preferably, but not limited to, constructed by using the system of embodiment 2 or embodiment 5.

In an embodiment, as shown in fig. 7, the training process of the classification integration model is as follows:

constructing a sample data set of perioperative patients, wherein each sample in the sample data set is associated with more than one classification label, the sample data set is divided into a classification training set and a classification test set, and the association of the classification labels can be performed in a manual mode;

constructing a classification integration model, namely the Stacking-based integration model, which comprises a first multi-classification model, a second multi-classification model, a third multi-classification model and a logistic regression model;

and training the classification integrated model by using a classification training set, and testing and verifying the trained classification integrated model by using a classification testing set. In the validation, cross validation was performed on the training set using randomizided searchcv and GridSearchCV, with selection of the hyperparameters being performed by F1_ Micro scores.

In this embodiment, as shown in fig. 7, preferably, the association rule obtaining module performs the following steps:

acquiring a sample data set of perioperative patients, wherein each sample in the sample data set is associated with more than one classification label; the sample data set is preferably, but not limited to, the perioperative patient sample data set obtained in example 2 or example 5, i.e. a standard patient data set.

And mining association rules of the classification labels in the sample data set to obtain an association rule matrix. The association rule matrix includes confidence of association between any two of all the classification tags.

In this embodiment, as shown in fig. 7, it is further preferable that, when the number of the classification tags in the sample data set is smaller, specifically smaller than the number threshold, the FP-growth algorithm is directly used to perform association rule mining on the classification tags in the sample data set. Firstly, establishing a classification label matrix as shown in fig. 7, wherein the first row of the classification label matrix is each label, and the first row is a patient number; and then, performing association rule analysis processing on the classification label matrix by using an FP-growth algorithm, and outputting an association confidence coefficient between any two classification labels, wherein the value range of the association confidence coefficient is 0 to 1. Based on the association confidence degrees, an association rule matrix as shown in fig. 7 is established, in the association rule matrix, the first row and the first column are both classification tags, and an element in the matrix represents the association confidence degree between the classification tags in the row and the column where the element is located, as shown in fig. 7, a (N-1) represents the association confidence degree between the classification tag N and the classification tag 1.

In this embodiment, preferably, when the number of the classification tags in the sample data set is large, the correlation patterns between the classification tags may be different, and performing the correlation analysis directly may cause a complex item set searching process, which may affect the accuracy of the correlation analysis, and specifically, when the number of the classification tags is greater than or equal to the number threshold, the number threshold is preferably, but not limited to, 3, 4, or 5. The method for mining the association rule of the classification label in the sample data set to obtain the association rule matrix comprises the following steps:

clustering classification labels in the sample data set to obtain more than one cluster; preferably but not limited to, clustering by using a K-means + + algorithm; and performing association rule mining on the classification label in each cluster to obtain an association rule submatrix. During fusion, the classification matrix is divided into more than one sub-classification matrix according to the clustering result, one clustering cluster corresponds to one sub-classification matrix, the sub-classification matrix is multiplied by the associated rule sub-matrix corresponding to the clustering cluster to obtain the classification sub-result of the clustering cluster, and all the classification sub-results form the classification result.

In this embodiment, it is further preferable that the association rule mining is performed on the classification label in each classification cluster through the FP-growth algorithm to obtain the association rule submatrix, and the process of obtaining the association rule submatrix is consistent with that in fig. 7, which has been described in detail in the above preferred solution, and is not described again here.

Example 7

The embodiment discloses a perioperative patient data multi-label classification device, as shown in fig. 9, including:

the data acquisition module is used for acquiring characteristic data of the patient to be classified;

the classification module is used for inputting the characteristic data of the patient to be classified into a trained classification model, and the classification model outputs a classification result which comprises more than one classification label and the classification confidence of each classification label; the classification model comprises a Stacking-based classification integration model, a label association rule acquisition module and a fusion module, wherein the fusion module is used for fusing a classification matrix output by the classification integration model and an association rule matrix output by the label association rule acquisition module to obtain a classification result.

In this embodiment, preferably, the classification integration model includes a first multi-classification model, a second multi-classification model, a third multi-classification model and a logistic regression model; the first multi-classification model, the second multi-classification model and the third multi-classification model respectively carry out multi-label classification processing on the characteristic data of the patient to be classified to obtain a first primary classification result, a second primary classification result and a third primary classification result; and processing the first primary classification result, the second primary classification result and the third primary classification result by the logistic regression model to obtain a classification matrix.

In this embodiment, it is preferable that the system further includes a classification integration model training module, and the classification integration model training module performs the following processes:

constructing a sample data set of perioperative patients, associating more than one classification label with each sample in the sample data set, and dividing the sample data set into a classification training set and a classification testing set; preferably, but not limited to, constructing a specimen dataset of a perioperative patient by the system provided in example 2 or example 5;

constructing a classification integration model; the classification integration model comprises a first multi-classification model, a second multi-classification model, a third multi-classification model and a logistic regression model;

and training the classification integrated model by using a classification training set, and testing and verifying the trained classification integrated model by using a classification testing set.

In this embodiment, the classification device builds a classification integration model of perioperative post-operation event multi-tags in combination with association rule analysis. And researching and predicting postoperative multi-event results, and building a multi-label prediction model by integrating a Ranking-SVM model, a multi-layer perception neural network model and a Binary Relevance model, so as to further improve the stability and accuracy of the model and integrate association rules into the prediction model for optimization.

Example 8

The present embodiment discloses a perioperative patient risk event prediction system, as shown in fig. 10, including:

the data acquisition module is used for acquiring the characteristic data of the patient to be classified;

the classification module is used for inputting the characteristic data of the patient to be classified into a trained classification model, the classification model outputs a classification result, the classification result comprises more than one classification label and the classification confidence coefficient of each classification label, and each classification label corresponds to a perioperative patient risk event;

the classification model comprises a Stacking-based classification integration model, a label association rule acquisition module and a fusion module, wherein the fusion module is used for fusing a classification matrix output by the classification integration model and an association rule matrix output by the label association rule acquisition module to obtain a classification result;

and the conversion module is used for converting the classification labels in the classification result into corresponding perioperative patient risk events to obtain a risk prediction result.

In this embodiment, in the process of acquiring the system sample data set provided in embodiment 2 or embodiment 5, perioperative risk events of patients (especially elderly surgical patients) are predicted, and on the basis of improving missing and unbalanced data sets, association rule analysis is fused, and a multi-label prediction model of post-operative events is built. Extracting the labels of the post-operation events based on the patient case text, collecting a large amount of medical related corpora by adopting a CBOW label extraction model of Word2Vec, training a medical Word vector model, and realizing the extraction of the label set (namely a classification label set) of the post-operation events. And then, filling missing data by adopting a Bayesian Gaussian process latent variable model, processing label imbalance data by adopting an MLSMOTE, weighting KNN (WKNN) and a genetic algorithm, and finally building a feature dimension reduction model by combining a Principal Component Analysis (PCA) model and the genetic algorithm to provide input with higher correlation for a classification integration model.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A perioperative patient data dimension reduction device, comprising:

the input module is used for acquiring original perioperative characteristic data containing multi-dimensional characteristics of a patient and a classification label corresponding to the original perioperative characteristic data;

and the output module outputs perioperative characteristic data.

2. The perioperative patient data dimension-reducing apparatus of claim 1, wherein the raw perioperative feature data includes index data of pre-operative and intra-operative patients.

3. The perioperative patient data dimension reduction apparatus of claim 1 or 2, wherein the secondary dimension reduction module comprises:

the initial population setting unit is used for setting individuals based on the first perioperative characteristic data, the gene number of each individual is less than or equal to the total number of characteristics in the first perioperative characteristic data, and a plurality of individuals form an initial population;

and (3) evolving an iteration unit, repeatedly executing the following processes until a termination condition is reached, and outputting the individual with the maximum fitness when the termination condition is reached:

acquiring the fitness of each individual in the population of the current generation; selecting partial individuals from the population of the current generation as the individuals of the population of the next generation based on the fitness of the individuals; and performing cross operation and variation operation on the individuals of the next generation population.

4. The perioperative patient data dimension reduction apparatus of claim 3, wherein the process of obtaining the fitness of an individual:

obtaining original perioperative period feature data of a plurality of patients and corresponding classification labels, and performing dimensionality reduction processing on the original perioperative period feature data according to individual feature information to obtain a plurality of dimensionality reduction samples consistent with individual features;

dividing a plurality of dimension reduction samples into a dimension reduction training set and a dimension reduction testing set;

constructing a dimension-reducing multilayer perception neural network;

training the constructed dimensionality reduction multilayer perception neural network by using a dimensionality reduction training set to obtain a dimensionality reduction classification prediction model; and testing the dimension reduction classification prediction model by using a dimension reduction test set to obtain the accuracy of the dimension reduction classification prediction model, and taking the accuracy as the fitness of an individual.

5. A perioperative patient sample dataset acquisition system, comprising:

the classification label association module is used for associating and corresponding original perioperative characteristic data of the patient with at least one classification label in the classification label set;

and a perioperative patient data dimension reduction device as claimed in any one of claims 1-4, performing dimension reduction processing on the original perioperative feature data of all patients to obtain corresponding perioperative feature data;

6. The perioperative patient sample dataset acquisition system of claim 5, wherein said classification tag set acquisition module is further configured to perform:

performing word segmentation on patient cases to obtain at least one postoperative event result, performing similar word analogy on postoperative event results of a plurality of patients by using a trained CBOW model to obtain a plurality of similar postoperative event result sets, matching the similar postoperative event result sets with an event dictionary, searching classification labels matched with the similar postoperative event result sets from the event dictionary, and forming a classification label set by using a plurality of classification labels.

7. The perioperative patient sample dataset acquisition system of claim 5 or 6, further comprising a missing filling means for filling missing values in the original perioperative feature data of the patient and inputting the filled original perioperative feature data into the perioperative patient data dimension reduction means for dimension reduction.

8. The perioperative patient sample dataset acquisition system of claim 7 wherein said deletion filling means performs a deletion filling process on the original perioperative feature data based on a Bayesian Gaussian process hidden variable model.

9. The perioperative patient sample data set acquisition system of claim 7 or 8, further comprising:

and the coding device is used for coding the original perioperative characteristic data.

10. The perioperative patient sample dataset acquisition system of claim 9, further comprising:

and the normalization device is used for normalizing the original perioperative characteristic data.