CN115206538A - Perioperative patient sample data set balancing method and sample data set acquisition system - Google Patents
Perioperative patient sample data set balancing method and sample data set acquisition system Download PDFInfo
- Publication number
- CN115206538A CN115206538A CN202210760514.1A CN202210760514A CN115206538A CN 115206538 A CN115206538 A CN 115206538A CN 202210760514 A CN202210760514 A CN 202210760514A CN 115206538 A CN115206538 A CN 115206538A
- Authority
- CN
- China
- Prior art keywords
- sample
- samples
- data set
- classification
- perioperative
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/126—Evolutionary algorithms, e.g. genetic algorithms or genetic programming
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/30—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Theoretical Computer Science (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Biology (AREA)
- General Engineering & Computer Science (AREA)
- Primary Health Care (AREA)
- Epidemiology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Pathology (AREA)
- Physiology (AREA)
- Quality & Reliability (AREA)
- Genetics & Genomics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The invention provides a perioperative patient sample data set balancing method and a sample data set acquisition system. The sample data set balancing method comprises the following steps: s1, oversampling a few types of label samples in a sample data set of a perioperative patient to obtain a synthetic sample, and generating a corresponding synthetic label set for the synthetic sample, wherein the sample data set comprises a plurality of samples and a classification label set corresponding to the samples; s2, adding the synthetic sample and the synthetic tag set into a sample data set to obtain a temporary sample data set; and S3, cleaning the samples in the temporary sample data set to obtain a balanced sample data set. The method comprises the steps of oversampling a few types of label samples in a sample data set to increase the number of the few types of label samples, balancing a plurality of types of label samples and the few types of label samples, cleaning noise samples to improve the quality of samples in the balanced sample data set output, and improving the expression effect of a classification model when the balanced sample data set is used for subsequent classification processing.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a perioperative patient sample data set balancing method and a sample data set acquisition system.
Background
The perioperative period is a perioperative period, which is a period of time from the time when a patient decides to receive a surgical treatment to the time when the patient is substantially recovered, including before, during and after the surgery, and specifically, from the time when the surgical treatment is determined to the time when the treatment related to the surgery is substantially completed, the period of time is from 5 to 7 days before the surgery to 7 to 12 days after the surgery.
According to the data reported by World Health Organization (WHO) published as World health standards 2021, the life expectancy of the global population is increased to 73.3 years, and by 2050, the number of elderly people worldwide will exceed 15 hundred million people. An increasing population of the elderly across the world has been identified as a major segment of the surgical market, and the prediction of risk events in elderly patients has become one of the leading research directions. The postoperative risk prediction is carried out on the elderly operation patient group, so that a doctor can formulate a diagnosis and treatment plan, treatment resources are reasonably configured, and the probability of postoperative risk events is reduced. Currently, some diagnostic tools can help hospitals to provide comprehensive and reliable treatment for high-risk patients, for example, chinese patents with publication numbers CN111009322A and CN114038565A have disclosed perioperative risk assessment by using a prediction model based on a perioperative data set of a patient, however, in the perioperative data set of a patient, there is a problem that labels of the data set are unbalanced, which may directly affect the performance effect of the perioperative prediction model.
Disclosure of Invention
The invention aims to solve the technical problems in the prior art and provides a perioperative patient sample data set balancing method and a sample data set acquisition system.
To achieve the above object, according to a first aspect of the present invention, there is provided a perioperative patient sample dataset equalization method comprising: s1, oversampling a few types of label samples in a sample data set of a perioperative patient by using an MLSMOTE algorithm to obtain a synthetic sample, and generating a corresponding synthetic label set for the synthetic sample, wherein the sample data set comprises a plurality of samples and a classification label set corresponding to the samples; s2, adding the synthetic sample into the sample data set to obtain a temporary sample data set; and S3, cleaning the samples in the temporary sample data set to obtain a balanced sample data set.
The technical scheme is as follows: the method has the advantages that the small number of label samples in the sample data set are oversampled to increase the number of the small number of label samples, balance of the large number of label samples and the small number of label samples is achieved, in addition, noise samples generated in the process of generating the small number of label samples are cleaned in all samples, sample quality of output balanced sample data sets is improved, data are effectively enhanced, and when the balanced sample data sets are used for subsequent classification processing, the representing effect of classification models can be improved.
In order to achieve the above object, according to a second aspect of the present invention, there is provided a sample data set equalizing apparatus for a perioperative patient, comprising: the sample synthesis module is used for oversampling a few types of label samples in a sample data set of a perioperative patient by utilizing an MLSMOTE algorithm to obtain a synthesized sample and generating a corresponding synthesized label set for the synthesized sample, wherein the sample data set comprises a plurality of samples and a classification label set corresponding to the samples; the temporary sample data set acquisition module is used for adding the synthetic sample into the sample data set to obtain a temporary sample data set; and the cleaning module is used for cleaning the samples in the temporary sample data set to obtain a balanced sample data set.
The technical scheme is as follows: the method has the advantages that the MLSMOTE algorithm is used for oversampling the few types of label samples in the sample data set to increase the number of the few types of label samples, balance of the most types of label samples and the few types of label samples is achieved, in addition, noise samples generated in the process of generating the few types of label samples by the MLSMOTE are cleaned in all samples, sample quality in the output balanced sample data set is improved, data are effectively enhanced, and the representing effect of a classification model can be improved when the balanced sample data set is used for subsequent classification processing.
To achieve the above object, according to a third aspect of the present invention, there is provided a perioperative patient sample dataset acquisition system comprising: the data acquisition module is used for acquiring original perioperative characteristic data and cases of a plurality of patients; the classification label set acquisition module is used for acquiring a classification label set based on a plurality of cases, and the classification labels represent perioperative patient risk events; the classification label correlation module is used for correlating and corresponding the original perioperative characteristic data of the patient with at least one classification label in the classification label set; the perioperative patient data dimension reduction device is used for performing dimension reduction processing on the original perioperative characteristic data of all patients to obtain corresponding perioperative characteristic data; the sample data set acquisition module is used for associating a classification label set corresponding to the corresponding original perioperative characteristic data for the sample by taking perioperative characteristic data of the patient as the sample to acquire a sample data set of the perioperative patient; the balance device is used for carrying out balance processing on the sample data set.
The technical scheme is as follows: a multi-classification label sample data set of perioperative patients is constructed, and the characteristic dimensionality of the samples in the data set is lower and is a characteristic which has a larger influence on subsequent classification through a perioperative patient data dimensionality reduction device, so that the efficiency of subsequent classification processing and model training can be improved; the number of the few types of label samples is increased through the sample data set balancing device, the balancing of the multiple types of label samples and the few types of label samples is achieved, noise samples generated in the process of generating the few types of label samples by the MLSMOTE are cleaned in all samples, the quality of output balanced sample data set samples is improved, data are effectively enhanced, and the representing effect of a classification model can be improved when the balanced sample data set is used for subsequent classification processing.
Drawings
FIG. 1 is a schematic structural diagram of a perioperative patient data dimension reduction device in embodiment 1 of the present invention;
FIG. 2 is a schematic structural diagram of a perioperative patient sample data set acquisition system in embodiment 2 of the present invention;
fig. 3 is a schematic flowchart of a sample data set equalization method in embodiment 3 of the present invention;
fig. 4 is a schematic structural diagram of a sample data set equalization apparatus in embodiment 4 of the present invention;
fig. 5 is a schematic structural diagram of a sample data set acquisition system in embodiment 5 of the present invention;
FIG. 6 is a flowchart illustrating a perioperative patient data multi-label classification method according to embodiment 6 of the present invention;
FIG. 7 is a schematic structural diagram of a classification model in example 6;
FIG. 8 is a schematic view showing a preferred flow chart of the perioperative patient data multi-label classification method according to example 6;
FIG. 9 is a schematic structural diagram of a perioperative patient data multi-label sorting device according to embodiment 7 of the present invention;
FIG. 10 is a schematic structural diagram of a perioperative patient risk event prediction system according to embodiment 8 of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
In the description of the present invention, it is to be understood that the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used merely for convenience of description and for simplicity of description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, are not to be construed as limiting the present invention.
In the description of the present invention, unless otherwise specified and limited, it should be noted that the terms "mounted," "connected," and "connected" are to be interpreted broadly, and may be, for example, a mechanical connection or an electrical connection, a communication between two elements, a direct connection, or an indirect connection through an intermediate medium, and those skilled in the art will understand the specific meaning of the terms as they are used in the specific case.
Example 1
This embodiment discloses a perioperative patient data dimension reduction device, as shown in fig. 1, the device includes:
the input module is used for acquiring original perioperative characteristic data containing multi-dimensional characteristics of a patient and a classification label corresponding to the original perioperative characteristic data;
the primary dimensionality reduction module is used for carrying out dimensionality reduction on the original perioperative feature data based on a principal component analysis algorithm to obtain first perioperative feature data;
the secondary dimensionality reduction module is used for carrying out dimensionality reduction on the first perioperative feature data based on a genetic algorithm to obtain perioperative feature data;
and the output module outputs perioperative characteristic data.
In this embodiment, in order to better reflect the perioperative state of the patient, improve the accuracy of subsequent classification processing, and avoid the problem that the post-operative patient data is not easy to be collected and managed, preferably, the original perioperative characteristic data includes pre-operative and intra-operative index data of the patient, such as pre-operative blood pressure, heart rate, blood fat, etc., intra-operative heart rate, blood pressure, blood loss, operative duration, etc. Different from the existing partial classification prediction model which only includes preoperative basic conditions of an operation patient and does not consider concrete conditions in operation, a plurality of researches prove that intraoperative indexes such as intraoperative heart rate, blood pressure, blood loss, operation time and the like are related to postoperative conditions of the patient, so that the accuracy of predicting postoperative events by using the subsequent model can be improved by using the original perioperative characteristic data provided by the embodiment, and the postoperative patient index data are not depended on.
In this embodiment, the class labels are used to characterize perioperative patient risk events, which preferably include, but are not limited to, unscheduled readmission, death.
In the present embodiment, to improve the richness of the data, the index data includes category data and numerical data, the category data represents the index data by category, for example, the intraoperative hemorrhage amount can be represented by more, medium, or less, and the numerical data represents the index data by numerical value, for example, the blood pressure value.
In this embodiment, the original perioperative feature data can be data of a patient with a known perioperative patient risk event, and therefore, the known perioperative patient risk event can be used as the classification label associated with the original perioperative feature data. The original perioperative characteristic data can also be the data of a patient with unknown perioperative patient risk events, and the expert sets corresponding classification labels for the original perioperative characteristic data. The corresponding classification labels of the original perioperative feature data can be one, two or more.
In this embodiment, after being processed by the principal component analysis algorithm, the feature dimension of the first perioperative feature data is smaller than the feature dimension of the original perioperative feature data, and an initial population of the genetic algorithm is constructed based on the first perioperative feature data.
In this embodiment, to further reduce the dimension of the first perioperative feature data through a genetic algorithm, preferably, the second dimension reduction module includes:
the initial population setting unit is used for setting individuals based on the first perioperative characteristic data, the gene number of the individuals is less than or equal to the total number of characteristics in the first perioperative characteristic data, and a plurality of individuals form an initial population; the gene of each individual is a feature in the first perioperative characteristic data, and the base factor of each individual can be randomly set under the condition that the number of genes of the individual is less than or equal to the total number of features in the first perioperative characteristic data;
and (3) evolving an iteration unit, repeatedly executing the following processes until a termination condition is reached, and outputting the individual with the maximum fitness when the termination condition is reached: acquiring the fitness of each individual in the population of the current generation; selecting partial individuals from the population of the current generation as the individuals of the population of the next generation based on the fitness of the individuals; and performing cross operation and variation operation on the individuals of the next generation population.
In this embodiment, the termination condition is preferably, but not limited to, that the number of evolution iterations reaches a preset maximum number of evolution iterations, or the maximum fitness value of the individual in the evolution iterations does not increase any more, or the increase amplitude of the maximum fitness value of the individual in the evolution iterations is lower than the amplification threshold. In each iteration, the fitness of the individuals in the population of the current generation is ranked from high to low, and the part of the individuals with the top rank is selected as the individuals of the population of the next generation. The cross operation is mainly to exchange the same point gene position of the matched parent, obtain the filial generation after the exchange, and take the filial generation as the individual of the next generation population.
In this embodiment, in order to make the perioperative feature data after dimensionality reduction have more excellent performance in the subsequent classification processing and improve the classification accuracy, preferably, the process of obtaining the fitness of the individual is as follows: obtaining original perioperative characteristic data of a plurality of patients and corresponding classification labels, and performing dimensionality reduction processing on the original perioperative characteristic data according to individual characteristic information to obtain a plurality of dimensionality reduction samples consistent with individual characteristics; dividing a plurality of dimension reduction samples into a dimension reduction training set and a dimension reduction testing set; constructing a dimension-reducing multilayer perception neural network; training the constructed dimensionality reduction multilayer perception neural network by using a dimensionality reduction training set to obtain a dimensionality reduction classification prediction model; and testing the dimension reduction classification prediction model by using a dimension reduction test set to obtain the accuracy of the model, and taking the accuracy as the fitness of the individual.
Example 2
This embodiment discloses a perioperative patient sample dataset acquisition system, as shown in fig. 2, this perioperative patient sample dataset acquisition system includes: the data acquisition module is used for acquiring original perioperative characteristic data and cases of a plurality of patients; case data is generally text data including doctor's diagnosis, past medical history, postoperative follow-up records, and the like; the classification label set acquisition module is used for acquiring a classification label set based on a plurality of cases, and the classification labels represent perioperative patient risk events; the classification label association module is used for associating and corresponding original perioperative characteristic data of the patient with at least one classification label in a classification label set, so that the original perioperative characteristic data corresponds to the classification label set which comprises at least one classification label; the perioperative patient data dimension reduction device provided in embodiment 1 performs dimension reduction processing on the original perioperative characteristic data of all patients to obtain corresponding perioperative characteristic data; and the sample data set acquisition module is used for associating the classification label set corresponding to the original perioperative characteristic data for the sample by taking the perioperative characteristic data of the patient as the sample to obtain the sample data set of the perioperative patient.
In this embodiment, preferably, the classification tag set obtaining module specifically executes: performing word segmentation on patient cases to obtain at least one postoperative event result (postoperative event result is perioperative patient risk event), performing similar word analogy on the postoperative event results of a plurality of patients by using a trained CBOW model to obtain a plurality of similar postoperative event result sets, matching the similar postoperative event result sets with an event dictionary, searching classification labels matched with the similar postoperative event result sets from the event dictionary, and forming a classification label set by using a plurality of classification labels.
In this embodiment, a CBOW Multi-Word Context Model of Word2Vec is used to train a large amount of medical corpus, and a PKUSEG segmentation tool (PKUSEG can segment words in multiple fields, including independent models in medical fields) is used to segment text information corresponding to a case set in this embodiment to obtain multiple post-operation event results. The event dictionary is preferably but not limited to the Chinese version ICD-11 event dictionary which is the unified International Classification of diseases published by the world health organization, and the event dictionary comprises a plurality of classification labels. Whether the post-similarity event result set is matched with the event dictionary is preferably, but not limited to, judged through semantic similarity, if the semantic similarity of the post-similarity event result set and the event dictionary is larger than a preset similarity threshold, the post-similarity event result set and the event dictionary are considered to be matched, and if not, the post-similarity event result set and the event dictionary are not matched.
In this embodiment, preferably, in order to fill up missing values in the data and improve the data quality, the perioperative data processing system further includes a missing filling device, which is configured to fill up missing values in the original perioperative feature data of the patient and input the filled original perioperative feature data into the perioperative patient data dimension reduction device for dimension reduction. The missing filling device is preferably but not limited to perform filling processing by the existing random forest regressor filling method, or misforest filling method, or Mean value Mean filling method, or median filling method.
In this embodiment, it is further preferable that the deletion filling device performs deletion filling processing on the original perioperative feature data based on a bayesian gaussian process hidden variable model.
In this embodiment, data padding for missing values inevitably introduces uncertainty into the original perioperative feature data set. The present embodiment applies a Bayesian Gaussian process hidden variable model (BGPLVM) to fill missing values of numerical features, specifically including:
first, the observed test data vector y is approximately calculated * ∈R N×M Probability density p (y) of * Y) (where N is the total number of patient samples and M is the total number of features), and the observed value Y * The variation distribution of the relevant hidden variables is q (x) * ). After the model parameters and hidden variables are learned, the BGPLVM can be used to estimate missing values:whereinIs a vector y * The value of (a) that can be observed in (b),is the missing value that needs to be predicted. Given the partially observed point y * The present embodiment wishes to reconstruct the missing partMissing datasets are filled by learning low-dimensional embedding of observable variables on a small complete dataset. Training BGPLVM on a complete data set D, introducing hidden variable X and new test hidden variable X * Such asAs described aboveA row vector representing a single patient measurement value,representing the known observed value of the image,expressing the missing value, by maximizing the probability density below, get y * Corresponding hidden variable x * Gaussian probability distribution.
Next, by maximizing inTo optimize the variation distribution q (x) * ) Keep dividing q (x) * ) All optimization quantities except for the one are unchanged. To predict missing valuesThe invention adopts a standard Gaussian process prediction method and simultaneously inputs x * Also take into account the uncertainty of x * Presence distribution q (x) * ). Similar to the GP prediction form, for predictionThe invention predicts firstI.e. with y * Corresponding implicit function value
For x * Marginalization of (a) would produce a multivariate density that is not gaussian fully dependent, but based on a squared exponential kernel,can be analyzed and processed, the invention usesThe mean may provide an estimate of the missing value for the present invention, and the variance may quantify the uncertainty associated with the mean estimate. Through a BGPLVM model, learning hidden space and model hyper-parameters in a training set, and obtaining average estimation of each feature containing a missing value through distribution.
In this embodiment, in order to facilitate data processing, it is further preferable that the data processing device further includes a coding device for coding the original perioperative characteristic data, and inputting the coded data into the deficiency filling device. The encoding means preferably, but not exclusively, encodes using existing One-hot encoding rules.
In this embodiment, in order to facilitate data processing, it is further preferable that the data processing device further includes a normalization device, configured to perform normalization processing on the encoded original perioperative period feature data, and input the normalized data into the deficiency filling device. The normalization means is preferably, but not limited to, normalization using a standard deviation normalization method.
Example 3
This embodiment provides a sample data set balancing method for perioperative patients, as shown in fig. 3, the sample data set balancing method includes:
the method comprises the following steps that S1, a few types of label samples in a sample data set of perioperative patients are subjected to oversampling to obtain synthetic samples, a corresponding synthetic label set is generated for the synthetic samples, and the sample data set comprises a plurality of samples and a classification label set corresponding to the samples; each sample represents a perioperative feature data set of a patient, and may be original perioperative feature data or perioperative feature data obtained after dimensionality reduction of the original perioperative feature data in example 1, and the classification label association process of the sample has been described in detail in example 1, and is not described herein again.
S2, adding the synthetic sample and the synthetic label set into the sample data set to obtain a temporary sample data set;
and S3, cleaning the samples in the temporary sample data set to obtain a balanced sample data set.
In this embodiment, a few types of label samples in the sample data set of the perioperative patient may be oversampled by SMOTE or SVM SMOTE or borderlinessmote or K-Means SMOTE or SMOTE-NC to obtain synthetic samples and generate a corresponding synthetic label set for the synthetic samples. Preferably, in order to improve the balance effect, the MLSMOTE algorithm is adopted to oversample a few types of label samples in the sample data set of the perioperative patient to obtain synthetic samples and generate a corresponding synthetic label set for the synthetic samples. The MLSMOTE algorithm, i.e. Multi label Synthetic least-sampling Technique (MLSMOTE), is commonly used to deal with the problem of data imbalance in the Multi-label classification task, and the generation process thereof includes: selecting a few classes of labels by adopting an Imbalance Imbalance Rate (IR); searching nearest neighbors, namely searching the nearest neighbors of the samples belonging to a few labels once the samples are selected as seed samples; generating a characteristic set, namely selecting a neighborhood and then obtaining a synthesized sample through interpolation; and (3) generation of a synthetic tag set, wherein the synthetic tag set is required for the generated synthetic sample.
In this embodiment, since the algorithm for synthesizing few class samples by oversampling, such as MLSMOTE, may generate some noise samples during the process of synthesizing few class label samples, and it is necessary to clean these noise samples, step S3 is set to improve the quality of the sample data set.
In this embodiment, preferably, in order to quickly determine the minority class labels in the sample data set, a ratio between the number of samples corresponding to each classification label and the total number of samples in the sample data set is calculated, the classification label with the ratio smaller than a ratio threshold is used as the minority class classification label, the classification label with the ratio greater than or equal to the ratio threshold is used as the majority class classification label, and the ratio threshold is preferably, but not limited to, smaller than 0.2.
In the present embodiment, it is preferred that, the number of samples that each minority class classification label needs to generate is the over-sampling rate of the minority class classification label. In order to better determine the over-sampling rate of each minority class classification label, so that the obtained equalized sample data set performs better when applied to subsequent classification, preferably, in step S1, the over-sampling rate is set for each minority class label based on a genetic algorithm, which specifically includes:
s11, setting a sample data set to comprise W minority class labels, and taking the oversampling rate of the samples of the W minority class labels as W genes of an individual, wherein W is a positive integer; each gene represents the oversampling rate of a minority class classification label, a plurality of individuals are utilized to construct an initial population, the initial population comprises a plurality of initial individuals, the numerical value of W genes of each initial individual is obtained through random selection, preferably, a numerical value range can be set for the oversampling rate of each minority class classification label, when the initial population is constructed, numerical values are randomly selected in the numerical value range to serve as the gene numerical values, and the numerical value range can be set according to needs;
step S12, the following evolutionary iterative process is repeatedly performed until a termination condition is reached: acquiring the fitness of each individual in the population of the current generation; selecting partial individuals from the population of the current generation as the individuals of the population of the next generation based on the fitness of the individuals; performing cross operation and variation operation on individuals of the next generation population;
and step S13, outputting the individual with the maximum fitness when the termination condition is reached.
In this embodiment, the termination condition is preferably, but not limited to, that the number of evolution iterations reaches a preset maximum number of evolution iterations, or the maximum fitness value of the individual in the evolution iterations does not increase any more, or the increase amplitude of the maximum fitness value of the individual in the evolution iterations is lower than the increase threshold. In each iteration, the fitness of the individuals in the population of the current generation is ranked from high to low, and the part of the individuals with the top rank is selected as the individuals of the population of the next generation.
In this embodiment, in order to make the obtained equalized sample data set have a better performance when applied to subsequent classification, preferably, the process of obtaining the fitness of the individual is as follows:
obtaining minority class label oversampling rate combinations based on individual gene information; the over-sampling rate combination comprises the over-sampling rates of all the minority class tags;
oversampling a few types of label samples in a sample data set of perioperative patients based on a few types of label oversampling rate combination to obtain a synthetic sample and a synthetic label set of the synthetic sample, adding the synthetic sample and the synthetic label set into the sample data set to obtain a balanced sample set, and dividing the balanced sample set into a balanced training sample set and a balanced testing sample set;
the method comprises the steps of constructing an equilibrium multi-layer perception neural network, training the equilibrium multi-layer perception neural network by using an equilibrium training sample set to obtain an equilibrium prediction classification model, testing the equilibrium prediction classification model by using an equilibrium test sample set to obtain the accuracy of the equilibrium prediction classification model, and taking the accuracy as the fitness of an individual.
In this embodiment, in order to effectively remove the noise sample and improve the quality of the sample set, preferably, step S3 is to perform a cleaning process on each sample in the temporary sample set, where the cleaning process includes:
s31, selecting seed samples from the temporary sample data set, selecting k adjacent samples of the seed samples, wherein classification labels of the k adjacent samples form an adjacent classification label set, and k is a positive integer; each sample in the temporary sample data set can be selected in sequence as a seed sample;
step S32, predicting the classification label set of the seed sample through Bayesian conditional probability based on the neighbor classification label set to obtain a predicted classification label set of the seed sample;
and step S33, judging whether the predicted classification label set of the seed sample is the same as the classification label set of the seed sample in the temporary sample data set, if so, retaining the seed sample, and if not, deleting the seed sample and considering the seed sample as a noise sample.
The cleaning process directly predicts the classification tag set of the seed sample through Bayesian conditional probability based on the neighbor classification tag set of the seed sample, compares and judges the obtained prediction classification tag set and the real classification tag set of the seed sample in the temporary sample data set, does not depend on classifier judgment, only depends on data judgment, reduces the operation amount, and improves the judgment efficiency and the accuracy.
In this embodiment, it is further preferable that, in step S31, the specific process of selecting k neighboring samples of the seed sample includes:
obtaining heterogeneous value difference measurement HVDM between the seed sample and all or part of the temporary sample data set respectively; HVDM is an abbreviation of hetereogenous Value Difference Metric;
correcting the heterogeneous value difference measurement HVDM by using the global unbalanced weight of the samples in the temporary sample data set to obtain a corrected heterogeneous value difference measurement;
and sequencing the correction heterogeneous value difference measures of all the samples and the seed samples in the temporary sample data set, and selecting the first k samples with larger correction heterogeneous value difference measures as k adjacent samples of the seed samples. Preferably, the modified heterogeneous value difference metrics may be sorted from top to bottom, and the first k samples with larger modified heterogeneous value difference metric values are selected as k neighboring samples of the seed sample.
In the process of selecting k neighbor samples of the seed sample, a Weighted KNN (Weighted KNN, wkNN) method is adopted to improve the quality of the synthesized sample. If the true few types of label samples in the sample data set are distributed very dispersedly, i.e. spatially sparsely, the few types of samples synthesized during the execution of the algorithm such as MLSMOTE are still sparsely distributed, and there is still no balance in a local view. If the kNN cleaning is directly used, sparse minority samples and new minority samples synthesized by MLSMOTE are removed with high probability, so that a proper classification boundary cannot be established, therefore, the kNN cleaning needs to be coordinated by introducing a distance weighting idea, namely, when sparse distributed samples are faced, local space density (namely, heteroid difference measurement HVDM and global unbalanced weight of samples) is taken into consideration, and small samples are kept as much as possible. The kNN is mainly cleaned by relying on the label set of the neighboring samples, so the distance calculation for the neighboring samples is particularly important when the data distribution is sparse, which is also the main reason for adding the distance weighting (i.e. modifying the heterogeneous value difference measurement HVDM by using the global unbalanced weight of the samples in the temporary sample data set). WkNN cleans noise samples and varies the distance between neighboring samples (modified by the modified heterogeneous difference metric), i.e., the distance between samples is expressed in terms of the sample heterogeneous difference metric, taking into account the local density effect.
In this embodiment, it is further preferable that the calculation formula of the heterogeneous value difference metric HVDM between the seed sample and the sample in the temporary sample data set is:
wherein f is 1 A feature vector representing a seed sample; f. of 2 A feature vector representing any sample in the temporary sample data set except the seed sample; HVDM (f) 1 ,f 2 ) Representing a feature vector f 1 And f 2 A heterology difference metric of; d (f) 1 ,f 2 ) Representing a feature vector f 1 And f 2 The distance between them; n represents the characteristic dimension of the sample in the temporary sample data set; x represents a feature index; d x (f 1 ,f 2 ) Representing a feature vector f 1 And a feature vector f 2 Distance, d, over feature x x (f 1 ,f 2 ) Obtained by the following formula:c denotes the number of classes of the feature x when the feature x is a class feature, C denotes a class index of the feature x,representing the feature x in the temporary sample dataset as belonging to the feature vector f 1 And the class feature of the feature x is the number of samples of c;representing temporary sample dataThe central feature x belongs to the feature vector f 2 And the class feature of the feature x is the number of samples of c;representing the feature x in the temporary sample dataset as belonging to the feature vector f 1 The number of samples of (a);representing the feature x in the temporary sample dataset as belonging to the feature vector f 2 The number of samples of (a); l f 1 -f 2 I represents a feature vector f 1 And f 2 The absolute value of the difference; sigma x Representing the standard deviation of the feature x in the temporary sample dataset.
In this embodiment, it is further preferable that the calculation formula of the modified heterogeneous value difference measure between the seed sample and the sample in the temporary sample data set is:
wherein f is 1 A feature vector representing a seed sample; f. of 2 A feature vector representing any sample in the temporary sample data set except the seed sample; HVDM (f) 1 ,f 2 ) Representing a feature vector f 1 And f 2 A heterology difference metric of; d W (f 1 ,f 2 ) Representing a feature vector f 1 And f 2 Modified heterogeneous value difference metric of (a); n represents the characteristic dimension of the sample in the temporary sample data set; IW represents a feature vector of f 2 Of samples of (a), IW = IR nn /(IR + +IR - ),IR + Indicating the Total imbalance Rate, IR, of all the minority class Classification tags in the temporary sample dataset - Indicating the Total imbalance Rate, IR, of all of the majority class Classification tags in the temporary sample dataset nn Is a feature vector of f 2 The total imbalance rate of all the class labels in the class label set of the sample.
In the process of removing noise samples, the hetoeneous Value Dif is used in WkNN distance calculationThe reference Metric (HVDM) performs distance measurement and corrects HVDM with the global imbalance weight IW of the sample as a weight factor. For a temporary sample dataset, when the classification tag set contains more tags of a few classes, the IR nn The larger the IW will be; for a temporary sample data set with sparse distribution of a few types of label samples and a large imbalance rate, IW is introduced into HVDM distance to improve the density of the few types of samples.
From formulasIt can be seen that the weighting coefficientsCan scale the HVDM (f) 1 ,f 2 ) Weighting coefficient is given when the number of minority labels in the neighbor sample classification label set is moreThe smaller will be. When the IW of the neighbor sample set of the seed sample is larger, that is, the label set of the neighbor sample contains more labels of the minority class, the weighting coefficient of the corresponding neighbor sampleThe smaller, and thus monotonically decreasing form, the following can be maintained: weighting coefficients of neighboring samples with fixed feature dimensionMay be scaled differently due to the presence of both majority and minority classes of tags contained in its set of tags; when the feature dimension is increased, that is, the sample distribution is gradually sparse, the scaling coefficient is also reduced.
It can be seen that the WkNN can help to screen the neighbor samples for the samples with more labels in the label set, and the distribution of the labels in the neighbor sample label set is taken into consideration, so that the samples with more labels in the label set are drawn close to the seed samples, the local label density of the minority labels is increased, and the label density of the majority labels is reduced. The whole process is as follows: firstly, MLSMOTE is used for up-sampling samples of a few types of labels, a balanced temporary new sample set is formed by the up-sampling samples and original samples, on the new sample set, a WkNN process is carried out on each sample, namely k adjacent samples are sequenced based on weighted HVDM, then the label set of seed samples is predicted according to the adjacent samples, if the situations of the predicted label set and the seed label set are the same, the samples are reserved, otherwise, the samples are deleted
Example 4
This embodiment discloses a sample data set balancing device of perioperative patients, as shown in fig. 4, the sample data set balancing device includes:
the sample synthesis module is used for oversampling a few types of label samples in the sample data set of the perioperative patient to obtain a synthesized sample, and generating a corresponding synthesized label set for the synthesized sample, wherein the sample data set comprises a plurality of samples and a classification label set corresponding to the samples;
the temporary sample data set acquisition module is used for adding the synthetic sample and the synthetic label set into the sample data set to acquire a temporary sample data set;
and the cleaning module is used for cleaning the samples in the temporary sample data set to obtain a balanced sample data set.
In this embodiment, preferably, the cleaning module includes:
the neighbor sample acquisition unit selects seed samples from the temporary sample data set, selects k neighbor samples of the seed samples, and the classification labels of the k neighbor samples form a neighbor classification label set, wherein k is a positive integer;
the prediction classification tag set obtaining unit is used for predicting the classification tag set of the seed sample through Bayesian conditional probability based on the neighbor classification tag set to obtain a prediction classification tag set of the seed sample;
and the cleaning unit is used for judging whether the predicted classification label set of the seed sample is the same as the classification label set of the seed sample in the temporary sample data set, if so, retaining the seed sample, and if not, deleting the seed sample.
In this embodiment, it is further preferable that the specific process of selecting k neighboring samples of the seed sample by the neighboring sample acquiring unit includes:
obtaining heterogeneous value difference measurement HVDM between the seed sample and all or part of the temporary sample data set respectively;
modifying the heterogeneous value difference measurement HVDM by using the global unbalanced weight of the samples in the temporary sample data set to obtain a modified heterogeneous value difference measurement;
and sequencing the correction heterogeneous value difference measures of all the samples and the seed samples in the temporary sample data set, and selecting the first k samples with larger correction heterogeneous value difference measures as k adjacent samples of the seed samples.
The test verification is performed on the equalization effect of the sample data set equalization device provided by the embodiment, and the result is as follows:
the IR represents the Imbalance ratio Imbalance Rate of the sample set, and the larger the IR represents the more unbalanced the sample set, and it can be seen from the above experimental results that the maximum IR and the average IR of the equalizing apparatus provided in this embodiment are the smallest, and the interval between the maximum value and the average value of the IR is drawn closer, which indicates that the sample set is more balanced.
Example 5
This embodiment also discloses a perioperative patient sample data set acquisition system, which adds a sample data set balancing device in this embodiment, that is, performs sample balancing processing on the reduced dimension acquired sample data set obtained in embodiment 2, and a schematic structural diagram of the device is shown in fig. 5, and includes: the data acquisition module is used for acquiring original perioperative characteristic data and cases of a plurality of patients; the classification label set acquisition module is used for acquiring a classification label set based on a plurality of cases, and the classification labels represent perioperative patient risk events; the classification label association module is used for associating and corresponding original perioperative characteristic data of the patient with at least one classification label in the classification label set; the perioperative patient data dimension reduction device is used for performing dimension reduction processing on the original perioperative characteristic data of all patients to obtain corresponding perioperative characteristic data; the sample data set acquisition module is used for associating a classification label set corresponding to the original perioperative period characteristic data for the sample by taking perioperative period characteristic data of the patient as the sample to obtain a sample data set of the perioperative period patient; the perioperative patient sample data set balancing device provided in embodiment 4 is further included, and is used for performing balancing processing on the sample data set.
In this embodiment, it is preferable that the perioperative patient data input device further includes a missing filling device, configured to perform filling processing on missing values in the original perioperative feature data of the patient, and input the original perioperative feature data subjected to filling processing into the perioperative patient data dimension reducing device for dimension reduction processing.
Example 6
This embodiment 6 discloses a perioperative patient data multi-label classification method, as shown in fig. 6, the multi-label classification method includes:
step A, acquiring characteristic data of a patient to be classified; the characteristic data of the patient to be classified is characteristic data of a perioperative patient and can comprise multidimensional characteristics. In order to improve the processability, reduce the dimensionality and improve the quality of the characteristic data of the patient to be classified, the characteristic data of the patient to be classified can be sequentially subjected to coding treatment and normalization treatment, the characteristic dimensionality of a sample output by the perioperative patient data dimensionality reduction device provided by the embodiment 1 is subjected to dimensionality reduction treatment, and the characteristic data of the patient to be classified after the dimensionality reduction treatment is input into a trained classification model.
Step B, inputting the characteristic data of the patient to be classified into a trained classification model, outputting a classification result by the classification model, wherein the classification result comprises more than one classification label and the classification confidence of each classification label; the classification confidence of a classification label indicates the probability that the patient feature data to be classified belongs to that classification label. The classification model comprises a Stacking-based classification integration model, a label association rule acquisition module and a fusion module, wherein the fusion module is used for fusing a classification matrix output by the classification integration model and an association rule matrix output by the label association rule acquisition module to obtain a classification result, and the fusion mode is preferably but not limited to multiplying the classification matrix and the association rule matrix.
In an embodiment, preferably, the structural diagram of the classification model is shown in fig. 7, and the classification integration model includes a first multi-classification model, a second multi-classification model, a third multi-classification model and a logistic regression model; the first multi-classification model, the second multi-classification model and the third multi-classification model respectively carry out multi-label classification processing on the characteristic data of the patient to be classified to obtain a first primary classification result, a second primary classification result and a third primary classification result; and processing the first primary classification result, the second primary classification result and the third primary classification result by the logistic regression model to obtain a classification matrix.
In this embodiment, preferably, the first multi-classification model, the second multi-classification model, and the third multi-classification model are a Ranking-SVM model, a classification multi-layer perceptual neural network model, and a Binary Relevance model, respectively. The Ranking-SVM model and the Binary Relevance model are conventional basic models in Stacking integration, and the reliability of model integration is high when the model is used in the Stacking integration. The classified multilayer perception neural network model adopts a multilayer perception neural network structure (namely an MLP network structure), so that the over-fitting problem can be avoided, and the complexity is low.
In this embodiment, it is preferable that the method further includes a step of constructing a sample data set of the perioperative patient, and as shown in fig. 8, the step of constructing the sample data set of the perioperative patient is preferably, but not limited to, constructed by using the system of embodiment 2 or embodiment 5.
In an embodiment, as shown in fig. 7, the training process of the classification integration model is as follows: constructing a sample data set of perioperative patients, wherein each sample in the sample data set is associated with more than one classification label, the sample data set is divided into a classification training set and a classification test set, and the association of the classification labels can be performed in a manual mode; constructing a classification integration model, namely the Stacking-based integration model, which comprises a first multi-classification model, a second multi-classification model, a third multi-classification model and a logistic regression model; and training the classification integrated model by using a classification training set, and testing and verifying the trained classification integrated model by using a classification testing set. In the validation, cross validation was performed on the training set using RandomizedSearchCV and GridSearchCV, with selection of the hyperparameters by F1_ Micro score.
In this embodiment, as shown in fig. 7, preferably, the association rule obtaining module performs the following steps: acquiring a sample data set of perioperative patients, wherein each sample in the sample data set is associated with more than one classification label; the sample data set is preferably, but not limited to, the perioperative patient sample data set acquired in example 2 or example 5, i.e. a standard patient data set. And mining association rules of the classification labels in the sample data set to obtain an association rule matrix. The association rule matrix includes confidence of association between any two of all the classification tags.
In this embodiment, as shown in fig. 7, it is further preferable that, when the number of the classification tags in the sample data set is smaller, specifically smaller than the number threshold, the FP-growth algorithm is directly used to perform association rule mining on the classification tags in the sample data set. Firstly, establishing a classification label matrix as shown in fig. 7, wherein the first row in the classification label matrix is each label, and the first row is a patient number; and then, performing association rule analysis processing on the classification label matrix by using an FP-growth algorithm, and outputting an association confidence coefficient between any two classification labels, wherein the value range of the association confidence coefficient is 0 to 1. Based on the association confidence degrees, an association rule matrix as shown in fig. 7 is established, in the association rule matrix, the first row and the first column are both classification tags, and an element in the matrix represents the association confidence degree between the classification tags in the row and the column where the element is located, as shown in fig. 7, a (N-1) represents the association confidence degree between the classification tag N and the classification tag 1.
In this embodiment, preferably, when the number of the classification tags in the sample data set is large, the correlation patterns between the classification tags may be different, and performing the correlation analysis directly may cause a complex item set searching process, which may affect the accuracy of the correlation analysis, and specifically, when the number of the classification tags is greater than or equal to the number threshold, the number threshold is preferably, but not limited to, 3, 4, or 5. The method for mining the association rule of the classification label in the sample data set to obtain the association rule matrix comprises the following steps:
clustering classification labels in the sample data set to obtain more than one cluster; preferably but not limited to, the clustering process is carried out by adopting a K-means + + algorithm; and mining association rules of the classification labels in each cluster to obtain an association rule submatrix. During fusion, the classification matrix is divided into more than one sub-classification matrix according to the clustering result, one clustering cluster corresponds to one sub-classification matrix, the sub-classification matrix is multiplied by the associated rule sub-matrix corresponding to the clustering cluster to obtain the classification sub-result of the clustering cluster, and all the classification sub-results form the classification result.
In this embodiment, it is further preferable that association rule mining is performed on the classification tags in each classification cluster through an FP-growth algorithm to obtain an association rule submatrix, and the obtaining process of the association rule submatrix is consistent with the process in fig. 7, which has been described in detail in the above preferred embodiment and is not described again here.
Example 7
The embodiment discloses a perioperative patient data multi-label classification device, as shown in fig. 9, including: the data acquisition module is used for acquiring the characteristic data of the patient to be classified; the classification module is used for inputting the characteristic data of the patient to be classified into a trained classification model, and the classification model outputs a classification result which comprises more than one classification label and the classification confidence of each classification label; the classification model comprises a Stacking-based classification integration model, a label association rule acquisition module and a fusion module, wherein the fusion module is used for fusing a classification matrix output by the classification integration model and an association rule matrix output by the label association rule acquisition module to obtain a classification result.
In this embodiment, preferably, the classification integration model includes a first multi-classification model, a second multi-classification model, a third multi-classification model and a logistic regression model; the first multi-classification model, the second multi-classification model and the third multi-classification model respectively carry out multi-label classification processing on the characteristic data of the patient to be classified to obtain a first primary classification result, a second primary classification result and a third primary classification result; and processing the first primary classification result, the second primary classification result and the third primary classification result by the logistic regression model to obtain a classification matrix.
In this embodiment, it is preferable that the system further includes a classification integration model training module, and the classification integration model training module performs the following processes: constructing a sample data set of perioperative patients, associating more than one classification label with each sample in the sample data set, and dividing the sample data set into a classification training set and a classification testing set; preferably but not limited to, constructing a sample data set of a perioperative patient by means of the system provided in example 2 or example 5; constructing a classification integration model; the classification integration model comprises a first multi-classification model, a second multi-classification model, a third multi-classification model and a logistic regression model; and training the classification integrated model by using a classification training set, and testing and verifying the trained classification integrated model by using a classification testing set.
In this embodiment, the classification device builds a perioperative postoperative event multi-label classification integration model combining association rule analysis. A plurality of postoperative risk events may occur after operation, research and prediction are carried out on postoperative multi-event results, a multi-label prediction model is built by integrating a Ranking-SVM model, a multi-layer perception neural network model and a Binary Relevance model, and association rules are fused into the prediction model for optimization in order to further improve the stability and accuracy of the model.
Example 8
The present embodiment discloses a perioperative patient risk event prediction system, as shown in fig. 10, including: the data acquisition module is used for acquiring characteristic data of the patient to be classified; the classification module is used for inputting the characteristic data of the patient to be classified into a trained classification model, the classification model outputs a classification result, the classification result comprises more than one classification label and the classification confidence of each classification label, and each classification label corresponds to a perioperative patient risk event;
the classification model comprises a Stacking-based classification integration model, a label association rule acquisition module and a fusion module, wherein the fusion module is used for fusing a classification matrix output by the classification integration model and an association rule matrix output by the label association rule acquisition module to obtain a classification result; and the conversion module is used for converting the classification labels in the classification result into corresponding perioperative patient risk events to obtain a risk prediction result.
In this embodiment, preferably, the classification-integration model includes a first multi-classification model, a second multi-classification model, a third multi-classification model, and a logistic regression model; the first multi-classification model, the second multi-classification model and the third multi-classification model respectively carry out multi-label classification processing on the characteristic data of the patient to be classified to obtain a first primary classification result, a second primary classification result and a third primary classification result; and processing the first primary classification result, the second primary classification result and the third primary classification result by the logistic regression model to obtain a classification matrix.
In this embodiment, it is preferable that the system further includes a classification integration model training module, and the classification integration model training module performs the following processes: constructing a sample data set of perioperative patients, associating more than one classification label with each sample in the sample data set, and dividing the sample data set into a classification training set and a classification testing set; preferably, but not limited to, constructing a specimen dataset of a perioperative patient by the system provided in example 2 or example 5; constructing a classification integration model; the classification integration model comprises a first multi-classification model, a second multi-classification model, a third multi-classification model and a logistic regression model; and training the classification integration model by using a classification training set, and testing and verifying the trained classification integration model by using a classification testing set.
In this embodiment, in the process of acquiring the system sample data set provided in embodiment 2 or embodiment 5, a risk event in a perioperative period of a patient (particularly, an elderly surgical patient) is predicted, and on the basis of improving a missing and unbalanced data set, association rule analysis is fused, and a post-operative event multi-label prediction model is built. Extracting the labels of the post-operation events based on the patient case text, collecting a large amount of medical related corpora by adopting a CBOW label extraction model of Word2Vec, training a medical Word vector model, and realizing the extraction of the label set (namely a classification label set) of the post-operation events. And then, filling missing data by adopting a Bayesian Gaussian process latent variable model, processing label imbalance data by adopting an MLSMOTE, weighting KNN (WKNN) and a genetic algorithm, and finally building a feature dimension reduction model by combining a Principal Component Analysis (PCA) model and the genetic algorithm to provide input with higher correlation for a classification integration model.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
Claims (10)
1. A perioperative patient sample dataset equalization method, comprising:
the method comprises the following steps of S1, oversampling a few types of label samples in a sample data set of a perioperative patient to obtain a synthetic sample, and generating a corresponding synthetic label set for the synthetic sample, wherein the sample data set comprises a plurality of samples and a classification label set corresponding to the samples;
s2, adding the synthetic sample and the synthetic label set into the sample data set to obtain a temporary sample data set;
and S3, cleaning the samples in the temporary sample data set to obtain a balanced sample data set.
2. The perioperative patient sample dataset equalization method of claim 1, wherein in said step S1, setting an oversampling rate for each minority class label based on a genetic algorithm, specifically comprises:
s11, setting a sample data set to comprise W minority class labels, and taking the oversampling rate of the samples of the W minority class labels as W genes of an individual, wherein W is a positive integer; constructing an initial population, wherein the initial population comprises a plurality of initial individuals, and the W gene values of each initial individual are obtained by random selection;
step S12, the following evolutionary iterative process is repeatedly performed until a termination condition is reached:
acquiring the fitness of each individual in the population of the current generation; selecting a part of individuals from the population of the current generation as individuals of the population of the next generation based on the fitness of the individuals; performing cross operation and variation operation on individuals of the next generation population;
and S13, outputting the individual with the maximum fitness when the termination condition is reached.
3. The perioperative patient sample data set balancing method of claim 2, wherein the process of obtaining the fitness of an individual:
obtaining minority label oversampling rate combinations based on individual gene information; the over-sampling rate combination includes the over-sampling rates of all the minority class tags;
oversampling a few types of label samples in a sample data set of a perioperative patient based on a few types of label oversampling rate combination to obtain a synthetic sample and a synthetic label set of the synthetic sample, adding the synthetic sample and the synthetic label set into the sample data set to obtain an equilibrium sample set, and dividing the equilibrium sample set into an equilibrium training sample set and an equilibrium testing sample set;
the method comprises the steps of constructing an equilibrium multi-layer perception neural network, training the equilibrium multi-layer perception neural network by using an equilibrium training sample set to obtain an equilibrium prediction classification model, testing the equilibrium prediction classification model by using an equilibrium test sample set to obtain the accuracy of the equilibrium prediction classification model, and taking the accuracy as the fitness of an individual.
4. The perioperative patient sample dataset balancing method according to claim 1, 2 or 3, wherein the step S3 is a cleaning process for each sample in the temporary sample dataset, the cleaning process comprising:
s31, selecting seed samples from the temporary sample data set, selecting k adjacent samples of the seed samples, wherein classification labels of the k adjacent samples form an adjacent classification label set, and k is a positive integer;
step S32, predicting the classification tag set of the seed sample through Bayes conditional probability based on the neighbor classification tag set to obtain a predicted classification tag set of the seed sample;
and step S33, judging whether the predicted classification label set of the seed sample is the same as the classification label set of the seed sample in the temporary sample data set, if so, retaining the seed sample, and if not, deleting the seed sample.
5. The perioperative patient sample dataset equalization method of claim 4, wherein in said step S31, the specific process of selecting k neighbor samples of seed samples comprises:
obtaining heterogeneous value difference measurement HVDM of the seed sample and all or part of the samples in the temporary sample data set respectively;
correcting the heterogeneous value difference measurement HVDM by using the global unbalanced weight of the samples in the temporary sample data set to obtain a corrected heterogeneous value difference measurement;
and sequencing the correction heterogeneous value difference measures of all the samples and the seed samples in the temporary sample data set, and selecting the first k samples with larger correction heterogeneous value difference measures as k adjacent samples of the seed samples.
6. The perioperative patient sample dataset equalization method of claim 5, wherein the heterology difference measure HVDM of the seed sample and the temporary sample dataset is calculated by the formula:
wherein f is 1 A feature vector representing a seed sample; f. of 2 A feature vector representing any sample in the temporary sample data set except the seed sample; HVDM (f) 1 ,f 2 ) Representing a feature vector f 1 And f 2 A heterology difference metric of; d (f) 1 ,f 2 ) Representing a feature vector f 1 And f 2 The distance between them; n represents the characteristic dimension of the sample in the temporary sample data set; x represents a feature index; d x (f 1 ,f 2 ) Representing a feature vector f 1 And a feature vector f 2 Distance, d, over feature x x (f 1 ,f 2 ) Obtained by the following formula:
c denotes the number of classes of the feature x when the feature x is a class feature, C denotes a class index of the feature x,representing the feature x in the temporary sample dataset as belonging to the feature vector f 1 And the class feature of the feature x is the number of samples of c;representing the feature x in the temporary sample dataset as belonging to the feature vector f 2 And the class feature of the feature x is the number of samples of c;representing the feature x in the temporary sample dataset as belonging to the feature vector f 1 The number of samples of (a);representing the feature x in the temporary sample dataset as belonging to the feature vector f 2 The number of samples of (a); l f 1 -f 2 I represents a feature vector f 1 And f 2 The absolute value of the difference; sigma x Representing the standard deviation of the feature x in the temporary sample dataset.
7. The perioperative patient sample dataset equalization method of claim 5 or 6, wherein the modified heterogeneous difference measure of the seed sample and the sample in the temporary sample dataset is calculated by the formula:
wherein f is 1 A feature vector representing a seed sample; f. of 2 A feature vector representing any sample in the temporary sample data set except the seed sample; HVDM (f) 1 ,f 2 ) Representing a feature vector f 1 And f 2 A heterogeneous value difference metric of (a); d W (f 1 ,f 2 ) Representing a feature vector f 1 And f 2 Modified heterology difference metric of (a); n represents the characteristic dimension of the sample in the temporary sample data set; IW represents a feature vector of f 2 Of samples of (a), IW = IR nn /(IR + +IR - ),IR + Indicating the Total imbalance Rate, IR, of all the minority class Classification tags in the temporary sample dataset - Indicating the Total imbalance Rate, IR, of all of the majority class Classification tags in the temporary sample dataset nn Is a feature vector of f 2 The total imbalance rate of all the class labels in the class label set of the sample.
8. A perioperative patient's sample data set equalization apparatus, comprising:
the sample synthesis module is used for oversampling a few types of label samples in a sample data set of a perioperative patient to obtain a synthesized sample and generating a corresponding synthesized label set for the synthesized sample, wherein the sample data set comprises a plurality of samples and a classification label set corresponding to the samples;
the temporary sample data set acquisition module is used for adding the synthetic sample and the synthetic label set into the sample data set to acquire a temporary sample data set;
and the cleaning module is used for cleaning the samples in the temporary sample data set to obtain a balanced sample data set.
9. A perioperative patient sample dataset acquisition system, comprising:
the data acquisition module is used for acquiring original perioperative characteristic data and cases of a plurality of patients;
the classification label set acquisition module is used for acquiring a classification label set based on a plurality of cases, and the classification labels represent perioperative patient risk events;
the classification label association module is used for associating and corresponding original perioperative characteristic data of the patient with at least one classification label in the classification label set;
the perioperative patient data dimension reduction device is used for performing dimension reduction processing on the original perioperative characteristic data of all patients to obtain corresponding perioperative characteristic data;
the sample data set acquisition module is used for associating a classification label set corresponding to the original perioperative period characteristic data for the sample by taking perioperative period characteristic data of the patient as the sample to obtain a sample data set of the perioperative period patient;
further comprising perioperative patient sample data set equalisation means as claimed in claim 8 for equalising the sample data set.
10. The perioperative patient sample dataset acquisition system of claim 9, further comprising a missing filling means for filling missing values in the original perioperative feature data of the patient and inputting the filled original perioperative feature data into the perioperative patient data dimension reduction means for performing dimension reduction.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210760514.1A CN115206538A (en) | 2022-06-30 | 2022-06-30 | Perioperative patient sample data set balancing method and sample data set acquisition system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210760514.1A CN115206538A (en) | 2022-06-30 | 2022-06-30 | Perioperative patient sample data set balancing method and sample data set acquisition system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115206538A true CN115206538A (en) | 2022-10-18 |
Family
ID=83577629
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210760514.1A Pending CN115206538A (en) | 2022-06-30 | 2022-06-30 | Perioperative patient sample data set balancing method and sample data set acquisition system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115206538A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117633625A (en) * | 2023-11-30 | 2024-03-01 | 成都市成华区妇幼保健院 | Gynaecology and obstetrics postoperative care data analysis method and system based on big data |
-
2022
- 2022-06-30 CN CN202210760514.1A patent/CN115206538A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117633625A (en) * | 2023-11-30 | 2024-03-01 | 成都市成华区妇幼保健院 | Gynaecology and obstetrics postoperative care data analysis method and system based on big data |
CN117633625B (en) * | 2023-11-30 | 2024-08-13 | 成都市成华区妇幼保健院 | Gynaecology and obstetrics postoperative care data analysis method and system based on big data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109192298B (en) | Deep brain disease diagnosis algorithm based on brain network | |
CN111785329B (en) | Single-cell RNA sequencing clustering method based on countermeasure automatic encoder | |
Guidotti et al. | Explaining any time series classifier | |
CN107992976B (en) | Hot topic early development trend prediction system and prediction method | |
CN117253614A (en) | Diabetes risk early warning method based on big data analysis | |
CN110363230B (en) | Stacking integrated sewage treatment fault diagnosis method based on weighted base classifier | |
CN111563533A (en) | Test subject classification method based on graph convolution neural network fusion of multiple human brain maps | |
CN114898879A (en) | Chronic disease risk prediction method based on graph representation learning | |
CN115659174A (en) | Multi-sensor fault diagnosis method, medium and equipment based on graph regularization CNN-BilSTM | |
CN114999635A (en) | circRNA-disease association relation prediction method based on graph convolution neural network and node2vec | |
CN116805533A (en) | Cerebral hemorrhage operation risk prediction system based on data collection and simulation | |
CN115206538A (en) | Perioperative patient sample data set balancing method and sample data set acquisition system | |
CN117349494A (en) | Graph classification method, system, medium and equipment for space graph convolution neural network | |
CN115206539A (en) | Multi-label integrated classification method based on perioperative patient risk event data | |
CN115295105A (en) | Perioperative patient data dimension reduction device and sample data set acquisition system | |
CN114897085A (en) | Clustering method based on closed subgraph link prediction and computer equipment | |
CN113284627B (en) | Medication recommendation method based on patient characterization learning | |
Koskela | Neural network methods in analysing and modelling time varying processes | |
CN116612831A (en) | Chemical substance safety evaluation method for deep learning combined mode biological zebra fish | |
CN114724630B (en) | Deep learning method for predicting post-translational modification site of protein | |
CN110633368A (en) | Deep learning classification method for early colorectal cancer unstructured data | |
Kelly et al. | Variable interaction measures with random forest classifiers | |
Usman et al. | Feature selection: It importance in performance prediction | |
CN112465054A (en) | Multivariate time series data classification method based on FCN | |
CN116070120B (en) | Automatic identification method and system for multi-tag time sequence electrophysiological signals |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |