CN109948732B

CN109948732B - Abnormal cell distant metastasis classification method and system based on unbalanced learning

Info

Publication number: CN109948732B
Application number: CN201910251365.4A
Authority: CN
Inventors: 彭立志; 李雪梅; 杨波; 李宝生; 朱健
Original assignee: Shandong Cancer Hospital & Institute (shandong Cancer Hospital); University of Jinan
Current assignee: Shandong Cancer Hospital & Institute (shandong Cancer Hospital); University of Jinan
Priority date: 2019-03-29
Filing date: 2019-03-29
Publication date: 2020-12-22
Anticipated expiration: 2039-03-29
Also published as: CN109948732A

Abstract

The disclosure provides a method and a system for classifying abnormal cell distant metastasis based on unbalanced learning, wherein a plurality of data sequences with certain cell distant metastasis and a plurality of data sequences without certain cell distant metastasis are obtained, the data set is divided into a training set and a testing set, the training set is used for training a model, and the testing set is used for testing the model; firstly, inputting a training set into a feature selection algorithm to be compared with the results of the classification of an original condition data set, and selecting p features with the best results; obtaining a training set with a positive-negative sample ratio of 1:1 by using an oversampling algorithm, respectively inputting the training set into a classification algorithm, testing by using a data sequence of a test set, and selecting an oversampling algorithm i of a training set Pi with an optimal evaluation result; and (3) inputting the training set into an oversampling algorithm for obtaining a training set Pi by adjusting the proportion of the positive and negative samples, gradually increasing the proportion of the positive and negative samples to a set proportion, and carrying out classification evaluation on the optimal proportion of the positive and negative samples. According to the technical scheme, the proportion of the positive samples is increased by adopting an oversampling algorithm, and better model evaluation indexes and the recall rate of a few positive samples are obtained.

Description

Abnormal cell distant metastasis classification method and system based on unbalanced learning

Technical Field

The disclosure relates to the technical field of machine learning and data mining, in particular to a method and a system for classifying abnormal cell distant metastasis based on unbalanced learning.

Background

Esophageal squamous carcinoma is one of the most common malignant tumors worldwide, but the early symptoms are not obvious, the change of the body is easy to be ignored, and the esophageal squamous carcinoma is generally in the middle and advanced stage when the body cannot bear the disease and goes to a hospital for examination. In clinic, doctors diagnose whether cancer cells of patients with esophageal squamous carcinoma have distant metastasis by image, even puncture and operation. These three approaches not only add to the cost of treatment for the patient, but also are time consuming. With the advent of the big data age, to solve this problem, it has been proposed to predict whether cancer cells of patients have metastasis by blood cell analysis. It is known from the relevant literature that the classification prediction of lymph node metastasis is over-classified in the medical field, the specificity and the sensitivity are less than 50%, the relevant research on distant metastasis is not performed, the P-test statistical analysis is performed by the clinical relevant researchers using statistical analysis software (SPSS, SAS), and the machine learning is used for the analysis prediction in the disclosure.

Because the collected data of patients with esophageal squamous cell carcinoma is not much, and the patients with cancer cells metastasizing far are more in a small proportion, the problem of unbalanced categories exists.

The inventors found in their research that in such unbalanced data sets, the standard classifier tends to obtain the maximum accuracy, while ignoring a few samples, which are the focus of attention, and even if obtaining a high accuracy, the analysis result is meaningless, and it is difficult to effectively predict whether cancer cells of a patient have metastasis. In real life, particularly in the medical field, the problem of category imbalance is common, which is mainly due to morbidity. In this case, the performance of the standard classifier would be severely impacted if the unbalanced data were not processed.

Disclosure of Invention

The purpose of the embodiments of the present specification is to provide a method for classifying abnormal cell distant metastasis based on unbalanced learning, which uses an oversampling algorithm to attempt to increase the proportion of positive samples, and obtains a better model evaluation index and a recall rate of a few positive samples.

The embodiment of the specification provides a method for classifying abnormal cell distant metastasis based on unbalanced learning, which comprises the following steps:

obtaining a plurality of data sequences with certain cells having distant metastasis and a plurality of data sequences without certain cells having distant metastasis, and forming a training set;

respectively inputting the training set into k feature selection algorithms, respectively selecting p attributes ranked in the front as features of the training set, inputting the attributes into a classifier for training, comparing classification results, and selecting p features with the best results;

enabling the training set to achieve data balance on a data level based on an oversampling algorithm, and inputting the training set processed by the feature selection algorithm into n oversampling algorithms to obtain a training set with a positive-negative sample ratio of 1: 1;

respectively inputting the training sets with the positive and negative sample proportion of 1:1 into a classification algorithm, then testing by using a data sequence of a test set, and selecting an oversampling algorithm i of the training set Pi with the optimal evaluation result;

and (3) inputting the training set into an oversampling algorithm for obtaining a training set Pi by adjusting the proportion of the positive and negative samples, gradually increasing the proportion of the positive and negative samples to a set proportion, and carrying out classification evaluation on the optimal proportion of the positive and negative samples.

Embodiments of the present disclosure provide a system for remote cell transfer classification based on unbalanced learning, comprising:

a training set acquisition unit configured to: obtaining a plurality of data sequences with certain cells having distant metastasis and a plurality of data sequences without certain cells having distant metastasis, and forming a training set;

a feature selection unit configured to: respectively inputting the training set into k feature selection algorithms, respectively selecting p attributes ranked in the front as features of the training set, inputting the attributes into a classifier for training, comparing classification results, and selecting p features with the best results;

an oversampling unit configured to: enabling the training set to achieve data balance on a data level based on an oversampling algorithm, and inputting the training set processed by the feature selection algorithm into n oversampling algorithms to obtain a training set with a positive-negative sample ratio of 1: 1;

an optimal oversampling algorithm obtaining unit configured to: respectively inputting the training sets with the positive and negative sample proportion of 1:1 into a classification algorithm, then testing by using a data sequence of a test set, and selecting an oversampling algorithm i of the training set Pi with the optimal evaluation result;

an optimal positive and negative sample proportion obtaining unit configured to: and (3) inputting the training set into an oversampling algorithm for obtaining a training set Pi by adjusting the proportion of the positive and negative samples, gradually increasing the proportion of the positive and negative samples to a set proportion, and carrying out classification evaluation on the optimal proportion of the positive and negative samples.

Compared with the prior art, the beneficial effect of this disclosure is:

1. the data sequence in the technical scheme disclosed by the invention can be used for routine examination of blood cell analysis data in a hospital, the data acquisition is easier from the technical realization, and the subsequent selection and oversampling processing of data characteristics are convenient to perform.

2. According to the technical scheme, the proportion of the positive samples is increased by adopting an oversampling algorithm, and better model evaluation indexes and the recall rate of a few positive samples are obtained.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

FIG. 1 is a flowchart of a method for classifying abnormal distant cell metastasis based on unbalanced learning according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating a feature selection strategy of an abnormal cell distant metastasis classification method based on unbalanced learning according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an oversampling algorithm selection strategy of the abnormal cell distant metastasis classification method based on unbalanced learning according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram illustrating a strategy for adjusting the proportion of positive and negative samples by using an abnormal cell distant metastasis classification method based on unbalanced learning according to an embodiment of the disclosure.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Currently, there are two main solutions to the problem of dealing with unbalanced data classification: first, the data is equalized. On the data level, a training sample is reconstructed by using a proper method, and data equalization can be achieved by adopting an oversampling or undersampling algorithm; second, new algorithms are improved or proposed. On the algorithm level, the existing classification algorithm is utilized to improve or provide a new classification algorithm, so that the minority samples are paid more attention, and the accuracy of the minority samples is improved. The technical solution of the embodiment of the present application is the first one, and data samples are balanced on a data plane.

Example of implementation 1

The implementation example discloses a method for classifying distant metastasis of abnormal cells based on unbalanced learning, which includes the steps of firstly screening an available data set, taking the distant metastasis classification of cells of esophageal squamous carcinoma as an example, screening a patient with a clinical M stage from a diagnosis table according to the existing diagnosis information of an esophageal squamous carcinoma patient, wherein the clinical M stage is 0 to indicate that cancer cells of the patient do not metastasize to other organs, the clinical M stage is 0 to indicate that cancer cells of the patient metastasize to other organs, the clinical M stage is 1, selecting blood cell analysis and inspection data of the previous time before the operation treatment according to the operation time recorded in the operation table by the patient, and selecting blood cell analysis and inspection data of the current day of diagnosis or the previous time of diagnosis if the operation treatment is not performed. The data on which the method is based are all data of patients, so that the method is irrelevant to diagnosis and treatment and only predicts the metastasis of relevant cells based on the relevant data.

In an implementation example, the screened available samples are divided into a 75% training set and a 25% testing set, the training set is input into a plurality of feature selection methods, the attribute of the top 8 of the ranking is selected as the feature of the data set, then the feature is input into a classifier, the output model evaluation index AUC and the recall rate recall are compared with the result output by the original situation, and the feature with the best result is selected.

In the implementation example, two types of data are trained through a classifier, so that the classification characteristics of the two types of data can be learned, and new data can be input, and the classifier can automatically identify which type belongs to.

The output model evaluation index AUC and recall ratio recall are explained as follows:

AUC (area Under curve) is defined as the area enclosed by the coordinate axes Under the ROC curve, and it is obvious that the value of this area is not larger than 1. Since the ROC curve is generally located above the line y ═ x, the AUC ranges between 0.5 and 1. The reason why the AUC value is used as the evaluation criterion is that the ROC curve cannot clearly indicate which classifier has a better effect in many cases, and as a numerical value, a classifier corresponding to a larger AUC has a better effect.

Recall (Recall), also known as Recall (TPR), is an integrity measure of the classification of unbalanced data, representing the ratio of the actual number of minority samples to the actual number of minority samples that should be.

Then, the selected features are input into different oversampling algorithms, the ratio of the positive and negative samples is 1:1, the positive and negative samples are input into a classifier, and the output model evaluation indexes are compared with the recall rate to select the oversampling algorithm with the best result.

Then, the selected oversampling algorithm is used for trying to increase the proportion of the positive type samples from 1.1:1, 1.2:1 to 2:1, then the proportion of the two types of samples with larger difference is given, namely 5:1 and 10:1, and the proper proportion of the positive type samples and the negative type samples is selected according to the comparison result.

In specific implementation, referring to fig. 1, the method for classifying abnormal distant cell metastasis based on unbalanced learning includes:

step (1): and (3) screening the data set, washing dirty data after screening, directly deleting the sample containing the missing data, and deleting the attribute of erythrocyte distribution width (CV) before feature selection to leave a complete data set.

Specifically, the data set includes a training set and a testing set, and the data in the training set includes a plurality of blood cell analysis data sequences in which cancer cells have distant metastasis and a plurality of blood cell analysis data sequences in which cancer cells have no distant metastasis.

The test set stores a blood cell analysis data sequence to be tested.

The data sequence, i.e., the blood cell analysis data sequence, includes: leukocyte count, lymphocyte absolute value, lymphocyte percentage, neutrophil absolute value, neutrophil percentage, monocyte absolute value, monocyte percentage, eosinophil absolute value, eosinophil percentage, basophil absolute value, basophil percentage, erythrocyte count, hemoglobin, erythrocyte mean volume, erythrocyte mean hemoglobin content, erythrocyte mean hemoglobin concentration, erythrocyte distribution width (CV), platelet count, platelet distribution width, platelet distribution hematocrit, platelet mean volume.

Step (2): referring to fig. 2, feature selection is performed on blood cell analysis, k feature selection algorithms are respectively input into a data set, p attributes ranked in the top are respectively selected as features of the data set a, the features are input into a classifier for training, and model evaluation indexes (G-Mean, AUC) and Recall (Recall) are output.

In this embodiment, the model is the aforementioned classifier.

Because a few types of samples are important, according to specific analysis of specific problems, the threshold values of the evaluation indexes G-Mean and AUC of the calculation model are re-given, and the weighted G-Mean and the weighted AUC are provided and are marked as WG-Mean and WAUC; and calculating the WG-Mean and the WAUC by a given calculation mode, and selecting p features which obtain the best result compared with the original condition, wherein the larger the calculated WG-Mean and the WAUC is, the better the result is, and the p features are used by the following inputs.

Since in this case a few samples are the focus of attention, the threshold is given again to increase the rate of recall in the calculation.

In this example, G-Mean and AUC are indicators of two comprehensive evaluation classifiers.

The calculation formula after the threshold value is reset is as follows:

WAUC＝Sensitivity×0.7+Specificity×0.3；

in this embodiment, the original case data is the data without the positive and negative class sample balancing.

In specific implementation, the attributes of 8 top-ranked names are respectively selected as basic features according to the result obtained by the algorithm. By analysis, selecting features includes: width of platelet distribution, lymphocyte percentage, absolute value of lymphocytes, neutrophil percentage, mean volume of platelets, red blood cell count, hemoglobin, and hematocrit.

Referring to fig. 3, the oversampling algorithm that yields better results is selected. Step (3) and step (4): enabling the data set to achieve data balance on a data level based on an oversampling algorithm, inputting 8 selected features into different oversampling algorithms to obtain training sets P1 and P2 … … Pn with the proportion of positive and negative samples being 1:1, respectively inputting the training sets P1 and P2 … … Pn into a classification algorithm, then testing by using a test set N, and outputting a model evaluation index (G-Mean, AUC) and a Recall rate (Recall); and (5) calculating WG-Mean and WAUC, and selecting the oversampling algorithm i which obtains the training set Pi and corresponds to the best result.

In this embodiment, the data set is divided into a training set and a test set, the training set is used for training the model, and the test set is used for testing the model; firstly, inputting a training set into a feature selection algorithm to be compared with the results of the classification of an original condition data set, and selecting p features with the best results; and then obtaining a training set with a positive-negative sample ratio of 1:1 by using an oversampling algorithm.

And (5): referring to the attached figure 4, the proportion of positive and negative samples is adjusted to obtain higher Recall rate and better model evaluation index, the training set M is input into an oversampling algorithm for obtaining the training set Pi, the proportion of the positive and negative samples is gradually increased to 1.1:1, 1.2:1 and is increased to 2:1, even Recall, WG-Mean and WAUC which are output in a ratio of 5:1 and 10:1 are given, and the proportion of the positive and negative samples with the optimal result is selected.

The proportion of the positive and negative samples with the optimal result is selected, so that the model evaluation index can be the best.

The analysis and prediction of the routine blood cell examination in hospitals are selected in the disclosed embodiment, so as to replace other expensive and time-consuming diagnosis approaches. Has certain innovativeness in application.

The technology used by the embodiment of the disclosure breaks through the weakness that clinical medicine researchers do not understand machine learning, and breaks through the traditional conventional P inspection analysis method.

The disclosed embodiment combines specific practical meanings to give the threshold value of the evaluation index of the calculation model again.

The oversampling algorithm is used for attempting to increase the proportion of the positive samples, and better model evaluation indexes and the recall rate of a few positive samples are obtained.

Example II

In another embodiment, the system may be implemented by a server, a data input device and a data display, the data input device is used to input the analysis data of blood cells into the server or call the blood cell data stored in the memory, and after the server processes the data, the server displays the specific result and the related data in the data processing process by using the display.

The server comprises a training set acquisition unit, a feature selection unit, an oversampling unit, an optimal oversampling algorithm acquisition unit and an optimal positive and negative sample proportion acquisition unit.

The specific implementation process of the above units can be referred to as the specific process in the first embodiment, and is not described in detail here.

Example III

The disclosed embodiment discloses a computer device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, and is characterized in that the processor executes the program to realize a cell remote transfer classification step based on unbalanced learning.

In this embodiment, the specific steps refer to the detailed process of embodiment one, and will not be described in detail here.

It should be noted that although several modules or sub-modules of the device are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the modules described above may be embodied in one module in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module described above may be further divided into embodiments by a plurality of modules.

Example four

The disclosed embodiments disclose a computer-readable storage medium, on which a computer program is stored, wherein the program, when executed by a processor, implements a step of remote cell transfer classification based on unbalanced learning.

In the present embodiments, a computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for performing various aspects of the present disclosure. The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device.

It is to be understood that throughout the description of the present specification, reference to the term "one embodiment", "another embodiment", "other embodiments", or "first through nth embodiments", etc., is intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, or materials described may be combined in any suitable manner in any one or more embodiments or examples.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. The abnormal cell distant metastasis classification method based on unbalanced learning is characterized by comprising the following steps:

respectively inputting the training sets with the positive and negative sample ratio of 1:1 into a classification algorithm, testing by using the data sequence of the test set, and selecting the training set with the optimal evaluation result_PiThe oversampling algorithm i of (1);

inputting the training set into the obtained training set by adjusting the proportion of the positive and negative samples_PiThe over-sampling algorithm gradually increases the proportion of the positive and negative samples to a set proportion, and selects the proportion of the positive and negative samples with the optimal classification evaluation.

2. The method of classifying abnormal cell distant metastasis based on unbalanced learning according to claim 1, wherein the data sequence comprises: leukocyte count, lymphocyte absolute value, lymphocyte percentage, neutrophil absolute value, neutrophil percentage, monocyte absolute value, monocyte percentage, eosinophil absolute value, eosinophil percentage, basophil absolute value, basophil percentage, erythrocyte count, hemoglobin, erythrocyte mean volume, erythrocyte mean hemoglobin content, erythrocyte mean hemoglobin concentration, erythrocyte distribution width, platelet count, platelet distribution width, platelet distribution volume, and platelet mean volume.

3. The method of claim 1, wherein the selecting the p features with the best results comprises: width of platelet distribution, lymphocyte percentage, absolute value of lymphocytes, neutrophil percentage, mean volume of platelets, red blood cell count, hemoglobin, and hematocrit.

4. The abnormal cell distant metastasis classification method based on unbalanced learning as claimed in claim 1, wherein the data of the training set is subjected to data screening before feature selection, and the data integrity is judged, and the samples containing the missing data are deleted.

5. A cell distant metastasis classification system based on unbalanced learning is characterized by comprising:

an optimal oversampling algorithm obtaining unit configured to: respectively inputting the training sets with the positive and negative sample ratio of 1:1 into a classification algorithm, testing by using the data sequence of the test set, and selecting the training set with the optimal evaluation result_PiThe oversampling algorithm i of (1);

an optimal positive and negative sample proportion obtaining unit configured to: inputting the training set into the obtained training set by adjusting the proportion of the positive and negative samples_PiThe over-sampling algorithm gradually increases the proportion of the positive and negative samples to a set proportion, and selects the proportion of the positive and negative samples with the optimal classification evaluation.

6. The system of claim 5, wherein the selecting the p features that yield the best results comprises: width of platelet distribution, lymphocyte percentage, absolute value of lymphocytes, neutrophil percentage, mean volume of platelets, red blood cell count, hemoglobin, and hematocrit.

7. The cell remote transfer classification system based on unbalanced learning is characterized by comprising a server, a data input device and a data display, wherein the data input device is used for inputting analysis data of blood cells into the server or calling the blood cell data stored in a memory in a calling mode, and the display is used for displaying specific results and related data in a data processing process;

the server is configured to include:

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method for classifying distant metastasis of abnormal cells based on unbalanced learning according to any one of claims 1 to 4.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of a method for classifying distant metastasis of abnormal cells based on unbalanced learning according to any one of claims 1 to 4.