CN109948732B - Abnormal cell distant metastasis classification method and system based on unbalanced learning - Google Patents

Abnormal cell distant metastasis classification method and system based on unbalanced learning Download PDF

Info

Publication number
CN109948732B
CN109948732B CN201910251365.4A CN201910251365A CN109948732B CN 109948732 B CN109948732 B CN 109948732B CN 201910251365 A CN201910251365 A CN 201910251365A CN 109948732 B CN109948732 B CN 109948732B
Authority
CN
China
Prior art keywords
training set
data
positive
proportion
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201910251365.4A
Other languages
Chinese (zh)
Other versions
CN109948732A (en
Inventor
彭立志
李雪梅
杨波
李宝生
朱健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Cancer Hospital & Institute (shandong Cancer Hospital)
University of Jinan
Original Assignee
Shandong Cancer Hospital & Institute (shandong Cancer Hospital)
University of Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Cancer Hospital & Institute (shandong Cancer Hospital), University of Jinan filed Critical Shandong Cancer Hospital & Institute (shandong Cancer Hospital)
Priority to CN201910251365.4A priority Critical patent/CN109948732B/en
Publication of CN109948732A publication Critical patent/CN109948732A/en
Application granted granted Critical
Publication of CN109948732B publication Critical patent/CN109948732B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The disclosure provides a method and a system for classifying abnormal cell distant metastasis based on unbalanced learning, wherein a plurality of data sequences with certain cell distant metastasis and a plurality of data sequences without certain cell distant metastasis are obtained, the data set is divided into a training set and a testing set, the training set is used for training a model, and the testing set is used for testing the model; firstly, inputting a training set into a feature selection algorithm to be compared with the results of the classification of an original condition data set, and selecting p features with the best results; obtaining a training set with a positive-negative sample ratio of 1:1 by using an oversampling algorithm, respectively inputting the training set into a classification algorithm, testing by using a data sequence of a test set, and selecting an oversampling algorithm i of a training set Pi with an optimal evaluation result; and (3) inputting the training set into an oversampling algorithm for obtaining a training set Pi by adjusting the proportion of the positive and negative samples, gradually increasing the proportion of the positive and negative samples to a set proportion, and carrying out classification evaluation on the optimal proportion of the positive and negative samples. According to the technical scheme, the proportion of the positive samples is increased by adopting an oversampling algorithm, and better model evaluation indexes and the recall rate of a few positive samples are obtained.

Description

Abnormal cell distant metastasis classification method and system based on unbalanced learning
Technical Field
The disclosure relates to the technical field of machine learning and data mining, in particular to a method and a system for classifying abnormal cell distant metastasis based on unbalanced learning.
Background
Esophageal squamous carcinoma is one of the most common malignant tumors worldwide, but the early symptoms are not obvious, the change of the body is easy to be ignored, and the esophageal squamous carcinoma is generally in the middle and advanced stage when the body cannot bear the disease and goes to a hospital for examination. In clinic, doctors diagnose whether cancer cells of patients with esophageal squamous carcinoma have distant metastasis by image, even puncture and operation. These three approaches not only add to the cost of treatment for the patient, but also are time consuming. With the advent of the big data age, to solve this problem, it has been proposed to predict whether cancer cells of patients have metastasis by blood cell analysis. It is known from the relevant literature that the classification prediction of lymph node metastasis is over-classified in the medical field, the specificity and the sensitivity are less than 50%, the relevant research on distant metastasis is not performed, the P-test statistical analysis is performed by the clinical relevant researchers using statistical analysis software (SPSS, SAS), and the machine learning is used for the analysis prediction in the disclosure.
Because the collected data of patients with esophageal squamous cell carcinoma is not much, and the patients with cancer cells metastasizing far are more in a small proportion, the problem of unbalanced categories exists.
The inventors found in their research that in such unbalanced data sets, the standard classifier tends to obtain the maximum accuracy, while ignoring a few samples, which are the focus of attention, and even if obtaining a high accuracy, the analysis result is meaningless, and it is difficult to effectively predict whether cancer cells of a patient have metastasis. In real life, particularly in the medical field, the problem of category imbalance is common, which is mainly due to morbidity. In this case, the performance of the standard classifier would be severely impacted if the unbalanced data were not processed.
Disclosure of Invention
The purpose of the embodiments of the present specification is to provide a method for classifying abnormal cell distant metastasis based on unbalanced learning, which uses an oversampling algorithm to attempt to increase the proportion of positive samples, and obtains a better model evaluation index and a recall rate of a few positive samples.
The embodiment of the specification provides a method for classifying abnormal cell distant metastasis based on unbalanced learning, which comprises the following steps:
obtaining a plurality of data sequences with certain cells having distant metastasis and a plurality of data sequences without certain cells having distant metastasis, and forming a training set;
respectively inputting the training set into k feature selection algorithms, respectively selecting p attributes ranked in the front as features of the training set, inputting the attributes into a classifier for training, comparing classification results, and selecting p features with the best results;
enabling the training set to achieve data balance on a data level based on an oversampling algorithm, and inputting the training set processed by the feature selection algorithm into n oversampling algorithms to obtain a training set with a positive-negative sample ratio of 1: 1;
respectively inputting the training sets with the positive and negative sample proportion of 1:1 into a classification algorithm, then testing by using a data sequence of a test set, and selecting an oversampling algorithm i of the training set Pi with the optimal evaluation result;
and (3) inputting the training set into an oversampling algorithm for obtaining a training set Pi by adjusting the proportion of the positive and negative samples, gradually increasing the proportion of the positive and negative samples to a set proportion, and carrying out classification evaluation on the optimal proportion of the positive and negative samples.
Embodiments of the present disclosure provide a system for remote cell transfer classification based on unbalanced learning, comprising:
a training set acquisition unit configured to: obtaining a plurality of data sequences with certain cells having distant metastasis and a plurality of data sequences without certain cells having distant metastasis, and forming a training set;
a feature selection unit configured to: respectively inputting the training set into k feature selection algorithms, respectively selecting p attributes ranked in the front as features of the training set, inputting the attributes into a classifier for training, comparing classification results, and selecting p features with the best results;
an oversampling unit configured to: enabling the training set to achieve data balance on a data level based on an oversampling algorithm, and inputting the training set processed by the feature selection algorithm into n oversampling algorithms to obtain a training set with a positive-negative sample ratio of 1: 1;
an optimal oversampling algorithm obtaining unit configured to: respectively inputting the training sets with the positive and negative sample proportion of 1:1 into a classification algorithm, then testing by using a data sequence of a test set, and selecting an oversampling algorithm i of the training set Pi with the optimal evaluation result;
an optimal positive and negative sample proportion obtaining unit configured to: and (3) inputting the training set into an oversampling algorithm for obtaining a training set Pi by adjusting the proportion of the positive and negative samples, gradually increasing the proportion of the positive and negative samples to a set proportion, and carrying out classification evaluation on the optimal proportion of the positive and negative samples.
Compared with the prior art, the beneficial effect of this disclosure is:
1. the data sequence in the technical scheme disclosed by the invention can be used for routine examination of blood cell analysis data in a hospital, the data acquisition is easier from the technical realization, and the subsequent selection and oversampling processing of data characteristics are convenient to perform.
2. According to the technical scheme, the proportion of the positive samples is increased by adopting an oversampling algorithm, and better model evaluation indexes and the recall rate of a few positive samples are obtained.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.
FIG. 1 is a flowchart of a method for classifying abnormal distant cell metastasis based on unbalanced learning according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram illustrating a feature selection strategy of an abnormal cell distant metastasis classification method based on unbalanced learning according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of an oversampling algorithm selection strategy of the abnormal cell distant metastasis classification method based on unbalanced learning according to an embodiment of the present disclosure;
fig. 4 is a schematic diagram illustrating a strategy for adjusting the proportion of positive and negative samples by using an abnormal cell distant metastasis classification method based on unbalanced learning according to an embodiment of the disclosure.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
Currently, there are two main solutions to the problem of dealing with unbalanced data classification: first, the data is equalized. On the data level, a training sample is reconstructed by using a proper method, and data equalization can be achieved by adopting an oversampling or undersampling algorithm; second, new algorithms are improved or proposed. On the algorithm level, the existing classification algorithm is utilized to improve or provide a new classification algorithm, so that the minority samples are paid more attention, and the accuracy of the minority samples is improved. The technical solution of the embodiment of the present application is the first one, and data samples are balanced on a data plane.
Example of implementation 1
The implementation example discloses a method for classifying distant metastasis of abnormal cells based on unbalanced learning, which includes the steps of firstly screening an available data set, taking the distant metastasis classification of cells of esophageal squamous carcinoma as an example, screening a patient with a clinical M stage from a diagnosis table according to the existing diagnosis information of an esophageal squamous carcinoma patient, wherein the clinical M stage is 0 to indicate that cancer cells of the patient do not metastasize to other organs, the clinical M stage is 0 to indicate that cancer cells of the patient metastasize to other organs, the clinical M stage is 1, selecting blood cell analysis and inspection data of the previous time before the operation treatment according to the operation time recorded in the operation table by the patient, and selecting blood cell analysis and inspection data of the current day of diagnosis or the previous time of diagnosis if the operation treatment is not performed. The data on which the method is based are all data of patients, so that the method is irrelevant to diagnosis and treatment and only predicts the metastasis of relevant cells based on the relevant data.
In an implementation example, the screened available samples are divided into a 75% training set and a 25% testing set, the training set is input into a plurality of feature selection methods, the attribute of the top 8 of the ranking is selected as the feature of the data set, then the feature is input into a classifier, the output model evaluation index AUC and the recall rate recall are compared with the result output by the original situation, and the feature with the best result is selected.
In the implementation example, two types of data are trained through a classifier, so that the classification characteristics of the two types of data can be learned, and new data can be input, and the classifier can automatically identify which type belongs to.
The output model evaluation index AUC and recall ratio recall are explained as follows:
AUC (area Under curve) is defined as the area enclosed by the coordinate axes Under the ROC curve, and it is obvious that the value of this area is not larger than 1. Since the ROC curve is generally located above the line y ═ x, the AUC ranges between 0.5 and 1. The reason why the AUC value is used as the evaluation criterion is that the ROC curve cannot clearly indicate which classifier has a better effect in many cases, and as a numerical value, a classifier corresponding to a larger AUC has a better effect.
Recall (Recall), also known as Recall (TPR), is an integrity measure of the classification of unbalanced data, representing the ratio of the actual number of minority samples to the actual number of minority samples that should be.
Then, the selected features are input into different oversampling algorithms, the ratio of the positive and negative samples is 1:1, the positive and negative samples are input into a classifier, and the output model evaluation indexes are compared with the recall rate to select the oversampling algorithm with the best result.
Then, the selected oversampling algorithm is used for trying to increase the proportion of the positive type samples from 1.1:1, 1.2:1 to 2:1, then the proportion of the two types of samples with larger difference is given, namely 5:1 and 10:1, and the proper proportion of the positive type samples and the negative type samples is selected according to the comparison result.
In specific implementation, referring to fig. 1, the method for classifying abnormal distant cell metastasis based on unbalanced learning includes:
step (1): and (3) screening the data set, washing dirty data after screening, directly deleting the sample containing the missing data, and deleting the attribute of erythrocyte distribution width (CV) before feature selection to leave a complete data set.
Specifically, the data set includes a training set and a testing set, and the data in the training set includes a plurality of blood cell analysis data sequences in which cancer cells have distant metastasis and a plurality of blood cell analysis data sequences in which cancer cells have no distant metastasis.
The test set stores a blood cell analysis data sequence to be tested.
The data sequence, i.e., the blood cell analysis data sequence, includes: leukocyte count, lymphocyte absolute value, lymphocyte percentage, neutrophil absolute value, neutrophil percentage, monocyte absolute value, monocyte percentage, eosinophil absolute value, eosinophil percentage, basophil absolute value, basophil percentage, erythrocyte count, hemoglobin, erythrocyte mean volume, erythrocyte mean hemoglobin content, erythrocyte mean hemoglobin concentration, erythrocyte distribution width (CV), platelet count, platelet distribution width, platelet distribution hematocrit, platelet mean volume.
Step (2): referring to fig. 2, feature selection is performed on blood cell analysis, k feature selection algorithms are respectively input into a data set, p attributes ranked in the top are respectively selected as features of the data set a, the features are input into a classifier for training, and model evaluation indexes (G-Mean, AUC) and Recall (Recall) are output.
In this embodiment, the model is the aforementioned classifier.
Because a few types of samples are important, according to specific analysis of specific problems, the threshold values of the evaluation indexes G-Mean and AUC of the calculation model are re-given, and the weighted G-Mean and the weighted AUC are provided and are marked as WG-Mean and WAUC; and calculating the WG-Mean and the WAUC by a given calculation mode, and selecting p features which obtain the best result compared with the original condition, wherein the larger the calculated WG-Mean and the WAUC is, the better the result is, and the p features are used by the following inputs.
Since in this case a few samples are the focus of attention, the threshold is given again to increase the rate of recall in the calculation.
In this example, G-Mean and AUC are indicators of two comprehensive evaluation classifiers.
Figure BDA0002012500380000071
Figure BDA0002012500380000072
The calculation formula after the threshold value is reset is as follows:
WAUC=Sensitivity×0.7+Specificity×0.3;
Figure BDA0002012500380000073
in this embodiment, the original case data is the data without the positive and negative class sample balancing.
In specific implementation, the attributes of 8 top-ranked names are respectively selected as basic features according to the result obtained by the algorithm. By analysis, selecting features includes: width of platelet distribution, lymphocyte percentage, absolute value of lymphocytes, neutrophil percentage, mean volume of platelets, red blood cell count, hemoglobin, and hematocrit.
Referring to fig. 3, the oversampling algorithm that yields better results is selected. Step (3) and step (4): enabling the data set to achieve data balance on a data level based on an oversampling algorithm, inputting 8 selected features into different oversampling algorithms to obtain training sets P1 and P2 … … Pn with the proportion of positive and negative samples being 1:1, respectively inputting the training sets P1 and P2 … … Pn into a classification algorithm, then testing by using a test set N, and outputting a model evaluation index (G-Mean, AUC) and a Recall rate (Recall); and (5) calculating WG-Mean and WAUC, and selecting the oversampling algorithm i which obtains the training set Pi and corresponds to the best result.
In this embodiment, the data set is divided into a training set and a test set, the training set is used for training the model, and the test set is used for testing the model; firstly, inputting a training set into a feature selection algorithm to be compared with the results of the classification of an original condition data set, and selecting p features with the best results; and then obtaining a training set with a positive-negative sample ratio of 1:1 by using an oversampling algorithm.
And (5): referring to the attached figure 4, the proportion of positive and negative samples is adjusted to obtain higher Recall rate and better model evaluation index, the training set M is input into an oversampling algorithm for obtaining the training set Pi, the proportion of the positive and negative samples is gradually increased to 1.1:1, 1.2:1 and is increased to 2:1, even Recall, WG-Mean and WAUC which are output in a ratio of 5:1 and 10:1 are given, and the proportion of the positive and negative samples with the optimal result is selected.
The proportion of the positive and negative samples with the optimal result is selected, so that the model evaluation index can be the best.
The analysis and prediction of the routine blood cell examination in hospitals are selected in the disclosed embodiment, so as to replace other expensive and time-consuming diagnosis approaches. Has certain innovativeness in application.
The technology used by the embodiment of the disclosure breaks through the weakness that clinical medicine researchers do not understand machine learning, and breaks through the traditional conventional P inspection analysis method.
The disclosed embodiment combines specific practical meanings to give the threshold value of the evaluation index of the calculation model again.
The oversampling algorithm is used for attempting to increase the proportion of the positive samples, and better model evaluation indexes and the recall rate of a few positive samples are obtained.
Example II
Embodiments of the present disclosure provide a system for remote cell transfer classification based on unbalanced learning, comprising:
a training set acquisition unit configured to: obtaining a plurality of data sequences with certain cells having distant metastasis and a plurality of data sequences without certain cells having distant metastasis, and forming a training set;
a feature selection unit configured to: respectively inputting the training set into k feature selection algorithms, respectively selecting p attributes ranked in the front as features of the training set, inputting the attributes into a classifier for training, comparing classification results, and selecting p features with the best results;
an oversampling unit configured to: enabling the training set to achieve data balance on a data level based on an oversampling algorithm, and inputting the training set processed by the feature selection algorithm into n oversampling algorithms to obtain a training set with a positive-negative sample ratio of 1: 1;
an optimal oversampling algorithm obtaining unit configured to: respectively inputting the training sets with the positive and negative sample proportion of 1:1 into a classification algorithm, then testing by using a data sequence of a test set, and selecting an oversampling algorithm i of the training set Pi with the optimal evaluation result;
an optimal positive and negative sample proportion obtaining unit configured to: and (3) inputting the training set into an oversampling algorithm for obtaining a training set Pi by adjusting the proportion of the positive and negative samples, gradually increasing the proportion of the positive and negative samples to a set proportion, and carrying out classification evaluation on the optimal proportion of the positive and negative samples.
In another embodiment, the system may be implemented by a server, a data input device and a data display, the data input device is used to input the analysis data of blood cells into the server or call the blood cell data stored in the memory, and after the server processes the data, the server displays the specific result and the related data in the data processing process by using the display.
The server comprises a training set acquisition unit, a feature selection unit, an oversampling unit, an optimal oversampling algorithm acquisition unit and an optimal positive and negative sample proportion acquisition unit.
The specific implementation process of the above units can be referred to as the specific process in the first embodiment, and is not described in detail here.
Example III
The disclosed embodiment discloses a computer device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, and is characterized in that the processor executes the program to realize a cell remote transfer classification step based on unbalanced learning.
In this embodiment, the specific steps refer to the detailed process of embodiment one, and will not be described in detail here.
It should be noted that although several modules or sub-modules of the device are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the modules described above may be embodied in one module in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module described above may be further divided into embodiments by a plurality of modules.
Example four
The disclosed embodiments disclose a computer-readable storage medium, on which a computer program is stored, wherein the program, when executed by a processor, implements a step of remote cell transfer classification based on unbalanced learning.
In this embodiment, the specific steps refer to the detailed process of embodiment one, and will not be described in detail here.
In the present embodiments, a computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for performing various aspects of the present disclosure. The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device.
It is to be understood that throughout the description of the present specification, reference to the term "one embodiment", "another embodiment", "other embodiments", or "first through nth embodiments", etc., is intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, or materials described may be combined in any suitable manner in any one or more embodiments or examples.
The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims (9)

1. The abnormal cell distant metastasis classification method based on unbalanced learning is characterized by comprising the following steps:
obtaining a plurality of data sequences with certain cells having distant metastasis and a plurality of data sequences without certain cells having distant metastasis, and forming a training set;
respectively inputting the training set into k feature selection algorithms, respectively selecting p attributes ranked in the front as features of the training set, inputting the attributes into a classifier for training, comparing classification results, and selecting p features with the best results;
enabling the training set to achieve data balance on a data level based on an oversampling algorithm, and inputting the training set processed by the feature selection algorithm into n oversampling algorithms to obtain a training set with a positive-negative sample ratio of 1: 1;
respectively inputting the training sets with the positive and negative sample ratio of 1:1 into a classification algorithm, testing by using the data sequence of the test set, and selecting the training set with the optimal evaluation resultPiThe oversampling algorithm i of (1);
inputting the training set into the obtained training set by adjusting the proportion of the positive and negative samplesPiThe over-sampling algorithm gradually increases the proportion of the positive and negative samples to a set proportion, and selects the proportion of the positive and negative samples with the optimal classification evaluation.
2. The method of classifying abnormal cell distant metastasis based on unbalanced learning according to claim 1, wherein the data sequence comprises: leukocyte count, lymphocyte absolute value, lymphocyte percentage, neutrophil absolute value, neutrophil percentage, monocyte absolute value, monocyte percentage, eosinophil absolute value, eosinophil percentage, basophil absolute value, basophil percentage, erythrocyte count, hemoglobin, erythrocyte mean volume, erythrocyte mean hemoglobin content, erythrocyte mean hemoglobin concentration, erythrocyte distribution width, platelet count, platelet distribution width, platelet distribution volume, and platelet mean volume.
3. The method of claim 1, wherein the selecting the p features with the best results comprises: width of platelet distribution, lymphocyte percentage, absolute value of lymphocytes, neutrophil percentage, mean volume of platelets, red blood cell count, hemoglobin, and hematocrit.
4. The abnormal cell distant metastasis classification method based on unbalanced learning as claimed in claim 1, wherein the data of the training set is subjected to data screening before feature selection, and the data integrity is judged, and the samples containing the missing data are deleted.
5. A cell distant metastasis classification system based on unbalanced learning is characterized by comprising:
a training set acquisition unit configured to: obtaining a plurality of data sequences with certain cells having distant metastasis and a plurality of data sequences without certain cells having distant metastasis, and forming a training set;
a feature selection unit configured to: respectively inputting the training set into k feature selection algorithms, respectively selecting p attributes ranked in the front as features of the training set, inputting the attributes into a classifier for training, comparing classification results, and selecting p features with the best results;
an oversampling unit configured to: enabling the training set to achieve data balance on a data level based on an oversampling algorithm, and inputting the training set processed by the feature selection algorithm into n oversampling algorithms to obtain a training set with a positive-negative sample ratio of 1: 1;
an optimal oversampling algorithm obtaining unit configured to: respectively inputting the training sets with the positive and negative sample ratio of 1:1 into a classification algorithm, testing by using the data sequence of the test set, and selecting the training set with the optimal evaluation resultPiThe oversampling algorithm i of (1);
an optimal positive and negative sample proportion obtaining unit configured to: inputting the training set into the obtained training set by adjusting the proportion of the positive and negative samplesPiThe over-sampling algorithm gradually increases the proportion of the positive and negative samples to a set proportion, and selects the proportion of the positive and negative samples with the optimal classification evaluation.
6. The system of claim 5, wherein the selecting the p features that yield the best results comprises: width of platelet distribution, lymphocyte percentage, absolute value of lymphocytes, neutrophil percentage, mean volume of platelets, red blood cell count, hemoglobin, and hematocrit.
7. The cell remote transfer classification system based on unbalanced learning is characterized by comprising a server, a data input device and a data display, wherein the data input device is used for inputting analysis data of blood cells into the server or calling the blood cell data stored in a memory in a calling mode, and the display is used for displaying specific results and related data in a data processing process;
the server is configured to include:
a training set acquisition unit configured to: obtaining a plurality of data sequences with certain cells having distant metastasis and a plurality of data sequences without certain cells having distant metastasis, and forming a training set;
a feature selection unit configured to: respectively inputting the training set into k feature selection algorithms, respectively selecting p attributes ranked in the front as features of the training set, inputting the attributes into a classifier for training, comparing classification results, and selecting p features with the best results;
an oversampling unit configured to: enabling the training set to achieve data balance on a data level based on an oversampling algorithm, and inputting the training set processed by the feature selection algorithm into n oversampling algorithms to obtain a training set with a positive-negative sample ratio of 1: 1;
an optimal oversampling algorithm obtaining unit configured to: respectively inputting the training sets with the positive and negative sample ratio of 1:1 into a classification algorithm, testing by using the data sequence of the test set, and selecting the training set with the optimal evaluation resultPiThe oversampling algorithm i of (1);
an optimal positive and negative sample proportion obtaining unit configured to: inputting the training set into the obtained training set by adjusting the proportion of the positive and negative samplesPiThe over-sampling algorithm gradually increases the proportion of the positive and negative samples to a set proportion, and selects the proportion of the positive and negative samples with the optimal classification evaluation.
8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method for classifying distant metastasis of abnormal cells based on unbalanced learning according to any one of claims 1 to 4.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of a method for classifying distant metastasis of abnormal cells based on unbalanced learning according to any one of claims 1 to 4.
CN201910251365.4A 2019-03-29 2019-03-29 Abnormal cell distant metastasis classification method and system based on unbalanced learning Expired - Fee Related CN109948732B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910251365.4A CN109948732B (en) 2019-03-29 2019-03-29 Abnormal cell distant metastasis classification method and system based on unbalanced learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910251365.4A CN109948732B (en) 2019-03-29 2019-03-29 Abnormal cell distant metastasis classification method and system based on unbalanced learning

Publications (2)

Publication Number Publication Date
CN109948732A CN109948732A (en) 2019-06-28
CN109948732B true CN109948732B (en) 2020-12-22

Family

ID=67012266

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910251365.4A Expired - Fee Related CN109948732B (en) 2019-03-29 2019-03-29 Abnormal cell distant metastasis classification method and system based on unbalanced learning

Country Status (1)

Country Link
CN (1) CN109948732B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104331642A (en) * 2014-10-28 2015-02-04 山东大学 Integrated learning method for recognizing ECM (extracellular matrix) protein
CN105868775A (en) * 2016-03-23 2016-08-17 深圳市颐通科技有限公司 Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm
CN106126972A (en) * 2016-06-21 2016-11-16 哈尔滨工业大学 A kind of level multi-tag sorting technique for protein function prediction
CN108509982A (en) * 2018-03-12 2018-09-07 昆明理工大学 A method of the uneven medical data of two classification of processing

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101563406B1 (en) * 2013-12-13 2015-10-26 건국대학교 산학협력단 System and method for large unbalanced data classification based on hadoop
CN104933053A (en) * 2014-03-18 2015-09-23 中国银联股份有限公司 Classification of class-imbalanced data
US10748663B2 (en) * 2017-05-04 2020-08-18 Efthymios Kalafatis Machine learning, natural language processing and network analysis-guided discovery related to medical research
CN108091397B (en) * 2018-01-24 2021-09-14 浙江大学 Bleeding event prediction method for patients with ischemic heart disease
CN108470187A (en) * 2018-02-26 2018-08-31 华南理工大学 A kind of class imbalance question classification method based on expansion training dataset
CN108847285B (en) * 2018-05-09 2021-05-28 吉林大学 Down syndrome screening method for pre-pregnancy and mid-pregnancy based on machine learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104331642A (en) * 2014-10-28 2015-02-04 山东大学 Integrated learning method for recognizing ECM (extracellular matrix) protein
CN105868775A (en) * 2016-03-23 2016-08-17 深圳市颐通科技有限公司 Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm
CN106126972A (en) * 2016-06-21 2016-11-16 哈尔滨工业大学 A kind of level multi-tag sorting technique for protein function prediction
CN108509982A (en) * 2018-03-12 2018-09-07 昆明理工大学 A method of the uneven medical data of two classification of processing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
用于不平衡数据分类的0阶TSK型模糊系统;顾晓清等;《自动化学报》;20171031;第43卷(第10期);第1773-1788页 *

Also Published As

Publication number Publication date
CN109948732A (en) 2019-06-28

Similar Documents

Publication Publication Date Title
CN110444248B (en) Cancer biomolecule marker screening method and system based on network topology parameters
CN110634563A (en) Differential diagnosis device for diabetic nephropathy and non-diabetic nephropathy
Da Costa Digital image analysis of blood cells
CN109191451B (en) Abnormality detection method, apparatus, device, and medium
CN112635057B (en) Esophageal squamous carcinoma prognosis index model construction method based on clinical phenotype and LASSO
CN111524594A (en) Target population blood system malignant tumor screening system
CN114220540A (en) Construction method and application of diabetic nephropathy risk prediction model
CN107169264B (en) complex disease diagnosis system
Aktar et al. Predicting patient COVID-19 disease severity by means of statistical and machine learning analysis of blood cell transcriptome data
Shrestha et al. Supervised machine learning for early predicting the sepsis patient: modified mean imputation and modified chi-square feature selection
Mohammed et al. Analysis of anemia using data mining techniques with risk factors specification
CN109948732B (en) Abnormal cell distant metastasis classification method and system based on unbalanced learning
Khan et al. Reinforcing synthetic data for meticulous survival prediction of patients suffering from left ventricular systolic dysfunction
CN113539473A (en) Method and system for diagnosing brucellosis only by using blood routine test data
Li et al. An AI-Aided diagnostic framework for hematologic neoplasms based on morphologic features and medical expertise
CN112967803A (en) Early mortality prediction method and system for emergency patients based on integrated model
CN114242245A (en) Machine learning method, system and device for predicting diabetic nephropathy occurrence risk based on electronic medical record data
CN112508909A (en) Disease association method of peripheral blood cell morphology automatic detection system
EP2920573B1 (en) Particle data segmentation result evaluation methods and flow cytometer
CN107065839B (en) A kind of method for diagnosing faults and device based on diversity recursion elimination feature
CN113555118B (en) Method and device for predicting disease degree, electronic equipment and storage medium
Li et al. The Risk Prediction of Prostate Cancer Based on A Improved Hybrid Algorithm
Yördan et al. Hybrid AI-Based Chronic Kidney Disease Risk Prediction
CN113488170B (en) Method for constructing acute pre-uveitis recurrence risk prediction model and related equipment
Bennett et al. Using a machine learning model to risk stratify for the presence of significant liver disease in a primary care population

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20201222

CF01 Termination of patent right due to non-payment of annual fee