CN114943874A - Data dividing method and device, computer equipment and storage medium - Google Patents

Data dividing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN114943874A
CN114943874A CN202210589598.7A CN202210589598A CN114943874A CN 114943874 A CN114943874 A CN 114943874A CN 202210589598 A CN202210589598 A CN 202210589598A CN 114943874 A CN114943874 A CN 114943874A
Authority
CN
China
Prior art keywords
data
classification
training
result
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210589598.7A
Other languages
Chinese (zh)
Inventor
肖继锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen United Imaging Research Institute of Innovative Medical Equipment
Original Assignee
Shenzhen United Imaging Research Institute of Innovative Medical Equipment
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen United Imaging Research Institute of Innovative Medical Equipment filed Critical Shenzhen United Imaging Research Institute of Innovative Medical Equipment
Priority to CN202210589598.7A priority Critical patent/CN114943874A/en
Publication of CN114943874A publication Critical patent/CN114943874A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a data partitioning method, a data partitioning device, computer equipment and a storage medium. The method comprises the following steps: randomly dividing each data in the target data set into training data and testing data, and labeling division results; performing cross validation on the target data set, performing training of a classification model by using first data which is re-divided into training data in each validation, and performing classification processing on second data which is re-divided into test data by using the trained classification model to obtain a classification result corresponding to each second data; the classification result is used for representing the probability that the second data is training data or the probability that the second data is test data; and if the classification result is determined to be reasonable according to the classification result, ending the data classification. By adopting the method, the training data and the test data can be distributed more uniformly, so that the effect of a machine learning model is improved.

Description

Data dividing method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of machine learning technologies, and in particular, to a data partitioning method and apparatus, a computer device, and a storage medium.
Background
In the image omics process, a data set is often divided into a training data set and a testing data set, wherein the training data set is used for training a machine learning model, and the testing data set is used for testing the trained machine learning model. However, since there are slight differences in the data of each data in the data set, if the data distribution in the training data set and the test data set is not uniform when the data set is divided, the effect of the machine learning model may be poor.
Disclosure of Invention
In view of the foregoing, it is necessary to provide a data partitioning method, apparatus, computer device and storage medium capable of making training data and test data uniformly distributed.
In a first aspect, the present application provides a data partitioning method. The method comprises the following steps:
randomly dividing each data in the target data set into training data and testing data, and labeling division results;
performing cross validation on a target data set, performing classification model training by using first data which is re-divided into training data in each validation, and performing classification processing on second data which is re-divided into test data by using the trained classification model to obtain classification results corresponding to the second data; the classification result is used for representing the probability that the second data is training data or the probability that the second data is test data;
and if the classification result is determined to be reasonable according to the classification result, ending the data classification.
In one embodiment, the method further comprises:
and if the division result is determined to be unreasonable according to the classification result, randomly dividing the data in the target data set again until the division result is determined to be reasonable according to the classification result.
In one embodiment, the determining that the classification result is reasonable according to the classification result includes:
determining an AUC value according to the classification result and the division result;
if the AUC value is within the preset range, determining that the division result is reasonable;
and if the AUC value is not in the preset range, determining that the division result is unreasonable.
In one embodiment, the determining the AUC value according to the classification result and the partition result includes:
determining an ROC curve according to the classification result, the division result and a plurality of probability threshold values;
AUC values were determined from the ROC curve.
In one embodiment, the performing cross validation on the target data set, performing class model training on the first data reclassified as training data in each validation, and performing class processing on the second data reclassified as test data by using the trained class model to obtain a class result corresponding to each second data includes:
equally dividing a target data set into M parts according to the data volume, taking M-1 parts of data as first data in each verification, dividing the first data into training data again, taking 1 part of data except the M-1 parts of data as second data, and dividing the second data into test data again; wherein M is a positive integer;
inputting the first data into an initial model for training to obtain a classification model;
and inputting the second data into the classification model to obtain a classification result output by the classification model.
In one embodiment, the inputting the first data into the initial model for training to obtain the classification model includes:
inputting the first data into the initial model to obtain a training result output by the initial model;
and adjusting parameters of the initial model according to the training result and the division result corresponding to the first data to obtain a classification model.
In one embodiment, randomly dividing each data in the target data set into training data and test data, and labeling the division result includes:
dividing a target data set into a training data set and a testing data set;
respectively labeling the division results of the data in the training data set and the data in the test data set;
the training data set and the test data set are mixed into a target data set.
In a second aspect, the present application further provides a data partitioning apparatus. The device includes:
the dividing module is used for randomly dividing each data in the target data set into training data and testing data and marking dividing results;
the verification module is used for performing cross verification on the target data set, training a classification model by using first data which is subdivided into training data in each verification, and classifying second data which is subdivided into test data by using the trained classification model to obtain a classification result corresponding to each second data; the classification result is used for representing the probability that the second data is training data or the probability that the second data is test data;
and the determining module is used for finishing data division if the dividing result is determined to be reasonable according to the classifying result.
In one embodiment, the apparatus further comprises:
and the re-dividing module is used for re-randomly dividing each data in the target data set if the dividing result is determined to be unreasonable according to the classification result until the dividing result is determined to be reasonable according to the classification result.
In one embodiment, the determining module is specifically configured to determine an AUC value according to the classification result and the partition result; if the AUC value is within the preset range, determining that the division result is reasonable; and if the AUC value is not in the preset range, determining that the division result is unreasonable.
In one embodiment, the determining module is specifically configured to determine an ROC curve according to the classification result, the division result, and a plurality of probability thresholds; AUC values were determined from the ROC curve.
In one embodiment, the verification module is specifically configured to averagely divide the target data set into M parts according to the data size, in each verification, use M-1 parts of data as first data, repartition the first data into training data, use 1 part of data other than M-1 parts of data as second data, and repartition the second data into test data; wherein M is a positive integer; inputting the first data into an initial model for training to obtain a classification model; and inputting the second data into the classification model to obtain a classification result output by the classification model.
In one embodiment, the verification module is specifically configured to input the first data into the initial model to obtain a training result output by the initial model; and adjusting parameters of the initial model according to the training result and the division result corresponding to the first data to obtain a classification model.
In one embodiment, the dividing module is specifically configured to divide the target data set into a training data set and a test data set; respectively labeling the division results of the data in the training data set and the data in the test data set; the training data set and the test data set are mixed into a target data set.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory in which a computer program is stored and a processor which, when executing the computer program, carries out the method according to the first aspect described above.
In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium, on which a computer program is stored which, when executed by a processor, implements the method of the first aspect described above.
In a fifth aspect, the present application further provides a computer program product. The computer program product comprising a computer program that when executed by a processor implements the method of the first aspect described above.
According to the data dividing method, the data dividing device, the computer equipment and the storage medium, all data in the target data set are randomly divided into training data and testing data, and dividing results are marked; performing cross validation on the labeled target data set, performing classification model training by using first data which is re-divided into training data in each validation, and performing classification processing on second data which is re-divided into test data by using the trained classification model to obtain classification results corresponding to the second data; and if the classification result is determined to be reasonable according to the classification result, ending the data classification. According to the embodiment of the application, each data in the target data set is randomly divided firstly, and then the division result is verified, so that the division is more reasonable, the training data and the test data are more uniformly distributed, and the effect of a machine learning model is improved.
Drawings
FIG. 1 is a flow diagram illustrating a data partitioning method according to one embodiment;
FIG. 2 is a second flowchart illustrating a data partitioning method according to an embodiment;
FIG. 3 is a flowchart illustrating the steps for determining the rationality of the classification result according to the classification result in one embodiment;
FIG. 4 is a schematic representation of an ROC curve in one embodiment;
FIG. 5 is a schematic flow chart diagram illustrating the cross-validation step performed on a target data set in one embodiment;
FIG. 6 is a flow chart illustrating a data partitioning method according to another embodiment;
FIG. 7 is a block diagram showing the structure of a data dividing apparatus according to an embodiment;
FIG. 8 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Referring to fig. 1, the present application provides a data partitioning method, which can be applied to an imaging omics process, that is, partitioned training data and test data can be used for training an image recognition model, an image segmentation model, and the like, and provide a basis for analysis and recognition of medical images. The embodiment of the application is exemplified by applying the method to a computer device, and it can be understood that the computer device may be a terminal or a server, and the terminal may be, but is not limited to, various personal computers, notebook computers, smart phones, and tablet computers; the server may be implemented as a stand-alone server or as a server cluster comprised of multiple servers. In the embodiment of the application, the method can comprise the following steps:
step 101, randomly dividing each data in the target data set into training data and testing data, and labeling the division result.
The computer device first acquires a target data set composed of a plurality of data, each of which may be, but is not limited to, various MR (Magnetic Resonance) image data, CT (Computed Tomography) image data, PET (Positron Emission Tomography) image data. The data in the target data set may also be other data, which is not limited in this embodiment of the present application.
And the computer equipment randomly divides each data in the target data set, divides the data into training data or testing data and marks the training data or the testing data according to the division result.
For example, data a1 is divided into training data, and data a1 is labeled 1; data a2 is divided into test data and data a2 is labeled as 0. And by analogy, randomly dividing each data in the target data set, and labeling a division result. In practical application, the training data may also be labeled as T, and the test data may be labeled as F, which is not limited in the embodiment of the present application.
In one embodiment, in the process of random division, a preset division principle may be followed. For example, after data division, the ratio of the number of training data to the number of test data in the target data set is 1:1, or after data division, the ratio of the number of training data to the number of test data in the target data set is 4: 1. The division principle is not limited in the embodiment of the application.
In one embodiment, the process of randomly dividing and labeling may include: dividing a target data set into a training data set and a testing data set; respectively labeling the division results of the data in the training data set and the data in the test data set; the training data set and the test data set are mixed into a target data set.
The target data set is randomly divided, and the target data set can be divided into two data sets, namely a training data set and a testing data set. And then, uniformly labeling the data in the training data set, and uniformly labeling the data in the testing data set. And after the division results are labeled, mixing the training data set with the test data set to obtain a labeled target data set.
And 102, performing cross validation on the target data set, performing classification model training by using first data which is subdivided into training data in each validation, and performing classification processing on second data which is subdivided into test data by using the trained classification model to obtain a classification result corresponding to each second data.
And the classification result is used for representing the probability that the second data is the training data or the probability that the second data is the test data. For example, the probability that the classification result is the second data is the training data is 85%, or the probability that the classification result is the second data is the test data is 15%.
And after the random division, the computer equipment carries out cross validation on the target data set. In each validation, a portion of the data in the target data set is repartitioned into training data and another portion of the data in the target data set is repartitioned into test data. It will be appreciated that the data re-partitioned into training data may be partitioned into training data and test data in the previous random partitioning. Similarly, the data re-partitioned into test data may also be partitioned into test data and training data in the previous random partitioning.
After the re-division, training a classification model by adopting first data which is re-divided into training data; after the model training is finished, the classification model is adopted to classify the second data which are divided into the test data again, and the probability that the second data are the training data is obtained, or the probability that the second data are the test data is obtained.
By analogy, after repeated re-division, model training and classification processing, each data in the target data set is used as test data (second data) to be subjected to classification processing, and a corresponding classification result is obtained.
And 103, finishing data division if the division result is determined to be reasonable according to the classification result.
And determining whether the division result is reasonable according to the classification result corresponding to each data, and ending the data division if the random division in the step 101 is determined to be reasonable.
The process of determining whether the classification result is reasonable according to the classification result may include: for each datum, determining whether the datum is training data or test data according to a probability threshold and a classification result; judging whether the result is consistent with the dividing result; and if the two are consistent, determining that the division result is reasonable.
For example, if the classification result corresponding to the data a1 is that the probability that the data a1 is training data is 85%, and the probability is greater than the probability threshold 80%, the data a1 is determined to be training data. In the previous random division, the data a1 was divided into training data, and the division result of the data a1 was determined to be reasonable. For another example, if the classification result corresponding to the data a2 is that the probability that the data a2 is training data is 82%, which is greater than the probability threshold 80%, the data a2 is determined to be training data. Whereas in the previous random division, the data a2 was divided into test data, it was determined that the division result of the data a2 was not reasonable.
Then, the ratio of the data with reasonable partitioning results in the target data set is counted, and if the ratio is larger than the preset ratio, the partitioning results of the random partitioning in the step 101 are determined to be reasonable. If the ratio is less than or equal to the preset ratio, it is determined that the division result of the random division in step 101 is unreasonable.
In practical application, whether the partitioning result is reasonable or not can be determined in other manners, which is not limited in the embodiment of the present application.
In the data dividing method, each data in a target data set is randomly divided into training data and testing data, and dividing results are labeled; performing cross validation on a target data set, performing classification model training by using first data which is subdivided into training data in each validation, and performing classification processing on second data which is subdivided into test data by using the trained classification model to obtain a classification result corresponding to each second data; and if the classification result is determined to be reasonable according to the classification result, ending the data classification. According to the embodiment of the application, each data in the target data set is randomly divided firstly, and then the division result is verified, so that the division is more reasonable, the training data and the test data are more uniformly distributed, and the effect of a machine learning model is improved.
In an embodiment, as shown in fig. 2, on the basis of the above embodiment, the embodiment of the present application may further include:
and 104, if the division result is determined to be unreasonable according to the classification result, randomly dividing the data in the target data set again until the division result is determined to be reasonable according to the classification result.
And if the computer equipment determines that the division result is unreasonable according to the classification result, randomly dividing each data in the target data set again. And after random division, performing cross validation on the target data set. In each verification, the first data which is divided into the training data again is used for training the classification model, and the trained classification model is used for classifying the second data which is divided into the test data again, so that the classification result corresponding to each second data is obtained. And then, determining whether the division result is reasonable according to the classification result again. And if the division result is still determined to be unreasonable, randomly dividing each data in the target data set again. And ending data division until the division result is determined to be reasonable according to the classification result.
In the above embodiment, if the classification result is determined to be unreasonable according to the classification result, the data in the target data set is randomly classified again until the classification result is determined to be reasonable according to the classification result. Through the embodiment of the application, repeated verification can be performed through repeated division, so that the training data and the test data in the target data set are distributed more uniformly, and the effect of a machine learning model is improved.
In an embodiment, as shown in fig. 3, the above process of determining the reasonable partitioning result according to the classification result may include the following steps:
step 201, determining an AUC (Area Under Curve) value according to the classification result and the division result.
The process by which the computer device determines the AUC values may include: determining an ROC curve (Receiver Operating Characteristic curve) according to the classification result, the division result and a plurality of probability threshold values; AUC values were determined from the ROC curve.
The ROC curve shown in fig. 4, FPR on the horizontal axis, i.e. the false positive rate, is calculated by the following formula: the amount of sample wrongly classified as positive is divided by the amount of all true negative samples. The vertical axis is TPR, namely the true positive rate, and the calculation formula is as follows: the amount of sample correctly classified as positive is divided by the amount of all true positive samples.
Taking data a1 in the target data set as an example, data a1 is first divided into training data. Then, in the cross-validation of the target data set, the data a1 is classified by using the classification model, and the probability that the data a1 is training data is 85% as a result of the classification. Assuming a probability threshold of 50%, data a1 is correctly classified as training data. By analogy, whether each data in the target data set is correctly classified can be determined according to the division result, the classification result and the probability threshold; then, the real positive rate and the false positive rate are calculated, and an ROC curve can be obtained. And calculating the area enclosed under the ROC curve, wherein the area is the AUC value.
Step 202, if the AUC value is within the preset range, it is determined that the partitioning result is reasonable.
Wherein the AUC value may characterize the classification accuracy of the classification model. The higher the AUC value is, the more accurate the classification result of the classification model is, and the lower the AUC value is, the less accurate the classification result of the classification model is.
The preset range of AUC values may be set to 0.4-0.6. If the AUC value is within the preset range, it indicates that the classification model does not distinguish well whether the data was classified as training data or test data in the previous random classification. In this case, the indication data may be training data or test data, i.e. the data may be used to achieve a better effect for the machine model, whether the model is trained as training data or tested as test data. Therefore, the division results of the previous random division are determined to be more balanced, namely the division results are reasonable.
In step 203, if the AUC is not within the preset range, it is determined that the partitioning result is not reasonable.
If the AUC value is not within the preset range, it indicates that the classification model can distinguish whether the data is classified as training data or test data in the previous random classification. In this case, it is shown that the data would be only training data or test data, and the result of the division by random division is not reasonable enough.
In the above embodiment, the AUC value is determined according to the classification result and the division result; if the AUC value is within the preset range, determining that the division result is reasonable; and if the AUC is not in the preset range, determining that the division result is unreasonable. According to the embodiment of the application, whether the dividing result is reasonable or not can be easily determined through the AUC value, and the data dividing efficiency can be improved.
In an embodiment, as shown in fig. 5, the above process of performing cross validation on the target data set, performing class model training using the first data re-divided into the training data in each validation, and performing class processing on the second data re-divided into the test data using the trained class model to obtain the class result corresponding to each second data may include the following steps:
step 301, equally dividing a target data set into M parts according to data volume, taking M-1 parts of data as first data in each verification, dividing the first data into training data again, taking 1 part of data except the M-1 parts of data as second data, and dividing the second data into test data again; wherein M is a positive integer.
The computer device may perform five-fold cross validation, ten-fold cross validation, etc. on the target dataset. Taking five-fold cross validation as an example, where M is 5, the target data set is divided into 5 pieces according to the data volume on average, and then five times of validation are performed. In each validation, 4 of these were scored as training data and the other 1 as test data. It is understood that after five verifications, each 1 data is classified as test data.
Step 302, inputting the first data into the initial model for training, and obtaining a classification model.
In each verification, first data serving as training data are input into the initial model for training, and a classification model is obtained after training is finished. It will be appreciated that in each validation, a classification model may be trained.
And step 303, inputting the second data into the classification model to obtain a classification result output by the classification model.
And in each verification, inputting second data serving as test data into the classification model for classification processing to obtain a classification result output by the classification model. It will be appreciated that after five verifications, each data was input as test data to the classification model for classification.
In the above embodiment, the target data set is divided into M parts on average according to the data volume, in each verification, M-1 parts of data are used as first data, the first data are divided into training data again, 1 part of data except M-1 parts of data are used as second data, and the second data are divided into test data again; inputting the first data into an initial model for training to obtain a classification model; and inputting the second data into the classification model to obtain a classification result output by the classification model. The embodiment of the application adopts cross validation, and all data in the target data set can be used as training data and test data, so that the validation is more comprehensive, and the division result is more reasonable.
In an embodiment, the process of inputting the first data into the initial model for training to obtain the classification model may include: inputting the first data into the initial model to obtain a training result output by the initial model; and adjusting parameters of the initial model according to the training result and the division result corresponding to the first data to obtain a classification model.
In the training process, a plurality of first data are sequentially input into the initial model to obtain a plurality of training results output by the initial model; and determining whether the model meets the convergence condition according to the training results and the division result corresponding to each first data in the previous random division. And if the convergence condition is not met, adjusting the adjustable parameters in the model. And then sequentially inputting the plurality of first data into the model for training. And ending the training until the model is determined to be in accordance with the convergence condition according to the classification result output by the model and the classification result in the random classification, and taking the model after the training as the classification model.
In the above embodiment, the first data is input into the initial model to obtain a training result output by the initial model; and adjusting parameters of the initial model according to the training result and the division result corresponding to the first data to obtain a classification model. The classification model is trained during cross validation, on one hand, the effect of model training can be performed by using validation data as training data, and on the other hand, the classification result of the classification model can be utilized, so that whether the classification result is reasonable or not is determined according to the classification result.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.
Referring to fig. 6, the present application provides a data partitioning method, and this embodiment is illustrated by applying the method to a computer device, and may include the following steps:
step 401, randomly dividing each data in the target data set into training data or testing data, and labeling the division result.
Step 402, equally dividing the target data set into M parts according to the data size, taking M-1 parts of data as first data in each verification, dividing the first data into training data again, taking 1 part of data except the M-1 parts of data as second data, and dividing the second data into test data again.
And 403, inputting the first data into the initial model for training to obtain a classification model.
And step 404, inputting the second data into the classification model to obtain a classification result output by the classification model.
And step 405, determining an AUC value according to the classification result and the division result.
And step 406, if the AUC value is within the preset range, determining that the division result is reasonable.
In step 407, if the AUC is not within the preset range, it is determined that the partitioning result is not reasonable.
And step 408, if the division result is determined to be unreasonable according to the classification result, randomly dividing the data in the target data set again until the division result is determined to be reasonable according to the classification result.
In the embodiment, each data in the target data set is randomly divided into training data or testing data, and division results are marked; performing cross validation on a target data set, performing classification model training by using first data which is re-divided into training data in each validation, and performing classification processing on second data which is re-divided into test data by using the trained classification model to obtain classification results corresponding to the second data; and if the division result is determined to be reasonable according to the classification result, ending the data division. According to the embodiment of the application, each data in the target data set is randomly divided firstly, then the division result is verified, and random division is performed again if the division is unreasonable, so that the division can be more reasonable, the training data and the test data are distributed more uniformly, and the effect of a machine learning model is improved.
Based on the same inventive concept, the embodiment of the present application further provides a data partitioning apparatus for implementing the above-mentioned data partitioning method. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme described in the above method, so specific limitations in one or more embodiments of the data partitioning device provided below can refer to the limitations on the data partitioning method in the foregoing, and details are not described here.
In one embodiment, as shown in fig. 7, there is provided a data dividing apparatus including:
a dividing module 501, configured to randomly divide each data in the target data set into training data and test data, and label a dividing result;
the verification module 502 is configured to perform cross verification on a target data set, perform class model training using first data subdivided into training data in each verification, and perform class processing on second data subdivided into test data using the trained class model to obtain class results corresponding to each second data; the classification result is used for representing the probability that the second data is training data or the probability that the second data is test data;
a determining module 503, configured to end data partitioning if it is determined that the partitioning result is reasonable according to the classification result.
In one embodiment, the apparatus further comprises:
and the re-dividing module 504 is configured to re-randomly divide each data in the target data set if the division result is determined to be unreasonable according to the classification result until the division result is determined to be reasonable according to the classification result.
In one embodiment, the determining module 503 is specifically configured to determine an AUC value according to the classification result and the partition result; if the AUC value is within the preset range, determining that the division result is reasonable; and if the AUC value is not in the preset range, determining that the division result is unreasonable.
In one embodiment, the determining module 503 is specifically configured to determine an ROC curve according to the classification result, the division result, and a plurality of probability thresholds; AUC values were determined from the ROC curve.
In one embodiment, the verification module 502 is specifically configured to divide the target data set into M parts on average according to the data size, in each verification, take M-1 parts of data as first data, and re-divide the first data into training data, take 1 part of data other than M-1 parts of data as second data, and re-divide the second data into test data; wherein M is a positive integer; inputting the first data into an initial model for training to obtain a classification model; and inputting the second data into the classification model to obtain a classification result output by the classification model.
In one embodiment, the verification module 502 is specifically configured to input the first data into the initial model to obtain a training result output by the initial model; and adjusting parameters of the initial model according to the training result and the division result corresponding to the first data to obtain a classification model.
In one embodiment, the dividing module 501 is specifically configured to divide the target data set into a training data set and a test data set; respectively labeling the division results of the data in the training data set and the data in the test data set; the training dataset and the test dataset are mixed into a target dataset.
The various modules in the data partitioning apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure thereof may be as shown in fig. 8. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store the XX data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data partitioning method.
Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:
randomly dividing each data in the target data set into training data and testing data, and labeling division results;
performing cross validation on a target data set, performing classification model training by using first data which is subdivided into training data in each validation, and performing classification processing on second data which is subdivided into test data by using the trained classification model to obtain a classification result corresponding to each second data; the classification result is used for representing the probability that the second data is training data or the probability that the second data is test data;
and if the classification result is determined to be reasonable according to the classification result, ending the data classification.
In one embodiment, the processor when executing the computer program further performs the steps of:
and if the division result is determined to be unreasonable according to the classification result, randomly dividing the data in the target data set again until the division result is determined to be reasonable according to the classification result.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
determining an AUC value according to the classification result and the division result;
if the AUC value is within the preset range, determining that the division result is reasonable;
and if the AUC value is not in the preset range, determining that the division result is unreasonable.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
determining an ROC curve according to the classification result, the division result and a plurality of probability threshold values;
AUC values were determined from the ROC curve.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
equally dividing a target data set into M parts according to the data volume, taking M-1 parts of data as first data in each verification, dividing the first data into training data again, taking 1 part of data except the M-1 parts of data as second data, and dividing the second data into test data again; wherein M is a positive integer;
inputting the first data into an initial model for training to obtain a classification model;
and inputting the second data into the classification model to obtain a classification result output by the classification model.
In one embodiment, the processor when executing the computer program further performs the steps of:
inputting the first data into the initial model to obtain a training result output by the initial model;
and adjusting parameters of the initial model according to the training result and the division result corresponding to the first data to obtain a classification model.
In one embodiment, the processor when executing the computer program further performs the steps of:
dividing a target data set into a training data set and a testing data set;
respectively labeling the data in the training data set and the data in the test data set with a division result;
the training data set and the test data set are mixed into a target data set.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
randomly dividing each data in the target data set into training data and testing data, and labeling a division result;
performing cross validation on a target data set, performing classification model training by using first data which is subdivided into training data in each validation, and performing classification processing on second data which is subdivided into test data by using the trained classification model to obtain a classification result corresponding to each second data; the classification result is used for representing the probability that the second data is training data or the probability that the second data is test data;
and if the classification result is determined to be reasonable according to the classification result, ending the data classification.
In one embodiment, the computer program when executed by the processor further performs the steps of:
and if the division result is determined to be unreasonable according to the classification result, randomly dividing the data in the target data set again until the division result is determined to be reasonable according to the classification result.
In one embodiment, the computer program when executed by the processor further performs the steps of:
determining an AUC value according to the classification result and the division result;
if the AUC value is within the preset range, determining that the division result is reasonable;
and if the AUC value is not in the preset range, determining that the division result is unreasonable.
In one embodiment, the computer program when executed by the processor further performs the steps of:
determining an ROC curve according to the classification result, the division result and a plurality of probability threshold values;
AUC values were determined from the ROC curve.
In one embodiment, the computer program when executed by the processor further performs the steps of:
equally dividing a target data set into M parts according to the data volume, taking M-1 parts of data as first data in each verification, dividing the first data into training data again, taking 1 part of data except the M-1 parts of data as second data, and dividing the second data into test data again; wherein M is a positive integer;
inputting the first data into an initial model for training to obtain a classification model;
and inputting the second data into the classification model to obtain a classification result output by the classification model.
In one embodiment, the computer program when executed by the processor further performs the steps of:
inputting the first data into the initial model to obtain a training result output by the initial model;
and adjusting parameters of the initial model according to the training result and the division result corresponding to the first data to obtain a classification model.
In one embodiment, the computer program when executed by the processor further performs the steps of:
dividing a target data set into a training data set and a testing data set;
respectively labeling the division results of the data in the training data set and the data in the test data set;
the training data set and the test data set are mixed into a target data set.
In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of:
randomly dividing each data in the target data set into training data and testing data, and labeling a division result;
performing cross validation on a target data set, performing classification model training by using first data which is re-divided into training data in each validation, and performing classification processing on second data which is re-divided into test data by using the trained classification model to obtain classification results corresponding to the second data; the classification result is used for representing the probability that the second data is training data or the probability that the second data is test data;
and if the classification result is determined to be reasonable according to the classification result, ending the data classification.
In one embodiment, the computer program when executed by the processor further performs the steps of:
and if the division result is determined to be unreasonable according to the classification result, randomly dividing the data in the target data set again until the division result is determined to be reasonable according to the classification result.
In one embodiment, the computer program when executed by the processor further performs the steps of:
determining an AUC value according to the classification result and the division result;
if the AUC value is within the preset range, determining that the division result is reasonable;
and if the AUC value is not in the preset range, determining that the division result is unreasonable.
In one embodiment, the computer program when executed by the processor further performs the steps of:
determining an ROC curve according to the classification result, the division result and a plurality of probability threshold values;
AUC values were determined from the ROC curve.
In one embodiment, the computer program when executed by the processor further performs the steps of:
averagely dividing a target data set into M parts according to the data volume, taking M-1 parts of data as first data in each verification, reclassifying the first data into training data, taking 1 part of data except the M-1 parts of data as second data, and reclassifying the second data into test data; wherein M is a positive integer;
inputting the first data into an initial model for training to obtain a classification model;
and inputting the second data into the classification model to obtain a classification result output by the classification model.
In one embodiment, the computer program when executed by the processor further performs the steps of:
inputting the first data into the initial model to obtain a training result output by the initial model;
and adjusting parameters of the initial model according to the training result and the division result corresponding to the first data to obtain a classification model.
In one embodiment, the computer program when executed by the processor further performs the steps of:
dividing a target data set into a training data set and a testing data set;
respectively labeling the data in the training data set and the data in the test data set with a division result;
the training data set and the test data set are mixed into a target data set.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), Magnetic Random Access Memory (MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases involved in the embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.
All possible combinations of the technical features in the above embodiments may not be described for the sake of brevity, but should be considered as being within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims (10)

1. A method of data partitioning, the method comprising:
randomly dividing each data in the target data set into training data and testing data, and labeling a division result;
performing cross validation on the target data set, performing training of a classification model by using first data which is re-divided into training data in each validation, and performing classification processing on second data which is re-divided into test data by using the trained classification model to obtain a classification result corresponding to each second data;
and if the classification result is determined to be reasonable according to the classification result, ending the data classification.
2. The method of claim 1, further comprising:
and if the division result is determined to be unreasonable according to the classification result, randomly dividing the data in the target data set again until the division result is determined to be reasonable according to the classification result.
3. The method according to claim 1 or 2, wherein the determining that the classification result is reasonable according to the classification result comprises:
determining an AUC value according to the classification result and the division result;
if the AUC value is within a preset range, determining that the dividing result is reasonable;
and if the AUC value is not in the preset range, determining that the division result is unreasonable.
4. The method of claim 3, wherein said determining an AUC value from said classification result and said partition result comprises:
determining an ROC curve according to the classification result, the division result and a plurality of probability threshold values;
determining said AUC value from said ROC curve.
5. The method according to claim 1, wherein the cross-validating the target data set, performing a classification model training using first data subdivided into training data in each validation, and performing a classification process on second data subdivided into test data using the trained classification model to obtain a classification result corresponding to each second data, comprises:
equally dividing the target data set into M parts according to the data volume, taking M-1 parts of data as first data in each verification, dividing the first data into training data again, taking 1 part of data except the M-1 parts of data as second data, and dividing the second data into test data again; wherein M is a positive integer;
inputting the first data into an initial model for training to obtain the classification model;
and inputting the second data into the classification model to obtain the classification result output by the classification model.
6. The method of claim 5, wherein the inputting the first data into an initial model for training to obtain the classification model comprises:
inputting the first data into the initial model to obtain a training result output by the initial model;
and adjusting parameters of the initial model according to the training result and the division result corresponding to the first data to obtain the classification model.
7. The method of claim 1, wherein randomly dividing each data in the target data set into training data and testing data and labeling the division result comprises:
dividing the target dataset into a training dataset and a testing dataset;
respectively labeling the data in the training data set and the data in the test data set with a division result;
mixing the training data set and the test data set into the target data set.
8. An apparatus for data partitioning, the apparatus comprising:
the dividing module is used for randomly dividing each data in the target data set into training data and test data and marking dividing results;
the verification module is used for performing cross verification on the target data set, training a classification model by using first data which is subdivided into training data in each verification, and performing classification processing on second data which is subdivided into test data by using the trained classification model to obtain a classification result corresponding to each second data, wherein the classification result is used for representing the probability that the second data is training data or the probability that the second data is test data;
and the determining module is used for finishing data division if the dividing result is determined to be reasonable according to the classifying result.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202210589598.7A 2022-05-27 2022-05-27 Data dividing method and device, computer equipment and storage medium Pending CN114943874A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210589598.7A CN114943874A (en) 2022-05-27 2022-05-27 Data dividing method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210589598.7A CN114943874A (en) 2022-05-27 2022-05-27 Data dividing method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114943874A true CN114943874A (en) 2022-08-26

Family

ID=82909307

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210589598.7A Pending CN114943874A (en) 2022-05-27 2022-05-27 Data dividing method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114943874A (en)

Similar Documents

Publication Publication Date Title
Doran et al. A Permutation-Based Kernel Conditional Independence Test.
CN109522435B (en) Image retrieval method and device
WO2022042123A1 (en) Image recognition model generation method and apparatus, computer device and storage medium
US20210081798A1 (en) Neural network method and apparatus
US11580376B2 (en) Electronic apparatus and method for optimizing trained model
US11775610B2 (en) Flexible imputation of missing data
CN111368254B (en) Multi-view data missing completion method for multi-manifold regularization non-negative matrix factorization
CN107292341A (en) Adaptive multi views clustering method based on paired collaboration regularization and NMF
CN110705489B (en) Training method and device for target recognition network, computer equipment and storage medium
CN111243682A (en) Method, device, medium and apparatus for predicting toxicity of drug
WO2020034801A1 (en) Medical feature screening method and apparatus, computer device, and storage medium
WO2024036709A1 (en) Anomalous data detection method and apparatus
de Andrade Silva et al. An experimental study on the use of nearest neighbor-based imputation algorithms for classification tasks
CN115391561A (en) Method and device for processing graph network data set, electronic equipment, program and medium
CN111161274A (en) Abdominal image segmentation method and computer equipment
WO2021135063A1 (en) Pathological data analysis method and apparatus, and device and storage medium
US20150356163A1 (en) Methods and systems for analyzing datasets
CN116402739A (en) Quality evaluation method and device for electronic endoscope detection flow
CN114417095A (en) Data set partitioning method and device
CN114943874A (en) Data dividing method and device, computer equipment and storage medium
CN113392086B (en) Medical database construction method, device and equipment based on Internet of things
CN113850632B (en) User category determination method, device, equipment and storage medium
CN110008972A (en) Method and apparatus for data enhancing
CN115905654A (en) Service data processing method, device, equipment, storage medium and program product
CN109657795B (en) Hard disk failure prediction method based on attribute selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination