WO2019169704A1

WO2019169704A1 - Data classification method, apparatus, device and computer readable storage medium

Info

Publication number: WO2019169704A1
Application number: PCT/CN2018/084047
Authority: WO
Inventors: 伍文岳
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-03-08
Filing date: 2018-04-23
Publication date: 2019-09-12
Also published as: CN108491474A

Abstract

Provided in an embodiment of the present application are a data classification method, apparatus, device, and computer readable storage medium. In a circumstance in which two classes of samples are unbalanced, for a large number of samples, several same class sample sets are generated by means of downsampling; and for a few classes of samples, new samples are generated by means of upsampling; the new samples are used for mixing with the few classes of samples to form a relatively large number of samples, such that the number of samples of a sample set in which the original numbers are few is balanced with the number of samples of a sample set in which the original numbers are many; and the few classes of samples and many classes of samples predict data by means of multiple modeling, and finally a prediction result having a quantitative advantage is taken as a classification result. The accuracy of the data prediction is improved by means of a means of upsampling, downsampling and multiple-modeling multiple-predictions.

Description

Data classification method, device, device and computer readable storage medium

This application is required to be submitted to the Chinese Patent Office on March 8, 2018, and the application number is 201810190818.2, the application name is "a data classification method, device, device and computer readable The priority of the Chinese Patent Application, the entire disclosure of which is incorporated herein by reference.

Technical field

The present application relates to the field of information processing technologies, and in particular, to a data classification method, apparatus, device, and computer readable storage medium.

Background technique

At present, in the process of data modeling to classify data, especially in the case of multi-classification, there are often various types of sample presentation class imbalance problems. When the number of training samples varies widely, direct use of unbalanced samples If the classification model is trained, the results of model training may be unsatisfactory due to the imbalance of the number of samples. Therefore, the prediction results obtained by using the model obtained by training are not ideal, and even the prediction results are opposite.

It is now common practice to increase the number of samples by generating new samples by using a smaller number of samples to achieve a level that is more balanced with a larger number of samples. New samples often need to be as close as possible to the real sample, but After all, the new sample is not a real sample. The model used for model training has a certain adverse effect on the prediction result of the data. If the new sample is combined with the original sample, the one-time prediction result obtained by the single model prediction is once An error occurs and the result will be irreparable.

Summary of the invention

The embodiment of the present application provides a data classification method, device, device, and computer readable storage medium, which improves the accuracy of data prediction by combining two types of samples with unbalanced numbers to achieve quantity equalization and combining multiple predictions and multiple predictions. , thereby improving the prediction accuracy of the model.

In a first aspect, an embodiment of the present application provides a data classification method, where the method includes:

Obtaining a sample set, the sample set comprising a majority class sample set and a minority class sample set; determining a first class according to a ratio of a total sample number of the majority class sample set to a total sample number of the minority class sample set a preset number of copies of the sample set and a preset number of samples; randomly extracting samples of the preset number of samples from the majority of the sample set to form a sample set of the first type, and repeatedly extracting to obtain a Determining a first type of sample set of the predetermined number of copies; determining an estimated total number of new samples to be generated according to the total number of samples of the minority sample set and the preset number of samples; utilizing the estimated total number according to the total number of samples Generating a new sample from the minority sample set, and mixing the new sample with the minority sample set to form a second type of sample set; respectively performing each of the first type of sample set and the second type of sample set Machine learning obtains a corresponding classification model; using the classification model to perform classification classification on the classified data to obtain a corresponding prediction result; determining a larger number of prediction results as a classification result, and a larger number of pre- The measurement results are determined as classification results.

In a second aspect, the embodiment of the present application further provides a data classification device, where the data classification device includes a unit for performing the foregoing data classification method. In a third aspect, an embodiment of the present application further provides a data classification device, where the device includes a memory and a processor connected to the memory, where the memory is used to store a computer program that implements a data classification method; The computer program stored in the memory is executed to perform the method as described in the first aspect above. In a fourth aspect, an embodiment of the present application provides a computer readable storage medium, where the one or more computer programs are stored, and the one or more computer programs can be processed by one or more The apparatus is executed to implement the method described in the first aspect above.

The embodiment of the present application provides a data classification method, device, device, and computer readable storage medium. When two types of samples (a few types of samples and a majority of samples) are not balanced, a large number of samples are generated by downsampling. Several similar sample sets, for a small number of samples, a new sample is generated by upsampling, and a new sample is mixed with the original minority sample to form a larger number of samples, so that the original smaller number of samples and the original larger number of samples The quantity is balanced, and the minority class and the majority class sample predict the data through multiple modeling, and finally the prediction result of the quantitative advantage is taken as the classification result, which is improved by means of upsampling, downsampling and multiple modeling multiple predictions. The accuracy of data prediction.

DRAWINGS

FIG. 1 is a schematic flowchart diagram of a data classification method according to an embodiment of the present application;

2 is a schematic diagram of a sub-flow of a data classification method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of another sub-flow of a data classification method according to an embodiment of the present application; FIG.

4 is a schematic block diagram of a data classification apparatus according to an embodiment of the present application;

FIG. 5 is a schematic block diagram of a subunit of a data classification apparatus according to an embodiment of the present application; FIG.

6 is a schematic block diagram of a subunit of a data classification apparatus according to an embodiment of the present application;

FIG. 7 is a schematic block diagram showing the structure of a data classification device according to an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application are clearly and completely described in the following with reference to the drawings in the embodiments of the present application. It is obvious that the described embodiments are a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.

FIG. 1 is a schematic flowchart of a data classification method according to an embodiment of the present application. The method can be run on terminals such as smart phones (such as Android phones, IOS phones, etc.), tablets, laptops, and smart devices. As shown in FIG. 1, the steps of the method include S101 to S108.

S101. Acquire a sample set, where the sample set includes a majority class sample set and a minority class sample set.

In the process of big data analysis or learning, there will be data imbalance, for example, click data and non-click data of advertisements, click data refers to behavior data of users who click on certain types of advertisements, and non-click data means no Clicking on the behavior data of users of this type of advertisement, the ratio of clicked data to non-clicked data may be as high as 1:1000, resulting in very unbalanced data. Most types of samples refer to a certain amount of data of a certain type, such as the above-mentioned non-click data, the majority of the sample set refers to the set consisting of these majority samples, and the minority sample refers to the available A small number of types of data, such as the above-mentioned click data, a small class sample set refers to a set consisting of these few class samples.

S102. Determine, according to a ratio of a total number of samples of the majority class sample set to a total sample number of the minority class sample set, a preset number of copies of the first type of sample set and a preset number of samples, the preset number of copies. It is odd.

When the total sample size of the majority class sample set differs greatly from the total sample number of the minority class sample set, a part of the sample of the majority class sample set needs to be extracted by downsampling to form the first type sample set, because there are many samples, so A plurality of copies of the first type of sample set are formed such that more samples of the majority of the sample set are used. The first type of sample set refers to a collection of samples of a type formed by a majority of samples. The preset number of copies of the first type of sample set and the number of preset samples are determined by the difference between the total number of samples of the majority sample set and the total number of samples of the minority sample set, when the total sample of the majority sample set The ratio of the number to the total number of samples of the minority sample set is less than a threshold (for example, the threshold may be any one of 100-1000), and then the predetermined number of samples of the first type of sample set is determined as the total sample size of the majority sample set. /2 or 1/3, the default number of copies is 3, because the number of samples obtained in the first type, that is, the number of preset samples must be an integer, so when the total number of samples is 1/2 or 1/3 is not an integer , may be rounded according to the rounding rule; when the ratio of the total sample number of the majority class sample set to the total sample number of the minority class sample set is greater than or equal to the threshold, determining the preset number of samples of the first type of sample set is the majority class The total sample size of the sample set is 1/4, and the default number of copies is 5. Similarly, when the total sample size 1/4 is not an integer, it can be rounded according to the rounding rule.

S103. Randomly extract samples of the preset sample number from the majority class sample set to form a first sample set, and repeat the multiple extractions to obtain the first type of sample set of the preset number of copies.

Determining the preset number of copies of the first type of sample set and the number of preset samples, randomly extracting samples from the majority of the sample set to obtain the required first type of sample set. In the embodiment of the present application, the majority of the sample sets randomly extract the samples of the preset sample number to form a sample set of the first type, and then return the sample to the original majority sample set. The samples of the preset sample number are randomly selected in the original majority sample set to form another first type of sample set until the first type of sample set of the predetermined number of copies is formed. The sample is taken back in order to not change the sample structure of the original majority sample set, so that the probability of the sample distribution trend of each random sample is the same, and the effect of each model training will not be adversely affected by the sample difference.

S104. Determine an estimated total number of new samples to be generated according to the total number of samples of the minority sample set and the preset number of samples. The number of minority samples is small, and some new samples can be generated by upsampling to achieve a balanced level between the minority samples and the first sample. The number of new samples that are expected to be generated, that is, the estimated total number, is equal to the number of preset samples minus the total number of samples of the minority sample set.

S105. Generate a new sample by using the minority class sample set according to the estimated total number, and mix the new sample with the minority class sample set to form a second type of sample set. The new sample is generated based on the real minority sample, and the new sample is mixed with the minority sample to form the second sample, so that the number of the second sample and the first sample are equalized.

In the embodiment of the present application, a new sample is generated by using the smote idea. Specifically, as shown in FIG. 2, the step of generating a new sample by using the minority sample set according to the estimated total number in S105 includes the following substeps. S1051-S1058.

S1051: sequentially determining one sample in the minority sample set as a reference sample; S1052, acquiring a neighbor sample of each reference sample; S1053, respectively counting a first number of neighbor samples of each reference sample; S1054, according to the Calculating, by a first quantity, a total number of samples of the minority sample set and a second number of non-neighbor samples corresponding to the reference sample; S1055, calculating a ratio of the second quantity to a total number of samples of the minority sample; S1056, Normalizing the ratio of each reference sample to obtain a corresponding normalized ratio; S1057, respectively calculating a corresponding third quantity according to each of the normalized ratio and the estimated total number; S1058, And selecting a neighbor sample of the corresponding reference sample according to the third quantity and the first quantity, and generating a new sample according to the reference sample and the neighbor sample.

The third quantity is the number of new samples that the corresponding reference sample is expected to generate, and the third quantity is only a budget value of the new sample generated by the reference sample, not a certain value, and the actual number of new samples generated may be equal to the first The three quantities may also be slightly larger or slightly smaller than the third quantity.

A neighbor sample of a sample refers to a sample that is close to the sample in the feature space, and includes a nearest neighbor sample, that is, the sample closest to the sample. In the embodiment of the present application, when the distance between a sample and the sample and the distance between the nearest neighbor sample and the sample is within a certain range (for example, 0-50%), the sample is referred to as a neighbor sample. Otherwise, it is called a non-neighbor sample.

In the embodiment of the present application, a corresponding new sample is generated for all the minority samples, that is, each minority sample is used as a reference sample, and a neighbor sample is obtained to generate a new sample, and the number of new samples generated according to each reference sample is A small number of samples are related to the distribution of the minority sample sets. Where a small number of sample samples are densely distributed, the number of new samples generated by the corresponding reference samples is small, and the distribution of the minority samples is sparse, and the corresponding reference samples are generated. The number of samples is large, so that the sample distribution in the final second-class sample set is more uniform. Whether the sample distribution is uniform has a certain influence on the model training, the more uniform the sample distribution, the better the effect of model training.

Specifically, as shown in FIG. 3, S1058 includes the following sub-steps S1-S4:

S1. Calculate a quotient of the third quantity and the first quantity.

S2: Determine whether the quotient is less than 1.

S3, if yes, selecting the third number of neighbor samples from the neighbor samples of the reference samples, where the distance between the third number of neighbor samples and the reference samples is greater than the remaining neighbor samples and the reference samples The distance is far, and each selected neighbor sample and the reference sample are respectively composed of sample pairs, and a new sample is generated by using one sample pair respectively.

The quotient of the third quantity and the first quantity is less than 1, indicating that the actual number of new samples required to be generated by the reference sample is less than the number of its neighbor samples, so the first number of neighbor samples and the reference sample can be selected to generate a new sample. Selecting the neighbor sample and the reference sample that are far away from each other to form a new sample, the new sample can be inserted into the space where the original sample distribution is sparse, so as to achieve uniform distribution of the sample.

For example, the nth reference sample An in the minority class sample set has Y neighbor samples, and the total number of the new samples (the third number) that the reference sample An is expected to generate is calculated to be N, if N is less than Y (for example, N=3) , Y=6), it is not necessary to combine all the neighbor samples with the reference sample An to generate a new sample, and only need to select N (3) neighbor samples to generate a new sample with the reference sample An, and select the neighbor sample. Try to be far away from the reference sample An, so that new samples can be inserted where the sample distribution is sparse, making the sample distribution more uniform.

S4. If no, the integer is taken according to the rounding rule, and each of the neighbor samples of the reference sample and the reference sample are respectively formed into a sample pair, and the integer pair of new samples are generated by using one sample pair respectively.

If the quotient of the third quantity and the first quantity is greater than or equal to 1, indicating that the actual quantity of the new sample to be generated by the reference sample is greater than or equal to the quantity of the neighbor sample, the quotient value is rounded according to the rounding rule, respectively According to each of the neighbor samples of the reference sample and the reference sample, a sample pair is formed, and each sample pair generates the new number of new samples, and finally the quantity of the new sample generated by all the reference samples and the original minority sample are mixed with the first type of sample. The number of samples in the set can be balanced.

For example, if N is greater than Y (N=15, Y=6), if the quotient obtained by dividing the two is greater than 1, and there is a remainder, then each neighbor sample can be averaged with the reference sample to produce the same number. A new sample (an integer after rounding off the quotient), so the resulting new sample is richer, making the entire sample set more complete.

In the process of model training, it is often necessary to convert each known type of sample into the feature vector An(a1, a2, ..., ai) of the i-dimensional plane, and each vector value ai represents the sample An. The information of an attribute is then machine modeled by eigenvectors and corresponding types of all samples to obtain a model, and finally the model is used to predict which type of data to be classified belongs to.

In the embodiment of the present application, a neighbor sample of a reference sample is obtained based on the Euclidean distance.

The method of generating a new sample using a sample pair includes steps (1)-(3):

(1) Acquiring the feature vector An(a1, a2, ..., ai) of the reference sample in the sample pair in the i-dimensional space and the feature vector Bk of the neighbor sample (b1, b2, .... .., bi).

In the actual case, i is often greater than or equal to 2, and the sample has several attribute information, then i takes a few.

Assuming that there are m samples in a small class of sample sets, An refers to the nth sample, where n ≤ m, a1, a2, ..., ai represents the eigenvalues of the reference sample An in the i-dimensional space. . The reference sample An has Y neighbor samples, and the K nearest neighbor samples and the reference samples are respectively selected to form K sample pairs, and Bk refers to the kth neighbor samples in the K neighbor samples, where k=1, 2, ..., K, each time a neighbor sample is selected from the K neighbor samples and the reference sample is composed to form a new sample, and finally a reference sample An generates K new samples.

The eigenvectors of the reference samples are known. The eigenvectors are also known after the neighbor samples are determined (because the neighbor samples are also samples of a few sample sets), An and Bk, ai and bi are only used to distinguish reference samples from neighbors. sample.

(2) A proportional value t is randomly generated, where 0 < t < 1.

(3) calculating a feature vector Cnk (c1, c2, ..., ci) of the new sample to be generated, wherein ci = ai + t * (bi-ai), generated in the i-dimensional space A sample having the feature vector Cnk (c1, c2, ..., ci). Cnk represents a new sample generated by the sample pair of the reference sample An and the neighbor sample Bk.

According to each vector value bi corresponding to the neighbor sample, the vector value ai of the reference sample corresponding to the vector value bi and the scale value t may calculate a vector value ci corresponding to the new sample, that is, the point and neighbor samples of the reference sample The points are connected in a straight line, and a point is randomly selected in the line. The point is between the reference sample and the neighbor sample, and a new point is obtained by the interpolation method, that is, a new sample is generated.

A method of generating an integer number of new samples using a sample pair includes steps (a)-(c):

(a) acquiring the feature vector An(a1, a2, ..., ai) of the reference sample in the sample pair in the i-dimensional space and the feature vector Bk of the neighbor sample (b1, b2, .... .., bi).

For example, if the reference sample An has Y neighbor samples, then the Y neighbor samples and the reference samples are respectively selected to form Y sample pairs, and Bk refers to the kth neighbor samples in the Y neighbor samples, where k=1, 2, ..., Y, each time selecting a neighbor sample and a reference sample from the Y neighbor samples to form a sample pair to generate an integer (j) new sample, and finally a reference sample An generates Y*j new samples .

(b) randomly generating j proportional values t _x , where 0<t _x <1, x=1, 2, . . . , j, j is equal to the integer, and all scale values t _{x are} not the same.

(c) calculating a feature vector Cnk _x (c1, c2, ..., ci) of the integer number of new samples to be generated, wherein ci = ai + t _x * (bi-ai), The i-dimensional space generates samples having feature vectors Cnk _x (c1, c2, ..., ci). Cnk _x represents the xth new sample generated by the sample pair of the reference sample An and the neighbor sample Bk.

The point of the reference sample is directly connected with the point of the neighbor sample, and an arbitrary number of points are randomly drawn in the connection. These points are between the reference sample and the neighbor sample, and an integer number of new points are obtained by the interpolation method, that is, an integer is generated. A new sample.

S106: Perform machine learning on each of the first type of sample set and the second type of sample set to obtain a corresponding classification model. S107: Perform classification and classification on the classified data by using the classification model, and obtain a corresponding prediction result. S108, determining a larger number of prediction results as classification results, and determining a larger number of prediction results as classification results.

For the accuracy of the prediction, as many times as possible, the modeling prediction is performed. Therefore, each of the first type of sample set and the second type of sample set are respectively machine-learned to obtain a corresponding classification model, and the obtained model is respectively used. The prediction is divided into the first category (majority category) and the second category (minor category), and the larger number of prediction results are the final classification results.

The above method can be used to predict whether the user clicks on a certain type of advertisement according to the user's behavior data, so that it is possible to plan to serve different advertisements to different user groups, or to plan advertisements for potential customers according to their needs. Program to increase the likelihood of obtaining potential business.

The embodiment of the present application provides a data classification method. In the case that the quantity of two types of samples (a few types of samples and a majority of samples) is not balanced, for a large number of samples, several similar sample sets are generated by downsampling, and the number is small. The sample is sampled to generate a new sample, and the new sample is mixed with the original minority sample to form a larger number of samples, so that the original smaller number of samples is equal to the original larger number of samples, and the minority sample and the majority are The sample predicts the data through multiple modeling, and finally takes the prediction result of the quantitative advantage as the classification result, and improves the accuracy of the data prediction by means of upsampling, downsampling and multiple modeling multiple predictions.

FIG. 4 is a schematic block diagram of a data classification apparatus 100 provided by an embodiment of the present application. The data classification device 100 includes an acquisition unit 101, a first determination unit 102, a first formation unit 103, a second determination unit 104, a generation unit 105, a second formation unit 106, a learning unit 107, a prediction unit 108, a statistics unit 109, and The third determining unit 110.

The obtaining unit 101 is configured to acquire a sample set, where the sample set includes a majority class sample set and a minority class sample set. The first determining unit 102 is configured to determine a preset number of copies and a preset number of samples of the first type of sample set according to a ratio of a total number of samples of the majority class sample set to a total sample number of the minority class sample set. The first forming unit 103 is configured to randomly extract samples of the preset sample number from the majority class sample set to form a first sample set, and repeat the multiple extractions to obtain the preset number of copies. A class of sample sets. The second determining unit 104 is configured to determine an estimated total number of new samples that need to be generated according to the total number of samples of the minority class sample set and the preset number of samples. The generating unit 105 is configured to generate a new sample by using the minority class sample set according to the estimated total number. The second forming unit 106 is configured to mix the new sample with the minority class sample set to form a second type of sample set. The learning unit 107 is configured to perform machine learning on each of the first type of sample set and the second type of sample set to obtain a corresponding classification model. The prediction unit 108 is configured to perform prediction classification on the classification data by using the classification model to obtain a corresponding prediction result. The statistical unit 109 is configured to determine a larger number of prediction results as the classification result. The third determining unit 110 is configured to determine a larger number of prediction results as the classification result.

In the embodiment of the present application, as shown in FIG. 5, the generating unit 105 includes the following subunits: a determining subunit 1051, configured to sequentially determine one sample in the minority class sample set as a reference sample; a unit 1052, configured to acquire a neighbor sample of each reference sample, a statistical subunit 1053, configured to separately count a first number of neighbor samples of each reference sample, and a first calculation subunit 1054, configured to use the first quantity according to the first quantity Calculating a second number of non-neighbor samples corresponding to the reference samples with a total number of samples of the minority sample set; a second calculating sub-unit 1055, configured to calculate the second number of total samples of the minority samples a normalization sub-unit 1056, configured to normalize the ratio of each reference sample to obtain a corresponding normalization ratio; and a third calculation sub-unit 1057 for normalizing according to each Calculating a corresponding third number, and generating a sub-unit 1058, configured to select a neighbor sample of the corresponding reference sample according to the third quantity and the first quantity, A new sample is generated from the reference sample and the neighbor sample.

In the embodiment of the present application, as shown in FIG. 6, the generating subunit 1058 includes the following subunit: a fourth calculating subunit 10581, configured to calculate the quotient of the third quantity and the first quantity. The determining subunit 10582 is configured to determine whether the quotient is less than 1. The subunit 10583 is configured to select, according to the quotient value less than 1, the third number of neighbor samples from the neighbor samples of the reference sample, and the distance between the third number of neighbor samples and the reference sample Both are farther apart than the remaining neighbor samples from the reference sample. The first generating subunit 10584 is configured to respectively form a pair of samples for each selected neighbor sample and the reference sample, and generate a new sample by using one sample pair respectively. a second generating sub-unit 10585, configured to: if the quotient is greater than or equal to 1, take an integer according to a rounding rule, and each neighbor sample of the reference sample and the reference sample respectively form a sample pair, and respectively use one sample Pair the generated integer new samples.

The first generation subunit 10584 includes: a second acquisition subunit, configured to acquire feature vectors An(a1, a2, . . . , ai) and neighbor samples of the reference samples in the sample pair in the i-dimensional space. a feature vector Bk (b1, b2, ..., bi); a first random subunit for randomly generating a scale value t, where 0 < t < 1; the first feature calculation subunit, For calculating a feature vector Cnk(c1, c2, ..., ci) of a new sample to be generated, wherein ci=ai+t*(bi-ai), having the generated in the i-dimensional space A sample of the feature vector Cnk (c1, c2, ..., ci).

The second generation subunit 10585 includes the following subunits: a third acquisition subunit, configured to acquire feature vectors An(a1, a2, . . . , ai) of the reference samples in the sample pair in the i-dimensional space. And a feature vector Bk (b1, b2, ..., bi) of the neighbor sample; a second random subunit for randomly generating j scale values t _x , where 0<t _x <1, x= 1, 2, ..., j, j is equal to the integer, and all scale values t _{x are} different; a second feature calculation subunit is used to calculate the integer number of new samples to be generated An eigenvector Cnk _x (c1, c2, ..., ci), wherein ci = ai + t _x * (bi-ai), having a feature vector Cnk _x (c1, c2) generated in the i-dimensional space , ..., ci) sample.

For the functions of the data classification device 100 and the specific description of each unit, reference may be made to the description in the above method embodiment, and the description is not repeated here. The above data sorting apparatus 100 can be implemented in the form of a computer program that can be run on a computer device as shown in FIG.

FIG. 7 is a schematic block diagram of a data classification device according to an embodiment of the present application. The device may be a terminal or a server, wherein the terminal may be a communication-enabled electronic device such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, and a wearable device. The server can be a standalone server or a server cluster consisting of multiple servers.

The device is a computer device 200 comprising a processor 202, a memory and a network interface 205 connected by a system bus 201, wherein the memory comprises a non-volatile storage medium 203 and an internal memory 204. The non-volatile storage medium 203 of the computer device 200 can store an operating system 2031 and a computer program 2032 that, when executed, can cause the processor 202 to perform a data classification method. The processor 202 of the computer device 200 is used to provide computing and control capabilities to support the operation of the entire computer device 200. The internal memory 204 provides an environment for the operation of the computer program 2032 in the non-volatile storage medium 203. The network interface 205 of the computer device 200 is used to perform network communications, such as sending assigned tasks and the like. The processor 202 can execute an implementation of all of the embodiments of the data classification method described above when the computer program 2032 in the non-volatile storage medium 203 is run. It will be understood by those skilled in the art that the embodiment of the computer device shown in FIG. 7 does not constitute a limitation on the specific configuration of the data classification device. In other embodiments, the data classification device may include more or less than the illustration. Parts, or combinations of parts, or different parts. For example, in some embodiments, the data classification device may include only a memory and a processor. In such an embodiment, the structure and function of the memory and the processor are the same as those of the embodiment shown in FIG. 7, and details are not described herein again.

The application further provides a computer readable storage medium storing one or more computer programs, the one or more computer programs being executable by one or more processors, All of the embodiments of the above data classification method can be implemented by one or more programs being executed by one or more processors.

The foregoing is only a specific embodiment of the present application, but the scope of protection of the present application is not limited thereto, and any equivalents can be easily conceived by those skilled in the art within the technical scope disclosed in the present application. Modifications or substitutions are intended to be included within the scope of the present application. Therefore, the scope of protection of this application should be determined by the scope of protection of the claims.

Claims

A data classification method, comprising:

Obtaining a sample set, the sample set comprising a majority class sample set and a minority class sample set;

Determining a preset number of copies of the first type of sample set and a preset number of samples according to a ratio of a total number of samples of the majority class sample set to a total number of samples of the minority class sample set, the preset number of copies being an odd number ;

Randomly extracting samples of the preset number of samples from the majority class sample set to form a first sample set, and repeating multiple extractions to obtain the first set of sample sets of the preset number of copies;

Determining an estimated total number of new samples to be generated according to the total number of samples of the minority sample set and the preset number of samples;

Generating a new sample using the minority class sample set according to the estimated total number, and mixing the new sample with the minority class sample set to form a second type sample set;

Performing machine learning on each of the first type of sample set and the second type of sample set to obtain a corresponding classification model;

Using the classification model to perform classification classification on the classified data to obtain a corresponding prediction result;

The number of different prediction results is separately counted, and the more large number of prediction results are determined as classification results.
The data classification method according to claim 1, wherein the generating a new sample by using the minority sample set according to the estimated total number comprises:

Determining one sample in the minority sample set as a reference sample;

Obtain a neighbor sample of each reference sample;

Separating the first number of neighbor samples of each reference sample separately;

Calculating a second number of non-neighbor samples of the corresponding reference samples according to the first quantity and the total number of samples of the minority class sample set;

Calculating a ratio of the second quantity to the total number of samples of the minority sample;

Normalizing the ratio of each reference sample to obtain a corresponding normalized ratio;

Calculating a corresponding third quantity according to each of the normalized ratio and the estimated total number;

And selecting a neighbor sample of the corresponding reference sample according to the third quantity and the first quantity, and generating a new sample according to the reference sample and the neighbor sample.
The data classification method according to claim 2, wherein the selecting a neighbor sample of the corresponding reference sample according to the third quantity and the first quantity, and generating a new sample according to the reference sample and the neighbor sample ,include:

Calculating a quotient of the third quantity and the first quantity;

Determining whether the quotient is less than 1;

If yes, selecting the third number of neighbor samples from the neighbor samples of the reference sample, where the distance between the third number of neighbor samples and the reference sample is greater than the distance between the remaining neighbor samples and the reference sample Far, each selected neighbor sample and the reference sample are respectively formed into a sample pair, and a new sample is generated by using one sample pair respectively;

If not, the integer is taken according to the rounding rule, and each neighbor sample of the reference sample is separately formed into a sample pair with the reference sample, and the integer pair of new samples are generated by using one sample pair respectively.
The data classification method according to claim 3, wherein generating a new sample by using a sample pair comprises:

Obtaining a feature vector An(a1, a2, ..., ai) of the reference sample in the sample pair in the i-dimensional space and a feature vector Bk of the neighbor sample (b1, b2, ..., Bi);

Randomly generate a proportional value t, where 0 < t < 1;

Calculating a feature vector Cnk(c1, c2, ..., ci) of the new sample to be generated, wherein ci=ai+t*(bi-ai), the i-dimensional space generation has the A sample of the eigenvectors Cnk (c1, c2, ..., ci).
The method of claim 3 wherein generating the integer number of new samples using a sample pair comprises:

Obtaining a feature vector An(a1, a2, ..., ai) of the reference sample in the sample pair in the i-dimensional space and a feature vector Bk of the neighbor sample (b1, b2, ..., Bi);

Randomly generating j proportional values t x , where 0<t x <1, x=1, 2, . . . , j, j is equal to the integer, and all scale values t x are different;

Calculating a feature vector Cnk x (c1, c2, ..., ci) of the integer number of new samples to be generated, wherein ci = ai + t x * (bi-ai), in the i The dimensional space generates a sample having a feature vector Cnk x (c1, c2, ..., ci).
A data classification device, comprising:

An obtaining unit, configured to acquire a sample set, where the sample set includes a majority class sample set and a minority class sample set;

a first determining unit, configured to determine, according to a ratio of a total number of samples of the majority class sample set to a total number of samples of the minority class sample set, a preset number of copies and a preset number of samples of the first type of sample set, The preset number of copies is an odd number;

a first forming unit, configured to randomly extract samples of the preset sample number from the majority class sample set to form a first sample set, and repeat the multiple extractions to obtain the preset number of copies a set of samples;

a second determining unit, configured to determine, according to the total number of samples of the minority class sample set and the preset number of samples, an estimated total number of new samples that need to be generated;

Generating unit, configured to generate a new sample by using the minority class sample set according to the estimated total number;

a second forming unit, configured to mix the new sample with the minority sample set to form a second type of sample set;

a learning unit, configured to respectively perform machine learning on each of the first type of sample set and the second type of sample set to obtain a corresponding classification model;

a prediction unit, configured to perform prediction classification on the classified data by using the classification model, to obtain a corresponding prediction result;

a statistical unit for separately counting the number of different prediction results;

And a third determining unit, configured to determine a larger number of prediction results as the classification result.
The data classification device according to claim 6, wherein the generating unit comprises:

Determining a subunit for sequentially determining one sample of the minority class sample set as a reference sample;

a first obtaining subunit, configured to acquire a neighbor sample of each reference sample;

a statistical subunit for separately counting the first number of neighbor samples of each reference sample;

a first calculating subunit, configured to calculate, according to the first quantity and a total number of samples of the minority class sample set, a second quantity of non-neighbor samples corresponding to the reference samples;

a second calculating subunit, configured to calculate a ratio of the second quantity to a total number of samples of the minority sample;

a normalization subunit, configured to normalize the ratio of each reference sample to obtain a corresponding normalized ratio;

a third calculating subunit, configured to respectively calculate a corresponding third quantity according to each of the normalized ratio and the estimated total number;

And generating a subunit, configured to select a neighbor sample of the corresponding reference sample according to the third quantity and the first quantity, and generate a new sample according to the reference sample and the neighbor sample.
The data classification device according to claim 7, wherein the generating subunit comprises:

a fourth calculating subunit, configured to calculate a quotient of the third quantity and the first quantity;

a determining subunit, configured to determine whether the quotient value is less than 1;

Selecting a subunit, if the quotient value is less than 1, selecting the third number of neighbor samples from the neighbor samples of the reference sample, and the distance between the third number of neighbor samples and the reference sample is Farther than the remaining neighbor samples and the reference samples;

a first generating subunit, configured to respectively form each selected neighbor sample and the reference sample into a sample pair, and respectively generate a new sample by using one sample pair;

a second generating subunit, configured to: if the quotient is greater than or equal to 1, take an integer according to a rounding rule, and each neighbor sample of the reference sample and the reference sample respectively form a sample pair, and respectively use a sample pair The integer number of new samples are generated.
The data classification device according to claim 8, wherein the first generation subunit comprises:

a second acquiring subunit, configured to acquire a feature vector An(a1, a2, . . . , ai) of the reference sample in the sample pair in the i-dimensional space and a feature vector Bk of the neighbor sample (b1, b2) ,......,bi);

a first random subunit for randomly generating a scale value t, wherein 0<t<1;

a first feature calculation subunit for calculating a feature vector Cnk(c1, c2, ..., ci) of a new sample to be generated, wherein ci=ai+t*(bi-ai), The i-dimensional space generates a sample having the feature vector Cnk (c1, c2, ..., ci).
The data classification device according to claim 8, wherein the second generation subunit comprises:

a third obtaining subunit, configured to acquire feature vectors An(a1, a2, . . . , ai) of the reference samples in the sample pair in the i-dimensional space and feature vectors Bk of the neighbor samples (b1, b2) ,......,bi);

a second random subunit for randomly generating j scale values t x , where 0<t x <1, x=1, 2, . . . , j, j is equal to the integer, and all ratios The values t x are not the same;

a second feature calculation subunit for calculating a feature vector Cnk x (c1, c2, ..., ci) of the integer number of new samples to be generated, wherein ci=ai+t x *( Bi-ai), a sample having a feature vector Cnk x (c1, c2, ..., ci) is generated in the i-dimensional space.
A data classification device, characterized in that the data processing device comprises a memory, and a processor connected to the memory;

The memory for storing a computer program implementing a data classification method;

The processor is configured to run a computer program stored in the memory to perform the following steps:

Obtaining a sample set, the sample set comprising a majority class sample set and a minority class sample set;

Determining a preset number of copies of the first type of sample set and a preset number of samples according to a ratio of a total number of samples of the majority class sample set to a total number of samples of the minority class sample set, the preset number of copies being an odd number ;

Randomly extracting samples of the preset number of samples from the majority class sample set to form a first sample set, and repeating multiple extractions to obtain the first set of sample sets of the preset number of copies;

Determining an estimated total number of new samples to be generated according to the total number of samples of the minority sample set and the preset number of samples;

Generating a new sample using the minority class sample set according to the estimated total number, and mixing the new sample with the minority class sample set to form a second type sample set;

Performing machine learning on each of the first type of sample set and the second type of sample set to obtain a corresponding classification model;

Using the classification model to perform classification classification on the classified data to obtain a corresponding prediction result;

The number of different prediction results is separately counted, and the more large number of prediction results are determined as classification results.
The data classification device according to claim 11, wherein the processor performs the following steps when performing the step of generating a new sample by using the minority class sample set according to the estimated total number:

Determining one sample in the minority sample set as a reference sample;

Obtain a neighbor sample of each reference sample;

Separating the first number of neighbor samples of each reference sample separately;

Calculating a second number of non-neighbor samples of the corresponding reference samples according to the first quantity and the total number of samples of the minority class sample set;

Calculating a ratio of the second quantity to the total number of samples of the minority sample;

Normalizing the ratio of each reference sample to obtain a corresponding normalized ratio;

Calculating a corresponding third quantity according to each of the normalized ratio and the estimated total number;

And selecting a neighbor sample of the corresponding reference sample according to the third quantity and the first quantity, and generating a new sample according to the reference sample and the neighbor sample.
The data classification device according to claim 12, wherein said processor is operative to select a neighbor sample corresponding to a reference sample according to said third quantity and said first quantity, according to said reference sample and said When the steps of generating a new sample for the neighbor sample are performed, the following steps are specifically performed:

Calculating a quotient of the third quantity and the first quantity;

Determining whether the quotient is less than 1;

If yes, selecting the third number of neighbor samples from the neighbor samples of the reference sample, where the distance between the third number of neighbor samples and the reference sample is greater than the distance between the remaining neighbor samples and the reference sample Far, each selected neighbor sample and the reference sample are respectively formed into a sample pair, and a new sample is generated by using one sample pair respectively;

If not, the integer is taken according to the rounding rule, and each neighbor sample of the reference sample is separately formed into a sample pair with the reference sample, and the integer pair of new samples are generated by using one sample pair respectively.
The data classification device according to claim 13, wherein the processor performs the following steps when generating a new sample by using a sample pair:

Obtaining a feature vector An(a1, a2, ..., ai) of the reference sample in the sample pair in the i-dimensional space and a feature vector Bk of the neighbor sample (b1, b2, ..., Bi);

Randomly generate a proportional value t, where 0 < t < 1;

Calculating a feature vector Cnk(c1, c2, ..., ci) of the new sample to be generated, wherein ci=ai+t*(bi-ai), the i-dimensional space generation has the A sample of the eigenvectors Cnk (c1, c2, ..., ci).
The data classification device according to claim 13, wherein the processor performs the following steps by performing the generation of the integer number of new samples by using one sample pair:

Obtaining a feature vector An(a1, a2, ..., ai) of the reference sample in the sample pair in the i-dimensional space and a feature vector Bk of the neighbor sample (b1, b2, ..., Bi);

Randomly generating j proportional values t x , where 0<t x <1, x=1, 2, . . . , j, j is equal to the integer, and all scale values t x are different;

Calculating a feature vector Cnk x (c1, c2, ..., ci) of the integer number of new samples to be generated, wherein ci = ai + t x * (bi-ai), in the i The dimensional space generates a sample having a feature vector Cnk x (c1, c2, ..., ci).
A computer readable storage medium, wherein the computer readable storage medium stores one or more computer programs, the one or more computer programs being executable by one or more processors to implement the following step:

Determining a preset number of copies of the first type of sample set and a preset number of samples according to a ratio of a total number of samples of the majority class sample set to a total number of samples of the minority class sample set, the preset number of copies being an odd number ;

Randomly extracting samples of the preset number of samples from the majority class sample set to form a first sample set, and repeating multiple extractions to obtain the first set of sample sets of the preset number of copies;

Determining an estimated total number of new samples to be generated according to the total number of samples of the minority sample set and the preset number of samples;

Generating a new sample using the minority class sample set according to the estimated total number, and mixing the new sample with the minority class sample set to form a second type sample set;

Performing machine learning on each of the first type of sample set and the second type of sample set to obtain a corresponding classification model;

Using the classification model to perform classification classification on the classified data to obtain a corresponding prediction result;

The number of different prediction results is separately counted, and the more large number of prediction results are determined as classification results.
The computer readable storage medium of claim 16, wherein the step of generating a new sample using the minority class sample set based on the estimated total number comprises:

Determining one sample in the minority sample set as a reference sample;

Obtain a neighbor sample of each reference sample;

Separating the first number of neighbor samples of each reference sample separately;

Calculating a second number of non-neighbor samples of the corresponding reference samples according to the first quantity and the total number of samples of the minority class sample set;

Calculating a ratio of the second quantity to the total number of samples of the minority sample;

Normalizing the ratio of each reference sample to obtain a corresponding normalized ratio;

Calculating a corresponding third quantity according to each of the normalized ratio and the estimated total number;

And selecting a neighbor sample of the corresponding reference sample according to the third quantity and the first quantity, and generating a new sample according to the reference sample and the neighbor sample.
The computer readable storage medium according to claim 17, wherein the selecting a neighbor sample of a corresponding reference sample according to the third quantity and the first quantity, generating according to the reference sample and the neighbor sample The steps for the new sample include:

Calculating a quotient of the third quantity and the first quantity;

Determining whether the quotient is less than 1;

If yes, selecting the third number of neighbor samples from the neighbor samples of the reference sample, where the distance between the third number of neighbor samples and the reference sample is greater than the distance between the remaining neighbor samples and the reference sample Far, each selected neighbor sample and the reference sample are respectively formed into a sample pair, and a new sample is generated by using one sample pair respectively;

If not, the integer is taken according to the rounding rule, and each neighbor sample of the reference sample is separately formed into a sample pair with the reference sample, and the integer pair of new samples are generated by using one sample pair respectively.
The computer readable storage medium of claim 18, wherein the step of generating a new sample using a sample pair comprises:

Obtaining a feature vector An(a1, a2, ..., ai) of the reference sample in the sample pair in the i-dimensional space and a feature vector Bk of the neighbor sample (b1, b2, ..., Bi);

Randomly generate a proportional value t, where 0 < t < 1;

Calculating a feature vector Cnk(c1, c2, ..., ci) of the new sample to be generated, wherein ci=ai+t*(bi-ai), the i-dimensional space generation has the A sample of the eigenvectors Cnk (c1, c2, ..., ci).
The computer readable storage medium of claim 18, wherein the step of generating the integer number of new samples using a sample pair comprises:

Obtaining a feature vector An(a1, a2, ..., ai) of the reference sample in the sample pair in the i-dimensional space and a feature vector Bk of the neighbor sample (b1, b2, ..., Bi);

Randomly generating j proportional values t x , where 0<t x <1, x=1, 2, . . . , j, j is equal to the integer, and all scale values t x are different;

Calculating a feature vector Cnk x (c1, c2, ..., ci) of the integer number of new samples to be generated, wherein ci = ai + t x * (bi-ai), in the i The dimensional space generates a sample having a feature vector Cnk x (c1, c2, ..., ci).