WO2019169704A1 - Data classification method, apparatus, device and computer readable storage medium - Google Patents

Data classification method, apparatus, device and computer readable storage medium Download PDF

Info

Publication number
WO2019169704A1
WO2019169704A1 PCT/CN2018/084047 CN2018084047W WO2019169704A1 WO 2019169704 A1 WO2019169704 A1 WO 2019169704A1 CN 2018084047 W CN2018084047 W CN 2018084047W WO 2019169704 A1 WO2019169704 A1 WO 2019169704A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
samples
neighbor
new
sample set
Prior art date
Application number
PCT/CN2018/084047
Other languages
French (fr)
Chinese (zh)
Inventor
伍文岳
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019169704A1 publication Critical patent/WO2019169704A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification

Definitions

  • the present application relates to the field of information processing technologies, and in particular, to a data classification method, apparatus, device, and computer readable storage medium.
  • the embodiment of the present application provides a data classification method, device, device, and computer readable storage medium, which improves the accuracy of data prediction by combining two types of samples with unbalanced numbers to achieve quantity equalization and combining multiple predictions and multiple predictions. , thereby improving the prediction accuracy of the model.
  • an embodiment of the present application provides a data classification method, where the method includes:
  • Obtaining a sample set comprising a majority class sample set and a minority class sample set; determining a first class according to a ratio of a total sample number of the majority class sample set to a total sample number of the minority class sample set a preset number of copies of the sample set and a preset number of samples; randomly extracting samples of the preset number of samples from the majority of the sample set to form a sample set of the first type, and repeatedly extracting to obtain a Determining a first type of sample set of the predetermined number of copies; determining an estimated total number of new samples to be generated according to the total number of samples of the minority sample set and the preset number of samples; utilizing the estimated total number according to the total number of samples Generating a new sample from the minority sample set, and mixing the new sample with the minority sample set to form a second type of sample set; respectively performing each of the first type of sample set and the second type of sample set Machine learning obtains a corresponding classification model; using the classification model to perform classification classification on the classified data
  • the embodiment of the present application further provides a data classification device, where the data classification device includes a unit for performing the foregoing data classification method.
  • an embodiment of the present application further provides a data classification device, where the device includes a memory and a processor connected to the memory, where the memory is used to store a computer program that implements a data classification method; The computer program stored in the memory is executed to perform the method as described in the first aspect above.
  • an embodiment of the present application provides a computer readable storage medium, where the one or more computer programs are stored, and the one or more computer programs can be processed by one or more The apparatus is executed to implement the method described in the first aspect above.
  • the embodiment of the present application provides a data classification method, device, device, and computer readable storage medium.
  • two types of samples a few types of samples and a majority of samples
  • a large number of samples are generated by downsampling.
  • Several similar sample sets for a small number of samples, a new sample is generated by upsampling, and a new sample is mixed with the original minority sample to form a larger number of samples, so that the original smaller number of samples and the original larger number of samples.
  • the quantity is balanced, and the minority class and the majority class sample predict the data through multiple modeling, and finally the prediction result of the quantitative advantage is taken as the classification result, which is improved by means of upsampling, downsampling and multiple modeling multiple predictions.
  • the accuracy of data prediction is improved.
  • FIG. 1 is a schematic flowchart diagram of a data classification method according to an embodiment of the present application
  • FIG. 2 is a schematic diagram of a sub-flow of a data classification method according to an embodiment of the present application
  • FIG. 3 is a schematic diagram of another sub-flow of a data classification method according to an embodiment of the present application.
  • FIG. 4 is a schematic block diagram of a data classification apparatus according to an embodiment of the present application.
  • FIG. 5 is a schematic block diagram of a subunit of a data classification apparatus according to an embodiment of the present application.
  • FIG. 6 is a schematic block diagram of a subunit of a data classification apparatus according to an embodiment of the present application.
  • FIG. 7 is a schematic block diagram showing the structure of a data classification device according to an embodiment of the present application.
  • FIG. 1 is a schematic flowchart of a data classification method according to an embodiment of the present application.
  • the method can be run on terminals such as smart phones (such as Android phones, IOS phones, etc.), tablets, laptops, and smart devices.
  • terminals such as smart phones (such as Android phones, IOS phones, etc.), tablets, laptops, and smart devices.
  • the steps of the method include S101 to S108.
  • click data refers to behavior data of users who click on certain types of advertisements
  • non-click data means no Clicking on the behavior data of users of this type of advertisement
  • the ratio of clicked data to non-clicked data may be as high as 1:1000, resulting in very unbalanced data.
  • Most types of samples refer to a certain amount of data of a certain type, such as the above-mentioned non-click data, the majority of the sample set refers to the set consisting of these majority samples, and the minority sample refers to the available A small number of types of data, such as the above-mentioned click data, a small class sample set refers to a set consisting of these few class samples.
  • S102 Determine, according to a ratio of a total number of samples of the majority class sample set to a total sample number of the minority class sample set, a preset number of copies of the first type of sample set and a preset number of samples, the preset number of copies. It is odd.
  • the first type of sample set refers to a collection of samples of a type formed by a majority of samples.
  • the preset number of copies of the first type of sample set and the number of preset samples are determined by the difference between the total number of samples of the majority sample set and the total number of samples of the minority sample set, when the total sample of the majority sample set
  • the ratio of the number to the total number of samples of the minority sample set is less than a threshold (for example, the threshold may be any one of 100-1000), and then the predetermined number of samples of the first type of sample set is determined as the total sample size of the majority sample set.
  • the default number of copies is 3, because the number of samples obtained in the first type, that is, the number of preset samples must be an integer, so when the total number of samples is 1/2 or 1/3 is not an integer , may be rounded according to the rounding rule; when the ratio of the total sample number of the majority class sample set to the total sample number of the minority class sample set is greater than or equal to the threshold, determining the preset number of samples of the first type of sample set is the majority class The total sample size of the sample set is 1/4, and the default number of copies is 5. Similarly, when the total sample size 1/4 is not an integer, it can be rounded according to the rounding rule.
  • the majority of the sample sets randomly extract the samples of the preset sample number to form a sample set of the first type, and then return the sample to the original majority sample set.
  • the samples of the preset sample number are randomly selected in the original majority sample set to form another first type of sample set until the first type of sample set of the predetermined number of copies is formed. The sample is taken back in order to not change the sample structure of the original majority sample set, so that the probability of the sample distribution trend of each random sample is the same, and the effect of each model training will not be adversely affected by the sample difference.
  • S104 Determine an estimated total number of new samples to be generated according to the total number of samples of the minority sample set and the preset number of samples.
  • the number of minority samples is small, and some new samples can be generated by upsampling to achieve a balanced level between the minority samples and the first sample.
  • the number of new samples that are expected to be generated is equal to the number of preset samples minus the total number of samples of the minority sample set.
  • S105 Generate a new sample by using the minority class sample set according to the estimated total number, and mix the new sample with the minority class sample set to form a second type of sample set.
  • the new sample is generated based on the real minority sample, and the new sample is mixed with the minority sample to form the second sample, so that the number of the second sample and the first sample are equalized.
  • a new sample is generated by using the smote idea.
  • the step of generating a new sample by using the minority sample set according to the estimated total number in S105 includes the following substeps. S1051-S1058.
  • S1051 sequentially determining one sample in the minority sample set as a reference sample; S1052, acquiring a neighbor sample of each reference sample; S1053, respectively counting a first number of neighbor samples of each reference sample; S1054, according to the Calculating, by a first quantity, a total number of samples of the minority sample set and a second number of non-neighbor samples corresponding to the reference sample; S1055, calculating a ratio of the second quantity to a total number of samples of the minority sample; S1056, Normalizing the ratio of each reference sample to obtain a corresponding normalized ratio; S1057, respectively calculating a corresponding third quantity according to each of the normalized ratio and the estimated total number; S1058, And selecting a neighbor sample of the corresponding reference sample according to the third quantity and the first quantity, and generating a new sample according to the reference sample and the neighbor sample.
  • the third quantity is the number of new samples that the corresponding reference sample is expected to generate, and the third quantity is only a budget value of the new sample generated by the reference sample, not a certain value, and the actual number of new samples generated may be equal to the first
  • the three quantities may also be slightly larger or slightly smaller than the third quantity.
  • a neighbor sample of a sample refers to a sample that is close to the sample in the feature space, and includes a nearest neighbor sample, that is, the sample closest to the sample.
  • a neighbor sample when the distance between a sample and the sample and the distance between the nearest neighbor sample and the sample is within a certain range (for example, 0-50%), the sample is referred to as a neighbor sample. Otherwise, it is called a non-neighbor sample.
  • a corresponding new sample is generated for all the minority samples, that is, each minority sample is used as a reference sample, and a neighbor sample is obtained to generate a new sample, and the number of new samples generated according to each reference sample is
  • a small number of samples are related to the distribution of the minority sample sets. Where a small number of sample samples are densely distributed, the number of new samples generated by the corresponding reference samples is small, and the distribution of the minority samples is sparse, and the corresponding reference samples are generated. The number of samples is large, so that the sample distribution in the final second-class sample set is more uniform. Whether the sample distribution is uniform has a certain influence on the model training, the more uniform the sample distribution, the better the effect of model training.
  • S1058 includes the following sub-steps S1-S4:
  • the quotient of the third quantity and the first quantity is less than 1, indicating that the actual number of new samples required to be generated by the reference sample is less than the number of its neighbor samples, so the first number of neighbor samples and the reference sample can be selected to generate a new sample. Selecting the neighbor sample and the reference sample that are far away from each other to form a new sample, the new sample can be inserted into the space where the original sample distribution is sparse, so as to achieve uniform distribution of the sample.
  • the quotient of the third quantity and the first quantity is greater than or equal to 1, indicating that the actual quantity of the new sample to be generated by the reference sample is greater than or equal to the quantity of the neighbor sample, the quotient value is rounded according to the rounding rule, respectively According to each of the neighbor samples of the reference sample and the reference sample, a sample pair is formed, and each sample pair generates the new number of new samples, and finally the quantity of the new sample generated by all the reference samples and the original minority sample are mixed with the first type of sample.
  • the number of samples in the set can be balanced.
  • each neighbor sample can be averaged with the reference sample to produce the same number.
  • a new sample an integer after rounding off the quotient, so the resulting new sample is richer, making the entire sample set more complete.
  • each known type of sample into the feature vector An(a1, a2, ..., ai) of the i-dimensional plane, and each vector value ai represents the sample An.
  • the information of an attribute is then machine modeled by eigenvectors and corresponding types of all samples to obtain a model, and finally the model is used to predict which type of data to be classified belongs to.
  • a neighbor sample of a reference sample is obtained based on the Euclidean distance.
  • the method of generating a new sample using a sample pair includes steps (1)-(3):
  • i is often greater than or equal to 2, and the sample has several attribute information, then i takes a few.
  • An refers to the nth sample, where n ⁇ m, a1, a2, ..., ai represents the eigenvalues of the reference sample An in the i-dimensional space.
  • the eigenvectors of the reference samples are known.
  • the eigenvectors are also known after the neighbor samples are determined (because the neighbor samples are also samples of a few sample sets), An and Bk, ai and bi are only used to distinguish reference samples from neighbors. sample.
  • Cnk represents a new sample generated by the sample pair of the reference sample An and the neighbor sample Bk.
  • the vector value ai of the reference sample corresponding to the vector value bi and the scale value t may calculate a vector value ci corresponding to the new sample, that is, the point and neighbor samples of the reference sample.
  • the points are connected in a straight line, and a point is randomly selected in the line.
  • the point is between the reference sample and the neighbor sample, and a new point is obtained by the interpolation method, that is, a new sample is generated.
  • a method of generating an integer number of new samples using a sample pair includes steps (a)-(c):
  • the reference sample An has Y neighbor samples
  • the Y neighbor samples and the reference samples are respectively selected to form Y sample pairs
  • a reference sample An generates Y*j new samples .
  • the point of the reference sample is directly connected with the point of the neighbor sample, and an arbitrary number of points are randomly drawn in the connection. These points are between the reference sample and the neighbor sample, and an integer number of new points are obtained by the interpolation method, that is, an integer is generated. A new sample.
  • S106 Perform machine learning on each of the first type of sample set and the second type of sample set to obtain a corresponding classification model.
  • S107 Perform classification and classification on the classified data by using the classification model, and obtain a corresponding prediction result.
  • S108 determining a larger number of prediction results as classification results, and determining a larger number of prediction results as classification results.
  • the modeling prediction is performed. Therefore, each of the first type of sample set and the second type of sample set are respectively machine-learned to obtain a corresponding classification model, and the obtained model is respectively used.
  • the prediction is divided into the first category (majority category) and the second category (minor category), and the larger number of prediction results are the final classification results.
  • the above method can be used to predict whether the user clicks on a certain type of advertisement according to the user's behavior data, so that it is possible to plan to serve different advertisements to different user groups, or to plan advertisements for potential customers according to their needs. Program to increase the likelihood of obtaining potential business.
  • the embodiment of the present application provides a data classification method.
  • the sample is sampled to generate a new sample, and the new sample is mixed with the original minority sample to form a larger number of samples, so that the original smaller number of samples is equal to the original larger number of samples, and the minority sample and the majority are
  • the sample predicts the data through multiple modeling, and finally takes the prediction result of the quantitative advantage as the classification result, and improves the accuracy of the data prediction by means of upsampling, downsampling and multiple modeling multiple predictions.
  • FIG. 4 is a schematic block diagram of a data classification apparatus 100 provided by an embodiment of the present application.
  • the data classification device 100 includes an acquisition unit 101, a first determination unit 102, a first formation unit 103, a second determination unit 104, a generation unit 105, a second formation unit 106, a learning unit 107, a prediction unit 108, a statistics unit 109, and The third determining unit 110.
  • the obtaining unit 101 is configured to acquire a sample set, where the sample set includes a majority class sample set and a minority class sample set.
  • the first determining unit 102 is configured to determine a preset number of copies and a preset number of samples of the first type of sample set according to a ratio of a total number of samples of the majority class sample set to a total sample number of the minority class sample set.
  • the first forming unit 103 is configured to randomly extract samples of the preset sample number from the majority class sample set to form a first sample set, and repeat the multiple extractions to obtain the preset number of copies. A class of sample sets.
  • the second determining unit 104 is configured to determine an estimated total number of new samples that need to be generated according to the total number of samples of the minority class sample set and the preset number of samples.
  • the generating unit 105 is configured to generate a new sample by using the minority class sample set according to the estimated total number.
  • the second forming unit 106 is configured to mix the new sample with the minority class sample set to form a second type of sample set.
  • the learning unit 107 is configured to perform machine learning on each of the first type of sample set and the second type of sample set to obtain a corresponding classification model.
  • the prediction unit 108 is configured to perform prediction classification on the classification data by using the classification model to obtain a corresponding prediction result.
  • the statistical unit 109 is configured to determine a larger number of prediction results as the classification result.
  • the third determining unit 110 is configured to determine a larger number of prediction results as the classification result.
  • the generating unit 105 includes the following subunits: a determining subunit 1051, configured to sequentially determine one sample in the minority class sample set as a reference sample; a unit 1052, configured to acquire a neighbor sample of each reference sample, a statistical subunit 1053, configured to separately count a first number of neighbor samples of each reference sample, and a first calculation subunit 1054, configured to use the first quantity according to the first quantity Calculating a second number of non-neighbor samples corresponding to the reference samples with a total number of samples of the minority sample set; a second calculating sub-unit 1055, configured to calculate the second number of total samples of the minority samples a normalization sub-unit 1056, configured to normalize the ratio of each reference sample to obtain a corresponding normalization ratio; and a third calculation sub-unit 1057 for normalizing according to each Calculating a corresponding third number, and generating a sub-unit 1058, configured to select a neighbor sample of the corresponding reference sample according to
  • the generating subunit 1058 includes the following subunit: a fourth calculating subunit 10581, configured to calculate the quotient of the third quantity and the first quantity.
  • the determining subunit 10582 is configured to determine whether the quotient is less than 1.
  • the subunit 10583 is configured to select, according to the quotient value less than 1, the third number of neighbor samples from the neighbor samples of the reference sample, and the distance between the third number of neighbor samples and the reference sample Both are farther apart than the remaining neighbor samples from the reference sample.
  • the first generating subunit 10584 is configured to respectively form a pair of samples for each selected neighbor sample and the reference sample, and generate a new sample by using one sample pair respectively.
  • a second generating sub-unit 10585 configured to: if the quotient is greater than or equal to 1, take an integer according to a rounding rule, and each neighbor sample of the reference sample and the reference sample respectively form a sample pair, and respectively use one sample Pair the generated integer new samples.
  • the above data sorting apparatus 100 can be implemented in the form of a computer program that can be run on a computer device as shown in FIG.
  • FIG. 7 is a schematic block diagram of a data classification device according to an embodiment of the present application.
  • the device may be a terminal or a server, wherein the terminal may be a communication-enabled electronic device such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, and a wearable device.
  • the server can be a standalone server or a server cluster consisting of multiple servers.
  • the device is a computer device 200 comprising a processor 202, a memory and a network interface 205 connected by a system bus 201, wherein the memory comprises a non-volatile storage medium 203 and an internal memory 204.
  • the non-volatile storage medium 203 of the computer device 200 can store an operating system 2031 and a computer program 2032 that, when executed, can cause the processor 202 to perform a data classification method.
  • the processor 202 of the computer device 200 is used to provide computing and control capabilities to support the operation of the entire computer device 200.
  • the internal memory 204 provides an environment for the operation of the computer program 2032 in the non-volatile storage medium 203.
  • the network interface 205 of the computer device 200 is used to perform network communications, such as sending assigned tasks and the like.
  • the processor 202 can execute an implementation of all of the embodiments of the data classification method described above when the computer program 2032 in the non-volatile storage medium 203 is run. It will be understood by those skilled in the art that the embodiment of the computer device shown in FIG. 7 does not constitute a limitation on the specific configuration of the data classification device. In other embodiments, the data classification device may include more or less than the illustration. Parts, or combinations of parts, or different parts. For example, in some embodiments, the data classification device may include only a memory and a processor. In such an embodiment, the structure and function of the memory and the processor are the same as those of the embodiment shown in FIG. 7, and details are not described herein again.
  • the application further provides a computer readable storage medium storing one or more computer programs, the one or more computer programs being executable by one or more processors, All of the embodiments of the above data classification method can be implemented by one or more programs being executed by one or more processors.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided in an embodiment of the present application are a data classification method, apparatus, device, and computer readable storage medium. In a circumstance in which two classes of samples are unbalanced, for a large number of samples, several same class sample sets are generated by means of downsampling; and for a few classes of samples, new samples are generated by means of upsampling; the new samples are used for mixing with the few classes of samples to form a relatively large number of samples, such that the number of samples of a sample set in which the original numbers are few is balanced with the number of samples of a sample set in which the original numbers are many; and the few classes of samples and many classes of samples predict data by means of multiple modeling, and finally a prediction result having a quantitative advantage is taken as a classification result. The accuracy of the data prediction is improved by means of a means of upsampling, downsampling and multiple-modeling multiple-predictions.

Description

一种数据分类方法、装置、设备及计算机可读存储介质Data classification method, device, device and computer readable storage medium
本申请要求于2018年03月08日提交中国专利局、申请号为This application is required to be submitted to the Chinese Patent Office on March 8, 2018, and the application number is 201810190818.2、申请名称为“一种数据分类方法、装置、设备及计算机可读201810190818.2, the application name is "a data classification method, device, device and computer readable 存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The priority of the Chinese Patent Application, the entire disclosure of which is incorporated herein by reference.
技术领域Technical field
本申请涉及信息处理技术领域,尤其涉及一种数据分类方法、装置、设备及计算机可读存储介质。The present application relates to the field of information processing technologies, and in particular, to a data classification method, apparatus, device, and computer readable storage medium.
背景技术Background technique
目前在数据建模将数据分类的过程中,特别是多分类的情况下,往往会存在各类样本呈现类不均衡问题,当各类的训练样本数量差异相当大时,直接利用不均衡的样本进行训练得到分类模型的话,由于各类样本数量的不均衡性,模型训练的结果可能很不理想,那么利用训练得到的模型进行预测而获得的预测结果也不理想,甚至预测结果是相反的。At present, in the process of data modeling to classify data, especially in the case of multi-classification, there are often various types of sample presentation class imbalance problems. When the number of training samples varies widely, direct use of unbalanced samples If the classification model is trained, the results of model training may be unsatisfactory due to the imbalance of the number of samples. Therefore, the prediction results obtained by using the model obtained by training are not ideal, and even the prediction results are opposite.
目前比较普遍的做法是将数量较少的那些样本通过产生新样本的方法来使得样本数量增加,以达到与数量较多的样本数量均衡的水平,新样本往往需要尽可能地接近真实样本,但新样本毕竟不是真实的样本,其用作模型训练得到的模型对数据的预测结果有一定的不良影响,如果通过产生的新样本结合原来的样本进行单次建模预测得到的一次性预测结果一旦出现错误,结果将不可挽回果。It is now common practice to increase the number of samples by generating new samples by using a smaller number of samples to achieve a level that is more balanced with a larger number of samples. New samples often need to be as close as possible to the real sample, but After all, the new sample is not a real sample. The model used for model training has a certain adverse effect on the prediction result of the data. If the new sample is combined with the original sample, the one-time prediction result obtained by the single model prediction is once An error occurs and the result will be irreparable.
发明内容Summary of the invention
本申请实施例提供了一种数据分类方法、装置、设备及计算机可读存储介质,通过使数量不均衡的两类样本达到数量均衡,结合多次建模多次预测来提高数据预测的准确性,从而提高模型的预测准确性。The embodiment of the present application provides a data classification method, device, device, and computer readable storage medium, which improves the accuracy of data prediction by combining two types of samples with unbalanced numbers to achieve quantity equalization and combining multiple predictions and multiple predictions. , thereby improving the prediction accuracy of the model.
第一方面,本申请实施例提供了一种数据分类方法,该方法包括:In a first aspect, an embodiment of the present application provides a data classification method, where the method includes:
获取样本集,所述样本集包括一多数类样本集和一少数类样本集;根据所 述多数类样本集的总样本数目与所述少数类样本集的总样本数目的比值确定第一类样本集的预设份数和预设样本个数;从所述多数类样本集中随机抽取所述预设样本个数的样本形成一份所述第一类样本集,重复多次抽取以得到所述预设份数的第一类样本集;根据所述少数类样本集的总样本数目和所述预设样本个数确定需要生成的新样本的预计总数目;根据所述预计总数目利用所述少数类样本集生成新样本,并将所述新样本与所述少数类样本集混合形成第二类样本集;分别将每份所述第一类样本集与所述第二类样本集进行机器学习得到对应的分类模型;利用所述分类模型对待分类数据进行预测分类,得到对应的预测结果;将数量较多的预测结果确定为分类结果,将数量较多的预测结果确定为分类结果。Obtaining a sample set, the sample set comprising a majority class sample set and a minority class sample set; determining a first class according to a ratio of a total sample number of the majority class sample set to a total sample number of the minority class sample set a preset number of copies of the sample set and a preset number of samples; randomly extracting samples of the preset number of samples from the majority of the sample set to form a sample set of the first type, and repeatedly extracting to obtain a Determining a first type of sample set of the predetermined number of copies; determining an estimated total number of new samples to be generated according to the total number of samples of the minority sample set and the preset number of samples; utilizing the estimated total number according to the total number of samples Generating a new sample from the minority sample set, and mixing the new sample with the minority sample set to form a second type of sample set; respectively performing each of the first type of sample set and the second type of sample set Machine learning obtains a corresponding classification model; using the classification model to perform classification classification on the classified data to obtain a corresponding prediction result; determining a larger number of prediction results as a classification result, and a larger number of pre- The measurement results are determined as classification results.
第二方面,本申请实施例还提供了一种数据分类装置,该数据分类装置包括用于执行上述的数据分类方法的单元。第三方面,本申请实施例还提供了一种数据分类设备,所述设备包括存储器以及与所述存储器相连的处理器;所述存储器用于存储实现数据分类方法的计算机程序;所述处理器用于运行所述存储器中存储的计算机程序,以执行如上述第一方面所述的方法。第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有一个或者一个以上计算机程序,所述一个或者一个以上计算机程序可被一个或者一个以上的处理器执行,以实现上述第一方面所述的方法。In a second aspect, the embodiment of the present application further provides a data classification device, where the data classification device includes a unit for performing the foregoing data classification method. In a third aspect, an embodiment of the present application further provides a data classification device, where the device includes a memory and a processor connected to the memory, where the memory is used to store a computer program that implements a data classification method; The computer program stored in the memory is executed to perform the method as described in the first aspect above. In a fourth aspect, an embodiment of the present application provides a computer readable storage medium, where the one or more computer programs are stored, and the one or more computer programs can be processed by one or more The apparatus is executed to implement the method described in the first aspect above.
本申请实施例提供了数据分类方法、装置、设备及计算机可读存储介质,在两类样本(少数类样本与多数类样本)数量不均衡的情况下,针对数量多的样本,通过下采样产生几份同类的样本集,针对数量少的样本通过上采样产生新样本,利用新样本与原少数类的样本混合形成数量较多的样本,使原本数量较少的样本与原本数量较多的样本数量均衡,并且少数类样本与多数类样本通过多次建模来预测数据,最终取占数量优势的预测结果作为分类结果,通过上采样、下采样以及多次建模多次预测的手段来提高数据预测的准确性。The embodiment of the present application provides a data classification method, device, device, and computer readable storage medium. When two types of samples (a few types of samples and a majority of samples) are not balanced, a large number of samples are generated by downsampling. Several similar sample sets, for a small number of samples, a new sample is generated by upsampling, and a new sample is mixed with the original minority sample to form a larger number of samples, so that the original smaller number of samples and the original larger number of samples The quantity is balanced, and the minority class and the majority class sample predict the data through multiple modeling, and finally the prediction result of the quantitative advantage is taken as the classification result, which is improved by means of upsampling, downsampling and multiple modeling multiple predictions. The accuracy of data prediction.
附图说明DRAWINGS
图1是本申请实施例提供的一种数据分类方法的流程示意图;FIG. 1 is a schematic flowchart diagram of a data classification method according to an embodiment of the present application;
图2是本申请实施例提供的一种数据分类方法的子流程示意图;2 is a schematic diagram of a sub-flow of a data classification method according to an embodiment of the present application;
图3是本申请实施例提供的一种数据分类方法的另一子流程示意图;FIG. 3 is a schematic diagram of another sub-flow of a data classification method according to an embodiment of the present application; FIG.
图4是本申请实施例提供的一种数据分类装置的示意性框图;4 is a schematic block diagram of a data classification apparatus according to an embodiment of the present application;
图5是本申请实施例提供的一种数据分类装置的子单元示意框图;FIG. 5 is a schematic block diagram of a subunit of a data classification apparatus according to an embodiment of the present application; FIG.
图6是本申请实施例提供的一种数据分类装置的子单元示意框图;6 is a schematic block diagram of a subunit of a data classification apparatus according to an embodiment of the present application;
图7是本申请实施例提供的一种数据分类设备的结构示意性框图。FIG. 7 is a schematic block diagram showing the structure of a data classification device according to an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application are clearly and completely described in the following with reference to the drawings in the embodiments of the present application. It is obvious that the described embodiments are a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.
图1是本申请实施例提供的一种数据分类方法的流程示意图。该方法可以运行在智能手机(如Android手机、IOS手机等)、平板电脑、笔记本电脑以及智能设备等终端中。如图1所示,该方法的步骤包括S101~S108。FIG. 1 is a schematic flowchart of a data classification method according to an embodiment of the present application. The method can be run on terminals such as smart phones (such as Android phones, IOS phones, etc.), tablets, laptops, and smart devices. As shown in FIG. 1, the steps of the method include S101 to S108.
S101,获取样本集,所述样本集包括一多数类样本集和一少数类样本集。S101. Acquire a sample set, where the sample set includes a majority class sample set and a minority class sample set.
在进行大数据分析或学习的过程,会存在数据不均衡的情况,例如,广告的点击数据和不点击数据,点击数据是指点击了某类广告的用户的行为数据,不点击数据是指未点击该类广告的用户的行为数据,点击数据与不点击数据的比值可能会高达1∶1000的比率,导致这两类数据非常不均衡。多数类样本指的是可得到的数量较多的某一类型的数据,例如上述的不点击数据,多数类样本集指的是由这些多数类样本组成的集合,少数类样本指的是可得到的数量较少的某一类型的数据,例如上述的点击数据,少数类样本集指的是由这些少数类样本组成的集合。In the process of big data analysis or learning, there will be data imbalance, for example, click data and non-click data of advertisements, click data refers to behavior data of users who click on certain types of advertisements, and non-click data means no Clicking on the behavior data of users of this type of advertisement, the ratio of clicked data to non-clicked data may be as high as 1:1000, resulting in very unbalanced data. Most types of samples refer to a certain amount of data of a certain type, such as the above-mentioned non-click data, the majority of the sample set refers to the set consisting of these majority samples, and the minority sample refers to the available A small number of types of data, such as the above-mentioned click data, a small class sample set refers to a set consisting of these few class samples.
S102,根据所述多数类样本集的总样本数目与所述少数类样本集的总样本数目的比值确定第一类样本集的预设份数和预设样本个数,所述预设份数为奇数。S102. Determine, according to a ratio of a total number of samples of the majority class sample set to a total sample number of the minority class sample set, a preset number of copies of the first type of sample set and a preset number of samples, the preset number of copies. It is odd.
多数类样本集的总样本数目与所述少数类样本集的总样本数目差异较大时,需要通过下采样抽取多数类样本集中的一部分样本形成第一类样本集,因为样本较多,因此需要形成多份该第一类样本集,以使该多数类样本集中的较多的样本被使用。第一类样本集指的是由多数类样本形成的一类样本的集合。第一类样本集的预设份数和预设样本个数由该多数类样本集的总样本数目与该少数类样本集的总样本数目之间的差异决定,当多数类样本集的总样本数目与少数类样本集的总样本数目的比值小于阈值(例如阈值可选100-1000任意一个), 则确定第一类样本集的预设样本个数为该多数类样本集的总样本数量1/2或1/3,预设份数为3,因为获取的第一类样本的数目,即预设样本个数必须为整数,所以当总样本数量1/2或1/3不为整数时,可以根据四舍五入法则取整;当多数类样本集的总样本数目与少数类样本集的总样本数目的比值大于或等于阈值,则确定第一类样本集的预设样本个数为该多数类样本集的总样本数量1/4,预设份数为5,同样地,当总样本数量1/4不为整数时,可以根据四舍五入法则取整。When the total sample size of the majority class sample set differs greatly from the total sample number of the minority class sample set, a part of the sample of the majority class sample set needs to be extracted by downsampling to form the first type sample set, because there are many samples, so A plurality of copies of the first type of sample set are formed such that more samples of the majority of the sample set are used. The first type of sample set refers to a collection of samples of a type formed by a majority of samples. The preset number of copies of the first type of sample set and the number of preset samples are determined by the difference between the total number of samples of the majority sample set and the total number of samples of the minority sample set, when the total sample of the majority sample set The ratio of the number to the total number of samples of the minority sample set is less than a threshold (for example, the threshold may be any one of 100-1000), and then the predetermined number of samples of the first type of sample set is determined as the total sample size of the majority sample set. /2 or 1/3, the default number of copies is 3, because the number of samples obtained in the first type, that is, the number of preset samples must be an integer, so when the total number of samples is 1/2 or 1/3 is not an integer , may be rounded according to the rounding rule; when the ratio of the total sample number of the majority class sample set to the total sample number of the minority class sample set is greater than or equal to the threshold, determining the preset number of samples of the first type of sample set is the majority class The total sample size of the sample set is 1/4, and the default number of copies is 5. Similarly, when the total sample size 1/4 is not an integer, it can be rounded according to the rounding rule.
S103,从所述多数类样本集中随机抽取所述预设样本个数的样本形成一份所述第一类样本集,重复多次抽取以得到所述预设份数的第一类样本集。S103. Randomly extract samples of the preset sample number from the majority class sample set to form a first sample set, and repeat the multiple extractions to obtain the first type of sample set of the preset number of copies.
确定第一类样本集的预设份数和预设样本个数,则从该多数类样本集中随机抽取样本得到需要的第一类样本集。在本申请实施例中,所述多数类样本集中随机抽取所述预设样本个数的样本形成一份所述第一类样本集之后,将抽取到样本重新放回到原多数类样本集中,在原来多数类样本集中重复随机抽取预设样本个数的样本再形成另一份第一类样本集,直至形成预设份数的第一类样本集。有放回地抽取样本是为了不改变原多数类样本集合的样本结构性,如此每次随机抽取的样本分布趋势的概率相同,每次模型训练的效果不会因样本差异带来不良影响。Determining the preset number of copies of the first type of sample set and the number of preset samples, randomly extracting samples from the majority of the sample set to obtain the required first type of sample set. In the embodiment of the present application, the majority of the sample sets randomly extract the samples of the preset sample number to form a sample set of the first type, and then return the sample to the original majority sample set. The samples of the preset sample number are randomly selected in the original majority sample set to form another first type of sample set until the first type of sample set of the predetermined number of copies is formed. The sample is taken back in order to not change the sample structure of the original majority sample set, so that the probability of the sample distribution trend of each random sample is the same, and the effect of each model training will not be adversely affected by the sample difference.
S104,根据所述少数类样本集的总样本数目和所述预设样本个数确定需要生成的新样本的预计总数目。少数类样本的数量较少,可以通过上采样的方法生成一些新的样本来使少数类样本与第一类样本达到均衡水平。预计需要生成的新样本的数量,即预计总数目,等于预设样本个数减去少数类样本集的总样本数目。S104. Determine an estimated total number of new samples to be generated according to the total number of samples of the minority sample set and the preset number of samples. The number of minority samples is small, and some new samples can be generated by upsampling to achieve a balanced level between the minority samples and the first sample. The number of new samples that are expected to be generated, that is, the estimated total number, is equal to the number of preset samples minus the total number of samples of the minority sample set.
S105,根据所述预计总数目利用所述少数类样本集生成新样本,并将所述新样本与所述少数类样本集混合形成第二类样本集。新样本依据真实的少数类样本产生,产生的新样本与少数类样本混合形成第二类样本,使第二类样本和第一类样本的数量均衡。S105. Generate a new sample by using the minority class sample set according to the estimated total number, and mix the new sample with the minority class sample set to form a second type of sample set. The new sample is generated based on the real minority sample, and the new sample is mixed with the minority sample to form the second sample, so that the number of the second sample and the first sample are equalized.
在本申请实施例中,采用smote思想生成新样本,具体地,如图2所示,S105中所述根据所述预计总数目利用所述少数类样本集生成新样本的步骤,包括以下子步骤S1051-S1058。In the embodiment of the present application, a new sample is generated by using the smote idea. Specifically, as shown in FIG. 2, the step of generating a new sample by using the minority sample set according to the estimated total number in S105 includes the following substeps. S1051-S1058.
S1051,依次将所述少数类样本集中的一个样本确定为参考样本;S1052,获取每个参考样本的近邻样本;S1053,分别统计每个参考样本的近邻样本的第一数量;S1054,根据所述第一数量与所述少数类样本集的总样本数目计算对应参考样本的非近邻样本的第二数量;S1055,计算所述第二数量占所述少数类样 本的总样本数目的比例;S1056,将每个参考样本的所述比例进行归一化处理,得到对应的归一化比例;S1057,根据每个所述归一化比例和所述预计总数目分别计算对应的第三数量;S1058,根据所述第三数量和所述第一数量选择对应参考样本的近邻样本,根据所述参考样本和所述近邻样本生成新样本。S1051: sequentially determining one sample in the minority sample set as a reference sample; S1052, acquiring a neighbor sample of each reference sample; S1053, respectively counting a first number of neighbor samples of each reference sample; S1054, according to the Calculating, by a first quantity, a total number of samples of the minority sample set and a second number of non-neighbor samples corresponding to the reference sample; S1055, calculating a ratio of the second quantity to a total number of samples of the minority sample; S1056, Normalizing the ratio of each reference sample to obtain a corresponding normalized ratio; S1057, respectively calculating a corresponding third quantity according to each of the normalized ratio and the estimated total number; S1058, And selecting a neighbor sample of the corresponding reference sample according to the third quantity and the first quantity, and generating a new sample according to the reference sample and the neighbor sample.
第三数量即为对应的参考样本预计需要生成的新样本的数量,第三数量只是该参考样本生成的新样本的一个预算值,并不是确定值,生成的新样本的实际数量可能等于该第三数量,也可能略大于或略小于该第三数量。The third quantity is the number of new samples that the corresponding reference sample is expected to generate, and the third quantity is only a budget value of the new sample generated by the reference sample, not a certain value, and the actual number of new samples generated may be equal to the first The three quantities may also be slightly larger or slightly smaller than the third quantity.
一个样本的近邻样本指的是在特征空间上与该样本距离接近的样本,其中包括一个最近邻样本,即与该样本距离最近的样本。在本申请实施例中,当一个样本和该样本的距离与最近邻样本和该样本的距离之间的差距在一定的范围(例如0-50%)之内,则将这个样本称为近邻样本,否则称为非近邻样本。A neighbor sample of a sample refers to a sample that is close to the sample in the feature space, and includes a nearest neighbor sample, that is, the sample closest to the sample. In the embodiment of the present application, when the distance between a sample and the sample and the distance between the nearest neighbor sample and the sample is within a certain range (for example, 0-50%), the sample is referred to as a neighbor sample. Otherwise, it is called a non-neighbor sample.
在本申请实施例中,针对所有的少数类样本均产生对应的新样本,即将每一个少数类样本作为参考样本,获取其近邻样本产生新样本,根据每一个参考样本生成的新样本的数量与少数类样本在该少数类样本集中的分布情况有关,少数类样本分布密集的地方,对应的参考样本生成的新样本的数量较少,少数类样本分布稀疏的地方,对应的参考样本生成的新样本的数量较多,以使最终第二类样本集中的样本分布较为均匀。样本分布是否均匀对模型训练具有一定的影响,样本分布越均匀,模型训练的效果越好。In the embodiment of the present application, a corresponding new sample is generated for all the minority samples, that is, each minority sample is used as a reference sample, and a neighbor sample is obtained to generate a new sample, and the number of new samples generated according to each reference sample is A small number of samples are related to the distribution of the minority sample sets. Where a small number of sample samples are densely distributed, the number of new samples generated by the corresponding reference samples is small, and the distribution of the minority samples is sparse, and the corresponding reference samples are generated. The number of samples is large, so that the sample distribution in the final second-class sample set is more uniform. Whether the sample distribution is uniform has a certain influence on the model training, the more uniform the sample distribution, the better the effect of model training.
具体地,如图3所示,S1058包括以下子步骤S1-S4:Specifically, as shown in FIG. 3, S1058 includes the following sub-steps S1-S4:
S1,计算所述第三数量与所述第一数量的商值。S1. Calculate a quotient of the third quantity and the first quantity.
S2,判断所述商值是否小于1。S2: Determine whether the quotient is less than 1.
S3,若是,则从所述参考样本的近邻样本中选择所述第三数量的近邻样本,所述第三数量的近邻样本与所述参考样本的距离均比剩余的近邻样本与所述参考样本的距离远,分别将每个选择的近邻样本与所述参考样本组成样本对,分别利用一个样本对生成一个新样本。S3, if yes, selecting the third number of neighbor samples from the neighbor samples of the reference samples, where the distance between the third number of neighbor samples and the reference samples is greater than the remaining neighbor samples and the reference samples The distance is far, and each selected neighbor sample and the reference sample are respectively composed of sample pairs, and a new sample is generated by using one sample pair respectively.
第三数量与第一数量的商值小于1,说明该参考样本所需生成的新样本的实际数量小于其近邻样本的数量,因此可以选择第一数量的近邻样本与该参考样本来生成新样本,选择距离较远的近邻样本与参考样本醉成样本对生成新样本则可以将新样本插入原来样本分布比较稀疏的空间,达到使样本分布均匀的目的。The quotient of the third quantity and the first quantity is less than 1, indicating that the actual number of new samples required to be generated by the reference sample is less than the number of its neighbor samples, so the first number of neighbor samples and the reference sample can be selected to generate a new sample. Selecting the neighbor sample and the reference sample that are far away from each other to form a new sample, the new sample can be inserted into the space where the original sample distribution is sparse, so as to achieve uniform distribution of the sample.
例如,少数类样本集中的第n个参考样本An具有Y个近邻样本,计算得到该参考样本An预计需要产生的新样本的总数(第三数量)为N,若N小于 Y(例如N=3,Y=6),则不需要将所有近邻样本与参考样本An组成样本对来生成新样本,只需要从中选择N个(3个)近邻样本来与参考样本An产生新样本,选择的近邻样本尽量与参考样本An远,如此可以在样本分布比较稀疏的地方插入新样本,使得样本分布更加均匀。For example, the nth reference sample An in the minority class sample set has Y neighbor samples, and the total number of the new samples (the third number) that the reference sample An is expected to generate is calculated to be N, if N is less than Y (for example, N=3) , Y=6), it is not necessary to combine all the neighbor samples with the reference sample An to generate a new sample, and only need to select N (3) neighbor samples to generate a new sample with the reference sample An, and select the neighbor sample. Try to be far away from the reference sample An, so that new samples can be inserted where the sample distribution is sparse, making the sample distribution more uniform.
S4,若否,则根据四舍五入法则取整数,将所述参考样本的每个近邻样本分别与所述参考样本组成样本对,分别利用一个样本对生成所述整数个新样本。S4. If no, the integer is taken according to the rounding rule, and each of the neighbor samples of the reference sample and the reference sample are respectively formed into a sample pair, and the integer pair of new samples are generated by using one sample pair respectively.
如果第三数量与第一数量的商值大于或等于1,说明该参考样本所需生成的新样本的实际数量大于或等于其近邻样本的数量,则根据四舍五入法则将该商值取整,分别根据该参考样本的每一个近邻样本与该参考样本组成样本对,每一个样本对生成该整数个新样本,最终所有参考样本生成的新样本与原少数类样本混合后的数量与第一类样本集的样本数量可达到均衡。If the quotient of the third quantity and the first quantity is greater than or equal to 1, indicating that the actual quantity of the new sample to be generated by the reference sample is greater than or equal to the quantity of the neighbor sample, the quotient value is rounded according to the rounding rule, respectively According to each of the neighbor samples of the reference sample and the reference sample, a sample pair is formed, and each sample pair generates the new number of new samples, and finally the quantity of the new sample generated by all the reference samples and the original minority sample are mixed with the first type of sample. The number of samples in the set can be balanced.
例如,若N大于Y(N=15,Y=6),则二者相除得到的商值大于1,且存在余数,则可以按每个近邻样本分别与参考样本组成样本对平均产生相同数量(商值四舍五入后的整数)的新样本,因此产生的新样本比较丰富,使得整个样本集合更加完整。For example, if N is greater than Y (N=15, Y=6), if the quotient obtained by dividing the two is greater than 1, and there is a remainder, then each neighbor sample can be averaged with the reference sample to produce the same number. A new sample (an integer after rounding off the quotient), so the resulting new sample is richer, making the entire sample set more complete.
在模型训练的过程中,往往需要将每一个已知类型的样本转化为i维平面的特征向量An(a1,a2,......,ai),每个向量值ai代表该样本An的一种属性的信息,然后通过对所有样本的特征向量和对应的类型进行机器学习得到模型,最终利用该模型去预测某一待分类数据的属于哪种类型。In the process of model training, it is often necessary to convert each known type of sample into the feature vector An(a1, a2, ..., ai) of the i-dimensional plane, and each vector value ai represents the sample An. The information of an attribute is then machine modeled by eigenvectors and corresponding types of all samples to obtain a model, and finally the model is used to predict which type of data to be classified belongs to.
在本申请实施例中,以欧氏距离为基础获取一个参考样本的近邻样本。In the embodiment of the present application, a neighbor sample of a reference sample is obtained based on the Euclidean distance.
利用一个样本对生成一个新样本的方法包括步骤(1)-(3):The method of generating a new sample using a sample pair includes steps (1)-(3):
(1)获取所述样本对中的参考样本在i维空间的特征向量An(a1,a2,......,ai)以及近邻样本的特征向量Bk(b1,b2,......,bi)。(1) Acquiring the feature vector An(a1, a2, ..., ai) of the reference sample in the sample pair in the i-dimensional space and the feature vector Bk of the neighbor sample (b1, b2, .... .., bi).
在实际情况中,i往往大于或等于2,样本具有几种属性信息,则i取几。In the actual case, i is often greater than or equal to 2, and the sample has several attribute information, then i takes a few.
假设少数类样本集中具有m个样本,则An指的是第n个样本,其中,n≤m,a1,a2,......,ai表示参考样本An在i维空间的各个特征值。参考样本An具有Y个近邻样本,选择了较远的K个近邻样本与参考样本分别组成K个样本对,Bk指的是K个近邻样本中的第k个近邻样本,其中,k=1,2,......,K,每次从该K个近邻样本中选择一个近邻样本与参考样本组成样本对生成一个新样本,最终一个参考样本An生成K个新样本。Assuming that there are m samples in a small class of sample sets, An refers to the nth sample, where n ≤ m, a1, a2, ..., ai represents the eigenvalues of the reference sample An in the i-dimensional space. . The reference sample An has Y neighbor samples, and the K nearest neighbor samples and the reference samples are respectively selected to form K sample pairs, and Bk refers to the kth neighbor samples in the K neighbor samples, where k=1, 2, ..., K, each time a neighbor sample is selected from the K neighbor samples and the reference sample is composed to form a new sample, and finally a reference sample An generates K new samples.
参考样本的特征向量是已知的,近邻样本被确定后其特征向量也是已知的(因为近邻样本也是少数类样本集中的样本),An与Bk,ai与bi均只是为了 区分参考样本与近邻样本。The eigenvectors of the reference samples are known. The eigenvectors are also known after the neighbor samples are determined (because the neighbor samples are also samples of a few sample sets), An and Bk, ai and bi are only used to distinguish reference samples from neighbors. sample.
(2)随机生成一个比例值t,其中,0<t<1。(2) A proportional value t is randomly generated, where 0 < t < 1.
(3)计算所需生成的新样本的特征向量Cnk(c1,c2,......,ci),其中,ci=ai+t*(bi-ai),在所述i维空间生成具有所述特征向量Cnk(c1,c2,......,ci)的样本。Cnk表示参考样本An与近邻样本Bk组成样本对生成的新样本。(3) calculating a feature vector Cnk (c1, c2, ..., ci) of the new sample to be generated, wherein ci = ai + t * (bi-ai), generated in the i-dimensional space A sample having the feature vector Cnk (c1, c2, ..., ci). Cnk represents a new sample generated by the sample pair of the reference sample An and the neighbor sample Bk.
根据近邻样本对应的每一个向量值bi,所述向量值bi对应的所述参考样本的向量值ai以及所述比例值t可以计算新样本对应的向量值ci,即将参考样本的点与近邻样本的点直线连接,在连线中任意取一点,该点在参考样本与近邻样本之间,通过这种内插法获取一个新的点,即产生一个新样本。According to each vector value bi corresponding to the neighbor sample, the vector value ai of the reference sample corresponding to the vector value bi and the scale value t may calculate a vector value ci corresponding to the new sample, that is, the point and neighbor samples of the reference sample The points are connected in a straight line, and a point is randomly selected in the line. The point is between the reference sample and the neighbor sample, and a new point is obtained by the interpolation method, that is, a new sample is generated.
利用一个样本对生成整数个新样本的方法包括步骤(a)-(c):A method of generating an integer number of new samples using a sample pair includes steps (a)-(c):
(a)获取所述样本对中的参考样本在i维空间的特征向量An(a1,a2,......,ai)以及近邻样本的特征向量Bk(b1,b2,......,bi)。(a) acquiring the feature vector An(a1, a2, ..., ai) of the reference sample in the sample pair in the i-dimensional space and the feature vector Bk of the neighbor sample (b1, b2, .... .., bi).
例如,参考样本An具有Y个近邻样本,则选择Y个近邻样本与参考样本分别组成Y个样本对,Bk指的是该Y个近邻样本中的第k个近邻样本,其中,k=1,2,......,Y,每次从该Y个近邻样本中选一个近邻样本与参考样本组成样本对生成整数(j)个新样本,最终一个参考样本An生成Y*j个新样本。For example, if the reference sample An has Y neighbor samples, then the Y neighbor samples and the reference samples are respectively selected to form Y sample pairs, and Bk refers to the kth neighbor samples in the Y neighbor samples, where k=1, 2, ..., Y, each time selecting a neighbor sample and a reference sample from the Y neighbor samples to form a sample pair to generate an integer (j) new sample, and finally a reference sample An generates Y*j new samples .
(b)随机生成j个比例值t x,其中,0<t x<1,x=1,2,......,j,j等于所述整数,且所有比例值t x均不相同。 (b) randomly generating j proportional values t x , where 0<t x <1, x=1, 2, . . . , j, j is equal to the integer, and all scale values t x are not the same.
(c)计算所需生成的所述整数个新样本的特征向量Cnk x(c1,c2,......,ci),其中,ci=ai+t x*(bi-ai),在所述i维空间生成具有特征向量Cnk x(c1,c2,......,ci)的样本。Cnk x表示参考样本An与近邻样本Bk组成样本对生成的第x个新样本。 (c) calculating a feature vector Cnk x (c1, c2, ..., ci) of the integer number of new samples to be generated, wherein ci = ai + t x * (bi-ai), The i-dimensional space generates samples having feature vectors Cnk x (c1, c2, ..., ci). Cnk x represents the xth new sample generated by the sample pair of the reference sample An and the neighbor sample Bk.
将参考样本的点与近邻样本的点直线连接,在连线中任意取整数个点,这些点在参考样本与近邻样本之间,通过这种内插法获取整数个新的点,即产生整数个新样本。The point of the reference sample is directly connected with the point of the neighbor sample, and an arbitrary number of points are randomly drawn in the connection. These points are between the reference sample and the neighbor sample, and an integer number of new points are obtained by the interpolation method, that is, an integer is generated. A new sample.
S106,分别将每份所述第一类样本集与所述第二类样本集进行机器学习得到对应的分类模型。S107,利用所述分类模型对待分类数据进行预测分类,得到对应的预测结果。S108,将数量较多的预测结果确定为分类结果,将数量较多的预测结果确定为分类结果。S106: Perform machine learning on each of the first type of sample set and the second type of sample set to obtain a corresponding classification model. S107: Perform classification and classification on the classified data by using the classification model, and obtain a corresponding prediction result. S108, determining a larger number of prediction results as classification results, and determining a larger number of prediction results as classification results.
为了预测的准确性,尽可能更多次的建模预测,因此分别将每份所述第一类样本集与所述第二类样本集进行机器学习得到对应的分类模型,利用得到的模型分别进行预测,预测结果分为第一类(多数类)和第二类(少数类),数 量较多的预测结果为最终的分类结果。For the accuracy of the prediction, as many times as possible, the modeling prediction is performed. Therefore, each of the first type of sample set and the second type of sample set are respectively machine-learned to obtain a corresponding classification model, and the obtained model is respectively used. The prediction is divided into the first category (majority category) and the second category (minor category), and the larger number of prediction results are the final classification results.
利用上述方法可以根据用户的行为数据预测该用户会不会点击某一类型的广告,因此可以有计划得对不同的用户群体投放不同的广告,或对潜在客户根据其需求有针对性地策划广告方案,以提高获取潜在业务的可能性。The above method can be used to predict whether the user clicks on a certain type of advertisement according to the user's behavior data, so that it is possible to plan to serve different advertisements to different user groups, or to plan advertisements for potential customers according to their needs. Program to increase the likelihood of obtaining potential business.
本申请实施例提供了数据分类方法、在两类样本(少数类样本与多数类样本)数量不均衡的情况下,针对数量多的样本,通过下采样产生几份同类的样本集,针对数量少的样本通过上采样产生新样本,利用新样本与原少数类的样本混合形成数量较多的样本,使原本数量较少的样本与原本数量较多的样本数量均衡,并且少数类样本与多数类样本通过多次建模来预测数据,最终取占数量优势的预测结果作为分类结果,通过上采样、下采样以及多次建模多次预测的手段来提高数据预测的准确性。The embodiment of the present application provides a data classification method. In the case that the quantity of two types of samples (a few types of samples and a majority of samples) is not balanced, for a large number of samples, several similar sample sets are generated by downsampling, and the number is small. The sample is sampled to generate a new sample, and the new sample is mixed with the original minority sample to form a larger number of samples, so that the original smaller number of samples is equal to the original larger number of samples, and the minority sample and the majority are The sample predicts the data through multiple modeling, and finally takes the prediction result of the quantitative advantage as the classification result, and improves the accuracy of the data prediction by means of upsampling, downsampling and multiple modeling multiple predictions.
图4是本申请实施例提供的一种数据分类装置100的示意性框图。该数据分类装置100包括获取单元101、第一确定单元102、第一形成单元103、第二确定单元104、生成单元105、第二形成单元106、学习单元107、预测单元108、统计单元109和第三确定单元110。FIG. 4 is a schematic block diagram of a data classification apparatus 100 provided by an embodiment of the present application. The data classification device 100 includes an acquisition unit 101, a first determination unit 102, a first formation unit 103, a second determination unit 104, a generation unit 105, a second formation unit 106, a learning unit 107, a prediction unit 108, a statistics unit 109, and The third determining unit 110.
获取单元101用于获取样本集,所述样本集包括一多数类样本集和一少数类样本集。第一确定单元102用于根据所述多数类样本集的总样本数目与所述少数类样本集的总样本数目的比值确定第一类样本集的预设份数和预设样本个数。第一形成单元103用于从所述多数类样本集中随机抽取所述预设样本个数的样本形成一份所述第一类样本集,重复多次抽取以得到所述预设份数的第一类样本集。第二确定单元104用于根据所述少数类样本集的总样本数目和所述预设样本个数确定需要生成的新样本的预计总数目。生成单元105用于根据所述预计总数目利用所述少数类样本集生成新样本。第二形成单元106用于并将所述新样本与所述少数类样本集混合形成第二类样本集。学习单元107用于分别将每份所述第一类样本集与所述第二类样本集进行机器学习得到对应的分类模型。预测单元108用于利用所述分类模型对待分类数据进行预测分类,得到对应的预测结果。统计单元109用于将数量较多的预测结果确定为分类结果。第三确定单元110用于将数量较多的预测结果确定为分类结果。The obtaining unit 101 is configured to acquire a sample set, where the sample set includes a majority class sample set and a minority class sample set. The first determining unit 102 is configured to determine a preset number of copies and a preset number of samples of the first type of sample set according to a ratio of a total number of samples of the majority class sample set to a total sample number of the minority class sample set. The first forming unit 103 is configured to randomly extract samples of the preset sample number from the majority class sample set to form a first sample set, and repeat the multiple extractions to obtain the preset number of copies. A class of sample sets. The second determining unit 104 is configured to determine an estimated total number of new samples that need to be generated according to the total number of samples of the minority class sample set and the preset number of samples. The generating unit 105 is configured to generate a new sample by using the minority class sample set according to the estimated total number. The second forming unit 106 is configured to mix the new sample with the minority class sample set to form a second type of sample set. The learning unit 107 is configured to perform machine learning on each of the first type of sample set and the second type of sample set to obtain a corresponding classification model. The prediction unit 108 is configured to perform prediction classification on the classification data by using the classification model to obtain a corresponding prediction result. The statistical unit 109 is configured to determine a larger number of prediction results as the classification result. The third determining unit 110 is configured to determine a larger number of prediction results as the classification result.
在本申请实施例中,如图5所示,所述生成单元105包括以下子单元:确定子单元1051,用于依次将所述少数类样本集中的一个样本确定为参考样本;第一获取子单元1052,用于获取每个参考样本的近邻样本;统计子单元1053, 用于分别统计每个参考样本的近邻样本的第一数量;第一计算子单元1054,用于根据所述第一数量与所述少数类样本集的总样本数目计算对应参考样本的非近邻样本的第二数量;第二计算子单元1055,用于计算所述第二数量占所述少数类样本的总样本数目的比例;归一化子单元1056,用于将每个参考样本的所述比例进行归一化处理,得到对应的归一化比例;第三计算子单元1057,用于根据每个所述归一化比例和所述预计总数目分别计算对应的第三数量;生成子单元1058,用于根据所述第三数量和所述第一数量选择对应参考样本的近邻样本,根据所述参考样本和所述近邻样本生成新样本。In the embodiment of the present application, as shown in FIG. 5, the generating unit 105 includes the following subunits: a determining subunit 1051, configured to sequentially determine one sample in the minority class sample set as a reference sample; a unit 1052, configured to acquire a neighbor sample of each reference sample, a statistical subunit 1053, configured to separately count a first number of neighbor samples of each reference sample, and a first calculation subunit 1054, configured to use the first quantity according to the first quantity Calculating a second number of non-neighbor samples corresponding to the reference samples with a total number of samples of the minority sample set; a second calculating sub-unit 1055, configured to calculate the second number of total samples of the minority samples a normalization sub-unit 1056, configured to normalize the ratio of each reference sample to obtain a corresponding normalization ratio; and a third calculation sub-unit 1057 for normalizing according to each Calculating a corresponding third number, and generating a sub-unit 1058, configured to select a neighbor sample of the corresponding reference sample according to the third quantity and the first quantity, A new sample is generated from the reference sample and the neighbor sample.
在本申请实施例中,如图6所示,所述生成子单元1058包括以下子单元:第四计算子单元10581,用于计算所述第三数量与所述第一数量的商值。判断子单元10582,用于判断所述商值是否小于1。选择子单元10583,用于若所述商值小于1,则从所述参考样本的近邻样本中选择所述第三数量的近邻样本,所述第三数量的近邻样本与所述参考样本的距离均比剩余的近邻样本与所述参考样本的距离远。第一生成子单元10584,用于分别将每个选择的近邻样本与所述参考样本组成样本对,分别利用一个样本对生成一个新样本。第二生成子单元10585,用于若所述商值大于或等于1,则根据四舍五入法则取整数,将所述参考样本的每个近邻样本分别与所述参考样本组成样本对,分别利用一个样本对生成所述整数个新样本。In the embodiment of the present application, as shown in FIG. 6, the generating subunit 1058 includes the following subunit: a fourth calculating subunit 10581, configured to calculate the quotient of the third quantity and the first quantity. The determining subunit 10582 is configured to determine whether the quotient is less than 1. The subunit 10583 is configured to select, according to the quotient value less than 1, the third number of neighbor samples from the neighbor samples of the reference sample, and the distance between the third number of neighbor samples and the reference sample Both are farther apart than the remaining neighbor samples from the reference sample. The first generating subunit 10584 is configured to respectively form a pair of samples for each selected neighbor sample and the reference sample, and generate a new sample by using one sample pair respectively. a second generating sub-unit 10585, configured to: if the quotient is greater than or equal to 1, take an integer according to a rounding rule, and each neighbor sample of the reference sample and the reference sample respectively form a sample pair, and respectively use one sample Pair the generated integer new samples.
第一生成子单元10584包括:第二获取子单元,用于获取所述样本对中的参考样本在i维空间的特征向量An(a1,a2,......,ai)以及近邻样本的特征向量Bk(b1,b2,......,bi);第一随机子单元,用于随机生成一个比例值t,其中,0<t<1;第一特征计算子单元,用于计算所需生成的新样本的特征向量Cnk(c1,c2,......,ci),其中,ci=ai+t*(bi-ai),在所述i维空间生成具有所述特征向量Cnk(c1,c2,......,ci)的样本。The first generation subunit 10584 includes: a second acquisition subunit, configured to acquire feature vectors An(a1, a2, . . . , ai) and neighbor samples of the reference samples in the sample pair in the i-dimensional space. a feature vector Bk (b1, b2, ..., bi); a first random subunit for randomly generating a scale value t, where 0 < t < 1; the first feature calculation subunit, For calculating a feature vector Cnk(c1, c2, ..., ci) of a new sample to be generated, wherein ci=ai+t*(bi-ai), having the generated in the i-dimensional space A sample of the feature vector Cnk (c1, c2, ..., ci).
第二生成子单元10585包括以下子单元:第三获取子单元,用于获取所述样本对中的参考样本在i维空间的特征向量An(a1,a2,......,ai)以及近邻样本的特征向量Bk(b1,b2,......,bi);第二随机子单元,用于随机生成j个比例值t x,其中,0<t x<1,x=1,2,......,j,j等于所述整数,且所有比例值t x均不相同;第二特征计算子单元,用于计算所需生成的所述整数个新样本的特征向量Cnk x(c1,c2,......,ci),其中,ci=ai+t x*(bi-ai),在所述i维空间生成具有特征向量Cnk x(c1,c2,......,ci)的样本。 The second generation subunit 10585 includes the following subunits: a third acquisition subunit, configured to acquire feature vectors An(a1, a2, . . . , ai) of the reference samples in the sample pair in the i-dimensional space. And a feature vector Bk (b1, b2, ..., bi) of the neighbor sample; a second random subunit for randomly generating j scale values t x , where 0<t x <1, x= 1, 2, ..., j, j is equal to the integer, and all scale values t x are different; a second feature calculation subunit is used to calculate the integer number of new samples to be generated An eigenvector Cnk x (c1, c2, ..., ci), wherein ci = ai + t x * (bi-ai), having a feature vector Cnk x (c1, c2) generated in the i-dimensional space , ..., ci) sample.
上述数据分类装置100的功能以及各个单元的具体描述可参考上述方法实 施例中的描述,此处不做重复描述。上述数据分类装置100可以实现为一种计算机程序的形式,计算机程序可以在如图7所示的计算机设备上运行。For the functions of the data classification device 100 and the specific description of each unit, reference may be made to the description in the above method embodiment, and the description is not repeated here. The above data sorting apparatus 100 can be implemented in the form of a computer program that can be run on a computer device as shown in FIG.
图7为本申请实施例提供的一种数据分类设备的示意性框图。该设备可以是终端,也可以是服务器,其中,终端可以是智能手机、平板电脑、笔记本电脑、台式电脑、个人数字助理和穿戴式设备等具有通信功能的电子设备。服务器可以是独立的服务器,也可以是多个服务器组成的服务器集群。FIG. 7 is a schematic block diagram of a data classification device according to an embodiment of the present application. The device may be a terminal or a server, wherein the terminal may be a communication-enabled electronic device such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, and a wearable device. The server can be a standalone server or a server cluster consisting of multiple servers.
该设备为一种计算机设备200,包括通过系统总线201连接的处理器202、存储器和网络接口205,其中,该存储器包括非易失性存储介质203和内存储器204。该计算机设备200的非易失性存储介质203可存储操作系统2031和计算机程序2032,该计算机程序2032被执行时,可使得处理器202执行一种数据分类方法。该计算机设备200的处理器202用于提供计算和控制能力,支撑整个计算机设备200的运行。该内存储器204为非易失性存储介质203中的计算机程序2032的运行提供环境。计算机设备200的网络接口205用于进行网络通信,如发送分配的任务等。处理器202运行非易失性存储介质203中的计算机程序2032时可执行上述数据分类方法的所有实施例的实现方式。本领域技术人员可以理解,图7中示出的计算机设备的实施例并不构成对数据分类设备具体构成的限定,在其他实施例中,数据分类设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。例如,在一些实施例中,数据分类设备可以仅包括存储器及处理器,在这样的实施例中,存储器及处理器的结构及功能与图7所示实施例一致,在此不再赘述。The device is a computer device 200 comprising a processor 202, a memory and a network interface 205 connected by a system bus 201, wherein the memory comprises a non-volatile storage medium 203 and an internal memory 204. The non-volatile storage medium 203 of the computer device 200 can store an operating system 2031 and a computer program 2032 that, when executed, can cause the processor 202 to perform a data classification method. The processor 202 of the computer device 200 is used to provide computing and control capabilities to support the operation of the entire computer device 200. The internal memory 204 provides an environment for the operation of the computer program 2032 in the non-volatile storage medium 203. The network interface 205 of the computer device 200 is used to perform network communications, such as sending assigned tasks and the like. The processor 202 can execute an implementation of all of the embodiments of the data classification method described above when the computer program 2032 in the non-volatile storage medium 203 is run. It will be understood by those skilled in the art that the embodiment of the computer device shown in FIG. 7 does not constitute a limitation on the specific configuration of the data classification device. In other embodiments, the data classification device may include more or less than the illustration. Parts, or combinations of parts, or different parts. For example, in some embodiments, the data classification device may include only a memory and a processor. In such an embodiment, the structure and function of the memory and the processor are the same as those of the embodiment shown in FIG. 7, and details are not described herein again.
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质存储有一个或者一个以上计算机程序,所述一个或者一个以上计算机程序可被一个或者一个以上的处理器执行,当所述一个或者一个以上程序被一个或者一个以上的处理器执行,可实现上述数据分类方法的所有实施例。The application further provides a computer readable storage medium storing one or more computer programs, the one or more computer programs being executable by one or more processors, All of the embodiments of the above data classification method can be implemented by one or more programs being executed by one or more processors.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The foregoing is only a specific embodiment of the present application, but the scope of protection of the present application is not limited thereto, and any equivalents can be easily conceived by those skilled in the art within the technical scope disclosed in the present application. Modifications or substitutions are intended to be included within the scope of the present application. Therefore, the scope of protection of this application should be determined by the scope of protection of the claims.

Claims (20)

  1. 一种数据分类方法,其特征在于,包括:A data classification method, comprising:
    获取样本集,所述样本集包括一多数类样本集和一少数类样本集;Obtaining a sample set, the sample set comprising a majority class sample set and a minority class sample set;
    根据所述多数类样本集的总样本数目与所述少数类样本集的总样本数目的比值确定第一类样本集的预设份数和预设样本个数,所述预设份数为奇数;Determining a preset number of copies of the first type of sample set and a preset number of samples according to a ratio of a total number of samples of the majority class sample set to a total number of samples of the minority class sample set, the preset number of copies being an odd number ;
    从所述多数类样本集中随机抽取所述预设样本个数的样本形成一份所述第一类样本集,重复多次抽取以得到所述预设份数的第一类样本集;Randomly extracting samples of the preset number of samples from the majority class sample set to form a first sample set, and repeating multiple extractions to obtain the first set of sample sets of the preset number of copies;
    根据所述少数类样本集的总样本数目和所述预设样本个数确定需要生成的新样本的预计总数目;Determining an estimated total number of new samples to be generated according to the total number of samples of the minority sample set and the preset number of samples;
    根据所述预计总数目利用所述少数类样本集生成新样本,并将所述新样本与所述少数类样本集混合形成第二类样本集;Generating a new sample using the minority class sample set according to the estimated total number, and mixing the new sample with the minority class sample set to form a second type sample set;
    分别将每份所述第一类样本集与所述第二类样本集进行机器学习得到对应的分类模型;Performing machine learning on each of the first type of sample set and the second type of sample set to obtain a corresponding classification model;
    利用所述分类模型对待分类数据进行预测分类,得到对应的预测结果;Using the classification model to perform classification classification on the classified data to obtain a corresponding prediction result;
    分别统计不同预测结果的数量,将数量较多的预测结果确定为分类结果。The number of different prediction results is separately counted, and the more large number of prediction results are determined as classification results.
  2. 根据权利要求1所述的数据分类方法,其特征在于,所述根据所述预计总数目利用所述少数类样本集生成新样本,包括:The data classification method according to claim 1, wherein the generating a new sample by using the minority sample set according to the estimated total number comprises:
    依次将所述少数类样本集中的一个样本确定为参考样本;Determining one sample in the minority sample set as a reference sample;
    获取每个参考样本的近邻样本;Obtain a neighbor sample of each reference sample;
    分别统计每个参考样本的近邻样本的第一数量;Separating the first number of neighbor samples of each reference sample separately;
    根据所述第一数量与所述少数类样本集的总样本数目计算对应参考样本的非近邻样本的第二数量;Calculating a second number of non-neighbor samples of the corresponding reference samples according to the first quantity and the total number of samples of the minority class sample set;
    计算所述第二数量占所述少数类样本的总样本数目的比例;Calculating a ratio of the second quantity to the total number of samples of the minority sample;
    将每个参考样本的所述比例进行归一化处理,得到对应的归一化比例;Normalizing the ratio of each reference sample to obtain a corresponding normalized ratio;
    根据每个所述归一化比例和所述预计总数目分别计算对应的第三数量;Calculating a corresponding third quantity according to each of the normalized ratio and the estimated total number;
    根据所述第三数量和所述第一数量选择对应参考样本的近邻样本,根据所述参考样本和所述近邻样本生成新样本。And selecting a neighbor sample of the corresponding reference sample according to the third quantity and the first quantity, and generating a new sample according to the reference sample and the neighbor sample.
  3. 根据权利要求2所述的数据分类方法,其特征在于,所述根据所述第三数量和所述第一数量选择对应参考样本的近邻样本,根据所述参考样本和所述近邻样本生成新样本,包括:The data classification method according to claim 2, wherein the selecting a neighbor sample of the corresponding reference sample according to the third quantity and the first quantity, and generating a new sample according to the reference sample and the neighbor sample ,include:
    计算所述第三数量与所述第一数量的商值;Calculating a quotient of the third quantity and the first quantity;
    判断所述商值是否小于1;Determining whether the quotient is less than 1;
    若是,则从所述参考样本的近邻样本中选择所述第三数量的近邻样本,所述第三数量的近邻样本与所述参考样本的距离均比剩余的近邻样本与所述参考样本的距离远,分别将每个选择的近邻样本与所述参考样本组成样本对,分别利用一个样本对生成一个新样本;If yes, selecting the third number of neighbor samples from the neighbor samples of the reference sample, where the distance between the third number of neighbor samples and the reference sample is greater than the distance between the remaining neighbor samples and the reference sample Far, each selected neighbor sample and the reference sample are respectively formed into a sample pair, and a new sample is generated by using one sample pair respectively;
    若否,则根据四舍五入法则取整数,将所述参考样本的每个近邻样本分别与所述参考样本组成样本对,分别利用一个样本对生成所述整数个新样本。If not, the integer is taken according to the rounding rule, and each neighbor sample of the reference sample is separately formed into a sample pair with the reference sample, and the integer pair of new samples are generated by using one sample pair respectively.
  4. 根据权利要求3所述的数据分类方法,其特征在于,利用一个样本对生成一个新样本,包括:The data classification method according to claim 3, wherein generating a new sample by using a sample pair comprises:
    获取所述样本对中的参考样本在i维空间的特征向量An(a1,a2,......,ai)以及近邻样本的特征向量Bk(b1,b2,......,bi);Obtaining a feature vector An(a1, a2, ..., ai) of the reference sample in the sample pair in the i-dimensional space and a feature vector Bk of the neighbor sample (b1, b2, ..., Bi);
    随机生成一个比例值t,其中,0<t<1;Randomly generate a proportional value t, where 0 < t < 1;
    计算所需生成的新样本的特征向量Cnk(c1,c2,......,ci),其中,ci=ai+t*(bi-ai),在所述i维空间生成具有所述特征向量Cnk(c1,c2,......,ci)的样本。Calculating a feature vector Cnk(c1, c2, ..., ci) of the new sample to be generated, wherein ci=ai+t*(bi-ai), the i-dimensional space generation has the A sample of the eigenvectors Cnk (c1, c2, ..., ci).
  5. 根据权利要求3所述的方法,其特征在于,利用一个样本对生成所述整数个新样本,包括:The method of claim 3 wherein generating the integer number of new samples using a sample pair comprises:
    获取所述样本对中的参考样本在i维空间的特征向量An(a1,a2,......,ai)以及近邻样本的特征向量Bk(b1,b2,......,bi);Obtaining a feature vector An(a1, a2, ..., ai) of the reference sample in the sample pair in the i-dimensional space and a feature vector Bk of the neighbor sample (b1, b2, ..., Bi);
    随机生成j个比例值t x,其中,0<t x<1,x=1,2,......,j,j等于所述整数,且所有比例值t x均不相同; Randomly generating j proportional values t x , where 0<t x <1, x=1, 2, . . . , j, j is equal to the integer, and all scale values t x are different;
    计算所需生成的所述整数个新样本的特征向量Cnk x(c1,c2,......,ci),其中,ci=ai+t x*(bi-ai),在所述i维空间生成具有特征向量Cnk x(c1,c2,......,ci)的样本。 Calculating a feature vector Cnk x (c1, c2, ..., ci) of the integer number of new samples to be generated, wherein ci = ai + t x * (bi-ai), in the i The dimensional space generates a sample having a feature vector Cnk x (c1, c2, ..., ci).
  6. 一种数据分类装置,其特征在于,包括:A data classification device, comprising:
    获取单元,用于获取样本集,所述样本集包括一多数类样本集和一少数类样本集;An obtaining unit, configured to acquire a sample set, where the sample set includes a majority class sample set and a minority class sample set;
    第一确定单元,用于根据所述多数类样本集的总样本数目与所述少数类样本集的总样本数目的比值确定第一类样本集的预设份数和预设样本个数,所述预设份数为奇数;a first determining unit, configured to determine, according to a ratio of a total number of samples of the majority class sample set to a total number of samples of the minority class sample set, a preset number of copies and a preset number of samples of the first type of sample set, The preset number of copies is an odd number;
    第一形成单元,用于从所述多数类样本集中随机抽取所述预设样本个数的样本形成一份所述第一类样本集,重复多次抽取以得到所述预设份数的第一类 样本集;a first forming unit, configured to randomly extract samples of the preset sample number from the majority class sample set to form a first sample set, and repeat the multiple extractions to obtain the preset number of copies a set of samples;
    第二确定单元,用于根据所述少数类样本集的总样本数目和所述预设样本个数确定需要生成的新样本的预计总数目;a second determining unit, configured to determine, according to the total number of samples of the minority class sample set and the preset number of samples, an estimated total number of new samples that need to be generated;
    生成单元,用于根据所述预计总数目利用所述少数类样本集生成新样本;Generating unit, configured to generate a new sample by using the minority class sample set according to the estimated total number;
    第二形成单元,用于并将所述新样本与所述少数类样本集混合形成第二类样本集;a second forming unit, configured to mix the new sample with the minority sample set to form a second type of sample set;
    学习单元,用于分别将每份所述第一类样本集与所述第二类样本集进行机器学习得到对应的分类模型;a learning unit, configured to respectively perform machine learning on each of the first type of sample set and the second type of sample set to obtain a corresponding classification model;
    预测单元,用于利用所述分类模型对待分类数据进行预测分类,得到对应的预测结果;a prediction unit, configured to perform prediction classification on the classified data by using the classification model, to obtain a corresponding prediction result;
    统计单元,用于分别统计不同预测结果的数量;a statistical unit for separately counting the number of different prediction results;
    第三确定单元,用于将数量较多的预测结果确定为分类结果。And a third determining unit, configured to determine a larger number of prediction results as the classification result.
  7. 根据权利要求6所述的数据分类装置,其特征在于,所述生成单元包括:The data classification device according to claim 6, wherein the generating unit comprises:
    确定子单元,用于依次将所述少数类样本集中的一个样本确定为参考样本;Determining a subunit for sequentially determining one sample of the minority class sample set as a reference sample;
    第一获取子单元,用于获取每个参考样本的近邻样本;a first obtaining subunit, configured to acquire a neighbor sample of each reference sample;
    统计子单元,用于分别统计每个参考样本的近邻样本的第一数量;a statistical subunit for separately counting the first number of neighbor samples of each reference sample;
    第一计算子单元,用于根据所述第一数量与所述少数类样本集的总样本数目计算对应参考样本的非近邻样本的第二数量;a first calculating subunit, configured to calculate, according to the first quantity and a total number of samples of the minority class sample set, a second quantity of non-neighbor samples corresponding to the reference samples;
    第二计算子单元,用于计算所述第二数量占所述少数类样本的总样本数目的比例;a second calculating subunit, configured to calculate a ratio of the second quantity to a total number of samples of the minority sample;
    归一化子单元,用于将每个参考样本的所述比例进行归一化处理,得到对应的归一化比例;a normalization subunit, configured to normalize the ratio of each reference sample to obtain a corresponding normalized ratio;
    第三计算子单元,用于根据每个所述归一化比例和所述预计总数目分别计算对应的第三数量;a third calculating subunit, configured to respectively calculate a corresponding third quantity according to each of the normalized ratio and the estimated total number;
    生成子单元,用于根据所述第三数量和所述第一数量选择对应参考样本的近邻样本,根据所述参考样本和所述近邻样本生成新样本。And generating a subunit, configured to select a neighbor sample of the corresponding reference sample according to the third quantity and the first quantity, and generate a new sample according to the reference sample and the neighbor sample.
  8. 根据权利要求7所述的数据分类装置,其特征在于,所述生成子单元包括:The data classification device according to claim 7, wherein the generating subunit comprises:
    第四计算子单元,用于计算所述第三数量与所述第一数量的商值;a fourth calculating subunit, configured to calculate a quotient of the third quantity and the first quantity;
    判断子单元,用于判断所述商值是否小于1;a determining subunit, configured to determine whether the quotient value is less than 1;
    选择子单元,用于若所述商值小于1,则从所述参考样本的近邻样本中选择所述第三数量的近邻样本,所述第三数量的近邻样本与所述参考样本的距离 均比剩余的近邻样本与所述参考样本的距离远;Selecting a subunit, if the quotient value is less than 1, selecting the third number of neighbor samples from the neighbor samples of the reference sample, and the distance between the third number of neighbor samples and the reference sample is Farther than the remaining neighbor samples and the reference samples;
    第一生成子单元,用于分别将每个选择的近邻样本与所述参考样本组成样本对,分别利用一个样本对生成一个新样本;a first generating subunit, configured to respectively form each selected neighbor sample and the reference sample into a sample pair, and respectively generate a new sample by using one sample pair;
    第二生成子单元,用于若所述商值大于或等于1,则根据四舍五入法则取整数,将所述参考样本的每个近邻样本分别与所述参考样本组成样本对,分别利用一个样本对生成所述整数个新样本。a second generating subunit, configured to: if the quotient is greater than or equal to 1, take an integer according to a rounding rule, and each neighbor sample of the reference sample and the reference sample respectively form a sample pair, and respectively use a sample pair The integer number of new samples are generated.
  9. 根据权利要求8所述的数据分类装置,其特征在于,第一生成子单元包括:The data classification device according to claim 8, wherein the first generation subunit comprises:
    第二获取子单元,用于获取所述样本对中的参考样本在i维空间的特征向量An(a1,a2,......,ai)以及近邻样本的特征向量Bk(b1,b2,......,bi);a second acquiring subunit, configured to acquire a feature vector An(a1, a2, . . . , ai) of the reference sample in the sample pair in the i-dimensional space and a feature vector Bk of the neighbor sample (b1, b2) ,......,bi);
    第一随机子单元,用于随机生成一个比例值t,其中,0<t<1;a first random subunit for randomly generating a scale value t, wherein 0<t<1;
    第一特征计算子单元,用于计算所需生成的新样本的特征向量Cnk(c1,c2,......,ci),其中,ci=ai+t*(bi-ai),在所述i维空间生成具有所述特征向量Cnk(c1,c2,......,ci)的样本。a first feature calculation subunit for calculating a feature vector Cnk(c1, c2, ..., ci) of a new sample to be generated, wherein ci=ai+t*(bi-ai), The i-dimensional space generates a sample having the feature vector Cnk (c1, c2, ..., ci).
  10. 根据权利要求8所述的数据分类装置,其特征在于,所述第二生成子单元包括:The data classification device according to claim 8, wherein the second generation subunit comprises:
    第三获取子单元,用于获取所述样本对中的参考样本在i维空间的特征向量An(a1,a2,......,ai)以及近邻样本的特征向量Bk(b1,b2,......,bi);a third obtaining subunit, configured to acquire feature vectors An(a1, a2, . . . , ai) of the reference samples in the sample pair in the i-dimensional space and feature vectors Bk of the neighbor samples (b1, b2) ,......,bi);
    第二随机子单元,用于随机生成j个比例值t x,其中,0<t x<1,x=1,2,......,j,j等于所述整数,且所有比例值t x均不相同; a second random subunit for randomly generating j scale values t x , where 0<t x <1, x=1, 2, . . . , j, j is equal to the integer, and all ratios The values t x are not the same;
    第二特征计算子单元,用于计算所需生成的所述整数个新样本的特征向量Cnk x(c1,c2,......,ci),其中,ci=ai+t x*(bi-ai),在所述i维空间生成具有特征向量Cnk x(c1,c2,......,ci)的样本。 a second feature calculation subunit for calculating a feature vector Cnk x (c1, c2, ..., ci) of the integer number of new samples to be generated, wherein ci=ai+t x *( Bi-ai), a sample having a feature vector Cnk x (c1, c2, ..., ci) is generated in the i-dimensional space.
  11. 一种数据分类设备,其特征在于,所述数据处理设备包括存储器,以及与所述存储器相连的处理器;A data classification device, characterized in that the data processing device comprises a memory, and a processor connected to the memory;
    所述存储器,用于存储实现数据分类方法的计算机程序;The memory for storing a computer program implementing a data classification method;
    所述处理器,用于运行所述存储器中存储的计算机程序,以执行以下步骤:The processor is configured to run a computer program stored in the memory to perform the following steps:
    获取样本集,所述样本集包括一多数类样本集和一少数类样本集;Obtaining a sample set, the sample set comprising a majority class sample set and a minority class sample set;
    根据所述多数类样本集的总样本数目与所述少数类样本集的总样本数目的比值确定第一类样本集的预设份数和预设样本个数,所述预设份数为奇数;Determining a preset number of copies of the first type of sample set and a preset number of samples according to a ratio of a total number of samples of the majority class sample set to a total number of samples of the minority class sample set, the preset number of copies being an odd number ;
    从所述多数类样本集中随机抽取所述预设样本个数的样本形成一份所述第一类样本集,重复多次抽取以得到所述预设份数的第一类样本集;Randomly extracting samples of the preset number of samples from the majority class sample set to form a first sample set, and repeating multiple extractions to obtain the first set of sample sets of the preset number of copies;
    根据所述少数类样本集的总样本数目和所述预设样本个数确定需要生成的新样本的预计总数目;Determining an estimated total number of new samples to be generated according to the total number of samples of the minority sample set and the preset number of samples;
    根据所述预计总数目利用所述少数类样本集生成新样本,并将所述新样本与所述少数类样本集混合形成第二类样本集;Generating a new sample using the minority class sample set according to the estimated total number, and mixing the new sample with the minority class sample set to form a second type sample set;
    分别将每份所述第一类样本集与所述第二类样本集进行机器学习得到对应的分类模型;Performing machine learning on each of the first type of sample set and the second type of sample set to obtain a corresponding classification model;
    利用所述分类模型对待分类数据进行预测分类,得到对应的预测结果;Using the classification model to perform classification classification on the classified data to obtain a corresponding prediction result;
    分别统计不同预测结果的数量,将数量较多的预测结果确定为分类结果。The number of different prediction results is separately counted, and the more large number of prediction results are determined as classification results.
  12. 根据权利要求11所述的数据分类设备,其特征在于,所述处理器在执行所述根据所述预计总数目利用所述少数类样本集生成新样本的步骤时,具体执行以下步骤:The data classification device according to claim 11, wherein the processor performs the following steps when performing the step of generating a new sample by using the minority class sample set according to the estimated total number:
    依次将所述少数类样本集中的一个样本确定为参考样本;Determining one sample in the minority sample set as a reference sample;
    获取每个参考样本的近邻样本;Obtain a neighbor sample of each reference sample;
    分别统计每个参考样本的近邻样本的第一数量;Separating the first number of neighbor samples of each reference sample separately;
    根据所述第一数量与所述少数类样本集的总样本数目计算对应参考样本的非近邻样本的第二数量;Calculating a second number of non-neighbor samples of the corresponding reference samples according to the first quantity and the total number of samples of the minority class sample set;
    计算所述第二数量占所述少数类样本的总样本数目的比例;Calculating a ratio of the second quantity to the total number of samples of the minority sample;
    将每个参考样本的所述比例进行归一化处理,得到对应的归一化比例;Normalizing the ratio of each reference sample to obtain a corresponding normalized ratio;
    根据每个所述归一化比例和所述预计总数目分别计算对应的第三数量;Calculating a corresponding third quantity according to each of the normalized ratio and the estimated total number;
    根据所述第三数量和所述第一数量选择对应参考样本的近邻样本,根据所述参考样本和所述近邻样本生成新样本。And selecting a neighbor sample of the corresponding reference sample according to the third quantity and the first quantity, and generating a new sample according to the reference sample and the neighbor sample.
  13. 根据权利要求12所述的数据分类设备,其特征在于,所述处理器在执行所述根据所述第三数量和所述第一数量选择对应参考样本的近邻样本,根据所述参考样本和所述近邻样本生成新样本的步骤时,具体执行以下步骤:The data classification device according to claim 12, wherein said processor is operative to select a neighbor sample corresponding to a reference sample according to said third quantity and said first quantity, according to said reference sample and said When the steps of generating a new sample for the neighbor sample are performed, the following steps are specifically performed:
    计算所述第三数量与所述第一数量的商值;Calculating a quotient of the third quantity and the first quantity;
    判断所述商值是否小于1;Determining whether the quotient is less than 1;
    若是,则从所述参考样本的近邻样本中选择所述第三数量的近邻样本,所述第三数量的近邻样本与所述参考样本的距离均比剩余的近邻样本与所述参考样本的距离远,分别将每个选择的近邻样本与所述参考样本组成样本对,分别利用一个样本对生成一个新样本;If yes, selecting the third number of neighbor samples from the neighbor samples of the reference sample, where the distance between the third number of neighbor samples and the reference sample is greater than the distance between the remaining neighbor samples and the reference sample Far, each selected neighbor sample and the reference sample are respectively formed into a sample pair, and a new sample is generated by using one sample pair respectively;
    若否,则根据四舍五入法则取整数,将所述参考样本的每个近邻样本分别与所述参考样本组成样本对,分别利用一个样本对生成所述整数个新样本。If not, the integer is taken according to the rounding rule, and each neighbor sample of the reference sample is separately formed into a sample pair with the reference sample, and the integer pair of new samples are generated by using one sample pair respectively.
  14. 根据权利要求13所述的数据分类设备,其特征在于,所述处理器在执行利用一个样本对生成一个新样本时,具体执行以下步骤:The data classification device according to claim 13, wherein the processor performs the following steps when generating a new sample by using a sample pair:
    获取所述样本对中的参考样本在i维空间的特征向量An(a1,a2,......,ai)以及近邻样本的特征向量Bk(b1,b2,......,bi);Obtaining a feature vector An(a1, a2, ..., ai) of the reference sample in the sample pair in the i-dimensional space and a feature vector Bk of the neighbor sample (b1, b2, ..., Bi);
    随机生成一个比例值t,其中,0<t<1;Randomly generate a proportional value t, where 0 < t < 1;
    计算所需生成的新样本的特征向量Cnk(c1,c2,......,ci),其中,ci=ai+t*(bi-ai),在所述i维空间生成具有所述特征向量Cnk(c1,c2,......,ci)的样本。Calculating a feature vector Cnk(c1, c2, ..., ci) of the new sample to be generated, wherein ci=ai+t*(bi-ai), the i-dimensional space generation has the A sample of the eigenvectors Cnk (c1, c2, ..., ci).
  15. 根据权利要求13所述的数据分类设备,其特征在于,所述处理器在执行利用一个样本对生成所述整数个新样本的,具体执行以下步骤:The data classification device according to claim 13, wherein the processor performs the following steps by performing the generation of the integer number of new samples by using one sample pair:
    获取所述样本对中的参考样本在i维空间的特征向量An(a1,a2,......,ai)以及近邻样本的特征向量Bk(b1,b2,......,bi);Obtaining a feature vector An(a1, a2, ..., ai) of the reference sample in the sample pair in the i-dimensional space and a feature vector Bk of the neighbor sample (b1, b2, ..., Bi);
    随机生成j个比例值t x,其中,0<t x<1,x=1,2,......,j,j等于所述整数,且所有比例值t x均不相同; Randomly generating j proportional values t x , where 0<t x <1, x=1, 2, . . . , j, j is equal to the integer, and all scale values t x are different;
    计算所需生成的所述整数个新样本的特征向量Cnk x(c1,c2,......,ci),其中,ci=ai+t x*(bi-ai),在所述i维空间生成具有特征向量Cnk x(c1,c2,......,ci)的样本。 Calculating a feature vector Cnk x (c1, c2, ..., ci) of the integer number of new samples to be generated, wherein ci = ai + t x * (bi-ai), in the i The dimensional space generates a sample having a feature vector Cnk x (c1, c2, ..., ci).
  16. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有一个或者一个以上计算机程序,所述一个或者一个以上计算机程序可被一个或者一个以上的处理器执行,以实现以下步骤:A computer readable storage medium, wherein the computer readable storage medium stores one or more computer programs, the one or more computer programs being executable by one or more processors to implement the following step:
    根据所述多数类样本集的总样本数目与所述少数类样本集的总样本数目的比值确定第一类样本集的预设份数和预设样本个数,所述预设份数为奇数;Determining a preset number of copies of the first type of sample set and a preset number of samples according to a ratio of a total number of samples of the majority class sample set to a total number of samples of the minority class sample set, the preset number of copies being an odd number ;
    从所述多数类样本集中随机抽取所述预设样本个数的样本形成一份所述第一类样本集,重复多次抽取以得到所述预设份数的第一类样本集;Randomly extracting samples of the preset number of samples from the majority class sample set to form a first sample set, and repeating multiple extractions to obtain the first set of sample sets of the preset number of copies;
    根据所述少数类样本集的总样本数目和所述预设样本个数确定需要生成的新样本的预计总数目;Determining an estimated total number of new samples to be generated according to the total number of samples of the minority sample set and the preset number of samples;
    根据所述预计总数目利用所述少数类样本集生成新样本,并将所述新样本与所述少数类样本集混合形成第二类样本集;Generating a new sample using the minority class sample set according to the estimated total number, and mixing the new sample with the minority class sample set to form a second type sample set;
    分别将每份所述第一类样本集与所述第二类样本集进行机器学习得到对应的分类模型;Performing machine learning on each of the first type of sample set and the second type of sample set to obtain a corresponding classification model;
    利用所述分类模型对待分类数据进行预测分类,得到对应的预测结果;Using the classification model to perform classification classification on the classified data to obtain a corresponding prediction result;
    分别统计不同预测结果的数量,将数量较多的预测结果确定为分类结果。The number of different prediction results is separately counted, and the more large number of prediction results are determined as classification results.
  17. 根据权利要求16所述的计算机可读存储介质,其特征在于,所述根据所述预计总数目利用所述少数类样本集生成新样本的步骤包括:The computer readable storage medium of claim 16, wherein the step of generating a new sample using the minority class sample set based on the estimated total number comprises:
    依次将所述少数类样本集中的一个样本确定为参考样本;Determining one sample in the minority sample set as a reference sample;
    获取每个参考样本的近邻样本;Obtain a neighbor sample of each reference sample;
    分别统计每个参考样本的近邻样本的第一数量;Separating the first number of neighbor samples of each reference sample separately;
    根据所述第一数量与所述少数类样本集的总样本数目计算对应参考样本的非近邻样本的第二数量;Calculating a second number of non-neighbor samples of the corresponding reference samples according to the first quantity and the total number of samples of the minority class sample set;
    计算所述第二数量占所述少数类样本的总样本数目的比例;Calculating a ratio of the second quantity to the total number of samples of the minority sample;
    将每个参考样本的所述比例进行归一化处理,得到对应的归一化比例;Normalizing the ratio of each reference sample to obtain a corresponding normalized ratio;
    根据每个所述归一化比例和所述预计总数目分别计算对应的第三数量;Calculating a corresponding third quantity according to each of the normalized ratio and the estimated total number;
    根据所述第三数量和所述第一数量选择对应参考样本的近邻样本,根据所述参考样本和所述近邻样本生成新样本。And selecting a neighbor sample of the corresponding reference sample according to the third quantity and the first quantity, and generating a new sample according to the reference sample and the neighbor sample.
  18. 根据权利要求17所述的计算机可读存储介质,其特征在于,所述根据所述第三数量和所述第一数量选择对应参考样本的近邻样本,根据所述参考样本和所述近邻样本生成新样本的步骤包括:The computer readable storage medium according to claim 17, wherein the selecting a neighbor sample of a corresponding reference sample according to the third quantity and the first quantity, generating according to the reference sample and the neighbor sample The steps for the new sample include:
    计算所述第三数量与所述第一数量的商值;Calculating a quotient of the third quantity and the first quantity;
    判断所述商值是否小于1;Determining whether the quotient is less than 1;
    若是,则从所述参考样本的近邻样本中选择所述第三数量的近邻样本,所述第三数量的近邻样本与所述参考样本的距离均比剩余的近邻样本与所述参考样本的距离远,分别将每个选择的近邻样本与所述参考样本组成样本对,分别利用一个样本对生成一个新样本;If yes, selecting the third number of neighbor samples from the neighbor samples of the reference sample, where the distance between the third number of neighbor samples and the reference sample is greater than the distance between the remaining neighbor samples and the reference sample Far, each selected neighbor sample and the reference sample are respectively formed into a sample pair, and a new sample is generated by using one sample pair respectively;
    若否,则根据四舍五入法则取整数,将所述参考样本的每个近邻样本分别与所述参考样本组成样本对,分别利用一个样本对生成所述整数个新样本。If not, the integer is taken according to the rounding rule, and each neighbor sample of the reference sample is separately formed into a sample pair with the reference sample, and the integer pair of new samples are generated by using one sample pair respectively.
  19. 根据权利要求18所述的计算机可读存储介质,其特征在于,利用一个样本对生成一个新样本的步骤包括:The computer readable storage medium of claim 18, wherein the step of generating a new sample using a sample pair comprises:
    获取所述样本对中的参考样本在i维空间的特征向量An(a1,a2,......,ai)以及近邻样本的特征向量Bk(b1,b2,......,bi);Obtaining a feature vector An(a1, a2, ..., ai) of the reference sample in the sample pair in the i-dimensional space and a feature vector Bk of the neighbor sample (b1, b2, ..., Bi);
    随机生成一个比例值t,其中,0<t<1;Randomly generate a proportional value t, where 0 < t < 1;
    计算所需生成的新样本的特征向量Cnk(c1,c2,......,ci),其中,ci=ai+t*(bi-ai),在所述i维空间生成具有所述特征向量Cnk(c1,c2,......,ci)的样本。Calculating a feature vector Cnk(c1, c2, ..., ci) of the new sample to be generated, wherein ci=ai+t*(bi-ai), the i-dimensional space generation has the A sample of the eigenvectors Cnk (c1, c2, ..., ci).
  20. 根据权利要求18所述的计算机可读存储介质,其特征在于,利用一个 样本对生成所述整数个新样本的步骤包括:The computer readable storage medium of claim 18, wherein the step of generating the integer number of new samples using a sample pair comprises:
    获取所述样本对中的参考样本在i维空间的特征向量An(a1,a2,......,ai)以及近邻样本的特征向量Bk(b1,b2,......,bi);Obtaining a feature vector An(a1, a2, ..., ai) of the reference sample in the sample pair in the i-dimensional space and a feature vector Bk of the neighbor sample (b1, b2, ..., Bi);
    随机生成j个比例值t x,其中,0<t x<1,x=1,2,......,j,j等于所述整数,且所有比例值t x均不相同; Randomly generating j proportional values t x , where 0<t x <1, x=1, 2, . . . , j, j is equal to the integer, and all scale values t x are different;
    计算所需生成的所述整数个新样本的特征向量Cnk x(c1,c2,......,ci),其中,ci=ai+t x*(bi-ai),在所述i维空间生成具有特征向量Cnk x(c1,c2,......,ci)的样本。 Calculating a feature vector Cnk x (c1, c2, ..., ci) of the integer number of new samples to be generated, wherein ci = ai + t x * (bi-ai), in the i The dimensional space generates a sample having a feature vector Cnk x (c1, c2, ..., ci).
PCT/CN2018/084047 2018-03-08 2018-04-23 Data classification method, apparatus, device and computer readable storage medium WO2019169704A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810190818.2 2018-03-08
CN201810190818.2A CN108491474A (en) 2018-03-08 2018-03-08 A kind of data classification method, device, equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
WO2019169704A1 true WO2019169704A1 (en) 2019-09-12

Family

ID=63338126

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/084047 WO2019169704A1 (en) 2018-03-08 2018-04-23 Data classification method, apparatus, device and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN108491474A (en)
WO (1) WO2019169704A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259964A (en) * 2020-01-17 2020-06-09 上海海事大学 Over-sampling method for unbalanced data set
CN111292329A (en) * 2020-01-15 2020-06-16 北京字节跳动网络技术有限公司 Training method and device for video segmentation network and electronic equipment
CN112085080A (en) * 2020-08-31 2020-12-15 北京百度网讯科技有限公司 Sample equalization method, device, equipment and storage medium
CN112801178A (en) * 2021-01-26 2021-05-14 上海明略人工智能(集团)有限公司 Model training method, device, equipment and computer readable medium
CN113673575A (en) * 2021-07-26 2021-11-19 浙江大华技术股份有限公司 Data synthesis method, training method of image processing model and related device

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109726821B (en) * 2018-11-27 2021-07-09 东软集团股份有限公司 Data equalization method and device, computer readable storage medium and electronic equipment
CN111539451B (en) * 2020-03-26 2023-08-15 平安科技(深圳)有限公司 Sample data optimization method, device, equipment and storage medium
CN111597225B (en) * 2020-04-21 2023-10-27 杭州安脉盛智能技术有限公司 Self-adaptive data reduction method based on segmentation transient identification
CN112784884A (en) * 2021-01-07 2021-05-11 重庆兆琨智医科技有限公司 Medical image classification method, system, medium and electronic terminal
CN112948463B (en) * 2021-03-01 2022-10-14 创新奇智(重庆)科技有限公司 Rolled steel data sampling method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105487526A (en) * 2016-01-04 2016-04-13 华南理工大学 FastRVM (fast relevance vector machine) wastewater treatment fault diagnosis method
US20170132516A1 (en) * 2015-11-05 2017-05-11 Adobe Systems Incorporated Adaptive sampling scheme for imbalanced large scale data
CN106973057A (en) * 2017-03-31 2017-07-21 浙江大学 A kind of sorting technique suitable for intrusion detection
EP3336739A1 (en) * 2016-12-18 2018-06-20 Deutsche Telekom AG A method for classifying attack sources in cyber-attack sensor systems

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170132516A1 (en) * 2015-11-05 2017-05-11 Adobe Systems Incorporated Adaptive sampling scheme for imbalanced large scale data
CN105487526A (en) * 2016-01-04 2016-04-13 华南理工大学 FastRVM (fast relevance vector machine) wastewater treatment fault diagnosis method
EP3336739A1 (en) * 2016-12-18 2018-06-20 Deutsche Telekom AG A method for classifying attack sources in cyber-attack sensor systems
CN106973057A (en) * 2017-03-31 2017-07-21 浙江大学 A kind of sorting technique suitable for intrusion detection

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DU HONGLE ET AL.: "A classification algorithm based on mixed sampling for imbalanced dataset", JOURNAL OF YANSHAN UNIVERSITY, vol. 39, no. 2, 31 March 2015 (2015-03-31), XP055636317 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111292329A (en) * 2020-01-15 2020-06-16 北京字节跳动网络技术有限公司 Training method and device for video segmentation network and electronic equipment
CN111259964A (en) * 2020-01-17 2020-06-09 上海海事大学 Over-sampling method for unbalanced data set
CN111259964B (en) * 2020-01-17 2023-04-07 上海海事大学 Over-sampling method for unbalanced data set
CN112085080A (en) * 2020-08-31 2020-12-15 北京百度网讯科技有限公司 Sample equalization method, device, equipment and storage medium
CN112085080B (en) * 2020-08-31 2024-03-08 北京百度网讯科技有限公司 Sample equalization method, device, equipment and storage medium
CN112801178A (en) * 2021-01-26 2021-05-14 上海明略人工智能(集团)有限公司 Model training method, device, equipment and computer readable medium
CN112801178B (en) * 2021-01-26 2024-04-09 上海明略人工智能(集团)有限公司 Model training method, device, equipment and computer readable medium
CN113673575A (en) * 2021-07-26 2021-11-19 浙江大华技术股份有限公司 Data synthesis method, training method of image processing model and related device

Also Published As

Publication number Publication date
CN108491474A (en) 2018-09-04

Similar Documents

Publication Publication Date Title
WO2019169704A1 (en) Data classification method, apparatus, device and computer readable storage medium
US10609433B2 (en) Recommendation information pushing method, server, and storage medium
Marcus et al. Counting with the crowd
TWI658420B (en) Method, device, server and computer readable storage medium for integrate collaborative filtering with time factor
CN107305637B (en) Data clustering method and device based on K-Means algorithm
CN110457577B (en) Data processing method, device, equipment and computer storage medium
CN106709318B (en) A kind of recognition methods of user equipment uniqueness, device and calculate equipment
CN110750658B (en) Recommendation method of media resource, server and computer readable storage medium
WO2018149337A1 (en) Information distribution method, device, and server
JP6249027B2 (en) Data model generation method and system for relational data
CN110087228B (en) Method and device for determining service package
CN108647997A (en) A kind of method and device of detection abnormal data
CN111178435B (en) Classification model training method and system, electronic equipment and storage medium
WO2023024408A1 (en) Method for determining feature vector of user, and related device and medium
Chen et al. A bootstrap method for goodness of fit and model selection with a single observed network
CN109543940B (en) Activity evaluation method, activity evaluation device, electronic equipment and storage medium
CN105677645B (en) A kind of tables of data comparison method and device
CN115545103A (en) Abnormal data identification method, label identification method and abnormal data identification device
CN113627950B (en) Method and system for extracting user transaction characteristics based on dynamic diagram
CN109583492A (en) A kind of method and terminal identifying antagonism image
CN107403199B (en) Data processing method and device
CN111291792B (en) Flow data type integrated classification method and device based on double evolution
CN112651764B (en) Target user identification method, device, equipment and storage medium
Ärje et al. Breaking the curse of dimensionality in quadratic discriminant analysis models with a novel variant of a Bayes classifier enhances automated taxa identification of freshwater macroinvertebrates
WO2019227415A1 (en) Scorecard model adjustment method, device, server and storage medium

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC

122 Ep: pct application non-entry in european phase

Ref document number: 18909061

Country of ref document: EP

Kind code of ref document: A1