WO2020093718A1

WO2020093718A1 - Training data re-sampling method and apparatus, and storage medium and electronic device

Info

Publication number: WO2020093718A1
Application number: PCT/CN2019/094741
Authority: WO
Inventors: 李伟健; 王长虎
Original assignee: 北京字节跳动网络技术有限公司
Priority date: 2018-11-08
Filing date: 2019-07-04
Publication date: 2020-05-14
Also published as: CN109635034A; CN109635034B

Abstract

Disclosed are a training data re-sampling method and apparatus, and a storage medium and an electronic device. The method comprises: acquiring first original data within a first time period (S101); calculating respective first proportions of multiple pre-set classifications in the first original data (S102); sorting the multiple pre-set classifications according to a size relationship of the first proportions and a pre-set rule so as to obtain a first sorting result (S103); determining, according to the first sorting result of the pre-set classifications and a pre-set correlation, a sampling proportion corresponding to each pre-set classification (S104), wherein the pre-set correlation is a correlation between the first sorting result and the sampling proportion; and re-sampling training data for modeling according to the sampling proportions respectively corresponding to the multiple pre-set classifications (S105). The problem of a classification model being unfriendly to a small category can be solved, and the classification accuracy of a classification model, obtained through training by means of training data, for different applications can be improved, thereby improving user experience.

Description

Training data resampling method, device, storage medium and electronic equipment

Cross-reference of related applications

This application requires the priority of China Patent Application No. 201811327417.3 filed at the China Intellectual Property Office on November 8, 2018. The entire contents of the disclosure of this Chinese patent application are incorporated herein by reference.

Technical field

The present disclosure relates to the field of data mining, and in particular, to a training data resampling method, device, storage medium, and electronic equipment.

Background technique

In machine learning, the number of samples in the training data for different classifications in the classification model may differ greatly. For example, in N training data, the number of samples belonging to the first category may be different from the number of samples belonging to the second category And the number of samples belonging to the third category is very different (for example, the number of samples belonging to the first category may account for 90% of the N training data, and the number of samples belonging to the second category and the third category may account for N together. 10% of the training data), so when directly using the training data with an uneven number of samples to train the classification model, the machine learning algorithm tends to produce a less satisfactory classification model, for example, it may cause the classification model for training When the classification with a small number of samples in the data is under-fitting and over-fitting the classification with a large number of samples in the training model, in fact, if the imbalance ratio exceeds 4: 1, the classification model will be biased towards the larger category. Ignore small categories. Therefore, the classification model trained with unprocessed unbalanced training data may not be ideal for the classification of actual data. At present, in view of the uneven training data, the method of resampling the training data is usually adopted.

Summary of the invention

The purpose of the present disclosure is to provide a training data resampling method, device, storage medium, and electronic equipment, which can resample training data according to the proportion of different classifications in the actual original data in the case of training data imbalance Processing to solve the problem that the classification model is not friendly to small categories.

In order to achieve the above objective, the present disclosure provides a training data resampling method. The method includes:

Obtain the first raw data in the first period;

Calculating a first proportion respectively occupied by multiple preset categories in the first original data;

Sorting the plurality of preset categories according to a preset rule according to the size relationship of the first ratio, to obtain a first sorting result;

Determining a sampling ratio corresponding to each preset category according to the first sorting result of each preset category and a preset correspondence, where the preset correspondence is the correspondence between the first sorting result and the sampling ratio;

Re-sampling the training data used for modeling according to the sampling ratios corresponding to the multiple preset categories, respectively.

Optionally, the method further includes: after obtaining the first sorting result,

Obtain the second raw data in the second period;

Calculating a second proportion respectively occupied by the plurality of preset categories in the second original data;

Sorting the plurality of preset categories according to the size relationship of the second ratio according to the preset rule, to obtain a second sorting result;

When the first sorting result is consistent with the second sorting result, the step of determining the sampling ratio corresponding to each preset classification is performed.

Optionally, when the first sorting result and the second sorting result are inconsistent, re-determine the second time period, and determine the second sorting result as the first sorting result; return to obtain the second Step of the second raw data in the period.

Optionally, when the proportions of the at least two preset categories are the same, the order of the at least two preset categories is determined according to the priorities of the at least two preset categories.

The present disclosure also provides a training data resampling device. The device includes:

The first obtaining module is used to obtain the first original data in the first period;

A first calculation module, configured to calculate a first proportion respectively occupied by multiple preset categories in the first original data;

A first sorting module, configured to sort the plurality of preset categories according to a preset rule according to the size relationship of the first ratio, to obtain a first sorting result;

A ratio obtaining module, configured to determine a sampling ratio corresponding to each preset category according to the first sorting result of each preset category and a preset correspondence, the preset correspondence between the first sorting result and the Correspondence between sampling ratios;

The re-sampling module is used to re-sample the training data used for modeling according to the sampling ratios respectively corresponding to the multiple preset categories.

Optionally, the device further includes:

A second obtaining module, configured to obtain the second original data in the second period after obtaining the first sorting result;

A second calculation module, configured to calculate a second proportion respectively occupied by the plurality of preset categories in the second original data;

A second sorting module, configured to sort the plurality of preset classifications according to the preset ratio according to the size relationship of the second ratio, to obtain a second sorting result;

The ranking comparison module is configured to trigger the ratio acquisition module to determine the sampling ratio corresponding to each preset category when the first ranking result is consistent with the second ranking result.

Optionally, the ranking comparison module is further configured to:

When the first sorting result and the second sorting result are inconsistent, re-determine the second time period, and determine the second sorting result as the first sorting result;

The second acquisition module is triggered to acquire the second original data in the second period.

The present disclosure also provides a computer-readable storage medium on which a computer program is stored, which implements the above method when the program is executed by a processor.

The present disclosure also provides an electronic device, which includes:

Memory, on which computer programs are stored;

The processor is configured to execute the computer program in the memory to implement the above method.

According to the embodiments of the present disclosure, the training data can be resampled according to the proportion of different classifications in the actual original data in the case of imbalanced training data, thereby solving the problem that the classification model is not friendly to small categories, improving The classification model trained by the training data has classification accuracy for different applications, thereby improving user experience.

BRIEF DESCRIPTION

The drawings are used to help further understand the present disclosure, and constitute a part of the specification, together with the following specific embodiments to explain the present disclosure, but do not constitute a limitation of the present disclosure. In the drawings:

FIG. 1 is a flowchart of a training data resampling method according to an exemplary embodiment of the present disclosure.

FIG. 2 is a flowchart of still another training data resampling method according to an exemplary embodiment of the present disclosure.

FIG. 3 is a structural block diagram of a training data resampling device according to an exemplary embodiment of the present disclosure.

FIG. 4 is a structural block diagram of yet another training data resampling device according to an exemplary embodiment of the present disclosure.

FIG. 5 is a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.

detailed description

The specific embodiments of the present disclosure will be described in detail below with reference to the drawings. It should be understood that the specific embodiments described herein are only used to illustrate and explain the present disclosure, and are not intended to limit the present disclosure.

As the classification model is often encountered when the training data is uneven, and the classification model trained based on the imbalanced training data often biases the categories that occupy more, and the categories that account for more may appear to be over However, under-fitting may occur in categories that occupy a relatively small amount. Therefore, when a classification model is established, those skilled in the art usually formulate strategies such as improving classification algorithms or balancing training data (data preprocessing) Among them, the latter is more commonly used because of its wide range of applications, that is, resampling of unbalanced training data.

According to the principle of re-sampling, re-sampling is usually divided into over-sampling and under-sampling. Over-sampling is to increase training by increasing the number of training data that occupy less categories in the training data, thereby increasing training Representation of a few classes in the data, undersampling means to reduce the representativeness of the majority of classes in the training data by reducing the amount of training data that accounts for more classes in the training data, in order to balance different proportions in the training data The number of categories of data to solve the imbalance of training data. The resampling method is usually three methods: nearest neighbor method, bilinear interpolation method and cubic convolution interpolation method.

It can be seen that, according to the different needs of training data resampling in different fields, the same resampling method has different effects on training data processing. Therefore, how to determine how to resample unbalanced training data according to the actual situation of classification distribution in different fields is an important problem that needs to be solved urgently.

Therefore, the present disclosure proposes a training data resampling method.

FIG. 1 is a flowchart of a training data resampling method according to an exemplary embodiment of the present disclosure. As shown in FIG. 1, the method may include steps S101 to S105.

In step S101, the first raw data in the first period is acquired. Raw data refers to the actual data generated in the actual application to which the classification model applies, for example, the data actually generated in the online small video application applied to the classification model that classifies the online small video, the first raw data in the first period It may be, for example, small video data actually uploaded by the user between 0:00 on October 15, 2018 and 0:00 on October 16, 2018.

In step S102, the first proportions respectively occupied by multiple preset categories in the first original data are calculated. After obtaining the first original data, calculate the first proportions respectively occupied by a plurality of preset categories in the first original data, and the preset categories are different categories that are artificially set, such as landscape categories , Pets, dance, technology, funny, etc. In step S102, a first proportion of different preset categories in the first original data acquired during the first period can be obtained.

In step S103, multiple preset categories are sorted according to preset rules according to the magnitude relationship of the first ratio, and a first sorting result is obtained (that is, a ranking of each preset category is obtained).

In step S104, the sampling ratio corresponding to each preset category is determined according to the ranking and preset correspondence of each preset category, and the preset correspondence is the correspondence between ranking and sampling ratio. The preset rule may be from large to small, or from small to large, etc., and the preset rule should be coordinated with the corresponding relationship of the preset, so as to obtain a sampling ratio that can reflect the actual distribution of each preset category in the first original data . For example, in step S103, the multiple preset categories may be sorted according to the order of the first ratio from large to small, and the preset corresponding relationship may be, for example, as shown in Table 1:

Table 1

排名Rank	11	22	33	44	55
采样比例Sampling ratio	50％50%	20％20%	10％10%	10％10%	10％10%

The preset category is not limited to the five categories shown in Table 1, but may be any other number. As long as the sum of the sampling ratios corresponding to each preset category is 100%. Therefore, as shown in Table 1, the five preset categories can respectively obtain the sampling ratio corresponding to them.

In step S105, the training data used for modeling is re-sampled according to sampling ratios corresponding to multiple preset categories, respectively. After obtaining the sampling ratios corresponding to the different preset categories in the first raw data of the first time period through steps S101 to S104, the training data can be resampled according to the sampling ratio. Upsampling or downsampling, and the specific sampling method used are not limited, as long as re-sampling is performed according to the sampling ratio of each preset category obtained from step S101 to step S104 shown in FIG. 1.

FIG. 2 is a flowchart of still another training data resampling method according to an exemplary embodiment of the present disclosure. As shown in FIG. 2, the method may include steps S201 to S205 in addition to steps S101 to S105 shown in FIG. 1.

In step S201, the second raw data in the second period is acquired. In the present disclosure, the second time period is different from the first time period, but there is overlap between the second time period and the first time period, and there is no fixed sequence relationship, that is, the first time period is before or after the second time period Yes, but the interval between the end time of the previous period and the start time of the later period needs to be less than a preset threshold. The preset threshold may be, for example, 24 hours, 48 hours, etc., so as to avoid that the interval selected between the first period and the second period is too large, and the data content of the corresponding first original data and second original data changes too much, thereby Causes problems that affect the proportion of each of the preset categories. The definition of the original data in the second original data is the same as the definition of the original data in the first original data.

In step S202, the second proportions respectively occupied by multiple preset categories in the second original data are calculated.

In step S203, a plurality of preset categories are sorted according to a preset rule according to the magnitude relationship of the second ratio, and a second sorting result is obtained. The above steps S202 and S203 are similar to the steps in step S102 and step S103 shown in FIG. 1, and step S102 and step S103 determine each preset category according to the proportion of each preset category in the first raw data Corresponding sampling ratio, step S202 and step S203 determine the sampling ratio corresponding to each preset category according to the proportion of each preset category in the second raw data, and obtain the first ranking result and the second ranking respectively result.

In step S204, the first sorting result and the second sorting result respectively obtained in step S103 and step S203 are compared. If the two are consistent, it means that the first sorting result and the second sorting result are accurate and can be used to determine different sampling ratios of each preset category, so go to step S104 to determine the sampling ratio corresponding to each preset category, Finally, step S105 is executed to resample the training data used for modeling according to the sampling ratio corresponding to each preset classification to solve the problem of the effect of the classification model due to the imbalance of the training data. And if the two do not match, then go to step S205.

In step S205, the second time period is newly determined, and the second sorting result is determined as the first sorting result. Among them, the re-determined second period is different from the second period used in step S201, and the definition of the relationship between the re-determined second period and the second period used in step S201 is different from that in step S201 The relationship between the second period used and the first period is similar, both allow overlapping time between them, and there is no fixed sequence relationship. In addition, the re-determined second period cannot be the same as the first period and all previous It is determined that the second time period is the same, so that each time when the original data is obtained for ratio calculation and sorting, it can ensure that the selected content can truly reflect the proportion of different preset categories in actual application, and does not repeat . After re-determining the second time period, determine the second sorting result as the first sorting result, and return to step S201 to obtain the second sorting result again, and then compare the newly acquired second sorting result with the previous The first sorting result of the second sorting result is compared to determine whether the training data can be resampled according to a consistent sampling ratio.

To sum up, in this embodiment, first obtain the first sorting result for the proportion of each preset classification in the first raw data of the first period, and then classify each preset classification in the second raw data of the second period The proportion takes the second sorting result, and compares the first sorting result with the second sorting result. If they are consistent, it can indicate that the sorting result is error-free, and the sampling ratio of each preset category can be directly determined according to the ranking, and then the training data is resampled according to the sampling ratio; if they are inconsistent, it means that at least one of the two is If there is an error, it is necessary to re-select the new original data in a new time period, and then obtain a new sorting result based on the new raw data again, and compare it with the latest sorting result obtained until the comparison result is consistent. When the comparison result is consistent after multiple comparisons, the sampling ratio corresponding to each preset category is determined according to the consistent sorting result, and the training data is resampled according to the sampling ratio.

According to an embodiment of the present disclosure, by comparing the ranking of the proportion of each preset category in different original data in different periods at least twice, a ratio that can truly reflect the proportion of different preset categories in the original data can be determined Ranking, avoiding the situation that the proportion of each preset category in the determined original data due to the selected time period is special or infrequent events does not reflect the proportion of each preset category in the actual original data well, Furthermore, the accuracy of re-sampling according to the sampling ratio determined by the ranking is ensured, so that the effect of the classification model trained based on the re-sampled training data is further improved.

In a possible implementation manner, when the proportions of the at least two preset categories are the same, the order of the at least two preset categories is determined according to the priorities of the at least two preset categories. In step S103 shown in FIG. 1 and step S203 shown in FIG. 2, when sorting according to the size relationship of the proportions of multiple preset categories in each original data according to preset rules, if two presets appear When the proportions of the categories are the same, the order of the preset categories with the same proportion may be determined according to the priority of the preset categories with the same proportion. For example, let preset category A, preset category B, and preset category C account for 2% of the original data in a period of time, and preset category A, preset category B, and preset category C have priority The level is preset category A> preset category B> preset category C, then when sorting, if according to the preset rule from large to small, then according to the preset category A should be ranked first, followed by Preset category B, and finally, preset category C. If the rules are from small to large, the opposite is true. Preset category C should be ranked first, followed by preset category B, and finally preset category A.

FIG. 3 is a structural block diagram of a training data resampling device according to an exemplary embodiment of the present disclosure. As shown in FIG. 3, the device may include: a first obtaining module 10, configured to obtain a first The original data; the first calculation module 20 is used to calculate the first proportions respectively occupied by a plurality of preset categories in the first raw data; the first sorting module 30 is used to compare the first proportion according to the preset rule Multiple preset categories are sorted to obtain a first sorting result; the ratio obtaining module 40 is used to determine a sampling ratio corresponding to each preset classification according to the first sorting result of each preset classification and a preset correspondence relationship, the preset The correspondence relationship is the correspondence relationship between the first sorting result and the sampling ratio; the re-sampling module 50 is used to re-sample the training data used for modeling according to the sampling ratios corresponding to multiple preset categories, respectively.

FIG. 4 is a structural block diagram of yet another training data resampling device according to an exemplary embodiment of the present disclosure. Compared with the training data resampling device shown in FIG. 3, the device shown in FIG. 4 may further include: a second acquiring module 60, configured to acquire the second sorting period after the first sorting module 30 obtains the first sorting result The second raw data of the second; the second calculation module 70 is used to calculate the second proportion of the plurality of preset categories respectively in the second raw data; the second sorting module 80 is used to calculate Set rules to sort multiple preset classifications to obtain a second sorting result; ranking comparison module 90 is used to trigger the ratio obtaining module 40 to determine the sampling corresponding to each preset classification when the first sorting result and the second sorting result are consistent proportion.

In a possible implementation manner, the ranking comparison module 90 may be further configured to: when the first ranking result and the second ranking result are inconsistent, re-determine the second time period, and determine the second ranking result as the first ranking result ; Trigger the second acquisition module 60 to acquire the second raw data in the second period.

In a possible implementation manner, when the ratio of at least two preset categories is the same, the order of the at least two preset categories is determined according to the priorities of the at least two preset categories

Regarding the device in the above embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment related to the method, and will not be elaborated here.

FIG. 5 is a block diagram of an electronic device 500 according to an exemplary embodiment of the present disclosure. For example, the electronic device 500 may be provided as a server. Referring to FIG. 5, the electronic device 500 includes a processor 522, the number of which may be one or more, and a memory 532 for storing a computer program executable by the processor 522. The computer program stored in the memory 532 may include one or more modules each corresponding to a set of instructions. In addition, the processor 522 may be configured to execute the computer program to perform the training data resampling method described above.

In addition, the electronic device 500 may further include a power supply component 526 and a communication component 550, which may be configured to perform power management of the electronic device 500, and the communication component 550 may be configured to implement communication of the electronic device 500, for example, wired Or wireless communication. In addition, the electronic device 500 may also include an input / output (I / O) interface 558. The electronic device 500 can operate based on an operating system stored in the memory 532, such as Windows Server ^™ , Mac OS X ^™ , Unix ^™ , Linux ^™, and so on.

In another exemplary embodiment, a computer-readable storage medium including program instructions is also provided. When the program instructions are executed by a processor, the training data resampling method described above is implemented. For example, the computer-readable storage medium may be the above-mentioned memory 532 including program instructions, and the above-mentioned program instructions may be executed by the processor 522 of the electronic device 500 to complete the above training data resampling method.

The exemplary embodiments of the present disclosure have been described in detail above with reference to the drawings. However, the present disclosure is not limited to the specific details in the above embodiments, and within the scope of the technical idea of the present disclosure, various simple modifications can be made to the technical solutions of the present disclosure These simple modifications are within the scope of this disclosure.

In addition, it should be noted that the specific technical features described in the above specific embodiments can be combined in any suitable manner without contradictions. In order to avoid unnecessary repetition, the present disclosure does not describe various possible combinations.

In addition, any combination of various embodiments of the present disclosure may also be arbitrarily combined, as long as it does not violate the concept of the present disclosure, it should be regarded as within the scope of the present disclosure.

Claims

A training data resampling method, the method includes:

Obtain the first raw data in the first period;

Calculating a first proportion respectively occupied by multiple preset categories in the first original data;

Sorting the plurality of preset categories according to a preset rule according to the size relationship of the first ratio, to obtain a first sorting result;

Determine a sampling ratio corresponding to each preset category according to the first sorting result and preset correspondence of each preset category, and the preset correspondence is the correspondence between the first sorting result and the sampling ratio relationship;

Based on the proportions of the poles respectively corresponding to the plurality of preset categories, the training data used for modeling is re-mined.
The method according to claim 1, further comprising: after obtaining the first sorting result,

Obtain the second raw data in the second period;

Calculating a second proportion respectively occupied by the plurality of preset categories in the second original data;

Sorting the plurality of preset categories according to the size relationship of the second ratio according to the preset rule, to obtain a second sorting result;

When the first sorting result is consistent with the second sorting result, the step of determining the sampling ratio corresponding to each preset classification is performed.
The method according to claim 2, wherein

When the first sorting result and the second sorting result are inconsistent, re-determine the second time period, and determine the second sorting result as the first sorting result;

Return to the step of obtaining the second original data in the second period.
The method according to any one of claims 1 to 3, wherein when the proportion of at least two preset categories is the same, the at least two preset categories are determined according to the priority of the at least two preset categories Sort.
A training data re-collecting rod device, the device includes:

The first obtaining module is used to obtain the first original data in the first period;

A first calculation module, configured to calculate a first proportion respectively occupied by multiple preset categories in the first original data;

A first sorting module, configured to sort the plurality of preset categories according to a preset rule according to the size relationship of the first ratio, to obtain a first sorting result;

A ratio obtaining module, configured to determine a sampling ratio corresponding to each preset category according to the first sorting result of each preset category and a preset correspondence, the preset correspondence between the first sorting result and the Correspondence between the ratio of rod mining;

The re-sampling module is used for re-sampling the training data used for modeling according to the rod proportion corresponding to the plurality of preset categories respectively.
The device according to claim 5, further comprising:

A second obtaining module, configured to obtain second original data within a second period after the first sorting module obtains the first sorting result;

A second calculation module, configured to calculate a second proportion respectively occupied by the plurality of preset categories in the second original data;

A second sorting module, configured to sort the plurality of preset classifications according to the preset ratio according to the size relationship of the second ratio, to obtain a second sorting result;

The ranking comparison module is configured to trigger the ratio acquisition module to determine the sampling ratio corresponding to each preset category when the first ranking result is consistent with the second ranking result.
The apparatus according to claim 6, wherein the ranking comparison module is further configured to:

When the first sorting result and the second sorting result are inconsistent, re-determine the second time period, and determine the second sorting result as the first sorting result;

The second acquisition module is triggered to acquire the second original data in the second period.
The apparatus according to any one of claims 5-7, wherein, when the ratio of at least two preset categories is the same, the at least two preset categories are determined according to the priorities of the at least two preset categories Sort.
A computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the method of any one of claims 1-4.
An electronic device, including:

Memory, on which computer programs are stored;

A processor, configured to execute the computer program in the memory, to implement the method of any one of claims 1-4.