WO2020168796A1

WO2020168796A1 - Data augmentation method based on high-dimensional spatial sampling

Info

Publication number: WO2020168796A1
Application number: PCT/CN2019/125431
Authority: WO
Inventors: 王卡风; 须成忠; 曹廷荣; 熊超
Original assignee: 深圳先进技术研究院
Priority date: 2019-02-19
Filing date: 2019-12-14
Publication date: 2020-08-27
Also published as: CN109886333A

Abstract

A data augmentation method based on high-dimensional spatial sampling. The method involves first dividing a data set needing to be augmented into a training set and a test set, and comprises: S1, mapping the training set from a low-dimensional space P to a high-dimensional space D, so as to obtain a first data set; S2, building a training model, wherein the training model comprises a sampler and a classifier; S3, the sampler performing sampling on the first data set using a Monte Carlo method, so as to obtain a second data set; S4, controlling the training set, the second data set and the test set to be in the same dimensional space; S5, inputting the training set and second data set with dimensions controlled into the classifier to train the training model; and S6, using the test set with the dimension controlled to evaluate the performance of the trained training model. The method is free from restriction of sampling on more data dimensions, and a generated new sample is more suitable for classification by a classifier.

Description

A data enhancement method based on high-dimensional spatial sampling

Technical field

The present invention relates to the technical field of data enhancement, and more specifically, to a method for enhancing data by performing Monte Carlo sampling in a high-dimensional space after upgrading the data training set.

Background technique

Machine learning and deep learning are generally used to improve accuracy through Data Augmentation or adjusting machine learning classification and regression algorithms. Data enhancement is one of the important branches of machine learning and deep learning research. Obtaining sufficient and effective data is an important means to obtain high accuracy. In practice, the data is often insufficient or there are many invalid redundant data in the original data. In this case, it is necessary to find more data or effectively enhance the original data. In actual problems, there may be many types of data, but the magnitude of the data is too small. In this case, the solution to the problem is a big obstacle. One solution is to use the original data for data enhancement to obtain more information. Lots of data suitable for the task. In order to make full use of the training data as much as possible, the training data is generally "expanded" through a series of random transformations, so that the machine learning model will not see exactly the same two training data, which helps prevent the model from overfitting, thereby improving Test accuracy rate. Two current data enhancement methods are introduced as follows: The first is AutoAugment data enhancement method: EkinD. Cubuk et al.'s paper "AutoAugment: Learning Augmentation Policies from Data" learns a data enhancement method suitable for the current task through model learning. Use reinforcement learning to find the best image transformation strategy from the data itself, and learn different combinations of enhancement methods for different tasks. It is a search of an existing image operation set on the original image; but in essence, this method is similar to some commonly used methods. Algorithms (such as: rotation, affine, etc.) have no essential difference, and neither the sampling space nor the sampling dimension has changed. The second is the GAN data enhancement method: Generative adversarial networks (GAN: Generative adversarial networks) learn the distribution of data through the model, and randomly generate images consistent with the distribution of the training data set, but this method cannot directly improve the accuracy of the classifier.

Summary of the invention

In view of the above problems, the present invention proposes a method of upgrading the data training set, and then using Monte Carlo sampling to generate new samples according to the upgraded data set, and combining the selection of machine learning algorithms and the adjustment of algorithm hyperparameters The technical solutions for joint optimization to improve the accuracy of machine learning are as follows:

The present invention provides a data enhancement method based on high-dimensional space sampling. The method first divides the data set to be enhanced into a training set and a test set, which specifically includes:

S1: Map the training set from the low-dimensional space P to the high-dimensional space D to obtain a first data set;

S2, building a training model, where the training model includes a sampler and a classifier;

S3, the sampler performs sampling on the first data set by using a Monte Carlo method to obtain a second data set;

S4, controlling the training set, the second data set and the test set to be in the same dimensional space;

S5: Input the dimensional control training set and the second data set into the classifier, and train the training model;

S6: Evaluate the performance of the trained training model by using the test set of controlled dimensions.

Preferably, the training model further includes a Metropolis-Hastings corrector. In step S3, the "sampling device uses the Monte Carlo method to sample on the first data set to obtain a second data set" The steps include:

S31, randomly selecting a sample from the first data set as an initial sample;

S32: Perform T iterations on the initial sample to obtain candidate samples;

S33: Use the Metropolis-Hastings corrector to determine whether the candidate sample meets the distribution properties consistent with the first data set, and when the determination result is yes, add the current candidate sample to the second data set, and return to step S31; When the judgment result is no, replace the current candidate sample with a new initial sample, and return to step S32.

More preferably, the Metropolis-Hastings corrector determines whether the candidate sample conforms to the distribution properties consistent with the first data set by setting an acceptance/rejection ratio, wherein the acceptance/rejection ratio ranges from 0.8 to 1.4.

Preferably, in step S1, the step of "mapping the training set from the low-dimensional space P to the high-dimensional space D to obtain the first data set" includes:

The dimensionality of the training set is increased through the dictionary matrix and the dimensionality increase operator to obtain the first data set.

More preferably, the dictionary matrix is randomly generated or is trained and generated in the KSVD algorithm using the training set, and the dimension increase operator is selected from any one of LASSO function, convolution, or coding.

Preferably, the Monte Carlo method is a random gradient Langevin dynamic sampling method or a random gradient Hamilton Monte Carlo sampling method.

Preferably, the classifier is selected from any one of a support vector machine algorithm, a random forest algorithm, or a convolutional neural network algorithm.

Preferably, a dimension increase operator or a dimension reduction operator is used to control the training set, the second data set and the test set to be in the same dimensional space, and the dimension increase/dimensionality reduction operator is selected from convolution/ Any one of deconvolution, encoding/decoding, or LASSO function.

Preferably, in step S5, the step of "inputting the dimensional control training set and the second data set into the classifier for training" includes:

First input the dimension-controlled training set into the classifier for training, and after the training is completed, continue to input the dimension-controlled second data set into the classifier for training; or

After the dimensionality-controlled training set and the second data set are combined, they are input into a classifier for training.

More preferably, the training set and the second data set with over-dimension control are combined according to a ratio of (4-7):1.

Compared with the prior art, the method provided by the present invention proposes to sample data in a higher dimension; using the LASSO function to increase the dimension, you can get rid of the limitation of sampling in more data dimensions, and achieve the purpose of data enhancement At the same time, it can get rid of the disaster of dimensionality and reduce the resource occupation of sampling. The performance of subsequent classifiers has also been significantly improved. The new samples generated by this method are more suitable for classifier classification.

Description of the drawings

Fig. 1 is a flowchart of a method for realizing data enhancement by sampling in a high-dimensional space provided by an embodiment of the present invention.

Fig. 2 is a design flow chart of a gradient estimator provided by an embodiment of the present invention.

Fig. 3 is an implementation flow chart of a sampling algorithm after dimensionality-upgrading using compressed sensing according to an embodiment of the present invention.

Fig. 4 is a design flow chart of the Metropolis-Hastings corrector provided by an embodiment of the present invention.

Fig. 5 is a design flowchart for training a training model provided by an embodiment of the present invention.

detailed description

In order to make the purpose, technical solutions, and advantages of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the application, and not used to limit the application.

The present invention provides a data enhancement method based on high-dimensional space sampling. Inspired by compressed sensing, the method assumes that all samples are low-dimensional measurements of certain high-dimensional sparse vectors, and there is continuous probability in such high-dimensional space Distribution, sampling in this continuous distribution to obtain new samples, and these high-dimensional space new samples are more conducive to classification. Please refer to FIG. 1. FIG. 1 is a flowchart of a method for realizing data enhancement by sampling in a high-dimensional space according to an embodiment of the present invention. The present invention will be explained in detail below with reference to FIG. 1.

This method first divides the data set that needs to be enhanced into a training set and a test set, and specifically includes the following steps:

Step S1: Map the training set from the low-dimensional space P to the high-dimensional space D to obtain a first data set. In this step, it also includes S11, randomly generating a dictionary matrix for compressed sensing, or using the training set to train in the KSVD algorithm to generate a dictionary matrix; S12, combining the dictionary matrix generated in step S11 with the ascending operator to train The dimensionality of the set is increased to obtain the first data set. According to some embodiments of the present invention, the dimensionality increase operator can be selected from any of the LASSO function, convolution, or encoding, and is preferably the LASSO function. The limitation of sampling on more data dimensions can achieve the effect of data enhancement, while also being able to get rid of the dimensional disaster and reduce the resource occupation of sampling.

Step S2, build an initial training model, which includes a sampler and a classifier. The sampler used in the training model performs sampling based on the Monte Carlo method. According to some embodiments of the present invention, the Monte Carlo method that can be used includes stochastic gradient Langevin dynamics (SGLD) sampling Method or Stochastic Gradient Hamiltonian Monte Carlo (sgHMC) sampling method, etc. The classifiers used in the training model include support vector machines (SVM), random forests and other shallow learning algorithms, and convolutional neural networks (CNN) and other deep learning algorithms. According to other embodiments of the present invention, a Metropolis-Hastings corrector can be added to the training model, and the corrector is used to determine whether the collected samples conform to the distribution consistent with the first data set or the training set before the upgrade. Nature, if it meets, accept; otherwise, reject it. Adding Metropolis-Hastings corrector will help to collect samples that meet the requirements.

Step S3, the sampler performs sampling on the first data set by using the Monte Carlo method to obtain the second data set. The sampler used in the present invention includes a gradient estimator. Please refer to Fig. 2. Fig. 2 is a flow chart of the gradient estimator design provided by an embodiment of the present invention. The principle is as follows: first, a small batch of samples are randomly selected from the original data set. The amount of data S, first solve the random gradient g _{m of the} initial value X ⁰ on the S data set, and then obtain the value of the next candidate sample X _T according to the random gradient g _m . Based on the gradient estimator, the embodiment of the present invention provides a specific sampling algorithm implementation process as shown in Fig. 3, step S31, use independent and identically distributed white noise to take an initial value X ⁰ on the first data set; step S32, in a sampler with a gradient estimator, perform T iterations on the initial value X ⁰ to find the next candidate sample X _T ; step S32, by using the Metropolis-Hastings corrector to determine whether X _T matches the first data Set consistent distribution properties to determine whether to accept X _T as a new valid sample; when the judgment result is yes, add the current candidate sample to the second data set, and return to step S31; when the judgment result is no, replace the current candidate sample For the new initial sample, return to step S32. After K rounds, K random samples can be taken from the D-dimensional spatial distribution: X ¹ , X ² , X ³ ,..., X ^k , these samples form high-dimensional The second data set of the space.

In step S32, the Metropolis-Hastings corrector determines whether the candidate sample conforms to the distribution properties consistent with the first data set by setting an acceptance/rejection ratio. According to some embodiments of the present invention, the range of the acceptance/rejection ratio is It is 0.8～1.4. Further, the implementation process of the Metropolis-Hastings corrector is shown in Figure 4. Firstly, the negative logarithmic density and derivative of X ⁰ and X _T are evaluated based on the entire data set; then the transition probability of X ⁰ to X _T and X are calculated respectively. Transition probability from _T to X ⁰ , and calculate the ratio θ of the two probability values; then randomly select a number ε between 0 and 1, compare εd and θ, where d is the set acceptance/rejection ratio value , If εd<θ, choose to accept X _T , otherwise, reject.

Step S4, controlling the training set, the second data set and the test set to be in the same dimensional space. In this step, the training set, the second data set, and the test set are controlled to be in the same dimensional space by using an ascending operator or a dimensional reducing operator to obtain the dimensional data required by the classifier. It specifically includes: use the dimension-up operator to increase the dimensions of the training set and the test set, so that the three data sets are all distributed in D-dimensional space; or use the dimension-reduction operator to reduce the dimensionality of the second data set to make three data The set is also distributed in P-dimensional space. And in the whole invention, the dimensionality increase/dimension reduction operator used is a pair of algorithms, you can choose convolution (convolution) / deconvolution (de-convolution), encoding (encoder) / decoding (decoder) or LASSO, etc. Any group of operators.

Step S5: Input the dimensional control training set and the second data set into the classifier to train the training model. In this step, the training result is evaluated with the correct rate obtained from the training, and the training ends when the correct rate no longer increases. During the training process, the start of the sampler is adjusted according to the correct rate. Parameters such as the number of initial steps, the number of steps in the sampling interval, the acceptance/rejection ratio of the corrector, the classifier algorithm and its corresponding hyperparameters, the specific training process is shown in Figure 5. According to some embodiments of the present invention, in this step, the dimensionality-controlled training set may be first input into the classifier for training, and after the training is completed, the second dimensionality-controlled data set is continuously input into the classifier for training , It is also possible to merge the training set and the second data set with controlled dimensionality, and then enter the training process into the classifier. According to other embodiments of the present invention, the training set with controlled dimensionality and the second data set are (4～7): Combine the ratio of 1.

Step S6: Evaluate the performance of the trained training model by using the dimensional controlled test set.

Experiments have confirmed that the data enhancement method provided by the present invention not only gets rid of the limitation of sampling in more data dimensions, but also has a significant improvement in the performance of the subsequent classifiers, and the new samples generated are more suitable for classifier classification. .

The above are only the preferred embodiments of the present invention. It should be pointed out that for those of ordinary skill in the art, without departing from the principle of the present invention, several improvements and modifications can be made, and these improvements and modifications are also It should be regarded as the protection scope of the present invention.

Claims

A data enhancement method based on high-dimensional spatial sampling. The method first divides the data set to be enhanced into a training set and a test set, and is characterized in that the method includes:

S1: Map the training set from the low-dimensional space P to the high-dimensional space D to obtain a first data set;

S2, building a training model, where the training model includes a sampler and a classifier;

S3, the sampler performs sampling on the first data set by using a Monte Carlo method to obtain a second data set;

S4, controlling the training set, the second data set and the test set to be in the same dimensional space;

S5: Input the dimensional control training set and the second data set into the classifier, and train the training model;

S6: Evaluate the performance of the trained training model by using the test set of controlled dimensions.
The method according to claim 1, wherein the training model further comprises a Metropolis-Hastings corrector, and in step S3, the "said sampler uses the Monte Carlo method on the first data set The steps of sampling to obtain the second data set include:

S31, randomly selecting a sample from the first data set as an initial sample;

S32: Perform T iterations on the initial sample to obtain candidate samples;

S33: Use the Metropolis-Hastings corrector to determine whether the candidate sample meets the distribution properties consistent with the first data set, and when the determination result is yes, add the current candidate sample to the second data set, and return to step S31; When the judgment result is no, replace the current candidate sample with a new initial sample, and return to step S32.
The method of claim 2, wherein the Metropolis-Hastings corrector determines whether the candidate sample conforms to the distribution property consistent with the first data set by setting an acceptance/rejection ratio, wherein the acceptance The range of the rejection ratio is 0.8 to 1.4.
The method according to claim 1, wherein in step S1, the step of "mapping the training set from the low-dimensional space P to the high-dimensional space D to obtain the first data set" comprises:

The dimensionality of the training set is increased through the dictionary matrix and the dimensionality increase operator to obtain the first data set.
The method of claim 4, wherein the dictionary matrix is randomly generated or the training set is used to train and generate in the KSVD algorithm, and the dimensionality operator is selected from any one of LASSO function, convolution, or coding. Kind.
The method of claim 1, wherein the Monte Carlo method is a random gradient Langevin dynamic sampling method or a random gradient Hamilton Monte Carlo sampling method.
The method according to claim 1, wherein the classifier is selected from any one of a support vector machine algorithm, a random forest algorithm, or a convolutional neural network algorithm.
The method according to claim 1 or 4, characterized in that the training set, the second data set and the test set are controlled in the same dimensional space by using an ascending operator or a dimensional reducing operator, and the ascending The dimension/dimension reduction operator is selected from any group of convolution/deconvolution, encoding/decoding or LASSO function.
The method according to claim 1, wherein in step S5, the step of "inputting the dimensional control training set and the second data set into the classifier for training" comprises:

First input the dimension-controlled training set into the classifier for training, and after the training is completed, continue to input the dimension-controlled second data set into the classifier for training; or

After the dimensionality-controlled training set and the second data set are combined, they are input into a classifier for training.
The method according to claim 9, wherein the training set and the second data set with over-dimension control are combined in a ratio of 4:1-7:1.