CN112381161B

CN112381161B - Neural network training method

Info

Publication number: CN112381161B
Application number: CN202011296897.9A
Authority: CN
Inventors: 林淑强; 尚占锋; 张永光; 林修明; 欧阳天
Original assignee: Xiamen Meiya Pico Information Co Ltd
Current assignee: Xiamen Meiya Pico Information Co Ltd
Priority date: 2020-11-18
Filing date: 2020-11-18
Publication date: 2022-08-30
Anticipated expiration: 2040-11-18
Also published as: CN112381161A

Abstract

The invention relates to a neural network training method, which comprises the following steps: s1, performing preliminary training, namely performing deep learning neural network training on training sample data with unbalanced category data to obtain a preliminary optimal training model; s2, processing training sample data according to the initial optimal training model; and S3, secondary training, and continuing iterative training on the basis of the initial optimal training model by using the data processed by the S2 until the neural network training model converges. According to the method, the DBSCAN clustering result and the existing label are utilized to guide the data sampling of each batch in the neural network training process, and the convergence speed of the algorithm model is increased and the generalization performance of the algorithm model is improved through the balance of data among classes and the characteristic diversity of data in a single class.

Description

Neural network training method

Technical Field

The invention relates to a deep learning algorithm, in particular to a neural network training method for improving the class imbalance of training sample data.

Background

In the deep learning neural network training process, an important step is gradient descent, namely, updating of weight parameters in the network, and common updating modes include the following three modes: 1. traversing all training data sets to calculate a primary loss function, calculating the gradient of the loss function to each parameter, and updating the gradient, wherein the method is called batch gradient descent; 2. the loss function is calculated once every time a data sample is trained, and then gradient updating parameters are solved, wherein the method is called random gradient descent; 3. dividing a training data set into a plurality of small data batches, calculating a loss function according to the batches, and updating parameters, wherein the method is called mini-batch gradient descent. All samples in the method 1 are trained once, so that the method has the defects of high calculation amount overhead and low calculation speed, and each sample in the method 2 updates parameters, so that the method has the advantages of high speed and poor convergence performance, so that a mini-batch gradient descent method is generally adopted in deep learning training at present.

The data class imbalance means that sample data of each class in the data set is greatly unbalanced. The problem is often encountered in deep learning algorithm training, particularly a classification algorithm model, the algorithm model trained by data with unbalanced classes has poor generalization performance, has serious bias in prediction reasoning, and often cannot be seen by general index parameters for measuring model performance in training, such as a two-classification model with a positive-negative sample ratio of 99:1 and extreme class imbalance, and the algorithm model has 99% prediction accuracy and 100% recall rate even if all data are predicted as positive samples.

The method solves the problem of data class imbalance, the traditional method utilizes a sample sampling method to relieve the data imbalance, the method mainly comprises random under-sampling (RUS) and random over-sampling (ROS) to ensure the balance among data classes, and mini-batch is carried out through random sampling during model training. The above conventional method has two disadvantages: 1. random sampling easily changes sample data distribution to cause model overfitting; 2. there is no way to ensure the data category balance in each batch, convergence is slow, and the generalization effect of the algorithm model is poor.

Disclosure of Invention

The present invention is directed to a neural network training method for improving class imbalance of training sample data, so as to solve the above problems. Therefore, the invention adopts the following specific technical scheme:

a neural network training method, comprising the steps of:

s1, performing preliminary training, namely performing deep learning neural network training on training sample data with unbalanced class data to obtain a preliminary optimal training model;

s2, processing the training sample data according to the initial optimal training model, wherein the specific process is as follows:

s21, extracting the feature vector of all pictures of each category according to the initial optimal training model

Wherein

M represents a marked category label, and id-n represents a picture id number;

s22, feature vector is paired by using clustering algorithm DBSCAN

Carrying out category internal feature clustering according to each label category to obtain data clustering result of each category

Wherein the content of the first and second substances,

a in the graph represents a marked class label and is called a first-level classification label, id-n represents a picture id number, and i represents a class label of DBSCAN clustering and is called a second-level classification label;

s23, obtaining the internal clustering condition of each category picture according to the data clustering result

S24, setting a sampling strategy of the deep learning neural network training process batch: from

Extracting batch samples from all the types of pictures, wherein the pictures in each batch meet the data balance of two-level classification: the data volume of each class between different class classes of the first class needs to meet the balance; data in the same first-level classification category accords with DBSCAN clustering distribution, and data quantity balance among second-level classification categories is met;

and S3, secondary training, and continuing iterative training on the basis of the initial optimal training model by using the data processed by the S2 until the neural network training model converges.

Further, the amount of data between different classes of training sample data differs by more than 4 times.

Further, the sample data size of each batch is 0.01% to 1% of the training sample data size.

Further, the sample data size per batch is 256 or 512.

Further, the epsilon parameter of DBSCAN is 0.6 and the minPts parameter is 2.

Further, the data amount between each category of the primary classification and the data amount between each category of the secondary classification in each batch are within 10%.

By adopting the technical scheme, the invention has the beneficial effects that: the clustering algorithm DBSCAN is adopted to cluster all data samples according to categories respectively to obtain the distribution condition of the data characteristics in each category, the number of the clustering categories does not need to be appointed in advance, the introduction of artificial prejudice is avoided, the clustering effect is better, meanwhile, the convergence speed of the algorithm model is improved by balancing the diversity of the data in the single category among the categories in each batch, and the model is ensured to have good generalization performance.

Drawings

To further illustrate the various embodiments, the invention provides the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the embodiments. With these references, one of ordinary skill in the art will appreciate other possible embodiments and advantages of the present invention. The components in the drawings are not necessarily to scale, and similar reference numerals are generally used to identify similar components.

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a sample diagram of a batch.

Detailed Description

The invention will now be further described with reference to the drawings and the detailed description.

As shown in fig. 1, a neural network training method includes the following steps:

and S1, performing preliminary training, namely performing deep learning neural network training on training sample data with unbalanced class data to obtain a preliminary optimal training model. Here, the class data imbalance means that the data amounts between different classes of training sample data are greatly different (for example, different by 4 times or more), that is, the data amount of the most classes is different by at least 4 times from the data amount of the least classes.

And S2, processing the training sample data according to the initial optimal training model. Specifically, the clustering result is used for guiding the sampling of batch in the neural network training process: the data in each batch not only needs to satisfy the balance of the data among the categories, but also needs to satisfy the characteristic diversity of the data in a single category. The specific process of S2 is as follows:

s21, extracting the feature vectors of all pictures in each category according to the primary optimal training model

Wherein

Where M denotes a tagged category tag and id-n denotes a picture id number.

S22, feature vector is subjected to clustering algorithm DBSCAN

Wherein, the first and the second end of the pipe are connected with each other,

and A in the graph represents a labeled class label and is called a first-level classification label, id-n represents a picture id number, and i represents a class label of a DBSCAN cluster and is called a second-level classification label. It is composed ofIn (3), the epsilon parameter of DBSCAN is 0.6 and the minPts parameter is 2.

S24, setting a sampling strategy of the batch in the deep learning neural network training process: from

Extracting batch samples from all the types of pictures, wherein the pictures in each batch meet the data balance of two-level classification: the data volume of each class between different class classification classes needs to meet the balance, generally the phase difference is required to be within 10 percent, and the best quantity is the same; data in the same first-class classification category accords with DBSCAN clustering distribution, data quantity balance among the second-class classification categories is met, phase difference is generally required to be within 10%, and the data quantity is preferably the same, so that data diversity in a single category is guaranteed. For example, there are 4 classes (first-class classes) for the training sample data, each class has 2, 4, and 3 second-class classes after being clustered by DBSCAN, and assuming that each Batch needs 256 samples, the number of samples needed for each class in one Batch is 256/4-64, and the number of samples needed for each class in a single class is 64/2-32, 64/4-16, 64/4-16, and 64/3-21.3 (non-integer, and only one class is rounded up and down), as shown in fig. 2. Preferably, the amount of data per batch is 0.01% to 1% of the amount of training sample data.

And S5, secondary training, and continuing iterative training on the basis of the initial optimal training model by using the data processed by the S2 until the neural network training model converges.

Experimental testing

1) An algorithm model: backbone: a 10-class neural network of googleNet and fully-connected layers, wherein the 10 classes are airlane, automobile, bird, cat, deer, dog, frog, horse, hip, and struck, respectively;

2) training sample data: the data set was from the cifar-10, 10 classes (airlane, automobile, bird, cat, deer, dog, frog, horse, ship, struck) with a simulated class imbalance ratio of 4: 1(5000:750), i.e. 5000 for 7 classes (airplan, automobile, bird, cat, deer, dog, frog) and 750 for 3 classes (horse, ship, truck);

3) test data: 10 categories, 1000 sheets each;

4) experimental hardware: 4 GTX 1080Ti GPU video cards;

5) the experimental process comprises the following steps: the size of the batch is 512, 600 batches are operated, and the accuracy acc and the loss value under the test set are calculated;

6) grouping experiments:

experiment 1, training by adopting the existing neural network training method, randomly segmenting training data according to the size of batch 512, and operating 600 batches;

experiment 2: the invention is used for improving the neural network training method of the data class imbalance, and particularly relates to training

The process is as follows:

first stage (preliminary training):

sampling the data of each category of the first 300 lots, ensuring that the data of each category in one lot are consistent, namely 512/10 (51) of each category, randomly distributing the rest 2 categories to 2 categories, and storing a model with the highest accuracy (test set);

second stage (secondary training):

data processing: taking out the model with the highest accuracy in the first stage (for test data, removing a full connection layer, and only reserving a backbone network for extracting picture characteristics), extracting the picture characteristics of all training samples, and clustering the data of 10 classes by using DBSCAN respectively, wherein one batch ensures that the data of each class in the first stage are consistent in number, and also ensures that the data of the subclass (secondary classification) in the class are balanced after the data of each class is clustered by the DBSCAN in a single class;

and (3) secondary training: using the processed data, the training continues for 300 lots based on the best model in the first stage.

7) The results of the experiment are shown in table 1:

TABLE 1

As can be seen from Table 1, the method of the present invention can improve the convergence rate and accuracy of model training.

In conclusion, in deep learning neural network training with unbalanced class data, the method provided by the invention guides data sampling of each batch in the neural network training process by using the clustering result of the DBSCAN and the existing label, and improves the convergence speed of the algorithm model and the generalization performance of the algorithm model through the balance of data among classes and the diversity of data characteristics in a single class (especially ensuring the number and distribution of difficult samples). The training method can be widely applied to class unbalanced AI scene training, and the landing application of artificial intelligence under various complex scenes is promoted.

While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A neural network training method is characterized by comprising the following steps:

Wherein

Where M represents a labeled categoryA tag, id-n representing a picture id number;

s22, feature vector is paired by using clustering algorithm DBSCAN

Wherein the content of the first and second substances,

Extracting batch samples from all the types of pictures, wherein the pictures in each batch meet the data balance of two-level classification: the data quantity of each class between different class-level classification classes needs to satisfy balance; data in the same first-level classification category accords with DBSCAN clustering distribution, and data quantity balance among the second-level classification categories is met;

2. The neural network training method of claim 1, wherein the amount of data between different classes of training sample data differs by more than a factor of 4.

3. The neural network training method of claim 1, wherein the epsilon parameter of DBSCAN is 0.6 and the minPts parameter is 2.

4. The neural network training method of claim 1, wherein the sample data size of each batch is 0.01% to 1% of the training sample data size.

5. The neural network training method of claim 4, wherein the sample data size for each batch is 256 or 512.

6. The neural network training method of claim 1, wherein the amount of data between each class of the primary classification and the amount of data between each class of the secondary classification in each batch are within 10%.