CN112381148A

CN112381148A - Semi-supervised image classification method based on random regional interpolation

Info

Publication number: CN112381148A
Application number: CN202011282976.4A
Authority: CN
Inventors: 曾祥平; 霍晓阳; 吴斯
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-11-17
Filing date: 2020-11-17
Publication date: 2021-02-19
Anticipated expiration: 2040-11-17
Also published as: CN112381148B

Abstract

The invention discloses a semi-supervised image classification method based on random regional interpolation, which comprises the steps of selecting a small number of images with real labels from a training set, and using the rest images as images without real labels; the two types of images are simultaneously sent to a random area interpolation module; the interpolation process is different, the image with the real label can directly generate a new augmented image through interpolation, but the image without the real label can not be normally interpolated, so that label information with high confidence level can be obtained through a teacher network to be used as a temporary label of the image without the real label, and then interpolation operation is carried out; and training the network by using the new augmented image until the network model is trained to a preset number of times. The method of the invention simultaneously carries out random regional interpolation on the two types of images to generate a new augmented image for training the classification network, thereby improving the generalization performance of the training model.

Description

Semi-supervised image classification method based on random regional interpolation

Technical Field

The invention relates to the technical field of computer vision, in particular to a semi-supervised image classification method based on random regional interpolation.

Background

With the growth of social media and network services, a large number of photos are uploaded every day. It becomes less feasible to label them manually for model training. However, the presence of real tag data is insufficient in more and more applications, and the absence of real tag data is easily obtained. In order to solve the problem of data learning with real tags in part, semi-supervised learning is studied to effectively reduce the dependence on data with real tags by using data without real tags.

Semi-supervised learning is a key problem in the research of the pattern recognition field and the machine learning field, and is a learning method combining supervised learning and unsupervised learning. Semi-supervised learning uses a large amount of non-genuine label data while using a small amount of genuine label data to perform pattern recognition work. When semi-supervised learning is used, as few personnel as possible will be required to do the work. Meanwhile, the method can bring higher accuracy. Therefore, semi-supervised learning is currently receiving more and more attention from people.

In a semi-supervised setting, only a small portion of the training examples are labeled, while all remaining examples are unlabeled. To overcome the lack of authentic tag data, many data enhancement methods have been developed to obtain similar but different examples from the original ones. The class labels of these instances are unchanged before and after the conversion. Training the augmented data makes the model robust to rotation, translation, clipping, resizing, flipping, and random erasure. Kolesnikov et al studied a number of existing data expansion mechanisms to gain insight into CNN design. To produce more complex supervision, linear interpolation has been used to blend the training instances, and the corresponding training targets are proportional to the blend ratio. The model is regularized to make a smooth prediction between instances. While deep neural networks always tend to learn the most discriminative features to achieve higher training accuracy, these models may focus on areas that are not necessarily important or ideal. We believe this may reduce the generalization performance of invisible data, especially if the tag data is limited. To solve this problem, we propose a more complete mechanism to construct complex class-fuzzy instances, which would force model learning to have a more explanatory and robust function by randomly changing the size and location of the blending region.

Disclosure of Invention

The present invention aims to overcome the problem that existing deep neural networks always tend to learn the most distinctive features to obtain high training accuracy, and may concentrate on areas that are not necessarily important or not needed, especially in cases where supervision is limited. Therefore, the semi-supervised image classification method based on random regional interpolation is provided, the method generates a new augmented image by performing random regional interpolation on an image with a real label and an image without the real label, and the new augmented image is used for training a network, so that the classification accuracy and the generalization performance of the network are greatly improved.

In order to achieve the purpose, the technical scheme provided by the invention is as follows: a semi-supervised image classification method based on random regional interpolation comprises the following steps:

s1, dividing the training set data into real label image sets

No-genuine label image set

S2, collecting images without real labels

Obtain an image without real label

Obtaining predicted class score information as the non-true label image through a teacher network corresponding to the CNN-13 classification network

The temporary label of (a); wherein the parameters of the teacher network are obtained by exponential sliding average of CNN-13 classification network parameters;

s3, inputting two true label images to the whole random area interpolation moduleMouth, obtaining new pair of augmented image samples

Indicating the generation of new augmented image information,

representing label information; inputting two images without real labels to the inlet of the whole random area interpolation module to obtain a new augmented image sample pair

Indicating the generation of new augmented image information,

representing label information;

s4, obtaining the new pair of the augmented image samples obtained in the step S3

And

inputting the data into a CNN-13 classification network for the current round of training and constraining by using a loss function;

and S5, repeating the steps S2-S4, finishing training after reaching the preset training times, outputting the trained CNN-13 classification network, and performing class prediction on the image to be classified by using the trained CNN-13 classification network.

In step S1, scaling all images is required to achieve the desired training effect and reduce the data computation; classifying all data according to requirements, and firstly dividing all image data into training data and test data sets

Two types are adopted; the training data is divided into two categories: real label image collection

And genuine-label-free image collections

In a ratio of 1: 50 i.e. training data equal to

One real label image is recorded as

Namely, it is

An image without real label is recorded as

Namely, it is

In step S2, the non-genuine label image set is processed

No real label image in

Labeling temporary labels, and adopting the same mode as the test stage for the teacher network corresponding to the CNN-13 classification network to fix the parameters without updating; the parameter of the teacher network is the result of exponential sliding average of the parameter of the CNN-13 classification network; taking the prediction result of the teacher network corresponding to the CNN-13 classification network as the image without the real label

The temporary label of (1).

In step S3, the random area interpolation module randomly selects a rectangular area according to the beta distribution from the two images with real tags or the two images without real tags, interpolates the image information in the same rectangular area in the second image with the image information in the rectangular area in the first image, and the images outside the rectangular area keep the information of the first image, and finally returns a new augmented image formed after interpolation; the specific situation is as follows:

a. inputting two images containing real labels

And

generating new pairs of augmented image samples

The specific process is as follows:

a1, inputting two images containing real label

And

and

the information of the image is represented by,

and

representing label information, wherein the spatial resolution of the two images is W multiplied by H, W represents the width of the images, and H represents the height of the images;

a2, randomly obtaining a combination ratio lambda from the Beta distribution, namely lambda-Beta (alpha ), wherein Beta (alpha ) is an expression of the Beta distribution and is continuous probability distribution defined in a (0,1) interval, alpha is a hyper-parameter, and values of different data sets alpha are different;

a3, calculating a binary mask R with the spatial resolution of W × H, wherein R is a rectangular region (R ═ R)^x,r^y,r^w,r^h) The value in (b) is 0, and the values in the other regions are 1; wherein (r)^x,r^y) Denotes the coordinate of the upper left corner, r^wIndicates the width of the rectangular area, r^hIndicating the height of the rectangular area;

r^x～Unif(0,W-r^w) Unif denotes uniform distribution;

r^y～Unif(0,H-r^h)；

a4, generating new augmented image

Wherein

a5, generating new augmented image

Corresponding label

Wherein

a6, generating new pair of augmented image samples

Indicating the generation of new augmented image information,

representing label information;

b. inputting two images without real labels

And

generating new pairs of augmented image samples

Because the input image is the image without the real label, a temporary label needs to be generated on the image without the real label first, and then interpolation operation is carried out, and the specific process is as follows:

b1, inputting two images without real labels

And

and

representing image information, the spatial resolution of the two images being W × H;

b2, will

And

inputting the data into a teacher network corresponding to the CNN-13 classification network to obtain a corresponding temporary label

And

b3, randomly obtaining a combination ratio lambda from the Beta distribution, namely lambda-Beta (alpha ), wherein Beta (alpha ) is an expression of the Beta distribution and is continuous probability distribution defined in a (0,1) interval, alpha is a hyper-parameter, and values of different data sets alpha are different;

b4, calculating a binary mask R with the spatial resolution of W multiplied by H, wherein the rectangular region R in R is equal to (R^x,r^y,r^w,r^h) The value in (b) is 0, and the values in the other regions are 1; wherein (r)^x,r^y) Denotes the coordinate of the upper left corner, r^wIndicates the width of the rectangular area, r^hIndicating the height of the rectangular area;

r^x～Unif(0,W-r^w)；

r^y～Unif(0,H-r^h)；

b5, generating a new augmented image

Wherein

b6, generating a new augmented image

Corresponding label

Wherein

b7, generating new pair of augmented image samples

Indicating the generation of new augmented image information,

indicating the label information.

In step S4, the new pair of augmented image samples is used

And

optimizing the CNN-13 classification network; since both the original image and the blended region are randomly determined, predicting class probability distributions of the composite image is a challenge in constructing the training target, and therefore, the CNN-13 classification network is forced to find important regions associated with the object and learn robust features for both classes of composite data, i.e., the generated new augmented image sample pairs

And

the network is optimized using two independent loss functions as follows:

wherein, theta_CClassifying the network for CNN-13 requires updating the parameters,

representing the distribution of the synthetic data derived from the images with the real tags,

representing the distribution of the synthetic data derived from the non-true-label image, C (-) representing the output of the last hidden layer of the input CNN-13 classification network after the Softmax function,

output of the last hidden layer of the CNN-13 classification network representing the input,/_CERepresents the cross-entropy loss function between the true label and the predicted value, l_DivIs a function for measuring divergence between the training target and the CNN-13 classification network output, using the weight p to control the relative importance of the synthetic instances derived from the non-true label images.

In step S5, the training frequency is set to 400, and when all data are trained, the data are trained again until the preset frequency is reached; wherein, every time training is finished, the teacher network corresponding to the CNN-13 classification network is updated; after the trained CNN-13 classification network is obtained, fixing the parameters of the trained CNN-13 classification network, not updating the parameters of the CNN-13 classification network, and performing class prediction on the images to be classified without using a loss function: and (3) sequentially inputting the images to be classified into the CNN-13 classification network, wherein each image can obtain a corresponding prediction result.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention adopts the current popular deep learning detection framework CNN-13 classification network as a basic model, and compared with the current semi-supervision method, the classification effect is better and the generalization performance is better.

2. The invention solves the problem of insufficient supervision, constructs effective category fuzzy data and is convenient for semi-supervised image classification, which has not been explored before. Compared with the existing interpolation-based method, the random region interpolation provided by the invention provides a more complete data expansion mechanism, and the constructed data standardizes the behavior of the model near the decision boundary.

3. While CNN-13 classification networks always tend to learn the most discriminative features to achieve higher training accuracy, they may focus on areas that are not necessarily important or desirable. This may reduce the generalization performance for invisible data, especially for the case of limited real tag data. To solve this problem, the method proposes a more complete mechanism to construct complex class ambiguity instances. By randomly changing the size and location of the blending region, the CNN-13 classification network learns more robust and explanatory features.

4. The invention regularizes the behavior of the CNN-13 classification network near the decision boundary by adopting a random region interpolation method, and the method only performs interpolation in a random rectangular region and combines one training image with another training image. The augmented image is more complex than the original image and can become class-blurred when there are objects that belong to different classes. And determining a training target by interpolation between the real labels corresponding to the original images according to the area of the mixed region. Considering that in a semi-supervised environment a large number of training instances are not true-labeled, it is crucial to establish reliable training targets in a semi-supervised environment. Therefore, an instructor network corresponding to the CNN-13 classification network is constructed by performing Exponential Moving Average (EMA) on the CNN-13 classification network parameters in the training stage. The teacher network corresponding to the CNN-13 classification network can generate more stable and accurate predictions for the non-genuine label data. Based on the method, the random area interpolation method can be applied to data without real labels and data with real labels, and the model is supervised and trained by using the synthetic data. The randomness of the mixture region helps the model to find important spatial regions associated with the target.

Drawings

FIG. 1 is a block flow diagram of the method of the present invention.

FIG. 2 is a schematic diagram of the random area interpolation operation of the method of the present invention, in which the Original data represents the Original image inputted to the random area interpolation module, the RRI module represents the random area interpolation module, the Classification network represents the CNN-13 Classification network, and l_CERepresenting cross-entropy loss function, l, corresponding to the image with the true label_DivMean square representing correspondence of non-true label imagesThe variance loss function, mask denotes a binary mask, and Beta denotes a Beta distribution.

FIG. 3 is some sample diagrams of the method of the present invention, in which the first column, the second column represent the input image, and the third column represents the generated augmented image.

Detailed Description

The present invention will be further described with reference to the following specific examples.

As shown in fig. 1, the semi-supervised image classification method based on random region interpolation provided in this embodiment includes the following steps:

s1, dividing the training data into real label image sets

No-genuine label image set

And test data collection

The method comprises the following specific steps:

firstly, horizontally turning an image with the probability of 0.5, then filling 2 pixel points in the width and height of the image, randomly cutting the image into 32x32 pixels, then subtracting the average value of the image pixels, and finally carrying out whitening treatment; classifying the image data according to requirements, firstly dividing the image data into training data and test data sets

And genuine-label-free image collections

In a ratio of 1: 50 i.e. training data equal to

One real label image is recorded as

Namely, it is

An image without real label is recorded as

Namely, it is

S2, collecting images without real labels

Obtain an image without real label

Obtaining predicted class information through teacher network, and using the predicted class information as the image without real label

The temporary label is as follows:

for images without real labels

Labeling temporary labels, and adopting the same mode as the test stage for the CNN-13 classification network to fix the parameters without updating; prediction information of teacher network corresponding to CNN-13 classification network as true label-free image

The parameter of the teacher network corresponding to the CNN-13 classification network is the result of exponential sliding average of the CNN-13 classification network parameters.

S3, inputting two images with real labels and real labels thereof to a random area interpolation moduleObtaining corresponding augmented image sample pairs

Inputting two images without real labels and temporary labels thereof to a random area interpolation module to obtain corresponding augmented image sample pairs

As shown in the random area interpolation Module (RRI module) section on the left of FIG. 2, two pieces of original image data are input

And

after passing through a random region interpolation module, a new augmented image sample pair is obtained from the image with the real label

No real label image will obtain new augmented image sample pair

The resulting image

As shown in the third column of FIG. 3, the first and second columns of FIG. 3 represent the original image data, respectively

And

the random area interpolation module randomly selects a rectangular area from two images with real labels or two images without real labels according to beta distribution, interpolates image information in the same rectangular area in the second image and image information in the rectangular area in the first image, keeps the information of the first image in the images outside the rectangular area, and finally returns a new augmented image formed after interpolation; the specific situation is as follows:

a. inputting two images containing real labels

And

generating new pairs of augmented image samples

The specific process is as follows:

a1, inputting two images containing real label

And

and

the information of the image is represented by,

and

r^x～Unif(0,W-r^w) Unif denotes uniform distribution;

r^y～Unif(0,H-r^h)；

a4, generating new augmented image

Wherein

a5, generating new augmented image

Corresponding label

Wherein

a6, generating new pair of augmented image samples

Indicating the generation of new augmented image information,

representing label information;

b. inputting two images without real labels

And

generating new pairs of augmented image samples

b1, inputting two images without real labels

And

and

b2, will

And

And

r^x～Unif(0,W-r^w)；

r^y～Unif(0,H-r^h)；

b5, generating a new augmented image

Wherein

b6, generating a new augmented image

Corresponding label

Wherein

b7, generating new pair of augmented image samples

Indicating the generation of new augmented image information,

indicating the label information.

S4, inputting the two pairs of the augmented image samples to the entrance of the whole network (CNN-13): an augmented image sample pair corresponding to the non-genuine label image of step S3

The other is an augmented image sample pair corresponding to the real label image

And for the training of the current round, constraint is carried out by using a loss function. As shown in the CNN-13 Classification network (Classification network) part on the right of FIG. 2, there is an augmented image sample pair corresponding to the true tag image

Using cross entropy loss function (l)_CE) To constrain the corresponding pair of augmented image samples for true tag-free images

Using the mean square variance loss function (l)_Div) To constrain.

The CNN-13 classification network comprises 9 convolutional layers (divided into 3 groups), 3 pooling layers and 1 full-connection layer, and the specific training process is as follows:

s41, inputting picture I (augmented image sample pair corresponding to true label image

For the augmented image sample pairs corresponding to the image without the real label

S42, obtaining a feature map F1 by the picture I through a first group of 128, 128 and 128 channel convolution layers, and obtaining a feature map F1' through a maximized pooling layer;

s43, passing the feature diagram F1 'through a second group of 256, 256 and 256 channel convolution layers to obtain a feature diagram F2, and passing through a maximized pooling layer to obtain a feature diagram F2';

s44, passing the feature diagram F2 'through a third group of 512, 256 and 128 channel convolution layers to obtain a feature diagram F3, and passing through a maximized pooling layer to obtain a feature diagram F3';

s45, passing the feature graph F3' through a 128 x 10 full connection layer to obtain a classification result score;

the loss function in the entire network is as follows:

in the formula, theta_CClassifying the network for CNN-13 requires updating the parameters,

output of the last hidden layer of the CNN-13 classification network representing the input,/_CERepresents the cross-entropy loss function between the true label and the predicted value, l_DivIs a function (e.g., mean square distance) that measures divergence between the training target and the CNN-13 classification network output, and the weight p is used to control the relative importance of the synthetic instances derived from the true label-free images.

Minimizing the loss function so that the network tends towards the nearest real label for each predicted result; in our setup, only a limited number of training images are used. The data expansion by using the image without the real label is very important, and the diversity and the number of training images can be expected to be effectively increased so as to improve the generalization capability of the detection model.

And S5, repeating the steps S2-S4, and finishing training after the preset training times are reached.

Real label image collection

And genuine-label-free image collections

The data size of the method is large, the training frequency is set to 400 in order to train the CNN-13 classification network well, and after all data are trained, the data are trained again in a disorderly mode until the preset frequency is reached, so that the characteristics of the sample can be fully learned. And updating the teacher network corresponding to the CNN-13 classification network every time training is finished, wherein the parameters of the teacher network corresponding to the CNN-13 classification network are the exponential sliding average of the parameters of the CNN-13 classification network.

S6 test data set

And testing and evaluating the trained CNN-13 classification network to obtain a prediction result.

Fixing the trained CNN-13 classification network, and not updating the CNN-13 classification network and using a loss function in the whole test process; gathering test data

And (3) sequentially inputting each image into the trained CNN-13 classification network, obtaining a corresponding prediction result for each image, and performing corresponding calculation with a real class label to obtain a test evaluation result.

In the following we use the Cifar10 dataset as an example and can divide it into 50000 training images and 10000 test images. And selecting 1000 training images as labeled images from 50000 training images, taking the rest images as non-labeled images, horizontally turning the images at the probability of 0.5, filling 2 pixel points in the width and height of the images, randomly cutting the images into 32 multiplied by 32 pixels, subtracting the average value of the image pixels, and finally putting the images into a CNN-13 classification network after whitening treatment.

Firstly, 100 images without real labels are input into a teacher network corresponding to the CNN-13 classification network to obtain temporary labels, and the temporary labels are endowed with images without real labels. And then inputting 100 real label images into a random area interpolation module to obtain an augmented image sample pair corresponding to the real label images. And then inputting 100 real label-free images endowed with temporary labels into a random area interpolation module to obtain an augmented image sample pair corresponding to the real label-free images. Then 100 augmented image sample pairs corresponding to the images with the real labels and 100 augmented image sample pairs corresponding to the images without the real labels are put into a CNN-13 classification network to train the CNN-13 classification network together. In the training process, the label information of the image with the real label is completely real, the temporary label information of the image without the real label is obtained by the teacher network corresponding to the CNN-13 classification network, the uncertainty is very large, and in order to avoid the total loss controlled by the synthesis example derived from the data without the real label, the weight rho is gradually increased to the maximum value of 100 in the previous 100 iterations. Training the initial learning rate to be 0.1, and reducing the initial learning rate to 0 according to a cosine annealing technology; the momentum is 0.9; the optimizer is stochastic echelon descent (SGD); the hyper-parameter a is set to 0.25.

According to the method, after the Cifar10 is trained for 400 times of iteration, the whole CNN-13 classification network basically tends to be stable, the classification result shows good effect, the semi-supervised image classification target is achieved, and a small number of images can bring huge promotion.

The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that the changes in the shape and principle of the present invention should be covered within the protection scope of the present invention.

Claims

1. A semi-supervised image classification method based on random regional interpolation is characterized by comprising the following steps:

s1, dividing the training set data into real label image sets

No-genuine label image set

S2, collecting images without real labels

Obtain an image without real label

s3, inputting two images with real labels to the inlet of the whole random area interpolation module to obtain a new augmented image sample pair

Indicating the generation of new augmented image information,

Indicating the generation of new augmented image information,

representing label information;

And

2. The semi-supervised image classification method based on random regional interpolation as claimed in claim 1, wherein: in step S1, scaling all images is required to achieve the desired training effect and reduce the data computation; classifying all data according to requirements, and firstly dividing all image data into training data and test data sets

And genuine-label-free image collections

In a ratio of 1: 50 i.e. training data equal to

One real label image is recorded as

Namely, it is

An image without real label is recorded as

Namely, it is

3. The semi-supervised image classification method based on random regional interpolation as claimed in claim 1, wherein: in step S2, the non-genuine label image set is processed

No real label image in

The temporary label of (1).

4. The semi-supervised image classification method based on random regional interpolation as claimed in claim 1, wherein: in step S3, the random area interpolation module randomly selects a rectangular area according to the beta distribution from the two images with real tags or the two images without real tags, interpolates the image information in the same rectangular area in the second image with the image information in the rectangular area in the first image, and the images outside the rectangular area keep the information of the first image, and finally returns a new augmented image formed after interpolation; the specific situation is as follows:

a. inputting two images containing real labels

And

generating new pairs of augmented image samples

The specific process is as follows:

a1, inputting two images containing real label

And

and

the information of the image is represented by,

and

r^x～Unif(0,W-r^w) Unif denotes uniform distribution;

r^y～Unif(0,H-r^h)；

a4, generating new augmented image

Wherein

a5, generating new augmented image

Corresponding label

Wherein

a6, generating new pair of augmented image samples

Indicating the generation of new augmented image information,

representing label information;

b. inputting two images without real labels

And

generating new pairs of augmented image samples

b1, inputting two images without real labels

And

and

b2, will

And

And

r^x～Unif(0,W-r^w)；

r^y～Unif(0,H-r^h)；

b5, generating a new augmented image

Wherein

b6, generating a new augmented image

Corresponding label

Wherein

b7, generating new pair of augmented image samples

Indicating the generation of new augmented image information,

indicating the label information.

5. The semi-supervised image classification method based on random regional interpolation as claimed in claim 1, wherein: in step S4, the new pair of augmented image samples is used

And

And

the network is optimized using two independent loss functions as follows:

6. The semi-supervised image classification method based on random regional interpolation as claimed in claim 1, wherein: in step S5, the training frequency is set to 400, and when all data are trained, the data are trained again until the preset frequency is reached; wherein, every time training is finished, the teacher network corresponding to the CNN-13 classification network is updated; after the trained CNN-13 classification network is obtained, fixing the parameters of the trained CNN-13 classification network, not updating the parameters of the CNN-13 classification network, and performing class prediction on the images to be classified without using a loss function: and (3) sequentially inputting the images to be classified into the CNN-13 classification network, wherein each image can obtain a corresponding prediction result.