CN109146058B

CN109146058B - Convolutional neural network with transform invariant capability and consistent expression

Info

Publication number: CN109146058B
Application number: CN201810861718.8A
Authority: CN
Inventors: 田新梅; 何岸峰; 沈旭
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2018-07-27
Filing date: 2018-07-27
Publication date: 2022-03-01
Anticipated expiration: 2038-07-27
Also published as: CN109146058A

Abstract

The invention discloses a convolutional neural network with transformation invariance capability and consistent expression, and a trained model can be more robust to a transformed picture only by introducing an invariance loss function in the training process. Meanwhile, the method can enable the model to learn the expression mode with unchanged conversion, and compared with the traditional method which only learns the mapping from the converted picture to the set label, the method can be better transferred to other deep learning problems. In addition, the method embeds the transformation invariance capability into the weight parameters of the network, thereby really improving the transformation invariance of the convolutional neural network, introducing no new parameters into the model, not needing to additionally process the picture, and not needing to change the existing network structure during testing.

Description

Convolutional neural network with transform invariant capability and consistent expression

Technical Field

The invention relates to the technical fields of image classification, image retrieval and the like, in particular to a convolutional neural network with transformation invariability and consistent expression.

Background

In recent years, with the rapid development of the internet, people can be exposed to massive pictures and videos. For these huge amount of pictures, how to accurately identify and retrieve is the basis of all picture-related applications. In the past, limited by relatively insufficient computing power, only a few relatively low-level feature extraction algorithms can be used, and the algorithms cannot accurately express high-level semantic information of pictures. With the improvement of computing power, the brought deep learning technology makes breakthrough progress in a series of related fields such as image recognition, picture retrieval and the like. The convolutional neural network is mainly used in applications such as picture recognition and retrieval. The operation of convolution, pooling and the like enables the model to extract feature expressions from local to global layer by layer. Compared with the traditional method, the technology has the advantages that accurate expression of high-level semantics enables the technology to greatly surpass the traditional algorithm in recognition performance.

However, existing convolutional neural networks are not particularly robust to pictures that have undergone various spatial transformations. After the output of the network middle layer is visualized, it can be seen that after the input picture is rotated, zoomed, or translated, the feature expression difference of each layer is quite large, and therefore, the recognition accuracy rate is also rapidly reduced.

The existing methods mainly solve this problem from three perspectives: the first method is mainly to enhance the data set during training, so that the model can be fully learned on various transformed pictures during training. The diversity of samples can be increased after the processing, so that the robustness of the model on various transformed pictures is improved immediately. The second method is to input various transformed pictures into a multi-channel structure, perform maximum pooling operation in the feature mapping output of each channel, and use the feature mapping obtained by the maximum pooling as the feature expression of the picture. The third method is to learn the transformation of the picture by an additional neural network, and inversely transform the picture back to a more standard pose according to the transformation, and then classify the picture with the standard pose. Therefore, the image recognition effect can be improved.

However, for the three methods, either the training time is increased or additional parameters and operations are introduced, so that the operation complexity is increased during recognition. Meanwhile, if the robustness of the network to the transformation is increased by modifying the structure, the existing network structure also needs to be modified when the network is applied, which is not beneficial to the transplantation of the model.

Disclosure of Invention

The invention aims to provide a convolutional neural network with transformation invariance capability and consistent expression, so that the invariance of characteristic expression in the network is effectively improved, and the network can be more stable when pictures are identified.

The purpose of the invention is realized by the following technical scheme:

a convolutional neural network with transform invariant capability and uniform expression, comprising:

in the training stage, a consistent loss function is introduced into a convolutional neural network comprising a convolutional layer, a full-link layer and a Softmax layer, so that the trained convolutional neural network learns an expression mode which is invariant to transformation;

the consistency loss function is introduced into the convolutional layer to promote the network to learn the expression of consistency on the characteristic information, the consistency loss function is introduced into the full connection layer to promote the network to learn the expression of consistency on the semantic information, and the consistency loss function is introduced into the Softmax layer to promote the network to learn the expression of consistency on the classification information.

According to the technical scheme provided by the invention, the expression consistency optimization targets of the feature level, the semantic level and the classification label level are introduced in sequence, so that the convolutional neural network model can have robustness on the expression of the three levels to the transformation.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a schematic diagram of a convolutional neural network provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of a picture before and after a basic transformation according to an embodiment of the present invention;

FIG. 3 is a block diagram of a convolutional neural network with transform invariant capability and consistent expression provided by an embodiment of the present invention;

fig. 4 is a schematic diagram illustrating comparison between an RC-CNN and an original model and a data enhancement scheme according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a convolutional neural network (RC-CNN) with transformation invariability and consistent expression, which firstly introduces the existing Convolutional Neural Networks (CNNs) and the basic transformation of images before introducing the RC-CNN.

1. Convolutional neural network

Convolutional Neural Networks (CNNs) are a class of deep neural networks of multiple levels. In each layer, different convolution kernels are learned to be used as characteristic extraction operators, and then the convolution kernels are used for carrying out convolution processing on the characteristic mapping of the previous layer, so that the characteristic mapping of the current layer is obtained. For the feature mapping of the lower layer, mainly learning some feature information of the lower layer, such as edges and corners, etc. The information expressed by the feature map of each layer is gradually abstracted as the hierarchy is gradually deepened. The feature expressions in different layers also represent feature information on the respective layers of the picture. The sharing of weights and pooling over space can make the convolutional neural network invariant to some local small spatial transformations. At the same time, the parameters of the model are reduced. In a convolutional neural network, the operation of a convolutional layer can be expressed by the following formula:

wherein, represents a convolution symbol, X_i-1Is a feature map of layer i-1, W_i ^jIs the jth convolution kernel of the ith layer,

is the offset of the jth feature expression of the ith layer; w_i ^jAnd

and the learning can be achieved through a gradient descent algorithm. f (-) is a non-linear function such as ReLU function, Sigmoid function or Tanh function, etc.

The operation of the fully connected layer is substantially the same as the operation of convolution except that the sign of convolution is changed to the sign x of matrix multiplication, as follows:

FIG. 1 is a schematic diagram of Convolutional Neural Networks (CNNs); it includes a convolutional layer (C1-C5), a full-link layer (FC 6-FC 8) and a Softmax layer.

The operation of convolution may perform feature extraction from a lower layer to a higher layer on the input picture. The operation of the full connection layer can further abstract the expression on the feature level of the picture into a higher-level expression on the semantic level, and the output of the last full connection layer FC8 layer is usually followed by a Softmax layer, and the output is the confidence of the network for predicting each category.

2. Fundamental transformation of images

In the embodiment of the present invention, the basic transformation of the aimed image is mainly some basic spatial transformations, including rotation, translation, scaling, and the like. The coordinates of the original picture are assumed to be (x, y), and the coordinates of the transformed picture are assumed to be (x ', y'). The transformation of the picture can be implemented by the following formula:

(x′,y′,1)＝(x,y,1)×T；

wherein T is a transformation matrix of the picture;

rotating transformation matrix T_RThe following equation:

where θ is the angle of rotation.

Translational transformation matrix T_TThe following equation:

wherein d is_xAnd d_yThe number of pixel points the picture is shifted in the x-direction and the y-direction, respectively.

Scaled transformation matrix T_SThe following equation:

wherein s is_xAnd s_yWhich are the scales at which the picture is scaled in the x-direction and the y-direction, respectively.

Transformation matrix T added for all transformations_RTSCan be obtained by multiplying the three matrices:

T_RTS＝T_R×T_T×T_S

as shown in fig. 2, an example of a picture before and after a basic transformation; ORI is in the original picture; the column of R is the rotated picture; the column of T is the translated picture; the column of S is the zoomed picture; RTS represents the simultaneous introduction of three transforms into a picture.

Although convolutional neural networks are invariant to some local, small spatial transformations. But after the picture is transformed globally and largely, the convolutional neural network is not robust. Therefore, the convolutional neural network provided by the embodiment of the invention has the capability of transforming invariance (namely, the transformed picture can still be accurately identified, and further subsequent classification and retrieval operations are realized), and the convolutional neural network is consistent in expression, so that the trained model can be more robust to the transformed picture only by introducing an invariance loss function in the training process. Meanwhile, the method can enable the model to learn the expression mode with unchanged conversion, and compared with the traditional method which only learns the mapping from the converted picture to the set label, the method can be better transferred to other deep learning problems. In addition, the method embeds the unchanged conversion capability into the weight parameters of the network by introducing the loss function of consistency, thereby really improving the invariance of the conversion of the convolutional neural network, introducing no new parameters into the model, not needing to additionally process the picture, and not needing to change the existing network structure during testing.

FIG. 3 is a block diagram of a convolutional neural network with transform invariant capability and consistent expression; in the training stage, a consistent loss function is introduced into a convolutional neural network comprising a convolutional layer, a full-link layer and a Softmax layer, so that the trained convolutional neural network learns an expression mode which is invariant to transformation;

introducing a consistency loss function into the convolutional layer to promote the network to learn the expression of consistency on the characteristic information; a consistency loss function is introduced into the full connection layer to promote the network to learn the expression of consistency on the semantic information, so that the network can be consistent on the expression of the semantic information as much as possible; and introducing a consistency loss function in a Softmax layer to push a network to learn the expression of consistency on the classification information, so that the expression of the classification information is consistent as much as possible.

Referring also to fig. 3, in the training phase, two ways of random transformation T '() and T "() are performed on the input sample picture X, and the resulting transformed picture is denoted as X' and X";

the consistency loss function of the ith layer in the convolutional neural network is added to the characteristic expression Fea of the pictures X 'and X' at the ith layer_i(X') with Fea_i(X'), which is expressed as:

in the above formula, L_iRepresenting the uniformity loss function for the ith layer.

The loss function of the entire convolutional neural network is expressed as:

L_All＝λ_Cls×(L′_Cls+L″_Cls)+∑λ_i×L_i；

wherein the coefficient lambda_iUsed to weigh the i-th layer's consistency loss function L_i，L′_ClsAnd L ″)_ClsCorresponding to the classification losses of pictures X' and X ", respectively, by a factor λ_ClsClassification loss L used to weigh sample pictures X_ClsAssuming that the overall class of classification is N, then L_ClsIs the loss of the Softmax layer for one N output.

In the embodiment of the present invention, the i-th layer refers to the i-th layer of the entire network, and is not particularly distinguished from the convolutional layer, the fully-connected layer, or the Softmax layer.

In fig. 3, T '(X) and T ″ (X) on the left side refer to random transformations T' (·) and T ″, performed on sample picture X^(·)(ii) a The labels "L _ Conv1, L _ Conv 2.., L _ FC 8" appearing on the series of arrows in the middle section facing upwards respectively represent the loss functions applied to the different layers, for example L _ Conv1 refers to the loss function on the first convolutional layer. The rightmost L Cls represents the classification loss function. The lower group of X indicates the true category of the sample picture X.

After the training is completed in the mode, the convolutional neural network which has the transformation invariance capability and is consistent in expression can be obtained, and the classification result can be output by directly sending the pre-transformed test picture into the network in the test stage.

FIG. 4 is a diagram illustrating the comparison between RC-CNN and original model (original model) and data augmentation (data augmentation). Wherein, (a) is the distribution of feature mapping of the original image in the original model. (b) For the distribution of the feature mapping of the transformed picture on the model trained by data enhancement, it can be seen that even through data enhancement, some of the internal expressions are mixed together and are not easily separated. (c) The convolutional neural network with the transformation invariance capability and consistent expression provided by the invention has the advantage that the images even after transformation can be more conveniently distinguished by promoting the consistent expression of feature mapping.

To compare the RC-CNN provided by the present invention with the best other methods today, comparative experiments were performed on two tasks. One is a large-scale picture recognition task, and the other is a picture retrieval task. And respectively comparing the RC-CNN with models such as a traditional convolutional neural network, a data-enhanced convolutional neural network, SI-CNN, TI-CNN, ST-CNN and the like.

In the large-scale picture recognition problem, we use the data of ILSVRC-2012. The data set is divided into 1000 classes according to the content of the pictures, and the classes are a subset of ImageNet. The training set has 1.3M pictures, the verification set has 5 ten thousand pictures, and the test set has 10 ten thousand pictures. The accuracy of recognition is generally judged by two indexes, top-1 accuracy and top-5 accuracy. Where top-1 represents the probability that the most confident prediction in the prediction results is consistent with the actual category. top-5 represents the probability that the actual class is within the prediction of the top five confidence. The results of the comparative experiments are shown in tables 1 to 2.

TABLE 1 results on the transformed ILSVRC-2012 dataset (top1/top5)

In the comparative experiment of the above formula, the consistency loss function is added to the label level (RC-CNN (Cls)), the feature expression level and the label level (RC-CNN (Conv + Cls)), the semantic level and the label level (RC-CNN (FC + Cls)) and all the levels (RC-CNN (Conv + FC + Cls)), respectively. It can be seen that when the loss of consistency is added at all levels, the overall best results can be achieved.

TABLE 2 results on the original ILSVRC-2012 dataset (top1/top5)

From the above results, it can be seen that the RC-CNN effectively improves the invariance of the convolutional neural network to the transformation, compared to the best results of the other days. Meanwhile, the result of the RC-CNN in the data set of the original picture is not reduced, but is improved to a certain extent, which shows that the RC-CNN is not only used for over-fitting the transformed picture to the prediction of the real label.

In the problem of picture retrieval, a UK-Bench dataset is used, which is a dataset that is dedicated to picture retrieval. It contains 2550 sets of pictures, each set of 4 pictures, all from different perspectives of the same item or scene. The task in this dataset is to use any one picture in the dataset to retrieve the other three remaining pictures in the same group throughout the dataset. To verify the effect of RC-CNN in large-scale data, an additional 1,000,000 pictures in the mirlickr dataset were taken as negative examples. The pre-trained models used in the picture classification task are not retrained or fine-tuned for these data. All pictures in the data set are fed into the model and the L2 normalized feature expression is extracted. And then calculating the Euclidean distance between one feature expression and all picture feature expressions in the data set, and arranging the Euclidean distances in ascending order. The nearest 4 pictures were used to calculate NS-Score. NS-Score is the average accuracy representing the four closest pictures. For example: if all four pictures are from the correct group, he can get a score of 4.0 for that picture. The results of the experiment are shown in table 3.

TABLE 3 results on UK-Bench dataset

The result on the image retrieval data set can show that RC-CNN can be obviously improved in different tasks, which shows that the invention has certain migratable capability.

The main idea of the above scheme provided by the embodiment of the present invention is to make the network have a certain robustness to transformation by introducing three levels of consistency optimization targets in training. By using the optimization method, after the input picture is transformed to a certain degree, the invariance of the feature expression in the network can be clearly seen to be effectively improved, and the network can be more stable when the picture is identified.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A convolutional neural network implementation method with transform invariant capability and consistent expression, comprising:

the consistency loss function is introduced into the convolutional layer to promote the network to learn the expression of consistency on the characteristic information, the consistency loss function is introduced into the full connection layer to promote the network to learn the expression of consistency on the semantic information, and the consistency loss function is introduced into the Softmax layer to promote the network to learn the expression of consistency on the classification information;

in the training stage, randomly transforming T '() and T' () in two modes on an input sample picture X, and recording the obtained transformed picture as X 'and X';

in the above formula, L_iRepresenting a consistency loss function of the ith layer;

the loss function of the entire convolutional neural network is expressed as:

L_All＝λ_Cls×(L′_Cls+L″_Cls)+∑λ_i×L_i；

wherein the coefficient lambda_iUsed to weigh the i-th layer's consistency loss function L_i，L′_ClsAnd L ″)_ClsCorresponding to the classification losses of pictures X' and X ", respectively, by a factor λ_ClsClassification loss L used to weigh sample pictures X_Cls。