CN109146058A

CN109146058A - With the constant ability of transformation and the consistent convolutional neural networks of expression

Info

Publication number: CN109146058A
Application number: CN201810861718.8A
Authority: CN
Inventors: 田新梅; 何岸峰; 沈旭
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2018-07-27
Filing date: 2018-07-27
Publication date: 2019-01-04
Anticipated expiration: 2038-07-27
Also published as: CN109146058B

Abstract

Have the invention discloses one kind and convert constant ability and the consistent convolutional neural networks of expression, it is only necessary to which the loss function for introducing invariance during trained may make trained model for transformed picture more robust.Meanwhile in the problem of this method can make model learning to constant expression way is converted, and only learn the mapping to transformation picture to set label compared to traditional method, and this method can preferably move to other deep learnings.In addition, this method is embedded into constant ability is converted in the weighting parameter of network, and is truly to improve the Inalterability of displacement of convolutional neural networks, does not introduce new parameter in model, without doing extra process to picture, do not need to change existing network structure in test.

Description

With the constant ability of transformation and the consistent convolutional neural networks of expression

Technical field

The present invention relates to image classification, the technical fields such as image retrieval, more particularly to it is a kind of have convert constant ability and Express consistent convolutional neural networks.

Background technique

In recent years, with the high speed development of internet, we can touch the picture and video of magnanimity.For these The picture of magnanimity, how accurately to be identified and be retrieved be all picture related applications basis.Past is limited to opposite Insufficient computing capability, can only it is enough it is some relatively low levels feature extraction algorithms, these algorithms are for the high-level of picture Semantic information can not accurately express.With the promotion of computing capability, bring deep learning technology is in image recognition, picture A series of related fieldss such as retrieval made breakthrough progress.Deep learning mainly uses in the applications such as picture recognition retrieval Be convolutional neural networks.It is operated by convolution and pond etc., so that model can be extracted from part to the overall situation layer by layer Feature representation.Compared to traditional method, the technology on high-level semantic it is accurate expression make its on recognition performance Traditional algorithm is greatly surmounted.

However, existing convolutional neural networks are not especially steady for the picture after various spatial alternations.It is right Network middle layer output visualization after it will be seen that when input picture by rotation, scaling or translation it Afterwards, feature representation difference at all levels can be quite big, and therefore, recognition accuracy also can dramatic decrease.

Existing method mainly solves the problems, such as this from three angles: first method is mainly in training to data Collection is enhanced, so that model is adequately learnt on various transformed pictures when training.It is processed in this way After can bring sample is multifarious increases, therefore robustness of the model on various transformed pictures also just obtains immediately It is promoted.Second method is then that various transformed pictures are input to the structure of a multichannel, the spy in each channel It is done in sign mapping output and maximizes pondization operation, pond will be maximized and obtain feature representation of the Feature Mapping as the picture.The Three kinds of methods are to learn the transformation of picture by an additional neural network, and gain picture contravariant according to this transformation Onto the posture of more standard, then classified with the picture of this standard posture.So the effect of picture recognition Promotion can be similarly obtained.

However, for three kinds of above-mentioned methods or increasing trained time or being just the introduction of additional parameter And operation, so that computational complexity increased when identification.Meanwhile if network is increased to transformation by modification structure Robustness, also need to modify existing network structure when application network, be unfavorable for the transplanting of model.

Summary of the invention

Have the object of the present invention is to provide one kind and convert constant ability and the consistent convolutional neural networks of expression, so that net The invariance of feature representation inside network is effectively improved, so that network can be more when identifying to picture It is steady to add.

The purpose of the present invention is what is be achieved through the following technical solutions:

One kind, which has, converts constant ability and the consistent convolutional neural networks of expression, comprising:

Training stage introduces consistency in comprising convolutional layer, full articulamentum and Softmax layers of convolutional neural networks Loss function so that convolutional neural networks study after training is to converting constant expression way；

Wherein, consistency loss function is introduced in convolutional layer to push network to learn the table of consistency on characteristic information It reaches, introduces consistency loss function in full articulamentum and push the network to learn the expression of consistency in semantic information, Softmax layers of introducing consistency loss function push network to learn the expression of consistency in classification information.

As seen from the above technical solution provided by the invention, by successively introduced feature level, semantic hierarchies, and The expression consistency optimization aim of tag along sort level, enables expression of the convolutional Neural network model on these three levels There is robustness to transformation.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill in field, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is the schematic diagram of convolutional neural networks provided in an embodiment of the present invention；

Fig. 2 is the schematic diagram before and after picture provided in an embodiment of the present invention progress basic transformation；

Fig. 3 is the frame provided in an embodiment of the present invention for having and converting constant ability and the consistent convolutional neural networks of expression Figure；

Fig. 4 is the contrast schematic diagram of RC-CNN provided in an embodiment of the present invention and master mould and data enhanced scheme.

Specific embodiment

With reference to the attached drawing in the embodiment of the present invention, technical solution in the embodiment of the present invention carries out clear, complete Ground description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Based on this The embodiment of invention, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, belongs to protection scope of the present invention.

The embodiment of the present invention, which provides one kind, has the constant ability of transformation and the consistent convolutional neural networks (RC-CNN) of expression, Before introducing RC-CNN, it is introduced first against the basic transformation of existing convolutional neural networks (CNNs) and image.

1, convolutional neural networks

Convolutional neural networks (CNNs) are a kind of multi-level deep neural networks.In layers, different by learning Then convolution kernel carries out process of convolution using Feature Mapping of these convolution kernels to preceding layer, obtains as feature extraction operator The Feature Mapping of current layer.For the Feature Mapping of lower level mainly learn be it is some relatively low layers characteristic information, than Such as edge and angle point.With the gradually intensification of level, information expressed by each layer of Feature Mapping is gradually abstracted.It is different Layer in feature representation also represent picture it is at all levels on characteristic information.Shared and spatially the pondization of weight operates It can make convolutional neural networks that there is invariance to some local small spatial alternations.Meanwhile the parameter of model also can be with Reduction.In convolutional neural networks, the operation of convolutional layer can be expressed by following formula:

Wherein, * represents convolution symbol, X_i-1It is (i-1)-th layer of Feature Mapping, W_i ^jIt is i-th layer of j-th of convolution kernel,It is The amount of bias of i-th layer of j-th of feature representation；W_i ^jWithIt can learn to obtain by gradient descent algorithm.F () is one non- Linear function, such as ReLU function, Sigmoid function or Tanh function etc..

The operation of full articulamentum and the operation of convolution are essentially the same, in addition to convolution symbol * has been changed to the symbol of matrix multiple ×, following formula:

It is as shown in Figure 1 the schematic diagram of convolutional neural networks (CNNs)；It includes convolutional layer (C1~C5), full articulamentum (FC6~FC8) and Softmax layers.

The operation of convolution can carry out the picture of input from low layer to high-rise feature extraction.The operation of full articulamentum can The expression of more high-level semantic level, last full connection are further abstracted into the expression in the feature level by picture FC8 layer of layer export after would generally connect one Softmax layers, his output is network for predicting the confidence level of each classification.

2, the basic transformation of image

In the embodiment of the present invention, the basic transformation of targeted image is mainly some basic spatial alternations, wherein wrapping Rotation is included, is translated, scaling etc..It is assumed that the coordinate of original image is (x, y), it is (x ', y ') by transformed Picture Coordinate. So the transformation of picture can be realized by following formula:

(x ', y ', 1)=(x, y, 1) × T；

Wherein T is the transformation matrix of picture；

The transformation matrix T of rotation_RFollowing formula:

Wherein, θ is the angle of rotation.

The transformation matrix T of translation_TFollowing formula:

Wherein, d_xAnd d_yIt is the number for the pixel that picture is translated up in the direction x and the side y respectively.

The transformation matrix T of scaling_SFollowing formula:

Wherein, s_xAnd s_yIt is the ratio that picture scales on the direction x and the direction y respectively.

The transformation matrix T that all transformation are all added_RTSIt can be got by three above-mentioned matrix multiples:

T_RTS=T_R×T_T×T_S

As shown in Fig. 2, carrying out the example before and after basic transformation for picture；ORI column is original image；R column is Postrotational picture；T column is the picture after translation；S column is the picture after scaling；RTS represents three kinds of transformation simultaneously It has been introduced into picture.

Although convolutional neural networks have invariance for some local, small spatial alternation.But when picture is by complete After office and biggish transformation, convolutional neural networks just not robust.Therefore, one kind provided by the embodiment of the present invention, which has, becomes Change constant ability (i.e. for transformed picture, can still accurately identify, and then realize subsequent classification, search operaqtion) and Express consistent convolutional neural networks, it is only necessary to which the loss function that invariance is introduced during trained may make training Good model is for transformed picture more robust.Meanwhile this method can make model learning to the expression constant to transformation Mode, only learns the mapping to transformation picture to set label compared to traditional method, and this method can be moved to preferably In the problem of other deep learnings.In addition, loss function of this method by introducing consistency, so that constant ability will be converted It is embedded into the weighting parameter of network, is truly to improve the Inalterability of displacement of convolutional neural networks, does not have in model New parameter is introduced, without doing extra process to picture, does not need to change existing network structure in test.

As shown in figure 3, to be a kind of with the frame diagram for converting constant ability and the consistent convolutional neural networks of expression；Training Stage, comprising in convolutional layer, full articulamentum and Softmax layers of convolutional neural networks introduce consistency loss function, So that the convolutional neural networks after training learn to the expression way constant to transformation；

Wherein, consistency loss function is introduced in convolutional layer to push network to learn the table of consistency on characteristic information It reaches；Consistency loss function is introduced in full articulamentum to push network to learn the expression of consistency in semantic information, so that net Network can be consistent as far as possible in the expression of semantic information；Network is pushed in Softmax layers of introducing consistency loss function The expression for learning consistency in classification information, so that the expression as far as possible of classification information is consistent.

Fig. 3 is seen also, in the training stage, the stochastic transformation T ' () of two ways is carried out for the samples pictures X of input " (), obtained transformed picture are denoted as X ' and X " with T；

I-th layer of consistency loss function in convolutional neural networks, is added in the feature representation of picture X ' and X " at i-th layer Fea_i(X ') and Fea_iBetween (X "), indicate are as follows:

In above formula, L_iIndicate i-th layer of consistency loss function.

The loss function of entire convolutional neural networks indicates are as follows:

L_All=λ_Cls×(L′_Cls+L″_Cls)+∑λ_i×L_i；

Wherein, coefficient lambda_iFor weighing i-th layer of consistency loss function L_i, L '_ClsWith L "_ClsCorrespond respectively to picture X ' With X " Classification Loss, coefficient lambda_ClsFor weighing the Classification Loss L of samples pictures X_Cls, it is assumed that total classification of classification is N, then L_ClsIt is exactly the loss of the Softmax layer of N output.

In the embodiment of the present invention, above-mentioned i-th layer refers to i-th layer of whole network, and not have to distinguish be specifically convolutional layer, complete Articulamentum or Softmax layers.

In Fig. 3, the T ' (X) and T " (X), which refers to, carries out stochastic transformation T ' () and T to samples pictures X " in left side^(·)；It is intermediate The label " L_Conv1, L_Conv2 ..., L_FC8 " occurred on a series of arrows upward in part, which respectively indicates, is added in difference Loss function on layer, such as L_Conv1 refer to the loss function on first convolutional layer.The L_Cls presentation class of the rightmost side Loss function.The Ground truth of X of lower section indicates the true classification of samples pictures X.

After completing training through the above way, can obtain has the constant ability of transformation and the consistent convolutional Neural net of expression The good test picture of pre-transform is directly sent into network by network, test phase, can output category result.

As shown in figure 4, enhancing (data for RC-CNN and master mould (original model) and data Augmentation contrast schematic diagram).Wherein, (a) is the distribution of original image Feature Mapping in master mould.(b) after for transformation Picture by data enhancing training after model on Feature Mapping distribution, it is seen that even if by data enhancing, Internal expression also some be aliasing together and be not easy separated.(c) there is the constant energy of transformation to be provided by the invention Power and the consistent convolutional neural networks of expression, the expression mapped by propulsive characteristics is consistent, even so that figure after transformation Piece also can be distinguished more easily.

In order to which RC-CNN provided by the invention makes comparisons with other current the best ways, carried out in two tasks Comparative experiments.One is large-scale picture recognition task, another is picture retrieval task.By RC-CNN respectively with tradition Convolutional neural networks, the enhanced convolutional neural networks of data, the models such as SI-CNN, TI-CNN, ST-CNN compare.

In extensive picture recognition problem, we use the data of ILSVRC-2012.According to figure in the data set The content of piece has been divided into 1000 classes, is a subset of ImageNet.A total of 1.3M picture of training set, verifying collection have altogether There are 50,000 pictures, test set there are 100,000 pictures.The accuracy rate of identification is generally judged by two indices, top-1 accuracy rate and Top-5 accuracy rate.Wherein top-1 represents the highest prediction of confidence level in prediction result and the consistent probability of concrete class.top-5 Represent probability of the actual classification inside first five prediction of confidence level.Contrast and experiment is as shown in 1~table of table 2.

Result (top1/top5) on the ILSVRC-2012 data set of table 1 after the conversion

In above formula comparative experiments, consistency loss function is added in label level (RC-CNN (Cls)), feature representation respectively Level and label level (RC-CNN (Conv+Cls)), semantic hierarchies and label level (RC-CNN (FC+Cls)) and all layers Secondary (RC-CNN (Conv+FC+Cls)).It can be seen that can reach on the whole when consistency loss is added in all levels Optimal result.

Result (top1/top5) of the table 2 on original ILSVRC-2012 data set

From the above result that it can be seen that, it is current best as a result, convolution is effectively promoted in RC-CNN to compare other Invariance of the neural network for transformation.Meanwhile result of the RC-CNN in the data set of original image does not reduce not only, instead And have a certain upgrade, illustrate RC-CNN not and be only over-fitting prediction of the transformed picture to true tag.

In the picture retrieval the problem of, using UK-Bench data set, which is used exclusively for the one of picture retrieval A data set.Include wherein 2550 groups of pictures, every group of picture have 4 pictures, this four picture be from the same article or The different visual angles of person's scene.Task in this data set is exactly to use in data set any one picture in entire data Concentrate remaining three pictures of other in same group of retrieval.In order to verify effect of the RC-CNN in large-scale data, 1,000,000 additional pictures in MIRFlickr data set, which are brought, does negative sample.Use picture classification times above The good model of pre-training in business, for these data, not in re -training or fine tuning.Pictures all in the data set are sent into The model simultaneously extracts the feature representation after its L2 normalization.Then all pictures in one of feature representation and data set are calculated The Euclidean distance of feature representation, and arranged by ascending order.4 nearest pictures of distance are used to calculate NS-Score.NS-Score It is that represent is Average Accuracy apart from immediate four picture.For example: if four pictures all come from just True group, then he can obtain 4.0 score for the figure.Experimental result is as shown in table 3.

Result on 3 UK-Bench data set of table

It can be seen that carrying out RC-CNN can obtain in different tasks in the result on the data set of image retrieval It is obviously improved, illustrates that the present invention has certain transportable ability.

Above scheme provided in an embodiment of the present invention, main thought are by introducing the consistent of three levels in training Property optimization aim come so that network to transformation have certain robustness.By using the optimization method, when the picture of input passes through It crosses after certain transformation, it can be clearly seen that the invariance of the feature representation of network internal is effectively improved, into And enable network more steady when identifying to picture.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment can The mode of necessary general hardware platform can also be added to realize by software by software realization.Based on this understanding, The technical solution of above-described embodiment can be embodied in the form of software products, which can store non-easy at one In the property lost storage medium (can be CD-ROM, USB flash disk, mobile hard disk etc.), including some instructions are with so that a computer is set Standby (can be personal computer, server or the network equipment etc.) executes method described in each embodiment of the present invention.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Within the technical scope of the present disclosure, any changes or substitutions that can be easily thought of by anyone skilled in the art, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the protection model of claims Subject to enclosing.

Claims

1. one kind, which has, converts constant ability and the consistent convolutional neural networks of expression characterized by comprising

Training stage, in the damage comprising introducing consistency in convolutional layer, full articulamentum and Softmax layers of convolutional neural networks Function is lost, so that the convolutional neural networks after training learn to the expression way constant to transformation；

Wherein, introducing consistency loss function in convolutional layer pushes the network to learn the expression of consistency on characteristic information, Full articulamentum introduces consistency loss function to push network to learn the expression of consistency in semantic information, at Softmax layers Consistency loss function is introduced to push network to learn the expression of consistency in classification information.

2. one kind according to claim 1, which has, converts constant ability and the consistent convolutional neural networks of expression, feature It is,

In the training stage, the samples pictures X of the input stochastic transformation T ' () and T " () for carrying out two ways is obtained Transformed picture is denoted as X ' and X "；

I-th layer of consistency loss function in convolutional neural networks is added in the feature representation Fea of picture X ' and X " at i-th layer_i (X ') and Fea_iBetween (X "), indicate are as follows:

In above formula, L_iIndicate i-th layer of consistency loss function.

3. one kind according to claim 2, which has, converts constant ability and the consistent convolutional neural networks of expression, feature It is, the loss function of entire convolutional neural networks indicates are as follows:

L_All=λ_Cls×(L′_Cls+L″_Cls)+∑λ_i×L_i；

Wherein, coefficient lambda_iFor weighing i-th layer of consistency loss function L_i, L '_ClsWith L "_ClsCorrespond respectively to picture X ' and X " Classification Loss, coefficient lambda_ClsFor weighing the Classification Loss L of samples pictures X_Cls。