CN111652246B

CN111652246B - Image self-adaptive sparsization representation method and device based on deep learning

Info

Publication number: CN111652246B
Application number: CN202010385699.3A
Authority: CN
Inventors: 袁春; 施诚
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2020-05-09
Filing date: 2020-05-09
Publication date: 2023-04-18
Anticipated expiration: 2040-05-09
Also published as: CN111652246A

Abstract

An image self-adaptive sparsization representation method and device based on deep learning are disclosed, the method comprises the following steps: a1, selecting an arbitrary deep convolutional neural network model M, adding a deep learning method based on a semi-hard attention mechanism at each stage of convolutional operation, adding the semi-hard attention mechanism into a convolutional layer, and constructing a new deep convolutional neural network model M'; a2, setting a linear increasing semihard attention sparse value domain for obtaining sparse image representation; and A3, setting a loss function suitable for the task, and training the whole deep convolutional neural network model M' by using back propagation. According to the method, under the condition that extra time complexity and space complexity are not introduced, the recognition accuracy of the depth convolution model on computer vision tasks such as image recognition and target detection is stably improved.

Description

Image self-adaptive sparsization representation method and device based on deep learning

Technical Field

The invention relates to the field of computer vision technology and deep learning, in particular to an image self-adaptive sparsity characterization method and device based on deep learning.

Background

Computer vision is a form of taking natural scenes by a camera or generating images by a computer, and content recognition and positioning monitoring are carried out on targets in the images through electronic equipment. The task can be said to be an application of machine learning in the visual field, and is an important component in the field of artificial intelligence. The main research content of computer vision can be summarized as follows: the information that we need is obtained by acquiring pictures or videos, preprocessing and analyzing the acquired pictures or videos, and the information is often called as features. In short, cameras and electronic devices are used to capture intrinsic information of pictures or videos.

Computer vision is a comprehensive discipline that involves a wide range of fields. From the research in the current stage, computer vision attempts to establish an Artificial Intelligence system, i.e. the AI (intellectual Intelligence) system, which we often say, in recent years, the theory or technique around computer vision is mainly to extract high-dimensional features from images or videos, i.e. as an expression of image or video information.

The traditional Feature extraction method mainly depends on a manually set Feature extraction mode, such as the classic SIFT (Scale-innovative Feature Transform) Feature. The SIFT feature mainly comprises the following four basic steps: (1) Detecting extreme points in a scale space, namely searching the extreme points at each position of the multi-scale features after image scaling, and identifying potential interest points with invariance to scale and rotation through a Gaussian differential function; (2) Positioning key points, namely, on each position of multiple scales, judging whether the candidate extreme points detected in the first step are stable or not by fitting a fine model, and selecting the stable part as the key points for subsequent calculation; (3) The direction of the key points is determined, the gradient direction is determined according to the local information of the image, one or more directions are distributed to each key point, and all the subsequent operations on the image data are essentially to transform the direction, the scale and the position of the key points, so that the robust feature with rotation invariance and space invariance can be provided finally. (4) Keypoint description local gradients of an image are measured at different scales within the local domain of each keypoint. The set of all gradients of an image, i.e. the SIFT features of the image, will eventually be robust to large local shape deformations and illumination variations.

Compared with the traditional feature extraction methods such as SIFT and the like, the modern deep learning feature extraction method based on images is much simpler in design and only comprises three parts: (1) The convolution layer is used for performing convolution on local information of the image characteristics to acquire image information with local receptive fields; (2) The nonlinear layer is used for enhancing the representation capability of the output characteristics of the convolutional layer; (3) And the full connection layer is used for carrying out deformation transformation on the global information of the image characteristics to obtain the image information with the global receptive field. The learned features of the modern deep learning feature extraction method are similar to the traditional features, and are essentially a representation of the position invariance and the rotation invariance of the image content, but each neural network layer is trained according to a specific data set by using back propagation, and the more robust image information representation capability is displayed under the condition of large data.

Disclosure of Invention

The invention mainly aims to provide an image self-adaptive sparsity representation method and device based on deep learning, so that the recognition accuracy of a deep convolution model on computer vision tasks such as image recognition, target detection and the like is improved under the condition of not introducing extra time complexity and space complexity.

In order to achieve the purpose, the invention adopts the following technical scheme:

an image self-adaptive sparsification characterization method based on deep learning comprises the following steps:

a1, selecting an arbitrary deep convolutional neural network model M, adding a deep learning method based on a semi-hard attention mechanism at each stage of convolutional operation, adding the semi-hard attention mechanism into a convolutional layer, and constructing a new deep convolutional neural network model M';

a2, setting a linear increasing semihard attention sparse value domain for obtaining sparse image representation;

a3, setting a loss function suitable for a task, and training a whole deep convolutional neural network model M' by utilizing back propagation;

the semi-hard attention mechanism is that the neural network learns the weight value of the image feature by using the statistical information of the image feature, and when the weight value is smaller than a set value range k, the image feature corresponding to the weight value smaller than k is reset to zero.

Further, the method comprises the following steps:

in the step A1, multiple convolution operations are used to gradually extract image local information, and then the image features with the local information are convolved, so as to extract global information.

In the step A1, the convolution operation is as shown in formula (1):

F _i+1,j ＝conv(F _i,* ) (1)

wherein Conv represents a convolution operation, F _i+1,j The j-th feature, F, representing the i +1 th layer of the convolution output _i,* Represents all the characteristics of the ith layer;

an attention mechanism is introduced as equation (2), and the mean of each image feature is used to determine the importance of the feature:

v _i+1,j ＝avgpool(F _i+1,j ) (2)

wherein, F _i+1,j Is a two-dimensional feature with a length h and a width w, avgpool for F _i+1,j This two-dimensional feature is averaged, i.e.,

the mean is then mapped to weights between [0,1] by a linear transformation and a nonlinear activation function:

wherein, v' _i+1,* To note the force value, W is the learnable weight of the linear transformation, and δ represents the sigmoid activation function.

Preferably, a semi-hard attention mechanism is added every third convolution layer for feature sparseness.

In the step A2, a dynamic range function is set:

y＝min(f(x),k) (4)

wherein f (x) is a linear function, x represents the iteration times in training, and is increased from 0 to the maximum iteration times, and the dynamic value range is fixed to k and does not change any more until the value of f (x) is greater than k; making the network iterate, firstly making the weight value of each image characteristic learn a local optimal solution, then gradually increasing the value range of the weight value return to zero, and finally making the network converge to a general optimal solution;

at each iteration, attention value v 'is given' _i+1,j Those values that are smaller than the dynamic range y in equation (4) are set to 0 and then applied to the convolution feature of the current layer, resulting in a self-sparsified feature:

F′ _i+1,* ＝v′ _i+1,* *F _i+1,* (5)。

in the step A3, the cross entropy function is used to process the image classification related task, and the mean square error loss function is used to process the target detection related task.

In the step A3, for the image classification ImageNet task, two full connection layers and one softmax layer are connected behind the one-dimensional global feature to output various predicted values, and a Cross Entropy Loss function (Cross Entropy Loss) is used for back propagation training of the whole network:

where n represents the number of classes, p represents the correct answer given by the label, and q represents the predicted value of the trained model output.

An adaptive image sparsification characterization device based on deep learning, which comprises a computer-readable storage medium and a processor, wherein the computer-readable storage medium stores an executable program, and the executable program is characterized in that when being executed by the processor, the adaptive image sparsification characterization device based on deep learning realizes the adaptive image sparsification characterization method based on deep learning.

A computer-readable storage medium, storing an executable program, which when executed by a processor, implements the deep learning-based image adaptive sparsification characterization method.

The invention has the following beneficial effects:

the invention provides an image self-adaptive sparsity characterization method and device based on deep learning, which can be well fused in the current mainstream deep convolution models (such as ResNet) and stably improve the recognition accuracy of the deep convolution models on computer vision tasks such as image recognition, target detection and the like under the condition of not introducing extra time complexity and space complexity. The invention simultaneously proves the effectiveness of the ImageNet data set and the COCO data set which are widely used.

By using the method, after the self-adaptive sparsization is added to any deep learning model, the generalization and robustness of the model are obviously enhanced, namely the representation capability of the image characteristics is enhanced, and meanwhile, the time complexity and the space complexity of the model are kept unchanged.

Detailed Description

The embodiments of the present invention will be described in detail below. It should be emphasized that the following description is merely exemplary in nature and is not intended to limit the scope of the invention or its application.

The embodiment of the invention provides an image self-adaptive sparsification characterization method based on deep learning, which comprises the following steps:

a1, selecting an arbitrary deep convolution neural network model M, and adding a deep learning method based on a semi-hard attention mechanism in each stage of convolution operation, wherein the new model is M'.

Convolution operation is a common method for extracting image features in deep learning, and multiple times of convolution operation are used for extracting image local information step by step from shallow to deep and then performing convolution on image features with the local information so as to extract global information. In the whole process, the convolution can extract both the shallow local features (such as texture information) and the high-level global features (such as semantic information) of the image. The attention mechanism means that the neural network uses statistical information (such as mean, variance, etc.) of image features of each stage to learn respective feature weights for the features respectively, wherein the weight values are between [0,1] to indicate the importance of the features. The semi-hard attention mechanism in the invention means that a value range k is set, and when a weight value is smaller than k, image features corresponding to the weight value smaller than k are zeroed, namely, sparsification operation is carried out. The purpose of this is to preserve those most important features, and the role of the unimportant feature zeroing is not to let these unimportant information influence the back propagation of the neural network, so that the trained network is more generalized and robust.

And A2, setting a linear increasing semihard attention sparse value domain for obtaining a more robust sparse image representation.

Specifically, a final value range k is set, that is, the neural network is trained to the end, and all image features with weights less than k are displayed and zeroed. In addition, a value range function can be set:

y＝min(f(x),k)

here, x represents the number of iterations in training, and is incremented from 0 to the maximum number of iterations until the value of f (x) is greater than k, and the dynamic range is fixed to k and does not change. In our scheme, f (x) is set as a linear function, such as f (x) =3e-5 x, k =0.3. The image features with weight values smaller than k are zeroed, so that parameters corresponding to the features are not derivable, and the image features cannot be trained sufficiently. Therefore, in the first thousands of iterations, the value of f (x) is close to 0, the whole neural network is approximately completely derivable, the network is firstly subjected to thousands of iterations, the weight value of each image feature is learned to a local optimal solution, then the value range of the weight value is gradually increased to zero, and finally the network is converged to a total optimal solution.

And A3, setting a loss function suitable for a specific task, and training the whole deep convolutional neural network model M' by utilizing back propagation.

The method is suitable for general computer vision tasks, and the whole network model is trained based on back propagation. Cross entropy functions can be used for processing image classification related tasks, and mean square error loss functions can be used for processing related tasks such as target detection. This step and the 2 above steps can be trained jointly, and the whole process does not increase the time complexity and space complexity of the neural network. Two computer vision basic tasks of image classification and object detection are taken as examples in the embodiment, but the application range covered by the computer vision basic task is not limited to the exemplified tasks.

The embodiment of the invention provides a self-adaptive sparse representation method aiming at the aspect of application of deep learning in the field of computer vision, in particular to the aspect of extracting image information. The method can be simply merged into any model based on the deep convolution neural network. According to the method, a semi-hard attention mechanism is added to each convolution block, so that attention to effective features and zero resetting of ineffective features are achieved. For example, for image a, the features corresponding to a certain part of parameter p are zeroed, i.e. represent that these features have low weight, and thus can be discarded. Then the neural network carried by the image a propagates backward in the training process without affecting the parameter p of the model. Therefore, the method enables other training images to fully utilize the feature representation brought by the parameter p. This has the benefit of thinning out the invalid features, more effectively characterizing the large data image. It is worth noting that the method does not bring extra time complexity and space complexity in the training and testing stages, and the effect of enhancing the generalization and robustness of the model and the image representation capability is really realized.

In some particularly preferred embodiments, the method may be operated as follows.

Step A1: the core operation of this step is the addition of a semi-hard attention mechanism to the convolutional layer, where we first describe the convolutional layer:

F _i+1,j ＝Conv(F _i,* ) (1)

here, conv stands for convolution operation, F _i+i,j Represents the jth feature, F, of the i +1 th layer convolution output _i,* Representing all the characteristics of the ith layer. The convolution operation of a conventional convolutional network is performed by equation (1).

We next introduce a mechanism of attention where we use the mean of each feature to determine the importance of the feature itself:

v _i+1,j ＝avgpool(F _i+1,j ) (2)

here, F _i+1,j Is a two-dimensional feature with a length h and a width w, avgpool for F _i+1,j This two-dimensional feature is averaged and can be written as

Then, the means are mapped to weights between [0,1] by a linear transformation and a nonlinear activation function:

v' _i+1,* ＝δ(Wv _i+1,* ) (3)

here, W is a learnable weight of the linear transformation, δ represents a sigmoid activation function, and until the weight [0,1] of the feature itself is obtained by adding a self-attention mechanism to the feature, we will describe how to sparsify the feature at step A2. Since the order of the operands of the attention mechanism is smaller than the amount of operations of the current layer convolution operation, our method does not enter additional temporal and spatial complexity. In addition, in our model, we add a semi-hard attention mechanism every other two convolution layers for feature sparsification, which is beneficial to further reduce the computational complexity.

Step A2: the attention mechanism is generally referred to as a soft attention (soft attention) mechanism without specific description, and represents that the weight may be any value between [0,1] and is everywhere conductive. The hard attention (hard attention) mechanism means that the weight can only take one of two values of 0,1, and most of the hard attention mechanism is not conductive. In the method, a half-hard attention (half-hard attention) mechanism is adopted, namely a value range k is set, all weight values smaller than k are forced to be 0, all values larger than or equal to k keep the original values, and the method combines a hard attention mechanism and a soft attention mechanism, so that the half-hard attention mechanism is conductive within a value range larger than or equal to k and is non-conductive within a value range smaller than k, and the effect is just required.

Furthermore, to make the weight of each image feature meaningful, at initialization time, we train several rounds with a very small range of values (approximately 0), with the goal of ensuring that the features zeroed out by the semi-hard attention mechanism are indeed relatively unimportant features, and not the result of random initialization, the specific dynamic range setting is as follows:

y＝min(f(x),k) (4)

here, x represents the number of iterations in the training, and is incremented from 0 to the maximum number of iterations (one iteration for a batch of images, i.e., one batch), and until the value of f (x) is greater than k, the dynamic range is fixed to k and does not change. In our scheme, f (x) is set as a linear function, such as f (x) =3e-5 x, k =0.3. It is worth noting that we do not set k to 0.5, since most features are valid in neural networks, discarding too many (half or more) feature channels results in a large drop in performance, and the opposite effect.

At each iteration, attention value v 'of formula (3)' _i+1,j Those values that are smaller than the dynamic range y in equation (4) are set to 0 and then act on the convolution feature of the current layer, resulting in a self-thinned feature:

F′ _i+1,* ＝v _i+1,* *F _i+1,* (5)

step A3: in the first two steps, a high-dimensional vector representing image features is usually output, i.e. the last two-dimensional features of the convolutional neural network are input into an avgpool layer (see formula 2), and one-dimensional global features are obtained. Taking the image classification ImageNet task as an example, two full-connection layers and one softmax layer are connected behind one-dimensional global features to output various predicted values, and a Cross Entropy Loss function (Cross Engine Loss) is used for back propagation training of the whole network:

where n represents the number of classes (e.g., imageNet n = 1000), p represents the correct answer given by the label, and q represents the predicted value of the model output we have trained.

Compared with a deep convolution model ResNet50, the model trained by the method is represented in the following table 1 in ImageNet image classification, and is represented in the following table 2 in COCO target detection.

TABLE 1

Model representation on ImageNet image classification	Accuracy of measurement
		ResNet50	76.2％
ResNet50+ feature sparsification (method)	77.3％

TABLE 2

Model Performance on COCO target detection	mAP
		FCOS(ResNet50)	38.7％
FCOS (ResNet 50) + thinning (method)	39.3％

End-to-end training can be realized through the steps A1-A3, with the increase of training iteration times, the value range in the step A2 is higher and higher, the characteristics output by the whole model become sparse, and the sparse redundant parameters provide better characteristic expression for other inputs.

The background of the present invention may contain background information related to the problem or environment of the present invention and does not necessarily describe the prior art. Accordingly, the inclusion in the background section is not an admission of prior art by the applicant.

The foregoing is a further detailed description of the invention in connection with specific/preferred embodiments and it is not intended to limit the invention to the specific embodiments described. It will be apparent to those skilled in the art that numerous alterations and modifications can be made to the described embodiments without departing from the inventive concepts herein, and such alterations and modifications are to be considered as within the scope of the invention. In the description herein, references to the description of the term "one embodiment," "some embodiments," "preferred embodiments," "an example," "a specific example," or "some examples" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Various embodiments or examples and features of various embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction. Although embodiments of the present invention and their advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the scope of the claims.

Claims

1. An image self-adaptive sparsification characterization method based on deep learning is characterized by comprising the following steps:

the semi-hard attention mechanism is that a neural network learns the weight value of the image features by utilizing statistical information of the image features, and when the weight value is smaller than a set value range k, the image features corresponding to the weight value smaller than k are returned to zero;

in the step A2, a dynamic range function is set:

y＝min(f(x),k) (4)

wherein f (x) is a linear function, x represents the iteration times in training, and is increased from 0 to the maximum iteration times, and the value of the dynamic value domain y is fixed to k and does not change any more until the value of f (x) is greater than k; making the network iterate, firstly making the weight value of each image characteristic learn a local optimal solution, then gradually increasing the value range of the weight value return to zero, and finally making the network converge to a general optimal solution;

at each iteration, attention value v' _i+1,j Those values that are smaller than the dynamic range of values in equation (4) are set to 0 and then act on the convolution feature of the current layer, resulting in a self-thinned feature:

F′ _i+1,* ＝v′ _i+1,* *F _i+1,* (5)；

wherein F _i+1,* Denotes the i +1 th feature map, v' _i+1,* Is a semi-hard attention value, F ', corresponding to the ith +1 feature map' _i+1,* The feature map is a feature map obtained by multiplying the first two and then thinning the first two.

2. The method according to claim 1, wherein in step A1, the local information of the image is gradually extracted by using a plurality of convolution operations, and then the image features with the local information are convolved, so as to extract the global information.

3. The method of claim 1, wherein in step A1, the convolution operation is as in formula (1):

F _i+1,j ＝Conv(F _i,* ) (1)

wherein Conv stands for convolution operation, F _i+1,j The j-th feature, F, representing the i +1 th layer of the convolution output _i,* Represents all the characteristics of the ith layer;

an attention-calling mechanism is given as equation (2) using the mean v of each image feature _i+1,j To determine the importance of the feature:

v _i+1,j ＝avgpool(F _i+1,j ) (2)

v' _i+1,* ＝δ(Wv _i+1,* ) (3)

wherein, v' _i+1,* For attention value, wv _i+1,* δ represents the sigmoid activation function, which is a learnable weight for a linear transformation.

4. A method as claimed in any one of claims 1 to 3, wherein a semi-hard attention mechanism is applied to every third convolution layer for feature sparsification.

5. The method according to any one of claims 1 to 3, wherein in step A3, the image classification related task is processed by using a cross entropy function, and the target detection related task is processed by using a mean square error loss function.

6. The method according to any one of claims 1 to 3, wherein in the step A3, for the image classification ImageNet task, the predicted values of each class are output after a one-dimensional global feature by two fully-connected layers and a softmax layer, and the whole network is trained by using Cross Entropy Loss function (Cross Entropy Loss) backpropagation:

7. An adaptive image sparsification characterization device based on deep learning, comprising a computer readable storage medium and a processor, wherein the executable program is stored in the computer readable storage medium, and when being executed by the processor, the executable program realizes the adaptive image sparsification characterization method based on deep learning according to any one of claims 1 to 6.

8. A computer-readable storage medium storing an executable program, wherein the executable program, when executed by a processor, implements the adaptive image sparsification characterization method based on deep learning according to any one of claims 1 to 6.