CN110782420A

CN110782420A - Small target feature representation enhancement method based on deep learning

Info

Publication number: CN110782420A
Application number: CN201910886472.4A
Authority: CN
Inventors: 姜明; 何利飞; 张旻; 李鹏飞; 汤景凡
Original assignee: Hangzhou Electronic Science and Technology University
Current assignee: Hangzhou Dianzi University; Hangzhou Electronic Science and Technology University
Priority date: 2019-09-19
Filing date: 2019-09-19
Publication date: 2020-02-11

Abstract

The invention discloses a small target feature representation enhancing method based on deep learning. The invention comprises the following steps: step 1, pre-training a neural network model Faster R-CNN on a super-large-scale database which contains more than 1400 million images and covers 20000 categories; step 2, reading input image data; step 3, generating a characteristic diagram through a convolutional neural network, and establishing a characteristic diagram space pyramid; step 4, obtaining the characteristic diagram weight from the attention mechanism module; step 5, fusing feature graphs from different layers according to the obtained weights; step 6, detecting and positioning the characteristic diagram; and 7, repeating the steps 3 to 6 aiming at the specified task and continuing to train the neural network model until the network reaches an optimal value. The method enhances the influence of the significant features and effectively combines the deep semantic and shallow high-resolution convolutional neural network features, thereby improving the overall target detection accuracy.

Description

Small target feature representation enhancement method based on deep learning

Technical Field

The invention relates to a target detection method, in particular to a small target feature representation enhancement method based on deep learning, and belongs to the technical field of computer visual image processing.

Background

Object detection, one of the basic problems of computer vision, is the basis of many other computer vision tasks, such as example segmentation, image captioning, object tracking, etc. From the application point of view, object detection can be divided into two research topics, "general object detection" and "specific object detection", the former aiming at exploring methods for detecting different types of objects under a unified framework to simulate human vision and cognition; the latter refers to detection in a specific application scenario, such as pedestrian detection, face detection, text detection, and the like. In recent years, with the rapid development of deep learning technology, new blood is injected for target detection, and a significant breakthrough is made, which is pushed to an unprecedented research hotspot. At present, target detection is widely applied to the fields of autonomous driving, robot vision, video monitoring and the like.

Disclosure of Invention

The invention provides a small target feature representation enhancing method based on deep learning, which is mainly used for solving the problem of contradiction between detail information and abstract information existing in single-layer convolution features.

To achieve the above technical objectives, the present invention adopts the following technical solutions:

a small target feature representation enhancing method based on deep learning is realized by the following steps:

firstly, pre-training a neural network model Faster R-CNN on a super-large-scale database which contains more than 1400 million images and covers 20000 categories;

reading input image data;

step (3) generating a characteristic diagram through a convolutional neural network, and establishing a characteristic diagram space pyramid;

step (4) obtaining feature map weights from the attention mechanism module;

step (5) fusing feature maps from different layers according to the obtained weights;

and (6) detecting and positioning the characteristic diagram.

Step (7) repeating the steps (3) to (6) for a specific task to continue training the neural network model until the network reaches an optimal value;

further, the deep convolutional neural network framework adopted in the step (1) is fast R-CNN, which comprises:

fast RCNN firstly supports inputting pictures of arbitrary size, and the pictures are set with a normalized dimension before entering the network, for example, the short edge of the image is set to be not more than 600, and the long edge of the image is set to be not more than 1000, and we can assume that M × N is 1000 × 600 (if the pictures are less than the size, 0 can be complemented by the edge, i.e. the image has black edges).

Further, the Faster R-CNN network framework includes:

13 convolution (conv) layers: kernel _ size is 3, pad is 1, stride is 1;

using the convolution formula:

where kernel _ size indicates that the convolution kernel size used is 3 × 3, pad indicates that the edge is filled with 1 bit, and stride indicates that the convolution kernel is shifted by 1 bit at a time. The formula of calculation shows that the conv layer does not change the picture size, that is: the size of the input picture is equal to the size of the output picture;

further, the Faster R-CNN also includes 13 activation (relu) layers: activating a function without changing the size of the picture;

further, FasterR-CNN also comprises 4 pooling (Pooling) layers: kernel _ size 2, stride 2; the pooling (Pooling) layer would let the output picture be 1/2 of the input picture;

further, after feature extraction, the picture size becomes (M/16) × (N/16), that is: 60 x 40(1000/16 ≈ 60,600/16 ≈ 40); the feature map is 60 × 40 × 512, which means that the size of the feature map is 60 × 40, and the number of the feature map is 512;

further, the step (3) of establishing a feature map spatial pyramid by using the feature map generated by the convolutional neural network is specifically implemented as follows:

the output is activated using the characteristics of the last convolutional layer output of each stage. For the conv2, conv3, conv4 and conv5 outputs, these final outputs are denoted as { C2, C3, C4, C5} and they have a step size of {4,8,16,32} relative to the input image. Conv1 would not be incorporated into the pyramid due to its large memory footprint. The low resolution feature map is then up-sampled by a factor of 2 so that each of the cross-connect bottom-up and top-down path feature maps have the same dimensions. Before fusion, an attention mechanism module is used for automatically learning the weights of the feature maps with different scales, and then fusion is carried out. This process is iterative until the final feature map is generated. This final set of feature maps is called P2, P3, P4, P5, corresponding to C2, C3, C4, C5, respectively.

Further, the step (4) obtains the feature graph weight from the attention mechanism module, and is specifically realized as follows:

the characteristic diagram A is subjected to matrix multiplication with the transposed AT of the characteristic diagram, because the characteristic diagram has channel dimensions, each pixel and every other pixel are equivalently subjected to point multiplication, the point multiplication geometrical meaning of the vectors is to calculate the similarity of two vectors, and the more similar the two vectors are, the larger the point multiplication is. And multiplying the characteristic diagram transpose matrix and the characteristic diagram matrix, and then normalizing by softmax to obtain the attention weight wi. The attention weight wi is multiplied by the transpose of the feature graph through a matrix, the correlation information is redistributed to the original feature graph, and the fusion mode is expressed by a formula as follows:

wi＝softmax(matmul(Ai,AiT))

wherein matmul represents the matrix product and the softmax function represents the value that maps the matrix product to (0, 1);

further, the step (5) of fusing feature maps from different layers according to the obtained attention weight wi, wherein the fused formula is as follows:

Ei＝wi*Ai+Ai；

wherein Ei represents the ith new feature map, wi represents the attention weight of the ith layer, and Ai represents the ith original feature map;

further, the detection and the positioning of the feature map in the step (6) are specifically realized as follows: and for the features extracted from the fused feature map, detecting whether the features belong to a specific class by using a classifier, and further adjusting the position of the candidate frame belonging to a certain class by using a positioner.

The invention has the following advantages:

the method utilizes the deep learning technology to detect the image content, automatically learns the characteristics of the target types by means of non-manual intervention, and has good robustness and self-adaptive capacity for detecting the small targets during identification, classification and positioning.

The method enhances the influence of the obvious features, and effectively combines the deep semantic and shallow high-resolution convolutional neural network features, thereby improving the overall target detection precision.

Drawings

FIG. 1 is a flow chart of the overall implementation of the present invention;

FIG. 2 is a diagram of a convolutional neural network architecture used in the present invention;

FIG. 3 is a diagram of a weight assignment method used by the present invention;

FIG. 4 is a picture to be examined;

FIG. 5 is a picture after inspection using the present invention;

Detailed Description

The attached drawings disclose a flow chart of a preferred embodiment of the invention in a non-limiting way; the technical solution of the present invention will be described in detail below with reference to the accompanying drawings.

As shown in fig. 1, fig. 2 and fig. 3, a small target feature representation enhancement method based on deep learning is implemented by the following steps:

reading input image data;

step (4) obtaining feature map weights from the attention mechanism module;

and (6) detecting and positioning the characteristic diagram.

Further, the Faster R-CNN network framework includes:

13 convolution (conv) layers: kernel _ size is 3, pad is 1, stride is 1;

using the convolution formula:

wi＝softmax(matmul(Ai,AiT))

Ei＝wi*Ai+Ai；

As can be easily seen from the comparison between fig. 4 and fig. 5, the detection effect on the small target in the picture is significant.

Claims

1. A small target feature representation enhancement method based on deep learning is characterized by comprising the following implementation steps:

reading input image data;

step (4) obtaining feature map weights from the attention mechanism module;

step (6) detecting and positioning the characteristic diagram;

and (7) repeating the steps (3) to (6) for the specified task, and continuing to train the neural network model until the network reaches an optimal value.

2. The method for enhancing small target feature representation based on deep learning of claim 1, wherein the deep convolutional neural network framework adopted in step (1) is fast R-CNN, which comprises:

the fast RCNN firstly supports inputting pictures with any size, and the pictures are set in a regularized scale before entering a network, if the short edge of the picture is set to be not more than N and the long edge of the picture is set to be not more than M, if the picture is less than the size, the edge is supplemented with 0, namely the picture has a black edge;

the Faster R-CNN network framework includes:

13 convolution (conv) layers: kernel _ size is 3, pad is 1, stride is 1;

using the convolution formula:

wherein, kernel _ size indicates that the size of the used convolution kernel is 3 × 3, pad indicates that the edge is filled with 1 bit, and stride indicates that the convolution kernel moves 1 bit each time; the computational formula shows that the conv layer does not change the size of the picture;

the Faster R-CNN also includes 13 activation (relu) layers;

FasterR-CNN also comprises 4 pooling layers: kernel _ size 2, stride 2; the pooling layer would let the output picture be 1/2 of the input picture.

3. The method according to claim 2, wherein the image size is (M/16) × (N/16) after the step (2) of reading the input image data and performing feature extraction, and the feature map is (M/16) × (N/16) × 512, which indicates that the feature map has a size of (M/16) × (N/16) and a number of 512.

4. The method for enhancing feature representation of small objects based on deep learning of claim 3, wherein the feature map generated by the convolutional neural network in step (3) is used to establish a feature map spatial pyramid, and the method is specifically implemented as follows:

activating an output using a characteristic of a last convolutional layer output of each stage; for conv2, conv3, conv4 and conv5 outputs, the final outputs are denoted as { C2, C3, C4, C5} and they have a step size of {4,8,16,32} relative to the input image; then, performing 2 times of upsampling on the low-resolution feature map, so that each feature map of the transverse connection from the bottom to the top path and each feature map of the transverse connection from the top to the bottom path have the same size; before fusion, an attention mechanism module is used for automatically learning the weights of the feature maps with different scales, and then fusion is carried out until a final feature map is generated; the final feature map set is called P2, P3, P4, P5, corresponding to C2, C3, C4, C5, respectively.

5. The method for enhancing the small target feature representation based on the deep learning as claimed in claim 4, wherein the step (4) obtains the feature map weight from the attention mechanism module, and is implemented as follows:

matrix multiplication is carried out on the characteristic diagram A and the transposed AT of the characteristic diagram, and as the characteristic diagram has channel dimensions, each pixel and each other pixel are equivalently subjected to dot multiplication operation; multiplying the characteristic diagram transpose matrix and the characteristic diagram matrix, and then normalizing by softmax to obtain an attention weight wi; the attention weight wi is multiplied by the transpose of the feature graph through a matrix, the correlation information is redistributed to the original feature graph, and the fusion mode is expressed by a formula as follows:

wi＝softmax(matmul(Ai,AiT))

where matmul represents the matrix product and the softmax function represents the value that maps the matrix product to (0, 1).

6. The method for enhancing small object feature representation based on deep learning of claim 5, wherein the step (5) is to fuse feature maps from different levels according to the obtained attention weight wi, and the formula of the fusion is as follows:

Ei＝wi*Ai+Ai；

wherein Ei represents the ith new feature map, wi represents the attention weight of the ith layer, and Ai represents the ith original feature map.

7. The method for enhancing small target feature representation based on deep learning of claim 6, wherein the detection and localization of the feature map in step (6) are implemented as follows: and for the features extracted from the fused feature map, detecting whether the features belong to a specific class by using a classifier, and further adjusting the position of the candidate frame belonging to a certain class by using a positioner.