CN111179212A

CN111179212A - Method for realizing micro target detection chip integrating distillation strategy and deconvolution

Info

Publication number: CN111179212A
Application number: CN201911091454.3A
Authority: CN
Inventors: 熊伟华; 吴华
Original assignee: Hangzhou Jingmou Intelligent Technology Co Ltd
Current assignee: Hangzhou Jingmou Intelligent Technology Co Ltd
Priority date: 2018-11-10
Filing date: 2019-11-10
Publication date: 2020-05-19
Anticipated expiration: 2039-11-10
Also published as: CN111179212B

Abstract

A method for realizing a tiny target detection chip integrating a distillation strategy and deconvolution is characterized in that a plurality of intermediate feature maps of a teaching network for a high-resolution image are trained on a plurality of layers in a learning network for a low-resolution image and comprising deconvolution layers in a loss-resistant learning mode, so that the receptive field of the low-pixel image is enlarged, the output precision of the learning network is improved, and the size of the chip is reduced. The invention simply designs the target detection task as the classification task through the learning network, and only needs to judge whether the target exists in the 20 multiplied by 20 pixel area, so that the detection of the tiny object can effectively eliminate the false detection while the chip area is small and the required memory is less when the hardware is realized.

Description

Method for realizing micro target detection chip integrating distillation strategy and deconvolution

Technical Field

The invention relates to a technology in the field of image detection, in particular to a method for realizing a tiny target detection chip integrating a distillation strategy and deconvolution.

Background

Although there are several methods for detecting objects using convolutional neural networks, most popular algorithms perform well when the target occupies a large portion of the image (typically greater than 20 square pixels in size). Recently, many algorithms have emerged to detect small objects with low resolution (less than 20 square pixels). These methods typically rely on multi-scale resolution and detect target objects of different sizes at the corresponding resolution. The structure detects multiple target objects simultaneously (multitask), which is beneficial to improving the detection of tiny target objects, but requires larger storage amount and longer calculation time in the hardware implementation process.

Disclosure of Invention

The invention provides a method for realizing a small target detection chip integrating a distillation strategy and deconvolution, which aims at the defects in the prior art, and uses a plurality of deconvolution layers to expand the receptive field, and the deconvolution layers are distilled through a pre-training convolution network with high-resolution objects, so that the precision similar to that of large objects on the detection of the small objects is achieved.

The invention is realized by the following technical scheme:

the invention relates to a method for realizing a micro target detection chip integrating a distillation strategy and deconvolution, which trains a plurality of intermediate characteristic maps of a teaching network for a high-resolution image on a plurality of layers in a learning network for a low-resolution image and containing deconvolution layers in a loss-resistant learning mode, enlarges the receptive field of the low-pixel image, improves the output precision of the learning network and reduces the size of the chip.

The learning network does not contain a residual error structure.

The intermediate feature maps used by the guide network to combat loss learning are processed by a residual error network.

Technical effects

Compared with the prior art, the learning network does not contain a residual error structure, so that clock waiting is not needed when hardware is realized, the speed is higher, the feature data of different scales are prevented from being read for fusion operation, and the read-write power consumption is reduced; the target detection task is simply designed into a classification task through a learning network, and only the existence of a target in a 20 multiplied by 20 pixel region needs to be judged, so that the detection of a tiny object can effectively eliminate false detection while the chip area is small and the required memory is small when hardware is realized.

Drawings

FIG. 1 is a schematic diagram of the integrated distillation strategy and deconvolution micro target detection architecture of the present invention;

FIG. 2 is a schematic diagram of a residual structure used in the teaching network;

FIG. 3 is a schematic diagram illustrating the effects of the embodiment.

Detailed Description

As shown in fig. 1, a minute target detection architecture integrating a distillation strategy and deconvolution according to this embodiment includes: the learning network student Net containing the deconvolution layer for the low-resolution images and the teaching network Teachernet for the high-resolution images train multiple layers in the learning network in a loss-resistant learning mode through a plurality of intermediate feature maps of the teaching network, and improve the output accuracy of the learning network while expanding the receptive field of the low-pixel images.

The learning network comprises: a convolutional layer 400, a normalization layer 402 with an S-shaped rectifying nonlinear active unit, a convolutional layer 404, an deconvolution layer 406, a normalization layer 408 with an S-shaped rectifying nonlinear active unit, a convolutional layer 410, a normalization layer 412 with an S-shaped rectifying nonlinear active unit, a pooling layer 414, a normal convolutional layer 416 and a full connection layer 418 which are connected in sequence, wherein: a convolution layer 400 receives an input image with a size of 20 × 20 × 3 and outputs a feature map 401 with a size of 20 × 20 × 32 to a normalization layer 402 for normalization, a convolution layer 404 outputs a feature map 405 with a size of 40 × 40 × 32 based on a feature map 403, a deconvolution layer 406 outputs a feature map 407 with a size of 40 × 40 × 32 to a normalization layer 408 for normalization, a convolution layer 410 outputs a feature map 411 with a size of 40 × 40 × 32 based on a feature map 409, normalization is performed by a normalization layer 412, and a pooling layer 414 obtains a maximum value for each 2 × 2 region based on a feature map 413, and a feature map 415 with the size of 20 × 20 × 32 is obtained by sampling every 2 pixels in the width-height direction, and is output to the full-link layer 418 after passing through the normal convolution layer 416, and finally, a vector with the size of 1 × 4096, that is, a final feature vector of the image, is output.

The fully-connected layer 418 outputs the image feature vectors for use as input to a subsequent classifier for determining the type of object (e.g., face, license plate, etc.) detected in the image.

The deconvolution layer 406 in the learning network is used to expand the field of view of the low-pixel image, and the feature extraction of the low-pixel image is guided by the object detection of the high-resolution image.

The teaching network employs a ResNet50 architecture, that is, it includes a convolutional layer 200, four ResNet

blocks

202, 204, 206, 208 connected in series, a max pooling layer 210, a fifth ResNet block 212, and a fully connected layer 214, wherein: the convolutional layer 200 receives a high-resolution image with the size of 40 × 40 × 3, the feature map 209 with the size of 40 × 40 × 32 guides the output feature map 413 of the normalization layer 412 of the learning network through the output of four concatenated ResNet blocks, the feature map 213 with the size of 20 × 20 × 32 output by the fifth ResNet block guides the output feature 417 of the ordinary convolutional layer 416 of the learning network, and the fully-connected layer 214 outputs a vector with the size of 1 × 4096, namely an image feature vector, which guides the output of the corresponding image feature vector of the learning network, namely the fully-connected layer 418.

The guidance is as follows:

feature maps

209, 213, 215 output by the teaching network are respectively used for performing antagonistic training with

feature maps

413, 417, 419 output by a normalization layer 412, a pooling layer 414 and a full-connection layer 418 of the learning network through a discrimination network, and when the two feature maps are different, guidance of the teaching network on the learning network is realized, so that the output of the learning network is finally consistent with the teaching network, specifically: the feature graph of the channel corresponding to the teaching network firstly finds the most similar channel in a plurality of channels of the corresponding layer of the learning network through cross correlation, namely, firstly, the variance is calculated for the feature array of each channel of the respective network, then the variance is sequenced, the feature graph of the teaching network with high variance is matched with the feature graph with high variance in the learning network through sequencing, and then the matched feature graph is input into a discrimination network.

The embodiment uses the network of IncepotionV 3 for countertraining, and performs counterlearning through a discriminator to transfer the characteristics of the teaching network to the learning network.

The convolutions described in this embodiment all adopt a layer-by-layer (depthwise) and point-by-point (pointwise) form, thereby significantly reducing the computational complexity, for example, the original convolution operation of 64 × 64 × 3 × 3 would become two continuous convolutions, the kernels are 64 × 1 × 3 × 3, respectively, that is, after independent convolution on each input channel, channel fusion is performed with 64 × 64 × 1 × 1, the parameter size is reduced to 4672/36864, the computational complexity is also significantly reduced, which is a convolution mode commonly used by mobile net at present.

As shown in fig. 2, a ResNet block in the teaching network directly adds metadata X of the block to an output F (X) of the block, so that the final output is a result of H (X) ═ F (X) + X, and the target value for training is F (X) ═ H (X) — X, that is, the "residual value" left after the input metadata is directly subtracted from the final output.

The learning network designed by the embodiment does not contain a residual error structure, so that the hardware is realized without clock waiting, the speed is higher, the fusion operation of reading feature data with different scales is avoided, and the reading and writing power consumption is reduced. The learning network simply designs a target detection task as a classification task, and only needs to judge whether a target exists in a 20 × 20 pixel region, so that the on-chip area is small and the required memory is small when hardware is realized.

In the embodiment, a small face data set (about 6000 faces) is constructed based on the open source data set face data set, the data set comprises a large face and a corresponding small face, and the learning network is trained by a distillation method and a non-distillation method, so that the accuracy of the scheme of the embodiment can be improved by 5%.

The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. A method for realizing a tiny target detection chip integrating a distillation strategy and deconvolution is characterized in that a plurality of intermediate feature maps of a teaching network for a high-resolution image are trained on a plurality of layers in a learning network for a low-resolution image and comprising deconvolution layers in a loss-resistant learning mode, so that the receptive field of the low-pixel image is enlarged, the output precision of the learning network is improved, and the size of the chip is reduced; the learning network does not contain residual error structures, and the intermediate characteristic graph used for resisting loss learning by the guiding network is processed by the residual error network.

2. The method of claim 1, wherein the teach network employs a ResNet50 architecture comprising sequentially connected convolutional layers, four serially connected ResNet blocks, a max pooling layer, a fifth ResNet block, and a full connection layer, wherein: the convolution layer receives a high-resolution image with the size of 40 multiplied by 3, an output characteristic diagram of a characteristic diagram guide learning network normalization layer with the size of 40 multiplied by 32 is obtained through the output of four ResNet blocks connected in series, a characteristic diagram with the size of 20 multiplied by 32 is output by a fifth ResNet block, the output characteristic diagram of a common convolution layer of the learning network is guided, and a vector with the size of 1 multiplied by 4096 is output by the full connection layer, namely the image characteristic vector guides a corresponding image characteristic vector of the learning network, namely the output of the full connection layer.

3. The method of claim 1, wherein the learning network comprises: the convolution layer, the normalization layer that has S type rectification nonlinear activation unit, convolution layer, the anti-convolution layer that connect gradually, the normalization layer that has S type rectification nonlinear activation unit, convolution layer, the normalization layer that has S type rectification nonlinear activation unit, pooling layer, ordinary convolution layer and full connection layer, wherein: the convolution layer receives an input image with the size of 20 x 3 and outputs a feature map with the size of 20 x 32 to a normalization layer for normalization, the convolution layer outputs the feature map with the size of 40 x 32 according to the feature map, the deconvolution layer outputs the feature map with the size of 40 x 32 to the normalization layer for normalization, the convolution layer outputs the feature map with the size of 40 x 32 according to the feature map and performs normalization through the normalization layer, the pooling layer obtains the feature map with the size of 20 x 32 in a mode of obtaining the maximum value in each 2 x 2 area according to the feature map and sampling every 2 pixels in the width and height direction, the feature map is output to a full-connection layer after being processed by a common convolution layer, and finally, a vector with the size of 1 x 4096, namely a final feature vector of the image is output.

4. The method of claim 1, 2 or 3, wherein the instructions are: the feature graph output by the teaching network is respectively used for the feature graph output by the normalization layer, the pooling layer and the full connection layer of the learning network, the confrontation training is carried out through the discrimination network, and when the feature graph output by the teaching network and the normalization layer, the pooling layer and the full connection layer of the learning network are different, the teaching of the teaching network to the learning network is realized, so that the output of the learning network is consistent with the teaching network.

5. The method of claim 4, wherein said instructions are: the feature graph of the channel corresponding to the teaching network firstly finds the most similar channel in a plurality of channels of the corresponding layer of the learning network through cross correlation, namely, firstly, the variance is calculated for the feature array of each channel of the respective network, then the variance is sequenced, the feature graph of the teaching network with high variance is matched with the feature graph with high variance in the learning network through sequencing, and then the matched feature graph is input into a discrimination network.

6. The method of claim 4, wherein the countermeasure training is performed using a network of IncepotionV 3, and the characteristics of the teaching network are transferred to the learning network by a discriminator for countermeasure learning.

7. A method as claimed in any preceding claim, wherein said convolution takes the form of a layer-by-layer (depthwise) and a point-by-point (pointwise) to substantially reduce computational complexity.

8. The method of claim 2 wherein the ResNet block adds the block's metadata X directly to the block's output F (X), so the final output is the result of H (X) ═ F (X) + X, and the target value trained on it is F (X) ═ H (X) -X, i.e., the remaining value of the final output after subtracting the input metadata directly.