CN114037885A

CN114037885A - Target detection method based on size of selectable expansion convolution kernel

Info

Publication number: CN114037885A
Application number: CN202010705702.5A
Authority: CN
Inventors: 何小海; 熊书琪; 吴晓红; 陈洪刚; 卿粼波; 滕奇志
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2020-07-21
Filing date: 2020-07-21
Publication date: 2022-02-11
Anticipated expiration: 2040-07-21
Also published as: CN114037885B

Abstract

The invention discloses a target detection method based on the size of a selectable expansion convolution kernel, and relates to the field of computer vision and artificial intelligence. Firstly, extracting features through a convolutional neural network, and performing feature fusion through a feature pyramid; then, introducing the feature maps obtained by fusion into selectable expansion coefficient convolution modules in corresponding layers respectively, and obtaining better features through the modules; and finally, performing multi-classification and bounding box regression on the fused feature layer, and continuously training an iterative model to obtain a multi-scale fused target detection result. The method not only realizes effective improvement of precision, can keep real-time performance under the condition of inputting pictures with certain sizes, and can be applied to places such as machine vision, face recognition, automatic driving, intelligent video, medical detection and the like.

Description

Target detection method based on size of selectable expansion convolution kernel

Technical Field

The invention relates to a target detection method based on the size of a selectable expansion convolution kernel, and belongs to the technical field of computer vision and intelligent information.

Background

Object detection is the basis for computer vision tasks and for many applications in the field of artificial intelligence. For target detection, it is defined as follows: given an input RGB image, target detection accomplishes two tasks: and detecting and identifying, namely knowing what category the object belongs to and finding out the position of the picture where the object is located. The category may be a species commonly found in nature, such as people, poultry, vehicles, and the like, and the positioning is performed by using a bounding box (bounding box). The target detection has wide application in face recognition, automatic driving, man-machine interaction, content-based image retrieval, intelligent video monitoring and the like.

The existing detectors are mainly divided into two types: one is a single-stage detector and the other is a two-stage detector. The two-stage method divides the detection problem into two processes, firstly generates a region suggestion, and then classifies and regresses a candidate region by a boundary box, the beginning of the algorithm is an R-CNN algorithm proposed in 2016, but because a large number of frames are generated in the two stages, the calculation complexity is greatly increased, and real-time detection is difficult to achieve; the single-stage detector adopts a regression-based idea, a region-abandoning proposal stage predicts the class probability and the position coordinate of an object through an Anchor point (Anchor), and can obtain a final detection result through end-to-end learning, wherein the region-abandoning proposal stage can greatly reduce the calculation complexity, so that the real-time detection can be realized under the condition of proper input resolution, and a representative algorithm is as follows: YOLOv3, DSSD, RfineDet, etc. In recent years, the field of view problem is gaining more and more attention in the field of computer vision, the larger the field of view means that the more the computer pays attention to, the better the integrity is, but the larger the field of view often ignores many detailed factors, so a convolution method capable of selecting the expansion coefficient is proposed to learn the size of the field of view adaptively, so that the network focuses on the region of interest adaptively, and more attention is paid recently.

Disclosure of Invention

The invention provides a target detection method based on the size of a selectable expansion convolution kernel, and aims to design a selectable expansion convolution network structure, apply the selectable expansion convolution network structure to a characteristic pyramid so as to better utilize characteristic fusion, and then perform target detection.

The invention realizes the purpose through the following technical scheme:

(1) extracting features by using a reference network Darknet-53, obtaining a multi-scale feature layer after 5 downsampling convolution and 3 upsampling operations, performing weighting fusion operation, and finally performing secondary classification and frame regression operation;

(2) constructing a selectable Dilated convolution module (selected scaled convolutional module (SDCM);

(3) the SDCM is introduced into the combination of the characteristic pyramid structure, namely the top layer characteristic is fused with the bottom layer characteristic and then is assisted by the attention information SDCM to obtain more effective characteristic P₅，P₄，P₃For multiple classification and regression operations.

(4) And finally, directly applying the characteristics to multi-classification and regression operation, and continuously training an iterative model to obtain a final detection result.

Drawings

FIG. 1 is a block diagram of a method for detecting an object according to the present invention.

Fig. 2 is a block diagram of an alternative expansion convolution based module of the present invention.

FIG. 3 is a diagram of a feature pyramid fusion module according to the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings in which:

the method for constructing the selectable dilated convolution module is as follows:

a series of feature, pooling and activation layers are obtained through the Darknet-53 network, P in Darknet-53₅，P₄And sequentially introducing the SDCM into the characteristic layers, and sending the SDCM into the corresponding characteristic layers in the up-sampling stage for weighted fusion.

The network consists of a convolution function (conv), an activation function (LEAKyrelu) and an average pooling function (avgpool), given input F, obtaining output F ' after F passes through a selected expansion coefficient formula, and obtaining F ' by F ' through a channel attention mechanism_CThen the spatial attention mechanism yields the final output F ".

F'＝w×conv 1(F)+(1-w)×conv 2(F) (1)

In the formula, the convolution coefficient of the conv1 function is 1, the convolution coefficient of the conv2 function is 2, and the feature map F ″ output finally is as follows:

in the formula, N_cRepresenting channel attention mechanism operation, N_sRepresenting a spatial attention mechanism of operation,

representing a convolution operation.

The method for constructing the characteristic pyramid fusion module comprises the following steps:

the size of the input feature map is (C multiplied by H multiplied by W), wherein H and W represent the input height and width, and C represents the number of channels; for top layer feature P'₅Dimension is reduced from 1024 to 512 for detected feature P 'by dimension reduction operation of 3 x 3 convolutional layers'₄The output P 'obtained by performing dimension reduction operation while enlarging the feature map through dimension reduction operation of the 3 x 3 convolution layer'₅Has a dimension of (19 × 19 × 255); then, the output is subjected to weighted element-wise summation operation to obtain a fused feature map P'₄Has a dimension of (38 × 38 × 255); and then made of P'₄Down-sampling to obtain P'₃Finally, P 'are respectively paired'₅，P'₄，P'₃Classification and regression operations were performed separately.

In order to verify the effectiveness of the target detection method based on the selectable dilation convolution, experiments were conducted in the MS COCO2017 dataset. The experimental platforms herein are: ubuntu 20.04, Nvidia RTX 2080Ti GPU, Intel (R) core (TM) i7-9700 CPU. The deep learning frame is Pythrch, the precision evaluation indexes are mAP (mean average precision) and AP (average precision), and the speed evaluation index is fps (frames per second).

The MS COCO2017 data set comprises 118282 training sets and 5000 testing sets. Experiments were trained on a COCO2017 train val and tested on a COCO2017 testval dataset. All experiments were pre-trained in the VGG16 reference network. The learning rate is adjusted using the Cosine approach, namely, inThe learning rate in the first 50epochs was set to 10^-2Then, the learning rates in 100epochs were set to 10, respectively^-3. When the input picture size is 608 × 608, the batch size is set to 16 during training, and the number of GPUs is 2; when the picture size is 416 × 416, the batch size is set to 32 at the time of training, and the number of GPUs is 2. The batch size is set to 1 for testing, and the pytorch acceleration is not applicable. The experimental results of the invention are shown in table 1 and table 2, for 608 × 608 input, the mAP is 36.2%, and the detection speed is 50 fps; for an input of 416 x 416, the mAP is 36.1%, and the detection speed is 60fps, which is superior to the existing one-stage detector.

TABLE 1 test results of different algorithms on COCO test-dev2017 data set

TABLE 2 different algorithms are small on COCO test-dev2017 dataset (AP)_s) China (AP)_m) Large (AP)_l) AP comparison of targets

Claims

1. A method for detecting an object based on a selectable magnitude of a dilated convolution kernel, comprising the steps of:

(2) constructing a selectable scaled Convolution Module (SDCM);

(3) the SDCM is introduced into the combination of the characteristic pyramid structure, namely the top layer characteristic is fused with the bottom layer characteristic and then is assisted by the attention information SDCM to obtain more effective characteristic P₅,P₄,P₃For multiple classification and regression operations.

2. The method of claim 1, wherein the optional dilation convolution module is constructed in (1) by:

a series of feature, pooling and activation layers are obtained through the Darknet-53 network, P in Darknet-53₄，P₅And sequentially introducing the SDCM into the characteristic layers, and sending the SDCM into the corresponding characteristic layers in the up-sampling stage for weighted fusion.

3. The method of claim 1, wherein the optional dilation convolution module is constructed in (2) by:

the network consists of a convolution function (conv), an activation function (leakyrelu) and an average pooling function (avgpool); given input F, obtaining output F 'after selecting expansion coefficient formula, obtaining F' by channel attention mechanism_CThen the spatial attention mechanism yields the final output F ":

F'＝w×conv 1(F)+(1-w)×conv 2(F) (1)

representing a convolution operation.

4. The method of claim 1, wherein the fusing of the pyramid structure of the features of the dilated convolution module in (3) is selected as follows:

the size of the input feature map is (C multiplied by H multiplied by W), wherein H and W represent the input height and width, and C represents the number of channels; for top layer feature P'₅Dimension reduction from 1024 to 512 for feature P 'via dimension reduction operation of 3 x 3 convolutional layers'₄The output P 'obtained by performing dimension reduction operation while enlarging the feature map through dimension reduction operation of the 3 x 3 convolution layer'₅Has a dimension of (19 × 19 × 255); then, the output is subjected to weighted element-wise summation operation to obtain a fused feature map P'₄Has a dimension of (38 × 38 × 255); and then made of P'₄Down-sampling to obtain P'₃Finally, P 'are respectively paired'₅，P'₄，P'₃Classification and regression operations were performed separately.