CN111079739B

CN111079739B - Multi-scale attention feature detection method

Info

Publication number: CN111079739B
Application number: CN201911189274.9A
Authority: CN
Inventors: 周书仁
Original assignee: Changsha University of Science and Technology
Current assignee: Changsha University of Science and Technology
Priority date: 2019-11-28
Filing date: 2019-11-28
Publication date: 2023-04-18
Anticipated expiration: 2039-11-28
Also published as: CN111079739A

Abstract

The invention discloses a multi-scale attention feature detection method, which comprises the following steps: constructing a single target detector through hardware resources of a computer, wherein the single target detector comprises a basic network, a newly added convolution layer and a prediction layer; adding a plurality of attention branches on the newly added convolutional layer to enhance the characteristic of the detection characteristic, constructing a parallel multi-scale attention characteristic detection model, and training a single target detector; training a parallel multi-scale attention feature detection model according to parameters obtained by training a single target detector; inputting an image to be detected into a multi-scale attention feature detection model, and calculating to obtain a detection result; a plurality of attention branches are added on a newly added convolution layer of a single target detector, context information and attention characteristics can be combined, the detection effect is improved, and particularly, the effect of 78.6% on a VOC2007 data set can be achieved by using the method.

Description

Multi-scale attention feature detection method

Technical Field

The invention relates to the field of computer vision and intelligent image recognition, in particular to a multi-scale attention feature detection method.

Background

Object detection is a basic and central technique in machine vision today. The task of object detection is to determine the object or person of interest in a large number of images, as well as the category and position of the object in the image.

The target detection has extremely high application value and wide application prospect. The application fields include: the method comprises the steps of unmanned driving, target detection in intelligent traffic, an intelligent question-answering system, face detection, medical image detection and the like. In addition, in the field of intelligent security, the target detection can realize the dynamic detection of targets such as safety helmets, life jackets and the like, and can also realize the functions of target intrusion, departure detection and the like.

At present, the field of target detection is widely developed, and target detection methods such as RCNN, SPP, fast/fast RCNN and YOLO are provided; according to the stage in the detection process, the above target detection methods can be divided into two categories: one-stage (One stage) target detection method and Two-stage (Two stage) target detection method.

The main differences between these two types of methods are: the Two stage target detection algorithm firstly needs to generate a preselected region (a preselected frame possibly containing a target to be detected) on an image, and then classifies and positions the target; and the One stage does not need to generate a preselected region, and directly extracts features from the image to classify and position the target.

The Two stage target detection algorithm is most typically Fast/Fast RCNN algorithm, and such algorithms include feature extraction, region selection, target classification and positioning. The method achieves high detection precision, but is poor in detection speed. For example, the fast RCNN method can accurately achieve target detection in 2016, but requires approximately 0.2s in image processing, and is poor in practicability for achieving real-time detection. The reason for this is that the Faster RCNN can generate 2000 alternative regions in the feed forward process, resulting in a large number of calculations, ultimately affecting the speed of detection.

The One stage target detection algorithm omits the process of generating a preselected region through a simple anchor point; in this way, target detection becomes an end-to-end process. The One stage method has great advantages in speed compared with the Two stage method. For example, the YOLO algorithm can reach 155FPS, but the detection precision is low.

For example, SSD is One of the One stage target detection algorithms, and the detection effect of SSD is general.

Disclosure of Invention

The invention mainly aims to provide a multi-scale attention feature detection method, and aims to solve the problem that the detection effect of SSD is general in the prior art.

A multi-scale attention feature detection method, comprising:

a multi-scale attention feature detection method, comprising:

constructing a single target detector through hardware resources of a computer, wherein the single target detector comprises a basic network, a newly added convolution layer and a prediction layer;

adding a plurality of attention branches on the newly added convolutional layer to enhance the characteristic of detection features and constructing a parallel multi-scale attention feature detection model, wherein each attention branch provides an attention area mask for the features obtained by the dot product detection of the upper layer of elements, and each detected feature comprises upper layer information and lower layer information in the detection process;

training the single pass target detector;

training the parallel multi-scale attention feature detection models according to parameters resulting from training the single pass target detector;

and inputting the image to be detected into the multi-scale attention feature detection model, and calculating to obtain a detection result.

Preferably, the adding a plurality of attention branches to the newly added convolutional layer to enhance the characteristics of the detection features and constructing a parallel multi-scale attention feature detection model further includes:

and taking the next detection feature obtained by a down-sampling layer in a shared network as the input of the attention branch, wherein the shared network comprises the base network and the newly added convolutional layer.

Preferably, the depth of the hourglass network of the attention branch is set to 1.

Preferably, the attention branch includes a feature layer, wherein the probability value of the channel of the feature layer is calculated by the formula:

wherein λ is _ij A value representing the previous feature set to 1; c represents the current channel and c represents the current channel,

representing the characteristic value of the (i, j) pixel point on the (c + 1) th characteristic diagram, wherein (i is more than or equal to 0, and j is more than or equal to k); c denotes the number of characteristic channels of the layer, and ` H `>

Representing the channel probability value of the pixel point;

the probability value calculation formula of the pixel points of the characteristic layer is as follows:

wherein λ is _ij A value representing a previous feature set to 1; x is a radical of a fluorine atom _ij Represents the characteristic value, sigma, of the pixel point (i +1, j + 1) on the characteristic diagram _0≤i,j≤k λ _ij exp(x _ij ) Representing the sum, x ', of weighted pixel values of different pixels in a channel' _ij Representing the probability value of the (i, j) pixel point; k represents the size of the feature map.

Preferably, the loss function of the multi-scale attention feature detection model includes two parts, namely localization loss and classification loss, and the calculation formula is as follows:

wherein L is _loss Is the loss function; l is _loc For the loss of positioning, L _cls Is the classification loss; n represents the number of matched prediction frames, and if N is 0, the loss is set to be 0; α represents the weight of the localization loss and the classification loss, and is set to 1.

Preferably, the calculation formula of the positioning loss is as follows:

wherein the content of the first and second substances,

representing the matching degree of the ith prediction box and the jth prediction box in the kth class; />

Representing the ith positive predicted distance, directly replaced by a bounding box; />

To representDistance of default box and correct box; l is a radical of an alcohol _loc (b, p.t) represents a positioning loss, wherein b represents a bounding box, i.e. a bounding box, p represents a predictionbox, i.e. a predicted candidate box, and t represents a grountruth, i.e. a real bounding box; pos represents a positive sample; x, y represent the abscissa and ordinate of the center point, and w, h represent the width and height of the frame, respectively.

Preferably, the number of said attention branches is 5.

Preferably, the basic network is a VGG-16 model, which is a pre-training ILSVRC classification model with the last two fully-connected layers removed; the VGG-16 includes 5 sets of convolutional layers.

Preferably, the constructing the single-pass object detector comprises:

and taking the multi-scale convolutional layer of the newly-added convolutional layer as the input of the prediction layer, and respectively and independently calculating classification and positioning results by using two convolutional kernels with the same size.

Preferably, after the two convolution kernels with the same size are used for independently calculating the classification and localization results, the method further includes:

highly repetitive predictions are eliminated by non-maxima suppression.

Through the technical scheme, a plurality of attention branches are added on the newly added convolution layer of the single target detector, context information and attention characteristics can be combined, the detection effect is improved, and particularly, the effect of 78.6% on the VOC2007 data set can be achieved by using the method.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

FIG. 1 is a flow chart of a first embodiment of a multi-scale attention feature detection method of the present invention;

FIG. 2 is a schematic diagram of a basic architecture of an SSD according to a first embodiment of the multi-scale attention feature detection method of the present invention;

FIG. 3 is a diagram illustrating feature maps with different sizes according to a first embodiment of a multi-scale attention feature detection method of the present invention.

FIG. 4 is a schematic diagram of a prior block of an SSD in a first embodiment of a method for multi-scale attention feature detection in accordance with the present invention;

FIG. 5 is a schematic diagram of a network structure of an SSD in a first embodiment of a method for detecting multi-scale attention characteristics according to the present invention;

FIG. 6 is a schematic diagram of a MA-SSD in the first embodiment of the multi-scale attention feature detection method of the present invention;

FIG. 7 is a schematic view of an attention module in a first embodiment of a multi-scale attention feature detection method according to the present invention;

fig. 8 is a schematic diagram illustrating a computing manner of a feature layer in a multi-scale attention feature detection method according to the present invention.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

The invention provides a multi-scale attention feature detection method.

As shown in fig. 1, in a first embodiment of the multi-scale attention feature detection method provided by the present invention, the method includes the following steps:

step S110: and constructing a single target detector through hardware resources of a computer, wherein the single target detector comprises a basic network, a newly added convolutional layer and a prediction layer.

Specifically, the method is implemented by a multi-scale attention feature detection system, the detection system includes a computer, and in step S110, a single-time target detector (i.e., SSD) is constructed by a processor of the computer, the SSD mainly has an idea that dense sampling is uniformly performed at different positions on a picture, different scales and aspect ratios can be adopted during sampling, and then CNN is used to extract features and perform classification and regression. Compared with a Two stage method, the whole process has one less step, so the speed is higher, but the defect is that the training is difficult due to uniform and dense sampling, and finally the model accuracy is not high.

SSD proposes a priori boxes of different scales and aspect ratios on the basis of anchor points in fast RCNN (Faster area convolutional neural network); meanwhile, the SSD generates predictions of different proportions from multi-scale features and explicitly separates the predictions according to the length-width ratio, a large-scale feature map can be used for detecting small objects, and a small-scale feature map can be used for detecting large objects. As shown in fig. 2, the SSD employs a multi-scale feature map.

With respect to multi-scale feature maps, CNN networks (convolutional neural networks) generally have a larger front feature map, and gradually use convolution or pool of stride =2 to reduce the feature map size, as shown in fig. 2, a larger feature map and a smaller feature map, which are both used for detection. This has the advantage that a larger feature map is used to detect a relatively smaller target, while a smaller feature map is responsible for detecting a larger target, as shown in fig. 3, the 8 × 8 feature map can be divided into more cells, but the a priori box scale of each cell is smaller.

In addition, unlike the YOLO (You Only Look one), which uses a full connection layer at last, the SSD directly uses convolution to extract the detection result from different feature maps. For a feature map having a shape of m × n × p, the detection value can be obtained by using only a convolution kernel having a size of 3 × 3 × p.

Furthermore, in YOLO, each unit predicts multiple bounding boxes, but all are relative to the unit itself (square block), but the shape of the real target is variable, and YOLO needs to adapt to the shape of the target during training. The SSD uses the concept of anchor points in FasterR-CNN as reference, each unit is provided with prior frames with different scales or length-width ratios, and the predicted boundary frame is based on the prior frames, so that the training difficulty is reduced to a certain extent. Typically, each cell is provided with a plurality of prior boxes, which have different dimensions and aspect ratios, as shown in fig. 4, it can be seen that each cell uses 4 different prior boxes, and in the picture, the cat and the dog respectively use the prior box most suitable for their shapes to train.

The SSD model consists of three parts: the model structures of the basic model, the newly added convolution layer and the prediction layer are shown in figure 5.

The detection effect of the SSD on small targets is general, since small targets do not have enough information at a high level.

Step S120: adding a plurality of attention branches on the newly added convolutional layer to enhance the characteristic of the detection feature and constructing a parallel multi-scale attention feature detection model, wherein each attention branch provides an attention area mask for the feature obtained by the dot product detection of the upper layer element, and each detected feature comprises upper layer information and lower layer information in the detection process.

Specifically, step S120 is completed by the processor, and the newly added convolutional layer is a multi-scale layer; please refer to fig. 6 for a structural schematic diagram of the parallel multi-scale attention feature detection model (MA-SSD); since each attention branch provides an attention area mask for the feature obtained by the dot-product detection of the element in the previous layer, the feature detection characteristics can be enhanced. Therefore, each detected feature comprises both upper layer information and lower layer information, which introduces context information in the detection process; thereby improving the accuracy of target detection.

Step S130: training the single pass target detector.

Specifically, the model is trained by the processor after being built.

Step S140: training the parallel multi-scale attention feature detection model according to parameters resulting from training the single pass target detector.

Specifically, parameters of the SSD backbone network are fixed, and then a multi-scale attention layer is trained through a processor, so that the parallel multi-scale attention feature detection model (MA-SSD) is trained.

Step S150: and inputting the image to be detected into the multi-scale attention feature detection model, and calculating to obtain a detection result.

Specifically, the image to be detected is input to a parallel multi-scale attention feature detection model (MA-SSD) through an input module of the computer, so that the image to be detected can be detected, and the detection result is better than that of the SSD model.

In addition, the existing SSD model is complex, the training process is complex and long, and the training cost is high.

In order to solve the above technical problem, in a second embodiment of the multi-scale attention feature detection method provided by the present invention, based on the first embodiment, step S120 further includes:

step S210: and taking the next detection feature obtained by a down-sampling layer in a shared network as the input of the attention branch, wherein the shared network comprises the base network and the newly added convolutional layer.

Specifically, in order to make the MA-SSD model smaller and faster, a down-sampling layer of the model is shared; the MA-SSD uses the detected features as input to the encoding-decoding structure, and uses the next detected feature as input to the attention branch through a down-sampling layer in a shared network, wherein the shared network comprises a base network and a new addition layer, thereby reducing the computation of the model, reducing the training process and the training time, and reducing the training cost. In addition, by sharing the multi-scale layers, the target detection speed can be increased.

Furthermore, this parallel structure is less computationally expensive and has lower coupling than algorithms such as DSSD (deconvolution word target detector) and FPN (feature pyramid network). In the serial encoding-decoding structure, the upsampling features at the lower level depend on the top-level features, which constitute the multi-scale decoding structure at the higher level. Due to the difference in parallel structure, the lower level upsampling features depend only on the higher level features and the connectivity between the upsampling features is low.

In addition, in the third embodiment of the multi-scale attention feature detection method proposed by the present invention, based on the second embodiment, the depth of the hourglass network of the above-mentioned attention branches is set to 1.

Specifically, the design of the attention branch mainly refers to the residual attention structure and the soft attention structure in Natural Language Processing (NLP), and the specific structure of the attention branch is shown in fig. 7. The present invention improves the structure of residual attention to meet speed requirements. In the attention branch, the depth of the hourglass network is set to 1. Unlike NLP, it has no time dimension in a single image. However, images have many different feature maps and pixels. The attention branch proposes a method for calculating a region of interest based on a feature map to improve the importance of a target region in features.

In a fourth embodiment of the multi-scale attention feature detection method provided by the present invention, based on the second embodiment, the attention branch includes a feature layer, wherein a probability value calculation formula of a channel of the feature layer is as follows:

wherein λ is _ij A value representing a previous feature set to 1; c represents the current channel and c represents the current channel,

representing the characteristic value of the (i, j) pixel point on the (c + 1) th characteristic diagram, wherein (i is more than or equal to 0, and j is more than or equal to k); c denotes the number of characteristic channels in the layer, and>

and representing the channel probability value of the pixel point.

Specifically, the calculation formula of the probability value of the channel is C-Softmax (channel-Softmax).

wherein λ is _ij A value representing the previous feature set to 1; x is the number of _ij Represents the characteristic value, sigma, of the pixel point (i +1, j + 1) on the characteristic diagram _0≤i,j≤k λ _ij exp(x _ij ) Representing the sum, x ', of weighted pixel values of different pixels in a channel' _ij Representing the probability value of the (i, j) pixel point; k denotes the size of the feature map.

Specifically, the formula for calculating the probability value of the pixel point is F-Softmax (characteristic-Softmax), and the schematic diagram of the formulas for calculating C-Softmax and F-Softmax refers to fig. 8, where the dark portion in the diagram is a summation area.

By the calculation method, the MA-SSD model is simplified, and the time cost of training and the calculation amount of the model are reduced.

In a fifth embodiment of the multi-scale attention feature detection method provided by the present invention, based on the first embodiment, the loss function of the multi-scale attention feature detection model includes two parts, namely a localization loss and a classification loss, and the calculation formula is as follows:

In a sixth embodiment of the multi-scale attention feature detection method provided by the present invention, based on the fifth embodiment, the calculation formula of the positioning loss is as follows:

wherein the content of the first and second substances,

Indicating the distance between the default box and the correct box; l is _loc (b, p.t) represents a positioning loss, wherein b represents a bounding box, i.e. a bounding box, p represents a predictionbox, i.e. a predicted candidate box, and t represents a grountruth, i.e. a real bounding box; pos represents a positive sample; x, y represent the abscissa and ordinate of the center point, and w, h represent the width and height of the frame, respectively.

By the calculation method, the MA-SSD model is simplified, and the training time cost and the calculation amount of the model are reduced.

In a seventh embodiment of the multi-scale attention feature detection method proposed by the present invention, based on the first embodiment, the number of the above-mentioned attention branches is set to 5. So as to improve the accuracy of target detection.

In an eighth embodiment of the multi-scale attention feature detection method provided by the present invention, based on the seventh embodiment, the basic network is a VGG-16 (visual geometry group network-16) model, which is a pre-trained ILSVRC (ImageNet large-scale visual recognition challenge race) classification model with the last two fully-connected layers removed; the VGG-16 includes 5 sets of convolutional layers.

In a ninth embodiment of the multi-scale attention feature detection method provided by the present invention, based on the eighth embodiment, step S110 includes:

step S910: and taking the multi-scale convolution layer of the newly-added convolution layer as the input of the prediction layer, and respectively calculating a classification result and a positioning result by using two convolution kernels with the same size.

Specifically, the convolution kernel of the same size is preferably a convolution kernel of 3*3.

In a tenth embodiment of the multi-scale attention feature detection method provided by the present invention, based on the ninth embodiment, after step S910, the method further includes:

step S1010: highly repetitive predictions are eliminated by non-maxima suppression.

Specifically, the optimal prediction effect is obtained by non-maximum suppression.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, wherein the software product is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the particular illustrative embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but is intended to cover various modifications, equivalent arrangements, and equivalents thereof, which may be made by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A multi-scale attention feature detection method is characterized by comprising the following steps:

adding a plurality of attention branches to the newly added convolutional layer to enhance the characteristic of the detection feature, and constructing a parallel multi-scale attention feature detection model, wherein each attention branch is as follows: the method comprises the steps that an attention area mask is provided by features obtained by dot product detection of an upper layer element, and each feature obtained by detection comprises upper layer information and lower layer information in the detection process;

training the single pass target detector;

training the parallel multi-scale attention feature detection model according to parameters obtained by training the single target detector;

inputting an image to be detected into the multi-scale attention feature detection model, and calculating to obtain a detection result;

adding a plurality of attention branches to the newly added convolutional layer to enhance the characteristic of the detection feature and constructing a parallel multi-scale attention feature detection model, further comprising:

2. The multi-scale attention feature detection method of claim 1, wherein a depth of the hourglass network of attention branches is set to 1.

3. A multi-scale attention feature detection method as claimed in claim 1, wherein the attention branch comprises a feature layer, wherein the probability value of the channel of the feature layer is calculated by the formula:

，

wherein the content of the first and second substances,

a value representing the previous feature set to 1; c denotes the current channel, is present>

Represents the feature value of the (i, j) pixel point on the (c + 1) th feature map, and +>

(ii) a C denotes the number of characteristic channels of the layer, and ` H `>

Representing the channel probability value of the pixel point;

，

wherein the content of the first and second substances,

a value representing the previous feature set to 1; />

Representing the characteristic value of the (i +1, j + 1) pixel point on the characteristic diagram,

represents the sum of weighted pixel values for different pixels in a channel>

Representing probability values of the (i, j) pixel points; k represents the size of the feature map.

4. A multi-scale attention feature detection method according to claim 1, wherein the loss function of the multi-scale attention feature detection model includes two parts of localization loss and classification loss, and the calculation formula is:

，

wherein the content of the first and second substances,

is the loss function; />

For said positioning loss, is>

Is the classification loss; n represents the number of matched prediction frames, and if N is 0, the loss is set to be 0; />

The weights representing the localization and classification loss are set to 1.

5. A multi-scale attention feature detection method as claimed in claim 4 wherein said localization loss is calculated by the formula:

，

wherein the content of the first and second substances,

Indicating the distance between the default box and the correct box; />

Representing the positioning loss, wherein b represents a bounding box, namely a frame, p represents a prediction box, namely a predicted candidate box, and t represents a grountruth, namely a real frame; pos represents a positive sample; x and y representThe abscissa and ordinate of the center point, w, h, represent the width and height of the frame, respectively.

6. The multi-scale attention feature detection method of claim 1, wherein the number of said attention branches is 5.

7. The multi-scale attention feature detection method of claim 1, wherein the base network is a VGG-16 model, which is a pre-trained ILSVRC classification model with two fully connected layers removed; the VGG-16 includes 5 convolutional layers.

8. The multi-scale attention feature detection method of claim 7, wherein said constructing a single-shot object detector comprises:

9. The multi-scale attention feature detection method of claim 8, wherein after said two convolution kernels of the same size are used to independently compute the classification and localization results, further comprising:

highly repetitive predictions are eliminated by non-maxima suppression.