CN115331081A

CN115331081A - Image target detection method and device

Info

Publication number: CN115331081A
Application number: CN202211053451.2A
Authority: CN
Inventors: 方杰民; 王兴刚; 刘文予
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2022-08-31
Filing date: 2022-08-31
Publication date: 2022-11-11

Abstract

The invention discloses an image target detection method, which comprises the following steps: extracting a multi-resolution characteristic graph from the image by using a Transformer network as a backbone network; inputting the multi-resolution feature map into a cross-scale attention feature pyramid network; in the characteristic pyramid network, starting from the input small-resolution characteristics, utilizing a cross-scale attention module to gradually perform characteristic fusion and recombination to the large resolution, and accumulating and fusing the characteristics from the small resolution to the large resolution; the fused features are further sent to a subsequent processing and predicting module for regression of the detection frame and prediction of the category, and the feature pyramid network of the cross-scale attention is trained based on the target data set until convergence; and performing regression of the detection frame and prediction of the category on the picture to be detected by utilizing the cross-scale attention feature pyramid network obtained by training. And the performance of final target detection is improved. The invention also provides a corresponding image target detection device.

Description

Image target detection method and device

Technical Field

The invention belongs to the technical field of deep learning and computer vision, and particularly relates to an image target detection method and device.

Background

Target detection is one of the most basic and important tasks in the field of computer vision, typically using deep neural networks for feature extraction and modeling of visual data, and predicting the corresponding location and class of target objects. Current deep learning methods attempt to solve the target detection task as either a classification problem or a regression problem or a combination of both.

Visual images have high complexity and diversity, and object detection generally needs to capture object information of various scales. In order to better model the target features of each scale, a Feature Pyramid Network (FPN) is widely used in the framework of target detection. The FPN takes the features extracted by the Backbone network (Backbone Networks) and positioned at each resolution ratio as input, and performs fusion. Low-resolution features have stronger semantics but lack details, while high-resolution features are full of details but less semantic. The FPN interpolates the low resolution features and superimposes them on the high resolution features for fusion. This enriches semantic information from multiple levels and makes visual representations more sensitive to multi-scale objects.

The Transformer network was first proposed and used for various scenes and tasks of Natural Language Processing (NLP) with great success. In recent years, transformers have come to be widely used in various visual tasks such as image classification, semantic segmentation, object detection, and the like, and have acquired very strong performance, exceeding the conventional use of more Convolutional Neural Networks (CNNs) in the visual tasks to some extent. The self-attention (self-attention) mechanism is a core component of a transform network, and automatically establishes a relationship between features by measuring responses between the features and recombining the features according to response values. The existing detection method based on the traditional FPN technology generally realizes the fusion of cross-scale features by directly interpolating and adding the features, and the modeling capability of the existing detection method on objects with complex scales is still limited.

Disclosure of Invention

In view of the above defects or improvement needs in the prior art, the present invention provides an image target detection method and apparatus, which model the object features that are more robust to scale and have stronger expression capability by introducing an FPN with a cross-scale attention mechanism, and improve the performance of final target detection.

To achieve the above object, according to an aspect of the present invention, there is provided an image object detection method, including the steps of:

the method comprises the following steps: extracting a multi-resolution characteristic graph from the image by using a Transformer network as a backbone network;

step two: inputting the multi-resolution feature map in the step one into a cross-scale attention feature pyramid network;

step three: in the characteristic pyramid network, starting from the small-resolution characteristics input in the step two, utilizing a cross-scale attention module to gradually perform characteristic fusion and recombination towards the large resolution, and accumulating and fusing the characteristics from the small resolution to the large resolution;

step four: the features fused in the third step are further sent to a subsequent processing and predicting module for regression and class prediction of the detection frame, and the feature pyramid network of the cross-scale attention is trained based on the target data set until convergence;

step five: and performing regression of the detection frame and prediction of the category on the picture to be detected by using the cross-scale attention feature pyramid network obtained by training.

In one embodiment of the present invention, the step three inter-scale attention module is implemented by the following steps:

(3.1) features of nth level

And (n + 1) th level of features

First converted into a sequence of 1-dimensional lemmas, i.e.

And

wherein H _n ,W _n Are respectively a feature F _n Spatial dimensions in both height and width dimensions, C being the size of the channel dimension of the feature；H _n+1 ,W _n+1 Are respectively a feature F _n+1 Spatial dimensions in both height and width dimensions;

(3.2) mapping the two characteristic sequences obtained in the step (3.1) to three spaces of Query, key and Value to obtain characteristic matrixes Q, K and V of the three spaces;

and (3.3) performing attention system operation on the three matrixes Q, K and V obtained in the step (3.2).

In one embodiment of the invention, the Query matrix is obtained by matching the feature F of the nth level _n Performing linear mapping to obtain:

Q＝F _n ×W，

where W is the matrix parameter of the linear mapping,

is the mapped Query matrix.

In one embodiment of the invention, the Key matrix and the Value matrix are both directly cascaded through F _n And F _n+1 Two sets of characteristics are obtained, namely

K＝V＝[F _n ，F _n+1 ]，

Wherein [ ·]Indicating the number of steps in the cascade operation, K,

and representing the obtained Key matrix and Value matrix.

In one embodiment of the present invention, the step (3.3) multiplies the Query matrix and the Key matrix to obtain an attention response map, and the response map is further applied to the Value matrix to obtain a new token sequence.

In one embodiment of the present invention, in the step (3.3), the token sequence F of the nth level _n Finally, it will be added again to the new token sequence in the form of a residual join, the whole process being denoted F _attn ＝softmax(QK ^T )V+F _n Wherein F is _attn For the resulting output feature matrix, K ^T Softmax is a normalized exponential function for the transpose of the key matrix K.

In one embodiment of the invention, the cross-scale attention module operates within local individual feature windows.

In one embodiment of the present invention, the window size is set according to specific requirements.

In one embodiment of the present invention, in the first step, the Transformer network is a Swin-Transformer.

According to another aspect of the present invention, there is also provided an image object detecting apparatus, comprising at least one processor and a memory, the at least one processor and the memory being connected via a data bus, the memory storing instructions executable by the at least one processor, the instructions being configured to perform the image object detecting method described above after being executed by the processor.

Generally, compared with the prior art, the technical scheme of the invention has the following beneficial effects:

aiming at the problem that the modeling capacity of the existing detection method for the complex-scale object is limited, the invention provides a novel target detection method based on a cross-scale self-attention feature pyramid, which can remarkably improve the modeling capacity of the multi-scale object with lower calculation cost and improve the performance of final target detection.

Drawings

FIG. 1 is a schematic flow chart of an image target detection method according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a feature pyramid network based on a cross-scale attention mechanism in an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The invention relates to the field of deep learning and computer vision, in particular to a cross-scale (cross-scale) Feature relationship modeling, an Attention Mechanism (Attention Mechanism) and a Feature Pyramid Network (FPN) based universal target detection algorithm.

The purpose of the invention is realized by the following technical scheme: the cross-resolution features are modeled and recombined through a self-attention mechanism, fig. 1 is a block diagram of a target detection method based on a Transformer backbone network and a cross-scale attention feature pyramid network in an embodiment of the invention, and as shown in fig. 1, the invention provides an image target detection method, which comprises the following steps:

the method comprises the following steps: a transform network (such as Swin-Transformer) is used as a backbone network to extract a multi-resolution feature map from an image.

Step two: and inputting the multi-resolution feature map in the step one into a cross-scale attention feature pyramid network.

Step three: in the feature pyramid network, starting from the small-resolution features input in the step two, the proposed cross-scale attention module is utilized to gradually perform feature fusion and recombination towards the large resolution. Features are fused cumulatively from small resolution to large resolution.

Step four: and the features fused in the third step are further sent to a subsequent processing and predicting module for regression of the detection frame and prediction of the category. And training the network module based on the target data set until convergence.

Further, as shown in fig. 2, the cross-Scale Attention module (Scale-Attention Block) in step three is implemented by the following steps:

(3.1) features of nth level

And (n + 1) th level of features

Are first converted into 1-dimensional Tokens sequences, i.e. word elements

And

wherein H _n ,W _n Are respectively a feature F _n The spatial dimensions in both height and width dimensions, C being the channel dimension size of the feature; h _n+1 ,W _n+1 Are respectively a feature F _n+1 The spatial dimensions in both the height and width dimensions.

And (3.2) mapping the two characteristic sequences obtained in (3.1) to three spaces of Query, key and Value. Specifically, the feature matrices of the three spaces are obtained as follows: the Query matrix passes through the feature F of the nth level _n Is subjected to linear mapping (linear projection), namely

Q＝F _n ×W，

Where W is the matrix parameter of the linear mapping,

the mapped Query matrix is obtained; both Key matrix and Value matrix are directly cascaded F _n And F _n+1 Two sets of characteristics are obtained, i.e.

K＝V＝[F _n ，F _n+1 ]，

Wherein [ ·]Indicating the number of steps in the cascade operation, K,

and representing the obtained Key matrix and Value matrix.

And (3.3) performing attention mechanism operation on the three matrixes Q, K and V obtained in the step (3.2). Specifically, multiplying the Query matrix and the Key matrix to obtain an attention response map (attention map), and the response map is further acted on the Value matrix to obtain a new token sequence. In particular, the token sequence F of the nth level _n Finally it will be added again to the new token sequence in the form of a residual join.

The whole process can be expressed as

F _attn ＝softmax(QK ^T )V+F _n ，

Wherein, F _attn For the resulting output feature matrix, K ^T For the transpose of the key matrix K, softmax is calculated for the normalized exponential function as follows:

wherein j =1, \ 8230;, K

In particular, in order to save computational cost, the above-described cross-scale attention module usually operates in local each feature window, and the window size can be made according to specific requirements.

Furthermore, the invention also provides an image object detection device, which comprises at least one processor and a memory, wherein the at least one processor and the memory are connected through a data bus, and the memory stores instructions capable of being executed by the at least one processor, and the instructions are used for completing the image object detection method after being executed by the processor.

The test is carried out on the environment image data of the relevant power patrol channel, and the results are shown in the following table 1:

TABLE 1 comparison of test results

It will be understood by those skilled in the art that the foregoing is only an exemplary embodiment of the present invention, and is not intended to limit the invention to the particular forms disclosed, since various modifications, substitutions and improvements within the spirit and scope of the invention are possible and within the scope of the appended claims.

Claims

1. An image target detection method is characterized by comprising the following steps:

step three: in the characteristic pyramid network, starting from the small-resolution characteristics input in the step two, utilizing a cross-scale attention module to gradually perform characteristic fusion and recombination to the large resolution, and accumulating and fusing the characteristics from the small resolution to the large resolution;

step four: the features fused in the third step are further sent to a subsequent processing and predicting module for regression of the detection frame and prediction of the category, and the feature pyramid network of the cross-scale attention is trained based on the target data set until convergence;

step five: and performing regression of the detection frame and prediction of the category on the picture to be detected by utilizing the cross-scale attention feature pyramid network obtained by training.

2. The image object detection method of claim 1, wherein the cross-scale attention module in step three is implemented by:

(3.1) features of nth level

And (n + 1) th level of features

First converted into a sequence of 1-dimensional lemmas, i.e.

And

wherein H _n ,W _n Are respectively a feature F _n The spatial dimensions in both height and width dimensions, C being the channel dimension size of the feature; h _n+1 ,W _n+1 Are respectively a feature F _n+1 In both height and width dimensionsThe size of the space;

(3.2) mapping the two characteristic sequences obtained in (3.1) to three spaces of Query, key and Value to obtain characteristic matrixes Q, K and V of the three spaces;

and (3.3) performing attention mechanism operation on the three matrixes Q, K and V obtained in the step (3.2).

3. The image object detection method of claim 2, wherein the Query matrix is obtained by matching the feature F of the nth level _n Linear mapping is performed to obtain:

Q＝F _n ×W，

where W is the matrix parameter of the linear mapping,

is the mapped Query matrix.

4. The image object detection method of claim 2, wherein the Key matrix and the Value matrix are both obtained by direct concatenation of F _n And F _n+1 Two sets of characteristics are obtained, namely

K＝V＝[F _n ，F _n+1 ]，

Wherein [. ]]Indicating the number of steps in the cascade operation, K,

and representing the obtained Key matrix and Value matrix.

5. The image object detection method according to claim 2, characterized in that said step (3.3) multiplies the Query matrix and the Key matrix to obtain an attention response map, and the response map is further applied to the Value matrix to obtain a new token sequence.

6. The image object detection method according to claim 1 or 2, characterized in that in step (3.3), the token sequence F of the nth level _n Finally, the residual connection is added to the new token sequence againA process is denoted as F _attn ＝softmax(QK ^T )V+F _n Wherein, F _attn For the resulting output feature matrix, K ^T Softmax is a normalized exponential function for the transpose of the key matrix K.

7. An image object detection method as claimed in claim 1 or 2, wherein the cross-scale attention module operates within local respective feature windows.

8. The image object detection method of claim 7, wherein the window size is tailored to specific requirements.

9. The image object detection method according to claim 1 or 2, wherein in the first step, the Transformer network is Swin-Transformer.

10. An image object detecting apparatus characterized by:

comprising at least one processor and a memory, said at least one processor and memory being connected via a data bus, said memory storing instructions executable by said at least one processor, said instructions being adapted to perform the image object detection method of any of claims 1-9 after being executed by said processor.