CN115331081A - Image target detection method and device - Google Patents

Image target detection method and device Download PDF

Info

Publication number
CN115331081A
CN115331081A CN202211053451.2A CN202211053451A CN115331081A CN 115331081 A CN115331081 A CN 115331081A CN 202211053451 A CN202211053451 A CN 202211053451A CN 115331081 A CN115331081 A CN 115331081A
Authority
CN
China
Prior art keywords
resolution
feature
detection method
matrix
cross
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211053451.2A
Other languages
Chinese (zh)
Inventor
方杰民
王兴刚
刘文予
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN202211053451.2A priority Critical patent/CN115331081A/en
Publication of CN115331081A publication Critical patent/CN115331081A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an image target detection method, which comprises the following steps: extracting a multi-resolution characteristic graph from the image by using a Transformer network as a backbone network; inputting the multi-resolution feature map into a cross-scale attention feature pyramid network; in the characteristic pyramid network, starting from the input small-resolution characteristics, utilizing a cross-scale attention module to gradually perform characteristic fusion and recombination to the large resolution, and accumulating and fusing the characteristics from the small resolution to the large resolution; the fused features are further sent to a subsequent processing and predicting module for regression of the detection frame and prediction of the category, and the feature pyramid network of the cross-scale attention is trained based on the target data set until convergence; and performing regression of the detection frame and prediction of the category on the picture to be detected by utilizing the cross-scale attention feature pyramid network obtained by training. And the performance of final target detection is improved. The invention also provides a corresponding image target detection device.

Description

Image target detection method and device
Technical Field
The invention belongs to the technical field of deep learning and computer vision, and particularly relates to an image target detection method and device.
Background
Target detection is one of the most basic and important tasks in the field of computer vision, typically using deep neural networks for feature extraction and modeling of visual data, and predicting the corresponding location and class of target objects. Current deep learning methods attempt to solve the target detection task as either a classification problem or a regression problem or a combination of both.
Visual images have high complexity and diversity, and object detection generally needs to capture object information of various scales. In order to better model the target features of each scale, a Feature Pyramid Network (FPN) is widely used in the framework of target detection. The FPN takes the features extracted by the Backbone network (Backbone Networks) and positioned at each resolution ratio as input, and performs fusion. Low-resolution features have stronger semantics but lack details, while high-resolution features are full of details but less semantic. The FPN interpolates the low resolution features and superimposes them on the high resolution features for fusion. This enriches semantic information from multiple levels and makes visual representations more sensitive to multi-scale objects.
The Transformer network was first proposed and used for various scenes and tasks of Natural Language Processing (NLP) with great success. In recent years, transformers have come to be widely used in various visual tasks such as image classification, semantic segmentation, object detection, and the like, and have acquired very strong performance, exceeding the conventional use of more Convolutional Neural Networks (CNNs) in the visual tasks to some extent. The self-attention (self-attention) mechanism is a core component of a transform network, and automatically establishes a relationship between features by measuring responses between the features and recombining the features according to response values. The existing detection method based on the traditional FPN technology generally realizes the fusion of cross-scale features by directly interpolating and adding the features, and the modeling capability of the existing detection method on objects with complex scales is still limited.
Disclosure of Invention
In view of the above defects or improvement needs in the prior art, the present invention provides an image target detection method and apparatus, which model the object features that are more robust to scale and have stronger expression capability by introducing an FPN with a cross-scale attention mechanism, and improve the performance of final target detection.
To achieve the above object, according to an aspect of the present invention, there is provided an image object detection method, including the steps of:
the method comprises the following steps: extracting a multi-resolution characteristic graph from the image by using a Transformer network as a backbone network;
step two: inputting the multi-resolution feature map in the step one into a cross-scale attention feature pyramid network;
step three: in the characteristic pyramid network, starting from the small-resolution characteristics input in the step two, utilizing a cross-scale attention module to gradually perform characteristic fusion and recombination towards the large resolution, and accumulating and fusing the characteristics from the small resolution to the large resolution;
step four: the features fused in the third step are further sent to a subsequent processing and predicting module for regression and class prediction of the detection frame, and the feature pyramid network of the cross-scale attention is trained based on the target data set until convergence;
step five: and performing regression of the detection frame and prediction of the category on the picture to be detected by using the cross-scale attention feature pyramid network obtained by training.
In one embodiment of the present invention, the step three inter-scale attention module is implemented by the following steps:
(3.1) features of nth level
Figure BDA0003824635640000021
And (n + 1) th level of features
Figure BDA0003824635640000022
First converted into a sequence of 1-dimensional lemmas, i.e.
Figure BDA0003824635640000023
And
Figure BDA0003824635640000031
wherein H n ,W n Are respectively a feature F n Spatial dimensions in both height and width dimensions, C being the size of the channel dimension of the feature;H n+1 ,W n+1 Are respectively a feature F n+1 Spatial dimensions in both height and width dimensions;
(3.2) mapping the two characteristic sequences obtained in the step (3.1) to three spaces of Query, key and Value to obtain characteristic matrixes Q, K and V of the three spaces;
and (3.3) performing attention system operation on the three matrixes Q, K and V obtained in the step (3.2).
In one embodiment of the invention, the Query matrix is obtained by matching the feature F of the nth level n Performing linear mapping to obtain:
Q=F n ×W,
where W is the matrix parameter of the linear mapping,
Figure BDA0003824635640000032
is the mapped Query matrix.
In one embodiment of the invention, the Key matrix and the Value matrix are both directly cascaded through F n And F n+1 Two sets of characteristics are obtained, namely
K=V=[F n ,F n+1 ],
Wherein [ ·]Indicating the number of steps in the cascade operation, K,
Figure BDA0003824635640000033
and representing the obtained Key matrix and Value matrix.
In one embodiment of the present invention, the step (3.3) multiplies the Query matrix and the Key matrix to obtain an attention response map, and the response map is further applied to the Value matrix to obtain a new token sequence.
In one embodiment of the present invention, in the step (3.3), the token sequence F of the nth level n Finally, it will be added again to the new token sequence in the form of a residual join, the whole process being denoted F attn =softmax(QK T )V+F n Wherein F is attn For the resulting output feature matrix, K T Softmax is a normalized exponential function for the transpose of the key matrix K.
In one embodiment of the invention, the cross-scale attention module operates within local individual feature windows.
In one embodiment of the present invention, the window size is set according to specific requirements.
In one embodiment of the present invention, in the first step, the Transformer network is a Swin-Transformer.
According to another aspect of the present invention, there is also provided an image object detecting apparatus, comprising at least one processor and a memory, the at least one processor and the memory being connected via a data bus, the memory storing instructions executable by the at least one processor, the instructions being configured to perform the image object detecting method described above after being executed by the processor.
Generally, compared with the prior art, the technical scheme of the invention has the following beneficial effects:
aiming at the problem that the modeling capacity of the existing detection method for the complex-scale object is limited, the invention provides a novel target detection method based on a cross-scale self-attention feature pyramid, which can remarkably improve the modeling capacity of the multi-scale object with lower calculation cost and improve the performance of final target detection.
Drawings
FIG. 1 is a schematic flow chart of an image target detection method according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a feature pyramid network based on a cross-scale attention mechanism in an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The invention relates to the field of deep learning and computer vision, in particular to a cross-scale (cross-scale) Feature relationship modeling, an Attention Mechanism (Attention Mechanism) and a Feature Pyramid Network (FPN) based universal target detection algorithm.
The purpose of the invention is realized by the following technical scheme: the cross-resolution features are modeled and recombined through a self-attention mechanism, fig. 1 is a block diagram of a target detection method based on a Transformer backbone network and a cross-scale attention feature pyramid network in an embodiment of the invention, and as shown in fig. 1, the invention provides an image target detection method, which comprises the following steps:
the method comprises the following steps: a transform network (such as Swin-Transformer) is used as a backbone network to extract a multi-resolution feature map from an image.
Step two: and inputting the multi-resolution feature map in the step one into a cross-scale attention feature pyramid network.
Step three: in the feature pyramid network, starting from the small-resolution features input in the step two, the proposed cross-scale attention module is utilized to gradually perform feature fusion and recombination towards the large resolution. Features are fused cumulatively from small resolution to large resolution.
Step four: and the features fused in the third step are further sent to a subsequent processing and predicting module for regression of the detection frame and prediction of the category. And training the network module based on the target data set until convergence.
Further, as shown in fig. 2, the cross-Scale Attention module (Scale-Attention Block) in step three is implemented by the following steps:
(3.1) features of nth level
Figure BDA0003824635640000051
And (n + 1) th level of features
Figure BDA0003824635640000052
Are first converted into 1-dimensional Tokens sequences, i.e. word elements
Figure BDA0003824635640000053
And
Figure BDA0003824635640000054
wherein H n ,W n Are respectively a feature F n The spatial dimensions in both height and width dimensions, C being the channel dimension size of the feature; h n+1 ,W n+1 Are respectively a feature F n+1 The spatial dimensions in both the height and width dimensions.
And (3.2) mapping the two characteristic sequences obtained in (3.1) to three spaces of Query, key and Value. Specifically, the feature matrices of the three spaces are obtained as follows: the Query matrix passes through the feature F of the nth level n Is subjected to linear mapping (linear projection), namely
Q=F n ×W,
Where W is the matrix parameter of the linear mapping,
Figure BDA0003824635640000055
the mapped Query matrix is obtained; both Key matrix and Value matrix are directly cascaded F n And F n+1 Two sets of characteristics are obtained, i.e.
K=V=[F n ,F n+1 ],
Wherein [ ·]Indicating the number of steps in the cascade operation, K,
Figure BDA0003824635640000061
and representing the obtained Key matrix and Value matrix.
And (3.3) performing attention mechanism operation on the three matrixes Q, K and V obtained in the step (3.2). Specifically, multiplying the Query matrix and the Key matrix to obtain an attention response map (attention map), and the response map is further acted on the Value matrix to obtain a new token sequence. In particular, the token sequence F of the nth level n Finally it will be added again to the new token sequence in the form of a residual join.
The whole process can be expressed as
F attn =softmax(QK T )V+F n
Wherein, F attn For the resulting output feature matrix, K T For the transpose of the key matrix K, softmax is calculated for the normalized exponential function as follows:
Figure BDA0003824635640000062
wherein j =1, \ 8230;, K
In particular, in order to save computational cost, the above-described cross-scale attention module usually operates in local each feature window, and the window size can be made according to specific requirements.
Furthermore, the invention also provides an image object detection device, which comprises at least one processor and a memory, wherein the at least one processor and the memory are connected through a data bus, and the memory stores instructions capable of being executed by the at least one processor, and the instructions are used for completing the image object detection method after being executed by the processor.
The test is carried out on the environment image data of the relevant power patrol channel, and the results are shown in the following table 1:
TABLE 1 comparison of test results
Figure BDA0003824635640000063
Figure BDA0003824635640000071
It will be understood by those skilled in the art that the foregoing is only an exemplary embodiment of the present invention, and is not intended to limit the invention to the particular forms disclosed, since various modifications, substitutions and improvements within the spirit and scope of the invention are possible and within the scope of the appended claims.

Claims (10)

1. An image target detection method is characterized by comprising the following steps:
the method comprises the following steps: extracting a multi-resolution characteristic graph from the image by using a Transformer network as a backbone network;
step two: inputting the multi-resolution feature map in the step one into a cross-scale attention feature pyramid network;
step three: in the characteristic pyramid network, starting from the small-resolution characteristics input in the step two, utilizing a cross-scale attention module to gradually perform characteristic fusion and recombination to the large resolution, and accumulating and fusing the characteristics from the small resolution to the large resolution;
step four: the features fused in the third step are further sent to a subsequent processing and predicting module for regression of the detection frame and prediction of the category, and the feature pyramid network of the cross-scale attention is trained based on the target data set until convergence;
step five: and performing regression of the detection frame and prediction of the category on the picture to be detected by utilizing the cross-scale attention feature pyramid network obtained by training.
2. The image object detection method of claim 1, wherein the cross-scale attention module in step three is implemented by:
(3.1) features of nth level
Figure FDA0003824635630000011
And (n + 1) th level of features
Figure FDA0003824635630000012
First converted into a sequence of 1-dimensional lemmas, i.e.
Figure FDA0003824635630000013
And
Figure FDA0003824635630000014
wherein H n ,W n Are respectively a feature F n The spatial dimensions in both height and width dimensions, C being the channel dimension size of the feature; h n+1 ,W n+1 Are respectively a feature F n+1 In both height and width dimensionsThe size of the space;
(3.2) mapping the two characteristic sequences obtained in (3.1) to three spaces of Query, key and Value to obtain characteristic matrixes Q, K and V of the three spaces;
and (3.3) performing attention mechanism operation on the three matrixes Q, K and V obtained in the step (3.2).
3. The image object detection method of claim 2, wherein the Query matrix is obtained by matching the feature F of the nth level n Linear mapping is performed to obtain:
Q=F n ×W,
where W is the matrix parameter of the linear mapping,
Figure FDA0003824635630000021
is the mapped Query matrix.
4. The image object detection method of claim 2, wherein the Key matrix and the Value matrix are both obtained by direct concatenation of F n And F n+1 Two sets of characteristics are obtained, namely
K=V=[F n ,F n+1 ],
Wherein [. ]]Indicating the number of steps in the cascade operation, K,
Figure FDA0003824635630000022
and representing the obtained Key matrix and Value matrix.
5. The image object detection method according to claim 2, characterized in that said step (3.3) multiplies the Query matrix and the Key matrix to obtain an attention response map, and the response map is further applied to the Value matrix to obtain a new token sequence.
6. The image object detection method according to claim 1 or 2, characterized in that in step (3.3), the token sequence F of the nth level n Finally, the residual connection is added to the new token sequence againA process is denoted as F attn =softmax(QK T )V+F n Wherein, F attn For the resulting output feature matrix, K T Softmax is a normalized exponential function for the transpose of the key matrix K.
7. An image object detection method as claimed in claim 1 or 2, wherein the cross-scale attention module operates within local respective feature windows.
8. The image object detection method of claim 7, wherein the window size is tailored to specific requirements.
9. The image object detection method according to claim 1 or 2, wherein in the first step, the Transformer network is Swin-Transformer.
10. An image object detecting apparatus characterized by:
comprising at least one processor and a memory, said at least one processor and memory being connected via a data bus, said memory storing instructions executable by said at least one processor, said instructions being adapted to perform the image object detection method of any of claims 1-9 after being executed by said processor.
CN202211053451.2A 2022-08-31 2022-08-31 Image target detection method and device Pending CN115331081A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211053451.2A CN115331081A (en) 2022-08-31 2022-08-31 Image target detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211053451.2A CN115331081A (en) 2022-08-31 2022-08-31 Image target detection method and device

Publications (1)

Publication Number Publication Date
CN115331081A true CN115331081A (en) 2022-11-11

Family

ID=83928983

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211053451.2A Pending CN115331081A (en) 2022-08-31 2022-08-31 Image target detection method and device

Country Status (1)

Country Link
CN (1) CN115331081A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116740414A (en) * 2023-05-15 2023-09-12 中国科学院自动化研究所 Image recognition method, device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116740414A (en) * 2023-05-15 2023-09-12 中国科学院自动化研究所 Image recognition method, device, electronic equipment and storage medium
CN116740414B (en) * 2023-05-15 2024-03-01 中国科学院自动化研究所 Image recognition method, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN106570464B (en) Face recognition method and device for rapidly processing face shielding
CN111950453A (en) Optional-shape text recognition method based on selective attention mechanism
CN112381097A (en) Scene semantic segmentation method based on deep learning
CN110838095B (en) Single image rain removing method and system based on cyclic dense neural network
CN113239825B (en) High-precision tobacco beetle detection method in complex scene
CN113628059A (en) Associated user identification method and device based on multilayer graph attention network
CN116895030B (en) Insulator detection method based on target detection algorithm and attention mechanism
CN114360030A (en) Face recognition method based on convolutional neural network
CN115471216B (en) Data management method of intelligent laboratory management platform
CN115331081A (en) Image target detection method and device
CN116523875A (en) Insulator defect detection method based on FPGA pretreatment and improved YOLOv5
CN111144407A (en) Target detection method, system, device and readable storage medium
CN111079930A (en) Method and device for determining quality parameters of data set and electronic equipment
CN117576402A (en) Deep learning-based multi-scale aggregation transducer remote sensing image semantic segmentation method
CN117173595A (en) Unmanned aerial vehicle aerial image target detection method based on improved YOLOv7
CN114529450B (en) Face image super-resolution method based on improved depth iteration cooperative network
CN114627370A (en) Hyperspectral image classification method based on TRANSFORMER feature fusion
CN111353976B (en) Sand grain target detection method based on convolutional neural network
CN114581789A (en) Hyperspectral image classification method and system
CN114049519A (en) Optical remote sensing image scene classification method
CN111340137A (en) Image recognition method, device and storage medium
CN111899161A (en) Super-resolution reconstruction method
CN115471893B (en) Face recognition model training, face recognition method and device
CN116578903A (en) Space-time detection method for power grid false data injection attack
CN115512109A (en) Image semantic segmentation method based on relational context aggregation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination