CN115331081A - Image target detection method and device - Google Patents
Image target detection method and device Download PDFInfo
- Publication number
- CN115331081A CN115331081A CN202211053451.2A CN202211053451A CN115331081A CN 115331081 A CN115331081 A CN 115331081A CN 202211053451 A CN202211053451 A CN 202211053451A CN 115331081 A CN115331081 A CN 115331081A
- Authority
- CN
- China
- Prior art keywords
- resolution
- feature
- detection method
- matrix
- cross
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/766—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an image target detection method, which comprises the following steps: extracting a multi-resolution characteristic graph from the image by using a Transformer network as a backbone network; inputting the multi-resolution feature map into a cross-scale attention feature pyramid network; in the characteristic pyramid network, starting from the input small-resolution characteristics, utilizing a cross-scale attention module to gradually perform characteristic fusion and recombination to the large resolution, and accumulating and fusing the characteristics from the small resolution to the large resolution; the fused features are further sent to a subsequent processing and predicting module for regression of the detection frame and prediction of the category, and the feature pyramid network of the cross-scale attention is trained based on the target data set until convergence; and performing regression of the detection frame and prediction of the category on the picture to be detected by utilizing the cross-scale attention feature pyramid network obtained by training. And the performance of final target detection is improved. The invention also provides a corresponding image target detection device.
Description
Technical Field
The invention belongs to the technical field of deep learning and computer vision, and particularly relates to an image target detection method and device.
Background
Target detection is one of the most basic and important tasks in the field of computer vision, typically using deep neural networks for feature extraction and modeling of visual data, and predicting the corresponding location and class of target objects. Current deep learning methods attempt to solve the target detection task as either a classification problem or a regression problem or a combination of both.
Visual images have high complexity and diversity, and object detection generally needs to capture object information of various scales. In order to better model the target features of each scale, a Feature Pyramid Network (FPN) is widely used in the framework of target detection. The FPN takes the features extracted by the Backbone network (Backbone Networks) and positioned at each resolution ratio as input, and performs fusion. Low-resolution features have stronger semantics but lack details, while high-resolution features are full of details but less semantic. The FPN interpolates the low resolution features and superimposes them on the high resolution features for fusion. This enriches semantic information from multiple levels and makes visual representations more sensitive to multi-scale objects.
The Transformer network was first proposed and used for various scenes and tasks of Natural Language Processing (NLP) with great success. In recent years, transformers have come to be widely used in various visual tasks such as image classification, semantic segmentation, object detection, and the like, and have acquired very strong performance, exceeding the conventional use of more Convolutional Neural Networks (CNNs) in the visual tasks to some extent. The self-attention (self-attention) mechanism is a core component of a transform network, and automatically establishes a relationship between features by measuring responses between the features and recombining the features according to response values. The existing detection method based on the traditional FPN technology generally realizes the fusion of cross-scale features by directly interpolating and adding the features, and the modeling capability of the existing detection method on objects with complex scales is still limited.
Disclosure of Invention
In view of the above defects or improvement needs in the prior art, the present invention provides an image target detection method and apparatus, which model the object features that are more robust to scale and have stronger expression capability by introducing an FPN with a cross-scale attention mechanism, and improve the performance of final target detection.
To achieve the above object, according to an aspect of the present invention, there is provided an image object detection method, including the steps of:
the method comprises the following steps: extracting a multi-resolution characteristic graph from the image by using a Transformer network as a backbone network;
step two: inputting the multi-resolution feature map in the step one into a cross-scale attention feature pyramid network;
step three: in the characteristic pyramid network, starting from the small-resolution characteristics input in the step two, utilizing a cross-scale attention module to gradually perform characteristic fusion and recombination towards the large resolution, and accumulating and fusing the characteristics from the small resolution to the large resolution;
step four: the features fused in the third step are further sent to a subsequent processing and predicting module for regression and class prediction of the detection frame, and the feature pyramid network of the cross-scale attention is trained based on the target data set until convergence;
step five: and performing regression of the detection frame and prediction of the category on the picture to be detected by using the cross-scale attention feature pyramid network obtained by training.
In one embodiment of the present invention, the step three inter-scale attention module is implemented by the following steps:
(3.1) features of nth levelAnd (n + 1) th level of featuresFirst converted into a sequence of 1-dimensional lemmas, i.e.Andwherein H n ,W n Are respectively a feature F n Spatial dimensions in both height and width dimensions, C being the size of the channel dimension of the feature;H n+1 ,W n+1 Are respectively a feature F n+1 Spatial dimensions in both height and width dimensions;
(3.2) mapping the two characteristic sequences obtained in the step (3.1) to three spaces of Query, key and Value to obtain characteristic matrixes Q, K and V of the three spaces;
and (3.3) performing attention system operation on the three matrixes Q, K and V obtained in the step (3.2).
In one embodiment of the invention, the Query matrix is obtained by matching the feature F of the nth level n Performing linear mapping to obtain:
Q=F n ×W,
In one embodiment of the invention, the Key matrix and the Value matrix are both directly cascaded through F n And F n+1 Two sets of characteristics are obtained, namely
K=V=[F n ,F n+1 ],
Wherein [ ·]Indicating the number of steps in the cascade operation, K,and representing the obtained Key matrix and Value matrix.
In one embodiment of the present invention, the step (3.3) multiplies the Query matrix and the Key matrix to obtain an attention response map, and the response map is further applied to the Value matrix to obtain a new token sequence.
In one embodiment of the present invention, in the step (3.3), the token sequence F of the nth level n Finally, it will be added again to the new token sequence in the form of a residual join, the whole process being denoted F attn =softmax(QK T )V+F n Wherein F is attn For the resulting output feature matrix, K T Softmax is a normalized exponential function for the transpose of the key matrix K.
In one embodiment of the invention, the cross-scale attention module operates within local individual feature windows.
In one embodiment of the present invention, the window size is set according to specific requirements.
In one embodiment of the present invention, in the first step, the Transformer network is a Swin-Transformer.
According to another aspect of the present invention, there is also provided an image object detecting apparatus, comprising at least one processor and a memory, the at least one processor and the memory being connected via a data bus, the memory storing instructions executable by the at least one processor, the instructions being configured to perform the image object detecting method described above after being executed by the processor.
Generally, compared with the prior art, the technical scheme of the invention has the following beneficial effects:
aiming at the problem that the modeling capacity of the existing detection method for the complex-scale object is limited, the invention provides a novel target detection method based on a cross-scale self-attention feature pyramid, which can remarkably improve the modeling capacity of the multi-scale object with lower calculation cost and improve the performance of final target detection.
Drawings
FIG. 1 is a schematic flow chart of an image target detection method according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a feature pyramid network based on a cross-scale attention mechanism in an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The invention relates to the field of deep learning and computer vision, in particular to a cross-scale (cross-scale) Feature relationship modeling, an Attention Mechanism (Attention Mechanism) and a Feature Pyramid Network (FPN) based universal target detection algorithm.
The purpose of the invention is realized by the following technical scheme: the cross-resolution features are modeled and recombined through a self-attention mechanism, fig. 1 is a block diagram of a target detection method based on a Transformer backbone network and a cross-scale attention feature pyramid network in an embodiment of the invention, and as shown in fig. 1, the invention provides an image target detection method, which comprises the following steps:
the method comprises the following steps: a transform network (such as Swin-Transformer) is used as a backbone network to extract a multi-resolution feature map from an image.
Step two: and inputting the multi-resolution feature map in the step one into a cross-scale attention feature pyramid network.
Step three: in the feature pyramid network, starting from the small-resolution features input in the step two, the proposed cross-scale attention module is utilized to gradually perform feature fusion and recombination towards the large resolution. Features are fused cumulatively from small resolution to large resolution.
Step four: and the features fused in the third step are further sent to a subsequent processing and predicting module for regression of the detection frame and prediction of the category. And training the network module based on the target data set until convergence.
Further, as shown in fig. 2, the cross-Scale Attention module (Scale-Attention Block) in step three is implemented by the following steps:
(3.1) features of nth levelAnd (n + 1) th level of featuresAre first converted into 1-dimensional Tokens sequences, i.e. word elementsAndwherein H n ,W n Are respectively a feature F n The spatial dimensions in both height and width dimensions, C being the channel dimension size of the feature; h n+1 ,W n+1 Are respectively a feature F n+1 The spatial dimensions in both the height and width dimensions.
And (3.2) mapping the two characteristic sequences obtained in (3.1) to three spaces of Query, key and Value. Specifically, the feature matrices of the three spaces are obtained as follows: the Query matrix passes through the feature F of the nth level n Is subjected to linear mapping (linear projection), namely
Q=F n ×W,
Where W is the matrix parameter of the linear mapping,the mapped Query matrix is obtained; both Key matrix and Value matrix are directly cascaded F n And F n+1 Two sets of characteristics are obtained, i.e.
K=V=[F n ,F n+1 ],
Wherein [ ·]Indicating the number of steps in the cascade operation, K,and representing the obtained Key matrix and Value matrix.
And (3.3) performing attention mechanism operation on the three matrixes Q, K and V obtained in the step (3.2). Specifically, multiplying the Query matrix and the Key matrix to obtain an attention response map (attention map), and the response map is further acted on the Value matrix to obtain a new token sequence. In particular, the token sequence F of the nth level n Finally it will be added again to the new token sequence in the form of a residual join.
The whole process can be expressed as
F attn =softmax(QK T )V+F n ,
Wherein, F attn For the resulting output feature matrix, K T For the transpose of the key matrix K, softmax is calculated for the normalized exponential function as follows:
In particular, in order to save computational cost, the above-described cross-scale attention module usually operates in local each feature window, and the window size can be made according to specific requirements.
Furthermore, the invention also provides an image object detection device, which comprises at least one processor and a memory, wherein the at least one processor and the memory are connected through a data bus, and the memory stores instructions capable of being executed by the at least one processor, and the instructions are used for completing the image object detection method after being executed by the processor.
The test is carried out on the environment image data of the relevant power patrol channel, and the results are shown in the following table 1:
TABLE 1 comparison of test results
It will be understood by those skilled in the art that the foregoing is only an exemplary embodiment of the present invention, and is not intended to limit the invention to the particular forms disclosed, since various modifications, substitutions and improvements within the spirit and scope of the invention are possible and within the scope of the appended claims.
Claims (10)
1. An image target detection method is characterized by comprising the following steps:
the method comprises the following steps: extracting a multi-resolution characteristic graph from the image by using a Transformer network as a backbone network;
step two: inputting the multi-resolution feature map in the step one into a cross-scale attention feature pyramid network;
step three: in the characteristic pyramid network, starting from the small-resolution characteristics input in the step two, utilizing a cross-scale attention module to gradually perform characteristic fusion and recombination to the large resolution, and accumulating and fusing the characteristics from the small resolution to the large resolution;
step four: the features fused in the third step are further sent to a subsequent processing and predicting module for regression of the detection frame and prediction of the category, and the feature pyramid network of the cross-scale attention is trained based on the target data set until convergence;
step five: and performing regression of the detection frame and prediction of the category on the picture to be detected by utilizing the cross-scale attention feature pyramid network obtained by training.
2. The image object detection method of claim 1, wherein the cross-scale attention module in step three is implemented by:
(3.1) features of nth levelAnd (n + 1) th level of featuresFirst converted into a sequence of 1-dimensional lemmas, i.e.Andwherein H n ,W n Are respectively a feature F n The spatial dimensions in both height and width dimensions, C being the channel dimension size of the feature; h n+1 ,W n+1 Are respectively a feature F n+1 In both height and width dimensionsThe size of the space;
(3.2) mapping the two characteristic sequences obtained in (3.1) to three spaces of Query, key and Value to obtain characteristic matrixes Q, K and V of the three spaces;
and (3.3) performing attention mechanism operation on the three matrixes Q, K and V obtained in the step (3.2).
4. The image object detection method of claim 2, wherein the Key matrix and the Value matrix are both obtained by direct concatenation of F n And F n+1 Two sets of characteristics are obtained, namely
K=V=[F n ,F n+1 ],
5. The image object detection method according to claim 2, characterized in that said step (3.3) multiplies the Query matrix and the Key matrix to obtain an attention response map, and the response map is further applied to the Value matrix to obtain a new token sequence.
6. The image object detection method according to claim 1 or 2, characterized in that in step (3.3), the token sequence F of the nth level n Finally, the residual connection is added to the new token sequence againA process is denoted as F attn =softmax(QK T )V+F n Wherein, F attn For the resulting output feature matrix, K T Softmax is a normalized exponential function for the transpose of the key matrix K.
7. An image object detection method as claimed in claim 1 or 2, wherein the cross-scale attention module operates within local respective feature windows.
8. The image object detection method of claim 7, wherein the window size is tailored to specific requirements.
9. The image object detection method according to claim 1 or 2, wherein in the first step, the Transformer network is Swin-Transformer.
10. An image object detecting apparatus characterized by:
comprising at least one processor and a memory, said at least one processor and memory being connected via a data bus, said memory storing instructions executable by said at least one processor, said instructions being adapted to perform the image object detection method of any of claims 1-9 after being executed by said processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211053451.2A CN115331081A (en) | 2022-08-31 | 2022-08-31 | Image target detection method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211053451.2A CN115331081A (en) | 2022-08-31 | 2022-08-31 | Image target detection method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115331081A true CN115331081A (en) | 2022-11-11 |
Family
ID=83928983
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211053451.2A Pending CN115331081A (en) | 2022-08-31 | 2022-08-31 | Image target detection method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115331081A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116740414A (en) * | 2023-05-15 | 2023-09-12 | 中国科学院自动化研究所 | Image recognition method, device, electronic equipment and storage medium |
-
2022
- 2022-08-31 CN CN202211053451.2A patent/CN115331081A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116740414A (en) * | 2023-05-15 | 2023-09-12 | 中国科学院自动化研究所 | Image recognition method, device, electronic equipment and storage medium |
CN116740414B (en) * | 2023-05-15 | 2024-03-01 | 中国科学院自动化研究所 | Image recognition method, device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106570464B (en) | Face recognition method and device for rapidly processing face shielding | |
CN111950453A (en) | Optional-shape text recognition method based on selective attention mechanism | |
CN112381097A (en) | Scene semantic segmentation method based on deep learning | |
CN110838095B (en) | Single image rain removing method and system based on cyclic dense neural network | |
CN113239825B (en) | High-precision tobacco beetle detection method in complex scene | |
CN113628059A (en) | Associated user identification method and device based on multilayer graph attention network | |
CN116895030B (en) | Insulator detection method based on target detection algorithm and attention mechanism | |
CN114360030A (en) | Face recognition method based on convolutional neural network | |
CN115471216B (en) | Data management method of intelligent laboratory management platform | |
CN115331081A (en) | Image target detection method and device | |
CN116523875A (en) | Insulator defect detection method based on FPGA pretreatment and improved YOLOv5 | |
CN111144407A (en) | Target detection method, system, device and readable storage medium | |
CN111079930A (en) | Method and device for determining quality parameters of data set and electronic equipment | |
CN117576402A (en) | Deep learning-based multi-scale aggregation transducer remote sensing image semantic segmentation method | |
CN117173595A (en) | Unmanned aerial vehicle aerial image target detection method based on improved YOLOv7 | |
CN114529450B (en) | Face image super-resolution method based on improved depth iteration cooperative network | |
CN114627370A (en) | Hyperspectral image classification method based on TRANSFORMER feature fusion | |
CN111353976B (en) | Sand grain target detection method based on convolutional neural network | |
CN114581789A (en) | Hyperspectral image classification method and system | |
CN114049519A (en) | Optical remote sensing image scene classification method | |
CN111340137A (en) | Image recognition method, device and storage medium | |
CN111899161A (en) | Super-resolution reconstruction method | |
CN115471893B (en) | Face recognition model training, face recognition method and device | |
CN116578903A (en) | Space-time detection method for power grid false data injection attack | |
CN115512109A (en) | Image semantic segmentation method based on relational context aggregation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |