CN113822871A

CN113822871A - Target detection method and device based on dynamic detection head, storage medium and equipment

Info

Publication number: CN113822871A
Application number: CN202111148642.2A
Authority: CN
Inventors: 杨紫崴
Original assignee: Ping An Medical and Healthcare Management Co Ltd
Current assignee: Shenzhen Ping An Medical Health Technology Service Co Ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2021-12-21

Abstract

The invention relates to the field of artificial intelligence, and provides a target detection method, a target detection device, a storage medium and target detection equipment based on a dynamic detection head. The method comprises the following steps: acquiring a target picture to be detected, and extracting the characteristics of the target picture through a backbone network to obtain a picture characteristic pyramid; converting the image feature pyramid into a three-dimensional feature tensor, wherein the feature tensor of the three-dimensional feature vector in each dimension corresponds to a feature map of one level of the image feature pyramid; inputting the three-dimensional feature tensor into a dynamic detection head, and performing multi-dimensional perception attention processing on the three-dimensional feature tensor through a plurality of self-attention modules which are serially stacked in the dynamic detection head to obtain a target detection result. According to the method, each self-attention module can perform more refined processing on the target detection task through the serial stacking structure of the plurality of self-attention modules in the dynamic output head, so that the target detection precision is effectively improved.

Description

Target detection method and device based on dynamic detection head, storage medium and equipment

Technical Field

The invention relates to the field of artificial intelligence, in particular to a target detection method and device based on a dynamic detection head, a storage medium and computer equipment.

Background

With the continuous development of artificial intelligence technology, the processing and analyzing technology of multimedia signals such as images, videos, and voices increasingly depends on more advanced technical means such as artificial intelligence, wherein one of the basic tasks of processing and analyzing images is target detection. Target detection, also called target extraction, is an image segmentation process based on the geometric shape and characteristics of a target, and mainly performs segmentation and identification on the target to be detected. Generally, target detection is to input a target picture into a trained target detection model, realize target detection through the target detection model, and output a detection result of a target to be detected. For example, a specific position of a human face, a vehicle, or a building is detected from the image, or a category of an object is detected in the image, or the like. The accuracy and real-time of target detection is an important capability of the whole target detection system.

In the prior art, a target detection algorithm mainly includes a backbone network (backbone) and a detection head (head), where the backbone network is mainly used to extract feature information in a target picture, and the detection head is mainly used to output a predetermined result according to the feature information extracted by the backbone network. However, as the target detection framework and the target detection algorithm are developed and matured, the accuracy improvement of the target detection result also reaches the bottleneck, and it is difficult to obtain a more accurate target detection result under the existing target detection framework.

Disclosure of Invention

In view of this, the present application provides a target detection method, an apparatus, a storage medium and a computer device based on a dynamic detection head, and mainly aims to solve the technical problem that the accuracy of a target detection result cannot be further improved in the existing target detection framework.

According to a first aspect of the present invention, there is provided a target detection method based on a dynamic detection head, the method comprising:

acquiring a target picture to be detected, and extracting the characteristics of the target picture through a backbone network to obtain a picture characteristic pyramid;

converting the image feature pyramid into a three-dimensional feature tensor, wherein the feature tensor of the three-dimensional feature vector in each dimension corresponds to a feature map of one level of the image feature pyramid;

inputting the three-dimensional feature tensor into a dynamic detection head, and performing multi-dimensional perception attention processing on the three-dimensional feature tensor through a plurality of self-attention modules which are serially stacked in the dynamic detection head to obtain a target detection result.

According to a second aspect of the present invention, there is provided a target detection apparatus based on a dynamic detection head, the apparatus comprising:

the image feature extraction module is used for acquiring a target image to be detected and extracting features of the target image through a backbone network to obtain an image feature pyramid;

the feature tensor conversion module is used for converting the image feature pyramid into a three-dimensional feature tensor, wherein the feature tensor of the three-dimensional feature vector in each dimension corresponds to a feature map of one level of the image feature pyramid;

and the self-attention processing module is used for inputting the three-dimensional feature tensor into the dynamic detection head, and performing multi-dimensional perception attention processing on the three-dimensional feature tensor through a plurality of self-attention modules which are serially stacked in the dynamic detection head to obtain a target detection result.

According to a third aspect of the present invention, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described dynamic detection head-based object detection method.

According to a fourth aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above dynamic detection head-based object detection method when executing the program.

The invention provides a target detection method, a device, a storage medium and computer equipment based on a dynamic detection head. According to the method, the output head in the traditional target detection algorithm is replaced by the dynamic output head comprising the plurality of self-attention modules, and the target detection task can be refined and split from difficulty to simplicity through the serial stacking structure of the plurality of self-attention modules in the dynamic output head, so that each self-attention module can perform more refined processing on the target detection task. In addition, the method solves the problems that an output head module is difficult to design and the target detection precision is low due to the fact that the image detection data are complex by applying the dynamic output head to the target detection algorithm, greatly improves the target detection precision, and saves manpower for adjusting the overshoot parameter in the target detection algorithm.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a schematic flowchart illustrating a target detection method based on a dynamic detection head according to an embodiment of the present invention;

FIG. 2 is a schematic processing flow diagram of a dynamic test head according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an object detection apparatus based on a dynamic detection head according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

At present, algorithms for target detection can be mainly summarized into three framework structures: a two-stage (two-step) target detection algorithm, a single-stage (one-step) target detection algorithm and an anchor-free (anchor-free) target detection algorithm, respectively. The two-stage target detection algorithm is based on a regional recommendation network (Region-probable) algorithm, and the representative algorithms include series of algorithms such as R-CNN (Region-CNN, a target detection algorithm based on deep learning), and the like. The single-stage target detection algorithm refers to an algorithm such as a Yolo (deep neural network-based target detection algorithm), which directly predicts the categories and positions of different targets by using Only one CNN network (convolutional neural network). Compared with a two-stage target detection algorithm, the single-stage target detection algorithm has the advantages of high detection speed and low accuracy. The anchor-free target detection algorithm is greatly different from a two-stage and single-stage target detection algorithm, and the detection problem is mainly solved by a regression-like means. The anchor-free target detection algorithm divides target detection into two sub-problems, namely determining the center point of an object and predicting the distances from four frames to the center point, determining the position of the target object through the prediction of the center point of the object, and then predicting the distances from the four frames to the center point to determine the size of the target object, thereby realizing target detection. With the gradual development and maturity of the three target detection algorithms, the detection accuracy of the three target detection algorithms also reaches the bottleneck, and the target detection accuracy is difficult to further improve.

In an embodiment, as shown in fig. 1, a target detection method based on a dynamic detection head is provided, and the method is described by taking an example of application of the method to a computer device such as a server, where the server may be an independent server, or a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The method comprises the following steps:

101. and acquiring a target picture to be detected, and extracting the features of the target picture through a backbone network to obtain a picture feature pyramid.

The Backbone network (Backbone) refers to a convolutional neural network with pre-training parameters for extracting picture features. The picture Feature Pyramid (FPN) refers to a network layer for collecting Feature maps of different scales, and is also called Neck (Neck). Generally, for the purpose of detecting a target position and a target category from an image, it is necessary to extract some necessary feature information from the image, then use the features to construct a network layer of some feature maps, and finally use the constructed network layer of the feature maps to realize the positioning and classification of the target. In practical applications, most backbone networks are constructed based on deep learning network models, wherein common backbone networks include ResNet-50, Darknet53 and the like.

Specifically, the computer device may first acquire a target picture to be processed through a data interface, then input the target picture into a pre-trained backbone network, perform picture feature extraction, and finally construct a picture feature pyramid by using the extracted feature information. In this embodiment, the feature pyramid can extract feature information of different scales on different scales and perform fusion, and it can make full use of all feature information extracted by the backbone network and enable the network to better detect objects. With the image characteristic pyramid, information extracted by the backbone network can be more fully utilized, so that the target detection algorithm can better cope with multi-scale conditions. For example, when the scales of the target objects in the pictures to be detected are different, the picture feature pyramid may be a picture feature structure with different levels, which is formed by the features of the target objects with different scales.

102. And converting the image feature pyramid into a three-dimensional feature tensor, wherein the feature tensor of the three-dimensional feature vector in each dimension corresponds to the feature map of one level of the image feature pyramid.

Specifically, since the scales of the feature layers in the image feature pyramid are not uniform, and the feature layers with non-uniform scales are difficult to perform further feature identification and classification, it is necessary to perform feature normalization on the image feature pyramid. In this embodiment, the feature with a smaller scale in the image feature pyramid may be enlarged, and the feature with a larger scale in the image feature pyramid may be reduced, so as to construct a three-dimensional feature tensor (3D tensor) with a uniform scale.

103. Inputting the three-dimensional feature tensor into a dynamic detection head, and performing multi-dimensional perception attention processing on the three-dimensional feature tensor through a plurality of self-attention modules which are serially stacked in the dynamic detection head to obtain a target detection result.

The detection head, also called an output head, refers to a module for performing target position prediction or target type prediction on the extracted image features. In the prior art, the detection head usually adopts a fixed convolutional network to predict the target position or the target category, and in this embodiment, a dynamic detection head may be used to replace the conventional detection head. The dynamic detection head may dynamically select the number of sub-modules and the stacking order of the sub-modules according to different target detection tasks or different target detection categories, and is therefore referred to as the dynamic detection head. In addition, a plurality of sub-modules in the dynamic detection head are constructed based on a self-attention mechanism and are serially stacked, so that the dynamic detection head can perform self-attention detection on the target to be detected from multiple dimensions and accumulate detection results, and the feature expression accuracy of the target task is effectively improved.

Specifically, before the target detection feature map is input to the dynamic detection head, corresponding self-attention modules, for example, self-attention modules with 3 dimensions, may be preset for the dynamic detection head according to different target detection requirements, and the target picture is subjected to perceptual attention processing in dimensions, space and task dimensions. And then inputting the processed three-dimensional feature tensor into a dynamic detection head, and performing multi-dimensional perception attention processing on the three-dimensional feature tensor by using a plurality of self-attention modules which are serially stacked in the dynamic detection head so as to obtain a target detection result. In this embodiment, the feature map of each dimension of the three-dimensional feature tensor can be processed by zero to multiple self-attention modules, that is, for the feature map of each dimension of the three-dimensional feature tensor, the enhancement perception processing can be performed by one or multiple self-attention modules, or the feature map can be left as it is without processing. Generally speaking, the more attention modules of the dynamic detection head, the more accurate the target detection result.

In the target detection method based on the dynamic detection head, firstly, features of a picture to be detected are extracted through a backbone network to obtain a feature pyramid of the target picture, then, the feature pyramid of the picture is converted into a three-dimensional feature tensor with uniform scale, and finally, the three-dimensional feature tensor is input into the dynamic detection head, so that multi-dimensional perception attention processing is performed on the three-dimensional feature tensor through a plurality of self-attention modules which are serially stacked in the dynamic detection head, and a target detection result is obtained. According to the method, the output head in the traditional target detection algorithm is replaced by the dynamic output head comprising the plurality of self-attention modules, and the target detection task can be refined and split from difficulty to simplicity through the serial stacking structure of the plurality of self-attention modules in the dynamic output head, so that each self-attention module can perform more refined processing on the target detection task. In addition, the method solves the problems that an output head module is difficult to design and the target detection precision is low due to the fact that the image detection data are complex by applying the dynamic output head to the target detection algorithm, greatly improves the target detection precision, and saves manpower for adjusting the overshoot parameter in the target detection algorithm.

In one embodiment, the step 101 may be implemented by: firstly, extracting a basic feature network of a target picture through a backbone network, then extracting a plurality of feature graphs with different scales from the basic feature network of the target picture, and finally arranging the feature graphs with different scales layer by layer to obtain a picture feature pyramid. In this embodiment, the feature maps of the multiple levels of the picture feature pyramid are extracted from the basic feature network extracted from the backbone network, so that the implementation is relatively convenient, and in addition, the extracted feature maps can be selected from feature maps output by a higher-level network, so as to improve the feature expression accuracy of the picture feature pyramid. In other embodiments, the feature pyramid may also be obtained by compressing or amplifying the target picture to form pictures with different dimensions, and then processing the pictures with different dimensions respectively using the same model, but the method requires input processing of the same picture after changing the dimensions for many times, and therefore has higher requirements on computational power and memory size of the computer.

In one embodiment, the step 102 can be implemented by: and scaling the feature maps of all levels of the image feature pyramid to the same scale, and constructing a three-dimensional feature tensor according to a plurality of feature maps with the same size. The three dimensions of the three-dimensional feature tensor can be a position dimension, a level dimension and a channel dimension respectively. In this embodiment, for the target picture, the original data thereof usually appears in the form of pixels, and the pixels are represented by numbers and arranged using two dimensional sizes (height and width), so the target picture can be expressed by three dimensional features, namely, a position dimension and a level dimension representing space and dimension, and a channel dimension representing color, respectively. Specifically, the computer device may perform a scale unification process on the feature maps of the respective hierarchies, wherein scaling of the feature maps may better preserve original features of the respective hierarchies. In the embodiment, the characteristic pyramid is converted into the three-dimensional characteristic tensor with uniform scale, so that the dynamic detection head can conveniently and accurately identify and classify the target pictures.

In one embodiment, the self-attention modules in step 103 may respectively include a top attention module, a front attention module, and a right attention module, wherein the top attention module may be constructed based on a spatial attention mechanism, the front attention module is constructed based on a scale attention mechanism, and the right attention module is constructed based on a target attention mechanism. In this embodiment, the self-attention modules in three directions may respectively process three-dimensional feature maps of a three-dimensional feature tensor, where the top-end attention module may spatially perceive the shape, the rotation direction, and the position of a target picture, the front-end attention module may spatially perceive a plurality of targets with different scales in the target picture, and the right-end attention module may perceive different expression modes of the targets in the target picture. In the embodiment, the top attention module, the front end attention module and the right end attention module are serially stacked in the dynamic detection head, so that the feature expression accuracy of the target detection result on the scale, the space and the target can be effectively improved.

In one embodiment, as shown in fig. 2, the dynamic detection head includes a Top Attention module (Top Attention), a Front Attention module (Front Attention) and a Right Attention module (Right Attention) stacked in series, and the step 103 can be implemented by: firstly, converting a Feature Pyramid (Feature Pyramid) into a three-dimensional Feature tensor (three dimensions are T, F, R respectively), inputting the three-dimensional Feature tensor into a top attention module, decoupling the three-dimensional Feature tensor through the top attention module, multiplying the decoupled three-dimensional Feature tensor and the three-dimensional Feature tensor to obtain a top attention Feature map (St), further inputting the top attention Feature map into a front attention module, fusing a multi-dimensional Feature map of the top attention Feature map through the front attention module, multiplying the fused convolution network and the top attention Feature map to obtain a front attention Feature map (Sf), finally inputting the front attention Feature map into a right attention module, activating a Feature channel of the front attention Feature map through the right attention module according to a preset target detection task, and obtaining a target detection result (Sr), wherein the target detection result can be a Classification type task (Classification), a Regression type task (Regression), other types of tasks and the like.

In the above embodiment, the dynamic output head may split and refine the feature expression of the three-dimensional feature tensor from three dimensions respectively by using a top attention module, a front attention module and a right attention module which are stacked in series. In this embodiment, the three-dimensional feature tensor can be first input into the top attention module, then the deformable convolution network is used to decouple the three-dimensional feature tensor in the target detection feature map, and further, the deformable convolution network obtained after decoupling is fused with the target detection feature map, so that the top attention feature map is obtained. In the process of target identification, the target object to be detected usually presents different shapes, different rotation directions and different positions, and the top attention module is an attention module based on the spatial position of the target object to be detected, so that the top attention module can improve the spatial attention to the target object to be detected in different shapes, different rotation directions and different positions.

Further, after the top-end attention feature map is obtained, the top-end attention feature map may be input into the front-end attention module, and then the multilayer features of the feature pyramid are fused through the convolution network of the front-end attention module, so that the fused convolution network and the top-end attention feature map are fused to obtain the front-end attention feature map. The target objects in the picture to be detected are often different in size, when feature extraction is performed through a backbone network, the sizes different in size are distributed in different levels of a picture feature pyramid, and features of different levels are usually extracted from different depths of the network, so that semantic differences are inevitably generated, and the semantic differences cause interference in the positioning or classification process of subsequent target objects.

Further, after the front-end attention feature map is obtained, the front-end attention feature map output by the front-end attention model may be input into the right-end attention module, so as to output a target detection result according to a target detection task preset in the end attention feature map. The target detection task may be to predict target position coordinates, to classify a target object, or the like. In addition, different representation modes are provided for the target to be detected in the target detection, for example, four frames are used for representation, a central point is used for representation, the frames and the central point are used for representation together, and the like. In this embodiment, the right-end attention module is utilized to perform joint learning on different target detection tasks, so that generalization on different target detection tasks can be improved. Specifically, the right attention module may include a full link network and a normalized network, and the features in the output diagram of the front attention module are classified through the full link network and the normalized network, and corresponding feature channels are selected according to different target detection tasks, so as to assist the completion of accurate tasks of different target detection.

In one embodiment, the processing method of the tip attention module may include the steps of: firstly, performing characteristic coefficient processing on a three-dimensional characteristic tensor through a deformed convolution network of a top attention module, then performing cross-scale integration on the three-dimensional characteristic tensor after the characteristic coefficient processing, and finally multiplying the three-dimensional characteristic tensor after the cross-scale integration with the three-dimensional characteristic tensor to obtain a top attention characteristic diagram. In this embodiment, the top attention module mainly includes a deformable convolution network, the deformable convolution network is obtained by adding an offset to a normal sampling coordinate on the basis of a normal convolution with a convolution kernel size of 3 × 3, and through learning the offset, the size and position of the deformable convolution kernel can be dynamically adjusted according to the content of an identified target, so as to adapt to different shapes and different rotation directions of the target object to be detected. Further, the top attention module decouples the high-dimensional three-dimensional feature tensor through the deformable convolution network, wherein the decoupling process comprises a dimension reduction process and a feature integration process. Firstly, the deformable convolution network conducts coefficient processing on the three-dimensional characteristic tensor through learning based on an attention mechanism, the weight of a target-free area in the three-dimensional characteristic tensor can be set to be zero, the weight of a target area in the three-dimensional characteristic tensor is increased, dimension reduction processing is conducted on the three-dimensional characteristic tensor, then the three-dimensional characteristic tensor subjected to dimension reduction processing is subjected to characteristic cross-scale integration, the three-dimensional characteristic tensor of the target object to be detected in different shapes, different rotation directions and different positions is enhanced, and the three-dimensional characteristic tensor of the target-free area is reduced. And finally, outputting a top attention module output image of the foreground target which can be focused on different spatial positions through the decoupling processing of the deformable convolution network.

In one embodiment, the processing method of the front-end attention module may include the steps of: firstly, averaging a multi-dimensional feature map of a top end attention feature map through an averaging pooling layer of a front end attention module, then performing feature fusion on the averaged top end attention feature map, normalizing the feature map subjected to feature fusion, mapping the top end attention feature map subjected to normalization processing into a preset value range through a nonlinear activation function, and finally multiplying the mapped top end attention feature map and the top end attention feature map to obtain a front end attention feature map. In this embodiment, the image features of different semantics in the output image of the top attention module may be averaged by using an averaging pooling layer, then the image features of different semantics may be fused by using a convolutional neural network with a convolutional kernel size of 1 × 1, and then the image features may be normalized, and the image feature vectors of different semantics may be mapped between 0 and 1 by using a nonlinear activation function, such as a sigmoid activation function, to activate the convolutional neural network. And finally, carrying out fusion processing on the convolutional neural network subjected to a series of operations and the top attention feature map output in the previous step to obtain a front end attention feature map. In the embodiment, the front-end attention feature map obtained by processing the front-end attention module effectively reduces the interference caused by semantic difference, so that the attention to the target objects with different scales can be improved.

Further, as a specific implementation of the method shown in fig. 1 and fig. 2, this embodiment provides an object detection apparatus based on a dynamic detection head, as shown in fig. 3, the apparatus includes: a picture feature extraction module 21, a feature tensor conversion module 22 and a self-attention processing module 23, wherein:

the picture feature extraction module 21 is configured to obtain a target picture to be detected, and perform feature extraction on the target picture through a backbone network to obtain a picture feature pyramid;

the feature tensor conversion module 22 is configured to convert the image feature pyramid into a three-dimensional feature tensor, where the feature tensor of the three-dimensional feature vector in each dimension corresponds to a feature map of one level of the image feature pyramid;

the self-attention processing module 23 may be configured to input the three-dimensional feature tensor into the dynamic detection head, and perform multidimensional perception attention processing on the three-dimensional feature tensor through a plurality of self-attention modules stacked in series in the dynamic detection head to obtain a target detection result.

In a specific application scenario, the picture feature extraction module 21 is specifically configured to extract a basic feature network of a target picture through a backbone network; extracting a plurality of feature graphs with different scales from a basic feature network of a target picture; and arranging a plurality of feature graphs with different scales layer by layer to obtain a picture feature pyramid.

In a specific application scenario, the feature tensor conversion module 22 is specifically configured to scale the feature maps of the respective levels of the image feature pyramid to the same scale, and construct a three-dimensional feature tensor according to a plurality of feature maps of the same size, where three dimensions of the three-dimensional feature tensor are a position dimension, a level dimension, and a channel dimension, respectively.

In a specific application scenario, the plurality of self-attention modules include a top-end attention module, a front-end attention module, and a right-end attention module; the top attention module is constructed based on a space attention mechanism, the front end attention module is constructed based on a scale attention mechanism, and the right end attention module is constructed based on a target attention mechanism.

In a specific application scenario, the dynamic detection head comprises a top attention module, a front attention module and a right attention module which are stacked in series; the self-attention processing module 23 is specifically configured to input the three-dimensional feature tensor into the top attention module, decouple the three-dimensional feature tensor through the top attention module, and multiply the decoupled three-dimensional feature tensor and the three-dimensional feature tensor to obtain a top attention feature map; inputting the top end attention feature map into a front end attention module, fusing the multi-dimensional feature map of the top end attention feature map through the front end attention module, and multiplying the fused convolution network and the top end attention feature map to obtain a front end attention feature map; and inputting the front-end attention feature map into a right-end attention module, and activating a feature channel of the front-end attention feature map through the right-end attention module according to a preset target detection task to obtain a target detection result.

In a specific application scenario, the self-attention processing module 23 may be specifically configured to perform feature-factor processing on a three-dimensional feature tensor through a deformed convolution network of the top-end attention module; performing cross-scale integration on the three-dimensional feature tensor subjected to feature coefficient processing; and multiplying the three-dimensional feature tensor integrated in the cross-scale mode by the three-dimensional feature tensor to obtain a top attention feature map.

In a specific application scenario, the self-attention processing module 23 may be specifically configured to perform averaging processing on the multi-dimensional feature map of the top-end attention feature map through an average pooling layer of the front-end attention module; performing feature fusion on the top end attention feature map subjected to averaging processing, and performing normalization processing on the feature map subjected to feature fusion; mapping the top attention characteristic graph after the normalization processing into a preset numerical range through a nonlinear activation function; and multiplying the mapped top end attention feature map and the top end attention feature map to obtain a front end attention feature map.

It should be noted that other corresponding descriptions of the functional units related to the target detection apparatus based on the dynamic detection head provided in this embodiment may refer to the corresponding descriptions in fig. 1 and fig. 2, and are not repeated herein.

Based on the methods shown in fig. 1 and fig. 2, correspondingly, the present embodiment further provides a storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the target detection method based on the dynamic detection head shown in fig. 1 and fig. 2.

Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, and the software product to be identified may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, or the like), and include several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the method according to the implementation scenarios of the present application.

Based on the method shown in fig. 1 and fig. 2 and the embodiment of the target detection apparatus based on the dynamic detection head shown in fig. 3, in order to achieve the above object, the present embodiment further provides an entity device for target detection based on the dynamic detection head, which may specifically be a personal computer, a server, a smart phone, a tablet computer, a smart watch, or other network devices, and the entity device includes a storage medium and a processor; a storage medium for storing a computer program; a processor for executing the computer program to implement the above-mentioned methods as shown in fig. 1 and fig. 2.

Optionally, the entity device may further include a user interface, a network interface, a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, and the like. The user interface may include a Display screen (Display), an input unit such as a keypad (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), etc.

Those skilled in the art will appreciate that the physical device structure for target detection based on the dynamic detection head provided in the present embodiment does not constitute a limitation to the physical device, and may include more or less components, or combine some components, or arrange different components.

The storage medium may further include an operating system and a network communication module. The operating system is a program for managing the hardware of the above-mentioned entity device and the software resources to be identified, and supports the operation of the information processing program and other software and/or programs to be identified. The network communication module is used for realizing communication among components in the storage medium and communication with other hardware and software in the information processing entity device.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus a necessary general hardware platform, and can also be implemented by hardware. By applying the technical scheme of the application, firstly, the characteristics of the picture to be detected are extracted through a backbone network to obtain the characteristic pyramid of the target picture, then the picture characteristic pyramid is converted into a three-dimensional characteristic tensor with uniform scale, and finally the three-dimensional characteristic tensor is input into a dynamic detection head so as to carry out multi-dimensional perception attention processing on the three-dimensional characteristic tensor through a plurality of self-attention modules which are serially stacked in the dynamic detection head, and a target detection result is obtained. Compared with the prior art, the method has the advantages that the output head in the traditional target detection algorithm is replaced by the dynamic output head comprising the plurality of self-attention modules, and the target detection task can be refined and split from difficulty to simplicity through the serial stacking structure of the plurality of self-attention modules in the dynamic output head, so that each self-attention module can perform more refined processing on the target detection task. In addition, the method solves the problems that an output head module is difficult to design and the target detection precision is low due to the fact that the image detection data are complex by applying the dynamic output head to the target detection algorithm, greatly improves the target detection precision, and saves manpower for adjusting the overshoot parameter in the target detection algorithm.

Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present application. Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The above application serial numbers are for description purposes only and do not represent the superiority or inferiority of the implementation scenarios. The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.

Claims

1. A target detection method based on a dynamic detection head is characterized by comprising the following steps:

acquiring a target picture to be detected, and extracting the features of the target picture through a backbone network to obtain a picture feature pyramid;

and inputting the three-dimensional feature tensor into a dynamic detection head, and performing multi-dimensional perception attention processing on the three-dimensional feature tensor through a plurality of self-attention modules which are serially stacked in the dynamic detection head to obtain a target detection result.

2. The method of claim 1, wherein the extracting features of the target picture through a backbone network to obtain a picture feature pyramid comprises:

extracting a basic feature network of the target picture through a backbone network;

extracting a plurality of feature graphs with different scales from the basic feature network of the target picture;

and arranging the plurality of feature graphs with different scales layer by layer to obtain a picture feature pyramid.

3. The method of claim 1, wherein the converting the pyramid of picture features into a three-dimensional feature tensor comprises:

the feature images of all levels of the image feature pyramid are scaled to the same scale, and the three-dimensional feature tensor is constructed according to the feature images of the same size, wherein three dimensions of the three-dimensional feature tensor are respectively position dimensions, level dimensions and channel dimensions.

4. The method of claim 1, wherein the plurality of self-attention modules comprises a top attention module, a front end attention module, and a right end attention module; wherein the top attention module is constructed based on a spatial attention mechanism, the front end attention module is constructed based on a scale attention mechanism, and the right end attention module is constructed based on a target attention mechanism.

5. The method of claim 4, wherein the dynamic detection head comprises a top attention module, a front attention module, and a right attention module stacked in series; then, the performing multidimensional perception attention processing on the three-dimensional feature tensor through a plurality of self-attention modules stacked in series in the dynamic detection head to obtain a target detection result includes:

inputting the three-dimensional feature tensor into the top attention module, decoupling the three-dimensional feature tensor through the top attention module, and multiplying the decoupled three-dimensional feature tensor and the three-dimensional feature tensor to obtain a top attention feature map;

inputting the top end attention feature map into the front end attention module, fusing the multi-dimensional feature map of the top end attention feature map through the front end attention module, and multiplying the fused convolution network by the top end attention feature map to obtain a front end attention feature map;

and inputting the front end attention feature map into the right end attention module, and activating a feature channel of the front end attention feature map through the right end attention module according to a preset target detection task to obtain a target detection result.

6. The method of claim 5, wherein the decoupling the three-dimensional feature tensor by the top attention module and multiplying the decoupled three-dimensional feature tensor by the three-dimensional feature tensor to obtain a top attention feature map comprises:

performing eigen-factorization processing on the three-dimensional feature tensor through a deformed convolution network of the top attention module;

performing cross-scale integration on the three-dimensional feature tensor subjected to feature coefficient processing;

and multiplying the three-dimensional feature tensor integrated in the cross-scale mode by the three-dimensional feature tensor to obtain a top attention feature map.

7. The method of claim 5, wherein the fusing the multi-dimensional feature map of the top attention feature map by the front end attention module and multiplying the fused convolutional network by the top attention feature map to obtain a front end attention feature map comprises:

averaging the multi-dimensional feature map of the top attention feature map through an averaging pooling layer of the front end attention module;

performing feature fusion on the top end attention feature map subjected to averaging processing, and performing normalization processing on the feature map subjected to feature fusion;

mapping the top attention characteristic graph after the normalization processing into a preset numerical range through a nonlinear activation function;

and multiplying the mapped top end attention feature map and the top end attention feature map to obtain a front end attention feature map.

8. An object detection device based on a dynamic detection head, the device comprising:

and the self-attention processing module is used for inputting the three-dimensional feature tensor into a dynamic detection head, and performing multi-dimensional perception attention processing on the three-dimensional feature tensor through a plurality of self-attention modules which are serially stacked in the dynamic detection head to obtain a target detection result.

9. A storage medium having a computer program stored thereon, the computer program, when being executed by a processor, realizing the steps of the method of any one of claims 1 to 7.

10. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 7 when executed by the processor.