WO2021232771A1

WO2021232771A1 - Multi-task target detection method and apparatus, electronic device, and storage medium

Info

Publication number: WO2021232771A1
Application number: PCT/CN2020/137446
Authority: WO
Inventors: 王金桥; 赵朝阳; 朱优松
Original assignee: 中科视语(北京)科技有限公司
Priority date: 2020-05-18
Filing date: 2020-12-18
Publication date: 2021-11-25
Also published as: CN111598112A; CN111598112B

Abstract

The present invention is applied in the technical field of image processing. Disclosed are a multi-task target detection method and apparatus, an electronic device, and a storage medium,. An attention-aware convolutional feature map of a target is extracted using cascaded attention modules, so that attention-aware convolutional features of a whole image can be generated from coarse to fine, to suppress interference of background noise. A local component feature, a global structure feature, a spatial context feature, and a multi-task feature of the target are extracted on the basis of the attention-aware convolutional feature map, and at least one of a detection task, a key point detection task, and an instance segmentation task of the target is implemented on the basis of the local component feature, the global structure feature, the spatial context feature, and the multi-task feature of the target. A global structure, local components, and context information of the target can be effectively associated to form a structured feature expression, thereby enhancing the robustness of the features to shielding, deformation, postures, etc., and improving the multi-task performance.

Description

Multi-task target detection method, device, electronic equipment and storage medium

This application requires the Chinese patent application number "202010422038.3" submitted by China Science and Technology (Beijing) Technology Co., Ltd. on May 18, 2020, with the title of "Multitasking target detection method, device, electronic equipment and storage medium" ”, the entire content of which is incorporated in this application by reference.

Technical field

The present invention relates to the field of image processing technology, in particular to a multi-task target detection method, device, electronic equipment and storage medium.

Background technique

Target detection is a basic task in computer vision, and it is the premise of many other tasks. The difficulty of target detection is almost reflected in other tasks, and it is more complex and diverse, such as background noise interference, target occlusion, truncation, and attitude change And deformation and so on. Multi-task design based on target detection has always been a hot issue. Simultaneously performing multi-task prediction through a network can not only save the amount of calculation, but also improve the generalization ability of the model.

Existing multi-task frameworks, such as Mask R-CNN, have strong scalability and are widely used. However, this framework does not consider the impact of the environment and the state of the target itself, and does not have a targeted structure and learning strategy, so its expressive ability Needs to be enhanced. On the whole, there is currently a lack of overall solutions for multi-tasks such as environmental interference and target posture changes.

Summary of the invention

The main purpose of the present disclosure is to provide a multi-task target detection method, device, electronic equipment and storage medium, which can solve at least one of the above technical problems.

To achieve the foregoing objectives, the first aspect of the embodiments of the present disclosure provides a multi-task target detection method, including:

Obtain an image of the target to be detected;

Using a cascaded attention module to extract the convolutional feature map of the target's attention perception;

Extracting local component features, global structural features, spatial context features, and multi-task features of the target based on the attention-perceived convolution feature map;

Based on the local component features, global structural features, spatial context features, and multi-task features of the target, at least one of the target detection task, key point detection task, and instance segmentation task is realized.

Optionally, the extracting the convolutional feature map of the attention perception of the target by using the cascaded spatial attention module includes:

Insert the attention module into multiple preset multiples sampled by the preset basic network to obtain multiple attention maps;

The multiple attention maps are respectively multiplied by the convolution feature maps at the corresponding downsampling multiples channel by channel to obtain the convolution feature maps for attention perception.

Optionally, the extraction of local component features, global structural features, spatial context features, and multi-task features of the target based on the convolution feature map of the attention perception includes:

Extracting a candidate frame containing the target from the attention-perceived convolution feature map;

Based on the attention-perceived convolution feature map and the candidate frame, extract local component features, global structural features, spatial context features, and multi-task features of the target.

Optionally, based on the local component feature, global structural feature, spatial context feature, and multi-task feature of the target, the detection task to achieve the target includes:

Fusing the local component features, global structural features, and spatial context features of the target to obtain the structural features of the target;

Based on the structural feature, the detection task of the target is realized.

Optionally, the key point detection task for achieving the target based on the local component feature, global structural feature, spatial context feature, and multi-task feature of the target, and/or the instance segmentation task includes:

Up-sampling the structured feature so that the resolution of the structured feature is the same as the resolution of the multi-task feature;

Fuse the structured feature after upsampling with the multi-task feature to obtain the fused feature;

Perform key point detection on the fused feature to achieve the key point detection task of the target, and/or perform instance segmentation on the fused feature to achieve the instance segmentation task of the target.

Optionally, the extracting the feature of a local component of the target based on the convolution feature map of the attention perception and the candidate frame includes:

Passing the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1 to obtain a component-sensitive feature map;

The candidate frame is mapped to the component-sensitive feature map through PSRoIPooling, and the candidate frame is divided into k×k candidate frame blocks, so that each candidate frame block represents a local component, and each candidate frame Form a k×k component feature;

Each k×k component feature is averagely pooled to obtain the local component feature of the target.

Optionally, the extracting the global structural feature of the target based on the convolutional feature map of the attention perception and the candidate frame includes:

Dimensionality reduction is performed on the convolutional feature maps of the attention perception through a convolutional layer with a size of 1×1, to obtain a set of dimensionality-reduced convolution feature maps;

Map the candidate frame to the reduced dimensionality convolution feature map by RoIPooling, and divide the candidate frame into k×k candidate block blocks, so that each candidate block block forms a k×k global feature;

Regarding each k×k global feature as a whole, encoding is performed through two convolutional layers with sizes of k×k and 1×1, respectively, to obtain the global structural feature of the target.

Optionally, the extraction of the context structure feature of the target based on the convolution feature map of the attention perception and the candidate frame includes:

Reducing the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1 to obtain a set of reduced dimensionality convolutional feature maps;

Keep the center point of each candidate frame unchanged, and expand the area of each candidate frame to a preset multiple;

Through RoIPooling, the area-expanded candidate frame is mapped to the reduced dimensionality convolution feature map, and the area-expanded candidate frame is divided into k×k candidate block blocks, so that each candidate frame forms a k× Contextual characteristics of k;

Regarding each k×k context feature as a whole, the context structure feature of the target is obtained by encoding through two convolutional layers with sizes of k×k and 1×1, respectively.

Optionally, the extraction of the multi-task feature of the target based on the convolutional feature map of the attention perception and the candidate frame includes:

Reducing the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1 to obtain a set of dimensionality-reducing convolutional feature maps;

Map the candidate frame to the reduced dimensionality convolution feature map by RoIPooling, and divide the candidate frame into 2k×2k candidate block blocks, so that each candidate frame forms a 2k×2k feature;

Encode each 2k×2k feature through 4 convolutions with a size of 3×3 and a channel number of 256;

The feature of each candidate frame after encoding is up-sampled with a frequency of a preset multiple to obtain the multi-task feature of the target.

Optionally, wherein the loss of at least one of a detection task, a key point detection task, and an instance segmentation task that achieves the target is detected through a preset loss detection model;

The preset loss model:

Loss=L _det (N)+λ ₁ L _att (N)+λ ₂ L _multi (N);

Wherein, N represents the detection network that realizes the multi-task target detection method, L _det represents the loss of the detection task, L _att represents the loss of the attention module, and L _multi represents the realization of the key point detection task and / Or the loss of the instance segmentation task, λ ₁ and λ ₂ are preset loss adjustment factors.

A second aspect of the embodiments of the present disclosure provides a multi-task target detection device, including:

The acquisition module is used to acquire the image of the target to be detected;

The first extraction module is configured to use the cascaded spatial attention module to extract the convolutional feature map of the target's attention perception;

The second extraction module is configured to extract local component features, global structural features, spatial context features, and multi-task features of the target based on the convolution feature map of the attention perception;

The task realization module is used to implement at least one of the target detection task, key point detection task, and instance segmentation task based on the local component feature, global structure feature, spatial context feature, and multi-task feature of the target.

A third aspect of the embodiments of the present disclosure provides an electronic device, including:

A memory, a processor, and a computer program stored on the memory and capable of running on the processor are characterized in that, when the processor executes the program, the multi-task target detection method provided by the first aspect of the embodiments of the present disclosure is implemented.

A fourth aspect of the embodiments of the present disclosure provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the multi-task target detection method provided in the first aspect of the embodiments of the present disclosure is implemented.

It can be seen from the above-mentioned embodiments of the present disclosure that the multi-task target detection, device, electronic equipment, and storage medium provided by the present disclosure use cascaded attention modules to extract the convolutional feature map of the target’s attention perception, which can be obtained from the rough To precisely generate the convolutional features of the whole picture attention perception, and suppress the interference of background noise. Convolution feature map based on attention perception, extract the target's local component features, global structure features, spatial context features and multi-task features, based on the target's local component features, global structure features, spatial context features and multi-task features to achieve the goal At least one of the detection task, the key point detection task, and the instance segmentation task. It can effectively associate the global structure, local components and context information of the target to form a structured feature expression, improve the robustness of features to occlusion, deformation, and posture, and improve multi-task performance.

Description of the drawings

In order to more clearly illustrate the new embodiment of the present invention or the technical solution in the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiment or the prior art. Obviously, the drawings in the following description are only exemplary. For those of ordinary skill in the art, other implementation drawings can be derived from the provided drawings without creative work.

The structure, ratio, size, etc. shown in this manual are only used to match the content disclosed in the manual for people familiar with this technology to understand and read. They are not used to limit the limited conditions for the implementation of this utility model, so it does not have The technical significance, any structural modification, proportional relationship change or size adjustment, shall still fall within the technology disclosed in the present invention without affecting the effects and objectives that can be achieved by the present invention The content must be covered.

FIG. 1 is a schematic flowchart of a multi-task target detection method provided by an embodiment of the present disclosure;

2 is a schematic diagram of a multi-task spatial attention mechanism provided by an embodiment of the present disclosure;

3 is a schematic flowchart of step S103 in the multi-task target detection method provided by an embodiment of the present disclosure;

4 is a schematic structural diagram of a model for implementing a multi-task target detection method provided by an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of a target detection device for multi-tasking provided by an embodiment of the present disclosure;

Figure 6 shows a schematic diagram of the hardware structure of an electronic device.

Detailed ways

The following specific examples illustrate the implementation of the present utility model. Those familiar with this technology can easily understand the other advantages and effects of the present utility model from the content disclosed in this specification. Obviously, the described embodiments are the present utility model. Some embodiments, not all embodiments. Based on the embodiments of the present utility model, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the scope of protection of the present utility model.

Please refer to FIG. 1. FIG. 1 is a schematic flowchart of a multi-task target detection method provided by an embodiment of the present disclosure. The method mainly includes the following steps:

S101. Obtain an image of a target to be detected;

S102. Use the cascaded spatial attention module to extract the convolutional feature map of the target's attention perception;

S103: Extracting local component features, global structural features, spatial context features, and multi-task features of the target based on the convolution feature map of the attention perception;

S104: Based on the local component feature, global structural feature, spatial context feature, and multi-task feature of the target, at least one of a detection task, a key point detection task, and an instance segmentation task of the target is realized.

In step S101, the image can be any image. The target can be people, animals, flowers and plants, etc. This embodiment does not impose any limitation on this.

In step S102, each attention module is used to generate a pixel-by-pixel foreground and/or background attention map, and multiple attention modules are connected in a cascaded manner to learn the spatial region of the whole image from coarse to fine. Realize the enhancement of the foreground features and the weakening of the background features, thereby continuously fine-tuning the basic network features, and finally get more comprehensive and accurate basic network features for attention perception, and then apply the basic network features of attention perception to the convolutional feature map Above, get the convolutional feature map of attention perception. Therefore, by generating an attention-perceived convolution feature map on the basis of the full image in step S102, the background interference in the image can be effectively filtered out and the feature expression of the foreground target can be enhanced.

In step S103, the local component features, global structural features, spatial context features, and multi-task features of the target are explicitly extracted to enhance the descriptive power of the target. Among them, local component features, such as human eyes, nose, mouth and other specific components; global structural features, such as the upright structure of the human body; spatial context features, are mainly used to extract spatial context information around the target, such as people in an indoor environment. The aircraft is in the middle of the sky; multi-task features are mainly used to extract key points and/or segmentation features.

Among them, the four processes of extracting the target's local component features, global structural features, spatial context features, and multi-task features may not be processed in parallel or in parallel.

In step S104, the local component features, global structural features, and spatial context features of the target are coupled together through a normalization operation to form a complete structured feature of the target, which can be used for target detection tasks . The above-mentioned structured features are further coupled with multi-task features through up-sampling, and the features obtained after coupling can be used in the task of key point detection and instance segmentation of the target. Realize end-to-end multi-task training and testing.

In one of the embodiments of the present application, please refer to FIG. 2. FIG. 2 is a schematic diagram of a multi-task spatial attention mechanism provided by an embodiment of the present disclosure. The multi-task spatial attention mechanism implements step S102 of the present disclosure (can be Understandably, the multi-task coupling network in FIG. 2 implements step S103 and step S104) of the present disclosure. Step S102 includes: inserting the attention module into multiple preset multiples sampled by the preset basic network to obtain multiple attention maps; The multiple attention maps are respectively multiplied by the convolution feature maps at the corresponding downsampling multiples channel by channel to obtain the convolution feature maps for attention perception. Exemplarily, an attention module is inserted in each downsampling stage of the preset basic network, and the preset multiples of downsampling are 4, 8, 16 as examples to generate 3 attention maps, and the 3 attention maps are respectively compared with Corresponding to the preset convolution feature maps at the downsampling multiples are multiplied channel by channel (that is, the attention map at 4 times is multiplied by the convolution feature map at 4 times, and the attention map at 8 times is multiplied by the volume at 8 times. The product feature map is multiplied, the attention map at 16 times is multiplied by the convolution feature map at 16 times), the background noise interference can be suppressed from coarse to fine, and the foreground feature expression can be enhanced to guide the preset basic network feature learning , And output the final convolutional feature map of attention perception.

Specifically, the present disclosure does not use the attention module behind Conv1, mainly because the shallow features lack sufficient semantic information, and the attention map generated at this time is often very inaccurate. For each attention module, the confidence that the position belongs to the target is expressed by predicting an attention map A. The attention module contains two 3*3 convolutional layers, the number of channels is 256, and then a 1*1 convolutional layer is used for foreground and background classification, and finally a sigmoid activation function is used to normalize to 0~1. The final attention map. The generated attention map has nothing to do with the target category, and the number of channels is 1. Then the attention map is multiplied channel by channel with the convolution features at multiples of the corresponding down-sampling frequency by broadcasting. After the multiplication, the features are used as the next input. This process is repeated in the preset basic network and gradually guides the preset basics. The learning of network features finally obtains the convolutional feature map of attention perception.

In one of the embodiments of the present application, please refer to FIG. 3. Step S103 includes:

S1031. Extract a candidate frame containing the target from the convolutional feature map of the attention perception;

S1032, based on the attention-perceived convolution feature map and the candidate frame, extract local component features, global structural features, spatial context features, and multi-task features of the target.

Specifically, the region extraction network RPN can be used to extract the candidate frame, and the generated candidate frame containing the target is P.

In one of the embodiments of the present application, step S104 includes: fusing the local component feature, global structural feature, and spatial context feature of the target to obtain the structured feature of the target; based on the structured feature, realizing the detection of the target Task.

Specifically, the local component feature, the global structure feature, and the spatial context feature are coupled together through a normalization operation to form a complete structured feature of the target, which can be used for the detection task of the target.

In one of the embodiments of the present application, step S104 includes: up-sampling the structured feature so that the resolution of the structured feature is the same as the resolution of the multi-task feature; Multi-task features are fused to obtain the fused feature; the fused feature is subjected to key point detection to achieve the key point detection task of the target, and/or the fused feature is instance segmented to achieve the target instance segmentation Task.

In one of the embodiments of the present application, please refer to FIG. 4, step S1032 includes: passing the convolutional feature map of the attention perception through a convolutional layer with a size of 1x1 to obtain a component-sensitive feature map; using PSRoIPooling to select the candidate The box is mapped to the sensitive feature map of the component, and the candidate box is divided into k×k candidate box blocks, so that each candidate box block represents a local component, and each candidate box forms a k×k component feature ; Perform average pooling for each k×k component feature to obtain the local component feature of the target.

Specifically, on the basis of the convolution feature map of attention perception, a component-sensitive score map is generated through a 1×1 convolution, where the number of convolution filters is k ² (C+1), k (usually Taking 7) means that the target is divided into k×k candidate block blocks of the same size, each candidate block block represents a partial component, and c is the total number of target types. In other words, a total of k ² feature channels are generated for each target category, and each feature channel is responsible for encoding a local component of the target. Here, the PSROIPooling operation in "R-FCN: Object detection via region-based fully convolutional networks" is used to achieve the extraction of local component features. The size of the local component feature is k ² (C+1), and then the 1×1×(C+1)-dimensional feature is obtained through the weighted average inside the channel.

In one of the embodiments of the present application, please refer to FIG. 4, step S1032 includes: reducing the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1 to obtain a set of reduced dimensionality convolutions Product feature map; Map the candidate frame to the reduced dimensionality convolution feature map through RoIPooling, and divide the candidate frame into k×k candidate block blocks, so that each candidate block block forms a k×k Global features; each k×k global feature is regarded as a whole, and two convolutional layers with sizes of k×k and 1×1 are used for encoding to obtain the global structural feature of the target.

Specifically, the same as the local structural feature of the extraction target, the candidate box is divided into k×k candidate box blocks, and then each candidate box block is individually pooled, but the difference from the local branch is: 1) All features Each channel needs to extract k×k features, that is, the feature channel here does not distinguish between categories and positions, and all target candidate frames are not score-sensitive and position-sensitive; 2) All candidate blocks are combined into a pool after the pooling operation A whole, its feature spatial resolution is k×k, and then further encoded into global structural features through two convolutional layers, the filter sizes of the two convolutional layers are k×k and 1×1, and finally one is also output Features of 1×1×(C+1).

Among them, because targets often have different scales, the RoIPooling operation in FasterR-CNN is used to extract features, which can unify the global structural features into scale-normalized features, that is, whether it is a large target or a small target, the global structural characteristics of The sizes are all the same.

In one of the embodiments of the present application, please refer to FIG. 4. Step S1032 includes: reducing the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1 to obtain a set of reduced dimensionality convolutions Feature map; keep the center point of each candidate frame unchanged, and expand the area of each candidate frame to a preset multiple; use RoIPooling to map the expanded candidate frame to the reduced dimensionality convolution feature map, and expand the area The latter candidate box is divided into k×k candidate box blocks, so that each candidate box forms a k×k context feature; each k×k context feature is regarded as a whole, and the two sizes are respectively k× K and a 1×1 convolutional layer are coded to obtain the context structure characteristics of the target.

Specifically, contextual structural features are widely used in visual recognition tasks as the most basic and important information. For example, the ship will appear in the water but not in the sky, which implies that the information around the target can usually help to better distinguish the semantic category of the target. In addition, the actual experience of the network is much smaller than the theoretical receptive field, so collecting information around the target can effectively reduce misidentification. Specifically, extracting contextual structural features in the present disclosure is the same as extracting global structural features, except that before extracting contextual structural features, the coordinates of the center point of each candidate frame need to be kept unchanged, and then the area is expanded to the original 2 Times.

In one of the embodiments of the present application, please refer to FIG. 4. Step S1032 includes: reducing the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1 to obtain a set of reduced dimensionality convolutions Feature map: Map the candidate frame to the reduced dimensionality convolution feature map through RoIPooling, and divide the candidate frame into 2k×2k candidate block blocks, so that each candidate frame forms a 2k×2k feature; Through 4 convolutions with a size of 3×3 and a number of channels of 256, each feature of 2k×2k is encoded; the feature of each candidate frame after encoding is up-sampled with a preset multiple of frequency to obtain the target The multi-tasking feature.

Specifically, the candidate frame is divided into 2k×2k blocks, and features are also extracted by RoIPooling. The spatial resolution of the extracted features is 2k×2k, and then further encoded by 4 3*3 convolutional layers. The channel is set to 256. Since key point detection and instance segmentation tasks require high spatial resolution of features, an up-sampling layer can restore their spatial structure information. The up-sampling rate here can be set to 2x or 4x, etc., after upsampling The feature of is the multi-task feature.

In one of the embodiments of the present application, the loss of at least one of the detection task, the key point detection task, and the instance segmentation task that achieves the target is detected through a preset loss detection model;

The preset loss model:

Loss=L _det (N)+λ ₁ L _att (N)+λ ₂ L _multi (N);

Among them, N represents the detection network that implements the multi-task target detection method, L _det represents the loss of the detection task, L _att represents the loss of the attention module, and L _multi represents the realization of the key point detection task and/or instance segmentation The loss of the task, λ ₁ and λ ₂ are preset loss adjustment factors.

Specifically, the present disclosure adopts a two-stage detection method, which first generates candidate frames through the RPN network, and then further classifies and regresses through the coupling network, so the detection loss includes the classification and regression loss of the RPN, and the classification and regression loss of the coupling network. The regression loss of the two uses smoothL1 loss, the classification loss of RPN is two-class cross-entropy loss, and the classification loss of coupled network is multi-class cross-entropy loss. L _att is the loss of the spatial attention module and also the two-class (foreground/background) cross-entropy loss. L _multi is the loss of other tasks, which can be key point loss or instance segmentation loss, or the sum of two losses (key point detection and instance segmentation are performed at the same time). λ ₁ and λ ₂ are loss adjustment factors, which can be set as needed. In one example, λ _{1 is} set to 0.25, λ _{2 is} set to 1, the positive and negative sample selection ratio of the detection part is 1:4, and the sample threshold is 0.5 , That is, the IOU with ground truth is greater than 0.5 as a positive sample, otherwise as a negative sample. The ratio of positive and negative samples in the RPN part is 1:1, the positive sample threshold is 0.7, and the negative sample threshold is 0.3.

Please refer to FIG. 5. FIG. 5 is a schematic structural diagram of a multi-task target detection device provided by an embodiment of the present disclosure. The device includes:

The obtaining module 201 is used to obtain an image of the target to be detected;

The first extraction module 202 is configured to use the cascaded spatial attention module to extract the convolutional feature map of the target's attention perception;

The second extraction module 203 is configured to extract local component features, global structural features, spatial context features, and multi-task features of the target based on the convolutional feature map of the attention perception;

The task realization module 204 is configured to implement at least one of the target detection task, key point detection task, and instance segmentation task based on the local component feature, global structure feature, spatial context feature, and multi-task feature of the target.

In one of the embodiments of the present application, the first extraction module 202 includes: an insertion sub-module, which is used to insert the attention module into multiple preset multiples sampled by the preset basic network to obtain multiple attention maps; The sub-module is used to multiply the multiple attention maps with the convolution feature maps at the corresponding downsampling multiples channel by channel to obtain the convolution feature maps for attention perception.

In one of the embodiments of the present application, the second extraction module 203 includes: a first extraction sub-module for extracting candidate frames containing the target on the convolutional feature map of the attention perception; a second extraction sub-module for Based on the convolutional feature map and the candidate frame based on the attention perception, the local component features, global structural features, spatial context features and multi-task features of the target are extracted.

In one of the embodiments of the present application, the task realization module 204 includes: a first feature fusion sub-module for fusing local component features, global structural features, and spatial context features of the target to obtain the structural features of the target; The detection task realization sub-module is used to realize the detection task of the target based on the structural feature.

In one of the embodiments of the present application, the task realization module 204 includes: a first up-sampling sub-module for up-sampling the structured feature so that the resolution of the structured feature is the same as the resolution of the multi-task feature ; The second feature fusion sub-module is used to fuse the structured features after upsampling with the multi-task features to obtain the fused features; the key point detection task realization sub-module is used to perform key points on the fused features Detection, the key point detection task to achieve the target, and/or the instance segmentation task realization sub-module, which is used to perform instance segmentation on the merged features to achieve the target instance segmentation task.

In one of the embodiments of the present application, the second extraction sub-module includes: a first dimensionality reduction sub-module, which is used to pass the convolutional feature map of the attention perception through a convolutional layer with a size of 1x1 to obtain a component-sensitive Feature map; the first mapping division sub-module is used to map the candidate frame to the component-sensitive feature map through PSRoIPooling, and divide the candidate frame into k×k candidate frame blocks, so that each The candidate block represents a partial component, and each candidate frame forms a k×k component feature; the pooling sub-module is used to averagely pool each k×k component feature to obtain the local component of the target feature.

In one of the embodiments of the present application, the second extraction sub-module includes: a second dimensionality reduction sub-module, which is used to reduce the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1 , Obtain a set of reduced dimensionality convolution feature maps; the second mapping division sub-module is used to map the candidate frame to the reduced dimensionality convolution feature map through RoIPooling, and divide the candidate frame into k ×k candidate box blocks, so that each candidate box block forms a k×k global feature; the first coding sub-module is used to treat each k×k global feature as a whole, and the two sizes are respectively The k×k and 1×1 convolutional layers are encoded to obtain the global structural feature of the target.

In one of the embodiments of the present application, the second extraction sub-module includes: a third dimensionality reduction sub-module for reducing the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1, Obtain a set of reduced dimensionality convolution feature maps; the area expansion sub-module is used to keep the center point of each candidate frame unchanged and expand the area of each candidate frame to a preset multiple; the third mapping division sub-module is used to pass RoIPooling maps the area-expanded candidate frame to the reduced dimensionality convolution feature map, and divides the area-expanded candidate frame into k×k candidate block blocks, so that each candidate frame forms a k×k The second coding sub-module is used to treat each k×k context feature as a whole, and encode it through two convolutional layers of size k×k and 1×1 to obtain the target Context structure characteristics.

In one of the embodiments of the present application, the second extraction sub-module includes: a fourth dimensionality reduction sub-module, which is used to reduce the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1, Obtain a set of reduced dimensionality convolution feature maps; the fourth mapping division sub-module is used to map the candidate frame to the reduced dimensionality convolution feature map through RoIPooling, and divide the candidate frame into 2k× 2k candidate box blocks, so that each candidate box forms a 2k×2k feature; the third encoding sub-module is used to pass 4 convolutions with a size of 3×3 and a channel number of 256 to convert each 2k×2k The second up-sampling sub-module is used for up-sampling the features of each candidate frame after encoding with a preset multiple of frequency to obtain the multi-task feature of the target.

In one of the embodiments of the present application, a loss detection module is further included, configured to detect the loss of at least one of the detection task, the key point detection task, and the instance segmentation task that achieves the target through a preset loss detection model;

The preset loss model:

Loss=L _det (N)+λ ₁ L _att (N)+λ ₂ L _multi (N);

The beneficial effects that can be achieved by the above-mentioned embodiments of the present disclosure are the same as those of the multi-task target detection method shown in FIG. 1, and will not be repeated here.

Please refer to FIG. 6, which shows a hardware structure diagram of an electronic device.

The electronic equipment described in this embodiment includes:

The memory 41, the processor 42, and a computer program that is stored on the memory 41 and can run on the processor, and the processor implements the multi-task target detection method described in the embodiment shown in FIG. 1 when the processor executes the program.

Further, the electronic device also includes:

At least one input device 43; at least one output device 44.

The aforementioned memory 41, processor 42 input device 43 and output device 44 are connected via a bus 45.

Among them, the input device 43 may specifically be a camera, a touch panel, a physical button, a mouse, and so on. The output device 44 may specifically be a display screen.

The memory 41 may be a high-speed random access memory (RAM, Random Access Memory) memory, or a non-volatile memory (non-volatile memory), such as a magnetic disk memory. The memory 41 is used to store a group of executable program codes, and the processor 42 is coupled with the memory 41.

Further, the embodiments of the present disclosure also provide a computer-readable storage medium, which may be provided in the electronic device in each of the above-mentioned embodiments, and the computer-readable storage medium may be the one shown in FIG. 6 above. The electronic device in the embodiment is shown. A computer program is stored on the computer-readable storage medium, and when the program is executed by the processor, the multi-task target detection method described in the embodiment shown in FIG. 1 is implemented. Further, the computer storage medium can also be a U disk, a mobile hard disk, a read-only memory (ROM, Read-Only Memory), a random access memory (RAM, Random Access Memory), a magnetic disk or an optical disk, etc. The medium of the program code.

Although the present invention has been described in detail above with general descriptions and specific embodiments, some modifications or improvements can be made on the basis of the present invention, which is obvious to those skilled in the art. Therefore, these modifications or improvements made on the basis of not departing from the spirit of the present utility model belong to the scope of protection of the present utility model.

The terms such as "upper", "lower", "left", "right", "middle", etc. cited in this specification are only for ease of description and are not used to limit the scope of the present invention. The change or adjustment of the relative relationship shall be regarded as the scope of the utility model which can be implemented without substantial change of the technical content.

Claims

A multi-task target detection method, which is characterized in that it comprises:

Obtain an image of the target to be detected;

Using a cascaded attention module to extract the convolutional feature map of the target's attention perception;

Extracting local component features, global structural features, spatial context features, and multi-task features of the target based on the attention-perceived convolution feature map;

Based on the local component features, global structural features, spatial context features, and multi-task features of the target, at least one of the target detection task, key point detection task, and instance segmentation task is realized.
The multi-task target detection method according to claim 1, wherein the extracting the convolutional feature map of the target's attention perception by using the cascaded spatial attention module comprises:

Insert the attention module into multiple preset multiples sampled by the preset basic network to obtain multiple attention maps;

The multiple attention maps are respectively multiplied by the convolution feature maps at the corresponding downsampling multiples channel by channel to obtain the convolution feature maps for attention perception.
The multi-task target detection method according to claim 1, wherein the convolution feature map based on the attention perception extracts local component features, global structural features, spatial context features, and multiple features of the target. Task characteristics include:

Extracting a candidate frame containing the target from the attention-perceived convolution feature map;

Based on the attention-perceived convolution feature map and the candidate frame, extract local component features, global structural features, spatial context features, and multi-task features of the target.
The multi-task target detection method according to claim 1, wherein the detection task based on the local component feature, global structure feature, spatial context feature, and multi-task feature of the target includes:

Fusing the local component features, global structural features, and spatial context features of the target to obtain the structural features of the target;

Based on the structural feature, the detection task of the target is realized.
The multi-task target detection method according to claim 1, wherein the key point detection task of the target is realized based on the local component feature, global structural feature, spatial context feature, and multi-task feature of the target , And/or, instance segmentation tasks include:

Up-sampling the structured feature so that the resolution of the structured feature is the same as the resolution of the multi-task feature;

Fuse the structured feature after upsampling with the multi-task feature to obtain the fused feature;

Perform key point detection on the fused feature to achieve the key point detection task of the target, and/or perform instance segmentation on the fused feature to achieve the instance segmentation task of the target.
The multi-task target detection method according to any one of claims 3 to 5, wherein the convolution feature map based on the attention perception and the candidate frame extract local component features of the target include:

Passing the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1 to obtain a component-sensitive feature map;

The candidate frame is mapped to the component-sensitive feature map through PSRoIPooling, and the candidate frame is divided into k×k candidate frame blocks, so that each candidate frame block represents a local component, and each candidate frame Form a k×k component feature;

Each k×k component feature is averagely pooled to obtain the local component feature of the target.
The multi-task target detection method according to any one of claims 3 to 5, wherein the convolution feature map based on the attention perception and the candidate frame extract global structural features of the target include:

Dimensionality reduction is performed on the convolutional feature maps of the attention perception through a convolutional layer with a size of 1×1, to obtain a set of dimensionality-reduced convolution feature maps;

Map the candidate frame to the reduced dimensionality convolution feature map by RoIPooling, and divide the candidate frame into k×k candidate block blocks, so that each candidate block block forms a k×k global feature;

Regarding each k×k global feature as a whole, encoding is performed through two convolutional layers with sizes of k×k and 1×1, respectively, to obtain the global structural feature of the target.
The multi-task target detection method according to any one of claims 3 to 5, wherein the convolution feature map based on the attention perception and the candidate frame extracts the context structure feature of the target include:

Reducing the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1 to obtain a set of dimensionality-reducing convolutional feature maps;

Keep the center point of each candidate frame unchanged, and expand the area of each candidate frame to a preset multiple;

Through RoIPooling, the area-expanded candidate frame is mapped to the reduced dimensionality convolution feature map, and the area-expanded candidate frame is divided into k×k candidate block blocks, so that each candidate frame forms a kxk Contextual characteristics

Regarding each k×k context feature as a whole, the context structure feature of the target is obtained by encoding through two convolutional layers with sizes of k×k and 1×1, respectively.
The multi-task target detection method according to any one of claims 3 to 5, wherein the convolution feature map based on the attention perception and the candidate frame extract the multi-task feature of the target include:

Reducing the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1 to obtain a set of dimensionality-reducing convolutional feature maps;

Map the candidate frame to the reduced dimensionality convolution feature map by RoIPooling, and divide the candidate frame into 2k×2k candidate block blocks, so that each candidate frame forms a 2k×2k feature;

Encode each 2k×2k feature through 4 convolutions with a size of 3×3 and a channel number of 256;

The feature of each candidate frame after encoding is up-sampled with a frequency of a preset multiple to obtain the multi-task feature of the target.
The multi-task target detection method according to any one of claims 1 to 5, wherein the detection task, the key point detection task, and the instance segmentation task of the target are detected through a preset loss detection model. Loss of at least one of;

The preset loss model:

Loss=L det (N)+λ 1 L att (N)+λ 2 L multi (N);

Wherein, N represents the detection network that realizes the multi-task target detection method, L det represents the loss of the detection task, L att represents the loss of the attention module, and L multi represents the realization of the key point detection task and / Or the loss of the instance segmentation task, λ 1 and λ 2 are preset loss adjustment factors.
A multi-task target detection device is characterized in that it comprises:

The acquisition module is used to acquire the image of the target to be detected;

The first extraction module is configured to use the cascaded spatial attention module to extract the convolutional feature map of the target's attention perception;

The second extraction module is configured to extract local component features, global structural features, spatial context features, and multi-task features of the target based on the convolution feature map of the attention perception;

The task realization module is used to implement at least one of the target detection task, key point detection task, and instance segmentation task based on the local component feature, global structure feature, spatial context feature, and multi-task feature of the target.
An electronic device, comprising: a memory, a processor, and a computer program stored on the memory and running on the processor, wherein the processor executes the computer program to implement the claims 1 to 11 Any of the steps in the multi-task target detection method.
A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the multi-task target detection method of any one of claims 1 to 11 The various steps.