WO2021232771A1 - Multi-task target detection method and apparatus, electronic device, and storage medium - Google Patents

Multi-task target detection method and apparatus, electronic device, and storage medium Download PDF

Info

Publication number
WO2021232771A1
WO2021232771A1 PCT/CN2020/137446 CN2020137446W WO2021232771A1 WO 2021232771 A1 WO2021232771 A1 WO 2021232771A1 CN 2020137446 W CN2020137446 W CN 2020137446W WO 2021232771 A1 WO2021232771 A1 WO 2021232771A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
task
target
features
attention
Prior art date
Application number
PCT/CN2020/137446
Other languages
French (fr)
Chinese (zh)
Inventor
王金桥
赵朝阳
朱优松
Original Assignee
中科视语(北京)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中科视语(北京)科技有限公司 filed Critical 中科视语(北京)科技有限公司
Publication of WO2021232771A1 publication Critical patent/WO2021232771A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Definitions

  • the present invention relates to the field of image processing technology, in particular to a multi-task target detection method, device, electronic equipment and storage medium.
  • Target detection is a basic task in computer vision, and it is the premise of many other tasks.
  • the difficulty of target detection is almost reflected in other tasks, and it is more complex and diverse, such as background noise interference, target occlusion, truncation, and attitude change And deformation and so on.
  • Multi-task design based on target detection has always been a hot issue. Simultaneously performing multi-task prediction through a network can not only save the amount of calculation, but also improve the generalization ability of the model.
  • the main purpose of the present disclosure is to provide a multi-task target detection method, device, electronic equipment and storage medium, which can solve at least one of the above technical problems.
  • the first aspect of the embodiments of the present disclosure provides a multi-task target detection method, including:
  • At least one of the target detection task, key point detection task, and instance segmentation task is realized.
  • the extracting the convolutional feature map of the attention perception of the target by using the cascaded spatial attention module includes:
  • the multiple attention maps are respectively multiplied by the convolution feature maps at the corresponding downsampling multiples channel by channel to obtain the convolution feature maps for attention perception.
  • the extraction of local component features, global structural features, spatial context features, and multi-task features of the target based on the convolution feature map of the attention perception includes:
  • the detection task to achieve the target includes:
  • the detection task of the target is realized.
  • the key point detection task for achieving the target based on the local component feature, global structural feature, spatial context feature, and multi-task feature of the target, and/or the instance segmentation task includes:
  • the extracting the feature of a local component of the target based on the convolution feature map of the attention perception and the candidate frame includes:
  • the candidate frame is mapped to the component-sensitive feature map through PSRoIPooling, and the candidate frame is divided into k ⁇ k candidate frame blocks, so that each candidate frame block represents a local component, and each candidate frame Form a k ⁇ k component feature;
  • Each k ⁇ k component feature is averagely pooled to obtain the local component feature of the target.
  • the extracting the global structural feature of the target based on the convolutional feature map of the attention perception and the candidate frame includes:
  • Dimensionality reduction is performed on the convolutional feature maps of the attention perception through a convolutional layer with a size of 1 ⁇ 1, to obtain a set of dimensionality-reduced convolution feature maps;
  • encoding is performed through two convolutional layers with sizes of k ⁇ k and 1 ⁇ 1, respectively, to obtain the global structural feature of the target.
  • the extraction of the context structure feature of the target based on the convolution feature map of the attention perception and the candidate frame includes:
  • the area-expanded candidate frame is mapped to the reduced dimensionality convolution feature map, and the area-expanded candidate frame is divided into k ⁇ k candidate block blocks, so that each candidate frame forms a k ⁇ Contextual characteristics of k;
  • the context structure feature of the target is obtained by encoding through two convolutional layers with sizes of k ⁇ k and 1 ⁇ 1, respectively.
  • the extraction of the multi-task feature of the target based on the convolutional feature map of the attention perception and the candidate frame includes:
  • the feature of each candidate frame after encoding is up-sampled with a frequency of a preset multiple to obtain the multi-task feature of the target.
  • the loss of at least one of a detection task, a key point detection task, and an instance segmentation task that achieves the target is detected through a preset loss detection model
  • Loss L det (N)+ ⁇ 1 L att (N)+ ⁇ 2 L multi (N);
  • N represents the detection network that realizes the multi-task target detection method
  • L det represents the loss of the detection task
  • L att represents the loss of the attention module
  • L multi represents the realization of the key point detection task and / Or the loss of the instance segmentation task
  • ⁇ 1 and ⁇ 2 are preset loss adjustment factors.
  • a second aspect of the embodiments of the present disclosure provides a multi-task target detection device, including:
  • the acquisition module is used to acquire the image of the target to be detected
  • the first extraction module is configured to use the cascaded spatial attention module to extract the convolutional feature map of the target's attention perception;
  • the second extraction module is configured to extract local component features, global structural features, spatial context features, and multi-task features of the target based on the convolution feature map of the attention perception;
  • the task realization module is used to implement at least one of the target detection task, key point detection task, and instance segmentation task based on the local component feature, global structure feature, spatial context feature, and multi-task feature of the target.
  • a third aspect of the embodiments of the present disclosure provides an electronic device, including:
  • a memory, a processor, and a computer program stored on the memory and capable of running on the processor are characterized in that, when the processor executes the program, the multi-task target detection method provided by the first aspect of the embodiments of the present disclosure is implemented.
  • a fourth aspect of the embodiments of the present disclosure provides a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed by a processor, the multi-task target detection method provided in the first aspect of the embodiments of the present disclosure is implemented.
  • the multi-task target detection, device, electronic equipment, and storage medium use cascaded attention modules to extract the convolutional feature map of the target’s attention perception, which can be obtained from the rough To precisely generate the convolutional features of the whole picture attention perception, and suppress the interference of background noise.
  • Convolution feature map based on attention perception extract the target's local component features, global structure features, spatial context features and multi-task features, based on the target's local component features, global structure features, spatial context features and multi-task features to achieve the goal At least one of the detection task, the key point detection task, and the instance segmentation task. It can effectively associate the global structure, local components and context information of the target to form a structured feature expression, improve the robustness of features to occlusion, deformation, and posture, and improve multi-task performance.
  • FIG. 1 is a schematic flowchart of a multi-task target detection method provided by an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of a multi-task spatial attention mechanism provided by an embodiment of the present disclosure
  • step S103 is a schematic flowchart of step S103 in the multi-task target detection method provided by an embodiment of the present disclosure
  • FIG. 4 is a schematic structural diagram of a model for implementing a multi-task target detection method provided by an embodiment of the present disclosure
  • FIG. 5 is a schematic structural diagram of a target detection device for multi-tasking provided by an embodiment of the present disclosure
  • Figure 6 shows a schematic diagram of the hardware structure of an electronic device.
  • FIG. 1 is a schematic flowchart of a multi-task target detection method provided by an embodiment of the present disclosure. The method mainly includes the following steps:
  • S104 Based on the local component feature, global structural feature, spatial context feature, and multi-task feature of the target, at least one of a detection task, a key point detection task, and an instance segmentation task of the target is realized.
  • the image can be any image.
  • the target can be people, animals, flowers and plants, etc. This embodiment does not impose any limitation on this.
  • each attention module is used to generate a pixel-by-pixel foreground and/or background attention map, and multiple attention modules are connected in a cascaded manner to learn the spatial region of the whole image from coarse to fine.
  • step S103 the local component features, global structural features, spatial context features, and multi-task features of the target are explicitly extracted to enhance the descriptive power of the target.
  • local component features such as human eyes, nose, mouth and other specific components
  • global structural features such as the upright structure of the human body
  • spatial context features are mainly used to extract spatial context information around the target, such as people in an indoor environment.
  • the aircraft is in the middle of the sky; multi-task features are mainly used to extract key points and/or segmentation features.
  • the four processes of extracting the target's local component features, global structural features, spatial context features, and multi-task features may not be processed in parallel or in parallel.
  • step S104 the local component features, global structural features, and spatial context features of the target are coupled together through a normalization operation to form a complete structured feature of the target, which can be used for target detection tasks .
  • the above-mentioned structured features are further coupled with multi-task features through up-sampling, and the features obtained after coupling can be used in the task of key point detection and instance segmentation of the target. Realize end-to-end multi-task training and testing.
  • FIG. 2 is a schematic diagram of a multi-task spatial attention mechanism provided by an embodiment of the present disclosure.
  • the multi-task spatial attention mechanism implements step S102 of the present disclosure (can be Understandably, the multi-task coupling network in FIG. 2 implements step S103 and step S104) of the present disclosure.
  • Step S102 includes: inserting the attention module into multiple preset multiples sampled by the preset basic network to obtain multiple attention maps; The multiple attention maps are respectively multiplied by the convolution feature maps at the corresponding downsampling multiples channel by channel to obtain the convolution feature maps for attention perception.
  • an attention module is inserted in each downsampling stage of the preset basic network, and the preset multiples of downsampling are 4, 8, 16 as examples to generate 3 attention maps, and the 3 attention maps are respectively compared with Corresponding to the preset convolution feature maps at the downsampling multiples are multiplied channel by channel (that is, the attention map at 4 times is multiplied by the convolution feature map at 4 times, and the attention map at 8 times is multiplied by the volume at 8 times.
  • the product feature map is multiplied, the attention map at 16 times is multiplied by the convolution feature map at 16 times), the background noise interference can be suppressed from coarse to fine, and the foreground feature expression can be enhanced to guide the preset basic network feature learning , And output the final convolutional feature map of attention perception.
  • the present disclosure does not use the attention module behind Conv1, mainly because the shallow features lack sufficient semantic information, and the attention map generated at this time is often very inaccurate.
  • the confidence that the position belongs to the target is expressed by predicting an attention map A.
  • the attention module contains two 3*3 convolutional layers, the number of channels is 256, and then a 1*1 convolutional layer is used for foreground and background classification, and finally a sigmoid activation function is used to normalize to 0 ⁇ 1.
  • the final attention map The generated attention map has nothing to do with the target category, and the number of channels is 1.
  • the attention map is multiplied channel by channel with the convolution features at multiples of the corresponding down-sampling frequency by broadcasting. After the multiplication, the features are used as the next input. This process is repeated in the preset basic network and gradually guides the preset basics. The learning of network features finally obtains the convolutional feature map of attention perception.
  • Step S103 includes:
  • the region extraction network RPN can be used to extract the candidate frame, and the generated candidate frame containing the target is P.
  • step S104 includes: fusing the local component feature, global structural feature, and spatial context feature of the target to obtain the structured feature of the target; based on the structured feature, realizing the detection of the target Task.
  • the local component feature, the global structure feature, and the spatial context feature are coupled together through a normalization operation to form a complete structured feature of the target, which can be used for the detection task of the target.
  • step S104 includes: up-sampling the structured feature so that the resolution of the structured feature is the same as the resolution of the multi-task feature; Multi-task features are fused to obtain the fused feature; the fused feature is subjected to key point detection to achieve the key point detection task of the target, and/or the fused feature is instance segmented to achieve the target instance segmentation Task.
  • step S1032 includes: passing the convolutional feature map of the attention perception through a convolutional layer with a size of 1x1 to obtain a component-sensitive feature map; using PSRoIPooling to select the candidate
  • the box is mapped to the sensitive feature map of the component, and the candidate box is divided into k ⁇ k candidate box blocks, so that each candidate box block represents a local component, and each candidate box forms a k ⁇ k component feature ; Perform average pooling for each k ⁇ k component feature to obtain the local component feature of the target.
  • a component-sensitive score map is generated through a 1 ⁇ 1 convolution, where the number of convolution filters is k 2 (C+1), k (usually Taking 7) means that the target is divided into k ⁇ k candidate block blocks of the same size, each candidate block block represents a partial component, and c is the total number of target types.
  • k 2 usually Taking 7
  • k usually Taking 7
  • k usually Taking 7
  • k means that the target is divided into k ⁇ k candidate block blocks of the same size
  • each candidate block block block represents a partial component
  • c is the total number of target types.
  • a total of k 2 feature channels are generated for each target category, and each feature channel is responsible for encoding a local component of the target.
  • the PSROIPooling operation in "R-FCN: Object detection via region-based fully convolutional networks” is used to achieve the extraction of local component features.
  • the size of the local component feature is k 2 (C+1), and then the 1 ⁇ 1 ⁇ (
  • step S1032 includes: reducing the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1 ⁇ 1 to obtain a set of reduced dimensionality convolutions Product feature map; Map the candidate frame to the reduced dimensionality convolution feature map through RoIPooling, and divide the candidate frame into k ⁇ k candidate block blocks, so that each candidate block block forms a k ⁇ k Global features; each k ⁇ k global feature is regarded as a whole, and two convolutional layers with sizes of k ⁇ k and 1 ⁇ 1 are used for encoding to obtain the global structural feature of the target.
  • the candidate box is divided into k ⁇ k candidate box blocks, and then each candidate box block is individually pooled, but the difference from the local branch is: 1) All features Each channel needs to extract k ⁇ k features, that is, the feature channel here does not distinguish between categories and positions, and all target candidate frames are not score-sensitive and position-sensitive; 2) All candidate blocks are combined into a pool after the pooling operation A whole, its feature spatial resolution is k ⁇ k, and then further encoded into global structural features through two convolutional layers, the filter sizes of the two convolutional layers are k ⁇ k and 1 ⁇ 1, and finally one is also output Features of 1 ⁇ 1 ⁇ (C+1).
  • the RoIPooling operation in FasterR-CNN is used to extract features, which can unify the global structural features into scale-normalized features, that is, whether it is a large target or a small target, the global structural characteristics of The sizes are all the same.
  • Step S1032 includes: reducing the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1 ⁇ 1 to obtain a set of reduced dimensionality convolutions Feature map; keep the center point of each candidate frame unchanged, and expand the area of each candidate frame to a preset multiple; use RoIPooling to map the expanded candidate frame to the reduced dimensionality convolution feature map, and expand the area
  • the latter candidate box is divided into k ⁇ k candidate box blocks, so that each candidate box forms a k ⁇ k context feature; each k ⁇ k context feature is regarded as a whole, and the two sizes are respectively k ⁇ K and a 1 ⁇ 1 convolutional layer are coded to obtain the context structure characteristics of the target.
  • contextual structural features are widely used in visual recognition tasks as the most basic and important information. For example, the ship will appear in the water but not in the sky, which implies that the information around the target can usually help to better distinguish the semantic category of the target. In addition, the actual experience of the network is much smaller than the theoretical receptive field, so collecting information around the target can effectively reduce misidentification.
  • extracting contextual structural features in the present disclosure is the same as extracting global structural features, except that before extracting contextual structural features, the coordinates of the center point of each candidate frame need to be kept unchanged, and then the area is expanded to the original 2 Times.
  • Step S1032 includes: reducing the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1 ⁇ 1 to obtain a set of reduced dimensionality convolutions Feature map: Map the candidate frame to the reduced dimensionality convolution feature map through RoIPooling, and divide the candidate frame into 2k ⁇ 2k candidate block blocks, so that each candidate frame forms a 2k ⁇ 2k feature; Through 4 convolutions with a size of 3 ⁇ 3 and a number of channels of 256, each feature of 2k ⁇ 2k is encoded; the feature of each candidate frame after encoding is up-sampled with a preset multiple of frequency to obtain the target The multi-tasking feature.
  • the candidate frame is divided into 2k ⁇ 2k blocks, and features are also extracted by RoIPooling.
  • the spatial resolution of the extracted features is 2k ⁇ 2k, and then further encoded by 4 3*3 convolutional layers.
  • the channel is set to 256. Since key point detection and instance segmentation tasks require high spatial resolution of features, an up-sampling layer can restore their spatial structure information.
  • the up-sampling rate here can be set to 2x or 4x, etc., after upsampling
  • the feature of is the multi-task feature.
  • the loss of at least one of the detection task, the key point detection task, and the instance segmentation task that achieves the target is detected through a preset loss detection model
  • Loss L det (N)+ ⁇ 1 L att (N)+ ⁇ 2 L multi (N);
  • N represents the detection network that implements the multi-task target detection method
  • L det represents the loss of the detection task
  • L att represents the loss of the attention module
  • L multi represents the realization of the key point detection task and/or instance segmentation
  • the loss of the task, ⁇ 1 and ⁇ 2 are preset loss adjustment factors.
  • the present disclosure adopts a two-stage detection method, which first generates candidate frames through the RPN network, and then further classifies and regresses through the coupling network, so the detection loss includes the classification and regression loss of the RPN, and the classification and regression loss of the coupling network.
  • the regression loss of the two uses smoothL1 loss
  • the classification loss of RPN is two-class cross-entropy loss
  • the classification loss of coupled network is multi-class cross-entropy loss.
  • L att is the loss of the spatial attention module and also the two-class (foreground/background) cross-entropy loss.
  • L multi is the loss of other tasks, which can be key point loss or instance segmentation loss, or the sum of two losses (key point detection and instance segmentation are performed at the same time).
  • ⁇ 1 and ⁇ 2 are loss adjustment factors, which can be set as needed. In one example, ⁇ 1 is set to 0.25, ⁇ 2 is set to 1, the positive and negative sample selection ratio of the detection part is 1:4, and the sample threshold is 0.5 , That is, the IOU with ground truth is greater than 0.5 as a positive sample, otherwise as a negative sample.
  • the ratio of positive and negative samples in the RPN part is 1:1, the positive sample threshold is 0.7, and the negative sample threshold is 0.3.
  • FIG. 5 is a schematic structural diagram of a multi-task target detection device provided by an embodiment of the present disclosure.
  • the device includes:
  • the obtaining module 201 is used to obtain an image of the target to be detected
  • the first extraction module 202 is configured to use the cascaded spatial attention module to extract the convolutional feature map of the target's attention perception;
  • the second extraction module 203 is configured to extract local component features, global structural features, spatial context features, and multi-task features of the target based on the convolutional feature map of the attention perception;
  • the task realization module 204 is configured to implement at least one of the target detection task, key point detection task, and instance segmentation task based on the local component feature, global structure feature, spatial context feature, and multi-task feature of the target.
  • the first extraction module 202 includes: an insertion sub-module, which is used to insert the attention module into multiple preset multiples sampled by the preset basic network to obtain multiple attention maps;
  • the sub-module is used to multiply the multiple attention maps with the convolution feature maps at the corresponding downsampling multiples channel by channel to obtain the convolution feature maps for attention perception.
  • the second extraction module 203 includes: a first extraction sub-module for extracting candidate frames containing the target on the convolutional feature map of the attention perception; a second extraction sub-module for Based on the convolutional feature map and the candidate frame based on the attention perception, the local component features, global structural features, spatial context features and multi-task features of the target are extracted.
  • the task realization module 204 includes: a first feature fusion sub-module for fusing local component features, global structural features, and spatial context features of the target to obtain the structural features of the target;
  • the detection task realization sub-module is used to realize the detection task of the target based on the structural feature.
  • the task realization module 204 includes: a first up-sampling sub-module for up-sampling the structured feature so that the resolution of the structured feature is the same as the resolution of the multi-task feature ;
  • the second feature fusion sub-module is used to fuse the structured features after upsampling with the multi-task features to obtain the fused features;
  • the key point detection task realization sub-module is used to perform key points on the fused features Detection, the key point detection task to achieve the target, and/or the instance segmentation task realization sub-module, which is used to perform instance segmentation on the merged features to achieve the target instance segmentation task.
  • the second extraction sub-module includes: a first dimensionality reduction sub-module, which is used to pass the convolutional feature map of the attention perception through a convolutional layer with a size of 1x1 to obtain a component-sensitive Feature map; the first mapping division sub-module is used to map the candidate frame to the component-sensitive feature map through PSRoIPooling, and divide the candidate frame into k ⁇ k candidate frame blocks, so that each The candidate block represents a partial component, and each candidate frame forms a k ⁇ k component feature; the pooling sub-module is used to averagely pool each k ⁇ k component feature to obtain the local component of the target feature.
  • a first dimensionality reduction sub-module which is used to pass the convolutional feature map of the attention perception through a convolutional layer with a size of 1x1 to obtain a component-sensitive Feature map
  • the first mapping division sub-module is used to map the candidate frame to the component-sensitive feature map through PSRoIPooling, and divide the candidate frame
  • the second extraction sub-module includes: a second dimensionality reduction sub-module, which is used to reduce the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1 ⁇ 1 , Obtain a set of reduced dimensionality convolution feature maps; the second mapping division sub-module is used to map the candidate frame to the reduced dimensionality convolution feature map through RoIPooling, and divide the candidate frame into k ⁇ k candidate box blocks, so that each candidate box block forms a k ⁇ k global feature; the first coding sub-module is used to treat each k ⁇ k global feature as a whole, and the two sizes are respectively The k ⁇ k and 1 ⁇ 1 convolutional layers are encoded to obtain the global structural feature of the target.
  • a second dimensionality reduction sub-module which is used to reduce the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1 ⁇ 1 , Obtain a set of reduced dimensionality convolution
  • the second extraction sub-module includes: a third dimensionality reduction sub-module for reducing the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1 ⁇ 1, Obtain a set of reduced dimensionality convolution feature maps; the area expansion sub-module is used to keep the center point of each candidate frame unchanged and expand the area of each candidate frame to a preset multiple; the third mapping division sub-module is used to pass RoIPooling maps the area-expanded candidate frame to the reduced dimensionality convolution feature map, and divides the area-expanded candidate frame into k ⁇ k candidate block blocks, so that each candidate frame forms a k ⁇ k
  • the second coding sub-module is used to treat each k ⁇ k context feature as a whole, and encode it through two convolutional layers of size k ⁇ k and 1 ⁇ 1 to obtain the target Context structure characteristics.
  • the second extraction sub-module includes: a fourth dimensionality reduction sub-module, which is used to reduce the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1 ⁇ 1, Obtain a set of reduced dimensionality convolution feature maps; the fourth mapping division sub-module is used to map the candidate frame to the reduced dimensionality convolution feature map through RoIPooling, and divide the candidate frame into 2k ⁇ 2k candidate box blocks, so that each candidate box forms a 2k ⁇ 2k feature; the third encoding sub-module is used to pass 4 convolutions with a size of 3 ⁇ 3 and a channel number of 256 to convert each 2k ⁇ 2k The second up-sampling sub-module is used for up-sampling the features of each candidate frame after encoding with a preset multiple of frequency to obtain the multi-task feature of the target.
  • a fourth dimensionality reduction sub-module which is used to reduce the dimensionality of the convolutional feature map of the attention
  • a loss detection module is further included, configured to detect the loss of at least one of the detection task, the key point detection task, and the instance segmentation task that achieves the target through a preset loss detection model;
  • Loss L det (N)+ ⁇ 1 L att (N)+ ⁇ 2 L multi (N);
  • N represents the detection network that implements the multi-task target detection method
  • L det represents the loss of the detection task
  • L att represents the loss of the attention module
  • L multi represents the realization of the key point detection task and/or instance segmentation
  • the loss of the task, ⁇ 1 and ⁇ 2 are preset loss adjustment factors.
  • FIG. 6 shows a hardware structure diagram of an electronic device.
  • the electronic device also includes:
  • the aforementioned memory 41, processor 42 input device 43 and output device 44 are connected via a bus 45.
  • the input device 43 may specifically be a camera, a touch panel, a physical button, a mouse, and so on.
  • the output device 44 may specifically be a display screen.
  • the memory 41 may be a high-speed random access memory (RAM, Random Access Memory) memory, or a non-volatile memory (non-volatile memory), such as a magnetic disk memory.
  • RAM Random Access Memory
  • non-volatile memory non-volatile memory
  • the memory 41 is used to store a group of executable program codes, and the processor 42 is coupled with the memory 41.
  • the embodiments of the present disclosure also provide a computer-readable storage medium, which may be provided in the electronic device in each of the above-mentioned embodiments, and the computer-readable storage medium may be the one shown in FIG. 6 above.
  • the electronic device in the embodiment is shown.
  • a computer program is stored on the computer-readable storage medium, and when the program is executed by the processor, the multi-task target detection method described in the embodiment shown in FIG. 1 is implemented.
  • the computer storage medium can also be a U disk, a mobile hard disk, a read-only memory (ROM, Read-Only Memory), a random access memory (RAM, Random Access Memory), a magnetic disk or an optical disk, etc.
  • the medium of the program code is a U disk, a mobile hard disk, a read-only memory (ROM, Read-Only Memory), a random access memory (RAM, Random Access Memory), a magnetic disk or an optical disk, etc.
  • the medium of the program code is a U disk, a mobile hard disk, a read-only memory (

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The present invention is applied in the technical field of image processing. Disclosed are a multi-task target detection method and apparatus, an electronic device, and a storage medium,. An attention-aware convolutional feature map of a target is extracted using cascaded attention modules, so that attention-aware convolutional features of a whole image can be generated from coarse to fine, to suppress interference of background noise. A local component feature, a global structure feature, a spatial context feature, and a multi-task feature of the target are extracted on the basis of the attention-aware convolutional feature map, and at least one of a detection task, a key point detection task, and an instance segmentation task of the target is implemented on the basis of the local component feature, the global structure feature, the spatial context feature, and the multi-task feature of the target. A global structure, local components, and context information of the target can be effectively associated to form a structured feature expression, thereby enhancing the robustness of the features to shielding, deformation, postures, etc., and improving the multi-task performance.

Description

多任务的目标检测方法、装置、电子设备及存储介质Multi-task target detection method, device, electronic equipment and storage medium
本申请要求中科视语(北京)科技有限公司于2020年05月18日提交的、发明名称为“多任务的目标检测方法、装置、电子设备及存储介质”的、中国专利申请号“202010422038.3”的优先权,其全部内容通过引用结合在本申请中。This application requires the Chinese patent application number "202010422038.3" submitted by China Science and Technology (Beijing) Technology Co., Ltd. on May 18, 2020, with the title of "Multitasking target detection method, device, electronic equipment and storage medium" ”, the entire content of which is incorporated in this application by reference.
技术领域Technical field
本发明涉及图像处理技术领域,特别涉及一种多任务的目标检测方法、装置、电子设备及存储介质。The present invention relates to the field of image processing technology, in particular to a multi-task target detection method, device, electronic equipment and storage medium.
背景技术Background technique
目标检测是计算机视觉中的一项基础任务,是很多其他任务的前提,目标检测的难点在其他任务中几乎都有体现,并且更为复杂多样,如背景噪声干扰、目标遮挡、截断、姿态变化以及形变等。在目标检测基础上进行多任务的设计也一直是个热门问题,通过一个网络同时进行多项任务的预测不仅可以节省计算量,同时还能改善模型的泛化能力。Target detection is a basic task in computer vision, and it is the premise of many other tasks. The difficulty of target detection is almost reflected in other tasks, and it is more complex and diverse, such as background noise interference, target occlusion, truncation, and attitude change And deformation and so on. Multi-task design based on target detection has always been a hot issue. Simultaneously performing multi-task prediction through a network can not only save the amount of calculation, but also improve the generalization ability of the model.
现有多任务的框架,例如Mask R-CNN扩展性较强,受到广泛应用,但是此框架没有考虑环境以及目标本身状态带来的影响,并没有针对性的结构和学习策略,所以其表达能力还有待增强。综合来看,目前缺乏对环境干扰、目标姿态变化等多任务的整体解决方案。Existing multi-task frameworks, such as Mask R-CNN, have strong scalability and are widely used. However, this framework does not consider the impact of the environment and the state of the target itself, and does not have a targeted structure and learning strategy, so its expressive ability Needs to be enhanced. On the whole, there is currently a lack of overall solutions for multi-tasks such as environmental interference and target posture changes.
发明内容Summary of the invention
本公开的主要目的在于提供一种多任务的目标检测方法、装置、电子设备 及存储介质,可解决上述至少一个技术问题。The main purpose of the present disclosure is to provide a multi-task target detection method, device, electronic equipment and storage medium, which can solve at least one of the above technical problems.
为实现上述目的,本公开实施例第一方面提供一种多任务的目标检测方法,包括:To achieve the foregoing objectives, the first aspect of the embodiments of the present disclosure provides a multi-task target detection method, including:
获取待检测目标的图像;Obtain an image of the target to be detected;
利用级联式的注意力模块,提取所述目标的注意力感知的卷积特征图;Using a cascaded attention module to extract the convolutional feature map of the target's attention perception;
基于所述注意力感知的卷积特征图,提取所述目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征;Extracting local component features, global structural features, spatial context features, and multi-task features of the target based on the attention-perceived convolution feature map;
基于所述目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征,实现所述目标的检测任务、关键点检测任务、实例分割任务中的至少一个。Based on the local component features, global structural features, spatial context features, and multi-task features of the target, at least one of the target detection task, key point detection task, and instance segmentation task is realized.
可选的,所述利用级联式的空间注意力模块,提取所述目标的注意力感知的卷积特征图包括:Optionally, the extracting the convolutional feature map of the attention perception of the target by using the cascaded spatial attention module includes:
将注意力模块插入预设基础网络下采样的多个预设倍数处,得到多个注意力图;Insert the attention module into multiple preset multiples sampled by the preset basic network to obtain multiple attention maps;
将所述多个注意力图分别与对应下采样倍数处的卷积特征图逐通道相乘,得到注意力感知的卷积特征图。The multiple attention maps are respectively multiplied by the convolution feature maps at the corresponding downsampling multiples channel by channel to obtain the convolution feature maps for attention perception.
可选的,所述基于所述注意力感知的卷积特征图,提取所述目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征包括:Optionally, the extraction of local component features, global structural features, spatial context features, and multi-task features of the target based on the convolution feature map of the attention perception includes:
在所述注意力感知的卷积特征图上提取包含所述目标的候选框;Extracting a candidate frame containing the target from the attention-perceived convolution feature map;
基于所述注意力感知的卷积特征图和所述候选框,提取所述目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征。Based on the attention-perceived convolution feature map and the candidate frame, extract local component features, global structural features, spatial context features, and multi-task features of the target.
可选的,所述基于所述目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征,实现所述目标的检测任务包括:Optionally, based on the local component feature, global structural feature, spatial context feature, and multi-task feature of the target, the detection task to achieve the target includes:
将所述目标的局部部件特征、全局结构特征、空间上下文特征进行融合,得到所述目标的结构化特征;Fusing the local component features, global structural features, and spatial context features of the target to obtain the structural features of the target;
基于所述结构化特征,实现所述目标的检测任务。Based on the structural feature, the detection task of the target is realized.
可选的,所述基于所述目标的局部部件特征、全局结构特征、空间上下文 特征以及多任务特征,实现所述目标的关键点检测任务,和/或,实例分割任务包括:Optionally, the key point detection task for achieving the target based on the local component feature, global structural feature, spatial context feature, and multi-task feature of the target, and/or the instance segmentation task includes:
将所述结构化特征进行上采样,使所述结构化特征的分辨率与所述多任务特征的分辨率相同;Up-sampling the structured feature so that the resolution of the structured feature is the same as the resolution of the multi-task feature;
将上采样后的结构化特征与所述多任务特征进行融合,得到融合后的特征;Fuse the structured feature after upsampling with the multi-task feature to obtain the fused feature;
将融合后的特征进行关键点检测,实现所述目标的关键点检测任务,和/或,将融合后的特征进行实例分割,实现所述目标的实例分割任务。Perform key point detection on the fused feature to achieve the key point detection task of the target, and/or perform instance segmentation on the fused feature to achieve the instance segmentation task of the target.
可选的,所述基于所述注意力感知的卷积特征图和所述候选框,提取所述目标的局部部件特征包括:Optionally, the extracting the feature of a local component of the target based on the convolution feature map of the attention perception and the candidate frame includes:
将所述注意力感知的卷积特征图通过一个大小为1×1的卷积层,得到部件敏感的特征图;Passing the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1 to obtain a component-sensitive feature map;
通过PSRoIPooling将所述候选框映射到所述部件敏感的特征图上,并将所述候选框划分为k×k个候选框块,以使每个候选框块表示一个局部部件,每个候选框形成一个k×k的部件特征;The candidate frame is mapped to the component-sensitive feature map through PSRoIPooling, and the candidate frame is divided into k×k candidate frame blocks, so that each candidate frame block represents a local component, and each candidate frame Form a k×k component feature;
将每个k×k的部件特征均进行平均池化,得到所述目标的局部部件特征。Each k×k component feature is averagely pooled to obtain the local component feature of the target.
可选的,所述基于所述注意力感知的卷积特征图和所述候选框,提取所述目标的全局结构特征包括:Optionally, the extracting the global structural feature of the target based on the convolutional feature map of the attention perception and the candidate frame includes:
将所述注意力感知的卷积特征图通过一个大小为1×1的卷积层进行降维,得到一组降维的卷积特征图;Dimensionality reduction is performed on the convolutional feature maps of the attention perception through a convolutional layer with a size of 1×1, to obtain a set of dimensionality-reduced convolution feature maps;
通过RoIPooling将所述候选框映射到所述降维的卷积特征图上,并将所述候选框划分为k×k个候选框块,以使每个候选框块形成一个k×k的全局特征;Map the candidate frame to the reduced dimensionality convolution feature map by RoIPooling, and divide the candidate frame into k×k candidate block blocks, so that each candidate block block forms a k×k global feature;
将每个k×k的全局特征当做一个整体,通过两个大小分别为k×k和1×1的卷积层进行编码,得到所述目标的全局结构特征。Regarding each k×k global feature as a whole, encoding is performed through two convolutional layers with sizes of k×k and 1×1, respectively, to obtain the global structural feature of the target.
可选的,所述基于所述注意力感知的卷积特征图和所述候选框,提取所述目标的上下文结构特征包括:Optionally, the extraction of the context structure feature of the target based on the convolution feature map of the attention perception and the candidate frame includes:
将所述注意力感知的卷积特征图通过一个大小为1×1的卷积层降维,得到 一组降维的卷积特征图;Reducing the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1 to obtain a set of reduced dimensionality convolutional feature maps;
保持每个候选框中心点不变,将每个候选框面积扩大至预设倍数;Keep the center point of each candidate frame unchanged, and expand the area of each candidate frame to a preset multiple;
通过RoIPooling将面积扩大后的候选框映射到所述降维的卷积特征图上,并将面积扩大后的候选框划分为k×k个候选框块,以使每个候选框形成一个k×k的上下文特征;Through RoIPooling, the area-expanded candidate frame is mapped to the reduced dimensionality convolution feature map, and the area-expanded candidate frame is divided into k×k candidate block blocks, so that each candidate frame forms a k× Contextual characteristics of k;
将每个k×k的上下文特征当做一个整体,通过两个大小分别为k×k和1×1的卷积层进行编码,得到所述目标的上下文结构特征。Regarding each k×k context feature as a whole, the context structure feature of the target is obtained by encoding through two convolutional layers with sizes of k×k and 1×1, respectively.
可选的,所述基于所述注意力感知的卷积特征图和所述候选框,提取所述目标的多任务特征包括:Optionally, the extraction of the multi-task feature of the target based on the convolutional feature map of the attention perception and the candidate frame includes:
将所述注意力感知的卷积特征图通过一个大小为1×1的卷积层降维,得到一组降维的卷积特征图;Reducing the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1 to obtain a set of dimensionality-reducing convolutional feature maps;
通过RoIPooling将所述候选框映射到所述降维的卷积特征图上,并将所述候选框划分为2k×2k个候选框块,以使每个候选框形成一个2k×2k的特征;Map the candidate frame to the reduced dimensionality convolution feature map by RoIPooling, and divide the candidate frame into 2k×2k candidate block blocks, so that each candidate frame forms a 2k×2k feature;
通过4个大小为3×3,通道数为256的卷积将每个2k×2k的特征进行编码;Encode each 2k×2k feature through 4 convolutions with a size of 3×3 and a channel number of 256;
将编码后的每个候选框的特征进行频率为预设倍数的上采样,得到所述目标的多任务特征。The feature of each candidate frame after encoding is up-sampled with a frequency of a preset multiple to obtain the multi-task feature of the target.
可选的,其中,通过预设的损失检测模型,检测实现所述目标的检测任务、关键点检测任务、实例分割任务中的至少一个的损失;Optionally, wherein the loss of at least one of a detection task, a key point detection task, and an instance segmentation task that achieves the target is detected through a preset loss detection model;
所述预设的损失模型:The preset loss model:
Loss=L det(N)+λ 1L att(N)+λ 2L multi(N); Loss=L det (N)+λ 1 L att (N)+λ 2 L multi (N);
其中,N表示实现所述多任务的目标检测方法的检测网络,L det表示实现所述检测任务的损失,L att表示所述注意力模块的损失,L multi表示实现所述关键点检测任务和/或实例分割任务的损失,λ 1和λ 2为预设的损失调节因子。 Wherein, N represents the detection network that realizes the multi-task target detection method, L det represents the loss of the detection task, L att represents the loss of the attention module, and L multi represents the realization of the key point detection task and / Or the loss of the instance segmentation task, λ 1 and λ 2 are preset loss adjustment factors.
本公开实施例第二方面提供一种多任务的目标检测装置,包括:A second aspect of the embodiments of the present disclosure provides a multi-task target detection device, including:
获取模块,用于获取待检测目标的图像;The acquisition module is used to acquire the image of the target to be detected;
第一提取模块,用于利用级联式的空间注意力模块,提取所述目标的注意力感知的卷积特征图;The first extraction module is configured to use the cascaded spatial attention module to extract the convolutional feature map of the target's attention perception;
第二提取模块,用于基于所述注意力感知的卷积特征图,提取所述目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征;The second extraction module is configured to extract local component features, global structural features, spatial context features, and multi-task features of the target based on the convolution feature map of the attention perception;
任务实现模块,用于基于所述目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征,实现所述目标的检测任务、关键点检测任务、实例分割任务中的至少一个。The task realization module is used to implement at least one of the target detection task, key point detection task, and instance segmentation task based on the local component feature, global structure feature, spatial context feature, and multi-task feature of the target.
本公开实施例第三方面提供了一种电子设备,包括:A third aspect of the embodiments of the present disclosure provides an electronic device, including:
存储器,处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现本公开实施例第一方面提供的多任务的目标检测方法。A memory, a processor, and a computer program stored on the memory and capable of running on the processor are characterized in that, when the processor executes the program, the multi-task target detection method provided by the first aspect of the embodiments of the present disclosure is implemented.
本公开实施例第四方面提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现本公开实施例第一方面提供的多任务的目标检测方法。A fourth aspect of the embodiments of the present disclosure provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the multi-task target detection method provided in the first aspect of the embodiments of the present disclosure is implemented.
从上述本公开实施例可知,本公开提供的多任务的目标检测、装置、电子设备及存储介质,利用级联式的注意力模块,提取目标的注意力感知的卷积特征图,能够从粗到精地生成全图注意力感知的卷积特征,抑制背景噪声的干扰。基于注意力感知的卷积特征图,提取目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征,基于目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征,实现目标的检测任务、关键点检测任务、实例分割任务中的至少一个。能够有效的关联目标全局结构、局部部件以及上下文信息,形成结构化的特征表达,提高特征对遮挡、形变以及姿态等的鲁棒性,改善多任务性能。It can be seen from the above-mentioned embodiments of the present disclosure that the multi-task target detection, device, electronic equipment, and storage medium provided by the present disclosure use cascaded attention modules to extract the convolutional feature map of the target’s attention perception, which can be obtained from the rough To precisely generate the convolutional features of the whole picture attention perception, and suppress the interference of background noise. Convolution feature map based on attention perception, extract the target's local component features, global structure features, spatial context features and multi-task features, based on the target's local component features, global structure features, spatial context features and multi-task features to achieve the goal At least one of the detection task, the key point detection task, and the instance segmentation task. It can effectively associate the global structure, local components and context information of the target to form a structured feature expression, improve the robustness of features to occlusion, deformation, and posture, and improve multi-task performance.
附图说明Description of the drawings
为了更清楚地说明本发明新型的实施方式或现有技术中的技术方案,下面将对实施方式或现有技术描述中所需要使用的附图作简单地介绍。显而易见地,下面描述中的附图仅仅是示例性的,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图引伸获得其它的实施附图。In order to more clearly illustrate the new embodiment of the present invention or the technical solution in the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiment or the prior art. Obviously, the drawings in the following description are only exemplary. For those of ordinary skill in the art, other implementation drawings can be derived from the provided drawings without creative work.
本说明书所绘示的结构、比例、大小等,均仅用以配合说明书所揭示的内容,以供熟悉此技术的人士了解与阅读,并非用以限定本实用新型可实施的限定条件,故不具技术上的实质意义,任何结构的修饰、比例关系的改变或大小的调整,在不影响本实用新型所能产生的功效及所能达成的目的下,均应仍落在本发明所揭示的技术内容得能涵盖的范围内。The structure, ratio, size, etc. shown in this manual are only used to match the content disclosed in the manual for people familiar with this technology to understand and read. They are not used to limit the limited conditions for the implementation of this utility model, so it does not have The technical significance, any structural modification, proportional relationship change or size adjustment, shall still fall within the technology disclosed in the present invention without affecting the effects and objectives that can be achieved by the present invention The content must be covered.
图1为本公开一实施例提供的多任务的目标检测方法的流程示意图;FIG. 1 is a schematic flowchart of a multi-task target detection method provided by an embodiment of the present disclosure;
图2为本公开一实施例提供的多任务的空间注意力机制的示意图;2 is a schematic diagram of a multi-task spatial attention mechanism provided by an embodiment of the present disclosure;
图3为本公开一实施例提供的多任务的目标检测方法中步骤S103的流程示意图;3 is a schematic flowchart of step S103 in the multi-task target detection method provided by an embodiment of the present disclosure;
图4为本公开一实施例提供的实现多任务的目标检测方法的模型的结构示意图;4 is a schematic structural diagram of a model for implementing a multi-task target detection method provided by an embodiment of the present disclosure;
图5为本公开一实施例提供的实现多任务的目标检测装置的结构示意图;FIG. 5 is a schematic structural diagram of a target detection device for multi-tasking provided by an embodiment of the present disclosure;
图6示出了一种电子设备的硬件结构示意图。Figure 6 shows a schematic diagram of the hardware structure of an electronic device.
具体实施方式Detailed ways
以下由特定的具体实施例说明本实用新型的实施方式,熟悉此技术的人士可由本说明书所揭露的内容轻易地了解本实用新型的其他优点及功效,显然,所描述的实施例是本实用新型一部分实施例,而不是全部的实施例。基于本实用新型中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本实用新型保护的范围。The following specific examples illustrate the implementation of the present utility model. Those familiar with this technology can easily understand the other advantages and effects of the present utility model from the content disclosed in this specification. Obviously, the described embodiments are the present utility model. Some embodiments, not all embodiments. Based on the embodiments of the present utility model, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the scope of protection of the present utility model.
请参阅图1,图1为本公开一实施例提供的多任务的目标检测方法的流程示意图,该方法主要包括以下步骤:Please refer to FIG. 1. FIG. 1 is a schematic flowchart of a multi-task target detection method provided by an embodiment of the present disclosure. The method mainly includes the following steps:
S101、获取待检测目标的图像;S101. Obtain an image of a target to be detected;
S102、利用级联式的空间注意力模块,提取该目标的注意力感知的卷积特征图;S102. Use the cascaded spatial attention module to extract the convolutional feature map of the target's attention perception;
S103、基于该注意力感知的卷积特征图,提取该目标的局部部件特征、全 局结构特征、空间上下文特征以及多任务特征;S103: Extracting local component features, global structural features, spatial context features, and multi-task features of the target based on the convolution feature map of the attention perception;
S104、基于该目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征,实现该目标的检测任务、关键点检测任务、实例分割任务中的至少一个。S104: Based on the local component feature, global structural feature, spatial context feature, and multi-task feature of the target, at least one of a detection task, a key point detection task, and an instance segmentation task of the target is realized.
在步骤S101中,该图像可以是任一图像。该目标可以是人、动物、花草等等,本实施例对此不做任何限制。In step S101, the image can be any image. The target can be people, animals, flowers and plants, etc. This embodiment does not impose any limitation on this.
在步骤S102中,每个注意力模块用于产生逐像素的前景和/或背景注意力图,多个注意力模块之间通过级联的方式连接,可由粗到细地学习全图的空间区域,实现对前景特征的增强和对背景特征的削弱,从而不断精调基础网络特征,最后得到更加全面精准的注意力感知的基础网络特征,然后将注意力感知的基础网络特征作用于卷积特征图上,获得注意力感知的卷积特征图。因此,通过步骤S102在全图的基础上产生注意力感知的卷积特征图,可有效地过滤掉图像中背景的干扰并增强前景目标的特征表达。In step S102, each attention module is used to generate a pixel-by-pixel foreground and/or background attention map, and multiple attention modules are connected in a cascaded manner to learn the spatial region of the whole image from coarse to fine. Realize the enhancement of the foreground features and the weakening of the background features, thereby continuously fine-tuning the basic network features, and finally get more comprehensive and accurate basic network features for attention perception, and then apply the basic network features of attention perception to the convolutional feature map Above, get the convolutional feature map of attention perception. Therefore, by generating an attention-perceived convolution feature map on the basis of the full image in step S102, the background interference in the image can be effectively filtered out and the feature expression of the foreground target can be enhanced.
在步骤S103中,显示地提取目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征,增强对目标的描述力。其中,局部部件特征,比如人的眼睛、鼻子、嘴巴等特定的部件;全局结构特征,比如人体的直立结构;空间上下文特征,主要用来提取目标周围的空间上下文信息,比如人在室内环境,飞机在天空中等;多任务特征,主要用来提取关键点和/或分割特征。In step S103, the local component features, global structural features, spatial context features, and multi-task features of the target are explicitly extracted to enhance the descriptive power of the target. Among them, local component features, such as human eyes, nose, mouth and other specific components; global structural features, such as the upright structure of the human body; spatial context features, are mainly used to extract spatial context information around the target, such as people in an indoor environment. The aircraft is in the middle of the sky; multi-task features are mainly used to extract key points and/or segmentation features.
其中,提取目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征的四个进程可以不并行处理或并行处理。Among them, the four processes of extracting the target's local component features, global structural features, spatial context features, and multi-task features may not be processed in parallel or in parallel.
在步骤S104中,将目标的局部部件特征、全局结构特征、空间上下文特征通过归一化操作之后耦合在一起,可形成一个完备的目标的结构化特征,该结构化特征可用于目标的检测任务。将上述结构化特征通过上采样与多任务特征进一步耦合,耦合后得到的特征可用于目标的关键点检测任务和实例分割任 务。实现端对端的多任务训练和测试。In step S104, the local component features, global structural features, and spatial context features of the target are coupled together through a normalization operation to form a complete structured feature of the target, which can be used for target detection tasks . The above-mentioned structured features are further coupled with multi-task features through up-sampling, and the features obtained after coupling can be used in the task of key point detection and instance segmentation of the target. Realize end-to-end multi-task training and testing.
在本申请其中一个实施例中,请参阅图2,图2为本公开一实施例提供的多任务的空间注意力机制的示意图,该多任务的空间注意力机制实现本公开的步骤S102(可理解的,图2中多任务耦合网络实现本公开的步骤S103和步骤S104),步骤S102包括:将注意力模块插入预设基础网络下采样的多个预设倍数处,得到多个注意力图;将该多个注意力图分别与对应下采样倍数处的卷积特征图逐通道相乘,得到注意力感知的卷积特征图。示例性的,在预设基础网络的每次下采样阶段都插入一个注意力模块,将下采样的预设倍数以4,8,16为例产生3个注意力图,将3个注意力图分别与对应下采样倍数处的预设的卷积特征图逐通道相乘(也即4倍处的注意力图与4倍处的卷积特征图相乘,8倍处的注意力图与8倍处的卷积特征图相乘,16倍处的注意力图与16倍处的卷积特征图相乘),可由粗到细地抑制背景噪声干扰,增强前景特征表达,以此来引导预设基础网络特征学习,并且输出最终的注意力感知的卷积特征图。In one of the embodiments of the present application, please refer to FIG. 2. FIG. 2 is a schematic diagram of a multi-task spatial attention mechanism provided by an embodiment of the present disclosure. The multi-task spatial attention mechanism implements step S102 of the present disclosure (can be Understandably, the multi-task coupling network in FIG. 2 implements step S103 and step S104) of the present disclosure. Step S102 includes: inserting the attention module into multiple preset multiples sampled by the preset basic network to obtain multiple attention maps; The multiple attention maps are respectively multiplied by the convolution feature maps at the corresponding downsampling multiples channel by channel to obtain the convolution feature maps for attention perception. Exemplarily, an attention module is inserted in each downsampling stage of the preset basic network, and the preset multiples of downsampling are 4, 8, 16 as examples to generate 3 attention maps, and the 3 attention maps are respectively compared with Corresponding to the preset convolution feature maps at the downsampling multiples are multiplied channel by channel (that is, the attention map at 4 times is multiplied by the convolution feature map at 4 times, and the attention map at 8 times is multiplied by the volume at 8 times. The product feature map is multiplied, the attention map at 16 times is multiplied by the convolution feature map at 16 times), the background noise interference can be suppressed from coarse to fine, and the foreground feature expression can be enhanced to guide the preset basic network feature learning , And output the final convolutional feature map of attention perception.
具体的,本公开并没有在Conv1后边使用注意力模块,主要是因为浅层特征缺乏足够的语义信息,此时产生的注意力图往往很不准确。对于每个注意力模块,通过预测一个注意力图A来表示该位置属于目标的置信度。注意力模块包含两个3*3的卷积层,其通道数为256,然后使用一个1*1的卷积层进行前景和背景分类,最后通过一个sigmoid激活函数来归一化到0~1产生最终注意力图。产生的注意力图与目标类别无关,其通道数为1。然后通过广播的方式将注意力图与对应下采样频率倍数处的卷积特征逐通道相乘,相乘之后特征作为下一次的输入,此过程在预设基础网络中不断重复,逐渐引导预设基础网络特征的学习,最后得到注意力感知的卷积特征图。Specifically, the present disclosure does not use the attention module behind Conv1, mainly because the shallow features lack sufficient semantic information, and the attention map generated at this time is often very inaccurate. For each attention module, the confidence that the position belongs to the target is expressed by predicting an attention map A. The attention module contains two 3*3 convolutional layers, the number of channels is 256, and then a 1*1 convolutional layer is used for foreground and background classification, and finally a sigmoid activation function is used to normalize to 0~1. The final attention map. The generated attention map has nothing to do with the target category, and the number of channels is 1. Then the attention map is multiplied channel by channel with the convolution features at multiples of the corresponding down-sampling frequency by broadcasting. After the multiplication, the features are used as the next input. This process is repeated in the preset basic network and gradually guides the preset basics. The learning of network features finally obtains the convolutional feature map of attention perception.
在本申请其中一个实施例中,请参阅图3,步骤S103包括:In one of the embodiments of the present application, please refer to FIG. 3. Step S103 includes:
S1031、在该注意力感知的卷积特征图上提取包含该目标的候选框;S1031. Extract a candidate frame containing the target from the convolutional feature map of the attention perception;
S1032、基于该注意力感知的卷积特征图和该候选框,提取该目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征。S1032, based on the attention-perceived convolution feature map and the candidate frame, extract local component features, global structural features, spatial context features, and multi-task features of the target.
具体的,可以采用区域提取网络RPN来提取候选框,产生的包含目标的候选框为P。Specifically, the region extraction network RPN can be used to extract the candidate frame, and the generated candidate frame containing the target is P.
在本申请其中一个实施例中,步骤S104包括:将该目标的局部部件特征、全局结构特征、空间上下文特征进行融合,得到该目标的结构化特征;基于该结构化特征,实现该目标的检测任务。In one of the embodiments of the present application, step S104 includes: fusing the local component feature, global structural feature, and spatial context feature of the target to obtain the structured feature of the target; based on the structured feature, realizing the detection of the target Task.
具体的,将局部部件特征、全局结构特征、空间上下文特征通过归一化操作之后耦合在一起,形成一个完备的目标的结构化特征,该结构化特征可用于目标的检测任务。Specifically, the local component feature, the global structure feature, and the spatial context feature are coupled together through a normalization operation to form a complete structured feature of the target, which can be used for the detection task of the target.
在本申请其中一个实施例中,步骤S104包括:将该结构化特征进行上采样,使该结构化特征的分辨率与该多任务特征的分辨率相同;将上采样后的结构化特征与该多任务特征进行融合,得到融合后的特征;将融合后的特征进行关键点检测,实现该目标的关键点检测任务,和/或,将融合后的特征进行实例分割,实现该目标的实例分割任务。In one of the embodiments of the present application, step S104 includes: up-sampling the structured feature so that the resolution of the structured feature is the same as the resolution of the multi-task feature; Multi-task features are fused to obtain the fused feature; the fused feature is subjected to key point detection to achieve the key point detection task of the target, and/or the fused feature is instance segmented to achieve the target instance segmentation Task.
在本申请其中一个实施例中,请参阅图4,步骤S1032包括:将该注意力感知的卷积特征图通过一个大小为1x1的卷积层,得到部件敏感的特征图;通过PSRoIPooling将该候选框映射到该部件敏感的特征图上,并将该候选框划分为k×k个候选框块,以使每个候选框块表示一个局部部件,每个候选框形成一个k×k的部件特征;将每个k×k的部件特征均进行平均池化,得到该目标的局部部件特征。In one of the embodiments of the present application, please refer to FIG. 4, step S1032 includes: passing the convolutional feature map of the attention perception through a convolutional layer with a size of 1x1 to obtain a component-sensitive feature map; using PSRoIPooling to select the candidate The box is mapped to the sensitive feature map of the component, and the candidate box is divided into k×k candidate box blocks, so that each candidate box block represents a local component, and each candidate box forms a k×k component feature ; Perform average pooling for each k×k component feature to obtain the local component feature of the target.
具体的,在注意力感知的卷积特征图的基础上通过一个1×1的卷积产生部件敏感的得分图,其中卷积的滤波器个数为k 2(C+1),k(通常取7)表示把目标划分成k×k个大小一致的候选框块,每一候选框块代表一个局部部件,c为总 的目标种类数。也就是说,对于每个目标种类总共会产生k 2个特征通道,每个特征通道负责编码目标的一个局部部件。这里采用“R-FCN:Object detection via region-based fully convolutional networks”中的PSROIPooling操作来实现局部部件特征的提取。局部部件特征的大小为k 2(C+1),然后通过通道内部的加权平均得到1×1×(C+1)维的特征。 Specifically, on the basis of the convolution feature map of attention perception, a component-sensitive score map is generated through a 1×1 convolution, where the number of convolution filters is k 2 (C+1), k (usually Taking 7) means that the target is divided into k×k candidate block blocks of the same size, each candidate block block represents a partial component, and c is the total number of target types. In other words, a total of k 2 feature channels are generated for each target category, and each feature channel is responsible for encoding a local component of the target. Here, the PSROIPooling operation in "R-FCN: Object detection via region-based fully convolutional networks" is used to achieve the extraction of local component features. The size of the local component feature is k 2 (C+1), and then the 1×1×(C+1)-dimensional feature is obtained through the weighted average inside the channel.
在本申请其中一个实施例中,请参阅图4,步骤S1032包括:将该注意力感知的卷积特征图通过一个大小为1×1的卷积层进行降维,得到一组降维的卷积特征图;通过RoIPooling将该候选框映射到该降维的卷积特征图上,并将该候选框划分为k×k个候选框块,以使每个候选框块形成一个k×k的全局特征;将每个k×k的全局特征当做一个整体,通过两个大小分别为k×k和1×1的卷积层进行编码,得到该目标的全局结构特征。In one of the embodiments of the present application, please refer to FIG. 4, step S1032 includes: reducing the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1 to obtain a set of reduced dimensionality convolutions Product feature map; Map the candidate frame to the reduced dimensionality convolution feature map through RoIPooling, and divide the candidate frame into k×k candidate block blocks, so that each candidate block block forms a k×k Global features; each k×k global feature is regarded as a whole, and two convolutional layers with sizes of k×k and 1×1 are used for encoding to obtain the global structural feature of the target.
具体的,与提取目标的局部结构特征相同的,将候选框分成k×k个候选框块,然后每一个候选框块单独做池化操作,但与局部分支不同的是:1)所有的特征通道都需要提取k×k个特征,即这里的特征通道不区分类别和位置,所有的目标候选框也没有得分敏感和位置敏感的性质;2)所有的候选框块经过池化操作之后组合成一个整体,其特征空间分辨率为k×k,然后通过两个卷积层进一步编码成全局结构特征,两个卷积层的滤波器大小分别为k×k和1×1,最后也输出一个1×1×(C+1)的特征。Specifically, the same as the local structural feature of the extraction target, the candidate box is divided into k×k candidate box blocks, and then each candidate box block is individually pooled, but the difference from the local branch is: 1) All features Each channel needs to extract k×k features, that is, the feature channel here does not distinguish between categories and positions, and all target candidate frames are not score-sensitive and position-sensitive; 2) All candidate blocks are combined into a pool after the pooling operation A whole, its feature spatial resolution is k×k, and then further encoded into global structural features through two convolutional layers, the filter sizes of the two convolutional layers are k×k and 1×1, and finally one is also output Features of 1×1×(C+1).
其中,由于目标往往具有不同的尺度,采用FasterR-CNN中的RoIPooling操作来提取特征,可以将全局结构特征统一为尺度归一化的特征,即不管是大目标还是小目标,其全局结构特征的大小都相同。Among them, because targets often have different scales, the RoIPooling operation in FasterR-CNN is used to extract features, which can unify the global structural features into scale-normalized features, that is, whether it is a large target or a small target, the global structural characteristics of The sizes are all the same.
在本申请其中一个实施例中,请参阅图4,步骤S1032包括:将该注意力感知的卷积特征图通过一个大小为1×1的卷积层降维,得到一组降维的卷积特征图;保持每个候选框中心点不变,将每个候选框面积扩大至预设倍数;通过 RoIPooling将面积扩大后的候选框映射到该降维的卷积特征图上,并将面积扩大后的候选框划分为k×k个候选框块,以使每个候选框形成一个k×k的上下文特征;将每个k×k的上下文特征当做一个整体,通过两个大小分别为k×k和1×1的卷积层进行编码,得到该目标的上下文结构特征。In one of the embodiments of the present application, please refer to FIG. 4. Step S1032 includes: reducing the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1 to obtain a set of reduced dimensionality convolutions Feature map; keep the center point of each candidate frame unchanged, and expand the area of each candidate frame to a preset multiple; use RoIPooling to map the expanded candidate frame to the reduced dimensionality convolution feature map, and expand the area The latter candidate box is divided into k×k candidate box blocks, so that each candidate box forms a k×k context feature; each k×k context feature is regarded as a whole, and the two sizes are respectively k× K and a 1×1 convolutional layer are coded to obtain the context structure characteristics of the target.
具体的,上下文结构特征作为一种最基本和最重要的信息被广泛用于视觉识别任务中。例如,船只会出现在水域中而不会出现在天空,这暗示着目标周围的信息通常能帮助更好地判别目标的语义类别。此外,网络的实际感受也要远小于理论感受野,因此收集目标周围的信息能够有效地减少误识别。具体地,本公开中提取上下文结构特征和提取全局结构特征一样,只不过在提取上下文结构特征之前,需先对每个候选框保持其中心点的坐标不变,然后把面积扩大为原来的2倍。Specifically, contextual structural features are widely used in visual recognition tasks as the most basic and important information. For example, the ship will appear in the water but not in the sky, which implies that the information around the target can usually help to better distinguish the semantic category of the target. In addition, the actual experience of the network is much smaller than the theoretical receptive field, so collecting information around the target can effectively reduce misidentification. Specifically, extracting contextual structural features in the present disclosure is the same as extracting global structural features, except that before extracting contextual structural features, the coordinates of the center point of each candidate frame need to be kept unchanged, and then the area is expanded to the original 2 Times.
在本申请其中一个实施例中,请参阅图4,步骤S1032包括:将该注意力感知的卷积特征图通过一个大小为1×1的卷积层降维,得到一组降维的卷积特征图;通过RoIPooling将该候选框映射到该降维的卷积特征图上,并将该候选框划分为2k×2k个候选框块,以使每个候选框形成一个2k×2k的特征;通过4个大小为3×3,通道数为256的卷积将每个2k×2k的特征进行编码;将编码后的每个候选框的特征进行频率为预设倍数的上采样,得到该目标的多任务特征。In one of the embodiments of the present application, please refer to FIG. 4. Step S1032 includes: reducing the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1 to obtain a set of reduced dimensionality convolutions Feature map: Map the candidate frame to the reduced dimensionality convolution feature map through RoIPooling, and divide the candidate frame into 2k×2k candidate block blocks, so that each candidate frame forms a 2k×2k feature; Through 4 convolutions with a size of 3×3 and a number of channels of 256, each feature of 2k×2k is encoded; the feature of each candidate frame after encoding is up-sampled with a preset multiple of frequency to obtain the target The multi-tasking feature.
具体的,把候选框分成了2k×2k个块,同样通过RoIPooling来提取特征,提取到的特征空间分辨率为2k×2k,然后通过4个3*3的卷积层进一步编码,卷积的通道设为256。由于关键点检测和实例分割任务对特征的空间分辨率要求较高,所以再通过一个上采样层可恢复其空间结构信息,这里的上采样率可以设置为2x或4x等等,上采样后得到的特征即为多任务特征。Specifically, the candidate frame is divided into 2k×2k blocks, and features are also extracted by RoIPooling. The spatial resolution of the extracted features is 2k×2k, and then further encoded by 4 3*3 convolutional layers. The channel is set to 256. Since key point detection and instance segmentation tasks require high spatial resolution of features, an up-sampling layer can restore their spatial structure information. The up-sampling rate here can be set to 2x or 4x, etc., after upsampling The feature of is the multi-task feature.
在本申请其中一个实施例中,通过预设的损失检测模型,检测实现该目标 的检测任务、关键点检测任务、实例分割任务中的至少一个的损失;In one of the embodiments of the present application, the loss of at least one of the detection task, the key point detection task, and the instance segmentation task that achieves the target is detected through a preset loss detection model;
该预设的损失模型:The preset loss model:
Loss=L det(N)+λ 1L att(N)+λ 2L multi(N); Loss=L det (N)+λ 1 L att (N)+λ 2 L multi (N);
其中,N表示实现该多任务的目标检测方法的检测网络,L det表示实现该检测任务的损失,L att表示该注意力模块的损失,L multi表示实现该关键点检测任务和/或实例分割任务的损失,λ 1和λ 2为预设的损失调节因子。 Among them, N represents the detection network that implements the multi-task target detection method, L det represents the loss of the detection task, L att represents the loss of the attention module, and L multi represents the realization of the key point detection task and/or instance segmentation The loss of the task, λ 1 and λ 2 are preset loss adjustment factors.
具体的,本公开采用两阶段的检测方法,先通过RPN网络来产生候选框,然后通过耦合网络来进一步分类和回归,所以检测损失包含RPN的分类、回归损失以及耦合网络的分类、回归损失。两者的回归损失使用smoothL1损失,RPN的分类损失为二分类交叉熵损失,耦合网络的分类损失为多分类交叉熵损失。L att为空间注意力模块的损失,也是二分类(前景/背景)交叉熵损失。L multi为其他任务的损失,其可以是关键点损失或实例分割损失,也可以是两个损失之和(同时进行关键点检测和实例分割)。λ 1和λ 2为损失调节因子,其可根据需要进行设置,在一个示例中,λ 1设置为0.25,λ 2设置为1,检测部分正负样本选择比例为1:4,样本阈值为0.5,即与ground truth的IOU大于0.5作为正样本,否则作为负样本。RPN部分的正负样本比例为1:1,正样本阈值为0.7,负样本阈值为0.3。 Specifically, the present disclosure adopts a two-stage detection method, which first generates candidate frames through the RPN network, and then further classifies and regresses through the coupling network, so the detection loss includes the classification and regression loss of the RPN, and the classification and regression loss of the coupling network. The regression loss of the two uses smoothL1 loss, the classification loss of RPN is two-class cross-entropy loss, and the classification loss of coupled network is multi-class cross-entropy loss. L att is the loss of the spatial attention module and also the two-class (foreground/background) cross-entropy loss. L multi is the loss of other tasks, which can be key point loss or instance segmentation loss, or the sum of two losses (key point detection and instance segmentation are performed at the same time). λ 1 and λ 2 are loss adjustment factors, which can be set as needed. In one example, λ 1 is set to 0.25, λ 2 is set to 1, the positive and negative sample selection ratio of the detection part is 1:4, and the sample threshold is 0.5 , That is, the IOU with ground truth is greater than 0.5 as a positive sample, otherwise as a negative sample. The ratio of positive and negative samples in the RPN part is 1:1, the positive sample threshold is 0.7, and the negative sample threshold is 0.3.
请参阅图5,图5为本公开一实施例提供的多任务的目标检测装置的结构示意图,该装置包括:Please refer to FIG. 5. FIG. 5 is a schematic structural diagram of a multi-task target detection device provided by an embodiment of the present disclosure. The device includes:
获取模块201,用于获取待检测目标的图像;The obtaining module 201 is used to obtain an image of the target to be detected;
第一提取模块202,用于利用级联式的空间注意力模块,提取该目标的注意力感知的卷积特征图;The first extraction module 202 is configured to use the cascaded spatial attention module to extract the convolutional feature map of the target's attention perception;
第二提取模块203,用于基于该注意力感知的卷积特征图,提取该目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征;The second extraction module 203 is configured to extract local component features, global structural features, spatial context features, and multi-task features of the target based on the convolutional feature map of the attention perception;
任务实现模块204,用于基于该目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征,实现该目标的检测任务、关键点检测任务、实例分割任务中的至少一个。The task realization module 204 is configured to implement at least one of the target detection task, key point detection task, and instance segmentation task based on the local component feature, global structure feature, spatial context feature, and multi-task feature of the target.
在本申请其中一个实施例中,第一提取模块202包括:插设子模块,用于将注意力模块插入预设基础网络下采样的多个预设倍数处,得到多个注意力图;相乘子模块,用于将该多个注意力图分别与对应下采样倍数处的卷积特征图逐通道相乘,得到注意力感知的卷积特征图。In one of the embodiments of the present application, the first extraction module 202 includes: an insertion sub-module, which is used to insert the attention module into multiple preset multiples sampled by the preset basic network to obtain multiple attention maps; The sub-module is used to multiply the multiple attention maps with the convolution feature maps at the corresponding downsampling multiples channel by channel to obtain the convolution feature maps for attention perception.
在本申请其中一个实施例中,第二提取模块203包括:第一提取子模块,用于在该注意力感知的卷积特征图上提取包含该目标的候选框;第二提取子模块,用于基于该注意力感知的卷积特征图和该候选框,提取该目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征。In one of the embodiments of the present application, the second extraction module 203 includes: a first extraction sub-module for extracting candidate frames containing the target on the convolutional feature map of the attention perception; a second extraction sub-module for Based on the convolutional feature map and the candidate frame based on the attention perception, the local component features, global structural features, spatial context features and multi-task features of the target are extracted.
在本申请其中一个实施例中,任务实现模块204包括:第一特征融合子模块,用于将该目标的局部部件特征、全局结构特征、空间上下文特征进行融合,得到该目标的结构化特征;检测任务实现子模块,用于基于该结构化特征,实现该目标的检测任务。In one of the embodiments of the present application, the task realization module 204 includes: a first feature fusion sub-module for fusing local component features, global structural features, and spatial context features of the target to obtain the structural features of the target; The detection task realization sub-module is used to realize the detection task of the target based on the structural feature.
在本申请其中一个实施例中,任务实现模块204包括:第一上采样子模块,用于将该结构化特征进行上采样,使该结构化特征的分辨率与该多任务特征的分辨率相同;第二特征融合子模块,用于将上采样后的结构化特征与该多任务特征进行融合,得到融合后的特征;关键点检测任务实现子模块,用于将融合后的特征进行关键点检测,实现该目标的关键点检测任务,和/或,实例分割任务实现子模块,用于将融合后的特征进行实例分割,实现该目标的实例分割任务。In one of the embodiments of the present application, the task realization module 204 includes: a first up-sampling sub-module for up-sampling the structured feature so that the resolution of the structured feature is the same as the resolution of the multi-task feature ; The second feature fusion sub-module is used to fuse the structured features after upsampling with the multi-task features to obtain the fused features; the key point detection task realization sub-module is used to perform key points on the fused features Detection, the key point detection task to achieve the target, and/or the instance segmentation task realization sub-module, which is used to perform instance segmentation on the merged features to achieve the target instance segmentation task.
在本申请其中一个实施例中,第二提取子模块包括:第一降维子模块,用于将所述注意力感知的卷积特征图通过一个大小为1x1的卷积层,得到部件敏 感的特征图;第一映射划分子模块,用于通过PSRoIPooling将所述候选框映射到所述部件敏感的特征图上,并将所述候选框划分为k×k个候选框块,以使每个候选框块表示一个局部部件,每个候选框形成一个k×k的部件特征;池化子模块,用于将每个k×k的部件特征均进行平均池化,得到所述目标的局部部件特征。In one of the embodiments of the present application, the second extraction sub-module includes: a first dimensionality reduction sub-module, which is used to pass the convolutional feature map of the attention perception through a convolutional layer with a size of 1x1 to obtain a component-sensitive Feature map; the first mapping division sub-module is used to map the candidate frame to the component-sensitive feature map through PSRoIPooling, and divide the candidate frame into k×k candidate frame blocks, so that each The candidate block represents a partial component, and each candidate frame forms a k×k component feature; the pooling sub-module is used to averagely pool each k×k component feature to obtain the local component of the target feature.
在本申请其中一个实施例中,第二提取子模块包括:第二降维子模块,用于将所述注意力感知的卷积特征图通过一个大小为1×1的卷积层进行降维,得到一组降维的卷积特征图;第二映射划分子模块,用于通过RoIPooling将所述候选框映射到所述降维的卷积特征图上,并将所述候选框划分为k×k个候选框块,以使每个候选框块形成一个k×k的全局特征;第一编码子模块,用于将每个k×k的全局特征当做一个整体,通过两个大小分别为k×k和1×1的卷积层进行编码,得到所述目标的全局结构特征。In one of the embodiments of the present application, the second extraction sub-module includes: a second dimensionality reduction sub-module, which is used to reduce the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1 , Obtain a set of reduced dimensionality convolution feature maps; the second mapping division sub-module is used to map the candidate frame to the reduced dimensionality convolution feature map through RoIPooling, and divide the candidate frame into k ×k candidate box blocks, so that each candidate box block forms a k×k global feature; the first coding sub-module is used to treat each k×k global feature as a whole, and the two sizes are respectively The k×k and 1×1 convolutional layers are encoded to obtain the global structural feature of the target.
在本申请其中一个实施例中,第二提取子模块包括:第三降维子模块,用于将所述注意力感知的卷积特征图通过一个大小为1×1的卷积层降维,得到一组降维的卷积特征图;面积扩大子模块,用于保持每个候选框中心点不变,将每个候选框面积扩大至预设倍数;第三映射划分子模块,用于通过RoIPooling将面积扩大后的候选框映射到所述降维的卷积特征图上,并将面积扩大后的候选框划分为k×k个候选框块,以使每个候选框形成一个k×k的上下文特征;第二编码子模块,用于将每个k×k的上下文特征当做一个整体,通过两个大小分别为k×k和1×1的卷积层进行编码,得到所述目标的上下文结构特征。In one of the embodiments of the present application, the second extraction sub-module includes: a third dimensionality reduction sub-module for reducing the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1, Obtain a set of reduced dimensionality convolution feature maps; the area expansion sub-module is used to keep the center point of each candidate frame unchanged and expand the area of each candidate frame to a preset multiple; the third mapping division sub-module is used to pass RoIPooling maps the area-expanded candidate frame to the reduced dimensionality convolution feature map, and divides the area-expanded candidate frame into k×k candidate block blocks, so that each candidate frame forms a k×k The second coding sub-module is used to treat each k×k context feature as a whole, and encode it through two convolutional layers of size k×k and 1×1 to obtain the target Context structure characteristics.
在本申请其中一个实施例中,第二提取子模块包括:第四降维子模块,用于将所述注意力感知的卷积特征图通过一个大小为1×1的卷积层降维,得到一组降维的卷积特征图;第四映射划分子模块,用于通过RoIPooling将所述候选框映射到所述降维的卷积特征图上,并将所述候选框划分为2k×2k个候选框 块,以使每个候选框形成一个2k×2k的特征;第三编码子模块,用于通过4个大小为3×3,通道数为256的卷积将每个2k×2k的特征进行编码;第二上采样子模块,用于将编码后的每个候选框的特征进行频率为预设倍数的上采样,得到所述目标的多任务特征。In one of the embodiments of the present application, the second extraction sub-module includes: a fourth dimensionality reduction sub-module, which is used to reduce the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1, Obtain a set of reduced dimensionality convolution feature maps; the fourth mapping division sub-module is used to map the candidate frame to the reduced dimensionality convolution feature map through RoIPooling, and divide the candidate frame into 2k× 2k candidate box blocks, so that each candidate box forms a 2k×2k feature; the third encoding sub-module is used to pass 4 convolutions with a size of 3×3 and a channel number of 256 to convert each 2k×2k The second up-sampling sub-module is used for up-sampling the features of each candidate frame after encoding with a preset multiple of frequency to obtain the multi-task feature of the target.
在本申请其中一个实施例中,还包括损失检测模块,用于,通过预设的损失检测模型,检测实现该目标的检测任务、关键点检测任务、实例分割任务中的至少一个的损失;In one of the embodiments of the present application, a loss detection module is further included, configured to detect the loss of at least one of the detection task, the key point detection task, and the instance segmentation task that achieves the target through a preset loss detection model;
该预设的损失模型:The preset loss model:
Loss=L det(N)+λ 1L att(N)+λ 2L multi(N); Loss=L det (N)+λ 1 L att (N)+λ 2 L multi (N);
其中,N表示实现该多任务的目标检测方法的检测网络,L det表示实现该检测任务的损失,L att表示该注意力模块的损失,L multi表示实现该关键点检测任务和/或实例分割任务的损失,λ 1和λ 2为预设的损失调节因子。 Among them, N represents the detection network that implements the multi-task target detection method, L det represents the loss of the detection task, L att represents the loss of the attention module, and L multi represents the realization of the key point detection task and/or instance segmentation The loss of the task, λ 1 and λ 2 are preset loss adjustment factors.
上述本公开实施例所能实现的有益效果与上述图1所示的多任务的目标检测法的有益效果相同,在此不再赘述。The beneficial effects that can be achieved by the above-mentioned embodiments of the present disclosure are the same as those of the multi-task target detection method shown in FIG. 1, and will not be repeated here.
请参见图6,图6示出了一种电子设备的硬件结构图。Please refer to FIG. 6, which shows a hardware structure diagram of an electronic device.
本实施例中所描述的电子设备,包括:The electronic equipment described in this embodiment includes:
存储器41、处理器42及存储在存储器41上并可在处理器上运行的计算机程序,处理器执行该程序时实现前述图1所示实施例中描述的多任务的目标检测方法。The memory 41, the processor 42, and a computer program that is stored on the memory 41 and can run on the processor, and the processor implements the multi-task target detection method described in the embodiment shown in FIG. 1 when the processor executes the program.
进一步地,该电子设备还包括:Further, the electronic device also includes:
至少一个输入设备43;至少一个输出设备44。At least one input device 43; at least one output device 44.
上述存储器41、处理器42输入设备43和输出设备44通过总线45连接。The aforementioned memory 41, processor 42 input device 43 and output device 44 are connected via a bus 45.
其中,输入设备43具体可为摄像头、触控面板、物理按键或者鼠标等等。输出设备44具体可为显示屏。Among them, the input device 43 may specifically be a camera, a touch panel, a physical button, a mouse, and so on. The output device 44 may specifically be a display screen.
存储器41可以是高速随机存取记忆体(RAM,Random Access Memory)存储器,也可为非不稳定的存储器(non-volatile memory),例如磁盘存储器。存储器41用于存储一组可执行程序代码,处理器42与存储器41耦合。The memory 41 may be a high-speed random access memory (RAM, Random Access Memory) memory, or a non-volatile memory (non-volatile memory), such as a magnetic disk memory. The memory 41 is used to store a group of executable program codes, and the processor 42 is coupled with the memory 41.
进一步地,本公开实施例还提供了一种计算机可读存储介质,该计算机可读存储介质可以是设置于上述各实施例中的电子设备中,该计算机可读存储介质可以是前述图6所示实施例中的电子设备。该计算机可读存储介质上存储有计算机程序,该程序被处理器执行时实现前述图1所示实施例中描述的多任务的目标检测方法。进一步地,该计算机可存储介质还可以是U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。Further, the embodiments of the present disclosure also provide a computer-readable storage medium, which may be provided in the electronic device in each of the above-mentioned embodiments, and the computer-readable storage medium may be the one shown in FIG. 6 above. The electronic device in the embodiment is shown. A computer program is stored on the computer-readable storage medium, and when the program is executed by the processor, the multi-task target detection method described in the embodiment shown in FIG. 1 is implemented. Further, the computer storage medium can also be a U disk, a mobile hard disk, a read-only memory (ROM, Read-Only Memory), a random access memory (RAM, Random Access Memory), a magnetic disk or an optical disk, etc. The medium of the program code.
虽然,上文中已经用一般性说明及具体实施例对本发明作了详尽的描述,但在本发明的基础上,可以对之作一些修改或改进,这对本领域技术人员而言是显而易见的。因此,在不偏离本实用新型精神的基础上所做的这些修改或改进,均属于本实用新型要求保护的范围。Although the present invention has been described in detail above with general descriptions and specific embodiments, some modifications or improvements can be made on the basis of the present invention, which is obvious to those skilled in the art. Therefore, these modifications or improvements made on the basis of not departing from the spirit of the present utility model belong to the scope of protection of the present utility model.
本说明书中所引用的如“上”、“下”、“左”、“右”、“中间”等的用语,亦仅为便于叙述的明了,而非用以限定本发明可实施的范围,其相对关系的改变或调整,在无实质变更技术内容下,当亦视为本实用新型可实施的范畴。The terms such as "upper", "lower", "left", "right", "middle", etc. cited in this specification are only for ease of description and are not used to limit the scope of the present invention. The change or adjustment of the relative relationship shall be regarded as the scope of the utility model which can be implemented without substantial change of the technical content.

Claims (13)

  1. 一种多任务的目标检测方法,其特征在于,包括:A multi-task target detection method, which is characterized in that it comprises:
    获取待检测目标的图像;Obtain an image of the target to be detected;
    利用级联式的注意力模块,提取所述目标的注意力感知的卷积特征图;Using a cascaded attention module to extract the convolutional feature map of the target's attention perception;
    基于所述注意力感知的卷积特征图,提取所述目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征;Extracting local component features, global structural features, spatial context features, and multi-task features of the target based on the attention-perceived convolution feature map;
    基于所述目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征,实现所述目标的检测任务、关键点检测任务、实例分割任务中的至少一个。Based on the local component features, global structural features, spatial context features, and multi-task features of the target, at least one of the target detection task, key point detection task, and instance segmentation task is realized.
  2. 根据权利要求1所述的多任务的目标检测方法,其特征在于,所述利用级联式的空间注意力模块,提取所述目标的注意力感知的卷积特征图包括:The multi-task target detection method according to claim 1, wherein the extracting the convolutional feature map of the target's attention perception by using the cascaded spatial attention module comprises:
    将注意力模块插入预设基础网络下采样的多个预设倍数处,得到多个注意力图;Insert the attention module into multiple preset multiples sampled by the preset basic network to obtain multiple attention maps;
    将所述多个注意力图分别与对应下采样倍数处的卷积特征图逐通道相乘,得到注意力感知的卷积特征图。The multiple attention maps are respectively multiplied by the convolution feature maps at the corresponding downsampling multiples channel by channel to obtain the convolution feature maps for attention perception.
  3. 根据权利要求1所述的多任务的目标检测方法,其特征在于,所述基于所述注意力感知的卷积特征图,提取所述目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征包括:The multi-task target detection method according to claim 1, wherein the convolution feature map based on the attention perception extracts local component features, global structural features, spatial context features, and multiple features of the target. Task characteristics include:
    在所述注意力感知的卷积特征图上提取包含所述目标的候选框;Extracting a candidate frame containing the target from the attention-perceived convolution feature map;
    基于所述注意力感知的卷积特征图和所述候选框,提取所述目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征。Based on the attention-perceived convolution feature map and the candidate frame, extract local component features, global structural features, spatial context features, and multi-task features of the target.
  4. 根据权利要求1所述的多任务的目标检测方法,其特征在于,所述基于所述目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征,实现所述目标的检测任务包括:The multi-task target detection method according to claim 1, wherein the detection task based on the local component feature, global structure feature, spatial context feature, and multi-task feature of the target includes:
    将所述目标的局部部件特征、全局结构特征、空间上下文特征进行融合,得到所述目标的结构化特征;Fusing the local component features, global structural features, and spatial context features of the target to obtain the structural features of the target;
    基于所述结构化特征,实现所述目标的检测任务。Based on the structural feature, the detection task of the target is realized.
  5. 根据权利要求1所述的多任务的目标检测方法,其特征在于,所述基于所述目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征,实现所述目标的关键点检测任务,和/或,实例分割任务包括:The multi-task target detection method according to claim 1, wherein the key point detection task of the target is realized based on the local component feature, global structural feature, spatial context feature, and multi-task feature of the target , And/or, instance segmentation tasks include:
    将所述结构化特征进行上采样,使所述结构化特征的分辨率与所述多任务特征的分辨率相同;Up-sampling the structured feature so that the resolution of the structured feature is the same as the resolution of the multi-task feature;
    将上采样后的结构化特征与所述多任务特征进行融合,得到融合后的特征;Fuse the structured feature after upsampling with the multi-task feature to obtain the fused feature;
    将融合后的特征进行关键点检测,实现所述目标的关键点检测任务,和/或,将融合后的特征进行实例分割,实现所述目标的实例分割任务。Perform key point detection on the fused feature to achieve the key point detection task of the target, and/or perform instance segmentation on the fused feature to achieve the instance segmentation task of the target.
  6. 根据权利要求3至5任意一项所述的多任务的目标检测方法,其特征在于,所述基于所述注意力感知的卷积特征图和所述候选框,提取所述目标的局部部件特征包括:The multi-task target detection method according to any one of claims 3 to 5, wherein the convolution feature map based on the attention perception and the candidate frame extract local component features of the target include:
    将所述注意力感知的卷积特征图通过一个大小为1×1的卷积层,得到部件敏感的特征图;Passing the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1 to obtain a component-sensitive feature map;
    通过PSRoIPooling将所述候选框映射到所述部件敏感的特征图上,并将所述候选框划分为k×k个候选框块,以使每个候选框块表示一个局部部件,每个候选框形成一个k×k的部件特征;The candidate frame is mapped to the component-sensitive feature map through PSRoIPooling, and the candidate frame is divided into k×k candidate frame blocks, so that each candidate frame block represents a local component, and each candidate frame Form a k×k component feature;
    将每个k×k的部件特征均进行平均池化,得到所述目标的局部部件特征。Each k×k component feature is averagely pooled to obtain the local component feature of the target.
  7. 根据权利要求3至5任意一项所述的多任务的目标检测方法,其特征在于,所述基于所述注意力感知的卷积特征图和所述候选框,提取所述目标的全局结构特征包括:The multi-task target detection method according to any one of claims 3 to 5, wherein the convolution feature map based on the attention perception and the candidate frame extract global structural features of the target include:
    将所述注意力感知的卷积特征图通过一个大小为1×1的卷积层进行降维,得到一组降维的卷积特征图;Dimensionality reduction is performed on the convolutional feature maps of the attention perception through a convolutional layer with a size of 1×1, to obtain a set of dimensionality-reduced convolution feature maps;
    通过RoIPooling将所述候选框映射到所述降维的卷积特征图上,并将所述候选框划分为k×k个候选框块,以使每个候选框块形成一个k×k的全局特征;Map the candidate frame to the reduced dimensionality convolution feature map by RoIPooling, and divide the candidate frame into k×k candidate block blocks, so that each candidate block block forms a k×k global feature;
    将每个k×k的全局特征当做一个整体,通过两个大小分别为k×k和1×1的卷积层进行编码,得到所述目标的全局结构特征。Regarding each k×k global feature as a whole, encoding is performed through two convolutional layers with sizes of k×k and 1×1, respectively, to obtain the global structural feature of the target.
  8. 根据权利要求3至5任意一项所述的多任务的目标检测方法,其特征在于,所述基于所述注意力感知的卷积特征图和所述候选框,提取所述目标的上下文结构特征包括:The multi-task target detection method according to any one of claims 3 to 5, wherein the convolution feature map based on the attention perception and the candidate frame extracts the context structure feature of the target include:
    将所述注意力感知的卷积特征图通过一个大小为1×1的卷积层降维,得到一组降维的卷积特征图;Reducing the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1 to obtain a set of dimensionality-reducing convolutional feature maps;
    保持每个候选框中心点不变,将每个候选框面积扩大至预设倍数;Keep the center point of each candidate frame unchanged, and expand the area of each candidate frame to a preset multiple;
    通过RoIPooling将面积扩大后的候选框映射到所述降维的卷积特征图上,并将面积扩大后的候选框划分为k×k个候选框块,以使每个候选框形成一个kxk的上下文特征;Through RoIPooling, the area-expanded candidate frame is mapped to the reduced dimensionality convolution feature map, and the area-expanded candidate frame is divided into k×k candidate block blocks, so that each candidate frame forms a kxk Contextual characteristics
    将每个k×k的上下文特征当做一个整体,通过两个大小分别为k×k和1×1的卷积层进行编码,得到所述目标的上下文结构特征。Regarding each k×k context feature as a whole, the context structure feature of the target is obtained by encoding through two convolutional layers with sizes of k×k and 1×1, respectively.
  9. 根据权利要求3至5任意一项所述的多任务的目标检测方法,其特征在于,所述基于所述注意力感知的卷积特征图和所述候选框,提取所述目标的多任务特征包括:The multi-task target detection method according to any one of claims 3 to 5, wherein the convolution feature map based on the attention perception and the candidate frame extract the multi-task feature of the target include:
    将所述注意力感知的卷积特征图通过一个大小为1×1的卷积层降维,得到一组降维的卷积特征图;Reducing the dimensionality of the convolutional feature map of the attention perception through a convolutional layer with a size of 1×1 to obtain a set of dimensionality-reducing convolutional feature maps;
    通过RoIPooling将所述候选框映射到所述降维的卷积特征图上,并将所述候选框划分为2k×2k个候选框块,以使每个候选框形成一个2k×2k的特征;Map the candidate frame to the reduced dimensionality convolution feature map by RoIPooling, and divide the candidate frame into 2k×2k candidate block blocks, so that each candidate frame forms a 2k×2k feature;
    通过4个大小为3×3,通道数为256的卷积将每个2k×2k的特征进行编码;Encode each 2k×2k feature through 4 convolutions with a size of 3×3 and a channel number of 256;
    将编码后的每个候选框的特征进行频率为预设倍数的上采样,得到所述目标的多任务特征。The feature of each candidate frame after encoding is up-sampled with a frequency of a preset multiple to obtain the multi-task feature of the target.
  10. 根据权利要求1至5任意一项所述的多任务的目标检测方法,其特征在于,其中,通过预设的损失检测模型,检测实现所述目标的检测任务、关键点检测任务、实例分割任务中的至少一个的损失;The multi-task target detection method according to any one of claims 1 to 5, wherein the detection task, the key point detection task, and the instance segmentation task of the target are detected through a preset loss detection model. Loss of at least one of;
    所述预设的损失模型:The preset loss model:
    Loss=L det(N)+λ 1L att(N)+λ 2L multi(N); Loss=L det (N)+λ 1 L att (N)+λ 2 L multi (N);
    其中,N表示实现所述多任务的目标检测方法的检测网络,L det表示实现 所述检测任务的损失,L att表示所述注意力模块的损失,L multi表示实现所述关键点检测任务和/或实例分割任务的损失,λ 1和λ 2为预设的损失调节因子。 Wherein, N represents the detection network that realizes the multi-task target detection method, L det represents the loss of the detection task, L att represents the loss of the attention module, and L multi represents the realization of the key point detection task and / Or the loss of the instance segmentation task, λ 1 and λ 2 are preset loss adjustment factors.
  11. 一种多任务的目标检测装置,其特征在于,包括:A multi-task target detection device is characterized in that it comprises:
    获取模块,用于获取待检测目标的图像;The acquisition module is used to acquire the image of the target to be detected;
    第一提取模块,用于利用级联式的空间注意力模块,提取所述目标的注意力感知的卷积特征图;The first extraction module is configured to use the cascaded spatial attention module to extract the convolutional feature map of the target's attention perception;
    第二提取模块,用于基于所述注意力感知的卷积特征图,提取所述目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征;The second extraction module is configured to extract local component features, global structural features, spatial context features, and multi-task features of the target based on the convolution feature map of the attention perception;
    任务实现模块,用于基于所述目标的局部部件特征、全局结构特征、空间上下文特征以及多任务特征,实现所述目标的检测任务、关键点检测任务、实例分割任务中的至少一个。The task realization module is used to implement at least one of the target detection task, key point detection task, and instance segmentation task based on the local component feature, global structure feature, spatial context feature, and multi-task feature of the target.
  12. 一种电子设备,包括:存储器,处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时,实现权利要求1至11中的任一项所述的多任务的目标检测方法中的各个步骤。An electronic device, comprising: a memory, a processor, and a computer program stored on the memory and running on the processor, wherein the processor executes the computer program to implement the claims 1 to 11 Any of the steps in the multi-task target detection method.
  13. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时,实现权利要求1至11中的任一项所述的多任务的目标检测方法中的各个步骤。A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the multi-task target detection method of any one of claims 1 to 11 The various steps.
PCT/CN2020/137446 2020-05-18 2020-12-18 Multi-task target detection method and apparatus, electronic device, and storage medium WO2021232771A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010422038.3A CN111598112B (en) 2020-05-18 2020-05-18 Multitask target detection method and device, electronic equipment and storage medium
CN202010422038.3 2020-05-18

Publications (1)

Publication Number Publication Date
WO2021232771A1 true WO2021232771A1 (en) 2021-11-25

Family

ID=72191519

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/137446 WO2021232771A1 (en) 2020-05-18 2020-12-18 Multi-task target detection method and apparatus, electronic device, and storage medium

Country Status (2)

Country Link
CN (1) CN111598112B (en)
WO (1) WO2021232771A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241277A (en) * 2021-12-22 2022-03-25 中国人民解放军国防科技大学 Attention-guided multi-feature fusion disguised target detection method, device, equipment and medium

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111598112B (en) * 2020-05-18 2023-02-24 中科视语(北京)科技有限公司 Multitask target detection method and device, electronic equipment and storage medium
CN112149683A (en) * 2020-09-30 2020-12-29 华宇金信(北京)软件有限公司 Method and device for detecting living objects in night vision environment
CN112507872B (en) * 2020-12-09 2021-12-28 中科视语(北京)科技有限公司 Positioning method and positioning device for head and shoulder area of human body and electronic equipment
CN113222899B (en) * 2021-04-15 2022-09-30 浙江大学 Method for segmenting and classifying liver tumors through CT detection based on deep learning
CN113902983B (en) * 2021-12-06 2022-03-25 南方医科大学南方医院 Laparoscopic surgery tissue and organ identification method and device based on target detection model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100329517A1 (en) * 2009-06-26 2010-12-30 Microsoft Corporation Boosted face verification
CN108647585A (en) * 2018-04-20 2018-10-12 浙江工商大学 A kind of traffic mark symbol detection method based on multiple dimensioned cycle attention network
CN109886871A (en) * 2019-01-07 2019-06-14 国家新闻出版广电总局广播科学研究院 The image super-resolution method merged based on channel attention mechanism and multilayer feature
CN109948709A (en) * 2019-03-21 2019-06-28 南京斯玛唯得智能技术有限公司 A kind of multitask Attribute Recognition system of target object
CN110197182A (en) * 2019-06-11 2019-09-03 中国电子科技集团公司第五十四研究所 Remote sensing image semantic segmentation method based on contextual information and attention mechanism
CN111598112A (en) * 2020-05-18 2020-08-28 中科视语(北京)科技有限公司 Multitask target detection method and device, electronic equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10354159B2 (en) * 2016-09-06 2019-07-16 Carnegie Mellon University Methods and software for detecting objects in an image using a contextual multiscale fast region-based convolutional neural network
CN109543699A (en) * 2018-11-28 2019-03-29 北方工业大学 Image abstract generation method based on target detection
CN111062438B (en) * 2019-12-17 2023-06-16 大连理工大学 Image propagation weak supervision fine granularity image classification algorithm based on correlation learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100329517A1 (en) * 2009-06-26 2010-12-30 Microsoft Corporation Boosted face verification
CN108647585A (en) * 2018-04-20 2018-10-12 浙江工商大学 A kind of traffic mark symbol detection method based on multiple dimensioned cycle attention network
CN109886871A (en) * 2019-01-07 2019-06-14 国家新闻出版广电总局广播科学研究院 The image super-resolution method merged based on channel attention mechanism and multilayer feature
CN109948709A (en) * 2019-03-21 2019-06-28 南京斯玛唯得智能技术有限公司 A kind of multitask Attribute Recognition system of target object
CN110197182A (en) * 2019-06-11 2019-09-03 中国电子科技集团公司第五十四研究所 Remote sensing image semantic segmentation method based on contextual information and attention mechanism
CN111598112A (en) * 2020-05-18 2020-08-28 中科视语(北京)科技有限公司 Multitask target detection method and device, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241277A (en) * 2021-12-22 2022-03-25 中国人民解放军国防科技大学 Attention-guided multi-feature fusion disguised target detection method, device, equipment and medium

Also Published As

Publication number Publication date
CN111598112B (en) 2023-02-24
CN111598112A (en) 2020-08-28

Similar Documents

Publication Publication Date Title
WO2021232771A1 (en) Multi-task target detection method and apparatus, electronic device, and storage medium
CN111340814B (en) RGB-D image semantic segmentation method based on multi-mode self-adaptive convolution
CN111047551A (en) Remote sensing image change detection method and system based on U-net improved algorithm
CN111210443A (en) Deformable convolution mixing task cascading semantic segmentation method based on embedding balance
CN110288602A (en) Come down extracting method, landslide extraction system and terminal
CN112950780B (en) Intelligent network map generation method and system based on remote sensing image
CN111898432A (en) Pedestrian detection system and method based on improved YOLOv3 algorithm
CN112070040A (en) Text line detection method for video subtitles
CN108537109B (en) OpenPose-based monocular camera sign language identification method
CN116129129A (en) Character interaction detection model and detection method
CN113705575B (en) Image segmentation method, device, equipment and storage medium
CN114913325A (en) Semantic segmentation method, device and computer program product
Cheng et al. A survey on image semantic segmentation using deep learning techniques
CN112800932B (en) Method for detecting remarkable ship target in offshore background and electronic equipment
Cambuim et al. An efficient static gesture recognizer embedded system based on ELM pattern recognition algorithm
CN116258931B (en) Visual finger representation understanding method and system based on ViT and sliding window attention fusion
CN116363429A (en) Training method of image recognition model, image recognition method, device and equipment
CN116310324A (en) Pyramid cross-layer fusion decoder based on semantic segmentation
CN114550047B (en) Behavior rate guided video behavior recognition method
AU2021104479A4 (en) Text recognition method and system based on decoupled attention mechanism
CN111583352B (en) Intelligent generation method of stylized icon for mobile terminal
CN113610856A (en) Method and device for training image segmentation model and image segmentation
CN111695507A (en) Static gesture recognition method based on improved VGGNet network and PCA
CN115511968B (en) Two-dimensional hand posture estimation method, device, equipment and storage medium
CN110569790A (en) Residential area element extraction method based on texture enhancement convolutional network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20936493

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15.03.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20936493

Country of ref document: EP

Kind code of ref document: A1