CN116524320A

CN116524320A - Multi-task target detection model for target detection and semantic segmentation

Info

Publication number: CN116524320A
Application number: CN202310275566.4A
Authority: CN
Inventors: 方志宁
Original assignee: Guodian Power Ningxia New Energy Development Co ltd
Current assignee: Guodian Power Ningxia New Energy Development Co ltd
Priority date: 2023-03-21
Filing date: 2023-03-21
Publication date: 2023-08-01

Abstract

A multi-task target detection model for target detection and semantic segmentation comprises a detection module and a semantic segmentation module, wherein the detection module and the semantic segmentation module are respectively deployed at mobile terminal equipment and are in communication connection through a backbone network. In order to further make the algorithm model of the detection module and the semantic segmentation module more universal and further reduce the influence of priori data size on the algorithm model, the detection module and the semantic segmentation module adopt an Anchor detection mode in the detection algorithm branch, finally realize the rapid identification of the position and the contour information of an object within 50ms on the mobile terminal equipment, and the additional expenditure required by program deployment is not required under the condition of large user quantity, and the model can be deployed to a server end to finish detection by using a GPU/CPU.

Description

Multi-task target detection model for target detection and semantic segmentation

Technical Field

The invention relates to the technical field of rapid detection of algorithm models on mobile terminal equipment, in particular to a multi-task target detection model for target detection and semantic segmentation.

Background

At present, in a scene in the field of visual images, it is additionally important to detect the position of a specific target at a specified position or find the position of the outline boundary of the specific target, for example, the automatic industrial detection field detects flaws on the surface of a product by using a deep learning model, the automatic inspection field of an unmanned aerial vehicle needs to automatically detect pictures captured by an unmanned aerial vehicle camera by using the deep learning model carried on the unmanned aerial vehicle, the power industry automatically inputs display data of power equipment by using the unmanned aerial vehicle, and automatically inputs data such as an instrument nameplate and the like, and the content of the specified target needs to be identified.

Normally, in the case of big data we do this in two ways: in the first mode, for the scene with not particularly high real-time requirement, the off-line mode is used for processing, the mobile terminal equipment is only responsible for collecting and transmitting images, and after receiving corresponding image data, the background server uses the off-line processing data of the background large model to return the processing result to the terminal equipment for corresponding processing, so that the mode is calculated as a relatively traditional mode. The second mode is to directly deploy a corresponding detection/identification model on the terminal equipment, process the image acquired by the mobile terminal equipment in real time and return the result in real time, which is one of the popular modes in recent years. However, most of end-side model detection effects are inaccurate and lack more detailed description information for detecting the edge profile of an object, so that it is difficult to meet some high-precision industrial scenes. How to design an algorithm model to rapidly detect and acquire fine boundary contour information of an object is a problem to be solved compared with the conventional method described in the first mode.

Compared with the traditional method mode I, the mode II of the detection method of the mobile terminal equipment at present has the following difficulties:

only the object position described by a rectangular frame can be detected, and the position information of the outline boundary of the object is lacking, which is difficult to meet for some high-precision industrial detection scenes.

Conventional objects use general convolution modules or pre-and post-attention mechanisms but this approach is computationally expensive and runs relatively slow on edge termination equipment.

The conventional detection model usually adopts an Anchor-based method, a certain prior probability is needed for the selection of Anchor, and at present, some Anchor methods exist but the detection effect on terminal equipment is not ideal.

When the quality of an image acquired by the edge equipment is low or an object is blocked, the conventional detection effect in the second mode is poor due to the fact that the object is insufficient in the current image or the pixel ratio of the object is too small, and the conventional convolution characteristic is mainly caused by the fact that the conventional convolution characteristic lacks global and local characteristic exchange and an effective attention mechanism is lacked.

Disclosure of Invention

In view of this, it is desirable to provide a multi-tasking object detection model for object detection and semantic segmentation.

A multi-task target detection model for target detection and semantic segmentation comprises a detection module and a semantic segmentation module, wherein the detection module and the semantic segmentation module are respectively deployed at mobile terminal equipment and are in communication connection through a backbone network.

Preferably, the network module of the mobile terminal device includes an Anchor detection module with a SwinTransformer as a backbone network and a connected Head network module.

Preferably, the overall architecture of the network module includes three processing procedures of the Patch Partition processing layer, the first stage Linear enhancement layer, the second stage Linear enhancement layer, the third stage Linear enhancement layer, the fourth stage Linear enhancement layer, the CSP module, the first csp+decon module, the second csp+decon module, the Concat module, the coupled Head module and the Conv module, where the image is processed by the first processing procedure: the image is firstly processed by a Patch Partition processing layer, then is respectively processed by a first stage Linear enhancement layer, a second stage Linear enhancement layer, a third stage Linear enhancement layer and a fourth stage Linear enhancement layer, then is detected by a CSP module and a coupled Head module, is convolved, and finally is subjected to data output after passing through a convolution module; the second treatment process is as follows: the image is firstly processed by a Patch Partition processing layer, then is respectively processed by a first-stage Linear Embedding layer and a second-stage Linear Embedding layer, then enters a first CSP+DECON module for processing, then is processed by a Concat module, and finally is processed by a convolution module and a coupled Head module for data output; the third treatment process is as follows: the image is firstly processed by a Patch Partition processing layer, then is respectively processed by a first stage Linear enhancement layer, a second stage Linear enhancement layer and a third stage Linear enhancement layer, then enters a second CSP+DECON module for processing, then is subjected to a Concat module, and finally is subjected to data output after passing through a convolution module and a coupled Head module. Through the detection mode, when an image is input, the detection result and the semantic segmentation result are output simultaneously through forward reasoning of the model, the detection module and the semantic segmentation module share the feature layer in the training stage, and the detection precision of the model is further improved through the compensation of a loss function.

Preferably, the coupled Head module outputs the IOU information, the location detection information, and the classification information.

Preferably, the detection flow of the coupled Head module is as follows:

step one, determining candidate areas of positive samples by using the center priori of GT;

step two, calculating reg+cls loss of each sample for each GT point:

C _ij ＝L _ij ^cls +λL _ij ^reg

performing an image classification task, namely marking the image analog labels on the image data set as GT, and reg+Cls loss as regression and classification loss; l (L) _ij ^cls And lambda L _ij ^reg Regression loss and classification loss of GT respectively;

determining the number of positive samples to which each GT needs to be allocated by using the predicted sample points of each GT; samples of the IOU front 20 with the current GT; finally, the Anchor free predicted point and the four offsets are regressed to form the rectangular box coordinates;

the IOU summation of the Top20 sample is rounded, and k is a candidate point/frame of dynamic matching as dynamic k of the current GT; taking the first k samples with the minimum loss as positive sample points for each GT; the case where the same sample is assigned to positive samples of multiple GT is globally removed.

Preferably, the semantic segmentation module further comprises a branching module, features extracted by the SwinTransformer-lite main network after channel pruning are further extracted by using a unused convolution kernel before the features are sent to the segmentation branches, and finally semantic features of H.times.W.C are output.

Preferably, the semantic division branch structure comprises a first convolution module, a second convolution module and a third convolution module, wherein the convolution kernel of the first convolution module is 1*1; the convolution kernel size of the second convolution module is 5*5; the convolution kernel size of the third convolution module is 3*3.

Preferably, the semantic segmentation branch structure training loss function is:

Loss＝Loss _Det +Loss _Seg

Loss _Det ＝Loss _cls +Loss _{iou_regression} +Loss _confidence +λLoss _l1

Loss _Seg ＝Loss _{softmax_cross_entr}

loss is the meaning of Loss; loss (Low Density) _Det To detect branch Loss, loss _Seg To split branch Loss, loss _Confidence Loss of confidence for a box, loss of confidence _{iou_regression} Loss of IOU, loss _cls To classify losses, loss _{softmax_cross_entr} Cross entropy loss.

In order to further make the algorithm model of the detection module and the semantic segmentation module more universal and further reduce the influence of priori data size on the algorithm model, the detection module and the semantic segmentation module adopt an Anchor detection mode in the detection algorithm branch, finally realize the rapid identification of the position and the contour information of an object within 50ms on the mobile terminal equipment, and the additional expenditure required by program deployment is not required under the condition of large user quantity, and the model can be deployed to a server end to finish detection by using a GPU/CPU.

Drawings

FIG. 1 is a schematic diagram of the overall architecture of a network module;

FIG. 2 is a schematic diagram of a coupled Head module;

FIG. 3 is a schematic diagram of a branching module;

in the figure: the processing layer 1, the first stage Linear editing layer 2, the second stage Linear editing layer 3, the third stage Linear editing layer 4, the fourth stage Linear editing layer 5, the CSP module 6, the first CSP+DECON module 7, the second CSP+DECON module 8, the Concat module 9, the coupled Head module 10 and the Conv module 11.

Detailed Description

In order to make the technical scheme of the invention easier to understand, the technical scheme of the invention is clearly and completely described by adopting a mode of a specific embodiment with reference to the accompanying drawings.

The multi-task target detection model for target detection and semantic segmentation comprises a detection module and a semantic segmentation module, wherein the detection module and the semantic segmentation module are respectively deployed at mobile terminal equipment and are in communication connection through a backbone network.

The network module of the mobile terminal device comprises an Anchor detection module taking a SwinTransformer as a backbone network and a coupled Head network module.

Referring to fig. 1, the overall architecture of the network module includes a Patch Partition processing layer 1, a first stage Linear enhancement layer 2, a second stage Linear enhancement layer 3, a third stage Linear enhancement layer 4, a fourth stage Linear enhancement layer 5, a CSP module 6, a first csp+decon module 7, a second csp+decon module 8, a Concat module 9, a decoded Head module 10 and a Conv module 11, and the image is respectively subjected to three processing procedures, where the first processing procedure is: the image is firstly processed by a Patch Partition processing layer 1, then respectively processed by a first stage Linear coding layer 2, a second stage Linear coding layer 3, a third stage Linear coding layer 4 and a fourth stage Linear coding layer 5, then subjected to CSP module 6 and detected Head module detection processing 10, then subjected to convolution, and finally subjected to data output after passing through a convolution module; the second treatment process is as follows: the image is firstly processed by a Patch partition-position processing layer 1, then is respectively processed by a first stage Linear enhancement layer 2 and a second stage Linear enhancement layer 3, then enters a first CSP+DECON module 7 for processing, then is processed by a Concat module 9, and finally is processed by a convolution module and a coupled Head module 10 for data output; the third treatment process is as follows: the image is firstly processed by a Patch Partition processing layer 1, then is respectively processed by a first stage Linear enhancement layer 2, a second stage Linear enhancement layer 3 and a third stage Linear enhancement layer 4, then enters a second CSP+DECON module processing 8, then is processed by a Concat module 9, and finally is processed by a convolution module and a coupled Head module 10 to output data. Through the detection mode, when an image is input, the detection result and the semantic segmentation result are output simultaneously through forward reasoning of the model, the detection module and the semantic segmentation module share the feature layer in the training stage, and the detection precision of the model is further improved through the compensation of a loss function. The CSP module and the deconvolution module are introduced into the backbone network, the size difference of different detection targets is considered, the CSP module is finally changed into a unified feature map through deconvolution, a feature layer which is finally sent into the detection head is formed through further fusion of the feature map, the feature expression capacity of the small target is improved through further fusion of 3 feature layers after the SwinT backbone network, and the detection effect of the small target is improved for the final effect of the model.

The multi-task target detection model comprises a detection module and a semantic segmentation module, an algorithm adopts a cut SwinTransformer as a main network, in order to further enable the model of the algorithm to be more universal, the influence of the prior data size on the algorithm model is further reduced, the model adopts an Anchor detection mode in the detection algorithm branch, the position and outline information of an object can be rapidly identified within 50ms on mobile terminal equipment finally, extra expenditure required by program deployment is not required under the condition of large user quantity, and in addition, the model can be deployed to a server side to finish detection by using a GPU/CPU. Through the detection mode, the detection result and the semantic segmentation result are output simultaneously through forward reasoning of the model when the image is input, the detection module and the semantic segmentation module share the feature layer in the training stage, and the detection precision of the model is further improved through the compensation of the loss function.

The backbone networks most commonly used in the industry and with optimal performance are VIT and SwinTransformer at present, but VIT only uses the coding process of transform and does not use the decoding process, meanwhile, the complexity of CV tasks is not considered, only image classification is explored, other CV tasks are not explored too much, and the performance of the multi-task scene for detection and segmentation is not robust enough. For the design of the backbone network of the model, the SwinTransformer-tiny backbone network is used as a baseline, and in order to further improve the reasoning speed of the backbone network, channel pruning is used to obtain the backbone network with the total model size of 0.6M.

The local attention mechanism of the SwinTransformer can exactly compensate the progress loss of the model caused by the light weight of the channel, and the W-MSA and the SW-MSA of the structure perform feature exchange and feature fusion on the local features after local attention, so that the feature expression capacity of the network model is further improved, and the method has a good detection effect on partial shielding targets.

Most of the better performing detection models in the industry are multi-headed detectors such as YOLOV5, YOLOX, etc. The multi-headed detector can detect targets of different sizes per detector Head, with the aim of further enhancing the effect of the detection algorithm, in which model our ability to extract features is mainly dependent on the backbone network, in order to further speed up the performance of the model we use one detector Head, as shown in the figure 1 algorithm model architecture diagram, we use only one detected Head module 10 compared to YOLOX,

referring to fig. 2, the coupled Head module includes output IOU information, location detection information, and classification information, respectively.

The detection flow of the coupled Head module is as follows:

step two, calculating reg+cls loss of each sample for each GT point:

C _ij ＝L _ij ^cls +λL _ij ^reg

Referring to fig. 3, the semantic segmentation module further includes a branching module, and before the features extracted by the swinsformer-lite backbone network after the channel pruning are sent to the segmentation branches, the features are further extracted by using a unused convolution kernel, and finally the semantic features of h×w×c are output. The semantic segmentation branch structure comprises a first convolution module, a second convolution module and a third convolution module, wherein the convolution kernel of the first convolution module is 1*1; the convolution kernel size of the second convolution module is 5*5; the convolution kernel size of the third convolution module is 3*3.

In order to further improve the detection capability of the model, the model can detect the rectangular frame position of the target object and the boundary contour information of the target object, and the training loss function strategy of the model is designed as follows:

the semantic segmentation branch structure training loss function is as follows:

Loss＝Loss _Det +Loss _Seg

Loss _Det ＝Loss _cls +Loss _{iou_regression} +Loss _confidence +λLoss _l1

Loss _Seg ＝Loss _{softmax_cross_entr}

loss is the meaning of Loss; loss (Low Density) _Det To detect branch Loss, loss _Seg To split branch Loss, loss _Confidence Loss of confidence for a box, loss of confidence _{iou_regression} Loss of IOU, loss _cls Is a classification loss. Loss (Low Density) _{softmax_cross_entr} Cross entropy loss.

In order to accelerate convergence in the actual training stage, firstly, solidifying the network weights of the segmentation part to ensure that gradient back propagation is not executed, only training the network weights of the detection module, and then releasing the weights of the segmentation module to ensure that the global weights of the segmentation module are all converged after the network weights of the detection module are converged.

The invention overcomes the following problems:

the algorithm model itself needs to be smaller, the size of the model needs to be less than 5M, less calculation parameters are obtained, and low-delay effect is achieved.

For a mobile terminal carrying model, image data shot by each device needs to be processed in real time, because uncertain object shielding exists in the outdoor in some scenes such as automatic inspection, and the size of a target object is inconsistent due to uncertain shooting height of the mobile terminal device, a series of special situations that the target is too small (below 60 pixels) or is partially shielded need to be processed by the detection model.

The data sets of most industrial scenes to be detected for the mobile terminal-mounted device may be of unknown shape and of unknown aspect ratio, and the model needs to be flexibly adapted to the size of each target of each scene, and at this time, the Anchor size may not be debugged by too many a priori data sets, so the model requirement is preferably an Anchor free-based algorithm model.

It should be noted that the embodiments described herein are only some embodiments of the present invention, not all the implementation manners of the present invention, and the embodiments are only exemplary, and are only used for providing a more visual and clear way of understanding the present disclosure, not limiting the technical solution described in the present invention. All other embodiments, and other simple alternatives and variations of the inventive solution, which would occur to a person skilled in the art without departing from the inventive concept, are within the scope of the invention.

Claims

1. A multitasking object detection model for object detection and semantic segmentation, characterized by: the multi-task target detection model for target detection and semantic segmentation comprises a detection module and a semantic segmentation module, wherein the detection module and the semantic segmentation module are respectively deployed on mobile terminal equipment and are in communication connection through a backbone network.

2. The object detection and semantic segmentation multitasking object detection model of claim 1, wherein: the network module of the mobile terminal device comprises an Anchor detection module taking a SwinTransformer as a backbone network and a coupled Head network module.

3. The object detection and semantic segmentation multitasking object detection model of claim 2, wherein: the general architecture of the network module comprises a Patch Partition processing layer, a first stage Linear enhancement layer, a second stage Linear enhancement layer, a third stage Linear enhancement layer, a fourth stage Linear enhancement layer, a CSP module, a first CSP+DECON module, a second CSP+DECON module, a Concat module, a coupled Head module and a Conv module, wherein the image is respectively processed by three processing procedures, and the first processing procedure is as follows: the image is firstly processed by a Patch Partition processing layer, then is respectively processed by a first stage Linear enhancement layer, a second stage Linear enhancement layer, a third stage Linear enhancement layer and a fourth stage Linear enhancement layer, then is detected by a CSP module and a coupled Head module, is convolved, and finally is subjected to data output after passing through a convolution module; the second treatment process is as follows: the image is firstly processed by a Patch Partition processing layer, then is respectively processed by a first-stage Linear Embedding layer and a second-stage Linear Embedding layer, then enters a first CSP+DECON module for processing, then is processed by a Concat module, and finally is processed by a convolution module and a coupled Head module for data output; the third treatment process is as follows: the image is firstly processed by a Patch Partition processing layer, then is respectively processed by a first stage Linear enhancement layer, a second stage Linear enhancement layer and a third stage Linear enhancement layer, then enters a second CSP+DECON module for processing, then is subjected to a Concat module, and finally is subjected to data output after passing through a convolution module and a coupled Head module.

4. A multi-tasking object detection model of object detection and semantic segmentation as claimed in claim 3, wherein: the coupled Head module outputs IOU information, position detection information, and classification information.

5. The object detection and semantic segmentation multitasking object detection model of claim 4, wherein: the detection flow of the coupled Head module is as follows:

step two, calculating reg+cls loss of each sample for each GT point:

C _ij ＝L _ij ^cls +λL _ij ^reg

6. The object detection and semantic segmentation multitasking object detection model of claim 5, wherein: the semantic segmentation module further comprises a branching module, features extracted through the SwinTransformer-lite backbone network after channel pruning are further extracted by using a unused convolution kernel before the features are sent to segmentation branches, and finally semantic features of H.times.W.times.C are output.

7. The object detection and semantic segmentation multitasking object detection model of claim 6, wherein: the semantic segmentation branch structure comprises a first convolution module, a second convolution module and a third convolution module, wherein the convolution kernel of the first convolution module is 1*1; the convolution kernel size of the second convolution module is 5*5; the convolution kernel size of the third convolution module is 3*3.

8. The object detection and semantic segmentation multitasking object detection model of claim 7, wherein: the semantic segmentation branch structure training loss function is as follows:

Loss＝Loss _Det +Loss _Seg

Loss _Det ＝Loss _cls +Loss _{iou_regression} +Loss _confidence +λLoss _l1

Loss _Seg ＝Loss _{softmax_cross_entr}