CN116524320A - Multi-task target detection model for target detection and semantic segmentation - Google Patents

Multi-task target detection model for target detection and semantic segmentation Download PDF

Info

Publication number
CN116524320A
CN116524320A CN202310275566.4A CN202310275566A CN116524320A CN 116524320 A CN116524320 A CN 116524320A CN 202310275566 A CN202310275566 A CN 202310275566A CN 116524320 A CN116524320 A CN 116524320A
Authority
CN
China
Prior art keywords
module
loss
semantic segmentation
detection
object detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310275566.4A
Other languages
Chinese (zh)
Inventor
方志宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guodian Power Ningxia New Energy Development Co ltd
Original Assignee
Guodian Power Ningxia New Energy Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guodian Power Ningxia New Energy Development Co ltd filed Critical Guodian Power Ningxia New Energy Development Co ltd
Priority to CN202310275566.4A priority Critical patent/CN116524320A/en
Publication of CN116524320A publication Critical patent/CN116524320A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0495Quantised networks; Sparse networks; Compressed networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/94Hardware or software architectures specially adapted for image or video understanding
    • G06V10/95Hardware or software architectures specially adapted for image or video understanding structured as a network, e.g. client-server architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

A multi-task target detection model for target detection and semantic segmentation comprises a detection module and a semantic segmentation module, wherein the detection module and the semantic segmentation module are respectively deployed at mobile terminal equipment and are in communication connection through a backbone network. In order to further make the algorithm model of the detection module and the semantic segmentation module more universal and further reduce the influence of priori data size on the algorithm model, the detection module and the semantic segmentation module adopt an Anchor detection mode in the detection algorithm branch, finally realize the rapid identification of the position and the contour information of an object within 50ms on the mobile terminal equipment, and the additional expenditure required by program deployment is not required under the condition of large user quantity, and the model can be deployed to a server end to finish detection by using a GPU/CPU.

Description

Multi-task target detection model for target detection and semantic segmentation
Technical Field
The invention relates to the technical field of rapid detection of algorithm models on mobile terminal equipment, in particular to a multi-task target detection model for target detection and semantic segmentation.
Background
At present, in a scene in the field of visual images, it is additionally important to detect the position of a specific target at a specified position or find the position of the outline boundary of the specific target, for example, the automatic industrial detection field detects flaws on the surface of a product by using a deep learning model, the automatic inspection field of an unmanned aerial vehicle needs to automatically detect pictures captured by an unmanned aerial vehicle camera by using the deep learning model carried on the unmanned aerial vehicle, the power industry automatically inputs display data of power equipment by using the unmanned aerial vehicle, and automatically inputs data such as an instrument nameplate and the like, and the content of the specified target needs to be identified.
Normally, in the case of big data we do this in two ways: in the first mode, for the scene with not particularly high real-time requirement, the off-line mode is used for processing, the mobile terminal equipment is only responsible for collecting and transmitting images, and after receiving corresponding image data, the background server uses the off-line processing data of the background large model to return the processing result to the terminal equipment for corresponding processing, so that the mode is calculated as a relatively traditional mode. The second mode is to directly deploy a corresponding detection/identification model on the terminal equipment, process the image acquired by the mobile terminal equipment in real time and return the result in real time, which is one of the popular modes in recent years. However, most of end-side model detection effects are inaccurate and lack more detailed description information for detecting the edge profile of an object, so that it is difficult to meet some high-precision industrial scenes. How to design an algorithm model to rapidly detect and acquire fine boundary contour information of an object is a problem to be solved compared with the conventional method described in the first mode.
Compared with the traditional method mode I, the mode II of the detection method of the mobile terminal equipment at present has the following difficulties:
only the object position described by a rectangular frame can be detected, and the position information of the outline boundary of the object is lacking, which is difficult to meet for some high-precision industrial detection scenes.
Conventional objects use general convolution modules or pre-and post-attention mechanisms but this approach is computationally expensive and runs relatively slow on edge termination equipment.
The conventional detection model usually adopts an Anchor-based method, a certain prior probability is needed for the selection of Anchor, and at present, some Anchor methods exist but the detection effect on terminal equipment is not ideal.
When the quality of an image acquired by the edge equipment is low or an object is blocked, the conventional detection effect in the second mode is poor due to the fact that the object is insufficient in the current image or the pixel ratio of the object is too small, and the conventional convolution characteristic is mainly caused by the fact that the conventional convolution characteristic lacks global and local characteristic exchange and an effective attention mechanism is lacked.
Disclosure of Invention
In view of this, it is desirable to provide a multi-tasking object detection model for object detection and semantic segmentation.
A multi-task target detection model for target detection and semantic segmentation comprises a detection module and a semantic segmentation module, wherein the detection module and the semantic segmentation module are respectively deployed at mobile terminal equipment and are in communication connection through a backbone network.
Preferably, the network module of the mobile terminal device includes an Anchor detection module with a SwinTransformer as a backbone network and a connected Head network module.
Preferably, the overall architecture of the network module includes three processing procedures of the Patch Partition processing layer, the first stage Linear enhancement layer, the second stage Linear enhancement layer, the third stage Linear enhancement layer, the fourth stage Linear enhancement layer, the CSP module, the first csp+decon module, the second csp+decon module, the Concat module, the coupled Head module and the Conv module, where the image is processed by the first processing procedure: the image is firstly processed by a Patch Partition processing layer, then is respectively processed by a first stage Linear enhancement layer, a second stage Linear enhancement layer, a third stage Linear enhancement layer and a fourth stage Linear enhancement layer, then is detected by a CSP module and a coupled Head module, is convolved, and finally is subjected to data output after passing through a convolution module; the second treatment process is as follows: the image is firstly processed by a Patch Partition processing layer, then is respectively processed by a first-stage Linear Embedding layer and a second-stage Linear Embedding layer, then enters a first CSP+DECON module for processing, then is processed by a Concat module, and finally is processed by a convolution module and a coupled Head module for data output; the third treatment process is as follows: the image is firstly processed by a Patch Partition processing layer, then is respectively processed by a first stage Linear enhancement layer, a second stage Linear enhancement layer and a third stage Linear enhancement layer, then enters a second CSP+DECON module for processing, then is subjected to a Concat module, and finally is subjected to data output after passing through a convolution module and a coupled Head module. Through the detection mode, when an image is input, the detection result and the semantic segmentation result are output simultaneously through forward reasoning of the model, the detection module and the semantic segmentation module share the feature layer in the training stage, and the detection precision of the model is further improved through the compensation of a loss function.
Preferably, the coupled Head module outputs the IOU information, the location detection information, and the classification information.
Preferably, the detection flow of the coupled Head module is as follows:
step one, determining candidate areas of positive samples by using the center priori of GT;
step two, calculating reg+cls loss of each sample for each GT point:
C ij =L ij cls +λL ij reg
performing an image classification task, namely marking the image analog labels on the image data set as GT, and reg+Cls loss as regression and classification loss; l (L) ij cls And lambda L ij reg Regression loss and classification loss of GT respectively;
determining the number of positive samples to which each GT needs to be allocated by using the predicted sample points of each GT; samples of the IOU front 20 with the current GT; finally, the Anchor free predicted point and the four offsets are regressed to form the rectangular box coordinates;
the IOU summation of the Top20 sample is rounded, and k is a candidate point/frame of dynamic matching as dynamic k of the current GT; taking the first k samples with the minimum loss as positive sample points for each GT; the case where the same sample is assigned to positive samples of multiple GT is globally removed.
Preferably, the semantic segmentation module further comprises a branching module, features extracted by the SwinTransformer-lite main network after channel pruning are further extracted by using a unused convolution kernel before the features are sent to the segmentation branches, and finally semantic features of H.times.W.C are output.
Preferably, the semantic division branch structure comprises a first convolution module, a second convolution module and a third convolution module, wherein the convolution kernel of the first convolution module is 1*1; the convolution kernel size of the second convolution module is 5*5; the convolution kernel size of the third convolution module is 3*3.
Preferably, the semantic segmentation branch structure training loss function is:
Loss=Loss Det +Loss Seg
Loss Det =Loss cls +Loss iou_regression +Loss confidence +λLoss l1
Loss Seg =Loss softmax_cross_entr
loss is the meaning of Loss; loss (Low Density) Det To detect branch Loss, loss Seg To split branch Loss, loss Confidence Loss of confidence for a box, loss of confidence iou_regression Loss of IOU, loss cls To classify losses, loss softmax_cross_entr Cross entropy loss.
In order to further make the algorithm model of the detection module and the semantic segmentation module more universal and further reduce the influence of priori data size on the algorithm model, the detection module and the semantic segmentation module adopt an Anchor detection mode in the detection algorithm branch, finally realize the rapid identification of the position and the contour information of an object within 50ms on the mobile terminal equipment, and the additional expenditure required by program deployment is not required under the condition of large user quantity, and the model can be deployed to a server end to finish detection by using a GPU/CPU.
Drawings
FIG. 1 is a schematic diagram of the overall architecture of a network module;
FIG. 2 is a schematic diagram of a coupled Head module;
FIG. 3 is a schematic diagram of a branching module;
in the figure: the processing layer 1, the first stage Linear editing layer 2, the second stage Linear editing layer 3, the third stage Linear editing layer 4, the fourth stage Linear editing layer 5, the CSP module 6, the first CSP+DECON module 7, the second CSP+DECON module 8, the Concat module 9, the coupled Head module 10 and the Conv module 11.
Detailed Description
In order to make the technical scheme of the invention easier to understand, the technical scheme of the invention is clearly and completely described by adopting a mode of a specific embodiment with reference to the accompanying drawings.
The multi-task target detection model for target detection and semantic segmentation comprises a detection module and a semantic segmentation module, wherein the detection module and the semantic segmentation module are respectively deployed at mobile terminal equipment and are in communication connection through a backbone network.
The network module of the mobile terminal device comprises an Anchor detection module taking a SwinTransformer as a backbone network and a coupled Head network module.
Referring to fig. 1, the overall architecture of the network module includes a Patch Partition processing layer 1, a first stage Linear enhancement layer 2, a second stage Linear enhancement layer 3, a third stage Linear enhancement layer 4, a fourth stage Linear enhancement layer 5, a CSP module 6, a first csp+decon module 7, a second csp+decon module 8, a Concat module 9, a decoded Head module 10 and a Conv module 11, and the image is respectively subjected to three processing procedures, where the first processing procedure is: the image is firstly processed by a Patch Partition processing layer 1, then respectively processed by a first stage Linear coding layer 2, a second stage Linear coding layer 3, a third stage Linear coding layer 4 and a fourth stage Linear coding layer 5, then subjected to CSP module 6 and detected Head module detection processing 10, then subjected to convolution, and finally subjected to data output after passing through a convolution module; the second treatment process is as follows: the image is firstly processed by a Patch partition-position processing layer 1, then is respectively processed by a first stage Linear enhancement layer 2 and a second stage Linear enhancement layer 3, then enters a first CSP+DECON module 7 for processing, then is processed by a Concat module 9, and finally is processed by a convolution module and a coupled Head module 10 for data output; the third treatment process is as follows: the image is firstly processed by a Patch Partition processing layer 1, then is respectively processed by a first stage Linear enhancement layer 2, a second stage Linear enhancement layer 3 and a third stage Linear enhancement layer 4, then enters a second CSP+DECON module processing 8, then is processed by a Concat module 9, and finally is processed by a convolution module and a coupled Head module 10 to output data. Through the detection mode, when an image is input, the detection result and the semantic segmentation result are output simultaneously through forward reasoning of the model, the detection module and the semantic segmentation module share the feature layer in the training stage, and the detection precision of the model is further improved through the compensation of a loss function. The CSP module and the deconvolution module are introduced into the backbone network, the size difference of different detection targets is considered, the CSP module is finally changed into a unified feature map through deconvolution, a feature layer which is finally sent into the detection head is formed through further fusion of the feature map, the feature expression capacity of the small target is improved through further fusion of 3 feature layers after the SwinT backbone network, and the detection effect of the small target is improved for the final effect of the model.
The multi-task target detection model comprises a detection module and a semantic segmentation module, an algorithm adopts a cut SwinTransformer as a main network, in order to further enable the model of the algorithm to be more universal, the influence of the prior data size on the algorithm model is further reduced, the model adopts an Anchor detection mode in the detection algorithm branch, the position and outline information of an object can be rapidly identified within 50ms on mobile terminal equipment finally, extra expenditure required by program deployment is not required under the condition of large user quantity, and in addition, the model can be deployed to a server side to finish detection by using a GPU/CPU. Through the detection mode, the detection result and the semantic segmentation result are output simultaneously through forward reasoning of the model when the image is input, the detection module and the semantic segmentation module share the feature layer in the training stage, and the detection precision of the model is further improved through the compensation of the loss function.
The backbone networks most commonly used in the industry and with optimal performance are VIT and SwinTransformer at present, but VIT only uses the coding process of transform and does not use the decoding process, meanwhile, the complexity of CV tasks is not considered, only image classification is explored, other CV tasks are not explored too much, and the performance of the multi-task scene for detection and segmentation is not robust enough. For the design of the backbone network of the model, the SwinTransformer-tiny backbone network is used as a baseline, and in order to further improve the reasoning speed of the backbone network, channel pruning is used to obtain the backbone network with the total model size of 0.6M.
The local attention mechanism of the SwinTransformer can exactly compensate the progress loss of the model caused by the light weight of the channel, and the W-MSA and the SW-MSA of the structure perform feature exchange and feature fusion on the local features after local attention, so that the feature expression capacity of the network model is further improved, and the method has a good detection effect on partial shielding targets.
Most of the better performing detection models in the industry are multi-headed detectors such as YOLOV5, YOLOX, etc. The multi-headed detector can detect targets of different sizes per detector Head, with the aim of further enhancing the effect of the detection algorithm, in which model our ability to extract features is mainly dependent on the backbone network, in order to further speed up the performance of the model we use one detector Head, as shown in the figure 1 algorithm model architecture diagram, we use only one detected Head module 10 compared to YOLOX,
referring to fig. 2, the coupled Head module includes output IOU information, location detection information, and classification information, respectively.
The detection flow of the coupled Head module is as follows:
step one, determining candidate areas of positive samples by using the center priori of GT;
step two, calculating reg+cls loss of each sample for each GT point:
C ij =L ij cls +λL ij reg
performing an image classification task, namely marking the image analog labels on the image data set as GT, and reg+Cls loss as regression and classification loss; l (L) ij cls And lambda L ij reg Regression loss and classification loss of GT respectively;
determining the number of positive samples to which each GT needs to be allocated by using the predicted sample points of each GT; samples of the iou front 20 with the current GT; finally, the Anchor free predicted point and the four offsets are regressed to form the rectangular box coordinates;
the iou summation of the Top20 sample is rounded, and k is a candidate point/frame of dynamic matching as dynamic k of the current GT; taking the first k samples with the minimum loss as positive sample points for each GT; the case where the same sample is assigned to positive samples of multiple GT is globally removed.
Referring to fig. 3, the semantic segmentation module further includes a branching module, and before the features extracted by the swinsformer-lite backbone network after the channel pruning are sent to the segmentation branches, the features are further extracted by using a unused convolution kernel, and finally the semantic features of h×w×c are output. The semantic segmentation branch structure comprises a first convolution module, a second convolution module and a third convolution module, wherein the convolution kernel of the first convolution module is 1*1; the convolution kernel size of the second convolution module is 5*5; the convolution kernel size of the third convolution module is 3*3.
In order to further improve the detection capability of the model, the model can detect the rectangular frame position of the target object and the boundary contour information of the target object, and the training loss function strategy of the model is designed as follows:
the semantic segmentation branch structure training loss function is as follows:
Loss=Loss Det +Loss Seg
Loss Det =Loss cls +Loss iou_regression +Loss confidence +λLoss l1
Loss Seg =Loss softmax_cross_entr
loss is the meaning of Loss; loss (Low Density) Det To detect branch Loss, loss Seg To split branch Loss, loss Confidence Loss of confidence for a box, loss of confidence iou_regression Loss of IOU, loss cls Is a classification loss. Loss (Low Density) softmax_cross_entr Cross entropy loss.
In order to accelerate convergence in the actual training stage, firstly, solidifying the network weights of the segmentation part to ensure that gradient back propagation is not executed, only training the network weights of the detection module, and then releasing the weights of the segmentation module to ensure that the global weights of the segmentation module are all converged after the network weights of the detection module are converged.
The invention overcomes the following problems:
the algorithm model itself needs to be smaller, the size of the model needs to be less than 5M, less calculation parameters are obtained, and low-delay effect is achieved.
For a mobile terminal carrying model, image data shot by each device needs to be processed in real time, because uncertain object shielding exists in the outdoor in some scenes such as automatic inspection, and the size of a target object is inconsistent due to uncertain shooting height of the mobile terminal device, a series of special situations that the target is too small (below 60 pixels) or is partially shielded need to be processed by the detection model.
The data sets of most industrial scenes to be detected for the mobile terminal-mounted device may be of unknown shape and of unknown aspect ratio, and the model needs to be flexibly adapted to the size of each target of each scene, and at this time, the Anchor size may not be debugged by too many a priori data sets, so the model requirement is preferably an Anchor free-based algorithm model.
In order to further make the algorithm model of the detection module and the semantic segmentation module more universal and further reduce the influence of priori data size on the algorithm model, the detection module and the semantic segmentation module adopt an Anchor detection mode in the detection algorithm branch, finally realize the rapid identification of the position and the contour information of an object within 50ms on the mobile terminal equipment, and the additional expenditure required by program deployment is not required under the condition of large user quantity, and the model can be deployed to a server end to finish detection by using a GPU/CPU.
It should be noted that the embodiments described herein are only some embodiments of the present invention, not all the implementation manners of the present invention, and the embodiments are only exemplary, and are only used for providing a more visual and clear way of understanding the present disclosure, not limiting the technical solution described in the present invention. All other embodiments, and other simple alternatives and variations of the inventive solution, which would occur to a person skilled in the art without departing from the inventive concept, are within the scope of the invention.

Claims (8)

1. A multitasking object detection model for object detection and semantic segmentation, characterized by: the multi-task target detection model for target detection and semantic segmentation comprises a detection module and a semantic segmentation module, wherein the detection module and the semantic segmentation module are respectively deployed on mobile terminal equipment and are in communication connection through a backbone network.
2. The object detection and semantic segmentation multitasking object detection model of claim 1, wherein: the network module of the mobile terminal device comprises an Anchor detection module taking a SwinTransformer as a backbone network and a coupled Head network module.
3. The object detection and semantic segmentation multitasking object detection model of claim 2, wherein: the general architecture of the network module comprises a Patch Partition processing layer, a first stage Linear enhancement layer, a second stage Linear enhancement layer, a third stage Linear enhancement layer, a fourth stage Linear enhancement layer, a CSP module, a first CSP+DECON module, a second CSP+DECON module, a Concat module, a coupled Head module and a Conv module, wherein the image is respectively processed by three processing procedures, and the first processing procedure is as follows: the image is firstly processed by a Patch Partition processing layer, then is respectively processed by a first stage Linear enhancement layer, a second stage Linear enhancement layer, a third stage Linear enhancement layer and a fourth stage Linear enhancement layer, then is detected by a CSP module and a coupled Head module, is convolved, and finally is subjected to data output after passing through a convolution module; the second treatment process is as follows: the image is firstly processed by a Patch Partition processing layer, then is respectively processed by a first-stage Linear Embedding layer and a second-stage Linear Embedding layer, then enters a first CSP+DECON module for processing, then is processed by a Concat module, and finally is processed by a convolution module and a coupled Head module for data output; the third treatment process is as follows: the image is firstly processed by a Patch Partition processing layer, then is respectively processed by a first stage Linear enhancement layer, a second stage Linear enhancement layer and a third stage Linear enhancement layer, then enters a second CSP+DECON module for processing, then is subjected to a Concat module, and finally is subjected to data output after passing through a convolution module and a coupled Head module.
4. A multi-tasking object detection model of object detection and semantic segmentation as claimed in claim 3, wherein: the coupled Head module outputs IOU information, position detection information, and classification information.
5. The object detection and semantic segmentation multitasking object detection model of claim 4, wherein: the detection flow of the coupled Head module is as follows:
step one, determining candidate areas of positive samples by using the center priori of GT;
step two, calculating reg+cls loss of each sample for each GT point:
C ij =L ij cls +λL ij reg
performing an image classification task, namely marking the image analog labels on the image data set as GT, and reg+Cls loss as regression and classification loss; l (L) ij cls And lambda L ij reg Regression loss and classification loss of GT respectively;
determining the number of positive samples to which each GT needs to be allocated by using the predicted sample points of each GT; samples of the iou front 20 with the current GT; finally, the Anchor free predicted point and the four offsets are regressed to form the rectangular box coordinates;
the iou summation of the Top20 sample is rounded, and k is a candidate point/frame of dynamic matching as dynamic k of the current GT; taking the first k samples with the minimum loss as positive sample points for each GT; the case where the same sample is assigned to positive samples of multiple GT is globally removed.
6. The object detection and semantic segmentation multitasking object detection model of claim 5, wherein: the semantic segmentation module further comprises a branching module, features extracted through the SwinTransformer-lite backbone network after channel pruning are further extracted by using a unused convolution kernel before the features are sent to segmentation branches, and finally semantic features of H.times.W.times.C are output.
7. The object detection and semantic segmentation multitasking object detection model of claim 6, wherein: the semantic segmentation branch structure comprises a first convolution module, a second convolution module and a third convolution module, wherein the convolution kernel of the first convolution module is 1*1; the convolution kernel size of the second convolution module is 5*5; the convolution kernel size of the third convolution module is 3*3.
8. The object detection and semantic segmentation multitasking object detection model of claim 7, wherein: the semantic segmentation branch structure training loss function is as follows:
Loss=Loss Det +Loss Seg
Loss Det =Loss cls +Loss iou_regression +Loss confidence +λLoss l1
Loss Seg =Loss softmax_cross_entr
loss is the meaning of Loss; loss (Low Density) Det To detect branch Loss, loss Seg To split branch Loss, loss Confidence Loss of confidence for a box, loss of confidence iou_regression Loss of IOU, loss cls To classify losses, loss softmax_cross_entr Cross entropy loss.
CN202310275566.4A 2023-03-21 2023-03-21 Multi-task target detection model for target detection and semantic segmentation Pending CN116524320A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310275566.4A CN116524320A (en) 2023-03-21 2023-03-21 Multi-task target detection model for target detection and semantic segmentation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310275566.4A CN116524320A (en) 2023-03-21 2023-03-21 Multi-task target detection model for target detection and semantic segmentation

Publications (1)

Publication Number Publication Date
CN116524320A true CN116524320A (en) 2023-08-01

Family

ID=87407197

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310275566.4A Pending CN116524320A (en) 2023-03-21 2023-03-21 Multi-task target detection model for target detection and semantic segmentation

Country Status (1)

Country Link
CN (1) CN116524320A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118036666A (en) * 2024-04-12 2024-05-14 清华大学 Task processing method, device, equipment, storage medium and computer program product

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118036666A (en) * 2024-04-12 2024-05-14 清华大学 Task processing method, device, equipment, storage medium and computer program product
CN118036666B (en) * 2024-04-12 2024-06-11 清华大学 Task processing method, device, equipment, storage medium and computer program product

Similar Documents

Publication Publication Date Title
CN111784685B (en) Power transmission line defect image identification method based on cloud edge cooperative detection
CN107742093B (en) Real-time detection method, server and system for infrared image power equipment components
CN110246141B (en) Vehicle image segmentation method based on joint corner pooling under complex traffic scene
CN109712127B (en) Power transmission line fault detection method for machine inspection video stream
EP3734496A1 (en) Image analysis method and apparatus, and electronic device and readable storage medium
CN111553403B (en) Smog detection method and system based on pseudo-3D convolutional neural network
CN110020658B (en) Salient object detection method based on multitask deep learning
CN111079739A (en) Multi-scale attention feature detection method
CN111862013A (en) Insulator detection method, device and equipment based on deep convolutional neural network
CN116524320A (en) Multi-task target detection model for target detection and semantic segmentation
CN115908988B (en) Defect detection model generation method, device, equipment and storage medium
CN114781514A (en) Floater target detection method and system integrating attention mechanism
CN116071315A (en) Product visual defect detection method and system based on machine vision
CN113505634A (en) Double-flow decoding cross-task interaction network optical remote sensing image salient target detection method
CN113962973A (en) Power transmission line unmanned aerial vehicle intelligent inspection system and method based on satellite technology
CN115115863A (en) Water surface multi-scale target detection method, device and system and storage medium
CN116385996A (en) Multitasking method and device based on three-dimensional matrix camera
CN117252815A (en) Industrial part defect detection method, system, equipment and storage medium based on 2D-3D multi-mode image
CN112800932B (en) Method for detecting remarkable ship target in offshore background and electronic equipment
CN114821274A (en) Method and device for identifying state of split and combined indicator
CN114331961A (en) Method for defect detection of an object
CN114140753A (en) Method, device and system for identifying marine ship
CN116563591A (en) Optical smoke detection method based on feature extraction under sea-sky background
Albrecht et al. Visual maritime attention using multiple low-level features and Naive Bayes classification
CN111738264A (en) Intelligent acquisition method for data of display panel of machine room equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination