CN117893838A - Target detection method using diffusion detection model - Google Patents
Target detection method using diffusion detection model Download PDFInfo
- Publication number
- CN117893838A CN117893838A CN202410288788.4A CN202410288788A CN117893838A CN 117893838 A CN117893838 A CN 117893838A CN 202410288788 A CN202410288788 A CN 202410288788A CN 117893838 A CN117893838 A CN 117893838A
- Authority
- CN
- China
- Prior art keywords
- dimensional
- frame
- low
- noise
- diffusion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 72
- 238000009792 diffusion process Methods 0.000 title claims abstract description 40
- 238000013507 mapping Methods 0.000 claims abstract description 8
- 238000012549 training Methods 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 3
- 230000006872 improvement Effects 0.000 abstract description 5
- 238000000034 method Methods 0.000 description 11
- 230000008569 process Effects 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 102100031315 AP-2 complex subunit mu Human genes 0.000 description 1
- 101000796047 Homo sapiens AP-2 complex subunit mu Proteins 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Landscapes
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
The invention discloses a target detection method applying a diffusion detection model, which can realize the improvement of the precision of the diffusion detection model and comprises the following steps: 1. acquiring an input image, and extracting an image feature map of the input image through an image feature extractor; 2. obtaining a low-dimensional truth box, encoding the low-dimensional truth box through a boundary box encoder, mapping the low-dimensional space to a high-dimensional space, and obtaining the high-dimensional truth box; 3. gradually adding Gaussian noise to the high-dimensional truth frame according to a rule of adding noise to the diffusion detection model to obtain a high-dimensional noise frame; 4. decoding the high-dimensional noise frame through a boundary frame decoder, and mapping the high-dimensional space back to the low-dimensional space before encoding to obtain the low-dimensional noise frame; 5. and intercepting the RoI features from the image feature images extracted by the image feature extractor by using the low-dimensional noise frame, inputting the intercepted RoI features and the low-dimensional noise frame into a detection head, carrying out regression and classification, and predicting the position and the target category of the corresponding low-dimensional truth frame.
Description
Technical Field
The invention relates to the technical field of target detection, in particular to a target detection method using a diffusion detection model.
Background
Object detection is an important task in the field of computer vision, aimed at identifying objects in an image and determining their position. The traditional target detection algorithm mainly relies on manual feature extraction, and is large in calculation amount and unstable. With the rise of deep learning, a target detection algorithm based on deep learning gradually becomes a research hotspot. However, existing target detection algorithms still face challenges such as different appearances, shapes and attitudes of objects, and interference from factors such as illumination, occlusion, etc. At present, a diffusion model has remarkable effect in the field of image generation, and random noise can be gradually converted into a clear image through a denoising diffusion process. In light of this, application of a diffusion model to the field of object detection is a new attempt.
The DiffusionDet model initially attempts to apply a diffusion model to the target detection task, modeling the target detection as a denoising diffusion process from noise box to target box. Specifically, during the training phase, the target boxes diffuse from the truth boxes to random distribution, and the model learns how to reverse the process of adding noise to the truth boxes; in the inference phase, the model refines a set of randomly generated target boxes in a progressive manner into output results. Although the DiffusionDet model exhibits excellent performance, it ignores that the diffusion model is generally used for an image generation task, a subject of diffusion in the image generation task is generally an image with a higher dimension, and a diffusion subject in the DiffusionDet model is a detection frame with a lower dimension, so that information which can be contained in the DiffusionDet model in the diffusion process is limited, the advantage of the diffusion model cannot be fully exerted, and further improvement of the performance of the DiffusionDet model is limited. In addition, the Diffuse det model adopts a structure that a plurality of detection heads are simply connected in series in the detection stage, and the effect of the area related characteristics is not considered.
Therefore, how to provide a diffusion detection model based on frame coding to improve the accuracy of the diffusion detection model is a technical problem to be solved.
Disclosure of Invention
The invention aims to provide a target detection method using a diffusion detection model, solve the problems existing in the prior art and realize the improvement of the precision of the diffusion detection model.
In order to achieve the above object, the solution of the present invention is:
an object detection method using a diffusion detection model, comprising the steps of:
step 1, acquiring an input image, and extracting an image feature map of the input image through an image feature extractor;
step 2, obtaining a low-dimensional truth box, encoding the low-dimensional truth box through a boundary box encoder, mapping the low-dimensional space to a high-dimensional space, and obtaining the high-dimensional truth box;
step 3, gradually adding Gaussian noise to the high-dimensional truth frame according to a rule of adding noise to the diffusion detection model to obtain a high-dimensional noise frame;
step 4, decoding the high-dimensional noise frame through a boundary frame decoder, and mapping the high-dimensional space back to the low-dimensional space before encoding to obtain the low-dimensional noise frame;
step 5, intercepting RoI features from the image feature images extracted by the image feature extractor by using the low-dimensional noise frames, inputting the intercepted RoI features and the low-dimensional noise frames into a detection head together, carrying out regression and classification, and predicting the position and the target category of the corresponding low-dimensional truth frames;
the detection head is of a cascade structure and consists of 4 cascade stages, each stage receives an image characteristic image, one of a noise frame and a prediction frame as input, outputs the prediction frame, and the last detection head also outputs a prediction type; in each stage, the RoIAlignon operation is utilized to extract RoI features for the image feature map/noise frame/prediction frame, and then the prediction frame is generated based on the extracted RoI features; the RoI features extracted in the last stage are additionally weighted and fused with the RoI features extracted in other stages and then are used for predicting frame regression and classification results.
In the step 1, the image feature map is extracted through a ResNet model or a Res2Net model.
The step 2 further comprises the step of obtaining a low-dimensional trueAfter the value frames, if the number of low-dimensional truth frames of the input image is less than the specified valueThe number of detection frames of the diffusion detection model is filled to a prescribed value +.>。
In the step 2, the boundary box encoder is implemented by a multi-layer perceptron.
The step 3 is specifically that according to a given time step lengthAnd noise schedule, sampling time step size +.>Samples at any one time step.
In the step 4, the boundary box decoder is implemented by a multi-layer perceptron.
And step 4, firstly generating a high-dimensional random frame with the same dimension as the high-dimensional noise frame during training during reasoning, and decoding the high-dimensional random frame of the high-dimensional space into the low-dimensional space through a boundary decoder.
After the technical scheme is adopted, the invention has the following technical effects:
the invention is beneficial to improving the capability of capturing information in the low-dimensional truth frame diffusion process and accelerating the process of diffusing the low-dimensional truth frame to Gaussian distribution by introducing the boundary frame encoder and the boundary frame decoder; and the RoI features of other stages are fused in the last stage of the detection head, so that the reasonable utilization of the area related information is facilitated, and the prediction accuracy is improved.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention;
FIG. 2 is a schematic diagram comparing the prior art (upper) with the present invention (lower);
fig. 3 is a schematic structural diagram of a detection head according to an embodiment of the present invention.
Detailed Description
In order to further explain the technical scheme of the invention, the invention is explained in detail by specific examples.
Referring to fig. 1 to 3, the present invention discloses a target detection method using a diffusion detection model, comprising the steps of:
step 1, acquiring an input image, and extracting an image feature map of the input image through an image feature extractor;
step 2, obtaining a low-dimensional truth box, encoding the low-dimensional truth box through a boundary box encoder, mapping the low-dimensional space to a high-dimensional space, and obtaining the high-dimensional truth box;
step 3, gradually adding Gaussian noise to the high-dimensional truth frame according to a rule of adding noise to the diffusion detection model to obtain a high-dimensional noise frame;
step 4, decoding the high-dimensional noise frame through a boundary frame decoder, and mapping the high-dimensional space back to the low-dimensional space before encoding to obtain the low-dimensional noise frame;
and 5, intercepting RoI features from the image feature images extracted by the image feature extractor by using the low-dimensional noise frame, inputting the intercepted RoI features and the low-dimensional noise frame into a detection head, carrying out regression and classification, and predicting the position and the target category of the corresponding low-dimensional truth frame.
The diffusion detection model in the prior art ignores the dimension of a diffusion main body, can not fully capture effective information in the diffusion process, and can not fully utilize relevant region information in the detection process, so that the invention provides a boundary frame encoder and a boundary frame decoder which are constructed to respectively encode and decode the boundary frame before and after diffusion, so that the boundary frame can be fully diffused in a high-dimensional space and can be aligned in a low-dimensional space; the high-dimensional detection frame can be randomly initialized in the reasoning process, and mapped to a low-dimensional space through a boundary frame decoder; during detection, the RoI features of a plurality of cascade stages are fused by introducing a feature fusion mechanism, so that the precision of final classification and regression prediction is improved.
Specific embodiments of the invention are shown below.
In the step 1, the following steps:
acquiring an input imageExtracting image features->,/>;
Wherein the method comprises the steps of,/>Image feature extractor->Is ResNet model or Res2Net model, +.>For inputting the height of the image +.>For inputting the width of the image->Representing an image consisting of three primary colors red, green and blue, < >>For image features->Is a number of channels.
The step 2 further includes:
acquiring an input imageCorresponding low-dimensional truth box set +.>,/>,/>For inputting images +.>The number of low-dimensional truth boxes in (a); setting the target number in any one input image to +.>,/>Is equal to or greater than the maximum value of the number of objects in any input image (in this embodiment +.>Value 300), if->Then by filling in and inputting the image->The number of detection frames of the diffusion detection model is filled to a prescribed value +.>At this time->。
In the step 2, the bounding box encoder is implemented by a multi-layer perceptron, and the calculation formula is as follows:
wherein,representing the encoded high-dimensional truth box, +.>Is the dimension of the high-dimension truth box (value 128 in this embodiment), +.>Representing a multi-layer perceptron (hereinafter the same).
The step 3 specifically comprises the following steps:
according to a given time stepAnd noise schedule, sampling time step size +.>Samples at any one time step. The noise adding process can be regarded as a markov process given a time step size +.>According to the re-parameterization technique, the time step size +.>Interior->The calculation formula of the input noise box at each moment is as follows:
wherein,is from->Linearly increase to +.>Is used for controlling the size of noise; />And->Is to represent an intermediate variable that is convenient to set; along with->Is increased by (1)>Gradually increase, correspondingly->Gradually becoming smaller; />Representation->Noise box of moment->The method comprises the steps of carrying out a first treatment on the surface of the In this embodiment->1000.
In the step 4, the bounding box decoder is implemented by the multi-layer perceptron, and the calculation formula is as follows:
wherein,representing the decoded set of low dimensional noise boxes.
The step 4 further includes generating a high-dimensional random frame with the same dimension as the high-dimensional noise frame during training during reasoning, decoding the high-dimensional random frame of the high-dimensional space into the low-dimensional space through a boundary decoder, and the calculation formula is as follows:
wherein,high-dimensional random box representing a gaussian distribution randomly generated at the time of reasoning, +.>Representing the low dimensional noise box decoded by the bounding box decoder.
The step 5 specifically comprises the following steps:
the detection head consists of 4 cascaded stages, each stage receives one of the image feature map, the noise frame and the prediction frame as input, and outputs the prediction frame, and the last detection head also outputs the prediction category; in each stage, the RoIAlignon operation is utilized to extract RoI features for the image feature map/noise frame/prediction frame, and then the prediction frame is generated based on the extracted RoI features; in order to fully utilize the region related features, the RoI features extracted in the last stage are additionally subjected to weighted fusion with the RoI features extracted in other stages and then are used for predicting frame regression and classification results. The entire detection flow can be expressed by the following formula:
wherein,indicating the first detection headiThe RoI features extracted in the individual stages; />Represent the firstiStage 1jThe RoI features corresponding to the proposal boxes; />Representing the RoIAlign operation; />A prediction frame set representing the i-th stage output; />Represent the firstiThe first of the detection headsjOutput frames corresponding to the input frames;/>representing the probability that the object in each output box belongs to the respective category,/for each output box>The number of categories (80 in this embodiment); />Representing a fully connected layer.
And (3) experimental verification:
the experiment is carried out on the CoCo data set, and compared with the prior method, the comparison result of the experiment on the CoCo data set is shown in the following table, and compared with the prior method, the performance of the experiment is remarkably improved, and compared with Diffuse det, the experiment is advanced in various indexes, wherein on the comparison with ResNet-50 as a main body, the experiment respectively obtains 1.3% and 2.3% improvement on the AP and the AP50, and the improvement of the performance is illustrated.
The above examples and drawings are not intended to limit the form or form of the present invention, and any suitable variations or modifications thereof by those skilled in the art should be construed as not departing from the scope of the present invention.
Claims (7)
1. The target detection method using the diffusion detection model is characterized by comprising the following steps:
step 1, acquiring an input image, and extracting an image feature map of the input image through an image feature extractor;
step 2, obtaining a low-dimensional truth box, encoding the low-dimensional truth box through a boundary box encoder, mapping the low-dimensional space to a high-dimensional space, and obtaining the high-dimensional truth box;
step 3, gradually adding Gaussian noise to the high-dimensional truth frame according to a rule of adding noise to the diffusion detection model to obtain a high-dimensional noise frame;
step 4, decoding the high-dimensional noise frame through a boundary frame decoder, and mapping the high-dimensional space back to the low-dimensional space before encoding to obtain the low-dimensional noise frame;
step 5, intercepting RoI features from the image feature images extracted by the image feature extractor by using the low-dimensional noise frames, inputting the intercepted RoI features and the low-dimensional noise frames into a detection head together, carrying out regression and classification, and predicting the position and the target category of the corresponding low-dimensional truth frames;
the detection head is of a cascade structure and consists of 4 cascade stages, each stage receives an image characteristic image, one of a noise frame and a prediction frame as input, outputs the prediction frame, and the last detection head also outputs a prediction type; in each stage, the RoIAlignon operation is utilized to extract RoI features for the image feature map/noise frame/prediction frame, and then the prediction frame is generated based on the extracted RoI features; the RoI features extracted in the last stage are additionally weighted and fused with the RoI features extracted in other stages and then are used for predicting frame regression and classification results.
2. The target detection method using a diffusion detection model according to claim 1, wherein:
in the step 1, the image feature map is extracted through a ResNet model or a Res2Net model.
3. The target detection method using a diffusion detection model according to claim 1, wherein:
step 2 further comprises, after the low-dimensional truth boxes are obtained, if the number of the low-dimensional truth boxes of the input image is less than a specified valueThe number of detection frames of the diffusion detection model is filled to a prescribed value +.>。
4. A target detection method using a diffusion detection model according to claim 1 or 3, wherein:
in the step 2, the boundary box encoder is implemented by a multi-layer perceptron.
5. The target detection method using a diffusion detection model according to claim 1, wherein:
the step 3 is specifically that according to a given time step lengthAnd noise schedule, sampling time step size +.>Samples at any one time step.
6. The target detection method using a diffusion detection model according to claim 1, wherein:
in the step 4, the boundary box decoder is implemented by a multi-layer perceptron.
7. The target detection method using a diffusion detection model according to claim 1 or 4, wherein:
and step 4, firstly generating a high-dimensional random frame with the same dimension as the high-dimensional noise frame during training during reasoning, and decoding the high-dimensional random frame of the high-dimensional space into the low-dimensional space through a boundary decoder.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410288788.4A CN117893838A (en) | 2024-03-14 | 2024-03-14 | Target detection method using diffusion detection model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410288788.4A CN117893838A (en) | 2024-03-14 | 2024-03-14 | Target detection method using diffusion detection model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117893838A true CN117893838A (en) | 2024-04-16 |
Family
ID=90649102
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410288788.4A Pending CN117893838A (en) | 2024-03-14 | 2024-03-14 | Target detection method using diffusion detection model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117893838A (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113837190A (en) * | 2021-08-30 | 2021-12-24 | 厦门大学 | End-to-end instance segmentation method based on Transformer |
CN116485682A (en) * | 2023-05-04 | 2023-07-25 | 北京联合大学 | Image shadow removing system and method based on potential diffusion model |
CN117236390A (en) * | 2023-09-22 | 2023-12-15 | 西南石油大学 | Reservoir prediction method based on cross attention diffusion model |
CN117292007A (en) * | 2023-09-28 | 2023-12-26 | 支付宝(杭州)信息技术有限公司 | Image generation method and device |
CN117315263A (en) * | 2023-11-28 | 2023-12-29 | 杭州申昊科技股份有限公司 | Target contour segmentation device, training method, segmentation method and electronic equipment |
CN117351325A (en) * | 2023-12-06 | 2024-01-05 | 浙江省建筑设计研究院 | Model training method, building effect graph generation method, equipment and medium |
CN117496927A (en) * | 2024-01-02 | 2024-02-02 | 广州市车厘子电子科技有限公司 | Music timbre style conversion method and system based on diffusion model |
-
2024
- 2024-03-14 CN CN202410288788.4A patent/CN117893838A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113837190A (en) * | 2021-08-30 | 2021-12-24 | 厦门大学 | End-to-end instance segmentation method based on Transformer |
CN116485682A (en) * | 2023-05-04 | 2023-07-25 | 北京联合大学 | Image shadow removing system and method based on potential diffusion model |
CN117236390A (en) * | 2023-09-22 | 2023-12-15 | 西南石油大学 | Reservoir prediction method based on cross attention diffusion model |
CN117292007A (en) * | 2023-09-28 | 2023-12-26 | 支付宝(杭州)信息技术有限公司 | Image generation method and device |
CN117315263A (en) * | 2023-11-28 | 2023-12-29 | 杭州申昊科技股份有限公司 | Target contour segmentation device, training method, segmentation method and electronic equipment |
CN117351325A (en) * | 2023-12-06 | 2024-01-05 | 浙江省建筑设计研究院 | Model training method, building effect graph generation method, equipment and medium |
CN117496927A (en) * | 2024-01-02 | 2024-02-02 | 广州市车厘子电子科技有限公司 | Music timbre style conversion method and system based on diffusion model |
Non-Patent Citations (1)
Title |
---|
SHOUFA CHEN , PEIZE SUN , ET AL: "DiffusionDet : Diffusion Model for Object Detection", 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 6 October 2023 (2023-10-06), pages 19773 - 19786, XP034513831, DOI: 10.1109/ICCV51070.2023.01816 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110097131B (en) | Semi-supervised medical image segmentation method based on countermeasure cooperative training | |
CN109508669B (en) | Facial expression recognition method based on generative confrontation network | |
CN108537743B (en) | Face image enhancement method based on generation countermeasure network | |
CN108648188B (en) | No-reference image quality evaluation method based on generation countermeasure network | |
CN108804397B (en) | Chinese character font conversion generation method based on small amount of target fonts | |
CN111274921B (en) | Method for recognizing human body behaviors by using gesture mask | |
CN111462261B (en) | Fast CU partitioning and intra-frame decision method for H.266/VVC | |
CN110222837A (en) | A kind of the network structure ArcGAN and method of the picture training based on CycleGAN | |
CN116311483B (en) | Micro-expression recognition method based on local facial area reconstruction and memory contrast learning | |
CN109829924A (en) | A kind of image quality evaluating method based on body feature analysis | |
CN111967358B (en) | Neural network gait recognition method based on attention mechanism | |
CN114511554A (en) | Automatic nasopharyngeal carcinoma target area delineating method and system based on deep learning | |
CN114006870A (en) | Network flow identification method based on self-supervision convolution subspace clustering network | |
CN114154016A (en) | Video description method based on target space semantic alignment | |
CN114333062B (en) | Pedestrian re-recognition model training method based on heterogeneous dual networks and feature consistency | |
CN115410078A (en) | Low-quality underwater image fish target detection method | |
CN111462157A (en) | Infrared image segmentation method based on genetic optimization threshold method | |
CN113807497B (en) | Unpaired image translation method for enhancing texture details | |
CN114066871A (en) | Method for training new coronary pneumonia focus region segmentation model | |
CN116630482B (en) | Image generation method based on multi-mode retrieval and contour guidance | |
CN117893838A (en) | Target detection method using diffusion detection model | |
CN115641445B (en) | Remote sensing image shadow detection method integrating asymmetric inner convolution and Transformer | |
He et al. | An optimal 3D convolutional neural network based lipreading method | |
CN113657415B (en) | Object detection method oriented to schematic diagram | |
CN111179361B (en) | Automatic black-and-white image coloring method based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |