CN117893838A

CN117893838A - Target detection method using diffusion detection model

Info

Publication number: CN117893838A
Application number: CN202410288788.4A
Authority: CN
Inventors: 曹刘娟; 罗耀钦
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2024-03-14
Filing date: 2024-03-14
Publication date: 2024-04-16

Abstract

The invention discloses a target detection method applying a diffusion detection model, which can realize the improvement of the precision of the diffusion detection model and comprises the following steps: 1. acquiring an input image, and extracting an image feature map of the input image through an image feature extractor; 2. obtaining a low-dimensional truth box, encoding the low-dimensional truth box through a boundary box encoder, mapping the low-dimensional space to a high-dimensional space, and obtaining the high-dimensional truth box; 3. gradually adding Gaussian noise to the high-dimensional truth frame according to a rule of adding noise to the diffusion detection model to obtain a high-dimensional noise frame; 4. decoding the high-dimensional noise frame through a boundary frame decoder, and mapping the high-dimensional space back to the low-dimensional space before encoding to obtain the low-dimensional noise frame; 5. and intercepting the RoI features from the image feature images extracted by the image feature extractor by using the low-dimensional noise frame, inputting the intercepted RoI features and the low-dimensional noise frame into a detection head, carrying out regression and classification, and predicting the position and the target category of the corresponding low-dimensional truth frame.

Description

Target detection method using diffusion detection model

Technical Field

The invention relates to the technical field of target detection, in particular to a target detection method using a diffusion detection model.

Background

Object detection is an important task in the field of computer vision, aimed at identifying objects in an image and determining their position. The traditional target detection algorithm mainly relies on manual feature extraction, and is large in calculation amount and unstable. With the rise of deep learning, a target detection algorithm based on deep learning gradually becomes a research hotspot. However, existing target detection algorithms still face challenges such as different appearances, shapes and attitudes of objects, and interference from factors such as illumination, occlusion, etc. At present, a diffusion model has remarkable effect in the field of image generation, and random noise can be gradually converted into a clear image through a denoising diffusion process. In light of this, application of a diffusion model to the field of object detection is a new attempt.

The DiffusionDet model initially attempts to apply a diffusion model to the target detection task, modeling the target detection as a denoising diffusion process from noise box to target box. Specifically, during the training phase, the target boxes diffuse from the truth boxes to random distribution, and the model learns how to reverse the process of adding noise to the truth boxes; in the inference phase, the model refines a set of randomly generated target boxes in a progressive manner into output results. Although the DiffusionDet model exhibits excellent performance, it ignores that the diffusion model is generally used for an image generation task, a subject of diffusion in the image generation task is generally an image with a higher dimension, and a diffusion subject in the DiffusionDet model is a detection frame with a lower dimension, so that information which can be contained in the DiffusionDet model in the diffusion process is limited, the advantage of the diffusion model cannot be fully exerted, and further improvement of the performance of the DiffusionDet model is limited. In addition, the Diffuse det model adopts a structure that a plurality of detection heads are simply connected in series in the detection stage, and the effect of the area related characteristics is not considered.

Therefore, how to provide a diffusion detection model based on frame coding to improve the accuracy of the diffusion detection model is a technical problem to be solved.

Disclosure of Invention

The invention aims to provide a target detection method using a diffusion detection model, solve the problems existing in the prior art and realize the improvement of the precision of the diffusion detection model.

In order to achieve the above object, the solution of the present invention is:

an object detection method using a diffusion detection model, comprising the steps of:

step 1, acquiring an input image, and extracting an image feature map of the input image through an image feature extractor;

step 2, obtaining a low-dimensional truth box, encoding the low-dimensional truth box through a boundary box encoder, mapping the low-dimensional space to a high-dimensional space, and obtaining the high-dimensional truth box;

step 3, gradually adding Gaussian noise to the high-dimensional truth frame according to a rule of adding noise to the diffusion detection model to obtain a high-dimensional noise frame;

step 4, decoding the high-dimensional noise frame through a boundary frame decoder, and mapping the high-dimensional space back to the low-dimensional space before encoding to obtain the low-dimensional noise frame;

step 5, intercepting RoI features from the image feature images extracted by the image feature extractor by using the low-dimensional noise frames, inputting the intercepted RoI features and the low-dimensional noise frames into a detection head together, carrying out regression and classification, and predicting the position and the target category of the corresponding low-dimensional truth frames;

the detection head is of a cascade structure and consists of 4 cascade stages, each stage receives an image characteristic image, one of a noise frame and a prediction frame as input, outputs the prediction frame, and the last detection head also outputs a prediction type; in each stage, the RoIAlignon operation is utilized to extract RoI features for the image feature map/noise frame/prediction frame, and then the prediction frame is generated based on the extracted RoI features; the RoI features extracted in the last stage are additionally weighted and fused with the RoI features extracted in other stages and then are used for predicting frame regression and classification results.

In the step 1, the image feature map is extracted through a ResNet model or a Res2Net model.

The step 2 further comprises the step of obtaining a low-dimensional trueAfter the value frames, if the number of low-dimensional truth frames of the input image is less than the specified valueThe number of detection frames of the diffusion detection model is filled to a prescribed value +.>。

In the step 2, the boundary box encoder is implemented by a multi-layer perceptron.

The step 3 is specifically that according to a given time step lengthAnd noise schedule, sampling time step size +.>Samples at any one time step.

In the step 4, the boundary box decoder is implemented by a multi-layer perceptron.

And step 4, firstly generating a high-dimensional random frame with the same dimension as the high-dimensional noise frame during training during reasoning, and decoding the high-dimensional random frame of the high-dimensional space into the low-dimensional space through a boundary decoder.

After the technical scheme is adopted, the invention has the following technical effects:

the invention is beneficial to improving the capability of capturing information in the low-dimensional truth frame diffusion process and accelerating the process of diffusing the low-dimensional truth frame to Gaussian distribution by introducing the boundary frame encoder and the boundary frame decoder; and the RoI features of other stages are fused in the last stage of the detection head, so that the reasonable utilization of the area related information is facilitated, and the prediction accuracy is improved.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention;

FIG. 2 is a schematic diagram comparing the prior art (upper) with the present invention (lower);

fig. 3 is a schematic structural diagram of a detection head according to an embodiment of the present invention.

Detailed Description

In order to further explain the technical scheme of the invention, the invention is explained in detail by specific examples.

Referring to fig. 1 to 3, the present invention discloses a target detection method using a diffusion detection model, comprising the steps of:

and 5, intercepting RoI features from the image feature images extracted by the image feature extractor by using the low-dimensional noise frame, inputting the intercepted RoI features and the low-dimensional noise frame into a detection head, carrying out regression and classification, and predicting the position and the target category of the corresponding low-dimensional truth frame.

The diffusion detection model in the prior art ignores the dimension of a diffusion main body, can not fully capture effective information in the diffusion process, and can not fully utilize relevant region information in the detection process, so that the invention provides a boundary frame encoder and a boundary frame decoder which are constructed to respectively encode and decode the boundary frame before and after diffusion, so that the boundary frame can be fully diffused in a high-dimensional space and can be aligned in a low-dimensional space; the high-dimensional detection frame can be randomly initialized in the reasoning process, and mapped to a low-dimensional space through a boundary frame decoder; during detection, the RoI features of a plurality of cascade stages are fused by introducing a feature fusion mechanism, so that the precision of final classification and regression prediction is improved.

Specific embodiments of the invention are shown below.

In the step 1, the following steps:

acquiring an input imageExtracting image features->，/>；

Wherein the method comprises the steps of，/>Image feature extractor->Is ResNet model or Res2Net model, +.>For inputting the height of the image +.>For inputting the width of the image->Representing an image consisting of three primary colors red, green and blue, < >>For image features->Is a number of channels.

The step 2 further includes:

acquiring an input imageCorresponding low-dimensional truth box set +.>，/>，/>For inputting images +.>The number of low-dimensional truth boxes in (a); setting the target number in any one input image to +.>，/>Is equal to or greater than the maximum value of the number of objects in any input image (in this embodiment +.>Value 300), if->Then by filling in and inputting the image->The number of detection frames of the diffusion detection model is filled to a prescribed value +.>At this time->。

In the step 2, the bounding box encoder is implemented by a multi-layer perceptron, and the calculation formula is as follows:

wherein,representing the encoded high-dimensional truth box, +.>Is the dimension of the high-dimension truth box (value 128 in this embodiment), +.>Representing a multi-layer perceptron (hereinafter the same).

The step 3 specifically comprises the following steps:

according to a given time stepAnd noise schedule, sampling time step size +.>Samples at any one time step. The noise adding process can be regarded as a markov process given a time step size +.>According to the re-parameterization technique, the time step size +.>Interior->The calculation formula of the input noise box at each moment is as follows:

wherein,is from->Linearly increase to +.>Is used for controlling the size of noise; />And->Is to represent an intermediate variable that is convenient to set; along with->Is increased by (1)>Gradually increase, correspondingly->Gradually becoming smaller; />Representation->Noise box of moment->The method comprises the steps of carrying out a first treatment on the surface of the In this embodiment->1000.

In the step 4, the bounding box decoder is implemented by the multi-layer perceptron, and the calculation formula is as follows:

wherein,representing the decoded set of low dimensional noise boxes.

The step 4 further includes generating a high-dimensional random frame with the same dimension as the high-dimensional noise frame during training during reasoning, decoding the high-dimensional random frame of the high-dimensional space into the low-dimensional space through a boundary decoder, and the calculation formula is as follows:

wherein,high-dimensional random box representing a gaussian distribution randomly generated at the time of reasoning, +.>Representing the low dimensional noise box decoded by the bounding box decoder.

The step 5 specifically comprises the following steps:

the detection head consists of 4 cascaded stages, each stage receives one of the image feature map, the noise frame and the prediction frame as input, and outputs the prediction frame, and the last detection head also outputs the prediction category; in each stage, the RoIAlignon operation is utilized to extract RoI features for the image feature map/noise frame/prediction frame, and then the prediction frame is generated based on the extracted RoI features; in order to fully utilize the region related features, the RoI features extracted in the last stage are additionally subjected to weighted fusion with the RoI features extracted in other stages and then are used for predicting frame regression and classification results. The entire detection flow can be expressed by the following formula:

wherein,indicating the first detection headiThe RoI features extracted in the individual stages; />Represent the firstiStage 1jThe RoI features corresponding to the proposal boxes; />Representing the RoIAlign operation; />A prediction frame set representing the i-th stage output; />Represent the firstiThe first of the detection headsjOutput frames corresponding to the input frames;/>representing the probability that the object in each output box belongs to the respective category,/for each output box>The number of categories (80 in this embodiment); />Representing a fully connected layer.

And (3) experimental verification:

the experiment is carried out on the CoCo data set, and compared with the prior method, the comparison result of the experiment on the CoCo data set is shown in the following table, and compared with the prior method, the performance of the experiment is remarkably improved, and compared with Diffuse det, the experiment is advanced in various indexes, wherein on the comparison with ResNet-50 as a main body, the experiment respectively obtains 1.3% and 2.3% improvement on the AP and the AP50, and the improvement of the performance is illustrated.

The above examples and drawings are not intended to limit the form or form of the present invention, and any suitable variations or modifications thereof by those skilled in the art should be construed as not departing from the scope of the present invention.

Claims

1. The target detection method using the diffusion detection model is characterized by comprising the following steps:

2. The target detection method using a diffusion detection model according to claim 1, wherein:

3. The target detection method using a diffusion detection model according to claim 1, wherein:

step 2 further comprises, after the low-dimensional truth boxes are obtained, if the number of the low-dimensional truth boxes of the input image is less than a specified valueThe number of detection frames of the diffusion detection model is filled to a prescribed value +.>。

4. A target detection method using a diffusion detection model according to claim 1 or 3, wherein:

5. The target detection method using a diffusion detection model according to claim 1, wherein:

6. The target detection method using a diffusion detection model according to claim 1, wherein:

7. The target detection method using a diffusion detection model according to claim 1 or 4, wherein: