CN116805389A

CN116805389A - Open world target detection method based on decoupling cascade region generation network

Info

Publication number: CN116805389A
Application number: CN202310722763.6A
Authority: CN
Inventors: 焦继超; 刘迈; 李宁; 范家硕; 丁奔
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2023-06-16
Filing date: 2023-06-16
Publication date: 2023-09-26

Abstract

The invention provides an open world target detection method based on a decoupling cascade region generation network, which belongs to the field of open world target detection, and comprises the steps of firstly inputting RGB images into a resnet50 model, and extracting feature images with different scales; the method comprises the steps of performing multi-scale information fusion through a fpn module, inputting a fusion feature map into an improved decoupling cascade region generation network, decoupling positioning quality estimation and regression branches, independently propagating gradients, and adjusting candidate frames from coarse to fine through the cascade network to generate coordinate correction amounts and probabilities of the candidate frames possibly containing unknown objects; then, selecting candidate frames with probability exceeding and lower than the overlapping degree threshold value through non-maximum value inhibition, executing Roi alignment, obtaining the position of an unknown object contained in the fusion feature map, aligning to the accurate candidate frames, fine-adjusting the positions of the candidate frames, extracting feature vectors corresponding to the candidate frames, inputting the feature vectors into a full-connection layer, and carrying out positioning frame regression and positioning quality assessment to obtain a final detection result. The invention has high recall rate.

Description

Open world target detection method based on decoupling cascade region generation network

Technical Field

The invention belongs to the field of open world target detection, and particularly relates to an open world target detection method based on a decoupling cascade region generation network.

Background

In recent years, with the development of deep learning, a target detection technique has been greatly advanced. However, conventional object detection techniques are typically only capable of working in a closed-world scenario, i.e., only capable of detecting objects of a known class. For the problem of open world target detection, i.e., target detection with unknown class targets, traditional methods perform poorly.

In order to solve this problem, some open world object detection methods based on deep learning have appeared in recent years to deal with the case of unknown class objects. However, existing methods are generally only applicable to one particular scenario, in other scenarios the recall rate on detecting unknown objects with semantic offsets is low.

The existing open world target detection methods mainly comprise two types: a target detection method based on a convolutional neural network and a target detection method based on a transducer.

In open world target detection based on convolutional neural networks, for example, an ORE model mainly proposes candidate frames through a region generation network, and then performs comparative clustering to identify unknown objects. The quality of candidate frames proposed by the regional generation network in the model is not high, and the parameters are shared by classification and regression tasks and are in a coupling state, but the focus points of the candidate frames are different, the texture information of the focus targets is classified, the edge information of the focus targets is positioned, the last feature expression is deficient, the recall rate of the unknown object is influenced, and therefore the unknown object cannot be detected well.

The method based on the transducer mainly utilizes an end-to-end transducer network and combines an open world target detection method for detection. For example, the OW-DETR model is lack of inductive bias, the fitting effect on a small data set is poor, a large amount of computing resources and training time are needed to achieve a good detection effect, and the model is low in efficiency in engineering application, so that the model cannot be well popularized to the industry.

Therefore, there is an urgent need for an open world target detection method that improves recall and efficiency of open world target detection.

Disclosure of Invention

Aiming at the defects of the existing open world target detection method, the invention provides the open world target detection method based on the decoupling cascade region generation network.

The open world target detection method based on the decoupling cascade region generation network comprises the following steps:

firstly, inputting RGB images acquired by a camera into a resnet50 model, and extracting feature images with different scales;

firstly, preprocessing an RGB image, and inputting the RGB image into residual modules of different stages of a resnet50 model; obtaining a deep feature map x through a residual error module _l+1 ：

x _l+1 ＝x _l +F(x _l ,W _l )

wherein ,x_l Representing shallow feature map, F (x) _l ,W _l ) Representing the residual function and outputting the characteristic diagram.

Then, the deep feature map x _l+1 And extracting the feature map through a model backup to obtain feature maps of different scales on 4 stages.

The residual block includes 1*1 convolution, 3*3 convolution, batch Norm normalization, and ReLU activation functions;

the convolution formula is shown as follows:

wherein ,x_ij Is an input feature map, w _k Is the weight of convolution kernel, k represents the size of convolution kernel, corresponding to k is convolved, b represents convolution bias term, y _ij Representing the output profile. H and W represent the height and width of the input feature map, respectively, and i and j represent the coordinates of the rows and columns of pixel points in the input feature map, respectively.

Step two, carrying out multi-scale information fusion on the feature images with different scales through a fpn module, and outputting a fused feature image with high resolution and strong semantics;

the specific fusion process is as follows:

fpn module is from top to bottom and deep feature map x _l+1 After up sampling, fusing the feature images with different scales which are downwards output from the bottom with the feature images; the fusion mode is corresponding position addition, and the up-sampling mode adopts nearest neighbor interpolation.

Inputting the fusion feature map into an improved decoupling cascade region generation network, and generating coordinate correction amounts of candidate frames possibly containing unknown objects and probabilities of the candidate frames containing the objects;

the improved decoupling cascade region generates a network, sub-task parameter branches of the network are decoupled, and gradients are independently propagated through parallel convolution decoupling branches; the improved decoupling cascade region generation network is enabled to more effectively aggregate object features of different categories by adding the latest convolution operator and normalization method; and the alignment of the feature images and the candidate frames is kept in a cascading mode, so that more accurate candidate frames are obtained.

The method comprises the following specific steps:

step 301, generating a series of candidate frames on an input fusion feature map by a decoupling cascade region generation network according to priori knowledge;

the decoupling cascade region generation network comprises two stages, and a coarse screen candidate frame is generated in a first stage; aligning the candidate frames to the region of interest by shifting in a second stage, resulting in more accurate candidate frames;

in the cascade design of the network, the self-adaptive convolution and RELU activation functions are included, and the specific formula is shown as follows;

C(x _i+1 )＝RELU(AdaptiveConv(x _i )),i＝1,…,n

AdaptiveConv (x) the adaptive convolution, C (x _i+1 ) The output of the cascade module is the first stage of the self-adaptive convolution is cavity convolution, and the second stage is deformable convolution.

Firstly, in a first stage, adapting to the transformation of a feature space through hole convolution to generate a candidate frame in a decoupling cascade region generation network;

the actual convolution kernel size K of the hole convolution is shown as follows:

K＝k+(k-1)(r-1)

wherein k is the original convolution kernel size, r is the cavity rate of the cavity convolution parameter, and the cavity convolution kernel size is adjusted through the cavity rate.

In the second stage, the shape offset is set by deformable convolution, and the spatial position of the sample is changed according to the offset learned by different feature maps, so that an irregular receptive field is formed.

Wherein y (p) and x (p) represent the positions of the output feature map and the input feature map, respectively, m _ij Representing the shape offset, alpha, corresponding to position (i, j) in the convolution kernel _k A penalty term representing that the sampled spatial position exceeds the target region;

then, the model is converged more quickly by decoupling the positioning quality estimation and the parameter branches of the positioning frame regression.

Each branch contains a decoupling module consisting of 3*3 convolutions, group Norm normalization and ReLU activation functions, and in the last layer of the decoupling module the 3*3 convolutions are replaced by deformable convolutions.

The decoupling module parameter branches are defined as follows:

G _i (x)＝RELU(GN(Conv(x))),i＝1,…,n

where Conv (x) represents 3*3 convolution when i=1 to n-1, and the convolution is replaced with a deformable convolution when i=n.

Finally, respectively spreading subtask gradients through decoupling convolution branches to obtain feature graphs of different receptive fields corresponding to different scales;

the formula is as follows:

DC _loc (x),DC _reg (x)＝L2Norm(G _n (…(G ₂ (G ₁ (C(x))))))

DC _loc (x),DC _reg (x) Respectively representing the characteristic diagrams of network output generated by the decoupling cascade region, wherein L2Norm represents L2 normalization, and the characteristic diagrams of different stages are gradually iterated through a C (x) module.

Step 302, inputting the candidate frame output in step 301 into a total loss function, and carrying out regression of the candidate frame by combining the overlapping degree of the positions and the shapes of the object and the real object in the candidate frame.

The calculation formula of the total loss function of the decoupling cascade area generation network is as follows:

wherein ,is the loss function of the localization quality assessment branch +.>Value of->Is the value of the bounding box regression branch loss function in phase τ. Lambda is a super parameter and is used for adjusting the weight of the regression branch loss function of the positioning frame in the total loss function; alpha ^τ Is a weight function of each stage; θ _Rpn ,θ _Fpn Is a parameter to decouple the cascade area generation network and fpn.

Loss functionIs defined as follows:

wherein p is the position of the candidate frame, p is the position of the real frame,is smooth L ₁ And a loss function for judging the position relation between the candidate frame and the real frame according to the center index.

Loss functionIs defined as follows:

wherein ρ is the Euclidean distance between the center points of the candidate frame and the real frame, c is the diagonal distance of the minimum closure area of the candidate frame and the real frame, and IoU is the intersection ratio index in target detection.

Step 303, on each area possibly containing an unknown object on the feature map, corresponding to a plurality of regression candidate frames, calculating the coordinate correction amount of each regression candidate frame and the probability of containing the object;

the probability calculation formula is:

wherein ,f_loc (x) The parameter is a parameter of a positioning quality estimation branch, a probability value between (0 and 1) is output through a sigmoid activation function, and at the moment, the larger the output probability value is, the larger the probability of containing a detected object is.

Step four, selecting candidate frames exceeding a probability threshold and lower than an overlapping degree threshold through non-maximum value suppression, executing Roi alignment, acquiring the position of an unknown object contained in a fusion feature map, aligning to the accurate candidate frames, finely adjusting the positions of the candidate frames, and extracting feature vectors corresponding to the candidate frames and outputting the feature vectors;

firstly, dividing a region possibly containing an object into a plurality of small blocks by utilizing Roi Align, uniformly sampling in each small block to obtain sampling point coordinates, calculating sampling point pixels by utilizing a bilinear interpolation method, finally averaging pixel areas in the small blocks, and combining characteristic values of all the small blocks to obtain a characteristic representation of the whole region containing the object;

the method of calculating the pixels in each small block is as follows:

where f (x, y) represents the interpolated pixel, f (x) ₁ ,y ₁ )，f(x ₁ ,y ₂ )，f(x ₂ ,y ₁ )，f(x ₂ ,y ₂ ) Pixel values representing four adjacent pixel points in the original image, (x) ₁ ,y ₁ )，(x ₁ ,y ₂ )，(x ₂ ,y ₁ )，(x ₂ ,y ₂ ) Pixel points respectively representing an upper left corner, a lower left corner, an upper right corner and a lower right corner, and x ₁ ≤x≤x ₂ ，y ₁ ≤y≤y ₂ 。

The pixel values of the sample points within the patch are averaged as shown in the following equation:

wherein ,f(x_ave ,y _ave ) Representing the characteristic pixel value of the small block after averaging, N represents that N points are sampled in the small block, f (x) _i ,y _i ) Pixel values representing the sampling points;

then, the position of the candidate frame is finely adjusted by corresponding the whole object area represented by the features to the accurate candidate frame, and the feature vector is extracted and output;

and fifthly, inputting the feature vectors into the full-connection layer respectively, carrying out positioning frame regression and positioning quality evaluation to obtain a final detection result, and outputting coordinates and corresponding confidence of the detection frame.

The invention has the advantages that:

compared with the prior art, the open world target detection method based on the decoupling cascade region generation network utilizes the decoupling branches to improve the region generation network expression capacity, improves the detection recall rate of unknown objects through the cascade module, learns the shape characteristics of the unknown objects in a small data set by utilizing the inductive bias of a convolution operator, performs positioning regression, improves the recall rate and efficiency of open world target detection, and has strong practicability.

Drawings

FIG. 1 is a schematic diagram of an open world target detection method based on a decoupled cascade region generation network of the present invention;

FIG. 2 is a schematic diagram of a decoupled cascade area generation network in accordance with the present invention;

FIG. 3 is a flow chart of an open world target detection method based on a decoupled cascade region generation network of the present invention;

Detailed Description

The present invention is further described in detail below with reference to the drawings and examples for the purpose of facilitating understanding and practicing the present invention by those of ordinary skill in the art. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

The invention discloses an open world target detection method based on a decoupling cascade region generation network, which is shown in figures 1 and 3 and comprises the following specific steps:

the specific extraction process comprises the following steps:

firstly, preprocessing an RGB image, and inputting the RGB image into residual modules of different stages of a resnet 50;

the pretreatment is operated by adopting a module comprising a rolling layer and a pooling layer;

obtaining a deep feature map x through a residual error module _l+1 ：

x _l+1 ＝x _l +F(x _l ,W _l )

the convolution formula is shown as follows:

wherein x_ij Is an input feature map, w _k Is the weight of convolution kernel, k represents the size of convolution kernel, corresponding to k is convolved, b represents convolution bias term, y _ij Representing the output profile. H and W represent the height and width of the input feature map, respectively, and i and j represent the coordinates of the rows and columns of pixel points in the input feature map, respectively. The convolution formula represents the sliding process of the convolution kernel on the input matrix, as well as the weighted sum operation at each location.

Step two, carrying out multi-scale information fusion on the feature images with different scales through a fpn module, combining semantic information of the high-level feature images with position information of the bottom-level feature images, and outputting a high-resolution and strong-semantic fusion feature image;

the specific fusion process is as follows:

fpn module top-down, deep features x _l+1 After up sampling, fusing the feature images with different scales which are downwards output from the bottom with the feature images; the fusion mode is corresponding position addition, and the up-sampling mode adopts nearest neighbor interpolation.

as shown in fig. 2, the subtask parameter branches of the improved decoupling cascade region generation network are decoupled compared with the existing region generation network, and the gradients are independently propagated through parallel convolution decoupling branches; by adding the latest convolution operator and normalization method, such as deformable convolution and group normalization, the region generation network can more effectively aggregate the characteristics of different types of objects; and, can keep the characteristic map and candidate frame to align through the cascade mode, obtain more accurate candidate frame.

The method comprises the following specific steps:

the decoupling cascade region generation network comprises two stages, and a coarse screen candidate frame is generated in a first stage; in the second stage, the candidate frames are aligned with the region of interest through offset, so that more accurate candidate frames are generated, and the recall rate of detecting unknown objects is improved from thick to thin;

in the cascade design of the network, the self-adaptive convolution and the ReLU activation function are contained, and the specific formula is shown as follows;

C(x _i+1 )＝RELU(AdaptiveConv(x _i )),i＝1,…,n

AdaptiveConv (x) the adaptive convolution, C (x _i+1 ) The output of the cascade module, the first stage of the adaptive convolution is the hole convolution, and the second stage is the deformable convolution. The candidate frames generated in each detection stage are adjusted by adopting adaptive convolution, and the size of the receptive field is dynamically adjusted according to different scales of the input image so as to obtain more accurate candidate frames.

K＝k+(k-1)(r-1)

wherein k is the original convolution kernel size, r is the cavity rate of the cavity convolution parameter, the calculation formula of the cavity convolution is the same as the convolution formula in the step one, and the convolution kernel size is adjusted through the cavity rate.

Wherein y (p) and x (p) represent the positions of the output feature map and the input feature map, respectively, m _ij Representing the shape offset, alpha, corresponding to position (i, j) in the convolution kernel _k Penalty term representing spatial position after sampling exceeding containing target area, w _k Representing convolution kernel weights, b representing convolution offsets.

And then, by decoupling parameter branches of different subtasks such as a positioning quality estimation branch, a positioning frame regression branch and the like, subtask gradients are independently propagated and are not interfered with each other, so that the model is converged more quickly. Each branch contains a decoupling module consisting of 3*3 convolution, group Norm normalization and ReLU activation functions, and in the last layer of the decoupling module, 3*3 convolution is replaced with deformable convolution, making the target detector robust to different scale object detection.

The decoupling module parameter branches are defined as follows:

G _i (x)＝RELU(GN(Conv(x))),i＝1,…,n

Finally, subtask gradients are respectively propagated through decoupling convolution branches to obtain feature graphs of different receptive fields aiming at different scales, and the current decoupling cascade parameter branches replace coupling convolution parameter branches in the original area generating network, so that different subtasks can be decoupled to propagate the gradients, and different object features can be better aggregated;

DC _loc (x),DC _reg (x)＝L2Norm(G _n (…(G ₂ (G ₁ (C(x))))))

DC _loc (x),DC _reg (x) Respectively representing the characteristic diagrams which are output by the decoupling cascade region generation network, wherein L2Norm represents L2 normalization, and the characteristic diagrams of different stages are gradually iterated through a C (x) module.

The candidate frames are input into the total loss function for correction, when the loss value is smaller, the positions of the candidate frames and the true frames of the real objects are closer and closer, and at the moment, a plurality of candidate frames are generated in the area with the possible objects on the feature map, so that coverage is ensured;

Loss functionIs defined asThe following formula:

Loss functionIs defined as follows:

the probability calculation formula is:

wherein ,f_loc (x) The parameter is a parameter of the positioning quality estimation branch, and a probability value between one (0 and 1) can be output through a sigmoid activation function, and at the moment, the larger the output probability value is, the larger the probability of containing the detected object is.

Step four, reserving high-quality candidate frames through non-maximum value inhibition, executing Roi alignment, acquiring the position of an unknown object contained in the fusion feature map, aligning to an accurate candidate frame, finely adjusting the position of the candidate frame, and extracting feature vectors corresponding to the candidate frame and outputting the feature vectors;

specifically, through non-maximum value inhibition, according to the cross ratio and probability between candidate frames, the candidate frames with higher probability and lower overlapping degree are reserved for comprehensive consideration, and the network operation efficiency is improved, wherein the specific formula is as follows:

wherein Anchor Box is a candidate frame, M is a candidate frame with the largest output probability value, S is a plurality of candidate frames corresponding to the same object-containing region on the feature map, ioU is an cross-over method, and thr is a screening threshold value for non-maximum suppression.

At this time, the region corresponding to the candidate frame in the feature map is likely to be the region containing the object, roi Align is executed on the region, the region possibly containing the object output in the third step is divided into a plurality of small blocks by utilizing Roi Align, sampling is uniformly carried out in each small block to obtain sampling point coordinates, a bilinear interpolation method is utilized to calculate sampling point pixels, finally, the pixel regions in the small blocks are averaged, and feature values of all the small blocks are combined to obtain the feature representation of the whole region containing the object;

the specific pixel calculation method in each small block is shown as follows:

The pixel values of the sample points within the patch are then averaged as shown in the following equation:

the important region in the feature map is extracted through the Roi Align, so that the problem of unmatched region and candidate frame caused by loss of quantization precision can be avoided.

After extracting region features for candidate frames using Roi Align, use is made ofThe loss function trains the two subtask detection heads. The method comprises the steps that the fusion feature map can be obtained to contain the position of an unknown object and aligned to an accurate candidate frame, then the position of the candidate frame is finely adjusted for further carrying out boundary frame regression and positioning quality estimation, and feature vectors corresponding to the candidate frame are extracted and output;

then, the feature vectors corresponding to the candidate frames are extracted and output.

And step five, respectively inputting the feature vectors obtained in the step four into a full-connection layer, carrying out positioning frame regression and positioning quality evaluation to obtain a final detection result, and outputting the detection frame coordinates and the corresponding confidence coefficient.

Examples:

the embodiment comprises the following steps:

step S1, firstly inputting an RGB image into a backstone for image feature extraction to obtain a feature map, and then combining the feature maps of different levels through a neg module to obtain a feature map with multi-scale information;

the method comprises the following specific steps: firstly, outputting feature graphs with different step sizes by inputting RGB images into a resnet50 model; then, combining semantic information of the high-level feature map and position information of the bottom-level feature map through a fpn module to output a high-resolution and high-semantic feature map; the method comprises the following steps:

s1.1, a resnet50 model is used for extracting an image feature map, firstly, input is preprocessed through a module comprising a convolution layer and a pooling layer, then the input is input into residual modules of different stages, the residual modules comprise a 1*1 convolution, a 3*3 convolution, batch Norm normalization and ReLU activation function, and the convolution formula is shown as follows:

where f is the input matrix, g is the convolution kernel, (x, y) is the coordinates on the input matrix, and (m, n) is the coordinates on the convolution kernel, the formula represents the sliding process of the convolution kernel on the input matrix, and the weighted sum operation at each location.

And respectively outputting characteristic graphs with different step sizes on 4 stages through a model backup, and then inputting the characteristic graphs into fpn for multi-scale information fusion.

S1.2 and fpn networks are divided into a bottom-up part and a top-down part, wherein feature extraction is a bottom-up process, 4 feature graphs with different scales are extracted through 4 stages of a resnet50, then deep features are sampled to the resolution corresponding to the bottom-down output through up sampling, the feature graphs are output after fusion, and the fusion mode is corresponding to position addition. The up-sampling mode adopts a nearest neighbor interpolation method, and the formula is shown as follows:

assuming that a feature map of w×h pixels needs to be scaled up to w×h by nearest neighbor interpolation, the calculation method of each pixel (dstX, dstY) in the new feature map is as follows:

wherein the pixel value of the (srcX, srcY) point is equal to the pixel value of the (dstX, dstY) point multiplied by a scaling factor, by rounding down, only the nearest integer point pixel is taken.

The coordinate transformation calculation formula of the nearest neighbor interpolation method is shown as follows:

where dstX and dstY are the abscissa of a certain pixel of the target feature map, dstdwidth and dstvight are the width and height of the target feature map, and srcWidth and srcHeight are the width and height of the original feature map. srcX and srcY are coordinates of the original image of the target feature map corresponding to the point (dstX, dstY).

And S2, inputting the extracted feature map into an improved decoupling cascade region generation network, and corresponding a plurality of regression candidate boxes on each region possibly containing an unknown object through the network. Calculating the coordinate correction amount of each regression candidate frame and the probability of the object contained in the coordinate correction amount;

firstly, inputting a feature map into a decoupling cascade region generation network to generate a plurality of candidate frames; then, the overlapping degree of the candidate frame and the true object position and shape in the loss function is calculated, and the candidate frame regression is performed, as shown in fig. 1.

S2.1, in the improved decoupling cascade region generation network, a series of candidate frames are generated on a feature map according to priori knowledge, wherein the transformation of a feature space is adapted through self-adaptive convolution, the alignment problem of the candidate frames and the feature map in the region generation network is solved, in the decoupling cascade region generation network, the self-adaptive convolution comprises two stages, the first stage is cavity convolution, and the actual convolution kernel size K is shown in the following formula:

K＝k+(k-1)(r-1)

wherein k is the original convolution kernel size, r is the cavity rate of the cavity convolution parameter, the calculation mode of the receptive field is consistent with the standard convolution, and the size of the receptive field is controlled by the cavity rate r.

Then, the feature map generates a network through a cascade area, a batch of candidate frames are generated in each stage by utilizing a multi-stage design and transferred to the next stage, more accurate candidate frames are generated, and recall rate of detecting unknown objects is improved from thick to thin.

In the regression stage of the candidate frame, the positioning quality estimation branch and the positioning frame regression branch are decoupled, and the two branches are processed in parallel, so that the recall rate of an unknown object can be improved.

S2.2, when the loss function is calculated, DIoU is selected as the loss function of the regression of the positioning frame, and the quality evaluation branch of the positioning frame is calculated by using the smooth L1 loss function.

The definition of the regression loss function of the positioning frame is as follows:

The definition of the bounding box quality assessment branch loss function is as follows:

wherein p is the position of the candidate frame, p is the position of the real frame, smoothL1 is a smooth L1 loss function, and the position relationship between the candidate frame and the real frame is judged according to a center index, and t, b, L and r are the distances from the center of the candidate frame to the target center point of the real frame respectively.

During training, the values of the two loss functions are calculated simultaneously, and then the total loss function is calculated at the end by the following formula:

wherein ,is the loss function of the localization quality assessment branch +.>Value of->Is the value of the bounding box regression branch loss function in phase τ. Lambda is a super parameter and is used for adjusting the weight of the regression branch loss function of the positioning frame in the total loss function; alpha ^τ Is a weight function of each stage;θ _Rpn ,θ _Fpn is a parameter to decouple the cascade area generation network and fpn.

Step S3, performing non-maximum value inhibition on the candidate frame with high quality in the step S2, then executing Roi alignment, extracting a feature image possibly containing an object, aligning the candidate frame, finely adjusting coordinate values of the candidate frame, and extracting a feature vector corresponding to the candidate frame and outputting the feature vector;

specifically, first, a plurality of candidate frames corresponding to the same object-containing region are screened using a non-maximum suppression method, as shown in the following formula:

And extracting region characteristics of the obtained high-quality candidate frames by utilizing Roi Align, matching the real labeling frames for the candidate frames, training the detection head by using a smoth L1 loss function, and finally outputting coordinate correction values and corresponding confidence coefficients of the candidate frames.

Wherein, the back propagation formula of the Roi Align is shown as follows:

where d (-) represents the distance of the pixel point, Δh and Δw represent x _i and i^* (r, j) difference in abscissa and ordinate, and x _j Representing points on the feature map before pooling, y _rj Represents the j-th point, i in the r-th region after pooling ^* (r, j) represents the source of the pixel.

Then, feature vectors corresponding to the candidate boxes are extracted from the feature map.

And S4, carrying out regression and positioning quality evaluation on the feature vector obtained in the step S3 to obtain a final detection result, and outputting the detection frame coordinates and the corresponding confidence coefficient.

Claims

1. An open world target detection method based on a decoupling cascade region generation network is characterized by comprising the following steps:

firstly, inputting an acquired RGB image into a resnet50 model, and extracting feature images with different scales;

x _l+1 ＝x _l +F(x _l ,W _l )

wherein ,x_l Representing shallow feature map, F (x) _l ,W _l ) Representing residual functions and outputting feature graphs;

then, the deep feature map x _l+1 Extracting feature graphs through a model backup to obtain feature graphs with different scales on 4 stages;

the method comprises the following specific steps:

then, the model is converged more quickly by decoupling the positioning quality estimation and the parameter branch of the positioning frame regression;

each branch comprises a decoupling module, consists of 3*3 convolution, group Norm normalization and ReLU activation functions, and replaces 3*3 convolution with deformable convolution in the last layer of the decoupling module;

the decoupling module parameter branches are defined as follows:

G _i (x)＝RELU(GN(Conv(x))),i＝1,…,n

wherein Conv (x) represents a 3*3 convolution when i=1 to n-1, and the convolution is replaced with a deformable convolution when i=n;

the formula is as follows:

DC _loc (x),DC _reg (x)＝L2Norm(G _n (…(G ₂ (G ₁ (C(x))))))

DC _loc (x),DC _reg (x) Respectively representing characteristic diagrams of network output generated by decoupling cascade areas, wherein L2Norm represents L2 normalization, C (x) is a cascade module, and characteristic diagrams of different stages are gradually iterated through the C (x) module;

step 302, inputting the candidate frame output in the step 301 into a total loss function, and carrying out regression of the candidate frame by combining the overlapping degree of the positions and the shapes of the object and the real object in the candidate frame;

wherein ,is the loss function of the localization quality assessment branch +.>Value of->Is atIn the stage tau, locating the value of the frame regression branch loss function; lambda is a super parameter and is used for adjusting the weight of the regression branch loss function of the positioning frame in the total loss function; alpha ^τ Is a weight function of each stage, θ _Rpn ,θ _Fpn Is a parameter to decouple the cascade area generation network and fpn;

the probability calculation formula is:

wherein ,f_loc (x) The method is a parameter of a positioning quality estimation branch, a probability value between (0 and 1) is output through a sigmoid activation function, and at the moment, the larger the output probability value is, the larger the probability of containing a detected object is;

2. The open world target detection method based on decoupled cascade area generation network of claim 1, wherein in step one, the residual block comprises 1*1 convolution, 3*3 convolution, batch Norm normalization, and ReLU activation function;

the convolution formula is shown as follows:

wherein x_ij Is an input feature map, w _k Is the weight of convolution kernel, k represents the size of convolution kernel, corresponding to k is convolved, b represents convolution bias term, y _ij Representing an output feature map; h and W represent the height and width of the input feature map, respectively, and i and j represent the coordinates of the rows and columns of pixel points in the input feature map, respectively.

3. The open world target detection method based on the decoupling cascade area generation network according to claim 1, wherein the multi-scale information fusion in the second step comprises the following specific processes:

4. The open world target detection method based on decoupling cascade area generation network as claimed in claim 1, wherein in the third step, the improved decoupling cascade area generation network has subtask parameter branches for decoupling, and gradients are propagated independently through parallel convolution decoupling branches; the improved decoupling cascade region generation network is enabled to more effectively aggregate object features of different categories by adding the latest convolution operator and normalization method; and the alignment of the feature images and the candidate frames is kept in a cascading mode, so that more accurate candidate frames are obtained.

5. The method for open world target detection based on decoupled cascade area generation network of claim 1, wherein in step three, the cascade design of the network comprises adaptive convolution and RELU activation functions, specifically as shown in the following formula;

C(x _i+1 )＝RELU(AdaptiveConv(x _i )),i＝1,…,n

AdaptiveConv (x) the adaptive convolution, C (x _i+1 ) Is a cascade moduleThe first stage of the output is cavity convolution, and the second stage is deformable convolution;

K＝k+(k-1)(r-1)

wherein k is the original convolution kernel size, r is the cavity rate of the cavity convolution parameter, and the cavity convolution kernel size is adjusted through the cavity rate;

in the second stage, the shape offset is set through deformable convolution, and the spatial position of sampling is changed according to the offset learned by different feature graphs, so that an irregular receptive field is formed;

wherein y (p) and x (p) represent the positions of the output feature map and the input feature map, respectively, m _ij Representing the shape offset, alpha, corresponding to position (i, j) in the convolution kernel _k Representing that the sampled spatial position exceeds a penalty term comprising the target region.

6. The open world target detection method based on the decoupled cascade area generation network of claim 1, wherein in the third step, a loss function is usedIs defined as follows:

where p is the position of the predicted frame, p is the position of the real frame,is smooth L ₁ A loss function for judging the position relationship between the candidate frame and the real frame according to the center index;

loss functionIs defined as follows:

wherein ρ is the Euclidean distance between the center points of the candidate frame and the real frame, c is the diagonal distance of the minimum closure area capable of simultaneously containing the predicted frame and the real frame, and IoU is the intersection ratio index in target detection.

7. The open world target detection method based on the decoupling cascade area generation network according to claim 1, wherein the step four specifically comprises:

firstly, dividing a region possibly containing an object into a plurality of small blocks by utilizing RoiAlign, uniformly sampling in each small block to obtain sampling point coordinates, calculating sampling point pixels by utilizing a bilinear interpolation method, finally averaging pixel areas in the small blocks, and combining characteristic values of all the small blocks to obtain a characteristic representation of the whole region containing the object;

the method of calculating the pixels in each small block is as follows:

where f (x, y) represents the interpolated pixel, f (x) ₁ ,y ₁ )，f(x ₁ ,y ₂ )，f(x ₂ ,y ₁ )，f(x ₂ ,y ₂ ) Pixel values representing four adjacent pixel points in the original image, (x) ₁ ,y ₁ )，(x ₁ ,y ₂ )，(x ₂ ,y ₁ )，(x ₂ ,y ₂ ) Pixel points respectively representing an upper left corner, a lower left corner, an upper right corner and a lower right corner, and x ₁ ≤x≤x ₂ ，y ₁ ≤y≤y ₂ ；

then, by mapping the entire object region of the feature representation to the exact candidate frame, the position of the candidate frame is fine-tuned, and the feature vector output is extracted.