CN113392960A

CN113392960A - Target detection network and method based on mixed hole convolution pyramid

Info

Publication number: CN113392960A
Application number: CN202110646653.7A
Authority: CN
Inventors: 殷光强; 殷康宁; 候少麒; 梁杰; 丁一寅; 曾宇昊
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2021-09-14
Anticipated expiration: 2041-06-10
Also published as: CN113392960B

Abstract

The present invention relates to the technical field of digital image processing, in particular to a target detection network and method based on a hybrid hole convolution pyramid. The target detection network includes a backbone network, a hybrid receptive field module, a low-level embedded feature pyramid module and a detection module ; Described backbone network uses hierarchical cascade network structure to extract target picture feature; Described mixed receptive field module carries out feature enhancement by the top-level feature map of backbone network topmost output; Described low-level embedded feature pyramid module is in feature On the basis of the pyramid, the high-level features are fused downward, and the final feature map to be detected is generated by means of low-level embedding; the detection module locates and classifies the feature map to be detected, and outputs the result. Through the target detection network and method, the problems of missed detection and false detection caused by scale and occlusion can be effectively solved.

Description

Target detection network and method based on mixed hole convolution pyramid

Technical Field

The invention relates to the technical field of digital image processing, in particular to a target detection network and a target detection method based on a mixed cavity convolution pyramid.

Background

Object detection is one of the most widespread applications in real life, with the task of focusing on a specific object in a picture. The traditional target detection method can be divided into a single-stage target detection method and a two-stage target detection method, wherein the two-stage method has the core that a region proposing method is adopted, an input image is selectively searched, a region proposing frame is generated, then, a convolutional neural network is used for extracting characteristics of each region proposing frame, and then, a classifier is used for classifying. The single-stage method is to directly output the target detection result through a convolutional neural network.

Through a series of varieties, the common point of the two methods gradually evolves to the method that a large number of Anchor frames are required to be generated in advance in the detection process, and the algorithms are collectively called Anchor-based target detection algorithms. The anchor box is a group of rectangular boxes obtained by utilizing a clustering algorithm on a training set before training, and represents the length and width sizes of the main distribution of the targets in the data set. During reasoning, n candidate rectangular frames are extracted from the anchor frames on the feature diagram, and then the rectangular frames are further classified and regressed. Compared with a two-stage algorithm, the processing of the candidate frame still passes through two steps of foreground coarse classification and multi-class fine classification.

The single-stage target detection algorithm lacks fine processing of a two-stage algorithm, and is not good in performance when the problems of multi-scale and shielding of targets and the like are faced. In addition, although the problem of candidate frame calculation amount explosion caused by selective search is relieved to a certain extent by the Anchor-based algorithm, the generation of a large number of Anchor frames with different sizes in each grid still causes calculation redundancy, and most importantly, the generation of the Anchor frames depends on a large number of super-parameter settings, and the positioning accuracy and the classification effect of the target are seriously influenced by manual parameter adjustment.

In the prior art, a patent with publication number CN110222712A discloses "a multi-item target detection algorithm based on deep learning", the proposed target detection algorithm obtains an augmented RoI set through a multi-scale sliding window and selective search, and takes over the generation of an intensive RoI set through an exhaustive mode with the multi-scale sliding window, which is large in calculation amount and low in efficiency.

Patent publication No. CN112115883A discloses a "method and apparatus for suppressing non-maximum value based on Anchor-free target detection algorithm", which proposes that a centret network model is used to perform target detection by predicting the upper left corner point, the lower right corner point and the center point of an object, and a non-maximum value suppression method is used to avoid the situation that there are multiple detection boxes in the same target object, but more complicated post-processing is required to group the pairs of corner points belonging to the same target, which is inefficient.

The patent with publication number CN112101153A discloses a "remote sensing target detection method based on a receptive field module and a multi-feature pyramid", which performs feature extraction on a visible light remote sensing image through a VGG network to obtain feature maps with different sizes, then performs cascade fusion on the feature maps and obtains an optimized feature map through a step length convolution feature pyramid, and then performs multi-scale output detection through receptive field information mining. The method utilizes the feature maps with different sizes, but the feature map fusion mode is redundant, and the performance of the backbone network is poor, so that the final detection result is influenced.

Disclosure of Invention

In order to solve the technical problems, the invention provides a target detection network and a target detection method based on a mixed cavity convolution pyramid, which can effectively solve the problems of missed detection and false detection caused by scale and shielding.

The invention is realized by adopting the following technical scheme:

a target detection network based on a hybrid void convolution pyramid is characterized in that: the system comprises a backbone network, a mixed reception field module, a low-level embedded characteristic pyramid module and a detection module; the backbone network extracts target picture features by using a layered cascade network structure; the mixed receptive field module is used for carrying out feature enhancement on the highest layer feature map output from the topmost end of the backbone network; the low-layer embedded feature pyramid module is used for fusing high-layer features downwards on the basis of a feature pyramid and generating a final feature graph to be detected in a low-layer embedding mode; the detection module is used for positioning and classifying the characteristic diagram to be detected and outputting a result.

The low-layer embedded characteristic pyramid module is used for generating a final characteristic diagram to be detected, and specifically comprises the following steps:

a. the low-layer embedded characteristic pyramid module fuses the current-layer characteristic graph with the high-layer characteristic graph subjected to channel compression and upsampling to form a composite characteristic graph, and embedding high-layer semantic information is completed;

b. fusing the composite feature map and the downsampled low-level feature map to form a mixed feature map, and completing embedding of low-level detail information;

c. and (4) generating a final characteristic diagram to be detected after each mixed characteristic diagram passes through the composite convolution layer.

The fusion mode in the step a and the step b is element-by-element channel-by-channel addition.

And the composite convolution layer in the step c is formed by connecting a 3 x 3 convolution layer, a BN layer and a LeakyReLU activation layer.

The mixed receptive field module comprises four parallel branches, including a 1 × 1 convolutional layer branch and three 3 × 3 convolutional layer branches with void rates of 1, 2 and 4 respectively; and after splicing the feature maps obtained by the cavity convolution layers with different cavity rates in parallel by the mixed receptive field module, performing feature information fusion by adopting the convolution layers of 1 multiplied by 1, and reducing the channel dimension to a specified number.

The backbone network is a single-stage detection network based on a Res2Net50 network, an echo-free mechanism of an FCOS is introduced in the prediction of a target, pixel-by-pixel prediction is carried out, and a Centeress branch network is added in a Loss function part.

The feature map output by the backbone network comprises C3, C4 and C5, and the feature map size is 100 × 100, 50 × 50 and 25 × 25.

A target detection method based on a mixed cavity convolution pyramid is characterized in that: the method comprises the following steps:

building a backbone network based on an Achor-free mechanism, obtaining feature maps C3, C4 and C5 through the backbone network, and outputting a highest-level feature map C5 output by the backbone network to a low-level embedded feature pyramid module after feature enhancement is carried out on the highest-level feature map C5 output by the backbone network through a mixed receptive field module;

II, forming composite characteristics by the aid of the low-layer embedded characteristic pyramid module and characteristic graphs C4 and C3 output by the main network through up-sampling and down-sampling operations, generating a characteristic graph to be detected after the composite characteristics pass through a composite convolution layer, and conveying the characteristic graph to be detected to a detection module for target positioning and classification tasks;

training the network, testing the model of each round, storing the best training model weight, testing the real-time performance of the mixed receptive field module and the low-level embedded characteristic pyramid module by using a corresponding test set, and training to obtain a network model;

and iv, detecting the target by using the trained network model, and outputting a detection result.

In the process of training the network in step iii, the loss function is as follows:

wherein p is_x,yRepresenting the class prediction probability, t_x,yExpressing regression prediction coordinates, and N expressing the number of positive samples; k is an indication function, if the current prediction is determined to be a positive sample, the current prediction is 1, and if not, the current prediction is 0;

L_clsthe specific expression form is a Focal local Loss function:

wherein y is a sample label, y' predicts the probability that a sample is a positive case, and gamma is a focusing parameter;

L_regis GIoUThe Loss function of Loss, the concrete calculation process is:

L_reg＝1-GIoU

where A and B represent the prediction and real boxes, IoU is their intersection-to-parallel ratio, and L is calculated by first calculating their minimum convex set C, i.e., the minimum bounding box bounding A, B, and then combining C with the minimum convex set to calculate GIoU_reg。

Compared with the prior art, the invention has the beneficial effects that:

1. the invention improves the structure of the characteristic pyramid, provides a low-layer embedded characteristic pyramid module, can effectively solve the problem that target detection is insufficient in processing multi-scale change, fuses shallow-layer characteristic information and high-layer characteristic information, adds normalization processing and activation functions to fused output, and optimizes model training.

The invention designs a mixed reception field module, and increases the reception field to acquire more global feature detail information by utilizing multi-size cavity convolution and combining with the multi-scale output characteristic of the feature pyramid under the condition of controlling the parameter quantity of the model so as to solve the problem of shielding of a target.

The method introduces an Anchor-free mechanism, combines a low-layer embedded characteristic pyramid module and a mixed receptive field module, can reduce invalid calculation caused by redundant candidate frames, can improve positioning accuracy, and effectively solves the problems of missing detection and the like.

2. According to the invention, the target detection network can solve the multi-scale and shielding problems of a target detection scene, can be used in a plug-and-play manner, introduces an Anchor-free algorithm, combines a low-layer embedded characteristic pyramid module and a mixed receptive field module, can reduce invalid calculation caused by redundant candidate frames, can improve positioning accuracy, and solves the problems of large model parameter quantity, large redundant calculation, low applicability, low efficiency and easy omission in the face of actual conditions in the existing target detection task.

3. The backbone network adopts an echo-free mechanism introduced into the FCOS to predict pixel points by pixel points, target detection is carried out without depending on a predefined anchor frame or a predefined proposed area, invalid calculation caused by redundant candidate frames is reduced, positioning accuracy is improved, problems of missed detection and the like are effectively solved, a central mechanism is utilized to quickly filter negative samples, low-quality prediction frames at positions far away from the target center are restrained, the weight of prediction frames close to the target center is increased, and detection performance is improved. Introducing the Res2Net50 network replaces the single 3 x 3 convolutional layer used in the ResNet50 with a hierarchical cascaded feature set in a given redundancy block that is more optimized in terms of network width, depth and resolution.

4. The hybrid receptive field module of the invention is different from other networks which carry out feature processing after multi-layer (C3, C4 and C5) feature fusion, but before feature fusion, the hybrid receptive field module is embedded between C5 and a feature pyramid P5 of a backbone network, so that the characterization capability of the C5 feature is improved, and the final detection and prediction are carried out only by the hybrid receptive field module and a low-layer embedded feature pyramid module. The use of the convolution layers with different void ratios improves the adaptability of the model to targets with different scales, after the spliced feature maps, the 1 x 1 convolution layers are adopted for feature information fusion, the channel dimensionality is reduced to a specified number, and the flexibility of the mixed receptive field module is improved.

5. Compared with the characteristic pyramid, the characteristics output by the low-level embedded characteristic pyramid module in the invention not only contain rich semantic information, but also contain specific detail information, thereby realizing double promotion of multi-scale target detection effect and positioning precision.

Drawings

The invention will be described in further detail with reference to the following description taken in conjunction with the accompanying drawings and detailed description, in which:

FIG. 1 is a schematic diagram of the overall structure of a target detection network according to the present invention;

FIG. 2 is a schematic flow chart of a target detection method according to the present invention;

FIG. 3 is a schematic diagram of a hybrid receptor field module according to the present invention;

FIG. 4 is a schematic diagram of a low-level embedded feature pyramid module according to the present invention;

FIG. 5 is a schematic view of the composite convolution layer of the present invention.

Detailed Description

Example 1

As a basic implementation mode of the invention, the invention comprises a target detection network based on a mixed cavity convolution pyramid, which comprises a backbone network, a mixed reception field module, a low-level embedded characteristic pyramid module and a detection module. The backbone network extracts target picture features by using a layered cascade network structure; and the mixed receptive field module is used for carrying out characteristic enhancement on the highest-layer characteristic diagram output from the topmost end of the backbone network. And the low-layer embedded characteristic pyramid module is used for fusing high-layer characteristics downwards on the basis of the characteristic pyramid and generating a final characteristic diagram to be detected in a low-layer embedding mode. The detection module is used for positioning and classifying the characteristic diagram to be detected and outputting a result.

The backbone network can be a single-stage detection network based on a Res2Net50 network, the feature extraction capability is stronger without increasing the calculation load, an echo-free mechanism of an FCOS is introduced in the aspect of target prediction to predict pixel points, a centerless branch network is added in a Loss function part, a low-quality detection frame is restrained, and the detection performance is improved.

A target detection method based on a mixed hole convolution pyramid comprises the following steps:

Example 2

As a best implementation mode of the invention, the invention comprises a target detection network based on a hybrid void convolution pyramid, and with reference to the attached drawing 1 of the specification, the target detection network comprises a backbone network, a hybrid reception field module, a low-level embedded feature pyramid module and a detection module.

The backbone network adopts a single-stage detection network structure, introduces an echo-free mechanism of FCOS (fiber channel operating system), performs pixel-by-pixel prediction, does not depend on a predefined anchor frame or a proposed area to perform target detection, reduces invalid calculation caused by redundant candidate frames, improves positioning accuracy, effectively solves the problems of missed detection and the like, utilizes a center mechanism, quickly filters negative samples, inhibits low-quality prediction frames at positions far away from a target center, increases the weight of the prediction frames close to the target center, and improves detection performance. The expression of Centeress is shown in formula (1) < CHEM >^*、r^*、t^*、b^*The distances from the pixel points to the left, right, upper and lower parts of the prediction frame are represented, and the values are between 0 and 1, so that the closer the Centeress value to the target real center is, the larger the Centeress value is, and the farther the Centeress value is from the target real center is, the smaller the Centeress value is.

The backbone network introduces a Res2Net50 network, using a hierarchical cascaded set of features in a given redundancy block instead of the single 3 x 3 convolutional layer used in the ResNet50, which is more optimized in terms of network width, depth and resolution. When passing through C3, C4 and C5, the sizes of the characteristic maps are 100 × 100, 50 × 50 and 25 × 25.

The mixed receptive field module is used for splicing the feature maps which are obtained by the cavity convolution layers with different cavity rates in parallel, so that the obtaining capability of the network on the global features is improved, and the grid effect caused by single cavity convolution is compensated. The hybrid receptive field module of the present application uses all the hole convolution layers to effectively solve the target occlusion problem.

Referring to the description and the accompanying drawing 3, in order to fully exert the performance of the hybrid receptor field module, the hybrid receptor field module of the present invention is different from other networks in that feature fusion is performed after multi-layer (C3, C4, C5) feature fusion is performed, but before feature fusion, the hybrid receptor field module is embedded between C5 and a feature pyramid P5 of a backbone network, so as to improve the characterization capability of C5 features, and final detection prediction is performed after the hybrid receptor field module passes through the low-layer embedded feature pyramid module. The mixed receptive field module of the invention is composed of four parallel branches, a convolution layer branch of 1 × 1, and three convolution layer branches of 3 × 3 with void rates of 1, 2 and 4 respectively. 3 x 3 cavity convolution with a cavity rate of 4 can acquire more global context feature details, enhance reasoning capability and solve the problem of target occlusion; and the convolution layers with different void ratios are used, so that the adaptability of the model to targets with different scales is improved.

The high-level features output by the C5 have rich semantic information, and are different from the combination of the conventionally adopted cascade features, and the parallel feature combination adopted by the invention can train the network parameters more suitable for the current data set. After the parallel branch 1 passes through the 1 × 1 convolutional layer, the detailed information of the image can be kept as much as possible under the condition of not changing the size of the feature diagram, and meanwhile, the number of channels of the feature diagram can be controlled, so that the subsequent calculation amount is reduced; the convolution kernel of 3 multiplied by 3 has smaller parameters, so that the characteristic information can be processed, and the calculation of the network is further reduced; the hole convolution can obtain more global feature detail information, the reasoning capability is enhanced, the shielding target is well identified, and the adaptability of the model to the multi-scale target is improved while the grid effect is eliminated due to the arrangement of different hole rates. The parallel branch 2 is a convolution of 3 x 3 with a void rate equal to 1 and is suitable for detecting small and medium-sized targets, the parallel branch 3 is a convolution of 3 x 3 with a void rate equal to 2 and is suitable for detecting medium-sized targets, and the parallel branch 4 is a convolution of 3 x 3 with a void rate equal to 4 and is suitable for detecting medium and large-sized targets.

After the spliced feature map, feature information fusion is carried out by adopting a convolution layer of 1 multiplied by 1, the channel dimensionality is reduced to a specified number, and the flexibility of the mixed receptive field module is improved.

The feature pyramid enables the feature map of each layer to have strong semantic information by fusing the features of the high layer downwards, and can perform prediction respectively. Compared with a characteristic pyramid, the characteristics output by the low-layer embedded characteristic pyramid module of the application not only contain rich semantic information, but also contain specific detail information, and double promotion of multi-scale target detection effect and positioning accuracy is achieved.

Referring to the specification and the attached figure 4, C5' is a feature diagram after passing through a low-layer embedded feature pyramid module, and referring to the specification and the attached figure 5, a composite convolutional layer (formed by cascading a 3 × 3 convolutional layer, a BN layer and a LeakyReLU activation layer) aims at processing fused features, optimizing model training and improving the nonlinear expression capability of the features.

The low-level embedded characteristic pyramid module firstly fuses a current-level characteristic graph and a high-level characteristic graph subjected to channel compression and upsampling in a mode of adding element by element and channel by channel to form a composite characteristic graph and complete the embedding of high-level semantic information; secondly, fusing the composite feature map and the downsampled low-level feature map to form a mixed feature map, and completing embedding of low-level detail information; and finally, after each mixed feature map is subjected to the designed composite convolution layer, generating a final feature map to be detected and entering the next module.

A target detection method based on a mixed cavity convolution pyramid refers to the attached figure 1 of the specification, and comprises the following steps:

Wherein, in the process of training the network, the loss function is as follows:

L_clsthe specific expression form is a Focal local Loss function:

wherein y is a sample label, y' predicts the probability that a sample is a positive case, and gamma is a focusing parameter; compared with the common cross entropy Loss function, the Focal local increases a gamma factor, and the influence of simple samples is reduced by controlling the value of gamma to focus more on difficult samples.

L_regFor the GIoU Loss function, the specific calculation process is as follows:

L_reg＝1-GIoU

In summary, after reading the present disclosure, those skilled in the art should make various other modifications without creative efforts according to the technical solutions and concepts of the present disclosure, which are within the protection scope of the present disclosure.

Claims

1. a target detection network based on mixed hole convolution pyramid, is characterized in that: comprise backbone network, mixed receptive field module, low-level embedded feature pyramid module and detection module; Described backbone network uses the network structure of hierarchical cascading to extract the target picture features; the mixed receptive field module is used for feature enhancement of the highest-level feature map output from the top of the backbone network; the low-level embedded feature pyramid module is based on the feature pyramid, and is used for high-level features downward. fusion, and generate the final feature map to be detected by means of low-level embedding; the detection module is used to locate and classify the feature map to be detected, and output the result.

2. a kind of target detection network based on mixed hole convolution pyramid according to claim 1, is characterized in that: described low-level embedded feature pyramid module is used for generating final feature map to be detected, specifically comprises the following steps:

a. The low-level embedded feature pyramid module fuses the current-level feature map with the high-level feature map after channel compression and upsampling to form a composite feature map to complete the embedding of high-level semantic information;

b. Integrate the composite feature map with the downsampled low-level feature map to form a hybrid feature map to complete the embedding of low-level detail information;

c. After each mixed feature map passes through the composite convolution layer, the final feature map to be detected is generated.

3 . A target detection network based on a mixed hole convolution pyramid according to claim 2 , wherein the fusion method in the step a and the step b is element-by-channel addition. 4 .

4. A target detection network based on a hybrid hole convolution pyramid according to claim 2, wherein the composite convolution layer in the step c is composed of a 3×3 convolution layer, a BN layer and a LeakyReLU activation level linked together.

5. A target detection network based on a mixed hole convolution pyramid according to claim 1, wherein the mixed receptive field module comprises four parallel branches, including a 1×1 convolutional layer branch and three 3×3 convolutional layer branches with dilation rates of 1, 2, and 4 respectively; the hybrid receptive field module stitches together the feature maps obtained in parallel from dilated convolutional layers with different dilation rates, and uses a 1×1 volume The product layer performs feature information fusion and reduces the channel dimension to a specified number.

6. a kind of target detection network based on mixed hole convolution pyramid according to claim 1, it is characterized in that: described backbone network is a single-stage detection network based on Res2Net50 network, in the prediction of target, the introduction of FCOS The Achor-free mechanism performs pixel-by-pixel prediction, and adds the Centerness branch network to the Loss function part.

7. A target detection network based on a hybrid hole convolution pyramid according to claim 6, wherein the feature map output by the backbone network includes C3, C4 and C5, and the feature map size is 100×100 , 50×50, 25×25.

8. A target detection method based on a mixed hole convolution pyramid, characterized in that: comprising the following steps:

1. Build a backbone network based on the Achor-free mechanism, and obtain feature maps C3, C4 and C5 through the backbone network. The highest-level feature map C5 output by the backbone network is enhanced by the hybrid receptive field module and then output to the low-level embedded feature pyramid module;

ii. The feature maps C4 and C3 output by the low-level embedded feature pyramid module and the backbone network are subjected to up-sampling and down-sampling operations to form composite features. The image is sent to the detection module for target positioning and classification tasks;

iii. Train the above network, test the model in each round, save the best training model weight, and use the corresponding test set to test the real-time performance of the mixed receptive field module and the low-level embedded feature pyramid module, and train to obtain the network model;

iv. Use the trained network model to detect the target and output the detection result.

9. a kind of target detection method based on mixed hole convolution pyramid according to claim 8, is characterized in that: in the process of training network in described step iii, loss function is as follows:

Among them, p _{x, y} represents the classification prediction probability, t _{x, y} represents the regression prediction coordinates, N represents the number of positive samples; k is the indicator function, if the current prediction is determined to be a positive sample, it is 1, otherwise it is 0;

L _cls is the Focal Loss loss function, and its specific expression is:

Among them, y is the sample label, the y' model predicts the probability that the sample is a positive example, and γ is the focusing parameter;

L _reg is the GIoU Loss loss function, and the specific calculation process is:

Among them, A and B represent the predicted frame and the real frame, IoU is their intersection ratio, first calculate their minimum convex set C, that is, the minimum bounding box surrounding A and B, and then combine C, the minimum convex set, to calculate GIoU, thereby calculating L _reg .