CN116403081A

CN116403081A - Multi-scale detection method for bidirectional self-adaptive feature fusion

Info

Publication number: CN116403081A
Application number: CN202310359590.6A
Authority: CN
Inventors: 李超; 黄方正; 杨晓宇; 刘学婷; 毛慧娟
Original assignee: Kashgar Electronic Information Industry Technology Research Institute
Current assignee: Kashgar Electronic Information Industry Technology Research Institute
Priority date: 2023-04-06
Filing date: 2023-04-06
Publication date: 2023-07-07

Abstract

The invention relates to the technical field of target detection, in particular to a multi-scale detection method for bidirectional self-adaptive feature fusion, which adopts feature graphs of different scales output by C2, C3, C4 and C5 layers of a backbone network and obtains the feature graphs after the C5 layers are subjected to convolution downsampling; the feature images are respectively subjected to a 1 multiplied by 1 convolution kernel, and the channel number is unified to 256, so that an input feature image is obtained; performing bidirectional feature fusion on the input feature images to obtain feature images of different layers; transforming adjacent three layers of feature images in different layers of feature images into the same scale; weighting and fusing the adjacent three layers of feature images to output feature images with three scales of large, medium and small; and detecting the multi-scale targets on the feature maps with different scales. By the detection method, the feature images with different scales can be fused from two directions, information among the feature images with different scales is balanced, and the problem of mutual interference of the multi-scale features can be effectively solved.

Description

Multi-scale detection method for bidirectional self-adaptive feature fusion

Technical Field

The invention relates to the technical field of target detection, in particular to a multi-scale detection method for bidirectional self-adaptive feature fusion.

Background

Targets with inconsistent dimensions often appear in practical application scenes, particularly small targets which are difficult to detect, such as vehicles and pedestrians under long-distance monitoring, aerial image targets of high-altitude unmanned aerial vehicles, remote sensing images and the like. In the application scenes, the multi-scale target detection can help to identify pedestrians at remote corners, avoid traffic accidents and improve driving safety; the possible tiny defects are identified, the production safety is ensured, and the economic benefit is improved; and the remote small target can be detected, so that the remote situation can be mastered, and the security monitoring is facilitated. Therefore, the need to improve the accuracy of multi-scale target detection algorithms is very necessary and urgent.

The traditional target detection algorithm mainly uses a manually designed feature vector to represent target features, a large number of redundant invalid candidate windows are generated when candidate frames are generated, the time complexity and calculation burden of the algorithm are increased, a large number of false detections in the subsequent classification process are caused, meanwhile, the manually designed features are influenced by background environments, light changes and scale changes, and depend on personal experience of designers, so that the traditional target detection method has certain limitation and cannot be widely used in common application scenes, and the precision improvement of the traditional target detection method is very limited.

With the development of technology, the deep learning technology gradually replaces the traditional target detection algorithm. However, the condition that the semantics of the small target are lost on the deep convolutional neural network is inspired, and the thought of feature fusion is inspired, and the detection and positioning are only carried out on the deep feature map, so that the detection effect of the small target is poor. Therefore, only in the prior art, a method for fusing feature maps with different scales by using shallow feature map information is proposed. The deep feature map has a large receptive field, and is more suitable for detecting a large target; the shallow feature map has a denser receptive field, and can better identify the space semantic information of a small target. By analyzing different feature fusion modes, the traditional feature pyramid is found to fuse deep features into shallow features from top to bottom, and then detection is carried out on feature graphs with different scales, so that three problems are found:

(1) The feature fusion is unbalanced, while the deep features are fused into the shallow feature map, the shallow feature map is not fused into the deep feature map, and the spatial position information in the shallow feature map is continuously diluted in the downsampling process, so that the deep feature map lacks the spatial position information in the shallow feature map, and the prediction is performed in a scale manner, so that the positioning is not facilitated.

(2) The problem of mutual interference among features of different scales is not considered, information of a large target exists in a shallow feature map, the information and a small target are mixed together, the targets of different scales form mutual interference, and particularly, the difficulty in detecting the small target is increased.

(3) The targets with different scales are distributed to the feature graphs with different scales for detection, the feature graphs are ignored to be interrelated, and information beneficial to detection exists in the feature graphs of other layers, so that useful information of other layers is not utilized.

Disclosure of Invention

In order to solve the technical problems, the invention provides a multi-scale detection method for bidirectional self-adaptive feature fusion, which can fuse different scale feature graphs from two directions, balance information among the different scale feature graphs and effectively solve the problem of mutual interference of the multi-scale features.

The invention is realized by adopting the following technical scheme:

a multi-scale detection method for bidirectional self-adaptive feature fusion is characterized by comprising the following steps of: the method comprises the following steps:

the method comprises the steps of obtaining feature graphs by adopting feature graphs of different scales output by C2, C3, C4 and C5 layers of a backbone network and performing convolution downsampling on the C5 layers; the feature maps are respectively subjected to a 1 multiplied by 1 convolution kernel, the channel number is unified to 256, and an input feature map { P2, P3, P4, P5 and P6};

performing bidirectional feature fusion on the input feature graphs { P2, P3, P4, P5 and P6} to obtain feature graphs of different layers;

transforming adjacent three layers of feature images in different layers of feature images into the same scale;

weighting and fusing the adjacent three layers of feature images to output feature images with three scales of large, medium and small;

and detecting the multi-scale targets on the feature maps with different scales.

The bidirectional feature fusion comprises feature fusion from top to bottom, feature re-fusion from bottom to top and feature enhancement of transverse connection.

The feature fusion from top to bottom specifically refers to: the input feature image of the upper layer is amplified to the same scale of the next layer through deconvolution up-sampling, element-by-element addition operation is directly carried out to generate a fusion feature image, the obtained fusion feature image is continuously repeated until P2 fusion is completed, and finally the fusion feature image is formed

The feature re-fusion from bottom to top specifically refers to: downsampling the fusion feature map of the next layer to reduce the scale of the previous layer, and adding the downsampled fusion feature map with the fusion feature map of the previous layer element by element to obtain a rebushed feature map

The feature enhancement of the transverse connection specifically refers to: a shortcut path from the input feature map to the re-fusion feature map is provided.

Transforming adjacent three layers of feature maps in different layers of feature maps into the same scale specifically refers to: and scaling the adjacent three layers of feature maps to the scale of the lowest layer by using the same scale transformation module.

The weighting fusion of the adjacent three layers of feature images specifically refers to: the adjacent three layers of feature images are fused after being distributed with different weights, and the weighted fusion method comprises the following steps:

wherein l represents the number of layers of the current feature map, x _ij Representing input features of different scales, y _ij Is a feature map after fusion; alpha, beta and gamma are weight parameters, the dimension is reduced after the different input characteristics are spliced, and then the weight parameters are obtained through softMax normalization; the features of the upper two layers are multiplied by alpha, the features of the upper layer are multiplied by beta, the features of the lower layer are multiplied by gamma, and the three are weighted and added element by element.

The detection of the multi-scale target on the feature diagrams with different scales specifically refers to: establishing a regression strategy based on the cross-correlation ratio and a multiscale detection model based on thermodynamic diagram; in the establishing process of the regression strategy based on the cross-correlation, the center point and the boundary frame are jointly trained by introducing the cross-correlation loss; the multi-scale detection model based on thermodynamic diagrams is used for distributing targets with different scales to characteristic diagrams with different scales for detection.

Training the center point and the bounding box simultaneously by adopting the form of a HIOU loss function, wherein the HIOU loss function is as follows:

wherein ρ is used to measure the distance between the predicted center point and the true center point, and a Euclidean distance paradigm is adopted; c represents the diagonal distance of the minimum bounding rectangle of the real bounding box and the prediction bounding box, alpha is the parameter used for balancing the ratio, v is the parameter used for measuring the aspect ratio by the slope, w, h and w ^gt 、h ^gt Representing the height width of the predicted frame and the height width of the real frame respectively,

for calculating the inclination of the diagonal of a bounding boxAngle.

The method for distributing targets with different scales to the feature graphs with different scales for detection specifically comprises the following steps:

controlling detection ranges on the characteristic diagrams of different scales by introducing scaling parameters, and respectively detecting targets of different scales; the detection ranges are L1, L2 and L3 from small to large, and the scale is divided into (0, 1282), (162, 3262), (642, 5122);

and performing non-maximum suppression, screening out a detection frame with highest confidence coefficient, and obtaining a final detection result.

Compared with the prior art, the invention has the beneficial effects that:

1. in the invention, the existing three-layer feature map is replaced by the five-layer feature map, so that more feature information is mined. And a smaller characteristic diagram is obtained after the C5 layer is subjected to convolution downsampling, and the newly added characteristic diagram has a larger receptive field and has a better effect on detecting a large-scale target, so that the model has a larger detection range when processing a multi-scale target. Furthermore, the feature images with different scales can be fused from two directions, the information among the feature images with different scales is balanced, the weighted fusion is combined, the relation among the feature images with different scales can be regulated, the network learns the importance degree of the feature images with different scales, the interference information is restrained, the useful information is adjusted before, and the optimal fusion mode with different scales is found. Finally, adjacent three-layer feature graphs are adopted to predict targets with different scales, and the mutual fusion of adjacent layers not only considers the association features between adjacent layers and strengthens the connection between the spacer layers, but also avoids the interference caused by the features with different scales.

The invention has the most basic and key functions in the field of computer vision, and has very wide application in the fields of security monitoring, medical image detection, unmanned driving and the like by detecting the category position of the target in the image.

2. When the bidirectional features are fused, a re-fusion channel from bottom to top is added, and abundant spatial information in shallow features is fused into a deep feature map, so that each layer contains feature information with different scales, and a feature pyramid with finer granularity is obtained. The transverse connection provides a shortcut from the input feature map to the re-fusion feature map, and the abundant spatial information in the shallow features is fused into the deep feature map, so that each layer contains the feature information of different scales, more unique features of the layer are reserved, the feature maps of different scales are more concerned with the information of the feature map of the layer on the basis of fusion, the feature information is enhanced, and preparation is made for later detection of multi-scale targets on the feature maps of different scales.

3. In the process of feature re-fusion from bottom to top, because the number of channels is unified before, at this time, all the channels of the feature map are 256, the channels do not need to be changed by convolution, 3×3 convolution kernels are adopted for downsampling, the step length is 2, and the obtained new fused feature map passes through a 3×3 convolution kernel, so that the aliasing effect is eliminated, and the features are more stable.

4. In the invention, the path from top to bottom and the path from bottom to top are not completely symmetrical, because only one-time scale transformation calculation is needed at the uppermost layer and the lowermost layer of the pyramid, namely, only one-time up-sampling is needed at the P6 layer, only one-time down-sampling is needed at the P2 layer, and only one-time fusion operation is needed when the feature map is output, the contribution to fusion features is less, so that only one input node is omitted, and the calculation amount of the model can be reduced.

5. And the feature images of different layers are converted into the same scale by the same scale conversion module, so that the subsequent multi-scale feature fusion is facilitated. Because the three branches respectively correspond to the target detection tasks of three scales of large, medium and small, the relation between the target scale and the characteristic image scale is inversely related, small targets are easy to detect on the large characteristic image, and large targets are easy to detect on the small characteristic image, the scale of the characteristic image must be ensured to be in a suitable detection range, and therefore the scale of the adjacent three layers of characteristic images is selected to be scaled to the scale of the lowest layer.

6. Considering the interrelation between the feature maps, the detection ranges on the feature maps of different scales are controlled by introducing scaling parameters, namely L1, L2 and L3 from small to large, and the scales are divided into (0, 1282), (162, 3262), (642 and 5122). Because the small targets are difficult to detect and the number of the small targets is large, the detection can be performed more carefully by using a narrower detection interval; the detection difficulty of large targets is small and the number is small, so that the detection can be more efficiently performed by using a wider detection interval. The targets of each scale can be detected by at least two detectors, and the useful information on the adjacent layer characteristic diagrams can be fully utilized, so that the detection rate of the targets of multiple scales can be improved.

7. Most of the existing detection algorithms are based on anchor frame mechanisms, and have a plurality of defects including large calculation amount, unbalanced positive and negative samples and the like, and the problems limit the improvement of the small target detection accuracy. Compared with other anchor-free frame detection algorithms based on thermodynamic diagrams, the traditional CenterNet has very remarkable improvement on the detection effect of a small target, but has a detection effect on a large target which is not as good as that of the other two algorithms. According to the invention, targets with different scales are distributed to the feature graphs with different scales for detection, the rough range of frame regression is divided, the regression efficiency of a model is improved, the detection precision of the model is improved by combining a multi-scale prediction mode, and the problem of poor single-scale detection effect can be effectively solved.

8. Because the center point of the boundary frame and the size of the boundary frame are interrelated, the convergence speed of the model is increased by introducing the cross ratio loss to carry out joint training on the center point and the boundary of the boundary frame.

10. The detection method is carried out on an MS COCO data set, the detection accuracy is improved by 4.1% compared with a baseline model, and the detection accuracy of large, medium and small scale targets is improved by 2.8%, 5.9% and 3.6% respectively, so that the method has more excellent accuracy for multi-scale target detection.

Drawings

The invention will be described in further detail with reference to the drawings and detailed description, wherein:

FIG. 1 is a schematic diagram of the overall structure of the present invention;

FIG. 2 is a schematic illustration of a mesoscale division according to the present invention;

FIG. 3 is a graph showing a distribution of the number of targets detected at different scales in the present invention;

FIG. 4 is a graph showing the loss function comparison of HIoU and baseline models in accordance with the present invention.

Detailed Description

Example 1

As a preferred embodiment of the invention, referring to fig. 1 of the specification, the invention comprises a multi-scale detection method of bidirectional self-adaptive feature fusion, comprising a self-adaptive feature fusion network and an anchor-free frame multi-scale detection network based on thermodynamic diagrams. The self-adaptive feature fusion network is used for feature fusion and outputting feature graphs of three scales of large, medium and small, and the anchor-frame-free multi-scale detection network based on thermodynamic diagrams is used for detecting multi-scale targets on the feature graphs of different scales.

The feature fusion method comprises the steps of feature fusion, outputting of feature graphs of three scales of large, medium and small, and specifically comprising the following steps:

the characteristic diagrams of different scales of the outputs of the C2, C3, C4 and C5 layers of the backbone network and the C5 layer are adopted to obtain a smaller characteristic diagram after being subjected to convolution downsampling, and the newly added characteristic diagram has a larger receptive field and better effect on detecting a large-scale target, so that the model has a larger detection range when processing a multi-scale target. The feature maps are respectively subjected to a 1×1 convolution kernel, and the channel numbers are unified to 256, so that input feature maps { P2, P3, P4, P5, P6}. Wherein the backbone network may be ResNet101.

And carrying out bidirectional feature fusion on the input feature graphs { P2, P3, P4, P5 and P6} to obtain feature graphs of different layers. The bidirectional feature fusion comprises feature fusion from top to bottom, feature re-fusion from bottom to top and feature enhancement of transverse connection.

The feature fusion from top to bottom specifically refers to: the input feature image of the upper layer is amplified to the same scale of the next layer through deconvolution up-sampling, so that the channel number and the resolution of the two layers are the same, the element-by-element addition operation is directly carried out to generate a fusion feature image, the obtained fusion feature image is continuously repeated until P2 fusion is completed, and finally the fusion feature image is formed

The fusion relationship between layers can be represented by the following calculation method:

wherein P2, P3, P4, P5 and P6 are feature pyramids with the same channel number after 1×1 convolution. Down (·) is a downsampling operation, specifically a convolution kernel of 3×3 in size and 2 steps. Up (·) is an upsampling operation, specifically a deconvolution kernel of size 3×3, step size 2, and padding 1.

Respectively new fusion characteristic diagrams.

The feature re-fusion from bottom to top specifically refers to: the fusion characteristic diagram obtained in the last step is

Downsampling the fusion feature map of the next layer to be reduced to the scale of the upper layer, and adding the downsampled fusion feature map of the next layer with the fusion feature map of the upper layer element by element to obtain a rebusked feature map +.>

Because the number of channels is unified before, at this time, all the channels of the feature map are 256, the number of channels is not required to be changed by convolution, in the fusion process, a 3×3 convolution kernel is adopted for downsampling, the step length is 2, and the obtained new fusion feature map passes through a 3×3 convolution kernel, so that the aliasing effect is eliminated, and the features are more stable.

The specific process can be represented by the following calculation method:

the downsampling operation is specifically that a convolution kernel with the size of 3×3 and the step length of 2 is adopted to reduce the resolution of the feature map to one half, and the resolution is consistent with the size of the feature map of the previous layer. The path from top to bottom and the path from bottom to top are not completely symmetrical, and because only one scale transformation calculation is needed at the uppermost layer and the lowermost layer of the pyramid, namely, only one up-sampling is needed at the P6 layer, only one down-sampling is needed at the P2 layer, and only one fusion operation is needed when the feature map is output, the contribution to fusion features is less, so that only one input node is omitted, and the calculation amount of a model can be reduced.

The feature enhancement of the transverse connection specifically refers to: a shortcut path from the input feature map to the re-fusion feature map is provided. Because very long passes are experienced from the layer feature map to the fused features during the fusion process, some features are diluted during the fusion process. Therefore, more unique features of the layer are reserved through transverse connection, so that the feature images of different scales are more concerned with the information of the feature images of the layer on the basis of fusion, the feature information is enhanced, and preparation is made for later detection of multi-scale targets on the feature images of different scales.

The fusion of bottom-up paths plus cross-links can be expressed in terms of the following calculation method:

wherein,,

the method is an output re-fusion characteristic diagram, the output of each layer of the network comprises the original characteristic diagram of the layer and the fully fused new characteristic diagram, the characteristic fusion from top to bottom and from bottom to top is carried out to obtain the characteristic diagram with rich semantic information and spatial information, meanwhile, the unique information of the layer is enhanced through transverse connection, and a detection basis with finer granularity is provided for multi-scale target detection.

Adjacent in different layer feature mapsThe three-layer feature map is transformed to the same scale. Since the feature maps of different layers have different resolutions and channel numbers, the feature maps of different layers should be transformed into the same scale before fusion, which is a scale transformation. Because the three branches respectively correspond to the target detection tasks of the large scale, the medium scale and the small scale, the relation between the target scale and the characteristic map scale is inversely related, the small target is easy to detect on the large characteristic map, the large target is easy to detect on the small characteristic map, the scale of the characteristic map must be ensured to be in a suitable detection range, the scale of the adjacent three layers of characteristic maps is selected to be scaled to the scale of the lowest layer, and the scale of the output characteristic map is respectively 32 ² ，64 ² ，128 ² Let Y4, Y3 and Y2 be respectively. The Y4 layer is used for detecting a large-scale target, the Y3 layer is used for detecting a medium-scale target, and the Y2 layer is used for detecting a small-scale target.

The upper layer characteristic diagram is expanded to 2 times of the original characteristic diagram through up-sampling, the uppermost layer characteristic diagram is expanded to four times of the original characteristic diagram through up-sampling twice, the up-sampling adopts a deconvolution mode, the convolution kernel is 4 multiplied by 4, the padding is 1, and the step length is 2. Batch normalization follows each convolution, normalizes the data distribution over the channel, and activates.

And then carrying out self-adaptive fusion on the feature images with unified scales, namely carrying out weighted fusion on the three adjacent feature images, and outputting feature images with three scales of large, medium and small. In this embodiment, different weights are allocated to the feature graphs, so that the network learns the importance degrees of different feature graphs, suppresses interference information, forwards adjusts useful information, and finds out the optimal fusion mode of different scales. The weighted fusion method comprises the following steps:

wherein l represents the number of layers of the current feature map, x _ij Representing input features of different scales, y _ij Is a feature map after fusion; alpha, beta and gamma are weight parameters, are learnable parameters, are obtained by reducing the dimension after being spliced by different input features and then are normalized by softMax; the features of the upper two layers are multiplied by alpha,the features of the previous layer are multiplied by beta, the features of the present layer are multiplied by gamma, and the three are added element by element after weighting.

Taking the calculation process of alpha as an example, the calculation method is as follows:

the generation of beta and gamma is the same, the channel number of the input three-scale feature map is unified to 256 when the two-way feature fusion is carried out, and each input feature x _ij Changing the number of channels to 16 through 1×1 convolution, then splicing to obtain a matrix of H×W×48, reducing the dimension to 3 channels through 1×1 convolution, wherein each channel corresponds to lambda _α 、λ _β 、λ _γ And then carrying out softMax normalization among the three channels to obtain alpha, beta and gamma parameters. Each parameter is in fact a matrix of size H x W, corresponding to the spatial information at each position of the picture. Multiplying the parameters by the feature map corresponds to weighting each position on the picture and filtering the feature map from the spatial position, thereby achieving the purposes of focusing on the key region and ignoring the non-key region on the feature map.

In order to verify how much improvement in performance is achieved when the adaptive feature fusion network structure in this embodiment is applied to the target detection algorithm, a classical target detection algorithm RetinaNet is selected as a baseline model. The backbone network adopts a pre-trained ResNet101 network on an ImageNet data set to scale an input image to 512 multiplied by 512, and adopts a strategy of scaling long sides and short sides and zero padding. After the last C5 layer of ResNet101, a convolution layer with the size of 3×3 and the step length of 2 is added, and downsampling is performed on the last layer of the feature map once so as to meet the requirement of the subsequent neck network. The up-sampling mode of the neck network adopts a deconvolution method, the convolution kernel size is 3×3, the step size is 2, and the padding is 1. After each convolution, a batch normalization and ReLU activation layer is added. The loss function adopts a focal loss function, and finally adopts Soft-NMS non-maximum suppression to screen out a prediction frame with highest confidence.

The initial learning rate was set to 0.001, and the optimizer selected Adam optimizer with weight decay factor of 0.9, with a 0.01 fold decrease in learning rate per 100 epochs. The batch size was set to 32 for a total of 250 epochs. The data set adopts an MS COCO data set, 118000 training sets are adopted in training, a verification set is not needed, and 41000 test sets are adopted in testing. The commonly used average precision AP is used as an evaluation index of the overall detection performance, the AP50 and the AP75 are respectively used as average precision when the IoU threshold is 0.5 and the average precision when the AP75 threshold is 0.75, and APs, APm, APl is used for measuring the detection effects of targets with different scales.

In order to ensure the fairness and effectiveness of the comparative experiment, the variables are strictly controlled, and the experiment of this embodiment adopts the same backbone network ResNet101. The number of layers of the feature map required by different feature fusion modes is different, so that the input feature map selected by the embodiment is different. FPN, PANet and this embodiment adopt a feature map of four layers of ResNet101{ C2, C3, C4, C5} output and a feature map of C6 layer output obtained by downsampling C5 with a step length of 2, and ASFF adopts a feature map of three layers of ResNet101{ C3, C4, C5} output. The implementation of the neck network for comparison was consistent with the original master model, and the addition of different feature fusion networks on the same backbone network was compared with this example, and the experimental results are shown in the following table.

From the table, it can be seen that, as a whole, the method AP proposed in this embodiment reaches 39.1%, the accuracy of the method AP is improved by 3.8% compared with the original FPN AP, and the method AP is respectively improved by 0.5% and 0.8% compared with the more advanced PAN method and ASFF method. From the aspect of detection performance of targets with different scales, especially from the aspect of small target detection performance, the accuracy of the method provided by the embodiment is 3.1% higher than that of the original FPN, and the improvement is very remarkable. The accuracy of the small target is improved by 1.3% compared with PANet, by 0.1% compared with ASFF, and the accuracy margin is not as large as FPN compared with PANet and ASFF. For the reason, the embodiment has similar bottom-up characteristic re-fusion paths with PANet and ASFF, the accuracy of the three paths in small target detection is obviously higher than that of FPN, and the lifting effect of the bottom-up characteristic re-fusion paths on the small target detection is fully verified. Meanwhile, because ASFF only adopts three layers of characteristic diagrams and does not adopt transverse connection, compared with PANet and the method of the embodiment, ASFF is slightly inferior in detection precision of large and medium targets, and the importance of the multi-scale characteristic diagrams and the transverse connection on detection performance is also illustrated. But the performance of the PANet on small target detection is not as good as that of the ASFF and the method of the embodiment, because the PANet fuses different feature maps in a pixel-by-pixel addition manner during feature fusion, and the problem of mutual interference of features of different scales is not considered, the embodiment and the ASFF perform feature fusion in an adaptive manner, and the effectiveness of the adaptive feature fusion module based on spatial attention is illustrated. In a combined view, the method provided by the embodiment has more excellent detection performance on the multi-scale targets.

Example 2

As another preferred embodiment of the invention, referring to fig. 1 of the specification, the invention comprises a multi-scale detection method of bidirectional self-adaptive feature fusion, comprising a self-adaptive feature fusion network and an anchor-free frame multi-scale detection network based on thermodynamic diagrams. The self-adaptive feature fusion network is used for feature fusion and outputting feature graphs of three scales of large, medium and small, and the anchor-frame-free multi-scale detection network based on thermodynamic diagrams is used for detecting multi-scale targets on the feature graphs of different scales.

The detection of the multi-scale target on the feature diagrams with different scales specifically refers to: and establishing a regression strategy based on the cross-correlation ratio and a multiscale detection model based on thermodynamic diagrams. In the establishing process of the regression strategy based on the cross-correlation, the center point and the boundary frame are jointly trained by introducing the cross-correlation loss.

Because the center point of the bounding box and the size of the bounding box are interrelated, the positioning quality of the bounding box is trained by using the HIOU, and the center point position and the frame size are jointly trained, specifically:

and the method is used for calculating the inclination angle of the diagonal line of the boundary frame, and v provides an improved direction for the frame regression when the length-width ratio of the predicted boundary frame is too large as compared with that of the real boundary frame, so that the frame regression can be more accurate.

The multi-scale detection model based on thermodynamic diagrams is used for distributing targets with different scales to characteristic diagrams with different scales for detection, and specifically comprises the following steps:

the samples are partitioned according to the scale size. Considering the interrelation between the feature maps, the detection ranges on the feature maps of different scales are controlled by introducing scaling parameters, namely L1, L2 and L3 from small to large, the scales are divided into (0, 1282), (162, 3262), (642, 5122), the large-scale detection range is 448, the middle-scale detection range is 300, the small-scale detection range is 128, and the detection range of each layer is defined as shown in figure 2 of the specification. Because the small targets are difficult to detect and the number of the small targets is large, the detection can be performed more carefully by using a narrower detection interval; the detection difficulty of large targets is small and the number is small, so that the detection can be more efficiently performed by using a wider detection interval. The targets of each scale can be detected by at least two detectors, and the useful information on the adjacent layer characteristic diagrams can be fully utilized, so that the detection rate of the targets of multiple scales can be improved.

For subsequent training, positive and negative samples are generated on each layer of feature images, and the true boundary box center points of the positive samples are mapped on feature images with different scales. The calculation method comprises the following steps:

wherein, (x) ₁ ，y ₁ )，(x ₂ ，y ₂ ) Respectively the upper left corner point and the lower right corner point of the frame, p is the mapping coordinate of the central point on the feature map, and i is the number of layers of the feature map.

Because the scale ranges of detection overlap each other among each layer, the same target can be detected in different branches, so that non-maximum suppression is required when the result is output, a detection frame with the highest confidence is screened out, and the threshold value is selected to be 0.5, so that the final detection result is obtained.

In order to verify how much performance is improved when the anchor-free frame multi-scale detection network structure based on thermodynamic diagram provided by the embodiment is applied to a target detection algorithm, the experiment of the embodiment selects an anchor-free frame algorithm CenterNet as an improvement basis, and replaces a backbone network with a pre-trained ResNet101 network on an ImageNet data set to serve as a baseline model. The input image is first scaled to 512 x 512 size, and the strategy of long-side scaling and short-side zero padding is adopted. The upsampling is performed in a deconvolution mode, with a convolution kernel size of 3×3, a step size of 2, and a padding of 1. After each convolution, a batch normalization and ReLU activation layer is added.

In order to refer more conveniently to the thermodynamic diagram-based anchor-free Multi-scale detection model proposed by the present embodiment, it is named HMSNet (hetmap-based Multi-scale detection network), wherein the improved loss function proposed in the cross-correlation-based regression strategy is named HIoU (hetmap-IOU) loss function, and the thermodynamic diagram-based Multi-scale detection module is named ms_head (Multi-scale detection head). In order to verify the effectiveness and the overall effectiveness of the split parts in the HMSNet, an ablation experiment is carried out on a baseline model, the baseline model is consistent with the original edition, a backbone network is replaced by ResNet101, the last layer of characteristic diagram is subjected to deconvolution and up-sampling for three times according to the method of the CenterNet original text, the characteristic diagram is expanded to 128×128, a thermodynamic diagram is generated on the expanded characteristic diagram, regression detection is carried out, the input of the HMSNet adopts the characteristic diagram of three layers of ResNet101{ C3, C4 and C5}, and the experimental results are shown in the table below.

Overall, it can be observed that the accuracy of the centrnet using the method of this embodiment is 3.4% higher than that of the baseline model, and the centrnet of the baseline model uses single-scale detection, which illustrates the superiority of multi-scale detection over single-scale detection. From the perspective of contributions of different modules to the detection result, the different modules have certain help to the detection result, wherein each index of the MS_head model is adopted, the accuracy is improved by about 2% -3% compared with a baseline model, and the contributions of HIoU to the detection result are smaller and are improved by about 0.4%. The HIoU only participates in the training part of the model, has the effect of accelerating the convergence of the model, has smaller effect in the test stage, and can better judge the quality of the detection frame, thereby improving the accuracy to a certain extent.

From different scales, the method provided by the embodiment has a certain improvement on the detection effect of targets with different scales, and particularly has more obvious improvement effect on targets in large and medium sizes. The detection effect of the large target is 2.7% higher than that of the baseline model, the detection effect of the medium target is improved by 3.8%, and the problem that the accuracy of the central Net in large target detection is not high is effectively solved. The analysis is performed in the MS_head, and because the overlapped scale division mode is adopted, partial targets are simultaneously distributed to a plurality of feature images for detection, so that the detection precision is improved. In order to verify the effectiveness of multi-scale detection, the number of different targets detected is counted on the feature graphs with different scales, and the result is shown in the attached figure 3 of the specification, so that the detection branches with the largest contribution degree to the targets with different scales can be seen to be different, different targets with different scales for detection are different, and meanwhile, the targets with different scales are also detected by the different detection branches. Particularly, the large and medium scale targets are respectively distributed on the three characteristic diagrams with different scales, the detection rate of the large and medium scale detector is very high, the detection effect is further improved, and the necessity of detecting 'divide and conquer' of the multi-scale targets respectively is verified.

To further verify the HIoU effect in HMSNet, the baseline model was compared to the training loss function of the methods herein, as shown in figure 4 of the specification. As can be seen from the graph, the loss function of the baseline model oscillates more, but the loss function of the method provided by the embodiment is more gentle and the convergence speed is faster. In the process of gradually converging the loss function, the loss function proposed by the embodiment is more approaching to 0, and the effectiveness of the loss function improvement and the effect of the loss function improvement on the model precision improvement of the embodiment are verified.

Example 3

As the best mode of the invention, the invention comprises a multi-scale detection method of bidirectional self-adaptive feature fusion, which combines the self-adaptive feature fusion network in the embodiment 1 and the anchor-free frame multi-scale detection network based on thermodynamic diagram in the embodiment 2. The superiority of the algorithm is further verified from the overall structure of the network. A comparison experiment was performed using a CenterNet as a baseline model, and a SAFF-FPN network and a HMSNet network were added to the CenterNet using ResNet101 as a backbone network in this order, and the experimental results are shown in the following table.

As can be seen from the table, the ASFF-FPN network proposed in the embodiment 1 is applied to the CenterNet, the accuracy is improved by 4.1% compared with a baseline model, the accuracy of the method in the method combining ASFF-FPN and HMSNet is up to 40.5%, the detection accuracy of the method respectively reaches 24.5%, 44.1% and 53.9% on the multi-scale target detection accuracy, the detection accuracy is respectively improved by 3.6%, 5.9% and 2.8%, and the detection performance of the multi-scale target is greatly improved. And compared with the experimental results based on the anchor frame algorithm, the experimental results of the anchor frame-free algorithm and the experimental results of the embodiment 1 are longitudinally compared, the accuracy of the anchor frame-free algorithm is higher than 3.4%, and the necessity of the anchor frame-free algorithm for improving multi-scale target detection is verified. It is noted that the baseline model adopted in this embodiment does not use the Hourgasss-104 originally adopted by the CenterNet as the backbone network, so that the overall accuracy is reduced compared with the model proposed in the prior art, and the fact that the structure of the backbone network plays a vital role in feature extraction is demonstrated. However, from the aspect of improvement, the detection performance of the multi-scale target can be remarkably improved by adopting a self-adaptive multi-scale feature fusion network and a thermodynamic diagram-based anchor-free frame multi-scale detection algorithm method, and the experimental result meets the expectations.

It can be seen that after the multi-scale detection method of the embodiment is adopted, the resolution capability of targets with different scales is better, the obtained bounding box is more accurate, and especially, the error detection and omission detection of large targets and small targets are improved, and the detection accuracy of the multi-scale targets is improved.

In view of the foregoing, it will be appreciated by those skilled in the art that, after reading the present specification, various other modifications can be made in accordance with the technical scheme and concepts of the present invention without the need for creative mental efforts, and the modifications are within the scope of the present invention.

Claims

1. A multi-scale detection method for bidirectional self-adaptive feature fusion is characterized by comprising the following steps of: the method comprises the following steps: the method comprises the steps of obtaining feature graphs by adopting feature graphs of different scales output by C2, C3, C4 and C5 layers of a backbone network and performing convolution downsampling on the C5 layers; the feature maps are respectively subjected to a 1 multiplied by 1 convolution kernel, the channel number is unified to 256, and an input feature map { P2, P3, P4, P5 and P6};

performing bidirectional feature fusion on the input feature graphs { P2, P3, P4, P5 and P6} to obtain feature graphs of different layers; transforming adjacent three layers of feature images in different layers of feature images into the same scale;

2. The multi-scale detection method for bidirectional adaptive feature fusion according to claim 1, wherein: the bidirectional feature fusion comprises feature fusion from top to bottom, feature re-fusion from bottom to top and feature enhancement of transverse connection.

3. The multi-scale detection method for bidirectional adaptive feature fusion according to claim 2, wherein: the feature fusion from top to bottom specifically refers to: the input feature image of the upper layer is amplified to the same scale of the next layer through deconvolution up-sampling, element-by-element addition operation is directly carried out to generate a fusion feature image, the obtained fusion feature image is continuously repeated until P2 fusion is completed, and finally the fusion feature image is formed

4. A multi-scale detection method for bi-directional adaptive feature fusion according to claim 3, wherein: the feature re-fusion from bottom to top specifically refers to: downsampling the fusion feature map of the next layer to reduce the scale of the previous layer, and adding the downsampled fusion feature map with the fusion feature map of the previous layer element by element to obtain a rebushed feature map

5. The multi-scale detection method for bi-directional adaptive feature fusion according to claim 4, wherein: the feature enhancement of the transverse connection specifically refers to: a shortcut path from the input feature map to the re-fusion feature map is provided.

6. The multi-scale detection method for bidirectional adaptive feature fusion according to claim 1 or 5, wherein: transforming adjacent three layers of feature maps in different layers of feature maps into the same scale specifically refers to: and scaling the adjacent three layers of feature maps to the scale of the lowest layer by using the same scale transformation module.

7. The multi-scale detection method for bi-directional adaptive feature fusion according to claim 6, wherein: the weighting fusion of the adjacent three layers of feature images specifically refers to: the adjacent three layers of feature images are fused after being distributed with different weights, and the weighted fusion method comprises the following steps:

8. The multi-scale detection method for bidirectional adaptive feature fusion according to claim 1 or 7, wherein: the detection of the multi-scale target on the feature diagrams with different scales specifically refers to: establishing a regression strategy based on the cross-correlation ratio and a multiscale detection model based on thermodynamic diagram; in the establishing process of the regression strategy based on the cross-correlation, the center point and the boundary frame are jointly trained by introducing the cross-correlation loss; the multi-scale detection model based on thermodynamic diagrams is used for distributing targets with different scales to characteristic diagrams with different scales for detection.

9. The multi-scale detection method for bi-directional adaptive feature fusion according to claim 8, wherein: training the center point and the bounding box simultaneously by adopting the form of a HIOU loss function, wherein the HIOU loss function is as follows:

used to calculate the angle of inclination of the diagonal of the bounding box.

10. The multi-scale detection method for bi-directional adaptive feature fusion according to claim 8, wherein: the method for distributing targets with different scales to the feature graphs with different scales for detection specifically comprises the following steps: