CN115631344A

CN115631344A - Target detection method based on feature adaptive aggregation

Info

Publication number: CN115631344A
Application number: CN202211219905.9A
Authority: CN
Inventors: 陈微; 何玉麟; 罗馨; 李晨; 姚泽欢; 汤明鑫
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-10-06
Filing date: 2022-10-06
Publication date: 2023-01-20
Anticipated expiration: 2042-10-06
Also published as: CN115631344B

Abstract

The invention discloses a target detection method based on feature adaptive aggregation, and aims to solve the problem that the detection precision of the existing real-time target detection method needs to be improved. The technical scheme is as follows: constructing a target detection system based on feature adaptive aggregation, which is composed of a main feature extraction module, a feature adaptive aggregation module, an auxiliary task module, a main task module and a post-processing module; preparing a data set required by a target detection system and optimizing image data of a training set through a data enhancement technology; training a target detection system by adopting a training set, and assisting a task module to assist network training; then, verifying the trained target detection system, and selecting the model parameter with the most excellent performance to obtain the trained target detection system with the most excellent performance; and finally, performing target detection on the user input image by adopting the trained target detection system with the most excellent performance to obtain the position and the category of the target. The invention realizes larger precision improvement with smaller time expenditure.

Description

Target detection method based on feature adaptive aggregation

Technical Field

The invention relates to the field of image recognition target detection, in particular to a target detection method based on feature adaptive aggregation and capable of optimizing target detection precision.

Background

Target detection is one of important tasks of computer vision, and has numerous applications such as intelligent security, intelligent robots, intelligent transportation and the like. With the development of artificial intelligence and deep learning, the performance of the target detection technology is remarkably improved. The performance evaluation of the target detection method generally has two aspects of accuracy and real-time performance, wherein the former reflects the detection accuracy of the method, and the latter reflects the processing speed of the method. For tasks such as face detection, vehicle detection, pedestrian detection and the like, the real-time performance is also an important index for measuring the performance of the target detection method. In practical application, the detection of the input image needs to be completed within a short time, otherwise, too high delay is caused, so that the user experience is not good, and serious traffic accidents such as traffic accidents occur.

Existing real-time target detection methods generally fall into two broad categories: the anchor-base method and the anchor-free method. The Anchor-base method generates a priori frame which is predefined and extends over the whole graph, and extracts the prior frame characteristics to finish classification and regression tasks. However, the anchor-base method is weak in generalization capability because the predefined prior frame needs to manually set hyper-parameters and has different length-width ratios, sizes and the like for different data sets, and the method is more complex than the anchor-free method and has slight deficiency in real-time performance. The Anchor-free method does not need a predefined prior frame, and directly extracts pixel point characteristics of the characteristic image to complete classification and regression tasks. The Anchor-free method is more dominant in speed and generalization, but the accuracy of the method is limited by the point features with weak characterization capability.

The document "Zhou X, wang D. Objects as points [ J ]. ArXiv preprinting arXiv:1904.07850,2019." (CenterNet) describes an anchor-free real-time object detection method, which uses the idea of keypoint detection to generate a Gaussian kernel for each object, which is used for locating the position of the center point of the object, and then uses regression branches to predict the length and width of the object frame. The centret realizes a simple model structure and has high running speed, but long-time training is needed to ensure that the model converges. The document "Liu Z, zheng T, xu G, et al. Training-time-free network for real-time object detection [ C ]// Proceedings of the AAAI Conference on Artificial Intelligence insight.2020, 34 (07): 11685-11692." (TTFNet) sets a wider range of Gaussian kernels for the problem of long training time of CenterNet, and considers more pixel points as training samples, increasing the number of training samples and making the model easier to converge. The method does not only locate the center point of the object, but takes any point of the Gaussian kernel region of the object as a prediction base point, and then predicts the distances from the prediction base point to the prediction frame in the four directions of the upper direction, the lower direction, the left direction and the right direction by using the regression branch. Through the improvement, the training time is reduced, and the precision is improved.

The two anchor-free methods have the advantages of high speed and generalization, but the accuracy is still lower than that of the anchor-base method because the key problems of insufficient pixel point characteristic capability and high classification and regression branch coupling degrees which influence the accuracy are not considered.

How to improve the feature characterization capability in the target detection method is not enough, and improving the accuracy is still a technical problem which is of great concern to those skilled in the art.

Disclosure of Invention

The invention aims to solve the technical problems that the existing real-time target detection method is insufficient in feature characterization capability, high in classification and regression branch feature coupling degree and low in detection precision, and provides a target detection method based on feature adaptive aggregation. On the premise of not influencing the real-time performance, the self-adaptive feature aggregation technology is utilized, a small amount of calculated amount is increased, the problems of insufficient feature characterization capability and high classification and regression branch feature coupling degree are solved, and the target detection precision is improved.

In order to solve the technical problem, the technical scheme of the invention is as follows: and constructing a target detection system based on feature adaptive aggregation. The system is composed of a main feature extraction module, a feature self-adaptive aggregation module, an auxiliary task module, a main task module and a post-processing module. Preparing and constructing a data set required by a target detection system, and dividing the data set into a training set, a verification set and a test set. Random cutting, random overturning, random translation, random brightness, saturation, contrast change processing and standardization processing are carried out on the training set image data through a data enhancement technology, and the training data diversity is enhanced. And only adopting size scaling and standardization processing on the verification set and the test set to keep the visual clues of the original image. And then training a main feature extraction module, a feature self-adaptive aggregation module, an auxiliary task module and a main task module in the target detection system by adopting a training set. During training, the auxiliary task module assists network training, and aims to enhance the attention of the target detection network to the position of the corner point of the object and improve the positioning accuracy. After one round of training is finished, testing the trained target detection system by using a verification set, selecting model parameters with the most excellent performance, assigning the model parameters to trainable modules (a main feature extraction module, a feature self-adaptive aggregation module and a main task module) in the target detection system, and obtaining the trained target detection system with the most excellent performance; and finally, performing target detection on the image input by the user by adopting the trained target detection system with the most excellent performance to obtain the position and the category of the target.

The technical scheme of the invention comprises the following steps:

firstly, constructing a target detection system based on feature adaptive aggregation. As shown in fig. 1, the target detection system is composed of a main feature extraction module, a feature adaptive aggregation module, an auxiliary task module, a main task module, and a post-processing module.

The main feature extraction module is connected with the feature self-adaptive aggregation module, extracts multi-scale features from the input image and sends a multi-scale feature map containing the multi-scale features to the feature self-adaptive aggregation module. The main feature extraction module consists of a DarkNet-53 convolutional neural network (see the article "Redmon J, farhadi A. Yolov3: an innovative improvement [ J ]. ArXiv preprints arXiv:1804.02767,2018." Redmon J, farhadi A et al: yolov 3) and a feature pyramid network (see the article "Lin T Y, doll R P, girshick R, et al. Featpyrad networks for object detection [ C ]// Proceedings of the IEEE con computer vision and pattern recognition.2017:2117-2125." Lin T Y, dorx3763 zrsk 3763, girshi R et al for feature detection in the pyramid detection. The DarkNet-53 convolutional neural network is a lightweight backbone network comprising 53 layers of neural networks, and the 53 layers of neural networks are divided into 5 serial sub-networks for extracting the backbone network characteristics of the image. The feature pyramid network receives the main network features from the DarkNet-53 convolutional neural network, a multi-scale feature map containing the multi-scale features is obtained through up-sampling, feature extraction and feature fusion operations, and the multi-scale feature map is sent to the feature self-adaptive aggregation module.

The feature self-adaptive aggregation module is connected with the main feature extraction module, the auxiliary task module and the main task module, and has the functions of providing a multi-scale perceived high-pixel feature map for the auxiliary task module, providing a boundary region perceived high-pixel feature map and a salient region perceived high-pixel feature map for the main task module, and improving the detection precision of the target detection system. The feature self-adaptive aggregation module is composed of a self-adaptive multi-scale feature aggregation network, a self-adaptive spatial feature aggregation network and a rough frame prediction network. The self-adaptive multi-scale feature aggregation network is composed of 4 weight-unshared SE (sequence-and-excitation) networks (the 4 SE networks are respectively taken as a first SE network, a second SE network, a third SE network and a fourth SE network), a multi-scale feature map is received from a feature pyramid network of a main feature extraction module, a self-adaptive multi-scale feature aggregation method is adopted to carry out channel self-attention enhancement, bilinear interpolation upsampling and scale level soft weight aggregation operation on the multi-scale feature map to obtain a multi-scale perceived high pixel feature map, and the multi-scale perceived high pixel feature map is sent to the self-adaptive spatial feature aggregation network, the rough frame prediction network and the auxiliary task module. The rough frame prediction network is composed of two layers of 3 x 3 convolutions and one layer of 1 x 1 convolution, receives the multi-scale perception high pixel characteristic diagram from the self-adaptive multi-scale characteristic aggregation network, predicts the multi-scale perception high pixel characteristic diagram to obtain a rough frame prediction position, and sends the rough frame prediction position to the self-adaptive spatial characteristic aggregation network. The self-adaptive spatial feature aggregation network is composed of two area-limited deformable convolutions with different offset conversion functions (a classification offset conversion function and a regression offset conversion function), a multi-scale perceived high-pixel feature map is received from the self-adaptive multi-scale feature aggregation network, a rough frame prediction position is received from the rough frame prediction network, a boundary area perceived high-pixel feature map and a salient area perceived high-pixel feature map are generated, and the boundary area perceived high-pixel feature map and the salient area perceived high-pixel feature map are sent to the main task module, so that the main task module has self-adaptive spatial perception capability, and the problem that the input feature coupling degree is high and the detection precision is influenced is solved.

The auxiliary task module is connected with an adaptive multi-scale feature aggregation network in the feature adaptive aggregation module, the auxiliary task module is a corner prediction network, the corner prediction network is composed of two layers of 3 x 3 convolution, one layer of 1 x 1 convolution and a sigmoid active layer, the auxiliary task module receives a multi-scale perception high pixel feature image from the adaptive multi-scale feature aggregation network, and the corner prediction network predicts the multi-scale perception high pixel feature image to obtain a corner prediction thermodynamic diagram which is used for calculating corner prediction loss in the training of a target detection system and assisting the target detection system in perceiving a corner region. The auxiliary task module is only used during training of the target detection system and is used for enhancing the perception of the target detection system on the positions of the corner points of the object, so that the position of the object frame can be predicted more accurately. When the trained target detection system detects the user input image, the module is directly discarded without adding extra calculation amount.

The main task module is connected with the adaptive spatial feature aggregation network and the post-processing module and consists of a fine frame prediction network and a central point prediction network. The fine frame prediction network is a layer of 1 multiplied by 1 convolution layer, receives the high pixel characteristic diagram sensed by the boundary region from the adaptive spatial characteristic aggregation network, performs 1 multiplied by 1 convolution on the high pixel characteristic diagram sensed by the boundary region to obtain a fine frame prediction position, and sends the fine frame prediction position to the post-processing module; the central point prediction network consists of a layer of 1 x 1 convolutional layer and a sigmoid activation layer, receives the high pixel characteristic diagram sensed by the salient region from the adaptive spatial characteristic aggregation network, performs 1 x 1 convolution and activation on the high pixel characteristic diagram sensed by the salient region to obtain a central point prediction thermodynamic diagram, and sends the central point prediction thermodynamic diagram to the post-processing module.

The post-processing module is a 3 x 3 pooling layer and is connected with a fine frame prediction network and a central point prediction network in the main task module, receives a fine frame prediction position from the fine frame prediction network, receives a central point prediction thermodynamic diagram from the central point prediction network, reserves a prediction maximum value in a central point prediction thermodynamic diagram 3 x 3 range by adopting 3 x 3 maximum pooling operation with the step length of 1, and extracts the position of the reserved prediction maximum value, namely a peak point, as the position of the central area point of the object. And finding out the corresponding up-down, left-right four-direction distances in the fine frame prediction position according to the position of the central area point to generate a predicted object frame position, wherein the central point category where the position of the central area point is located is the category of the object prediction. The post-processing module suppresses overlapping false frames by extracting peak points in the 3 x 3 range, reducing false positive prediction frames.

Secondly, constructing a training set, a verification set and a test set, wherein the method comprises the following steps:

2.1 collecting target detection scene images as a target detection data set, and manually labeling each target detection scene image in the target detection data set, wherein the method comprises the following steps:

the general Scene data set published by MS COCO (see documents "Tsung-Yi Lin, michael Maire, large Belongie, james Hays, pietro Perona, deva Ramanan, piotr Dollar, and C Lawrence' S Zitnicknic. Microdoco: common objects in scenes in ECCV,2014." Tsung-Yi Lin, michael Maire et al, microsoft COCO: common objects in scenes) or the Cityscapes unmanned Scene data set (see documents "Cordts M, omran M, ramos S, ciet al. The MS COCO dataset has 80 classes, containing 105000 training images (train 2017) as training set, 5000 verification images (val 2017) as verification set, and 20000 test images (test-dev) as test set. The citrescaps dataset has 8 classes: pedestrians, riders, trolleys, trucks, buses, trains, motorcycles and bicycles, with 2975 training images as the training set, 500 validation images as the validation set, 1525 Zhang Ceshi images as the test set. Let the total number of images in the training set be S, let the total number of images in the test set be T, let the total number of images in the verification set be V, let S be 205000 or 2975, T be 20000 or 1524, and let V be 5000 or 500. Each image of the MS COCO and the citrescaps data sets is manually labeled, that is, each image is labeled with the position of an object in the form of a rectangular frame and is labeled with the category of the object.

2.2 carrying out optimization processing on the S images in the training set, including turning, cutting, translation, brightness transformation, contrast transformation, saturation transformation, scaling and standardization to obtain an optimized training set D _t The method comprises the following steps:

2.2.1 order variable s =1, initialize the optimized training set D _t Is empty;

2.2.2 overturning the s image in the training set by adopting a random overturning method to obtain the s overturned image, wherein the random probability of the random overturning method is 0.5;

2.2.3 randomly cutting the s-th turned image by adopting a minimum cross-over ratio (IoU) to obtain an s-th cut image; the minimum cross-over ratio (IoU) used is 0.3 for the minimum size ratio.

2.2.4, carrying out random image translation on the s-th cut image to obtain an s-th translated image;

2.2.5, performing brightness conversion on the s-th translated image by adopting random brightness to obtain an s-th brightness-converted image; the random luminance takes a luminance difference value of 32.

2.2.6, carrying out contrast conversion processing on the image after the s-th brightness conversion by adopting random contrast to obtain an image after the s-th contrast conversion; the random contrast has a contrast range of (0.5,1.5).

2.2.7, performing saturation transformation on the image with the s-th contrast transformation by adopting random saturation to obtain an image with the s-th saturation transformation; the saturation range for random saturation is (0.5,1.5).

2.2.8 adopting a scaling operation to scale the s-th image after saturation transformation to 512 multiplied by 512 to obtain an s-th scaled image;

2.2.9 standardizes the s scaled image by adopting standardization operation to obtain the s standard image, and puts the s standard image into the optimized training set D _t In (1).

If S is less than or equal to S, making S = S +1, and rotating by 2.2.2; if s>S, obtaining an optimized training set D consisting of S standard images _t Turn 2.3.

2.3 training set D according to optimization _t And making a task truth label for model training. The method is characterized in that the method is divided into four tasks which are respectively a central point prediction task, an angular point prediction task, a rough frame prediction task and a fine frame prediction task, and the method comprises the following steps:

2.3.1 let variable s =1; let the s image in the optimized training set have N _s A label box, order N _s The ith one of the label boxes is

Let the label category of the ith label box be c ⁱ ，

Represents the coordinates of the point at the upper left corner of the ith label box,

represents the coordinate of the lower right corner point of the ith label box, N _s Is a positive integer, i is more than or equal to 1 and less than or equal to N _s 。

2.3.2 construction of the predicted true value of the centerpoint for the centerpoint prediction task

The method comprises the following steps:

2.3.2.1 construction of a size of

All-zero matrix chart H _zeros C represents the number of classification categories of the optimized training set, wherein the number of the categories is the number of categories of the labeled targets of the target detection data set, for example, the MS COCO data set is 80 categories, the Cityscapes data set is 19 categories, H is the height of the s image, and W is the width of the s image;

2.3.2.2 let i =1, representing the labeling box of the ith down-sampling 4 times;

2.3.2.3 will

Dividing the labeling coordinate by 4, and recording as a labeling frame of 4 times of downsampling

Represents B _si The upper left, upper right, lower left and lower right corner positions of the' are arranged.

2.3.2.4 adopts two-dimensional Gaussian kernel generation method, and calculates B _si ' center point

A base point of a two-dimensional Gaussian kernel with a variance of (σ) _x ,σ _y ) The Gaussian values of all the pixel points in the two-dimensional Gaussian kernel range are obtained to obtain a first heightSet of S values _ctr . The method comprises the following specific steps:

2.3.2.4.1 makes the number of pixel points in two-dimensional Gaussian kernel be N _pixel ,N _pixel Let the first set of Gaussian values S be a positive integer _ctr Is empty;

2.3.2.4.2 let p =1, representing the number of pixel points in two-dimensional Gaussian kernel, where p is greater than or equal to 1 and less than or equal to N _pixel ；

2.3.2.4.3 in the s-th image with (x) ₀ ,y ₀ ) Any pixel point (x) in the Gaussian kernel range as the base point _p ,y _p ) Two-dimensional Gao Sizhi K (x) _p ,y _p ) Comprises the following steps:

wherein (x) ₀ ,y ₀ ) Is the base point of a two-dimensional Gaussian kernel, namely the center of the two-dimensional Gaussian kernel (can be B' _si May be B' _si Corner point of) x ₀ A coordinate value of the base point in the width direction, y ₀ Is a coordinate value of the base point in the high direction. (x) _p ,y _p ) Is a base point (x) ₀ ,y ₀ ) Pixel points, x, within the Gaussian kernel range _p Is the coordinate value, y, of the pixel point in the width direction _p Is the coordinate of the pixel point in the high direction. (x) ₀ ,y ₀ ) And (x) _p ,y _p ) Are all located in the image coordinate system after down-sampling by 4 times.

Representing the variance of the two-dimensional gaussian kernel in the width direction,

and the variance of the two-dimensional Gaussian kernel in the high direction is represented, and the number of points in the range of the Gaussian kernel is controlled by controlling the variance of the two-dimensional Gaussian kernel in the width direction and the high direction. w represents B _si ' Width at the scale of the characteristic diagram, h represents B _si ' height in the scale of the feature map, α is the center region position determined to be B _si The parameter for the' ratio, is set to 0.54. Will (x) _p ,y _p ) And K (x) calculated _p ,y _p ) Storing the first set of Gaussian values S _ctr Performing the following steps;

2.3.2.4.4 let p = p +1; if p is less than or equal to N _pixel Turning to 2.3.2.4.3; if p is>N _pixel ，B _si ' the coordinates and two-dimensional Gaussian values in the Gaussian kernel have been stored in S _ctr In, S _ctr In the presence of N _pixel Each pixel point and the corresponding two-dimensional Gaussian value thereof are converted to 2.3.2.5;

2.3.2.5 reduction of S _ctr Value of (3) to H _zeros In (1). Will S _ctr Element (x) of (1) _p ,y _p ) And K (x) _p ,y _p ) According to H _zeros [x _p ,y _p ,c ⁱ ]＝K(x _p ,y _p ) Rule assignment of c ⁱ Represents B _si ' class number, 1. Ltoreq. C ⁱ C and C is not more than C ⁱ Is a positive integer;

2.3.2.6 has i = i +1; if i is less than or equal to N _s Turning to 2.3.2.3; if i>N _s N of the s-th image _s All the two-dimensional Gaussian values generated by the down-sampling 4-time labeling boxes are assigned to H _zeros Middle, 2.3.2.7;

2.3.2.7 predicts the true value of the center point of the s-th image

2.3.3 construction of actual values of corner predictions for a task of corner prediction

The method comprises the following steps:

2.3.3.1 construction of a size of

All-zero matrix of

"4" represents the number of corner points of the labeling box of 4 times of downsampling, and also represents 4 channels of the matrix;

2.3.3.2 let i =1, denote the labeling box of the ith down-sampling by 4 times;

2.3.3.3 let the base point of the two-dimensional Gaussian kernel be B _si ' Upper left corner point, coordinates of

Adopting the two-dimensional Gaussian kernel generation method of 2.3.2.4 to calculate

A base point of a two-dimensional Gaussian kernel with a variance of (σ) _x ,σ _y ) The Gauss values of all the pixel points in the two-dimensional Gauss kernel range are obtained to obtain a second Gauss value set S _tl ；

2.3.3.4 converting S to _tl To the element coordinates and Gaussian values in

In the 1 st channel, i.e. according to

Assigning a value to the rule of (1);

2.3.3.5 let the base point of the two-dimensional Gaussian kernel be B _si The upper right corner point of' has coordinates of

A base point of a two-dimensional Gaussian kernel with a variance of (σ) _x ,σ _y ) Obtaining the Gaussian values of all the pixel points in the two-dimensional Gaussian kernel range to obtain a third Gaussian value set S _tr ；

2.3.3.6 converting S to _tr To the element coordinates and Gaussian values in

In the 2 nd channel, i.e. according to

Assigning a value to the rule of (1);

2.3.3.7 let the base point of the two-dimensional Gaussian kernel be B _si ' lower left corner point, coordinates of

A base point of a two-dimensional Gaussian kernel with a variance of (σ) _x ,σ _y ) The Gaussian values of all the pixel points in the two-dimensional Gaussian kernel range are obtained to obtain a fourth Gaussian value set S _dl ；

2.3.3.8 converting S to _dl To the element coordinates and Gaussian values in

In the 3 rd channel according to

Assigning a value to the rule of (1);

2.3.3.9 let the base point of the two-dimensional Gaussian kernel be B' _si At the lower right corner point of having coordinates of

A base point of a two-dimensional Gaussian kernel with a variance of (σ) _x ,σ _y ) The Gauss values of all the pixel points in the two-dimensional Gauss kernel range are obtained to obtain a fifth Gauss value set S _dr ；

2.3.3.10 reduction of S _dr To the element coordinates and Gaussian values in

In the 4 th channel, i.e. according to

Assigning values to the rules;

2.3.3.11 let i = i +1 if i ≦ N _s Turning to 2.3.3.3; if i>N _s N of the s-th image _s The two-dimensional Gaussian values generated by the down-sampling 4-fold labeling frames are all assigned to

Middle, 2.3.3.12;

2.3.3.12 orders the predicted true value of corner of the s-th image

2.3.4 from the N of the s-th image _s Constructing a rough frame real value of the s image of a rough frame prediction task by using 4 times of down-sampling labeled frames

The method comprises the following steps:

2.3.4.1 construction of a size of

All-zero matrix of

"4" represents 4 coordinates of the labeling box sampled 4 times;

2.3.4.2 let i =1, representing the labeling box of the ith down-sampling 4 times;

2.3.4.3 pair H _zeros Marking box B of 4 times of sampling in ith _si ' assignment of internal pixels, i.e. B _si ' coordinate value

Assign a value to

In 4 channels of pixel locations;

2.3.4.4 let i = i +1, if i ≦ N _s Turning to 2.3.4.3; if i>N _s N of the s-th image _s The real value of the rough frame corresponding to each labeled frame is assigned to

In which values are assigned

The true value label of the s image is converted to 2.3.4.5;

2.3.4.5 the rough frame true value of the s-th image

2.3.5 according to

Constructing real value of fine box prediction task

Value and

are equal, i.e.

2.3.6 let S = S +1, if S is less than or equal to S, rotate 2.3.2; if S is greater than S, rotating to 2.3.7;

2.3.7 obtaining the task real label of S images for model training, and forming a set by the task real label and the S images to form a training set D for model training _M ；

2.4 adopting an image scaling standardization method to optimize the V images in the verification set to obtain a new verification set D consisting of the V scaled and standardized images _V The method comprises the following steps:

2.4.1 let variable v =1;

2.4.2 adopting zooming operation to zoom the v-th image in the verification set to 512 multiplied by 512 to obtain a zoomed image of the v-th image;

2.4.3 standardizing the zoomed image of the v th image by adopting a standardization operation to obtain a standardized image of the v th image.

2.4.4 if V is less than or equal to V, making V = V +1, and rotating by 2.4.2; if v is>V, obtaining a new verification set D consisting of V images after scaling standardization _V Turn 2.5.

2.5 optimizing the T images in the test set by adopting the image scaling standardization method of 2.4 steps to obtain a new test set D consisting of the images after the T scaling standardization _T 。

Thirdly, training the target detection system constructed in the first step by utilizing a gradient back propagation method to obtain N _m And (4) model parameters. The method comprises the following steps:

3.1 initializing the network weight parameters of each module in the target detection system. Initializing parameters of a DarkNet-53 convolutional neural network in a main characteristic extraction module by adopting a pre-training model trained on an ImageNet data set (https:// www.image-net.org /); and initializing other network weight parameters (a characteristic pyramid network, a characteristic self-adaptive aggregation module, an auxiliary task module and a main task module network weight parameter in a main characteristic module) by adopting normal distribution with the mean value of 0 and the variance of 0.01.

3.2 setting the training parameters of the target detection system. The initial learning rate learning _ rate is set to 0.01, and the learning rate attenuation factor is set to 0.1, i.e., the learning rate is reduced by a factor of 10 (attenuation is performed at training steps of 80 and 110). Random gradient descent (SGD) was selected as a model training optimizer with a hyper-parameter "momentum" of 0.9 and a weight decay "of 0.0004. The batch size (mini _ batch _ size) of the network training is 64. The maximum training step size (maxepoch) is 120.

3.3 training the target detection system, the method is to use the rough frame prediction position, the fine frame prediction position, the angular point prediction thermodynamic diagram and the difference between the central point prediction thermodynamic diagram and the true value output by the target detection system in the first training as the loss value (loss), and update the network weight parameter by using the gradient back propagation until the loss value reaches the threshold value or the difference between the central point prediction thermodynamic diagram and the true valueThe training step reaches the end of maxepoch. At the last N _m (typically 10) training steps, and the network weight parameters are saved once per training round.

The method comprises the following steps:

3.3.1 order training step epoch =1, one period of training for all data in the training set is one epoch, and the serial number N of the initialization batch _b ＝1；

3.3.2 Primary feature extraction Module Slave D _M Read the Nth _b Batch, total B =64 images, and the B images are recorded as matrix I _train ，I _train Contains B H × W × 3 images. Where H denotes the height of the input image, W denotes the width of the input image, and "3" denotes the three channels of RGB of the image.

3.3.3 principal feature extraction Module extracts I by principal feature extraction method _train To obtain I _train Will comprise I _train The multi-scale feature map of the multi-scale features is sent to the adaptive feature aggregation module. The method comprises the following steps:

5363 DarkNet-53 convolution neural network extraction I of 3.3.3.1 main feature extraction module _train The method comprises the following steps of obtaining a backbone network feature map set by using the image features, wherein the method comprises the following steps: 5 serial sub-network pairs I of DarkNet-53 convolutional neural network _train And performing down sampling and feature extraction on the B images to obtain the features of the main network, namely 4 feature graphs (the output of the last four serial sub-networks), and sending the features to the feature pyramid network.

3.3.3.2 the pyramid network receives 4 characteristic graphs from the DarkNet-53 convolutional neural network, and the pyramid network performs up-sampling, characteristic extraction and characteristic fusion on the 4 characteristic graphs to obtain 3 multi-scale characteristic graphs, such that

Mapping multi-scale features

And sending the information to a feature adaptive aggregation module.

3.3.4 feature adaptive aggregation Module Slave feature pyramid networkReceiving a multiscale feature map

Generating a multi-scale perceptual high-pixel feature map F _H Will F _H Sending the data to an auxiliary task module; and generating a high pixel characteristic diagram perceived by the boundary area and a high pixel characteristic diagram perceived by the salient area, and sending the high pixel characteristic diagram perceived by the boundary area and the high pixel characteristic diagram perceived by the salient area to the main task module. The method comprises the following steps:

5363A method for receiving a feature pyramid network from a 3.3.4.1 adaptive multi-scale feature aggregation network

Method pair adopting self-adaptive multi-scale feature aggregation

Channel self-attention enhancement, bilinear interpolation up-sampling and scale level soft weight aggregation operation are carried out to obtain a multi-scale perception high pixel characteristic diagram F _H 。F _H The resolution of the characteristic diagram is

F _H The number of feature map channels of (2) is 64. The specific method comprises the following steps:

3.3.4.1.1 adaptive multiscale feature aggregation network uses first, second, and third SE network parallel pairs

Performing parallel channel self-attention enhancement, i.e. first SE network pair

Weighted summation applied on the channels to obtain the first channel characteristic enhanced image

Simultaneous second SE network pair

Weighted summation applied on the channels to obtain the second channel characteristic enhanced image

Simultaneous third SE network pair

Weighted summation is applied to the channels to obtain an image with enhanced representation of the third channel

3.3.4.1.2 first, second and third SE networks of the adaptive multi-scale feature aggregation network adopt bilinear interpolation in parallel

Upsampled to the same resolution size

Obtaining an up-sampled feature map

Become an upsampled feature map set

The specific calculation process is shown as formula (2):

wherein SE _n It is shown that the nth SE network,

and the first multi-scale feature map is shown, upesample shows bilinear interpolation upsampling, and l is more than or equal to 1 and less than or equal to 3,1 and less than or equal to n is less than or equal to 3.

3.3.4.1.3 adaptive multi-scale feature aggregation network pair

Calculating the weight by adopting 1 multiplied by 1 convolution, reducing the number of channels from 64 to 1, and then executing Softmax operation on the scale dimension to obtain the value of

Soft weight map of

The numerical size of the pixel points of the soft weight map indicates that more attention should be paid

Which one of these 3 dimensions, i.e.

Which is weighted more heavily, so that objects of different sizes respond to signatures of different dimensions.

3.3.4.1.4 adaptive multiscale feature aggregation network maps the weights of the ith scale

Corresponding ith up-sampled feature map

Element by element multiplication, i.e. to

And

corresponding element-by-element multiplication, will

And

the corresponding element-by-element multiplication is performed,

and

respectively multiplying element by element to obtain 3 products, then carrying out weighted summation on the 3 products, and fusing the 3 products into a feature map to obtain a fused feature map; then, a fourth SE network is adopted to enhance the channel representation of the fused feature map, and a multi-scale perception high-pixel feature map F is obtained _H . The specific process is shown in formula (3):

wherein SE ₄ In order to be the fourth SE network,

indicating that the same position element occupies weight in different scales, "×" indicates the product of the corresponding position elements, and Conv indicates 1 × 1 convolution. Adaptive multi-scale feature aggregation network F _H And sending the data to an auxiliary task module, a rough frame prediction network and an adaptive spatial feature aggregation network.

3.3.4.2 coarse-box prediction network receives a multi-scale aware high-pixel feature map F from an adaptive multi-scale feature aggregation network _H Using a coarse frame prediction method to F _H The position of each feature point in the image is subjected to rough frame position prediction to generate a rough frame prediction position B _coarse B is to be _coarse Sending to an adaptive spatial feature aggregation network, B _coarse Is also that

Of resolution of

The number of channels is 4. The channel number 4 represents the distance from the pixel point to the four directions of up, down, left and right, and each pixel point forms a rough frame. B is _coarse For limiting adaptive spatial featuresCharacterizing a deformable convolution sampling range in an aggregation network. And, for B _coarse Coarse box true value constructed with 2.2.5.4

Calculating loss

The loss calculation is based on the GIoU loss (see the documents "Rezatofigughi H, tsio N, gwak J Y, et al. Generalized interaction over unit: A measurement and a loss for bounding box regression [ C)]// Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.2019: 658-666A paper by "Rezatofighi H, tsei N et al: generalized intersection ratio: metrics and loss of bounding box regression):

wherein S _b Is a set of regression samples consisting of

A set of pixels other than 0; n is a radical of _b Is the number of regression sample sets, W _ij Is corresponding to

And the (i, j) position weight value which is not 0 is used for applying larger loss weight to the pixel point at the central area position, so that the pixel point at the central area position returns to the position of the marking frame more accurately.

3.3.4.3 an adaptive spatial feature aggregation network receives a multi-scale aware high-pixel feature map F from an adaptive multi-scale feature aggregation network _H Receiving a coarse frame prediction location B from a coarse frame prediction network _coarse Generating a boundary region perceived high pixel feature map F _HR And high pixel feature F of salient region perception _HS . The method comprises the following steps:

3.3.4.3.1 designs an area-limited deformable convolution (R-DConv). Deformable convolution (DConv) (see documents "Zhu X, hu H, lin S, et al. Deformable networks v2: more formable, better results,". Beta results [ C ]// Proceedings of the IEEE/CVF con conference on computer vision and pattern recognition.2019:9308-9316."Zhu X, hu H et al. Deformable networks v2: more Deformable, better results) because of the property of adaptive sparse sampling is often used to enhance the spatial perception of features, but its sampling range is not limited, resulting in the sampling points being prone to over-shift, and for objects of different sizes, the difficulty of adaptive learning to sample the most representative feature points is inconsistent, resulting in poor adaptability to detection of objects of different sizes, so the present invention designs region-limited Deformable convolution (R-DConv) to enhance adaptability. The specific method comprises the following steps:

3.3.4.3.1.1 design offset transfer function

For the offset Δ p of the deformable convolution (Δ p is a learnable offset based on the feature points, see the literature "Zhu X, hu H, lin S, et al]// Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.2019:9308-9316. ") to obtain the transformed offset, so that the difficulty of adaptively learning and sampling the most representative feature points is consistent for objects of different sizes.

Limiting the offset range of the spatial sampling points of the deformable convolution to B _coarse While also differentiating the offset deltap of the deformable convolution. Because the space sampling range of the large-size object is wider than that of the small-size object, the corresponding searching difficulty is different. In order to solve the problem that the search difficulty of the space characteristic points of the objects with different sizes is inconsistent, a Sigmoid function is adopted

To B _coarse The offset Δ p is normalized so that Δ p is [0,1 ]]Within the interval. Through such processing, the difficulty of searching for the point feature with the strongest characteristic ability becomes the same for objects of different sizes. Thus, splitting Δ p into h _Δp And w _Δp ，h _Δp Denotes the deviation of Δ p in the vertical direction, w _Δp Indicating the shift of Δ p in the horizontal direction.

As shown in equation (5):

wherein

Representing the offset transfer function in the vertical direction,

indicating the offset transfer function in the horizontal direction, the overall offset transfer function

(t, l, r, d) are convolution kernel positions p and B _coarse The distance in four directions of up, down, left and right.

3.3.4.3.1.2 utilization

The deformable convolution sampling area is limited. Given a 3 x 3 convolution kernel with K =9 spatial sample location points, w _k Weight of convolution kernel, P, representing the k-th position _k A predefined position offset representing the kth position. P _k E { (-1, -1), (-1,0), (1,1) } denotes the 3 × 3 range centered at (0,0). Let x (p) denote the input feature map at the convolution kernel center position p, and y (p) denote the output feature map at the convolution kernel center position p. Y (p) is calculated using R-DConv, as shown in equation (6):

wherein Δ p _k Denotes the learnable offset, Δ m, of the k-th position _k Representing the weight of the kth position. Δ p _k And Δ m _k Generated by a 3 × 3 convolution, the 3 × 3 convolution generates a feature map of 27 channels, 9 of which are Δ p _k Abscissa offset value, Δ p for 9 channels _k Ordinate offset value, 9 channels (representing weights for different offset value characteristics) of Δ m _k The value of (c). B is _coarse A coarse box, representing a prediction on the scale of the current feature map, is also a predefined bounding region.

3.3.4.3.2 in order to make R-DConv learn the salient region of an object in the range of a rough frame, and extract the characteristics for making object classification more accurate, a classification adaptive spatial feature aggregation method is adopted and B is utilized _coarse Limiting the sampling range pair F _H The method for carrying out feature aggregation comprises the following steps:

3.3.4.3.2.1 order Classification offset transfer function

Calculating the output characteristic y at the position p by using the formula (6) _cls (p)。

3.3.4.3.2.2 uses

Traversal F with convolution kernel _H Obtaining a high pixel characteristic diagram F of the salient region perception _HS 。

Allowing sampling points to be concentrated so that the classification branches can focus on the most discriminative salient regions. Thus, let

Enabling the R-DConv to learn the salient region of the object in the rough frame range, and extracting the feature which enables the object to be classified more accurately, namely a high-pixel feature map F for sensing the salient region _HS Will F _HS And sending the data to the main task module.

3.3.4.3.3 in order to make R-DConv learn the boundary region information of an object in the coarse frame range, extracting the feature which makes the object position regression more accurate, a regression adaptive spatial feature aggregation method is adopted and B is utilized _coarse Limiting the sampling range pair F _H Performing feature aggregation, wherein the regression adaptive spatial feature aggregation method specifically comprises the following steps:

3.3.4.3.3.1 design regression offset transfer function

The offset Δ p of the deformable convolution is transformed.

And uniformly dividing the space sampling points of the R-DConv operation along the upper, lower, left and right directions, so that the limited area is divided into four sub-areas which respectively correspond to the upper left, the upper right, the lower left and the lower right.

And respectively carrying out uniform sampling on the four sub-areas, namely allocating equal sampling points to each area. In this way, the spatial sampling points of the R-DConv operation are dispersed, so that features containing more information from the boundary can be extracted, and the object position can be regressed more accurately. The setting of K =9 is made such that,

the function samples two points from four sub-areas respectively, total eight edge points are added with a central point to form a convolution kernel of 3 multiplied by 3, and the capture of the boundary information of the central characteristic point is enhanced. Regression offset transfer function

As shown in equation (7):

the sampling difficulty of objects with different sizes can be balanced through normalization for the Sigmoid function for normalizing the offset in the rough frame interval.

Will be provided with

Substituted into equation (6)

Obtaining an output characteristic y at the position p _reg (p) of the formula (I). Thus, it is possible to provide

Enabling R-DConv to learn the region of the object boundary in the coarse frame range, and extracting the feature which enables the regression position of the prediction frame to be more accurate, namely a high-pixel feature map F perceived by the boundary region _HR 。

3.3.4.3.3.2 uses

Traversing F with convolution kernel _H Obtaining a high pixel characteristic map F perceived by the boundary area _HR Will F _HR And sending the data to a main task module.

3.3.5 Assist task Module receives F from an adaptive Multi-Scale feature aggregation network _H Processing the two layers of 3 multiplied by 3 convolution, one layer of 1 multiplied by 1 convolution and sigmoid function to obtain a corner point prediction thermodynamic diagram H _corner ，H _corner Has a resolution of

The number of channels is 4. To H _corner And the corner predicted true value constructed by 2.3.3

Calculating the loss to obtain H _corner And

loss value of

The calculation of (D) is based on a modified version of Focal Loss (see the literature "Law H, deng J. Corneret: detecting objects as detected keypoints [ C)]// Proceedings of the European conference on computer vision (ECCV) 2018, "Law H, deng J et al: cornernet: object detection with paired keypoints):

wherein N is _s Is the number of the image label boxes, alpha _l And β is a hyperparameter, set to 2 and 4, respectively, for controlling the gradient profile of the loss function.

Is the corner predicted value output by the auxiliary task module at the c channel and (i, j) pixel position,

is the corner predicted true value of the c-th channel, pixel location (i, j). The auxiliary task module learns the positions of four corners of the positioning marking frame and assists the target detection network training to enable the extracted features to focus on the positions of the corners of the object, so that the target detection system can position the object more accurately.

3.3.6 Fine Framing of Main task ModuleReceiving boundary region-aware high-pixel feature map F from adaptive spatial feature aggregation network by measuring network _HR After a layer of 1X 1 convolution, F is obtained _HR Fine-box predicted position B of feature point position _refine 。B _refine Has a resolution of

The number of channels is 4. The channel number 4 represents the distance from the pixel point to the prediction fine frame in the upper, lower, left and right directions, and each pixel point can form a fine prediction frame. To B _refine And the real value of the fine box obtained from 2.3.5

Calculating loss

Is based on GIoU loss:

wherein S _b Is a set of regression samples consisting of

A set of pixels other than 0. N is a radical of _b Is the number of regression sample sets, W _ij Is corresponding to

And the (i, j) position weight value which is not 0 is used for applying larger loss weight to the pixel point at the central area position, so that the pixel point at the central area position returns to the position of the standard frame more accurately. B is _refine Represents the accuracy with which the target detection system regresses the position of the object.

3.3.7 Central Point prediction network for Main task Module from adaptive spatial feature aggregation networkHigh-pixel feature map F for receiving salient region perception _HS After a layer of 1 × 1 convolution and sigmoid function processing, F is obtained _HS Center point predictive thermodynamic diagram H of feature point positions _center 。H _center Has a resolution of

The number of channels is the number of data set categories C. The C of the MS COCO data set is 80, and the C of the CityScaps data set is 8. H is to be _center The central point predicted true value constructed with 2.2.5.2

Calculating loss

Is based on a modified version of Focal local:

wherein N is _s Is the number of the image labeling boxes, α _l And β is a hyperparameter, set to 2 and 4, respectively, for controlling the gradient profile of the loss function.

Is the c-th channel, the central point prediction thermodynamic diagram of the (i, j) pixel position,

is the predicted true value of the center point of the (i, j) th channel pixel position. H _center Represents the ability of the target detection system to locate the center of an object and to distinguish object classes.

3.3.8 Total loss function for design target detection System

As shown in equation (11):

wherein

Is the output of a corner prediction network _corner And true value

The value of the loss is calculated,

h being central point predicted network output _center And true value

The value of the loss is calculated,

b being the output of the coarse-box prediction network _coarse And true value

The value of the loss is calculated and,

b being fine-box prediction network output _refine And true value

A calculated loss value. Predicting network loss weights by corner points according to importance

Central point predicted network loss weight

Coarse box prediction network loss weighting

Fine-box prediction network loss weighting

3.3.9 has epoch = epoch +1, and if epoch is 80 or 110, has learning _ rate = learning _ rate × 0.1, rotate 3.3.10; if the epoch is neither 80 nor 110, directly transferring to 3.3.10;

3.3.10 if the epoch is less than or equal to maxepoch, rotate 3.3.2; if the epoch is greater than maxepoch, indicating that the training is finished, turning to 3.3.11;

3.3.11 preserved N _m Network weight parameter of individual epochs.

Fourth, verifying N after loading by using the verification set _m The detection precision of the target detection system of the network weight parameters of the epoch is kept, and the network weight parameters with the best performance are used as the network weight parameters of the target detection system. The method comprises the following steps:

4.1 order variable n _m ＝1；

4.2 post-load N of target detection System _m Nth in network weight parameter of epoch _m A network weight parameter; the new verification set D processed by adopting an image scaling standardization method through 2.4 steps _V Inputting a target detection system;

4.3 let V =1 be the V-th image of the validation set, V being the number of images of the validation set;

4.4 the principal feature extraction Module receives the v verification set image D _v Extracting D by using the main feature extraction method described in 3.3.3 _v To obtain D _v Will contain D _v Sending the multi-scale feature map of the multi-scale features to the self-adaptive feature aggregation module;

4.5 adaptive multiscale feature aggregation network reception in feature adaptive aggregation Module containing D _v The multi-scale feature map of the multi-scale features is subjected to channel self-attention enhancement and bilinear interpolation by adopting the self-adaptive multi-scale feature aggregation method 3.3.4.1Sampling and scale level soft weight aggregation operation to obtain D _v Multi-scale perceptual high-pixel feature map F _HV Will F _HV Sending the data to a rough frame prediction network and an adaptive spatial feature aggregation network;

4.6 coarse Box prediction network reception F in feature adaptive aggregation Module _HV F, adopting the rough frame prediction method 3.3.4.2 _HV The position of each feature point in the image is subjected to rough frame position prediction to generate a v-th verification set image D _v Coarse box of (1) predict position B _HVcoarse (ii) a B is to be _HVcoarse And sending the information to an adaptive spatial feature aggregation network. B is _HVcoarse Is also that

Of resolution of

The number of channels is 4;

4.7 adaptive spatial feature aggregation network in feature adaptive aggregation Module receives B from the coarse Box prediction network _HVcoarse Receiving F from an adaptive multi-scale feature aggregation network _HV The method for classifying and self-adapting space characteristic polymerization of 3.3.4.3.2 is adopted to utilize B _HVcoarse Limit the sampling range, F _HV Performing classification task space feature aggregation to obtain the v-th verification set image D _v A high pixel feature map of salient region perception of (1); sending the high pixel characteristic graph perceived by the salient region of the v-th verification image to a central point prediction network;

4.8 adaptive spatial feature aggregation network in feature adaptive aggregation Module adopts the regression adaptive spatial feature aggregation method described in 3.3.4.3.3 and utilizes B _HVcoarse Limit the sampling range, and F _HV Performing regression task space feature aggregation to obtain the v-th verification set image D _v The boundary region perceived high pixel feature map of (1); sending the high pixel characteristic image perceived by the boundary area of the v-th verification image to a fine frame prediction network;

4.9 Fine Frames in Main task ModuleReceiving a high pixel characteristic graph sensed by a boundary region through a prediction network, and performing 1 multiplied by 1 convolution processing on the high pixel characteristic graph to obtain a v verification set image D _v The fine frame prediction position of the object is sent to a post-processing module;

4.10 Central Point prediction network in Main task Module receives the v verification set image D _v The high pixel characteristic diagram of the salient region perception is processed by a layer of 1 multiplied by 1 convolution to obtain the v verification set image D _v The center point prediction thermodynamic diagram of (1), and the v-th verification image D _v The central point prediction thermodynamic diagram is sent to a post-processing module;

4.11 post-processing Module receives the v authentication image D _v The thermal diagram of the fine frame prediction position and the center point is predicted, and the method of removing the overlapped pseudo frame is adopted to carry out the verification on the v < th > verification image D _v The fine frame prediction position and the central point prediction thermodynamic diagram are subjected to overlapping pseudo frame removing operation to obtain a v-th verification image D _v The specific method of predicting the object frame set is as follows:

4.11.1 post-processing module pairs the v-th verification image D _v Performs a 3 x 3 Max Pooling operation (2D Max-Pooling) to extract a v-th verification image D _v The central point of (2) a set of peak points of the predictive thermodynamic diagram, each peak point representing a central region point within the predicted object;

4.11.2 from the v-th verification image D _v The central point predictive thermodynamic diagram of (1) obtains a peak point (P) _x ，P _y ) Coordinate value P of _x ，P _y Post-processing module from D _v The fine frame prediction position of (P) is obtained as a peak point _x ，P _y ) Distance information (t, l, D, r) of the upper, left, lower and right directions is obtained to obtain D _v Prediction frame B of _p ＝{P _y -t，p _l -1，p _d +d，p _r +r}。B _p The category of (D) is peak point (P) _x ，P _y ) The channel value with the maximum pixel value of the thermal image of the center point of the position is recorded as c _p 。B _p The confidence of (D) is the peak point (P) _x ，P _y ) Thermodynamic diagram of center point of location _p ChannelHas a pixel value of s _p ；

4.11.3 post-processing module retains the v-th verification image D _v Middle confidence s _p A prediction box greater than a confidence threshold (typically set to 0.3) forms the v-th verification image D _v The object frame prediction set of (1) retains the prediction frame B _p And B _p Class c of _p Information;

4.12 making V = V +1, if V is less than or equal to V, rotating by 4.4; if V > V, it is stated that the nth _m Converting an object frame prediction set of V verification images of each model into 4.13;

4.13 if the verification set adopts a general scene data set disclosed by MS COCO, testing the precision of a final object frame prediction set output by the target detection system by adopting a standard MS COCO evaluation mode (https:// cocoataset. Org /), recording the precision of the object frame prediction set, and turning to 4.14; if the verification set adopts a Cityscapes unmanned scene data set, testing the precision of a final object frame prediction set output by the target detection system by adopting a Cityscapes evaluation mode (https:// www.cityscapes-dataset. Com /), recording the precision of the object frame prediction set, and turning to 4.14;

4.14 order n _m ＝n _m +1; if n is _m ≤N _m And 4.2; if n is _m ＞N _m Explanation of completion N _m Testing the precision of each model, and turning to 4.15;

4.15 from N _m Selecting an object frame prediction set with highest precision from the precision of the object frame prediction sets of the models, finding a weight parameter corresponding to a target detection system corresponding to the object frame prediction set with the highest precision, taking the weight parameter as the weight parameter selected by the target detection system, loading the selected weight parameter to the target detection system, and enabling the target detection system loaded with the selected weight parameter to become the trained target detection system.

Fifthly, adopting the trained target detection system to perform target detection on the image to be detected input by the user, wherein the method comprises the following steps:

5.1 adopting the 2.4-step image scaling standardization method to carry out optimization processing on the image I to be detected input by the user to obtainTo the standardized image I to be detected _nor Is shown by _nor Inputting a main feature extraction module;

5.2 Main feature extraction Module receives I _nor Extracting I by using the main feature extraction method described in 3.3.3 _nor To obtain I _nor Will comprise I _nor The multi-scale feature map of the multi-scale features is sent to the adaptive feature aggregation module.

5.3 adaptive multiscale feature aggregation network reception in feature adaptive aggregation Module including I _nor The multi-scale feature map of the multi-scale features adopts the self-adaptive multi-scale feature polymerization method 3.3.4.1 to perform the self-adaptive multi-scale feature polymerization on the feature containing I _nor The multi-scale characteristic graph of the multi-scale characteristic carries out channel self-attention enhancement, bilinear interpolation upsampling and scale level soft weight aggregation operation to obtain a multi-scale perception high pixel characteristic graph F _IH Will F _IH Sending the data to a rough frame prediction network and an adaptive spatial feature aggregation network;

5.4 coarse Box prediction network reception F in feature adaptive aggregation Module _IH F, adopting the rough frame prediction method 3.3.4.2 _IH Carrying out rough frame position prediction to obtain a rough frame prediction position B in the image I to be detected _Icoarse (ii) a B is to be _Icoarse And sending the information to the adaptive spatial feature aggregation network. B is _Icoarse Is also that

Of resolution of

The number of channels is 4;

5.5 adaptive spatial feature aggregation network in feature adaptive aggregation Module receiving F _IH And B _Icoarse The method for classifying and self-adapting space characteristic polymerization of 3.3.4.3.2 is adopted to utilize B _Icoarse Limiting the sampling range, for F _IH Performing classification task spatial feature aggregation to obtain a high pixel feature map perceived by a salient region of the image I to be detected; sensing salient regions of an image I to be detectedSending the known high pixel characteristic graph to a central point prediction network;

5.6 adaptive spatial feature aggregation network in feature adaptive aggregation Module adopts the regression adaptive spatial feature aggregation method described in 3.3.4.3.3 and utilizes B _Icoarse Limiting the sampling range, for F _IH Performing regression task spatial feature aggregation to obtain a high pixel feature map perceived by a boundary region of the image I to be detected; sending a high pixel characteristic image sensed by a boundary area of an image I to be detected to a fine frame prediction network;

5.7 the fine frame prediction network in the main task module receives the high pixel characteristic image perceived by the boundary area of the image I to be detected, and the fine frame prediction position of the object in the image I to be detected is obtained through a layer of 1 × 1 convolution processing; sending the fine frame prediction position of an object in an image I to be detected to a post-processing module;

5.8 the central point prediction network in the main task module receives the high pixel characteristic image perceived by the salient region of the image I to be detected, and the central point prediction thermodynamic diagram of the object of the image I to be detected is obtained through a layer of 1 × 1 convolution processing; the central point prediction thermodynamic diagram of the object of the image I to be detected is sent to a post-processing module;

5.9 the post-processing module receives the fine frame prediction position and the central point prediction thermodynamic diagram of the object of the image I to be detected, the method for removing the overlapped pseudo frame in the step 4.9 is adopted to remove the overlapped pseudo frame from the fine frame prediction position of the object of the image I to be detected and the central point prediction thermodynamic diagram of the object of the image I to be detected, so as to obtain an object frame prediction set of the image I to be detected, and the object frame prediction set of the image I to be detected reserves a prediction frame B _p And the type information of the prediction frame, namely the coordinate position and the prediction type of the prediction object frame of the image to be detected.

And sixthly, finishing.

The invention can achieve the following beneficial effects:

the invention provides a target detection method based on feature adaptive aggregation. The invention adopts the self-adaptive multi-scale characteristic aggregation network and the self-adaptive spatial characteristic aggregation network, and realizes larger precision improvement with a small amount of calculation overhead. The method is suitable for most of target detection based on images. The invention can achieve the following effects:

1. the invention constructs a target detection system which integrates a main feature extraction module, a feature self-adaptive aggregation module, an auxiliary task module, a main task module and a post-processing module, and designs an aggregation mode and a network structure which are suitable for target detection by utilizing the channel self-attention enhancement, the scale level soft weight aggregation and the self-adaptive feature aggregation capability of the deformable convolution of the self-adaptive spatial feature aggregation module of the self-adaptive multi-scale feature aggregation module on the basis of ensuring the rapidness and the real-time performance of a target detection method, thereby realizing the great detection precision improvement. By adopting MS COCO and Cityscapes data sets to carry out experiments on the invention, the detection precision of the invention is greatly improved compared with CenterNet and TTFNet in the background technology.

2. The self-adaptive multi-scale feature aggregation network utilizes an SE module to enhance the feature channel characterization capability and utilizes a scale-level soft weight graph to enhance the multi-scale characterization capability of features; the self-adaptive spatial feature aggregation network utilizes the rough frame to limit the sampling range of the deformable convolution space, relieves the problem of excessive offset, designs different offset conversion functions aiming at a central point prediction task and a fine frame prediction network, enables the regression task to pay attention to the boundary region of an object, enables the classification task to pay attention to the salient region of the object, relieves the problem of characteristic coupling of the classification task and the regression task, and can realize great detection precision improvement.

Drawings

FIG. 1 is a logical block diagram of an object detection system constructed in a first step of the present invention.

FIG. 2 is a general flow chart of the present invention.

FIG. 3 is a graph comparing the results of the test of the present invention with the results of the TTFNet method.

Fig. 4 is a diagram showing an example of a detection image in a test for the effect of the present invention.

Detailed Description

The following describes an embodiment of the present invention with reference to the drawings. As shown in fig. 2, the present invention comprises the steps of:

The main feature extraction module is connected with the feature self-adaptive aggregation module, extracts multi-scale features from the input image and sends a multi-scale feature map containing the multi-scale features to the feature self-adaptive aggregation module. The main feature extraction module consists of a DarkNet-53 convolutional neural network. The DarkNet-53 convolutional neural network is a lightweight backbone network comprising 53 layers of neural networks, and the 53 layers of neural networks are divided into 5 serial sub-networks for extracting the backbone network characteristics of the image. The feature pyramid network receives the main network features from the DarkNet-53 convolutional neural network, a multi-scale feature map containing the multi-scale features is obtained through up-sampling, feature extraction and feature fusion operations, and the multi-scale feature map is sent to the feature self-adaptive aggregation module.

The feature self-adaptive aggregation module is connected with the main feature extraction module, the auxiliary task module and the main task module, and has the functions of providing a multi-scale perceived high-pixel feature map for the auxiliary task module, providing a boundary region perceived high-pixel feature map and a salient region perceived high-pixel feature map for the main task module, and improving the detection precision of the target detection system. The characteristic self-adaptive aggregation module is composed of a self-adaptive multi-scale characteristic aggregation network, a self-adaptive spatial characteristic aggregation network and a rough frame prediction network. The self-adaptive multi-scale feature aggregation network is composed of 4 weight unshared SE networks (the 4 SE networks are respectively recorded as a first SE network, a second SE network, a third SE network and a fourth SE network), receives a multi-scale feature map from a feature pyramid network of a main feature extraction module, performs channel self-attention enhancement, bilinear interpolation upsampling and scale level soft weight aggregation on the multi-scale feature map by adopting a self-adaptive multi-scale feature aggregation method to obtain a multi-scale perceived high pixel feature map, and sends the multi-scale perceived high pixel feature map to the self-adaptive spatial feature aggregation network, the rough frame prediction network and the auxiliary task module. The rough frame prediction network is composed of two layers of 3 x 3 convolution and one layer of 1 x 1 convolution, receives the multi-scale perception high pixel characteristic diagram from the self-adaptive multi-scale characteristic aggregation network, predicts the multi-scale perception high pixel characteristic diagram to obtain a rough frame prediction position, and sends the rough frame prediction position to the self-adaptive spatial characteristic aggregation network. The self-adaptive spatial feature aggregation network is composed of two region-limited deformable convolutions with different offset conversion functions (a classification offset conversion function and a regression offset conversion function), receives a multi-scale perceived high pixel feature map from the self-adaptive multi-scale feature aggregation network, receives a rough frame prediction position from a rough frame prediction network, generates a boundary region perceived high pixel feature map and a salient region perceived high pixel feature map, and sends the boundary region perceived high pixel feature map and the salient region perceived high pixel feature map to the main task module, so that the main task module has self-adaptive spatial perception capability, and the problem that the input feature coupling degree is high and affects detection accuracy is relieved.

The auxiliary task module is connected with an adaptive multi-scale feature aggregation network in the feature adaptive aggregation module, the auxiliary task module is a corner prediction network, the corner prediction network is composed of two layers of 3 x 3 convolution, one layer of 1 x 1 convolution and a sigmoid active layer, the auxiliary task module receives a multi-scale perception high pixel feature image from the adaptive multi-scale feature aggregation network, and the corner prediction network predicts the multi-scale perception high pixel feature image to obtain a corner prediction thermodynamic diagram which is used for calculating corner prediction loss in the training of a target detection system and assisting the target detection system in perceiving a corner region. The auxiliary task module is only used during training of the target detection system and is used for enhancing perception of the target detection system on the position of the corner point of the object, so that the position of the object frame is predicted more accurately. When the trained target detection system detects the user input image, the module is directly discarded, and extra calculation amount is not increased.

The post-processing module is a 3 x 3 pooling layer and is connected with a fine frame prediction network and a central point prediction network in the main task module, receives a fine frame prediction position from the fine frame prediction network, receives a central point prediction thermodynamic diagram from the central point prediction network, reserves a prediction maximum value in a central point prediction thermodynamic diagram 3 x 3 range by adopting 3 x 3 maximum pooling operation with the step length of 1, and extracts the position of the reserved prediction maximum value, namely a peak point, as the position of the central area point of the object. And finding out the corresponding up-down, left-right four-direction distances in the fine frame prediction position according to the position of the central area point to generate a predicted object frame position, wherein the central point category where the position of the central area point is located is the category of the object prediction. The post-processing module suppresses overlapping false frames by extracting peak points within a 3 × 3 range, reducing false positive prediction frames.

a common scene data set or a Cityscapes unmanned scene data set disclosed by MS COCO is used as the target detection data set. The MS COCO dataset has 80 classes, containing 105000 training images (train 2017) as training set, 5000 verification images (val 2017) as verification set, and 20000 test images (test-dev) as test set. The cityscaps dataset has 8 classes: pedestrians, riders, trolleys, trucks, buses, trains, motorcycles and bicycles, with 2975 training images as the training set, 500 validation images as the validation set, 1525 Zhang Ceshi images as the test set. Let the total number of images in the training set be S, let the total number of images in the test set be T, let the total number of images in the verification set be V, let S be 205000 or 2975, T be 20000 or 1524, and let V be 5000 or 500. Each image of the MS COCO and cityscaps data sets is manually labeled, i.e., each image is labeled with the position of an object in the form of a rectangular box and is labeled with the category of the object.

2.2.1 order variable s =1, initialize the optimized training set D _t Is empty;

2.2.3 randomly cutting the s-th overturned image by adopting a minimum cross-over ratio (IoU) to obtain an s-th cut image; the minimum cross-over ratio (IoU) used is 0.3 for the minimum size ratio.

2.2.6, carrying out contrast conversion processing on the image after the s-th brightness conversion by adopting random contrast to obtain an image after the s-th contrast conversion; the random contrast ratio ranges from (0.5,1.5).

2.2.7, performing saturation conversion on the image after the s-th contrast conversion by adopting random saturation to obtain an image after the s-th saturation conversion; the saturation range for random saturation is (0.5,1.5).

2.3 training set D according to optimization _t And making a task truth label for model training. The method is divided into four tasks which are respectively a central point prediction task, an angular point prediction task, a rough frame prediction task and a fine frame prediction task, and comprises the following steps:

Let the label category of the ith label box be c ⁱ ，

The method comprises the following steps:

2.3.2.1 construction a size of

All-zero matrix chart H _zeros And C represents the number of classification categories of the optimized training set, wherein the number of the classification categories is the target detection numberThe number of classes of the labeled targets in the data set is 80 classes of MS COCO data set and 19 classes of Cityscapes data set, H is the height of the s th image, and W is the width of the s th image;

2.3.2.2 let i =1, denote the labeling box of the ith down-sampling by 4 times;

2.3.2.3 will

Represents B _si The upper left, upper right, lower left, lower right corner positions of the' are located.

A base point of a two-dimensional Gaussian kernel with a variance of (σ) _x ,σ _y ) Obtaining the Gaussian values of all pixel points in the two-dimensional Gaussian kernel range to obtain a first Gaussian value set S _ctr . The method comprises the following specific steps:

wherein (x) ₀ ,y ₀ ) Is a base point of a two-dimensional Gaussian kernel, namely the center of the two-dimensional Gaussian kernel (can be B' _si May be B' _si Corner point of) x ₀ A coordinate value of the base point in the width direction, y ₀ Is a coordinate value of the base point in the high direction. (x) _p ,y _p ) Is a base point (x) ₀ ,y ₀ ) Pixel points in the Gaussian kernel range, x _p Is the coordinate value, y, of the pixel point in the width direction _p The coordinates of the pixel point in the high direction. (x) ₀ ,y ₀ ) And (x) _p ,y _p ) Are all located in the image coordinate system after down-sampling by 4 times.

and the variance of the two-dimensional Gaussian kernel in the high direction is represented, and the number of points in the range of the Gaussian kernel is controlled by controlling the variance of the two-dimensional Gaussian kernel in the width direction and the high direction. w represents B _si ' Width in the scale of the feature map, h represents B _si ' high at the scale of the feature map, α is the position of the center region in B _si The parameter for the' ratio, set to 0.54. Will (x) _p ,y _p ) And calculated K (x) _p ,y _p ) Storing the first set of Gaussian values S _ctr Performing the following steps;

2.3.2.4.4 let p = p +1; if p is less than or equal to N _pixel Turning to 2.3.2.4.3; if p is>N _pixel ，B _si ' the coordinates and two-dimensional Gaussian values in the Gaussian kernel have been stored in S _ctr In, S _ctr In which is N _pixel Each pixel point and the corresponding two-dimensional Gaussian value thereof are converted to 2.3.2.5;

2.3.2.5 treatment of S _ctr Value of (3) to H _zeros In (1). Will S _ctr Element (x) of (1) _p ,y _p ) And K (x) _p ,y _p ) According to H _zeros [x _p ,x _p ,c ⁱ ]＝K(x _p ,y _p ) Rule assignment of c ⁱ Represents B _si ' class number, 1. Ltoreq. C ⁱ C and C is not more than C ⁱ Is a positive integer;

2.3.2.7 predicts the true value of the center point of the s-th image

The method comprises the following steps:

2.3.3.1 construction of a size of

All-zero matrix of

2.3.3.2 let i =1, representing the labeling box of the ith down-sampling 4 times;

A base point of a two-dimensional Gaussian kernel with a variance of (σ) _x ,σ _y ) Obtaining the Gaussian values of all the pixel points in the two-dimensional Gaussian kernel range to obtain a second Gaussian value set S _tl ；

2.3.3.4 converting S to _tl To the element coordinates and Gaussian values in

In the 1 st channel, i.e. according to

Assigning a value to the rule of (1);

A base point of a two-dimensional Gaussian kernel with a variance of (σ) _x ,σ _y ) The Gaussian values of all the pixel points in the two-dimensional Gaussian kernel range are obtained to obtain a third Gaussian value set S _tr ；

2.3.3.6S _tr To the element coordinates and Gaussian values in

In the 2 nd channel, i.e. according to

Assigning values to the rules;

2.3.3.8S _dl To the element coordinates and Gaussian values in

In the 3 rd channel according to

Assigning a value to the rule of (1);

2.3.3.10 reduction of S _dr To the element coordinates and Gaussian values in

In the 4 th channel, i.e. according to

Assigning a value to the rule of (1);

Middle, 2.3.3.12;

2.3.3.12 predicting true value of corner of s-th image

2.3.4 according to the N of the s-th image _s Constructing a rough frame real value of the s image of a rough frame prediction task by using 4 times of down-sampling labeled frames

The method comprises the following steps:

2.3.4.1 construction a size of

All-zero matrix of

"4" represents 4 coordinates of the labeling box sampled 4 times;

Assign a value to

In 4 channels of pixel locations;

2.3.4.4 let i = i +1 if i ≦ N _s Turning to 2.3.4.3; if i>N _s N of the s-th image _s The actual value of the rough frame corresponding to each marking frame is assigned to

In which values are assigned

The true value label of the s image is converted to 2.3.4.5;

2.3.4.5 order the true value of the coarse frame of the s-th image

2.3.5 according to

Constructing a fine-box true value for a fine-box prediction task

Value and

are equal, i.e.

2.3.7 obtaining S images for task real labels of model training, and forming a set by the S images and the S images to form a training set D for model training _M ；

2.4.1 let variable v =1;

2.4.4 if V is less than or equal to V, making V = V +1, and rotating to 2.4.2; if V > V, a new verification set D consisting of V scaled normalized images is obtained _V Turn 2.5.

2.5 adopting the image scaling standardization method of 2.4 steps to carry out optimization processing on the T images in the test set to obtain a new test set D consisting of the images after the T images are scaled and standardized _T 。

And thirdly, training the target detection system constructed in the first step by using a gradient back propagation method to obtain Nm model parameters. The method comprises the following steps:

3.3 training the target detection system, the method is that the difference between the rough frame prediction position, the fine frame prediction position, the corner point prediction thermodynamic diagram and the central point prediction thermodynamic diagram output by the target detection system in one training and the real value is used as a loss value (loss), and the network weight parameter is updated by utilizing gradient back propagation until the loss value reaches a threshold value or the training step length reaches maxepoch and ends. The network weight parameters are saved for each training round at the last Nm (set to 10 in this embodiment). The method comprises the following steps:

3.3.3 principal feature extraction Module miningExtraction of I by principal feature extraction _train To obtain I _train Will contain I _train The multi-scale feature map of the multi-scale features is sent to the adaptive feature aggregation module. The method comprises the following steps:

5363 DarkNet-53 convolution neural network extraction I of main feature extraction module of 3.3.3.1 _train The method comprises the following steps of obtaining a backbone network feature map set by using the image features, wherein the method comprises the following steps: 5 serial sub-network pairs I of DarkNet-53 convolutional neural network _train The B images are down-sampled and feature extracted to obtain the features of the backbone network, i.e. 4 feature maps (the outputs of the last four serial subnets) and sent to the feature pyramid network.

Mapping multi-scale features

And sending the information to a feature adaptive aggregation module.

3.3.4 feature adaptive aggregation Module receives a multiscale feature map from a feature pyramid network

3.3.4.1 an adaptive multi-scale feature aggregation network receives from a feature pyramid network

Using adaptive multi-scale featuresPolymerization process pair

Simultaneous second SE network pair

Simultaneous third SE network pair

3.3.4.1.2 first, second and third SE networks of adaptive multi-scale feature aggregation network are adopted in parallelBilinear interpolation will

Upsampled to the same resolution size

Obtaining an up-sampled feature map

Become an upsampled feature map set

The specific calculation process is shown as formula (2):

wherein SE _n It is shown that the nth SE network,

3.3.4.1.3 adaptive multi-scale feature aggregation network pair

Soft weight map of

Which one of these 3 dimensions, i.e.

Corresponding ith up-sampled feature map

Element by element multiplication, i.e. to

And

corresponding element by element multiplication, will

And

corresponding to the multiplication element by element,

and

multiplying element by element respectively to obtain 3 products, then carrying out weighted summation on the 3 products, and fusing the 3 products into a feature map to obtain a fused feature map; then, a fourth SE network is adopted to enhance the channel representation of the fused feature map, and a multi-scale perception high-pixel feature map F is obtained _H . The specific process is shown in formula (3):

wherein SE ₄ Is a fourth S E network,

3.3.4.2 coarse-box prediction network receives a multi-scale aware high-pixel feature map F from an adaptive multi-scale feature aggregation network _H Using a rough frame prediction method to pair F _H The position of each feature point in the image is subjected to rough frame position prediction to generate a rough frame prediction position B _coarse B is to be _coarse Sending to an adaptive spatial feature aggregation network, B _coarse Is also that

Of resolution of

The number of channels is 4. The channel number 4 represents the distance from the pixel point to the four directions of up, down, left and right, and each pixel point forms a rough frame. B is _coarse For limiting the range of deformable convolution samples in an adaptive spatial feature aggregation network. And, for B _coarse Coarse box true value constructed with 2.2.5.4

Calculating loss

The loss calculation is based on the GIoU loss (see the documents "Rezatofigughi H, tsio N, gwak J Y, et al. Generalized interaction over unit: A measurement and a loss for bounding box regression [ C)]//Proceedings of the IEEE/CVF conference on computer visioN and pattern recognition.2019:658-666 "Rezatofigughi H, tsoi N et al: generalized intersection ratio: metrics and loss of bounding box regression):

wherein S _b Is a set of regression samples consisting of

3.3.4.3 an adaptive spatial feature aggregation network receives a multi-scale aware high-pixel feature map F from an adaptive multi-scale feature aggregation network _H Receiving a coarse frame prediction location B from a coarse frame prediction network _coarse Generating a boundary region perceived high pixel feature map F _HR High pixel feature map F for sensing salient region _HS . The method comprises the following steps:

3.3.4.3.1 designs an area-limited deformable convolution (R-DConv). The specific method comprises the following steps:

3.3.4.3.1.1 design offset transfer function

For the offset Δ p of the deformable convolution (Δ p is a learnable offset based on the feature points, see the literature "Zhu X, hu H, lin S, et al. Deformable convnets v2: more Deformable, beta results [ C)]i/Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.2019:9308-9316. ") to obtain the transformed offset, so that the difficulty of sampling the most representative feature points of objects with different sizes is adaptively learnedAnd (5) the consistency is achieved.

To B _coarse The offset Δ p is normalized to make Δ p at [0,1]Within the interval. Through the processing, the difficulty of searching the point characteristics with the strongest characterization capacity becomes the same for the objects with different sizes. Thus, splitting Δ p into h _Δp And w _Δp ，h _Δp Denotes the deviation of Δ p in the vertical direction, w _Δp Indicating the shift of Δ p in the horizontal direction.

As shown in equation (5):

wherein

Representing the offset transfer function in the vertical direction,

represents an offset transfer function in the horizontal direction,overall offset transfer function

3.3.4.3.1.2 utilization

The deformable convolution sampling area is limited. Given a 3 x 3 convolution kernel with K =9 spatial sample location points, w _k Weight of convolution kernel, P, representing the k-th position _k Representing a predefined position offset for the kth position. P _k E { (-1, -1), (-1,0), (1,1) } denotes the 3 × 3 range centered at (0,0). Let x (p) denote the input feature map at the convolution kernel center position p, and y (p) denote the output feature map at the convolution kernel center position p. Y (p) is calculated using R-DConv, as shown in equation (6):

wherein Δ p _k Denotes the learnable offset, Δ m, of the k-th position _k Representing the weight of the kth position. Δ p _k And Δ m _k Generated by a 3 × 3 convolution, the 3 × 3 convolution generates a feature map of 27 channels, 9 of which are Δ p _k Abscissa offset value, Δ p for 9 channels _k Ordinate offset value, 9 channels (representing weights for different offset value characteristics) of Δ m _k The value of (c). B _coarse A coarse box, representing a prediction on the scale of the current feature map, is also a predefined bounding region.

3.3.4.3.2.1 order Classification offset transfer function

3.3.4.3.2.2 uses

Traversing F with convolution kernel _H Obtaining a high pixel characteristic diagram F of the salient region perception _HS 。

Enabling the R-DConv to learn the salient region of the object in the range of the rough frame, and extracting the feature which enables the object to be classified more accurately, namely a high-pixel feature map F for sensing the salient region _HS Will F _HS And sending the data to a main task module.

3.3.4.3.3 in order to make R-DConv learn the boundary region information of an object in the range of a rough frame, and extract the characteristics which make the object position regression more accurate, a regression adaptive spatial feature aggregation method is adopted and B is utilized _coarse Limiting the sampling range pair F _H Performing feature aggregation, wherein the regression adaptive spatial feature aggregation method specifically comprises the following steps of:

3.3.4.3.3.1 design regression offset transfer function

The offset Δ p of the deformable convolution is transformed.

the function samples two points from each of the four subregions, and adds a central point to form a convolution kernel of 3 multiplied by 3, thereby enhancing the capture of the boundary information by the central characteristic point. Regression offset transfer function

As shown in equation (7):

Will be provided with

Substituted into equation (6)

Deriving an output characteristic y at a position p _reg (p) of the formula (I). Thus, it is possible to provide

Enabling R-DConv to learn the region of the object boundary in the range of the rough frame, and extracting the feature which enables the regression position of the prediction frame to be more accurate, namely a high-pixel feature map F perceived by the boundary region _HR 。

3.3.4.3.3.2 uses

Traversal F with convolution kernel _H Obtaining a high pixel characteristic map F perceived by the boundary area _HR Will F _HR And sending the data to a main task module.

3.3.5 the auxiliary task Module receives F from the adaptive Multi-Scale feature aggregation network _H Processing the two layers of 3 multiplied by 3 convolution, one layer of 1 multiplied by 1 convolution and sigmoid function to obtain a corner point prediction thermodynamic diagram H _corner ，H _corner Has a resolution of

Calculating the loss to obtain H _corner And

loss value of

The calculation of (c) is based on a modified version of Focal Loss:

is the corner predicted true value of the c-th channel, pixel location (i, j). The auxiliary task module learns the positions of four corners of the positioning marking frame, assists the target detection network training, enables the extracted features to pay more attention to the positions of the corners of the object, and accordingly enables the target detection system to position the object more accurately.

3.3.6 Fine-Box prediction network of Main task Module receiving boundary region-aware high-Pixel feature map F from adaptive spatial feature aggregation network _HR After a layer of 1 × 1 convolution processing, F is obtained _HR Fine-box predicted position B of feature point position _refine 。B _refine Has a resolution of

Calculating loss

Is based on GIoU loss:

wherein S _b Is a set of regression samples consisting of

3.3.7 Central Point prediction network of Main task Module receives saliency region-aware high-Pixel feature map F from an adaptive spatial feature aggregation network _HS After a layer of 1 × 1 convolution and sigmoid function processing, F is obtained _HS Center point predictive thermodynamic diagram H of feature point positions _center 。H _center Has a resolution of

The number of channels is the number of data set categories C. The C of the MS COCO data set is 80, and the C of the CityScaps data set is 8. Will H _center The central point predicted true value constructed with 2.2.5.2

Calculating loss

Is based on a modified version of Focal local:

Is the c-th channel, the center point predicted thermodynamic diagram of (i, j) pixel location,

3.3.8 Total loss function for design target detection System

As shown in equation (11):

wherein

Is the output of a corner prediction network _corner And true value

The value of the loss is calculated,

h being central point predicted network output _center And true value

The value of the loss is calculated and,

b being a coarse box prediction network output _coarse And true value

The value of the loss is calculated,

b being fine-box prediction network output _refine And true value

A calculated loss value. According toImportance corner point prediction network loss weight

Central point predicted network loss weight

Coarse box prediction network loss weighting

Fine-box prediction network loss weighting

3.3.9 flush = flush +1, flush _ rate = flushing _ rate × 0.1, go 3.3.10 if flush is 80 or 110; if the epoch is neither 80 nor 110, directly rotating to 3.3.10;

3.3.10 if the epoch is less than or equal to maxepoch, rotate 3.3.2; if the epoch is larger than maxepoch, indicating that the training is finished, turning to 3.3.11;

3.3.11 preserved N _m Network weight parameter of individual epochs.

4.1 order variable n _m ＝1；

4.2 post-load N of target detection System _m Nth of network weight parameters of epoch _m A network weight parameter; the new verification set D processed by adopting an image scaling standardization method through 2.4 steps _V Inputting a target detection system;

4.4 the Primary feature extraction Module receives the v verification set image D _v Extracting D by using the main feature extraction method described in 3.3.3 _v To obtain D _v Will contain D _v OfSending the multi-scale feature map of the scale features to the self-adaptive feature aggregation module;

4.5 adaptive multiscale feature aggregation network reception in feature adaptive aggregation Module containing D _v The multi-scale feature map of the multi-scale features is subjected to channel self-attention enhancement, bilinear interpolation upsampling and scale level soft weight aggregation operation by adopting the self-adaptive multi-scale feature aggregation method 3.3.4.1 to obtain D _v Multi-scale perceived high pixel feature map F _HV Will F _HV Sending the data to a rough frame prediction network and an adaptive spatial feature aggregation network;

Of resolution of

The number of channels is 4;

4.7 adaptive spatial feature aggregation network in feature adaptive aggregation Module receives B from the coarse Box prediction network _HVcoarse Receiving F from an adaptive multi-scale feature aggregation network _HV The method for classifying and self-adapting space characteristic polymerization of 3.3.4.3.2 is adopted to utilize B _HVcoarse Limit the sampling range, and F _HV Performing classification task space feature aggregation to obtain the v-th verification set image D _v A high pixel feature map of salient region perception of (1); sending the high pixel characteristic graph perceived by the salient region of the v-th verification image to a central point prediction network;

4.8 adaptive spatial feature aggregation network in feature adaptive aggregation Module adopts the regression adaptive spatial feature 3.3.4.3.3Polymerization Process Using B _HVcoarse Limit the sampling range, and F _HV Performing regression task space feature aggregation to obtain the v-th verification set image D _v The boundary region perceived high pixel feature map of (1); sending the high pixel characteristic image perceived by the boundary area of the v-th verification image to a fine frame prediction network;

4.9 the fine frame prediction network in the main task module receives the high pixel characteristic image sensed by the boundary area, and the v verification set image D is obtained after 1 × 1 convolution processing _v The fine frame prediction position of the object is sent to a post-processing module;

4.10 Central Point prediction network in Main task Module receives the v verification set image D _v The high pixel characteristic diagram of the salient region perception is processed by a layer of 1 multiplied by 1 convolution to obtain the v verification set image D _v The v-th verification image D _v The central point prediction thermodynamic diagram is sent to a post-processing module;

4.11 post-processing Module receives the v verification image D _v The fine frame prediction position and the central point prediction thermodynamic diagram of the method adopts a method of removing overlapped pseudo frames to carry out the prediction on the v-th verification image D _v The fine frame prediction position and the central point prediction thermodynamic diagram are subjected to overlapping pseudo frame removing operation to obtain a v-th verification image D _v The specific method of predicting the object frame set is as follows:

4.11.2 from the v-th verification image D _v The central point predictive thermodynamic diagram of (1) obtains a peak point (P) _x ，P _y ) Coordinate value P of _x ，P _y Post-processing module from D _v The fine frame prediction position of (P) is obtained as a peak point _x ，P _y ) Distance information (t, l, D, r) of the upper, left, lower and right directions is obtained to obtain D _v Prediction frame B of _p ＝{P _y -t，p _l -l，p _d +d，p _r +r}。B _p Is the peak point (P) _x ，P _y ) The channel value with the maximum pixel value of the thermal image of the center point of the position is recorded as c _p 。B _p Is the peak point (P) _x ，P _y ) Thermodynamic diagram of center point of position c _p Pixel value of channel, denoted as s _p ；

4.13 if the verification set adopts a general scene data set disclosed by MS COCO, testing the precision of a final object frame prediction set output by the target detection system by adopting a standard MS COCO evaluation mode (https:// cocoataset. Org /), recording the precision of the object frame prediction set, and turning to 4.14; if the verification set adopts a Cityscapes unmanned scene data set, testing the precision of a final object frame prediction set output by the target detection system by adopting a Cityscapes evaluation mode (https:// www.cityscapes-dataset.com /), recording the precision of the object frame prediction set, and turning to 4.14;

4.14 order n _m ＝n _m +1; if n is _m ≤N _m Turning to 4.2; if n is _m ＞N _m Explanation of completion N _m Testing the precision of each model, and turning to 4.15;

4.15 from N _m Selecting the object frame prediction set with the highest precision from the precision of the object frame prediction sets of each model, finding out the weight parameter corresponding to the target detection system corresponding to the object frame prediction set with the highest precision, taking the weight parameter as the weight parameter selected by the target detection system, loading the selected weight parameter to the target detection system, and addingThe target detection system loaded with the selected weight parameters becomes the trained target detection system.

5.1, optimizing the image I to be detected input by the user by adopting the 2.4-step image scaling standardization method to obtain the standardized image I to be detected _nor Is shown by _nor An input main feature extraction module;

5.2 Main feature extraction Module receives I _nor Extracting I by using the main feature extraction method described in 3.3.3 _nor To obtain I _nor Will contain I _nor The multi-scale feature map of the multi-scale features is sent to the adaptive feature aggregation module.

5.3 adaptive multiscale feature aggregation network in feature adaptive aggregation Module receiving Inclusion I _nor The multi-scale feature map of the multi-scale features adopts the self-adaptive multi-scale feature polymerization method 3.3.4.1 to perform the self-adaptive multi-scale feature polymerization on the feature containing I _nor The multi-scale characteristic graph of the multi-scale characteristic is subjected to channel self-attention enhancement, bilinear interpolation upsampling and scale level soft weight aggregation operation to obtain a multi-scale perception high-pixel characteristic graph F _IH Will F _IH Sending the data to a rough frame prediction network and an adaptive spatial feature aggregation network;

5.4 coarse Box prediction network reception F in feature adaptive aggregation Module _IH F is predicted by adopting the rough frame prediction method 3.3.4.2 _IH Carrying out rough frame position prediction to obtain a rough frame prediction position B in the image I to be detected _Icoarse (ii) a B is to be _Icoarse And sending the information to an adaptive spatial feature aggregation network. B is _Icoarse Is also that

Of resolution of

The number of channels is 4;

5.5 characteristicsAdaptive spatial feature aggregation network reception F in an adaptive aggregation module _IH And B _Icoarse The method for classifying and self-adapting space characteristic polymerization of 3.3.4.3.2 is adopted to utilize B _Icoarse Limiting the sampling range, for F _IH Performing classification task spatial feature aggregation to obtain a high pixel feature map perceived by a salient region of the image I to be detected; sending a high pixel characteristic image perceived by a salient region of an image I to be detected to a central point prediction network;

5.7 a fine frame prediction network in the main task module receives a high pixel characteristic image perceived by a boundary region of the image I to be detected, and the fine frame prediction position of an object in the image I to be detected is obtained through a layer of 1 × 1 convolution processing; sending the fine frame prediction position of an object in an image I to be detected to a post-processing module;

And sixthly, finishing.

20000 test set data from an MS COCO data set or 1524 test set data from a Cityscapes data set (as the test set division mode described in the Second step) are selected to carry out numerical test of detection Precision AP (Average Precision) and operation speed FPS (Frames Per Second) on the device, the experimental environment is Ubuntu20.04 (a version of Linux system), a central processing unit of Intel i9-10900K series is loaded, the processing frequency is 3.70GHz, four Intel Vita RTX 2080Ti image processors are additionally arranged, the core frequency is 1635MHz, and the video memory capacity is 12GB. One embodiment of the test of the present invention is shown in fig. 4, where a to-be-detected image (the upper image in fig. 4 is an image captured during driving), which is input to the target detection system of the present invention, and the image prediction set is output and visualized to generate a detected visualized image (the lower image in fig. 4 is a visualized image of the detection result of the detected image, where the detection frame and the object type are labeled, as shown in fig. 4, where the "bicycle" detected at (1), "person" detected at (2), and "car" type detected at (3) are framed in the form of rectangular frames).

Firstly, defining a target detection algorithm performance evaluation index. The test adopts a standard MS COCO evaluation mode, and has 6 specific indexes: AP, AP ₅₀ 、AP ₇₅ 、AP _S 、AP _M And AP _L . AP indicates that the value of the cross-over ratio (IoU) is [0.5,0.95]The Average Precision (AP) calculated every 0.05 over the interval is then averaged over all intervals of AP. AP (Access Point) ₅₀ And AP ₇₅ Indicating an AP value of IoU greater than 0.5 and 0.75, respectively. AP (Access Point) _S 、AP _M And AP _L AP representing a small-sized object, a medium-sized object, and a large-sized object, respectively, wherein the size-defining ranges are [0, 64 ] respectively ² ]、[64 ² ，128 ² ]And [128 ] ² ，∞]. The larger the AP value, the higher the detection accuracy.

According to the experimental results of the invention, the experimental results of the MS COCO data set and the Cityscapes data set are respectively analyzed.

The MS COCO dataset target detection algorithm performance pairs are shown in table 1. The invention is shown in comparison to the classical real-time target detection method YOLOv3, the most relevant methods of the invention centret and TTFNet. According to the experimental results, the method can rapidly and accurately detect the target. Compared with the CenterNet, the precision of 4.4AP is improved at a faster running speed of about 2.2 ms. Compared to TTFNet, with a small speed delay, about 3.15ms, a 2.5AP accuracy improvement is achieved. On the premise of hardly influencing the real-time performance, the method realizes great precision improvement. The target detection algorithm precision and speed are two indexes needing to be balanced, and the realization of larger precision improvement under a small amount of calculation load is significant in practical application. For accuracy, the higher the accuracy, the more difficult it is to improve, the classical MaskRCNN algorithm (see documents "He K, gkioxari G, doll a r P, et al. Mask r-cnn [ C ]// Proceedings of the IEEE international conference on vision.2017:2961-2969." He K, gkioxari G, doll r P, et al. Mask r-cnn [ C ]// Proceedings of the IEEE international conference on vision.2017:2961-2969 et al: mask r-cnn) achieves an accuracy of 39.8AP at 11FPS, which is 5.45 times faster than MaskRCNN and 2.0AP higher. Thus, a speed delay of only about 3.15ms is sacrificed (which is fully acceptable for real-world applications), and a greater accuracy improvement of 2.5AP is achieved.

TABLE 1

Method	Backbone network	FPS	AP	AP ₅₀	AP ₇₅	AP _S	AP _M	AP _L
									YOLOv3	DarkNet-53	48	33.4	56.3	35.2	19.5	36.4	43.6
CenterNet	DLA-34	53	37.4	55.1	40.8	20.6	42.0	50.6
									TTFNet	DarkNet-53	74	39.3	56.8	42.5	20.6	43.3	54.3
The invention	DarkNet-53	60	41.8	58.7	45.3	22.7	45.6	54.9

The cityscaps dataset target detection algorithm performance pairs are shown in table 2. The Cityscapes data set is a classic intelligent driving scene data set, unified 768 x 384 images are used as input in the experiment, and TTFNet and the performance of the method are compared under the Cityscapes data set. TTFNet is faster in running speed than the invention, but the detection precision gap is obvious (5.8 AP). And the speed delay is only 3.46ms, which is fully acceptable for real-world applications. Therefore, the invention has better balance between the running speed and the detection precision, and realizes larger precision improvement with smaller time overhead.

TABLE 2

Method	Backbone network	FPS	AP	AP ₅₀	AP ₇₅	AP _S	AP _M	AP _L
									TTFNet	DarkNet-53	58.7	17.2	33.9	15.6	6.4	22.5	30.1
The invention	DarkNet-53	48.8	23.0	41.7	22.1	4.3	22.1	45.2

And carrying out visual analysis on the trained target detection system. As shown in fig. 3, TTFNet and the present invention were visually analyzed under the citrescaps data set in this experiment. Fig. 3 (a) and 3 (b) show results of TTFNet detection, and fig. 3 (c) and 3 (d) show results of TTFNet detection according to the present invention. For the sake of easy observation, the regions with false TTFNet detection are indicated by arrows (i.e. the false detection indicated by the arrow on the left side in fig. 3 (a) detects "bicyle" type, the detection indicated by the arrow on the right side detects multiple overlapping false positive boxes; and the false detection indicated by the arrow on fig. 3 (b) detects the background region as the foreground region). Compared with TTFNet detection, the method is more accurate, has lower false detection rate and higher classification precision (no false detection occurs at the position of the arrow corresponding to the left side of the graph 3 (a) in the graph 3 (c), a plurality of overlapped false positive frames are not detected at the position of the arrow corresponding to the right side of the graph 3 (a), and the background area is not detected as the foreground area by mistake at the position of the arrow corresponding to the graph 3 (b) in the graph 3 (d)). The excellent visualization results also prove the effectiveness of the proposed method.

Claims

1. A target detection method based on feature adaptive aggregation is characterized by comprising the following steps:

firstly, constructing a target detection system based on feature adaptive aggregation; the target detection system consists of a main feature extraction module, a feature self-adaptive aggregation module, an auxiliary task module, a main task module and a post-processing module;

the main feature extraction module is connected with the feature self-adaptive aggregation module, extracts multi-scale features from the input image and sends a multi-scale feature map containing the multi-scale features to the feature self-adaptive aggregation module; the main characteristic extraction module consists of a DarkNet-53 convolutional neural network and a characteristic pyramid network; the DarkNet-53 convolutional neural network is a lightweight trunk network containing 53 layers of neural networks, and the 53 layers of neural networks are divided into 5 serial sub-networks and used for extracting the trunk network characteristics of the image; the method comprises the steps that a feature pyramid network receives main network features from a DarkNet-53 convolutional neural network, a multi-scale feature map containing multi-scale features is obtained through up-sampling, feature extraction and feature fusion operations, and the multi-scale feature map is sent to a feature self-adaptive aggregation module;

the feature self-adaptive aggregation module is connected with the main feature extraction module, the auxiliary task module and the main task module, and has the functions of providing a multi-scale perceived high-pixel feature map for the auxiliary task module, providing a boundary region perceived high-pixel feature map and a salient region perceived high-pixel feature map for the main task module, and improving the detection precision of the target detection system; the characteristic self-adaptive aggregation module is composed of a self-adaptive multi-scale characteristic aggregation network, a self-adaptive spatial characteristic aggregation network and a rough frame prediction network; the self-adaptive multi-scale feature aggregation network is composed of 4 SE networks with unshared weights, and the 4 SE networks are respectively marked as a first SE network, a second SE network, a third SE network and a fourth SE network; receiving a multi-scale feature map from a feature pyramid network of a main feature extraction module, carrying out channel self-attention enhancement, bilinear interpolation upsampling and scale level soft weight aggregation operation on the multi-scale feature map by adopting a self-adaptive multi-scale feature aggregation method to obtain a multi-scale perceived high pixel feature map, and sending the multi-scale perceived high pixel feature map to a self-adaptive spatial feature aggregation network, a rough frame prediction network and an auxiliary task module; the rough frame prediction network is composed of two layers of 3 x 3 convolutions and one layer of 1 x 1 convolution, receives the multi-scale perception high pixel characteristic diagram from the self-adaptive multi-scale characteristic aggregation network, predicts the multi-scale perception high pixel characteristic diagram to obtain a rough frame prediction position, and sends the rough frame prediction position to the self-adaptive spatial characteristic aggregation network; the self-adaptive spatial feature aggregation network is composed of a region-limited deformable convolution of a classification offset conversion function and a regression offset conversion function, receives a multi-scale perceived high-pixel feature map from the self-adaptive multi-scale feature aggregation network, receives a rough frame prediction position from a rough frame prediction network, generates a boundary region perceived high-pixel feature map and a salient region perceived high-pixel feature map, and sends the boundary region perceived high-pixel feature map and the salient region perceived high-pixel feature map to the main task module;

the auxiliary task module is connected with an adaptive multi-scale feature aggregation network in the feature adaptive aggregation module, the auxiliary task module is an angular point prediction network, the angular point prediction network consists of two layers of 3 x 3 convolutions, a layer of 1 x 1 convolutions and a sigmoid active layer, the auxiliary task module receives a multi-scale perceived high pixel feature map from the adaptive multi-scale feature aggregation network, and the angular point prediction network predicts the multi-scale perceived high pixel feature map to obtain an angular point prediction thermodynamic diagram which is used for calculating angular point prediction loss in the training of a target detection system and assisting the target detection system in perceiving an angular point region; the auxiliary task module is only used for training the target detection system and is used for enhancing the perception of the target detection system on the position of the corner point of the object so as to predict the position of the object frame more accurately; when the trained target detection system detects the user input image, the module is directly discarded;

the main task module is connected with the self-adaptive spatial feature aggregation network and the post-processing module and consists of a fine frame prediction network and a central point prediction network; the fine frame prediction network is a layer of 1 multiplied by 1 convolution layer, receives the high pixel characteristic diagram sensed by the boundary region from the adaptive spatial characteristic aggregation network, performs 1 multiplied by 1 convolution on the high pixel characteristic diagram sensed by the boundary region to obtain a fine frame prediction position, and sends the fine frame prediction position to the post-processing module; the central point prediction network consists of a layer of 1 × 1 convolutional layer and a sigmoid activation layer, receives the high pixel characteristic diagram perceived by the salient region from the adaptive spatial characteristic aggregation network, performs 1 × 1 convolution and activation on the high pixel characteristic diagram perceived by the salient region to obtain a central point prediction thermodynamic diagram, and sends the central point prediction thermodynamic diagram to the post-processing module;

the post-processing module is a 3 x 3 pooling layer and is connected with a fine frame prediction network and a central point prediction network in the main task module, receives a fine frame prediction position from the fine frame prediction network, receives a central point prediction thermodynamic diagram from the central point prediction network, reserves a prediction maximum value in a central point prediction thermodynamic diagram 3 x 3 range by adopting 3 x 3 maximum pooling operation with the step length of 1, and extracts the position of the reserved prediction maximum value, namely a peak point, as the position of an object central area point; finding out the corresponding up-down, left-right four-direction distances in the fine frame prediction position according to the position of the central area point to generate a predicted object frame position, wherein the type of the central point where the position of the central area point is located is the type of object prediction; the post-processing module inhibits overlapping false frames by extracting peak points within a range of 3 multiplied by 3, so that false positive prediction frames are reduced;

2.1 collecting target detection scene images as a target detection data set, and manually labeling each target detection scene image in the target detection data set, wherein the method comprises the following steps: using a general scene data set or a Cityscapes unmanned scene data set disclosed by MS COCO as a target detection data set; training images in an MS COCO data set or a Cityscapes data set are used as a training set, verification images are used as a verification set, and test images are used as a test set; the total number of images in the training set is S, the total number of images in the testing set is T, the total number of images in the verification set is V, and each image in the MS COCO and Cityscapes data sets is manually marked, namely, each image is marked with the position of an object in a rectangular frame form and the category of the object;

2.2 carrying out optimization processing on the S images in the training set, including turning, cutting, translation, brightness transformation, contrast transformation, saturation transformation, scaling and standardization to obtain an optimized training set D _t ；

2.3 training set D according to optimization _t Making a task truth value label for model training; the method is characterized in that the method is divided into four tasks which are respectively a central point prediction task, an angular point prediction task, a rough frame prediction task and a fine frame prediction task, and the method comprises the following steps:

2.3.1 let variable s =1; let the s image in the optimized training set have N _s A label box, let N _s The ith one of the label boxes is

Let the label category of the ith label box be c ⁱ ，

represents the coordinate of the lower right corner point of the ith label box, N _s Is a positive integer, i is more than or equal to 1 and less than or equal to N _s ；

The method comprises the following steps:

2.3.2.1 construction of a size of

All-zero matrix chart H _zeros C represents the number of the classification categories of the optimized training set, wherein the number of the categories is the number of the categories of the labeled targets of the target detection data set, H is the height of the s-th image, and W is the width of the s-th image;

2.3.2.3 will

Represents B _si The upper left, upper right, lower left and lower right corner positions of the' are arranged;

Is the base point of a two-dimensional Gaussian kernel with a variance of (sigma) _x ，σ _y ) Obtaining the Gaussian values of all pixel points in the two-dimensional Gaussian kernel range to obtain a first Gaussian value set S _ctr (ii) a The specific method comprises the following steps:

2.3.2.4.1 makes the number of pixel points in two-dimensional Gaussian kernel be N _pixel ，N _pixel Let the first set of Gaussian values S be a positive integer _ctr Is empty;

2.3.2.4.3 in the s-th image with (x) ₀ ，y ₀ ) Any pixel point (x) in the Gaussian kernel range as the base point _p ，y _p ) Two-dimensional Gao Sizhi K (x) _p ，y _p ) Comprises the following steps:

wherein (x) ₀ ，y ₀ ) Is the base point of a two-dimensional Gaussian kernel, i.e. the center of the two-dimensional Gaussian kernel, x ₀ A coordinate value of the base point in the width direction, y ₀ A coordinate value in a high direction as a base point; (x) _p ，y _p ) Is a base point (x) ₀ ，y ₀ ) Pixel points, x, within the Gaussian kernel range _p Is the coordinate value, y, of the pixel point in the width direction _p The coordinates of the pixel point in the high direction are obtained; (x) ₀ ，y ₀ ) And (x) _p ，y _p ) All located in the image coordinate system after down-sampling 4 times;

representing the variance of the two-dimensional Gaussian kernel in the high direction, and controlling the number of points in the range of the Gaussian kernel by controlling the variance of the two-dimensional Gaussian kernel in the width direction and the high direction; w represents B _si ' Width at the scale of the characteristic diagram, h represents B _si ' high at the scale of the feature map, α is the position of the center region in B _si ' parameters of the ratio; will (x) _p ，y _p ) And calculated K (x) _p ，y _p ) Storing the first set of Gaussian values S _ctr Performing the following steps;

2.3.2.4.4 let p = p +1; if p is less than or equal to N _pixel Turning to 2.3.2.4.3; if p > N _pixel ，B _si ' the coordinates and two-dimensional Gaussian values in the Gaussian kernel have been stored in S _ctr In, S _ctr In which is N _pixel Each pixel point and the corresponding two-dimensional Gaussian value thereof are converted to 2.3.2.5;

2.3.2.5 treatment of S _ctr Value of (3) to H _zeros Performing the following steps; will S _ctr Element (x) of (1) _p ，y _p ) And K (x) _p ，y _p ) According to H _zeros [x _p ，y _p ，c ⁱ ]＝K(x _p ，y _p ) Rule assignment of c ⁱ Represents B _si ' class number, 1. Ltoreq. C ⁱ C and C is not more than C ⁱ Is a positive integer;

2.3.2.6 has i = i +1; if i is less than or equal to N _s Turning to 2.3.2.3; if i > N _s N of the s-th image _s All the two-dimensional Gaussian values generated by the down-sampling 4-time labeling boxes are assigned to H _zeros Transfer 2.3.2.7;

2.3.2.7 predicts the true value of the center point of the s-th image

The method comprises the following steps:

2.3.3.1 construction of a size of

All-zero matrix of

2.3.3.3 let the base point of the two-dimensional Gaussian kernel be B _si The upper left corner point ofIs marked as

A base point of a two-dimensional Gaussian kernel with a variance of (σ) _x ，σ _y ) The Gauss values of all the pixel points in the two-dimensional Gauss kernel range are obtained to obtain a second Gauss value set S _tl ；

2.3.3.4 converting S to _tl To the element coordinates and Gaussian values in

In the 1 st channel, i.e. according to

Assigning a value to the rule of (1);

A base point of a two-dimensional Gaussian kernel with a variance of (σ) _x ，σ _y ) Obtaining the Gaussian values of all the pixel points in the two-dimensional Gaussian kernel range to obtain a third Gaussian value set S _tr ；

2.3.3.6 converting S to _tr To the element coordinates and Gaussian values in

In the 2 nd channel, i.e. according to

Assigning a value to the rule of (1);

A base point of a two-dimensional Gaussian kernel with a variance of (σ) _x ，σ _y ) Obtaining the Gaussian values of all the pixel points in the two-dimensional Gaussian kernel range to obtain a fourth Gaussian value set S _dl ；

2.3.3.8 converting S to _al To the element coordinates and Gaussian values in

In the 3 rd channel according to

Assigning a value to the rule of (1);

A base point of a two-dimensional Gaussian kernel with a variance of (σ) _x ，σ _y ) The Gaussian values of all the pixel points in the two-dimensional Gaussian kernel range are obtained to obtain a fifth Gaussian value set S _dr ；

2.3.3.10 reduction of S _dr To the element coordinates and Gaussian values in

In the 4 th channel, i.e. according to

Assigning a value to the rule of (1);

2.3.3.11 let i = i +1 if i ≦ N _s Turning to 2.3.3.3; if i > N _s N for the s-th image _s The two-dimensional Gaussian values generated by the down-sampling 4-fold labeling frames are all assigned to

Transfer 2.3.3.12;

2.3.3.12 orders the predicted true value of corner of the s-th image

2.3.5 according to

Constructing real value of fine box prediction task

Value and

are equal, i.e.

2.3.6 let S = S +1, if S is less than or equal to S, rotate 2.3.2; if S is more than S, rotating to 2.3.7;

2.3.7 task of obtaining S images for model trainingThe real labels and the S images form a set to form a training set D for model training _M ；

2.4 adopting an image scaling standardization method to carry out optimization processing on the V images in the verification set, namely scaling and standardizing the V images to obtain a new verification set D consisting of the V scaled and standardized images _V ；

2.5 adopting the image scaling standardization method of 2.4 steps to carry out optimization processing on the T images in the test set to obtain a new test set D consisting of the images after the T images are scaled and standardized _T ；

Thirdly, training the target detection system constructed in the first step by utilizing a gradient back propagation method to obtain N _m A model parameter; the method comprises the following steps:

3.1 initializing network weight parameters of each module in the target detection system; initializing parameters of a DarkNet-53 convolutional neural network in a main characteristic extraction module by adopting a pre-training model trained on an ImageNet data set; initializing a feature pyramid network, a feature self-adaptive aggregation module, an auxiliary task module and a main task module network weight parameter in a main feature module;

3.2 setting training parameters of the target detection system; initializing an initial learning rate learning _ rate attenuation coefficient, selecting random gradient descent as a model training optimizer, initializing a hyper-parameter momentum of the optimizer, and initializing weight attenuation; initializing the batch size mini _ batch _ size of network training as a positive integer; the maximum training step maxepoch is initialized to a positive integer.

3.3 training the target detection system, the method is that the difference between the rough frame prediction position, the fine frame prediction position, the angular point prediction thermodynamic diagram and the central point prediction thermodynamic diagram output by the target detection system in one training and the real value is used as the loss value loss, and the network weight parameter is updated by utilizing gradient back propagation until the loss value reaches the threshold value or the training step length reaches maxepoch and is finished; at the last N _m Training, wherein each training step is performed, and the network weight parameters are stored once; the method comprises the following steps:

3.3.1 order training pace =1, training setTraining data for one epoch; initialization batch number N _b ＝1；

3.3.2 Primary feature extraction Module Slave D _M Read the Nth _b Batch, total B =64 images, and the B images are recorded as matrix I _train ，I _train B H × W × 3 images are contained in the image; h represents the height of the input image, W represents the width of the input image, and "3" represents the RGB three channels of the image;

3.3.3 principal feature extraction Module extracts I by principal feature extraction method _train To obtain I _train Will comprise I _train Sending the multi-scale feature map of the multi-scale features to the self-adaptive feature aggregation module; the method comprises the following steps:

5363 DarkNet-53 convolution neural network extraction I of 3.3.3.1 main feature extraction module _train The method comprises the following steps of obtaining a backbone network feature map set by using the image features, wherein the method comprises the following steps: 5 serial sub-network pairs I of DarkNet-53 convolutional neural network _train B images are subjected to down sampling and feature extraction to obtain main network features, namely 4 feature graphs output by the last four serial sub-networks, and the features are sent to a feature pyramid network;

3.3.3.2 the feature pyramid network receives 4 feature maps from the DarkNet-53 convolutional neural network, and the feature pyramid network performs up-sampling, feature extraction and feature fusion on the 4 feature maps to obtain 3 multi-scale feature maps, wherein the order is

Mapping multi-scale features

Sending the information to a characteristic self-adaptive aggregation module;

Generating a multi-scale perceptual high-pixel feature map F _H Will F _H Send to the assistantA task assistance module; generating a high pixel characteristic diagram perceived by the boundary area and a high pixel characteristic diagram perceived by the salient area, and sending the high pixel characteristic diagram perceived by the boundary area and the high pixel characteristic diagram perceived by the salient area to the main task module; the method comprises the following steps:

Method pair for aggregation by adopting self-adaptive multi-scale features

Channel self-attention enhancement, bilinear interpolation up-sampling and scale level soft weight aggregation operation are carried out to obtain a multi-scale perception high pixel characteristic diagram F _H ；F _H Resolution of the feature map of

F _H The number of characteristic map channels is 64; the specific method comprises the following steps:

3.3.4.1.1 adaptive multi-scale feature aggregation network uses parallel pairs of first, second, and third SE networks

Simultaneous second SE network pair

Simultaneous third SE network pair

Upsampled to the same resolution size

Obtaining an up-sampled feature map

Become an upsampled feature map set

The specific calculation process is shown as formula (2):

wherein SE _n It is shown that the nth SE network,

representing the first multi-scale feature map, wherein Upesample represents bilinear interpolation upsampling, and l is more than or equal to 1 and less than or equal to 3,1 and less than or equal to n is less than or equal to 3;

3.3.4.1.3 adaptive multi-scale feature aggregation network pair

Soft weight map of

Which one of these 3 dimensions, i.e.

Which of the objects occupies a larger weight, so that objects with different sizes respond to feature maps with different scales;

Corresponding ith up-sampled feature map

Element by element multiplication, i.e. to

And

corresponding element by element multiplication, will

And

the corresponding element-by-element multiplication is performed,

and

respectively multiplying element by element to obtain 3 products, then carrying out weighted summation on the 3 products, and fusing the 3 products into a feature map to obtain a fused feature map; then, a fourth SE network is adopted to enhance the channel representation of the fused feature map, and a multi-scale perception high-pixel feature map F is obtained _H (ii) a The specific process is shown in formula (3):

wherein SE ₄ In order to be the fourth SE network,

represents the weight of the same position element in different scales, "×" represents the product of the corresponding position elements, and Conv represents the 1 × 1 convolution; adaptive multi-scale feature aggregation network F _H Sending the data to an auxiliary task module, a rough frame prediction network and an adaptive spatial feature aggregation network;

3.3.4.2 the coarse box prediction network receives a multi-scale aware high-pixel feature map F from an adaptive multi-scale feature aggregation network _H Using a rough frame prediction method to pair F _H The position of each feature point in the image is subjected to rough frame position prediction to generate a rough frame prediction position B _coarse A 1 to B _coarse Sending to an adaptive spatial feature aggregation network, B _coarse Is also that

Of resolution of

Number of channelsIs 4; the channel number 4 represents the distance from the pixel point to the four directions of up, down, left and right, and each pixel point forms a rough frame; b is _coarse The method is used for limiting the deformable convolution sampling range in the adaptive spatial feature aggregation network; and, for B _coarse Coarse box true value constructed with 2.2.5.4

Calculating loss

Wherein S _b Is a set of regression samples consisting of

(i, j) position weight value other than 0;

3.3.4.3 an adaptive spatial feature aggregation network receives a multi-scale aware high-pixel feature map F from an adaptive multi-scale feature aggregation network _H Receiving a coarse frame prediction location B from a coarse frame prediction network _coarse Generating a boundary region perceived high pixel feature map F _HR And high pixel feature F of salient region perception _HS (ii) a The method comprises the following steps:

3.3.4.3.1 a deformable convolution R-Dconv with limited design area is created by:

3.3.4.3.1.1 design offset transfer function

The offset deltap of the deformable convolution is carried outTransforming to obtain a transformed offset;

limiting the offset range of the spatial sampling points of the deformable convolution to B _coarse Meanwhile, the offset delta p of the deformable convolution can be differentiated; using Sigmoid function

To B _coarse The offset Δ p is normalized to make Δ p at [0,1]Within the interval; splitting Δ p into h _Δp And w _Δp ，h _Δp Denotes the deviation of Δ p in the vertical direction, w _Δp Represents the deviation of Δ p in the horizontal direction;

as shown in equation (5):

wherein

Representing the offset transfer function in the vertical direction,

(t, l, r, d) are convolution kernel positions p and B _coarse The distance in the up, down, left and right directions;

3.3.4.3.1.2 utilization

Limiting the deformable convolution sampling area; given a 3 x 3 convolution kernel with K =9 spatial sample location points, w _k Weight of convolution kernel, P, representing the k-th position _k A predefined position offset representing a kth position; p _k E { (-1, -1), (-1,0), (1,1) } denotes a 3 × 3 range centered at (0,0); let x (p) denote the input feature map at the convolution kernel center position p, and y (p) denote the output feature map at the convolution kernel center position p; y (p) is calculated using R-DConv, as shown in equation (6):

wherein Δ p _k Denotes the learnable offset, Δ m, of the kth position _k A weight representing a kth position; Δ p _k And Δ m _k Generated by a 3 × 3 convolution, the 3 × 3 convolution generates a feature map of 27 channels, 9 of which are Δ p _k Abscissa offset value, 9 channels Δ p _k Ordinate offset value, Δ m for 9 channels _k A value of (d); b is _coarse A rough box representing the prediction on the current feature map scale, which is also a predefined bounding region;

3.3.4.3.2 adopts a classification adaptive space feature aggregation method to utilize B _coarse Limiting the sampling range pair F _H The method for carrying out feature aggregation comprises the following steps:

3.3.4.3.2.1 order Classification offset transfer function

Calculating the output characteristic y at the position p by using the formula (6) _cls (p)；

3.3.4.3.2.2 uses

Traversal F with convolution kernel _H Obtaining a high pixel characteristic diagram F of the salient region perception _HS ；

Allowing the sampling points to be concentrated, so that the classification branches can concentrate on the salient regions with the most identification capability; order to

Enabling the R-DConv to learn the salient region of the object in the range of the rough frame, and extracting the feature which enables the object to be classified more accurately, namely a high-pixel feature map F for sensing the salient region _HS Will F _HS Sending the data to a main task module;

3.3.4.3.3 adopts regression adaptive spatial feature aggregation method and utilizes B _coarse Limiting the sampling range pair F _H Performing feature aggregation, wherein the regression adaptive spatial feature aggregation method specifically comprises the following steps:

3.3.4.3.3.1 design regression offset transfer function

Transforming the offset delta p of the deformable convolution;

uniformly dividing space sampling points of R-DConv operation along four directions of up, down, left and right to enable a limited area to be divided into four sub-areas which respectively correspond to the upper left, the upper right, the lower left and the lower right;

respectively carrying out uniform sampling on the four sub-regions, namely distributing equal sampling points to each region; the setting of K =9 is made such that,

the function samples two points from four sub-areas respectively, total eight edge points are added with a central point to form a convolution kernel of 3 multiplied by 3, and the capture of the boundary information by the central characteristic point is enhanced; regression biasFunction of quantity conversion

As shown in equation (7):

a Sigmoid function for normalizing the offset in the rough frame interval;

will be provided with

Substituted into equation (6)

Deriving an output characteristic y at a position p _reg (p)；

3.3.4.3.3.2 uses

Traversing F with convolution kernel _H Obtaining a high pixel characteristic map F perceived by the boundary area _HR Will F _HR Sending the data to a main task module;

3.3.5 Assist task Module receives F from an adaptive Multi-Scale feature aggregation network _H Processing the two layers of convolution with the power of 3 multiplied by 3, convolution with the power of 1 multiplied by 1 and sigmoid function to obtain the corner prediction thermodynamic diagram H _corner ，H _corner Has a resolution of

The number of channels is 4; to H _corner And the corner predicted true value constructed by 2.3.3

Calculating the loss to obtain H _corner And

loss value of

Wherein N is _s Is the number of the image label boxes, alpha _l And β is a hyperparameter for controlling the gradient curve of the loss function;

is the corner predicted true value of the c channel and pixel position (i, j);

3.3.6 Fine-Box prediction network of Main task Module receiving boundary region-aware high-Pixel feature map F from adaptive spatial feature aggregation network _HR After a layer of 1 × 1 convolution processing, F is obtained _HR Fine-box predicted position B of feature point position _refine ；B _refine Has a resolution of

The number of channels is 4; the channel number 4 represents the distance from the pixel point to the prediction fine frame in the upper, lower, left and right directions, and each pixel point can form a fine prediction frame; to B _refine And the fine box true value obtained from 2.3.5

Calculating loss

Wherein S _b Is a set of regression samples consisting of

A set of pixels other than 0; n is a radical of hydrogen _b Is the number of regression sample sets, W _ij Is corresponding to

(i, j) position weight value other than 0, B _refine The learning quality of (2) represents the accuracy of the target detection system in returning the position of the object;

3.3.7 Central Point prediction network of Main task Module receives saliency region-aware high-Pixel feature map F from an adaptive spatial feature aggregation network _HS After a layer of 1 × 1 convolution and sigmoid function processing, F is obtained _HS Center point predictive thermodynamic diagram H of feature point positions _center ；H _center Has a resolution of

The number of channels is the number C of data set categories; h is to be _center The central point predicted true value constructed with 2.2.5.2

Calculating loss

Wherein N is _s Is the imageNumber of boxes, α _l And beta is a hyper-parameter,

is the center point predicted true value of the (i, j) th channel and the (i, j) th pixel position; h _center The learning quality of (2) represents the ability of the target detection system to locate the center position of an object and to distinguish object classes;

3.3.8 design Total loss function for target detection System

As shown in equation (11):

wherein

H being the output of a corner prediction network _corner And true value

The value of the loss is calculated,

h being central point predicted network output _center And true value

The value of the loss is calculated,

b being a coarse box prediction network output _coarse And true value

The value of the loss is calculated,

b being fine-box prediction network output _refine And true value

A calculated loss value;

the network loss weights are predicted for the corners,

the network loss weight is predicted for the central point,

the network loss weight is predicted for the coarse box,

predicting a network loss weight for the fine box;

3.3.10 if the epoch is less than or equal to maxepoch, rotate 3.3.2; if the epoch is greater than maxepoch, indicating that the training is finished, and turning to 3.3.11;

3.3.11 preserved N _m Network weight parameters of the epochs;

fourth, verifying N after loading by using the verification set _m The detection precision of a target detection system of the network weight parameters of the epochs keeps the network weight parameters with the best performance as the network weight parameters of the target detection system; the method comprises the following steps:

4.1 order variable n _m ＝1；

4.2 targetDetecting system loaded N _m Nth in network weight parameter of epoch _m A network weight parameter; new verification set D _V Inputting a target detection system;

4.5 adaptive multiscale feature aggregation network reception in feature adaptive aggregation Module containing D _v The multi-scale feature map of the multi-scale features is subjected to channel self-attention enhancement, bilinear interpolation upsampling and scale level soft weight aggregation operation by adopting the self-adaptive multi-scale feature aggregation method 3.3.4.1 to obtain D _v Multi-scale perceived high pixel feature map F _HV A 1 to F _HV Sending the data to a rough frame prediction network and an adaptive spatial feature aggregation network;

4.6 coarse Box prediction network reception F in feature adaptive aggregation Module _HV F, adopting the rough frame prediction method 3.3.4.2 _HV The position of each feature point in the image is subjected to rough frame position prediction to generate a v-th verification set image D _v Coarse box of (1) predict position B _HVcoarse (ii) a B is to be _HVcoarse Sending the information to an adaptive spatial feature aggregation network; b is _HVcoarse Is also that

Of resolution size of

The number of channels is 4;

4.7 is characterized byAn adaptive spatial feature aggregation network in an adaptive aggregation module receives B from a coarse box prediction network _HVcoarse Receiving F from an adaptive multi-scale feature aggregation network _HV The method for classifying and self-adapting space characteristic polymerization of 3.3.4.3.2 is adopted to utilize B _HVcoarse Limit the sampling range, and F _HV Performing classification task space feature aggregation to obtain the v-th verification set image D _v A high pixel feature map of salient region perception of (1); sending the high pixel characteristic graph perceived by the salient region of the v-th verification image to a central point prediction network;

4.9 Fine frame prediction network in the main task module receives the high pixel characteristic graph sensed by the boundary region, and the Vth verification set image D is obtained through 1 multiplied by 1 convolution processing of one layer _v The fine frame prediction position of the object is sent to the post-processing module;

4.11 post-processing Module receives the v authentication image D _v The thermal diagram of the fine frame prediction position and the center point is predicted, and the method of removing the overlapped pseudo frame is adopted to carry out the verification on the v < th > verification image D _v The fine frame prediction position and the central point prediction thermodynamic diagram are subjected to overlapping pseudo frame removing operation to obtain a v-th verification image D _v The specific method for predicting the object frame set is as follows:

4.11.1 post-processing module pairs the v-th verification image D _v Performs a 3 x 3 max pooling operation to extract a v-th verification image D _v The central point of (2) a set of peak points of the predictive thermodynamic diagram, each peak point representing a central region point within the predicted object;

4.11.2 from the v-th verification image D _v The central point predictive thermodynamic diagram of (1) obtains a peak point (P) _x ，P _y ) Coordinate value P of _x ，P _y Post-processing module from D _v The fine frame prediction position of (P) is obtained as a peak point _x ，P _y ) Distance information (t, l, D, r) of the upper, left, lower and right directions is obtained to obtain D _v Prediction frame B of (1) _p ＝{P _y -t，p _l -l，p _d +d，p _r +r}；B _p The category of (D) is peak point (P) _x ，P _y ) The channel value with the maximum pixel value of the thermal image of the center point of the position is recorded as c _p ；B _p The confidence of (D) is the peak point (P) _x ，P _y ) Thermodynamic diagram of center point of position c _p Pixel value of channel, noted as s _p ；

4.11.3 post-processing module retains the v-th verification image D _v Middle confidence s _p A prediction box larger than the confidence threshold value to form a v verification image D _v The object frame prediction set of (1) retains the prediction frame B _p And B _p Class c of _p Information;

4.12 let V = V +1, if V is less than or equal to V, turn 4.4; if V > V, it is stated that the nth _m Converting an object frame prediction set of V verification images of each model into 4.13;

4.13 if the verification set adopts a general scene data set disclosed by MS COCO, testing the precision of a final object frame prediction set output by the target detection system by adopting a standard MS COCO evaluation mode, recording the precision of the object frame prediction set, and turning to 4.14; if the verification set adopts a Cityscapes unmanned scene data set, testing the precision of a final object frame prediction set output by the target detection system by adopting a Cityscapes evaluation mode, recording the precision of the object frame prediction set, and turning to 4.14;

4.15 from N _m Selecting an object frame prediction set with highest precision from the precision of the object frame prediction sets of the models, finding a weight parameter corresponding to a target detection system corresponding to the object frame prediction set with the highest precision, taking the weight parameter as the weight parameter selected by the target detection system, loading the selected weight parameter to the target detection system, and enabling the target detection system loaded with the selected weight parameter to become a trained target detection system;

5.2 Main feature extraction Module receives I _nor Extracting I by using the main feature extraction method described in 3.3.3 _nor To obtain I _nor Will contain I _nor Sending the multi-scale feature map of the multi-scale features to the self-adaptive feature aggregation module;

5.3 adaptive multiscale feature aggregation network in feature adaptive aggregation Module receiving Inclusion I _nor The self-adaptive multi-scale feature aggregation method of 3.3.4.1 is adopted to carry out the multi-scale feature graph containing I _nor The multi-scale characteristic graph of the multi-scale characteristic carries out channel self-attention enhancement, bilinear interpolation upsampling and scale level soft weight aggregation operation to obtain a multi-scale perception high pixel characteristic graph F _IH Will F _IH Sending the data to a rough frame prediction network and an adaptive spatial feature aggregation network;

5.4 coarse Box prediction network reception F in feature adaptive aggregation Module _IH Coarse frame prediction as described in 3.3.4.2 is usedMethod pair F _IH Carrying out rough frame position prediction to obtain a rough frame prediction position B in the image I to be detected _Icoarse (ii) a B is to be _Icoarse Sending the information to an adaptive spatial feature aggregation network; b _Icoarse Is also that

Of resolution of

The number of channels is 4;

5.5 adaptive spatial feature aggregation network reception F in feature adaptive aggregation Module _IH And B _Icoarse The method for clustering the classified adaptive spatial features by adopting 3.3.4.3.2 utilizes B _Icoarse Limiting the sampling range, for F _IH Performing classification task spatial feature aggregation to obtain a high pixel feature map perceived by a salient region of the image I to be detected; sending a high pixel characteristic image perceived by a salient region of an image I to be detected to a central point prediction network;

5.7 a fine frame prediction network in the main task module receives a high pixel characteristic image perceived by a boundary region of the image I to be detected, and the fine frame prediction position of an object in the image I to be detected is obtained through a layer of 1 × 1 convolution processing; sending the fine frame prediction position of the object in the image I to be detected to a post-processing module;

5.9 the post-processing module receives the fine frame prediction position and the central point prediction thermodynamic diagram of the object of the image I to be detected, the method for removing the overlapped pseudo frame in the step 4.9 is adopted to remove the overlapped pseudo frame from the fine frame prediction position of the object of the image I to be detected and the central point prediction thermodynamic diagram of the object of the image I to be detected, so as to obtain an object frame prediction set of the image I to be detected, and the object frame prediction set of the image I to be detected reserves a prediction frame B _p And the type information of the prediction frame, namely the coordinate position and the prediction type of the prediction object frame of the image to be detected;

and sixthly, finishing.

2. The method of claim 1, wherein in step 2.1, the MS COCO data set has 80 classes, including 105000 training images as a training set, 5000 verification images as a verification set, and 20000 test images as a test set; the cityscaps dataset has 8 classes: pedestrians, riders, trolleys, trucks, buses, trains, motorcycles and bicycles, wherein 2975 training images are used as a training set, 500 verification images are used as a verification set, and 1525 Zhang Ceshi images are used as a test set; s is 205000 or 2975, T is 20000 or 1524, and V is 5000 or 500.

3. The method for detecting the target of claim 1, wherein the optimization processing is performed on the S images in the training set in the 2.2 steps to obtain an optimized training set D _t The method comprises the following steps:

2.2.1 order variable s =1, initialize the optimized training set D _t Is empty;

2.2.2, overturning the s image in the training set by adopting a random overturning method to obtain an s overturned image, wherein the random probability of the random overturning method is 0.5;

2.2.3, randomly cutting the s-th overturned image by adopting minimum intersection and comparison to obtain an s-th cut image; the minimum size ratio adopted by the minimum intersection ratio is 0.3;

2.2.5, performing brightness conversion on the s-th translated image by adopting random brightness to obtain an s-th brightness-converted image; the brightness difference value adopted by the random brightness is 32;

2.2.6, carrying out contrast conversion processing on the image after the s-th brightness conversion by adopting random contrast to obtain an image after the s-th contrast conversion; the random contrast has a contrast range of (0.5,1.5);

2.2.7, performing saturation transformation on the image with the s-th contrast transformation by adopting random saturation to obtain an image with the s-th saturation transformation; the saturation range for random saturation is (0.5,1.5);

2.2.9 standardizes the s scaled image by adopting standardization operation to obtain the s standard image, and puts the s standard image into the optimized training set D _t Performing the following steps;

if S is less than or equal to S, making S = S +1, and rotating by 2.2.2; if S is more than S, obtaining an optimized training set D consisting of S standard images _t 。

4. The feature adaptive aggregation-based target detection method of claim 1, wherein the two-dimensional Gaussian kernel center in step 2.3.2.4.3 is B' _si Is center or is B' _si Is set to 0.54.

5. The method as claimed in claim 1, wherein the step 2.3.4 comprises performing feature adaptive aggregation based on the Nth image _s Constructing a rough frame real value of the s image of a rough frame prediction task by using 4 times of down-sampling labeled frames

The method comprises the following steps:

2.3.4.1 construction of a size of

All-zero matrix of

"4" represents 4 coordinates of the labeling box sampled 4 times;

2.3.4.2 let i =1, denote the labeling box of the ith down-sampling by 4 times;

Assign a value to

In 4 channels of pixel locations;

2.3.4.4 let i = i +1 if i ≦ N _s Turning to 2.3.4.3; if i > N _s N of the s-th image _s The actual value of the rough frame corresponding to each marking frame is assigned to

In (1), assigned with values

The true value label of the s image is converted to 2.3.4.5;

2.3.4.5 the rough frame true value of the s-th image

6. The method for detecting the target based on the feature adaptive aggregation as claimed in claim 1, wherein the method for performing the optimization processing on the V images in the verification set by adopting the image scaling normalization method in the 2.4 steps is:

2.4.1 let variable v =1;

2.4.2 adopting zooming operation to zoom the v image in the verification set to 512 x 512 to obtain a v zoomed image;

2.4.3 standardizing the zoomed image of the v th image by adopting a standardization operation to obtain a standardized image of the v th image;

2.4.4 if V is less than or equal to V, making V = V +1, and rotating by 2.4.2; if V > V, a new verification set D consisting of V scaled normalized images is obtained _V 。

7. The target detection method based on the feature adaptive aggregation as claimed in claim 1, wherein in the third step, the feature pyramid network, the feature adaptive aggregation module, the auxiliary task module, and the main task module in the main feature module are initialized by adopting a normal distribution with a mean value of 0 and a variance of 0.01; the initial learning rate learning _ rate is initialized to 0.01, the attenuation coefficient is initialized to 0.1, the hyper-parameter momentum of the optimizer is initialized to 0.9, and the weight attenuation is initialized to 0.0004; the batch size mini _ batch _ size of the network training is initialized to 64; the maximum training step maxepoch is initialized to 120.

8. The method for detecting target based on feature adaptive aggregation as claimed in claim 1, wherein in the third step, N is _m =10,3.3.5 step a _l Set to 2, β is set to 4;3.3.8 step of the corner point prediction network loss weight

Central point predicted network loss weight

Coarse box prediction network loss weighting

Fine-box prediction network loss weighting

9. The method of claim 1, wherein the step 4.11.3 is performed with the confidence threshold set to 0.3.