CN115631344A - Target detection method based on feature adaptive aggregation - Google Patents

Target detection method based on feature adaptive aggregation Download PDF

Info

Publication number
CN115631344A
CN115631344A CN202211219905.9A CN202211219905A CN115631344A CN 115631344 A CN115631344 A CN 115631344A CN 202211219905 A CN202211219905 A CN 202211219905A CN 115631344 A CN115631344 A CN 115631344A
Authority
CN
China
Prior art keywords
network
feature
image
prediction
aggregation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211219905.9A
Other languages
Chinese (zh)
Other versions
CN115631344B (en
Inventor
陈微
何玉麟
罗馨
李晨
姚泽欢
汤明鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202211219905.9A priority Critical patent/CN115631344B/en
Publication of CN115631344A publication Critical patent/CN115631344A/en
Application granted granted Critical
Publication of CN115631344B publication Critical patent/CN115631344B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a target detection method based on feature adaptive aggregation, and aims to solve the problem that the detection precision of the existing real-time target detection method needs to be improved. The technical scheme is as follows: constructing a target detection system based on feature adaptive aggregation, which is composed of a main feature extraction module, a feature adaptive aggregation module, an auxiliary task module, a main task module and a post-processing module; preparing a data set required by a target detection system and optimizing image data of a training set through a data enhancement technology; training a target detection system by adopting a training set, and assisting a task module to assist network training; then, verifying the trained target detection system, and selecting the model parameter with the most excellent performance to obtain the trained target detection system with the most excellent performance; and finally, performing target detection on the user input image by adopting the trained target detection system with the most excellent performance to obtain the position and the category of the target. The invention realizes larger precision improvement with smaller time expenditure.

Description

Target detection method based on feature adaptive aggregation
Technical Field
The invention relates to the field of image recognition target detection, in particular to a target detection method based on feature adaptive aggregation and capable of optimizing target detection precision.
Background
Target detection is one of important tasks of computer vision, and has numerous applications such as intelligent security, intelligent robots, intelligent transportation and the like. With the development of artificial intelligence and deep learning, the performance of the target detection technology is remarkably improved. The performance evaluation of the target detection method generally has two aspects of accuracy and real-time performance, wherein the former reflects the detection accuracy of the method, and the latter reflects the processing speed of the method. For tasks such as face detection, vehicle detection, pedestrian detection and the like, the real-time performance is also an important index for measuring the performance of the target detection method. In practical application, the detection of the input image needs to be completed within a short time, otherwise, too high delay is caused, so that the user experience is not good, and serious traffic accidents such as traffic accidents occur.
Existing real-time target detection methods generally fall into two broad categories: the anchor-base method and the anchor-free method. The Anchor-base method generates a priori frame which is predefined and extends over the whole graph, and extracts the prior frame characteristics to finish classification and regression tasks. However, the anchor-base method is weak in generalization capability because the predefined prior frame needs to manually set hyper-parameters and has different length-width ratios, sizes and the like for different data sets, and the method is more complex than the anchor-free method and has slight deficiency in real-time performance. The Anchor-free method does not need a predefined prior frame, and directly extracts pixel point characteristics of the characteristic image to complete classification and regression tasks. The Anchor-free method is more dominant in speed and generalization, but the accuracy of the method is limited by the point features with weak characterization capability.
The document "Zhou X, wang D. Objects as points [ J ]. ArXiv preprinting arXiv:1904.07850,2019." (CenterNet) describes an anchor-free real-time object detection method, which uses the idea of keypoint detection to generate a Gaussian kernel for each object, which is used for locating the position of the center point of the object, and then uses regression branches to predict the length and width of the object frame. The centret realizes a simple model structure and has high running speed, but long-time training is needed to ensure that the model converges. The document "Liu Z, zheng T, xu G, et al. Training-time-free network for real-time object detection [ C ]// Proceedings of the AAAI Conference on Artificial Intelligence insight.2020, 34 (07): 11685-11692." (TTFNet) sets a wider range of Gaussian kernels for the problem of long training time of CenterNet, and considers more pixel points as training samples, increasing the number of training samples and making the model easier to converge. The method does not only locate the center point of the object, but takes any point of the Gaussian kernel region of the object as a prediction base point, and then predicts the distances from the prediction base point to the prediction frame in the four directions of the upper direction, the lower direction, the left direction and the right direction by using the regression branch. Through the improvement, the training time is reduced, and the precision is improved.
The two anchor-free methods have the advantages of high speed and generalization, but the accuracy is still lower than that of the anchor-base method because the key problems of insufficient pixel point characteristic capability and high classification and regression branch coupling degrees which influence the accuracy are not considered.
How to improve the feature characterization capability in the target detection method is not enough, and improving the accuracy is still a technical problem which is of great concern to those skilled in the art.
Disclosure of Invention
The invention aims to solve the technical problems that the existing real-time target detection method is insufficient in feature characterization capability, high in classification and regression branch feature coupling degree and low in detection precision, and provides a target detection method based on feature adaptive aggregation. On the premise of not influencing the real-time performance, the self-adaptive feature aggregation technology is utilized, a small amount of calculated amount is increased, the problems of insufficient feature characterization capability and high classification and regression branch feature coupling degree are solved, and the target detection precision is improved.
In order to solve the technical problem, the technical scheme of the invention is as follows: and constructing a target detection system based on feature adaptive aggregation. The system is composed of a main feature extraction module, a feature self-adaptive aggregation module, an auxiliary task module, a main task module and a post-processing module. Preparing and constructing a data set required by a target detection system, and dividing the data set into a training set, a verification set and a test set. Random cutting, random overturning, random translation, random brightness, saturation, contrast change processing and standardization processing are carried out on the training set image data through a data enhancement technology, and the training data diversity is enhanced. And only adopting size scaling and standardization processing on the verification set and the test set to keep the visual clues of the original image. And then training a main feature extraction module, a feature self-adaptive aggregation module, an auxiliary task module and a main task module in the target detection system by adopting a training set. During training, the auxiliary task module assists network training, and aims to enhance the attention of the target detection network to the position of the corner point of the object and improve the positioning accuracy. After one round of training is finished, testing the trained target detection system by using a verification set, selecting model parameters with the most excellent performance, assigning the model parameters to trainable modules (a main feature extraction module, a feature self-adaptive aggregation module and a main task module) in the target detection system, and obtaining the trained target detection system with the most excellent performance; and finally, performing target detection on the image input by the user by adopting the trained target detection system with the most excellent performance to obtain the position and the category of the target.
The technical scheme of the invention comprises the following steps:
firstly, constructing a target detection system based on feature adaptive aggregation. As shown in fig. 1, the target detection system is composed of a main feature extraction module, a feature adaptive aggregation module, an auxiliary task module, a main task module, and a post-processing module.
The main feature extraction module is connected with the feature self-adaptive aggregation module, extracts multi-scale features from the input image and sends a multi-scale feature map containing the multi-scale features to the feature self-adaptive aggregation module. The main feature extraction module consists of a DarkNet-53 convolutional neural network (see the article "Redmon J, farhadi A. Yolov3: an innovative improvement [ J ]. ArXiv preprints arXiv:1804.02767,2018." Redmon J, farhadi A et al: yolov 3) and a feature pyramid network (see the article "Lin T Y, doll R P, girshick R, et al. Featpyrad networks for object detection [ C ]// Proceedings of the IEEE con computer vision and pattern recognition.2017:2117-2125." Lin T Y, dorx3763 zrsk 3763, girshi R et al for feature detection in the pyramid detection. The DarkNet-53 convolutional neural network is a lightweight backbone network comprising 53 layers of neural networks, and the 53 layers of neural networks are divided into 5 serial sub-networks for extracting the backbone network characteristics of the image. The feature pyramid network receives the main network features from the DarkNet-53 convolutional neural network, a multi-scale feature map containing the multi-scale features is obtained through up-sampling, feature extraction and feature fusion operations, and the multi-scale feature map is sent to the feature self-adaptive aggregation module.
The feature self-adaptive aggregation module is connected with the main feature extraction module, the auxiliary task module and the main task module, and has the functions of providing a multi-scale perceived high-pixel feature map for the auxiliary task module, providing a boundary region perceived high-pixel feature map and a salient region perceived high-pixel feature map for the main task module, and improving the detection precision of the target detection system. The feature self-adaptive aggregation module is composed of a self-adaptive multi-scale feature aggregation network, a self-adaptive spatial feature aggregation network and a rough frame prediction network. The self-adaptive multi-scale feature aggregation network is composed of 4 weight-unshared SE (sequence-and-excitation) networks (the 4 SE networks are respectively taken as a first SE network, a second SE network, a third SE network and a fourth SE network), a multi-scale feature map is received from a feature pyramid network of a main feature extraction module, a self-adaptive multi-scale feature aggregation method is adopted to carry out channel self-attention enhancement, bilinear interpolation upsampling and scale level soft weight aggregation operation on the multi-scale feature map to obtain a multi-scale perceived high pixel feature map, and the multi-scale perceived high pixel feature map is sent to the self-adaptive spatial feature aggregation network, the rough frame prediction network and the auxiliary task module. The rough frame prediction network is composed of two layers of 3 x 3 convolutions and one layer of 1 x 1 convolution, receives the multi-scale perception high pixel characteristic diagram from the self-adaptive multi-scale characteristic aggregation network, predicts the multi-scale perception high pixel characteristic diagram to obtain a rough frame prediction position, and sends the rough frame prediction position to the self-adaptive spatial characteristic aggregation network. The self-adaptive spatial feature aggregation network is composed of two area-limited deformable convolutions with different offset conversion functions (a classification offset conversion function and a regression offset conversion function), a multi-scale perceived high-pixel feature map is received from the self-adaptive multi-scale feature aggregation network, a rough frame prediction position is received from the rough frame prediction network, a boundary area perceived high-pixel feature map and a salient area perceived high-pixel feature map are generated, and the boundary area perceived high-pixel feature map and the salient area perceived high-pixel feature map are sent to the main task module, so that the main task module has self-adaptive spatial perception capability, and the problem that the input feature coupling degree is high and the detection precision is influenced is solved.
The auxiliary task module is connected with an adaptive multi-scale feature aggregation network in the feature adaptive aggregation module, the auxiliary task module is a corner prediction network, the corner prediction network is composed of two layers of 3 x 3 convolution, one layer of 1 x 1 convolution and a sigmoid active layer, the auxiliary task module receives a multi-scale perception high pixel feature image from the adaptive multi-scale feature aggregation network, and the corner prediction network predicts the multi-scale perception high pixel feature image to obtain a corner prediction thermodynamic diagram which is used for calculating corner prediction loss in the training of a target detection system and assisting the target detection system in perceiving a corner region. The auxiliary task module is only used during training of the target detection system and is used for enhancing the perception of the target detection system on the positions of the corner points of the object, so that the position of the object frame can be predicted more accurately. When the trained target detection system detects the user input image, the module is directly discarded without adding extra calculation amount.
The main task module is connected with the adaptive spatial feature aggregation network and the post-processing module and consists of a fine frame prediction network and a central point prediction network. The fine frame prediction network is a layer of 1 multiplied by 1 convolution layer, receives the high pixel characteristic diagram sensed by the boundary region from the adaptive spatial characteristic aggregation network, performs 1 multiplied by 1 convolution on the high pixel characteristic diagram sensed by the boundary region to obtain a fine frame prediction position, and sends the fine frame prediction position to the post-processing module; the central point prediction network consists of a layer of 1 x 1 convolutional layer and a sigmoid activation layer, receives the high pixel characteristic diagram sensed by the salient region from the adaptive spatial characteristic aggregation network, performs 1 x 1 convolution and activation on the high pixel characteristic diagram sensed by the salient region to obtain a central point prediction thermodynamic diagram, and sends the central point prediction thermodynamic diagram to the post-processing module.
The post-processing module is a 3 x 3 pooling layer and is connected with a fine frame prediction network and a central point prediction network in the main task module, receives a fine frame prediction position from the fine frame prediction network, receives a central point prediction thermodynamic diagram from the central point prediction network, reserves a prediction maximum value in a central point prediction thermodynamic diagram 3 x 3 range by adopting 3 x 3 maximum pooling operation with the step length of 1, and extracts the position of the reserved prediction maximum value, namely a peak point, as the position of the central area point of the object. And finding out the corresponding up-down, left-right four-direction distances in the fine frame prediction position according to the position of the central area point to generate a predicted object frame position, wherein the central point category where the position of the central area point is located is the category of the object prediction. The post-processing module suppresses overlapping false frames by extracting peak points in the 3 x 3 range, reducing false positive prediction frames.
Secondly, constructing a training set, a verification set and a test set, wherein the method comprises the following steps:
2.1 collecting target detection scene images as a target detection data set, and manually labeling each target detection scene image in the target detection data set, wherein the method comprises the following steps:
the general Scene data set published by MS COCO (see documents "Tsung-Yi Lin, michael Maire, large Belongie, james Hays, pietro Perona, deva Ramanan, piotr Dollar, and C Lawrence' S Zitnicknic. Microdoco: common objects in scenes in ECCV,2014." Tsung-Yi Lin, michael Maire et al, microsoft COCO: common objects in scenes) or the Cityscapes unmanned Scene data set (see documents "Cordts M, omran M, ramos S, ciet al. The MS COCO dataset has 80 classes, containing 105000 training images (train 2017) as training set, 5000 verification images (val 2017) as verification set, and 20000 test images (test-dev) as test set. The citrescaps dataset has 8 classes: pedestrians, riders, trolleys, trucks, buses, trains, motorcycles and bicycles, with 2975 training images as the training set, 500 validation images as the validation set, 1525 Zhang Ceshi images as the test set. Let the total number of images in the training set be S, let the total number of images in the test set be T, let the total number of images in the verification set be V, let S be 205000 or 2975, T be 20000 or 1524, and let V be 5000 or 500. Each image of the MS COCO and the citrescaps data sets is manually labeled, that is, each image is labeled with the position of an object in the form of a rectangular frame and is labeled with the category of the object.
2.2 carrying out optimization processing on the S images in the training set, including turning, cutting, translation, brightness transformation, contrast transformation, saturation transformation, scaling and standardization to obtain an optimized training set D t The method comprises the following steps:
2.2.1 order variable s =1, initialize the optimized training set D t Is empty;
2.2.2 overturning the s image in the training set by adopting a random overturning method to obtain the s overturned image, wherein the random probability of the random overturning method is 0.5;
2.2.3 randomly cutting the s-th turned image by adopting a minimum cross-over ratio (IoU) to obtain an s-th cut image; the minimum cross-over ratio (IoU) used is 0.3 for the minimum size ratio.
2.2.4, carrying out random image translation on the s-th cut image to obtain an s-th translated image;
2.2.5, performing brightness conversion on the s-th translated image by adopting random brightness to obtain an s-th brightness-converted image; the random luminance takes a luminance difference value of 32.
2.2.6, carrying out contrast conversion processing on the image after the s-th brightness conversion by adopting random contrast to obtain an image after the s-th contrast conversion; the random contrast has a contrast range of (0.5,1.5).
2.2.7, performing saturation transformation on the image with the s-th contrast transformation by adopting random saturation to obtain an image with the s-th saturation transformation; the saturation range for random saturation is (0.5,1.5).
2.2.8 adopting a scaling operation to scale the s-th image after saturation transformation to 512 multiplied by 512 to obtain an s-th scaled image;
2.2.9 standardizes the s scaled image by adopting standardization operation to obtain the s standard image, and puts the s standard image into the optimized training set D t In (1).
If S is less than or equal to S, making S = S +1, and rotating by 2.2.2; if s>S, obtaining an optimized training set D consisting of S standard images t Turn 2.3.
2.3 training set D according to optimization t And making a task truth label for model training. The method is characterized in that the method is divided into four tasks which are respectively a central point prediction task, an angular point prediction task, a rough frame prediction task and a fine frame prediction task, and the method comprises the following steps:
2.3.1 let variable s =1; let the s image in the optimized training set have N s A label box, order N s The ith one of the label boxes is
Figure BDA0003876971850000051
Let the label category of the ith label box be c i
Figure BDA0003876971850000052
Represents the coordinates of the point at the upper left corner of the ith label box,
Figure BDA0003876971850000053
represents the coordinate of the lower right corner point of the ith label box, N s Is a positive integer, i is more than or equal to 1 and less than or equal to N s
2.3.2 construction of the predicted true value of the centerpoint for the centerpoint prediction task
Figure BDA0003876971850000054
The method comprises the following steps:
2.3.2.1 construction of a size of
Figure BDA0003876971850000055
All-zero matrix chart H zeros C represents the number of classification categories of the optimized training set, wherein the number of the categories is the number of categories of the labeled targets of the target detection data set, for example, the MS COCO data set is 80 categories, the Cityscapes data set is 19 categories, H is the height of the s image, and W is the width of the s image;
2.3.2.2 let i =1, representing the labeling box of the ith down-sampling 4 times;
2.3.2.3 will
Figure BDA0003876971850000056
Dividing the labeling coordinate by 4, and recording as a labeling frame of 4 times of downsampling
Figure BDA0003876971850000057
Figure BDA0003876971850000058
Figure BDA0003876971850000059
Represents B si The upper left, upper right, lower left and lower right corner positions of the' are arranged.
2.3.2.4 adopts two-dimensional Gaussian kernel generation method, and calculates B si ' center point
Figure BDA00038769718500000510
A base point of a two-dimensional Gaussian kernel with a variance of (σ) xy ) The Gaussian values of all the pixel points in the two-dimensional Gaussian kernel range are obtained to obtain a first heightSet of S values ctr . The method comprises the following specific steps:
2.3.2.4.1 makes the number of pixel points in two-dimensional Gaussian kernel be N pixel ,N pixel Let the first set of Gaussian values S be a positive integer ctr Is empty;
2.3.2.4.2 let p =1, representing the number of pixel points in two-dimensional Gaussian kernel, where p is greater than or equal to 1 and less than or equal to N pixel
2.3.2.4.3 in the s-th image with (x) 0 ,y 0 ) Any pixel point (x) in the Gaussian kernel range as the base point p ,y p ) Two-dimensional Gao Sizhi K (x) p ,y p ) Comprises the following steps:
Figure BDA00038769718500000511
wherein (x) 0 ,y 0 ) Is the base point of a two-dimensional Gaussian kernel, namely the center of the two-dimensional Gaussian kernel (can be B' si May be B' si Corner point of) x 0 A coordinate value of the base point in the width direction, y 0 Is a coordinate value of the base point in the high direction. (x) p ,y p ) Is a base point (x) 0 ,y 0 ) Pixel points, x, within the Gaussian kernel range p Is the coordinate value, y, of the pixel point in the width direction p Is the coordinate of the pixel point in the high direction. (x) 0 ,y 0 ) And (x) p ,y p ) Are all located in the image coordinate system after down-sampling by 4 times.
Figure BDA00038769718500000512
Representing the variance of the two-dimensional gaussian kernel in the width direction,
Figure BDA00038769718500000513
and the variance of the two-dimensional Gaussian kernel in the high direction is represented, and the number of points in the range of the Gaussian kernel is controlled by controlling the variance of the two-dimensional Gaussian kernel in the width direction and the high direction. w represents B si ' Width at the scale of the characteristic diagram, h represents B si ' height in the scale of the feature map, α is the center region position determined to be B si The parameter for the' ratio, is set to 0.54. Will (x) p ,y p ) And K (x) calculated p ,y p ) Storing the first set of Gaussian values S ctr Performing the following steps;
2.3.2.4.4 let p = p +1; if p is less than or equal to N pixel Turning to 2.3.2.4.3; if p is>N pixel ,B si ' the coordinates and two-dimensional Gaussian values in the Gaussian kernel have been stored in S ctr In, S ctr In the presence of N pixel Each pixel point and the corresponding two-dimensional Gaussian value thereof are converted to 2.3.2.5;
2.3.2.5 reduction of S ctr Value of (3) to H zeros In (1). Will S ctr Element (x) of (1) p ,y p ) And K (x) p ,y p ) According to H zeros [x p ,y p ,c i ]=K(x p ,y p ) Rule assignment of c i Represents B si ' class number, 1. Ltoreq. C i C and C is not more than C i Is a positive integer;
2.3.2.6 has i = i +1; if i is less than or equal to N s Turning to 2.3.2.3; if i>N s N of the s-th image s All the two-dimensional Gaussian values generated by the down-sampling 4-time labeling boxes are assigned to H zeros Middle, 2.3.2.7;
2.3.2.7 predicts the true value of the center point of the s-th image
Figure BDA0003876971850000061
2.3.3 construction of actual values of corner predictions for a task of corner prediction
Figure BDA0003876971850000062
The method comprises the following steps:
2.3.3.1 construction of a size of
Figure BDA0003876971850000063
All-zero matrix of
Figure BDA0003876971850000064
"4" represents the number of corner points of the labeling box of 4 times of downsampling, and also represents 4 channels of the matrix;
2.3.3.2 let i =1, denote the labeling box of the ith down-sampling by 4 times;
2.3.3.3 let the base point of the two-dimensional Gaussian kernel be B si ' Upper left corner point, coordinates of
Figure BDA0003876971850000065
Adopting the two-dimensional Gaussian kernel generation method of 2.3.2.4 to calculate
Figure BDA0003876971850000066
A base point of a two-dimensional Gaussian kernel with a variance of (σ) xy ) The Gauss values of all the pixel points in the two-dimensional Gauss kernel range are obtained to obtain a second Gauss value set S tl
2.3.3.4 converting S to tl To the element coordinates and Gaussian values in
Figure BDA0003876971850000067
In the 1 st channel, i.e. according to
Figure BDA0003876971850000068
Assigning a value to the rule of (1);
2.3.3.5 let the base point of the two-dimensional Gaussian kernel be B si The upper right corner point of' has coordinates of
Figure BDA0003876971850000069
Adopting the two-dimensional Gaussian kernel generation method of 2.3.2.4 to calculate
Figure BDA00038769718500000610
A base point of a two-dimensional Gaussian kernel with a variance of (σ) xy ) Obtaining the Gaussian values of all the pixel points in the two-dimensional Gaussian kernel range to obtain a third Gaussian value set S tr
2.3.3.6 converting S to tr To the element coordinates and Gaussian values in
Figure BDA00038769718500000611
In the 2 nd channel, i.e. according to
Figure BDA00038769718500000612
Assigning a value to the rule of (1);
2.3.3.7 let the base point of the two-dimensional Gaussian kernel be B si ' lower left corner point, coordinates of
Figure BDA00038769718500000613
Adopting the two-dimensional Gaussian kernel generation method of 2.3.2.4 to calculate
Figure BDA00038769718500000614
A base point of a two-dimensional Gaussian kernel with a variance of (σ) xy ) The Gaussian values of all the pixel points in the two-dimensional Gaussian kernel range are obtained to obtain a fourth Gaussian value set S dl
2.3.3.8 converting S to dl To the element coordinates and Gaussian values in
Figure BDA00038769718500000615
In the 3 rd channel according to
Figure BDA00038769718500000616
Assigning a value to the rule of (1);
2.3.3.9 let the base point of the two-dimensional Gaussian kernel be B' si At the lower right corner point of having coordinates of
Figure BDA00038769718500000617
Adopting the two-dimensional Gaussian kernel generation method of 2.3.2.4 to calculate
Figure BDA00038769718500000618
A base point of a two-dimensional Gaussian kernel with a variance of (σ) xy ) The Gauss values of all the pixel points in the two-dimensional Gauss kernel range are obtained to obtain a fifth Gauss value set S dr
2.3.3.10 reduction of S dr To the element coordinates and Gaussian values in
Figure BDA00038769718500000619
In the 4 th channel, i.e. according to
Figure BDA00038769718500000620
Assigning values to the rules;
2.3.3.11 let i = i +1 if i ≦ N s Turning to 2.3.3.3; if i>N s N of the s-th image s The two-dimensional Gaussian values generated by the down-sampling 4-fold labeling frames are all assigned to
Figure BDA00038769718500000621
Middle, 2.3.3.12;
2.3.3.12 orders the predicted true value of corner of the s-th image
Figure BDA00038769718500000622
2.3.4 from the N of the s-th image s Constructing a rough frame real value of the s image of a rough frame prediction task by using 4 times of down-sampling labeled frames
Figure BDA0003876971850000071
The method comprises the following steps:
2.3.4.1 construction of a size of
Figure BDA0003876971850000072
All-zero matrix of
Figure BDA0003876971850000073
"4" represents 4 coordinates of the labeling box sampled 4 times;
2.3.4.2 let i =1, representing the labeling box of the ith down-sampling 4 times;
2.3.4.3 pair H zeros Marking box B of 4 times of sampling in ith si ' assignment of internal pixels, i.e. B si ' coordinate value
Figure BDA0003876971850000074
Assign a value to
Figure BDA0003876971850000075
In 4 channels of pixel locations;
2.3.4.4 let i = i +1, if i ≦ N s Turning to 2.3.4.3; if i>N s N of the s-th image s The real value of the rough frame corresponding to each labeled frame is assigned to
Figure BDA0003876971850000076
In which values are assigned
Figure BDA0003876971850000077
The true value label of the s image is converted to 2.3.4.5;
2.3.4.5 the rough frame true value of the s-th image
Figure BDA0003876971850000078
2.3.5 according to
Figure BDA0003876971850000079
Constructing real value of fine box prediction task
Figure BDA00038769718500000710
Figure BDA00038769718500000711
Value and
Figure BDA00038769718500000712
are equal, i.e.
Figure BDA00038769718500000713
2.3.6 let S = S +1, if S is less than or equal to S, rotate 2.3.2; if S is greater than S, rotating to 2.3.7;
2.3.7 obtaining the task real label of S images for model training, and forming a set by the task real label and the S images to form a training set D for model training M
2.4 adopting an image scaling standardization method to optimize the V images in the verification set to obtain a new verification set D consisting of the V scaled and standardized images V The method comprises the following steps:
2.4.1 let variable v =1;
2.4.2 adopting zooming operation to zoom the v-th image in the verification set to 512 multiplied by 512 to obtain a zoomed image of the v-th image;
2.4.3 standardizing the zoomed image of the v th image by adopting a standardization operation to obtain a standardized image of the v th image.
2.4.4 if V is less than or equal to V, making V = V +1, and rotating by 2.4.2; if v is>V, obtaining a new verification set D consisting of V images after scaling standardization V Turn 2.5.
2.5 optimizing the T images in the test set by adopting the image scaling standardization method of 2.4 steps to obtain a new test set D consisting of the images after the T scaling standardization T
Thirdly, training the target detection system constructed in the first step by utilizing a gradient back propagation method to obtain N m And (4) model parameters. The method comprises the following steps:
3.1 initializing the network weight parameters of each module in the target detection system. Initializing parameters of a DarkNet-53 convolutional neural network in a main characteristic extraction module by adopting a pre-training model trained on an ImageNet data set (https:// www.image-net.org /); and initializing other network weight parameters (a characteristic pyramid network, a characteristic self-adaptive aggregation module, an auxiliary task module and a main task module network weight parameter in a main characteristic module) by adopting normal distribution with the mean value of 0 and the variance of 0.01.
3.2 setting the training parameters of the target detection system. The initial learning rate learning _ rate is set to 0.01, and the learning rate attenuation factor is set to 0.1, i.e., the learning rate is reduced by a factor of 10 (attenuation is performed at training steps of 80 and 110). Random gradient descent (SGD) was selected as a model training optimizer with a hyper-parameter "momentum" of 0.9 and a weight decay "of 0.0004. The batch size (mini _ batch _ size) of the network training is 64. The maximum training step size (maxepoch) is 120.
3.3 training the target detection system, the method is to use the rough frame prediction position, the fine frame prediction position, the angular point prediction thermodynamic diagram and the difference between the central point prediction thermodynamic diagram and the true value output by the target detection system in the first training as the loss value (loss), and update the network weight parameter by using the gradient back propagation until the loss value reaches the threshold value or the difference between the central point prediction thermodynamic diagram and the true valueThe training step reaches the end of maxepoch. At the last N m (typically 10) training steps, and the network weight parameters are saved once per training round.
The method comprises the following steps:
3.3.1 order training step epoch =1, one period of training for all data in the training set is one epoch, and the serial number N of the initialization batch b =1;
3.3.2 Primary feature extraction Module Slave D M Read the Nth b Batch, total B =64 images, and the B images are recorded as matrix I train ,I train Contains B H × W × 3 images. Where H denotes the height of the input image, W denotes the width of the input image, and "3" denotes the three channels of RGB of the image.
3.3.3 principal feature extraction Module extracts I by principal feature extraction method train To obtain I train Will comprise I train The multi-scale feature map of the multi-scale features is sent to the adaptive feature aggregation module. The method comprises the following steps:
5363 DarkNet-53 convolution neural network extraction I of 3.3.3.1 main feature extraction module train The method comprises the following steps of obtaining a backbone network feature map set by using the image features, wherein the method comprises the following steps: 5 serial sub-network pairs I of DarkNet-53 convolutional neural network train And performing down sampling and feature extraction on the B images to obtain the features of the main network, namely 4 feature graphs (the output of the last four serial sub-networks), and sending the features to the feature pyramid network.
3.3.3.2 the pyramid network receives 4 characteristic graphs from the DarkNet-53 convolutional neural network, and the pyramid network performs up-sampling, characteristic extraction and characteristic fusion on the 4 characteristic graphs to obtain 3 multi-scale characteristic graphs, such that
Figure BDA0003876971850000081
Mapping multi-scale features
Figure BDA0003876971850000082
And sending the information to a feature adaptive aggregation module.
3.3.4 feature adaptive aggregation Module Slave feature pyramid networkReceiving a multiscale feature map
Figure BDA0003876971850000083
Generating a multi-scale perceptual high-pixel feature map F H Will F H Sending the data to an auxiliary task module; and generating a high pixel characteristic diagram perceived by the boundary area and a high pixel characteristic diagram perceived by the salient area, and sending the high pixel characteristic diagram perceived by the boundary area and the high pixel characteristic diagram perceived by the salient area to the main task module. The method comprises the following steps:
5363A method for receiving a feature pyramid network from a 3.3.4.1 adaptive multi-scale feature aggregation network
Figure BDA0003876971850000084
Method pair adopting self-adaptive multi-scale feature aggregation
Figure BDA0003876971850000085
Channel self-attention enhancement, bilinear interpolation up-sampling and scale level soft weight aggregation operation are carried out to obtain a multi-scale perception high pixel characteristic diagram F H 。F H The resolution of the characteristic diagram is
Figure BDA0003876971850000086
F H The number of feature map channels of (2) is 64. The specific method comprises the following steps:
3.3.4.1.1 adaptive multiscale feature aggregation network uses first, second, and third SE network parallel pairs
Figure BDA0003876971850000087
Performing parallel channel self-attention enhancement, i.e. first SE network pair
Figure BDA0003876971850000088
Weighted summation applied on the channels to obtain the first channel characteristic enhanced image
Figure BDA0003876971850000089
Simultaneous second SE network pair
Figure BDA00038769718500000810
Weighted summation applied on the channels to obtain the second channel characteristic enhanced image
Figure BDA00038769718500000811
Simultaneous third SE network pair
Figure BDA00038769718500000812
Weighted summation is applied to the channels to obtain an image with enhanced representation of the third channel
Figure BDA00038769718500000813
3.3.4.1.2 first, second and third SE networks of the adaptive multi-scale feature aggregation network adopt bilinear interpolation in parallel
Figure BDA0003876971850000091
Upsampled to the same resolution size
Figure BDA0003876971850000092
Obtaining an up-sampled feature map
Figure BDA0003876971850000093
Become an upsampled feature map set
Figure BDA0003876971850000094
The specific calculation process is shown as formula (2):
Figure BDA0003876971850000095
wherein SE n It is shown that the nth SE network,
Figure BDA0003876971850000096
and the first multi-scale feature map is shown, upesample shows bilinear interpolation upsampling, and l is more than or equal to 1 and less than or equal to 3,1 and less than or equal to n is less than or equal to 3.
3.3.4.1.3 adaptive multi-scale feature aggregation network pair
Figure BDA0003876971850000097
Calculating the weight by adopting 1 multiplied by 1 convolution, reducing the number of channels from 64 to 1, and then executing Softmax operation on the scale dimension to obtain the value of
Figure BDA0003876971850000098
Soft weight map of
Figure BDA0003876971850000099
The numerical size of the pixel points of the soft weight map indicates that more attention should be paid
Figure BDA00038769718500000910
Which one of these 3 dimensions, i.e.
Figure BDA00038769718500000911
Which is weighted more heavily, so that objects of different sizes respond to signatures of different dimensions.
3.3.4.1.4 adaptive multiscale feature aggregation network maps the weights of the ith scale
Figure BDA00038769718500000912
Corresponding ith up-sampled feature map
Figure BDA00038769718500000913
Element by element multiplication, i.e. to
Figure BDA00038769718500000914
And
Figure BDA00038769718500000915
corresponding element-by-element multiplication, will
Figure BDA00038769718500000916
And
Figure BDA00038769718500000917
the corresponding element-by-element multiplication is performed,
Figure BDA00038769718500000918
and
Figure BDA00038769718500000919
respectively multiplying element by element to obtain 3 products, then carrying out weighted summation on the 3 products, and fusing the 3 products into a feature map to obtain a fused feature map; then, a fourth SE network is adopted to enhance the channel representation of the fused feature map, and a multi-scale perception high-pixel feature map F is obtained H . The specific process is shown in formula (3):
Figure BDA00038769718500000920
wherein SE 4 In order to be the fourth SE network,
Figure BDA00038769718500000921
indicating that the same position element occupies weight in different scales, "×" indicates the product of the corresponding position elements, and Conv indicates 1 × 1 convolution. Adaptive multi-scale feature aggregation network F H And sending the data to an auxiliary task module, a rough frame prediction network and an adaptive spatial feature aggregation network.
3.3.4.2 coarse-box prediction network receives a multi-scale aware high-pixel feature map F from an adaptive multi-scale feature aggregation network H Using a coarse frame prediction method to F H The position of each feature point in the image is subjected to rough frame position prediction to generate a rough frame prediction position B coarse B is to be coarse Sending to an adaptive spatial feature aggregation network, B coarse Is also that
Figure BDA00038769718500000922
Of resolution of
Figure BDA00038769718500000923
The number of channels is 4. The channel number 4 represents the distance from the pixel point to the four directions of up, down, left and right, and each pixel point forms a rough frame. B is coarse For limiting adaptive spatial featuresCharacterizing a deformable convolution sampling range in an aggregation network. And, for B coarse Coarse box true value constructed with 2.2.5.4
Figure BDA00038769718500000924
Calculating loss
Figure BDA00038769718500000925
Figure BDA00038769718500000926
The loss calculation is based on the GIoU loss (see the documents "Rezatofigughi H, tsio N, gwak J Y, et al. Generalized interaction over unit: A measurement and a loss for bounding box regression [ C)]// Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.2019: 658-666A paper by "Rezatofighi H, tsei N et al: generalized intersection ratio: metrics and loss of bounding box regression):
Figure BDA00038769718500000927
wherein S b Is a set of regression samples consisting of
Figure BDA00038769718500000928
A set of pixels other than 0; n is a radical of b Is the number of regression sample sets, W ij Is corresponding to
Figure BDA00038769718500000929
And the (i, j) position weight value which is not 0 is used for applying larger loss weight to the pixel point at the central area position, so that the pixel point at the central area position returns to the position of the marking frame more accurately.
3.3.4.3 an adaptive spatial feature aggregation network receives a multi-scale aware high-pixel feature map F from an adaptive multi-scale feature aggregation network H Receiving a coarse frame prediction location B from a coarse frame prediction network coarse Generating a boundary region perceived high pixel feature map F HR And high pixel feature F of salient region perception HS . The method comprises the following steps:
3.3.4.3.1 designs an area-limited deformable convolution (R-DConv). Deformable convolution (DConv) (see documents "Zhu X, hu H, lin S, et al. Deformable networks v2: more formable, better results,". Beta results [ C ]// Proceedings of the IEEE/CVF con conference on computer vision and pattern recognition.2019:9308-9316."Zhu X, hu H et al. Deformable networks v2: more Deformable, better results) because of the property of adaptive sparse sampling is often used to enhance the spatial perception of features, but its sampling range is not limited, resulting in the sampling points being prone to over-shift, and for objects of different sizes, the difficulty of adaptive learning to sample the most representative feature points is inconsistent, resulting in poor adaptability to detection of objects of different sizes, so the present invention designs region-limited Deformable convolution (R-DConv) to enhance adaptability. The specific method comprises the following steps:
3.3.4.3.1.1 design offset transfer function
Figure BDA0003876971850000101
For the offset Δ p of the deformable convolution (Δ p is a learnable offset based on the feature points, see the literature "Zhu X, hu H, lin S, et al]// Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.2019:9308-9316. ") to obtain the transformed offset, so that the difficulty of adaptively learning and sampling the most representative feature points is consistent for objects of different sizes.
Figure BDA0003876971850000102
Limiting the offset range of the spatial sampling points of the deformable convolution to B coarse While also differentiating the offset deltap of the deformable convolution. Because the space sampling range of the large-size object is wider than that of the small-size object, the corresponding searching difficulty is different. In order to solve the problem that the search difficulty of the space characteristic points of the objects with different sizes is inconsistent, a Sigmoid function is adopted
Figure BDA0003876971850000103
Figure BDA0003876971850000104
To B coarse The offset Δ p is normalized so that Δ p is [0,1 ]]Within the interval. Through such processing, the difficulty of searching for the point feature with the strongest characteristic ability becomes the same for objects of different sizes. Thus, splitting Δ p into h Δp And w Δp ,h Δp Denotes the deviation of Δ p in the vertical direction, w Δp Indicating the shift of Δ p in the horizontal direction.
Figure BDA0003876971850000105
As shown in equation (5):
Figure BDA0003876971850000106
Figure BDA0003876971850000107
wherein
Figure BDA0003876971850000108
Representing the offset transfer function in the vertical direction,
Figure BDA0003876971850000109
indicating the offset transfer function in the horizontal direction, the overall offset transfer function
Figure BDA00038769718500001010
(t, l, r, d) are convolution kernel positions p and B coarse The distance in four directions of up, down, left and right.
3.3.4.3.1.2 utilization
Figure BDA00038769718500001011
The deformable convolution sampling area is limited. Given a 3 x 3 convolution kernel with K =9 spatial sample location points, w k Weight of convolution kernel, P, representing the k-th position k A predefined position offset representing the kth position. P k E { (-1, -1), (-1,0), (1,1) } denotes the 3 × 3 range centered at (0,0). Let x (p) denote the input feature map at the convolution kernel center position p, and y (p) denote the output feature map at the convolution kernel center position p. Y (p) is calculated using R-DConv, as shown in equation (6):
Figure BDA00038769718500001012
wherein Δ p k Denotes the learnable offset, Δ m, of the k-th position k Representing the weight of the kth position. Δ p k And Δ m k Generated by a 3 × 3 convolution, the 3 × 3 convolution generates a feature map of 27 channels, 9 of which are Δ p k Abscissa offset value, Δ p for 9 channels k Ordinate offset value, 9 channels (representing weights for different offset value characteristics) of Δ m k The value of (c). B is coarse A coarse box, representing a prediction on the scale of the current feature map, is also a predefined bounding region.
3.3.4.3.2 in order to make R-DConv learn the salient region of an object in the range of a rough frame, and extract the characteristics for making object classification more accurate, a classification adaptive spatial feature aggregation method is adopted and B is utilized coarse Limiting the sampling range pair F H The method for carrying out feature aggregation comprises the following steps:
3.3.4.3.2.1 order Classification offset transfer function
Figure BDA0003876971850000111
Calculating the output characteristic y at the position p by using the formula (6) cls (p)。
3.3.4.3.2.2 uses
Figure BDA0003876971850000112
Traversal F with convolution kernel H Obtaining a high pixel characteristic diagram F of the salient region perception HS
Figure BDA0003876971850000113
Allowing sampling points to be concentrated so that the classification branches can focus on the most discriminative salient regions. Thus, let
Figure BDA0003876971850000114
Enabling the R-DConv to learn the salient region of the object in the rough frame range, and extracting the feature which enables the object to be classified more accurately, namely a high-pixel feature map F for sensing the salient region HS Will F HS And sending the data to the main task module.
3.3.4.3.3 in order to make R-DConv learn the boundary region information of an object in the coarse frame range, extracting the feature which makes the object position regression more accurate, a regression adaptive spatial feature aggregation method is adopted and B is utilized coarse Limiting the sampling range pair F H Performing feature aggregation, wherein the regression adaptive spatial feature aggregation method specifically comprises the following steps:
3.3.4.3.3.1 design regression offset transfer function
Figure BDA0003876971850000115
The offset Δ p of the deformable convolution is transformed.
Figure BDA0003876971850000116
And uniformly dividing the space sampling points of the R-DConv operation along the upper, lower, left and right directions, so that the limited area is divided into four sub-areas which respectively correspond to the upper left, the upper right, the lower left and the lower right.
Figure BDA0003876971850000117
And respectively carrying out uniform sampling on the four sub-areas, namely allocating equal sampling points to each area. In this way, the spatial sampling points of the R-DConv operation are dispersed, so that features containing more information from the boundary can be extracted, and the object position can be regressed more accurately. The setting of K =9 is made such that,
Figure BDA0003876971850000118
the function samples two points from four sub-areas respectively, total eight edge points are added with a central point to form a convolution kernel of 3 multiplied by 3, and the capture of the boundary information of the central characteristic point is enhanced. Regression offset transfer function
Figure BDA0003876971850000119
As shown in equation (7):
Figure BDA00038769718500001110
Figure BDA00038769718500001111
the sampling difficulty of objects with different sizes can be balanced through normalization for the Sigmoid function for normalizing the offset in the rough frame interval.
Will be provided with
Figure BDA00038769718500001112
Substituted into equation (6)
Figure BDA00038769718500001113
Obtaining an output characteristic y at the position p reg (p) of the formula (I). Thus, it is possible to provide
Figure BDA00038769718500001114
Enabling R-DConv to learn the region of the object boundary in the coarse frame range, and extracting the feature which enables the regression position of the prediction frame to be more accurate, namely a high-pixel feature map F perceived by the boundary region HR
3.3.4.3.3.2 uses
Figure BDA00038769718500001115
Traversing F with convolution kernel H Obtaining a high pixel characteristic map F perceived by the boundary area HR Will F HR And sending the data to a main task module.
3.3.5 Assist task Module receives F from an adaptive Multi-Scale feature aggregation network H Processing the two layers of 3 multiplied by 3 convolution, one layer of 1 multiplied by 1 convolution and sigmoid function to obtain a corner point prediction thermodynamic diagram H corner ,H corner Has a resolution of
Figure BDA00038769718500001116
The number of channels is 4. To H corner And the corner predicted true value constructed by 2.3.3
Figure BDA00038769718500001117
Calculating the loss to obtain H corner And
Figure BDA0003876971850000121
loss value of
Figure BDA0003876971850000122
Figure BDA0003876971850000123
The calculation of (D) is based on a modified version of Focal Loss (see the literature "Law H, deng J. Corneret: detecting objects as detected keypoints [ C)]// Proceedings of the European conference on computer vision (ECCV) 2018, "Law H, deng J et al: cornernet: object detection with paired keypoints):
Figure BDA0003876971850000124
wherein N is s Is the number of the image label boxes, alpha l And β is a hyperparameter, set to 2 and 4, respectively, for controlling the gradient profile of the loss function.
Figure BDA0003876971850000125
Is the corner predicted value output by the auxiliary task module at the c channel and (i, j) pixel position,
Figure BDA0003876971850000126
is the corner predicted true value of the c-th channel, pixel location (i, j). The auxiliary task module learns the positions of four corners of the positioning marking frame and assists the target detection network training to enable the extracted features to focus on the positions of the corners of the object, so that the target detection system can position the object more accurately.
3.3.6 Fine Framing of Main task ModuleReceiving boundary region-aware high-pixel feature map F from adaptive spatial feature aggregation network by measuring network HR After a layer of 1X 1 convolution, F is obtained HR Fine-box predicted position B of feature point position refine 。B refine Has a resolution of
Figure BDA0003876971850000127
The number of channels is 4. The channel number 4 represents the distance from the pixel point to the prediction fine frame in the upper, lower, left and right directions, and each pixel point can form a fine prediction frame. To B refine And the real value of the fine box obtained from 2.3.5
Figure BDA0003876971850000128
Calculating loss
Figure BDA0003876971850000129
Figure BDA00038769718500001210
Is based on GIoU loss:
Figure BDA00038769718500001211
wherein S b Is a set of regression samples consisting of
Figure BDA00038769718500001212
A set of pixels other than 0. N is a radical of b Is the number of regression sample sets, W ij Is corresponding to
Figure BDA00038769718500001213
And the (i, j) position weight value which is not 0 is used for applying larger loss weight to the pixel point at the central area position, so that the pixel point at the central area position returns to the position of the standard frame more accurately. B is refine Represents the accuracy with which the target detection system regresses the position of the object.
3.3.7 Central Point prediction network for Main task Module from adaptive spatial feature aggregation networkHigh-pixel feature map F for receiving salient region perception HS After a layer of 1 × 1 convolution and sigmoid function processing, F is obtained HS Center point predictive thermodynamic diagram H of feature point positions center 。H center Has a resolution of
Figure BDA00038769718500001214
The number of channels is the number of data set categories C. The C of the MS COCO data set is 80, and the C of the CityScaps data set is 8. H is to be center The central point predicted true value constructed with 2.2.5.2
Figure BDA00038769718500001215
Calculating loss
Figure BDA00038769718500001216
Figure BDA00038769718500001217
Is based on a modified version of Focal local:
Figure BDA00038769718500001218
wherein N is s Is the number of the image labeling boxes, α l And β is a hyperparameter, set to 2 and 4, respectively, for controlling the gradient profile of the loss function.
Figure BDA00038769718500001219
Is the c-th channel, the central point prediction thermodynamic diagram of the (i, j) pixel position,
Figure BDA00038769718500001220
is the predicted true value of the center point of the (i, j) th channel pixel position. H center Represents the ability of the target detection system to locate the center of an object and to distinguish object classes.
3.3.8 Total loss function for design target detection System
Figure BDA00038769718500001221
As shown in equation (11):
Figure BDA00038769718500001222
wherein
Figure BDA00038769718500001223
Is the output of a corner prediction network corner And true value
Figure BDA00038769718500001224
The value of the loss is calculated,
Figure BDA00038769718500001225
h being central point predicted network output center And true value
Figure BDA0003876971850000131
The value of the loss is calculated,
Figure BDA0003876971850000132
b being the output of the coarse-box prediction network coarse And true value
Figure BDA0003876971850000133
The value of the loss is calculated and,
Figure BDA0003876971850000134
b being fine-box prediction network output refine And true value
Figure BDA0003876971850000135
A calculated loss value. Predicting network loss weights by corner points according to importance
Figure BDA0003876971850000136
Central point predicted network loss weight
Figure BDA0003876971850000137
Coarse box prediction network loss weighting
Figure BDA0003876971850000138
Fine-box prediction network loss weighting
Figure BDA0003876971850000139
3.3.9 has epoch = epoch +1, and if epoch is 80 or 110, has learning _ rate = learning _ rate × 0.1, rotate 3.3.10; if the epoch is neither 80 nor 110, directly transferring to 3.3.10;
3.3.10 if the epoch is less than or equal to maxepoch, rotate 3.3.2; if the epoch is greater than maxepoch, indicating that the training is finished, turning to 3.3.11;
3.3.11 preserved N m Network weight parameter of individual epochs.
Fourth, verifying N after loading by using the verification set m The detection precision of the target detection system of the network weight parameters of the epoch is kept, and the network weight parameters with the best performance are used as the network weight parameters of the target detection system. The method comprises the following steps:
4.1 order variable n m =1;
4.2 post-load N of target detection System m Nth in network weight parameter of epoch m A network weight parameter; the new verification set D processed by adopting an image scaling standardization method through 2.4 steps V Inputting a target detection system;
4.3 let V =1 be the V-th image of the validation set, V being the number of images of the validation set;
4.4 the principal feature extraction Module receives the v verification set image D v Extracting D by using the main feature extraction method described in 3.3.3 v To obtain D v Will contain D v Sending the multi-scale feature map of the multi-scale features to the self-adaptive feature aggregation module;
4.5 adaptive multiscale feature aggregation network reception in feature adaptive aggregation Module containing D v The multi-scale feature map of the multi-scale features is subjected to channel self-attention enhancement and bilinear interpolation by adopting the self-adaptive multi-scale feature aggregation method 3.3.4.1Sampling and scale level soft weight aggregation operation to obtain D v Multi-scale perceptual high-pixel feature map F HV Will F HV Sending the data to a rough frame prediction network and an adaptive spatial feature aggregation network;
4.6 coarse Box prediction network reception F in feature adaptive aggregation Module HV F, adopting the rough frame prediction method 3.3.4.2 HV The position of each feature point in the image is subjected to rough frame position prediction to generate a v-th verification set image D v Coarse box of (1) predict position B HVcoarse (ii) a B is to be HVcoarse And sending the information to an adaptive spatial feature aggregation network. B is HVcoarse Is also that
Figure BDA00038769718500001310
Of resolution of
Figure BDA00038769718500001311
The number of channels is 4;
4.7 adaptive spatial feature aggregation network in feature adaptive aggregation Module receives B from the coarse Box prediction network HVcoarse Receiving F from an adaptive multi-scale feature aggregation network HV The method for classifying and self-adapting space characteristic polymerization of 3.3.4.3.2 is adopted to utilize B HVcoarse Limit the sampling range, F HV Performing classification task space feature aggregation to obtain the v-th verification set image D v A high pixel feature map of salient region perception of (1); sending the high pixel characteristic graph perceived by the salient region of the v-th verification image to a central point prediction network;
4.8 adaptive spatial feature aggregation network in feature adaptive aggregation Module adopts the regression adaptive spatial feature aggregation method described in 3.3.4.3.3 and utilizes B HVcoarse Limit the sampling range, and F HV Performing regression task space feature aggregation to obtain the v-th verification set image D v The boundary region perceived high pixel feature map of (1); sending the high pixel characteristic image perceived by the boundary area of the v-th verification image to a fine frame prediction network;
4.9 Fine Frames in Main task ModuleReceiving a high pixel characteristic graph sensed by a boundary region through a prediction network, and performing 1 multiplied by 1 convolution processing on the high pixel characteristic graph to obtain a v verification set image D v The fine frame prediction position of the object is sent to a post-processing module;
4.10 Central Point prediction network in Main task Module receives the v verification set image D v The high pixel characteristic diagram of the salient region perception is processed by a layer of 1 multiplied by 1 convolution to obtain the v verification set image D v The center point prediction thermodynamic diagram of (1), and the v-th verification image D v The central point prediction thermodynamic diagram is sent to a post-processing module;
4.11 post-processing Module receives the v authentication image D v The thermal diagram of the fine frame prediction position and the center point is predicted, and the method of removing the overlapped pseudo frame is adopted to carry out the verification on the v < th > verification image D v The fine frame prediction position and the central point prediction thermodynamic diagram are subjected to overlapping pseudo frame removing operation to obtain a v-th verification image D v The specific method of predicting the object frame set is as follows:
4.11.1 post-processing module pairs the v-th verification image D v Performs a 3 x 3 Max Pooling operation (2D Max-Pooling) to extract a v-th verification image D v The central point of (2) a set of peak points of the predictive thermodynamic diagram, each peak point representing a central region point within the predicted object;
4.11.2 from the v-th verification image D v The central point predictive thermodynamic diagram of (1) obtains a peak point (P) x ,P y ) Coordinate value P of x ,P y Post-processing module from D v The fine frame prediction position of (P) is obtained as a peak point x ,P y ) Distance information (t, l, D, r) of the upper, left, lower and right directions is obtained to obtain D v Prediction frame B of p ={P y -t,p l -1,p d +d,p r +r}。B p The category of (D) is peak point (P) x ,P y ) The channel value with the maximum pixel value of the thermal image of the center point of the position is recorded as c p 。B p The confidence of (D) is the peak point (P) x ,P y ) Thermodynamic diagram of center point of location p ChannelHas a pixel value of s p
4.11.3 post-processing module retains the v-th verification image D v Middle confidence s p A prediction box greater than a confidence threshold (typically set to 0.3) forms the v-th verification image D v The object frame prediction set of (1) retains the prediction frame B p And B p Class c of p Information;
4.12 making V = V +1, if V is less than or equal to V, rotating by 4.4; if V > V, it is stated that the nth m Converting an object frame prediction set of V verification images of each model into 4.13;
4.13 if the verification set adopts a general scene data set disclosed by MS COCO, testing the precision of a final object frame prediction set output by the target detection system by adopting a standard MS COCO evaluation mode (https:// cocoataset. Org /), recording the precision of the object frame prediction set, and turning to 4.14; if the verification set adopts a Cityscapes unmanned scene data set, testing the precision of a final object frame prediction set output by the target detection system by adopting a Cityscapes evaluation mode (https:// www.cityscapes-dataset. Com /), recording the precision of the object frame prediction set, and turning to 4.14;
4.14 order n m =n m +1; if n is m ≤N m And 4.2; if n is m >N m Explanation of completion N m Testing the precision of each model, and turning to 4.15;
4.15 from N m Selecting an object frame prediction set with highest precision from the precision of the object frame prediction sets of the models, finding a weight parameter corresponding to a target detection system corresponding to the object frame prediction set with the highest precision, taking the weight parameter as the weight parameter selected by the target detection system, loading the selected weight parameter to the target detection system, and enabling the target detection system loaded with the selected weight parameter to become the trained target detection system.
Fifthly, adopting the trained target detection system to perform target detection on the image to be detected input by the user, wherein the method comprises the following steps:
5.1 adopting the 2.4-step image scaling standardization method to carry out optimization processing on the image I to be detected input by the user to obtainTo the standardized image I to be detected nor Is shown by nor Inputting a main feature extraction module;
5.2 Main feature extraction Module receives I nor Extracting I by using the main feature extraction method described in 3.3.3 nor To obtain I nor Will comprise I nor The multi-scale feature map of the multi-scale features is sent to the adaptive feature aggregation module.
5.3 adaptive multiscale feature aggregation network reception in feature adaptive aggregation Module including I nor The multi-scale feature map of the multi-scale features adopts the self-adaptive multi-scale feature polymerization method 3.3.4.1 to perform the self-adaptive multi-scale feature polymerization on the feature containing I nor The multi-scale characteristic graph of the multi-scale characteristic carries out channel self-attention enhancement, bilinear interpolation upsampling and scale level soft weight aggregation operation to obtain a multi-scale perception high pixel characteristic graph F IH Will F IH Sending the data to a rough frame prediction network and an adaptive spatial feature aggregation network;
5.4 coarse Box prediction network reception F in feature adaptive aggregation Module IH F, adopting the rough frame prediction method 3.3.4.2 IH Carrying out rough frame position prediction to obtain a rough frame prediction position B in the image I to be detected Icoarse (ii) a B is to be Icoarse And sending the information to the adaptive spatial feature aggregation network. B is Icoarse Is also that
Figure BDA0003876971850000151
Of resolution of
Figure BDA0003876971850000152
The number of channels is 4;
5.5 adaptive spatial feature aggregation network in feature adaptive aggregation Module receiving F IH And B Icoarse The method for classifying and self-adapting space characteristic polymerization of 3.3.4.3.2 is adopted to utilize B Icoarse Limiting the sampling range, for F IH Performing classification task spatial feature aggregation to obtain a high pixel feature map perceived by a salient region of the image I to be detected; sensing salient regions of an image I to be detectedSending the known high pixel characteristic graph to a central point prediction network;
5.6 adaptive spatial feature aggregation network in feature adaptive aggregation Module adopts the regression adaptive spatial feature aggregation method described in 3.3.4.3.3 and utilizes B Icoarse Limiting the sampling range, for F IH Performing regression task spatial feature aggregation to obtain a high pixel feature map perceived by a boundary region of the image I to be detected; sending a high pixel characteristic image sensed by a boundary area of an image I to be detected to a fine frame prediction network;
5.7 the fine frame prediction network in the main task module receives the high pixel characteristic image perceived by the boundary area of the image I to be detected, and the fine frame prediction position of the object in the image I to be detected is obtained through a layer of 1 × 1 convolution processing; sending the fine frame prediction position of an object in an image I to be detected to a post-processing module;
5.8 the central point prediction network in the main task module receives the high pixel characteristic image perceived by the salient region of the image I to be detected, and the central point prediction thermodynamic diagram of the object of the image I to be detected is obtained through a layer of 1 × 1 convolution processing; the central point prediction thermodynamic diagram of the object of the image I to be detected is sent to a post-processing module;
5.9 the post-processing module receives the fine frame prediction position and the central point prediction thermodynamic diagram of the object of the image I to be detected, the method for removing the overlapped pseudo frame in the step 4.9 is adopted to remove the overlapped pseudo frame from the fine frame prediction position of the object of the image I to be detected and the central point prediction thermodynamic diagram of the object of the image I to be detected, so as to obtain an object frame prediction set of the image I to be detected, and the object frame prediction set of the image I to be detected reserves a prediction frame B p And the type information of the prediction frame, namely the coordinate position and the prediction type of the prediction object frame of the image to be detected.
And sixthly, finishing.
The invention can achieve the following beneficial effects:
the invention provides a target detection method based on feature adaptive aggregation. The invention adopts the self-adaptive multi-scale characteristic aggregation network and the self-adaptive spatial characteristic aggregation network, and realizes larger precision improvement with a small amount of calculation overhead. The method is suitable for most of target detection based on images. The invention can achieve the following effects:
1. the invention constructs a target detection system which integrates a main feature extraction module, a feature self-adaptive aggregation module, an auxiliary task module, a main task module and a post-processing module, and designs an aggregation mode and a network structure which are suitable for target detection by utilizing the channel self-attention enhancement, the scale level soft weight aggregation and the self-adaptive feature aggregation capability of the deformable convolution of the self-adaptive spatial feature aggregation module of the self-adaptive multi-scale feature aggregation module on the basis of ensuring the rapidness and the real-time performance of a target detection method, thereby realizing the great detection precision improvement. By adopting MS COCO and Cityscapes data sets to carry out experiments on the invention, the detection precision of the invention is greatly improved compared with CenterNet and TTFNet in the background technology.
2. The self-adaptive multi-scale feature aggregation network utilizes an SE module to enhance the feature channel characterization capability and utilizes a scale-level soft weight graph to enhance the multi-scale characterization capability of features; the self-adaptive spatial feature aggregation network utilizes the rough frame to limit the sampling range of the deformable convolution space, relieves the problem of excessive offset, designs different offset conversion functions aiming at a central point prediction task and a fine frame prediction network, enables the regression task to pay attention to the boundary region of an object, enables the classification task to pay attention to the salient region of the object, relieves the problem of characteristic coupling of the classification task and the regression task, and can realize great detection precision improvement.
Drawings
FIG. 1 is a logical block diagram of an object detection system constructed in a first step of the present invention.
FIG. 2 is a general flow chart of the present invention.
FIG. 3 is a graph comparing the results of the test of the present invention with the results of the TTFNet method.
Fig. 4 is a diagram showing an example of a detection image in a test for the effect of the present invention.
Detailed Description
The following describes an embodiment of the present invention with reference to the drawings. As shown in fig. 2, the present invention comprises the steps of:
firstly, constructing a target detection system based on feature adaptive aggregation. As shown in fig. 1, the target detection system is composed of a main feature extraction module, a feature adaptive aggregation module, an auxiliary task module, a main task module, and a post-processing module.
The main feature extraction module is connected with the feature self-adaptive aggregation module, extracts multi-scale features from the input image and sends a multi-scale feature map containing the multi-scale features to the feature self-adaptive aggregation module. The main feature extraction module consists of a DarkNet-53 convolutional neural network. The DarkNet-53 convolutional neural network is a lightweight backbone network comprising 53 layers of neural networks, and the 53 layers of neural networks are divided into 5 serial sub-networks for extracting the backbone network characteristics of the image. The feature pyramid network receives the main network features from the DarkNet-53 convolutional neural network, a multi-scale feature map containing the multi-scale features is obtained through up-sampling, feature extraction and feature fusion operations, and the multi-scale feature map is sent to the feature self-adaptive aggregation module.
The feature self-adaptive aggregation module is connected with the main feature extraction module, the auxiliary task module and the main task module, and has the functions of providing a multi-scale perceived high-pixel feature map for the auxiliary task module, providing a boundary region perceived high-pixel feature map and a salient region perceived high-pixel feature map for the main task module, and improving the detection precision of the target detection system. The characteristic self-adaptive aggregation module is composed of a self-adaptive multi-scale characteristic aggregation network, a self-adaptive spatial characteristic aggregation network and a rough frame prediction network. The self-adaptive multi-scale feature aggregation network is composed of 4 weight unshared SE networks (the 4 SE networks are respectively recorded as a first SE network, a second SE network, a third SE network and a fourth SE network), receives a multi-scale feature map from a feature pyramid network of a main feature extraction module, performs channel self-attention enhancement, bilinear interpolation upsampling and scale level soft weight aggregation on the multi-scale feature map by adopting a self-adaptive multi-scale feature aggregation method to obtain a multi-scale perceived high pixel feature map, and sends the multi-scale perceived high pixel feature map to the self-adaptive spatial feature aggregation network, the rough frame prediction network and the auxiliary task module. The rough frame prediction network is composed of two layers of 3 x 3 convolution and one layer of 1 x 1 convolution, receives the multi-scale perception high pixel characteristic diagram from the self-adaptive multi-scale characteristic aggregation network, predicts the multi-scale perception high pixel characteristic diagram to obtain a rough frame prediction position, and sends the rough frame prediction position to the self-adaptive spatial characteristic aggregation network. The self-adaptive spatial feature aggregation network is composed of two region-limited deformable convolutions with different offset conversion functions (a classification offset conversion function and a regression offset conversion function), receives a multi-scale perceived high pixel feature map from the self-adaptive multi-scale feature aggregation network, receives a rough frame prediction position from a rough frame prediction network, generates a boundary region perceived high pixel feature map and a salient region perceived high pixel feature map, and sends the boundary region perceived high pixel feature map and the salient region perceived high pixel feature map to the main task module, so that the main task module has self-adaptive spatial perception capability, and the problem that the input feature coupling degree is high and affects detection accuracy is relieved.
The auxiliary task module is connected with an adaptive multi-scale feature aggregation network in the feature adaptive aggregation module, the auxiliary task module is a corner prediction network, the corner prediction network is composed of two layers of 3 x 3 convolution, one layer of 1 x 1 convolution and a sigmoid active layer, the auxiliary task module receives a multi-scale perception high pixel feature image from the adaptive multi-scale feature aggregation network, and the corner prediction network predicts the multi-scale perception high pixel feature image to obtain a corner prediction thermodynamic diagram which is used for calculating corner prediction loss in the training of a target detection system and assisting the target detection system in perceiving a corner region. The auxiliary task module is only used during training of the target detection system and is used for enhancing perception of the target detection system on the position of the corner point of the object, so that the position of the object frame is predicted more accurately. When the trained target detection system detects the user input image, the module is directly discarded, and extra calculation amount is not increased.
The main task module is connected with the adaptive spatial feature aggregation network and the post-processing module and consists of a fine frame prediction network and a central point prediction network. The fine frame prediction network is a layer of 1 multiplied by 1 convolution layer, receives the high pixel characteristic diagram sensed by the boundary region from the adaptive spatial characteristic aggregation network, performs 1 multiplied by 1 convolution on the high pixel characteristic diagram sensed by the boundary region to obtain a fine frame prediction position, and sends the fine frame prediction position to the post-processing module; the central point prediction network consists of a layer of 1 x 1 convolutional layer and a sigmoid activation layer, receives the high pixel characteristic diagram sensed by the salient region from the adaptive spatial characteristic aggregation network, performs 1 x 1 convolution and activation on the high pixel characteristic diagram sensed by the salient region to obtain a central point prediction thermodynamic diagram, and sends the central point prediction thermodynamic diagram to the post-processing module.
The post-processing module is a 3 x 3 pooling layer and is connected with a fine frame prediction network and a central point prediction network in the main task module, receives a fine frame prediction position from the fine frame prediction network, receives a central point prediction thermodynamic diagram from the central point prediction network, reserves a prediction maximum value in a central point prediction thermodynamic diagram 3 x 3 range by adopting 3 x 3 maximum pooling operation with the step length of 1, and extracts the position of the reserved prediction maximum value, namely a peak point, as the position of the central area point of the object. And finding out the corresponding up-down, left-right four-direction distances in the fine frame prediction position according to the position of the central area point to generate a predicted object frame position, wherein the central point category where the position of the central area point is located is the category of the object prediction. The post-processing module suppresses overlapping false frames by extracting peak points within a 3 × 3 range, reducing false positive prediction frames.
Secondly, constructing a training set, a verification set and a test set, wherein the method comprises the following steps:
2.1 collecting target detection scene images as a target detection data set, and manually labeling each target detection scene image in the target detection data set, wherein the method comprises the following steps:
a common scene data set or a Cityscapes unmanned scene data set disclosed by MS COCO is used as the target detection data set. The MS COCO dataset has 80 classes, containing 105000 training images (train 2017) as training set, 5000 verification images (val 2017) as verification set, and 20000 test images (test-dev) as test set. The cityscaps dataset has 8 classes: pedestrians, riders, trolleys, trucks, buses, trains, motorcycles and bicycles, with 2975 training images as the training set, 500 validation images as the validation set, 1525 Zhang Ceshi images as the test set. Let the total number of images in the training set be S, let the total number of images in the test set be T, let the total number of images in the verification set be V, let S be 205000 or 2975, T be 20000 or 1524, and let V be 5000 or 500. Each image of the MS COCO and cityscaps data sets is manually labeled, i.e., each image is labeled with the position of an object in the form of a rectangular box and is labeled with the category of the object.
2.2 carrying out optimization processing on the S images in the training set, including turning, cutting, translation, brightness transformation, contrast transformation, saturation transformation, scaling and standardization to obtain an optimized training set D t The method comprises the following steps:
2.2.1 order variable s =1, initialize the optimized training set D t Is empty;
2.2.2 overturning the s image in the training set by adopting a random overturning method to obtain the s overturned image, wherein the random probability of the random overturning method is 0.5;
2.2.3 randomly cutting the s-th overturned image by adopting a minimum cross-over ratio (IoU) to obtain an s-th cut image; the minimum cross-over ratio (IoU) used is 0.3 for the minimum size ratio.
2.2.4, carrying out random image translation on the s-th cut image to obtain an s-th translated image;
2.2.5, performing brightness conversion on the s-th translated image by adopting random brightness to obtain an s-th brightness-converted image; the random luminance takes a luminance difference value of 32.
2.2.6, carrying out contrast conversion processing on the image after the s-th brightness conversion by adopting random contrast to obtain an image after the s-th contrast conversion; the random contrast ratio ranges from (0.5,1.5).
2.2.7, performing saturation conversion on the image after the s-th contrast conversion by adopting random saturation to obtain an image after the s-th saturation conversion; the saturation range for random saturation is (0.5,1.5).
2.2.8 adopting a scaling operation to scale the s-th image after saturation transformation to 512 multiplied by 512 to obtain an s-th scaled image;
2.2.9 standardizes the s scaled image by adopting standardization operation to obtain the s standard image, and puts the s standard image into the optimized training set D t In (1).
If S is less than or equal to S, making S = S +1, and rotating by 2.2.2; if s>S, obtaining an optimized training set D consisting of S standard images t Turn 2.3.
2.3 training set D according to optimization t And making a task truth label for model training. The method is divided into four tasks which are respectively a central point prediction task, an angular point prediction task, a rough frame prediction task and a fine frame prediction task, and comprises the following steps:
2.3.1 let variable s =1; let the s image in the optimized training set have N s A label box, order N s The ith one of the label boxes is
Figure BDA0003876971850000181
Let the label category of the ith label box be c i
Figure BDA0003876971850000182
Represents the coordinates of the point at the upper left corner of the ith label box,
Figure BDA0003876971850000183
represents the coordinate of the lower right corner point of the ith label box, N s Is a positive integer, i is more than or equal to 1 and less than or equal to N s
2.3.2 construction of the predicted true value of the centerpoint for the centerpoint prediction task
Figure BDA0003876971850000184
The method comprises the following steps:
2.3.2.1 construction a size of
Figure BDA0003876971850000185
All-zero matrix chart H zeros And C represents the number of classification categories of the optimized training set, wherein the number of the classification categories is the target detection numberThe number of classes of the labeled targets in the data set is 80 classes of MS COCO data set and 19 classes of Cityscapes data set, H is the height of the s th image, and W is the width of the s th image;
2.3.2.2 let i =1, denote the labeling box of the ith down-sampling by 4 times;
2.3.2.3 will
Figure BDA0003876971850000191
Dividing the labeling coordinate by 4, and recording as a labeling frame of 4 times of downsampling
Figure BDA0003876971850000192
Figure BDA0003876971850000193
Figure BDA0003876971850000194
Represents B si The upper left, upper right, lower left, lower right corner positions of the' are located.
2.3.2.4 adopts two-dimensional Gaussian kernel generation method, and calculates B si ' center point
Figure BDA0003876971850000195
A base point of a two-dimensional Gaussian kernel with a variance of (σ) xy ) Obtaining the Gaussian values of all pixel points in the two-dimensional Gaussian kernel range to obtain a first Gaussian value set S ctr . The method comprises the following specific steps:
2.3.2.4.1 makes the number of pixel points in two-dimensional Gaussian kernel be N pixel ,N pixel Let the first set of Gaussian values S be a positive integer ctr Is empty;
2.3.2.4.2 let p =1, representing the number of pixel points in two-dimensional Gaussian kernel, where p is greater than or equal to 1 and less than or equal to N pixel
2.3.2.4.3 in the s-th image with (x) 0 ,y 0 ) Any pixel point (x) in the Gaussian kernel range as the base point p ,y p ) Two-dimensional Gao Sizhi K (x) p ,y p ) Comprises the following steps:
Figure BDA0003876971850000196
wherein (x) 0 ,y 0 ) Is a base point of a two-dimensional Gaussian kernel, namely the center of the two-dimensional Gaussian kernel (can be B' si May be B' si Corner point of) x 0 A coordinate value of the base point in the width direction, y 0 Is a coordinate value of the base point in the high direction. (x) p ,y p ) Is a base point (x) 0 ,y 0 ) Pixel points in the Gaussian kernel range, x p Is the coordinate value, y, of the pixel point in the width direction p The coordinates of the pixel point in the high direction. (x) 0 ,y 0 ) And (x) p ,y p ) Are all located in the image coordinate system after down-sampling by 4 times.
Figure BDA0003876971850000197
Representing the variance of the two-dimensional gaussian kernel in the width direction,
Figure BDA0003876971850000198
and the variance of the two-dimensional Gaussian kernel in the high direction is represented, and the number of points in the range of the Gaussian kernel is controlled by controlling the variance of the two-dimensional Gaussian kernel in the width direction and the high direction. w represents B si ' Width in the scale of the feature map, h represents B si ' high at the scale of the feature map, α is the position of the center region in B si The parameter for the' ratio, set to 0.54. Will (x) p ,y p ) And calculated K (x) p ,y p ) Storing the first set of Gaussian values S ctr Performing the following steps;
2.3.2.4.4 let p = p +1; if p is less than or equal to N pixel Turning to 2.3.2.4.3; if p is>N pixel ,B si ' the coordinates and two-dimensional Gaussian values in the Gaussian kernel have been stored in S ctr In, S ctr In which is N pixel Each pixel point and the corresponding two-dimensional Gaussian value thereof are converted to 2.3.2.5;
2.3.2.5 treatment of S ctr Value of (3) to H zeros In (1). Will S ctr Element (x) of (1) p ,y p ) And K (x) p ,y p ) According to H zeros [x p ,x p ,c i ]=K(x p ,y p ) Rule assignment of c i Represents B si ' class number, 1. Ltoreq. C i C and C is not more than C i Is a positive integer;
2.3.2.6 has i = i +1; if i is less than or equal to N s Turning to 2.3.2.3; if i>N s N of the s-th image s All the two-dimensional Gaussian values generated by the down-sampling 4-time labeling boxes are assigned to H zeros Middle, 2.3.2.7;
2.3.2.7 predicts the true value of the center point of the s-th image
Figure BDA0003876971850000199
2.3.3 construction of actual values of corner predictions for a task of corner prediction
Figure BDA00038769718500001910
The method comprises the following steps:
2.3.3.1 construction of a size of
Figure BDA00038769718500001911
All-zero matrix of
Figure BDA00038769718500001912
"4" represents the number of corner points of the labeling box of 4 times of downsampling, and also represents 4 channels of the matrix;
2.3.3.2 let i =1, representing the labeling box of the ith down-sampling 4 times;
2.3.3.3 let the base point of the two-dimensional Gaussian kernel be B si ' Upper left corner point, coordinates of
Figure BDA00038769718500001913
Adopting the two-dimensional Gaussian kernel generation method of 2.3.2.4 to calculate
Figure BDA0003876971850000201
A base point of a two-dimensional Gaussian kernel with a variance of (σ) xy ) Obtaining the Gaussian values of all the pixel points in the two-dimensional Gaussian kernel range to obtain a second Gaussian value set S tl
2.3.3.4 converting S to tl To the element coordinates and Gaussian values in
Figure BDA0003876971850000202
In the 1 st channel, i.e. according to
Figure BDA0003876971850000203
Assigning a value to the rule of (1);
2.3.3.5 let the base point of the two-dimensional Gaussian kernel be B si The upper right corner point of' has coordinates of
Figure BDA0003876971850000204
Adopting the two-dimensional Gaussian kernel generation method of 2.3.2.4 to calculate
Figure BDA0003876971850000205
A base point of a two-dimensional Gaussian kernel with a variance of (σ) xy ) The Gaussian values of all the pixel points in the two-dimensional Gaussian kernel range are obtained to obtain a third Gaussian value set S tr
2.3.3.6S tr To the element coordinates and Gaussian values in
Figure BDA0003876971850000206
In the 2 nd channel, i.e. according to
Figure BDA0003876971850000207
Assigning values to the rules;
2.3.3.7 let the base point of the two-dimensional Gaussian kernel be B si ' lower left corner point, coordinates of
Figure BDA0003876971850000208
Adopting the two-dimensional Gaussian kernel generation method of 2.3.2.4 to calculate
Figure BDA0003876971850000209
A base point of a two-dimensional Gaussian kernel with a variance of (σ) xy ) The Gaussian values of all the pixel points in the two-dimensional Gaussian kernel range are obtained to obtain a fourth Gaussian value set S dl
2.3.3.8S dl To the element coordinates and Gaussian values in
Figure BDA00038769718500002010
In the 3 rd channel according to
Figure BDA00038769718500002011
Assigning a value to the rule of (1);
2.3.3.9 let the base point of the two-dimensional Gaussian kernel be B' si At the lower right corner point of having coordinates of
Figure BDA00038769718500002012
Adopting the two-dimensional Gaussian kernel generation method of 2.3.2.4 to calculate
Figure BDA00038769718500002013
A base point of a two-dimensional Gaussian kernel with a variance of (σ) xy ) The Gauss values of all the pixel points in the two-dimensional Gauss kernel range are obtained to obtain a fifth Gauss value set S dr
2.3.3.10 reduction of S dr To the element coordinates and Gaussian values in
Figure BDA00038769718500002014
In the 4 th channel, i.e. according to
Figure BDA00038769718500002015
Assigning a value to the rule of (1);
2.3.3.11 let i = i +1 if i ≦ N s Turning to 2.3.3.3; if i>N s N of the s-th image s The two-dimensional Gaussian values generated by the down-sampling 4-fold labeling frames are all assigned to
Figure BDA00038769718500002016
Middle, 2.3.3.12;
2.3.3.12 predicting true value of corner of s-th image
Figure BDA00038769718500002017
2.3.4 according to the N of the s-th image s Constructing a rough frame real value of the s image of a rough frame prediction task by using 4 times of down-sampling labeled frames
Figure BDA00038769718500002018
The method comprises the following steps:
2.3.4.1 construction a size of
Figure BDA00038769718500002019
All-zero matrix of
Figure BDA00038769718500002020
"4" represents 4 coordinates of the labeling box sampled 4 times;
2.3.4.2 let i =1, representing the labeling box of the ith down-sampling 4 times;
2.3.4.3 pair H zeros Marking box B of 4 times of sampling in ith si ' assignment of internal pixels, i.e. B si ' coordinate value
Figure BDA00038769718500002021
Assign a value to
Figure BDA00038769718500002022
In 4 channels of pixel locations;
2.3.4.4 let i = i +1 if i ≦ N s Turning to 2.3.4.3; if i>N s N of the s-th image s The actual value of the rough frame corresponding to each marking frame is assigned to
Figure BDA00038769718500002023
In which values are assigned
Figure BDA00038769718500002024
The true value label of the s image is converted to 2.3.4.5;
2.3.4.5 order the true value of the coarse frame of the s-th image
Figure BDA00038769718500002025
2.3.5 according to
Figure BDA00038769718500002026
Constructing a fine-box true value for a fine-box prediction task
Figure BDA00038769718500002027
Figure BDA00038769718500002028
Value and
Figure BDA00038769718500002029
are equal, i.e.
Figure BDA00038769718500002030
2.3.6 let S = S +1, if S is less than or equal to S, rotate 2.3.2; if S is greater than S, rotating to 2.3.7;
2.3.7 obtaining S images for task real labels of model training, and forming a set by the S images and the S images to form a training set D for model training M
2.4 adopting an image scaling standardization method to optimize the V images in the verification set to obtain a new verification set D consisting of the V scaled and standardized images V The method comprises the following steps:
2.4.1 let variable v =1;
2.4.2 adopting zooming operation to zoom the v-th image in the verification set to 512 multiplied by 512 to obtain a zoomed image of the v-th image;
2.4.3 standardizing the zoomed image of the v th image by adopting a standardization operation to obtain a standardized image of the v th image.
2.4.4 if V is less than or equal to V, making V = V +1, and rotating to 2.4.2; if V > V, a new verification set D consisting of V scaled normalized images is obtained V Turn 2.5.
2.5 adopting the image scaling standardization method of 2.4 steps to carry out optimization processing on the T images in the test set to obtain a new test set D consisting of the images after the T images are scaled and standardized T
And thirdly, training the target detection system constructed in the first step by using a gradient back propagation method to obtain Nm model parameters. The method comprises the following steps:
3.1 initializing the network weight parameters of each module in the target detection system. Initializing parameters of a DarkNet-53 convolutional neural network in a main characteristic extraction module by adopting a pre-training model trained on an ImageNet data set (https:// www.image-net.org /); and initializing other network weight parameters (a characteristic pyramid network, a characteristic self-adaptive aggregation module, an auxiliary task module and a main task module network weight parameter in a main characteristic module) by adopting normal distribution with the mean value of 0 and the variance of 0.01.
3.2 setting the training parameters of the target detection system. The initial learning rate learning _ rate is set to 0.01, and the learning rate attenuation factor is set to 0.1, i.e., the learning rate is reduced by a factor of 10 (attenuation is performed at training steps of 80 and 110). Random gradient descent (SGD) was selected as a model training optimizer with a hyper-parameter "momentum" of 0.9 and a weight decay "of 0.0004. The batch size (mini _ batch _ size) of the network training is 64. The maximum training step size (maxepoch) is 120.
3.3 training the target detection system, the method is that the difference between the rough frame prediction position, the fine frame prediction position, the corner point prediction thermodynamic diagram and the central point prediction thermodynamic diagram output by the target detection system in one training and the real value is used as a loss value (loss), and the network weight parameter is updated by utilizing gradient back propagation until the loss value reaches a threshold value or the training step length reaches maxepoch and ends. The network weight parameters are saved for each training round at the last Nm (set to 10 in this embodiment). The method comprises the following steps:
3.3.1 order training step epoch =1, one period of training for all data in the training set is one epoch, and the serial number N of the initialization batch b =1;
3.3.2 Primary feature extraction Module Slave D M Read the Nth b Batch, total B =64 images, and the B images are recorded as matrix I train ,I train Contains B H × W × 3 images. Where H denotes the height of the input image, W denotes the width of the input image, and "3" denotes the three channels of RGB of the image.
3.3.3 principal feature extraction Module miningExtraction of I by principal feature extraction train To obtain I train Will contain I train The multi-scale feature map of the multi-scale features is sent to the adaptive feature aggregation module. The method comprises the following steps:
5363 DarkNet-53 convolution neural network extraction I of main feature extraction module of 3.3.3.1 train The method comprises the following steps of obtaining a backbone network feature map set by using the image features, wherein the method comprises the following steps: 5 serial sub-network pairs I of DarkNet-53 convolutional neural network train The B images are down-sampled and feature extracted to obtain the features of the backbone network, i.e. 4 feature maps (the outputs of the last four serial subnets) and sent to the feature pyramid network.
3.3.3.2 the pyramid network receives 4 characteristic graphs from the DarkNet-53 convolutional neural network, and the pyramid network performs up-sampling, characteristic extraction and characteristic fusion on the 4 characteristic graphs to obtain 3 multi-scale characteristic graphs, such that
Figure BDA0003876971850000221
Mapping multi-scale features
Figure BDA0003876971850000222
And sending the information to a feature adaptive aggregation module.
3.3.4 feature adaptive aggregation Module receives a multiscale feature map from a feature pyramid network
Figure BDA0003876971850000223
Generating a multi-scale perceptual high-pixel feature map F H Will F H Sending the data to an auxiliary task module; and generating a high pixel characteristic diagram perceived by the boundary area and a high pixel characteristic diagram perceived by the salient area, and sending the high pixel characteristic diagram perceived by the boundary area and the high pixel characteristic diagram perceived by the salient area to the main task module. The method comprises the following steps:
3.3.4.1 an adaptive multi-scale feature aggregation network receives from a feature pyramid network
Figure BDA0003876971850000224
Using adaptive multi-scale featuresPolymerization process pair
Figure BDA0003876971850000225
Channel self-attention enhancement, bilinear interpolation up-sampling and scale level soft weight aggregation operation are carried out to obtain a multi-scale perception high pixel characteristic diagram F H 。F H The resolution of the characteristic diagram is
Figure BDA0003876971850000226
F H The number of feature map channels of (2) is 64. The specific method comprises the following steps:
3.3.4.1.1 adaptive multiscale feature aggregation network uses first, second, and third SE network parallel pairs
Figure BDA0003876971850000227
Performing parallel channel self-attention enhancement, i.e. first SE network pair
Figure BDA0003876971850000228
Weighted summation applied on the channels to obtain the first channel characteristic enhanced image
Figure BDA0003876971850000229
Simultaneous second SE network pair
Figure BDA00038769718500002210
Weighted summation applied on the channels to obtain the second channel characteristic enhanced image
Figure BDA00038769718500002211
Simultaneous third SE network pair
Figure BDA00038769718500002212
Weighted summation is applied to the channels to obtain an image with enhanced representation of the third channel
Figure BDA00038769718500002213
3.3.4.1.2 first, second and third SE networks of adaptive multi-scale feature aggregation network are adopted in parallelBilinear interpolation will
Figure BDA00038769718500002214
Upsampled to the same resolution size
Figure BDA00038769718500002215
Obtaining an up-sampled feature map
Figure BDA00038769718500002216
Become an upsampled feature map set
Figure BDA00038769718500002217
The specific calculation process is shown as formula (2):
Figure BDA00038769718500002218
wherein SE n It is shown that the nth SE network,
Figure BDA00038769718500002219
and the first multi-scale feature map is shown, upesample shows bilinear interpolation upsampling, and l is more than or equal to 1 and less than or equal to 3,1 and less than or equal to n is less than or equal to 3.
3.3.4.1.3 adaptive multi-scale feature aggregation network pair
Figure BDA00038769718500002220
Calculating the weight by adopting 1 multiplied by 1 convolution, reducing the number of channels from 64 to 1, and then executing Softmax operation on the scale dimension to obtain the value of
Figure BDA00038769718500002221
Soft weight map of
Figure BDA00038769718500002222
The numerical size of the pixel points of the soft weight map indicates that more attention should be paid
Figure BDA00038769718500002223
Which one of these 3 dimensions, i.e.
Figure BDA00038769718500002224
Which is weighted more heavily, so that objects of different sizes respond to signatures of different dimensions.
3.3.4.1.4 adaptive multiscale feature aggregation network maps the weights of the ith scale
Figure BDA00038769718500002225
Corresponding ith up-sampled feature map
Figure BDA00038769718500002226
Element by element multiplication, i.e. to
Figure BDA00038769718500002227
And
Figure BDA00038769718500002228
corresponding element by element multiplication, will
Figure BDA00038769718500002229
And
Figure BDA00038769718500002230
corresponding to the multiplication element by element,
Figure BDA00038769718500002231
and
Figure BDA00038769718500002232
multiplying element by element respectively to obtain 3 products, then carrying out weighted summation on the 3 products, and fusing the 3 products into a feature map to obtain a fused feature map; then, a fourth SE network is adopted to enhance the channel representation of the fused feature map, and a multi-scale perception high-pixel feature map F is obtained H . The specific process is shown in formula (3):
Figure BDA0003876971850000231
wherein SE 4 Is a fourth S E network,
Figure BDA0003876971850000232
indicating that the same position element occupies weight in different scales, "×" indicates the product of the corresponding position elements, and Conv indicates 1 × 1 convolution. Adaptive multi-scale feature aggregation network F H And sending the data to an auxiliary task module, a rough frame prediction network and an adaptive spatial feature aggregation network.
3.3.4.2 coarse-box prediction network receives a multi-scale aware high-pixel feature map F from an adaptive multi-scale feature aggregation network H Using a rough frame prediction method to pair F H The position of each feature point in the image is subjected to rough frame position prediction to generate a rough frame prediction position B coarse B is to be coarse Sending to an adaptive spatial feature aggregation network, B coarse Is also that
Figure BDA0003876971850000233
Of resolution of
Figure BDA0003876971850000234
The number of channels is 4. The channel number 4 represents the distance from the pixel point to the four directions of up, down, left and right, and each pixel point forms a rough frame. B is coarse For limiting the range of deformable convolution samples in an adaptive spatial feature aggregation network. And, for B coarse Coarse box true value constructed with 2.2.5.4
Figure BDA0003876971850000235
Calculating loss
Figure BDA0003876971850000236
Figure BDA0003876971850000237
The loss calculation is based on the GIoU loss (see the documents "Rezatofigughi H, tsio N, gwak J Y, et al. Generalized interaction over unit: A measurement and a loss for bounding box regression [ C)]//Proceedings of the IEEE/CVF conference on computer visioN and pattern recognition.2019:658-666 "Rezatofigughi H, tsoi N et al: generalized intersection ratio: metrics and loss of bounding box regression):
Figure BDA0003876971850000238
wherein S b Is a set of regression samples consisting of
Figure BDA0003876971850000239
A set of pixels other than 0; n is a radical of b Is the number of regression sample sets, W ij Is corresponding to
Figure BDA00038769718500002310
And the (i, j) position weight value which is not 0 is used for applying larger loss weight to the pixel point at the central area position, so that the pixel point at the central area position returns to the position of the marking frame more accurately.
3.3.4.3 an adaptive spatial feature aggregation network receives a multi-scale aware high-pixel feature map F from an adaptive multi-scale feature aggregation network H Receiving a coarse frame prediction location B from a coarse frame prediction network coarse Generating a boundary region perceived high pixel feature map F HR High pixel feature map F for sensing salient region HS . The method comprises the following steps:
3.3.4.3.1 designs an area-limited deformable convolution (R-DConv). The specific method comprises the following steps:
3.3.4.3.1.1 design offset transfer function
Figure BDA00038769718500002311
For the offset Δ p of the deformable convolution (Δ p is a learnable offset based on the feature points, see the literature "Zhu X, hu H, lin S, et al. Deformable convnets v2: more Deformable, beta results [ C)]i/Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.2019:9308-9316. ") to obtain the transformed offset, so that the difficulty of sampling the most representative feature points of objects with different sizes is adaptively learnedAnd (5) the consistency is achieved.
Figure BDA00038769718500002312
Limiting the offset range of the spatial sampling points of the deformable convolution to B coarse While also differentiating the offset deltap of the deformable convolution. Because the space sampling range of the large-size object is wider than that of the small-size object, the corresponding searching difficulty is different. In order to solve the problem that the search difficulty of the space characteristic points of the objects with different sizes is inconsistent, a Sigmoid function is adopted
Figure BDA00038769718500002313
Figure BDA00038769718500002314
To B coarse The offset Δ p is normalized to make Δ p at [0,1]Within the interval. Through the processing, the difficulty of searching the point characteristics with the strongest characterization capacity becomes the same for the objects with different sizes. Thus, splitting Δ p into h Δp And w Δp ,h Δp Denotes the deviation of Δ p in the vertical direction, w Δp Indicating the shift of Δ p in the horizontal direction.
Figure BDA0003876971850000241
As shown in equation (5):
Figure BDA00038769718500002415
Figure BDA0003876971850000242
wherein
Figure BDA0003876971850000243
Representing the offset transfer function in the vertical direction,
Figure BDA0003876971850000244
represents an offset transfer function in the horizontal direction,overall offset transfer function
Figure BDA0003876971850000245
(t, l, r, d) are convolution kernel positions p and B coarse The distance in four directions of up, down, left and right.
3.3.4.3.1.2 utilization
Figure BDA0003876971850000246
The deformable convolution sampling area is limited. Given a 3 x 3 convolution kernel with K =9 spatial sample location points, w k Weight of convolution kernel, P, representing the k-th position k Representing a predefined position offset for the kth position. P k E { (-1, -1), (-1,0), (1,1) } denotes the 3 × 3 range centered at (0,0). Let x (p) denote the input feature map at the convolution kernel center position p, and y (p) denote the output feature map at the convolution kernel center position p. Y (p) is calculated using R-DConv, as shown in equation (6):
Figure BDA0003876971850000247
wherein Δ p k Denotes the learnable offset, Δ m, of the k-th position k Representing the weight of the kth position. Δ p k And Δ m k Generated by a 3 × 3 convolution, the 3 × 3 convolution generates a feature map of 27 channels, 9 of which are Δ p k Abscissa offset value, Δ p for 9 channels k Ordinate offset value, 9 channels (representing weights for different offset value characteristics) of Δ m k The value of (c). B coarse A coarse box, representing a prediction on the scale of the current feature map, is also a predefined bounding region.
3.3.4.3.2 in order to make R-DConv learn the salient region of an object in the range of a rough frame, and extract the characteristics for making object classification more accurate, a classification adaptive spatial feature aggregation method is adopted and B is utilized coarse Limiting the sampling range pair F H The method for carrying out feature aggregation comprises the following steps:
3.3.4.3.2.1 order Classification offset transfer function
Figure BDA0003876971850000248
Calculating the output characteristic y at the position p by using the formula (6) cls (p)。
3.3.4.3.2.2 uses
Figure BDA0003876971850000249
Traversing F with convolution kernel H Obtaining a high pixel characteristic diagram F of the salient region perception HS
Figure BDA00038769718500002410
Allowing sampling points to be concentrated so that the classification branches can focus on the most discriminative salient regions. Thus, let
Figure BDA00038769718500002411
Enabling the R-DConv to learn the salient region of the object in the range of the rough frame, and extracting the feature which enables the object to be classified more accurately, namely a high-pixel feature map F for sensing the salient region HS Will F HS And sending the data to a main task module.
3.3.4.3.3 in order to make R-DConv learn the boundary region information of an object in the range of a rough frame, and extract the characteristics which make the object position regression more accurate, a regression adaptive spatial feature aggregation method is adopted and B is utilized coarse Limiting the sampling range pair F H Performing feature aggregation, wherein the regression adaptive spatial feature aggregation method specifically comprises the following steps of:
3.3.4.3.3.1 design regression offset transfer function
Figure BDA00038769718500002412
The offset Δ p of the deformable convolution is transformed.
Figure BDA00038769718500002413
And uniformly dividing the space sampling points of the R-DConv operation along the upper, lower, left and right directions, so that the limited area is divided into four sub-areas which respectively correspond to the upper left, the upper right, the lower left and the lower right.
Figure BDA00038769718500002414
And respectively carrying out uniform sampling on the four sub-areas, namely allocating equal sampling points to each area. In this way, the spatial sampling points of the R-DConv operation are dispersed, so that features containing more information from the boundary can be extracted, and the object position can be regressed more accurately. The setting of K =9 is made such that,
Figure BDA0003876971850000251
the function samples two points from each of the four subregions, and adds a central point to form a convolution kernel of 3 multiplied by 3, thereby enhancing the capture of the boundary information by the central characteristic point. Regression offset transfer function
Figure BDA0003876971850000252
As shown in equation (7):
Figure BDA0003876971850000253
Figure BDA0003876971850000254
the sampling difficulty of objects with different sizes can be balanced through normalization for the Sigmoid function for normalizing the offset in the rough frame interval.
Will be provided with
Figure BDA0003876971850000255
Substituted into equation (6)
Figure BDA0003876971850000256
Deriving an output characteristic y at a position p reg (p) of the formula (I). Thus, it is possible to provide
Figure BDA0003876971850000257
Enabling R-DConv to learn the region of the object boundary in the range of the rough frame, and extracting the feature which enables the regression position of the prediction frame to be more accurate, namely a high-pixel feature map F perceived by the boundary region HR
3.3.4.3.3.2 uses
Figure BDA0003876971850000258
Traversal F with convolution kernel H Obtaining a high pixel characteristic map F perceived by the boundary area HR Will F HR And sending the data to a main task module.
3.3.5 the auxiliary task Module receives F from the adaptive Multi-Scale feature aggregation network H Processing the two layers of 3 multiplied by 3 convolution, one layer of 1 multiplied by 1 convolution and sigmoid function to obtain a corner point prediction thermodynamic diagram H corner ,H corner Has a resolution of
Figure BDA0003876971850000259
The number of channels is 4. To H corner And the corner predicted true value constructed by 2.3.3
Figure BDA00038769718500002510
Calculating the loss to obtain H corner And
Figure BDA00038769718500002511
loss value of
Figure BDA00038769718500002512
Figure BDA00038769718500002513
The calculation of (c) is based on a modified version of Focal Loss:
Figure BDA00038769718500002514
wherein N is s Is the number of the image label boxes, alpha l And β is a hyperparameter, set to 2 and 4, respectively, for controlling the gradient profile of the loss function.
Figure BDA00038769718500002515
Is the corner predicted value output by the auxiliary task module at the c channel and (i, j) pixel position,
Figure BDA00038769718500002516
is the corner predicted true value of the c-th channel, pixel location (i, j). The auxiliary task module learns the positions of four corners of the positioning marking frame, assists the target detection network training, enables the extracted features to pay more attention to the positions of the corners of the object, and accordingly enables the target detection system to position the object more accurately.
3.3.6 Fine-Box prediction network of Main task Module receiving boundary region-aware high-Pixel feature map F from adaptive spatial feature aggregation network HR After a layer of 1 × 1 convolution processing, F is obtained HR Fine-box predicted position B of feature point position refine 。B refine Has a resolution of
Figure BDA00038769718500002517
The number of channels is 4. The channel number 4 represents the distance from the pixel point to the prediction fine frame in the upper, lower, left and right directions, and each pixel point can form a fine prediction frame. To B refine And the real value of the fine box obtained from 2.3.5
Figure BDA00038769718500002518
Calculating loss
Figure BDA00038769718500002519
Figure BDA00038769718500002520
Is based on GIoU loss:
Figure BDA00038769718500002521
wherein S b Is a set of regression samples consisting of
Figure BDA00038769718500002522
A set of pixels other than 0. N is a radical of b Is the number of regression sample sets, W ij Is corresponding to
Figure BDA00038769718500002523
And the (i, j) position weight value which is not 0 is used for applying larger loss weight to the pixel point at the central area position, so that the pixel point at the central area position returns to the position of the standard frame more accurately. B is refine Represents the accuracy with which the target detection system regresses the position of the object.
3.3.7 Central Point prediction network of Main task Module receives saliency region-aware high-Pixel feature map F from an adaptive spatial feature aggregation network HS After a layer of 1 × 1 convolution and sigmoid function processing, F is obtained HS Center point predictive thermodynamic diagram H of feature point positions center 。H center Has a resolution of
Figure BDA0003876971850000261
The number of channels is the number of data set categories C. The C of the MS COCO data set is 80, and the C of the CityScaps data set is 8. Will H center The central point predicted true value constructed with 2.2.5.2
Figure BDA0003876971850000262
Calculating loss
Figure BDA0003876971850000263
Figure BDA0003876971850000264
Is based on a modified version of Focal local:
Figure BDA0003876971850000265
wherein N is s Is the number of the image labeling boxes, α l And β is a hyperparameter, set to 2 and 4, respectively, for controlling the gradient profile of the loss function.
Figure BDA0003876971850000266
Is the c-th channel, the center point predicted thermodynamic diagram of (i, j) pixel location,
Figure BDA0003876971850000267
is the predicted true value of the center point of the (i, j) th channel pixel position. H center Represents the ability of the target detection system to locate the center of an object and to distinguish object classes.
3.3.8 Total loss function for design target detection System
Figure BDA0003876971850000268
As shown in equation (11):
Figure BDA0003876971850000269
wherein
Figure BDA00038769718500002610
Is the output of a corner prediction network corner And true value
Figure BDA00038769718500002611
The value of the loss is calculated,
Figure BDA00038769718500002612
h being central point predicted network output center And true value
Figure BDA00038769718500002613
The value of the loss is calculated and,
Figure BDA00038769718500002614
b being a coarse box prediction network output coarse And true value
Figure BDA00038769718500002615
The value of the loss is calculated,
Figure BDA00038769718500002616
b being fine-box prediction network output refine And true value
Figure BDA00038769718500002617
A calculated loss value. According toImportance corner point prediction network loss weight
Figure BDA00038769718500002618
Central point predicted network loss weight
Figure BDA00038769718500002619
Coarse box prediction network loss weighting
Figure BDA00038769718500002620
Fine-box prediction network loss weighting
Figure BDA00038769718500002621
3.3.9 flush = flush +1, flush _ rate = flushing _ rate × 0.1, go 3.3.10 if flush is 80 or 110; if the epoch is neither 80 nor 110, directly rotating to 3.3.10;
3.3.10 if the epoch is less than or equal to maxepoch, rotate 3.3.2; if the epoch is larger than maxepoch, indicating that the training is finished, turning to 3.3.11;
3.3.11 preserved N m Network weight parameter of individual epochs.
Fourth, verifying N after loading by using the verification set m The detection precision of the target detection system of the network weight parameters of the epoch is kept, and the network weight parameters with the best performance are used as the network weight parameters of the target detection system. The method comprises the following steps:
4.1 order variable n m =1;
4.2 post-load N of target detection System m Nth of network weight parameters of epoch m A network weight parameter; the new verification set D processed by adopting an image scaling standardization method through 2.4 steps V Inputting a target detection system;
4.3 let V =1 be the V-th image of the validation set, V being the number of images of the validation set;
4.4 the Primary feature extraction Module receives the v verification set image D v Extracting D by using the main feature extraction method described in 3.3.3 v To obtain D v Will contain D v OfSending the multi-scale feature map of the scale features to the self-adaptive feature aggregation module;
4.5 adaptive multiscale feature aggregation network reception in feature adaptive aggregation Module containing D v The multi-scale feature map of the multi-scale features is subjected to channel self-attention enhancement, bilinear interpolation upsampling and scale level soft weight aggregation operation by adopting the self-adaptive multi-scale feature aggregation method 3.3.4.1 to obtain D v Multi-scale perceived high pixel feature map F HV Will F HV Sending the data to a rough frame prediction network and an adaptive spatial feature aggregation network;
4.6 coarse Box prediction network reception F in feature adaptive aggregation Module HV F, adopting the rough frame prediction method 3.3.4.2 HV The position of each feature point in the image is subjected to rough frame position prediction to generate a v-th verification set image D v Coarse box of (1) predict position B HVcoarse (ii) a B is to be HVcoarse And sending the information to an adaptive spatial feature aggregation network. B is HVcoarse Is also that
Figure BDA0003876971850000271
Of resolution of
Figure BDA0003876971850000272
The number of channels is 4;
4.7 adaptive spatial feature aggregation network in feature adaptive aggregation Module receives B from the coarse Box prediction network HVcoarse Receiving F from an adaptive multi-scale feature aggregation network HV The method for classifying and self-adapting space characteristic polymerization of 3.3.4.3.2 is adopted to utilize B HVcoarse Limit the sampling range, and F HV Performing classification task space feature aggregation to obtain the v-th verification set image D v A high pixel feature map of salient region perception of (1); sending the high pixel characteristic graph perceived by the salient region of the v-th verification image to a central point prediction network;
4.8 adaptive spatial feature aggregation network in feature adaptive aggregation Module adopts the regression adaptive spatial feature 3.3.4.3.3Polymerization Process Using B HVcoarse Limit the sampling range, and F HV Performing regression task space feature aggregation to obtain the v-th verification set image D v The boundary region perceived high pixel feature map of (1); sending the high pixel characteristic image perceived by the boundary area of the v-th verification image to a fine frame prediction network;
4.9 the fine frame prediction network in the main task module receives the high pixel characteristic image sensed by the boundary area, and the v verification set image D is obtained after 1 × 1 convolution processing v The fine frame prediction position of the object is sent to a post-processing module;
4.10 Central Point prediction network in Main task Module receives the v verification set image D v The high pixel characteristic diagram of the salient region perception is processed by a layer of 1 multiplied by 1 convolution to obtain the v verification set image D v The v-th verification image D v The central point prediction thermodynamic diagram is sent to a post-processing module;
4.11 post-processing Module receives the v verification image D v The fine frame prediction position and the central point prediction thermodynamic diagram of the method adopts a method of removing overlapped pseudo frames to carry out the prediction on the v-th verification image D v The fine frame prediction position and the central point prediction thermodynamic diagram are subjected to overlapping pseudo frame removing operation to obtain a v-th verification image D v The specific method of predicting the object frame set is as follows:
4.11.1 post-processing module pairs the v-th verification image D v Performs a 3 x 3 Max Pooling operation (2D Max-Pooling) to extract a v-th verification image D v The central point of (2) a set of peak points of the predictive thermodynamic diagram, each peak point representing a central region point within the predicted object;
4.11.2 from the v-th verification image D v The central point predictive thermodynamic diagram of (1) obtains a peak point (P) x ,P y ) Coordinate value P of x ,P y Post-processing module from D v The fine frame prediction position of (P) is obtained as a peak point x ,P y ) Distance information (t, l, D, r) of the upper, left, lower and right directions is obtained to obtain D v Prediction frame B of p ={P y -t,p l -l,p d +d,p r +r}。B p Is the peak point (P) x ,P y ) The channel value with the maximum pixel value of the thermal image of the center point of the position is recorded as c p 。B p Is the peak point (P) x ,P y ) Thermodynamic diagram of center point of position c p Pixel value of channel, denoted as s p
4.11.3 post-processing module retains the v-th verification image D v Middle confidence s p A prediction box greater than a confidence threshold (typically set to 0.3) forms the v-th verification image D v The object frame prediction set of (1) retains the prediction frame B p And B p Class c of p Information;
4.12 making V = V +1, if V is less than or equal to V, rotating by 4.4; if V > V, it is stated that the nth m Converting an object frame prediction set of V verification images of each model into 4.13;
4.13 if the verification set adopts a general scene data set disclosed by MS COCO, testing the precision of a final object frame prediction set output by the target detection system by adopting a standard MS COCO evaluation mode (https:// cocoataset. Org /), recording the precision of the object frame prediction set, and turning to 4.14; if the verification set adopts a Cityscapes unmanned scene data set, testing the precision of a final object frame prediction set output by the target detection system by adopting a Cityscapes evaluation mode (https:// www.cityscapes-dataset.com /), recording the precision of the object frame prediction set, and turning to 4.14;
4.14 order n m =n m +1; if n is m ≤N m Turning to 4.2; if n is m >N m Explanation of completion N m Testing the precision of each model, and turning to 4.15;
4.15 from N m Selecting the object frame prediction set with the highest precision from the precision of the object frame prediction sets of each model, finding out the weight parameter corresponding to the target detection system corresponding to the object frame prediction set with the highest precision, taking the weight parameter as the weight parameter selected by the target detection system, loading the selected weight parameter to the target detection system, and addingThe target detection system loaded with the selected weight parameters becomes the trained target detection system.
Fifthly, adopting the trained target detection system to perform target detection on the image to be detected input by the user, wherein the method comprises the following steps:
5.1, optimizing the image I to be detected input by the user by adopting the 2.4-step image scaling standardization method to obtain the standardized image I to be detected nor Is shown by nor An input main feature extraction module;
5.2 Main feature extraction Module receives I nor Extracting I by using the main feature extraction method described in 3.3.3 nor To obtain I nor Will contain I nor The multi-scale feature map of the multi-scale features is sent to the adaptive feature aggregation module.
5.3 adaptive multiscale feature aggregation network in feature adaptive aggregation Module receiving Inclusion I nor The multi-scale feature map of the multi-scale features adopts the self-adaptive multi-scale feature polymerization method 3.3.4.1 to perform the self-adaptive multi-scale feature polymerization on the feature containing I nor The multi-scale characteristic graph of the multi-scale characteristic is subjected to channel self-attention enhancement, bilinear interpolation upsampling and scale level soft weight aggregation operation to obtain a multi-scale perception high-pixel characteristic graph F IH Will F IH Sending the data to a rough frame prediction network and an adaptive spatial feature aggregation network;
5.4 coarse Box prediction network reception F in feature adaptive aggregation Module IH F is predicted by adopting the rough frame prediction method 3.3.4.2 IH Carrying out rough frame position prediction to obtain a rough frame prediction position B in the image I to be detected Icoarse (ii) a B is to be Icoarse And sending the information to an adaptive spatial feature aggregation network. B is Icoarse Is also that
Figure BDA0003876971850000281
Of resolution of
Figure BDA0003876971850000282
The number of channels is 4;
5.5 characteristicsAdaptive spatial feature aggregation network reception F in an adaptive aggregation module IH And B Icoarse The method for classifying and self-adapting space characteristic polymerization of 3.3.4.3.2 is adopted to utilize B Icoarse Limiting the sampling range, for F IH Performing classification task spatial feature aggregation to obtain a high pixel feature map perceived by a salient region of the image I to be detected; sending a high pixel characteristic image perceived by a salient region of an image I to be detected to a central point prediction network;
5.6 adaptive spatial feature aggregation network in feature adaptive aggregation Module adopts the regression adaptive spatial feature aggregation method described in 3.3.4.3.3 and utilizes B Icoarse Limiting the sampling range, for F IH Performing regression task spatial feature aggregation to obtain a high pixel feature map perceived by a boundary region of the image I to be detected; sending a high pixel characteristic image sensed by a boundary area of an image I to be detected to a fine frame prediction network;
5.7 a fine frame prediction network in the main task module receives a high pixel characteristic image perceived by a boundary region of the image I to be detected, and the fine frame prediction position of an object in the image I to be detected is obtained through a layer of 1 × 1 convolution processing; sending the fine frame prediction position of an object in an image I to be detected to a post-processing module;
5.8 the central point prediction network in the main task module receives the high pixel characteristic image perceived by the salient region of the image I to be detected, and the central point prediction thermodynamic diagram of the object of the image I to be detected is obtained through a layer of 1 × 1 convolution processing; the central point prediction thermodynamic diagram of the object of the image I to be detected is sent to a post-processing module;
5.9 the post-processing module receives the fine frame prediction position and the central point prediction thermodynamic diagram of the object of the image I to be detected, the method for removing the overlapped pseudo frame in the step 4.9 is adopted to remove the overlapped pseudo frame from the fine frame prediction position of the object of the image I to be detected and the central point prediction thermodynamic diagram of the object of the image I to be detected, so as to obtain an object frame prediction set of the image I to be detected, and the object frame prediction set of the image I to be detected reserves a prediction frame B p And the type information of the prediction frame, namely the coordinate position and the prediction type of the prediction object frame of the image to be detected.
And sixthly, finishing.
20000 test set data from an MS COCO data set or 1524 test set data from a Cityscapes data set (as the test set division mode described in the Second step) are selected to carry out numerical test of detection Precision AP (Average Precision) and operation speed FPS (Frames Per Second) on the device, the experimental environment is Ubuntu20.04 (a version of Linux system), a central processing unit of Intel i9-10900K series is loaded, the processing frequency is 3.70GHz, four Intel Vita RTX 2080Ti image processors are additionally arranged, the core frequency is 1635MHz, and the video memory capacity is 12GB. One embodiment of the test of the present invention is shown in fig. 4, where a to-be-detected image (the upper image in fig. 4 is an image captured during driving), which is input to the target detection system of the present invention, and the image prediction set is output and visualized to generate a detected visualized image (the lower image in fig. 4 is a visualized image of the detection result of the detected image, where the detection frame and the object type are labeled, as shown in fig. 4, where the "bicycle" detected at (1), "person" detected at (2), and "car" type detected at (3) are framed in the form of rectangular frames).
Firstly, defining a target detection algorithm performance evaluation index. The test adopts a standard MS COCO evaluation mode, and has 6 specific indexes: AP, AP 50 、AP 75 、AP S 、AP M And AP L . AP indicates that the value of the cross-over ratio (IoU) is [0.5,0.95]The Average Precision (AP) calculated every 0.05 over the interval is then averaged over all intervals of AP. AP (Access Point) 50 And AP 75 Indicating an AP value of IoU greater than 0.5 and 0.75, respectively. AP (Access Point) S 、AP M And AP L AP representing a small-sized object, a medium-sized object, and a large-sized object, respectively, wherein the size-defining ranges are [0, 64 ] respectively 2 ]、[64 2 ,128 2 ]And [128 ] 2 ,∞]. The larger the AP value, the higher the detection accuracy.
According to the experimental results of the invention, the experimental results of the MS COCO data set and the Cityscapes data set are respectively analyzed.
The MS COCO dataset target detection algorithm performance pairs are shown in table 1. The invention is shown in comparison to the classical real-time target detection method YOLOv3, the most relevant methods of the invention centret and TTFNet. According to the experimental results, the method can rapidly and accurately detect the target. Compared with the CenterNet, the precision of 4.4AP is improved at a faster running speed of about 2.2 ms. Compared to TTFNet, with a small speed delay, about 3.15ms, a 2.5AP accuracy improvement is achieved. On the premise of hardly influencing the real-time performance, the method realizes great precision improvement. The target detection algorithm precision and speed are two indexes needing to be balanced, and the realization of larger precision improvement under a small amount of calculation load is significant in practical application. For accuracy, the higher the accuracy, the more difficult it is to improve, the classical MaskRCNN algorithm (see documents "He K, gkioxari G, doll a r P, et al. Mask r-cnn [ C ]// Proceedings of the IEEE international conference on vision.2017:2961-2969." He K, gkioxari G, doll r P, et al. Mask r-cnn [ C ]// Proceedings of the IEEE international conference on vision.2017:2961-2969 et al: mask r-cnn) achieves an accuracy of 39.8AP at 11FPS, which is 5.45 times faster than MaskRCNN and 2.0AP higher. Thus, a speed delay of only about 3.15ms is sacrificed (which is fully acceptable for real-world applications), and a greater accuracy improvement of 2.5AP is achieved.
TABLE 1
Method Backbone network FPS AP AP 50 AP 75 AP S AP M AP L
YOLOv3 DarkNet-53 48 33.4 56.3 35.2 19.5 36.4 43.6
CenterNet DLA-34 53 37.4 55.1 40.8 20.6 42.0 50.6
TTFNet DarkNet-53 74 39.3 56.8 42.5 20.6 43.3 54.3
The invention DarkNet-53 60 41.8 58.7 45.3 22.7 45.6 54.9
The cityscaps dataset target detection algorithm performance pairs are shown in table 2. The Cityscapes data set is a classic intelligent driving scene data set, unified 768 x 384 images are used as input in the experiment, and TTFNet and the performance of the method are compared under the Cityscapes data set. TTFNet is faster in running speed than the invention, but the detection precision gap is obvious (5.8 AP). And the speed delay is only 3.46ms, which is fully acceptable for real-world applications. Therefore, the invention has better balance between the running speed and the detection precision, and realizes larger precision improvement with smaller time overhead.
TABLE 2
Method Backbone network FPS AP AP 50 AP 75 AP S AP M AP L
TTFNet DarkNet-53 58.7 17.2 33.9 15.6 6.4 22.5 30.1
The invention DarkNet-53 48.8 23.0 41.7 22.1 4.3 22.1 45.2
And carrying out visual analysis on the trained target detection system. As shown in fig. 3, TTFNet and the present invention were visually analyzed under the citrescaps data set in this experiment. Fig. 3 (a) and 3 (b) show results of TTFNet detection, and fig. 3 (c) and 3 (d) show results of TTFNet detection according to the present invention. For the sake of easy observation, the regions with false TTFNet detection are indicated by arrows (i.e. the false detection indicated by the arrow on the left side in fig. 3 (a) detects "bicyle" type, the detection indicated by the arrow on the right side detects multiple overlapping false positive boxes; and the false detection indicated by the arrow on fig. 3 (b) detects the background region as the foreground region). Compared with TTFNet detection, the method is more accurate, has lower false detection rate and higher classification precision (no false detection occurs at the position of the arrow corresponding to the left side of the graph 3 (a) in the graph 3 (c), a plurality of overlapped false positive frames are not detected at the position of the arrow corresponding to the right side of the graph 3 (a), and the background area is not detected as the foreground area by mistake at the position of the arrow corresponding to the graph 3 (b) in the graph 3 (d)). The excellent visualization results also prove the effectiveness of the proposed method.

Claims (9)

1. A target detection method based on feature adaptive aggregation is characterized by comprising the following steps:
firstly, constructing a target detection system based on feature adaptive aggregation; the target detection system consists of a main feature extraction module, a feature self-adaptive aggregation module, an auxiliary task module, a main task module and a post-processing module;
the main feature extraction module is connected with the feature self-adaptive aggregation module, extracts multi-scale features from the input image and sends a multi-scale feature map containing the multi-scale features to the feature self-adaptive aggregation module; the main characteristic extraction module consists of a DarkNet-53 convolutional neural network and a characteristic pyramid network; the DarkNet-53 convolutional neural network is a lightweight trunk network containing 53 layers of neural networks, and the 53 layers of neural networks are divided into 5 serial sub-networks and used for extracting the trunk network characteristics of the image; the method comprises the steps that a feature pyramid network receives main network features from a DarkNet-53 convolutional neural network, a multi-scale feature map containing multi-scale features is obtained through up-sampling, feature extraction and feature fusion operations, and the multi-scale feature map is sent to a feature self-adaptive aggregation module;
the feature self-adaptive aggregation module is connected with the main feature extraction module, the auxiliary task module and the main task module, and has the functions of providing a multi-scale perceived high-pixel feature map for the auxiliary task module, providing a boundary region perceived high-pixel feature map and a salient region perceived high-pixel feature map for the main task module, and improving the detection precision of the target detection system; the characteristic self-adaptive aggregation module is composed of a self-adaptive multi-scale characteristic aggregation network, a self-adaptive spatial characteristic aggregation network and a rough frame prediction network; the self-adaptive multi-scale feature aggregation network is composed of 4 SE networks with unshared weights, and the 4 SE networks are respectively marked as a first SE network, a second SE network, a third SE network and a fourth SE network; receiving a multi-scale feature map from a feature pyramid network of a main feature extraction module, carrying out channel self-attention enhancement, bilinear interpolation upsampling and scale level soft weight aggregation operation on the multi-scale feature map by adopting a self-adaptive multi-scale feature aggregation method to obtain a multi-scale perceived high pixel feature map, and sending the multi-scale perceived high pixel feature map to a self-adaptive spatial feature aggregation network, a rough frame prediction network and an auxiliary task module; the rough frame prediction network is composed of two layers of 3 x 3 convolutions and one layer of 1 x 1 convolution, receives the multi-scale perception high pixel characteristic diagram from the self-adaptive multi-scale characteristic aggregation network, predicts the multi-scale perception high pixel characteristic diagram to obtain a rough frame prediction position, and sends the rough frame prediction position to the self-adaptive spatial characteristic aggregation network; the self-adaptive spatial feature aggregation network is composed of a region-limited deformable convolution of a classification offset conversion function and a regression offset conversion function, receives a multi-scale perceived high-pixel feature map from the self-adaptive multi-scale feature aggregation network, receives a rough frame prediction position from a rough frame prediction network, generates a boundary region perceived high-pixel feature map and a salient region perceived high-pixel feature map, and sends the boundary region perceived high-pixel feature map and the salient region perceived high-pixel feature map to the main task module;
the auxiliary task module is connected with an adaptive multi-scale feature aggregation network in the feature adaptive aggregation module, the auxiliary task module is an angular point prediction network, the angular point prediction network consists of two layers of 3 x 3 convolutions, a layer of 1 x 1 convolutions and a sigmoid active layer, the auxiliary task module receives a multi-scale perceived high pixel feature map from the adaptive multi-scale feature aggregation network, and the angular point prediction network predicts the multi-scale perceived high pixel feature map to obtain an angular point prediction thermodynamic diagram which is used for calculating angular point prediction loss in the training of a target detection system and assisting the target detection system in perceiving an angular point region; the auxiliary task module is only used for training the target detection system and is used for enhancing the perception of the target detection system on the position of the corner point of the object so as to predict the position of the object frame more accurately; when the trained target detection system detects the user input image, the module is directly discarded;
the main task module is connected with the self-adaptive spatial feature aggregation network and the post-processing module and consists of a fine frame prediction network and a central point prediction network; the fine frame prediction network is a layer of 1 multiplied by 1 convolution layer, receives the high pixel characteristic diagram sensed by the boundary region from the adaptive spatial characteristic aggregation network, performs 1 multiplied by 1 convolution on the high pixel characteristic diagram sensed by the boundary region to obtain a fine frame prediction position, and sends the fine frame prediction position to the post-processing module; the central point prediction network consists of a layer of 1 × 1 convolutional layer and a sigmoid activation layer, receives the high pixel characteristic diagram perceived by the salient region from the adaptive spatial characteristic aggregation network, performs 1 × 1 convolution and activation on the high pixel characteristic diagram perceived by the salient region to obtain a central point prediction thermodynamic diagram, and sends the central point prediction thermodynamic diagram to the post-processing module;
the post-processing module is a 3 x 3 pooling layer and is connected with a fine frame prediction network and a central point prediction network in the main task module, receives a fine frame prediction position from the fine frame prediction network, receives a central point prediction thermodynamic diagram from the central point prediction network, reserves a prediction maximum value in a central point prediction thermodynamic diagram 3 x 3 range by adopting 3 x 3 maximum pooling operation with the step length of 1, and extracts the position of the reserved prediction maximum value, namely a peak point, as the position of an object central area point; finding out the corresponding up-down, left-right four-direction distances in the fine frame prediction position according to the position of the central area point to generate a predicted object frame position, wherein the type of the central point where the position of the central area point is located is the type of object prediction; the post-processing module inhibits overlapping false frames by extracting peak points within a range of 3 multiplied by 3, so that false positive prediction frames are reduced;
secondly, constructing a training set, a verification set and a test set, wherein the method comprises the following steps:
2.1 collecting target detection scene images as a target detection data set, and manually labeling each target detection scene image in the target detection data set, wherein the method comprises the following steps: using a general scene data set or a Cityscapes unmanned scene data set disclosed by MS COCO as a target detection data set; training images in an MS COCO data set or a Cityscapes data set are used as a training set, verification images are used as a verification set, and test images are used as a test set; the total number of images in the training set is S, the total number of images in the testing set is T, the total number of images in the verification set is V, and each image in the MS COCO and Cityscapes data sets is manually marked, namely, each image is marked with the position of an object in a rectangular frame form and the category of the object;
2.2 carrying out optimization processing on the S images in the training set, including turning, cutting, translation, brightness transformation, contrast transformation, saturation transformation, scaling and standardization to obtain an optimized training set D t
2.3 training set D according to optimization t Making a task truth value label for model training; the method is characterized in that the method is divided into four tasks which are respectively a central point prediction task, an angular point prediction task, a rough frame prediction task and a fine frame prediction task, and the method comprises the following steps:
2.3.1 let variable s =1; let the s image in the optimized training set have N s A label box, let N s The ith one of the label boxes is
Figure FDA0003876971840000021
Let the label category of the ith label box be c i
Figure FDA0003876971840000022
Represents the coordinates of the point at the upper left corner of the ith label box,
Figure FDA0003876971840000023
represents the coordinate of the lower right corner point of the ith label box, N s Is a positive integer, i is more than or equal to 1 and less than or equal to N s
2.3.2 construction of the predicted true value of the centerpoint for the centerpoint prediction task
Figure FDA0003876971840000024
The method comprises the following steps:
2.3.2.1 construction of a size of
Figure FDA0003876971840000025
All-zero matrix chart H zeros C represents the number of the classification categories of the optimized training set, wherein the number of the categories is the number of the categories of the labeled targets of the target detection data set, H is the height of the s-th image, and W is the width of the s-th image;
2.3.2.2 let i =1, representing the labeling box of the ith down-sampling 4 times;
2.3.2.3 will
Figure FDA0003876971840000026
Dividing the labeling coordinate by 4, and recording as a labeling frame of 4 times of downsampling
Figure FDA0003876971840000027
Figure FDA0003876971840000028
Figure FDA0003876971840000031
Represents B si The upper left, upper right, lower left and lower right corner positions of the' are arranged;
2.3.2.4 adopts two-dimensional Gaussian kernel generation method, and calculates B si ' center point
Figure FDA0003876971840000032
Is the base point of a two-dimensional Gaussian kernel with a variance of (sigma) x ,σ y ) Obtaining the Gaussian values of all pixel points in the two-dimensional Gaussian kernel range to obtain a first Gaussian value set S ctr (ii) a The specific method comprises the following steps:
2.3.2.4.1 makes the number of pixel points in two-dimensional Gaussian kernel be N pixel ,N pixel Let the first set of Gaussian values S be a positive integer ctr Is empty;
2.3.2.4.2 let p =1, representing the number of pixel points in two-dimensional Gaussian kernel, where p is greater than or equal to 1 and less than or equal to N pixel
2.3.2.4.3 in the s-th image with (x) 0 ,y 0 ) Any pixel point (x) in the Gaussian kernel range as the base point p ,y p ) Two-dimensional Gao Sizhi K (x) p ,y p ) Comprises the following steps:
Figure FDA0003876971840000033
wherein (x) 0 ,y 0 ) Is the base point of a two-dimensional Gaussian kernel, i.e. the center of the two-dimensional Gaussian kernel, x 0 A coordinate value of the base point in the width direction, y 0 A coordinate value in a high direction as a base point; (x) p ,y p ) Is a base point (x) 0 ,y 0 ) Pixel points, x, within the Gaussian kernel range p Is the coordinate value, y, of the pixel point in the width direction p The coordinates of the pixel point in the high direction are obtained; (x) 0 ,y 0 ) And (x) p ,y p ) All located in the image coordinate system after down-sampling 4 times;
Figure FDA0003876971840000034
representing the variance of the two-dimensional gaussian kernel in the width direction,
Figure FDA0003876971840000035
representing the variance of the two-dimensional Gaussian kernel in the high direction, and controlling the number of points in the range of the Gaussian kernel by controlling the variance of the two-dimensional Gaussian kernel in the width direction and the high direction; w represents B si ' Width at the scale of the characteristic diagram, h represents B si ' high at the scale of the feature map, α is the position of the center region in B si ' parameters of the ratio; will (x) p ,y p ) And calculated K (x) p ,y p ) Storing the first set of Gaussian values S ctr Performing the following steps;
2.3.2.4.4 let p = p +1; if p is less than or equal to N pixel Turning to 2.3.2.4.3; if p > N pixel ,B si ' the coordinates and two-dimensional Gaussian values in the Gaussian kernel have been stored in S ctr In, S ctr In which is N pixel Each pixel point and the corresponding two-dimensional Gaussian value thereof are converted to 2.3.2.5;
2.3.2.5 treatment of S ctr Value of (3) to H zeros Performing the following steps; will S ctr Element (x) of (1) p ,y p ) And K (x) p ,y p ) According to H zeros [x p ,y p ,c i ]=K(x p ,y p ) Rule assignment of c i Represents B si ' class number, 1. Ltoreq. C i C and C is not more than C i Is a positive integer;
2.3.2.6 has i = i +1; if i is less than or equal to N s Turning to 2.3.2.3; if i > N s N of the s-th image s All the two-dimensional Gaussian values generated by the down-sampling 4-time labeling boxes are assigned to H zeros Transfer 2.3.2.7;
2.3.2.7 predicts the true value of the center point of the s-th image
Figure FDA0003876971840000041
2.3.3 construction of actual values of corner predictions for a task of corner prediction
Figure FDA0003876971840000042
The method comprises the following steps:
2.3.3.1 construction of a size of
Figure FDA0003876971840000043
All-zero matrix of
Figure FDA00038769718400000421
"4" represents the number of corner points of the labeling box of 4 times of downsampling, and also represents 4 channels of the matrix;
2.3.3.2 let i =1, representing the labeling box of the ith down-sampling 4 times;
2.3.3.3 let the base point of the two-dimensional Gaussian kernel be B si The upper left corner point ofIs marked as
Figure FDA0003876971840000044
Adopting the two-dimensional Gaussian kernel generation method of 2.3.2.4 to calculate
Figure FDA0003876971840000045
A base point of a two-dimensional Gaussian kernel with a variance of (σ) x ,σ y ) The Gauss values of all the pixel points in the two-dimensional Gauss kernel range are obtained to obtain a second Gauss value set S tl
2.3.3.4 converting S to tl To the element coordinates and Gaussian values in
Figure FDA0003876971840000046
In the 1 st channel, i.e. according to
Figure FDA0003876971840000047
Assigning a value to the rule of (1);
2.3.3.5 let the base point of the two-dimensional Gaussian kernel be B si The upper right corner point of' has coordinates of
Figure FDA0003876971840000048
Adopting the two-dimensional Gaussian kernel generation method of 2.3.2.4 to calculate
Figure FDA0003876971840000049
A base point of a two-dimensional Gaussian kernel with a variance of (σ) x ,σ y ) Obtaining the Gaussian values of all the pixel points in the two-dimensional Gaussian kernel range to obtain a third Gaussian value set S tr
2.3.3.6 converting S to tr To the element coordinates and Gaussian values in
Figure FDA00038769718400000410
In the 2 nd channel, i.e. according to
Figure FDA00038769718400000411
Assigning a value to the rule of (1);
2.3.3.7 let the base point of the two-dimensional Gaussian kernel be B si ' lower left corner point, coordinates of
Figure FDA00038769718400000412
Adopting the two-dimensional Gaussian kernel generation method of 2.3.2.4 to calculate
Figure FDA00038769718400000413
A base point of a two-dimensional Gaussian kernel with a variance of (σ) x ,σ y ) Obtaining the Gaussian values of all the pixel points in the two-dimensional Gaussian kernel range to obtain a fourth Gaussian value set S dl
2.3.3.8 converting S to al To the element coordinates and Gaussian values in
Figure FDA00038769718400000414
In the 3 rd channel according to
Figure FDA00038769718400000415
Assigning a value to the rule of (1);
2.3.3.9 let the base point of the two-dimensional Gaussian kernel be B' si At the lower right corner point of having coordinates of
Figure FDA00038769718400000416
Adopting the two-dimensional Gaussian kernel generation method of 2.3.2.4 to calculate
Figure FDA00038769718400000417
A base point of a two-dimensional Gaussian kernel with a variance of (σ) x ,σ y ) The Gaussian values of all the pixel points in the two-dimensional Gaussian kernel range are obtained to obtain a fifth Gaussian value set S dr
2.3.3.10 reduction of S dr To the element coordinates and Gaussian values in
Figure FDA00038769718400000418
In the 4 th channel, i.e. according to
Figure FDA00038769718400000419
Assigning a value to the rule of (1);
2.3.3.11 let i = i +1 if i ≦ N s Turning to 2.3.3.3; if i > N s N for the s-th image s The two-dimensional Gaussian values generated by the down-sampling 4-fold labeling frames are all assigned to
Figure FDA00038769718400000420
Transfer 2.3.3.12;
2.3.3.12 orders the predicted true value of corner of the s-th image
Figure FDA0003876971840000051
2.3.4 from the N of the s-th image s Constructing a rough frame real value of the s image of a rough frame prediction task by using 4 times of down-sampling labeled frames
Figure FDA0003876971840000052
2.3.5 according to
Figure FDA0003876971840000053
Constructing real value of fine box prediction task
Figure FDA0003876971840000054
Figure FDA0003876971840000055
Value and
Figure FDA0003876971840000056
are equal, i.e.
Figure FDA0003876971840000057
2.3.6 let S = S +1, if S is less than or equal to S, rotate 2.3.2; if S is more than S, rotating to 2.3.7;
2.3.7 task of obtaining S images for model trainingThe real labels and the S images form a set to form a training set D for model training M
2.4 adopting an image scaling standardization method to carry out optimization processing on the V images in the verification set, namely scaling and standardizing the V images to obtain a new verification set D consisting of the V scaled and standardized images V
2.5 adopting the image scaling standardization method of 2.4 steps to carry out optimization processing on the T images in the test set to obtain a new test set D consisting of the images after the T images are scaled and standardized T
Thirdly, training the target detection system constructed in the first step by utilizing a gradient back propagation method to obtain N m A model parameter; the method comprises the following steps:
3.1 initializing network weight parameters of each module in the target detection system; initializing parameters of a DarkNet-53 convolutional neural network in a main characteristic extraction module by adopting a pre-training model trained on an ImageNet data set; initializing a feature pyramid network, a feature self-adaptive aggregation module, an auxiliary task module and a main task module network weight parameter in a main feature module;
3.2 setting training parameters of the target detection system; initializing an initial learning rate learning _ rate attenuation coefficient, selecting random gradient descent as a model training optimizer, initializing a hyper-parameter momentum of the optimizer, and initializing weight attenuation; initializing the batch size mini _ batch _ size of network training as a positive integer; the maximum training step maxepoch is initialized to a positive integer.
3.3 training the target detection system, the method is that the difference between the rough frame prediction position, the fine frame prediction position, the angular point prediction thermodynamic diagram and the central point prediction thermodynamic diagram output by the target detection system in one training and the real value is used as the loss value loss, and the network weight parameter is updated by utilizing gradient back propagation until the loss value reaches the threshold value or the training step length reaches maxepoch and is finished; at the last N m Training, wherein each training step is performed, and the network weight parameters are stored once; the method comprises the following steps:
3.3.1 order training pace =1, training setTraining data for one epoch; initialization batch number N b =1;
3.3.2 Primary feature extraction Module Slave D M Read the Nth b Batch, total B =64 images, and the B images are recorded as matrix I train ,I train B H × W × 3 images are contained in the image; h represents the height of the input image, W represents the width of the input image, and "3" represents the RGB three channels of the image;
3.3.3 principal feature extraction Module extracts I by principal feature extraction method train To obtain I train Will comprise I train Sending the multi-scale feature map of the multi-scale features to the self-adaptive feature aggregation module; the method comprises the following steps:
5363 DarkNet-53 convolution neural network extraction I of 3.3.3.1 main feature extraction module train The method comprises the following steps of obtaining a backbone network feature map set by using the image features, wherein the method comprises the following steps: 5 serial sub-network pairs I of DarkNet-53 convolutional neural network train B images are subjected to down sampling and feature extraction to obtain main network features, namely 4 feature graphs output by the last four serial sub-networks, and the features are sent to a feature pyramid network;
3.3.3.2 the feature pyramid network receives 4 feature maps from the DarkNet-53 convolutional neural network, and the feature pyramid network performs up-sampling, feature extraction and feature fusion on the 4 feature maps to obtain 3 multi-scale feature maps, wherein the order is
Figure FDA0003876971840000061
Mapping multi-scale features
Figure FDA0003876971840000062
Sending the information to a characteristic self-adaptive aggregation module;
3.3.4 feature adaptive aggregation Module receives a multiscale feature map from a feature pyramid network
Figure FDA0003876971840000063
Generating a multi-scale perceptual high-pixel feature map F H Will F H Send to the assistantA task assistance module; generating a high pixel characteristic diagram perceived by the boundary area and a high pixel characteristic diagram perceived by the salient area, and sending the high pixel characteristic diagram perceived by the boundary area and the high pixel characteristic diagram perceived by the salient area to the main task module; the method comprises the following steps:
5363A method for receiving a feature pyramid network from a 3.3.4.1 adaptive multi-scale feature aggregation network
Figure FDA0003876971840000064
Method pair for aggregation by adopting self-adaptive multi-scale features
Figure FDA0003876971840000065
Channel self-attention enhancement, bilinear interpolation up-sampling and scale level soft weight aggregation operation are carried out to obtain a multi-scale perception high pixel characteristic diagram F H ;F H Resolution of the feature map of
Figure FDA0003876971840000066
F H The number of characteristic map channels is 64; the specific method comprises the following steps:
3.3.4.1.1 adaptive multi-scale feature aggregation network uses parallel pairs of first, second, and third SE networks
Figure FDA0003876971840000067
Performing parallel channel self-attention enhancement, i.e. first SE network pair
Figure FDA0003876971840000068
Weighted summation applied on the channels to obtain the first channel characteristic enhanced image
Figure FDA0003876971840000069
Simultaneous second SE network pair
Figure FDA00038769718400000610
Weighted summation applied on the channels to obtain the second channel characteristic enhanced image
Figure FDA00038769718400000611
Simultaneous third SE network pair
Figure FDA00038769718400000612
Weighted summation is applied to the channels to obtain an image with enhanced representation of the third channel
Figure FDA00038769718400000613
3.3.4.1.2 first, second and third SE networks of the adaptive multi-scale feature aggregation network adopt bilinear interpolation in parallel
Figure FDA00038769718400000614
Upsampled to the same resolution size
Figure FDA00038769718400000615
Obtaining an up-sampled feature map
Figure FDA00038769718400000616
Become an upsampled feature map set
Figure FDA00038769718400000617
The specific calculation process is shown as formula (2):
Figure FDA00038769718400000618
wherein SE n It is shown that the nth SE network,
Figure FDA00038769718400000619
representing the first multi-scale feature map, wherein Upesample represents bilinear interpolation upsampling, and l is more than or equal to 1 and less than or equal to 3,1 and less than or equal to n is less than or equal to 3;
3.3.4.1.3 adaptive multi-scale feature aggregation network pair
Figure FDA00038769718400000620
Calculating the weight by adopting 1 multiplied by 1 convolution, reducing the number of channels from 64 to 1, and then executing Softmax operation on the scale dimension to obtain the value of
Figure FDA0003876971840000071
Soft weight map of
Figure FDA0003876971840000072
The numerical size of the pixel points of the soft weight map indicates that more attention should be paid
Figure FDA0003876971840000073
Which one of these 3 dimensions, i.e.
Figure FDA0003876971840000074
Which of the objects occupies a larger weight, so that objects with different sizes respond to feature maps with different scales;
3.3.4.1.4 adaptive multiscale feature aggregation network maps the weights of the ith scale
Figure FDA0003876971840000075
Corresponding ith up-sampled feature map
Figure FDA0003876971840000076
Element by element multiplication, i.e. to
Figure FDA0003876971840000077
And
Figure FDA0003876971840000078
corresponding element by element multiplication, will
Figure FDA0003876971840000079
And
Figure FDA00038769718400000710
the corresponding element-by-element multiplication is performed,
Figure FDA00038769718400000711
and
Figure FDA00038769718400000712
respectively multiplying element by element to obtain 3 products, then carrying out weighted summation on the 3 products, and fusing the 3 products into a feature map to obtain a fused feature map; then, a fourth SE network is adopted to enhance the channel representation of the fused feature map, and a multi-scale perception high-pixel feature map F is obtained H (ii) a The specific process is shown in formula (3):
Figure FDA00038769718400000713
wherein SE 4 In order to be the fourth SE network,
Figure FDA00038769718400000714
represents the weight of the same position element in different scales, "×" represents the product of the corresponding position elements, and Conv represents the 1 × 1 convolution; adaptive multi-scale feature aggregation network F H Sending the data to an auxiliary task module, a rough frame prediction network and an adaptive spatial feature aggregation network;
3.3.4.2 the coarse box prediction network receives a multi-scale aware high-pixel feature map F from an adaptive multi-scale feature aggregation network H Using a rough frame prediction method to pair F H The position of each feature point in the image is subjected to rough frame position prediction to generate a rough frame prediction position B coarse A 1 to B coarse Sending to an adaptive spatial feature aggregation network, B coarse Is also that
Figure FDA00038769718400000715
Of resolution of
Figure FDA00038769718400000716
Number of channelsIs 4; the channel number 4 represents the distance from the pixel point to the four directions of up, down, left and right, and each pixel point forms a rough frame; b is coarse The method is used for limiting the deformable convolution sampling range in the adaptive spatial feature aggregation network; and, for B coarse Coarse box true value constructed with 2.2.5.4
Figure FDA00038769718400000717
Calculating loss
Figure FDA00038769718400000718
Figure FDA00038769718400000719
Wherein S b Is a set of regression samples consisting of
Figure FDA00038769718400000720
A set of pixels other than 0; n is a radical of b Is the number of regression sample sets, W ij Is corresponding to
Figure FDA00038769718400000721
(i, j) position weight value other than 0;
3.3.4.3 an adaptive spatial feature aggregation network receives a multi-scale aware high-pixel feature map F from an adaptive multi-scale feature aggregation network H Receiving a coarse frame prediction location B from a coarse frame prediction network coarse Generating a boundary region perceived high pixel feature map F HR And high pixel feature F of salient region perception HS (ii) a The method comprises the following steps:
3.3.4.3.1 a deformable convolution R-Dconv with limited design area is created by:
3.3.4.3.1.1 design offset transfer function
Figure FDA00038769718400000813
The offset deltap of the deformable convolution is carried outTransforming to obtain a transformed offset;
Figure FDA00038769718400000814
limiting the offset range of the spatial sampling points of the deformable convolution to B coarse Meanwhile, the offset delta p of the deformable convolution can be differentiated; using Sigmoid function
Figure FDA0003876971840000081
To B coarse The offset Δ p is normalized to make Δ p at [0,1]Within the interval; splitting Δ p into h Δp And w Δp ,h Δp Denotes the deviation of Δ p in the vertical direction, w Δp Represents the deviation of Δ p in the horizontal direction;
Figure FDA0003876971840000082
as shown in equation (5):
Figure FDA0003876971840000083
Figure FDA0003876971840000084
wherein
Figure FDA0003876971840000085
Representing the offset transfer function in the vertical direction,
Figure FDA0003876971840000086
indicating the offset transfer function in the horizontal direction, the overall offset transfer function
Figure FDA0003876971840000087
(t, l, r, d) are convolution kernel positions p and B coarse The distance in the up, down, left and right directions;
3.3.4.3.1.2 utilization
Figure FDA0003876971840000088
Limiting the deformable convolution sampling area; given a 3 x 3 convolution kernel with K =9 spatial sample location points, w k Weight of convolution kernel, P, representing the k-th position k A predefined position offset representing a kth position; p k E { (-1, -1), (-1,0), (1,1) } denotes a 3 × 3 range centered at (0,0); let x (p) denote the input feature map at the convolution kernel center position p, and y (p) denote the output feature map at the convolution kernel center position p; y (p) is calculated using R-DConv, as shown in equation (6):
Figure FDA0003876971840000089
wherein Δ p k Denotes the learnable offset, Δ m, of the kth position k A weight representing a kth position; Δ p k And Δ m k Generated by a 3 × 3 convolution, the 3 × 3 convolution generates a feature map of 27 channels, 9 of which are Δ p k Abscissa offset value, 9 channels Δ p k Ordinate offset value, Δ m for 9 channels k A value of (d); b is coarse A rough box representing the prediction on the current feature map scale, which is also a predefined bounding region;
3.3.4.3.2 adopts a classification adaptive space feature aggregation method to utilize B coarse Limiting the sampling range pair F H The method for carrying out feature aggregation comprises the following steps:
3.3.4.3.2.1 order Classification offset transfer function
Figure FDA00038769718400000810
Calculating the output characteristic y at the position p by using the formula (6) cls (p);
3.3.4.3.2.2 uses
Figure FDA00038769718400000811
Traversal F with convolution kernel H Obtaining a high pixel characteristic diagram F of the salient region perception HS
Figure FDA00038769718400000812
Allowing the sampling points to be concentrated, so that the classification branches can concentrate on the salient regions with the most identification capability; order to
Figure FDA0003876971840000091
Enabling the R-DConv to learn the salient region of the object in the range of the rough frame, and extracting the feature which enables the object to be classified more accurately, namely a high-pixel feature map F for sensing the salient region HS Will F HS Sending the data to a main task module;
3.3.4.3.3 adopts regression adaptive spatial feature aggregation method and utilizes B coarse Limiting the sampling range pair F H Performing feature aggregation, wherein the regression adaptive spatial feature aggregation method specifically comprises the following steps:
3.3.4.3.3.1 design regression offset transfer function
Figure FDA0003876971840000092
Transforming the offset delta p of the deformable convolution;
Figure FDA0003876971840000093
uniformly dividing space sampling points of R-DConv operation along four directions of up, down, left and right to enable a limited area to be divided into four sub-areas which respectively correspond to the upper left, the upper right, the lower left and the lower right;
Figure FDA0003876971840000094
respectively carrying out uniform sampling on the four sub-regions, namely distributing equal sampling points to each region; the setting of K =9 is made such that,
Figure FDA0003876971840000095
the function samples two points from four sub-areas respectively, total eight edge points are added with a central point to form a convolution kernel of 3 multiplied by 3, and the capture of the boundary information by the central characteristic point is enhanced; regression biasFunction of quantity conversion
Figure FDA0003876971840000096
As shown in equation (7):
Figure FDA0003876971840000097
Figure FDA0003876971840000098
a Sigmoid function for normalizing the offset in the rough frame interval;
will be provided with
Figure FDA0003876971840000099
Substituted into equation (6)
Figure FDA00038769718400000910
Deriving an output characteristic y at a position p reg (p);
3.3.4.3.3.2 uses
Figure FDA00038769718400000911
Traversing F with convolution kernel H Obtaining a high pixel characteristic map F perceived by the boundary area HR Will F HR Sending the data to a main task module;
3.3.5 Assist task Module receives F from an adaptive Multi-Scale feature aggregation network H Processing the two layers of convolution with the power of 3 multiplied by 3, convolution with the power of 1 multiplied by 1 and sigmoid function to obtain the corner prediction thermodynamic diagram H corner ,H corner Has a resolution of
Figure FDA00038769718400000912
The number of channels is 4; to H corner And the corner predicted true value constructed by 2.3.3
Figure FDA00038769718400000913
Calculating the loss to obtain H corner And
Figure FDA00038769718400000914
loss value of
Figure FDA00038769718400000915
Figure FDA00038769718400000916
Wherein N is s Is the number of the image label boxes, alpha l And β is a hyperparameter for controlling the gradient curve of the loss function;
Figure FDA00038769718400000917
is the corner predicted value output by the auxiliary task module at the c channel and (i, j) pixel position,
Figure FDA00038769718400000918
is the corner predicted true value of the c channel and pixel position (i, j);
3.3.6 Fine-Box prediction network of Main task Module receiving boundary region-aware high-Pixel feature map F from adaptive spatial feature aggregation network HR After a layer of 1 × 1 convolution processing, F is obtained HR Fine-box predicted position B of feature point position refine ;B refine Has a resolution of
Figure FDA0003876971840000101
The number of channels is 4; the channel number 4 represents the distance from the pixel point to the prediction fine frame in the upper, lower, left and right directions, and each pixel point can form a fine prediction frame; to B refine And the fine box true value obtained from 2.3.5
Figure FDA0003876971840000102
Calculating loss
Figure FDA0003876971840000103
Figure FDA0003876971840000104
Wherein S b Is a set of regression samples consisting of
Figure FDA0003876971840000105
A set of pixels other than 0; n is a radical of hydrogen b Is the number of regression sample sets, W ij Is corresponding to
Figure FDA0003876971840000106
(i, j) position weight value other than 0, B refine The learning quality of (2) represents the accuracy of the target detection system in returning the position of the object;
3.3.7 Central Point prediction network of Main task Module receives saliency region-aware high-Pixel feature map F from an adaptive spatial feature aggregation network HS After a layer of 1 × 1 convolution and sigmoid function processing, F is obtained HS Center point predictive thermodynamic diagram H of feature point positions center ;H center Has a resolution of
Figure FDA0003876971840000107
The number of channels is the number C of data set categories; h is to be center The central point predicted true value constructed with 2.2.5.2
Figure FDA00038769718400001026
Calculating loss
Figure FDA0003876971840000108
Figure FDA0003876971840000109
Wherein N is s Is the imageNumber of boxes, α l And beta is a hyper-parameter,
Figure FDA00038769718400001010
is the c-th channel, the center point predicted thermodynamic diagram of (i, j) pixel location,
Figure FDA00038769718400001011
is the center point predicted true value of the (i, j) th channel and the (i, j) th pixel position; h center The learning quality of (2) represents the ability of the target detection system to locate the center position of an object and to distinguish object classes;
3.3.8 design Total loss function for target detection System
Figure FDA00038769718400001012
As shown in equation (11):
Figure FDA00038769718400001013
wherein
Figure FDA00038769718400001014
H being the output of a corner prediction network corner And true value
Figure FDA00038769718400001015
The value of the loss is calculated,
Figure FDA00038769718400001016
h being central point predicted network output center And true value
Figure FDA00038769718400001017
The value of the loss is calculated,
Figure FDA00038769718400001018
b being a coarse box prediction network output coarse And true value
Figure FDA00038769718400001019
The value of the loss is calculated,
Figure FDA00038769718400001025
b being fine-box prediction network output refine And true value
Figure FDA00038769718400001020
A calculated loss value;
Figure FDA00038769718400001021
the network loss weights are predicted for the corners,
Figure FDA00038769718400001022
the network loss weight is predicted for the central point,
Figure FDA00038769718400001023
the network loss weight is predicted for the coarse box,
Figure FDA00038769718400001024
predicting a network loss weight for the fine box;
3.3.9 has epoch = epoch +1, and if epoch is 80 or 110, has learning _ rate = learning _ rate × 0.1, rotate 3.3.10; if the epoch is neither 80 nor 110, directly transferring to 3.3.10;
3.3.10 if the epoch is less than or equal to maxepoch, rotate 3.3.2; if the epoch is greater than maxepoch, indicating that the training is finished, and turning to 3.3.11;
3.3.11 preserved N m Network weight parameters of the epochs;
fourth, verifying N after loading by using the verification set m The detection precision of a target detection system of the network weight parameters of the epochs keeps the network weight parameters with the best performance as the network weight parameters of the target detection system; the method comprises the following steps:
4.1 order variable n m =1;
4.2 targetDetecting system loaded N m Nth in network weight parameter of epoch m A network weight parameter; new verification set D V Inputting a target detection system;
4.3 let V =1 be the V-th image of the validation set, V being the number of images of the validation set;
4.4 the principal feature extraction Module receives the v verification set image D v Extracting D by using the main feature extraction method described in 3.3.3 v To obtain D v Will contain D v Sending the multi-scale feature map of the multi-scale features to the self-adaptive feature aggregation module;
4.5 adaptive multiscale feature aggregation network reception in feature adaptive aggregation Module containing D v The multi-scale feature map of the multi-scale features is subjected to channel self-attention enhancement, bilinear interpolation upsampling and scale level soft weight aggregation operation by adopting the self-adaptive multi-scale feature aggregation method 3.3.4.1 to obtain D v Multi-scale perceived high pixel feature map F HV A 1 to F HV Sending the data to a rough frame prediction network and an adaptive spatial feature aggregation network;
4.6 coarse Box prediction network reception F in feature adaptive aggregation Module HV F, adopting the rough frame prediction method 3.3.4.2 HV The position of each feature point in the image is subjected to rough frame position prediction to generate a v-th verification set image D v Coarse box of (1) predict position B HVcoarse (ii) a B is to be HVcoarse Sending the information to an adaptive spatial feature aggregation network; b is HVcoarse Is also that
Figure FDA0003876971840000111
Figure FDA0003876971840000112
Of resolution size of
Figure FDA0003876971840000113
The number of channels is 4;
4.7 is characterized byAn adaptive spatial feature aggregation network in an adaptive aggregation module receives B from a coarse box prediction network HVcoarse Receiving F from an adaptive multi-scale feature aggregation network HV The method for classifying and self-adapting space characteristic polymerization of 3.3.4.3.2 is adopted to utilize B HVcoarse Limit the sampling range, and F HV Performing classification task space feature aggregation to obtain the v-th verification set image D v A high pixel feature map of salient region perception of (1); sending the high pixel characteristic graph perceived by the salient region of the v-th verification image to a central point prediction network;
4.8 adaptive spatial feature aggregation network in feature adaptive aggregation Module adopts the regression adaptive spatial feature aggregation method described in 3.3.4.3.3 and utilizes B HVcoarse Limit the sampling range, and F HV Performing regression task space feature aggregation to obtain the v-th verification set image D v The boundary region perceived high pixel feature map of (1); sending the high pixel characteristic image perceived by the boundary area of the v-th verification image to a fine frame prediction network;
4.9 Fine frame prediction network in the main task module receives the high pixel characteristic graph sensed by the boundary region, and the Vth verification set image D is obtained through 1 multiplied by 1 convolution processing of one layer v The fine frame prediction position of the object is sent to the post-processing module;
4.10 Central Point prediction network in Main task Module receives the v verification set image D v The high pixel characteristic diagram of the salient region perception is processed by a layer of 1 multiplied by 1 convolution to obtain the v verification set image D v The v-th verification image D v The central point prediction thermodynamic diagram is sent to a post-processing module;
4.11 post-processing Module receives the v authentication image D v The thermal diagram of the fine frame prediction position and the center point is predicted, and the method of removing the overlapped pseudo frame is adopted to carry out the verification on the v < th > verification image D v The fine frame prediction position and the central point prediction thermodynamic diagram are subjected to overlapping pseudo frame removing operation to obtain a v-th verification image D v The specific method for predicting the object frame set is as follows:
4.11.1 post-processing module pairs the v-th verification image D v Performs a 3 x 3 max pooling operation to extract a v-th verification image D v The central point of (2) a set of peak points of the predictive thermodynamic diagram, each peak point representing a central region point within the predicted object;
4.11.2 from the v-th verification image D v The central point predictive thermodynamic diagram of (1) obtains a peak point (P) x ,P y ) Coordinate value P of x ,P y Post-processing module from D v The fine frame prediction position of (P) is obtained as a peak point x ,P y ) Distance information (t, l, D, r) of the upper, left, lower and right directions is obtained to obtain D v Prediction frame B of (1) p ={P y -t,p l -l,p d +d,p r +r};B p The category of (D) is peak point (P) x ,P y ) The channel value with the maximum pixel value of the thermal image of the center point of the position is recorded as c p ;B p The confidence of (D) is the peak point (P) x ,P y ) Thermodynamic diagram of center point of position c p Pixel value of channel, noted as s p
4.11.3 post-processing module retains the v-th verification image D v Middle confidence s p A prediction box larger than the confidence threshold value to form a v verification image D v The object frame prediction set of (1) retains the prediction frame B p And B p Class c of p Information;
4.12 let V = V +1, if V is less than or equal to V, turn 4.4; if V > V, it is stated that the nth m Converting an object frame prediction set of V verification images of each model into 4.13;
4.13 if the verification set adopts a general scene data set disclosed by MS COCO, testing the precision of a final object frame prediction set output by the target detection system by adopting a standard MS COCO evaluation mode, recording the precision of the object frame prediction set, and turning to 4.14; if the verification set adopts a Cityscapes unmanned scene data set, testing the precision of a final object frame prediction set output by the target detection system by adopting a Cityscapes evaluation mode, recording the precision of the object frame prediction set, and turning to 4.14;
4.14 order n m =n m +1; if n is m ≤N m Turning to 4.2; if n is m >N m Explanation of completion N m Testing the precision of each model, and turning to 4.15;
4.15 from N m Selecting an object frame prediction set with highest precision from the precision of the object frame prediction sets of the models, finding a weight parameter corresponding to a target detection system corresponding to the object frame prediction set with the highest precision, taking the weight parameter as the weight parameter selected by the target detection system, loading the selected weight parameter to the target detection system, and enabling the target detection system loaded with the selected weight parameter to become a trained target detection system;
fifthly, adopting the trained target detection system to perform target detection on the image to be detected input by the user, wherein the method comprises the following steps:
5.1, optimizing the image I to be detected input by the user by adopting the 2.4-step image scaling standardization method to obtain the standardized image I to be detected nor Is shown by nor An input main feature extraction module;
5.2 Main feature extraction Module receives I nor Extracting I by using the main feature extraction method described in 3.3.3 nor To obtain I nor Will contain I nor Sending the multi-scale feature map of the multi-scale features to the self-adaptive feature aggregation module;
5.3 adaptive multiscale feature aggregation network in feature adaptive aggregation Module receiving Inclusion I nor The self-adaptive multi-scale feature aggregation method of 3.3.4.1 is adopted to carry out the multi-scale feature graph containing I nor The multi-scale characteristic graph of the multi-scale characteristic carries out channel self-attention enhancement, bilinear interpolation upsampling and scale level soft weight aggregation operation to obtain a multi-scale perception high pixel characteristic graph F IH Will F IH Sending the data to a rough frame prediction network and an adaptive spatial feature aggregation network;
5.4 coarse Box prediction network reception F in feature adaptive aggregation Module IH Coarse frame prediction as described in 3.3.4.2 is usedMethod pair F IH Carrying out rough frame position prediction to obtain a rough frame prediction position B in the image I to be detected Icoarse (ii) a B is to be Icoarse Sending the information to an adaptive spatial feature aggregation network; b Icoarse Is also that
Figure FDA0003876971840000131
Of resolution of
Figure FDA0003876971840000132
The number of channels is 4;
5.5 adaptive spatial feature aggregation network reception F in feature adaptive aggregation Module IH And B Icoarse The method for clustering the classified adaptive spatial features by adopting 3.3.4.3.2 utilizes B Icoarse Limiting the sampling range, for F IH Performing classification task spatial feature aggregation to obtain a high pixel feature map perceived by a salient region of the image I to be detected; sending a high pixel characteristic image perceived by a salient region of an image I to be detected to a central point prediction network;
5.6 adaptive spatial feature aggregation network in feature adaptive aggregation Module adopts the regression adaptive spatial feature aggregation method described in 3.3.4.3.3 and utilizes B Icoarse Limiting the sampling range, for F IH Performing regression task spatial feature aggregation to obtain a high pixel feature map perceived by a boundary region of the image I to be detected; sending a high pixel characteristic image sensed by a boundary area of an image I to be detected to a fine frame prediction network;
5.7 a fine frame prediction network in the main task module receives a high pixel characteristic image perceived by a boundary region of the image I to be detected, and the fine frame prediction position of an object in the image I to be detected is obtained through a layer of 1 × 1 convolution processing; sending the fine frame prediction position of the object in the image I to be detected to a post-processing module;
5.8 the central point prediction network in the main task module receives the high pixel characteristic image perceived by the salient region of the image I to be detected, and the central point prediction thermodynamic diagram of the object of the image I to be detected is obtained through a layer of 1 × 1 convolution processing; the central point prediction thermodynamic diagram of the object of the image I to be detected is sent to a post-processing module;
5.9 the post-processing module receives the fine frame prediction position and the central point prediction thermodynamic diagram of the object of the image I to be detected, the method for removing the overlapped pseudo frame in the step 4.9 is adopted to remove the overlapped pseudo frame from the fine frame prediction position of the object of the image I to be detected and the central point prediction thermodynamic diagram of the object of the image I to be detected, so as to obtain an object frame prediction set of the image I to be detected, and the object frame prediction set of the image I to be detected reserves a prediction frame B p And the type information of the prediction frame, namely the coordinate position and the prediction type of the prediction object frame of the image to be detected;
and sixthly, finishing.
2. The method of claim 1, wherein in step 2.1, the MS COCO data set has 80 classes, including 105000 training images as a training set, 5000 verification images as a verification set, and 20000 test images as a test set; the cityscaps dataset has 8 classes: pedestrians, riders, trolleys, trucks, buses, trains, motorcycles and bicycles, wherein 2975 training images are used as a training set, 500 verification images are used as a verification set, and 1525 Zhang Ceshi images are used as a test set; s is 205000 or 2975, T is 20000 or 1524, and V is 5000 or 500.
3. The method for detecting the target of claim 1, wherein the optimization processing is performed on the S images in the training set in the 2.2 steps to obtain an optimized training set D t The method comprises the following steps:
2.2.1 order variable s =1, initialize the optimized training set D t Is empty;
2.2.2, overturning the s image in the training set by adopting a random overturning method to obtain an s overturned image, wherein the random probability of the random overturning method is 0.5;
2.2.3, randomly cutting the s-th overturned image by adopting minimum intersection and comparison to obtain an s-th cut image; the minimum size ratio adopted by the minimum intersection ratio is 0.3;
2.2.4, carrying out random image translation on the s-th cut image to obtain an s-th translated image;
2.2.5, performing brightness conversion on the s-th translated image by adopting random brightness to obtain an s-th brightness-converted image; the brightness difference value adopted by the random brightness is 32;
2.2.6, carrying out contrast conversion processing on the image after the s-th brightness conversion by adopting random contrast to obtain an image after the s-th contrast conversion; the random contrast has a contrast range of (0.5,1.5);
2.2.7, performing saturation transformation on the image with the s-th contrast transformation by adopting random saturation to obtain an image with the s-th saturation transformation; the saturation range for random saturation is (0.5,1.5);
2.2.8 adopting a scaling operation to scale the s-th image after saturation transformation to 512 multiplied by 512 to obtain an s-th scaled image;
2.2.9 standardizes the s scaled image by adopting standardization operation to obtain the s standard image, and puts the s standard image into the optimized training set D t Performing the following steps;
if S is less than or equal to S, making S = S +1, and rotating by 2.2.2; if S is more than S, obtaining an optimized training set D consisting of S standard images t
4. The feature adaptive aggregation-based target detection method of claim 1, wherein the two-dimensional Gaussian kernel center in step 2.3.2.4.3 is B' si Is center or is B' si Is set to 0.54.
5. The method as claimed in claim 1, wherein the step 2.3.4 comprises performing feature adaptive aggregation based on the Nth image s Constructing a rough frame real value of the s image of a rough frame prediction task by using 4 times of down-sampling labeled frames
Figure FDA0003876971840000141
The method comprises the following steps:
2.3.4.1 construction of a size of
Figure FDA0003876971840000142
All-zero matrix of
Figure FDA0003876971840000143
"4" represents 4 coordinates of the labeling box sampled 4 times;
2.3.4.2 let i =1, denote the labeling box of the ith down-sampling by 4 times;
2.3.4.3 pair H zeros Marking box B of 4 times of sampling in ith si ' assignment of internal pixels, i.e. B si ' coordinate value
Figure FDA0003876971840000144
Assign a value to
Figure FDA0003876971840000145
In 4 channels of pixel locations;
2.3.4.4 let i = i +1 if i ≦ N s Turning to 2.3.4.3; if i > N s N of the s-th image s The actual value of the rough frame corresponding to each marking frame is assigned to
Figure FDA0003876971840000146
In (1), assigned with values
Figure FDA0003876971840000147
The true value label of the s image is converted to 2.3.4.5;
2.3.4.5 the rough frame true value of the s-th image
Figure FDA0003876971840000151
6. The method for detecting the target based on the feature adaptive aggregation as claimed in claim 1, wherein the method for performing the optimization processing on the V images in the verification set by adopting the image scaling normalization method in the 2.4 steps is:
2.4.1 let variable v =1;
2.4.2 adopting zooming operation to zoom the v image in the verification set to 512 x 512 to obtain a v zoomed image;
2.4.3 standardizing the zoomed image of the v th image by adopting a standardization operation to obtain a standardized image of the v th image;
2.4.4 if V is less than or equal to V, making V = V +1, and rotating by 2.4.2; if V > V, a new verification set D consisting of V scaled normalized images is obtained V
7. The target detection method based on the feature adaptive aggregation as claimed in claim 1, wherein in the third step, the feature pyramid network, the feature adaptive aggregation module, the auxiliary task module, and the main task module in the main feature module are initialized by adopting a normal distribution with a mean value of 0 and a variance of 0.01; the initial learning rate learning _ rate is initialized to 0.01, the attenuation coefficient is initialized to 0.1, the hyper-parameter momentum of the optimizer is initialized to 0.9, and the weight attenuation is initialized to 0.0004; the batch size mini _ batch _ size of the network training is initialized to 64; the maximum training step maxepoch is initialized to 120.
8. The method for detecting target based on feature adaptive aggregation as claimed in claim 1, wherein in the third step, N is m =10,3.3.5 step a l Set to 2, β is set to 4;3.3.8 step of the corner point prediction network loss weight
Figure FDA0003876971840000152
Central point predicted network loss weight
Figure FDA0003876971840000153
Coarse box prediction network loss weighting
Figure FDA0003876971840000154
Fine-box prediction network loss weighting
Figure FDA0003876971840000155
9. The method of claim 1, wherein the step 4.11.3 is performed with the confidence threshold set to 0.3.
CN202211219905.9A 2022-10-06 2022-10-06 Target detection method based on feature self-adaptive aggregation Active CN115631344B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211219905.9A CN115631344B (en) 2022-10-06 2022-10-06 Target detection method based on feature self-adaptive aggregation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211219905.9A CN115631344B (en) 2022-10-06 2022-10-06 Target detection method based on feature self-adaptive aggregation

Publications (2)

Publication Number Publication Date
CN115631344A true CN115631344A (en) 2023-01-20
CN115631344B CN115631344B (en) 2023-05-09

Family

ID=84905182

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211219905.9A Active CN115631344B (en) 2022-10-06 2022-10-06 Target detection method based on feature self-adaptive aggregation

Country Status (1)

Country Link
CN (1) CN115631344B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116052026A (en) * 2023-03-28 2023-05-02 石家庄铁道大学 Unmanned aerial vehicle aerial image target detection method, system and storage medium
CN116452972A (en) * 2023-03-17 2023-07-18 兰州交通大学 Transformer end-to-end remote sensing image vehicle target detection method
CN117152083A (en) * 2023-08-31 2023-12-01 哈尔滨工业大学 Ground penetrating radar road disease image prediction visualization method based on category activation mapping
CN118279566A (en) * 2024-05-10 2024-07-02 广东工业大学 Automatic driving target detection system for small object

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110135267A (en) * 2019-04-17 2019-08-16 电子科技大学 A kind of subtle object detection method of large scene SAR image
CN111475650A (en) * 2020-04-02 2020-07-31 中国人民解放军国防科技大学 Russian semantic role labeling method, system, device and storage medium
CN113158862A (en) * 2021-04-13 2021-07-23 哈尔滨工业大学(深圳) Lightweight real-time face detection method based on multiple tasks
WO2022083157A1 (en) * 2020-10-22 2022-04-28 北京迈格威科技有限公司 Target detection method and apparatus, and electronic device
CN114821357A (en) * 2022-04-24 2022-07-29 中国人民解放军空军工程大学 Optical remote sensing target detection method based on transformer
CN114841244A (en) * 2022-04-05 2022-08-02 西北工业大学 Target detection method based on robust sampling and mixed attention pyramid

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110135267A (en) * 2019-04-17 2019-08-16 电子科技大学 A kind of subtle object detection method of large scene SAR image
CN111475650A (en) * 2020-04-02 2020-07-31 中国人民解放军国防科技大学 Russian semantic role labeling method, system, device and storage medium
WO2022083157A1 (en) * 2020-10-22 2022-04-28 北京迈格威科技有限公司 Target detection method and apparatus, and electronic device
CN113158862A (en) * 2021-04-13 2021-07-23 哈尔滨工业大学(深圳) Lightweight real-time face detection method based on multiple tasks
CN114841244A (en) * 2022-04-05 2022-08-02 西北工业大学 Target detection method based on robust sampling and mixed attention pyramid
CN114821357A (en) * 2022-04-24 2022-07-29 中国人民解放军空军工程大学 Optical remote sensing target detection method based on transformer

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CHAOJUN LIN等: "An anchor-free detector and R-CNN integrated neural network architecture for environmental perception of urban roads" *
YULIN HE等: "CenterRepp: Predict Central Representative Point Set’s Distribution For Detection" *
侯志强等: "基于双分支特征融合的无锚框目标检测算法" *
范红超等: "基于Anchor-free的交通标志检测" *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116452972A (en) * 2023-03-17 2023-07-18 兰州交通大学 Transformer end-to-end remote sensing image vehicle target detection method
CN116052026A (en) * 2023-03-28 2023-05-02 石家庄铁道大学 Unmanned aerial vehicle aerial image target detection method, system and storage medium
CN117152083A (en) * 2023-08-31 2023-12-01 哈尔滨工业大学 Ground penetrating radar road disease image prediction visualization method based on category activation mapping
CN117152083B (en) * 2023-08-31 2024-04-09 哈尔滨工业大学 Ground penetrating radar road disease image prediction visualization method based on category activation mapping
CN118279566A (en) * 2024-05-10 2024-07-02 广东工业大学 Automatic driving target detection system for small object

Also Published As

Publication number Publication date
CN115631344B (en) 2023-05-09

Similar Documents

Publication Publication Date Title
CN110298262B (en) Object identification method and device
CN110188705B (en) Remote traffic sign detection and identification method suitable for vehicle-mounted system
Cortinhal et al. Salsanext: Fast, uncertainty-aware semantic segmentation of lidar point clouds
CN110378381B (en) Object detection method, device and computer storage medium
CN110135267B (en) Large-scene SAR image fine target detection method
CN109584248B (en) Infrared target instance segmentation method based on feature fusion and dense connection network
CN111598030B (en) Method and system for detecting and segmenting vehicle in aerial image
Zhang et al. DAGN: A real-time UAV remote sensing image vehicle detection framework
Tian et al. A dual neural network for object detection in UAV images
CN115631344B (en) Target detection method based on feature self-adaptive aggregation
CN110309856A (en) Image classification method, the training method of neural network and device
CN107545263B (en) Object detection method and device
CN111160249A (en) Multi-class target detection method of optical remote sensing image based on cross-scale feature fusion
CN109583483A (en) A kind of object detection method and system based on convolutional neural networks
CN113361495A (en) Face image similarity calculation method, device, equipment and storage medium
CN104299006A (en) Vehicle license plate recognition method based on deep neural network
Nguyen et al. Hybrid deep learning-Gaussian process network for pedestrian lane detection in unstructured scenes
US20220180476A1 (en) Systems and methods for image feature extraction
CN116188999B (en) Small target detection method based on visible light and infrared image data fusion
CN110852327A (en) Image processing method, image processing device, electronic equipment and storage medium
CN113743417A (en) Semantic segmentation method and semantic segmentation device
CN112365451A (en) Method, device and equipment for determining image quality grade and computer readable medium
Fan et al. A novel sonar target detection and classification algorithm
WO2021083126A1 (en) Target detection and intelligent driving methods and apparatuses, device, and storage medium
Khellal et al. Pedestrian classification and detection in far infrared images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant