CN113657225A - Target detection method - Google Patents

Target detection method Download PDF

Info

Publication number
CN113657225A
CN113657225A CN202110898055.9A CN202110898055A CN113657225A CN 113657225 A CN113657225 A CN 113657225A CN 202110898055 A CN202110898055 A CN 202110898055A CN 113657225 A CN113657225 A CN 113657225A
Authority
CN
China
Prior art keywords
class
information
target
attention
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110898055.9A
Other languages
Chinese (zh)
Other versions
CN113657225B (en
Inventor
卢涛
陈剑卓
张彦铎
徐爱波
吴云韬
金从元
余晗
魏明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Institute of Technology
Wuhan Fiberhome Technical Services Co Ltd
Original Assignee
Wuhan Institute of Technology
Wuhan Fiberhome Technical Services Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Institute of Technology, Wuhan Fiberhome Technical Services Co Ltd filed Critical Wuhan Institute of Technology
Priority to CN202110898055.9A priority Critical patent/CN113657225B/en
Publication of CN113657225A publication Critical patent/CN113657225A/en
Application granted granted Critical
Publication of CN113657225B publication Critical patent/CN113657225B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a target detection method, which comprises the following steps: extracting image features to generate a feature map; sampling the characteristic diagram to obtain an enlarged characteristic diagram; connecting the amplified feature map to a category prediction head, a width and height prediction head and a center point offset prediction head; adding a category attention network into a category prediction head, and mining effective information between targets which are far away from each other in a category and are semantically related; monitoring the training of each measuring head by generating monitoring information through encoding the real target frame; and selecting the identification object in the image to be detected by the output result of each prediction head and marking the classification result. According to the invention, by combining the category attention for further judging the category of the target and the scale self-adaptive coding for frame regression, the network can correlate the characteristics in the category and among the categories, and can carry out more accurate frame selection according to the scale transformation of the detected target while mining the effective information between the targets which are far away from each other and related semantically, so that the accuracy of detection and the frame selection precision are improved.

Description

Target detection method
Technical Field
The invention belongs to the field of computer vision target detection, and particularly relates to a target detection method.
Background
Object detection (object detection) is a common problem in the field of machine vision (machine vision), and is image segmentation based on the characteristics of geometric features, statistical features and the like of a detected object, which combines object segmentation and identification into a whole so as to obtain an accurate object detection result. The target detection is to combine target positioning and target classification, and to locate an object of interest from an image or video by using multi-directional knowledge such as image processing technology and machine learning. The target classification part is responsible for judging whether the input image contains a classification object, and the target positioning part is responsible for representing the position of the target object and marking and positioning by using a circumscribed rectangle frame. Target detection plays an important role in many applications such as target tracking, attitude detection, and the like.
Generally, target detection can be classified into a conventional detection method and a learning detection method. The conventional detection method generally includes three steps, that is, traversing a candidate region by using sliding windows of different sizes, extracting relevant visual features of the candidate region by using a Histogram of Oriented Gradients (HOG) and Scale-invariant feature transform (SIFT), and classifying the features by using a trained classifier. Although the method has good effect, the method has no pertinence to the object to be detected when the sliding window is used for carrying out region selection, so that the method has high time complexity and redundancy of the window, the classification effect is larger under different conditions, and the robustness is not strong. And then, the learning-based method is widely applied to the field of target detection, and the deep learning method can fully extract the features in the training sample, so that more accurate classification is obtained and a certain detection speed is increased.
In recent years, a method based on a deep Convolutional Neural Network (CNN) is significantly improved compared with a traditional target detection algorithm. The deep convolutional network (lens-5) for target detection introduces two layers of CNNs to realize target detection. Thereafter, as deep learning further progresses, the accuracy of target detection is continuously improved. Thereafter, target detection algorithms (twostage) based on classification series and algorithms (singlestage) converting target detection into regression problems were developed. Aiming at the problems of high parameters and high training amount of a two-stage target detection algorithm, a method (You only look once) for dividing a picture into grids, wherein each grid only detects a target with a center in the grid, predicts two scale frames (bounding boxes) and category information, and predicts the scale frames, the target confidence coefficient and the category probability of all regions at one time is born. Then, a target detection method based on the regression problem develops a more intuitive method (Objects as Points, centret) for directly detecting the central point and the size of the target and discarding the prediction frame, so that the speed and the precision of target detection are further improved.
Although the target detection method using the prediction-free box has a satisfactory effect, the method does not take the problems of the change of the aspect ratio of the target and the uneven distribution of the targets with different scales into consideration when constructing the Heatmap, and does not mine effective information of the targets which are far away from each other in the class and are semantically related. Therefore, it is very important to construct a method that focuses on the aspect ratio and distribution of the target and can mine more effective information.
Disclosure of Invention
In view of the above drawbacks or needs for improvement in the prior art, the present invention provides a target detection method, which solves the limitations of the current target detection based on regression problems.
An object detection method comprising the steps of:
s1, extracting image features to generate a feature map;
s2, the extracted feature map is sampled, and an amplified feature map which retains original feature information is obtained;
s3, connecting the amplified feature map to a category prediction head, a width and height prediction head and a central point offset prediction head;
s4, adding a category attention network into the category prediction header, wherein the category attention network is used for mining effective information between distant and semantically related targets in and among the categories;
s5, in the training stage, generating supervision information by encoding the real target frame, thereby supervising the training process of each measuring head;
and S6, outputting classification information, regression frame width and height information and central point position information of the image to be detected respectively by the trained class prediction head, width and height prediction head and central point offset prediction head, framing the identification object in the image to be detected according to the output result and marking the classification result.
Further, the features of the image are extracted by utilizing a residual error network or a deep feature fusion network, and a feature map is generated.
Further, the upsampling module consists of an alternation of a deformable convolution and a transposed convolution.
Further, the mechanism of the class attention network is represented as: i isE=HE(IDkISk) (ii) a Wherein, IERepresenting valid information between objects, HEIndicating operations for mining valid information, IDkIndicating the distance information in the case of k,ISkthe semantic information is shown in the case of k, which is classified into an intra-class case and an inter-class case.
Further, the category attention network includes an inter-class associative attention group and an intra-class associative attention group; the inter-class associated attention group comprises a plurality of class attention blocks and a class excitation block, and then inter-class information output by the inter-class associated attention group is superimposed on the amplified characteristic diagram element by element through broadcasting to form an intra-class associated attention group, so that the class attention of the class prediction head is realized.
Further, the category attention workflow of the category attention network comprises the following steps:
s41, enlarging characteristic diagram F with scale C multiplied by H multiplied by WPIExtracting features, reducing the size to obtain information between classes, multiplying the information between classes to the enlarged feature graph F by matrix multiplicationPIObtaining a new inter-class information characteristic diagram; the inter-class information feature map is represented as follows:
FWI=Hmul(Zip(Conv(FPI)),FPI)
wherein, FWIFeature graph representing information between classes, HmulRepresenting a matrix pixel-by-pixel multiplication operation, Zip representing an information reduction operation, and Conv representing a convolution operation;
s42, for new inter-class information characteristic diagram FWIExtracting features, passing the extraction result through a linear rectification function, extracting features again to obtain intra-class information, and superimposing the intra-class information on the amplified feature map F by broadcasting element-by-element additionPIObtaining a category attention feature map; the class attention feature map is represented as follows:
FCA=Hadd(Conv(Lin(Conv(FWI))),FPI)
wherein, FCAAs a class attention feature map, HaddIndicating broadcast element-by-element addition and Lin indicating linear commutation operations.
Further, the central point offset prediction head is used for outputting the central point offset of the central point positioning network, and the central point positioning network comprises a cross entropy loss group and a central point offset loss group; the center point offset prediction head corrects the offset of the target center point by a center offset loss, which is expressed as follows:
Figure BDA0003198780120000031
wherein L isoffsetIndicating a loss of center offset, N represents the batch size,
Figure BDA0003198780120000032
representing the predicted center coordinate, OiRepresenting the true center coordinates.
Further, the breadth and height prediction head realizes breadth and height prediction by constructing a scale self-adaptive network; the scale self-adaptive network is determined by a two-dimensional Gaussian kernel and a target real width-to-height ratio, the variance of the two-dimensional Gaussian kernel is determined by an intersection ratio and the width-to-height of a target frame, and the intersection ratio is determined by the upper limit and the lower limit and the area of the real target frame according to the set upper limit and the set lower limit, so that the scale self-adaptation of the width-to-height prediction head is realized.
Further, connecting the enlarged feature map to the category predictor, the width and height predictor, and the center point offset predictor compiles three feature maps: one is a class heatmap graph
Figure BDA0003198780120000033
One is a dimension width and height diagram
Figure BDA0003198780120000034
The last one is a center point offset map
Figure BDA0003198780120000035
Wherein N represents the size of the batch, r represents the output step length, C represents the number of target classes, and H and W represent the height and width of the image respectively;
for each real target box btC, calculating the down-sampled r-times equivalent value of the central point p
Figure BDA0003198780120000036
All targets are coded into a Heatmap graph H in a Gaussian kernel mode, and a specific channel is occupied by a specific category; when the central points of two or more targets are coincident, adopting the target representative with the largest target frame area; hxycThe value of the corresponding position is confirmed by a 2D gaussian kernel, which is:
Figure BDA0003198780120000037
wherein σxIs a parameter related to IoU and the width of the target box, 1/3 for the calculated transverse axis of the ellipse; sigmayIs a parameter highly correlated with IoU and the target box, 1/3 for the calculated ellipse longitudinal axis; the Gaussian kernel forms an ellipse
Figure BDA0003198780120000041
σ will be derived as followsx、σyA calculation formula of IoU and the height and width of the target box; first IoU is calculated as:
Figure BDA0003198780120000042
further deducing that:
Figure BDA0003198780120000043
due to the fact that
Figure BDA0003198780120000044
Wherein a is half of the transverse axis of the Gaussian kernel, b is half of the transverse axis of the Gaussian kernel, r is the distance from the intersection point of the rectangular diagonal line and the outer ring of the Gaussian kernel to the center of the rectangle, and the following is further provided:
Figure BDA0003198780120000045
further comprising:
Figure BDA0003198780120000046
by the formula of ellipses
Figure BDA0003198780120000047
And (3) obtaining:
Figure BDA0003198780120000048
Figure BDA0003198780120000049
the method of computing the gaussian kernel parameters a, b relating to IoU, the width and height of the target box is thus obtained:
Figure BDA00031987801200000410
further, the size of IoU is adaptively adjusted according to the size of the target frame area:
Figure BDA00031987801200000411
wherein [ alpha, beta ]]For the set IoU value range, area is the area of the target box, aSIs the area threshold of the small target frame, aLFor the area threshold of the large target frame, the area is smaller than aSIs uniformly set to a and the area is larger than aLIs uniformly set to be beta, area [ a ]S,aL]The target box IoU in between is set to the adaptation value;
adding a center point offset map
Figure BDA0003198780120000051
In that
Figure BDA0003198780120000052
Filling real target frames b at the coordinate positions respectivelytLoss floating point value of center point of
Figure BDA0003198780120000053
The loss of center point positioning accuracy due to downsampling is recovered and all classes share the same offset map.
The invention has the beneficial effects that:
according to the method, the category attention for further judging the target category and the scale self-adaptive coding for frame regression are combined, so that the network can associate the characteristics in the category and among the categories, and can obtain a more accurate target frame while mining the effective information between targets which are far away from each other in the category and among the categories and are semantically related; and more accurate framing is performed according to the scale transformation of the detected target, so that the accuracy of target detection and the framing precision are improved.
Drawings
Fig. 1 is a schematic flow chart of a target detection method according to an embodiment of the present invention;
fig. 2 is a network structure diagram of a target detection method according to an embodiment of the present invention;
FIG. 3 is a graph comparing test results of the target detection method of the present invention with other algorithms;
fig. 4 is a schematic diagram illustrating the effect of detecting 2 image targets according to the embodiment of the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
The invention discloses a target detection method (DASCAN), which improves the conventional key point detection scheme aiming at the requirement of multi-path real-time accurate inference in actual projects, improves the detection precision of a model and better meets the real-time requirement of a real scene; the invention provides a scale self-adaptive coding module, optimizes a target frame to obtain an accurate frame selection result, and provides a category attention module, so that the similar objects are accurately distinguished. The invention can realize multi-path real-time accurate multi-target detection and detect the object type and position in a complex scene.
A target detection method according to an embodiment of the present invention, as shown in fig. 1 and 2, includes the following steps:
and S1, extracting image features to generate a feature map.
In the embodiment of the present invention, a feature map is generated by extracting image features in an original image or video using a Deep residual network (ResNet) or a Deep Layer feature Aggregation (DLA).
And S2, the extracted feature map is sampled, and an enlarged feature map with original feature information retained is obtained.
And constructing an up-sampling module consisting of 3-by-3 deformable convolution and transposed convolution alternately, and up-sampling by using the up-sampling module to obtain the amplified feature map retaining the effective information. The characteristic diagram for reserving the original characteristic information is shown as follows:
FPI=HIM(HUS(FOri))
wherein, FPIEnlarged characteristic diagram representing retained information, HIMMapping operation representing the retained characteristic information, HUSShowing an image enlarging operation, ForiThe feature map generated in S1, i.e., the feature image obtained through the backbone network, is represented.
And S3, connecting the amplified feature map to the category prediction head, the width and height prediction head and the central point offset prediction head, and enhancing the information acquisition capability of the features in different fields.
In the training stage, the classification prediction head is used for confirming the existence of the target and confirming the classification of the target through the channel ID, and a class attention module is added in the classification prediction head for useEfficient information is mined between objects that are far apart but semantically related within and between classes. The mechanism of the class attention network is represented as: i isE=HE(IDkISk) (ii) a Wherein, IERepresenting valid information between objects, HEIndicating operations for mining valid information, IDkDenotes the distance information in the case of k, ISkRepresents semantic information in the case of k, which is divided into w: intra-class case and b: inter-class case.
And constructing a center offset positioning module, wherein the center offset positioning module is used for constructing a target center point of a center point positioning network. Enlarging the feature map FPIAnd the central point offset of the positioning network is output by connecting to a central offset prediction head, wherein the positioning network comprises an improved cross entropy loss group and a central point offset loss group, and the loss groups jointly form a central point positioning network. The offset of the center point is corrected by a loss of center offset. Center offset loss, expressed as follows:
Figure BDA0003198780120000061
wherein L isoffsetIndicating a loss of center offset, N represents the batch size,
Figure BDA0003198780120000062
representing the predicted center coordinate, OiRepresenting the true center coordinates.
Constructing a frame width and height prediction module for constructing a scale-adaptive width and height predictor and amplifying a feature map FPIAnd connecting to a width and height prediction head, and inputting a scale self-adaptive network to obtain a width and height regression quantity. The scale self-adaptive network is determined by a two-dimensional Gaussian kernel and the real aspect ratio of the target, and the variance of the two-dimensional Gaussian kernel is determined by the intersection ratio and the aspect ratio of the target frame. And the intersection ratio is determined according to the set upper limit and the set lower limit and the area of the real target frame, so that the scale self-adaptation of the width and height prediction head is realized.
And S4, adding a category attention network into the category prediction header, wherein the category attention network is used for mining effective information between distant and semantically related targets in and among the categories and reinforcing network classification capability.
And constructing a Class Attention Module (CAM), connecting the amplified feature map to a classification prediction head, inputting the amplified feature map to the class attention module to obtain the object class, wherein the class attention network comprises an inter-class associated attention group and an intra-class associated attention group. The inter-class attention group comprises a plurality of class attention blocks and a class excitation block, and then the inter-class attention group is superimposed on the original characteristic diagram element by element through broadcasting to form an intra-class attention group, so that the class attention of the class prediction head is realized.
In the embodiment of the present invention, the category attention work flow in the category attention module is divided into the following steps:
for the enlarged characteristic diagram F with the scale of C multiplied by H multiplied by WPIExtracting features, reducing the size to obtain information between classes, multiplying the information to F by matrix multiplicationPIAnd obtaining a new inter-class information characteristic diagram. The inter-class information characteristic diagram is represented as follows:
FWI=Hmul(Zip(Conv(FPI)),FPI)
wherein, FWIFeature graph representing information between classes, HmulRepresenting a matrix pixel-by-pixel multiplication, Zip representing an information reduction operation, and Conv representing a convolution operation of 1 x 1.
For new feature diagram FWIExtracting features, passing through linear rectification function, extracting features again to obtain information in class, and superimposing the information to F by broadcast element-by-element addition methodPIIn the above, a category attention feature map is obtained. A class attention feature map, represented as follows:
FCA=Hadd(Conv(Lin(Conv(FWI))),FPI)
wherein, FCAAttention feature map of the above-mentioned category, HaddIndicating broadcast element-by-element addition and Lin indicating linear commutation operations.
And S5, in the training stage, generating supervision information by encoding the real target frame, thereby supervising the training process of each measuring head and improving the precision and accuracy of each measuring head.
In the training phase, the width and height prediction head is used for the width and height of the target box represented by the center point. The center point offset prediction head is used for predicting the value of the target center point lost in the precision in the coding process of the scale self-adaptive coding module. And the supervision information adopted by the training of the classification prediction head, the width and height prediction head and the central point offset prediction head is obtained by coding the real target frame by a scale self-adaptive coding module. The constraint of the scale-adaptive monitoring information on each measuring head is encoded into Rpre=Hadapt(Ibbox) Wherein R ispreIndicates the coding result of each of the predictive heads, HadaptRepresenting a scale adaptive information coding operation, IbboxInformation representing a real target box.
And S6, in the inference stage, the trained class prediction head, the width and height prediction head and the central point offset prediction head respectively output the classification information, the regression frame width and height information and the central point position information of the image to be detected, and then the recognition object is framed in the image to be detected according to the output prediction result and the classification result is marked.
In this example, the data input to the adaptive scale coding module compiles three feature maps, a class heatmap map
Figure BDA0003198780120000071
One dimension width and height diagram
Figure BDA0003198780120000072
And a center point offset map
Figure BDA0003198780120000073
Where N represents the batch size (batch-size), r represents the step size of the output, C represents the number of target classes, and H and W represent the height and width of the image, respectively.
For each real target box btC, calculating the down-sampled r-times equivalent value of the central point p
Figure BDA0003198780120000074
All targets are encoded into the Heatmap graph H by means of gaussian kernels, and a specific class occupies a specific channel. When the central points of two or more targets are coincident, the target representative with the largest target frame area is adopted. HxycThe value of the corresponding position is confirmed by a 2D gaussian kernel, which is:
Figure BDA0003198780120000081
wherein sigmaxIs a parameter related to IoU and the width of the target box, 1/3, σ of the calculated transverse axis of the ellipseyIs a parameter highly correlated with IoU and the target box, is 1/3 of the calculated longitudinal axis of the ellipse, the Gaussian kernel constituting the ellipse
Figure BDA0003198780120000082
σ will be derived as followsx、σyAnd IoU and the height and width of the target box:
the specific calculation formula of IoU is:
Figure BDA0003198780120000083
further deducing that:
Figure BDA0003198780120000084
due to the fact that
Figure BDA0003198780120000085
Wherein a is half of the transverse axis of the Gaussian kernel, b is half of the transverse axis of the Gaussian kernel, r is the distance from the intersection point of the rectangular diagonal line and the outer ring of the Gaussian kernel to the center of the rectangle, and the following is further provided:
Figure BDA0003198780120000086
further comprising:
Figure BDA0003198780120000087
by the formula of ellipses
Figure BDA0003198780120000088
And (3) obtaining:
Figure BDA0003198780120000089
Figure BDA00031987801200000810
thereby obtaining IoU a calculation method of the Gaussian kernel parameters a and b related to the width and height of the target frame.
Figure BDA0003198780120000091
In order to further adapt to the target frames with different scales, the size of the target frame is adaptively adjusted IoU according to the size of the area of the target frame.
Figure BDA0003198780120000092
Wherein [ alpha, beta ]]For the set IoU value range, area is the area of the target box, aSIs the area threshold of the small target frame, aLFor the area threshold of the large target frame, the area is smaller than aSIs uniformly set to a and the area is larger than aLIs uniformly set to be beta, area [ a ]S,aL]The target block IoU in between is set to the adaptation value.
In order to further predict the accurate position of the scale central point in the input image, a central point offset map is added
Figure BDA0003198780120000093
In that
Figure BDA0003198780120000094
Filling real target frames b at the coordinate positions respectivelytLoss floating point value of center point of
Figure BDA0003198780120000095
For recovering from the loss of centre point positioning accuracy due to down-sampling, all classes share the same.
Use of
Figure BDA0003198780120000096
Representative class is ctTarget frame t, dimension width and height map in
Figure BDA0003198780120000097
Filling real target frames b at the coordinate positions respectivelytWidth and height of
Figure BDA0003198780120000098
The scale is not normalized. To reduce the amount of computation, use one
Figure BDA0003198780120000099
All classes are predicted.
And in the reasoning stage, drawing a frame on the picture according to the classification information, the regression frame width and height information and the central point position information.
The invention also provides a target detection system based on the scale self-adaptive coding module and the category attention module, which comprises the following components:
the characteristic extraction module is used for grouping the input pictures to form a characteristic image;
the up-sampling module is used for specially encoding the characteristic image to form an amplified characteristic image with reserved information;
and the class attention module is used for constructing a class attention network as a classifier, connecting the amplified feature map to a classification prediction head and obtaining the object class through the class attention network. Wherein the class attention network comprises an inter-class associative attention group and an intra-class associative attention group. The inter-class attention group comprises a plurality of class attention blocks and a class excitation block, and then the inter-class attention group is superimposed on the original characteristic diagram element by element through broadcasting to form an intra-class attention group, so that the class attention of the class prediction head is realized.
And the center offset positioning module is used for constructing a target center point of the center point positioning network, connecting the amplified characteristic diagram to a center offset amount prediction head and correcting the offset of the center point through center offset amount loss. Wherein the positioning network comprises an improved cross entropy loss set and a center point offset loss set. The sets of losses collectively form a central point location network.
And the frame width and height prediction module is used for constructing a scale self-adaptive width and height predictor, connecting the amplified characteristic diagram to a width and height prediction head, and inputting the amplified characteristic diagram into a scale self-adaptive network to obtain a width and height regression quantity. The scale self-adaptive network is determined by a two-dimensional Gaussian kernel and the real aspect ratio of the target, and the variance of the two-dimensional Gaussian kernel is determined by the intersection ratio and the aspect ratio of the target frame. And the intersection ratio is determined according to the set upper limit and the set lower limit and the area of the real target frame, so that the dimension self-adaption of the length and width measuring head is realized.
And the image detection result module is used for displaying the classification information of the category classification module, the center offset positioning module and the frame length and width prediction module and drawing a target frame.
The invention finally provides a test embodiment, using the MS COCO 2017 data set as a training set, a verification set and a test set, wherein 118000 images are included as the training data set, 5000 images are included as the verification data set, and 20000 images are included as the test data set. The target detection results were evaluated using three different average Accuracies (AP), AP50, AP75 as rating indices to examine the target detection performance of the present invention. ResNet-18 and DLA34 are respectively selected as model frameworks of the invention. The present invention scales all images to 512 × 512 while maintaining their scale ratio and generates a 128 × 128 feature map using a scale adaptive coding module. Random translation (translation range 128), random flipping, random color dithering, random fill lighting are used as data enhancement, and the overall objective is optimized using SGD. We used a Learning Rate (LR) of 0.02, a batch size of 128, 80 iterative training (epoch) on the data set, and a 0.1-fold reduction in LR at 50 and 72, respectively. All experiments were done on a machine equipped with a PyTorch containing NVIDIA Titan V GPU with training tasks and speed testing. Table 1 shows the comparison result of adding the scale adaptive coding module through the three evaluation indexes, table 2 shows the comparison result of adding the category attention module, table 3 shows the comparison result of the present invention with the current main algorithm, fig. 3 is the comparison of the method of the present invention with each algorithm in the present example, and fig. 4a and 4b show the effect of the present invention.
TABLE 1 adaptive coding Module comparison experiment
Figure BDA0003198780120000101
TABLE 2 Category attention Module comparative experiment
Figure BDA0003198780120000111
Table 3 comparison of results on the COCO test data set for networks of SOTAs (non-optimal results). Wherein bold and italic bold represent the first and second highest values, respectively
Figure BDA0003198780120000112
Figure BDA0003198780120000121
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (10)

1. A method of target detection, comprising the steps of:
s1, extracting image features to generate a feature map;
s2, the extracted feature map is sampled, and an amplified feature map which retains original feature information is obtained;
s3, connecting the amplified feature map to a category prediction head, a width and height prediction head and a central point offset prediction head;
s4, adding a category attention network into the category prediction header, wherein the category attention network is used for mining effective information between distant and semantically related targets in and among the categories;
s5, in the training stage, generating supervision information by encoding the real target frame, thereby supervising the training process of each measuring head;
and S6, outputting classification information, regression frame width and height information and central point position information of the image to be detected respectively by the trained class prediction head, width and height prediction head and central point offset prediction head, framing the identification object in the image to be detected according to the output result and marking the classification result.
2. The method of claim 1, wherein the feature map is generated by extracting features of the image using a residual network or a deep feature fusion network.
3. The object detection method of claim 1, wherein the upsampling module consists of an alternation of a deformable convolution and a transposed convolution.
4. The object detection method of claim 1, wherein the mechanism of the class attention network is represented as: i isE=HE(IDkISk) (ii) a Wherein, IERepresenting an objectInter effective information, HEIndicating operations for mining valid information, IDkDenotes the distance information in the case of k, ISkThe semantic information is shown in the case of k, which is classified into an intra-class case and an inter-class case.
5. The object detection method of claim 1, wherein the class attention network comprises an inter-class associative attention group and an intra-class associative attention group; the inter-class associated attention group comprises a plurality of class attention blocks and a class excitation block, and then inter-class information output by the inter-class associated attention group is superimposed on the amplified characteristic diagram element by element through broadcasting to form an intra-class associated attention group, so that the class attention of the class prediction head is realized.
6. The object detection method of claim 1, wherein the class attention workflow of the class attention network comprises the steps of:
s41, enlarging characteristic diagram F with scale C multiplied by H multiplied by WPIExtracting features, reducing the size to obtain information between classes, multiplying the information between classes to the enlarged feature graph F by matrix multiplicationPIObtaining a new inter-class information characteristic diagram; the inter-class information feature map is represented as follows:
FWI=Hmul(Zip(Conv(FPI)),FPI)
wherein, FWIFeature graph representing information between classes, HmulRepresenting a matrix pixel-by-pixel multiplication operation, Zip representing an information reduction operation, and Conv representing a convolution operation;
s42, for new inter-class information characteristic diagram FWIExtracting features, passing the extraction result through a linear rectification function, extracting features again to obtain intra-class information, and superimposing the intra-class information on the amplified feature map F by broadcasting element-by-element additionPIObtaining a category attention feature map; the class attention feature map is represented as follows:
FCA=Hadd(Conv(Lin(Conv(FWI))),FPI)
wherein, FCAAs a class attention feature map, HaddIndicating broadcast element-by-element addition and Lin indicating linear commutation operations.
7. The object detection method of claim 1, wherein the midpoint offset prediction header is configured to output a midpoint offset of a midpoint location network, the midpoint location network comprising a cross entropy loss set and a midpoint offset loss set; the center point offset prediction head corrects the offset of the target center point by a center offset loss, which is expressed as follows:
Figure FDA0003198780110000021
wherein L isoffsetIndicating a loss of center offset, N represents the batch size,
Figure FDA0003198780110000022
representing the predicted center coordinate, OiRepresenting the true center coordinates.
8. The object detection method of claim 1, wherein the broad height prediction head implements broad height prediction by constructing a scale adaptive network; the scale self-adaptive network is determined by a two-dimensional Gaussian kernel and a target real width-to-height ratio, the variance of the two-dimensional Gaussian kernel is determined by an intersection ratio and the width-to-height of a target frame, and the intersection ratio is determined by the upper limit and the lower limit and the area of the real target frame according to the set upper limit and the set lower limit, so that the scale self-adaptation of the width-to-height prediction head is realized.
9. The method of claim 1, wherein the connection of the enlarged feature map to the category predictor, the width-height predictor and the center point offset predictor compiles three feature maps: one is a class heatmap graph
Figure FDA0003198780110000023
Figure FDA0003198780110000024
One is a dimension width and height diagram
Figure FDA0003198780110000025
The last one is a center point offset map
Figure FDA0003198780110000026
Wherein N represents the size of the batch, r represents the output step length, C represents the number of target classes, and H and W represent the height and width of the image respectively;
for each real target box btC, calculating the down-sampled r-times equivalent value of the central point p
Figure FDA0003198780110000027
All targets are coded into a Heatmap graph H in a Gaussian kernel mode, and a specific channel is occupied by a specific category; when the central points of two or more targets are coincident, adopting the target representative with the largest target frame area; hxycThe value of the corresponding position is confirmed by a 2D gaussian kernel, which is:
Figure FDA0003198780110000028
wherein σxIs a parameter related to IoU and the width of the target box, 1/3 for the calculated transverse axis of the ellipse; sigmayIs a parameter highly correlated with IoU and the target box, 1/3 for the calculated ellipse longitudinal axis; the Gaussian kernel forms an ellipse
Figure FDA0003198780110000029
σ will be derived as followsx、σyA calculation formula of IoU and the height and width of the target box; first IoU is calculated as:
Figure FDA0003198780110000031
further deducing that:
Figure FDA0003198780110000032
due to the fact that
Figure FDA0003198780110000033
Wherein a is half of the transverse axis of the Gaussian kernel, b is half of the transverse axis of the Gaussian kernel, r is the distance from the intersection point of the rectangular diagonal line and the outer ring of the Gaussian kernel to the center of the rectangle, and the following is further provided:
Figure FDA0003198780110000034
further comprising:
Figure FDA0003198780110000035
by the formula of ellipses
Figure FDA0003198780110000036
Obtaining:
Figure FDA0003198780110000037
Figure FDA0003198780110000038
the method of computing the gaussian kernel parameters a, b relating to IoU, the width and height of the target box is thus obtained:
Figure FDA0003198780110000039
10. the object detection method of claim 9, wherein the size of IoU is adaptively adjusted according to the size of the area of the object frame:
Figure FDA00031987801100000310
wherein [ alpha, beta ]]For the set IoU value range, area is the area of the target box, aSIs the area threshold of the small target frame, aLFor the area threshold of the large target frame, the area is smaller than aSIs uniformly set to a and the area is larger than aLIs uniformly set to be beta, area [ a ]S,aL]The target box IoU in between is set to the adaptation value;
adding a center point offset map
Figure FDA00031987801100000311
In that
Figure FDA00031987801100000312
Filling real target frames b at the coordinate positions respectivelytLoss floating point value of center point of
Figure FDA0003198780110000041
The loss of center point positioning accuracy due to downsampling is recovered and all classes share the same offset map.
CN202110898055.9A 2021-08-05 2021-08-05 Target detection method Active CN113657225B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110898055.9A CN113657225B (en) 2021-08-05 2021-08-05 Target detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110898055.9A CN113657225B (en) 2021-08-05 2021-08-05 Target detection method

Publications (2)

Publication Number Publication Date
CN113657225A true CN113657225A (en) 2021-11-16
CN113657225B CN113657225B (en) 2023-09-26

Family

ID=78478514

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110898055.9A Active CN113657225B (en) 2021-08-05 2021-08-05 Target detection method

Country Status (1)

Country Link
CN (1) CN113657225B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114972976A (en) * 2022-07-29 2022-08-30 之江实验室 Night target detection and training method and device based on frequency domain self-attention mechanism
CN115908790A (en) * 2022-12-28 2023-04-04 北京斯年智驾科技有限公司 Target detection center point offset detection method and device and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191566A (en) * 2019-12-26 2020-05-22 西北工业大学 Optical remote sensing image multi-target detection method based on pixel classification
CN112036457A (en) * 2020-08-20 2020-12-04 腾讯科技(深圳)有限公司 Method and device for training target detection model and target detection method and device
CN112801146A (en) * 2021-01-13 2021-05-14 华中科技大学 Target detection method and system
US20210183072A1 (en) * 2019-12-16 2021-06-17 Nvidia Corporation Gaze determination machine learning system having adaptive weighting of inputs
CN112990102A (en) * 2021-04-16 2021-06-18 四川阿泰因机器人智能装备有限公司 Improved Centernet complex environment target detection method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210183072A1 (en) * 2019-12-16 2021-06-17 Nvidia Corporation Gaze determination machine learning system having adaptive weighting of inputs
CN111191566A (en) * 2019-12-26 2020-05-22 西北工业大学 Optical remote sensing image multi-target detection method based on pixel classification
CN112036457A (en) * 2020-08-20 2020-12-04 腾讯科技(深圳)有限公司 Method and device for training target detection model and target detection method and device
CN112801146A (en) * 2021-01-13 2021-05-14 华中科技大学 Target detection method and system
CN112990102A (en) * 2021-04-16 2021-06-18 四川阿泰因机器人智能装备有限公司 Improved Centernet complex environment target detection method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114972976A (en) * 2022-07-29 2022-08-30 之江实验室 Night target detection and training method and device based on frequency domain self-attention mechanism
CN114972976B (en) * 2022-07-29 2022-12-20 之江实验室 Night target detection and training method and device based on frequency domain self-attention mechanism
CN115908790A (en) * 2022-12-28 2023-04-04 北京斯年智驾科技有限公司 Target detection center point offset detection method and device and electronic equipment

Also Published As

Publication number Publication date
CN113657225B (en) 2023-09-26

Similar Documents

Publication Publication Date Title
CN109241913B (en) Ship detection method and system combining significance detection and deep learning
CN108961235B (en) Defective insulator identification method based on YOLOv3 network and particle filter algorithm
CN110738207B (en) Character detection method for fusing character area edge information in character image
CN110111366B (en) End-to-end optical flow estimation method based on multistage loss
CN109886121B (en) Human face key point positioning method for shielding robustness
CN110232350B (en) Real-time water surface multi-moving-object detection and tracking method based on online learning
CN110163207B (en) Ship target positioning method based on Mask-RCNN and storage device
US20210081695A1 (en) Image processing method, apparatus, electronic device and computer readable storage medium
CN113435240B (en) End-to-end form detection and structure identification method and system
CN114627052A (en) Infrared image air leakage and liquid leakage detection method and system based on deep learning
CN117253154B (en) Container weak and small serial number target detection and identification method based on deep learning
CN113609896A (en) Object-level remote sensing change detection method and system based on dual-correlation attention
US11615612B2 (en) Systems and methods for image feature extraction
CN114022408A (en) Remote sensing image cloud detection method based on multi-scale convolution neural network
CN113657225B (en) Target detection method
CN114724155A (en) Scene text detection method, system and equipment based on deep convolutional neural network
CN113191204B (en) Multi-scale blocking pedestrian detection method and system
CN112800955A (en) Remote sensing image rotating target detection method and system based on weighted bidirectional feature pyramid
CN111753682A (en) Hoisting area dynamic monitoring method based on target detection algorithm
CN114266794A (en) Pathological section image cancer region segmentation system based on full convolution neural network
CN111507337A (en) License plate recognition method based on hybrid neural network
CN113554679A (en) Anchor-frame-free target tracking algorithm for computer vision application
CN112419317A (en) Visual loopback detection method based on self-coding network
CN110634142B (en) Complex vehicle road image boundary optimization method
CN114299383A (en) Remote sensing image target detection method based on integration of density map and attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant