CN113657225A - Target detection method - Google Patents
Target detection method Download PDFInfo
- Publication number
- CN113657225A CN113657225A CN202110898055.9A CN202110898055A CN113657225A CN 113657225 A CN113657225 A CN 113657225A CN 202110898055 A CN202110898055 A CN 202110898055A CN 113657225 A CN113657225 A CN 113657225A
- Authority
- CN
- China
- Prior art keywords
- class
- information
- target
- attention
- category
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a target detection method, which comprises the following steps: extracting image features to generate a feature map; sampling the characteristic diagram to obtain an enlarged characteristic diagram; connecting the amplified feature map to a category prediction head, a width and height prediction head and a center point offset prediction head; adding a category attention network into a category prediction head, and mining effective information between targets which are far away from each other in a category and are semantically related; monitoring the training of each measuring head by generating monitoring information through encoding the real target frame; and selecting the identification object in the image to be detected by the output result of each prediction head and marking the classification result. According to the invention, by combining the category attention for further judging the category of the target and the scale self-adaptive coding for frame regression, the network can correlate the characteristics in the category and among the categories, and can carry out more accurate frame selection according to the scale transformation of the detected target while mining the effective information between the targets which are far away from each other and related semantically, so that the accuracy of detection and the frame selection precision are improved.
Description
Technical Field
The invention belongs to the field of computer vision target detection, and particularly relates to a target detection method.
Background
Object detection (object detection) is a common problem in the field of machine vision (machine vision), and is image segmentation based on the characteristics of geometric features, statistical features and the like of a detected object, which combines object segmentation and identification into a whole so as to obtain an accurate object detection result. The target detection is to combine target positioning and target classification, and to locate an object of interest from an image or video by using multi-directional knowledge such as image processing technology and machine learning. The target classification part is responsible for judging whether the input image contains a classification object, and the target positioning part is responsible for representing the position of the target object and marking and positioning by using a circumscribed rectangle frame. Target detection plays an important role in many applications such as target tracking, attitude detection, and the like.
Generally, target detection can be classified into a conventional detection method and a learning detection method. The conventional detection method generally includes three steps, that is, traversing a candidate region by using sliding windows of different sizes, extracting relevant visual features of the candidate region by using a Histogram of Oriented Gradients (HOG) and Scale-invariant feature transform (SIFT), and classifying the features by using a trained classifier. Although the method has good effect, the method has no pertinence to the object to be detected when the sliding window is used for carrying out region selection, so that the method has high time complexity and redundancy of the window, the classification effect is larger under different conditions, and the robustness is not strong. And then, the learning-based method is widely applied to the field of target detection, and the deep learning method can fully extract the features in the training sample, so that more accurate classification is obtained and a certain detection speed is increased.
In recent years, a method based on a deep Convolutional Neural Network (CNN) is significantly improved compared with a traditional target detection algorithm. The deep convolutional network (lens-5) for target detection introduces two layers of CNNs to realize target detection. Thereafter, as deep learning further progresses, the accuracy of target detection is continuously improved. Thereafter, target detection algorithms (twostage) based on classification series and algorithms (singlestage) converting target detection into regression problems were developed. Aiming at the problems of high parameters and high training amount of a two-stage target detection algorithm, a method (You only look once) for dividing a picture into grids, wherein each grid only detects a target with a center in the grid, predicts two scale frames (bounding boxes) and category information, and predicts the scale frames, the target confidence coefficient and the category probability of all regions at one time is born. Then, a target detection method based on the regression problem develops a more intuitive method (Objects as Points, centret) for directly detecting the central point and the size of the target and discarding the prediction frame, so that the speed and the precision of target detection are further improved.
Although the target detection method using the prediction-free box has a satisfactory effect, the method does not take the problems of the change of the aspect ratio of the target and the uneven distribution of the targets with different scales into consideration when constructing the Heatmap, and does not mine effective information of the targets which are far away from each other in the class and are semantically related. Therefore, it is very important to construct a method that focuses on the aspect ratio and distribution of the target and can mine more effective information.
Disclosure of Invention
In view of the above drawbacks or needs for improvement in the prior art, the present invention provides a target detection method, which solves the limitations of the current target detection based on regression problems.
An object detection method comprising the steps of:
s1, extracting image features to generate a feature map;
s2, the extracted feature map is sampled, and an amplified feature map which retains original feature information is obtained;
s3, connecting the amplified feature map to a category prediction head, a width and height prediction head and a central point offset prediction head;
s4, adding a category attention network into the category prediction header, wherein the category attention network is used for mining effective information between distant and semantically related targets in and among the categories;
s5, in the training stage, generating supervision information by encoding the real target frame, thereby supervising the training process of each measuring head;
and S6, outputting classification information, regression frame width and height information and central point position information of the image to be detected respectively by the trained class prediction head, width and height prediction head and central point offset prediction head, framing the identification object in the image to be detected according to the output result and marking the classification result.
Further, the features of the image are extracted by utilizing a residual error network or a deep feature fusion network, and a feature map is generated.
Further, the upsampling module consists of an alternation of a deformable convolution and a transposed convolution.
Further, the mechanism of the class attention network is represented as: i isE=HE(IDkISk) (ii) a Wherein, IERepresenting valid information between objects, HEIndicating operations for mining valid information, IDkIndicating the distance information in the case of k,ISkthe semantic information is shown in the case of k, which is classified into an intra-class case and an inter-class case.
Further, the category attention network includes an inter-class associative attention group and an intra-class associative attention group; the inter-class associated attention group comprises a plurality of class attention blocks and a class excitation block, and then inter-class information output by the inter-class associated attention group is superimposed on the amplified characteristic diagram element by element through broadcasting to form an intra-class associated attention group, so that the class attention of the class prediction head is realized.
Further, the category attention workflow of the category attention network comprises the following steps:
s41, enlarging characteristic diagram F with scale C multiplied by H multiplied by WPIExtracting features, reducing the size to obtain information between classes, multiplying the information between classes to the enlarged feature graph F by matrix multiplicationPIObtaining a new inter-class information characteristic diagram; the inter-class information feature map is represented as follows:
FWI=Hmul(Zip(Conv(FPI)),FPI)
wherein, FWIFeature graph representing information between classes, HmulRepresenting a matrix pixel-by-pixel multiplication operation, Zip representing an information reduction operation, and Conv representing a convolution operation;
s42, for new inter-class information characteristic diagram FWIExtracting features, passing the extraction result through a linear rectification function, extracting features again to obtain intra-class information, and superimposing the intra-class information on the amplified feature map F by broadcasting element-by-element additionPIObtaining a category attention feature map; the class attention feature map is represented as follows:
FCA=Hadd(Conv(Lin(Conv(FWI))),FPI)
wherein, FCAAs a class attention feature map, HaddIndicating broadcast element-by-element addition and Lin indicating linear commutation operations.
Further, the central point offset prediction head is used for outputting the central point offset of the central point positioning network, and the central point positioning network comprises a cross entropy loss group and a central point offset loss group; the center point offset prediction head corrects the offset of the target center point by a center offset loss, which is expressed as follows:
wherein L isoffsetIndicating a loss of center offset, N represents the batch size,representing the predicted center coordinate, OiRepresenting the true center coordinates.
Further, the breadth and height prediction head realizes breadth and height prediction by constructing a scale self-adaptive network; the scale self-adaptive network is determined by a two-dimensional Gaussian kernel and a target real width-to-height ratio, the variance of the two-dimensional Gaussian kernel is determined by an intersection ratio and the width-to-height of a target frame, and the intersection ratio is determined by the upper limit and the lower limit and the area of the real target frame according to the set upper limit and the set lower limit, so that the scale self-adaptation of the width-to-height prediction head is realized.
Further, connecting the enlarged feature map to the category predictor, the width and height predictor, and the center point offset predictor compiles three feature maps: one is a class heatmap graphOne is a dimension width and height diagramThe last one is a center point offset mapWherein N represents the size of the batch, r represents the output step length, C represents the number of target classes, and H and W represent the height and width of the image respectively;
for each real target box btC, calculating the down-sampled r-times equivalent value of the central point pAll targets are coded into a Heatmap graph H in a Gaussian kernel mode, and a specific channel is occupied by a specific category; when the central points of two or more targets are coincident, adopting the target representative with the largest target frame area; hxycThe value of the corresponding position is confirmed by a 2D gaussian kernel, which is:
wherein σxIs a parameter related to IoU and the width of the target box, 1/3 for the calculated transverse axis of the ellipse; sigmayIs a parameter highly correlated with IoU and the target box, 1/3 for the calculated ellipse longitudinal axis; the Gaussian kernel forms an ellipse
σ will be derived as followsx、σyA calculation formula of IoU and the height and width of the target box; first IoU is calculated as:
further deducing that:
due to the fact thatWherein a is half of the transverse axis of the Gaussian kernel, b is half of the transverse axis of the Gaussian kernel, r is the distance from the intersection point of the rectangular diagonal line and the outer ring of the Gaussian kernel to the center of the rectangle, and the following is further provided:
further comprising:
the method of computing the gaussian kernel parameters a, b relating to IoU, the width and height of the target box is thus obtained:
further, the size of IoU is adaptively adjusted according to the size of the target frame area:
wherein [ alpha, beta ]]For the set IoU value range, area is the area of the target box, aSIs the area threshold of the small target frame, aLFor the area threshold of the large target frame, the area is smaller than aSIs uniformly set to a and the area is larger than aLIs uniformly set to be beta, area [ a ]S,aL]The target box IoU in between is set to the adaptation value;
adding a center point offset mapIn thatFilling real target frames b at the coordinate positions respectivelytLoss floating point value of center point ofThe loss of center point positioning accuracy due to downsampling is recovered and all classes share the same offset map.
The invention has the beneficial effects that:
according to the method, the category attention for further judging the target category and the scale self-adaptive coding for frame regression are combined, so that the network can associate the characteristics in the category and among the categories, and can obtain a more accurate target frame while mining the effective information between targets which are far away from each other in the category and among the categories and are semantically related; and more accurate framing is performed according to the scale transformation of the detected target, so that the accuracy of target detection and the framing precision are improved.
Drawings
Fig. 1 is a schematic flow chart of a target detection method according to an embodiment of the present invention;
fig. 2 is a network structure diagram of a target detection method according to an embodiment of the present invention;
FIG. 3 is a graph comparing test results of the target detection method of the present invention with other algorithms;
fig. 4 is a schematic diagram illustrating the effect of detecting 2 image targets according to the embodiment of the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
The invention discloses a target detection method (DASCAN), which improves the conventional key point detection scheme aiming at the requirement of multi-path real-time accurate inference in actual projects, improves the detection precision of a model and better meets the real-time requirement of a real scene; the invention provides a scale self-adaptive coding module, optimizes a target frame to obtain an accurate frame selection result, and provides a category attention module, so that the similar objects are accurately distinguished. The invention can realize multi-path real-time accurate multi-target detection and detect the object type and position in a complex scene.
A target detection method according to an embodiment of the present invention, as shown in fig. 1 and 2, includes the following steps:
and S1, extracting image features to generate a feature map.
In the embodiment of the present invention, a feature map is generated by extracting image features in an original image or video using a Deep residual network (ResNet) or a Deep Layer feature Aggregation (DLA).
And S2, the extracted feature map is sampled, and an enlarged feature map with original feature information retained is obtained.
And constructing an up-sampling module consisting of 3-by-3 deformable convolution and transposed convolution alternately, and up-sampling by using the up-sampling module to obtain the amplified feature map retaining the effective information. The characteristic diagram for reserving the original characteristic information is shown as follows:
FPI=HIM(HUS(FOri))
wherein, FPIEnlarged characteristic diagram representing retained information, HIMMapping operation representing the retained characteristic information, HUSShowing an image enlarging operation, ForiThe feature map generated in S1, i.e., the feature image obtained through the backbone network, is represented.
And S3, connecting the amplified feature map to the category prediction head, the width and height prediction head and the central point offset prediction head, and enhancing the information acquisition capability of the features in different fields.
In the training stage, the classification prediction head is used for confirming the existence of the target and confirming the classification of the target through the channel ID, and a class attention module is added in the classification prediction head for useEfficient information is mined between objects that are far apart but semantically related within and between classes. The mechanism of the class attention network is represented as: i isE=HE(IDkISk) (ii) a Wherein, IERepresenting valid information between objects, HEIndicating operations for mining valid information, IDkDenotes the distance information in the case of k, ISkRepresents semantic information in the case of k, which is divided into w: intra-class case and b: inter-class case.
And constructing a center offset positioning module, wherein the center offset positioning module is used for constructing a target center point of a center point positioning network. Enlarging the feature map FPIAnd the central point offset of the positioning network is output by connecting to a central offset prediction head, wherein the positioning network comprises an improved cross entropy loss group and a central point offset loss group, and the loss groups jointly form a central point positioning network. The offset of the center point is corrected by a loss of center offset. Center offset loss, expressed as follows:
wherein L isoffsetIndicating a loss of center offset, N represents the batch size,representing the predicted center coordinate, OiRepresenting the true center coordinates.
Constructing a frame width and height prediction module for constructing a scale-adaptive width and height predictor and amplifying a feature map FPIAnd connecting to a width and height prediction head, and inputting a scale self-adaptive network to obtain a width and height regression quantity. The scale self-adaptive network is determined by a two-dimensional Gaussian kernel and the real aspect ratio of the target, and the variance of the two-dimensional Gaussian kernel is determined by the intersection ratio and the aspect ratio of the target frame. And the intersection ratio is determined according to the set upper limit and the set lower limit and the area of the real target frame, so that the scale self-adaptation of the width and height prediction head is realized.
And S4, adding a category attention network into the category prediction header, wherein the category attention network is used for mining effective information between distant and semantically related targets in and among the categories and reinforcing network classification capability.
And constructing a Class Attention Module (CAM), connecting the amplified feature map to a classification prediction head, inputting the amplified feature map to the class attention module to obtain the object class, wherein the class attention network comprises an inter-class associated attention group and an intra-class associated attention group. The inter-class attention group comprises a plurality of class attention blocks and a class excitation block, and then the inter-class attention group is superimposed on the original characteristic diagram element by element through broadcasting to form an intra-class attention group, so that the class attention of the class prediction head is realized.
In the embodiment of the present invention, the category attention work flow in the category attention module is divided into the following steps:
for the enlarged characteristic diagram F with the scale of C multiplied by H multiplied by WPIExtracting features, reducing the size to obtain information between classes, multiplying the information to F by matrix multiplicationPIAnd obtaining a new inter-class information characteristic diagram. The inter-class information characteristic diagram is represented as follows:
FWI=Hmul(Zip(Conv(FPI)),FPI)
wherein, FWIFeature graph representing information between classes, HmulRepresenting a matrix pixel-by-pixel multiplication, Zip representing an information reduction operation, and Conv representing a convolution operation of 1 x 1.
For new feature diagram FWIExtracting features, passing through linear rectification function, extracting features again to obtain information in class, and superimposing the information to F by broadcast element-by-element addition methodPIIn the above, a category attention feature map is obtained. A class attention feature map, represented as follows:
FCA=Hadd(Conv(Lin(Conv(FWI))),FPI)
wherein, FCAAttention feature map of the above-mentioned category, HaddIndicating broadcast element-by-element addition and Lin indicating linear commutation operations.
And S5, in the training stage, generating supervision information by encoding the real target frame, thereby supervising the training process of each measuring head and improving the precision and accuracy of each measuring head.
In the training phase, the width and height prediction head is used for the width and height of the target box represented by the center point. The center point offset prediction head is used for predicting the value of the target center point lost in the precision in the coding process of the scale self-adaptive coding module. And the supervision information adopted by the training of the classification prediction head, the width and height prediction head and the central point offset prediction head is obtained by coding the real target frame by a scale self-adaptive coding module. The constraint of the scale-adaptive monitoring information on each measuring head is encoded into Rpre=Hadapt(Ibbox) Wherein R ispreIndicates the coding result of each of the predictive heads, HadaptRepresenting a scale adaptive information coding operation, IbboxInformation representing a real target box.
And S6, in the inference stage, the trained class prediction head, the width and height prediction head and the central point offset prediction head respectively output the classification information, the regression frame width and height information and the central point position information of the image to be detected, and then the recognition object is framed in the image to be detected according to the output prediction result and the classification result is marked.
In this example, the data input to the adaptive scale coding module compiles three feature maps, a class heatmap mapOne dimension width and height diagramAnd a center point offset mapWhere N represents the batch size (batch-size), r represents the step size of the output, C represents the number of target classes, and H and W represent the height and width of the image, respectively.
For each real target box btC, calculating the down-sampled r-times equivalent value of the central point pAll targets are encoded into the Heatmap graph H by means of gaussian kernels, and a specific class occupies a specific channel. When the central points of two or more targets are coincident, the target representative with the largest target frame area is adopted. HxycThe value of the corresponding position is confirmed by a 2D gaussian kernel, which is:
wherein sigmaxIs a parameter related to IoU and the width of the target box, 1/3, σ of the calculated transverse axis of the ellipseyIs a parameter highly correlated with IoU and the target box, is 1/3 of the calculated longitudinal axis of the ellipse, the Gaussian kernel constituting the ellipseσ will be derived as followsx、σyAnd IoU and the height and width of the target box:
the specific calculation formula of IoU is:
further deducing that:
due to the fact thatWherein a is half of the transverse axis of the Gaussian kernel, b is half of the transverse axis of the Gaussian kernel, r is the distance from the intersection point of the rectangular diagonal line and the outer ring of the Gaussian kernel to the center of the rectangle, and the following is further provided:
further comprising:
thereby obtaining IoU a calculation method of the Gaussian kernel parameters a and b related to the width and height of the target frame.
In order to further adapt to the target frames with different scales, the size of the target frame is adaptively adjusted IoU according to the size of the area of the target frame.
Wherein [ alpha, beta ]]For the set IoU value range, area is the area of the target box, aSIs the area threshold of the small target frame, aLFor the area threshold of the large target frame, the area is smaller than aSIs uniformly set to a and the area is larger than aLIs uniformly set to be beta, area [ a ]S,aL]The target block IoU in between is set to the adaptation value.
In order to further predict the accurate position of the scale central point in the input image, a central point offset map is addedIn thatFilling real target frames b at the coordinate positions respectivelytLoss floating point value of center point ofFor recovering from the loss of centre point positioning accuracy due to down-sampling, all classes share the same.
Use ofRepresentative class is ctTarget frame t, dimension width and height map inFilling real target frames b at the coordinate positions respectivelytWidth and height ofThe scale is not normalized. To reduce the amount of computation, use oneAll classes are predicted.
And in the reasoning stage, drawing a frame on the picture according to the classification information, the regression frame width and height information and the central point position information.
The invention also provides a target detection system based on the scale self-adaptive coding module and the category attention module, which comprises the following components:
the characteristic extraction module is used for grouping the input pictures to form a characteristic image;
the up-sampling module is used for specially encoding the characteristic image to form an amplified characteristic image with reserved information;
and the class attention module is used for constructing a class attention network as a classifier, connecting the amplified feature map to a classification prediction head and obtaining the object class through the class attention network. Wherein the class attention network comprises an inter-class associative attention group and an intra-class associative attention group. The inter-class attention group comprises a plurality of class attention blocks and a class excitation block, and then the inter-class attention group is superimposed on the original characteristic diagram element by element through broadcasting to form an intra-class attention group, so that the class attention of the class prediction head is realized.
And the center offset positioning module is used for constructing a target center point of the center point positioning network, connecting the amplified characteristic diagram to a center offset amount prediction head and correcting the offset of the center point through center offset amount loss. Wherein the positioning network comprises an improved cross entropy loss set and a center point offset loss set. The sets of losses collectively form a central point location network.
And the frame width and height prediction module is used for constructing a scale self-adaptive width and height predictor, connecting the amplified characteristic diagram to a width and height prediction head, and inputting the amplified characteristic diagram into a scale self-adaptive network to obtain a width and height regression quantity. The scale self-adaptive network is determined by a two-dimensional Gaussian kernel and the real aspect ratio of the target, and the variance of the two-dimensional Gaussian kernel is determined by the intersection ratio and the aspect ratio of the target frame. And the intersection ratio is determined according to the set upper limit and the set lower limit and the area of the real target frame, so that the dimension self-adaption of the length and width measuring head is realized.
And the image detection result module is used for displaying the classification information of the category classification module, the center offset positioning module and the frame length and width prediction module and drawing a target frame.
The invention finally provides a test embodiment, using the MS COCO 2017 data set as a training set, a verification set and a test set, wherein 118000 images are included as the training data set, 5000 images are included as the verification data set, and 20000 images are included as the test data set. The target detection results were evaluated using three different average Accuracies (AP), AP50, AP75 as rating indices to examine the target detection performance of the present invention. ResNet-18 and DLA34 are respectively selected as model frameworks of the invention. The present invention scales all images to 512 × 512 while maintaining their scale ratio and generates a 128 × 128 feature map using a scale adaptive coding module. Random translation (translation range 128), random flipping, random color dithering, random fill lighting are used as data enhancement, and the overall objective is optimized using SGD. We used a Learning Rate (LR) of 0.02, a batch size of 128, 80 iterative training (epoch) on the data set, and a 0.1-fold reduction in LR at 50 and 72, respectively. All experiments were done on a machine equipped with a PyTorch containing NVIDIA Titan V GPU with training tasks and speed testing. Table 1 shows the comparison result of adding the scale adaptive coding module through the three evaluation indexes, table 2 shows the comparison result of adding the category attention module, table 3 shows the comparison result of the present invention with the current main algorithm, fig. 3 is the comparison of the method of the present invention with each algorithm in the present example, and fig. 4a and 4b show the effect of the present invention.
TABLE 1 adaptive coding Module comparison experiment
TABLE 2 Category attention Module comparative experiment
Table 3 comparison of results on the COCO test data set for networks of SOTAs (non-optimal results). Wherein bold and italic bold represent the first and second highest values, respectively
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.
Claims (10)
1. A method of target detection, comprising the steps of:
s1, extracting image features to generate a feature map;
s2, the extracted feature map is sampled, and an amplified feature map which retains original feature information is obtained;
s3, connecting the amplified feature map to a category prediction head, a width and height prediction head and a central point offset prediction head;
s4, adding a category attention network into the category prediction header, wherein the category attention network is used for mining effective information between distant and semantically related targets in and among the categories;
s5, in the training stage, generating supervision information by encoding the real target frame, thereby supervising the training process of each measuring head;
and S6, outputting classification information, regression frame width and height information and central point position information of the image to be detected respectively by the trained class prediction head, width and height prediction head and central point offset prediction head, framing the identification object in the image to be detected according to the output result and marking the classification result.
2. The method of claim 1, wherein the feature map is generated by extracting features of the image using a residual network or a deep feature fusion network.
3. The object detection method of claim 1, wherein the upsampling module consists of an alternation of a deformable convolution and a transposed convolution.
4. The object detection method of claim 1, wherein the mechanism of the class attention network is represented as: i isE=HE(IDkISk) (ii) a Wherein, IERepresenting an objectInter effective information, HEIndicating operations for mining valid information, IDkDenotes the distance information in the case of k, ISkThe semantic information is shown in the case of k, which is classified into an intra-class case and an inter-class case.
5. The object detection method of claim 1, wherein the class attention network comprises an inter-class associative attention group and an intra-class associative attention group; the inter-class associated attention group comprises a plurality of class attention blocks and a class excitation block, and then inter-class information output by the inter-class associated attention group is superimposed on the amplified characteristic diagram element by element through broadcasting to form an intra-class associated attention group, so that the class attention of the class prediction head is realized.
6. The object detection method of claim 1, wherein the class attention workflow of the class attention network comprises the steps of:
s41, enlarging characteristic diagram F with scale C multiplied by H multiplied by WPIExtracting features, reducing the size to obtain information between classes, multiplying the information between classes to the enlarged feature graph F by matrix multiplicationPIObtaining a new inter-class information characteristic diagram; the inter-class information feature map is represented as follows:
FWI=Hmul(Zip(Conv(FPI)),FPI)
wherein, FWIFeature graph representing information between classes, HmulRepresenting a matrix pixel-by-pixel multiplication operation, Zip representing an information reduction operation, and Conv representing a convolution operation;
s42, for new inter-class information characteristic diagram FWIExtracting features, passing the extraction result through a linear rectification function, extracting features again to obtain intra-class information, and superimposing the intra-class information on the amplified feature map F by broadcasting element-by-element additionPIObtaining a category attention feature map; the class attention feature map is represented as follows:
FCA=Hadd(Conv(Lin(Conv(FWI))),FPI)
wherein, FCAAs a class attention feature map, HaddIndicating broadcast element-by-element addition and Lin indicating linear commutation operations.
7. The object detection method of claim 1, wherein the midpoint offset prediction header is configured to output a midpoint offset of a midpoint location network, the midpoint location network comprising a cross entropy loss set and a midpoint offset loss set; the center point offset prediction head corrects the offset of the target center point by a center offset loss, which is expressed as follows:
8. The object detection method of claim 1, wherein the broad height prediction head implements broad height prediction by constructing a scale adaptive network; the scale self-adaptive network is determined by a two-dimensional Gaussian kernel and a target real width-to-height ratio, the variance of the two-dimensional Gaussian kernel is determined by an intersection ratio and the width-to-height of a target frame, and the intersection ratio is determined by the upper limit and the lower limit and the area of the real target frame according to the set upper limit and the set lower limit, so that the scale self-adaptation of the width-to-height prediction head is realized.
9. The method of claim 1, wherein the connection of the enlarged feature map to the category predictor, the width-height predictor and the center point offset predictor compiles three feature maps: one is a class heatmap graph One is a dimension width and height diagramThe last one is a center point offset mapWherein N represents the size of the batch, r represents the output step length, C represents the number of target classes, and H and W represent the height and width of the image respectively;
for each real target box btC, calculating the down-sampled r-times equivalent value of the central point pAll targets are coded into a Heatmap graph H in a Gaussian kernel mode, and a specific channel is occupied by a specific category; when the central points of two or more targets are coincident, adopting the target representative with the largest target frame area; hxycThe value of the corresponding position is confirmed by a 2D gaussian kernel, which is:
wherein σxIs a parameter related to IoU and the width of the target box, 1/3 for the calculated transverse axis of the ellipse; sigmayIs a parameter highly correlated with IoU and the target box, 1/3 for the calculated ellipse longitudinal axis; the Gaussian kernel forms an ellipse
σ will be derived as followsx、σyA calculation formula of IoU and the height and width of the target box; first IoU is calculated as:
further deducing that:
due to the fact thatWherein a is half of the transverse axis of the Gaussian kernel, b is half of the transverse axis of the Gaussian kernel, r is the distance from the intersection point of the rectangular diagonal line and the outer ring of the Gaussian kernel to the center of the rectangle, and the following is further provided:
further comprising:
the method of computing the gaussian kernel parameters a, b relating to IoU, the width and height of the target box is thus obtained:
10. the object detection method of claim 9, wherein the size of IoU is adaptively adjusted according to the size of the area of the object frame:
wherein [ alpha, beta ]]For the set IoU value range, area is the area of the target box, aSIs the area threshold of the small target frame, aLFor the area threshold of the large target frame, the area is smaller than aSIs uniformly set to a and the area is larger than aLIs uniformly set to be beta, area [ a ]S,aL]The target box IoU in between is set to the adaptation value;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110898055.9A CN113657225B (en) | 2021-08-05 | 2021-08-05 | Target detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110898055.9A CN113657225B (en) | 2021-08-05 | 2021-08-05 | Target detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113657225A true CN113657225A (en) | 2021-11-16 |
CN113657225B CN113657225B (en) | 2023-09-26 |
Family
ID=78478514
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110898055.9A Active CN113657225B (en) | 2021-08-05 | 2021-08-05 | Target detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113657225B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114972976A (en) * | 2022-07-29 | 2022-08-30 | 之江实验室 | Night target detection and training method and device based on frequency domain self-attention mechanism |
CN115908790A (en) * | 2022-12-28 | 2023-04-04 | 北京斯年智驾科技有限公司 | Target detection center point offset detection method and device and electronic equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111191566A (en) * | 2019-12-26 | 2020-05-22 | 西北工业大学 | Optical remote sensing image multi-target detection method based on pixel classification |
CN112036457A (en) * | 2020-08-20 | 2020-12-04 | 腾讯科技(深圳)有限公司 | Method and device for training target detection model and target detection method and device |
CN112801146A (en) * | 2021-01-13 | 2021-05-14 | 华中科技大学 | Target detection method and system |
US20210183072A1 (en) * | 2019-12-16 | 2021-06-17 | Nvidia Corporation | Gaze determination machine learning system having adaptive weighting of inputs |
CN112990102A (en) * | 2021-04-16 | 2021-06-18 | 四川阿泰因机器人智能装备有限公司 | Improved Centernet complex environment target detection method |
-
2021
- 2021-08-05 CN CN202110898055.9A patent/CN113657225B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210183072A1 (en) * | 2019-12-16 | 2021-06-17 | Nvidia Corporation | Gaze determination machine learning system having adaptive weighting of inputs |
CN111191566A (en) * | 2019-12-26 | 2020-05-22 | 西北工业大学 | Optical remote sensing image multi-target detection method based on pixel classification |
CN112036457A (en) * | 2020-08-20 | 2020-12-04 | 腾讯科技(深圳)有限公司 | Method and device for training target detection model and target detection method and device |
CN112801146A (en) * | 2021-01-13 | 2021-05-14 | 华中科技大学 | Target detection method and system |
CN112990102A (en) * | 2021-04-16 | 2021-06-18 | 四川阿泰因机器人智能装备有限公司 | Improved Centernet complex environment target detection method |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114972976A (en) * | 2022-07-29 | 2022-08-30 | 之江实验室 | Night target detection and training method and device based on frequency domain self-attention mechanism |
CN114972976B (en) * | 2022-07-29 | 2022-12-20 | 之江实验室 | Night target detection and training method and device based on frequency domain self-attention mechanism |
CN115908790A (en) * | 2022-12-28 | 2023-04-04 | 北京斯年智驾科技有限公司 | Target detection center point offset detection method and device and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN113657225B (en) | 2023-09-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109241913B (en) | Ship detection method and system combining significance detection and deep learning | |
CN108961235B (en) | Defective insulator identification method based on YOLOv3 network and particle filter algorithm | |
CN110738207B (en) | Character detection method for fusing character area edge information in character image | |
CN110111366B (en) | End-to-end optical flow estimation method based on multistage loss | |
CN109886121B (en) | Human face key point positioning method for shielding robustness | |
CN110232350B (en) | Real-time water surface multi-moving-object detection and tracking method based on online learning | |
CN110163207B (en) | Ship target positioning method based on Mask-RCNN and storage device | |
US20210081695A1 (en) | Image processing method, apparatus, electronic device and computer readable storage medium | |
CN113435240B (en) | End-to-end form detection and structure identification method and system | |
CN114627052A (en) | Infrared image air leakage and liquid leakage detection method and system based on deep learning | |
CN117253154B (en) | Container weak and small serial number target detection and identification method based on deep learning | |
CN113609896A (en) | Object-level remote sensing change detection method and system based on dual-correlation attention | |
US11615612B2 (en) | Systems and methods for image feature extraction | |
CN114022408A (en) | Remote sensing image cloud detection method based on multi-scale convolution neural network | |
CN113657225B (en) | Target detection method | |
CN114724155A (en) | Scene text detection method, system and equipment based on deep convolutional neural network | |
CN113191204B (en) | Multi-scale blocking pedestrian detection method and system | |
CN112800955A (en) | Remote sensing image rotating target detection method and system based on weighted bidirectional feature pyramid | |
CN111753682A (en) | Hoisting area dynamic monitoring method based on target detection algorithm | |
CN114266794A (en) | Pathological section image cancer region segmentation system based on full convolution neural network | |
CN111507337A (en) | License plate recognition method based on hybrid neural network | |
CN113554679A (en) | Anchor-frame-free target tracking algorithm for computer vision application | |
CN112419317A (en) | Visual loopback detection method based on self-coding network | |
CN110634142B (en) | Complex vehicle road image boundary optimization method | |
CN114299383A (en) | Remote sensing image target detection method based on integration of density map and attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |