CN113657225A

CN113657225A - Target detection method

Info

Publication number: CN113657225A
Application number: CN202110898055.9A
Authority: CN
Inventors: 卢涛; 陈剑卓; 张彦铎; 徐爱波; 吴云韬; 金从元; 余晗; 魏明
Original assignee: Wuhan Institute of Technology; Wuhan Fiberhome Technical Services Co Ltd
Current assignee: Wuhan Institute of Technology; Wuhan Fiberhome Technical Services Co Ltd
Priority date: 2021-08-05
Filing date: 2021-08-05
Publication date: 2021-11-16
Anticipated expiration: 2041-08-05
Also published as: CN113657225B

Abstract

The invention provides a target detection method, which comprises the following steps: extracting image features to generate a feature map; sampling the characteristic diagram to obtain an enlarged characteristic diagram; connecting the amplified feature map to a category prediction head, a width and height prediction head and a center point offset prediction head; adding a category attention network into a category prediction head, and mining effective information between targets which are far away from each other in a category and are semantically related; monitoring the training of each measuring head by generating monitoring information through encoding the real target frame; and selecting the identification object in the image to be detected by the output result of each prediction head and marking the classification result. According to the invention, by combining the category attention for further judging the category of the target and the scale self-adaptive coding for frame regression, the network can correlate the characteristics in the category and among the categories, and can carry out more accurate frame selection according to the scale transformation of the detected target while mining the effective information between the targets which are far away from each other and related semantically, so that the accuracy of detection and the frame selection precision are improved.

Description

Target detection method

Technical Field

The invention belongs to the field of computer vision target detection, and particularly relates to a target detection method.

Background

Object detection (object detection) is a common problem in the field of machine vision (machine vision), and is image segmentation based on the characteristics of geometric features, statistical features and the like of a detected object, which combines object segmentation and identification into a whole so as to obtain an accurate object detection result. The target detection is to combine target positioning and target classification, and to locate an object of interest from an image or video by using multi-directional knowledge such as image processing technology and machine learning. The target classification part is responsible for judging whether the input image contains a classification object, and the target positioning part is responsible for representing the position of the target object and marking and positioning by using a circumscribed rectangle frame. Target detection plays an important role in many applications such as target tracking, attitude detection, and the like.

Generally, target detection can be classified into a conventional detection method and a learning detection method. The conventional detection method generally includes three steps, that is, traversing a candidate region by using sliding windows of different sizes, extracting relevant visual features of the candidate region by using a Histogram of Oriented Gradients (HOG) and Scale-invariant feature transform (SIFT), and classifying the features by using a trained classifier. Although the method has good effect, the method has no pertinence to the object to be detected when the sliding window is used for carrying out region selection, so that the method has high time complexity and redundancy of the window, the classification effect is larger under different conditions, and the robustness is not strong. And then, the learning-based method is widely applied to the field of target detection, and the deep learning method can fully extract the features in the training sample, so that more accurate classification is obtained and a certain detection speed is increased.

In recent years, a method based on a deep Convolutional Neural Network (CNN) is significantly improved compared with a traditional target detection algorithm. The deep convolutional network (lens-5) for target detection introduces two layers of CNNs to realize target detection. Thereafter, as deep learning further progresses, the accuracy of target detection is continuously improved. Thereafter, target detection algorithms (twostage) based on classification series and algorithms (singlestage) converting target detection into regression problems were developed. Aiming at the problems of high parameters and high training amount of a two-stage target detection algorithm, a method (You only look once) for dividing a picture into grids, wherein each grid only detects a target with a center in the grid, predicts two scale frames (bounding boxes) and category information, and predicts the scale frames, the target confidence coefficient and the category probability of all regions at one time is born. Then, a target detection method based on the regression problem develops a more intuitive method (Objects as Points, centret) for directly detecting the central point and the size of the target and discarding the prediction frame, so that the speed and the precision of target detection are further improved.

Although the target detection method using the prediction-free box has a satisfactory effect, the method does not take the problems of the change of the aspect ratio of the target and the uneven distribution of the targets with different scales into consideration when constructing the Heatmap, and does not mine effective information of the targets which are far away from each other in the class and are semantically related. Therefore, it is very important to construct a method that focuses on the aspect ratio and distribution of the target and can mine more effective information.

Disclosure of Invention

In view of the above drawbacks or needs for improvement in the prior art, the present invention provides a target detection method, which solves the limitations of the current target detection based on regression problems.

An object detection method comprising the steps of:

s1, extracting image features to generate a feature map;

s2, the extracted feature map is sampled, and an amplified feature map which retains original feature information is obtained;

s3, connecting the amplified feature map to a category prediction head, a width and height prediction head and a central point offset prediction head;

s4, adding a category attention network into the category prediction header, wherein the category attention network is used for mining effective information between distant and semantically related targets in and among the categories;

s5, in the training stage, generating supervision information by encoding the real target frame, thereby supervising the training process of each measuring head;

and S6, outputting classification information, regression frame width and height information and central point position information of the image to be detected respectively by the trained class prediction head, width and height prediction head and central point offset prediction head, framing the identification object in the image to be detected according to the output result and marking the classification result.

Further, the features of the image are extracted by utilizing a residual error network or a deep feature fusion network, and a feature map is generated.

Further, the upsampling module consists of an alternation of a deformable convolution and a transposed convolution.

Further, the mechanism of the class attention network is represented as: i is_E＝H_E(I_DkI_Sk) (ii) a Wherein, I_ERepresenting valid information between objects, H_EIndicating operations for mining valid information, I_DkIndicating the distance information in the case of k,I_Skthe semantic information is shown in the case of k, which is classified into an intra-class case and an inter-class case.

Further, the category attention network includes an inter-class associative attention group and an intra-class associative attention group; the inter-class associated attention group comprises a plurality of class attention blocks and a class excitation block, and then inter-class information output by the inter-class associated attention group is superimposed on the amplified characteristic diagram element by element through broadcasting to form an intra-class associated attention group, so that the class attention of the class prediction head is realized.

Further, the category attention workflow of the category attention network comprises the following steps:

s41, enlarging characteristic diagram F with scale C multiplied by H multiplied by W_PIExtracting features, reducing the size to obtain information between classes, multiplying the information between classes to the enlarged feature graph F by matrix multiplication_PIObtaining a new inter-class information characteristic diagram; the inter-class information feature map is represented as follows:

F_WI＝H_mul(Zip(Conv(F_PI))，F_PI)

wherein, F_WIFeature graph representing information between classes, H_mulRepresenting a matrix pixel-by-pixel multiplication operation, Zip representing an information reduction operation, and Conv representing a convolution operation;

s42, for new inter-class information characteristic diagram F_WIExtracting features, passing the extraction result through a linear rectification function, extracting features again to obtain intra-class information, and superimposing the intra-class information on the amplified feature map F by broadcasting element-by-element addition_PIObtaining a category attention feature map; the class attention feature map is represented as follows:

F_CA＝H_add(Conv(Lin(Conv(F_WI)))，F_PI)

wherein, F_CAAs a class attention feature map, H_addIndicating broadcast element-by-element addition and Lin indicating linear commutation operations.

Further, the central point offset prediction head is used for outputting the central point offset of the central point positioning network, and the central point positioning network comprises a cross entropy loss group and a central point offset loss group; the center point offset prediction head corrects the offset of the target center point by a center offset loss, which is expressed as follows:

wherein L is_offsetIndicating a loss of center offset, N represents the batch size,

representing the predicted center coordinate, O_iRepresenting the true center coordinates.

Further, the breadth and height prediction head realizes breadth and height prediction by constructing a scale self-adaptive network; the scale self-adaptive network is determined by a two-dimensional Gaussian kernel and a target real width-to-height ratio, the variance of the two-dimensional Gaussian kernel is determined by an intersection ratio and the width-to-height of a target frame, and the intersection ratio is determined by the upper limit and the lower limit and the area of the real target frame according to the set upper limit and the set lower limit, so that the scale self-adaptation of the width-to-height prediction head is realized.

Further, connecting the enlarged feature map to the category predictor, the width and height predictor, and the center point offset predictor compiles three feature maps: one is a class heatmap graph

One is a dimension width and height diagram

The last one is a center point offset map

Wherein N represents the size of the batch, r represents the output step length, C represents the number of target classes, and H and W represent the height and width of the image respectively;

for each real target box b_tC, calculating the down-sampled r-times equivalent value of the central point p

All targets are coded into a Heatmap graph H in a Gaussian kernel mode, and a specific channel is occupied by a specific category; when the central points of two or more targets are coincident, adopting the target representative with the largest target frame area; h_xycThe value of the corresponding position is confirmed by a 2D gaussian kernel, which is:

wherein σ_xIs a parameter related to IoU and the width of the target box, 1/3 for the calculated transverse axis of the ellipse; sigma_yIs a parameter highly correlated with IoU and the target box, 1/3 for the calculated ellipse longitudinal axis; the Gaussian kernel forms an ellipse

σ will be derived as follows_x、σ_yA calculation formula of IoU and the height and width of the target box; first IoU is calculated as:

further deducing that:

due to the fact that

Wherein a is half of the transverse axis of the Gaussian kernel, b is half of the transverse axis of the Gaussian kernel, r is the distance from the intersection point of the rectangular diagonal line and the outer ring of the Gaussian kernel to the center of the rectangle, and the following is further provided:

further comprising:

by the formula of ellipses

And (3) obtaining:

the method of computing the gaussian kernel parameters a, b relating to IoU, the width and height of the target box is thus obtained:

further, the size of IoU is adaptively adjusted according to the size of the target frame area:

wherein [ alpha, beta ]]For the set IoU value range, area is the area of the target box, a_SIs the area threshold of the small target frame, a_LFor the area threshold of the large target frame, the area is smaller than a_SIs uniformly set to a and the area is larger than a_LIs uniformly set to be beta, area [ a ]_S,a_L]The target box IoU in between is set to the adaptation value;

adding a center point offset map

In that

Filling real target frames b at the coordinate positions respectively_tLoss floating point value of center point of

The loss of center point positioning accuracy due to downsampling is recovered and all classes share the same offset map.

The invention has the beneficial effects that:

according to the method, the category attention for further judging the target category and the scale self-adaptive coding for frame regression are combined, so that the network can associate the characteristics in the category and among the categories, and can obtain a more accurate target frame while mining the effective information between targets which are far away from each other in the category and among the categories and are semantically related; and more accurate framing is performed according to the scale transformation of the detected target, so that the accuracy of target detection and the framing precision are improved.

Drawings

Fig. 1 is a schematic flow chart of a target detection method according to an embodiment of the present invention;

fig. 2 is a network structure diagram of a target detection method according to an embodiment of the present invention;

FIG. 3 is a graph comparing test results of the target detection method of the present invention with other algorithms;

fig. 4 is a schematic diagram illustrating the effect of detecting 2 image targets according to the embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

The invention discloses a target detection method (DASCAN), which improves the conventional key point detection scheme aiming at the requirement of multi-path real-time accurate inference in actual projects, improves the detection precision of a model and better meets the real-time requirement of a real scene; the invention provides a scale self-adaptive coding module, optimizes a target frame to obtain an accurate frame selection result, and provides a category attention module, so that the similar objects are accurately distinguished. The invention can realize multi-path real-time accurate multi-target detection and detect the object type and position in a complex scene.

A target detection method according to an embodiment of the present invention, as shown in fig. 1 and 2, includes the following steps:

and S1, extracting image features to generate a feature map.

In the embodiment of the present invention, a feature map is generated by extracting image features in an original image or video using a Deep residual network (ResNet) or a Deep Layer feature Aggregation (DLA).

And S2, the extracted feature map is sampled, and an enlarged feature map with original feature information retained is obtained.

And constructing an up-sampling module consisting of 3-by-3 deformable convolution and transposed convolution alternately, and up-sampling by using the up-sampling module to obtain the amplified feature map retaining the effective information. The characteristic diagram for reserving the original characteristic information is shown as follows:

F_PI＝H_IM(H_US(F_Ori))

wherein, F_PIEnlarged characteristic diagram representing retained information, H_IMMapping operation representing the retained characteristic information, H_USShowing an image enlarging operation, F_oriThe feature map generated in S1, i.e., the feature image obtained through the backbone network, is represented.

And S3, connecting the amplified feature map to the category prediction head, the width and height prediction head and the central point offset prediction head, and enhancing the information acquisition capability of the features in different fields.

In the training stage, the classification prediction head is used for confirming the existence of the target and confirming the classification of the target through the channel ID, and a class attention module is added in the classification prediction head for useEfficient information is mined between objects that are far apart but semantically related within and between classes. The mechanism of the class attention network is represented as: i is_E＝H_E(I_DkI_Sk) (ii) a Wherein, I_ERepresenting valid information between objects, H_EIndicating operations for mining valid information, I_DkDenotes the distance information in the case of k, I_SkRepresents semantic information in the case of k, which is divided into w: intra-class case and b: inter-class case.

And constructing a center offset positioning module, wherein the center offset positioning module is used for constructing a target center point of a center point positioning network. Enlarging the feature map F_PIAnd the central point offset of the positioning network is output by connecting to a central offset prediction head, wherein the positioning network comprises an improved cross entropy loss group and a central point offset loss group, and the loss groups jointly form a central point positioning network. The offset of the center point is corrected by a loss of center offset. Center offset loss, expressed as follows:

Constructing a frame width and height prediction module for constructing a scale-adaptive width and height predictor and amplifying a feature map F_PIAnd connecting to a width and height prediction head, and inputting a scale self-adaptive network to obtain a width and height regression quantity. The scale self-adaptive network is determined by a two-dimensional Gaussian kernel and the real aspect ratio of the target, and the variance of the two-dimensional Gaussian kernel is determined by the intersection ratio and the aspect ratio of the target frame. And the intersection ratio is determined according to the set upper limit and the set lower limit and the area of the real target frame, so that the scale self-adaptation of the width and height prediction head is realized.

And S4, adding a category attention network into the category prediction header, wherein the category attention network is used for mining effective information between distant and semantically related targets in and among the categories and reinforcing network classification capability.

And constructing a Class Attention Module (CAM), connecting the amplified feature map to a classification prediction head, inputting the amplified feature map to the class attention module to obtain the object class, wherein the class attention network comprises an inter-class associated attention group and an intra-class associated attention group. The inter-class attention group comprises a plurality of class attention blocks and a class excitation block, and then the inter-class attention group is superimposed on the original characteristic diagram element by element through broadcasting to form an intra-class attention group, so that the class attention of the class prediction head is realized.

In the embodiment of the present invention, the category attention work flow in the category attention module is divided into the following steps:

for the enlarged characteristic diagram F with the scale of C multiplied by H multiplied by W_PIExtracting features, reducing the size to obtain information between classes, multiplying the information to F by matrix multiplication_PIAnd obtaining a new inter-class information characteristic diagram. The inter-class information characteristic diagram is represented as follows:

F_WI＝H_mul(Zip(Conv(F_PI))，F_PI)

wherein, F_WIFeature graph representing information between classes, H_mulRepresenting a matrix pixel-by-pixel multiplication, Zip representing an information reduction operation, and Conv representing a convolution operation of 1 x 1.

For new feature diagram F_WIExtracting features, passing through linear rectification function, extracting features again to obtain information in class, and superimposing the information to F by broadcast element-by-element addition method_PIIn the above, a category attention feature map is obtained. A class attention feature map, represented as follows:

F_CA＝H_add(Conv(Lin(Conv(F_WI)))，F_PI)

wherein, F_CAAttention feature map of the above-mentioned category, H_addIndicating broadcast element-by-element addition and Lin indicating linear commutation operations.

And S5, in the training stage, generating supervision information by encoding the real target frame, thereby supervising the training process of each measuring head and improving the precision and accuracy of each measuring head.

In the training phase, the width and height prediction head is used for the width and height of the target box represented by the center point. The center point offset prediction head is used for predicting the value of the target center point lost in the precision in the coding process of the scale self-adaptive coding module. And the supervision information adopted by the training of the classification prediction head, the width and height prediction head and the central point offset prediction head is obtained by coding the real target frame by a scale self-adaptive coding module. The constraint of the scale-adaptive monitoring information on each measuring head is encoded into R_pre＝H_adapt(I_bbox) Wherein R is_preIndicates the coding result of each of the predictive heads, H_adaptRepresenting a scale adaptive information coding operation, I_bboxInformation representing a real target box.

And S6, in the inference stage, the trained class prediction head, the width and height prediction head and the central point offset prediction head respectively output the classification information, the regression frame width and height information and the central point position information of the image to be detected, and then the recognition object is framed in the image to be detected according to the output prediction result and the classification result is marked.

In this example, the data input to the adaptive scale coding module compiles three feature maps, a class heatmap map

One dimension width and height diagram

And a center point offset map

Where N represents the batch size (batch-size), r represents the step size of the output, C represents the number of target classes, and H and W represent the height and width of the image, respectively.

All targets are encoded into the Heatmap graph H by means of gaussian kernels, and a specific class occupies a specific channel. When the central points of two or more targets are coincident, the target representative with the largest target frame area is adopted. H_xycThe value of the corresponding position is confirmed by a 2D gaussian kernel, which is:

wherein sigma_xIs a parameter related to IoU and the width of the target box, 1/3, σ of the calculated transverse axis of the ellipse_yIs a parameter highly correlated with IoU and the target box, is 1/3 of the calculated longitudinal axis of the ellipse, the Gaussian kernel constituting the ellipse

σ will be derived as follows_x、σ_yAnd IoU and the height and width of the target box:

the specific calculation formula of IoU is:

further deducing that:

due to the fact that

further comprising:

by the formula of ellipses

And (3) obtaining:

thereby obtaining IoU a calculation method of the Gaussian kernel parameters a and b related to the width and height of the target frame.

In order to further adapt to the target frames with different scales, the size of the target frame is adaptively adjusted IoU according to the size of the area of the target frame.

Wherein [ alpha, beta ]]For the set IoU value range, area is the area of the target box, a_SIs the area threshold of the small target frame, a_LFor the area threshold of the large target frame, the area is smaller than a_SIs uniformly set to a and the area is larger than a_LIs uniformly set to be beta, area [ a ]_S,a_L]The target block IoU in between is set to the adaptation value.

In order to further predict the accurate position of the scale central point in the input image, a central point offset map is added

In that

For recovering from the loss of centre point positioning accuracy due to down-sampling, all classes share the same.

Use of

Representative class is c_tTarget frame t, dimension width and height map in

Filling real target frames b at the coordinate positions respectively_tWidth and height of

The scale is not normalized. To reduce the amount of computation, use one

All classes are predicted.

And in the reasoning stage, drawing a frame on the picture according to the classification information, the regression frame width and height information and the central point position information.

The invention also provides a target detection system based on the scale self-adaptive coding module and the category attention module, which comprises the following components:

the characteristic extraction module is used for grouping the input pictures to form a characteristic image;

the up-sampling module is used for specially encoding the characteristic image to form an amplified characteristic image with reserved information;

and the class attention module is used for constructing a class attention network as a classifier, connecting the amplified feature map to a classification prediction head and obtaining the object class through the class attention network. Wherein the class attention network comprises an inter-class associative attention group and an intra-class associative attention group. The inter-class attention group comprises a plurality of class attention blocks and a class excitation block, and then the inter-class attention group is superimposed on the original characteristic diagram element by element through broadcasting to form an intra-class attention group, so that the class attention of the class prediction head is realized.

And the center offset positioning module is used for constructing a target center point of the center point positioning network, connecting the amplified characteristic diagram to a center offset amount prediction head and correcting the offset of the center point through center offset amount loss. Wherein the positioning network comprises an improved cross entropy loss set and a center point offset loss set. The sets of losses collectively form a central point location network.

And the frame width and height prediction module is used for constructing a scale self-adaptive width and height predictor, connecting the amplified characteristic diagram to a width and height prediction head, and inputting the amplified characteristic diagram into a scale self-adaptive network to obtain a width and height regression quantity. The scale self-adaptive network is determined by a two-dimensional Gaussian kernel and the real aspect ratio of the target, and the variance of the two-dimensional Gaussian kernel is determined by the intersection ratio and the aspect ratio of the target frame. And the intersection ratio is determined according to the set upper limit and the set lower limit and the area of the real target frame, so that the dimension self-adaption of the length and width measuring head is realized.

And the image detection result module is used for displaying the classification information of the category classification module, the center offset positioning module and the frame length and width prediction module and drawing a target frame.

The invention finally provides a test embodiment, using the MS COCO 2017 data set as a training set, a verification set and a test set, wherein 118000 images are included as the training data set, 5000 images are included as the verification data set, and 20000 images are included as the test data set. The target detection results were evaluated using three different average Accuracies (AP), AP50, AP75 as rating indices to examine the target detection performance of the present invention. ResNet-18 and DLA34 are respectively selected as model frameworks of the invention. The present invention scales all images to 512 × 512 while maintaining their scale ratio and generates a 128 × 128 feature map using a scale adaptive coding module. Random translation (translation range 128), random flipping, random color dithering, random fill lighting are used as data enhancement, and the overall objective is optimized using SGD. We used a Learning Rate (LR) of 0.02, a batch size of 128, 80 iterative training (epoch) on the data set, and a 0.1-fold reduction in LR at 50 and 72, respectively. All experiments were done on a machine equipped with a PyTorch containing NVIDIA Titan V GPU with training tasks and speed testing. Table 1 shows the comparison result of adding the scale adaptive coding module through the three evaluation indexes, table 2 shows the comparison result of adding the category attention module, table 3 shows the comparison result of the present invention with the current main algorithm, fig. 3 is the comparison of the method of the present invention with each algorithm in the present example, and fig. 4a and 4b show the effect of the present invention.

TABLE 1 adaptive coding Module comparison experiment

TABLE 2 Category attention Module comparative experiment

Table 3 comparison of results on the COCO test data set for networks of SOTAs (non-optimal results). Wherein bold and italic bold represent the first and second highest values, respectively

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A method of target detection, comprising the steps of:

s1, extracting image features to generate a feature map;

2. The method of claim 1, wherein the feature map is generated by extracting features of the image using a residual network or a deep feature fusion network.

3. The object detection method of claim 1, wherein the upsampling module consists of an alternation of a deformable convolution and a transposed convolution.

4. The object detection method of claim 1, wherein the mechanism of the class attention network is represented as: i is_E＝H_E(I_DkI_Sk) (ii) a Wherein, I_ERepresenting an objectInter effective information, H_EIndicating operations for mining valid information, I_DkDenotes the distance information in the case of k, I_SkThe semantic information is shown in the case of k, which is classified into an intra-class case and an inter-class case.

5. The object detection method of claim 1, wherein the class attention network comprises an inter-class associative attention group and an intra-class associative attention group; the inter-class associated attention group comprises a plurality of class attention blocks and a class excitation block, and then inter-class information output by the inter-class associated attention group is superimposed on the amplified characteristic diagram element by element through broadcasting to form an intra-class associated attention group, so that the class attention of the class prediction head is realized.

6. The object detection method of claim 1, wherein the class attention workflow of the class attention network comprises the steps of:

F_WI＝H_mul(Zip(Conv(F_PI))，F_PI)

F_CA＝H_add(Conv(Lin(Conv(F_WI)))，F_PI)

7. The object detection method of claim 1, wherein the midpoint offset prediction header is configured to output a midpoint offset of a midpoint location network, the midpoint location network comprising a cross entropy loss set and a midpoint offset loss set; the center point offset prediction head corrects the offset of the target center point by a center offset loss, which is expressed as follows:

8. The object detection method of claim 1, wherein the broad height prediction head implements broad height prediction by constructing a scale adaptive network; the scale self-adaptive network is determined by a two-dimensional Gaussian kernel and a target real width-to-height ratio, the variance of the two-dimensional Gaussian kernel is determined by an intersection ratio and the width-to-height of a target frame, and the intersection ratio is determined by the upper limit and the lower limit and the area of the real target frame according to the set upper limit and the set lower limit, so that the scale self-adaptation of the width-to-height prediction head is realized.

9. The method of claim 1, wherein the connection of the enlarged feature map to the category predictor, the width-height predictor and the center point offset predictor compiles three feature maps: one is a class heatmap graph

One is a dimension width and height diagram

The last one is a center point offset map

further deducing that:

due to the fact that

further comprising:

by the formula of ellipses

Obtaining:

10. the object detection method of claim 9, wherein the size of IoU is adaptively adjusted according to the size of the area of the object frame:

wherein [ alpha, beta ]]For the set IoU value range, area is the area of the target box, a_SIs the area threshold of the small target frame, a_LFor the area threshold of the large target frame, the area is smaller than a_SIs uniformly set to a and the area is larger than a_LIs uniformly set to be beta, area [ a ]_S，a_L]The target box IoU in between is set to the adaptation value;

adding a center point offset map

In that