CN111783819A

CN111783819A - Improved target detection method based on region-of-interest training on small-scale data set

Info

Publication number: CN111783819A
Application number: CN202010383794.XA
Authority: CN
Inventors: 尹子会; 付炜平; 赵冀宁; 孟荣; 贾志辉; 董俊虎; 杜江龙; 赵振兵
Original assignee: State Grid Corp of China SGCC; North China Electric Power University; Maintenance Branch of State Grid Hebei Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; North China Electric Power University; Maintenance Branch of State Grid Hebei Electric Power Co Ltd
Priority date: 2020-05-08
Filing date: 2020-05-08
Publication date: 2020-10-16
Anticipated expiration: 2040-05-08
Also published as: CN111783819B

Abstract

The invention provides an improved target detection method based on region-of-interest training on a small-scale data set, which belongs to the technical field of image analysis, and is characterized in that an image target detection result is obtained through a target detection model, the training process of the target detection model comprises a stage of independently performing frame regression task training and classification task training in sequence in a circulating mode, the frame regression task training is performed on the target detection model by using a first training set obtained by enhancing first data of the small-scale data set, and the classification task training is performed on the target detection model by using a second training set obtained by enhancing second data of the first training set; each image of the second training set contains global information of part of the picture outside the region of interest. According to the method, a region-of-interest mechanism is introduced in a training Stage, so that the overfitting phenomenon which is easy to occur when the existing One-Stage target detection model is trained on a small-scale data set is overcome, and an accurate target detection model is obtained.

Description

Improved target detection method based on region-of-interest training on small-scale data set

Technical Field

The invention belongs to the technical field of image analysis, and relates to a target detection method based on region-of-interest training improvement on a small-scale data set.

Background

Deep Learning (DL) is a research direction in the field of Machine Learning (ML), and features are extracted through neural network Learning instead of using the characteristic of artificially extracting features, so that the Learning efficiency and accuracy are greatly improved, and the Deep Learning (DL) is widely applied to the fields of image classification, target detection, image segmentation, natural language processing and the like. However, since the deep learning method is generally data-driven, it has high requirements for sample data quantity, richness, accuracy and the like. In the field of target detection, if the sample data volume and the abundance are insufficient, the deep learning model not only extracts the target features in the learning sample, but also brings the background noise in the sample into the learning range, so that the model is over-fitted to the data. After the overfitting occurs, the recall rate of the target detection to the target is seriously reduced, and the detection performance is seriously influenced.

Target detection methods based on deep learning generally fall into two categories: the Two-Stage detection algorithm divides a detection problem into Two stages, the first Stage generates a candidate Region, the second Stage classifies and corrects the position of a target, and a main representative model comprises a Region with CNN (R-CNN), Fast R-CNN and the like; the other is a One-Stage detection algorithm, which directly predicts the type probability and the position information of a target by using a single network without generating a candidate region, and typically represents an ssd (single Shot multi box detector) model and a yolo (yolook Only One) model.

For the One-Stage target detection model, due to the lack of a target frame first-check mechanism similar to the Two-Stage algorithm, more serious overfitting is often generated on training set data during classification training. Particularly in

Disclosure of Invention

The invention aims to provide an improved target detection method based on region-of-interest training on a small-scale data set, a region-of-interest mechanism is introduced in a training Stage, and the phenomenon that an existing One-Stage target detection model is easy to be over-fitted when trained on the small-scale data set is overcome, so that an accurate target detection model is obtained.

The technical scheme provided by the invention is an improved target detection method based on region-of-interest training on a small-scale data set, and an image target detection result is obtained through a target detection model, wherein the target detection model comprises a multi-layer output depth feature extraction network and a multi-scale fusion detection head; the training process of the target detection model comprises a stage of independently performing frame regression task training and classification task training in a circulating mode. The independent training can be realized by adjusting a loss coefficient in a loss function, so that the classification task training at the stage can possibly learn partial global information of each picture in a training set, and meanwhile, the frame identification learning of a frame regression task training on an interested region is not influenced.

In one embodiment of the invention, the small-scale data set marking the region of interest is used for performing frame regression task training and classification task training on the target detection model.

In one embodiment of the present invention, the deep feature extraction network is pre-trained using a large-scale data set, the large-scale data set is a classified data set, classification categories of the classified data set are basically irrelevant to classification of a target to be recognized, the deep feature extraction network is a classification model which is only classified and does not regress, and a process of training the classification model is pre-training, and network weights obtained by the pre-training can shorten training time based on a small-scale data set. When using a large-scale dataset without classification, the dataset needs to be transformed to obtain a dataset in the classification format required for pre-training.

In one embodiment of the present invention, the frame regression task training is performed on the target detection model using a first training set obtained by first data enhancement on the small-scale data set, and the classification task training is performed on the target detection model using a second training set obtained by second data enhancement on the first training set; and each image of the second training set contains partial global information of the picture outside the region of interest. Different small-scale training sets are used in the Stage of circularly and independently performing frame regression task training and classification task training in sequence, the first training set aims to enable the One-Stage type target detection model to obtain frame recognition capability, the second training set aims to enable the One-Stage type target detection model to obtain classification capability, and the classification capability can inhibit overfitting.

An improvement of the above embodiment may be that the first data enhancement is used to obtain a first training set of a size larger than the small-scale data set by one or more of flipping, panning, blurring, scaling, and cropping; the second data enhancement is used for preserving background information of a background area of the image according to a distance part between the background area and an interested area of the image, and the method comprises the step of adding noise. And a first training set with the scale larger than that of the original small-scale data set is obtained through first data enhancement to obtain richer training data, and a second training set with the scale basically the same as that of the first training set but containing partial global information is obtained through second data enhancement to reserve partial background information and improve the classification and identification capacity of the target detection model after training.

In one embodiment of the present invention, exemplarily, a noise adding method is provided: for a picture marked with several regions of interest, its pixel p_x,yAmplitude n of the added noise_x,yIs min (b, a × d), where d is the pixel p_x,yThe shortest distance to all interested areas, a is a noise intensity parameter, and b is the maximum noise intensity. In a further improvement, the training results can be optimized by adjusting the above parameters.

In one embodiment of the invention, in the multi-scale fusion detection head, feature images with different sizes in the output of the obtained depth model feature extraction network are subjected to up-sampling, fusion and convolution layer by utilizing a feature pyramid network structure, so that target detection output with n scales, the number of which is the same as that of the detection heads, is obtained.

In an embodiment of the present invention, each of the multi-scale fusion detection heads includes a classification output layer for training a classification task and a regression output layer for training a frame regression task. In an independent training, if the loss coefficients corresponding to the classification output layer account for the larger weight of all losses, the training can be concentrated in the classification task training, and if the loss coefficients corresponding to the regression output layer account for the larger weight of all losses, the training can be concentrated in the frame regression task training.

In an embodiment of the present invention, a learning rate of each frame regression task training is lower than a learning rate of a last frame regression task training, and a learning rate of each classification task training is lower than a learning rate of a last classification task training.

In an embodiment of the invention, after the loop independent training phase is finished, the target detection model is fine-tuned using the first training set. In the fine tuning of the model, basically, the weights of the losses are slightly different, so that the classification task training and the bounding box regression task training are simultaneously considered in the fine tuning training.

Compared with the prior art, the invention has the beneficial effects that:

the invention improves the defect that the existing One-Stage target detection excessively depends on data by improving the data enhancement and training methods. The training input data is subjected to local limit strong denoising processing, and the farther the training input data is from a target, the higher the noise intensity is, so that the fitting difficulty of the feature extraction network on the background noise of the input picture is increased, and the overfitting possibility of the model on a small data set is reduced. And for the area close to the target, partial background information is also reserved, so that the network can adaptively learn features in different ranges. During training, a regression task and a classification task are respectively trained. Different training sets are used according to different tasks: for the regression task needing more global information, the image without noise is input, so that the global information is easier to extract; and for a classification task needing to pay more attention to the local part, inputting a noise-added picture and paying more attention to the target characteristic. Through testing, the method has relatively common practical significance on small-scale data sets. The invention is practical and has certain reference significance for the scheme design of related problems.

Drawings

FIG. 1 is a schematic diagram of a target detection model according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart illustrating a method for training a target detection model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a data flow of the embodiment of FIG. 2 in training a target detection model;

FIG. 4 is a graph of images containing a target object in a second training set after being noisy with a maximum noise strength according to an embodiment of the present invention;

FIG. 5 is a graph of images containing a target object in a second training set after being noisy with a maximum noise strength according to an embodiment of the present invention;

FIG. 6 is a partial inspection result of a defect image of substation equipment after training by the method of the present invention in an application embodiment;

fig. 7 is a result image obtained by detecting a part of a VOC2007 data set image using the method of the present invention in one application example.

Detailed Description

It should be noted at first that the basic idea of the technical solution of the present invention is: when the One-Stage type target detection model is used for training, firstly, input data are processed, the learning difficulty of different regions of an image is adjusted, the target classification task training process and the frame regression task training process can be adaptive to the attention range, namely the region of interest, and meanwhile, the classification task training process can learn partial whole information and pay attention to local information. In one aspect of the model training phase, the method aims at forming a soft region-of-interest mechanism, which is a method different from a candidate region mechanism of Two-Stage, in a target detection model of Two-Stage, a module for identifying a candidate region as a region-of-interest is included, when classifying, a part of a target is directly segmented from extracted full-image features, the segmentation is a hard region-of-interest mechanism other than that, while the method adjusts the learning difficulty of different regions, such as gradually increasing different-intensity noise, and does not set a soft definite boundary to enable the classification task training process to pay attention to the target itself.

The technical scheme of the invention is based on a target detection model, as shown in figure 1, the target detection model is basically One-Stage and comprises a multi-layer output depth feature extraction network 1 and a multi-scale fusion detection head 2. The multi-layer output depth feature extraction network comprises n layers of trunk networks from top to bottom, each layer of trunk network comprises one or more convolution layers, each layer of trunk network outputs a feature map of one scale to the lower layer of trunk network, and a plurality of trunk networks are selected from the trunk networks from top to bottom to output the feature map of one scale acquired by the layer to the multi-scale fusion detection head 2. The deeper the backbone network goes down, the smaller the scale of its output profile. In the multiscale fusion Detection Head 2, a plurality of independent Detection heads (Detection heads) are arranged corresponding to the selected backbone networks in a consistent and scale manner, the characteristic diagram output by the selected backbone network is subjected to upsampling and tensor splicing in the multiscale fusion Detection Head 2 layer by layer, except for the bottommost layer or the minimum scale Detection Head which directly identifies the characteristic diagram output by the bottommost layer in the backbone network, the input of other Detection heads is the characteristic diagram after the tensor splicing of the layer. The output of each detection head is processed by a regression output layer and a classification output layer respectively and then is used as a target detection output result of the target detection model. In the embodiment of fig. 1, 1 ≧ i > j > n, because three backbone networks of the i-th layer, the j-th layer and the n-th layer are respectively selected, and correspondingly, three detection heads for different scales are arranged in the multi-scale fusion detection head 2 from bottom to top, in other embodiments, the number of the detection heads in the multi-scale fusion detection head 2 is correspondingly different due to the different number of the selected backbone networks.

In the embodiment shown in fig. 2 and 3, based on the structure of the target detection model, the target detection model is trained in the following steps S100 to S110, so as to obtain the weight values of the nodes of the target detection model.

S100, pre-training based on a large-scale data set is carried out on the multi-layer output depth feature extraction network, and initial parameter values of the target detection model are obtained.

Specifically, a multi-tiered output depth feature extraction network in a target detection model is pre-trained using a large-scale dataset. And taking the weight value obtained by pre-training as an initial parameter value of a depth feature extraction network in the target detection model so as to achieve the purposes of accelerating convergence speed and improving detection precision.

Illustratively, the large-scale data set in the embodiment of the invention is an image data set provided by ImageNet, and the multi-layer output deep feature extraction network is a MobileNet V2 network.

S101, obtaining an Anchor point (Anchor) required for training a target detection model by using a small-scale data set marked with a region of interest.

Specifically, the invention marks the region of interest by setting a group Truth target box for each picture of the small-scale data set, and exemplarily, the region of interest is a minimum rectangle covering the device of interest. Based on the small-scale data set, performing cluster analysis after normalization on the size of the Ground Truth target frame, exemplarily, the embodiment analyzes the size distribution of the Ground Truth target frame by using a Kmeans algorithm to obtain a group of size clustering results about the Ground Truth target frame, exemplarily, the results include a plurality of different scales, one scale corresponds to the shape of one Anchor frame (Anchor Box), and a set including a plurality of Anchor frame scales is established.

And taking the feature point of the feature map of each scale as an anchor point, wherein each anchor point corresponds to anchor frames with the sizes of a plurality of anchor frames in the set, and the number of the anchor frames of all the anchor points needing to be detected by the multi-scale fusion detection head for one image is as follows:

wherein, w_iAnd h_iSpecifically, for a feature map with a size of 7 × 10, 70 pixels, i.e. 70 feature points and 70 anchor points, if each anchor point corresponds to 3 anchor frame sizes, the detection head corresponding to the feature map detects 210 anchor framesAnd the decoder after the output of the detection head decodes the combination output by the multi-scale fusion detection head.

S102, a first training set used for performing frame regression task training on the target detection model and a second training set used for performing classification task training on the target detection model are obtained by using the small-scale data set marked with the region of interest.

Exemplarily, in this embodiment, a larger number of pictures are obtained by enhancing the small-scale data set pictures respectively through methods such as flipping, translating, blurring, zooming, cropping, and the like, and a set of the pictures is used as the first training set.

Exemplarily, in this embodiment, the first training set is subjected to noise adding processing according to a distance between one pixel of each picture and each groudtruth target frame, and then the first training set is used as the second training set. The specific noise adding method is that for a picture marked with a plurality of interested areas, the pixel p_x,yAmplitude n of the added noise_x,yComprises the following steps:

min(b,a×d)

wherein d is a pixel p_x,yThe shortest distance to all interested areas, a is a noise intensity parameter, and b is the maximum noise intensity. And taking the set of the pictures subjected to the noise processing as a second training set. The noise addition keeps partial background information of the background area outside the interested area, namely, in each picture of the second training set, the interested area has no clear visual boundary, and the closer to the boundary position of the interested area of the picture, the more the retained background information is, in the Two-Stage type target detection model, after the candidate area identified in the first Stage, the detection information provided for the second Stage does not contain any background information outside the candidate area.

S103, in the multi-scale fusion detection heads, multi-scale fusion is carried out to obtain target detection data of each detection head.

Specifically, in the forward propagation process of the depth model feature extraction network, the initial parameters obtained by pre-training in S100 are used to select output feature maps of different sizes of a plurality of layers of backbone networks with different depths in the depth model feature extraction network as the output of the depth model feature extraction network. In the multi-scale fusion detection head, the Feature Pyramid Network (FPN) structure is utilized to perform up-sampling, fusion and convolution on feature maps with different sizes in the output of the obtained depth model feature extraction network, and the target detection output of n scales which is the same as the number n of the detection heads is obtained as follows:

w_i×h_i×k×(c+5)

where c is the number of target classes, w_iAnd h_iThe length and width of the ith output convolution signature, respectively. And outputting c classification results of all anchors, four coordinates of the corresponding prediction boxes and a confidence coefficient. The four coordinates of the prediction frame are respectively the abscissa position, the ordinate position, the prediction frame length and the prediction frame width.

S104, a decoding algorithm of the output of the multi-scale fusion detection head is configured. The decoding algorithm aims to convert the output of the target detection model detection head into a coordinate prediction result, namely, coordinates in a real picture.

Specifically, in this embodiment, the anchor frame generated in step S101 is used to perform regression training, the anchor frame with the largest IOU compared to the groudtruth target frame is selected as the anchor point responsible for predicting a target object, and the relationship between the prediction output and the actual coordinate is expressed by equations (1) to (4):

x′＝x+sig mod(p_x)×w (1)

y′＝y+sig mod(p_y)×h (2)

wherein, x ', y', w 'and h' respectively represent the center coordinate, length and width of the regression of anchor points Anchors in the anchor point set, x, y, w and h respectively represent the upper left point coordinate of the anchor points Anchors in the anchor point set and the width and height of the Anchors, and p_x、p_y、p_w、p_hAnd predicting the regression value obtained in the primary frame regression training on the whole target detection network.

When the target detection model is used for prediction, for the classification result of each anchor point, the product of c classification prediction results of the anchor point and the confidence coefficient of the anchor point is used as the confidence coefficient of c classes. And selecting a value as a threshold value for ensuring that the anchor point correctly predicts the target, wherein the value range of the threshold value is 0-1, and the preferred value is 0.7. And for each anchor point, when the confidence coefficient of one or more classes is greater than or equal to a threshold value, outputting the anchor point as effective output, and performing non-maximum suppression processing to obtain a final prediction frame.

And S105, configuring a sum loss function in the training of the target detection model.

Specifically, in this embodiment, for the Anchor responsible for detecting the target, the confidence C is 1; the Anchor which does not burden the detection target and the IOU of the prediction box and the ground truth is more than 0.5 is ignored; the other Anchor confidence C is 0.

The embodiment uses the cross entropy function as the loss function of confidence prediction, and the formula is as follows

In the formula, C_ijIs a measure of the confidence in the prediction,

is a true confidence value, the network has n output scales, sigma is sigmoid function, and the Anchor is responsible for prediction

1, Anchor is not responsible for prediction

Is 0; when the Anchor is ignored, the control unit,

otherwise it is 1.

The cross entropy function is used as a loss function of the classification prediction network, and the formula is as follows

In the formula, p_ijIs the value of the predicted classification that is,

is a real classification value, the network has n output scales, sigma is a sigmoid function, and the Anchor is in charge of prediction

Is 1, and is 0 when Anchor is not responsible for prediction.

On the frame regression, the invention uses the mean square error loss function, and the formula is as follows:

in the formula, x_ij、y_ij、w_ij、h_ijIs the predicted frame center coordinates and length width,

is the real frame center coordinate and width height.

The sum loss function is given by:

LOSS＝α_objL_obj+α_noobjL_noobj+α_classL_class+α_whL_wh+α_xyL_xy(10)

in the formula, α_obj、α_noobj、α_class、α_wh、α_xyThe weights of the respective loss functions of equations (5) to (9).

And S106, under the premise of determining the decoding algorithm and the sum loss function, performing frame regression task training on the target detection model through the first training set to obtain the target detection model subjected to frame regression task training independently performed once.

Specifically, a in the summation loss function is adjusted in a frame regression task training which is independently implemented_classThe value of (a) is zero, which is equivalent to training only the frame regression output capability of the target detection model.

Specifically, most images of the training set are used as train sets, the rest are used as valid sets, the train sets are trained by using a first learning rate, exemplarily, the learning rate is 0.001, and the valid sets are used as verification sets, α is used_classIs set to 0, α_noobjSet to 0.01 and the remaining weights to 1, will stop this bounding box regression task training when the loss of the validation set is no longer decreasing, in other embodiments α_classWeight values much smaller than other loss coefficients can be set so that the training is focused on the bounding box regression task, at the same time α_noobjA smaller weight is also set.

And S107, on the premise of determining the decoding algorithm and the sum loss function, performing classification task training on the target detection model through a second training set to obtain the target detection model which is subjected to the classification task training independently performed once.

Specifically, in a classification task training which is independently implemented, a in a sum loss function is used_wh， a_xyAll are set to zero, namely, the training is equivalent to only the classification task output capability of the target detection model.

Specifically, the second training set is used for training. Most of the training set is taken as train set and the rest as evaluation set. Train the train set using the second learning rate, exemplarily, set the second learning rate to 0.001, while validating the set as the verification set. In particularThe present embodiment will α_classSetting the weight as 1 and setting the other weights as 0 for training. And stopping the training of the classification task when the loss of the verification set does not decrease.

And S108, repeating S106 and S107 in a circulating mode in sequence, and gradually reducing the first learning rate and the second learning rate in the circulating mode until the loss of the frame regression task training under the first learning rate is not reduced compared with the loss of the last frame regression task training, and meanwhile, the loss of the classification task training under the second learning rate is not reduced compared with the loss of the last classification task training.

Specifically, the first learning rate used each time S106 is implemented is lower than the first learning rate used in the previous implementation of S106, for example, if the first learning rate is 0.001, this time may be 0.0005; the second learning rate used each time S107 is executed is lower than the second learning rate used in S107 last time, and for example, if the second learning rate is 0.001 last time, this time, the second learning rate may be 0.0005. The first learning rate and the second learning rate may be different in each cycle. Meanwhile, since S106 and S107 are repeatedly performed, at the start of the first cycle, S106 may be performed after S107 is performed first.

S109, α at lower level_noobjAnd (4) weight fine tuning the model.

Specifically, a learning rate lower than the first learning rate and the second learning rate used in the last cycle is set, the first training set is used to train the target detection model to be fine-tuned as a whole, and the total loss from the target detection model obtained in the training S108 to the verification set does not decrease, exemplarily, the sum-up loss function in the training will α_noobjSet to 0.01 and the remaining weights to 1, to reduce L in the fine tuning training_onobjThe overall weight of (c).

And S110, testing the model.

And predicting the image by using the original non-enhanced small-scale data set as a test set by using the trained target detection model obtained in S108 or S109. And evaluating the performance of the model according to the accuracy of the prediction result.

Detailed description of the preferred embodiment

In a specific embodiment, after data enhancement and noise addition are performed on a substation equipment defect image, the substation equipment defect image is respectively used as an input of a target detection model of the present invention, and the noisy second training set substation equipment defect images shown in fig. 4 and 5 are processed, wherein fig. 4 shows a processing result with a maximum noise intensity of 127, an area of interest contains a target object of cattle, fig. 5 shows a processing result with a maximum noise intensity of 255, the area of interest contains an object of a respirator, it can be seen that there is no clear boundary between the area containing the target object and the background, and the area of interest contains more background information as the area of interest is closer, so that the present invention is a soft area of interest mechanism. In contrast, in a picture containing a candidate region of a hard roi mechanism, the image is completely black outside the candidate region, i.e. the background information of any region outside the candidate region is zero. The MobileNet V2 network with the resolution of 320 x 224 and the pre-training is used as the depth feature extraction network of the target detection model, two main network output feature maps are selected in the depth feature extraction network and set as two outputs of the depth feature extraction network, and the scale and the size are 7 x 10 and 14 x 20 respectively. The normalized sizes of the anchors generated from the data sets are (0.73 × 0.79), (0.54 × 0.42), (0.33 × 0.71), (0.24 × 0.25), (0.16 × 0.46), and (0.07 × 0.16), respectively, and after training using the algorithm of the present invention, the detection results of the defective image portion of the substation equipment are shown in fig. 6, where (a) is a discoloration failure of the respirator, (b) is a normal respirator, (c) is a breakage failure of the insulator, and (d) is a bird nest foreign object.

Detailed description of the invention

And selecting a part of the VOC2007 data set as a small-scale data set of the target detection model, labeling, performing data enhancement and noise addition, and using the labeled part of the VOC2007 data set as the input of the target detection model, and extracting the characteristics by using a 320X 224 pre-trained MobileNet V2 network, wherein the two outputs are set by the characteristic extraction network, and the sizes of the two outputs are 7X 10 and 14X 20 respectively. The Anchor normalized sizes generated from the data sets are (0.50 × 0.72), (0.46 × 0.33), (0.30 × 0.36), (0.20 × 0.56), (0.17 × 0.27), (0.10 × 0.11), respectively, and after training using the algorithm of the present invention, partial image detection results for the VOC2007 data set using the method of the present invention are shown in FIG. 7, where (a) is a bus and (b) is a cow.

Claims

1. An improved target detection method based on region-of-interest training on a small-scale data set obtains an image target detection result through a target detection model, and is characterized in that: the target detection model comprises a multi-layer output depth feature extraction network and a multi-scale fusion detection head; the training process of the target detection model comprises a stage of independently performing frame regression task training and classification task training in a circulating mode.

2. The object detection method according to claim 1, characterized in that: and performing frame regression task training and classification task training on the target detection model by using the small-scale data set for marking the interested region.

3. The object detection method according to claim 1, characterized in that: pre-training the deep feature extraction network using a large-scale dataset.

4. The object detection method according to claim 2, characterized in that: performing the frame regression task training on the target detection model by using a first training set obtained by performing first data enhancement on the small-scale data set, and performing the classification task training on the target detection model by using a second training set obtained by performing second data enhancement on the first training set; and each image of the second training set contains partial global information of the picture outside the region of interest.

5. The object detection method according to claim 4, characterized in that: the first data enhancement is used to obtain a first training set that is larger in size than the small-scale data set by one or more of flipping, panning, blurring, zooming, and cropping; the second data enhancement is used for preserving background information of a background area of the image according to a distance part between the background area and an interested area of the image, and the method comprises the step of adding noise.

6. The object detection method according to claim 5, characterized in that: the noise adding method is that for a picture marked with a plurality of interested areas, the pixels of the picture are

Amplitude of the added noise

Is composed of

Wherein, in the step (A),

is a pixel

The shortest distance to all the regions of interest,

in order to be a parameter of the intensity of the noise,

the maximum noise intensity.

7. The object detection method according to claim 2, characterized in that: in the multi-scale fusion detection head, the feature pyramid network structure is utilized to perform up-sampling, fusion and convolution on feature graphs with different sizes in the output of the obtained depth model feature extraction network layer by layer to obtain the number of detection heads

Same as

And outputting target detection of each scale.

8. The object detection method according to claim 2, characterized in that: each detection head of the multi-scale fusion detection head comprises a classification output layer used for classification task training and a regression output layer used for frame regression task training.

9. The object detection method according to claim 1, characterized in that: the learning rate of each frame regression task training is lower than that of the last frame regression task training, and meanwhile, the learning rate of each classification task training is lower than that of the last classification task training.

10. The object detection method according to claim 4, characterized in that: and after the stage is finished, fine-tuning the target detection model by using the first training set.