CN111783819B

CN111783819B - Improved target detection method based on region of interest training on small-scale data set

Info

Publication number: CN111783819B
Application number: CN202010383794.XA
Authority: CN
Inventors: 尹子会; 付炜平; 赵冀宁; 孟荣; 贾志辉; 董俊虎; 杜江龙; 赵振兵
Original assignee: State Grid Corp of China SGCC; North China Electric Power University; Maintenance Branch of State Grid Hebei Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; North China Electric Power University; Maintenance Branch of State Grid Hebei Electric Power Co Ltd
Priority date: 2020-05-08
Filing date: 2020-05-08
Publication date: 2024-02-09
Anticipated expiration: 2040-05-08
Also published as: CN111783819A

Abstract

The invention provides an improved target detection method based on region of interest training on a small-scale dataset, which belongs to the technical field of image analysis, wherein an image target detection result is obtained through a target detection model, the training process of the target detection model comprises a stage of sequentially and independently carrying out frame regression task training and classification task training, a first training set obtained by enhancing the small-scale dataset through first data is used for carrying out frame regression task training on the target detection model, and a second training set obtained by enhancing the first training set through second data is used for carrying out classification task training on the target detection model; each image of the second training set contains global information of a part of the image outside the region of interest. According to the method, a region-of-interest mechanism is introduced in a training Stage, so that the phenomenon of overfitting easily occurring when the existing One-Stage target detection model is trained on a small-scale data set is overcome, and an accurate target detection model is obtained.

Description

Improved target detection method based on region of interest training on small-scale data set

Technical Field

The invention belongs to the technical field of image analysis, and relates to a target detection method based on training improvement of a region of interest on a small-scale data set.

Background

Deep Learning (DL) is a research direction in the field of Machine Learning (ML), and the Learning efficiency and accuracy are greatly improved by Learning features through a neural network instead of manually extracting features, so that the Deep Learning (DL) is widely applied in the fields of image classification, object detection, image segmentation, natural language processing, and the like. However, since the deep learning method is generally driven by data, it has high requirements on the number of sample data, the abundance, the accuracy, and the like. In the field of target detection, if the sample data amount and the abundance are insufficient, the deep learning model not only can extract target features in a learning sample, but also can be brought into a learning range for background noise in the sample, so that the model is over-fitted to the data. After the fitting, the recall rate of the target detection to the target is seriously reduced, and the detection performance is seriously affected.

Target detection methods based on deep learning are generally of two types: the Two-Stage detection algorithm divides the detection problem into Two stages, wherein the first Stage generates candidate areas, the second Stage classifies and corrects the positions of targets, and main representative models comprise regional convolutional neural networks (R-CNN), fast R-CNN and the like; secondly, the One-Stage detection algorithm directly predicts the type probability and the position information of the target by using a single network without generating a candidate region, and is typically represented by a SSD (Single Shot MultiBox Detector) model and a YOLO (You Look Only Once) model.

For One-Stage target detection model, due to the lack of a target frame first-check mechanism similar to the Two-Stage algorithm, more serious overfitting is often generated on the training set data during classification training. Especially in

Disclosure of Invention

The invention aims to provide an improved target detection method based on region-of-interest training on a small-scale data set, wherein a region-of-interest mechanism is introduced in a training Stage, so that the phenomenon that the existing One-Stage target detection model is easy to fit when training is carried out on the small-scale data set is overcome, and an accurate target detection model is further obtained.

The technical scheme provided by the invention is that the target detection method based on the training improvement of the region of interest on a small-scale data set is characterized in that an image target detection result is obtained through a target detection model, and the target detection model comprises a multi-layer output depth feature extraction network and a multi-scale fusion detection head; the training process of the target detection model comprises a stage of carrying out frame regression task training and classification task training sequentially and independently. The independent training can be realized by adjusting a lost coefficient in the loss function, so that the classification task training at the stage can possibly learn partial global information of each picture of the training set, and meanwhile, the frame recognition learning of the region of interest by the frame regression task training is not influenced.

In one embodiment of the invention, the frame regression task training and the classification task training are performed on the target detection model by using the small-scale data set of the marked region of interest, and the method is particularly suitable for target detection after training of the small-scale data set containing the target object, and can provide more accurate target detection results after learning after acquiring the limited data set.

In one embodiment of the present invention, the depth feature extraction network is pre-trained using a large-scale dataset, where the large-scale dataset is a classification dataset whose classification category is substantially independent of the classification of the desired recognition target, the depth feature extraction network is substantially a classification-only, non-regressive classification model, and the training of the classification model is pre-trained, where the pre-trained network weights can reduce training time based on the small-scale dataset. When using a large-scale data set without classification, the data set needs to be transformed to obtain the data set in the classification format required for pre-training.

In one embodiment of the invention, the frame regression task training is performed on the target detection model by using a first training set obtained by enhancing the small-scale data set through first data, and the classification task training is performed on the target detection model by using a second training set obtained by enhancing the first training set through second data; each image of the second training set contains global information of part of the image outside the interested area. Different small-scale training sets are used in the Stage of carrying out frame regression task training and classification task training in a circulating and sequentially and independently, wherein the first training set aims at enabling the One-Stage type target detection model to obtain frame recognition capability, the second training set aims at enabling the One-Stage type target detection model to obtain classification capability, and the classification capability can inhibit overfitting.

A refinement of the above embodiment may be that the first data enhancement is used to obtain a first training set that is larger in size than the small-scale data set, the method including one or more of flipping, panning, blurring, scaling, and cropping; the second data enhancement is used for partially retaining background information of a background area of the image according to the distance between the background area and an interested area of the image, and the method comprises the step of adding noise. The first training set with the scale larger than that of the original small-scale data set is obtained through first data enhancement so as to obtain richer training data, and the second training set with the scale basically the same as that of the first training set but containing partial global information is obtained through second data enhancement so as to retain partial background information and improve the classification and identification capacity of the target detection model after training is finished.

In one embodiment of the present invention, an exemplary method for adding noise is provided: for a picture marked with several regions of interest, its pixel p _x,y The amplitude n of the added noise _x,y Min (b, a×d), where d is the pixel p _x,y The shortest distance to all the regions of interest, a is the noise intensity parameter, and b is the maximum noise plus intensity. In a still further refinement, the training results may be optimized by adjusting the above-described individual parameters.

In one embodiment of the invention, in a multi-scale fusion detection head, different-size feature graphs in the output of an obtained depth model feature extraction network are up-sampled, fused and convolved layer by utilizing a feature pyramid network structure, so that n-scale target detection outputs which are the same as the number n of the detection heads are obtained.

In one embodiment of the invention, each detection head of the multi-scale fusion detection head comprises a classification output layer for classification task training and a regression output layer for frame regression task training. In one independent training, if the loss coefficient corresponding to the classified output layer accounts for a large weight of all losses, the training can be focused on the classified task training, and if the loss coefficient corresponding to the regression output layer accounts for a large weight of all losses, the training can be focused on the frame regression task training.

In one embodiment of the present invention, the learning rate of each training of the frame regression task is lower than the learning rate of the previous training of the frame regression task, and the learning rate of each training of the classification task is lower than the learning rate of the previous training of the classification task.

In one embodiment of the invention, the target detection model is fine-tuned using the first training set after the end of the cycle independent training phase. In fine tuning the model, basically, the weights of the individual losses are less different, so that classification task training and edge regression task training are considered simultaneously in the fine tuning training.

Compared with the prior art, the method has the beneficial effects that:

the invention improves the defect that the existing One-Stage target detection is too dependent on data by starting from the improvement of data enhancement and training methods. And the training input data is subjected to local limit imposed noise processing, and the further the training input data is away from the target, the greater the noise intensity is, so that the difficulty in fitting the background noise of the input picture by the feature extraction network is increased, and the possibility of overfitting of the model on the small data set is reduced. For the area close to the target, partial background information is reserved as well, so that the network can adaptively learn the characteristics of different ranges. During training, the regression task and the classification task are trained respectively. Different training sets are used according to different tasks respectively: for a regression task requiring more global information, inputting a picture without noise, and extracting the global information more easily; and for the classification task needing to pay more attention to the local, inputting a noise-added picture, and paying more attention to the target characteristics. Through testing, the method has a more general practical meaning on a small-scale data set. The invention is feasible and has certain reference significance for the scheme design of related problems.

Drawings

FIG. 1 is a schematic diagram of a target detection model according to an embodiment of the present invention;

FIG. 2 is a flowchart of a training method for a target detection model according to an embodiment of the present invention;

FIG. 3 is a schematic data flow diagram of the embodiment of FIG. 2 when training the object detection model;

FIG. 4 is a representation of an image containing a target object in a second training set after noise addition using a maximum noise addition intensity in accordance with one embodiment of the present invention;

FIG. 5 is a representation of an image containing a target object in a second training set after noise addition using a maximum noise addition intensity in one embodiment of the present invention;

FIG. 6 is a graph showing the results of partial inspection of a defective image of substation equipment after training using the method of the present invention in an application example;

fig. 7 is a resulting image from the detection of a partial image of a VOC2007 data set using the method of the present invention in one application example.

Detailed Description

Firstly, the basic idea of the technical scheme of the invention is as follows: when training is carried out by using an One-Stage type target detection model, firstly, input data are processed, and learning difficulties of different areas of images are adjusted, so that a target classification task training process and a frame regression task training process can be self-adaptive to a focus range, namely a region of interest, and meanwhile, in particular, the classification task training process can learn part of total information and focus on local information. In the method, a module for identifying candidate areas as the interested areas is included in a Two-Stage object detection model, and the module directly divides the extracted full graph features into parts of the target when classifying, wherein the division is a hard interested area mechanism which is not the same, and the method is used for adjusting learning difficulties of different areas, such as gradually increasing noise with different intensities, and not setting a clear boundary for a soft classification task training process to focus on the target.

The technical scheme of the invention is based on a target detection model, as shown in fig. 1, the target detection model is basically One-Stage, and comprises a multi-layer output depth feature extraction network 1 and a multi-scale fusion detection head 2. The depth feature extraction network with multi-layer output comprises n layers of main networks from top to bottom, each layer of main network comprises one or more convolution layers, each layer of main network outputs a feature map of one scale to the lower layer of main network, and a plurality of main networks are selected from top to bottom in the main networks to output the feature map of one scale obtained by the layer to the multi-scale fusion detection head 2. The deeper the backbone network is down, the smaller the scale of its output feature map. In the multi-scale fusion Detection Head 2, a plurality of independent Detection heads (Detection heads) are arranged, the number of which is consistent with the number of the selected backbone networks and the scale of which corresponds to the number of the selected backbone networks, the feature images output by the selected backbone networks are up-sampled and tensor spliced layer by layer in the multi-scale fusion Detection Head 2, and besides the feature images output by the backbone networks at the bottommost layer or the least-scale Detection Head directly identifies the feature images output by the backbone networks at the bottommost layer, the input of the other Detection heads is the feature images after tensor splicing of the layer. After the output of each detection head is processed by the regression output layer and the classification output layer respectively, the output is taken as a target detection output result of the target detection model. In the embodiment of fig. 1, 1 is greater than or equal to i > j > n, because the three backbone networks of the ith layer, the jth layer and the nth layer are respectively selected, three detection heads for different scales are arranged in the multi-scale fusion detection head 2 from bottom to top, and in other embodiments, the number of detection heads in the multi-scale fusion detection head 2 is correspondingly different due to the different numbers of the selected backbone networks.

In the embodiment shown in fig. 2 and 3, based on the above-mentioned object detection model structure, training is performed on the object detection model through the following steps S100 to S110 to obtain each node weight value of the object detection model.

S100, implementing pre-training based on a large-scale data set on the multi-layer output depth feature extraction network to obtain initial parameter values of the target detection model.

Specifically, the depth feature extraction network of the multi-layer output in the target detection model is pre-trained using a large-scale dataset. And taking the weight value obtained by pre-training as an initial parameter value of the depth feature extraction network in the target detection model, so as to achieve the purposes of accelerating the convergence speed and improving the detection precision.

As an example, in the embodiment of the present invention, the large-scale dataset is an image dataset provided by ImageNet, and the multi-layer output depth feature extraction network is a mobilenet v2 network.

S101, obtaining an Anchor point (Anchor) required for training the target detection model using a small-scale data set marked with the region of interest.

Specifically, the invention marks the region of interest by setting a group Truth target frame for each picture of the small-scale data set, and the exemplary region of interest is a minimum rectangle covering the equipment of interest. Based on the small-scale data set, the size of the group Truth target frame is subjected to cluster analysis after normalization, and in the embodiment, the size distribution of the group Truth target frame is analyzed by using a Kmeans algorithm to obtain a group of size clustering results related to the group Truth target frame, wherein the results comprise a plurality of different scales, one scale corresponds to the shape of an Anchor frame (Anchor Box), and a set comprising a plurality of Anchor frame scales is established.

The feature points of the feature map of each scale are used as an anchor point, each anchor point corresponds to anchor frames of a plurality of anchor frame sizes in the set, and then the number of anchor frames of all anchor points to be detected by the multi-scale fusion detection head for one image is as follows:

wherein w is _i And h _i The length and width of the ith feature map are respectively, k is the number of anchor frame sizes corresponding to the ith feature map, and in this embodiment, the scale of one feature map corresponds to a selected backbone network, that is, the scale of the feature map convolved and output by the backbone network is a fixed scale in the anchor frame set. Specifically, for a 7×10 feature map with a size, there are 70 pixels, that is, 70 feature points and 70 anchor points, and if each anchor point corresponds to 3 anchor frame sizes, the detection head corresponding to the feature map detects 210 anchor frames. Then the detection head corresponding to the feature map detects 210 anchor frames, and the number of the anchor frames of the feature map with other scales is the same as that of the algorithm. One aspect of the algorithm is used for configuring the decoder output by the multi-scale fusion detection head to output by the multi-scale fusion detection headAnd merging and decoding.

S102, a first training set for performing frame regression task training on the target detection model and a second training set for performing classification task training on the target detection model are obtained by using the small-scale data set marked with the region of interest.

In this embodiment, a larger number of pictures are obtained after the small-scale dataset pictures are respectively enhanced by turning, translating, blurring, scaling, clipping, and the like, and the set of the pictures is used as the first training set.

In this embodiment, the first training set is used as the second training set after noise adding processing is performed on the first training set according to the distance between a pixel of each picture and each group trunk target frame. The specific noise adding method is that, for a picture marked with a plurality of regions of interest, a pixel p _x,y The amplitude n of the added noise _x,y The method comprises the following steps:

min(b,a×d)

wherein d is the pixel p _x,y The shortest distance to all the regions of interest, a is the noise intensity parameter, and b is the maximum noise plus intensity. And taking the collection of each picture after the noise adding treatment as a second training set. The noise is added, partial background information of a background area outside the region of interest is reserved, namely, in each picture of the second training set, the region of interest does not have a clear visual boundary, the closer to the boundary position of the region of interest of the picture, the more the reserved background information is, and in the Two-Stage type target detection model, after the candidate area is identified in the first Stage, the detection information provided for the second Stage does not contain any background information outside the candidate area.

S103, in the multi-scale fusion detection heads, acquiring target detection data of each detection head through multi-scale fusion.

Specifically, in the forward propagation process of the depth model feature extraction network, the output feature graphs with different sizes of the multi-layer different depth backbone networks in the depth model feature extraction network are selected as the output of the depth model feature extraction network by using the initial parameters obtained by S100 pre-training. In the multi-scale fusion detection head, up-sampling, fusion and convolution are carried out on feature graphs with different sizes in the output of the obtained depth model feature extraction network by utilizing a Feature Pyramid Network (FPN) structure, and the target detection output with n scales which is the same as the number n of the detection head is obtained by the following steps:

w _i ×h _i ×k×(c+5)

wherein c is the number of target categories, w _i And h _i The length and width of the ith output convolution feature map, respectively. And outputting c classification results of all Anchor and four coordinates and one confidence coefficient of the corresponding prediction frame. The four coordinates of the predicted frame are the abscissa position, the ordinate position, the predicted frame length, and the predicted frame width, respectively.

S104, configuring a decoding algorithm of the output of the multi-scale fusion detection head. The decoding algorithm aims at converting the output of the target detection model detection head into a coordinate prediction result, namely, a coordinate in a real picture.

Specifically, in this embodiment, regression training is performed by using the anchor frame generated in step S101, and the anchor frame that is the largest compared with the IOU of the group trunk target frame is selected as the anchor point responsible for predicting a target object, where the relationship between the predicted output and the actual coordinates is represented by formulas (1) to (4):

x′＝x+sig mod(p _x )×w (1)

y′＝y+sig mod(p _y )×h (2)

wherein x ', y', w ', h' respectively represent center coordinates and length and width of each anchor point Anchor of the anchor point set after regression, x, y, w, h respectively represent upper left point coordinates of each anchor point anchor of the anchor point set and width and height of the Anchors, and p _x 、p _y 、p _w 、p _h Representing the whole object detection network at one sideThe regression values obtained are predicted in the frame regression training.

When the target detection model is used for prediction, for the classification result of each anchor point, the product of the c classification prediction results of the anchor point multiplied by the confidence coefficient of the anchor point is taken as the confidence coefficient of the c categories. And selecting a value as a threshold for ensuring that the anchor point correctly predicts the target, wherein the threshold value ranges from 0 to 1, and is preferably 0.7. And for each anchor point, when the confidence coefficient of one or more classes is greater than or equal to a threshold value, taking the output of the anchor point as effective output, and performing non-maximum value inhibition processing to obtain a final prediction frame.

S105, configuring a sum loss function in target detection model training.

Specifically, in this embodiment, for the Anchor responsible for detecting the target, the confidence coefficient C is 1; anchor whose object is not negatively detected and the IOU of the prediction frame and the group trunk is more than 0.5 is ignored; other Anchor confidence C is 0.

The present embodiment uses a cross entropy function as the loss function for confidence prediction, formulated as follows

Wherein C is _ij Is a value of the confidence in the prediction,is a true confidence value, the network has n output scales, sigma is a sigmoid function, and Anchor is responsible for forecasting +.>For 1, anchor is not responsible for prediction +.>Is 0; when Anchor ignores, add->Otherwise, 1.

Using cross entropy function as a loss function for a class prediction network, the formula is as follows

Wherein p is _ij Is a predicted class value of the model,is a true classification value, the network has n output scales, sigma is a sigmoid function, and Anchor is responsible for forecasting +.>Is 1, anchor is not responsible for predicting 0.

In frame regression, the method uses a mean square error loss function, and the formula is as follows:

wherein x is _ij 、y _ij 、w _ij 、h _ij Is the prediction frame center coordinates and length width,is the true frame center coordinate and width height.

The sum loss function is the following:

LOSS＝α _obj L _obj +α _noobj L _noobj +α _class L _class +α _wh L _wh +α _xy L _xy (10)

wherein alpha is _obj 、α _noobj 、α _class 、α _wh 、α _xy Weights for the respective loss functions of formulas (5) to (9).

And S106, on the premise of determining a decoding algorithm and a sum loss function, performing frame regression task training on the target detection model through the first training set so as to obtain the target detection model subjected to frame regression task training which is independently implemented once.

Specifically, a in the total loss function is adjusted in one independently implemented frame regression task training _class The value of (2) is zero, which is equivalent to training only the frame regression output capability of the object detection model.

Specifically, training is performed using a first training set. Most of the images of the training set are used as the train set, the rest are used as the valid set, the train set is trained by using the first learning rate, and the training rate is exemplified as 0.001, and the valid set is used as the verification set. Will be alpha _class Set to 0, alpha _noobj Set to 0.01 and the rest weights reset to 1 for training. And stopping the training of the frame regression task when the loss of the verification set is not lowered. In other embodiments, α _class The weight values can be set to be much smaller than the other loss coefficients so that the training is focused on the frame regression task, while α _noobj Also put a smaller weight.

And S107, on the premise of determining a decoding algorithm and a sum loss function, performing classification task training on the target detection model through a second training set to obtain the target detection model subjected to classification task training which is independently implemented once.

Specifically, in a classification task training which is independently implemented, the sum loss function a is used for _wh ， a _xy All set to zero, which is equivalent to training only the classification task output capability of the object detection model.

Specifically, training is performed using the second training set. Most of the training set is denoted as train set, and the rest of the training set is denoted as valid set. Training the train set using the second learning rate, exemplary, let the second learning rate be 0.001, while the validation set is the validation set. In particular, the present embodiment will be α _class Setting the weight to be 1, setting the rest weights to be 0, and training. Will not be as verified by loss of the setAnd when descending, stopping the training of the classification task.

S108, repeating the steps S106 and S107 in turn, and gradually reducing the first learning rate and the second learning rate in the cycle until the loss of the frame regression task training at one first learning rate is no longer reduced compared with the loss of the previous frame regression task training, and simultaneously, the loss of the classification task training at one second learning rate is no longer reduced compared with the loss of the previous classification task training.

Specifically, the first learning rate used each time S106 is performed is lower than the first learning rate of the previous execution of S106, for example, the previous time is 0.001, and the current time may be 0.0005; the second learning rate used each time S107 is performed is lower than the second learning rate used in the previous execution of S107, for example, the current time may be 0.0005 when the previous time is 0.001. The first learning rate and the second learning rate may be different in each cycle. Meanwhile, since S106 and S107 are repeatedly performed, S106 may be performed after S107 is performed first at the beginning of the first cycle.

S109, lower alpha _noobj Weight fine tuning model.

Specifically, a learning rate lower than the first learning rate and the second learning rate used in the last cycle is set, the first training set is used to train and finely tune the whole target detection model, and the total loss from the target detection model obtained in the training step S108 to the verification set is not reduced any more. Exemplary, the sum loss function in training will be α _noobj Set to 0.01 and the rest weights reset to 1 to reduce L in the fine tuning training _onobj Is a weight of the whole.

S110, testing the model.

Using the trained target detection model obtained in S108 or S109, the image is predicted with the original unenhanced small-scale dataset as a test set. And evaluating the performance of the model according to the accuracy of the prediction result.

Detailed description of the preferred embodiments

In a specific embodiment, after data enhancement and noise addition are performed on the defect image of the substation equipment, the defect image is used as input of the target detection model of the invention, the noise added defect image of the substation equipment in the second training set is shown in fig. 4 and 5, wherein fig. 4 shows a processing result with the maximum noise adding strength of 127, the object of interest is a cow, fig. 5 shows a processing result with the maximum noise adding strength of 255, the object of interest is a respirator, it can be seen that no clear boundary exists between the area containing the object of interest and the background, and the closer to the object of interest, the more background information is contained, so that the mechanism of the soft object of interest is provided. In contrast, a picture containing candidate regions for a hard region of interest mechanism is entirely black outside the candidate regions, i.e., the background information of any region outside the candidate regions is zero. The depth feature extraction network of the target detection model is a mobile Net V2 network with 320 a 224 resolution and trained, two main network output feature graphs are selected in the depth feature extraction network and set as two outputs of the depth feature extraction network, and the dimensions and the sizes are 7 multiplied by 10 and 14 multiplied by 20 respectively. The normalized dimension of Anchor generated according to the data set is (0.73X0.79), (0.54X0.42), (0.33X0.71), (0.24X0.25), (0.16X0.46), (0.07 X0.16), and the detection result of the defective image part of the substation equipment is shown in FIG. 6 after training by using the algorithm of the invention, wherein, (a) is a color change fault of respirator, (b) is a normal respirator, (c) is an insulator breakage fault, and (d) is bird nest foreign matter.

Second embodiment

The VOC2007 data set selection part is used as a small-scale data set of the target detection model of the invention, the data is enhanced and denoised after marking, and then the data is used as the input of the target detection model of the invention, the MobileNet V2 network with 320-224 resolution after pretraining is used for extracting the characteristics, and the characteristics extraction network is provided with two outputs with the sizes of 7-10 and 14-20 respectively. The normalized size of Anchor generated according to the data set is (0.50X0.72), (0.46X 0.33), (0.30X0.36), (0.20X0.56), (0.17X0.27), (0.10X0.11), and the partial image detection result of VOC2007 data set by the method of the invention is shown in FIG. 7 after training by the algorithm of the invention, wherein, (a) is bus and (b) is cow.

Claims

1. An improved target detection method based on region of interest training on a small-scale data set, which obtains an image target detection result through a target detection model, is characterized in that: the target detection model comprises a multi-layer output depth feature extraction network and a multi-scale fusion detection head; the training process of the target detection model comprises a stage of carrying out frame regression task training and classification task training sequentially and independently in a circulating manner;

performing frame regression task training and classification task training on the target detection model by using a small-scale data set of the marked region of interest;

performing the frame regression task training on the target detection model by using a first training set obtained by enhancing the small-scale data set through first data, and performing the classification task training on the target detection model by using a second training set obtained by enhancing the first training set through second data; each image of the second training set contains global information of part of the image outside the interested area;

the first data enhancement is used for obtaining a first training set with a size larger than the small-scale data set, and the method comprises more than one of turning, translating, blurring, scaling and clipping; the second data enhancement is used for partially retaining the background information of a background area of the image according to the distance between the background area and one region of interest of the image, and the method comprises the steps of adding noise;

the noise adding method is that for a picture marked with a plurality of regions of interest, the pixel p of the picture _x,y The amplitude n of the added noise _x,y Min (b, a×d), where d is the pixel p _x,y The shortest distance to all the regions of interest, a is the noise intensity parameter, and b is the maximum noise plus intensity.

2. The target detection method according to claim 1, wherein: the depth feature extraction network is pre-trained using a large-scale dataset.

3. The target detection method according to claim 1, wherein: and in the multi-scale fusion detection head, up-sampling, fusing and convoluting the feature graphs with different sizes in the output of the obtained depth model feature extraction network layer by utilizing a feature pyramid network structure to obtain target detection outputs with n scales, the number of which is equal to that of the detection heads n.

4. The target detection method according to claim 1, wherein: each detection head of the multi-scale fusion detection head comprises a classification output layer for classification task training and a regression output layer for frame regression task training.

5. The target detection method according to claim 1, wherein: the learning rate of each frame regression task training is lower than that of the last frame regression task training, and meanwhile, the learning rate of each classification task training is lower than that of the last classification task training.

6. The target detection method according to claim 1, wherein: after the stage is over, fine tuning the target detection model using the first training set.