CN114565959A

CN114565959A - Target detection method and device based on YOLO-SD-Tiny

Info

Publication number: CN114565959A
Application number: CN202210152654.0A
Authority: CN
Inventors: 周斌; 沈振冈; 李文豪; 李艳红
Original assignee: Wuhan Etah Information Technology Co ltd
Current assignee: Wuhan Etah Information Technology Co ltd
Priority date: 2022-02-18
Filing date: 2022-02-18
Publication date: 2022-05-31

Abstract

The invention discloses a target detection method and a device based on YOLO-SD-Tiny, relating to the field of target detection, wherein the method comprises the following steps: after replacing an activation function adopted in the last CSP-Body and CBL in the Yolov4-Tiny trunk feature extraction network with a Mish activation function, extracting information of a picture to be detected to obtain an effective feature layer; Self-DeConvolation upsampling is carried out on the effective characteristic layer according to the characteristic pyramid network FPN and then output; and predicting the upsampled output value by using a YOLO Head. The invention is suitable for general equipment, in particular to low-performance equipment with lower calculation force, and can improve the accuracy and the detection speed of target detection.

Description

Target detection method and device based on YOLO-SD-Tiny

Technical Field

The invention relates to the field of target detection, in particular to a target detection method and device based on YOLO-SD-Tiny.

Background

Face detection is a very important computer vision task and is an important branch in target detection. The target detection algorithm based on deep learning is mainly divided into two types, one type is based on regional suggestions, and the other type is not based on regional suggestions.

The target detection algorithm based on the regional suggestion mainly comprises R-CNN, Fast R-CNN, Faster R-CNN and the like, and the target detection algorithm based on the regional suggestion is divided into two steps, namely, a series of candidate boxes are generated based on a target, and then classification and coordinate regression are carried out through a convolutional neural network. The target detection algorithm based on the regional suggestion has high accuracy, but the model parameters are often overlarge and the real-time performance is poor.

The target detection algorithm without the region suggestion mainly comprises a YOLO series (You Only Look one) algorithm, the candidate box generation process and the classification return are combined together, the calculation complexity of the neural network is greatly reduced, and the accuracy is lower compared with a two-stage method based on the region suggestion. The YOLO algorithm is also continuously developed, and the speed and the precision of more algorithms such as YOLOV4 which are currently applied are greatly improved compared with those of the series of early algorithms.

In recent years, some elaborately designed network models and operators have been proposed, which greatly reduce the amount of computation and parameters. For example, YOLOV4-Tiny is proposed on the basis of YOLOV4, the structure of YOLOV4-Tiny is a simplified version of YOLOV4, the model belongs to a lightweight model, only 600 thousands of parameters are equivalent to one tenth of the original parameters, and therefore the detection speed is greatly improved.

The mainstream target detection algorithms often need high-computation-power equipment, but the mobile equipment and the embedded equipment cannot support such complex models in computation power, and a target detector with high accuracy and high real-time performance, which can better meet different computation-power equipment, is lacking at present.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a target detection method based on YOLO-SD-Tiny, which is suitable for general equipment, particularly low-performance equipment with low calculation power, and can improve the accuracy and detection speed of target detection.

In order to achieve the above purposes, the technical scheme adopted by the invention is as follows:

a target detection method based on YOLO-SD-Tiny, the method includes the following steps:

after replacing an activation function adopted in the last CSP-Body and CBL in the Yolov4-Tiny trunk feature extraction network with a Mish activation function, extracting information of a picture to be detected to obtain an effective feature layer;

Self-Deconvolation upsampling and outputting the effective characteristic layer according to the characteristic pyramid network FPN;

and predicting the upsampled output value by using a YOLO Head.

In some embodiments, the performing Self-deconvolation upsampling on the valid feature layer according to the feature pyramid network FPN and outputting the upsampling includes:

compressing the number of channels of the significant feature layer F with the shape of H × W × C to C by 1 × 1 convolution_r；

Setting the up-sampling rate to be sigma, and carrying out convolution based on the compressed effective characteristic layer F

For outputting feature layers

A point l in_tPredicting an upsampling kernel associated with location information

Wherein

The obtained core

Obtained after a weighted sum operator reshape

Wherein

Wherein

For outputting feature layers

Midpoint l_tTheta is the weighted sum operator,

k_areaa neighborhood of a point in the significance signature F, k_encoderIs a ratio of k_areaA neighborhood of one smaller region;

outputting feature layers

Point l in_tMapping back to the point l corresponding to the effective characteristic layer F, and taking out the k taking l as the center_area×k_areaRegion, and up-sampling kernel of the predicted point

And performing dot product to obtain an output value.

In some embodiments, after obtaining the output value, the method further includes:

normalization of the kernel σ H × σ W × k by softmax_area×k_areaMaking the sum of the kernel weights 1;

in some embodiments, when predicting the upsampled output value using the YOLO Head, the CIOU loss is used as the bounding box regression loss.

In some embodiments, when predicting the upsampled output value by using the YOLO Head, the GHM loss is used as the classification loss.

The invention provides a target detection device based on YOLO-SD-Tiny, which is suitable for general equipment, particularly low-performance equipment with low calculation capacity, and can improve the accuracy rate and detection speed of target detection.

a YOLO-SD-Tiny-based target detection device, comprising:

the system comprises a trunk feature extraction network, a detection layer and a feature layer, wherein the trunk feature extraction network is formed by replacing an activation function adopted in the last CSP-Body and CBL in the Yolov4-Tiny trunk feature extraction network with a Mish activation function and is used for extracting information of a picture to be detected to obtain an effective feature layer;

a feature pyramid network FPN, configured to perform Self-deconvoltation upsampling on the effective feature layer and output the upsampled;

YOLO Head, which is used to predict the up-sampled output value.

In some embodiments, the feature pyramid network FPN comprises a Self-DeConvolution computation unit comprising:

an upsampling kernel prediction module to:

For outputting feature layers

Wherein

The obtained core

Obtained after a weighted sum operator reshape

Wherein

Wherein

For outputting feature layers

Midpoint l_tTheta is the weighted sum operator,

a feature traversal module to: outputting feature layers

And performing dot product to obtain an output value.

In some embodiments, after obtaining the output value, the upsampling kernel prediction module is further to:

Compared with the prior art, the invention has the advantages that:

according to the invention, a YOLO-SD-Tiny model is provided for the problems that the target detection model is too large to be deployed on low-performance equipment, the real-time performance is poor and the like, and a Mish activation function MCSP-Body is introduced into a trunk feature extraction network part to enable information to flow into a network better; and an SD module is introduced into the characteristic pyramid network part to accelerate the speed and the receptive field of characteristic fusion. According to experimental result analysis, the YOLO-SD-Tiny provided by the invention on the OccludeFace data set is improved by 6.35% in AP and improved by 9.64% in detection speed compared with the YOLOv4-Tiny, and the problems of detection speed and accuracy are solved to a certain extent.

Drawings

FIG. 1 is a flow chart of a target detection method based on YOLO-SD-Tiny in the embodiment of the present invention;

FIG. 2 is a schematic diagram of a YOLO-SD-Tiny overall network model in an embodiment of the present invention;

FIG. 3 is a comparison graph of Mish activation function and LeakyReLU activation function curves in an embodiment of the present invention;

FIG. 4 is a flow chart of upsampling kernel prediction in an embodiment of the present invention;

FIG. 5 is a flow chart of feature traversal in an embodiment of the present invention.

Detailed Description

For the purpose of making the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, an embodiment of the present invention provides a method for detecting a target based on YOLO-SD-Tiny, the method includes the following steps:

s1, after an activation function adopted in the last CSP-Body and CBL in the YoloV4-Tiny trunk feature extraction network is replaced by a Mish activation function, information extraction is carried out on a picture to be detected to obtain an effective feature layer.

The Yolov4-Tiny trunk feature extraction network comprises two CBLs, three CSP-bodies and one CBL in sequence.

It is worth noting that CBL stands for Convolution (Convolution), Batch Normalization (Batch Normalization) and LeakyReLU activation functions. The CSP-Body consists of three CBL structures and one Maxpool, the CSP-Body divides the characteristic diagram transmitted from the upper layer into two parts, and then the two parts are combined through a cross-stage hierarchical structure, and the CSP-Body can enhance the learning capacity of the neural network through a residual structure, thereby reducing the memory occupation and the calculation amount while ensuring that the precision of the neural network is not lost.

Referring to fig. 2, in the present embodiment, the leak relu activation function used in the fifth CSP-Body and the sixth CBL is replaced with the Mish activation function. The substitutions are denoted MCSP-body and CBM, respectively. During sampling, upsampling is carried out based on Self-deconvoltation (SD), so the model in the embodiment of the invention is called YOLO-SD-Tiny.

When 416 × 416 is taken as an input, the integral network model of YOLO-SD-Tiny is shown in fig. 2, and as can be seen from fig. 2, YOLO-SD-Tiny is divided into three parts, namely, a trunk feature extraction network, a feature pyramid network and a YOLO Head, wherein the trunk feature extraction network is composed of two CBLs, two CSP-bodies, one MCSP-Body and CMB. The MCSP-Body replaces the CBL structure in CSP-Body with a CBM structure based on the Mish activation function, resulting in better information flow into the network.

The activation function can complete the nonlinear change from the input to the output of the neuron, and has important significance for the training of the neural network. The activation functions commonly used by neural networks are Sigmoid, Tanh, ReLU, LeakyReLU, etc., but they all have certain disadvantages. Taking the ReLU activation function as an example, when the input is negative, the gradient becomes zero, resulting in the gradient disappearing; the LeakyReLU activation function allows slight negative gradients when accepting negative inputs, and avoids the effect of gradient disappearance caused by negative inputs to some extent.

The equation for the computation of the Mish activation function is as follows:

f(x)＝xtanh(softplus(x))＝xtanh(ln(1+e^x))

the value range of the Mish activation function is [ ≈ -0.31, + ∞ ]. A comparison of the curves of the mesh activation function and the LeakyReLU activation function is shown in FIG. 3. As can be seen from fig. 3, the hash activation function allows a slight negative value to exist in the value range and a maximum value is not set, so that gradient flow with better effect is brought, and the input of the neural network can enable information to go deep into the network better after being mapped by the smooth activation function, so as to obtain better accuracy.

S2, performing Self-deconvoltation upsampling on the effective characteristic layer according to the characteristic pyramid network FPN and outputting the upsampled effective characteristic layer.

The FPN, i.e. the feature pyramid network, is a top-down feature fusion method. The feature pyramid network adopts simple fusion from top to bottom, namely, a higher-level feature map which is more abstract and has stronger semantic meaning is sampled, and the feature map obtained by the upsampling is connected to a previous-level feature map through the transverse direction. The high-level features can be fused to the shallow features to help the shallow features to better detect the target. The traditional up-sampling is based on an interpolation method, the interpolation method cannot utilize semantic information of a feature map, and the receptive field is small.

For this reason, in this embodiment, upsampling is performed based on Self-deconvolation (SD for short), the SD involves two modules, the first module is an upsampling kernel prediction module, and the second module is a feature traversal module. For a valid feature layer F with a shape of H × W × C, given an integer upsampling rate of σ, SD will then result in an output feature layer with a shape of σ H × σ W × C

For output feature layer

A certain point l in_t＝(x_t,y_t) In the effective feature layer F, a point l ═ (x, y) can be found, corresponding to this, where

Denote the neighborhood of l as N (F)_l,k)。

Step S2 includes an upsampling kernel prediction process and a feature traversal process, and specifically includes:

s21, compressing the number of channels of the effective characteristic layer F with the shape of H multiplied by W multiplied by C to C through 1 multiplied by 1 convolution_r；

S22, setting the up-sampling rate to be sigma, and carrying out convolution on the compressed effective characteristic layer F

For outputting feature layers

Wherein

S23, obtaining the core

Obtained after a weighted sum operator reshape

Wherein

Wherein

Is output asFeature layer

Midpoint l_tTheta is a weighted sum operator,

referring to FIG. 4, it can be understood that in the upsampling kernel prediction module, the number of channels is first compressed to C by 1 × 1 convolution_rFollowed by convolution

For outputting feature layers

At this stage the parameter is k_encoder×k_encoder×C_r×σ²×k_area ²Wherein k is_encoder＝k_area-1。

S24, outputting the characteristic layer

And performing dot product to obtain an output value.

It will be appreciated that the feature traversal procedure is shown in FIG. 5 for point l at the output feature level (upsampling kernel)_tMapping it back to the point l corresponding to the effective feature layer F, and taking out the k centered on it_area×k_areaAnd performing dot product on the predicted up-sampling kernel of the point to obtain an output value. Different channels at the same position share the same up-sampling kernel, so that a neighborhood map N (F) is provided for one point l in the effective feature layer F_l,k_area) K at l_area×k_areaRegion per pixel point pair output feature layer

Corresponding pixel point l_tThe contribution of (2) is different, based on the content of the features rather than the distance of the positions, so that the feature map semantics from feature reorganization can be stronger than the original feature map, because each pixel point can focus on the information from the relevant point in the local region.

And S3, predicting the up-sampled output value by using a YOLO Head.

It should be noted that the loss function of YOLO-SD-Tiny in this embodiment is divided into three parts, namely, confidence loss, classification loss, and bounding box regression loss.

In a preferred embodiment, the CIOU loss function is used as the bounding box regression loss. The IOU refers to the ratio of the intersection and the union of the prediction box and the real box, and is used as a measurement standard of the regression accuracy of the bounding box, and the calculation formula of the IOU and the CIOU is as follows:

wherein B is a prediction box, B^gtIs a real frame.

Wherein, b and b^gtRespectively representing the central points of the prediction frame and the real frame, alpha is a weight function, v is a parameter for measuring the aspect ratio of the boundary frame, and c represents the diagonal distance of the minimum closure area which can simultaneously contain the prediction frame and the real frame.

In terms of classification loss, ghm (differential hardbanding mechanism) loss is introduced to solve the problem of positive and negative sample imbalance and the problem of particularly difficult samples (outliers). The gradient mode length d of the outliers is much larger than the average samples, which may reduce the accuracy of the model if the model is forced to focus on these samples. In order to simultaneously attenuate the easily separable sample and the particularly difficult separable sample, the gradient density gd (g) is proposed, and the calculation formula is as follows:

wherein, delta_ε(g_kG) shows that in samples 1-N, the gradient mode length is distributed in

Number of samples in the range,/_ε(g) Represent

The physical meaning of the interval, and thus the gradient density gd (g), is the total number of samples per unit of the portion of the gradient modulo length g. Next, the GHM loss can be obtained by multiplying the cross entropy and the inverse of the gradient density of the sample, and the calculation formula is as follows:

where N is the total number of samples, L_CE(p_i,p_i ^*) Is a binary cross entropy loss, p ∈ [0,1 ]]Is the probability of model prediction, p^*E {0,1} is a true label for a certain classAnd (6) a label.

Therefore, in the embodiment of the invention, the loss function of the YOLO-SD-Tiny adopts CIOU loss to accelerate the regression speed of the boundary box in the regression loss of the boundary box, and adopts GHM loss to solve the problem of unbalance of positive and negative samples and the problem of particularly difficultly-classified samples (outliers) in the classification loss.

The integral process of the YOLO-SD-Tiny target detection in the embodiment of the invention is as follows:

firstly, an input image is divided by using S multiplied by S grids, each grid in the S multiplied by S grids is only responsible for predicting a target with a central point falling in the grid, and 3 prediction frames are calculated, and each prediction frame corresponds to 5+ C values; where C represents the total number of categories in the dataset and 5 represents the predicted bounding box center point coordinates (x, y), the predicted box width height dimension (w, h) and the confidence. Then, the class confidence of the network prediction is solved, which is related to the probability P (n) that the target falls into the grid_object) And predicting the accuracy P (n) of the ith target by grids_class|n_object) And the intersection ratio (IOU) are related, and the expression is as follows:

if the target center falls into the grid, P (n)_object) 1, otherwise 0;

is the intersection ratio between the prediction box and the real box. Finally, the DIOU NMS is used to screen out the prediction box with the highest score as the target detection box, and the output feature maps are 26 × 26 and 13 × 13, respectively, so as to achieve the positioning and classification of the target, it should be noted that the NMS is a necessary post-processing step in the target detection, and is intended to remove the repeated boxes, leaving the most accurate box, and the DIOU NMS suggests that two boxes with farther center points may be located on different objects, and should not be deleted (this is the biggest difference between the DIOU NMS and the NMS).

It is worth to be noted that, there are various performance evaluation indexes in the field of target detection, for example, the model can be evaluated by using the most widely used accuracy (Precision) and Recall (Recall) at present, and the calculation formula is as follows:

wherein, the accuracy rate P is used for evaluating the prediction result, TP (true Positive) represents the number of positive samples correctly predicted by the model, FP (false positive) represents the number of negative samples predicted as positive samples; the recall rate is used to evaluate samples, indicating how many positive samples of all samples were correctly predicted, and fn (false negative) indicates that the model predicts what would otherwise be positive samples as negative samples.

The AP represents the area surrounded by PR curves and coordinates formed by Precision and Recall of a single type under different confidence degree thresholds, the accuracy and the Recall rate are comprehensively considered, and the identification effect evaluation of single type target detection is comprehensive. The FPS is the number of images which can be detected by the model in one second, and the larger the FPS value is, the faster the detection speed of the model is.

The method comprises the steps of carrying out comparative analysis on a data set OccludeFace by using a YOLO-SD-Tiny algorithm and YOLOv4-Tiny in the embodiment of the invention, carrying out ablation experiments on the YOLO-SD-Tiny (with MCSP-Body) and the YOLO-SD-Tiny (with GHM & CIOU), verifying the influence of different modules on a model, wherein the experiment results are shown in table 1, and the MCSP-Body introduced based on a Mish activation function is improved by 0.67% on AP compared with the YOLOv4-Tiny, so that the gradient does not disappear, and the function is smoothly activated, so that information can be better embedded into a network, and the detection accuracy is improved. The yield of GHM (loss of GHM) introduced in the classification loss part and the yield of CIOU introduced in the boundary box regression part is improved by 2.09% on AP compared with the original yield of YOLOv4-Tiny model, which shows that the detection accuracy of the model can be improved by comprehensively considering the CIOU of the overlapping area, the central point and the aspect ratio and solving the GHM loss of positive and negative samples and difficultly-divided samples. YOLO-SD-Tiny showed 6.35% increase in AP and 9.64% increase in assay rate FPS compared to YOLOv 4-Tiny. By combining the comparison of various experimental data in table 1, it can be verified that the improved method provided by the invention can effectively improve the target detection precision and detection speed.

TABLE 1

In conclusion, the invention provides the YOLO-SD-Tiny model for the problems of incapability of deployment on low-performance equipment, poor real-time performance and the like caused by overlarge target detection model, introduces the MCSP-Body based on the Mish activation function in the trunk feature extraction network part, and leads information to flow into the network better; and an SD module is introduced into the characteristic pyramid network part to accelerate the speed and the receptive field of characteristic fusion. According to experimental result analysis, the YOLO-SD-Tiny provided by the invention on the OccludeFace data set is improved by 6.35% in AP and improved by 9.64% in detection speed compared with the YOLOv4-Tiny, and the problems of detection speed and accuracy are solved to a certain extent.

Meanwhile, the embodiment of the invention also provides a target detection device based on the YOLO-SD-Tiny, which comprises a trunk feature extraction network, a feature pyramid network FPN and a YOLO Head.

The trunk feature extraction network is formed by replacing an activation function adopted in the last CSP-Body and CBL in the YOLOV4-Tiny trunk feature extraction network with a Mish activation function, and is used for extracting information of a picture to be detected to obtain an effective feature layer.

And the feature pyramid network FPN is used for performing Self-deconvoltation upsampling on the effective feature layer and outputting the upsampled effective feature layer. YOLO Head is used to predict the up-sampled output value.

In some embodiments, feature pyramid network FPN includes a Self-deconvolation computation unit that includes an upsampling kernel prediction module and a feature traversal module.

An upsampling kernel prediction module to:

compressing the number of channels of the significant feature layer F with shape H × W × C to C by 1 × 1 convolution_r；

For outputting feature layers

A point of (l)_tPredicting an upsampling kernel associated with location information

Wherein

The obtained core

Obtained after a weighted sum operator reshape

Wherein

Wherein

For outputting feature layers

Midpoint l_tTheta is the weighted sum operator,

a feature traversal module for: outputting feature layers

Point l in_tMapping back to the point l corresponding to the effective characteristic layer F, and taking out the k taking l as the center_area×k_areaAnd performing dot product on the region and the predicted up-sampling kernel of the point to obtain an output value.

According to the face detection device based on YOLOV4-Tiny, a Mish activation function MCSP-Body is introduced into a main feature extraction network part, so that information can flow into a network better; and an SD module is introduced into the characteristic pyramid network part to accelerate the characteristic fusion speed and the receptive field. According to experimental result analysis, the YOLO-SD-Tiny provided by the invention on the OccludeFace data set is improved by 6.35% in AP and improved by 9.64% in detection speed compared with the YOLOv4-Tiny, and the problems of detection speed and accuracy are solved to a certain extent.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments. While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A target detection method based on YOLO-SD-Tiny is characterized by comprising the following steps:

replacing an activation function adopted in the last CSP-Body and CBL in the Yolov4-Tiny trunk feature extraction network with a Mish activation function, and extracting information of a picture to be detected to obtain an effective feature layer;

Self-DeConvolation upsampling is carried out on the effective characteristic layer according to the characteristic pyramid network FPN and then output;

and predicting the upsampled output value by using a YOLO Head.

2. The YOLO-SD-Tiny-based target detection method of claim 1, wherein the seif-deconvoitation upsampling and outputting the valid feature layers according to the feature pyramid network FPN comprises:

For outputting feature layers

Wherein

The obtained core

Obtained after a weighted sum operator reshape

Wherein

Wherein

For outputting feature layers

Midpoint l_tTheta is the weighted sum operator,

will output the feature layer

And performing dot product to obtain an output value.

3. The YOLO-SD-Tiny-based target detection method of claim 2, further comprising, after obtaining the output value:

normalization of the kernel σ H × σ W × k by softmax_area×k_areaSo that the sum of the kernel weights is 1.

4. The YOLO-SD-Tiny-any based target detection method of claim 1, wherein when predicting the up-sampled output value using YOLO Head, CIOU loss is used as the bounding box regression loss.

5. The YOLO-SD-Tiny-any based target detection method of claim 1, wherein the GHM loss is used as the classification loss when predicting the up-sampled output value using YOLO Head.

6. A target detection device based on YOLO-SD-Tiny, characterized by comprising:

the system comprises a trunk feature extraction network and a detection module, wherein the trunk feature extraction network is formed by replacing an activation function adopted in the last CSP-Body and CBL in the YOLOV4-Tiny trunk feature extraction network with a Mish activation function and is used for extracting information of a picture to be detected to obtain an effective feature layer;

YOLO Head, which is used to predict the up-sampled output value.

7. The YOLO-SD-Tiny based target detection apparatus of claim 6, wherein the feature pyramid network FPN comprises a Self-deconvolition calculation unit, the Self-deconvolition calculation unit comprising:

an upsampling kernel prediction module to:

Setting the up-sampling rate to sigma, based on the compressedThe effective feature layer F is convolved

For outputting feature layers

Wherein

The obtained core

Obtained after a weighted sum operator reshape

Wherein

Wherein

For outputting feature layers

Midpoint l_tTheta is the weighted sum operator,

a feature traversal module to: outputting feature layers

And performing dot product to obtain an output value.

8. The YOLO-SD-Tiny based target detection apparatus of claim 7, wherein the upsampling kernel prediction module is further configured to, after obtaining the output value.

9. the YOLO-SD-Tiny-any based target detection method of claim 6, wherein when predicting the up-sampled output value using YOLO Head, CIOU loss is used as the bounding box regression loss.

10. The YOLO-SD-Tiny-any based target detection method of claim 6, wherein the GHM loss is used as the classification loss when predicting the up-sampled output value using YOLO Head.