CN114299438A

CN114299438A - Tunnel parking event detection method integrating traditional parking detection and neural network

Info

Publication number: CN114299438A
Application number: CN202111665332.8A
Authority: CN
Inventors: 宋永端; 陈欢; 庞思袁; 凌凯; 赵梦雯; 卫佳; 王攀; 程霜雄; 魏大创; 廖昕怡
Original assignee: Chongqing University; DIBI Chongqing Intelligent Technology Research Institute Co Ltd; Star Institute of Intelligent Systems
Current assignee: Chongqing University; DIBI Chongqing Intelligent Technology Research Institute Co Ltd; Star Institute of Intelligent Systems
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-08

Abstract

The invention relates to a tunnel parking event detection method integrating traditional parking detection and a neural network, which comprises the steps of collecting driving videos from cameras under different scenes of a highway tunnel, obtaining pictures in the videos and marking the pictures to obtain a VOC data set; clustering the pictures in the VOC data set to obtain the most suitable size of the vehicle target boundary frame of each vehicle type, and taking the size as the Anchor size in the SSD neural network; constructing and training a vehicle identification model based on an SSD neural network to obtain an optimal vehicle identification model; and inputting a section of video to be detected into a traditional parking detection algorithm to obtain a corresponding video frame picture with a fixed foreground target picture as a picture to be predicted, and inputting the picture to be predicted into an optimal vehicle identification model to obtain a judgment result. Compared with the traditional parking event detection algorithm, the method has higher accuracy.

Description

Tunnel parking event detection method integrating traditional parking detection and neural network

Technical Field

The invention relates to the technical field of tunnel parking inspection, in particular to a tunnel real-time parking event detection method fusing traditional parking detection and an SSD neural network.

Background

The tunnel is a bottleneck road section of road traffic, and the brightness difference, the environmental difference and the like inside and outside the tunnel can cause certain influence on the road traffic safety, and especially in the illegal parking event in the tunnel, the problems of personal casualties, traffic jam and the like can be caused. At present, a parking detection system completely suitable for a tunnel scene does not exist, the false detection rate is high based on the traditional parking detection algorithm, and the requirements of real-time performance and accuracy of tunnel parking event detection cannot be completely met. The deep learning algorithm can extract deep features of the target, effectively solves the problem of vehicle identification in a complex scene, and has good effects on real-time performance and accuracy of target detection. Therefore, the tunnel parking event is detected by using a deep learning method, and the problem of high false detection rate of the traditional parking detection algorithm can be effectively solved.

In the field of parking event detection, how to reduce the false detection rate of a parking event as much as possible while ensuring the detection in time is a problem to be solved. The current video-based parking event detection method is divided into a deep learning-based method and a traditional parking detection-based method.

Based on the traditional parking detection method, the foreground target change of the image area is sensed through background modeling, and whether parking behaviors exist is judged through related constraint conditions. The method is applied to the electronic science and technology university and is a illegal parking detection method based on background modeling (CN107491753A), and vehicle detection is directly carried out in a background image obtained by modeling through background modeling. The invention does not perform the detection of moving objects, thus eliminating the interference of moving objects in actual video frames. Based on the fact that objects appearing in the background image are most likely stationary or slowly moving vehicle targets, disturbances to the environment, such as lighting, shadows, etc., can be mistakenly detected as parking vehicles, resulting in false detection of parking event detection. Under the tunnel scene, because of the characteristics of darkness and changeable illumination, the feature extraction of the traditional parking detection algorithm becomes more difficult, so that the detection accuracy is greatly reduced, and the detection of the parking event cannot be accurately finished under the tunnel scene by the traditional parking detection algorithm.

The deep learning can simulate the complicated hierarchical cognitive law of the human brain, extract the deep level characteristics of the target and effectively solve the vehicle identification problem in a complicated scene. The method comprises the steps of setting a parking detection area, detecting a current frame vehicle and recording vehicle detection frame information, comparing intersection and comparison of the current vehicle detection frame information and historical vehicle detection frame information, and determining parking behavior if the intersection and comparison is greater than a threshold value and the vehicle stagnation time exceeds a set threshold value (CN 107609491A). If a vehicle is slowly driven in the detection area or a condition that a plurality of vehicles are slowly driven exists, the intersection ratio of the vehicles is larger than a threshold value, and the method can generate false detection of parking event detection. The method adopts a background difference method to extract a vehicle target foreground and preprocess, obtains a suspected static target area by tracking the speed of an estimated target in a short time, detects whether vehicles exist in the suspected static target area by adopting a deep learning method for images of the suspected static target area, and judges as a parking event if the vehicle target is detected in the static target area. However, in a tunnel scene, tunnel illumination, frequent flashing of vehicle lights when vehicles are stationary, and under the condition that multiple vehicles are shielded from each other, the problems of ID jump, missing detection, false detection and the like can be caused to vehicle tracking, and further, false detection of a vehicle parking event can be caused.

Disclosure of Invention

Aiming at the problems in the prior art, the technical problems to be solved by the invention are as follows: the traditional parking detection algorithm is high in false detection rate and cannot be applied to the technical problem of a tunnel scene.

In order to solve the technical problems, the invention adopts the following technical scheme: the tunnel parking event detection method fusing the traditional parking detection and the neural network comprises the following steps:

s1: collecting driving videos from cameras in different scenes of the expressway tunnel, intercepting and storing pictures according to a fixed frame rate, and collecting a plurality of pictures as a data set;

marking the vehicle target in each picture in the data set by adopting an image marking tool, wherein the marking content comprises the vehicle type and the coordinate value of a boundary frame surrounding the vehicle target, and all marked pictures are used as a first sample to form a VOC data set;

s2: clustering the boundary box surrounding the vehicle target marked in each picture in the VOC data set obtained in the S1 to obtain the most suitable size of the boundary box of the vehicle target for each vehicle type, and taking the most suitable size of the boundary box of the vehicle target for each vehicle type as the size of the Anchor in the SSD neural network;

s3: the method comprises the following steps of constructing and training a vehicle identification model based on the SSD neural network, wherein the structure of the vehicle identification model is as follows:

selecting VGG16 as a backbone network by using a small size convolution kernel instead of a large size convolution kernel;

and the following modifications are made to VGG 16: removing the fully-connected layer which is finally used for classification of the VGG16, changing the fully-connected layers fc6 and fc7 into convolutional layers Conv6 and Conv7 correspondingly to the remaining two fully-connected layers, then additionally adding 4 convolutional layers named as Conv8_2, Conv9_2, Conv10_2 and Conv11_2, and finally selecting 6 convolutional layers of Conv4-3, Conv7, Conv8-2, Conv9-2, Conv10-2 and Conv11-2 to form a feature pyramid multi-scale detection structure;

taking the Anchor size in the SSD neural network obtained in the S2 as the size of the vehicle identification model, and training the vehicle identification model by adopting the first sample to obtain an optimal vehicle identification model;

s4: inputting a section of video to be detected into a traditional parking detection algorithm, taking an obtained corresponding video frame picture with a fixed foreground target picture as a picture to be predicted, and transmitting the picture to be predicted as input to an optimal vehicle identification model;

s5: and the optimal vehicle identification model obtains whether the vehicle target in the video to be detected is a tunnel parking event through two judgments.

As an improvement, the step of clustering the images in the VOC data set to select an appropriate Anchor size in the SSD neural network in S2 is as follows:

s21: extracting length and width dimension information of a bounding box surrounding the vehicle target in each first sample

S22: taking the length and width dimension information surrounding the vehicle target boundary frame in each sample I and the vehicle target size information corresponding to the vehicle category in the sample I as sample II to obtain a sample II set, and clustering all sample II in the sample II set by adopting a K-Means clustering algorithm, wherein the method comprises the following steps:

s221: randomly selecting B samples II from the sample II set as initial clustering centers, wherein each initial clustering center is used as a cluster center of one cluster;

s222: calculating the distance from all the other second samples in the second sample set to the B initial clustering centers, and allocating the second sample to the closest cluster, wherein the distance calculation formula is as follows:

d_i,c＝1-IOU(i,c) (1)；

in the formula (d)_i,cThe distance from the ith sample to the c cluster center is represented, i represents the ith sample, c represents the c cluster center, and IOU represents the intersection ratio of the areas of the ith sample and the c cluster center;

s223: calculating the average value of the distance from each second sample in the c cluster to the center of the cluster

Will be closest to

D of_i,cThe corresponding ith sample number two is used as a new cluster center of the c cluster;

s224: calculating the distance d between the new cluster center of the c-th cluster and the initial cluster center of the c-th cluster;

s225: judging whether d is smaller than a set threshold or reaches the maximum iteration times, if d is smaller than the set threshold, exiting, otherwise, updating the new cluster center of the c-th cluster to the initial cluster center, and returning to the step S222;

and if the current iteration times reach the maximum iteration times, exiting, otherwise, updating the initial cluster center of the new cluster center of the c-th cluster, and returning to the step S222.

As an improvement, in S3, the process of training the vehicle identification model by using the sample number one to obtain the optimal vehicle identification model is as follows:

s31: pre-training a vehicle identification model by using an ImageNet large-scale classification data set to obtain a suboptimal vehicle identification model, initializing each layer of the VGG16 network on the basis of the suboptimal vehicle identification model, wherein initializing a newly added layer by adopting an Xavier method;

s32: modifying the total number of categories in the SSD neural network into 2;

s33: inputting all the first samples in the first sample set into a suboptimal vehicle identification model, calculating the loss of the current iteration, and updating the parameters of the suboptimal vehicle identification model by using a random gradient descent method according to the loss;

s34: and judging whether the maximum iteration times is reached, if so, judging that the current suboptimal vehicle identification model is the optimal vehicle identification model, and if not, returning to the step S33.

As an improvement, the process of obtaining whether the vehicle target in the video to be detected is the tunnel parking event by the optimal vehicle identification model in S5 is as follows:

s51: obtaining a preliminary detection result under a given confidence threshold, identifying a vehicle target in the picture through a preferred vehicle identification model and obtaining coordinate information (x1, y1, x2 and y2) of a vehicle target boundary box, wherein (x1 and y1) represent coordinates of the upper left corner of the vehicle target boundary box, and (x2 and y2) represent coordinates of the lower right corner of the vehicle target boundary box;

s52: calculating the Area _ Det of the vehicle target boundary frame, wherein the concrete expression form of the formula Area _ Det is as follows:

Area_Det＝(x2-x1)(y2-y1)(2)；

s53: counting the number Q of parking foreground pixel points in the vehicle target boundary frame to represent the area occupied by the fixed foreground target in the detection frame;

s54: counting the proportion P of the parking target foreground in the vehicle target boundary frame, wherein the foreground proportion P is expressed by a formula as follows:

and traversing the fixed foreground target picture through the coordinates of the vehicle target boundary frame, if the ratio P exceeds a set threshold value T, judging that the vehicle target has a tunnel parking event, and otherwise, judging that the tunnel parking event does not exist.

As an improvement, the step of S33 calculating the loss of the current iteration is as follows:

the loss function loss of the current iteration is represented by a weighted sum of the position loss function loc and the confidence loss function conf, which is formulated as:

wherein alpha is used for adjusting the proportion between the position loss function loc and the confidence loss function conf, N is the total number of default frames matched with the labeling frame by the Anchor, if N is 0, the loss function loss is defined to be 0, and the position loss is the smooth loss smooth between a prediction frame output by a first sample through a suboptimal vehicle recognition model and a labeled bounding frame surrounding the vehicle target_L1Default box d has center (cx, cy), width w, height h, and position penalty function as follows:

wherein the content of the first and second substances,

indicates whether the ith prediction box and the jth label box match with respect to the category k, the match is 1, the mismatch is 0,

the ith prediction box is represented as a block of the ith prediction,

indicating that the jth label box Pos represents a positive sample;

the confidence loss function is the cross-validation of the confidence of the softmax loss on the class and setting the weight ratio to 1, and is as follows:

wherein i represents a predicted frame number, j represents a labeled frame number, p is a category number, and p ═ 0 represents the background, wherein

Taking 1 to indicate that the ith prediction box is matched with the jth label box, the category of the label box is P,

indicating the probability value of the prediction category p of the ith prediction box.

Compared with the prior art, the invention has at least the following advantages:

compared with a 'deep learning vehicle parking detection method based on monitoring video' (CN109919053A) applied by the university of Tai Ching worker, the method well solves the influence of interference such as ambient illumination, vehicle lamp flicker, vehicle ID jump and the like on the judgment of the parking event; compared with the traditional parking event detection algorithm, the method has higher accuracy.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

Aiming at the problems of dark tunnel environment, serious illumination interference and the like and the requirements on real-time performance and accuracy of parking target detection in a tunnel scene, the invention selects an SSD detection network with higher detection speed as a basic network under a Tensorflow deep learning framework, and obtains a vehicle identification model through training. In order to eliminate the influence of interference such as tunnel environment illumination, vehicle illumination, water stain and the like on the detection of the parking event, the method of the invention takes the parking target foreground detected by the traditional parking detection algorithm as a primary detection result, then takes the corresponding video frame picture of the picture with the parking target foreground as the input of a tunnel vehicle identification model based on the SSD neural network to carry out vehicle target identification to obtain the coordinates of vehicle detection frames, then counts the parking target foreground ratio in each detection frame, and if the ratio is always kept to exceed the set threshold value state within the set time, the parking event is judged. The double detection method fusing the SSD neural network and the traditional parking detection algorithm can improve the accuracy of parking event detection and reduce the false detection rate.

The method comprises the steps of firstly, acquiring a video with parking behaviors in a tunnel, intercepting and storing the video as a picture according to a fixed frame number, marking the picture as a label, using the picture as a data set for training a vehicle identification model, and then training the tunnel vehicle identification model based on an SSD neural network; establishing a background model by using a Gaussian mixture model, preprocessing the picture, extracting a parking target foreground by using a background difference method, and taking a video frame picture corresponding to the picture with the parking target foreground as the input of a vehicle identification model based on an SSD (solid state disk) neural network; and finally, detecting the picture by using a vehicle identification model based on the SSD neural network, obtaining the coordinates of vehicle target detection frames, counting the parking target foreground ratio in each detection frame, and if the ratio is always kept to exceed a set threshold value state within a set time, determining that the tunnel parking event occurs.

In the following description, for the sake of simplifying terms, the bounding box surrounding the vehicle object to be labeled is also referred to as a labeling box, and the result of the optimal vehicle recognition model prediction is also referred to as a prediction box.

The tunnel parking event detection method fusing the traditional parking detection and the neural network comprises the following steps:

s1: collecting driving videos from cameras in different scenes of the expressway tunnel, intercepting and storing pictures according to a fixed frame rate, and collecting a plurality of pictures as a data set; and marking the vehicle target in each picture in the data set by adopting an image marking tool, wherein the marking content comprises the vehicle type and the coordinate value of a boundary frame surrounding the vehicle target, and all marked pictures are used as a first sample to form the VOC data set.

The method comprises the steps of collecting multiple sections of driving videos with parking events from fixed cameras in different scenes of a highway tunnel, intercepting and storing pictures in a video stream every 25 frames through an automatic screenshot program, eliminating pictures with lens switching, fuzzy images and no vehicle targets, and collecting about 12000 pictures as a data set.

And manually marking the vehicle target in the intercepted picture by adopting an image marking tool LabelImg, wherein the marked object only comprises the vehicle type, the mark type is car, the mark type is a positive sample, and the coordinate value of a boundary box surrounding the target, and storing the coordinate value to obtain an xml file, so as to obtain the VOC data set.

According to the following steps of 4: 1: 1, randomly dividing the data set into a training set, a verification set and a test set, namely 8000 training sets and 2000 verification sets respectively.

And preprocessing the training image, including image turning, scale transformation, randomly erasing a vehicle to generate a mask image, subtracting an average value and the like.

S2: and clustering the bounding box surrounding the vehicle target marked in each picture in the VOC data set obtained in the step S1 to obtain the most suitable size of the vehicle target bounding box of each vehicle type, and taking the most suitable size of the vehicle target bounding box of each vehicle type as the Anchor size in the SSD neural network.

S21: extracting length and width dimension information of a bounding box surrounding the vehicle target in each first sample;

d_i,c＝1-IOU(i,c) (1)；

Will be closest to

selecting VGG16 as a backbone network by using a small size convolution kernel instead of a large size convolution kernel; on the premise of ensuring that the receptive field is not changed, the number of model parameters is limited;

and the following modifications are made to VGG 16: the fully-connected layer which is finally used for classification of the VGG16 is removed, the remaining two fully-connected layers fc6 and fc7 are changed into convolutional layers Conv6 and Conv7, then 4 convolutional layers are additionally added, namely Conv8_2, Conv9_2, Conv10_2 and Conv11_2, and finally 6 convolutional layers, namely Conv4-3, Conv7, Conv8-2, Conv9-2, Conv10-2 and Conv11-2, are selected to form the feature pyramid multi-scale detection structure.

When the input image is 300 × 300, the resolution is as shown in table 1:

TABLE 1 size of resolution of selected feature layer for backbone network

Feature layer

Conv4-3

Conv7

Conv8-2

Conv9-2

Conv10-2

Conv11-2

Resolution ratio

38×38

19×19

10×10

5×5

3×3

1×1

S31: pre-training a vehicle identification model by using an ImageNet large-scale classification data set to obtain a suboptimal vehicle identification model, initializing each layer of the VGG16 network on the basis of the suboptimal vehicle identification model, wherein initializing a newly added layer by adopting an Xavier method; training is carried out on a training sample by a pre-training model, so that the tunnel vehicle recognition model can be trained more rapidly, and the model is more accurate.

S32: modifying the total number of categories in the SSD neural network into 2; dividing into background and vehicle;

s33: inputting all the first samples in the first sample set into a suboptimal vehicle identification model, calculating the loss of the current iteration, and updating the parameters of the suboptimal vehicle identification model by using a random gradient descent method according to the loss; training used a stochastic gradient descent method with an initial learning rate set to 0.004, a learning rate adjusted to polynomial decay, and a batch-size set to 4.

Specifically, the steps of the loss of the current iteration are as follows:

wherein, α is used for adjusting the proportion between the position loss function loc and the confidence loss function conf, the default α is 1, N is the total number of default frames matched to the labeling frame by the Anchor, if N is 0, the loss function loss is defined to be 0, and the position loss is the smooth loss smooth between the prediction frame output by the suboptimal vehicle identification model and the labeled bounding frame surrounding the vehicle target by the sample one_L1Default box d has center (cx, cy), width w, height h, and position penalty function as follows:

wherein the content of the first and second substances,

the ith prediction box is represented as a block of the ith prediction,

indicating that the jth label box Pos represents a positive sample.

wherein i represents a predicted frame number, j represents a labeled frame number, p is a vehicle class number, and p-0 represents a background, wherein

representing the probability value of the prediction class p of the ith prediction box, the first half of the formula is the loss of positive samples (Pos), namely the loss classified into a certain class (excluding background), and the second half is the negative sampleLoss of the present (Neg), i.e. loss of class background.

And taking the Anchor size in the SSD neural network obtained in the S2 as the size of the vehicle identification model, and training the vehicle identification model by adopting the first sample to obtain the optimal vehicle identification model.

S4: and inputting a section of video to be detected into a traditional parking detection algorithm, taking the obtained corresponding video frame picture with the fixed foreground target picture as a picture to be predicted, and transmitting the picture to be predicted as input to the optimal vehicle identification model.

The method mainly comprises the following two steps:

1) obtaining a background model of a video to be detected based on a Gaussian mixture model, and then obtaining the size, position and shape information of a vehicle target by a background difference method;

2) performing frame extraction and fixed frame number AND processing on a video to be detected, setting that 1 frame extraction processing is performed on each 12 frames of the video, taking a target foreground obtained by adopting a method of performing AND processing on a current frame and an extracted first 6 frames of images as a parking target foreground, performing closed operation to remove small noise and threshold operation to remove target shadow, and taking a corresponding video frame picture with a fixed foreground target picture as a picture to be predicted.

S5: the optimal vehicle identification model obtains whether the vehicle target in the video to be detected is a tunnel parking event through two judgments, and the specific steps are as follows:

s51: obtaining a preliminary detection result under a given confidence threshold (usually 0.5), identifying a vehicle target in the picture through a good vehicle identification model and obtaining coordinate information (x1, y1, x2 and y2) of a vehicle target boundary box, wherein (x1 and y1) represent coordinates of the upper left corner of the vehicle target boundary box, and (x2 and y2) represent coordinates of the lower right corner of the vehicle target boundary box; removing redundant detection frames by using a non-maximum suppression algorithm to obtain a more accurate detection result;

Area_Det＝(x2-x1)(y2-y1) (2)；

and the process of secondarily judging that the vehicle has the parking behavior is that the fixed foreground target picture is traversed through the coordinates of the vehicle target boundary frame, if the ratio P exceeds a set threshold value T (the threshold value T is set to be 0.7), the vehicle target is judged to have the tunnel parking event, and otherwise, the tunnel parking event is judged not to exist.

Wherein the formula for determining the existence of parking behavior is represented as:

the picture number batch _ size of each training is set according to the condition that a computer is configured with a display card, the larger the picture number of each training is, the more accurate the training is, and meanwhile, the training shock is reduced, the invention is carried out under an NVIDIA GTX 1060 display card, and in order to ensure the feasibility of the training, the batch _ size is set to be 4; setting the initial learning rate is very important, problems can be caused when the initial learning rate is set too large or too small, the loss is large or the loss is not reduced and becomes a shaking condition when the initial learning rate is set too large, the reduction direction cannot be quickly found when the initial learning rate is set too small, and the final initial learning rate is set to be 0.004 after multiple attempts; training a total of 100 epochs. In the first three epochs, in order to ensure the stability of model training, the invention adopts a WarmUp preheating means, as shown in formula (10), lr_minIs empirically set to 10^-6，lr_baseFor an initial learning rate of 0.004, Iter and Iter respectively represent the number of iterations required for an epoch and the time whenThe number of previous iterations. Therefore, as the number of iterations increases, the learning rate of the warm-up phase increases from 10^-6Linear growth is started until the initial learning rate is reached.

And after the first three preheating epochs are finished, entering a conventional learning rate attenuation strategy. The invention utilizes the cosine annealing algorithm to reduce the learning rate, as shown in the formula (5.6), T_curAnd T_sumThe sub-table represents the current iteration times and the total iteration times. Cosine annealing is smoother than step learning rate decay, and a better solution can be found in a gradient descent algorithm.

The invention selects SGD with momentum as the optimizer of the algorithm, and can effectively accelerate the convergence of the algorithm; while using a weight decay strategy of five parts per million to prevent overfitting.

Comparing the method of the invention with the parking detection result of the traditional vehicle parking algorithm, whether the detection effect of the invention is improved can be obtained, and the comparison effect is shown in table 1:

TABLE 1 statistical table of tunnel parking event detection results

Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.

Claims

1. The tunnel parking event detection method fusing the traditional parking detection and the neural network is characterized by comprising the following steps of:

2. The method for detecting tunnel parking events fusing conventional parking detection and a neural network as claimed in claim 1, wherein: the step of clustering the images in the VOC data set to select an appropriate Anchor size in the SSD neural network in S2 is as follows:

d_i,c＝1-IOU(i,c) (1)；

Will be closest to

3. The method for detecting tunnel parking events fusing conventional parking detection and neural networks according to claim 1 or 2, characterized in that: the process of training the vehicle identification model by using the first sample in the step S3 to obtain the optimal vehicle identification model is as follows:

s32: modifying the total number of categories in the SSD neural network into 2;

4. The method for detecting tunnel parking events fusing conventional parking detection and neural networks according to claim 3, wherein: the process of obtaining whether the vehicle target in the video to be detected is the tunnel parking event or not by the optimal vehicle identification model in the S5 is as follows:

Area_Det＝(x2-x1)(y2-y1) (2)；

5. The method for detecting tunnel parking events fusing conventional parking detection and neural networks according to claim 3, wherein: the step of S33 calculating the loss of the current iteration is as follows:

wherein alpha is used for adjusting the proportion between the position loss function loc and the confidence loss function conf, N is the total number of default frames matched with the labeling frame by the Anchor, if N is 0, the loss function loss is defined to be 0, and the position loss is a prediction frame of a first sample output by a suboptimal vehicle recognition model and a labeled enclosureSmooth loss smooths between the vehicle target boundary frames_L1Default box d has center (cx, cy), width w, height h, and position penalty function as follows:

wherein the content of the first and second substances,

the ith prediction box is represented as a block of the ith prediction,

indicating that the jth label box Pos represents a positive sample;