CN115861956A

CN115861956A - Yolov3 road garbage detection method based on decoupling head

Info

Publication number: CN115861956A
Application number: CN202211703314.9A
Authority: CN
Inventors: 许水清; 易文淏; 陶松兵; 章文焘; 郑浩东; 何启航; 都海波; 陈立平
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2022-12-23
Filing date: 2022-12-23
Publication date: 2023-03-28

Abstract

The invention discloses a Yolov3 road garbage detection method based on a decoupling head, which belongs to the technical field of computer vision, and comprises the steps of establishing an improved Yolov3 network, wherein the improved Yolov3 network comprises a trunk network, a Neck structure and a detection head, and the Neck structure comprises a channel attention mechanism CA; and optimizing the backbone network through the obtained training sample set to obtain a road garbage recognition network with the best recognition effect, and realizing the detection and recognition of road garbage through the network. Compared with the traditional detection method, the identification method can better realize the separation and fusion of the characteristics and improve the identification capability of the network. The accuracy of the method in the test sample set is higher than that of other methods, and the method can be better suitable for complex road environments.

Description

Yolov3 road garbage detection method based on decoupling head

Technical Field

The invention relates to the technical field of computer vision, in particular to a Yolov3 road garbage detection method based on a decoupling head.

Background

With the development of the times, computer vision technology has been deeply integrated into various industries, wherein the recognition of road garbage by using deep learning and image processing methods becomes a research hotspot in environmental engineering application in the field of computer vision in recent years. The road garbage in the wide area of the city has the characteristics of small target, multiple types and various shapes, and has more complex characteristics. The traditional target detection method is easily interfered by complex characteristics to cause the problems of false detection and missed detection. Therefore, the traditional target detection method is directly applied to the field of road garbage recognition, and certain limitation exists.

At present, the traditional road garbage identification method mainly depends on manual operation, the manual operation is low in classification efficiency, particularly when a large amount of garbage is treated, a severe classification environment and a heavy task can threaten the physical health of operators, identification errors are easily caused due to difficulty of the task, and mixed garbage not only easily causes pollution to the environment, but also possibly causes waste of recoverable resources. The other method is to establish a picture database of various garbage, and identify various garbage objects by using an image comparison technology, an algorithm such as scale invariant feature transformation matching and the like, but the method is only suitable for garbage identification occasions with a conveyor belt as a fixed background in a garbage disposal plant, cannot be applied to road garbage identification occasions with complex backgrounds, is difficult to ensure the identification accuracy in multi-target tasks, and cannot meet the actual requirements of road garbage identification.

In summary, the existing road garbage identification technology has the following problems:

1. the background of road garbage is complex, and the traditional target detection method is difficult to handle the task of the complex characteristic;

2. the occupation ratio of the road garbage in the identification area is small, and a plurality of garbage with different types and shapes can be distributed in the same identification area;

3. due to the complexity of road conditions, road garbage can be shielded to different degrees, the original characteristics of the road garbage are covered, and certain adverse effects are caused on characteristic extraction.

Disclosure of Invention

The present invention has been made to solve the above-mentioned problems occurring in the prior art. Specifically, a CSPDarkNet53 network of a channel attention mechanism is used as a main network, a decoupling head is used as a detection head, a training sample set is used for optimizing the main network, a test sample set is used for selecting the main network with the best optimization effect, and the main network is used as a road garbage recognition network for recognizing road garbage. Compared with the traditional method, the method can better detect the small sample target due to the addition of the channel attention mechanism, adapts to the complex background environment, improves the feature extraction and fusion capability of road garbage with various types and forms by adopting the decoupling detection head, and improves the recognition rate of the road garbage.

In order to achieve the above object, the present invention provides a Yolov3 road garbage detection method based on a decoupling head, which obtains a road garbage recognition network with an optimal detection effect by training collected road garbage images through improving a Yolov3 network structure, and uses the garbage recognition network to complete the recognition of road garbage, and specifically comprises the following steps:

step 1, collecting and processing road garbage images

Collecting D-type road garbage images, wherein D is recorded as the number of the types of the road garbage images;

selecting M road garbage images in each class of the D class road garbage images to obtain M multiplied by D road garbage images, then respectively adopting Z image processing modes to finish data enhancement on the M multiplied by D road garbage images to obtain Z multiplied by M multiplied by D road garbage images, and forming a training sample data set by the Z multiplied by M multiplied by D road garbage images;

selecting N road garbage images except M road garbage images in each class of the D class road garbage images to obtain NxD road garbage images, and forming the NxD road garbage images into a test sample data set, wherein N is not equal to M;

step 2, establishing an improved Yolov3 network based on a decoupling detection head and a channel attention mechanism, wherein the improved Yolov3 network comprises a backbone network, a Neck structure and a detection head;

step 2.1, a CSPDarkNet53 network is adopted as a backbone network, and a value obtained by dividing the actual network layer number by the marking network layer number 256 is defined as a depth coefficient zeta, wherein the backbone network structure comprises: a standard convolution layer alpha formed by sequentially connecting convolution layers with convolution kernel size of 6 multiplied by 6, a batch standardization layer and a SiLU activation function in series ₁ The standard convolution layer alpha ₁ Has 32 input channels; a standard convolution layer alpha formed by sequentially connecting convolution layers with convolution kernel size of 3 multiplied by 3, a batch standardization layer and a SiLU activation function in series ₂ The standard convolution layer alpha ₂ Has 64 input channels; a standard convolution layer alpha formed by sequentially connecting convolution layers with convolution kernel size of 3 multiplied by 3, a batch standardization layer and a SiLU activation function in series ₃ The standard convolution layer alpha ₃ Has 128 input channels; the standard convolution layer alpha is formed by sequentially connecting convolution layers with convolution kernel size of 3 multiplied by 3, batch normalization layer and SiLU activation function in series ₄ The standard convolution layer alpha ₄ Has 256 input channels; a standard convolution layer alpha formed by sequentially connecting convolution layers with convolution kernel size of 6 multiplied by 6, a batch standardization layer and a SiLU activation function in series ₅ The standard convolution layer alpha ₅ 512 input channels; comprising 3 standard convolutional layers alpha ₂ C3 module layer beta of 128 zeta bottleneck modules ₁ (ii) a Comprising 3 standard convolutional layers alpha ₃ C3 module layer beta of 256 zeta bottleneck modules ₂ (ii) a Comprising 3 standard convolutional layers alpha ₄ 512 ζ bottleneck modulesC3 Module layer beta ₃ (ii) a Comprising 3 standard convolutional layers alpha ₅ C3 module layer beta of 1024 zeta bottleneck modules ₄ (ii) a SPPF Module layer gamma ₁ ；

The input of the backbone network is a standard convolutional layer alpha ₁ The output is SPPF module layer gamma ₁ In particular, standard convolutional layer alpha ₁ Standard convolutional layer alpha ₂ C3 module layer beta ₁ Standard convolutional layer alpha ₃ C3 module layer beta ₂ Standard convolutional layer alpha ₄ C3 module layer beta ₃ Standard convolutional layer alpha ₅ C3 module layer beta ₄ And SPPF Module layer γ ₁ Sequentially connecting in series;

step 2.2, adopting an FPN + PAN network as a Neck structure, wherein the Neck structure comprises: convolution layer chi with convolution kernel size of 1 multiplied by 1 and channel number of 512 ₁ One convolution kernel with 1 × 1 channel number of 256 convolution layers χ ₂ One convolution kernel with 3 x 3 channels and 128 convolution layers ₃ A convolution kernel with a size of 3 x 3 and a channel number of 256 convolution layers χ ₄ (ii) a A downsampled layer δ with 256 channels; the four Concat module layers are respectively marked as Concat module layer C ₁ Concat module layer C ₂ Concat module layer C ₃ And Concat module layer C ₄ (ii) a The two 512-channel C3 module layers are respectively marked as C3 module layer D ₁ C3 Module layer D ₂ Two 256-channel C3 module layers are respectively marked as D ₃ And a fourth C3 module layer D ₄ (ii) a A channel attention mechanism CA;

the input of the neutral structure is three, which are respectively marked as input output ₁₁ Input output ₁₂ And input output ₁₃ Wherein output is input ₁₁ C3 module layer beta connected with backbone network ₂ Output, input output of ₁₂ C3 module layer beta connected with backbone network ₃ Output, input output of ₁₃ SPPF module layer gamma connected with backbone network ₁ An output of (d); the output of the neutral structure is three, and the outputs are respectively recorded as output ₂₁ Output, output ₂₂ And output ₂₃ Wherein output is output ₂₁ As C3 module layer D ₂ Output of (2), output ₂₂ Is C3 Module layer D ₃ Output of (2), output ₂₃ The output of the channel attention mechanism CA;

step 2.3, a decoupling detection head is adopted as a detection head, and the structure of the detection head comprises: convolutional layer Z with convolution kernel size of 1 × 1 and channel number of 256 ₁ Convolutional layer Z with convolution kernel size of 3 × 3 and number of 256 channels ₂ Convolutional layer Z with convolution kernel size of 3 × 3 channels of 512 ₃ Convolution layer Z with convolution kernel size of 1 × 1 channel number D ₄ Convolutional layer Z with convolutional kernel size of 1 × 1 and channel number of 4 ₅ Convolutional layer Z with convolution kernel size of 1 × 1 channel number of 1 ₆ ；

The input to the decoupling head is convolutional layer Z ₁ Layer Z of convolution ₁ Three output outputs respectively connected with the neutral structure ₂₁ 、output ₂₂ 、output ₂₃ Connecting; the output of which forms the following three paths: first pass convolutional layer Z ₁ And a convolutional layer Z ₂ Convolutional layer Z ₃ And convolutional layer Z ₄ Sequentially connected in series to form; the second path is the convolutional layer Z ₁ And a convolutional layer Z ₂ And a convolutional layer Z ₃ And convolutional layer Z ₅ Sequentially connected in series to form; third path is convolutional layer Z ₁ And a convolutional layer Z ₂ And a convolutional layer Z ₃ And convolutional layer Z ₆ Sequentially connected in series to form;

step 3, training the improved Yolov3 network established in the step 2 to obtain a network with the optimal detection effect, and using the network with the optimal detection effect as a road garbage recognition network, wherein the specific steps are as follows:

step 3.1, uniformly adjusting the pixels of the road garbage images in the training sample set to be S multiplied by S;

step 3.2, randomly selecting B road garbage images in the training sample set, and forming a series of gamma, r = (y) ₁ ，y ₂ ，...，y _s ...，y _B ) Wherein, y _s Is any road garbage image in the series gamma and is recorded as an image y _s S =1, 2.., B, calculating image y _s Actual class probability tensor Y ^s Actual class coordinate tensor W ^s And the actual class IoU tensor X ^s Wherein the actual class probability tensor Y ^s Has a size of H × W × C, and an actual class coordinate tensor W ^s Has a size of H × W × 4, and an actual class IoU tensor X ^s The size of (A) is H multiplied by W multiplied by 1; wherein H represents the height of each tensor, W represents the width of each tensor, and C represents the depth of each tensor;

initializing image y _s Is predicted class probability tensor O ^s And a predicted category coordinate tensor P ^s And prediction class IoU tensor Q ^s The values are as follows:

defining a prediction class tensor O ^s Prediction class coordinate tensor P ^s Prediction type IoU tensor Q ^s The coordinates of (d) are composed of an abscissa n, an ordinate m, and a depth coordinate γ, and are denoted as (n, m, γ);

an abscissa n, an ordinate m, and a depth coordinate γ (where n =1,2.. H, m =1,2.. W, γ =1,2.. C) are arbitrarily selected and ordered

And O is ^s The prediction probability values of other coordinates are all equal to 0, and the tensor P is used for predicting the class coordinate ^s An abscissa n, an ordinate m, a depth coordinate γ (where n =1,2,. H, m =1,2,. W, γ =1,2,. 4) are arbitrarily selected and evaluated>

And P is ^s The prediction probability values of other coordinates are all equal to 0, and the IoU tensor Q is used for predicting the type ^s An abscissa n, an ordinate m, a depth coordinate γ (where n =1,2.. H, m =1,2.. W,. γ = 1) are arbitrarily selected and evaluated>

And Q ^s The predicted probability values of other coordinates are all equal to 0;

step 3.3, after the B road garbage images selected and obtained in the step 3.2 are input into a main network, updating the prediction category probability tensor O of each road garbage image ^s Prediction class coordinate tensor P ^s And predicting the class IoU tensor Q ^s ，s＝s＝1，2，...，B；

And 3.4, optimizing the backbone network according to the updated prediction tensors and the actual tensors:

image y _s Is equally divided into H line segments and equally divided into W line segments, i.e. the image y _s Equally dividing into H multiplied by W grids;

for image y _s Predicting each grid, comparing the obtained prediction information with the real information to obtain a loss function loss, obtaining a minimized loss function loss through a gradient descent method, and finishing the optimization of a backbone network;

step 3.5, repeating the steps 3.2 to 3.4 until the road garbage images in the training sample set are all selected, wherein if the number of the road garbage images left in the training sample set in the last round of selection is less than B, the road garbage images are randomly selected from the selected road garbage images for supplementation;

marking the backbone network which is optimized through the steps 3.2 to 3.5 as the backbone network T of the h generation _h Wherein h is the serial number of the generation;

step 3.6, calculating the backbone network T of the h generation by using the test sample set _h Average precision V of mean value of concentrated road garbage images of test samples _h The method comprises the following steps:

step 3.6.1, defining any one type of road garbage in the D type of road garbage as i-type garbage, wherein i =1, 2., D;

defining a prediction box as a rectangular box marked on the grid, wherein the class probability tensor O is predicted ^s Determining the garbage category detected by the rectangular frame, predicting the coordinate tensor P of the category ^s Determining the central coordinates of the rectangular frame, predicting the IoU tensor Q ^s Determining the confidence of the rectangular frame; defining an actual frame as a rectangular frame for manually marking the road garbage on the road garbage image; defining the overlapping degree I as the value of the area intersection of the prediction frame and the actual frame divided by the area union;

step 3.6.2, randomly taking n unequal decimal places from 0 to 1 to form an overlapping degree threshold sequence K, K = &K _i1 ，K _i2 ，...K _ij ，..K _in In which K is _ij J =1,2, \ 8230;, n, for the jth overlap threshold corresponding to the ith type of garbage;

defining TP as the overlap degree I in the ith garbage is more than or equal to the jth overlap degree threshold K _ij The FP is that the overlap degree I in the I-th garbage is less than the jth overlap degree threshold value K _ij FN is the number of the corresponding prediction frames which are not given to the actual frame, and the threshold value K of the overlap degree of the ith class garbage at the jth is calculated _ij Recall rate R in _ij And accuracy P _ij The calculation formula is as follows:

step 3.6.3, calculating the recall rate and the accuracy of all the overlapping degree thresholds in the overlapping degree threshold sequence K in the i-th garbage according to the method of the step 3.6.2 to obtain n recall rates R _ij And accuracy P _ij In order from 1 to n, at a recall rate R _ij As abscissa, accuracy P _ij Drawing a curve in a plane coordinate system as a vertical coordinate and marking as P _i -R _i A curve;

with P _i -R _i Curve, abscissa, ordinate, P _i -R _i The connecting line of the end point of the curve and the abscissa is a contour line, the area in the contour line is calculated and recorded as the AP value F of the i-th garbage _i ；

Step 3.6.4, calculating the AP value of each type of road garbage in the D types of road garbage according to the method from the step 3.6.2 to the step 3.6.3 to obtain D AP values F _i And according to D AP values F _i Computing the h-th generation of the backbone network T _h Average precision V of mean value of concentrated road garbage images of test samples _h ，

Step 3.7, the number of times of repetition is set to G, and step 3.4 and step 3.6 are repeated for G times to obtain a network set T and a mean average precision set V, T = { T = { T = } ₁ ，T ₂ ，...，T _h ，...，T _G }，V＝{V ₁ ，V ₂ ，...，V _h ，...，V _G }；

Note V _o To the highest mean average precision, V _o ＝max{V ₁ ，V ₂ ，...，V _h ，...，V _G )，V _o Corresponding backbone network T _o The network with the best recognition effect is obtained and recorded as a road garbage recognition network;

and 4, identifying the road garbage by using a road garbage identification network.

Preferably, the C3 module layer β ₁ C3 module layer beta ₂ C3 module layer beta ₃ C3 module layer beta ₄ The system is composed of three layers of structures which are sequentially connected in series along the input-output direction of a main network, and specifically comprises the following steps:

when 128 ζ, 256 ζ, 512 ζ and 1024 ζ are collectively referred to as n × ζ, n or equal to 128 or equal to 256 or equal to 512 or equal to 1024, the standard convolution layer α is ₂ Standard convolutional layer alpha ₃ Standard convolutional layer alpha ₄ And standard convolution layer alpha ₅ Collectively referred to as standard convolutional layer α;

the first layer structures of the four C3 module layers are all formed by parallel passageway Benzonum 1 and passageway Benzonum 2, wherein the passageway Benzonum 1 is formed by sequentially connecting standard convolution layers alpha and n × ζ bottle neck modules in series, the passageway Benzonum 2 is formed by a standard convolution layer alpha, the second layer structure is a Concat module layer, the input is passageway Benzonum 1 and passageway Benzonum 2, and the output is connected in series to the third layer structure; the third layer structure is a standard convolutional layer alpha.

Preferably, the tack structure is composed of 8 layers of structures which are sequentially connected in series along the input-output direction of the backbone network:

the first layer structure inputs output respectively ₁₁ And inputoutput ₁₂ The input signal includes a passage sigma 1 and a passage sigma 2 which are parallel to each other, wherein the passage sigma 1 is formed by the input output ₁₁ Concat module layer C ₁ "convolutional layer", and ₁ the down-sampling layers delta are sequentially connected in series, and the path sigma 2 is formed by inputting output ₁₂ And Concat module layer C ₁ Is formed by connection; the second layer structure is composed of a Concat module layer C ₁ Forming; the third layer structure is formed by a Concat module layer C ₁ Output and input output of ₁₃ As input, a passage sigma 3 and a passage sigma 4 are included in parallel, wherein the passage sigma 3 is formed by the Concat module layer C ₁ Output, concat module layer C ₂ C3 Module layer D ₁ "convolutional layer", and ₂ the down-sampling layers delta are sequentially connected in series, and the path sigma 4 is formed by inputting output ₁₃ And Concat module layer C ₂ Is formed by connection; the fourth layer structure is composed of a Concat module layer C ₂ Forming; the fifth layer structure is formed by a Concat module layer C ₂ Output, convolution layer χ of ₂ The output of (a) is input, and comprises a parallel passage sigma 5 and a passage sigma 6, wherein the passage sigma 5 is formed by the Concat module layer C ₂ Output, concat module layer C ₃ C3 Module layer D ₂ "convolutional layer", and ₃ sequentially connected in series to form a passage Be 6 consisting of a convolution layer chi ₂ Output and Concat module layer C ₃ Is formed by direct connection; the sixth layer structure is a Concat module layer C ₃ (ii) a The seventh layer structure is formed by a Concat module layer D ₃ Output, convolution layer chi ₁ The output is input, and comprises two parallel paths sigma 7 and sigma 8, wherein the path sigma 7 is formed by the output along the Concat module layer and the C3 module layer D ₂ Laminated layer chi ₄ Sequentially connected in series to form a passage Be 8 consisting of a convolution layer chi ₁ Output and Concat module layer C ₄ Is formed by direct connection; the eighth layer structure is composed of a Concat module layer C ₄ C3 Module layer D ₄ And a channel attention mechanism CA which are connected in series in sequence.

Preferably, the channel attention mechanism CA is implemented as follows:

the channel attention mechanism CA includes an average pooling layer eta _X Average pooling layer η _Y Concat module layer L, convolutional layer lambda, bulk normalization layer A, and Sigmoid nonlinearityAn active layer J;

the channel attention mechanism CA is composed of 3 layers of structures which are sequentially connected in series along the input-output direction of a main network: the first layer structure is composed of two parallel paths I1 and I2, and the first path 1 is composed of an average pooling layer eta _X The path II 2 is composed of an average pooling layer eta _Y The second layer structure is formed by sequentially connecting the Concat module layer L and a batch standardized layer A in series along the direction of two parallel paths-the output direction of a main network, the third layer structure is formed by separating two parallel path II 3 and path II 4 from the output of the batch standardized layer A, and the path II 3 and the path II 4 are formed by sequentially connecting a convolution layer lambda and a Sigmoid nonlinear active layer J in series;

the input of the channel attention mechanism CA is a fourth C3 module layer D ₄ The output of the attention mechanism CA is obtained by multiplying the outputs of the channel ii 3 and the channel ii 4 by the input of the attention mechanism CA.

Preferably, the specific steps of identifying the road garbage by using the road garbage identification network are as follows:

step 4.1, inputting an original road garbage image to be identified, and performing the following image processing: adjusting the pixels of the original road garbage image to be identified to be E multiplied by E;

recording the processed original road garbage image to be identified as an image Y _d ；

Step 4.2, in image Y _d Divide the grids into upper equal parts and mark any one of the grids as grid K ^d _v Wherein v is the number of the grid, v =1, 2.., Λ, Λ is the number of the grid;

step 4.3, image Y _d Sending the road garbage into a road garbage recognition network to obtain a grid K ^d _v The prediction type probability tensor, the prediction type coordinate tensor and the prediction type IoU tensor of each prediction frame are defined, and any one of the prediction frames is a prediction frame R ^d _vu U =1, 2.., U is a grid K ^d _v The number of middle prediction frames is obtained and the prediction frame R ^d _vu Of the corresponding prediction class probability tensorPredicted probability value O ^d _vu Prediction probability value P of prediction category coordinate tensor ^d _vu And the prediction probability value Q of the prediction type IoU tensor ^d _vu Predicting the probability value Q by predicting the class IoU tensor ^d _vu Obtaining a prediction frame R ^d _vu Is recorded as confidence level L ^d _vu ；

The confidence L ^d _v With a given confidence threshold L ⁰ The comparison is made and the following decisions are made:

if L is ^d _vu ≥L ⁰ Reserving the prediction box;

if L is ^d _vu ＜L ⁰ Discarding the prediction box;

step 4.4, repeat step 4.3 until image Y _d All the grids are selected, and then the prediction frame is marked on the image Y according to the judgment result of the prediction frame in each grid of the road garbage image to be identified _d And obtaining the identified road garbage image.

Compared with the prior art, the invention has the following beneficial effects:

1. the channel attention mechanism in the invention adopts an effective data compression method, and can selectively extract required features under a complex recognition background.

2. The invention adopts the identification method of positioning and classifying feature separation to separate the road garbage geographic position positioning feature and the road garbage category identification feature processing process, improves the utilization capacity of various complex features, and has better effect on small targets and polymorphic targets.

3. The CSPDarkNet53 network based on the channel attention mechanism has stronger identification capability on multiple targets and shielded targets, and is more suitable for the needs of actual life.

Drawings

FIG. 1 is a flow chart of a method of identifying road spam in accordance with the present invention;

FIG. 2 is a schematic block diagram of the method for identifying road spam in accordance with the present invention;

FIG. 3 is a schematic diagram of the implementation steps of the detection head in the embodiment of the present invention;

fig. 4 is a mAP of the road identification method of the present invention compared to other methods in an embodiment of the present invention.

Detailed description of the invention

The technical solution of the present invention will be described clearly and completely with reference to the accompanying drawings and the detailed description.

Fig. 1 is a flowchart of a road refuse recognition method of the present invention, fig. 2 is a schematic block diagram of the road refuse recognition method of the present invention, and as can be seen from fig. 1 and fig. 2, the present invention provides a Yolov3 road refuse detection method based on a decoupling head, the detection method obtains a road refuse recognition network with an optimal detection effect by training an acquired road refuse image through improving a Yolov3 network structure, and uses the refuse recognition network to recognize road refuse, and specifically includes the following steps:

step 1, collecting and processing road garbage images

and selecting N road garbage images except the M road garbage images in each class of the D road garbage images to obtain NxD road garbage images, and forming the NxD road garbage images into a test sample data set, wherein N is not equal to M.

In this example, D =14, m =70, z =8, n =30.

In this embodiment, z =8 specifically includes the following 8 image processing modes: random cutting, random translation, brightness change, gaussian random noise addition, random rotation, random overturning, random shielding treatment and mosaic data enhancement.

Step 2, establishing an improved Yolov3 network based on a decoupling detection head and channel attention mechanism, wherein the improved Yolov3 network comprises a backbone network, a Neck structure and a detection head.

Step 2.1, a CSPDarkNet53 network is adopted as a backbone network, and a value obtained by dividing the actual network layer number by the marking network layer number 256 is defined as a depth coefficient zeta, wherein the backbone network structure comprises: the standard convolution layer alpha is formed by sequentially connecting a convolution layer with convolution kernel size of 6 multiplied by 6, a batch standardization layer and a SiLU activation function in series ₁ The standard convolution layer alpha ₁ Has 32 input channels; a standard convolution layer alpha formed by sequentially connecting convolution layers with convolution kernel size of 3 multiplied by 3, a batch standardization layer and a SiLU activation function in series ₂ The standard convolution layer alpha ₂ Has 64 input channels; a standard convolution layer alpha formed by sequentially connecting convolution layers with convolution kernel size of 3 multiplied by 3, a batch standardization layer and a SiLU activation function in series ₃ The standard convolution layer alpha ₃ Has 128 input channels; a standard convolution layer alpha formed by sequentially connecting convolution layers with convolution kernel size of 3 multiplied by 3, a batch standardization layer and a SiLU activation function in series ₄ The standard convolution layer alpha ₄ Has 256 input channels; a standard convolution layer alpha formed by sequentially connecting convolution layers with convolution kernel size of 6 multiplied by 6, a batch standardization layer and a SiLU activation function in series ₅ The standard convolution layer alpha ₅ 512 input channels; contains 3 standard convolution layers alpha ₂ C3 module layer beta of 128 zeta bottleneck modules ₁ (ii) a Comprising 3 standard convolutional layers alpha ₃ C3 module layer beta of 256 zeta bottleneck modules ₂ (ii) a Comprising 3 standard convolutional layers alpha ₄ C3 module layer beta of 512 zeta bottleneck modules ₃ (ii) a Comprising 3 standard convolutional layers alpha ₅ C3 module layer beta of 1024 zeta bottleneck modules ₄ (ii) a SPPF Module layer gamma ₁ ；

The input of the backbone network is a standard convolutional layer alpha ₁ The output is SPPF module layer gamma ₁ Specifically, standard convolution layer alpha ₁ Standard convolutional layer alpha ₂ C3 module layer beta ₁ Standard convolutional layer alpha ₃ C3 module layer beta ₂ Standard convolutional layer alpha ₄ C3 module layer beta ₃ Standard convolutional layerα ₅ C3 module layer beta ₄ And SPPF Module layer γ ₁ Are connected in series in turn.

In this embodiment, the C3 module layer β ₁ C3 module layer beta ₂ C3 module layer beta ₃ C3 module layer beta ₄ The system is composed of three layers of structures which are sequentially connected in series along the input-output direction of a main network, and specifically comprises the following steps:

collectively, 128 ζ, 256 ζ, 512 ζ and 1024 ζ are referred to as nxζ, n is 128 or equal to 256 or equal to 512 or equal to 1024, and the standard convolution layer α is ₂ Standard convolutional layer alpha ₃ Standard convolutional layer alpha ₄ And standard convolutional layer alpha ₅ Collectively referred to as standard convolutional layer α;

Step 2.2, adopting an FPN + PAN network as a Neck structure, wherein the Neck structure comprises: convolution layer chi with convolution kernel size of 1 multiplied by 1 and channel number of 512 ₁ One convolution kernel with 1 × 1 channel number of 256 convolution layers χ ₂ One convolution kernel with the size of 3 multiplied by 3 and the number of channels of 128 convolution layers x ₃ A convolution kernel with a size of 3 x 3 and a channel number of 256 convolution layers χ ₄ (ii) a A downsampled layer δ with 256 channels; the four Concat module layers are respectively marked as Concat module layer C ₁ Concat module layer C ₂ Concat module layer C ₃ And Concat module layer C ₄ (ii) a The two 512-channel C3 module layers are respectively marked as C3 module layer D ₁ C3 Module layer D ₂ Two 256-channel C3 module layers are respectively marked as D ₃ And a fourth C3 module layer D ₄ (ii) a A channel attention mechanism CA;

the input of the neutral structure is three, which are respectively marked as input output ₁₁ Input output ₁₂ And input output ₁₃ Wherein output is input ₁₁ C3 module layer beta connected with backbone network ₂ Output, input output of ₁₂ C3 module layer beta connected with backbone network ₃ Output, input output of ₁₃ SPPF module layer gamma connected with backbone network ₁ An output of (d); the output of the neutral structure is three, and the outputs are respectively recorded as output ₂₁ Output, output ₂₂ And output ₂₃ Wherein output is output ₂₁ As C3 module layer D ₂ Output of (2), output ₂₂ As C3 module layer D ₃ Output of (2), output ₂₃ The output of the mechanism CA is noted for the channel.

In this embodiment, the hack structure is composed of 8 layers of structures connected in series in the input-output direction of the backbone network:

the first layer structure inputs output respectively ₁₁ Input output ₁₂ The input signal comprises a path sigma 1 and a path sigma 2 which are arranged in parallel, wherein the path sigma 1 is input by output ₁₁ Concat module layer C1, convolutional layer χ ₁ The down-sampling layers delta are sequentially connected in series, and the path sigma 2 is formed by inputting output ₁₂ And Concat module layer C ₁ Is formed by connection; the second layer structure is composed of a Concat module layer C1; the third layer structure is formed by a Concat module layer C ₁ Output and input output of ₁₃ As input, a passage sigma 3 and a passage sigma 4 are included in parallel, wherein the passage sigma 3 is formed by the Concat module layer C ₁ Output, concat module layer C ₂ C3 Module layer D ₁ "convolutional layer", and ₂ the down-sampling layers delta are sequentially connected in series, and the path sigma 4 is formed by inputting output ₁₃ And Concat module layer C ₂ Is formed by connection; the fourth layer structure is composed of a Concat module layer C ₂ Forming; the fifth layer structure is formed by a Concat module layer C ₂ Output, convolution layer χ of ₂ Has a passage sigma 5 and a passage sigma 6 in parallel, wherein the passage sigma 5 is formed by the Concat module layer C ₂ Output of (C), concat module layer C ₃ C3 Module layer D ₂ "convolutional layer", and ₃ sequentially connected in series to form a passage Be 6 consisting of a curled layer chi ₂ Output and Concat module layer C ₃ Is formed by direct connection; the sixth layer structure is a Concat module layer C ₃ (ii) a The seventh layer structure respectivelyWith Concat module layer C ₃ Output, volume layer chi ₁ The output is input, and comprises two parallel paths sigma 7 and sigma 8, wherein the path sigma 7 is formed by the output along the Concat module layer and the C3 module layer D ₃ "convolutional layer", and ₄ sequentially connected in series to form a passage Be 8 consisting of a convolution layer chi ₁ Output and Concat module layer C ₄ Is formed by direct connection; the eighth layer structure is composed of a Concat module layer C ₄ C3 Module layer D ₄ And the channel attention mechanism CA are connected in series in sequence.

In this embodiment, the channel attention mechanism CA is implemented as follows:

the channel attention mechanism CA includes an average pooling layer eta _X Average pooling layer η _Y A Concat module layer L, a convolution layer lambda, a batch normalization layer A and a Sigmoid nonlinear activation layer J;

the channel attention mechanism CA is composed of 3 layers of structures which are sequentially connected in series along the input-output direction of a main network: the first layer structure is composed of two parallel paths I1 and I2, and the first path 1 is composed of an average pooling layer eta _X The path II 2 is composed of an average pooling layer eta _Y The second layer structure is formed by sequentially connecting the Concat module layer and a batch standard layer A in series along two parallel passage directions-a trunk network output direction, the third layer structure is divided into two parallel passages II 3 and II 4 from the output of the batch standard layer A, and the passages II 3 and II 4 are formed by sequentially connecting a coiling layer lambda and a Sigmoid nonlinear activation layer J in series;

the input of the channel attention mechanism CA is a fourth C3 module layer D ₄ The output of the path ii 3 and the output of the path ii 4 are multiplied by the input of the attention mechanism CA, respectively, to obtain the output of the attention mechanism CA.

The input of the decoupling head being convolutional layer Z ₁ Layer Z of convolution ₁ Three output outputs respectively connected with the neutral structure ₂₁ 、output ₂₂ 、output ₂₃ Connecting; the output of which forms the following three paths: first pass convolutional layer Z ₁ And a convolutional layer Z ₂ Convolution layer Z ₃ And convolutional layer Z ₄ Sequentially connected in series to form; the second path is the convolutional layer Z ₁ And a convolutional layer Z ₂ And a convolutional layer Z ₃ And convolutional layer Z ₅ Sequentially connected in series to form; third path is convolutional layer Z ₁ And a convolutional layer Z ₂ And a convolutional layer Z ₃ And convolutional layer Z ₆ Are sequentially connected in series.

Fig. 3 is a schematic diagram of a detection head implementation step in the embodiment of the present invention. (one text must be added corresponding to FIG. 3)

and 3.1, uniformly adjusting the pixels of the road garbage images in the training sample set to S multiplied by S.

Step 3.2, randomly selecting B road garbage images in the training sample set, and forming a series of gamma, gamma = (y) ₁ ，y ₂ ，...，y _s ...，y _B ) Wherein, y _s Is any road garbage image in the series gamma and is recorded as an image y _s S =1, 2.., B, calculating image y _s Actual class probability tensor Y ^s Actual class coordinate tensor W ^s And the actual class IoU tensor X ^s Wherein the actual class probability tensor Y ^s Has a size of H × W × C, an actual class coordinate tensor W ^s Has a size of H × W × 4, and an actual class IoU tensor X ^s The size of (a) is H × W × 1; where H denotes the height of each tensor, W denotes the width of each tensor, and C denotes the depth of each tensor.

Initializing image y _s Is predicted class probability tensor O ^s Prediction class coordinate tensor P ^s And prediction class IoU tensor Q ^s The values are as follows:

step 3.3, the B road garbage images selected in the step 3.2 are input into a main network, and then the predicted category probability tensor O of each road garbage image is updated ^s And a predicted category coordinate tensor P ^s And predicting the class IoU tensor Q ^s ，s＝s＝1，2，…，B。

And 3.4, optimizing the backbone network according to each updated prediction tensor and actual tensor:

image y _s Is equally divided into H line segments and equally divided into W line segments, i.e. the image y _s Equally divided into HW grids;

for image y _s And predicting each grid, comparing the obtained prediction information with the real information to obtain a loss function loss, obtaining a minimum loss function min-loss by a gradient descent method, and finishing the optimization of the backbone network.

In the present embodiment, the expression of the loss function loss is as follows:

loss＝box_gain×bbox_loss+cls_gain×cls_loss+obj_gain×obj_loss

wherein bbox _ loss is rectangular frame loss, cls _ loss is classification loss, obj _ loss is confidence loss, box _ gain is rectangular frame loss weight, cls _ gain is shunt loss weight, and obj _ gain is confidence loss;

default box _ gain =0.05, cls_gain =0.5, obj_gain =1.0, the expressions given for rectangular box loss bbox _ loss, classification loss cls _ loss, and confidence loss obj _ loss are as follows: (ii) a

Wherein | · | purple ₂ Representing the euclidean norm;

and obtaining a minimum loss function min-loss through a gradient descent method, and finishing the optimization of the backbone network.

general purpose medicineThe backbone network optimized through the steps 3.2-3.5 is marked as the h generation backbone network T _h Wherein h is the generation number.

step 3.6.1, defining any one type of road garbage in the D type of road garbage as i type garbage, i =1,2, \ 8230, D;

step 3.6.2, randomly taking n unequal decimal numbers at 0-1 to form an overlapping degree threshold value sequence K, wherein K = { K = { (K) _i1 ，K _i2 ，…K _ij ，..K _in In which K is _ij J =1,2, \ 8230;, n, for the jth overlap threshold corresponding to the ith type of garbage;

Step 3.7, the repetition times are set to be G, and the step 3.4-the step 3.6 are repeated for G times to obtain a network set T and a mean average precision set V, wherein T = { T = } ₁ ，T ₂ ，...，T _h ，...，T _G }，V＝{V ₁ ，V ₂ ，...，V _h ，...，V _G }；

Note V _o To the highest mean average precision, V _o ＝max{V ₁ ，V ₂ ，...，V _h ，...，V _G }，V _o Corresponding backbone network T _o The network with the best recognition effect is obtained and recorded as a road garbage recognition network;

in the present embodiment, B =16, s =640, g =51.

In this embodiment, the specific steps of identifying road garbage by using the road garbage identification network are as follows:

step 4.1, inputting an original road garbage image to be identified, and performing the following image processing: adjusting the pixels of the original road garbage image to be identified to be E multiplied by E:

recording the processed original road garbage image to be identified as an image Y _d ;

Step 4.2, in image Y _d Divide the grids into upper equal parts and mark any one of the grids as grid K ^d _v Wherein v is the number of the grids, v =1,2, \8230, Λ, Λ is the number of the grids;

step 4.3, image Y _d Sending the road garbage into a road garbage recognition network to obtain a grid K ^d _v The prediction type probability tensor, the prediction type coordinate tensor and the prediction type IoU tensor of each prediction frame are defined, and any one of the prediction frames is a prediction frame R ^d _vu U =1, 2.., U is a grid K ^d _v The number of middle prediction frames is obtained and the prediction frame R ^d _vu Prediction probability value O of corresponding prediction class probability tensor ^d _vu Prediction probability value P of prediction category coordinate tensor ^d _vu And the prediction probability value Q of the prediction type IoU tensor ^d _vu Predicting the probability value Q by predicting the class IoU tensor ^d _vu Obtaining a prediction frame R ^d _vu Is recorded as confidence level L ^d _vu ；

The confidence degree L ^d _v With a given confidence threshold L ⁰ The comparison is made and the following decisions are made:

if L is ^d _vu ≥L ⁰ Reserving the prediction box;

if Ld _vu ＜L ⁰ Discarding the prediction box;

step 4.4, repeat step 4.3 until image Y _d All the grids are selected, and then the prediction frames are marked according to the judgment result of the prediction frame in each grid of the road garbage image to be identifiedIn the image Y _d And obtaining the identified road garbage image.

In this embodiment, U.gtoreq.3.

And finishing the detection of the road garbage to be identified.

In the above detection process, the specific settings of the bottleneck module, the sulu activation function, the SPPF module, the Concat module layer, and the Sigmoid nonlinear activation layer J in step 2 are as follows.

The bottleneck module has the specific structure that: the input image is divided into two paths, wherein one path is formed by sequentially connecting convolution layers with convolution kernel sizes of 1 multiplied by 1 and convolution layers with convolution kernel sizes of 3 multiplied by 3 in series, the other path keeps the original input image, and then the two paths of input images are directly added to obtain new image output.

The expression of the SiLU activation function is as follows:

where e is an exponential function, ω is the input to the SiLU activation function, and Y (ω) is the output of the SiLU activation function.

The SPPF module is composed of four layers of structures which are sequentially connected in series along the input-output direction of a backbone network: the first layer structure is a convolution layer with convolution kernel size of 1 multiplied by 1 and channel number of 512; the second layer structure is four parallel passages output from the convolution layer in the first layer structure and respectively marked as a passage k 1, a passage k 2, a passage k 3 and a passage k 4, the passage k 1 is composed of three pooling layers sequentially connected in series along the input-output direction of the main network, the passage k 2 is composed of two pooling layers sequentially connected in series along the input-output direction of the main network, the passage k 3 is composed of one pooling layer sequentially connected in series along the input-output direction of the main network, the passage k 4 is directly output by the first layer structure, and the pooling layers are maximum pooling down-sampling layers with convolution kernels of 5 multiplied by 5; the third layer structure is a Concat module layer, the input is four paths of the second layer, the output is connected in series to the fourth layer, and the fourth layer structure is a convolution layer with convolution kernel size of 1 multiplied by 1 and channel number of 512.

The structure of the Concat module layer is as follows: and connecting the two input channel tensors front and back, and returning a connected tensor copy.

The Sigmoid nonlinear activation layer J comprises a Sigmoid function, and the expression of the Sigmoid function is as follows:

where ω 1 is the Sigmoid function input and Y1 (ω 1) is the Sigmoid function output.

FIG. 4 is a graph of the average accuracy V of the mean values of the road garbage detection method and other identification methods of the present invention _h Other identification methods include Yolov3 algorithm using ECA channel attention mechanism without any modification, yolov3 algorithm using ASFF detection head. As can be seen from FIG. 4, the mean average accuracy V corresponding to the method of the present invention _h The curves show that the method is superior to other recognition algorithms, and the superiority of the recognition method in the field of road garbage recognition is demonstrated.

In a word, the method improves the identification accuracy by adding the attention mechanism and a new detection head to the Yolov3 network, greatly improves the identification capability of the generation to the road garbage, can well meet the application requirement of the complex road environment, and better serves the environmental protection cause.

Claims

1. A Yolov3 road garbage detection method based on a decoupling head is characterized in that a road garbage recognition network with an optimal detection effect is obtained by training acquired road garbage images through improving a Yolov3 network structure, and the road garbage recognition network is used for recognizing road garbage, and specifically comprises the following steps:

step 1, collecting and processing road garbage images

step 2.1, a CSPDarkNet53 network is adopted as a backbone network, and a value obtained by dividing the actual network layer number by the marking network layer number 256 is defined as a depth coefficient zeta, wherein the backbone network structure comprises: a standard convolution layer alpha formed by sequentially connecting convolution layers with convolution kernel size of 6 multiplied by 6, a batch standardization layer and a SiLU activation function in series ₁ The standard convolution layer alpha ₁ Has 32 input channels; a standard convolution layer alpha formed by sequentially connecting convolution layers with convolution kernel size of 3 multiplied by 3, a batch standardization layer and a SiLU activation function in series ₂ The standard convolution layer alpha ₂ Has 64 input channels; a standard convolution layer alpha formed by sequentially connecting convolution layers with convolution kernel size of 3 multiplied by 3, a batch standardization layer and a SiLU activation function in series ₃ The standard convolution layer alpha ₃ Has 128 input channels; a standard convolution layer alpha formed by sequentially connecting convolution layers with convolution kernel size of 3 multiplied by 3, a batch standardization layer and a SiLU activation function in series ₄ The standard convolution layer alpha ₄ Has 256 input channels; a standard convolution layer alpha formed by sequentially connecting convolution layers with convolution kernel size of 6 multiplied by 6, a batch standardization layer and a SiLU activation function in series ₅ The standard convolution layer alpha ₅ 512 input channels; comprising 3 standard convolutional layers alpha ₂ C3 module layer beta of 128 zeta bottleneck modules ₁ (ii) a Comprising 3 standard convolutional layers alpha ₃ C3 module layer beta of 256 zeta bottleneck modules ₂ (ii) a Comprises 3 marksQuasi-convolutional layer alpha ₄ C3 module layer beta of 512 zeta bottleneck modules ₃ (ii) a Contains 3 standard convolution layers alpha ₅ C3 module layer beta of 1024 zeta bottleneck modules ₄ (ii) a SPPF Module layer gamma ₁ ；

The input of the backbone network is a standard convolutional layer alpha ₁ The output is SPPF module layer gamma ₁ In particular, standard convolutional layer alpha ₁ Standard convolutional layer alpha ₂ C3 module layer beta ₁ Standard convolutional layer alpha ₃ C3 module layer beta ₂ Standard convolutional layer alpha ₄ C3 module layer beta ₃ Standard convolutional layer alpha ₅ C3 module layer beta ₄ And SPPF module layer gamma ₁ Sequentially connected in series;

step 2.2, adopting an FPN + PAN network as a Neck structure, wherein the Neck structure comprises: convolution layer chi with convolution kernel size of 1 multiplied by 1 and channel number of 512 ₁ One convolution kernel with 1 × 1 channel number of 256 convolution layers χ ₂ One convolution kernel with 3 x 3 channels and 128 convolution layers ₃ A convolution kernel with a size of 3 x 3 and a channel number of 256 convolution layers x ₄ (ii) a A downsampled layer delta with 256 channels; the four Concat module layers are respectively marked as Concat module layer C ₁ Concat module layer C ₂ Concat module layer C3 and Concat module layer C ₄ (ii) a The two 512-channel C3 module layers are respectively marked as C3 module layer D ₁ C3 Module layer D ₂ Two 256-channel C3 module layers are respectively marked as D ₃ And a fourth C3 module layer D ₄ (ii) a A channel attention mechanism CA;

the input of the neutral structure is three, which are respectively marked as input output ₁₁ Input output ₁₂ And input output ₁₃ Wherein, the input output11 is connected with the C3 module layer beta of the backbone network ₂ Output, input output of ₁₂ C3 module layer beta connected with backbone network ₃ Output, input output of ₁₃ SPPF module layer gamma connected with backbone network ₁ An output of (d); the output of the neutral structure is three, which are respectively marked as output ₂₁ Output, output ₂₂ And output ₂₃ Wherein output is output ₂₁ As C3 module layer D ₂ Output of (2), outputoutput ₂₂ As C3 module layer D ₃ Output of (2), output ₂₃ The output of the channel attention mechanism CA;

step 2.3, a decoupling detection head is adopted as a detection head, and the structure of the detection head comprises: convolutional layer Z with convolution kernel size of 1 × 1 and channel number of 256 ₁ Convolutional layer Z with convolution kernel size of 3 × 3 and number of 256 channels ₂ Convolutional layer Z with convolutional kernel size of 3 x 3 channels number of 512 ₃ Convolution layer Z with convolution kernel size of 1 × 1 channel number D ₄ Convolutional layer Z with convolution kernel size of 1 × 1 and channel number of 4 ₅ Convolutional layer Z with convolution kernel size of 1 × 1 channel number of 1 ₆ ；

The input of the decoupling head being convolutional layer Z ₁ Layer Z of convolution ₁ Three output outputs respectively connected with the neutral structure ₂₁ 、output ₂₂ 、output ₂₃ Connecting; the output of which forms the following three paths: first pass convolutional layer Z ₁ And a convolutional layer Z ₂ Convolutional layer Z ₃ And convolutional layer Z ₄ Sequentially connected in series to form; the second path is the convolutional layer Z ₁ And a convolutional layer Z ₂ And a convolutional layer Z ₃ And convolutional layer Z ₅ Sequentially connected in series to form; third path is convolutional layer Z ₁ And a convolutional layer Z ₂ And a convolutional layer Z ₃ And convolutional layer Z ₆ Sequentially connected in series to form;

step 3.2, randomly selecting B road garbage images in the training sample set, and forming a series of gamma, gamma = (y) ₁ ，y ₂ ，...，y _s ...，y _B ) Wherein, y _s Is any road garbage image in the series gamma and is recorded as an image y _s S =1, 2., B, calculating the image y _s Actual class probability tensor Y ^s Actual class seatAmount of scalar W ^s And the actual class IoU tensor X ^s Wherein the actual class probability tensor Y ^S Has a size of H × W × C, and an actual class coordinate tensor W ^s Has a size of H × W × 4, and an actual class IoU tensor X ^s The size of (A) is H multiplied by W multiplied by 1; wherein H represents the height of each tensor, W represents the width of each tensor, and C represents the depth of each tensor;

initializing image y _s Is predicted class probability tensor O ^s Prediction class coordinate tensor P ^s And prediction class IoU tensor Q ^s The initialization process is as follows:

defining a prediction class tensor O ^s And a predicted category coordinate tensor P ^s Prediction type IoU tensor Q ^s The coordinates of (a) are composed of an abscissa n, an ordinate m, and a depth coordinate γ, and are denoted as (n, m, x);

the abscissa n, the ordinate m, and the depth coordinate γ (where n =1,2.. H, m =1,2.. W, γ =1,2.. C) are arbitrarily selected and ordered

And the prediction probability values of other coordinates in the Os are all equal to 0, and tensor P is conducted on the prediction type coordinates ^s An abscissa n, an ordinate m, and a depth coordinate γ (where n =1,2.. H, m =1,2.. W,. γ =1,2.. 4) are arbitrarily selected and ordered

The prediction probability values of other coordinates in the small and medium are all equal to 0, and the IoU tensor Q is compared with the prediction type ^s An abscissa n, an ordinate m, a depth coordinate γ (where n =1,2.. H, m =1,2.. W,. γ = 1) are arbitrarily selected and evaluated>

step 3.3, after the B road garbage images selected and obtained in the step 3.2 are input into a main network, updating the prediction category probability tensor O of each road garbage image ^s And a predicted category coordinate tensor P ^s And predicting the class IoU tensor Q ^s ，s＝s＝1，2，..，B；

for image y _s Predicting each grid, comparing the obtained prediction information with real information to obtain a loss function loss, obtaining a minimized loss function loss through a gradient descent method, and finishing the optimization of a backbone network;

marking the backbone network optimized through the steps 3.2-3.5 as the h generation backbone network T _h Wherein h is the serial number of the generation;

defining a prediction box as a rectangular box marked on a grid, wherein a class probability tensor O is predicted ^s Determining the garbage category detected by the rectangular frame, predicting the coordinate tensor P of the category ^s Determining the central coordinates of the rectangular frame, predicting the IoU tensor Q ^s Determining the confidence of the rectangular frame; defining an actual frame as a rectangular frame for manually marking the road garbage on the road garbage image; defining the overlapping degree I as the value of the area intersection of the prediction box and the actual box divided by the area union;

step 3.6.2, randomly taking n unequal decimal numbers at 0-1 to form an overlapping degree threshold value sequence K, wherein K = { K = { (K) _i1 ，K _i2 ，…K _ij ，..K _in In which K is _ij J =1,2, \8230;, n, which is the jth overlapping threshold corresponding to the ith garbage;

Step 3.6.4, calculating the AP value of each type of road garbage in the D type of road garbage according to the method of the step 3.6.2 and the step 3.6.3 to obtain D AP values F _i And calculating the h generation backbone network T according to the D AP values Fi _h Average precision V of mean value of concentrated road garbage images of test samples _h ，

Step 3.7, the repetition times are set to be G, and the step 3.4-the step 3.6 are repeated for G times to obtain a network set T and a mean average precision set V, wherein T = { T = } ₁ ，T ₂ ，...，Th，...，T _G }，V＝{V ₁ ，V ₂ ，...，Vh，...，V _G }；

2. The Yolov3 road debris detection method based on decoupling head as claimed in claim 1, wherein the C3 module layer β is ₁ C3 module layer beta ₂ C3 module layer beta ₃ C3 module layer beta ₄ The system is composed of three layers of structures which are sequentially connected in series along the input-output direction of a main network, and specifically comprises the following steps:

when 128 ζ, 256 ζ, 512 ζ and 1024 ζ are collectively referred to as n × ζ, n or equal to 128 or equal to 256 or equal to 512 or equal to 1024, the standard convolution layer α is ₂ Standard convolutional layer alpha ₃ Standard convolutional layer alpha ₄ And standard convolutional layer alpha ₅ Then collectively referred to as the standard convolution layer α;

the first layer structure of the four C3 module layers is composed of a parallel channel 31 and a channel 32, wherein the channel 31 is composed of a standard convolution layer alpha and n multiplied zeta bottleneck modules which are sequentially connected in series, the channel 32 is composed of a standard convolution layer alpha, the second layer structure is a Concat module layer, the input is the channel 31 and the channel 32, and the output is connected in series to the third layer structure; the third layer structure is a standard convolutional layer alpha.

3. The Yolov3 road debris detection method based on the decoupling head as claimed in claim 1, wherein the tack structure is formed by sequentially connecting 8 layers of structures in series along the input-output direction of the backbone network:

the first layer structure inputs output respectively ₁₁ Input output ₁₂ The input signal comprises a path sigma 1 and a path sigma 2 which are arranged in parallel, wherein the path sigma 1 is input by output ₁₁ Concat module layer C ₁ "convolutional layer", and ₁ the down-sampling layers delta are sequentially connected in series, and the path sigma 2 is formed by inputting output ₁₂ And Concat module layer C ₁ Is formed by connection; the second layer structure is composed of a Concat module layer C ₁ Forming; the third layer structure is formed by a Concat module layer C ₁ Output and input output of ₁₃ As input, a passage sigma 3 and a passage sigma 4 are included in parallel, wherein the passage sigma 3 is formed by the Concat module layer C ₁ Output, concat module layer C ₂ C3 Module layer D ₁ Laminated layer chi ₂ The down-sampling layers delta are sequentially connected in series, and the path sigma 4 is formed by inputting output ₁₃ And Concat module layer C ₂ Is formed by connection; the fourth layer structure is composed of a Concat module layer C ₂ Forming; the fifth layer structure is formed by a Concat module layer C ₂ Output, convolution layer chi of ₂ The output of (a) is input, and comprises a parallel passage sigma 5 and a passage sigma 6, wherein the passage sigma 5 is formed by the Concat module layer C ₂ Output, concat module layer C ₃ C3 Module layer D ₂ "convolutional layer", and ₃ sequentially connected in series to form a passage Be 6 consisting of a convolution layer chi ₂ Output and Concat module layer C ₃ Is formed by direct connection; the sixth layer structure is a Concat module layer C ₃ (ii) a The seventh layer structure is formed by a Concat module layer C ₃ Output, convolution layer chi ₁ The output is an input and includes two parallel passages sigma 7 and sigma 8, the passage 67 is defined by the output along the Concat module layer and the C3 module layer D ₃ "convolutional layer", and ₄ sequentially connected in series to form a passage Be 8 consisting of a convolution layer chi ₁ Output and Concat module layer C ₄ Is formed by direct connection; the eighth layer structure is composed of a Concat module layer C ₄ C3 Module layer D ₄ And the channel attention mechanism CA are connected in series in sequence.

4. The Yolov3 road debris detection method based on the decoupling head as claimed in claim 5, wherein the channel attention mechanism CA is implemented by the following steps:

the channel attention mechanism CA is composed of 3 layers of structures which are sequentially connected in series along the input-output direction of a main network: the first layer structure is composed of two juxtaposed paths of an boaka 1 and an boaka 2, and the path of the boaka 1 is carried out by an average bath layer eta _X The path of the path 2 being defined by an average pooling layer eta _Y The two paths are connected into a Concat module layer L, the second layer structure is formed by sequentially connecting the Concat module layer L and a batch standardization layer A in series along two parallel path directions-a main network output direction, the third layer structure is formed by separating two parallel path-an-3 and a path-an-4 from the output of the batch standardization layer A, and the path-an-3 and the path-an-4 are formed by sequentially connecting a convolution layer lambda and a Sigmoid nonlinear activation layer J in series;

the input of the channel attention mechanism CA is a fourth C3 module layer D ₄ The outputs of passage boaka 3 and passage boaka 4 are multiplied by the inputs of attention mechanism CA, respectively, to obtain the output of attention mechanism CA.

5. The Yolov3 road debris detection method based on the decoupling head as claimed in claim 1, wherein the specific steps of identifying road debris by using a road debris identification network are as follows:

step 4.3, image Y _d Sending the road garbage into a road garbage recognition network to obtain a grid K ^d _v The prediction type probability tensor, the prediction type coordinate tensor and the prediction type IoU tensor of each prediction frame are defined, and any one of the prediction frames is a prediction frame R ^d _vu U =1,2, \ 8230, U, U being a grid K ^d _v The number of middle prediction frames is obtained and the prediction frame R ^d _vu Prediction probability value 0 of corresponding prediction class probability tensor ^d _vu Prediction probability value P of prediction category coordinate tensor ^d _vu And the prediction probability value Q of the prediction type IoU tensor ^d _vu Predicting the probability value Q by predicting the class IoU tensor ^d _vu Obtaining a prediction frame R ^d _vu Is recorded as confidence level L ^d _vu ；

if L is ^d _vu ≥L ⁰ Reserving the prediction box;

if L is ^d _vu ＜L ⁰ Discarding the prediction box;

step 4.4, repeating the step 4.3 until all grids of the image Yd are selected, and marking the prediction frame on the image Y according to the judgment result of the prediction frame in each grid of the road garbage image to be identified _d And obtaining the identified road garbage image.