CN112800906A

CN112800906A - Improved YOLOv 3-based cross-domain target detection method for automatic driving automobile

Info

Publication number: CN112800906A
Application number: CN202110068030.6A
Authority: CN
Inventors: 范佳琦; 霍天娇; 李鑫; 魏珍琦; 王嘉琛; 高炳钊
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2021-01-19
Filing date: 2021-01-19
Publication date: 2021-05-14
Anticipated expiration: 2041-01-19
Also published as: CN112800906B

Abstract

The invention belongs to the technical field of computer vision and environment perception of an automatic driving automobile, and particularly relates to a cross-domain target detection method of the automatic driving automobile based on improved YOLOv 3. The method is based on an improved single-stage YOLOv3 detection algorithm framework, adopts a generated confrontation network model to obtain training set data, and carries out cross-domain target detection aiming at the problem that the training set and the test set respectively come from different distributed data domains. Meanwhile, the accuracy of single-stage target detection is improved by improving the YOLOv3 algorithm, the application of the generated countermeasure network reduces the re-labeling of multi-class targets among different data fields, and the difficult problem of cross-domain target detection of the automatic driving automobile is solved to a certain extent.

Description

Improved YOLOv 3-based cross-domain target detection method for automatic driving automobile

Technical Field

The invention belongs to the technical field of computer vision and environment perception of an automatic driving automobile, and particularly relates to a cross-domain target detection method of the automatic driving automobile based on improved YOLOv 3.

Background

With the research of the automatic driving automobile, how to accurately detect various target objects in video stream data input by a camera has important significance for subsequent planning and decision-making work. The problems that the driving scene is complex and changeable and the target species are various exist in the automatic driving automobile, and the traditional method for manually extracting the features is not suitable for being applied to the automatic driving automobile due to the fact that the detection precision and the robustness are not high enough, so that the target detection algorithm based on deep learning has high research value. In recent years, with the development of computer vision technology, more and more detection frames with better performance appear, and the detection frames have different characteristics in detection precision and speed and are respectively suitable for different application scenes.

Although the data-driven deep learning detection algorithm has made great progress in many detection tasks, in practical production applications, many difficulties are faced. Firstly, the deep learning detection algorithm greatly depends on training data, and when the data volume is too small, all features of the data cannot be fully learned, and the problem of model overfitting is easily caused. For the problem that actual road scenes are complex and changeable, if the detection accuracy of the model is high in different time periods, different places and different weather, data collection and labeling are needed in the scenes, and data collection in some unusual scenes is difficult, so that the data collection in various road scenes is difficult to complete. Secondly, how the deep learning model trained in one scene is well applied to different scenes for detection and high detection precision is obtained, which has high requirements on the robustness of the model. Finally, the automatic driving automobile requires high detection precision and also has requirements on the real-time detection speed of the algorithm, and in order to enable the algorithm to be well applied to the actual driving task, the number of frames detected by the algorithm per second can meet the requirement on the driving speed of the automobile.

In view of the above problems, the deep learning model mainly has two problems to solve: 1. how to improve the generalization ability of the model, so that the model trained under one data domain can be well applied to another data domain with different distribution for target detection, i.e. the purpose of cross-domain target detection is realized. 2. How to improve the detection accuracy of the detection algorithm which can meet the real-time requirement on a plurality of different types of target objects and the average detection accuracy of the detection algorithm under a plurality of complex environments.

Disclosure of Invention

The invention provides an automatic driving automobile cross-domain target detection method based on improved YOLOv3, which is based on an improved single-stage YOLOv3 detection algorithm framework, adopts a method for generating an antagonistic network to obtain training set data, performs cross-domain target detection aiming at the problem that a training set and a test set respectively come from data domains with different distributions, improves the target detection precision by improving a YOLOv3 algorithm, generates the antagonistic network, reduces the problem of data re-labeling between different data domains, and solves the problem of cross-domain target detection of an automatic driving automobile.

The technical scheme of the invention is described as follows by combining the attached drawings:

a cross-domain target detection method of an automatic driving automobile based on improved YOLOv3 comprises the following steps:

inputting a source domain image and a target domain image into a countermeasure generation network cycleGAN model for training to obtain a synthetic graph;

step two, taking the synthetic image as a training set, and taking the target domain image as a test set;

clustering the training set mark boxes through a K-means clustering algorithm, determining the clustering number and calculating the prior box size;

step four, building an improved YOLOv3 network; extracting backbone networks by using the characteristics;

fifthly, training a false image generated by a cycleGAN model by using a YOLOv3 network;

step six, detecting the images of the test set by using the model obtained by training and calculating the average detection precision;

and seventhly, detecting the cross-domain target.

The method comprises the following steps that firstly, an unsupervised generation countermeasure network cycleGAN algorithm is adopted to complete self-adaptation between a source domain image and a target domain image; the method specifically comprises the following steps:

11) generating a network cycleGAN by cyclic confrontation;

the source domain image input to the antagonism generation network cycleGAN model is subjected to a generator to obtain a synthetic graph with a target domain style and is continuously trained;

distinguishing the synthetic image synthesized by the generator from the real image in the original data domain by a discriminator;

12) calculating an antagonistic generation network cycleGAN loss function;

the countermeasure generation network CycleGAN loss function comprises a generator loss function and a discriminator loss function;

the producer loss function comprises the cyclic consistency loss, the GAN loss and the loss per se of the two producers;

the cycle consistency loss of the two generators is calculated by formula (1), specifically as follows:

in the formula, G_S-TAnd G_T-STwo generators are represented; l is_cycRepresenting a cyclic consistency loss value; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; l is_cyc1Representing a cyclic consistency loss value in a first direction; l is_cyc2Representing a cyclic consistency loss value in a second direction; lambda [ alpha ]₁A constant coefficient for balancing the loss value; e_{s～S_real}Represents an image S in the source data field S _ real; e_{t～T_real}Represents the image T in the target data field T _ real; s represents a picture in S _ real; t represents a picture in T _ real;

the GAN loss is calculated by equation (2) as follows:

in the formula, L_GANRepresents the total GAN loss value; g_S-TAnd G_T-STwo generators are represented; d_SAnd D_TAre two discriminators; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; l is_GAN1A GAN loss value representing a first generator; l is_GAN2A GAN loss value representing a second generator; e_{s～S_real}Represents an image S in the source data field S _ real; s represents a picture in S _ real; e_{t～T_real}Represents the image T in the target data field T _ real; t represents a picture in T _ real;

the intrinsic loss is calculated by formula (3), specifically as follows:

in the formula, L_idtRepresents the total loss per se; g_S-TAnd G_T-STwo generators are represented; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; l is_idt1A loss value indicating the first generator itself; l is_idt2A loss value indicating the second generator itself; lambda, lambda₁、λ₂A constant coefficient for balancing the loss value; e_{s-S_real}Represents an image S in the source data field S _ real; s represents a picture in S _ real; e_{t～T_real}Represents the image T in the target data field T _ real; t represents a picture in T _ real;

the generator loss function is calculated by equation (4) as follows:

in the formula, L_GRepresenting the total loss function value of the two generators; l is_cycRepresents the total cycle consistency loss value; l is_GANRepresents the total GAN loss value; l is_idtRepresents the total intrinsic loss value; g_S-TAnd G_T-STwo generators are represented; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; d_SAnd D_TAre two discriminators;

the discriminator loss function is calculated by formula (5), specifically as follows:

in the formula, L_DRepresents the total loss value of the two discriminators; l is_DTRepresentation discriminator D_TThe loss value of (d); l is_DSRepresentation discriminator D_SThe loss value of (d); e_{t'～T_fake}Representing a false image T' in a false image set T _ fake with target domain characteristics; e_{s'-S_fake}Representing a false image S' in a false image set S _ fake with source data domain characteristics; t' represents a picture in T _ fake; s' represents a picture in S _ fake; d_SAnd D_TAre two discriminators; e_{s～S_real}Represents an image S in the source data field S _ real; s represents a picture in S _ real; e_{t～T_real}Represents the image T in the target data field T _ real; t represents a picture in T _ real;

in summary, the countermeasure generation network CycleGAN loss function is calculated by equation (6) as follows:

in the formula, L_totalRepresenting the total loss function value of the antagonistic generation network CycleGAN; l is_DRepresents the total loss value of the two discriminators; l is_GRepresenting the total loss function value of the two generators; l is_cycRepresents the total cycleA ring consistency loss value; l is_GANRepresents the total GAN loss value; l is_idtRepresents the total intrinsic loss value; g_S-TAnd G_T-STwo generators are represented; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; d_SAnd D_TAre two discriminators; l is_DTRepresentation discriminator D_TThe loss value of (d); l is_DSRepresentation discriminator D_SThe loss value of (a).

The specific method of the second step is as follows:

and (4) obtaining a synthetic graph as training set data by using the confrontation generation network cycleGAN model in the step one, and making a picture folder, a data folder and an image marking folder.

The picture folder comprises all jpg images in a training set, a verification set and a test set; the data folder comprises three txt files, and all picture names of a training set, all picture names of a verification set and all picture names of a test set are recorded respectively; the image marking folder comprises coordinates (x) of the upper left corner of all marking frames in a training set, a verification set and a test set_min,y_min) Lower right corner coordinate (x)_max,y_max) And the category to which the object belongs (wherein 0, 1,2, … respectively represent each different category of object), namely one xml file for each picture.

The concrete method of the third step is as follows:

the txt file comprises all target object marking frames of the training set, and K-means clustering is carried out on all marking frames according to the width omega and the height h of the frames and according to the formula (9) -14, so that the statistical condition of the sizes of the marking frames of the training set is obtained;

d＝1-imp_IOU(box,centroid) (9)

wherein d represents a distance value; the centroid represents a determined clustering center frame; box represents any other box; box ^ centroid represents the area size of the intersection between two frames, box ^ centroid represents the area size of the union between two frames; imp _ IOU represents the modified IOU value; rho values represent the distance between the center points of two clustering boxes and centroids; c represents the length of the upper left and lower right points of the above found frame C; b_boxRepresents any other box; b_centroidA center frame is determined for the cluster; α represents a coefficient; the v value represents the aspect ratio of each cluster box to be calculated in the cIOU, and represents the difference of the aspect ratios of the two boxes; omega_centroidWidth value, h, representing the center box of the cluster_centroidRepresenting the height value of the clustering center box; omega_boxRepresents the width value, h, of an arbitrary frame_boxIndicating the height value of any box.

The concrete method of the fourth step is as follows:

the characteristic extraction network adopted in the YOLOv3 algorithm is Darknet-53, namely 53 network layers are provided and comprise a characteristic pyramid structure; the network layer consists of:

41) extracting a front-end network from the features;

the whole feature extraction part comprises 5 residual error groups with different convolution kernel numbers, each residual error group consists of a plurality of residual error blocks with different numbers, the number of the residual error blocks in the 5 residual error groups is respectively 1,2, 8 and 4, a convolution layer with the step length of 2 is arranged between every two residual error groups to reduce the size by half, so that the feature map size input into the next residual error group is half of the previous size, the feature map size comprises 5 convolution layers with the step length of 2, and the finally output feature map size is 1/32 of the training set image size, namely n/32 x n/32;

42) improving the characteristic pyramid;

421. fusing the information of the characteristic pyramid structure lower layer and the deep layer characteristic diagram;

the feature graph size output after the feature extraction of 5 residual groups is n/32 multiplied by n/32, and then the feature graph size is used as the input of the feature pyramid network, the structure outputs three feature graphs with different sizes, and the three feature graphs are sequentially from small to large: the

residual error groups

3,4 and 5 output feature graphs which are obtained by fusing the feature graphs, and the minimum size of the feature graphs is n/32 multiplied by n/32; the

residual error groups

2,3 and 4 output feature graphs which are fused to obtain feature graphs, and the size of the feature graphs is n/16 multiplied by n/16; the feature graphs obtained by fusing the feature graphs output by the

residual group

1,2 and 3 have the maximum size of n/8 multiplied by n/8; the output channel number N is calculated by formula (15), specifically as follows:

N＝num×(score+location+label) (15)

where num represents the number of prior boxes drawn in each cell; score indicates that the confidence probability value of each prediction box is 1, each box has a score, and the value is 0-1; location represents the position coordinates of each prediction frame, and includes 4 coordinate values (t)_x,t_y,t_ω,t_h) Respectively calculating the coordinate of the center point and the width and height value of each prediction frame for the predicted coordinate offset of each prediction frame; label represents the probability value of each category of target object output by each prediction box, and the number of the output values is the number of categories of the target object to be detected;

422. adding an attention module;

the attention module includes two sub-modules of channel attention and spatial attention that focus on different features in two dimensions, channel and space, respectively.

Wherein the channel attention module: for an input n multiplied by m dimensional feature map, firstly, extracting features on each dimensional channel through a global average layer, and outputting a 1 multiplied by m dimensional vector by the layer; then, sequentially extracting the associated features among the channels through two full-connection layers, wherein the number of the output channels is m/4 dimension and m dimension; a relu activation function is arranged behind the first full connection layer, and a sigmoid activation function is arranged behind the second full connection layer and used for fixing the output value within the range of 0-1; finally, multiplying the feature map input into the channel attention module and the feature map output by the sigmoid layer, wherein the size of the feature map output finally is also n multiplied by m;

the spatial attention module: firstly extracting features of an input n multiplied by m dimensional feature map through two parallel convolutional layer branches, wherein the sizes of convolutional kernels of the two branches are 1 multiplied by 9 and 9 multiplied by 1 respectively, adding pixel values of positions corresponding to the feature map output by the two branches, fixing an output value in a range of 0-1 by using a sigmoid function, and obtaining the feature map with the size of n multiplied by 1; finally, multiplying the characteristic diagram input by the module with the characteristic vector of the 1-dimensional module to obtain the final output characteristic diagram of the dimension of n multiplied by m.

The concrete method of the step five is as follows:

after a characteristic extraction backbone network is built, for an input training set picture, loading data, labels, category numbers and prior frame size information into the network, and loading a weight file obtained by training on a coco2014 data set as a pre-training weight of a YOLOv3 model, namely an initial weight parameter of each network layer; forward propagation during training calculates the YOLOv3 loss function: location loss, confidence loss, category loss.

The concrete method of the sixth step is as follows:

cutting a test set picture and inputting the cut test set picture into a trained model, and screening redundant frames for all prediction frames output by a backbone network in the test process by adopting a Soft-NMS algorithm, wherein the specific method comprises the following steps:

61) all prediction blocks for the network output B ═ B₁,b₂,...,b_n}，b₁Represents the 1 st prediction box; b₂Represents the 2 nd prediction box; b_nRepresents the nth prediction box; its corresponding confidence score S ═ S₁,s₂,...,s_n}，s₁Representing the confidence score obtained by the prediction of the 1 st prediction box; s₂Representing the confidence score obtained by the prediction of the 2 nd prediction box; s_nRepresenting the confidence score obtained by the prediction of the nth prediction box; setting an intersection ratio threshold t, a fraction threshold sigma and a confidence coefficient threshold alpha;

62) finding the box with the highest score in the same category in the prediction boxes and marking the box as b_mWith a score of s_m(ii) a For the frame b with the highest division_mEach outer frame b_iFraction s_iCalculation Block b_iAnd b_mIOU value IOU (b) in between_i,b_m) (ii) a If IOU (b)_i,b_m) Not less than t, then

Otherwise, s_iThe value is unchanged; wherein s is_iRepresenting a prediction box b_iA confidence score of (d); b_iRepresents any one prediction box; b_mA prediction box representing the highest score; σ represents a score threshold;

63) repeating the step 62) for the next class of prediction frame until all the prediction frames of all the class targets are traversed, and obtaining a new confidence score S' ═ S corresponding to all the prediction frames at this time₁′,s₂′,...,s_n′}；

64) Screening new confidence coefficient scores according to the set confidence coefficient threshold value alpha, and screening the score s of any one prediction frame_j', if s_j′<α, then the frame b_jIs suppressed from outputting.

The concrete method of the seventh step is as follows:

and for two data fields with different distributions, the data field which is easy to collect is taken as a source field and marked, the source field data is generated into false pictures with a target field style through a CycleGAN model, the improved YOLOv3 model is trained by the false pictures, and the trained model is used for detecting the target object in the target field picture.

The invention has the beneficial effects that:

1) the improved YOLOv3 network structure in the invention can extract image features more fully, and the trained model has higher detection precision on the target object in the test set;

2) the invention relates to a YOLOv3 model which can be applied to an automatic driving automobile and can be used for cross-domain target detection and has improved precision, and the generation of training set data adopts an unsupervised algorithm cycleGAN. The method effectively solves the problem of complex work of marking the data set target object in different scenes, well transfers the model obtained by training in one scene to the target detection in another scene, and improves the robustness and the transfer capability of the model.

Drawings

FIG. 1 is a flow chart of the algorithm of the present invention;

FIG. 2 is a schematic diagram of a loop countermeasure generation network according to the present invention;

FIG. 3 is a diagram of a generator network structure in the cycleGAN model of the present invention;

FIG. 4 is a diagram of the network structure of the discriminator in the cycleGAN model of the present invention;

FIG. 5 is a result graph of the K-means algorithm for determining the cluster number;

FIG. 6 is a diagram of a channel attention module network architecture;

FIG. 7 is a diagram of a spatial attention module network architecture;

FIG. 8 is a diagram of residual block in a feature extraction network;

FIG. 9 is a modified YOLOv3 network architecture diagram;

FIG. 10 is a diagram of the effect of cross-domain detection of different weather;

FIG. 11 is a diagram of the effect of cross-domain detection at different locations;

FIG. 12 is a cross-domain detection effect graph at different time periods;

FIG. 13 is a diagram of the effect of cross-domain detection of different weather at different locations;

fig. 14 is a diagram of the effect of cross-domain detection at different locations and different time periods.

Detailed Description

The invention is described in further detail below with reference to the following detailed description and accompanying drawings:

the CycleGAN model proposed in 2017 is used for circularly generating an antagonistic network, and has the main effects that: for the source data domain and the target data domain, the transfer of the training data can be realized under the condition that a one-to-one mapping relation between the training data is not required to be established, namely the source data domain is converted into an image with the style of the target domain, and the target data domain can also be converted into an image with the style of the source domain. In order to realize the function, the model is composed of four parts, namely two generators, two discriminators, a source data field image and a target data field image. In such unsupervised learning models, the generator and the discriminator are trained alternately in a competing manner, the generator trying to generate as realistic a fake picture as possible to trick the discriminator, and the discriminator trying to distinguish the fake picture from the truly captured picture. Compared with the traditional unsupervised countermeasure network, the method has the advantages that the cycleGAN model provides cycle consistency loss, ensures one-to-one mapping relation of image style conversion between two data fields, and has important application value in the unpaired image style conversion problem.

For the cross-domain detection problem that a training set and a test set are from two different data domains, the training set pictures are easy to obtain usually, and the model obtained by training the training set data does not learn the image characteristics of the test set, so when the model is used for detecting the test set pictures, the detection precision is very low, and at the moment, the unsupervised generation countermeasure network cycleGAN algorithm is adopted to complete the self-adaptation between the source domain image and the target domain image. The antagonistic network CycleGAN comprises: two generators, two discriminators, a source data domain image, a target data domain image.

The generator has the following functions: the method comprises the steps of obtaining a composite image with a target domain style from an image in an input source data domain, and continuously training a generator to ensure that the obtained composite image is as vivid as possible so as to cheat a discriminator to make the discriminator unable to distinguish whether the image is really existed or synthesized.

The function of the discriminator is: the purpose of training the discriminator is to output as high a score as possible for the original image (close to 1) and as low a score as possible for the composite image (close to 0), so that the discriminator is not fooled by the composite image generated by the generator, and is able to better discriminate between the true image and the composite image.

The method specifically comprises the following steps:

11) generating a network cycleGAN by cyclic confrontation;

12) calculating an antagonistic generation network cycleGAN loss function;

in the formula, G_S-TAnd G_T-STwo generators are represented; l is_cycRepresenting a cyclic consistency loss value; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; l is_cyc1Representing a cyclic consistency loss value in a first direction; l is_cyc2Representing a cyclic consistency loss value in a second direction; lambda [ alpha ]₁A constant coefficient for balancing the loss value; e_{s-S_real}Represents an image S in the source data field S _ real; e_{t-T_real}Represents the image T in the target data field T _ real; s represents one of S _ realOpening a picture; t represents a picture in T _ real;

the image in the source domain is passed through a generator G_S-TThe resulting composite map is input to the discriminator to output a score between 0 and 1, and the false map that the generator wishes to generate is able to fool the discriminator, i.e. the score is as close to 1 as possible, so the GAN penalty between that score and 1 is calculated.

The GAN loss is calculated by equation (2) as follows:

in the formula, L_GANRepresents the total GAN loss value; g_S-TAnd G_T-STwo generators are represented; d_SAnd D_TAre two discriminators; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; l is_GAN1A GAN loss value representing a first generator; l is_GAN2A GAN loss value representing a second generator; e_{s～S_real}Represents an image S in the source data field S _ real; s represents a picture in S _ real; e_{t～T_real}Represents the image T in the target data field T _ real; t represents a picture in T real.

In order to prevent overfitting in the training process of the generator, the image in the source data domain is used as the input of the generator, and the corresponding output and the input have little change, so that whether the generator has overfitting or not is verified.

The intrinsic loss is calculated by formula (3), specifically as follows:

in the formula, L_idtRepresents the total loss per se; g_S-TAnd G_T-STwo generators are represented; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; l is_idt1A loss value indicating the first generator itself; l is_idt2To representThe intrinsic loss value of the second generator; lambda, lambda₁、λ₂A constant coefficient for balancing the loss value; e_{s～S_real}Represents an image S in the source data field S _ real; s represents a picture in S _ real; e_{t～T_real}Represents the image T in the target data field T _ real; t represents a picture in T _ real;

the generator loss function is calculated by equation (4) as follows:

in the formula, L_GRepresenting the total loss function value of the two generators; l is_cycRepresents the total cycle consistency loss value; l is_GANRepresents the total GAN loss value; l is_idtRepresents the total intrinsic loss value; g_S-TAnd G_T-STwo generators are represented; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; d_SAnd D_TAre two discriminators.

Discriminator loss function: the composite image generated by the input source data field and the real image in the target data field are distinguished by the discriminator, wherein the score output by the discriminator is as close to 0 as possible for a false image and as close to 1 as possible for a real image.

in the formula, L_DRepresents the total loss value of the two discriminators; l is_DTRepresentation discriminator D_TThe loss value of (d); l is_DSRepresentation discriminator D_SThe loss value of (d); e_{t'～T_fake}Representing a false image T' in a false image set T _ fake with target domain characteristics; e_{s'～S_fake}Representing a false image S' in a false image set S _ fake with source data domain characteristics; t' represents a picture in T _ fake(ii) a S' represents a picture in S _ fake; d_SAnd D_TAre two discriminators; e_{s～S_real}Represents an image S in the source data field S _ real; s represents a picture in S _ real; e_{t～T_real}Represents the image T in the target data field T _ real; t represents a picture in T _ real;

in the formula, L_totalRepresenting the total loss function value of the antagonistic generation network CycleGAN; l is_DRepresents the total loss value of the two discriminators; l is_GRepresenting the total loss function value of the two generators; l is_cycRepresents the total cycle consistency loss value; l is_GANRepresents the total GAN loss value; l is_idtRepresents the total intrinsic loss value; g_S-TAnd G_T-STwo generators are represented; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; d_SAnd D_TAre two discriminators; l is_DTRepresentation discriminator D_TThe loss value of (d); l is_DSRepresentation discriminator D_SThe loss value of (a).

the specific method of the second step is as follows:

and the txt file comprises all target object marking frames of the training set, and K-means clustering is carried out on all marking frames according to the width omega and the height h of the frames according to the following formula so as to obtain the statistical condition of the sizes of the marking frames of the training set.

d＝1-IOU(box,centroid) (7)

Wherein, the centroid is a determined clustering center frame, the box is any other frame, an IOU (interaction over intersection) value between the centroid and the box is calculated, d represents a distance value, and the K-means algorithm just carries out clustering according to the distance value. box ≧ centroid denotes the area size of the intersection between the two frames, box ≦ centroid denotes the area size of the union between the two frames. The bigger the IOU value is, the more the two boxes intersect, the more the shapes are similar, and the two boxes should be grouped into the same class in the clustering process, so that the distance value of the K-means cluster and the IOU value of the box are in an inverse relation.

For the detection of various different types of target objects, the shape and the size of the target objects may be very different, the K-means clustering algorithm only clusters all the labeled frames according to the intersection and comparison between the two frames, the clustering mode does not reflect the shape difference between the labeled frames of different types, the number of the labeled frames of the target objects with more occurrence frequency is far more than that of the target objects with less occurrence frequency, and therefore the size of the labeled frames in the clustering process easily affects the finally obtained clustering result. In order to embody the shape difference of different types of target objects in the clustering process, a K-means clustering algorithm formula is improved, in addition to the IOU value of the frame, the cIOU and gIOU values which can embody the shape of the frame are added, and the clustering formula between the improved clustering center frame centroid and any frame box is as follows:

d＝1-imp_IOU(box,centroid) (9)

wherein d represents a distance value; the centroid represents a determined clustering center frame; box represents any other box; box ^ centroid represents the area size of the intersection between two frames, box ^ centroid represents the area size of the union between two frames; imp _ IOU represents the improved IOU value, the improvement mode is that the values of IOU, cIOU and gIOU of all the mark frames are calculated, and the average value of the three values is taken as the basis for calculating the distance; gIOU formula: searching a smallest enclosure frame C on the basis of the IOU, and subtracting the union area of the clustering frames box and the centroid from the area of the frame C and dividing the union area by the area of the frame C; cIOU formula: rho represents the distance between the center points of the two clustering boxes box and centroid, C represents the length of the upper left point, the lower right point of the found box C, and rho²/c²The value can show the distance between two frames, but in the clustering process, because the width and the height of the frame are only clustered without considering the position, the rho value²/c ²0. v values in the cIOURespectively calculating the aspect ratio of each clustering box, wherein the aspect ratio represents the difference of the high ratios of the widths of two boxes, namely if the v value is large, the cIOU value is small, the difference of the shapes of the two boxes is large and is not suitable for being clustered into a class, and the alpha value is a coefficient and is used for balancing the two items in the cIOU formula; b_boxRepresents any other box; b_centroidA center frame is determined for the cluster; α represents a coefficient; the v value represents the aspect ratio of each cluster box to be calculated in the cIOU, and represents the difference of the aspect ratios of the two boxes; omega_centroidWidth value, h, representing the center box of the cluster_centroidRepresenting the height value of the clustering center box; omega_boxRepresents the width value, h, of an arbitrary frame_boxIndicating the height value of any box.

the concrete method of the fourth step is as follows:

41) extracting a front-end network from the features;

411. and (3) rolling layers: if the size of the input image is n multiplied by n, the number of convolution kernels of the first two convolution layers is 32 and 64 respectively, and the size of the output feature map is halved to n/2 multiplied by n/2 by setting the step length of convolution sliding to be 2;

412. residual group structure: stacking a plurality of residual blocks in Darknet-53 to form a backbone of a feature extraction network structure, wherein two convolution layers span between the input and the output of each residual block, the number of the first residual blocks is 1, the sizes of input and output feature graphs of the residual blocks are the same and are still n/2 multiplied by n/2, and the number of channels is also the same, namely, the residual blocks only play a role in extracting data features and do not change the sizes of the feature graphs;

413. convolutional layer with step size 2: replacing the pooling layer with a convolutional layer with step size 2 in Darknet-53 serves to reduce the feature size, namely: halving the size of the input feature map to n/4 Xn/4 and doubling the number of channels;

414. residual group and convolutional layer stacking: the network layer structured in steps 412 and 413 is repeated for 4 times as an integral structure, wherein the number of residual block structures in step 412 is 2, 8, 4, respectively, except for the last residual block, a convolutional layer is arranged behind each step 412 structure to reduce the size of the feature map by half, and the feature map size is repeated in sequence, so that the feature map size obtained after each step 412+ step 413 structure is: n/4 Xn/4, n/8 Xn/8, n/16 Xn/16, n/32 Xn/32

Namely: the whole feature extraction part comprises 5 residual error groups with different convolution kernel numbers, each residual error group consists of a plurality of residual error blocks with different numbers, the number of the residual error blocks in the 5 residual error groups is respectively 1,2, 8 and 4, a convolution layer with the step length of 2 is arranged between every two residual error groups to reduce the size by half, so that the feature map size input into the next residual error group is half of the previous size, the feature map size comprises 5 convolution layers with the step length of 2, and the finally output feature map size is 1/32 of the training set image size, namely n/32 x n/32;

42) improving the characteristic pyramid;

residual error groups

residual group

N＝num×(score+location+label) (15)

where num represents the number of prior boxes drawn in each cell; score indicates that the confidence probability value of each prediction box is 1, each box has a score, and the value is 0-1; location represents the position coordinates of each prediction frame, and includes 4 coordinate values (t)_x,t_y,t_ω,t_h) Respectively calculating the coordinate of the center point and the width and height value of each prediction frame for the predicted coordinate offset of each prediction frame; label represents the probability value of each category of target object output by each prediction box, and the number of the output values is the number of the categories of the target objects to be detected.

The feature graph output by the low-layer network and obtained by only less convolution layer calculation only primarily extracts image features, contains more noise vectors and well retains the edge line features of the image; the feature map output by the deep network obtained by calculating the plurality of convolution layers is complete for extracting the whole pixel-level features, has less noise, but can cause the loss of edge contour information and the edge of a target object to be blurred. Therefore, the deep characteristic diagram and the shallow characteristic diagram are fused, more target object contour line characteristics are reserved on the basis of better extracting target object characteristics, all characteristics of a real target object are reserved to the maximum extent by extracting the characteristic diagram through the neural network, and the target detection rate is favorably improved.

422. Adding an attention module;

For the low-level feature map, random vectors such as noise exist in features extracted among channels, a space attention module is added to the low-level feature map, only difference information on the same space dimension is concerned, and an average value among the channels at the same pixel position is calculated to eliminate the noise vectors. For the deep feature map, the feature extraction is complete, and a channel attention module is added to acquire partial differences among different feature channels of the same feature map. For the characteristic diagram of the middle layer, the characteristic extraction of partial noise vectors and all channels is also different, so that two modules of channel attention and space attention are added simultaneously.

the concrete method of the step five is as follows:

the concrete method of the sixth step is as follows:

the test set picture is cut to a certain size, and the size can be set to be 704 multiplied by 704; inputting the data into a trained model, and screening the redundant frames by adopting a Soft-NMS algorithm for all the prediction frames output by the backbone network in the test process, wherein the specific method comprises the following steps:

61)all prediction blocks for the network output B ═ B₁,b₂,...,b_n}，b₁Represents the 1 st prediction box; b₂Represents the 2 nd prediction box; b_nRepresents the nth prediction box; its corresponding confidence score S ═ S₁,s₂,...,s_n}，s₁Representing the confidence score obtained by the prediction of the 1 st prediction box; s₂Representing the confidence score obtained by the prediction of the 2 nd prediction box; s_nRepresenting the confidence score obtained by the prediction of the nth prediction box; setting an intersection ratio threshold t, a fraction threshold sigma and a confidence coefficient threshold alpha;

And seventhly, detecting the cross-domain target.

The concrete method of the seventh step is as follows:

for images in two different data domains, the images are called a source data domain which is easy to acquire and obtain, a cycleGAN model is adopted to input the source data and target data, the source domain data are generated into a false picture with a target domain style, the position and the shape of the false picture and the source data are not changed compared with those of a target object at the moment, and picture background information has the target domain style. Therefore, for two data fields with different distributions, the data field which is easy to collect is used as a source field and is marked, the source field data is generated into false pictures with target field styles through a CycleGAN model, the improved YOLOv3 model is trained by the false pictures, and the trained model is used for detecting the target object in the target field pictures, wherein the detection precision is higher because the model learns the information of image backgrounds, styles and the like of a test set.

Examples

A cross-domain target detection method of an automatic driving automobile based on improved YOLOv3 takes target detection under different weather as an example, and comprises the following specific steps:

step 1: the cross-domain target detection problem is embodied, and the target detection problem under different weather is considered, namely the training set and the test set data are collected under different weather. For example, the training set data KITTI is a picture in sunny day, and the test set data vKITTI-rainy is a picture in rainy day at the same place.

As shown in FIG. 2, S _ real is source domain data KITTI, T _ real is target domain data vKITTI-raw, T _ fake is a composite graph with target domain style, S _ fake is a composite graph with source style, G_S-TImage generator, G, representing the movement from a source domain to a target domain_T-SImage generator, D, representing the movement from a target domain to a source domain_SIs a source domain image discriminator, D_TIs a target domain image discriminator. The generator and discriminator network structures in the CycleGAN are respectively shown in the attached figures 3 and 4, and the symbolic meanings are Conv convolution layer, Deconv deconvolution layer, BN batch normalization layer, relu activation function, Leakyrelu activation function and 9x representing that the residual block repeatedly appears 9 times.

The method adopts the CycleGAN unsupervised generation countermeasure network to generate a composite image T _ fake with a target domain style from the source domain image, and the CycleGAN network aims at two numbers with different distributionsBy converting the image style, information such as the position and type of the target object in the image is not changed, so that the composite image T _ fake does not need to be marked again. Taking T _ fake as training set data, taking a mark file of original KITTI data as a mark file of a T _ fake image, training an improved YOLOv3 model in the invention, wherein the size of an original image is 1280 multiplied by 720, 3568 training images and 892 testing images, inputting the training images into a YOLOv3 network, cutting the size of the training images into 704 multiplied by 704, and then performing feature extraction; obtaining a train.txt document and a test.txt document for recording the image names of a training set and a test set; finding out corresponding training set images by using image names in a train text document, and enabling all mark boxes in the training images to have 4 position coordinates (x) at upper left, lower right and lower left_min,y_min,x_max,y_max) And the target class information in this box is counted in a 2007_ train. The classes of targets to be detected are 5 classes: vehicles, pedestrians, buses, traffic lights, traffic signs.

All images in the cycleGAN are trained for 200 rounds, the learning rate adopts a linear attenuation change mode, and the learning rate value of the first 100 rounds of training is kept at 2 multiplied by 10^-4The value of the learning rate of 100-150 turns is kept at 2 x 10^-5The learning rate value of 150-200 turns is kept at 2 multiplied by 10^-6And is not changed. Parameter lambda in generator cycle uniformity loss₁＝λ₂And (5) setting the parameter lambda in the body loss to be 0.5, and adopting a square error loss to replace the cross entropy loss by a loss value calculation formula. In the training process, Ubuntu18.04.4 and a Pythrch frame are selected, the picture quantity batch _ size of each training is 1, and the weight coefficient updating mode of a generator and a discriminator is Adam.

Step 2: the K-means clustering algorithm calculates the prior box size.

Firstly, determining that all mark frames in a training set are clustered into several classes, calculating the value of the intersection ratio IOU between the prior frame and the mark frame obtained by clustering when 1,2,3, … and 9 prior frames are respectively drawn in each cell, and finding out the position of an inflection point in the curve, namely the IOU value after the point is not obviously increased, wherein the number of the cluster frames at the point is the finally determined cluster number. As shown in fig. 5, when the number of clusters is 3, that is, when 3 prior boxes are drawn in each cell, the inflection point of the curve is present, and the IOU value is 67.63%, it is determined that all the labeled boxes in the YOLOv3 model are grouped into 9 classes.

And calculating an imp _ IOU value between the clustering center frame centroid and any frame box and an equivalent distance value according to the following formula, and accordingly obtaining a prior frame size value of the clustering according to a K-means clustering algorithm.

d＝1-imp_IOU(box,centroid) (16)

The improved method is that the values of the IOU, the cIOU and the gIOU of all the marked frames are calculated, and the average value of the three is used as the basis for calculating the distance.

The calculated prior frame size is: [(26,20),(38,28),(51,37),(69,41),(83,48),(111,56),(147,81),(202,121),(311,177)]. The YOLOv3 network outputs feature maps of three sizes in total, and the sizes of the prior frames drawn in each cell in the feature maps from small to large correspond to the 9 sizes respectively, that is, for an image with an input of 704 × 704, the sizes of the three feature maps output by the backbone network are: 22 × 22, 44 × 44, 88 × 88, wherein the 3 box sizes drawn in each cell in the 22 × 22 feature map are [ (26,20), (38,28), (51,37) ]; wherein the 3 box sizes drawn in each cell in the 44 × 44 feature map are [ (69,41), (83,48), (111,56) ]; wherein the 3 box sizes drawn in each cell in the 88 x 88 signature are [ (147,81), (202,121), (311,177) ], respectively. Each prior frame comprises partial features of an input image, and outputs the extracted position coordinates of the target object, the confidence score obtained by prediction and category information.

And step 3: and constructing an improved YOLOv3 feature extraction backbone network.

The characteristic extraction network adopted in the YOLOv3 algorithm is Darknet-53, namely 53 network layers are provided, and a characteristic pyramid structure is contained. Front-end network architectureThe method mainly comprises a plurality of residual blocks and convolution layers with the step length of 2, wherein the residual blocks are used for continuously extracting input feature map features, the convolution layers with the step length of 2 replace pooling layers to reduce the image size by half and output the image, and the Darknet-53 contains 5 convolution layers with the step length of 2, so that the feature map dimension input into a feature pyramid is 1/32 of the input image dimension, namely 22 multiplied by 22. The output characteristic diagram output information comprises a confidence value predicted as a foreground frame and four position coordinates (t)_x,t_y,t_ω,t_h) The value and the probability value of each of the five categories of targets, so that the three feature map dimensions output by the network are: 3 × (1+4+5) ═ 30 dimensions. Accordingly, Darknet-53 obtains three characteristic maps of 22X 30, 44X 30, and 88X 30.

Fig. 9 is a structure diagram of an improved YOLOv3 feature extraction backbone network, in which Conv3x3_ s2 represents convolutional layers with convolution kernel size of 3x3 and step size of 2, and plays a role in halving the size of the input feature map. Conv1x1 and Conv3x3 are convolutional layers used in the feature pyramid layer, x1, x2, x8, x8, and x4 respectively indicate the number of times of occurrence of each residual block in the front-end network, fig. 8 is a basic structure of the residual block in fig. 9, and is composed of a convolutional layer Conv1x1_ s1 with a convolutional kernel size of 1x1, a convolutional layer Conv1x1_ s1 with a step size of 1, and a convolutional layer Conv3x3_ s1 with a convolutional kernel size of 3x3, and the input and the output obtained after the two convolutional layers are added as the output of the final residual block. The network structures of the channel attention and space attention modules in fig. 9 are shown in fig. 6 and 7, respectively, the global average pooling in fig. 6 indicates that a feature map is input to output a one-dimensional vector to calculate an average pixel value of the feature map, FC is a full connection layer and is used for converting a multi-dimensional feature map into a one-dimensional column vector, and a Sigmoid activation function is used for adjusting an output value to be between 0 and 1, that is, outputting a probability value.

And 4, step 4: the improved YOLOv3 network was trained.

The YOLOv3 model is built under a Keras framework, and the training set image is a synthetic image which is generated by KITTI images and has a vKITTI-rainy style; the markup file is the same as that of the KITTI data set; the input image size is 1280 × 720, and is input to YOLOv3 and cropped to 704 × 704; loading the trained weights on the coco2014 data set as initial weights; and setting the training round number epoch to be 500 in the network training process, outputting the loss value of the training set and the verification set of each round, and if the loss values of 6 consecutive rounds are not reduced any more, indicating that the loss value at the moment is converged, and suspending the training to obtain the final model.

And 5: and (4) cross-domain target detection.

The images of the test set are from a vKITTI-rainy data set, the models are obtained by training images synthesized by using cycleGAN, information such as the position and the category of the target object in the images is the same as that in the KITTI data set, environmental background information is the same as that in the vKITTI-rainy, and the model learns the background information of the test set, so that the detection precision of the target object in the test set is greatly improved compared with the detection precision of the model obtained by using the images of the training set. The result shows that when the KITTI data set is trained by using the YOLOv3 model to detect the target object in the vKITTI-rainy data set, the detection precision of the vehicle is 38.00 percent; when a KITTI false map synthesized by training cycleGAN by using a YOLOv3 model and detecting a target object in vKITTI-rainy, the detection precision of the vehicle target object is 44.15%; the model detection accuracy in the present invention was 48.94%. The detection result shows that the accuracy of cross-domain target detection is greatly improved by the model. FIG. 10 is a graph showing the effect of applying the algorithm of the present invention to vKITTI-rainy data set detection.

Step 6: cross-domain target detection in other scenarios.

The training and detection process of the present invention is described in detail above with reference to target detection across different weather conditions as an example. The method is also suitable for cross-domain target detection under five different scenes of different places, different time periods, different places, different weather conditions and different places and different time periods. The cross-domain detection results in the four different scenes are shown in fig. 11-14, fig. 11 is a detection effect graph in clear daytime, fig. 12 is a detection effect graph in night, fig. 13 is a detection effect graph in fog, and fig. 14 is a detection effect graph in evening.

Claims

1. A cross-domain target detection method of an automatic driving automobile based on improved YOLOv3 is characterized by comprising the following steps:

and seventhly, detecting the cross-domain target.

2. The improved YOLOv 3-based cross-domain target detection method for the automatic driven vehicle as claimed in claim 1, wherein the first step adopts an unsupervised generation countermeasure network cycleGAN algorithm to complete the adaptation between the source domain image and the target domain image; the method specifically comprises the following steps:

11) generating a network cycleGAN by cyclic confrontation;

12) calculating an antagonistic generation network cycleGAN loss function;

the GAN loss is calculated by equation (2) as follows:

the intrinsic loss is calculated by formula (3), specifically as follows:

the generator loss function is calculated by equation (4) as follows:

in the formula, L_DRepresents the total loss value of the two discriminators;

representation discriminator D_TThe loss value of (d);

representation discriminator D_SThe loss value of (d); e_{t'～T_fake}Representing a false image T' in a false image set T _ fake with target domain characteristics; e_{s'～S_fake}Representing a false image S' in a false image set S _ fake with source data domain characteristics; t' represents a picture in T _ fake; s' represents a picture in S _ fake; d_SAnd D_TAre two discriminators; e_{s～S_real}Represents an image S in the source data field S _ real; s represents a picture in S _ real; e_{t～T_real}Represents the image T in the target data field T _ real; t represents a picture in T _ real;

in the formula, L_totalRepresenting the total loss function value of the antagonistic generation network CycleGAN; l is_DRepresents the total loss value of the two discriminators; l is_GRepresenting the total loss function value of the two generators; l is_cycRepresents the total cycle consistency loss value; l is_GANRepresents the total GAN loss value; l is_idtRepresents the total intrinsic loss value; g_S-TAnd G_T-STwo generators are represented; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; d_SAnd D_TAre two discriminators;

representation discriminator D_TThe loss value of (d);

representation discriminator D_SThe loss value of (a).

3. The method for detecting the cross-domain target of the autonomous vehicle based on the improved YOLOv3 as claimed in claim 1, wherein the specific method in the second step is as follows:

4. The improved YOLOv 3-based cross-domain target detection method for the automatic driving vehicle as claimed in claim 3, wherein the picture folder contains all the jpg images in the training set, the verification set and the test set; the data folder comprises three txt files, and all picture names of a training set, all picture names of a verification set and all picture names of a test set are recorded respectively; the image marking folder comprises coordinates (x) of the upper left corner of all marking frames in a training set, a verification set and a test set_min,y_min) Lower right corner coordinate (x)_max,y_max) And the category to which the object belongs (wherein 0, 1,2, … respectively represent each different category of object), namely one xml file for each picture.

5. The method for detecting the cross-domain target of the autonomous vehicle based on the improved YOLOv3 as claimed in claim 4, wherein the specific method of the third step is as follows:

d＝1-imp_IOU(box,centroid) (9)

6. The method for detecting the cross-domain target of the autonomous vehicle based on the improved YOLOv3 of claim 1, wherein the concrete method of the fourth step is as follows:

41) extracting a front-end network from the features;

42) improving the characteristic pyramid;

the feature graph size output after the feature extraction of 5 residual groups is n/32 multiplied by n/32, and then the feature graph size is used as the input of the feature pyramid network, the structure outputs three feature graphs with different sizes, and the three feature graphs are sequentially from small to large: the residual error groups 3,4 and 5 output feature graphs which are obtained by fusing the feature graphs, and the minimum size of the feature graphs is n/32 multiplied by n/32; the residual error groups 2,3 and 4 output feature graphs which are fused to obtain feature graphs, and the size of the feature graphs is n/16 multiplied by n/16; the feature graphs obtained by fusing the feature graphs output by the residual group 1,2 and 3 have the maximum size of n/8 multiplied by n/8; the output channel number N is calculated by formula (15), specifically as follows:

N＝num×(score+location+label) (15)

422. adding an attention module;

7. The method for detecting the cross-domain target of the autonomous vehicle based on the improved YOLOv3 as claimed in claim 6, wherein the concrete method of the fifth step is as follows:

8. The method for detecting the cross-domain target of the autonomous vehicle based on the improved YOLOv3 as claimed in claim 6, wherein the concrete method of the sixth step is as follows:

9. The method for detecting the cross-domain target of the autonomous vehicle based on the improved YOLOv3 as claimed in claim 6, wherein the concrete method of the seventh step is as follows: