CN112800906A - Improved YOLOv 3-based cross-domain target detection method for automatic driving automobile - Google Patents

Improved YOLOv 3-based cross-domain target detection method for automatic driving automobile Download PDF

Info

Publication number
CN112800906A
CN112800906A CN202110068030.6A CN202110068030A CN112800906A CN 112800906 A CN112800906 A CN 112800906A CN 202110068030 A CN202110068030 A CN 202110068030A CN 112800906 A CN112800906 A CN 112800906A
Authority
CN
China
Prior art keywords
real
value
picture
domain
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110068030.6A
Other languages
Chinese (zh)
Other versions
CN112800906B (en
Inventor
范佳琦
霍天娇
李鑫
魏珍琦
王嘉琛
高炳钊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jilin University
Original Assignee
Jilin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jilin University filed Critical Jilin University
Priority to CN202110068030.6A priority Critical patent/CN112800906B/en
Publication of CN112800906A publication Critical patent/CN112800906A/en
Application granted granted Critical
Publication of CN112800906B publication Critical patent/CN112800906B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention belongs to the technical field of computer vision and environment perception of an automatic driving automobile, and particularly relates to a cross-domain target detection method of the automatic driving automobile based on improved YOLOv 3. The method is based on an improved single-stage YOLOv3 detection algorithm framework, adopts a generated confrontation network model to obtain training set data, and carries out cross-domain target detection aiming at the problem that the training set and the test set respectively come from different distributed data domains. Meanwhile, the accuracy of single-stage target detection is improved by improving the YOLOv3 algorithm, the application of the generated countermeasure network reduces the re-labeling of multi-class targets among different data fields, and the difficult problem of cross-domain target detection of the automatic driving automobile is solved to a certain extent.

Description

Improved YOLOv 3-based cross-domain target detection method for automatic driving automobile
Technical Field
The invention belongs to the technical field of computer vision and environment perception of an automatic driving automobile, and particularly relates to a cross-domain target detection method of the automatic driving automobile based on improved YOLOv 3.
Background
With the research of the automatic driving automobile, how to accurately detect various target objects in video stream data input by a camera has important significance for subsequent planning and decision-making work. The problems that the driving scene is complex and changeable and the target species are various exist in the automatic driving automobile, and the traditional method for manually extracting the features is not suitable for being applied to the automatic driving automobile due to the fact that the detection precision and the robustness are not high enough, so that the target detection algorithm based on deep learning has high research value. In recent years, with the development of computer vision technology, more and more detection frames with better performance appear, and the detection frames have different characteristics in detection precision and speed and are respectively suitable for different application scenes.
Although the data-driven deep learning detection algorithm has made great progress in many detection tasks, in practical production applications, many difficulties are faced. Firstly, the deep learning detection algorithm greatly depends on training data, and when the data volume is too small, all features of the data cannot be fully learned, and the problem of model overfitting is easily caused. For the problem that actual road scenes are complex and changeable, if the detection accuracy of the model is high in different time periods, different places and different weather, data collection and labeling are needed in the scenes, and data collection in some unusual scenes is difficult, so that the data collection in various road scenes is difficult to complete. Secondly, how the deep learning model trained in one scene is well applied to different scenes for detection and high detection precision is obtained, which has high requirements on the robustness of the model. Finally, the automatic driving automobile requires high detection precision and also has requirements on the real-time detection speed of the algorithm, and in order to enable the algorithm to be well applied to the actual driving task, the number of frames detected by the algorithm per second can meet the requirement on the driving speed of the automobile.
In view of the above problems, the deep learning model mainly has two problems to solve: 1. how to improve the generalization ability of the model, so that the model trained under one data domain can be well applied to another data domain with different distribution for target detection, i.e. the purpose of cross-domain target detection is realized. 2. How to improve the detection accuracy of the detection algorithm which can meet the real-time requirement on a plurality of different types of target objects and the average detection accuracy of the detection algorithm under a plurality of complex environments.
Disclosure of Invention
The invention provides an automatic driving automobile cross-domain target detection method based on improved YOLOv3, which is based on an improved single-stage YOLOv3 detection algorithm framework, adopts a method for generating an antagonistic network to obtain training set data, performs cross-domain target detection aiming at the problem that a training set and a test set respectively come from data domains with different distributions, improves the target detection precision by improving a YOLOv3 algorithm, generates the antagonistic network, reduces the problem of data re-labeling between different data domains, and solves the problem of cross-domain target detection of an automatic driving automobile.
The technical scheme of the invention is described as follows by combining the attached drawings:
a cross-domain target detection method of an automatic driving automobile based on improved YOLOv3 comprises the following steps:
inputting a source domain image and a target domain image into a countermeasure generation network cycleGAN model for training to obtain a synthetic graph;
step two, taking the synthetic image as a training set, and taking the target domain image as a test set;
clustering the training set mark boxes through a K-means clustering algorithm, determining the clustering number and calculating the prior box size;
step four, building an improved YOLOv3 network; extracting backbone networks by using the characteristics;
fifthly, training a false image generated by a cycleGAN model by using a YOLOv3 network;
step six, detecting the images of the test set by using the model obtained by training and calculating the average detection precision;
and seventhly, detecting the cross-domain target.
The method comprises the following steps that firstly, an unsupervised generation countermeasure network cycleGAN algorithm is adopted to complete self-adaptation between a source domain image and a target domain image; the method specifically comprises the following steps:
11) generating a network cycleGAN by cyclic confrontation;
the source domain image input to the antagonism generation network cycleGAN model is subjected to a generator to obtain a synthetic graph with a target domain style and is continuously trained;
distinguishing the synthetic image synthesized by the generator from the real image in the original data domain by a discriminator;
12) calculating an antagonistic generation network cycleGAN loss function;
the countermeasure generation network CycleGAN loss function comprises a generator loss function and a discriminator loss function;
the producer loss function comprises the cyclic consistency loss, the GAN loss and the loss per se of the two producers;
the cycle consistency loss of the two generators is calculated by formula (1), specifically as follows:
Figure BDA0002904914560000031
in the formula, GS-TAnd GT-STwo generators are represented; l iscycRepresenting a cyclic consistency loss value; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; l iscyc1Representing a cyclic consistency loss value in a first direction; l iscyc2Representing a cyclic consistency loss value in a second direction; lambda [ alpha ]1A constant coefficient for balancing the loss value; es~S_realRepresents an image S in the source data field S _ real; et~T_realRepresents the image T in the target data field T _ real; s represents a picture in S _ real; t represents a picture in T _ real;
the GAN loss is calculated by equation (2) as follows:
Figure BDA0002904914560000032
in the formula, LGANRepresents the total GAN loss value; gS-TAnd GT-STwo generators are represented; dSAnd DTAre two discriminators; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; l isGAN1A GAN loss value representing a first generator; l isGAN2A GAN loss value representing a second generator; es~S_realRepresents an image S in the source data field S _ real; s represents a picture in S _ real; et~T_realRepresents the image T in the target data field T _ real; t represents a picture in T _ real;
the intrinsic loss is calculated by formula (3), specifically as follows:
Figure BDA0002904914560000041
in the formula, LidtRepresents the total loss per se; gS-TAnd GT-STwo generators are represented; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; l isidt1A loss value indicating the first generator itself; l isidt2A loss value indicating the second generator itself; lambda, lambda1、λ2A constant coefficient for balancing the loss value; es-S_realRepresents an image S in the source data field S _ real; s represents a picture in S _ real; et~T_realRepresents the image T in the target data field T _ real; t represents a picture in T _ real;
the generator loss function is calculated by equation (4) as follows:
Figure BDA0002904914560000043
in the formula, LGRepresenting the total loss function value of the two generators; l iscycRepresents the total cycle consistency loss value; l isGANRepresents the total GAN loss value; l isidtRepresents the total intrinsic loss value; gS-TAnd GT-STwo generators are represented; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; dSAnd DTAre two discriminators;
the discriminator loss function is calculated by formula (5), specifically as follows:
Figure BDA0002904914560000042
in the formula, LDRepresents the total loss value of the two discriminators; l isDTRepresentation discriminator DTThe loss value of (d); l isDSRepresentation discriminator DSThe loss value of (d); et'~T_fakeRepresenting a false image T' in a false image set T _ fake with target domain characteristics; es'-S_fakeRepresenting a false image S' in a false image set S _ fake with source data domain characteristics; t' represents a picture in T _ fake; s' represents a picture in S _ fake; dSAnd DTAre two discriminators; es~S_realRepresents an image S in the source data field S _ real; s represents a picture in S _ real; et~T_realRepresents the image T in the target data field T _ real; t represents a picture in T _ real;
in summary, the countermeasure generation network CycleGAN loss function is calculated by equation (6) as follows:
Figure BDA0002904914560000051
in the formula, LtotalRepresenting the total loss function value of the antagonistic generation network CycleGAN; l isDRepresents the total loss value of the two discriminators; l isGRepresenting the total loss function value of the two generators; l iscycRepresents the total cycleA ring consistency loss value; l isGANRepresents the total GAN loss value; l isidtRepresents the total intrinsic loss value; gS-TAnd GT-STwo generators are represented; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; dSAnd DTAre two discriminators; l isDTRepresentation discriminator DTThe loss value of (d); l isDSRepresentation discriminator DSThe loss value of (a).
The specific method of the second step is as follows:
and (4) obtaining a synthetic graph as training set data by using the confrontation generation network cycleGAN model in the step one, and making a picture folder, a data folder and an image marking folder.
The picture folder comprises all jpg images in a training set, a verification set and a test set; the data folder comprises three txt files, and all picture names of a training set, all picture names of a verification set and all picture names of a test set are recorded respectively; the image marking folder comprises coordinates (x) of the upper left corner of all marking frames in a training set, a verification set and a test setmin,ymin) Lower right corner coordinate (x)max,ymax) And the category to which the object belongs (wherein 0, 1,2, … respectively represent each different category of object), namely one xml file for each picture.
The concrete method of the third step is as follows:
the txt file comprises all target object marking frames of the training set, and K-means clustering is carried out on all marking frames according to the width omega and the height h of the frames and according to the formula (9) -14, so that the statistical condition of the sizes of the marking frames of the training set is obtained;
d=1-imp_IOU(box,centroid) (9)
Figure BDA0002904914560000061
Figure BDA0002904914560000062
Figure BDA0002904914560000063
Figure BDA0002904914560000064
Figure BDA0002904914560000065
wherein d represents a distance value; the centroid represents a determined clustering center frame; box represents any other box; box ^ centroid represents the area size of the intersection between two frames, box ^ centroid represents the area size of the union between two frames; imp _ IOU represents the modified IOU value; rho values represent the distance between the center points of two clustering boxes and centroids; c represents the length of the upper left and lower right points of the above found frame C; bboxRepresents any other box; bcentroidA center frame is determined for the cluster; α represents a coefficient; the v value represents the aspect ratio of each cluster box to be calculated in the cIOU, and represents the difference of the aspect ratios of the two boxes; omegacentroidWidth value, h, representing the center box of the clustercentroidRepresenting the height value of the clustering center box; omegaboxRepresents the width value, h, of an arbitrary frameboxIndicating the height value of any box.
The concrete method of the fourth step is as follows:
the characteristic extraction network adopted in the YOLOv3 algorithm is Darknet-53, namely 53 network layers are provided and comprise a characteristic pyramid structure; the network layer consists of:
41) extracting a front-end network from the features;
the whole feature extraction part comprises 5 residual error groups with different convolution kernel numbers, each residual error group consists of a plurality of residual error blocks with different numbers, the number of the residual error blocks in the 5 residual error groups is respectively 1,2, 8 and 4, a convolution layer with the step length of 2 is arranged between every two residual error groups to reduce the size by half, so that the feature map size input into the next residual error group is half of the previous size, the feature map size comprises 5 convolution layers with the step length of 2, and the finally output feature map size is 1/32 of the training set image size, namely n/32 x n/32;
42) improving the characteristic pyramid;
421. fusing the information of the characteristic pyramid structure lower layer and the deep layer characteristic diagram;
the feature graph size output after the feature extraction of 5 residual groups is n/32 multiplied by n/32, and then the feature graph size is used as the input of the feature pyramid network, the structure outputs three feature graphs with different sizes, and the three feature graphs are sequentially from small to large: the residual error groups 3,4 and 5 output feature graphs which are obtained by fusing the feature graphs, and the minimum size of the feature graphs is n/32 multiplied by n/32; the residual error groups 2,3 and 4 output feature graphs which are fused to obtain feature graphs, and the size of the feature graphs is n/16 multiplied by n/16; the feature graphs obtained by fusing the feature graphs output by the residual group 1,2 and 3 have the maximum size of n/8 multiplied by n/8; the output channel number N is calculated by formula (15), specifically as follows:
N=num×(score+location+label) (15)
where num represents the number of prior boxes drawn in each cell; score indicates that the confidence probability value of each prediction box is 1, each box has a score, and the value is 0-1; location represents the position coordinates of each prediction frame, and includes 4 coordinate values (t)x,ty,tω,th) Respectively calculating the coordinate of the center point and the width and height value of each prediction frame for the predicted coordinate offset of each prediction frame; label represents the probability value of each category of target object output by each prediction box, and the number of the output values is the number of categories of the target object to be detected;
422. adding an attention module;
the attention module includes two sub-modules of channel attention and spatial attention that focus on different features in two dimensions, channel and space, respectively.
Wherein the channel attention module: for an input n multiplied by m dimensional feature map, firstly, extracting features on each dimensional channel through a global average layer, and outputting a 1 multiplied by m dimensional vector by the layer; then, sequentially extracting the associated features among the channels through two full-connection layers, wherein the number of the output channels is m/4 dimension and m dimension; a relu activation function is arranged behind the first full connection layer, and a sigmoid activation function is arranged behind the second full connection layer and used for fixing the output value within the range of 0-1; finally, multiplying the feature map input into the channel attention module and the feature map output by the sigmoid layer, wherein the size of the feature map output finally is also n multiplied by m;
the spatial attention module: firstly extracting features of an input n multiplied by m dimensional feature map through two parallel convolutional layer branches, wherein the sizes of convolutional kernels of the two branches are 1 multiplied by 9 and 9 multiplied by 1 respectively, adding pixel values of positions corresponding to the feature map output by the two branches, fixing an output value in a range of 0-1 by using a sigmoid function, and obtaining the feature map with the size of n multiplied by 1; finally, multiplying the characteristic diagram input by the module with the characteristic vector of the 1-dimensional module to obtain the final output characteristic diagram of the dimension of n multiplied by m.
The concrete method of the step five is as follows:
after a characteristic extraction backbone network is built, for an input training set picture, loading data, labels, category numbers and prior frame size information into the network, and loading a weight file obtained by training on a coco2014 data set as a pre-training weight of a YOLOv3 model, namely an initial weight parameter of each network layer; forward propagation during training calculates the YOLOv3 loss function: location loss, confidence loss, category loss.
The concrete method of the sixth step is as follows:
cutting a test set picture and inputting the cut test set picture into a trained model, and screening redundant frames for all prediction frames output by a backbone network in the test process by adopting a Soft-NMS algorithm, wherein the specific method comprises the following steps:
61) all prediction blocks for the network output B ═ B1,b2,...,bn},b1Represents the 1 st prediction box; b2Represents the 2 nd prediction box; bnRepresents the nth prediction box; its corresponding confidence score S ═ S1,s2,...,sn},s1Representing the confidence score obtained by the prediction of the 1 st prediction box; s2Representing the confidence score obtained by the prediction of the 2 nd prediction box; snRepresenting the confidence score obtained by the prediction of the nth prediction box; setting an intersection ratio threshold t, a fraction threshold sigma and a confidence coefficient threshold alpha;
62) finding the box with the highest score in the same category in the prediction boxes and marking the box as bmWith a score of sm(ii) a For the frame b with the highest divisionmEach outer frame biFraction siCalculation Block biAnd bmIOU value IOU (b) in betweeni,bm) (ii) a If IOU (b)i,bm) Not less than t, then
Figure BDA0002904914560000081
Otherwise, siThe value is unchanged; wherein s isiRepresenting a prediction box biA confidence score of (d); biRepresents any one prediction box; bmA prediction box representing the highest score; σ represents a score threshold;
63) repeating the step 62) for the next class of prediction frame until all the prediction frames of all the class targets are traversed, and obtaining a new confidence score S' ═ S corresponding to all the prediction frames at this time1′,s2′,...,sn′};
64) Screening new confidence coefficient scores according to the set confidence coefficient threshold value alpha, and screening the score s of any one prediction framej', if sj′<α, then the frame bjIs suppressed from outputting.
The concrete method of the seventh step is as follows:
and for two data fields with different distributions, the data field which is easy to collect is taken as a source field and marked, the source field data is generated into false pictures with a target field style through a CycleGAN model, the improved YOLOv3 model is trained by the false pictures, and the trained model is used for detecting the target object in the target field picture.
The invention has the beneficial effects that:
1) the improved YOLOv3 network structure in the invention can extract image features more fully, and the trained model has higher detection precision on the target object in the test set;
2) the invention relates to a YOLOv3 model which can be applied to an automatic driving automobile and can be used for cross-domain target detection and has improved precision, and the generation of training set data adopts an unsupervised algorithm cycleGAN. The method effectively solves the problem of complex work of marking the data set target object in different scenes, well transfers the model obtained by training in one scene to the target detection in another scene, and improves the robustness and the transfer capability of the model.
Drawings
FIG. 1 is a flow chart of the algorithm of the present invention;
FIG. 2 is a schematic diagram of a loop countermeasure generation network according to the present invention;
FIG. 3 is a diagram of a generator network structure in the cycleGAN model of the present invention;
FIG. 4 is a diagram of the network structure of the discriminator in the cycleGAN model of the present invention;
FIG. 5 is a result graph of the K-means algorithm for determining the cluster number;
FIG. 6 is a diagram of a channel attention module network architecture;
FIG. 7 is a diagram of a spatial attention module network architecture;
FIG. 8 is a diagram of residual block in a feature extraction network;
FIG. 9 is a modified YOLOv3 network architecture diagram;
FIG. 10 is a diagram of the effect of cross-domain detection of different weather;
FIG. 11 is a diagram of the effect of cross-domain detection at different locations;
FIG. 12 is a cross-domain detection effect graph at different time periods;
FIG. 13 is a diagram of the effect of cross-domain detection of different weather at different locations;
fig. 14 is a diagram of the effect of cross-domain detection at different locations and different time periods.
Detailed Description
The invention is described in further detail below with reference to the following detailed description and accompanying drawings:
a cross-domain target detection method of an automatic driving automobile based on improved YOLOv3 comprises the following steps:
inputting a source domain image and a target domain image into a countermeasure generation network cycleGAN model for training to obtain a synthetic graph;
the CycleGAN model proposed in 2017 is used for circularly generating an antagonistic network, and has the main effects that: for the source data domain and the target data domain, the transfer of the training data can be realized under the condition that a one-to-one mapping relation between the training data is not required to be established, namely the source data domain is converted into an image with the style of the target domain, and the target data domain can also be converted into an image with the style of the source domain. In order to realize the function, the model is composed of four parts, namely two generators, two discriminators, a source data field image and a target data field image. In such unsupervised learning models, the generator and the discriminator are trained alternately in a competing manner, the generator trying to generate as realistic a fake picture as possible to trick the discriminator, and the discriminator trying to distinguish the fake picture from the truly captured picture. Compared with the traditional unsupervised countermeasure network, the method has the advantages that the cycleGAN model provides cycle consistency loss, ensures one-to-one mapping relation of image style conversion between two data fields, and has important application value in the unpaired image style conversion problem.
For the cross-domain detection problem that a training set and a test set are from two different data domains, the training set pictures are easy to obtain usually, and the model obtained by training the training set data does not learn the image characteristics of the test set, so when the model is used for detecting the test set pictures, the detection precision is very low, and at the moment, the unsupervised generation countermeasure network cycleGAN algorithm is adopted to complete the self-adaptation between the source domain image and the target domain image. The antagonistic network CycleGAN comprises: two generators, two discriminators, a source data domain image, a target data domain image.
The generator has the following functions: the method comprises the steps of obtaining a composite image with a target domain style from an image in an input source data domain, and continuously training a generator to ensure that the obtained composite image is as vivid as possible so as to cheat a discriminator to make the discriminator unable to distinguish whether the image is really existed or synthesized.
The function of the discriminator is: the purpose of training the discriminator is to output as high a score as possible for the original image (close to 1) and as low a score as possible for the composite image (close to 0), so that the discriminator is not fooled by the composite image generated by the generator, and is able to better discriminate between the true image and the composite image.
The method specifically comprises the following steps:
11) generating a network cycleGAN by cyclic confrontation;
the source domain image input to the antagonism generation network cycleGAN model is subjected to a generator to obtain a synthetic graph with a target domain style and is continuously trained;
distinguishing the synthetic image synthesized by the generator from the real image in the original data domain by a discriminator;
12) calculating an antagonistic generation network cycleGAN loss function;
the countermeasure generation network CycleGAN loss function comprises a generator loss function and a discriminator loss function;
the producer loss function comprises the cyclic consistency loss, the GAN loss and the loss per se of the two producers;
the cycle consistency loss of the two generators is calculated by formula (1), specifically as follows:
Figure BDA0002904914560000111
in the formula, GS-TAnd GT-STwo generators are represented; l iscycRepresenting a cyclic consistency loss value; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; l iscyc1Representing a cyclic consistency loss value in a first direction; l iscyc2Representing a cyclic consistency loss value in a second direction; lambda [ alpha ]1A constant coefficient for balancing the loss value; es-S_realRepresents an image S in the source data field S _ real; et-T_realRepresents the image T in the target data field T _ real; s represents one of S _ realOpening a picture; t represents a picture in T _ real;
the image in the source domain is passed through a generator GS-TThe resulting composite map is input to the discriminator to output a score between 0 and 1, and the false map that the generator wishes to generate is able to fool the discriminator, i.e. the score is as close to 1 as possible, so the GAN penalty between that score and 1 is calculated.
The GAN loss is calculated by equation (2) as follows:
Figure BDA0002904914560000121
in the formula, LGANRepresents the total GAN loss value; gS-TAnd GT-STwo generators are represented; dSAnd DTAre two discriminators; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; l isGAN1A GAN loss value representing a first generator; l isGAN2A GAN loss value representing a second generator; es~S_realRepresents an image S in the source data field S _ real; s represents a picture in S _ real; et~T_realRepresents the image T in the target data field T _ real; t represents a picture in T real.
In order to prevent overfitting in the training process of the generator, the image in the source data domain is used as the input of the generator, and the corresponding output and the input have little change, so that whether the generator has overfitting or not is verified.
The intrinsic loss is calculated by formula (3), specifically as follows:
Figure BDA0002904914560000122
in the formula, LidtRepresents the total loss per se; gS-TAnd GT-STwo generators are represented; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; l isidt1A loss value indicating the first generator itself; l isidt2To representThe intrinsic loss value of the second generator; lambda, lambda1、λ2A constant coefficient for balancing the loss value; es~S_realRepresents an image S in the source data field S _ real; s represents a picture in S _ real; et~T_realRepresents the image T in the target data field T _ real; t represents a picture in T _ real;
the generator loss function is calculated by equation (4) as follows:
Figure BDA0002904914560000133
in the formula, LGRepresenting the total loss function value of the two generators; l iscycRepresents the total cycle consistency loss value; l isGANRepresents the total GAN loss value; l isidtRepresents the total intrinsic loss value; gS-TAnd GT-STwo generators are represented; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; dSAnd DTAre two discriminators.
Discriminator loss function: the composite image generated by the input source data field and the real image in the target data field are distinguished by the discriminator, wherein the score output by the discriminator is as close to 0 as possible for a false image and as close to 1 as possible for a real image.
The discriminator loss function is calculated by formula (5), specifically as follows:
Figure BDA0002904914560000131
in the formula, LDRepresents the total loss value of the two discriminators; l isDTRepresentation discriminator DTThe loss value of (d); l isDSRepresentation discriminator DSThe loss value of (d); et'~T_fakeRepresenting a false image T' in a false image set T _ fake with target domain characteristics; es'~S_fakeRepresenting a false image S' in a false image set S _ fake with source data domain characteristics; t' represents a picture in T _ fake(ii) a S' represents a picture in S _ fake; dSAnd DTAre two discriminators; es~S_realRepresents an image S in the source data field S _ real; s represents a picture in S _ real; et~T_realRepresents the image T in the target data field T _ real; t represents a picture in T _ real;
in summary, the countermeasure generation network CycleGAN loss function is calculated by equation (6) as follows:
Figure BDA0002904914560000132
in the formula, LtotalRepresenting the total loss function value of the antagonistic generation network CycleGAN; l isDRepresents the total loss value of the two discriminators; l isGRepresenting the total loss function value of the two generators; l iscycRepresents the total cycle consistency loss value; l isGANRepresents the total GAN loss value; l isidtRepresents the total intrinsic loss value; gS-TAnd GT-STwo generators are represented; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; dSAnd DTAre two discriminators; l isDTRepresentation discriminator DTThe loss value of (d); l isDSRepresentation discriminator DSThe loss value of (a).
Step two, taking the synthetic image as a training set, and taking the target domain image as a test set;
the specific method of the second step is as follows:
and (4) obtaining a synthetic graph as training set data by using the confrontation generation network cycleGAN model in the step one, and making a picture folder, a data folder and an image marking folder.
The picture folder comprises all jpg images in a training set, a verification set and a test set; the data folder comprises three txt files, and all picture names of a training set, all picture names of a verification set and all picture names of a test set are recorded respectively; the image marking folder comprises coordinates (x) of the upper left corner of all marking frames in a training set, a verification set and a test setmin,ymin) Lower right corner coordinate (x)max,ymax) And the category to which the object belongs (wherein 0, 1,2, … respectively represent each different category of object), namely one xml file for each picture.
Clustering the training set mark boxes through a K-means clustering algorithm, determining the clustering number and calculating the prior box size;
and the txt file comprises all target object marking frames of the training set, and K-means clustering is carried out on all marking frames according to the width omega and the height h of the frames according to the following formula so as to obtain the statistical condition of the sizes of the marking frames of the training set.
d=1-IOU(box,centroid) (7)
Figure BDA0002904914560000141
Wherein, the centroid is a determined clustering center frame, the box is any other frame, an IOU (interaction over intersection) value between the centroid and the box is calculated, d represents a distance value, and the K-means algorithm just carries out clustering according to the distance value. box ≧ centroid denotes the area size of the intersection between the two frames, box ≦ centroid denotes the area size of the union between the two frames. The bigger the IOU value is, the more the two boxes intersect, the more the shapes are similar, and the two boxes should be grouped into the same class in the clustering process, so that the distance value of the K-means cluster and the IOU value of the box are in an inverse relation.
For the detection of various different types of target objects, the shape and the size of the target objects may be very different, the K-means clustering algorithm only clusters all the labeled frames according to the intersection and comparison between the two frames, the clustering mode does not reflect the shape difference between the labeled frames of different types, the number of the labeled frames of the target objects with more occurrence frequency is far more than that of the target objects with less occurrence frequency, and therefore the size of the labeled frames in the clustering process easily affects the finally obtained clustering result. In order to embody the shape difference of different types of target objects in the clustering process, a K-means clustering algorithm formula is improved, in addition to the IOU value of the frame, the cIOU and gIOU values which can embody the shape of the frame are added, and the clustering formula between the improved clustering center frame centroid and any frame box is as follows:
d=1-imp_IOU(box,centroid) (9)
Figure BDA0002904914560000151
Figure BDA0002904914560000152
Figure BDA0002904914560000153
Figure BDA0002904914560000154
Figure BDA0002904914560000155
wherein d represents a distance value; the centroid represents a determined clustering center frame; box represents any other box; box ^ centroid represents the area size of the intersection between two frames, box ^ centroid represents the area size of the union between two frames; imp _ IOU represents the improved IOU value, the improvement mode is that the values of IOU, cIOU and gIOU of all the mark frames are calculated, and the average value of the three values is taken as the basis for calculating the distance; gIOU formula: searching a smallest enclosure frame C on the basis of the IOU, and subtracting the union area of the clustering frames box and the centroid from the area of the frame C and dividing the union area by the area of the frame C; cIOU formula: rho represents the distance between the center points of the two clustering boxes box and centroid, C represents the length of the upper left point, the lower right point of the found box C, and rho2/c2The value can show the distance between two frames, but in the clustering process, because the width and the height of the frame are only clustered without considering the position, the rho value2/c 20. v values in the cIOURespectively calculating the aspect ratio of each clustering box, wherein the aspect ratio represents the difference of the high ratios of the widths of two boxes, namely if the v value is large, the cIOU value is small, the difference of the shapes of the two boxes is large and is not suitable for being clustered into a class, and the alpha value is a coefficient and is used for balancing the two items in the cIOU formula; bboxRepresents any other box; bcentroidA center frame is determined for the cluster; α represents a coefficient; the v value represents the aspect ratio of each cluster box to be calculated in the cIOU, and represents the difference of the aspect ratios of the two boxes; omegacentroidWidth value, h, representing the center box of the clustercentroidRepresenting the height value of the clustering center box; omegaboxRepresents the width value, h, of an arbitrary frameboxIndicating the height value of any box.
Step four, building an improved YOLOv3 network; extracting backbone networks by using the characteristics;
the concrete method of the fourth step is as follows:
the characteristic extraction network adopted in the YOLOv3 algorithm is Darknet-53, namely 53 network layers are provided and comprise a characteristic pyramid structure; the network layer consists of:
41) extracting a front-end network from the features;
411. and (3) rolling layers: if the size of the input image is n multiplied by n, the number of convolution kernels of the first two convolution layers is 32 and 64 respectively, and the size of the output feature map is halved to n/2 multiplied by n/2 by setting the step length of convolution sliding to be 2;
412. residual group structure: stacking a plurality of residual blocks in Darknet-53 to form a backbone of a feature extraction network structure, wherein two convolution layers span between the input and the output of each residual block, the number of the first residual blocks is 1, the sizes of input and output feature graphs of the residual blocks are the same and are still n/2 multiplied by n/2, and the number of channels is also the same, namely, the residual blocks only play a role in extracting data features and do not change the sizes of the feature graphs;
413. convolutional layer with step size 2: replacing the pooling layer with a convolutional layer with step size 2 in Darknet-53 serves to reduce the feature size, namely: halving the size of the input feature map to n/4 Xn/4 and doubling the number of channels;
414. residual group and convolutional layer stacking: the network layer structured in steps 412 and 413 is repeated for 4 times as an integral structure, wherein the number of residual block structures in step 412 is 2, 8, 4, respectively, except for the last residual block, a convolutional layer is arranged behind each step 412 structure to reduce the size of the feature map by half, and the feature map size is repeated in sequence, so that the feature map size obtained after each step 412+ step 413 structure is: n/4 Xn/4, n/8 Xn/8, n/16 Xn/16, n/32 Xn/32
Namely: the whole feature extraction part comprises 5 residual error groups with different convolution kernel numbers, each residual error group consists of a plurality of residual error blocks with different numbers, the number of the residual error blocks in the 5 residual error groups is respectively 1,2, 8 and 4, a convolution layer with the step length of 2 is arranged between every two residual error groups to reduce the size by half, so that the feature map size input into the next residual error group is half of the previous size, the feature map size comprises 5 convolution layers with the step length of 2, and the finally output feature map size is 1/32 of the training set image size, namely n/32 x n/32;
42) improving the characteristic pyramid;
421. fusing the information of the characteristic pyramid structure lower layer and the deep layer characteristic diagram;
the feature graph size output after the feature extraction of 5 residual groups is n/32 multiplied by n/32, and then the feature graph size is used as the input of the feature pyramid network, the structure outputs three feature graphs with different sizes, and the three feature graphs are sequentially from small to large: the residual error groups 3,4 and 5 output feature graphs which are obtained by fusing the feature graphs, and the minimum size of the feature graphs is n/32 multiplied by n/32; the residual error groups 2,3 and 4 output feature graphs which are fused to obtain feature graphs, and the size of the feature graphs is n/16 multiplied by n/16; the feature graphs obtained by fusing the feature graphs output by the residual group 1,2 and 3 have the maximum size of n/8 multiplied by n/8; the output channel number N is calculated by formula (15), specifically as follows:
N=num×(score+location+label) (15)
where num represents the number of prior boxes drawn in each cell; score indicates that the confidence probability value of each prediction box is 1, each box has a score, and the value is 0-1; location represents the position coordinates of each prediction frame, and includes 4 coordinate values (t)x,ty,tω,th) Respectively calculating the coordinate of the center point and the width and height value of each prediction frame for the predicted coordinate offset of each prediction frame; label represents the probability value of each category of target object output by each prediction box, and the number of the output values is the number of the categories of the target objects to be detected.
The feature graph output by the low-layer network and obtained by only less convolution layer calculation only primarily extracts image features, contains more noise vectors and well retains the edge line features of the image; the feature map output by the deep network obtained by calculating the plurality of convolution layers is complete for extracting the whole pixel-level features, has less noise, but can cause the loss of edge contour information and the edge of a target object to be blurred. Therefore, the deep characteristic diagram and the shallow characteristic diagram are fused, more target object contour line characteristics are reserved on the basis of better extracting target object characteristics, all characteristics of a real target object are reserved to the maximum extent by extracting the characteristic diagram through the neural network, and the target detection rate is favorably improved.
422. Adding an attention module;
the attention module includes two sub-modules of channel attention and spatial attention that focus on different features in two dimensions, channel and space, respectively.
Wherein the channel attention module: for an input n multiplied by m dimensional feature map, firstly, extracting features on each dimensional channel through a global average layer, and outputting a 1 multiplied by m dimensional vector by the layer; then, sequentially extracting the associated features among the channels through two full-connection layers, wherein the number of the output channels is m/4 dimension and m dimension; a relu activation function is arranged behind the first full connection layer, and a sigmoid activation function is arranged behind the second full connection layer and used for fixing the output value within the range of 0-1; finally, multiplying the feature map input into the channel attention module and the feature map output by the sigmoid layer, wherein the size of the feature map output finally is also n multiplied by m;
the spatial attention module: firstly extracting features of an input n multiplied by m dimensional feature map through two parallel convolutional layer branches, wherein the sizes of convolutional kernels of the two branches are 1 multiplied by 9 and 9 multiplied by 1 respectively, adding pixel values of positions corresponding to the feature map output by the two branches, fixing an output value in a range of 0-1 by using a sigmoid function, and obtaining the feature map with the size of n multiplied by 1; finally, multiplying the characteristic diagram input by the module with the characteristic vector of the 1-dimensional module to obtain the final output characteristic diagram of the dimension of n multiplied by m.
For the low-level feature map, random vectors such as noise exist in features extracted among channels, a space attention module is added to the low-level feature map, only difference information on the same space dimension is concerned, and an average value among the channels at the same pixel position is calculated to eliminate the noise vectors. For the deep feature map, the feature extraction is complete, and a channel attention module is added to acquire partial differences among different feature channels of the same feature map. For the characteristic diagram of the middle layer, the characteristic extraction of partial noise vectors and all channels is also different, so that two modules of channel attention and space attention are added simultaneously.
Fifthly, training a false image generated by a cycleGAN model by using a YOLOv3 network;
the concrete method of the step five is as follows:
after a characteristic extraction backbone network is built, for an input training set picture, loading data, labels, category numbers and prior frame size information into the network, and loading a weight file obtained by training on a coco2014 data set as a pre-training weight of a YOLOv3 model, namely an initial weight parameter of each network layer; forward propagation during training calculates the YOLOv3 loss function: location loss, confidence loss, category loss.
Step six, detecting the images of the test set by using the model obtained by training and calculating the average detection precision;
the concrete method of the sixth step is as follows:
the test set picture is cut to a certain size, and the size can be set to be 704 multiplied by 704; inputting the data into a trained model, and screening the redundant frames by adopting a Soft-NMS algorithm for all the prediction frames output by the backbone network in the test process, wherein the specific method comprises the following steps:
61)all prediction blocks for the network output B ═ B1,b2,...,bn},b1Represents the 1 st prediction box; b2Represents the 2 nd prediction box; bnRepresents the nth prediction box; its corresponding confidence score S ═ S1,s2,...,sn},s1Representing the confidence score obtained by the prediction of the 1 st prediction box; s2Representing the confidence score obtained by the prediction of the 2 nd prediction box; snRepresenting the confidence score obtained by the prediction of the nth prediction box; setting an intersection ratio threshold t, a fraction threshold sigma and a confidence coefficient threshold alpha;
62) finding the box with the highest score in the same category in the prediction boxes and marking the box as bmWith a score of sm(ii) a For the frame b with the highest divisionmEach outer frame biFraction siCalculation Block biAnd bmIOU value IOU (b) in betweeni,bm) (ii) a If IOU (b)i,bm) Not less than t, then
Figure BDA0002904914560000191
Otherwise, siThe value is unchanged; wherein s isiRepresenting a prediction box biA confidence score of (d); biRepresents any one prediction box; bmA prediction box representing the highest score; σ represents a score threshold;
63) repeating the step 62) for the next class of prediction frame until all the prediction frames of all the class targets are traversed, and obtaining a new confidence score S' ═ S corresponding to all the prediction frames at this time1′,s2′,...,sn′};
64) Screening new confidence coefficient scores according to the set confidence coefficient threshold value alpha, and screening the score s of any one prediction framej', if sj′<α, then the frame bjIs suppressed from outputting.
And seventhly, detecting the cross-domain target.
The concrete method of the seventh step is as follows:
for images in two different data domains, the images are called a source data domain which is easy to acquire and obtain, a cycleGAN model is adopted to input the source data and target data, the source domain data are generated into a false picture with a target domain style, the position and the shape of the false picture and the source data are not changed compared with those of a target object at the moment, and picture background information has the target domain style. Therefore, for two data fields with different distributions, the data field which is easy to collect is used as a source field and is marked, the source field data is generated into false pictures with target field styles through a CycleGAN model, the improved YOLOv3 model is trained by the false pictures, and the trained model is used for detecting the target object in the target field pictures, wherein the detection precision is higher because the model learns the information of image backgrounds, styles and the like of a test set.
Examples
A cross-domain target detection method of an automatic driving automobile based on improved YOLOv3 takes target detection under different weather as an example, and comprises the following specific steps:
step 1: the cross-domain target detection problem is embodied, and the target detection problem under different weather is considered, namely the training set and the test set data are collected under different weather. For example, the training set data KITTI is a picture in sunny day, and the test set data vKITTI-rainy is a picture in rainy day at the same place.
As shown in FIG. 2, S _ real is source domain data KITTI, T _ real is target domain data vKITTI-raw, T _ fake is a composite graph with target domain style, S _ fake is a composite graph with source style, GS-TImage generator, G, representing the movement from a source domain to a target domainT-SImage generator, D, representing the movement from a target domain to a source domainSIs a source domain image discriminator, DTIs a target domain image discriminator. The generator and discriminator network structures in the CycleGAN are respectively shown in the attached figures 3 and 4, and the symbolic meanings are Conv convolution layer, Deconv deconvolution layer, BN batch normalization layer, relu activation function, Leakyrelu activation function and 9x representing that the residual block repeatedly appears 9 times.
The method adopts the CycleGAN unsupervised generation countermeasure network to generate a composite image T _ fake with a target domain style from the source domain image, and the CycleGAN network aims at two numbers with different distributionsBy converting the image style, information such as the position and type of the target object in the image is not changed, so that the composite image T _ fake does not need to be marked again. Taking T _ fake as training set data, taking a mark file of original KITTI data as a mark file of a T _ fake image, training an improved YOLOv3 model in the invention, wherein the size of an original image is 1280 multiplied by 720, 3568 training images and 892 testing images, inputting the training images into a YOLOv3 network, cutting the size of the training images into 704 multiplied by 704, and then performing feature extraction; obtaining a train.txt document and a test.txt document for recording the image names of a training set and a test set; finding out corresponding training set images by using image names in a train text document, and enabling all mark boxes in the training images to have 4 position coordinates (x) at upper left, lower right and lower leftmin,ymin,xmax,ymax) And the target class information in this box is counted in a 2007_ train. The classes of targets to be detected are 5 classes: vehicles, pedestrians, buses, traffic lights, traffic signs.
All images in the cycleGAN are trained for 200 rounds, the learning rate adopts a linear attenuation change mode, and the learning rate value of the first 100 rounds of training is kept at 2 multiplied by 10-4The value of the learning rate of 100-150 turns is kept at 2 x 10-5The learning rate value of 150-200 turns is kept at 2 multiplied by 10-6And is not changed. Parameter lambda in generator cycle uniformity loss1=λ2And (5) setting the parameter lambda in the body loss to be 0.5, and adopting a square error loss to replace the cross entropy loss by a loss value calculation formula. In the training process, Ubuntu18.04.4 and a Pythrch frame are selected, the picture quantity batch _ size of each training is 1, and the weight coefficient updating mode of a generator and a discriminator is Adam.
Step 2: the K-means clustering algorithm calculates the prior box size.
Firstly, determining that all mark frames in a training set are clustered into several classes, calculating the value of the intersection ratio IOU between the prior frame and the mark frame obtained by clustering when 1,2,3, … and 9 prior frames are respectively drawn in each cell, and finding out the position of an inflection point in the curve, namely the IOU value after the point is not obviously increased, wherein the number of the cluster frames at the point is the finally determined cluster number. As shown in fig. 5, when the number of clusters is 3, that is, when 3 prior boxes are drawn in each cell, the inflection point of the curve is present, and the IOU value is 67.63%, it is determined that all the labeled boxes in the YOLOv3 model are grouped into 9 classes.
And calculating an imp _ IOU value between the clustering center frame centroid and any frame box and an equivalent distance value according to the following formula, and accordingly obtaining a prior frame size value of the clustering according to a K-means clustering algorithm.
d=1-imp_IOU(box,centroid) (16)
Figure BDA0002904914560000211
The improved method is that the values of the IOU, the cIOU and the gIOU of all the marked frames are calculated, and the average value of the three is used as the basis for calculating the distance.
The calculated prior frame size is: [(26,20),(38,28),(51,37),(69,41),(83,48),(111,56),(147,81),(202,121),(311,177)]. The YOLOv3 network outputs feature maps of three sizes in total, and the sizes of the prior frames drawn in each cell in the feature maps from small to large correspond to the 9 sizes respectively, that is, for an image with an input of 704 × 704, the sizes of the three feature maps output by the backbone network are: 22 × 22, 44 × 44, 88 × 88, wherein the 3 box sizes drawn in each cell in the 22 × 22 feature map are [ (26,20), (38,28), (51,37) ]; wherein the 3 box sizes drawn in each cell in the 44 × 44 feature map are [ (69,41), (83,48), (111,56) ]; wherein the 3 box sizes drawn in each cell in the 88 x 88 signature are [ (147,81), (202,121), (311,177) ], respectively. Each prior frame comprises partial features of an input image, and outputs the extracted position coordinates of the target object, the confidence score obtained by prediction and category information.
And step 3: and constructing an improved YOLOv3 feature extraction backbone network.
The characteristic extraction network adopted in the YOLOv3 algorithm is Darknet-53, namely 53 network layers are provided, and a characteristic pyramid structure is contained. Front-end network architectureThe method mainly comprises a plurality of residual blocks and convolution layers with the step length of 2, wherein the residual blocks are used for continuously extracting input feature map features, the convolution layers with the step length of 2 replace pooling layers to reduce the image size by half and output the image, and the Darknet-53 contains 5 convolution layers with the step length of 2, so that the feature map dimension input into a feature pyramid is 1/32 of the input image dimension, namely 22 multiplied by 22. The output characteristic diagram output information comprises a confidence value predicted as a foreground frame and four position coordinates (t)x,ty,tω,th) The value and the probability value of each of the five categories of targets, so that the three feature map dimensions output by the network are: 3 × (1+4+5) ═ 30 dimensions. Accordingly, Darknet-53 obtains three characteristic maps of 22X 30, 44X 30, and 88X 30.
Fig. 9 is a structure diagram of an improved YOLOv3 feature extraction backbone network, in which Conv3x3_ s2 represents convolutional layers with convolution kernel size of 3x3 and step size of 2, and plays a role in halving the size of the input feature map. Conv1x1 and Conv3x3 are convolutional layers used in the feature pyramid layer, x1, x2, x8, x8, and x4 respectively indicate the number of times of occurrence of each residual block in the front-end network, fig. 8 is a basic structure of the residual block in fig. 9, and is composed of a convolutional layer Conv1x1_ s1 with a convolutional kernel size of 1x1, a convolutional layer Conv1x1_ s1 with a step size of 1, and a convolutional layer Conv3x3_ s1 with a convolutional kernel size of 3x3, and the input and the output obtained after the two convolutional layers are added as the output of the final residual block. The network structures of the channel attention and space attention modules in fig. 9 are shown in fig. 6 and 7, respectively, the global average pooling in fig. 6 indicates that a feature map is input to output a one-dimensional vector to calculate an average pixel value of the feature map, FC is a full connection layer and is used for converting a multi-dimensional feature map into a one-dimensional column vector, and a Sigmoid activation function is used for adjusting an output value to be between 0 and 1, that is, outputting a probability value.
And 4, step 4: the improved YOLOv3 network was trained.
The YOLOv3 model is built under a Keras framework, and the training set image is a synthetic image which is generated by KITTI images and has a vKITTI-rainy style; the markup file is the same as that of the KITTI data set; the input image size is 1280 × 720, and is input to YOLOv3 and cropped to 704 × 704; loading the trained weights on the coco2014 data set as initial weights; and setting the training round number epoch to be 500 in the network training process, outputting the loss value of the training set and the verification set of each round, and if the loss values of 6 consecutive rounds are not reduced any more, indicating that the loss value at the moment is converged, and suspending the training to obtain the final model.
And 5: and (4) cross-domain target detection.
The images of the test set are from a vKITTI-rainy data set, the models are obtained by training images synthesized by using cycleGAN, information such as the position and the category of the target object in the images is the same as that in the KITTI data set, environmental background information is the same as that in the vKITTI-rainy, and the model learns the background information of the test set, so that the detection precision of the target object in the test set is greatly improved compared with the detection precision of the model obtained by using the images of the training set. The result shows that when the KITTI data set is trained by using the YOLOv3 model to detect the target object in the vKITTI-rainy data set, the detection precision of the vehicle is 38.00 percent; when a KITTI false map synthesized by training cycleGAN by using a YOLOv3 model and detecting a target object in vKITTI-rainy, the detection precision of the vehicle target object is 44.15%; the model detection accuracy in the present invention was 48.94%. The detection result shows that the accuracy of cross-domain target detection is greatly improved by the model. FIG. 10 is a graph showing the effect of applying the algorithm of the present invention to vKITTI-rainy data set detection.
Step 6: cross-domain target detection in other scenarios.
The training and detection process of the present invention is described in detail above with reference to target detection across different weather conditions as an example. The method is also suitable for cross-domain target detection under five different scenes of different places, different time periods, different places, different weather conditions and different places and different time periods. The cross-domain detection results in the four different scenes are shown in fig. 11-14, fig. 11 is a detection effect graph in clear daytime, fig. 12 is a detection effect graph in night, fig. 13 is a detection effect graph in fog, and fig. 14 is a detection effect graph in evening.

Claims (9)

1. A cross-domain target detection method of an automatic driving automobile based on improved YOLOv3 is characterized by comprising the following steps:
inputting a source domain image and a target domain image into a countermeasure generation network cycleGAN model for training to obtain a synthetic graph;
step two, taking the synthetic image as a training set, and taking the target domain image as a test set;
clustering the training set mark boxes through a K-means clustering algorithm, determining the clustering number and calculating the prior box size;
step four, building an improved YOLOv3 network; extracting backbone networks by using the characteristics;
fifthly, training a false image generated by a cycleGAN model by using a YOLOv3 network;
step six, detecting the images of the test set by using the model obtained by training and calculating the average detection precision;
and seventhly, detecting the cross-domain target.
2. The improved YOLOv 3-based cross-domain target detection method for the automatic driven vehicle as claimed in claim 1, wherein the first step adopts an unsupervised generation countermeasure network cycleGAN algorithm to complete the adaptation between the source domain image and the target domain image; the method specifically comprises the following steps:
11) generating a network cycleGAN by cyclic confrontation;
the source domain image input to the antagonism generation network cycleGAN model is subjected to a generator to obtain a synthetic graph with a target domain style and is continuously trained;
distinguishing the synthetic image synthesized by the generator from the real image in the original data domain by a discriminator;
12) calculating an antagonistic generation network cycleGAN loss function;
the countermeasure generation network CycleGAN loss function comprises a generator loss function and a discriminator loss function;
the producer loss function comprises the cyclic consistency loss, the GAN loss and the loss per se of the two producers;
the cycle consistency loss of the two generators is calculated by formula (1), specifically as follows:
Figure FDA0002904914550000021
in the formula, GS-TAnd GT-STwo generators are represented; l iscycRepresenting a cyclic consistency loss value; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; l iscyc1Representing a cyclic consistency loss value in a first direction; l iscyc2Representing a cyclic consistency loss value in a second direction; lambda [ alpha ]1A constant coefficient for balancing the loss value; es~S_realRepresents an image S in the source data field S _ real; et~T_realRepresents the image T in the target data field T _ real; s represents a picture in S _ real; t represents a picture in T _ real;
the GAN loss is calculated by equation (2) as follows:
Figure FDA0002904914550000022
in the formula, LGANRepresents the total GAN loss value; gS-TAnd GT-STwo generators are represented; dSAnd DTAre two discriminators; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; l isGAN1A GAN loss value representing a first generator; l isGAN2A GAN loss value representing a second generator; es~S_realRepresents an image S in the source data field S _ real; s represents a picture in S _ real; et~T_realRepresents the image T in the target data field T _ real; t represents a picture in T _ real;
the intrinsic loss is calculated by formula (3), specifically as follows:
Figure FDA0002904914550000023
in the formula, LidtRepresents the total loss per se; gS-TAnd GT-STwo generators are represented; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; l isidt1A loss value indicating the first generator itself; l isidt2A loss value indicating the second generator itself; lambda, lambda1、λ2A constant coefficient for balancing the loss value; es-S_realRepresents an image S in the source data field S _ real; s represents a picture in S _ real; et~T_realRepresents the image T in the target data field T _ real; t represents a picture in T _ real;
the generator loss function is calculated by equation (4) as follows:
Figure FDA0002904914550000031
in the formula, LGRepresenting the total loss function value of the two generators; l iscycRepresents the total cycle consistency loss value; l isGANRepresents the total GAN loss value; l isidtRepresents the total intrinsic loss value; gS-TAnd GT-STwo generators are represented; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; dSAnd DTAre two discriminators;
the discriminator loss function is calculated by formula (5), specifically as follows:
Figure FDA0002904914550000032
in the formula, LDRepresents the total loss value of the two discriminators;
Figure FDA0002904914550000033
representation discriminator DTThe loss value of (d);
Figure FDA0002904914550000034
representation discriminator DSThe loss value of (d); et'~T_fakeRepresenting a false image T' in a false image set T _ fake with target domain characteristics; es'~S_fakeRepresenting a false image S' in a false image set S _ fake with source data domain characteristics; t' represents a picture in T _ fake; s' represents a picture in S _ fake; dSAnd DTAre two discriminators; es~S_realRepresents an image S in the source data field S _ real; s represents a picture in S _ real; et~T_realRepresents the image T in the target data field T _ real; t represents a picture in T _ real;
in summary, the countermeasure generation network CycleGAN loss function is calculated by equation (6) as follows:
Figure FDA0002904914550000035
in the formula, LtotalRepresenting the total loss function value of the antagonistic generation network CycleGAN; l isDRepresents the total loss value of the two discriminators; l isGRepresenting the total loss function value of the two generators; l iscycRepresents the total cycle consistency loss value; l isGANRepresents the total GAN loss value; l isidtRepresents the total intrinsic loss value; gS-TAnd GT-STwo generators are represented; s _ real represents a real picture set in a source domain; t _ real represents a real picture set in the target domain; dSAnd DTAre two discriminators;
Figure FDA0002904914550000041
representation discriminator DTThe loss value of (d);
Figure FDA0002904914550000042
representation discriminator DSThe loss value of (a).
3. The method for detecting the cross-domain target of the autonomous vehicle based on the improved YOLOv3 as claimed in claim 1, wherein the specific method in the second step is as follows:
and (4) obtaining a synthetic graph as training set data by using the confrontation generation network cycleGAN model in the step one, and making a picture folder, a data folder and an image marking folder.
4. The improved YOLOv 3-based cross-domain target detection method for the automatic driving vehicle as claimed in claim 3, wherein the picture folder contains all the jpg images in the training set, the verification set and the test set; the data folder comprises three txt files, and all picture names of a training set, all picture names of a verification set and all picture names of a test set are recorded respectively; the image marking folder comprises coordinates (x) of the upper left corner of all marking frames in a training set, a verification set and a test setmin,ymin) Lower right corner coordinate (x)max,ymax) And the category to which the object belongs (wherein 0, 1,2, … respectively represent each different category of object), namely one xml file for each picture.
5. The method for detecting the cross-domain target of the autonomous vehicle based on the improved YOLOv3 as claimed in claim 4, wherein the specific method of the third step is as follows:
the txt file comprises all target object marking frames of the training set, and K-means clustering is carried out on all marking frames according to the width omega and the height h of the frames and according to the formula (9) -14, so that the statistical condition of the sizes of the marking frames of the training set is obtained;
d=1-imp_IOU(box,centroid) (9)
Figure FDA0002904914550000043
Figure FDA0002904914550000044
Figure FDA0002904914550000045
Figure FDA0002904914550000046
Figure FDA0002904914550000047
wherein d represents a distance value; the centroid represents a determined clustering center frame; box represents any other box; box ^ centroid represents the area size of the intersection between two frames, box ^ centroid represents the area size of the union between two frames; imp _ IOU represents the modified IOU value; rho values represent the distance between the center points of two clustering boxes and centroids; c represents the length of the upper left and lower right points of the above found frame C; bboxRepresents any other box; bcentroidA center frame is determined for the cluster; α represents a coefficient; the v value represents the aspect ratio of each cluster box to be calculated in the cIOU, and represents the difference of the aspect ratios of the two boxes; omegacentroidWidth value, h, representing the center box of the clustercentroidRepresenting the height value of the clustering center box; omegaboxRepresents the width value, h, of an arbitrary frameboxIndicating the height value of any box.
6. The method for detecting the cross-domain target of the autonomous vehicle based on the improved YOLOv3 of claim 1, wherein the concrete method of the fourth step is as follows:
the characteristic extraction network adopted in the YOLOv3 algorithm is Darknet-53, namely 53 network layers are provided and comprise a characteristic pyramid structure; the network layer consists of:
41) extracting a front-end network from the features;
the whole feature extraction part comprises 5 residual error groups with different convolution kernel numbers, each residual error group consists of a plurality of residual error blocks with different numbers, the number of the residual error blocks in the 5 residual error groups is respectively 1,2, 8 and 4, a convolution layer with the step length of 2 is arranged between every two residual error groups to reduce the size by half, so that the feature map size input into the next residual error group is half of the previous size, the feature map size comprises 5 convolution layers with the step length of 2, and the finally output feature map size is 1/32 of the training set image size, namely n/32 x n/32;
42) improving the characteristic pyramid;
421. fusing the information of the characteristic pyramid structure lower layer and the deep layer characteristic diagram;
the feature graph size output after the feature extraction of 5 residual groups is n/32 multiplied by n/32, and then the feature graph size is used as the input of the feature pyramid network, the structure outputs three feature graphs with different sizes, and the three feature graphs are sequentially from small to large: the residual error groups 3,4 and 5 output feature graphs which are obtained by fusing the feature graphs, and the minimum size of the feature graphs is n/32 multiplied by n/32; the residual error groups 2,3 and 4 output feature graphs which are fused to obtain feature graphs, and the size of the feature graphs is n/16 multiplied by n/16; the feature graphs obtained by fusing the feature graphs output by the residual group 1,2 and 3 have the maximum size of n/8 multiplied by n/8; the output channel number N is calculated by formula (15), specifically as follows:
N=num×(score+location+label) (15)
where num represents the number of prior boxes drawn in each cell; score indicates that the confidence probability value of each prediction box is 1, each box has a score, and the value is 0-1; location represents the position coordinates of each prediction frame, and includes 4 coordinate values (t)x,ty,tω,th) Respectively calculating the coordinate of the center point and the width and height value of each prediction frame for the predicted coordinate offset of each prediction frame; label represents the probability value of each category of target object output by each prediction box, and the number of the output values is the number of categories of the target object to be detected;
422. adding an attention module;
the attention module includes two sub-modules of channel attention and spatial attention that focus on different features in two dimensions, channel and space, respectively.
Wherein the channel attention module: for an input n multiplied by m dimensional feature map, firstly, extracting features on each dimensional channel through a global average layer, and outputting a 1 multiplied by m dimensional vector by the layer; then, sequentially extracting the associated features among the channels through two full-connection layers, wherein the number of the output channels is m/4 dimension and m dimension; a relu activation function is arranged behind the first full connection layer, and a sigmoid activation function is arranged behind the second full connection layer and used for fixing the output value within the range of 0-1; finally, multiplying the feature map input into the channel attention module and the feature map output by the sigmoid layer, wherein the size of the feature map output finally is also n multiplied by m;
the spatial attention module: firstly extracting features of an input n multiplied by m dimensional feature map through two parallel convolutional layer branches, wherein the sizes of convolutional kernels of the two branches are 1 multiplied by 9 and 9 multiplied by 1 respectively, adding pixel values of positions corresponding to the feature map output by the two branches, fixing an output value in a range of 0-1 by using a sigmoid function, and obtaining the feature map with the size of n multiplied by 1; finally, multiplying the characteristic diagram input by the module with the characteristic vector of the 1-dimensional module to obtain the final output characteristic diagram of the dimension of n multiplied by m.
7. The method for detecting the cross-domain target of the autonomous vehicle based on the improved YOLOv3 as claimed in claim 6, wherein the concrete method of the fifth step is as follows:
after a characteristic extraction backbone network is built, for an input training set picture, loading data, labels, category numbers and prior frame size information into the network, and loading a weight file obtained by training on a coco2014 data set as a pre-training weight of a YOLOv3 model, namely an initial weight parameter of each network layer; forward propagation during training calculates the YOLOv3 loss function: location loss, confidence loss, category loss.
8. The method for detecting the cross-domain target of the autonomous vehicle based on the improved YOLOv3 as claimed in claim 6, wherein the concrete method of the sixth step is as follows:
cutting a test set picture and inputting the cut test set picture into a trained model, and screening redundant frames for all prediction frames output by a backbone network in the test process by adopting a Soft-NMS algorithm, wherein the specific method comprises the following steps:
61) all prediction blocks for the network output B ═ B1,b2,...,bn},b1Represents the 1 st prediction box; b2Represents the 2 nd prediction box; bnRepresents the nth prediction box; its corresponding confidence score S ═ S1,s2,...,sn},s1Representing the confidence score obtained by the prediction of the 1 st prediction box; s2Representing the confidence score obtained by the prediction of the 2 nd prediction box; snRepresenting the confidence score obtained by the prediction of the nth prediction box; setting an intersection ratio threshold t, a fraction threshold sigma and a confidence coefficient threshold alpha;
62) finding the box with the highest score in the same category in the prediction boxes and marking the box as bmWith a score of sm(ii) a For the frame b with the highest divisionmEach outer frame biFraction siCalculation Block biAnd bmIOU value IOU (b) in betweeni,bm) (ii) a If IOU (b)i,bm) Not less than t, then
Figure FDA0002904914550000071
Otherwise, siThe value is unchanged; wherein s isiRepresenting a prediction box biA confidence score of (d); biRepresents any one prediction box; bmA prediction box representing the highest score; σ represents a score threshold;
63) repeating the step 62) for the next class of prediction frame until all the prediction frames of all the class targets are traversed, and obtaining a new confidence score S' ═ S corresponding to all the prediction frames at this time1′,s2′,...,sn′};
64) Screening new confidence coefficient scores according to the set confidence coefficient threshold value alpha, and screening the score s of any one prediction framej', if sj′<α, then the frame bjIs suppressed from outputting.
9. The method for detecting the cross-domain target of the autonomous vehicle based on the improved YOLOv3 as claimed in claim 6, wherein the concrete method of the seventh step is as follows:
and for two data fields with different distributions, the data field which is easy to collect is taken as a source field and marked, the source field data is generated into false pictures with a target field style through a CycleGAN model, the improved YOLOv3 model is trained by the false pictures, and the trained model is used for detecting the target object in the target field picture.
CN202110068030.6A 2021-01-19 2021-01-19 Improved YOLOv 3-based cross-domain target detection method for automatic driving automobile Active CN112800906B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110068030.6A CN112800906B (en) 2021-01-19 2021-01-19 Improved YOLOv 3-based cross-domain target detection method for automatic driving automobile

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110068030.6A CN112800906B (en) 2021-01-19 2021-01-19 Improved YOLOv 3-based cross-domain target detection method for automatic driving automobile

Publications (2)

Publication Number Publication Date
CN112800906A true CN112800906A (en) 2021-05-14
CN112800906B CN112800906B (en) 2022-08-30

Family

ID=75810387

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110068030.6A Active CN112800906B (en) 2021-01-19 2021-01-19 Improved YOLOv 3-based cross-domain target detection method for automatic driving automobile

Country Status (1)

Country Link
CN (1) CN112800906B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113642418A (en) * 2021-07-23 2021-11-12 南京富岛软件有限公司 Improved intelligent identification method for safety protection in 5T operation and maintenance
CN113822248A (en) * 2021-11-23 2021-12-21 江苏金晓电子信息股份有限公司 Cross-domain vehicle detection method for generating countermeasure network based on cycleGAN
CN113837087A (en) * 2021-09-24 2021-12-24 上海交通大学宁波人工智能研究院 Animal target detection system and method based on YOLOv3
CN114863426A (en) * 2022-05-05 2022-08-05 北京科技大学 Micro target detection method for coupling target feature attention and pyramid
CN116246128A (en) * 2023-02-28 2023-06-09 深圳市锐明像素科技有限公司 Training method and device of detection model crossing data sets and electronic equipment
CN116883681A (en) * 2023-08-09 2023-10-13 北京航空航天大学 Domain generalization target detection method based on countermeasure generation network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109978035A (en) * 2019-03-18 2019-07-05 西安电子科技大学 Pedestrian detection method based on improved k-means and loss function
CN111012301A (en) * 2019-12-19 2020-04-17 北京理工大学 Head-mounted visual accurate aiming system
CN111091151A (en) * 2019-12-17 2020-05-01 大连理工大学 Method for generating countermeasure network for target detection data enhancement
CN111680556A (en) * 2020-04-29 2020-09-18 平安国际智慧城市科技股份有限公司 Method, device and equipment for identifying vehicle type at traffic gate and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109978035A (en) * 2019-03-18 2019-07-05 西安电子科技大学 Pedestrian detection method based on improved k-means and loss function
CN111091151A (en) * 2019-12-17 2020-05-01 大连理工大学 Method for generating countermeasure network for target detection data enhancement
CN111012301A (en) * 2019-12-19 2020-04-17 北京理工大学 Head-mounted visual accurate aiming system
CN111680556A (en) * 2020-04-29 2020-09-18 平安国际智慧城市科技股份有限公司 Method, device and equipment for identifying vehicle type at traffic gate and storage medium

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113642418A (en) * 2021-07-23 2021-11-12 南京富岛软件有限公司 Improved intelligent identification method for safety protection in 5T operation and maintenance
CN113837087A (en) * 2021-09-24 2021-12-24 上海交通大学宁波人工智能研究院 Animal target detection system and method based on YOLOv3
CN113837087B (en) * 2021-09-24 2023-08-29 上海交通大学宁波人工智能研究院 Animal target detection system and method based on YOLOv3
CN113822248A (en) * 2021-11-23 2021-12-21 江苏金晓电子信息股份有限公司 Cross-domain vehicle detection method for generating countermeasure network based on cycleGAN
CN114863426A (en) * 2022-05-05 2022-08-05 北京科技大学 Micro target detection method for coupling target feature attention and pyramid
CN114863426B (en) * 2022-05-05 2022-12-13 北京科技大学 Micro target detection method for coupling target feature attention and pyramid
CN116246128A (en) * 2023-02-28 2023-06-09 深圳市锐明像素科技有限公司 Training method and device of detection model crossing data sets and electronic equipment
CN116246128B (en) * 2023-02-28 2023-10-27 深圳市锐明像素科技有限公司 Training method and device of detection model crossing data sets and electronic equipment
CN116883681A (en) * 2023-08-09 2023-10-13 北京航空航天大学 Domain generalization target detection method based on countermeasure generation network
CN116883681B (en) * 2023-08-09 2024-01-30 北京航空航天大学 Domain generalization target detection method based on countermeasure generation network

Also Published As

Publication number Publication date
CN112800906B (en) 2022-08-30

Similar Documents

Publication Publication Date Title
CN112800906B (en) Improved YOLOv 3-based cross-domain target detection method for automatic driving automobile
Garcia-Garcia et al. A survey on deep learning techniques for image and video semantic segmentation
CN110111335B (en) Urban traffic scene semantic segmentation method and system for adaptive countermeasure learning
WO2022083784A1 (en) Road detection method based on internet of vehicles
CN110163187B (en) F-RCNN-based remote traffic sign detection and identification method
CN110796168A (en) Improved YOLOv 3-based vehicle detection method
Cao et al. A low-cost pedestrian-detection system with a single optical camera
CN109543695A (en) General density people counting method based on multiple dimensioned deep learning
CN108537824B (en) Feature map enhanced network structure optimization method based on alternating deconvolution and convolution
CN107609602A (en) A kind of Driving Scene sorting technique based on convolutional neural networks
CN107392131A (en) A kind of action identification method based on skeleton nodal distance
Turay et al. Toward performing image classification and object detection with convolutional neural networks in autonomous driving systems: A survey
CN113255589B (en) Target detection method and system based on multi-convolution fusion network
CN110263786A (en) A kind of road multi-targets recognition system and method based on characteristic dimension fusion
CN112434723B (en) Day/night image classification and object detection method based on attention network
Lian et al. A dense Pointnet++ architecture for 3D point cloud semantic segmentation
CN111008979A (en) Robust night image semantic segmentation method
CN115661777A (en) Semantic-combined foggy road target detection algorithm
CN115690549A (en) Target detection method for realizing multi-dimensional feature fusion based on parallel interaction architecture model
CN114973199A (en) Rail transit train obstacle detection method based on convolutional neural network
Zimmer et al. Real-time and robust 3d object detection within road-side lidars using domain adaptation
CN104008374B (en) Miner&#39;s detection method based on condition random field in a kind of mine image
CN113902753A (en) Image semantic segmentation method and system based on dual-channel and self-attention mechanism
CN112668662A (en) Outdoor mountain forest environment target detection method based on improved YOLOv3 network
CN110852255A (en) Traffic target detection method based on U-shaped characteristic pyramid

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant