CN110363122B - Cross-domain target detection method based on multi-layer feature alignment - Google Patents

Cross-domain target detection method based on multi-layer feature alignment Download PDF

Info

Publication number
CN110363122B
CN110363122B CN201910594012.4A CN201910594012A CN110363122B CN 110363122 B CN110363122 B CN 110363122B CN 201910594012 A CN201910594012 A CN 201910594012A CN 110363122 B CN110363122 B CN 110363122B
Authority
CN
China
Prior art keywords
feature
target
default
layer
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910594012.4A
Other languages
Chinese (zh)
Other versions
CN110363122A (en
Inventor
王蒙
李威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201910594012.4A priority Critical patent/CN110363122B/en
Publication of CN110363122A publication Critical patent/CN110363122A/en
Application granted granted Critical
Publication of CN110363122B publication Critical patent/CN110363122B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • G06V20/584Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads of vehicle lights or traffic lights

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a cross-domain target detection method based on multilayer feature alignment. First, the detector is trained on the framed labeled source domain dataset by a deep convolutional neural network. Then, the trained detector is used as a pre-training model, and the images of the source domain and the target domain without frame marking are subjected to feature extraction through a deep convolutional neural network VGG-16, so that the source domain, the target and the shared feature parameters. And secondly, designing a domain classifier, taking the extracted feature layers of the multiple layers of source domains and target domains as the input of the domain classifier, and judging whether the feature layers are from the source domains or the target domains. And then, through a training mode of confronting a generated network, the feature distribution of the source domain and the target domain is aligned, and further, the data deviation between the two domains is reduced. And finally, performing joint training on the detector and the discriminator to obtain a final model. The method and the device realize the migration of the source domain knowledge to the target domain, and improve the detection precision of the target domain data without frame marking.

Description

Cross-domain target detection method based on multi-layer feature alignment
Technical Field
The invention relates to the field of computer vision, in particular to a cross-domain target detection method based on multilayer feature alignment.
Technical Field
In recent years, with the development of deep neural networks, significant progress has been made based on data-driven computer vision. The target detection is taken as a basic task of computer vision, is more concerned by researchers widely, and plays an increasing role in the intelligent life of people. For example, in intelligent driving, rapid positioning of passing vehicles and surrounding scenes can be realized through 3D target detection, so that an automobile can avoid obstacles and smoothly run; in the intelligent retail business, commodities on a shelf can be checked through accurate target detection, unmanned sale is realized, and labor is saved; in an intelligent community, passing vehicles can be quickly identified through license plate detection, so that automatic release is realized. In summary, target detection is becoming an indispensable part of our lives.
Despite significant advances in data-driven object detection-based tasks, practical production applications face a number of difficulties. A target detection algorithm based on a deep convolutional neural network needs a large amount of frame labeling and classification labeling, and in actual scene service, a large amount of manpower and material resources need to be consumed from data collection to labeling. On the one hand, in many scenes, such as road scenes in good weather conditions, a large amount of data is collected, and good effects are achieved in related businesses. On the other hand, in some special fields, such as road scene pictures in rare haze weather, the collection is difficult. How to apply the existing massive data to a new business scene to save manpower and material resources is a difficult problem faced by people.
Currently, there are two main solutions: a target detection method of weak supervision and a method based on domain alignment. In the target detection task of weak supervision, only classification labeling is used, so that the actual positioning precision is poor and the method is difficult to use in practice. The other is a domain alignment method based on transfer learning. In this method, the distance between the source domain and the target domain is usually measured by minimizing the first or second statistic distance, or by projecting the source domain and the target into a common feature space. However, due to the use of deep convolutional neural networks, both the source domain and the target domain are projected into a high-dimensional data space, and this distance metric-based approach is often difficult to be effective and sometimes even counterproductive. With the development of deep antagonistic generation networks, some have used for feature alignment in domain adaptation. Based on the inspiration, a countermeasure generation network is merged into a target detection network, and the detection network and the countermeasure network are trained jointly, so that in the training process, the source domain features are aligned with the target domain features, and the detection model trained on the source domain is generalized into target domain data.
Object detection, as a basic task in computer vision, mainly involves two processes, namely localization and recognition. Since the R-CNN proposal, a detection model based on a convolutional neural network is developed rapidly. At present, there are two main types, namely a target detection method based on a candidate region and a target detection method without a candidate region. In the detection model based on the candidate region, the detection process is divided into two steps. The candidate region of interest is obtained by first using a region extraction network, mainly by judging whether the region has an object to be detected. After the candidate region is obtained, the region is classified by a classification and regression method and the coordinate position of the region is obtained. The representatives of the detectors are mainly models such as R-CNN, fast R-CNN, mask R-CNN and the like, and the detectors are mainly characterized by accurate detection but slow speed. The method for detecting the non-candidate area mainly comprises the steps of dividing a network on a picture, then respectively carrying out regression and classification processing, and realizing positioning and identification as a process, wherein typical representatives of the method are yolo, SSD and the like. Such methods are faster but less accurate due to the absence of a candidate region extraction process. Although these detection methods based on full supervision all achieve good effects, they rely too much on a large amount of data and cannot effectively utilize the existing large amount of data. And a large amount of manpower and material resources are needed for collecting and labeling data.
Disclosure of Invention
The invention mainly solves the technical problem of providing a multi-layer domain self-adaptive cross-domain target detection method, which is characterized in that a countermeasure generation network is added, so that a discriminator cannot distinguish whether a characteristic layer is from a source domain or a target domain, the characteristic distribution of the source domain and the target domain is similar, and then labeling information on the target domain is transferred to a source domain data set, so that the performance of a detector on the target domain is improved.
The invention provides a cross-domain target detection method based on multilayer feature alignment, which comprises the following steps: the method comprises the following steps of preprocessing source domain data and target domain data, training a detection model by using the source domain data, extracting and aligning source domain and target domain features, and training a domain self-adaptive network and a detection network in a combined manner to obtain a final detection model, and specifically comprises the following four steps:
step1: preprocessing source domain data and target domain data; specifically, the method comprises the steps of adjusting brightness, saturation and hue, randomly cutting a fixed area, randomly cutting random sizes, randomly turning over a fixed angle and randomly turning over a random angle to obtain an extended picture, normalizing the extended picture, and cutting the extended picture to a fixed size to enable the size of the extended picture to be 300 x 300.
Step2: inputting the preprocessed source domain data into an SSD detection model, carrying out end-to-end training on the source domain data with the labeled information to obtain a fully supervised initial detection model, and using the fully supervised initial detection model as a pre-training model for the next training;
the specific process is as follows: constructing an SSD detection model, taking VGG16 with a removed full connection layer as a basic network model, then adding six convolutional layers of Conv6, conv7, conv8, conv9, conv10 and Conv11, extracting Conv4_3, conv7, conv8_2, conv9_2, conv10_2 and Conv11_2 as feature layers, inputting preprocessed source domain data into the SSD detection model for feature extraction, generating 6 default frames with different scale sizes and aspect ratios on each pixel point of each feature layer, matching the default frames with labeled frames, and obtaining a positive sample and a negative sample in a training process;
firstly, matching the default frame with the largest intersection ratio with the labeling frame, and then matching the default frame with the intersection ratio of the labeling frame and the default frame larger than 0.5;
specifically, the default frame size in the feature map of the Conv7, conv8_2, conv9_2, conv10_2, and Conv11_2 layers accounts for the original ratio:
Figure BDA0002117039690000031
in the formula, u =5 denotes the number of layers to be predicted, s min The value of 0.2,s max The value of 0.9,s k The size of the ratio of the default frame size of the k-th prediction layer to the original image is indicated, and the size of the ratio of the default frame size on the prediction layer Conv4_3 to the original image is 0.1, that is, when k =6, s k =0.1, let S k Indicates the default bounding box size of the k-th prediction layer, then S k =300×s k
For the aspect ratio of the default bounding box, take α r = {1,2,3,1/2,1/3}, then the first five default frame widths and heights of the k-th layer are respectively
Figure BDA0002117039690000032
And
Figure BDA0002117039690000033
sixth default bounding box S for k layer k ' the aspect ratio is 1, and has
Figure BDA0002117039690000034
Get S 7 =312, fromGenerating different default frames for each pixel point of each prediction layer according to the set length-width ratio of the size;
in addition, each pixel point of each predicted feature layer is respectively fed into a size of 3 × 3 × q m To further derive a classification confidence score and a regression bias, q m The number of channels of the mth layer of feature layer;
during the training process, the objective loss function is:
Figure BDA0002117039690000041
wherein N is the number of default frames matched with the labeling frame, L loc For the localization loss function, L conf For classifying the loss amount of confidence coefficient, alpha is a regularization parameter, x is an input image, c is a target class, L is a model prediction frame, g is a labeling frame, for a positioning loss function, the position deviation of a default frame is obtained, the frame is defined as { cx, cy, w, h }, wherein (cx, cy) is the center coordinate of the frame, w and h are respectively width and height, and the default frame is d, L is loc Comprises the following steps:
Figure BDA0002117039690000042
wherein the content of the first and second substances,
Figure BDA0002117039690000043
Figure BDA0002117039690000044
respectively the center coordinate, width and height of the jth labeled frame,
Figure BDA0002117039690000045
respectively the center coordinate, width and height of the ith default frame,
Figure BDA0002117039690000046
is a smoothing function, andis provided with
Figure BDA0002117039690000047
When in use
Figure BDA0002117039690000048
When the sampling point is in a positive sample set Pos, matching the ith default frame with the jth labeling frame with the category of k, and sending the default frame into the positive sample set Pos; if not, then,
Figure BDA0002117039690000049
loss of classification L conf For softmax loss function:
Figure BDA00021170396900000410
wherein the content of the first and second substances,
Figure BDA00021170396900000411
Figure BDA00021170396900000412
for the probability that the ith default bounding box is of the p-type,
Figure BDA00021170396900000413
the classification score when the ith default frame is predicted as the background, neg represents a matched negative sample set, which is the background; pos represents a matching positive sample, and, for the class to be detected,
Figure BDA00021170396900000414
when in use
Figure BDA00021170396900000415
Then, matching the ith default frame with the jth labeling frame with the category p, and sending the default frame into the positive sample set Pos; if not, then,
Figure BDA00021170396900000416
sending the default frame into a negative sample set Neg;
inputting the preprocessed source domain data into the SSD detection model, training through the above processes to obtain a fully supervised detection model, and taking the trained model as a pre-training model for next training.
Step3: performing feature extraction on the preprocessed source domain data and target domain data by using a pre-training model; and respectively performing feature extraction on the preprocessed source domain data and target domain data on the convolutional layers Conv1, conv2, conv3 and Conv4 in the pre-training model.
Step4: using a countermeasure generation network to realize multi-layer feature alignment of a source domain and a target domain; after the Step3 characteristic extraction, the source domain characteristic F is obtained m (x S ) And target domain feature F m (x T ) Wherein m represents the mth layer of convolution characteristics, and then using a countermeasure generation network to realize the alignment of the characteristics of the source domain and the target domain, wherein a discriminator of the countermeasure generation network is set as D m Which is a two-class network and is represented by F m (x T ) As a generator, for the arbiter, setting the source domain classification as 1 and the target domain classification as 0, then the arbiter at the m-th layer trains the target function
Figure BDA0002117039690000051
As a two-class cross entropy loss function:
Figure BDA0002117039690000052
wherein x is S Sampled in the source domain data distribution p (x) S ),x T Data distribution p (x) sampled in target domain T ) E (-) denotes the calculated expectation value, D m (F m (x S ) Represent source domain data x S Is the probability that the mth layer feature of (2) belongs to the real data, D m (F m (x T ) Representing the probability that the mth layer feature of the target domain data belongs to the generated data;
in the generation of the network, a source domain data set is used as real data, target domain data is used as generated data, and then a network objective function is generated at the mth layer
Figure BDA0002117039690000053
Comprises the following steps:
Figure BDA0002117039690000054
combining the above two equations, D when the input features of the discriminator come from the source domain m (F m (x S ) D) approaches 1, D when the input features of the discriminator come from the target domain m (F m (x T ) Tends to 0, and for the generator, i.e. the feature extraction network F m (x T ) The purpose of which is to make D m (F m (x T ) Approaches 1, therefore, the loss function V (D) of the mth feature layer in the training process m ,F m ) Is a very small game problem:
Figure BDA0002117039690000061
that is, the following equation is minimized:
Figure BDA0002117039690000062
step5: the method comprises the following steps of training a confrontation generation network, namely an alignment network model and a pre-training model in a combined mode to obtain a final detection model well trained on a source domain data set, wherein a loss function in the training process is as follows:
Figure BDA0002117039690000063
in the formula, L D Discriminator loss function for different feature layers
Figure BDA0002117039690000064
And, L D For detecting loss L on source domain data det Generating losses with different feature layersLoss function
Figure BDA0002117039690000065
The sum of (a) and (b),
Figure BDA0002117039690000066
weights for different feature layers to combat the loss.
The invention has the beneficial effects that:
according to the method, the confrontation generation domain self-adaptive network is added into the detection model, so that the source domain and the target domain are similar in feature distribution, and the detection performance of the detector with well-trained source domain data on the target domain is improved.
In step1, the data is preprocessed, and the number and the change of the data are increased, so that the model can be better fitted and generalized.
In step2, a better performing detector can be obtained by training on the data labeled source domain data set. Then, the model is used as a pre-training model for fine tuning, so that the labeled information in the source domain data can be fully utilized, and meanwhile, the transfer learning is carried out through the fine tuning.
In step3 and step4, different information can be obtained by extracting features on different convolutional layers. Generally speaking, a shallow layer of a deep convolutional neural network is a relatively universal feature, mainly includes spatial position information of image data and the like, and has less semantic information; the deep layer is more specific, mainly contains semantic information of image data, and has less spatial position information. Features of different convolutional layers are extracted and input into a discriminator respectively, so that feature distribution alignment of different levels can be achieved.
In step5, the countermeasure generation network and the detector are jointly trained, so that the feature distribution of the target domain and the feature distribution of the source domain are continuously drawn while the detector is trained on the data of the source domain, and the generalization performance of the detector on the data of the target domain is improved.
Drawings
FIG. 1 is a schematic flow chart of a preferred embodiment of the present invention;
FIG. 2 is a basic detection model SSD of an embodiment of the present invention;
FIG. 3 shows an embodiment of the present invention in which a discriminator is used, where (a) is D 4 And (b) is D 10 And (4) network model.
Detailed Description
The following description of the embodiments of the present invention with reference to the accompanying drawings is provided to make the advantages and features of the present invention more comprehensible to those skilled in the art, and to make the scope of the present invention more clearly defined.
Example 1: the invention mainly relates to a target detection method based on multilayer feature alignment, which integrates a countermeasure mechanism into the existing detection model, such as SSD, YOLO and the like, and jointly trains a countermeasure generation network and a detector, so that the multilayer feature distribution of source domain data and target domain data is approximate, and the performance of the detector on a target domain data set is further improved.
The invention has wide application fields, for example, in intelligent driving, the invention is applied to detection tasks under different scenes, and the detection scene migration can be realized through the alignment of the feature distribution of a large amount of marked scene data and unseen scene data, the robustness of the detector is improved, and simultaneously, the mass data marking cost is reduced. The specific use case of the invention is illustrated by taking scene detection of different weather conditions in automatic driving as an example. The source domain data are easily obtained data in good weather, the target domain data are data in haze weather, 2795 pieces of source domain data and target domain data are provided, 500 pieces of test sets are provided in the target domain, and 8 detection categories of the source domain and the target domain are provided. In the experimental process, a system Ubuntu18.04 is used, a hardware CPU is i78700k 3.7GHz multiplied by 6, a programming language is python3.6, a video card is Yingwei GeForce RTX 2070, and a deep learning frame is Pythroch 1.0.
The specific implementation process is as follows:
step1: inputting source domain data x S And target domain data x T And preprocessing the data. The main methods are degree, saturation and hue adjustment, random cutting of a fixed area, random cutting of random size, and random turning and fixationFixed angle, random flip angle, etc. And obtaining an extended picture, performing normalization processing, and cutting the picture input into the deep convolutional neural network to a fixed size to enable the size of the picture to be 300 multiplied by 300.
Step2: the preprocessed source domain data set is sent to a detector SSD, the specific model structure of the SSD is shown in fig. 2, and an initial detection model is obtained by performing end-to-end training on the source domain data with labeled information. In SSD training, batch 16, total 44000 iterations, and optimization using SGD, the initial learning rate was 0.01. After 28000 times of training, the learning rate is reduced to 0.1 times of the initial value, and after 36000 times of training, the learning rate is reduced by 0.1 times again until the final detection model is obtained.
Specifically, feature maps of the Conv4_3, conv7, conv8_2, conv9_2, conv10_2 and Conv11_2 layers are extracted, the sizes of the feature maps are 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3 and 1 × 1 respectively, and then 6 default frames with different scale sizes and aspect ratios are generated on each pixel point of each feature map respectively. Wherein, the ratio of the default frame size on the feature map of the Conv7, conv8_2, conv9_2, conv10_2 and Conv11_2 layers to the original is defined by the following formula (1):
Figure BDA0002117039690000081
in the formula, u =5 represents the number of layers to be predicted, s min Is 0.2,s max The value of 0.9,s k Indicating the ratio of the default frame size of the k-th prediction layer to the original image. The size of the ratio of the default frame size in the prediction layer Conv4_3 to the original image is 0.1. Order S k Indicates the default bounding box size of the k-th prediction layer, then S k =300×s k
For the default aspect ratio of the bounding box, α is typically taken r =1, 2,3,1/2,1/3, the first five default frame widths and heights of the k-th layer are respectively
Figure BDA0002117039690000082
And
Figure BDA0002117039690000083
except that, for the sixth default frame S 'of the k-th layer' k The aspect ratio is 1, and has
Figure BDA0002117039690000084
Generally take S 7 =312. Therefore, each pixel point of each prediction layer generates different default frames according to the set length-width ratio of the size.
In addition, each pixel point of each predicted feature layer is respectively fed into a size of 3 × 3 × q m To further derive a classification confidence score and a regression bias, q m The number of channels of the characteristic layer of the mth layer.
During the training process, the objective loss function is as shown in equation (2):
Figure BDA0002117039690000085
wherein N is the number of default frames matched with the labeled frames, L loc For the localization loss function, L conf For the loss amount of classification confidence coefficient, alpha is a regularization parameter, x is an input image, c is a target category, L is a model prediction frame, g is a labeling frame, for a positioning loss function, a position deviation of a default frame is obtained, the frame is defined as { cx, cy, w, h }, wherein (cx, cy) is a frame center coordinate, w and h are respectively width and height, and the default frame is d, L is loc As shown in formula (3):
Figure BDA0002117039690000091
in the formula (3), the reaction mixture is,
Figure BDA0002117039690000092
Figure BDA0002117039690000093
Figure BDA0002117039690000094
respectively the center coordinate, width and height of the jth labeled frame,
Figure BDA0002117039690000095
respectively, the center coordinate, width and height of the ith default frame.
Figure BDA0002117039690000096
Is a smooth function and has
Figure BDA0002117039690000097
When the temperature is higher than the set temperature
Figure BDA0002117039690000098
Then, matching the ith default frame with the jth real labeling frame with the category of k, and sending the default frame into a positive sample set Pos; if not, then,
Figure BDA0002117039690000099
loss of classification L conf As a softmax loss function, as shown in equation (4):
Figure BDA00021170396900000910
wherein the content of the first and second substances,
Figure BDA00021170396900000911
Figure BDA00021170396900000912
for the probability that the ith default bounding box is p-type,
Figure BDA00021170396900000913
the classification score when the ith default frame is predicted as the background, neg represents a matched negative sample set, which is the background; pos represents a positive sample of the match,is the category to be detected.
Figure BDA00021170396900000914
When in use
Figure BDA00021170396900000915
Then, matching the ith default frame with the jth real labeling frame with the category p, and sending the default frame into a positive sample set Pos; if not, then,
Figure BDA00021170396900000916
and puts the default bounding box into the negative sample set Neg.
Inputting preprocessed source domain data into a basic detection model SSD, training through the above processes to obtain a fully supervised detection model, and taking the trained model as a pre-training model for next training.
Step3: and taking the detection model obtained in Step2 as a pre-training model, and initializing the detection model SSD in the network model based on feature alignment. The overall framework flow is shown in fig. 1. Inputting the preprocessed source domain and target domain data into the basic network VGG16 of the detector, and respectively obtaining the source domain and target domain multilayer characteristics F m (x S ) And F m (x T ) Where i represents different convolutional layers. In the deep convolutional neural network, the shallow feature is relatively universal and contains abundant spatial information; while the deep features are more specific and contain richer semantic information. Therefore, in practical implementation, we take m =4 and m =10 respectively to obtain the shallow feature and the deep feature, specifically corresponding to Conv2_2 and Conv4_3 in the VGG16 network. Now feature layer F 4 Has a size of 75X 75,F 10 The feature layer size is 38 x 38.
Step4: discriminator D is designed, and is similar in structure to PatchGAN. For feature layer F 4 And F 10 Design the discriminators D separately 4 And D 10 The structure is shown in fig. 3. The source domain feature is marked as 1, the target domain feature is marked as 0, and then the source domain feature and the target domain feature are respectively sent to a two-classification discriminator. Different in thatThe shallow feature layer contains more position space information, so D 4 The design will be such that each pixel of the feature map is classified, which consists of three convolutional layers. For deep features, it contains rich semantic information, so D 10 Will be paired with F 10 After the convolutional layers are used, the convolutional layers are fully connected to perform two classifications, which are composed of three convolutional layers and three fully connected layers.
Step4: performing joint training on the discriminator and the detector SSD, wherein a training loss function is as follows:
Figure BDA0002117039690000101
during the joint training of the arbiter and the detector SSD,
Figure BDA0002117039690000102
and
Figure BDA0002117039690000103
take 0.1 and 0.2, respectively. For the SSD model, SGD was used for optimization, with a learning rate of 0.00001. For discriminator D 4 And D 10 The Adam algorithm is used for optimization, and the learning rate is 0.0001. The batch size was 8 and the iteration cycle was 20000.
Through the steps, a detector with higher robustness can be obtained finally, the detection performance of the target domain data before and after the feature alignment is given in the table 1, and the detection precision is improved after the feature alignment is shown in the table.
TABLE 1 detection Performance on target Domain data before and after feature alignment
Method Original model SSD Feature layer 4 alignment Feature layer 10 alignment The characteristic layer 4 is aligned with 10
Detection accuracy (mAP) 16.9 22.3 24.5 27.9
Compared with the prior art, in the embodiment, an initial detector is trained on an SSD detection model by using a source domain data set with labeled information, and then the detector is used as a pre-training model to extract features of a source domain and a target domain respectively. The source domain and the target domain share parameters in the feature extraction process and are respectively input into different discriminator models. In consideration of the fact that the feature difference of different convolutional layers is large, the shallow feature layer and the deep feature layer are simultaneously used as the input of a discriminator, and the discriminator and a detector are jointly trained, so that the feature alignment of source domains and target domains of different levels is realized, a detection model on source domain data can be generalized to a target domain, and the performance of the detection model on target domain data without labeling information is improved.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (4)

1. A cross-domain target detection method based on multi-layer feature alignment is characterized by comprising the following steps:
step1: preprocessing source domain data and target domain data;
step2: inputting the preprocessed source domain data into an SSD detection model, performing end-to-end training on the source domain data with the labeled information to obtain a fully supervised initial detection model, and using the fully supervised initial detection model as a pre-training model for the next training;
step3: performing feature extraction on the preprocessed source domain data and target domain data by using a pre-training model;
step4: using a countermeasure generation network to realize multi-layer feature alignment of a source domain and a target domain;
the Step4 is specifically as follows: after the Step3 characteristic extraction, the source domain characteristic F is obtained m (x S ) And target domain characteristics F m (x T ) Wherein m represents the mth layer of convolution characteristics, and then using a countermeasure generation network to realize the alignment of the characteristics of the source domain and the target domain, wherein a discriminator of the countermeasure generation network is set as D m Which is a two-class network and is represented by F m As a generator, for the arbiter, setting the source domain classification as 1 and the target domain classification as 0, then the arbiter at the m-th layer trains the target function
Figure FDA0003681088320000011
As a two-class cross entropy loss function:
Figure FDA0003681088320000012
wherein x is S Sampled in the source domain data distribution p (x) S ),x T Data distribution p (x) sampled in target domain T ) E (-) denotes the calculated expectation value, D m (F m (x S ) Represents source domain data x S Is the probability that the mth layer feature of (2) belongs to the real data, D m (F m (x T ) Represents the probability that the mth layer feature of the target domain data belongs to the generated data;
in generating the network, the source domain data set is taken as real dataIf the target domain data is used as the generation data, the mth layer generates a network target function
Figure FDA0003681088320000013
Comprises the following steps:
Figure FDA0003681088320000014
combining the above two equations, D when the input features of the discriminator come from the source domain m (F m (x S ) D) approaches 1, D when the input features of the discriminator come from the target domain m (F m (x T ) ) approaches 0, and for the generator, i.e. the feature extraction network F m The purpose of which is to make D m (F m (x S ) ) approaches 1, so the loss function V (D) of the mth feature layer in the training process m ,F m ) Is a very small game problem:
Figure FDA0003681088320000015
that is, the following equation is minimized:
Figure FDA0003681088320000021
step5: training the confrontation generation network, namely an alignment network model and a pre-training model in a combined manner to obtain a final detection model trained on a source domain data set;
and (3) performing combined training on the alignment network model and the pre-training model to obtain a final detection model, wherein a loss function in the training process is as follows:
Figure FDA0003681088320000022
in the formula, L D Being of different characteristicsLayer discriminator loss function
Figure FDA0003681088320000023
And, L F For detecting loss L on source domain data det With different feature layer generator loss functions
Figure FDA0003681088320000024
The sum of (a) and (b),
Figure FDA0003681088320000025
weights to combat losses for different feature layers.
2. The method as claimed in claim 1, wherein the Step1 preprocessing is to perform data augmentation, specifically, by adjusting brightness, saturation and hue, randomly cropping a fixed area, randomly cropping a random size, randomly flipping a fixed angle and randomly flipping a random angle to obtain an augmented picture, performing normalization, and cropping the picture to a fixed size to obtain a size of 300 × 300.
3. The method for cross-domain target detection based on multilayer feature alignment as claimed in claim 1, wherein the specific process of Step2 is as follows: constructing an SSD detection model, taking VGG16 with a removed full connection layer as a basic network model, then adding six convolutional layers of Conv6, conv7, conv8, conv9, conv10 and Conv11, extracting Conv4_3, conv7, conv8_2, conv9_2, conv10_2 and Conv11_2 as feature layers, inputting preprocessed source domain data into the SSD detection model for feature extraction, generating 6 default frames with different scale sizes and aspect ratios on each pixel point of each feature layer, matching the default frames with labeled frames, and obtaining a positive sample and a negative sample in a training process;
firstly, matching the default frame with the largest intersection ratio with the labeling frame, and then matching the default frame with the intersection ratio of the labeling frame and the default frame larger than 0.5;
specifically, the default bounding box size on the feature map of the Conv7, conv8_2, conv9_2, conv10_2, and Conv11_2 layers accounts for the size of the original ratio:
Figure FDA0003681088320000026
in the formula, u =5 denotes the number of layers to be predicted, s min The value of 0.2,s max The value of 0.9,s k The size of the ratio of the default frame size of the k-th prediction layer to the original image is indicated, and the size of the ratio of the default frame size on the prediction layer Conv4_3 to the original image is 0.1, that is, when k =6, s k =0.1, let S k Indicates the default bounding box size of the k-th prediction layer, then S k =300×s k
For the aspect ratio of the default bounding box, take α r =1, 2,3,1/2,1/3, the first five default frame widths and heights of the k-th layer are respectively
Figure FDA0003681088320000031
And
Figure FDA0003681088320000032
sixth default bounding box S for k layer k ' the aspect ratio is 1, and has
Figure FDA0003681088320000033
Get S 7 =312, so that each pixel point in each prediction layer will generate different default frames according to the aspect ratio of the set size;
in addition, each pixel point of each predicted feature layer is respectively fed into a size of 3 × 3 × q m To further derive a classification confidence score and a regression bias, q m The number of channels of the mth layer of feature layer;
during the training process, the objective loss function is:
Figure FDA0003681088320000034
wherein N is the number of default frames matched with the labeling frame, L loc For the localization loss function, L conf For classifying the loss amount of confidence coefficient, alpha is a regularization parameter, x is an input image, c is a target category, L is a model prediction frame, g is a labeling frame, for a positioning loss function, the position deviation of a default frame is obtained, the frame is defined as { cx, cy, w, h }, wherein (cx, cy) is the center coordinate of the frame, w and h are respectively width and height, and the default frame is set as d, L is loc Comprises the following steps:
Figure FDA0003681088320000035
wherein the content of the first and second substances,
Figure FDA0003681088320000036
Figure FDA0003681088320000037
respectively the center coordinate, width and height of the jth labeled frame,
Figure FDA0003681088320000038
respectively the center coordinate, width and height of the ith default frame,
Figure FDA0003681088320000039
is a smooth function and has
Figure FDA00036810883200000310
Figure FDA00036810883200000311
When in use
Figure FDA00036810883200000312
When it comes toMatching the i default frames with the jth labeling frame with the category of k, and sending the default frames into a positive sample set Pos; if not, then the mobile terminal can be switched to the normal mode,
Figure FDA00036810883200000313
loss of classification L conf For softmax loss function:
Figure FDA00036810883200000314
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003681088320000041
Figure FDA0003681088320000042
for the probability that the ith default bounding box is of the p-type,
Figure FDA0003681088320000043
the classification score when the ith default frame is predicted as the background, and Neg represents a matched negative sample set, namely the background; pos represents the positive sample of the match, and is the class to be detected,
Figure FDA0003681088320000044
when in use
Figure FDA0003681088320000045
Then, matching the ith default frame with the jth labeling frame with the category p, and sending the default frame into the positive sample set Pos; if not, then,
Figure FDA0003681088320000046
sending the default frame into a negative sample set Neg;
inputting the preprocessed source domain data into the SSD detection model, training through the above processes to obtain a fully supervised detection model, and taking the trained model as a pre-training model for next training.
4. The multi-layer feature alignment-based cross-domain target detection method according to claim 1, wherein Step3 is specifically: and respectively performing feature extraction on the preprocessed source domain data and target domain data on the convolutional layers Conv1, conv2, conv3 and Conv4 in the pre-training model.
CN201910594012.4A 2019-07-03 2019-07-03 Cross-domain target detection method based on multi-layer feature alignment Active CN110363122B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910594012.4A CN110363122B (en) 2019-07-03 2019-07-03 Cross-domain target detection method based on multi-layer feature alignment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910594012.4A CN110363122B (en) 2019-07-03 2019-07-03 Cross-domain target detection method based on multi-layer feature alignment

Publications (2)

Publication Number Publication Date
CN110363122A CN110363122A (en) 2019-10-22
CN110363122B true CN110363122B (en) 2022-10-11

Family

ID=68217903

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910594012.4A Active CN110363122B (en) 2019-07-03 2019-07-03 Cross-domain target detection method based on multi-layer feature alignment

Country Status (1)

Country Link
CN (1) CN110363122B (en)

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021097774A1 (en) * 2019-11-21 2021-05-27 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for multi-source domain adaptation for semantic segmentation
CN111126446B (en) * 2019-11-29 2023-04-07 西安工程大学 Method for amplifying defect image data of robot vision industrial product
CN113128287B (en) * 2019-12-31 2024-01-02 暗物质(北京)智能科技有限公司 Method and system for training cross-domain facial expression recognition model and facial expression recognition
CN111060318B (en) * 2020-01-09 2021-12-28 山东科技大学 Bearing fault diagnosis method based on deep countermeasure migration network
WO2021147366A1 (en) * 2020-01-23 2021-07-29 华为技术有限公司 Image processing method and related device
CN111340021B (en) * 2020-02-20 2022-07-15 中国科学技术大学 Unsupervised domain adaptive target detection method based on center alignment and relation significance
CN111382568B (en) * 2020-05-29 2020-09-11 腾讯科技(深圳)有限公司 Training method and device of word segmentation model, storage medium and electronic equipment
CN111860494B (en) * 2020-06-16 2023-07-07 北京航空航天大学 Optimization method and device for image target detection, electronic equipment and storage medium
CN112115009B (en) * 2020-08-13 2022-02-18 中国科学院计算技术研究所 Fault detection method for neural network processor
CN112115834A (en) * 2020-09-11 2020-12-22 昆明理工大学 Standard certificate photo detection method based on small sample matching network
CN112287963B (en) * 2020-09-14 2024-03-05 江南大学 Small sample target detection method under classifier hot start training mechanism
CN112115916B (en) * 2020-09-29 2023-05-02 西安电子科技大学 Domain adaptive Faster R-CNN semi-supervised SAR detection method
CN112395951B (en) * 2020-10-23 2022-06-24 中国地质大学(武汉) Complex scene-oriented domain-adaptive traffic target detection and identification method
CN112232293B (en) * 2020-11-09 2022-08-26 腾讯科技(深圳)有限公司 Image processing model training method, image processing method and related equipment
CN112861616B (en) * 2020-12-31 2022-10-11 电子科技大学 Passive field self-adaptive target detection method
CN112668594B (en) * 2021-01-26 2021-10-26 华南理工大学 Unsupervised image target detection method based on antagonism domain adaptation
CN112906763B (en) * 2021-02-01 2024-06-14 南京航空航天大学 Automatic digital image labeling method utilizing cross-task information
CN113706440A (en) * 2021-03-12 2021-11-26 腾讯科技(深圳)有限公司 Image processing method, image processing device, computer equipment and storage medium
CN113052184B (en) * 2021-03-12 2022-11-18 电子科技大学 Target detection method based on two-stage local feature alignment
CN113222997A (en) * 2021-03-31 2021-08-06 上海商汤智能科技有限公司 Neural network generation method, neural network image processing device, electronic device, and medium
CN114495265B (en) * 2021-07-15 2023-04-07 电子科技大学 Human behavior recognition method based on activity graph weighting under multi-cross-domain scene
CN113837300B (en) * 2021-09-29 2024-03-12 上海海事大学 Automatic driving cross-domain target detection method based on block chain
CN113870258B (en) * 2021-12-01 2022-03-25 浙江大学 Counterwork learning-based label-free pancreas image automatic segmentation system
CN114386527B (en) * 2022-01-18 2022-12-09 湖南大学无锡智能控制研究院 Category regularization method and system for domain adaptive target detection
CN116188830B (en) * 2022-11-01 2023-09-29 青岛柯锐思德电子科技有限公司 Hyperspectral image cross-domain classification method based on multi-level feature alignment
CN117934869B (en) * 2024-03-22 2024-06-18 中铁大桥局集团有限公司 Target detection method, system, computing device and medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423760A (en) * 2017-07-21 2017-12-01 西安电子科技大学 Based on pre-segmentation and the deep learning object detection method returned
CN108710896A (en) * 2018-04-24 2018-10-26 浙江工业大学 The field learning method of learning network is fought based on production
CN109272024A (en) * 2018-08-29 2019-01-25 昆明理工大学 A kind of image interfusion method based on convolutional neural networks
CN109376620A (en) * 2018-09-30 2019-02-22 华北电力大学 A kind of migration diagnostic method of gearbox of wind turbine failure
CN109409365A (en) * 2018-10-25 2019-03-01 江苏德劭信息科技有限公司 It is a kind of that method is identified and positioned to fruit-picking based on depth targets detection
CN109447149A (en) * 2018-10-25 2019-03-08 腾讯科技(深圳)有限公司 A kind of training method of detection model, device and terminal device
CN109543640A (en) * 2018-11-29 2019-03-29 中国科学院重庆绿色智能技术研究院 A kind of biopsy method based on image conversion
CN109583342A (en) * 2018-11-21 2019-04-05 重庆邮电大学 Human face in-vivo detection method based on transfer learning
CN109671018A (en) * 2018-12-12 2019-04-23 华东交通大学 A kind of image conversion method and system based on production confrontation network and ResNets technology
CN109685066A (en) * 2018-12-24 2019-04-26 中国矿业大学(北京) A kind of mine object detection and recognition method based on depth convolutional neural networks
CN109829537A (en) * 2019-01-30 2019-05-31 华侨大学 Style transfer method and equipment based on deep learning GAN network children's garment clothes

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10474929B2 (en) * 2017-04-25 2019-11-12 Nec Corporation Cyclic generative adversarial network for unsupervised cross-domain image generation
US11543830B2 (en) * 2017-12-06 2023-01-03 Petuum, Inc. Unsupervised real-to-virtual domain unification for end-to-end highway driving

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423760A (en) * 2017-07-21 2017-12-01 西安电子科技大学 Based on pre-segmentation and the deep learning object detection method returned
CN108710896A (en) * 2018-04-24 2018-10-26 浙江工业大学 The field learning method of learning network is fought based on production
CN109272024A (en) * 2018-08-29 2019-01-25 昆明理工大学 A kind of image interfusion method based on convolutional neural networks
CN109376620A (en) * 2018-09-30 2019-02-22 华北电力大学 A kind of migration diagnostic method of gearbox of wind turbine failure
CN109409365A (en) * 2018-10-25 2019-03-01 江苏德劭信息科技有限公司 It is a kind of that method is identified and positioned to fruit-picking based on depth targets detection
CN109447149A (en) * 2018-10-25 2019-03-08 腾讯科技(深圳)有限公司 A kind of training method of detection model, device and terminal device
CN109583342A (en) * 2018-11-21 2019-04-05 重庆邮电大学 Human face in-vivo detection method based on transfer learning
CN109543640A (en) * 2018-11-29 2019-03-29 中国科学院重庆绿色智能技术研究院 A kind of biopsy method based on image conversion
CN109671018A (en) * 2018-12-12 2019-04-23 华东交通大学 A kind of image conversion method and system based on production confrontation network and ResNets technology
CN109685066A (en) * 2018-12-24 2019-04-26 中国矿业大学(北京) A kind of mine object detection and recognition method based on depth convolutional neural networks
CN109829537A (en) * 2019-01-30 2019-05-31 华侨大学 Style transfer method and equipment based on deep learning GAN network children's garment clothes

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
EmotionGAN: Unsupervised Domain Adaptation for Learning Discrete Probability Distributions of Image Emotions;Sicheng Zhao等;《Session: Multimedia-2 (Socical & Emotional Multimedia)》;20181026;第1319-1327页 *
Unsupervised Domain Adaptation for Classification of Histopathology Whole-Slide Images;Jian Ren等;《Frontiers in Bioengineering and Biotechnology》;20190515;第7卷;第1-12页 *
Weakly Supervised Object Localization with Progressive Domain Adaptation;Dong Li等;《CVPR》;20161231;第3512-3520页 *
基于生成对抗网络的恶意代码识别研究与应用;曹启云;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190115(第(2019)01期);I139-270 *
基于行人部件、群组相似性与数据增强的行人重识别研究;曾奇勋;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190615(第(2019)06期);I136-233 *
数字人脸图像皮肤美化及整体风格化关键技术研究;谢璐;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190115(第(2019)01期);I138-3080 *
特征增强的SSD算法及其在目标检测中的应用;谭红臣等;《计算机辅助设计与图形学学报》;20190430;第31卷(第4期);第573-579页 *
生成对抗网络研究及其在图像风格转换的应用;董伟;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190615(第(2019)06期);I140-90 *

Also Published As

Publication number Publication date
CN110363122A (en) 2019-10-22

Similar Documents

Publication Publication Date Title
CN110363122B (en) Cross-domain target detection method based on multi-layer feature alignment
CN104599275B (en) The RGB-D scene understanding methods of imparametrization based on probability graph model
WO2019169816A1 (en) Deep neural network for fine recognition of vehicle attributes, and training method thereof
Lynen et al. Placeless place-recognition
Zhou et al. Robust vehicle detection in aerial images using bag-of-words and orientation aware scanning
CN112016605B (en) Target detection method based on corner alignment and boundary matching of bounding box
CN105138998B (en) Pedestrian based on the adaptive sub-space learning algorithm in visual angle recognition methods and system again
Kang et al. Pairwise relational networks for face recognition
CN111583263A (en) Point cloud segmentation method based on joint dynamic graph convolution
CN111639564B (en) Video pedestrian re-identification method based on multi-attention heterogeneous network
CN110175615B (en) Model training method, domain-adaptive visual position identification method and device
Wang et al. Traffic sign detection using a cascade method with fast feature extraction and saliency test
CN101630363A (en) Rapid detection method of face in color image under complex background
CN110287798B (en) Vector network pedestrian detection method based on feature modularization and context fusion
CN108230330B (en) Method for quickly segmenting highway pavement and positioning camera
WO2022218396A1 (en) Image processing method and apparatus, and computer readable storage medium
Redondo-Cabrera et al. All together now: Simultaneous object detection and continuous pose estimation using a hough forest with probabilistic locally enhanced voting
CN112949510A (en) Human detection method based on fast R-CNN thermal infrared image
CN105868776A (en) Transformer equipment recognition method and device based on image processing technology
CN111540203A (en) Method for adjusting green light passing time based on fast-RCNN
Xiang et al. A real-time vehicle traffic light detection algorithm based on modified YOLOv3
CN116994164A (en) Multi-mode aerial image fusion and target detection combined learning method
CN116468935A (en) Multi-core convolutional network-based stepwise classification and identification method for traffic signs
CN111539362A (en) Unmanned aerial vehicle image target detection device and method
CN113705731A (en) End-to-end image template matching method based on twin network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant