CN110363122B

CN110363122B - Cross-domain target detection method based on multi-layer feature alignment

Info

Publication number: CN110363122B
Application number: CN201910594012.4A
Authority: CN
Inventors: 王蒙; 李威
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2019-07-03
Filing date: 2019-07-03
Publication date: 2022-10-11
Anticipated expiration: 2039-07-03
Also published as: CN110363122A

Abstract

The invention discloses a cross-domain target detection method based on multilayer feature alignment. First, the detector is trained on the framed labeled source domain dataset by a deep convolutional neural network. Then, the trained detector is used as a pre-training model, and the images of the source domain and the target domain without frame marking are subjected to feature extraction through a deep convolutional neural network VGG-16, so that the source domain, the target and the shared feature parameters. And secondly, designing a domain classifier, taking the extracted feature layers of the multiple layers of source domains and target domains as the input of the domain classifier, and judging whether the feature layers are from the source domains or the target domains. And then, through a training mode of confronting a generated network, the feature distribution of the source domain and the target domain is aligned, and further, the data deviation between the two domains is reduced. And finally, performing joint training on the detector and the discriminator to obtain a final model. The method and the device realize the migration of the source domain knowledge to the target domain, and improve the detection precision of the target domain data without frame marking.

Description

Cross-domain target detection method based on multi-layer feature alignment

Technical Field

The invention relates to the field of computer vision, in particular to a cross-domain target detection method based on multilayer feature alignment.

Technical Field

In recent years, with the development of deep neural networks, significant progress has been made based on data-driven computer vision. The target detection is taken as a basic task of computer vision, is more concerned by researchers widely, and plays an increasing role in the intelligent life of people. For example, in intelligent driving, rapid positioning of passing vehicles and surrounding scenes can be realized through 3D target detection, so that an automobile can avoid obstacles and smoothly run; in the intelligent retail business, commodities on a shelf can be checked through accurate target detection, unmanned sale is realized, and labor is saved; in an intelligent community, passing vehicles can be quickly identified through license plate detection, so that automatic release is realized. In summary, target detection is becoming an indispensable part of our lives.

Despite significant advances in data-driven object detection-based tasks, practical production applications face a number of difficulties. A target detection algorithm based on a deep convolutional neural network needs a large amount of frame labeling and classification labeling, and in actual scene service, a large amount of manpower and material resources need to be consumed from data collection to labeling. On the one hand, in many scenes, such as road scenes in good weather conditions, a large amount of data is collected, and good effects are achieved in related businesses. On the other hand, in some special fields, such as road scene pictures in rare haze weather, the collection is difficult. How to apply the existing massive data to a new business scene to save manpower and material resources is a difficult problem faced by people.

Currently, there are two main solutions: a target detection method of weak supervision and a method based on domain alignment. In the target detection task of weak supervision, only classification labeling is used, so that the actual positioning precision is poor and the method is difficult to use in practice. The other is a domain alignment method based on transfer learning. In this method, the distance between the source domain and the target domain is usually measured by minimizing the first or second statistic distance, or by projecting the source domain and the target into a common feature space. However, due to the use of deep convolutional neural networks, both the source domain and the target domain are projected into a high-dimensional data space, and this distance metric-based approach is often difficult to be effective and sometimes even counterproductive. With the development of deep antagonistic generation networks, some have used for feature alignment in domain adaptation. Based on the inspiration, a countermeasure generation network is merged into a target detection network, and the detection network and the countermeasure network are trained jointly, so that in the training process, the source domain features are aligned with the target domain features, and the detection model trained on the source domain is generalized into target domain data.

Object detection, as a basic task in computer vision, mainly involves two processes, namely localization and recognition. Since the R-CNN proposal, a detection model based on a convolutional neural network is developed rapidly. At present, there are two main types, namely a target detection method based on a candidate region and a target detection method without a candidate region. In the detection model based on the candidate region, the detection process is divided into two steps. The candidate region of interest is obtained by first using a region extraction network, mainly by judging whether the region has an object to be detected. After the candidate region is obtained, the region is classified by a classification and regression method and the coordinate position of the region is obtained. The representatives of the detectors are mainly models such as R-CNN, fast R-CNN, mask R-CNN and the like, and the detectors are mainly characterized by accurate detection but slow speed. The method for detecting the non-candidate area mainly comprises the steps of dividing a network on a picture, then respectively carrying out regression and classification processing, and realizing positioning and identification as a process, wherein typical representatives of the method are yolo, SSD and the like. Such methods are faster but less accurate due to the absence of a candidate region extraction process. Although these detection methods based on full supervision all achieve good effects, they rely too much on a large amount of data and cannot effectively utilize the existing large amount of data. And a large amount of manpower and material resources are needed for collecting and labeling data.

Disclosure of Invention

The invention mainly solves the technical problem of providing a multi-layer domain self-adaptive cross-domain target detection method, which is characterized in that a countermeasure generation network is added, so that a discriminator cannot distinguish whether a characteristic layer is from a source domain or a target domain, the characteristic distribution of the source domain and the target domain is similar, and then labeling information on the target domain is transferred to a source domain data set, so that the performance of a detector on the target domain is improved.

The invention provides a cross-domain target detection method based on multilayer feature alignment, which comprises the following steps: the method comprises the following steps of preprocessing source domain data and target domain data, training a detection model by using the source domain data, extracting and aligning source domain and target domain features, and training a domain self-adaptive network and a detection network in a combined manner to obtain a final detection model, and specifically comprises the following four steps:

step1: preprocessing source domain data and target domain data; specifically, the method comprises the steps of adjusting brightness, saturation and hue, randomly cutting a fixed area, randomly cutting random sizes, randomly turning over a fixed angle and randomly turning over a random angle to obtain an extended picture, normalizing the extended picture, and cutting the extended picture to a fixed size to enable the size of the extended picture to be 300 x 300.

Step2: inputting the preprocessed source domain data into an SSD detection model, carrying out end-to-end training on the source domain data with the labeled information to obtain a fully supervised initial detection model, and using the fully supervised initial detection model as a pre-training model for the next training;

the specific process is as follows: constructing an SSD detection model, taking VGG16 with a removed full connection layer as a basic network model, then adding six convolutional layers of Conv6, conv7, conv8, conv9, conv10 and Conv11, extracting Conv4_3, conv7, conv8_2, conv9_2, conv10_2 and Conv11_2 as feature layers, inputting preprocessed source domain data into the SSD detection model for feature extraction, generating 6 default frames with different scale sizes and aspect ratios on each pixel point of each feature layer, matching the default frames with labeled frames, and obtaining a positive sample and a negative sample in a training process;

firstly, matching the default frame with the largest intersection ratio with the labeling frame, and then matching the default frame with the intersection ratio of the labeling frame and the default frame larger than 0.5;

specifically, the default frame size in the feature map of the Conv7, conv8_2, conv9_2, conv10_2, and Conv11_2 layers accounts for the original ratio:

in the formula, u =5 denotes the number of layers to be predicted, s _min The value of 0.2,s _max The value of 0.9,s _k The size of the ratio of the default frame size of the k-th prediction layer to the original image is indicated, and the size of the ratio of the default frame size on the prediction layer Conv4_3 to the original image is 0.1, that is, when k =6, s _k =0.1, let S _k Indicates the default bounding box size of the k-th prediction layer, then S _k ＝300×s _k ；

For the aspect ratio of the default bounding box, take α _r = {1,2,3,1/2,1/3}, then the first five default frame widths and heights of the k-th layer are respectively

And

sixth default bounding box S for k layer _k ' the aspect ratio is 1, and has

Get S ₇ =312, fromGenerating different default frames for each pixel point of each prediction layer according to the set length-width ratio of the size;

in addition, each pixel point of each predicted feature layer is respectively fed into a size of 3 × 3 × q _m To further derive a classification confidence score and a regression bias, q _m The number of channels of the mth layer of feature layer;

during the training process, the objective loss function is:

wherein N is the number of default frames matched with the labeling frame, L _loc For the localization loss function, L _conf For classifying the loss amount of confidence coefficient, alpha is a regularization parameter, x is an input image, c is a target class, L is a model prediction frame, g is a labeling frame, for a positioning loss function, the position deviation of a default frame is obtained, the frame is defined as { cx, cy, w, h }, wherein (cx, cy) is the center coordinate of the frame, w and h are respectively width and height, and the default frame is d, L is _loc Comprises the following steps:

wherein the content of the first and second substances,

respectively the center coordinate, width and height of the jth labeled frame,

respectively the center coordinate, width and height of the ith default frame,

is a smoothing function, andis provided with

When in use

When the sampling point is in a positive sample set Pos, matching the ith default frame with the jth labeling frame with the category of k, and sending the default frame into the positive sample set Pos; if not, then,

loss of classification L _conf For softmax loss function:

wherein the content of the first and second substances,

for the probability that the ith default bounding box is of the p-type,

the classification score when the ith default frame is predicted as the background, neg represents a matched negative sample set, which is the background; pos represents a matching positive sample, and, for the class to be detected,

when in use

Then, matching the ith default frame with the jth labeling frame with the category p, and sending the default frame into the positive sample set Pos; if not, then,

sending the default frame into a negative sample set Neg;

inputting the preprocessed source domain data into the SSD detection model, training through the above processes to obtain a fully supervised detection model, and taking the trained model as a pre-training model for next training.

Step3: performing feature extraction on the preprocessed source domain data and target domain data by using a pre-training model; and respectively performing feature extraction on the preprocessed source domain data and target domain data on the convolutional layers Conv1, conv2, conv3 and Conv4 in the pre-training model.

Step4: using a countermeasure generation network to realize multi-layer feature alignment of a source domain and a target domain; after the Step3 characteristic extraction, the source domain characteristic F is obtained _m (x _S ) And target domain feature F _m (x _T ) Wherein m represents the mth layer of convolution characteristics, and then using a countermeasure generation network to realize the alignment of the characteristics of the source domain and the target domain, wherein a discriminator of the countermeasure generation network is set as D _m Which is a two-class network and is represented by F _m (x _T ) As a generator, for the arbiter, setting the source domain classification as 1 and the target domain classification as 0, then the arbiter at the m-th layer trains the target function

As a two-class cross entropy loss function:

wherein x is _S Sampled in the source domain data distribution p (x) _S ),x _T Data distribution p (x) sampled in target domain _T ) E (-) denotes the calculated expectation value, D _m (F _m (x _S ) Represent source domain data x _S Is the probability that the mth layer feature of (2) belongs to the real data, D _m (F _m (x _T ) Representing the probability that the mth layer feature of the target domain data belongs to the generated data;

in the generation of the network, a source domain data set is used as real data, target domain data is used as generated data, and then a network objective function is generated at the mth layer

Comprises the following steps:

combining the above two equations, D when the input features of the discriminator come from the source domain _m (F _m (x _S ) D) approaches 1, D when the input features of the discriminator come from the target domain _m (F _m (x _T ) Tends to 0, and for the generator, i.e. the feature extraction network F _m (x _T ) The purpose of which is to make D _m (F _m (x _T ) Approaches 1, therefore, the loss function V (D) of the mth feature layer in the training process _m ,F _m ) Is a very small game problem:

that is, the following equation is minimized:

step5: the method comprises the following steps of training a confrontation generation network, namely an alignment network model and a pre-training model in a combined mode to obtain a final detection model well trained on a source domain data set, wherein a loss function in the training process is as follows:

in the formula, L _D Discriminator loss function for different feature layers

And, L _D For detecting loss L on source domain data _det Generating losses with different feature layersLoss function

The sum of (a) and (b),

weights for different feature layers to combat the loss.

The invention has the beneficial effects that:

according to the method, the confrontation generation domain self-adaptive network is added into the detection model, so that the source domain and the target domain are similar in feature distribution, and the detection performance of the detector with well-trained source domain data on the target domain is improved.

In step1, the data is preprocessed, and the number and the change of the data are increased, so that the model can be better fitted and generalized.

In step2, a better performing detector can be obtained by training on the data labeled source domain data set. Then, the model is used as a pre-training model for fine tuning, so that the labeled information in the source domain data can be fully utilized, and meanwhile, the transfer learning is carried out through the fine tuning.

In step3 and step4, different information can be obtained by extracting features on different convolutional layers. Generally speaking, a shallow layer of a deep convolutional neural network is a relatively universal feature, mainly includes spatial position information of image data and the like, and has less semantic information; the deep layer is more specific, mainly contains semantic information of image data, and has less spatial position information. Features of different convolutional layers are extracted and input into a discriminator respectively, so that feature distribution alignment of different levels can be achieved.

In step5, the countermeasure generation network and the detector are jointly trained, so that the feature distribution of the target domain and the feature distribution of the source domain are continuously drawn while the detector is trained on the data of the source domain, and the generalization performance of the detector on the data of the target domain is improved.

Drawings

FIG. 1 is a schematic flow chart of a preferred embodiment of the present invention;

FIG. 2 is a basic detection model SSD of an embodiment of the present invention;

FIG. 3 shows an embodiment of the present invention in which a discriminator is used, where (a) is D ₄ And (b) is D ₁₀ And (4) network model.

Detailed Description

The following description of the embodiments of the present invention with reference to the accompanying drawings is provided to make the advantages and features of the present invention more comprehensible to those skilled in the art, and to make the scope of the present invention more clearly defined.

Example 1: the invention mainly relates to a target detection method based on multilayer feature alignment, which integrates a countermeasure mechanism into the existing detection model, such as SSD, YOLO and the like, and jointly trains a countermeasure generation network and a detector, so that the multilayer feature distribution of source domain data and target domain data is approximate, and the performance of the detector on a target domain data set is further improved.

The invention has wide application fields, for example, in intelligent driving, the invention is applied to detection tasks under different scenes, and the detection scene migration can be realized through the alignment of the feature distribution of a large amount of marked scene data and unseen scene data, the robustness of the detector is improved, and simultaneously, the mass data marking cost is reduced. The specific use case of the invention is illustrated by taking scene detection of different weather conditions in automatic driving as an example. The source domain data are easily obtained data in good weather, the target domain data are data in haze weather, 2795 pieces of source domain data and target domain data are provided, 500 pieces of test sets are provided in the target domain, and 8 detection categories of the source domain and the target domain are provided. In the experimental process, a system Ubuntu18.04 is used, a hardware CPU is i78700k 3.7GHz multiplied by 6, a programming language is python3.6, a video card is Yingwei GeForce RTX 2070, and a deep learning frame is Pythroch 1.0.

The specific implementation process is as follows:

step1: inputting source domain data x _S And target domain data x _T And preprocessing the data. The main methods are degree, saturation and hue adjustment, random cutting of a fixed area, random cutting of random size, and random turning and fixationFixed angle, random flip angle, etc. And obtaining an extended picture, performing normalization processing, and cutting the picture input into the deep convolutional neural network to a fixed size to enable the size of the picture to be 300 multiplied by 300.

Step2: the preprocessed source domain data set is sent to a detector SSD, the specific model structure of the SSD is shown in fig. 2, and an initial detection model is obtained by performing end-to-end training on the source domain data with labeled information. In SSD training, batch 16, total 44000 iterations, and optimization using SGD, the initial learning rate was 0.01. After 28000 times of training, the learning rate is reduced to 0.1 times of the initial value, and after 36000 times of training, the learning rate is reduced by 0.1 times again until the final detection model is obtained.

Specifically, feature maps of the Conv4_3, conv7, conv8_2, conv9_2, conv10_2 and Conv11_2 layers are extracted, the sizes of the feature maps are 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3 and 1 × 1 respectively, and then 6 default frames with different scale sizes and aspect ratios are generated on each pixel point of each feature map respectively. Wherein, the ratio of the default frame size on the feature map of the Conv7, conv8_2, conv9_2, conv10_2 and Conv11_2 layers to the original is defined by the following formula (1):

in the formula, u =5 represents the number of layers to be predicted, s _min Is 0.2,s _max The value of 0.9,s _k Indicating the ratio of the default frame size of the k-th prediction layer to the original image. The size of the ratio of the default frame size in the prediction layer Conv4_3 to the original image is 0.1. Order S _k Indicates the default bounding box size of the k-th prediction layer, then S _k ＝300×s _k 。

For the default aspect ratio of the bounding box, α is typically taken _r =1, 2,3,1/2,1/3, the first five default frame widths and heights of the k-th layer are respectively

And

except that, for the sixth default frame S 'of the k-th layer' _k The aspect ratio is 1, and has

Generally take S ₇ =312. Therefore, each pixel point of each prediction layer generates different default frames according to the set length-width ratio of the size.

In addition, each pixel point of each predicted feature layer is respectively fed into a size of 3 × 3 × q _m To further derive a classification confidence score and a regression bias, q _m The number of channels of the characteristic layer of the mth layer.

During the training process, the objective loss function is as shown in equation (2):

wherein N is the number of default frames matched with the labeled frames, L _loc For the localization loss function, L _conf For the loss amount of classification confidence coefficient, alpha is a regularization parameter, x is an input image, c is a target category, L is a model prediction frame, g is a labeling frame, for a positioning loss function, a position deviation of a default frame is obtained, the frame is defined as { cx, cy, w, h }, wherein (cx, cy) is a frame center coordinate, w and h are respectively width and height, and the default frame is d, L is _loc As shown in formula (3):

in the formula (3), the reaction mixture is,

respectively the center coordinate, width and height of the jth labeled frame,

respectively, the center coordinate, width and height of the ith default frame.

Is a smooth function and has

When the temperature is higher than the set temperature

Then, matching the ith default frame with the jth real labeling frame with the category of k, and sending the default frame into a positive sample set Pos; if not, then,

loss of classification L _conf As a softmax loss function, as shown in equation (4):

wherein the content of the first and second substances,

for the probability that the ith default bounding box is p-type,

the classification score when the ith default frame is predicted as the background, neg represents a matched negative sample set, which is the background; pos represents a positive sample of the match,is the category to be detected.

When in use

Then, matching the ith default frame with the jth real labeling frame with the category p, and sending the default frame into a positive sample set Pos; if not, then,

and puts the default bounding box into the negative sample set Neg.

Inputting preprocessed source domain data into a basic detection model SSD, training through the above processes to obtain a fully supervised detection model, and taking the trained model as a pre-training model for next training.

Step3: and taking the detection model obtained in Step2 as a pre-training model, and initializing the detection model SSD in the network model based on feature alignment. The overall framework flow is shown in fig. 1. Inputting the preprocessed source domain and target domain data into the basic network VGG16 of the detector, and respectively obtaining the source domain and target domain multilayer characteristics F _m (x _S ) And F _m (x _T ) Where i represents different convolutional layers. In the deep convolutional neural network, the shallow feature is relatively universal and contains abundant spatial information; while the deep features are more specific and contain richer semantic information. Therefore, in practical implementation, we take m =4 and m =10 respectively to obtain the shallow feature and the deep feature, specifically corresponding to Conv2_2 and Conv4_3 in the VGG16 network. Now feature layer F ₄ Has a size of 75X 75,F ₁₀ The feature layer size is 38 x 38.

Step4: discriminator D is designed, and is similar in structure to PatchGAN. For feature layer F ₄ And F ₁₀ Design the discriminators D separately ₄ And D ₁₀ The structure is shown in fig. 3. The source domain feature is marked as 1, the target domain feature is marked as 0, and then the source domain feature and the target domain feature are respectively sent to a two-classification discriminator. Different in thatThe shallow feature layer contains more position space information, so D ₄ The design will be such that each pixel of the feature map is classified, which consists of three convolutional layers. For deep features, it contains rich semantic information, so D ₁₀ Will be paired with F ₁₀ After the convolutional layers are used, the convolutional layers are fully connected to perform two classifications, which are composed of three convolutional layers and three fully connected layers.

Step4: performing joint training on the discriminator and the detector SSD, wherein a training loss function is as follows:

during the joint training of the arbiter and the detector SSD,

and

take 0.1 and 0.2, respectively. For the SSD model, SGD was used for optimization, with a learning rate of 0.00001. For discriminator D ₄ And D ₁₀ The Adam algorithm is used for optimization, and the learning rate is 0.0001. The batch size was 8 and the iteration cycle was 20000.

Through the steps, a detector with higher robustness can be obtained finally, the detection performance of the target domain data before and after the feature alignment is given in the table 1, and the detection precision is improved after the feature alignment is shown in the table.

TABLE 1 detection Performance on target Domain data before and after feature alignment

Method	Original model SSD	Feature layer	4 alignment	Feature layer	10 alignment	The characteristic layer 4 is aligned with 10
							Detection accuracy (mAP)	16.9	22.3	24.5	27.9

Compared with the prior art, in the embodiment, an initial detector is trained on an SSD detection model by using a source domain data set with labeled information, and then the detector is used as a pre-training model to extract features of a source domain and a target domain respectively. The source domain and the target domain share parameters in the feature extraction process and are respectively input into different discriminator models. In consideration of the fact that the feature difference of different convolutional layers is large, the shallow feature layer and the deep feature layer are simultaneously used as the input of a discriminator, and the discriminator and a detector are jointly trained, so that the feature alignment of source domains and target domains of different levels is realized, a detection model on source domain data can be generalized to a target domain, and the performance of the detection model on target domain data without labeling information is improved.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A cross-domain target detection method based on multi-layer feature alignment is characterized by comprising the following steps:

step1: preprocessing source domain data and target domain data;

step2: inputting the preprocessed source domain data into an SSD detection model, performing end-to-end training on the source domain data with the labeled information to obtain a fully supervised initial detection model, and using the fully supervised initial detection model as a pre-training model for the next training;

step3: performing feature extraction on the preprocessed source domain data and target domain data by using a pre-training model;

step4: using a countermeasure generation network to realize multi-layer feature alignment of a source domain and a target domain;

the Step4 is specifically as follows: after the Step3 characteristic extraction, the source domain characteristic F is obtained _m (x _S ) And target domain characteristics F _m (x _T ) Wherein m represents the mth layer of convolution characteristics, and then using a countermeasure generation network to realize the alignment of the characteristics of the source domain and the target domain, wherein a discriminator of the countermeasure generation network is set as D _m Which is a two-class network and is represented by F _m As a generator, for the arbiter, setting the source domain classification as 1 and the target domain classification as 0, then the arbiter at the m-th layer trains the target function

As a two-class cross entropy loss function:

wherein x is _S Sampled in the source domain data distribution p (x) _S ),x _T Data distribution p (x) sampled in target domain _T ) E (-) denotes the calculated expectation value, D _m (F _m (x _S ) Represents source domain data x _S Is the probability that the mth layer feature of (2) belongs to the real data, D _m (F _m (x _T ) Represents the probability that the mth layer feature of the target domain data belongs to the generated data;

in generating the network, the source domain data set is taken as real dataIf the target domain data is used as the generation data, the mth layer generates a network target function

Comprises the following steps:

combining the above two equations, D when the input features of the discriminator come from the source domain _m (F _m (x _S ) D) approaches 1, D when the input features of the discriminator come from the target domain _m (F _m (x _T ) ) approaches 0, and for the generator, i.e. the feature extraction network F _m The purpose of which is to make D _m (F _m (x _S ) ) approaches 1, so the loss function V (D) of the mth feature layer in the training process _m ,F _m ) Is a very small game problem:

that is, the following equation is minimized:

step5: training the confrontation generation network, namely an alignment network model and a pre-training model in a combined manner to obtain a final detection model trained on a source domain data set;

and (3) performing combined training on the alignment network model and the pre-training model to obtain a final detection model, wherein a loss function in the training process is as follows:

in the formula, L _D Being of different characteristicsLayer discriminator loss function

And, L _F For detecting loss L on source domain data _det With different feature layer generator loss functions

The sum of (a) and (b),

weights to combat losses for different feature layers.

2. The method as claimed in claim 1, wherein the Step1 preprocessing is to perform data augmentation, specifically, by adjusting brightness, saturation and hue, randomly cropping a fixed area, randomly cropping a random size, randomly flipping a fixed angle and randomly flipping a random angle to obtain an augmented picture, performing normalization, and cropping the picture to a fixed size to obtain a size of 300 × 300.

3. The method for cross-domain target detection based on multilayer feature alignment as claimed in claim 1, wherein the specific process of Step2 is as follows: constructing an SSD detection model, taking VGG16 with a removed full connection layer as a basic network model, then adding six convolutional layers of Conv6, conv7, conv8, conv9, conv10 and Conv11, extracting Conv4_3, conv7, conv8_2, conv9_2, conv10_2 and Conv11_2 as feature layers, inputting preprocessed source domain data into the SSD detection model for feature extraction, generating 6 default frames with different scale sizes and aspect ratios on each pixel point of each feature layer, matching the default frames with labeled frames, and obtaining a positive sample and a negative sample in a training process;

specifically, the default bounding box size on the feature map of the Conv7, conv8_2, conv9_2, conv10_2, and Conv11_2 layers accounts for the size of the original ratio:

For the aspect ratio of the default bounding box, take α _r =1, 2,3,1/2,1/3, the first five default frame widths and heights of the k-th layer are respectively

And

sixth default bounding box S for k layer _k ' the aspect ratio is 1, and has

Get S ₇ =312, so that each pixel point in each prediction layer will generate different default frames according to the aspect ratio of the set size;

during the training process, the objective loss function is:

wherein N is the number of default frames matched with the labeling frame, L _loc For the localization loss function, L _conf For classifying the loss amount of confidence coefficient, alpha is a regularization parameter, x is an input image, c is a target category, L is a model prediction frame, g is a labeling frame, for a positioning loss function, the position deviation of a default frame is obtained, the frame is defined as { cx, cy, w, h }, wherein (cx, cy) is the center coordinate of the frame, w and h are respectively width and height, and the default frame is set as d, L is _loc Comprises the following steps:

wherein the content of the first and second substances,

respectively the center coordinate, width and height of the jth labeled frame,

respectively the center coordinate, width and height of the ith default frame,

is a smooth function and has

When in use

When it comes toMatching the i default frames with the jth labeling frame with the category of k, and sending the default frames into a positive sample set Pos; if not, then the mobile terminal can be switched to the normal mode,

loss of classification L _conf For softmax loss function:

wherein, the first and the second end of the pipe are connected with each other,

for the probability that the ith default bounding box is of the p-type,

the classification score when the ith default frame is predicted as the background, and Neg represents a matched negative sample set, namely the background; pos represents the positive sample of the match, and is the class to be detected,

when in use

sending the default frame into a negative sample set Neg;

4. The multi-layer feature alignment-based cross-domain target detection method according to claim 1, wherein Step3 is specifically: and respectively performing feature extraction on the preprocessed source domain data and target domain data on the convolutional layers Conv1, conv2, conv3 and Conv4 in the pre-training model.