WO2023092582A1

WO2023092582A1 - A scene adaptive target detection method based on motion foreground

Info

Publication number: WO2023092582A1
Application number: PCT/CN2021/134085
Authority: WO
Inventors: Haimiao Hu; Mingzhu Li; Yidan ZHANG; Hongxu Jiang
Original assignee: Hangzhou Innovation Institute, Beihang University
Priority date: 2021-11-25
Filing date: 2021-11-29
Publication date: 2023-06-01
Also published as: CN114399697A

Abstract

With the development of deep learning technology, the requirements for model generalization performance in real environment is increasing gradually. The influence of illumination and background on model generalization performance has been widely concerned. The invention discloses a scene adaptive target detection method based on motion foreground. In this method, we use the motion foreground target boxes effectively by using a prior of the consistency of motion foreground and global target data distribution, and calculate the instance feature similarity through the decoder, which greatly improves the effect of the model in the target domain. The experimental results show that the target detection effect of the method is greatly improved in the real environment.

Description

A scene adaptive target detection method based on motion foreground

Field of the invention

The invention relates to a scene adaptive target detection method based on motion foreground.

Background art

In the field of computer vision, target detection is an important topic. The task of target detection is to find the region of interest in images and videos, and to determine its categories and locations. At present, many methods based on deep learning can achieve good results on benchmark data sets. However, due to the existence of domain differences, when the target size, camera angle, lighting and background environment change, the effect of the model has been reduced. The simplest and most effective way to solve this problem is to train models in the same domain. But, on the one hand, using manual annotation of datasets costs a lot of manpower and resources, on the other hand, many practical fields cannot be manually annotated. Therefore, in order to solve the degradation of model performance caused by different data distribution, the target detection method based on domain adaptive came into being.

At present, target detection methods based on domain adaptive include feature-based methods and model-based methods. The most classical method is to minimize the field difference of features through adversarial training, so that the domain ofproposal box features cannot be distinguished. Many relevant algorithms are improved based on this algorithm. This algorithm is called DA-FasterRcnn. Another class of algorithms realizes pixel level field alignment through adversarial generation.

But in the above algorithms only the domain differences in classification is considered, while the domain differences in regression is not considered, leading to that the model effect is not ideal when the scene changes. In addition, due to that data distribution is not known, suitable proposal boxes cannot be extracted in the one-stage extraction of candidate regions RPN stage of two-stage target detection, and it is also impossible to determine which regions of features need to be aligned during feature alignment stage.

Summary of the invention

To solve the above technical problems in the prior art, the present invention provides a scene adaptive target detection method based on motion foreground.

According to one aspect of the invention, a scene adaptive target detection method based on motion foreground is provided, which includes the following steps:

A) Obtaining the source domain dataset and the target domain dataset; the source domain dataset contains source domain RGB images, target detection manual labels and motion foreground target boxes labels; the target domain dataset contains target domain RGB images and motion foreground target boxes labels;

B) sending the source domain dataset and the target domain dataset into the feature extraction module to obtain the source domain features and the target domain features;

C) sending the source domain features and the target domain features from step B) into first proposal boxes and foreground boxes aggregation module to obtain the source domain instance features and the target domain instance features;

D) sending the source domain features from step B) into second proposal boxes and foreground boxes aggregation module to obtain the source domain classification regression features;

E) sending the source domain instance features and the target domain instance features from step C) into generating similarity measurement module to compute loss; the network is optimized, and the domain difference is reduced by this way;

F) sending the source domain classification regression features from step D) into classification regression module to compute loss and to optimize the network;

G) sending the source domain features and the target domain features from step B) into global feature alignment module to compute loss and to optimize the network.

The acquisition methods of the motion foreground target boxes in step A) includes VIBE, Gaussian Mixture Model, frame difference method and optical flow.

In the training process, the RPN proposal boxes with high confidence is combined with the motion foreground target boxes. After the samples is equalized, the source domain instance features and the target domain instance features are extracted.

In the training process, the RPN poposal boxes is combined with the motion foreground target boxes. Then, the source domain classification regression features are extracted.

In the training process, the decoder is used to reconstruct the source domain instance features and the target domain instance features, and the decoding features are obtained. Then, the similarity loss of decoding features is calculated to achieve instance feature alignment.

In the training process, the classification regression loss of the source domain datasets is calculated by the source target boxes truth label for ensuring the accuracy of source domain target detection.

The global feature alignment module includes gradient reversal layer (GRL) and classifier for achieving image level feature alignment.

The improvements of the present invention over the prior art include that a scene adaptive target detection method based on motion foreground is provided in which the prior knowledge of motion foreground is effectively utilized, that the decoder is used for feature alignment to obtain better detection effect, and that the generalization performance of the model in new scenes is improved.

Brief description of the drawings

FIG. 1 is a flow chart of a scene adaptive target detection method based on the motion foreground according to an embodiment of the invention;

FIG. 2 is the schematic diagram of first proposal boxes and foreground boxes aggregation module (PFA1) according to an embodiment of the invention;

FIG. 3 is the schematic diagram of second proposal boxes and foreground boxes aggregation module (PFA2) according to an embodiment of the invention;

FIG. 4 is the schematic diagram of generative similarity measurement (GSM) module according to an embodiment of the invention;

FIG. 5 is the schematic diagram of global feature alignment (GFA) module according to an embodiment of the invention.

Detailed description of the invention

In the embodiment according to the present invention as shown in Fig. 1, the source domain dataset is

, where ns represents the actual number of samples in the source domain,

represents the source domain sample I,

represents the set of target box coordinate values for the source domain sample i, and

represents the target category for the source domain sample i. There is only a single category of pedestrian in this embodiment.

represents the set of motion foreground target boxes coordinate values for the source domain sample i. The num ofboxes in

and the number of boxes in

are inconsistent. The target domain dataset is

nT represents the actual number of samples in the target domain.

represents the target domain sample i.

represents the set ofmotion foreground target boxes coordinate values for the target domain sample i.

The scene adaptive target detection method based on the motion foreground according to the present invention trains model using the source domain data and the motion foreground data of the target domain, so that the model can have good detection effect even without the target domain (T) annotation data set; the method includes the following steps:

A) sending the continuous frame sample set in the source domain

and the continuous frame sample set in the target domain

into Vibe motion target detection algorithm to obtain the source domain motion foreground target boxes and the target domain motion foreground target boxes

(S represents the source domain and T represents the target domain) , so the source domain dataset D _S and the target domain dataset D _T are obtained;

B) sending the source domain dataset D _S and the target domain dataset D _T into the feature extraction module (S101) to obtain the source domain features f1 and the target domain features f2; ResNet-101 for example is used as the backbone network of the feature extraction module in an embodiment of the invention;

C) sending the source domain features f1 and the source domain motion foreground target boxes

into PFA1 to obtain the source domain instance features pfs (S112) , and sending the target domain features f2 and the target domain motion foreground target boxes

into PFA1 to obtain the target domain instance features pft (S113) ;

D) sending the source domain features f1 and the source domain motion foreground target boxes

into PFA2 to obtain the source domain classification regression features crs (S112) ;

E) sending the source domain classification regression features crs into classification regression module (S121) , where the classification regression loss of the source domain dataset is calculated by the source target boxes truth label, so that the network weights of feature extraction module and classification regression module are optimized by training the source domain dataset;

F) sending the source domain instance features pfs and the target domain instance features pft into generating similarity measurement (GSM) module (S122) , to allow that the instance features of source domain and target domain are as similar as possible, by training the source domain dataset and target domain dataset, that the network weights of feature extraction module and GSM module are optimized, and that the generalization performance of the model is improved;

G) sending the source domain features f1 and the target domain features f2 into global feature alignment (GFA) module, to allow that the features of source domain and target domain are similar as much as possible by training the source domain data set and target domain data set, that the network weights of feature extraction module and GFA module (gradient reversal layer (GRL) and classifier) is optimized, and that the domain of source domain feature f1 and target domain feature f2 cannot be distinguished.

According to a further aspect of the invention, as shown in FIG. 2, the first proposal boxes and foreground boxes aggregation module PFA1 comprises sub-modules for performing the following steps respectively:

Step S201: sending the continuous frame sample set in the source domain

and the continuous frame sample set in the target domain

into RPN network (region proposal network) to obtain the set of positive and negative proposal boxes

where

represents jth proposal box in the ith image sample of source domain and target domain, C represents the number of proposal boxes that generated in the RPN network, which is set to 64 in an embodiment of the invention,

represents ith image sample in the source domain, and

represents ith image sample in the target domain;

Step S202: sending the continuous frame sample set in the source domain

and the continuous frame sample set in the target domain

where fb _i represents the ith set of motion foreground target boxes;

Step S211: selecting the proposal boxes whose confidence is greater than the preset threshold TH in the set of positive and negative proposal boxes

where TH is set to 0.7 in an embodiment of the invention;

Step S212: merging the proposal boxes obtained in S211 with the motion foreground target boxes

in S202;

Step S213: sending the output of S212 into sample equalization filter to obtain the source domain PFA1 proposal boxes set and the target domain PFA1 proposal boxes set

where {b _if} _jrepresents jth set of the PFA1 proposal boxes in the ith image sample of the dataset, f represents the set of the proposal boxes generated in PFA1, S represents the source domain, T represents the target domain, C _Sf, C _Tf respectively represent the number of boxes in the set of proposal boxes and motion foreground target boxes in the source domain and the number of boxes in the set ofproposal boxes and motion foreground target boxes in the target domain, and C _Sf=C _Tf.

In the sample equalization filter, fixed sample number f_num is set, which is set to 8 in an embodiment of the invention, so that the number of PFA1 proposal boxes in the ith sample in the source domain (S) is consistent with that in the ith sample in the target domain (T) , so as to eliminate sample imbalance.

According to a further aspect of the invention, as shown in FIG. 3, the second proposal boxes and foreground boxes aggregation module PFA2 comprises sub-modules for performing the following steps respectively:

Step S301: sending the continuous frame sample set in the source domain

into RPN network (region proposal network) to get the set of source domain positive and negative proposal boxes

where

represents the jth proposal box in the ith image sample of source domain, and C represents the number ofproposal boxes that generated in the RPN network, which is set to 64 in an embodiment of the invention;

Step S302: sending the continuous frame sample set in the source domain

into Vibe motion target detection algorithm to obtain the source domain motion foreground target boxes

where fb _i represents the ith set ofmotion foreground target boxes;

Step S311: merging the set of source domain positive and negative proposal boxes obtained in S301

with the source domain motion foreground target boxes

in S302 to obtain the source domain PFA2 proposal boxes set

to overcome the problem that the two domains cannot generate accurate proposal boxes when the target size difference is large by adding motion foreground target boxes, where {b _ia} _j represents jth set of the PFA2 proposal boxes in the ith image sample of the dataset, “a” represents the set of the proposal boxes generated in PFA2, S represents the source domain, and C _Sa represents the number of boxes in the set of proposal boxes and motion foreground target boxes in the source domain.

According to a further aspect of the invention, in the classification and regression module S121, the source domain feature f1 and PFA2 proposal boxes set

are input into classifiers and regressors to complete regression and classification of samples and the loss function of this part is as follows:

L _det=L _RPN+L _T,

where L _det represents the loss function of the source domain detection including L _RPN (RPN loss function) and L _T (classification regression loss function) , det refers to the name of the total loss function of the classification regression module, RPN refers to the name of the loss function of the first stage RPN stage of the two-stage target detection framework, and T refers to the name of the loss function of the second stage classification regression stage of the two-stage target detection framework. In an embodiment, cross entropy loss is used for classification loss and mean square error (MSE) loss is used for regression loss.

According to a further aspect of the invention, as shown in FIG. 4, the generative similarity measurement module includes sub-modules for performing the following operations respectively:

Step S401: sending PFA1 proposal boxes set

generated by PFA1 module into the source domain feature f1 and the target domain feature f2 extracted by the feature extraction module, to obtain the source domain instance featuref _S and the target domain instance feature f _T;

Step S402: sending the source domain instance feature f _S and the target domain instance feature f _T into an adaptive average pooling layer to obtain source domain pooling feature and target domain pooling feature f _Ss402, f _Ts402, which has an output size of8*8 and a number of channels that is equal to the number of channels of the source domain instance features f _S;

Step S403: sending the output from S402 into a first 1*1 convolution layer to obtain first source domain convolution feature and second target domain convolution feature f _Ss403, f _Ts403, which each has a number of channels of 1024 in an embodiment of the invention;

Step S404: sending the output from S403 into a first up-sampling layer to obtain first source domain up-sampling feature and first target domain up-sampling feature f _Ss404, f _Ts404, which each has an output size of 16*16 and a number of channels of 256 in an embodiment of the invention, where the up-sampling module comprises an up-sampling layer of interpolation, a convolution layer and a batch normalization layer;

Step S405: sending the output from S404 into a second up-sampling layer to obtain second source domain up-sampling feature and second target domain up-sampling feature f _Ss405, f _Ts405, which each has an output size of32*32 and a number of channels of256 in an embodiment of the invention;

Step S406: sending the output from S405 into a third up-sampling layer to obtain third source domain up-sampling feature and third target domain up-sampling feature f _Ss406, f _Ts406, which each has an output size of 64*64 and a number of channels of 256 in an embodiment of the invention;

Step S407: sending the output from S406 into a second 1*1 convolution layer to obtain source domain decoding feature and target domain decoding feature f _SG, f _TG, which each has a number of channels of 3 in an embodiment of the invention;

computing the perceived loss of the source domain decoding feature and the target domain decoding feature f _SG, f _TG to obtain loss L _ins:

L _ins=E (G (S) , G (T) ) ,

where E represents the perceived loss, which is a loss function used to measure the similarity between images; L _ins represents the perceived loss value of the source domain decoding feature and the target domain decoding feature f _SG, f _TG; E is the calculation function of perceived loss (the perceived loss function is the existing technology) ; G (S) , G (T) respectively represent the source domain decoding features and the target domain decoding features f _SG, f _TG generated by steps S402-S407, when the source domain instance feature and the target domain instance feature f _S, f _T are input shared decoder G. This scheme can effectively measure the similarity between the instance features of the two domains (the source domain and the target domain) . Through the training of feature extraction module and GSM module, the instance features of the source domain and the instance features of the target domain can be as similar as possible, and the accuracy of classification regression module in target domain can be guaranteed. And, the use of decoder enhances the generalization performance of the model, reduces the risk of the model overfitting and reduces the failure rate of the model training.

According to a further aspect of the invention, as shown in Figure 5, the global feature alignment module GFA comprises sub-modules for performing the following operations respectively:

Step S501: obtaining the source domain feature f1 and the target domain feature f2generated;

Step S502: sending the source domain feature f1 and the target domain feature f2 into GRL (gradient reversal layer) . In conventional back propagation, the loss (the difference between the predicted value and the real value) is passed forward layer by layer, and each layer network calculates the gradient according to the transfer loss, and then updates the parameters of the layer network. GRL layer inverts the errors transmitted to this layer, so that the network training objectives before and after GRL are opposite, so as to achieve the effect of confrontation;

Step S503: sending the source domain feature f1 and the target domain feature f2 into a classifier to distinguish the source domain feature from the target domain feature, where the classifier includes the convolution layer, activation layer, for performing the operations in Steps S511-S513 respectively;

where the loss function of global feature alignment module GFA is the loss function ofclassifierL _img. In an embodiment of the invention, it is the cross entropy loss function:

where N is the number of all samples in the source domain and target domain, i is the sample label number, y _i is whether the actual label of the sample belongs to the source domain or the target domain, and p _i is the probability that the sample belongs to different categories after classifier.

In an embodiment of the invention, the final global loss function is:

L=L _det+λ ₁L _ins+` ₂L _img,

whereλ ₁, λ ₂ are empirical values for measuring the contribution value of three losses to the final loss. In this embodiment, it is taken as 1.

The advantages of the invention include:

(1) The priori ofmotion foreground is fully utilized in the invention and is well integrated into the training framework. By using FPA1 and FPA2, the proposal boxes extracted from RPN network are effectively fused with the motion foreground target boxes, so that the effect of the model is optimized by complementing each other of the two kinds of proposal boxes.

(2) In order to reduce the risk ofmodel overfitting and improve the accuracy of target box regression, the present invention abandens the existing feature alignment mode based classifier. In the case of instance feature alignment, a decoder is used to reduce over-fitting. Its loss function is the perceptual loss. The effect of the model in the target domain is greatly improved by this way.

(3) In the fusion ofproposal boxes extracted from RPN network and motion foreground target boxes, sample equalization is effectively realized through sample equalization filter.

In order to verify the validity of the method, the inventors conducted the following experiments, in which the testing process only followed the testing process of the two-stage detection algorithm, so the speed was consistent with the conventional two-stage algorithm. By adding some components during model training process, the trained model achieved good results in both source and target domains.

The source domain dataset and target domain dataset adopted in the specific implementation scheme were captured in a real world scenario. They were named DML dataset and ZN dataset respectively, where DML dataset was the source domain dataset and ZN dataset was the target domain dataset.

Experimental details: In all experiments, the parameters were consistent with the original DA-FasterRCNN algorithm. ResNET-50 was used for the backbone network, and ImageNet's pre-training weight was used for the initialization of the backbone network. After training 70,000 images, the average precision (map) of the target domain was calculated. All experiments were based on the PyTorch framework, the NVIDIA GTX-2080TI hardware platform was used.

The comparison diagram of experimental results are showed in Table 1. DA-FasterRCNN is an algorithm of the classical domain adaptive detection. Method PFA1 is an improved method by adding first proposal boxes and foreground boxes aggregation (PFA1) module to the classical algorithm, that is, adding the motion foreground target boxes to the RPN proposal boxes. It can be seen that the method of the invention effectively improved the detection effect in the target domain.

Table 1: The Result ofdomain adaptive detection

Claims

A scene adaptive target detecting method based on motion foreground, for training model based on source domain data and target domain foreground data to allow that the model has a good detection effect also in target domain, including the steps of:

A) feeding set of source domain consecutive frame samples and set of target domain consecutive frame samples into motion target detection algorithm to obtain output of motion foreground target box of the source domain consecutive frame samples and motion foreground target box of the target domain consecutive frame samples, wherein the motion foreground target boxes forms source domain data set and target domain data set together with source domain labeling tags;

B) feeding the source domain data set and target domain data set into feature extraction module to obtain source domain features and target domain features;

C) feeding the source domain features, the target domain features, and the motion foreground target box into first proposal box and foreground box aggregation module (PFA1) to obtain source domain instance features and target domain instance features;

D) feeding the source domain features and source domain motion foreground target box into second proposal box and foreground box aggregation module (PFA2) to obtain source domain classification regression features;

E) sending the source domain classification regression features into classification regression module to calculate loss with source target boxes truth label so as to obtain optimized detection effect on source domain;

F) sending the source domain instance features and target domain instance features into generative similarity measurement module (GSM) to allow the source domain instance features and target domain instance features to be as similar as possible and to improve generalization performance, thereby reducing over-fitting;

G) sending the source domain features and the target domain features into global feature alignment module (GFA) to align image features so that the domains which the source domain features and target domain features belong to cannot be distinguished,

wherein the first proposal box and foreground box aggregation module (PFA1) includes sub-modules for performing the following operations respectively:

step S201: sending the source domain consecutive frame samples and the target domain consecutive frame samples into RPN network to generate set of source domain positive and negative proposal boxes and set of target domain positive and negative proposal boxes;

step S211: selecting the source domain positive and negative proposal boxes and the target domain positive and negative proposal boxes whose confidence is greater than a preset threshold TH from the set of source domain positive and negative proposal boxes and the set of target domain positive and negative proposal boxes generated in step S201;

step S202: obtaining source domain motion foreground target boxes and target domain motion foreground target boxes by the motion target detection algorithm;

step S212: obtaining source domain combined target boxes and target domain combined target boxes by combining the source domain positive and negative proposal boxes and the target domain positive and negative proposal boxes with confidence greater than the preset threshold TH obtained in step S211 with the source domain motion foreground target boxes and the target domain motion foreground target boxes obtained in step S202;

step S213: obtaining source domain proposal boxes and target domain proposal boxes of the first proposal box and foreground box aggregation module (PFA1) by sample equalization filter,

wherein the sample equalization filter, by copying or deleting the source domain combined target boxes and the target domain combined target boxes obtained in step S212, allows that the number of source domain combined target boxes included in the ith sample in the source domain (S) and the number of the target domain combined target boxes included in the ith sample in the target domain (T) are the same, to effectively use motion foreground priori and eliminate sample imbalance,

wherein said second proposal box and foreground box aggregation module (PFA2) includes sub-modules for performing the follow operations respectively:

step S301: allowing the source domain consecutive frame samples to go through the RPN network to generate set of source domain positive and negative proposal boxes;

step S302: obtaining the source domain motion foreground target box using the motion target detection algorithm;

step S311: adding the set of source domain positive and negative proposal boxes and the source domain motion foreground target box to generate set of source domain proposal boxes of the second proposal box and foreground box aggregation module (PFA2) , to avoid that good proposal target boxes cannot be generated when the target sizes of the source domain and the target domain differ too much,

wherein the generative similarity measurement module (GSM) includes sub-modules for performing the following operations respectively:

step S401: intercepting the source domain instance features in the source domain features using the source domain proposal box of the first proposal box and foreground box aggregation module (PFA1) generated in step S213, and intercepting the target domain instance features in the target domain features using the target domain proposal box of the first proposal box and foreground box aggregation module (PFA1) generated in step S213;

step S402: sending the source domain instance features and the target domain instance features into adaptive average pooling layer, which outputs pooling features having a size of 8*8 and a channel number that equals to the channel number of the source domain instance features;

step S403: sending the pooling features obtained in step S402 into a first 1*1 convolution layer, which is a 1*1 convolution layer and which outputs first source domain convolution layer features and first target domain convolution layer features;

step S404: sending the first source domain convolution layer features and first target domain convolution layer features obtained in step S403 into a first up-sampling module for performing interpolation up-sampling, convolution, and/or batch normalization layer operations, to output first source domain up-sampling layer features and first target domain up-sampling layer features;

step S405, sending the output of the first up-sampling module into a second up-sampling module for performing interpolation up-sampling, convolution, and/or batch normalization layer operations, to output second source domain up-sampling layer features and second target domain up-sampling layer features;

step S406, sending the output of the second up-sampling module into a third up-sampling module for performing interpolation up-sampling, convolution, and/or batch normalization layer operations, to output third source domain up-sampling layer features and third target domain up-sampling layer features;

step S407, sending the output of the third upsampling module into a second 1*1 convolution layer, which is a 1*1 convolution layer, to generate source domain decoding features having a channel number of 3 and target domain decoding features having a channel number of 3 and to calculate perceptual loss of the source domain decoding features and the target domain decoding features to obtain loss L _ins as:

L _ins=E (G (S) , G (T) ) ,

where L _ins is the perceptual loss value of the source domain decoding features and the target domain decoding features, E is perceptual loss calculation function, G (S) refers to the source domain decoding feature generated from the source domain instance feature by steps S402-S407, and G (T) refers to the target domain decoding feature generated from the target domain instance feature by steps S402-S407,

where the global feature alignment module (GFA) includes sub-modules for performing the following operations respectively:

step S501: obtaining the source domain features and the target domain features;

step S502: sending the source domain features and the target domain features into a gradient reversal layer, which reverses the errors transmitted thereto so that the network training goals before and after the gradient reversal layer are opposite to each other so as to achieve the effect of confrontation, and which outputs classification features,

step S503: sending the classification features into a classifier, which includes a first classifier convolution layer, a first classifier activation layer, and a second classifier convolution layer, to distinguish the source domain features and the target domain features,

where the gradient reversal layer achieves a certain degree of feature alignment at image level, and the loss function of the global feature alignment module (GFA) is the loss function of the classifier.
The scene adaptive target detection method based on motion foreground according to claim 1, wherein:

the step B) includes sending the source domain consecutive frame samples and target domain consecutive frame samples into an ResNet-101 that functions as a feature extraction network, and using the last layer of the obtained features as the source domain features and the target domain features.
The scene adaptive target detection method based on motion foreground according to claim 1, wherein:

the classification regression module, using the source domain proposal boxes of the second proposal box and foreground box aggregation module (PFA2) generated in step S311, accesses a first classification regression module convolution layer to perform regression and classification of the source domain consecutive frame samples and target domain consecutive frame, the loss functions involved include classification regression loss function L _T and RPN loss function L _RPN, and the source domain target detection algorithm loss function L _det is:

L _det=L _RPN+L _T,

where L _RPN, L _T are RPN loss function and classification regression loss function respectively, the subscript det represents the total loss function of the classification regression module, RPN represents the loss function of the RPN of the first stage of a two-stage target detection framework, and T represents the loss function of the classification and regression of the second stage of the two-stage target detection framework.
The scene adaptive target detection method based on motion foreground according to claim 1, wherein:

the loss function L _img of the global feature alignment module is a cross-entropy loss function as:

where N is the number of all samples in the source domain and the target domain, i is the sample label, y _i is the actual label of the sample indicating whether the sample belongs to the source domain or the target domain, p _i is the probability of belonging to respective categories after passing through the classifier.
The scene adaptive target detection method based on motion foreground according to claim 4, wherein the global loss function is:

L=L _det+λ ₁L _ins+λ ₂L _img,

where λ ₁, λ ₂ take empirical values, for measuring the contribution of each of the three losses to the final loss.
The scene adaptive target detection method based on motion foreground according to claim 1, wherein:

the motion target detection algorithm includes frame difference method and/or background elimination method.