CN115439791A

CN115439791A - Cross-domain video action recognition method, device, equipment and computer-readable storage medium

Info

Publication number: CN115439791A
Application number: CN202211171602.4A
Authority: CN
Inventors: 周冕; 田壮; 高文杰; 高赞
Original assignee: Tianjin University of Technology
Current assignee: Tianjin University of Technology
Priority date: 2022-09-26
Filing date: 2022-09-26
Publication date: 2022-12-06

Abstract

The invention belongs to the technical field of computer vision and pattern recognition, and relates to a cross-domain video action recognition method based on resampling and feature weighting without supervision. The method comprises the following steps: the method comprises the steps of source domain sample resampling, video preprocessing, feature extraction, motion excitation module construction, feature fusion module construction, intermediate domain weighting module construction, classification module construction and domain comparison learning module construction. The problem of insufficient generalization capability of the model on a plurality of data sets can be effectively reduced under the condition that a target domain has no sample label, the model mobility is improved, the problem of inconsistent data distribution among different data sets is solved, the problem of cross-domain action recognition under the condition that the target data set training data set has no label is solved, and accurate recognition of a target domain test set is realized by using the information of a source domain data set and the information of a target domain label-free training set. The invention is particularly suitable for the field of public safety.

Description

Cross-domain video action recognition method, device, equipment and computer-readable storage medium

Technical Field

The invention belongs to the technical field of computer vision and pattern recognition, relates to an unsupervised cross-domain video action recognition method based on resampling and feature weighting, and can effectively reduce the problem of insufficient generalization capability of a model on a plurality of data sets under the condition that a target domain has no sample label, improve the mobility of the model, and verify the effectiveness of the model on a plurality of cross-domain video action data sets.

Background

Over the last few years, many different depth-structured approaches have emerged in the field of motion recognition; for example: in Two-Stream (dual Stream convolutional neural network): a double-current network architecture is provided, two 2D volume blocks are used for carrying out combined training on RGB and optical flow information, and time information is modeled; TRN: a depth model of a time relation network is provided, and a special pooling layer is adopted to model the time relation C3D between video frames: directly learning the spatiotemporal characteristics of the video data through a 3D convolution pair; I3D: this is a deep network that integrates a dilated two-dimensional convolution filter to take advantage of large-scale pre-trained two-dimensional models. P3D: the core of the method is to split 3D convolution into a 1D time convolution kernel of 3 x 1 and a 2D space convolution of 1 x 3, and parameter quantity is reduced.

However, the above methods often cannot be directly applied to the field of cross-domain motion recognition because they are trained on the same distributed training and testing data sets, i.e. all samples are from the same data set; for cross-domain tasks, the samples for training and testing are often from different data sets, i.e. the distribution of samples is different; under the condition, the method cannot well eliminate the data distribution difference of the samples, so that the classification effect of the model is greatly reduced, the model effect is not good, and the method cannot be effectively applied to cross-domain tasks.

In the related research fields of computer vision and pattern recognition, the cross-domain task of transfer learning is always one of the most active research fields; cross-domain tasks already have mature methods in the image domain, and existing methods have great differences in strategies for coping with domain transitions. One class of methods performs domain distribution alignment by matching first and second order statistical moments of the source and target data distributions. Another salient strategy is antagonistic training, in which discriminative and domain-agnostic feature representations are learned by coupling domain discriminators with source classification loss. Compared with a cross-domain image recognition task, the cross-domain action recognition task is more difficult because timing information only provides a few cross-domain action recognition methods, for example, in a DAAA method, an end-to-end antagonism learning framework is provided for aligning the two fields, and TA3N provides a time attention confrontation adaptive network for aligning time dynamics, wherein an attention mechanism is used for aligning the time dynamics and the space characteristics. TCoN: aligning the feature distribution of the same type of videos in the source domain and the target domain by using a cooperative attention method; while these methods can produce satisfactory results, most require samples with labeled information. Therefore, in this study we focus on the unsupervised cross-domain motion recognition task.

Disclosure of Invention

The invention aims to solve the problem of cross-domain action recognition under the condition that a training set of a target data set is not labeled aiming at a cross-domain task of action recognition, and provides a non-supervised cross-domain video action recognition method, a device, equipment and a computer storage medium based on resampling and feature weighting.

The cross-domain video action identification method specifically comprises the following steps:

1) Source domain sample resampling: the distribution of the number of source domain samples in each category is uneven, the uneven source domain samples are used for training the model, so that the model can be more suitable for the categories with large sample amount, and the categories with small sample amount are ignored; the distribution of the sample size of the target domain in each category is often different from that of the source domain, so that the effect is reduced when the model trained on the source domain is transferred to the target domain; according to the characteristics, resampling is carried out on the source domain samples to balance the consistency of the quantity of each type of the source domain samples;

2) Video preprocessing: the number of the frames of the cross-domain action data set samples is large, the calculated pressure of inputting all the frames into the network is large, and the efficiency of extracting a certain amount of frame images and inputting the frame images into the network is high due to the high similarity of the two adjacent frames; according to the characteristics, N video samples are respectively taken at the source domain and the target domain at random and are combined into a batch of data samples; dividing each video into k =6 segments, randomly extracting sixteen frames from each segment of video, and taking the ninety-six frame images as the representation of motion samples; meanwhile, performing conventional data enhancement on the frame image, namely performing random cutting, random horizontal turning and normalization processing on the video; in the testing stage, only the video samples of the testing set are subjected to frame extraction processing, and data enhancement is not performed;

3) Feature extraction: extracting features of the video samples processed in the steps 1) and 2) by using an I3D model pre-trained on a Kinetics data set as a feature extractor of a network, and obtaining a fragment-level feature sequence

Wherein T =6 indicates that each video sample results in 6 segment features, and C is the feature channel dimension;

4) Constructing a motion excitation module: inputting the segment-level features F into a motion excitation module, enhancing motion information based on time dimension of the motion excitation module, improving quality of the motion features, and obtaining enhanced segment-level features

5) Constructing a feature fusion module: segment-level features modeled by motion excitation module

Fusing T =6 segment level features into video level features using a feature fusion module as input

6) Constructing a middle domain weighting module: one goal of cross-domain action recognition is to learn more domain-invariant features, i.e., intrinsic action features that do not change with domain changes; according to the characteristics, the intermediate domain adding module is designed, and the intermediate domain weighting layer is used for the input same-phaseA batch of 2N video level sample features { F ₁ ，F ₂ ，…,F _2N Judging whether the data distribution of the source domain or the target domain is more biased or in an intermediate domain (namely the intersection part of the feature distribution of the source domain and the feature distribution of the target domain); the sample characteristics in the middle domain have more domain invariant characteristics relative to the samples biased to the source domain or target domain specific data distribution, different weight vectors are calculated according to the distance between the characteristic F and the middle domain and different rules through a middle domain weighting module, and the weight vectors are classified

Domain contrast weight vector

Categorizing weight vector x ^c Weighting a video level feature F to obtain a feature F for classification _C And input it into the subsequent classification module; domain contrast weight vector x ^d Weighting video level feature F to obtain learning feature F for domain contrast _D ；

7) Constructing a classification module: video level feature F weighted by intermediate domain weighting module _C Inputting the sample data into a classification module, and obtaining a classification probability vector l of the sample in each class through calculation of the classification module _C On the one hand, the classification module is used for real label Y of source domain samples _S And the classification probability vector l of the source domain samples _S ∈l _C Calculating the classification loss of the source domain samples by using a cross entropy function, and optimizing the classification capability of the whole network; the cross entropy loss function is defined as:

wherein N represents the number of source domain samples in the current network training batch, y _i A label representing the ith sample,

representing a predicted value for the ith sample; log () is the logarithmOperating;

on the other hand, the classification module uses the classification probability vector l of the target domain sample _T ∈l _C According to the formula:

obtaining a pseudo label of a target domain sample

8) A domain comparison learning module is constructed: the domain comparison learning module mainly realizes inter-domain comparison learning, and achieves the purposes of reducing domain deviation and improving the cross-domain capability of the network by shortening the distance between the features belonging to the same category in the source domain sample feature and the target domain sample feature and advancing the distance between the features of different categories; video level feature F weighted by intermediate domain weighting module _D Inputting the feature into a domain comparison learning module, and mapping the feature into a domain comparison feature F through the domain comparison learning module _d Using source domain real label Y _S And target domain pseudo-tag

And domain contrast feature F _d Calculating the domain contrast loss, and optimizing the network to reduce the domain offset; the loss of domain contrast is defined as:

wherein N represents the number of source domain or target domain samples per batch;

representing the ith source domain contrast characteristic of each batch,

representing the jth target domain contrast characteristic of each batch;

is an indicator function, which is 1 if its parameter is true, otherwise it is 0;

Ω represents a sample set of the target domain with unreliable pseudo labels, and D (-) is a function of the returned sample domain;

where τ >0 is a temperature hyperparameter.

9) Constructing a joint training loss function; the whole network carries out joint training through the loss functions proposed in the steps 6), 7) and 8); the overall loss function of the network is defined as:

wherein alpha, beta and gamma are hyper-parameters.

The step 1) comprises the following specific steps:

counting the number of samples of each category according to the real label of the source domain sample, and calculating the sampling magnification for each category;

sampling magnification = (number of samples per class/maximum number of class samples);

carrying out weighted random sampling on each class according to the sampling multiplying power, namely achieving the purpose of consistent sample number of each class of the whole source domain samples; and for the target domain data set, no resampling operation is performed because it has no real label.

The step 4) comprises the following specific steps:

segment-level features extracted by a feature extraction module for modeling and enhancing motion information of the segment-level features

Is input into the motion excitation module; the motion excitation module makes difference between the characteristics of the t segment and the characteristics of the t +1 segment on a time sequence channel, and highlights the characteristics of difference between the two segments through the difference operation of the two adjacent segments, wherein the difference characteristics are usually brought by motion, so the characteristics with motion information are positioned and added back to the original segment level characteristics, and the purpose of enhancing the motion information is achieved.

The step 6) comprises the following specific steps:

in order to obtain the distance between the sample characteristic and the middle domain, the middle domain weighting module is implemented as a two-classifier for judging whether the input sample characteristic comes from the source domain or the target domain, and the input is the video-level characteristic extracted by the network

Generating domain labels simultaneously from source domain or target domain according to characteristics

Carrying out supervised training on the module according to the domain label to improve the classification accuracy of the second classifier; the loss function of the middle domain weighting module is two-class cross entropy loss BCEloss, and the domain classification loss of the sample is defined as:

where N represents the number of samples in the current network training batch,

a domain label representing the ith sample,

weighting the ith sample for the intermediate domainThe characteristics are related to the predicted value of the domain d epsilon { S, T }; log () is a log-taking operation;

along with the training, the output classification probability of the two classifiers can reflect the distance between the sample characteristics and the middle domain, namely the closer the output classification probability is to 0.5, the more difficult the two classifiers are to judge whether the sample comes from a source domain or a target domain, which shows that the characteristics have more domain-invariant characteristics and the distance between the characteristics and the middle domain is smaller; calculating the final weighting characteristic of each sample by using the following formula according to the classification probability output by the two classifiers;

the feature weight calculation formula of the classification module branch is as follows:

F _C ＝F*e ^-3|λ-0.5|

the calculation formula of the feature weight of the domain comparison module branch is as follows:

F _D ＝F*e ^3|λ-0.5|

wherein F represents the video-level sample characteristics fused by the characteristic fusion module, and λ represents the classification probability output by the intermediate domain weighting module.

The present invention also provides a video motion detection apparatus, the apparatus comprising:

the source domain sample resampling module balances the number of samples of the source domain samples in each category, prevents the model from over-fitting some categories, and is beneficial to improving the migration performance of the model to the target domain;

the characteristic extraction module is used for extracting the fragment-level characteristics of the sample by utilizing the pre-training I3D model;

the motion excitation module is used for modeling and enhancing the motion information of the segment-level features, and is beneficial to the classification and domain adaptation of subsequent videos;

the feature fusion module is used for fusing the 6 segments of segment-level features of each sample into video-level features;

the intermediate domain weighting module is used for finding out an intermediate domain sample which is extracted by the feature extraction module and has domain invariance by judging the features extracted by the model, and prompting the network to have greater learning strength on the domain invariant features and improve the migration capability of the network on one hand, and improving the efficiency of domain comparison learning on the other hand and reducing domain deviation better on the other hand by giving different weights to the sample;

the classification module is used for processing the characteristics weighted by the intermediate domain weighting module to obtain the classification loss of the source domain sample and the target domain pseudo label, and is responsible for outputting the final classification result of the target domain test sample in the test stage;

and the domain comparison module is used for processing the characteristics weighted by the intermediate domain weighting module, performing comparison learning on the source domain and the target domain, pushing the distribution of different types of sample characteristics in the same domain away, drawing the distribution of the sample characteristics in the same domain and different domains closer, and reducing the domain deviation of the characteristics extracted by the network.

The present invention also provides a video motion recognition apparatus, comprising:

a memory for storing an executable computer program;

and the processor is used for realizing the cross-domain video action identification method when executing the executable computer program stored in the memory.

The invention also provides a computer readable storage medium, which stores a computer program for implementing the cross-domain video action recognition method when being executed by a processor.

The invention has the advantages and beneficial effects that:

1) Through resampling of source domain samples, the number of samples of the source domain number in each category is balanced, overfitting of the model to some categories is prevented, and the migration performance of the model to the target domain is improved;

2) The motion excitation module enhances the motion information in the sample characteristics, and is beneficial to the classification and domain adaptation of subsequent videos.

3) The intermediate domain weighting layer judges the features extracted by the model to find out intermediate domain samples extracted by the feature extraction module and having domain invariance, and by giving different weights to the samples, on one hand, the network is promoted to have greater learning strength on the domain invariant features, the migration capability of the network is improved, on the other hand, the domain comparison learning efficiency is improved, and the domain deviation is better reduced.

4) The domain comparison learning module performs comparison learning on the source domain and the target domain, the distribution of the sample characteristics of the same domain and different classes is pushed far, the distribution of the sample characteristics of the same domain and different domains is pulled close, and the domain adaptation degree of the network is improved.

The training set and the test set of the common action recognition method are divided from the same data set, so that the cross-domain problem cannot be effectively solved; the invention solves the problem of inconsistent data distribution among different data sets, solves the problem of cross-domain action identification under the condition that the training data set of the target data set is not labeled, and realizes the accurate identification of the target domain test set by using the information of the source domain data set and the information of the target domain unlabeled training set.

Drawings

FIG. 1 is a block diagram of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings;

example 1:

data set

UCF-Olympic there are 6 shared classes from UCF50 and Olympic datasets, containing a total of 601 training videos and 240 test videos from the UCF50 dataset, and 240 training samples and 54 test samples from the Olympic motion dataset.

UCF-HMDB full with 12 sharing categories from UCF101 and HMDB51 respectively, and containing 3209 videos in total, UCF with 1438 videos for training and 571 videos for testing, HMDB with 840 videos for training and 360 for testing.

Table 1 is a comparison of the performance of the classical cross-domain action recognition algorithm of embodiment 1 of the present invention on UCF-HMDBfull and UCF-Olympic cross-domain action data sets, where the corresponding documents in table 1 are as follows:

[1]Arshad Jamal,Vinay P Namboodiri,Dipti Deodhare,and KS Venkatesh.Deep domain adaptation in action space.In BMVC,2018.

[2]Min-Hung Chen,Zsolt Kira,Ghassan AlRegib,Jaekwon Yoo,Ruxin Chen,and Jian Zheng.Temporal attentive align-ment for large-scale video domain adaptation.In ICCV,2019.

[3]Boxiao Pan,Zhangjie Cao,Ehsan Adeli,and Juan Carlos Niebles.Adversarial cross-domain action recognition with co-attention.In AAAI,2020.

[4]Jinwoo Choi,Gaurav Sharma,Samuel Schulter,and Jia-Bin Huang.Shuffle and attend:Video domain adaptation.In ECCV,2020.

TABLE 1

As shown in fig. 1, it is an operation flowchart of the unsupervised, cross-domain video motion recognition method based on resampling and feature weighting according to this embodiment, and the operation steps of the method include:

1) Source domain sample resampling: the distribution of the number of samples of each type of the cross-domain action recognition data set is not necessarily completely consistent, for part of the cross-domain action recognition data set, the distribution of the number of samples of a source domain in each type is not uniform, and the use of the source domain samples to train a model can lead to the model to be more suitable for the types with more sample amount and ignore the types with less sample amount; the sample number distribution of the target domain is often different from the sample number distribution of the source domain, so that the effect of the model trained on the source domain is reduced when the model is migrated to the target domain. Therefore, before training, the class with the smaller number of samples in the source domain is resampled, so as to achieve the purpose of balancing the consistency of the number of each class of the samples in the source domain. The specific operation method is that the number of samples of each category is counted according to the real label of the source domain sample, and the sampling multiplying power is calculated for each category;

and carrying out weighted random sampling on each class according to the sampling multiplying power, namely achieving the purpose of consistent sample number of each class.

2) Video preprocessing: the number of the frames of the cross-domain action data set samples is large, the calculated pressure of inputting all the frames into the network is large, and the efficiency of extracting a certain amount of frame images and inputting the frame images into the network is high due to the high similarity of the two adjacent frames; according to the characteristics, N video samples are respectively taken at the source domain and the target domain at random and are combined into a batch of data samples; dividing each video into k =6 segments, randomly extracting sixteen frames from each segment of video, and taking the ninety-six frame images as the representation of motion samples; meanwhile, performing conventional data enhancement on the frame image, namely performing random cutting, random horizontal turning and normalization processing on the video; in the testing stage, only the video samples of the testing set are subjected to frame extraction processing, and data enhancement is not performed.

4) Constructing a motion excitation module: in order to model and enhance the motion information of the segment-level features, the segment-level features extracted by the feature extraction module are input into the motion excitation module; segment level feature F _S The method comprises the steps of inputting the motion information into a motion excitation module, enhancing the motion information by the motion excitation module based on a time dimension, and improving the quality of motion characteristics, wherein the specific operation is that the characteristics of a t segment and the characteristics of a t +1 segment are differentiated on a time sequence channel, the characteristics with difference between the two segments are highlighted through the differential operation of the two adjacent segments, and the difference characteristics are usually caused by motion, so the characteristics with the motion information are positioned and added back to the original segment-level characteristics to achieve the purpose of enhancing the motion information, and the enhanced segment-level characteristics are finally obtained

5) Constructing a feature fusion module: to generate the video-level features, the segment-level features are fused to the video-level feature features, where the segment-level features are not added together directly, but are weighted according to a fusion weight vector. The weight vectors being simple to useThe characteristic fusion module is calculated and realized as a multilayer perceptron (MLP), the framework is Linear/ReLU/Linear/Sigmoid, and T =6 segment-level characteristics are received as input and are modeled by the motion excitation module

As input, T =6 segment level features are fused into video level features with a feature fusion module

6) Constructing a middle domain weighting module: one goal of cross-domain action recognition is to learn more domain-invariant features, i.e., intrinsic action features that do not change with domain changes; according to the characteristics, the intermediate domain adding module is designed, and the intermediate domain weighting layer has the function of adding 2N video level sample characteristics { F ] of the same input batch ₁ ，F ₂ ，…,F _2N A determination is made whether it is more biased towards the source domain or target domain data distribution, or in the intermediate domain (i.e. the intersection of the source domain feature distribution and the target domain feature distribution); the sample characteristics in the middle domain have more domain-invariant characteristics relative to the samples biased to the source domain or target domain data distribution, and the module weights the samples according to the distance between the characteristics and the middle domain; the weighting mode is divided into two types according to the difference of the subsequent inflow module of the characteristics, one is to give larger weight to the middle domain sample and input the middle domain sample into the classification module, so that the middle domain sample occupies larger weight in the subsequent classification calculation, the classification module is favorable for better fitting the domain invariant characteristics, and the generalization capability of the classification module is improved; the second is to apply a reverse weighting to these samples to give smaller weight to the middle-domain samples, and input them to the domain contrast learning module in order to make them occupy smaller weight in the domain contrast learning calculation, i.e. make the features with larger domain differences occupy larger weight in the domain contrast learning calculation. These samples distributed toward the source domain (target domain) are more valuable in domain contrast learning than the intermediate domain samples because the domain contrast module is based on the samplesThe comparison difference is calculated, and the domain deviation can be better learned by the domain comparison module by selecting the sample characteristics with larger domain difference;

in order to obtain the distance between the sample characteristics and the intermediate domain, the intermediate domain weighting module is implemented as a two-classifier for judging whether the input sample characteristics come from the source domain or the target domain, the input is the video-level characteristics F extracted by the network, and meanwhile, domain labels are generated according to whether the characteristics come from the source domain or the target domain

And carrying out supervised training on the module according to the domain label to improve the classification accuracy of the two classifiers. The loss function of the middle domain weighting module is a two-class cross entropy loss (BCEloss), and the domain classification loss of the sample is defined as:

where N represents the number of samples in the current network training batch,

a domain label representing the ith sample,

a predicted value of the ith sample characteristic related to the domain d epsilon { S, T } for the middle domain weighting module; log () is a log-taking operation;

along with the training, the output classification probability of the two classifiers can reflect the distance between the sample characteristics and the middle domain, namely the closer the output classification probability is to 0.5, the more difficult the two classifiers judge whether the sample is from the source domain or the target domain, which shows that the characteristics have more domain-invariant characteristics and the smaller distance with the middle domain; and then according to the classification probability output by the two classifiers, calculating the intermediate domain weighting characteristic of each sample by using the following formula.

The feature weighting calculation formula of the classification module branch is as follows:

F _C ＝F*e ^-3|λ-0.5|

F _D ＝F*e ^3|λ-0.5|

7) Constructing a classification module: video level feature F weighted by intermediate domain weighting module _C Inputting the sample data into a classification module, and obtaining a classification probability vector l of the sample in each class through calculation of the classification module _C The classification module mainly realizes two functions, wherein one function is to optimize the separability of the network to the classes by carrying out supervised classification loss calculation on the source domain samples with labels. Real label Y of classification module to source domain sample _S And the classification probability vector l of the source domain samples _S ∈l _C Calculating the classification loss of the source domain samples by using a cross entropy function, and optimizing the classification capability of the whole network; the cross entropy loss function is defined as:

representing a predicted value for the ith sample; log () is a log-taking operation;

another function is that the classification module uses the classification probability vector l of the target domain samples _T ∈l _C According to the formula:

obtaining a pseudo label of a target domain sample;

8) A domain comparison learning module is constructed: the domain comparison learning module mainly realizes the inter-domain comparison learning,the method achieves the purposes of reducing domain deviation and improving the cross-domain capability of the network by shortening the distance between the same class of features in the source domain sample features and the target domain sample features and advancing the distance between the different classes of features; video level feature F weighted by intermediate domain weighting module _D Inputting the feature into a domain comparison learning module, and mapping the feature into a domain comparison feature F through the domain comparison learning module _d Using source domain real label Y _S And target domain pseudo-tag

wherein N represents the number of source-domain or target-domain samples per batch;

representing the ith source domain contrast characteristic of each batch,

representing the jth target domain contrast characteristic of each batch;

where τ >0 is a temperature hyperparameter.

wherein alpha, beta and gamma are hyper-parameters.

Correspondingly, the present embodiment further provides a video motion detection apparatus, including:

and the domain comparison module is used for processing the characteristics weighted by the intermediate domain weighting module, performing comparison learning on the source domain and the target domain, pushing the distribution of the sample characteristics of the same domain and different classes far away, and pulling the distribution of the sample characteristics of the same domain and different domains close to each other, so that the domain offset of the characteristics extracted by the network is reduced.

The present embodiment further provides a video motion recognition device, where the device includes:

a memory for storing an executable computer program;

The present embodiment also provides a computer-readable storage medium, which stores a computer program, and when being executed by a processor, the computer program implements the cross-domain video motion recognition method.

To validate the effectiveness of the present invention, evaluations were performed on the action data sets UCF-HMDBfull, and UCF-Olympic. The experiment sets up 20 epochs, adopts an optimization method sgd, the default learning rate is 0.01, and the loss function hyperparameters are set to be alpha =1.0, beta =1.0, gamma =1.5 and tau =0.5. The I3D network is initialized using model parameters pre-trained on kinetics-400.

In the testing process, the segmented frame extraction mode of the test sample is the same as that in the training stage, but resampling and data enhancement operation are not carried out. A comparison of the experimental effect for this example with the unsupervised method can be seen in table 1. As can be seen from table 1, the unsupervised cross-domain video motion recognition model based on resampling and feature weighting provided by the invention has better recognition performance on an unsupervised cross-domain motion recognition target data set.

Claims

1. A cross-domain video action recognition method is characterized by comprising the following steps:

1) Source domain sample resampling: the distribution of the number of source domain samples in each category is uneven, and training the model by using uneven source domain samples can lead the model to be more suitable for the categories with large sample amount and ignore the categories with small sample amount; the distribution of the sample size of the target domain in each category is often different from that of the source domain, so that the effect is reduced when the model trained on the source domain is transferred to the target domain; according to the characteristics, resampling is carried out on the source domain samples to balance the consistency of the quantity of each type of the source domain samples;

2) Video preprocessing: the number of the frames of the cross-domain action data set samples is large, the calculated pressure of inputting all the frames into the network is large, and the efficiency of extracting a certain amount of frame images and inputting the frame images into the network is high due to the high similarity of the two adjacent frames; according to the characteristics, N video samples are respectively taken at the source domain and the target domain at random and are combined into a batch of data samples; dividing each video into k =6 segments, randomly extracting sixteen frames from each segment of video, and taking the ninety-six frame images as the representation of motion samples; meanwhile, performing conventional data enhancement on the frame image, namely performing random cutting, random horizontal turning and normalization processing on the video; in the testing stage, only the video sample of the testing set is subjected to frame extraction processing, and data enhancement is not performed;

4) Constructing a motion excitation module: segment level feature F _S Inputting the motion information into a motion excitation module, and enhancing the motion information based on the time dimension of the motion excitation module to improve the quality of the motion characteristics and obtain enhanced segment-level characteristics

6) Constructing a middle domain weighting module: one goal of cross-domain action recognition is to learn more domain-invariant features, i.e., intrinsic action features that do not change with domain changes; according to the characteristics, the intermediate domain adding module is designed, and the intermediate domain weighting layer has the function of adding 2N video level sample characteristics { F ] of the same input batch ₁ ，F ₂ ，…,F _2N Judging whether the data distribution of the source domain or the target domain is more biased or in an intermediate domain (namely the intersection part of the feature distribution of the source domain and the feature distribution of the target domain); the sample characteristics in the middle domain have more domain invariant characteristics relative to the samples biased to the source domain or target domain specific data distribution, different weight vectors are calculated according to the distance between the characteristic F and the middle domain and different rules through a middle domain weighting module, and the weight vectors are classified

Domain contrast weight vector

Categorizing weight vector x ^c Weighting a video level feature F to obtain a feature F for classification _C And input it into the subsequent classification module; domain contrast weight vector x ^d Weighting video level feature F to obtain learning feature F for domain comparison _D ；

on the other hand the classification module uses the classification probability vector l of the target domain samples _T ∈l _C According to the formula:

obtaining a pseudo label for a target domain sample

8) Constructing a domain comparison learning module: the domain comparison learning module mainly realizes inter-domain comparison learning, and achieves the purposes of reducing domain deviation and improving the cross-domain capability of the network by shortening the distance between the features belonging to the same category in the source domain sample feature and the target domain sample feature and advancing the distance between the features of different categories; video level feature F weighted by intermediate domain weighting module _D Inputting the feature into a domain comparison learning module, and mapping the feature into a domain comparison feature F through the domain comparison learning module _d Using source domain real label Y _S And target domain pseudo-tag

And a domain contrast feature F _d Calculating the domain contrast loss, and optimizing the network to reduce the domain offset; the loss of domain contrast is defined as:

representing the ith source domain contrast characteristic of each batch,

representing the jth target domain comparison characteristic of each batch;

where τ >0 is a temperature hyperparameter.

wherein alpha, beta and gamma are hyper-parameters.

2. The cross-domain video motion recognition method according to claim 1, wherein the specific steps of step 1) are as follows:

sampling multiplying power = (number of samples per class/maximum number of samples per class);

3. The cross-domain video motion recognition method according to claim 1, wherein the step 4) comprises the following specific steps:

Is input into a motion excitation module; the motion excitation module makes difference between the characteristics of the t segment and the characteristics of the t +1 segment on a time sequence channel, and highlights the characteristics of difference between the two segments through the difference operation of the two adjacent segments, wherein the difference characteristics are usually brought by motion, so the characteristics with motion information are positioned and added back to the original segment level characteristics, and the purpose of enhancing the motion information is achieved.

4. The cross-domain video motion recognition method according to claim 1, wherein the step 6) comprises the following specific steps:

where N represents the number of samples in the current network training batch,

a domain label representing the ith sample,

along with the training, the output classification probability of the two classifiers can reflect the distance between the sample characteristics and the middle domain, namely the closer the output classification probability is to 0.5, the more difficult the two classifiers judge whether the sample is from the source domain or the target domain, which shows that the characteristics have more domain-invariant characteristics and the smaller distance with the middle domain; calculating the intermediate domain weighting characteristic of each sample by using the following formula according to the classification probability output by the two classifiers;

F _C ＝F*e ^-3|λ-0.5|

F _D ＝F*e ^3|λ-0.5|

5. A video motion detection apparatus, the apparatus comprising:

the characteristic fusion module is used for fusing the 6-segment fragment-level characteristics of each sample into video-level characteristics;

6. A video motion recognition device, the device comprising:

a memory for storing an executable computer program;

a processor for implementing the method of any one of claims 1 to 4 when executing an executable computer program stored in the memory.

7. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 4.