CN115439791A - Cross-domain video action recognition method, device, equipment and computer-readable storage medium - Google Patents

Cross-domain video action recognition method, device, equipment and computer-readable storage medium Download PDF

Info

Publication number
CN115439791A
CN115439791A CN202211171602.4A CN202211171602A CN115439791A CN 115439791 A CN115439791 A CN 115439791A CN 202211171602 A CN202211171602 A CN 202211171602A CN 115439791 A CN115439791 A CN 115439791A
Authority
CN
China
Prior art keywords
domain
sample
module
feature
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211171602.4A
Other languages
Chinese (zh)
Inventor
周冕
田壮
高文杰
高赞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University of Technology
Original Assignee
Tianjin University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University of Technology filed Critical Tianjin University of Technology
Priority to CN202211171602.4A priority Critical patent/CN115439791A/en
Publication of CN115439791A publication Critical patent/CN115439791A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level

Abstract

The invention belongs to the technical field of computer vision and pattern recognition, and relates to a cross-domain video action recognition method based on resampling and feature weighting without supervision. The method comprises the following steps: the method comprises the steps of source domain sample resampling, video preprocessing, feature extraction, motion excitation module construction, feature fusion module construction, intermediate domain weighting module construction, classification module construction and domain comparison learning module construction. The problem of insufficient generalization capability of the model on a plurality of data sets can be effectively reduced under the condition that a target domain has no sample label, the model mobility is improved, the problem of inconsistent data distribution among different data sets is solved, the problem of cross-domain action recognition under the condition that the target data set training data set has no label is solved, and accurate recognition of a target domain test set is realized by using the information of a source domain data set and the information of a target domain label-free training set. The invention is particularly suitable for the field of public safety.

Description

Cross-domain video action recognition method, device, equipment and computer-readable storage medium
Technical Field
The invention belongs to the technical field of computer vision and pattern recognition, relates to an unsupervised cross-domain video action recognition method based on resampling and feature weighting, and can effectively reduce the problem of insufficient generalization capability of a model on a plurality of data sets under the condition that a target domain has no sample label, improve the mobility of the model, and verify the effectiveness of the model on a plurality of cross-domain video action data sets.
Background
Over the last few years, many different depth-structured approaches have emerged in the field of motion recognition; for example: in Two-Stream (dual Stream convolutional neural network): a double-current network architecture is provided, two 2D volume blocks are used for carrying out combined training on RGB and optical flow information, and time information is modeled; TRN: a depth model of a time relation network is provided, and a special pooling layer is adopted to model the time relation C3D between video frames: directly learning the spatiotemporal characteristics of the video data through a 3D convolution pair; I3D: this is a deep network that integrates a dilated two-dimensional convolution filter to take advantage of large-scale pre-trained two-dimensional models. P3D: the core of the method is to split 3D convolution into a 1D time convolution kernel of 3 x 1 and a 2D space convolution of 1 x 3, and parameter quantity is reduced.
However, the above methods often cannot be directly applied to the field of cross-domain motion recognition because they are trained on the same distributed training and testing data sets, i.e. all samples are from the same data set; for cross-domain tasks, the samples for training and testing are often from different data sets, i.e. the distribution of samples is different; under the condition, the method cannot well eliminate the data distribution difference of the samples, so that the classification effect of the model is greatly reduced, the model effect is not good, and the method cannot be effectively applied to cross-domain tasks.
In the related research fields of computer vision and pattern recognition, the cross-domain task of transfer learning is always one of the most active research fields; cross-domain tasks already have mature methods in the image domain, and existing methods have great differences in strategies for coping with domain transitions. One class of methods performs domain distribution alignment by matching first and second order statistical moments of the source and target data distributions. Another salient strategy is antagonistic training, in which discriminative and domain-agnostic feature representations are learned by coupling domain discriminators with source classification loss. Compared with a cross-domain image recognition task, the cross-domain action recognition task is more difficult because timing information only provides a few cross-domain action recognition methods, for example, in a DAAA method, an end-to-end antagonism learning framework is provided for aligning the two fields, and TA3N provides a time attention confrontation adaptive network for aligning time dynamics, wherein an attention mechanism is used for aligning the time dynamics and the space characteristics. TCoN: aligning the feature distribution of the same type of videos in the source domain and the target domain by using a cooperative attention method; while these methods can produce satisfactory results, most require samples with labeled information. Therefore, in this study we focus on the unsupervised cross-domain motion recognition task.
Disclosure of Invention
The invention aims to solve the problem of cross-domain action recognition under the condition that a training set of a target data set is not labeled aiming at a cross-domain task of action recognition, and provides a non-supervised cross-domain video action recognition method, a device, equipment and a computer storage medium based on resampling and feature weighting.
The cross-domain video action identification method specifically comprises the following steps:
1) Source domain sample resampling: the distribution of the number of source domain samples in each category is uneven, the uneven source domain samples are used for training the model, so that the model can be more suitable for the categories with large sample amount, and the categories with small sample amount are ignored; the distribution of the sample size of the target domain in each category is often different from that of the source domain, so that the effect is reduced when the model trained on the source domain is transferred to the target domain; according to the characteristics, resampling is carried out on the source domain samples to balance the consistency of the quantity of each type of the source domain samples;
2) Video preprocessing: the number of the frames of the cross-domain action data set samples is large, the calculated pressure of inputting all the frames into the network is large, and the efficiency of extracting a certain amount of frame images and inputting the frame images into the network is high due to the high similarity of the two adjacent frames; according to the characteristics, N video samples are respectively taken at the source domain and the target domain at random and are combined into a batch of data samples; dividing each video into k =6 segments, randomly extracting sixteen frames from each segment of video, and taking the ninety-six frame images as the representation of motion samples; meanwhile, performing conventional data enhancement on the frame image, namely performing random cutting, random horizontal turning and normalization processing on the video; in the testing stage, only the video samples of the testing set are subjected to frame extraction processing, and data enhancement is not performed;
3) Feature extraction: extracting features of the video samples processed in the steps 1) and 2) by using an I3D model pre-trained on a Kinetics data set as a feature extractor of a network, and obtaining a fragment-level feature sequence
Figure BDA0003862947900000021
Wherein T =6 indicates that each video sample results in 6 segment features, and C is the feature channel dimension;
4) Constructing a motion excitation module: inputting the segment-level features F into a motion excitation module, enhancing motion information based on time dimension of the motion excitation module, improving quality of the motion features, and obtaining enhanced segment-level features
Figure BDA0003862947900000022
5) Constructing a feature fusion module: segment-level features modeled by motion excitation module
Figure BDA0003862947900000023
Fusing T =6 segment level features into video level features using a feature fusion module as input
Figure BDA0003862947900000031
6) Constructing a middle domain weighting module: one goal of cross-domain action recognition is to learn more domain-invariant features, i.e., intrinsic action features that do not change with domain changes; according to the characteristics, the intermediate domain adding module is designed, and the intermediate domain weighting layer is used for the input same-phaseA batch of 2N video level sample features { F 1 ,F 2 ,…,F 2N Judging whether the data distribution of the source domain or the target domain is more biased or in an intermediate domain (namely the intersection part of the feature distribution of the source domain and the feature distribution of the target domain); the sample characteristics in the middle domain have more domain invariant characteristics relative to the samples biased to the source domain or target domain specific data distribution, different weight vectors are calculated according to the distance between the characteristic F and the middle domain and different rules through a middle domain weighting module, and the weight vectors are classified
Figure BDA0003862947900000032
Domain contrast weight vector
Figure BDA0003862947900000033
Categorizing weight vector x c Weighting a video level feature F to obtain a feature F for classification C And input it into the subsequent classification module; domain contrast weight vector x d Weighting video level feature F to obtain learning feature F for domain contrast D
7) Constructing a classification module: video level feature F weighted by intermediate domain weighting module C Inputting the sample data into a classification module, and obtaining a classification probability vector l of the sample in each class through calculation of the classification module C On the one hand, the classification module is used for real label Y of source domain samples S And the classification probability vector l of the source domain samples S ∈l C Calculating the classification loss of the source domain samples by using a cross entropy function, and optimizing the classification capability of the whole network; the cross entropy loss function is defined as:
Figure BDA0003862947900000034
wherein N represents the number of source domain samples in the current network training batch, y i A label representing the ith sample,
Figure BDA0003862947900000035
representing a predicted value for the ith sample; log () is the logarithmOperating;
on the other hand, the classification module uses the classification probability vector l of the target domain sample T ∈l C According to the formula:
Figure BDA0003862947900000036
obtaining a pseudo label of a target domain sample
Figure BDA0003862947900000037
8) A domain comparison learning module is constructed: the domain comparison learning module mainly realizes inter-domain comparison learning, and achieves the purposes of reducing domain deviation and improving the cross-domain capability of the network by shortening the distance between the features belonging to the same category in the source domain sample feature and the target domain sample feature and advancing the distance between the features of different categories; video level feature F weighted by intermediate domain weighting module D Inputting the feature into a domain comparison learning module, and mapping the feature into a domain comparison feature F through the domain comparison learning module d Using source domain real label Y S And target domain pseudo-tag
Figure BDA0003862947900000038
And domain contrast feature F d Calculating the domain contrast loss, and optimizing the network to reduce the domain offset; the loss of domain contrast is defined as:
Figure BDA0003862947900000041
wherein N represents the number of source domain or target domain samples per batch;
Figure BDA0003862947900000042
representing the ith source domain contrast characteristic of each batch,
Figure BDA0003862947900000043
representing the jth target domain contrast characteristic of each batch;
Figure BDA0003862947900000044
is an indicator function, which is 1 if its parameter is true, otherwise it is 0;
Figure BDA0003862947900000045
Ω represents a sample set of the target domain with unreliable pseudo labels, and D (-) is a function of the returned sample domain;
Figure BDA0003862947900000046
where τ >0 is a temperature hyperparameter.
9) Constructing a joint training loss function; the whole network carries out joint training through the loss functions proposed in the steps 6), 7) and 8); the overall loss function of the network is defined as:
Figure BDA0003862947900000047
wherein alpha, beta and gamma are hyper-parameters.
The step 1) comprises the following specific steps:
counting the number of samples of each category according to the real label of the source domain sample, and calculating the sampling magnification for each category;
sampling magnification = (number of samples per class/maximum number of class samples);
carrying out weighted random sampling on each class according to the sampling multiplying power, namely achieving the purpose of consistent sample number of each class of the whole source domain samples; and for the target domain data set, no resampling operation is performed because it has no real label.
The step 4) comprises the following specific steps:
segment-level features extracted by a feature extraction module for modeling and enhancing motion information of the segment-level features
Figure BDA0003862947900000048
Is input into the motion excitation module; the motion excitation module makes difference between the characteristics of the t segment and the characteristics of the t +1 segment on a time sequence channel, and highlights the characteristics of difference between the two segments through the difference operation of the two adjacent segments, wherein the difference characteristics are usually brought by motion, so the characteristics with motion information are positioned and added back to the original segment level characteristics, and the purpose of enhancing the motion information is achieved.
The step 6) comprises the following specific steps:
in order to obtain the distance between the sample characteristic and the middle domain, the middle domain weighting module is implemented as a two-classifier for judging whether the input sample characteristic comes from the source domain or the target domain, and the input is the video-level characteristic extracted by the network
Figure BDA0003862947900000051
Generating domain labels simultaneously from source domain or target domain according to characteristics
Figure BDA0003862947900000052
Figure BDA0003862947900000053
Carrying out supervised training on the module according to the domain label to improve the classification accuracy of the second classifier; the loss function of the middle domain weighting module is two-class cross entropy loss BCEloss, and the domain classification loss of the sample is defined as:
Figure BDA0003862947900000054
where N represents the number of samples in the current network training batch,
Figure BDA0003862947900000055
a domain label representing the ith sample,
Figure BDA0003862947900000056
weighting the ith sample for the intermediate domainThe characteristics are related to the predicted value of the domain d epsilon { S, T }; log () is a log-taking operation;
along with the training, the output classification probability of the two classifiers can reflect the distance between the sample characteristics and the middle domain, namely the closer the output classification probability is to 0.5, the more difficult the two classifiers are to judge whether the sample comes from a source domain or a target domain, which shows that the characteristics have more domain-invariant characteristics and the distance between the characteristics and the middle domain is smaller; calculating the final weighting characteristic of each sample by using the following formula according to the classification probability output by the two classifiers;
the feature weight calculation formula of the classification module branch is as follows:
F C =F*e -3|λ-0.5|
the calculation formula of the feature weight of the domain comparison module branch is as follows:
F D =F*e 3|λ-0.5|
wherein F represents the video-level sample characteristics fused by the characteristic fusion module, and λ represents the classification probability output by the intermediate domain weighting module.
The present invention also provides a video motion detection apparatus, the apparatus comprising:
the source domain sample resampling module balances the number of samples of the source domain samples in each category, prevents the model from over-fitting some categories, and is beneficial to improving the migration performance of the model to the target domain;
the characteristic extraction module is used for extracting the fragment-level characteristics of the sample by utilizing the pre-training I3D model;
the motion excitation module is used for modeling and enhancing the motion information of the segment-level features, and is beneficial to the classification and domain adaptation of subsequent videos;
the feature fusion module is used for fusing the 6 segments of segment-level features of each sample into video-level features;
the intermediate domain weighting module is used for finding out an intermediate domain sample which is extracted by the feature extraction module and has domain invariance by judging the features extracted by the model, and prompting the network to have greater learning strength on the domain invariant features and improve the migration capability of the network on one hand, and improving the efficiency of domain comparison learning on the other hand and reducing domain deviation better on the other hand by giving different weights to the sample;
the classification module is used for processing the characteristics weighted by the intermediate domain weighting module to obtain the classification loss of the source domain sample and the target domain pseudo label, and is responsible for outputting the final classification result of the target domain test sample in the test stage;
and the domain comparison module is used for processing the characteristics weighted by the intermediate domain weighting module, performing comparison learning on the source domain and the target domain, pushing the distribution of different types of sample characteristics in the same domain away, drawing the distribution of the sample characteristics in the same domain and different domains closer, and reducing the domain deviation of the characteristics extracted by the network.
The present invention also provides a video motion recognition apparatus, comprising:
a memory for storing an executable computer program;
and the processor is used for realizing the cross-domain video action identification method when executing the executable computer program stored in the memory.
The invention also provides a computer readable storage medium, which stores a computer program for implementing the cross-domain video action recognition method when being executed by a processor.
The invention has the advantages and beneficial effects that:
1) Through resampling of source domain samples, the number of samples of the source domain number in each category is balanced, overfitting of the model to some categories is prevented, and the migration performance of the model to the target domain is improved;
2) The motion excitation module enhances the motion information in the sample characteristics, and is beneficial to the classification and domain adaptation of subsequent videos.
3) The intermediate domain weighting layer judges the features extracted by the model to find out intermediate domain samples extracted by the feature extraction module and having domain invariance, and by giving different weights to the samples, on one hand, the network is promoted to have greater learning strength on the domain invariant features, the migration capability of the network is improved, on the other hand, the domain comparison learning efficiency is improved, and the domain deviation is better reduced.
4) The domain comparison learning module performs comparison learning on the source domain and the target domain, the distribution of the sample characteristics of the same domain and different classes is pushed far, the distribution of the sample characteristics of the same domain and different domains is pulled close, and the domain adaptation degree of the network is improved.
The training set and the test set of the common action recognition method are divided from the same data set, so that the cross-domain problem cannot be effectively solved; the invention solves the problem of inconsistent data distribution among different data sets, solves the problem of cross-domain action identification under the condition that the training data set of the target data set is not labeled, and realizes the accurate identification of the target domain test set by using the information of the source domain data set and the information of the target domain unlabeled training set.
Drawings
FIG. 1 is a block diagram of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings;
example 1:
data set
UCF-Olympic there are 6 shared classes from UCF50 and Olympic datasets, containing a total of 601 training videos and 240 test videos from the UCF50 dataset, and 240 training samples and 54 test samples from the Olympic motion dataset.
UCF-HMDB full with 12 sharing categories from UCF101 and HMDB51 respectively, and containing 3209 videos in total, UCF with 1438 videos for training and 571 videos for testing, HMDB with 840 videos for training and 360 for testing.
Table 1 is a comparison of the performance of the classical cross-domain action recognition algorithm of embodiment 1 of the present invention on UCF-HMDBfull and UCF-Olympic cross-domain action data sets, where the corresponding documents in table 1 are as follows:
[1]Arshad Jamal,Vinay P Namboodiri,Dipti Deodhare,and KS Venkatesh.Deep domain adaptation in action space.In BMVC,2018.
[2]Min-Hung Chen,Zsolt Kira,Ghassan AlRegib,Jaekwon Yoo,Ruxin Chen,and Jian Zheng.Temporal attentive align-ment for large-scale video domain adaptation.In ICCV,2019.
[3]Boxiao Pan,Zhangjie Cao,Ehsan Adeli,and Juan Carlos Niebles.Adversarial cross-domain action recognition with co-attention.In AAAI,2020.
[4]Jinwoo Choi,Gaurav Sharma,Samuel Schulter,and Jia-Bin Huang.Shuffle and attend:Video domain adaptation.In ECCV,2020.
TABLE 1
Figure BDA0003862947900000081
As shown in fig. 1, it is an operation flowchart of the unsupervised, cross-domain video motion recognition method based on resampling and feature weighting according to this embodiment, and the operation steps of the method include:
1) Source domain sample resampling: the distribution of the number of samples of each type of the cross-domain action recognition data set is not necessarily completely consistent, for part of the cross-domain action recognition data set, the distribution of the number of samples of a source domain in each type is not uniform, and the use of the source domain samples to train a model can lead to the model to be more suitable for the types with more sample amount and ignore the types with less sample amount; the sample number distribution of the target domain is often different from the sample number distribution of the source domain, so that the effect of the model trained on the source domain is reduced when the model is migrated to the target domain. Therefore, before training, the class with the smaller number of samples in the source domain is resampled, so as to achieve the purpose of balancing the consistency of the number of each class of the samples in the source domain. The specific operation method is that the number of samples of each category is counted according to the real label of the source domain sample, and the sampling multiplying power is calculated for each category;
sampling magnification = (number of samples per class/maximum number of class samples);
and carrying out weighted random sampling on each class according to the sampling multiplying power, namely achieving the purpose of consistent sample number of each class.
2) Video preprocessing: the number of the frames of the cross-domain action data set samples is large, the calculated pressure of inputting all the frames into the network is large, and the efficiency of extracting a certain amount of frame images and inputting the frame images into the network is high due to the high similarity of the two adjacent frames; according to the characteristics, N video samples are respectively taken at the source domain and the target domain at random and are combined into a batch of data samples; dividing each video into k =6 segments, randomly extracting sixteen frames from each segment of video, and taking the ninety-six frame images as the representation of motion samples; meanwhile, performing conventional data enhancement on the frame image, namely performing random cutting, random horizontal turning and normalization processing on the video; in the testing stage, only the video samples of the testing set are subjected to frame extraction processing, and data enhancement is not performed.
3) Feature extraction: extracting features of the video samples processed in the steps 1) and 2) by using an I3D model pre-trained on a Kinetics data set as a feature extractor of a network, and obtaining a fragment-level feature sequence
Figure BDA0003862947900000091
Wherein T =6 indicates that each video sample results in 6 segment features, and C is the feature channel dimension;
4) Constructing a motion excitation module: in order to model and enhance the motion information of the segment-level features, the segment-level features extracted by the feature extraction module are input into the motion excitation module; segment level feature F S The method comprises the steps of inputting the motion information into a motion excitation module, enhancing the motion information by the motion excitation module based on a time dimension, and improving the quality of motion characteristics, wherein the specific operation is that the characteristics of a t segment and the characteristics of a t +1 segment are differentiated on a time sequence channel, the characteristics with difference between the two segments are highlighted through the differential operation of the two adjacent segments, and the difference characteristics are usually caused by motion, so the characteristics with the motion information are positioned and added back to the original segment-level characteristics to achieve the purpose of enhancing the motion information, and the enhanced segment-level characteristics are finally obtained
Figure BDA0003862947900000092
5) Constructing a feature fusion module: to generate the video-level features, the segment-level features are fused to the video-level feature features, where the segment-level features are not added together directly, but are weighted according to a fusion weight vector. The weight vectors being simple to useThe characteristic fusion module is calculated and realized as a multilayer perceptron (MLP), the framework is Linear/ReLU/Linear/Sigmoid, and T =6 segment-level characteristics are received as input and are modeled by the motion excitation module
Figure BDA0003862947900000093
As input, T =6 segment level features are fused into video level features with a feature fusion module
Figure BDA0003862947900000094
6) Constructing a middle domain weighting module: one goal of cross-domain action recognition is to learn more domain-invariant features, i.e., intrinsic action features that do not change with domain changes; according to the characteristics, the intermediate domain adding module is designed, and the intermediate domain weighting layer has the function of adding 2N video level sample characteristics { F ] of the same input batch 1 ,F 2 ,…,F 2N A determination is made whether it is more biased towards the source domain or target domain data distribution, or in the intermediate domain (i.e. the intersection of the source domain feature distribution and the target domain feature distribution); the sample characteristics in the middle domain have more domain-invariant characteristics relative to the samples biased to the source domain or target domain data distribution, and the module weights the samples according to the distance between the characteristics and the middle domain; the weighting mode is divided into two types according to the difference of the subsequent inflow module of the characteristics, one is to give larger weight to the middle domain sample and input the middle domain sample into the classification module, so that the middle domain sample occupies larger weight in the subsequent classification calculation, the classification module is favorable for better fitting the domain invariant characteristics, and the generalization capability of the classification module is improved; the second is to apply a reverse weighting to these samples to give smaller weight to the middle-domain samples, and input them to the domain contrast learning module in order to make them occupy smaller weight in the domain contrast learning calculation, i.e. make the features with larger domain differences occupy larger weight in the domain contrast learning calculation. These samples distributed toward the source domain (target domain) are more valuable in domain contrast learning than the intermediate domain samples because the domain contrast module is based on the samplesThe comparison difference is calculated, and the domain deviation can be better learned by the domain comparison module by selecting the sample characteristics with larger domain difference;
in order to obtain the distance between the sample characteristics and the intermediate domain, the intermediate domain weighting module is implemented as a two-classifier for judging whether the input sample characteristics come from the source domain or the target domain, the input is the video-level characteristics F extracted by the network, and meanwhile, domain labels are generated according to whether the characteristics come from the source domain or the target domain
Figure BDA0003862947900000101
And carrying out supervised training on the module according to the domain label to improve the classification accuracy of the two classifiers. The loss function of the middle domain weighting module is a two-class cross entropy loss (BCEloss), and the domain classification loss of the sample is defined as:
Figure BDA0003862947900000102
where N represents the number of samples in the current network training batch,
Figure BDA0003862947900000103
a domain label representing the ith sample,
Figure BDA0003862947900000104
a predicted value of the ith sample characteristic related to the domain d epsilon { S, T } for the middle domain weighting module; log () is a log-taking operation;
along with the training, the output classification probability of the two classifiers can reflect the distance between the sample characteristics and the middle domain, namely the closer the output classification probability is to 0.5, the more difficult the two classifiers judge whether the sample is from the source domain or the target domain, which shows that the characteristics have more domain-invariant characteristics and the smaller distance with the middle domain; and then according to the classification probability output by the two classifiers, calculating the intermediate domain weighting characteristic of each sample by using the following formula.
The feature weighting calculation formula of the classification module branch is as follows:
F C =F*e -3|λ-0.5|
the calculation formula of the feature weight of the domain comparison module branch is as follows:
F D =F*e 3|λ-0.5|
wherein F represents the video-level sample characteristics fused by the characteristic fusion module, and λ represents the classification probability output by the intermediate domain weighting module.
7) Constructing a classification module: video level feature F weighted by intermediate domain weighting module C Inputting the sample data into a classification module, and obtaining a classification probability vector l of the sample in each class through calculation of the classification module C The classification module mainly realizes two functions, wherein one function is to optimize the separability of the network to the classes by carrying out supervised classification loss calculation on the source domain samples with labels. Real label Y of classification module to source domain sample S And the classification probability vector l of the source domain samples S ∈l C Calculating the classification loss of the source domain samples by using a cross entropy function, and optimizing the classification capability of the whole network; the cross entropy loss function is defined as:
Figure BDA0003862947900000111
wherein N represents the number of source domain samples in the current network training batch, y i A label representing the ith sample,
Figure BDA0003862947900000112
representing a predicted value for the ith sample; log () is a log-taking operation;
another function is that the classification module uses the classification probability vector l of the target domain samples T ∈l C According to the formula:
Figure BDA0003862947900000113
obtaining a pseudo label of a target domain sample;
8) A domain comparison learning module is constructed: the domain comparison learning module mainly realizes the inter-domain comparison learning,the method achieves the purposes of reducing domain deviation and improving the cross-domain capability of the network by shortening the distance between the same class of features in the source domain sample features and the target domain sample features and advancing the distance between the different classes of features; video level feature F weighted by intermediate domain weighting module D Inputting the feature into a domain comparison learning module, and mapping the feature into a domain comparison feature F through the domain comparison learning module d Using source domain real label Y S And target domain pseudo-tag
Figure BDA0003862947900000114
And domain contrast feature F d Calculating the domain contrast loss, and optimizing the network to reduce the domain offset; the loss of domain contrast is defined as:
Figure BDA0003862947900000115
wherein N represents the number of source-domain or target-domain samples per batch;
Figure BDA0003862947900000116
representing the ith source domain contrast characteristic of each batch,
Figure BDA0003862947900000117
representing the jth target domain contrast characteristic of each batch;
Figure BDA0003862947900000118
is an indicator function, which is 1 if its parameter is true, otherwise it is 0;
Figure BDA0003862947900000119
Ω represents a sample set of the target domain with unreliable pseudo labels, and D (-) is a function of the returned sample domain;
Figure BDA00038629479000001110
where τ >0 is a temperature hyperparameter.
9) Constructing a joint training loss function; the whole network carries out joint training through the loss functions proposed in the steps 6), 7) and 8); the overall loss function of the network is defined as:
Figure BDA00038629479000001111
wherein alpha, beta and gamma are hyper-parameters.
Correspondingly, the present embodiment further provides a video motion detection apparatus, including:
the source domain sample resampling module balances the number of samples of the source domain samples in each category, prevents the model from over-fitting some categories, and is beneficial to improving the migration performance of the model to the target domain;
the characteristic extraction module is used for extracting the fragment-level characteristics of the sample by utilizing the pre-training I3D model;
the motion excitation module is used for modeling and enhancing the motion information of the segment-level features, and is beneficial to the classification and domain adaptation of subsequent videos;
the feature fusion module is used for fusing the 6 segments of segment-level features of each sample into video-level features;
the intermediate domain weighting module is used for finding out an intermediate domain sample which is extracted by the feature extraction module and has domain invariance by judging the features extracted by the model, and prompting the network to have greater learning strength on the domain invariant features and improve the migration capability of the network on one hand, and improving the efficiency of domain comparison learning on the other hand and reducing domain deviation better on the other hand by giving different weights to the sample;
the classification module is used for processing the characteristics weighted by the intermediate domain weighting module to obtain the classification loss of the source domain sample and the target domain pseudo label, and is responsible for outputting the final classification result of the target domain test sample in the test stage;
and the domain comparison module is used for processing the characteristics weighted by the intermediate domain weighting module, performing comparison learning on the source domain and the target domain, pushing the distribution of the sample characteristics of the same domain and different classes far away, and pulling the distribution of the sample characteristics of the same domain and different domains close to each other, so that the domain offset of the characteristics extracted by the network is reduced.
The present embodiment further provides a video motion recognition device, where the device includes:
a memory for storing an executable computer program;
and the processor is used for realizing the cross-domain video action identification method when executing the executable computer program stored in the memory.
The present embodiment also provides a computer-readable storage medium, which stores a computer program, and when being executed by a processor, the computer program implements the cross-domain video motion recognition method.
To validate the effectiveness of the present invention, evaluations were performed on the action data sets UCF-HMDBfull, and UCF-Olympic. The experiment sets up 20 epochs, adopts an optimization method sgd, the default learning rate is 0.01, and the loss function hyperparameters are set to be alpha =1.0, beta =1.0, gamma =1.5 and tau =0.5. The I3D network is initialized using model parameters pre-trained on kinetics-400.
In the testing process, the segmented frame extraction mode of the test sample is the same as that in the training stage, but resampling and data enhancement operation are not carried out. A comparison of the experimental effect for this example with the unsupervised method can be seen in table 1. As can be seen from table 1, the unsupervised cross-domain video motion recognition model based on resampling and feature weighting provided by the invention has better recognition performance on an unsupervised cross-domain motion recognition target data set.

Claims (7)

1. A cross-domain video action recognition method is characterized by comprising the following steps:
1) Source domain sample resampling: the distribution of the number of source domain samples in each category is uneven, and training the model by using uneven source domain samples can lead the model to be more suitable for the categories with large sample amount and ignore the categories with small sample amount; the distribution of the sample size of the target domain in each category is often different from that of the source domain, so that the effect is reduced when the model trained on the source domain is transferred to the target domain; according to the characteristics, resampling is carried out on the source domain samples to balance the consistency of the quantity of each type of the source domain samples;
2) Video preprocessing: the number of the frames of the cross-domain action data set samples is large, the calculated pressure of inputting all the frames into the network is large, and the efficiency of extracting a certain amount of frame images and inputting the frame images into the network is high due to the high similarity of the two adjacent frames; according to the characteristics, N video samples are respectively taken at the source domain and the target domain at random and are combined into a batch of data samples; dividing each video into k =6 segments, randomly extracting sixteen frames from each segment of video, and taking the ninety-six frame images as the representation of motion samples; meanwhile, performing conventional data enhancement on the frame image, namely performing random cutting, random horizontal turning and normalization processing on the video; in the testing stage, only the video sample of the testing set is subjected to frame extraction processing, and data enhancement is not performed;
3) Feature extraction: extracting features of the video samples processed in the steps 1) and 2) by using an I3D model pre-trained on a Kinetics data set as a feature extractor of a network, and obtaining a fragment-level feature sequence
Figure FDA0003862947890000011
Wherein T =6 indicates that each video sample results in 6 segment features, and C is the feature channel dimension;
4) Constructing a motion excitation module: segment level feature F S Inputting the motion information into a motion excitation module, and enhancing the motion information based on the time dimension of the motion excitation module to improve the quality of the motion characteristics and obtain enhanced segment-level characteristics
Figure FDA0003862947890000012
5) Constructing a feature fusion module: segment-level features modeled by motion excitation module
Figure FDA0003862947890000013
Fusing T =6 segment level features into video level features using a feature fusion module as input
Figure FDA0003862947890000014
6) Constructing a middle domain weighting module: one goal of cross-domain action recognition is to learn more domain-invariant features, i.e., intrinsic action features that do not change with domain changes; according to the characteristics, the intermediate domain adding module is designed, and the intermediate domain weighting layer has the function of adding 2N video level sample characteristics { F ] of the same input batch 1 ,F 2 ,…,F 2N Judging whether the data distribution of the source domain or the target domain is more biased or in an intermediate domain (namely the intersection part of the feature distribution of the source domain and the feature distribution of the target domain); the sample characteristics in the middle domain have more domain invariant characteristics relative to the samples biased to the source domain or target domain specific data distribution, different weight vectors are calculated according to the distance between the characteristic F and the middle domain and different rules through a middle domain weighting module, and the weight vectors are classified
Figure FDA0003862947890000021
Domain contrast weight vector
Figure FDA0003862947890000022
Categorizing weight vector x c Weighting a video level feature F to obtain a feature F for classification C And input it into the subsequent classification module; domain contrast weight vector x d Weighting video level feature F to obtain learning feature F for domain comparison D
7) Constructing a classification module: video level feature F weighted by intermediate domain weighting module C Inputting the sample data into a classification module, and obtaining a classification probability vector l of the sample in each class through calculation of the classification module C On the one hand, the classification module is used for real label Y of source domain samples S And the classification probability vector l of the source domain samples S ∈l C Calculating the classification loss of the source domain samples by using a cross entropy function, and optimizing the classification capability of the whole network; the cross entropy loss function is defined as:
Figure FDA0003862947890000023
wherein N represents the number of source domain samples in the current network training batch, y i A label representing the ith sample,
Figure FDA0003862947890000024
representing a predicted value for the ith sample; log () is a log-taking operation;
on the other hand the classification module uses the classification probability vector l of the target domain samples T ∈l C According to the formula:
Figure FDA0003862947890000025
obtaining a pseudo label for a target domain sample
Figure FDA0003862947890000026
8) Constructing a domain comparison learning module: the domain comparison learning module mainly realizes inter-domain comparison learning, and achieves the purposes of reducing domain deviation and improving the cross-domain capability of the network by shortening the distance between the features belonging to the same category in the source domain sample feature and the target domain sample feature and advancing the distance between the features of different categories; video level feature F weighted by intermediate domain weighting module D Inputting the feature into a domain comparison learning module, and mapping the feature into a domain comparison feature F through the domain comparison learning module d Using source domain real label Y S And target domain pseudo-tag
Figure FDA0003862947890000027
And a domain contrast feature F d Calculating the domain contrast loss, and optimizing the network to reduce the domain offset; the loss of domain contrast is defined as:
Figure FDA0003862947890000028
wherein N represents the number of source domain or target domain samples per batch;
Figure FDA0003862947890000029
representing the ith source domain contrast characteristic of each batch,
Figure FDA00038629478900000210
representing the jth target domain comparison characteristic of each batch;
Figure FDA00038629478900000211
is an indicator function, which is 1 if its parameter is true, otherwise it is 0;
Figure FDA0003862947890000031
Ω represents a sample set of the target domain with unreliable pseudo labels, and D (-) is a function of the returned sample domain;
Figure FDA0003862947890000032
where τ >0 is a temperature hyperparameter.
9) Constructing a joint training loss function; the whole network carries out joint training through the loss functions proposed in the steps 6), 7) and 8); the overall loss function of the network is defined as:
Figure FDA0003862947890000033
wherein alpha, beta and gamma are hyper-parameters.
2. The cross-domain video motion recognition method according to claim 1, wherein the specific steps of step 1) are as follows:
counting the number of samples of each category according to the real label of the source domain sample, and calculating the sampling magnification for each category;
sampling multiplying power = (number of samples per class/maximum number of samples per class);
carrying out weighted random sampling on each class according to the sampling multiplying power, namely achieving the purpose of consistent sample number of each class of the whole source domain samples; and for the target domain data set, no resampling operation is performed because it has no real label.
3. The cross-domain video motion recognition method according to claim 1, wherein the step 4) comprises the following specific steps:
segment-level features extracted by a feature extraction module for modeling and enhancing motion information of the segment-level features
Figure FDA0003862947890000034
Is input into a motion excitation module; the motion excitation module makes difference between the characteristics of the t segment and the characteristics of the t +1 segment on a time sequence channel, and highlights the characteristics of difference between the two segments through the difference operation of the two adjacent segments, wherein the difference characteristics are usually brought by motion, so the characteristics with motion information are positioned and added back to the original segment level characteristics, and the purpose of enhancing the motion information is achieved.
4. The cross-domain video motion recognition method according to claim 1, wherein the step 6) comprises the following specific steps:
in order to obtain the distance between the sample characteristic and the middle domain, the middle domain weighting module is implemented as a two-classifier for judging whether the input sample characteristic comes from the source domain or the target domain, and the input is the video-level characteristic extracted by the network
Figure FDA0003862947890000041
Generating domain labels simultaneously from source domain or target domain according to characteristics
Figure FDA0003862947890000042
Figure FDA0003862947890000043
Carrying out supervised training on the module according to the domain label to improve the classification accuracy of the second classifier; the loss function of the middle domain weighting module is two-class cross entropy loss BCEloss, and the domain classification loss of the sample is defined as:
Figure FDA0003862947890000044
where N represents the number of samples in the current network training batch,
Figure FDA0003862947890000045
a domain label representing the ith sample,
Figure FDA0003862947890000046
a predicted value of the ith sample characteristic related to the domain d epsilon { S, T } for the middle domain weighting module; log () is a log-taking operation;
along with the training, the output classification probability of the two classifiers can reflect the distance between the sample characteristics and the middle domain, namely the closer the output classification probability is to 0.5, the more difficult the two classifiers judge whether the sample is from the source domain or the target domain, which shows that the characteristics have more domain-invariant characteristics and the smaller distance with the middle domain; calculating the intermediate domain weighting characteristic of each sample by using the following formula according to the classification probability output by the two classifiers;
the feature weight calculation formula of the classification module branch is as follows:
F C =F*e -3|λ-0.5|
the calculation formula of the feature weight of the domain comparison module branch is as follows:
F D =F*e 3|λ-0.5|
wherein F represents the video-level sample characteristics fused by the characteristic fusion module, and λ represents the classification probability output by the intermediate domain weighting module.
5. A video motion detection apparatus, the apparatus comprising:
the source domain sample resampling module balances the number of samples of the source domain samples in each category, prevents the model from over-fitting some categories, and is beneficial to improving the migration performance of the model to the target domain;
the characteristic extraction module is used for extracting the fragment-level characteristics of the sample by utilizing the pre-training I3D model;
the motion excitation module is used for modeling and enhancing the motion information of the segment-level features, and is beneficial to the classification and domain adaptation of subsequent videos;
the characteristic fusion module is used for fusing the 6-segment fragment-level characteristics of each sample into video-level characteristics;
the intermediate domain weighting module is used for finding out an intermediate domain sample which is extracted by the feature extraction module and has domain invariance by judging the features extracted by the model, and prompting the network to have greater learning strength on the domain invariant features and improve the migration capability of the network on one hand, and improving the efficiency of domain comparison learning on the other hand and reducing domain deviation better on the other hand by giving different weights to the sample;
the classification module is used for processing the characteristics weighted by the intermediate domain weighting module to obtain the classification loss of the source domain sample and the target domain pseudo label, and is responsible for outputting the final classification result of the target domain test sample in the test stage;
and the domain comparison module is used for processing the characteristics weighted by the intermediate domain weighting module, performing comparison learning on the source domain and the target domain, pushing the distribution of the sample characteristics of the same domain and different classes far away, and pulling the distribution of the sample characteristics of the same domain and different domains close to each other, so that the domain offset of the characteristics extracted by the network is reduced.
6. A video motion recognition device, the device comprising:
a memory for storing an executable computer program;
a processor for implementing the method of any one of claims 1 to 4 when executing an executable computer program stored in the memory.
7. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 4.
CN202211171602.4A 2022-09-26 2022-09-26 Cross-domain video action recognition method, device, equipment and computer-readable storage medium Pending CN115439791A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211171602.4A CN115439791A (en) 2022-09-26 2022-09-26 Cross-domain video action recognition method, device, equipment and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211171602.4A CN115439791A (en) 2022-09-26 2022-09-26 Cross-domain video action recognition method, device, equipment and computer-readable storage medium

Publications (1)

Publication Number Publication Date
CN115439791A true CN115439791A (en) 2022-12-06

Family

ID=84249846

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211171602.4A Pending CN115439791A (en) 2022-09-26 2022-09-26 Cross-domain video action recognition method, device, equipment and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN115439791A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115861901A (en) * 2022-12-30 2023-03-28 深圳大学 Video classification method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115861901A (en) * 2022-12-30 2023-03-28 深圳大学 Video classification method, device, equipment and storage medium
CN115861901B (en) * 2022-12-30 2023-06-30 深圳大学 Video classification method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
WO2021042828A1 (en) Neural network model compression method and apparatus, and storage medium and chip
Shen et al. Person re-identification with deep similarity-guided graph neural network
CN112446423B (en) Fast hybrid high-order attention domain confrontation network method based on transfer learning
CN108846413B (en) Zero sample learning method based on global semantic consensus network
CN111079594B (en) Video action classification and identification method based on double-flow cooperative network
CN112527993B (en) Cross-media hierarchical deep video question-answer reasoning framework
CN112651940B (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN113158815A (en) Unsupervised pedestrian re-identification method, system and computer readable medium
CN115761408A (en) Knowledge distillation-based federal domain adaptation method and system
CN112883931A (en) Real-time true and false motion judgment method based on long and short term memory network
Chen et al. Multi-level attentive adversarial learning with temporal dilation for unsupervised video domain adaptation
CN116206327A (en) Image classification method based on online knowledge distillation
Yue et al. Multi-task adversarial autoencoder network for face alignment in the wild
CN115270752A (en) Template sentence evaluation method based on multilevel comparison learning
CN115439791A (en) Cross-domain video action recognition method, device, equipment and computer-readable storage medium
CN113221683A (en) Expression recognition method based on CNN model in teaching scene
CN112052795B (en) Video behavior identification method based on multi-scale space-time feature aggregation
CN112883930A (en) Real-time true and false motion judgment method based on full-connection network
CN114973107B (en) Unsupervised cross-domain video action identification method based on multi-discriminator cooperation and strong and weak sharing mechanism
CN112200260B (en) Figure attribute identification method based on discarding loss function
CN113313185B (en) Hyperspectral image classification method based on self-adaptive spatial spectrum feature extraction
Bi et al. Entropy-weighted reconstruction adversary and curriculum pseudo labeling for domain adaptation in semantic segmentation
CN111681748B (en) Medical behavior action normalization evaluation method based on intelligent visual perception
CN114998973A (en) Micro-expression identification method based on domain self-adaptation
Pei et al. FGO-Net: Feature and Gaussian Optimization Network for visual saliency prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination