CN115439791A - Cross-domain video action recognition method, device, equipment and computer-readable storage medium - Google Patents
Cross-domain video action recognition method, device, equipment and computer-readable storage medium Download PDFInfo
- Publication number
- CN115439791A CN115439791A CN202211171602.4A CN202211171602A CN115439791A CN 115439791 A CN115439791 A CN 115439791A CN 202211171602 A CN202211171602 A CN 202211171602A CN 115439791 A CN115439791 A CN 115439791A
- Authority
- CN
- China
- Prior art keywords
- domain
- sample
- module
- feature
- video
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to the technical field of computer vision and pattern recognition, and relates to a cross-domain video action recognition method based on resampling and feature weighting without supervision. The method comprises the following steps: the method comprises the steps of source domain sample resampling, video preprocessing, feature extraction, motion excitation module construction, feature fusion module construction, intermediate domain weighting module construction, classification module construction and domain comparison learning module construction. The problem of insufficient generalization capability of the model on a plurality of data sets can be effectively reduced under the condition that a target domain has no sample label, the model mobility is improved, the problem of inconsistent data distribution among different data sets is solved, the problem of cross-domain action recognition under the condition that the target data set training data set has no label is solved, and accurate recognition of a target domain test set is realized by using the information of a source domain data set and the information of a target domain label-free training set. The invention is particularly suitable for the field of public safety.
Description
Technical Field
The invention belongs to the technical field of computer vision and pattern recognition, relates to an unsupervised cross-domain video action recognition method based on resampling and feature weighting, and can effectively reduce the problem of insufficient generalization capability of a model on a plurality of data sets under the condition that a target domain has no sample label, improve the mobility of the model, and verify the effectiveness of the model on a plurality of cross-domain video action data sets.
Background
Over the last few years, many different depth-structured approaches have emerged in the field of motion recognition; for example: in Two-Stream (dual Stream convolutional neural network): a double-current network architecture is provided, two 2D volume blocks are used for carrying out combined training on RGB and optical flow information, and time information is modeled; TRN: a depth model of a time relation network is provided, and a special pooling layer is adopted to model the time relation C3D between video frames: directly learning the spatiotemporal characteristics of the video data through a 3D convolution pair; I3D: this is a deep network that integrates a dilated two-dimensional convolution filter to take advantage of large-scale pre-trained two-dimensional models. P3D: the core of the method is to split 3D convolution into a 1D time convolution kernel of 3 x 1 and a 2D space convolution of 1 x 3, and parameter quantity is reduced.
However, the above methods often cannot be directly applied to the field of cross-domain motion recognition because they are trained on the same distributed training and testing data sets, i.e. all samples are from the same data set; for cross-domain tasks, the samples for training and testing are often from different data sets, i.e. the distribution of samples is different; under the condition, the method cannot well eliminate the data distribution difference of the samples, so that the classification effect of the model is greatly reduced, the model effect is not good, and the method cannot be effectively applied to cross-domain tasks.
In the related research fields of computer vision and pattern recognition, the cross-domain task of transfer learning is always one of the most active research fields; cross-domain tasks already have mature methods in the image domain, and existing methods have great differences in strategies for coping with domain transitions. One class of methods performs domain distribution alignment by matching first and second order statistical moments of the source and target data distributions. Another salient strategy is antagonistic training, in which discriminative and domain-agnostic feature representations are learned by coupling domain discriminators with source classification loss. Compared with a cross-domain image recognition task, the cross-domain action recognition task is more difficult because timing information only provides a few cross-domain action recognition methods, for example, in a DAAA method, an end-to-end antagonism learning framework is provided for aligning the two fields, and TA3N provides a time attention confrontation adaptive network for aligning time dynamics, wherein an attention mechanism is used for aligning the time dynamics and the space characteristics. TCoN: aligning the feature distribution of the same type of videos in the source domain and the target domain by using a cooperative attention method; while these methods can produce satisfactory results, most require samples with labeled information. Therefore, in this study we focus on the unsupervised cross-domain motion recognition task.
Disclosure of Invention
The invention aims to solve the problem of cross-domain action recognition under the condition that a training set of a target data set is not labeled aiming at a cross-domain task of action recognition, and provides a non-supervised cross-domain video action recognition method, a device, equipment and a computer storage medium based on resampling and feature weighting.
The cross-domain video action identification method specifically comprises the following steps:
1) Source domain sample resampling: the distribution of the number of source domain samples in each category is uneven, the uneven source domain samples are used for training the model, so that the model can be more suitable for the categories with large sample amount, and the categories with small sample amount are ignored; the distribution of the sample size of the target domain in each category is often different from that of the source domain, so that the effect is reduced when the model trained on the source domain is transferred to the target domain; according to the characteristics, resampling is carried out on the source domain samples to balance the consistency of the quantity of each type of the source domain samples;
2) Video preprocessing: the number of the frames of the cross-domain action data set samples is large, the calculated pressure of inputting all the frames into the network is large, and the efficiency of extracting a certain amount of frame images and inputting the frame images into the network is high due to the high similarity of the two adjacent frames; according to the characteristics, N video samples are respectively taken at the source domain and the target domain at random and are combined into a batch of data samples; dividing each video into k =6 segments, randomly extracting sixteen frames from each segment of video, and taking the ninety-six frame images as the representation of motion samples; meanwhile, performing conventional data enhancement on the frame image, namely performing random cutting, random horizontal turning and normalization processing on the video; in the testing stage, only the video samples of the testing set are subjected to frame extraction processing, and data enhancement is not performed;
3) Feature extraction: extracting features of the video samples processed in the steps 1) and 2) by using an I3D model pre-trained on a Kinetics data set as a feature extractor of a network, and obtaining a fragment-level feature sequenceWherein T =6 indicates that each video sample results in 6 segment features, and C is the feature channel dimension;
4) Constructing a motion excitation module: inputting the segment-level features F into a motion excitation module, enhancing motion information based on time dimension of the motion excitation module, improving quality of the motion features, and obtaining enhanced segment-level features
5) Constructing a feature fusion module: segment-level features modeled by motion excitation moduleFusing T =6 segment level features into video level features using a feature fusion module as input
6) Constructing a middle domain weighting module: one goal of cross-domain action recognition is to learn more domain-invariant features, i.e., intrinsic action features that do not change with domain changes; according to the characteristics, the intermediate domain adding module is designed, and the intermediate domain weighting layer is used for the input same-phaseA batch of 2N video level sample features { F 1 ,F 2 ,…,F 2N Judging whether the data distribution of the source domain or the target domain is more biased or in an intermediate domain (namely the intersection part of the feature distribution of the source domain and the feature distribution of the target domain); the sample characteristics in the middle domain have more domain invariant characteristics relative to the samples biased to the source domain or target domain specific data distribution, different weight vectors are calculated according to the distance between the characteristic F and the middle domain and different rules through a middle domain weighting module, and the weight vectors are classifiedDomain contrast weight vectorCategorizing weight vector x c Weighting a video level feature F to obtain a feature F for classification C And input it into the subsequent classification module; domain contrast weight vector x d Weighting video level feature F to obtain learning feature F for domain contrast D ;
7) Constructing a classification module: video level feature F weighted by intermediate domain weighting module C Inputting the sample data into a classification module, and obtaining a classification probability vector l of the sample in each class through calculation of the classification module C On the one hand, the classification module is used for real label Y of source domain samples S And the classification probability vector l of the source domain samples S ∈l C Calculating the classification loss of the source domain samples by using a cross entropy function, and optimizing the classification capability of the whole network; the cross entropy loss function is defined as:
wherein N represents the number of source domain samples in the current network training batch, y i A label representing the ith sample,representing a predicted value for the ith sample; log () is the logarithmOperating;
on the other hand, the classification module uses the classification probability vector l of the target domain sample T ∈l C According to the formula:
8) A domain comparison learning module is constructed: the domain comparison learning module mainly realizes inter-domain comparison learning, and achieves the purposes of reducing domain deviation and improving the cross-domain capability of the network by shortening the distance between the features belonging to the same category in the source domain sample feature and the target domain sample feature and advancing the distance between the features of different categories; video level feature F weighted by intermediate domain weighting module D Inputting the feature into a domain comparison learning module, and mapping the feature into a domain comparison feature F through the domain comparison learning module d Using source domain real label Y S And target domain pseudo-tagAnd domain contrast feature F d Calculating the domain contrast loss, and optimizing the network to reduce the domain offset; the loss of domain contrast is defined as:
wherein N represents the number of source domain or target domain samples per batch;representing the ith source domain contrast characteristic of each batch,representing the jth target domain contrast characteristic of each batch;is an indicator function, which is 1 if its parameter is true, otherwise it is 0;
Ω represents a sample set of the target domain with unreliable pseudo labels, and D (-) is a function of the returned sample domain;
where τ >0 is a temperature hyperparameter.
9) Constructing a joint training loss function; the whole network carries out joint training through the loss functions proposed in the steps 6), 7) and 8); the overall loss function of the network is defined as:
wherein alpha, beta and gamma are hyper-parameters.
The step 1) comprises the following specific steps:
counting the number of samples of each category according to the real label of the source domain sample, and calculating the sampling magnification for each category;
sampling magnification = (number of samples per class/maximum number of class samples);
carrying out weighted random sampling on each class according to the sampling multiplying power, namely achieving the purpose of consistent sample number of each class of the whole source domain samples; and for the target domain data set, no resampling operation is performed because it has no real label.
The step 4) comprises the following specific steps:
segment-level features extracted by a feature extraction module for modeling and enhancing motion information of the segment-level featuresIs input into the motion excitation module; the motion excitation module makes difference between the characteristics of the t segment and the characteristics of the t +1 segment on a time sequence channel, and highlights the characteristics of difference between the two segments through the difference operation of the two adjacent segments, wherein the difference characteristics are usually brought by motion, so the characteristics with motion information are positioned and added back to the original segment level characteristics, and the purpose of enhancing the motion information is achieved.
The step 6) comprises the following specific steps:
in order to obtain the distance between the sample characteristic and the middle domain, the middle domain weighting module is implemented as a two-classifier for judging whether the input sample characteristic comes from the source domain or the target domain, and the input is the video-level characteristic extracted by the networkGenerating domain labels simultaneously from source domain or target domain according to characteristics Carrying out supervised training on the module according to the domain label to improve the classification accuracy of the second classifier; the loss function of the middle domain weighting module is two-class cross entropy loss BCEloss, and the domain classification loss of the sample is defined as:
where N represents the number of samples in the current network training batch,a domain label representing the ith sample,weighting the ith sample for the intermediate domainThe characteristics are related to the predicted value of the domain d epsilon { S, T }; log () is a log-taking operation;
along with the training, the output classification probability of the two classifiers can reflect the distance between the sample characteristics and the middle domain, namely the closer the output classification probability is to 0.5, the more difficult the two classifiers are to judge whether the sample comes from a source domain or a target domain, which shows that the characteristics have more domain-invariant characteristics and the distance between the characteristics and the middle domain is smaller; calculating the final weighting characteristic of each sample by using the following formula according to the classification probability output by the two classifiers;
the feature weight calculation formula of the classification module branch is as follows:
F C =F*e -3|λ-0.5|
the calculation formula of the feature weight of the domain comparison module branch is as follows:
F D =F*e 3|λ-0.5|
wherein F represents the video-level sample characteristics fused by the characteristic fusion module, and λ represents the classification probability output by the intermediate domain weighting module.
The present invention also provides a video motion detection apparatus, the apparatus comprising:
the source domain sample resampling module balances the number of samples of the source domain samples in each category, prevents the model from over-fitting some categories, and is beneficial to improving the migration performance of the model to the target domain;
the characteristic extraction module is used for extracting the fragment-level characteristics of the sample by utilizing the pre-training I3D model;
the motion excitation module is used for modeling and enhancing the motion information of the segment-level features, and is beneficial to the classification and domain adaptation of subsequent videos;
the feature fusion module is used for fusing the 6 segments of segment-level features of each sample into video-level features;
the intermediate domain weighting module is used for finding out an intermediate domain sample which is extracted by the feature extraction module and has domain invariance by judging the features extracted by the model, and prompting the network to have greater learning strength on the domain invariant features and improve the migration capability of the network on one hand, and improving the efficiency of domain comparison learning on the other hand and reducing domain deviation better on the other hand by giving different weights to the sample;
the classification module is used for processing the characteristics weighted by the intermediate domain weighting module to obtain the classification loss of the source domain sample and the target domain pseudo label, and is responsible for outputting the final classification result of the target domain test sample in the test stage;
and the domain comparison module is used for processing the characteristics weighted by the intermediate domain weighting module, performing comparison learning on the source domain and the target domain, pushing the distribution of different types of sample characteristics in the same domain away, drawing the distribution of the sample characteristics in the same domain and different domains closer, and reducing the domain deviation of the characteristics extracted by the network.
The present invention also provides a video motion recognition apparatus, comprising:
a memory for storing an executable computer program;
and the processor is used for realizing the cross-domain video action identification method when executing the executable computer program stored in the memory.
The invention also provides a computer readable storage medium, which stores a computer program for implementing the cross-domain video action recognition method when being executed by a processor.
The invention has the advantages and beneficial effects that:
1) Through resampling of source domain samples, the number of samples of the source domain number in each category is balanced, overfitting of the model to some categories is prevented, and the migration performance of the model to the target domain is improved;
2) The motion excitation module enhances the motion information in the sample characteristics, and is beneficial to the classification and domain adaptation of subsequent videos.
3) The intermediate domain weighting layer judges the features extracted by the model to find out intermediate domain samples extracted by the feature extraction module and having domain invariance, and by giving different weights to the samples, on one hand, the network is promoted to have greater learning strength on the domain invariant features, the migration capability of the network is improved, on the other hand, the domain comparison learning efficiency is improved, and the domain deviation is better reduced.
4) The domain comparison learning module performs comparison learning on the source domain and the target domain, the distribution of the sample characteristics of the same domain and different classes is pushed far, the distribution of the sample characteristics of the same domain and different domains is pulled close, and the domain adaptation degree of the network is improved.
The training set and the test set of the common action recognition method are divided from the same data set, so that the cross-domain problem cannot be effectively solved; the invention solves the problem of inconsistent data distribution among different data sets, solves the problem of cross-domain action identification under the condition that the training data set of the target data set is not labeled, and realizes the accurate identification of the target domain test set by using the information of the source domain data set and the information of the target domain unlabeled training set.
Drawings
FIG. 1 is a block diagram of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings;
example 1:
data set
UCF-Olympic there are 6 shared classes from UCF50 and Olympic datasets, containing a total of 601 training videos and 240 test videos from the UCF50 dataset, and 240 training samples and 54 test samples from the Olympic motion dataset.
UCF-HMDB full with 12 sharing categories from UCF101 and HMDB51 respectively, and containing 3209 videos in total, UCF with 1438 videos for training and 571 videos for testing, HMDB with 840 videos for training and 360 for testing.
Table 1 is a comparison of the performance of the classical cross-domain action recognition algorithm of embodiment 1 of the present invention on UCF-HMDBfull and UCF-Olympic cross-domain action data sets, where the corresponding documents in table 1 are as follows:
[1]Arshad Jamal,Vinay P Namboodiri,Dipti Deodhare,and KS Venkatesh.Deep domain adaptation in action space.In BMVC,2018.
[2]Min-Hung Chen,Zsolt Kira,Ghassan AlRegib,Jaekwon Yoo,Ruxin Chen,and Jian Zheng.Temporal attentive align-ment for large-scale video domain adaptation.In ICCV,2019.
[3]Boxiao Pan,Zhangjie Cao,Ehsan Adeli,and Juan Carlos Niebles.Adversarial cross-domain action recognition with co-attention.In AAAI,2020.
[4]Jinwoo Choi,Gaurav Sharma,Samuel Schulter,and Jia-Bin Huang.Shuffle and attend:Video domain adaptation.In ECCV,2020.
TABLE 1
As shown in fig. 1, it is an operation flowchart of the unsupervised, cross-domain video motion recognition method based on resampling and feature weighting according to this embodiment, and the operation steps of the method include:
1) Source domain sample resampling: the distribution of the number of samples of each type of the cross-domain action recognition data set is not necessarily completely consistent, for part of the cross-domain action recognition data set, the distribution of the number of samples of a source domain in each type is not uniform, and the use of the source domain samples to train a model can lead to the model to be more suitable for the types with more sample amount and ignore the types with less sample amount; the sample number distribution of the target domain is often different from the sample number distribution of the source domain, so that the effect of the model trained on the source domain is reduced when the model is migrated to the target domain. Therefore, before training, the class with the smaller number of samples in the source domain is resampled, so as to achieve the purpose of balancing the consistency of the number of each class of the samples in the source domain. The specific operation method is that the number of samples of each category is counted according to the real label of the source domain sample, and the sampling multiplying power is calculated for each category;
sampling magnification = (number of samples per class/maximum number of class samples);
and carrying out weighted random sampling on each class according to the sampling multiplying power, namely achieving the purpose of consistent sample number of each class.
2) Video preprocessing: the number of the frames of the cross-domain action data set samples is large, the calculated pressure of inputting all the frames into the network is large, and the efficiency of extracting a certain amount of frame images and inputting the frame images into the network is high due to the high similarity of the two adjacent frames; according to the characteristics, N video samples are respectively taken at the source domain and the target domain at random and are combined into a batch of data samples; dividing each video into k =6 segments, randomly extracting sixteen frames from each segment of video, and taking the ninety-six frame images as the representation of motion samples; meanwhile, performing conventional data enhancement on the frame image, namely performing random cutting, random horizontal turning and normalization processing on the video; in the testing stage, only the video samples of the testing set are subjected to frame extraction processing, and data enhancement is not performed.
3) Feature extraction: extracting features of the video samples processed in the steps 1) and 2) by using an I3D model pre-trained on a Kinetics data set as a feature extractor of a network, and obtaining a fragment-level feature sequenceWherein T =6 indicates that each video sample results in 6 segment features, and C is the feature channel dimension;
4) Constructing a motion excitation module: in order to model and enhance the motion information of the segment-level features, the segment-level features extracted by the feature extraction module are input into the motion excitation module; segment level feature F S The method comprises the steps of inputting the motion information into a motion excitation module, enhancing the motion information by the motion excitation module based on a time dimension, and improving the quality of motion characteristics, wherein the specific operation is that the characteristics of a t segment and the characteristics of a t +1 segment are differentiated on a time sequence channel, the characteristics with difference between the two segments are highlighted through the differential operation of the two adjacent segments, and the difference characteristics are usually caused by motion, so the characteristics with the motion information are positioned and added back to the original segment-level characteristics to achieve the purpose of enhancing the motion information, and the enhanced segment-level characteristics are finally obtained
5) Constructing a feature fusion module: to generate the video-level features, the segment-level features are fused to the video-level feature features, where the segment-level features are not added together directly, but are weighted according to a fusion weight vector. The weight vectors being simple to useThe characteristic fusion module is calculated and realized as a multilayer perceptron (MLP), the framework is Linear/ReLU/Linear/Sigmoid, and T =6 segment-level characteristics are received as input and are modeled by the motion excitation moduleAs input, T =6 segment level features are fused into video level features with a feature fusion module
6) Constructing a middle domain weighting module: one goal of cross-domain action recognition is to learn more domain-invariant features, i.e., intrinsic action features that do not change with domain changes; according to the characteristics, the intermediate domain adding module is designed, and the intermediate domain weighting layer has the function of adding 2N video level sample characteristics { F ] of the same input batch 1 ,F 2 ,…,F 2N A determination is made whether it is more biased towards the source domain or target domain data distribution, or in the intermediate domain (i.e. the intersection of the source domain feature distribution and the target domain feature distribution); the sample characteristics in the middle domain have more domain-invariant characteristics relative to the samples biased to the source domain or target domain data distribution, and the module weights the samples according to the distance between the characteristics and the middle domain; the weighting mode is divided into two types according to the difference of the subsequent inflow module of the characteristics, one is to give larger weight to the middle domain sample and input the middle domain sample into the classification module, so that the middle domain sample occupies larger weight in the subsequent classification calculation, the classification module is favorable for better fitting the domain invariant characteristics, and the generalization capability of the classification module is improved; the second is to apply a reverse weighting to these samples to give smaller weight to the middle-domain samples, and input them to the domain contrast learning module in order to make them occupy smaller weight in the domain contrast learning calculation, i.e. make the features with larger domain differences occupy larger weight in the domain contrast learning calculation. These samples distributed toward the source domain (target domain) are more valuable in domain contrast learning than the intermediate domain samples because the domain contrast module is based on the samplesThe comparison difference is calculated, and the domain deviation can be better learned by the domain comparison module by selecting the sample characteristics with larger domain difference;
in order to obtain the distance between the sample characteristics and the intermediate domain, the intermediate domain weighting module is implemented as a two-classifier for judging whether the input sample characteristics come from the source domain or the target domain, the input is the video-level characteristics F extracted by the network, and meanwhile, domain labels are generated according to whether the characteristics come from the source domain or the target domainAnd carrying out supervised training on the module according to the domain label to improve the classification accuracy of the two classifiers. The loss function of the middle domain weighting module is a two-class cross entropy loss (BCEloss), and the domain classification loss of the sample is defined as:
where N represents the number of samples in the current network training batch,a domain label representing the ith sample,a predicted value of the ith sample characteristic related to the domain d epsilon { S, T } for the middle domain weighting module; log () is a log-taking operation;
along with the training, the output classification probability of the two classifiers can reflect the distance between the sample characteristics and the middle domain, namely the closer the output classification probability is to 0.5, the more difficult the two classifiers judge whether the sample is from the source domain or the target domain, which shows that the characteristics have more domain-invariant characteristics and the smaller distance with the middle domain; and then according to the classification probability output by the two classifiers, calculating the intermediate domain weighting characteristic of each sample by using the following formula.
The feature weighting calculation formula of the classification module branch is as follows:
F C =F*e -3|λ-0.5|
the calculation formula of the feature weight of the domain comparison module branch is as follows:
F D =F*e 3|λ-0.5|
wherein F represents the video-level sample characteristics fused by the characteristic fusion module, and λ represents the classification probability output by the intermediate domain weighting module.
7) Constructing a classification module: video level feature F weighted by intermediate domain weighting module C Inputting the sample data into a classification module, and obtaining a classification probability vector l of the sample in each class through calculation of the classification module C The classification module mainly realizes two functions, wherein one function is to optimize the separability of the network to the classes by carrying out supervised classification loss calculation on the source domain samples with labels. Real label Y of classification module to source domain sample S And the classification probability vector l of the source domain samples S ∈l C Calculating the classification loss of the source domain samples by using a cross entropy function, and optimizing the classification capability of the whole network; the cross entropy loss function is defined as:
wherein N represents the number of source domain samples in the current network training batch, y i A label representing the ith sample,representing a predicted value for the ith sample; log () is a log-taking operation;
another function is that the classification module uses the classification probability vector l of the target domain samples T ∈l C According to the formula:
obtaining a pseudo label of a target domain sample;
8) A domain comparison learning module is constructed: the domain comparison learning module mainly realizes the inter-domain comparison learning,the method achieves the purposes of reducing domain deviation and improving the cross-domain capability of the network by shortening the distance between the same class of features in the source domain sample features and the target domain sample features and advancing the distance between the different classes of features; video level feature F weighted by intermediate domain weighting module D Inputting the feature into a domain comparison learning module, and mapping the feature into a domain comparison feature F through the domain comparison learning module d Using source domain real label Y S And target domain pseudo-tagAnd domain contrast feature F d Calculating the domain contrast loss, and optimizing the network to reduce the domain offset; the loss of domain contrast is defined as:
wherein N represents the number of source-domain or target-domain samples per batch;representing the ith source domain contrast characteristic of each batch,representing the jth target domain contrast characteristic of each batch;is an indicator function, which is 1 if its parameter is true, otherwise it is 0;
Ω represents a sample set of the target domain with unreliable pseudo labels, and D (-) is a function of the returned sample domain;
where τ >0 is a temperature hyperparameter.
9) Constructing a joint training loss function; the whole network carries out joint training through the loss functions proposed in the steps 6), 7) and 8); the overall loss function of the network is defined as:
wherein alpha, beta and gamma are hyper-parameters.
Correspondingly, the present embodiment further provides a video motion detection apparatus, including:
the source domain sample resampling module balances the number of samples of the source domain samples in each category, prevents the model from over-fitting some categories, and is beneficial to improving the migration performance of the model to the target domain;
the characteristic extraction module is used for extracting the fragment-level characteristics of the sample by utilizing the pre-training I3D model;
the motion excitation module is used for modeling and enhancing the motion information of the segment-level features, and is beneficial to the classification and domain adaptation of subsequent videos;
the feature fusion module is used for fusing the 6 segments of segment-level features of each sample into video-level features;
the intermediate domain weighting module is used for finding out an intermediate domain sample which is extracted by the feature extraction module and has domain invariance by judging the features extracted by the model, and prompting the network to have greater learning strength on the domain invariant features and improve the migration capability of the network on one hand, and improving the efficiency of domain comparison learning on the other hand and reducing domain deviation better on the other hand by giving different weights to the sample;
the classification module is used for processing the characteristics weighted by the intermediate domain weighting module to obtain the classification loss of the source domain sample and the target domain pseudo label, and is responsible for outputting the final classification result of the target domain test sample in the test stage;
and the domain comparison module is used for processing the characteristics weighted by the intermediate domain weighting module, performing comparison learning on the source domain and the target domain, pushing the distribution of the sample characteristics of the same domain and different classes far away, and pulling the distribution of the sample characteristics of the same domain and different domains close to each other, so that the domain offset of the characteristics extracted by the network is reduced.
The present embodiment further provides a video motion recognition device, where the device includes:
a memory for storing an executable computer program;
and the processor is used for realizing the cross-domain video action identification method when executing the executable computer program stored in the memory.
The present embodiment also provides a computer-readable storage medium, which stores a computer program, and when being executed by a processor, the computer program implements the cross-domain video motion recognition method.
To validate the effectiveness of the present invention, evaluations were performed on the action data sets UCF-HMDBfull, and UCF-Olympic. The experiment sets up 20 epochs, adopts an optimization method sgd, the default learning rate is 0.01, and the loss function hyperparameters are set to be alpha =1.0, beta =1.0, gamma =1.5 and tau =0.5. The I3D network is initialized using model parameters pre-trained on kinetics-400.
In the testing process, the segmented frame extraction mode of the test sample is the same as that in the training stage, but resampling and data enhancement operation are not carried out. A comparison of the experimental effect for this example with the unsupervised method can be seen in table 1. As can be seen from table 1, the unsupervised cross-domain video motion recognition model based on resampling and feature weighting provided by the invention has better recognition performance on an unsupervised cross-domain motion recognition target data set.
Claims (7)
1. A cross-domain video action recognition method is characterized by comprising the following steps:
1) Source domain sample resampling: the distribution of the number of source domain samples in each category is uneven, and training the model by using uneven source domain samples can lead the model to be more suitable for the categories with large sample amount and ignore the categories with small sample amount; the distribution of the sample size of the target domain in each category is often different from that of the source domain, so that the effect is reduced when the model trained on the source domain is transferred to the target domain; according to the characteristics, resampling is carried out on the source domain samples to balance the consistency of the quantity of each type of the source domain samples;
2) Video preprocessing: the number of the frames of the cross-domain action data set samples is large, the calculated pressure of inputting all the frames into the network is large, and the efficiency of extracting a certain amount of frame images and inputting the frame images into the network is high due to the high similarity of the two adjacent frames; according to the characteristics, N video samples are respectively taken at the source domain and the target domain at random and are combined into a batch of data samples; dividing each video into k =6 segments, randomly extracting sixteen frames from each segment of video, and taking the ninety-six frame images as the representation of motion samples; meanwhile, performing conventional data enhancement on the frame image, namely performing random cutting, random horizontal turning and normalization processing on the video; in the testing stage, only the video sample of the testing set is subjected to frame extraction processing, and data enhancement is not performed;
3) Feature extraction: extracting features of the video samples processed in the steps 1) and 2) by using an I3D model pre-trained on a Kinetics data set as a feature extractor of a network, and obtaining a fragment-level feature sequenceWherein T =6 indicates that each video sample results in 6 segment features, and C is the feature channel dimension;
4) Constructing a motion excitation module: segment level feature F S Inputting the motion information into a motion excitation module, and enhancing the motion information based on the time dimension of the motion excitation module to improve the quality of the motion characteristics and obtain enhanced segment-level characteristics
5) Constructing a feature fusion module: segment-level features modeled by motion excitation moduleFusing T =6 segment level features into video level features using a feature fusion module as input
6) Constructing a middle domain weighting module: one goal of cross-domain action recognition is to learn more domain-invariant features, i.e., intrinsic action features that do not change with domain changes; according to the characteristics, the intermediate domain adding module is designed, and the intermediate domain weighting layer has the function of adding 2N video level sample characteristics { F ] of the same input batch 1 ,F 2 ,…,F 2N Judging whether the data distribution of the source domain or the target domain is more biased or in an intermediate domain (namely the intersection part of the feature distribution of the source domain and the feature distribution of the target domain); the sample characteristics in the middle domain have more domain invariant characteristics relative to the samples biased to the source domain or target domain specific data distribution, different weight vectors are calculated according to the distance between the characteristic F and the middle domain and different rules through a middle domain weighting module, and the weight vectors are classifiedDomain contrast weight vectorCategorizing weight vector x c Weighting a video level feature F to obtain a feature F for classification C And input it into the subsequent classification module; domain contrast weight vector x d Weighting video level feature F to obtain learning feature F for domain comparison D ;
7) Constructing a classification module: video level feature F weighted by intermediate domain weighting module C Inputting the sample data into a classification module, and obtaining a classification probability vector l of the sample in each class through calculation of the classification module C On the one hand, the classification module is used for real label Y of source domain samples S And the classification probability vector l of the source domain samples S ∈l C Calculating the classification loss of the source domain samples by using a cross entropy function, and optimizing the classification capability of the whole network; the cross entropy loss function is defined as:
wherein N represents the number of source domain samples in the current network training batch, y i A label representing the ith sample,representing a predicted value for the ith sample; log () is a log-taking operation;
on the other hand the classification module uses the classification probability vector l of the target domain samples T ∈l C According to the formula:
8) Constructing a domain comparison learning module: the domain comparison learning module mainly realizes inter-domain comparison learning, and achieves the purposes of reducing domain deviation and improving the cross-domain capability of the network by shortening the distance between the features belonging to the same category in the source domain sample feature and the target domain sample feature and advancing the distance between the features of different categories; video level feature F weighted by intermediate domain weighting module D Inputting the feature into a domain comparison learning module, and mapping the feature into a domain comparison feature F through the domain comparison learning module d Using source domain real label Y S And target domain pseudo-tagAnd a domain contrast feature F d Calculating the domain contrast loss, and optimizing the network to reduce the domain offset; the loss of domain contrast is defined as:
wherein N represents the number of source domain or target domain samples per batch;representing the ith source domain contrast characteristic of each batch,representing the jth target domain comparison characteristic of each batch;is an indicator function, which is 1 if its parameter is true, otherwise it is 0;
Ω represents a sample set of the target domain with unreliable pseudo labels, and D (-) is a function of the returned sample domain;
where τ >0 is a temperature hyperparameter.
9) Constructing a joint training loss function; the whole network carries out joint training through the loss functions proposed in the steps 6), 7) and 8); the overall loss function of the network is defined as:
wherein alpha, beta and gamma are hyper-parameters.
2. The cross-domain video motion recognition method according to claim 1, wherein the specific steps of step 1) are as follows:
counting the number of samples of each category according to the real label of the source domain sample, and calculating the sampling magnification for each category;
sampling multiplying power = (number of samples per class/maximum number of samples per class);
carrying out weighted random sampling on each class according to the sampling multiplying power, namely achieving the purpose of consistent sample number of each class of the whole source domain samples; and for the target domain data set, no resampling operation is performed because it has no real label.
3. The cross-domain video motion recognition method according to claim 1, wherein the step 4) comprises the following specific steps:
segment-level features extracted by a feature extraction module for modeling and enhancing motion information of the segment-level featuresIs input into a motion excitation module; the motion excitation module makes difference between the characteristics of the t segment and the characteristics of the t +1 segment on a time sequence channel, and highlights the characteristics of difference between the two segments through the difference operation of the two adjacent segments, wherein the difference characteristics are usually brought by motion, so the characteristics with motion information are positioned and added back to the original segment level characteristics, and the purpose of enhancing the motion information is achieved.
4. The cross-domain video motion recognition method according to claim 1, wherein the step 6) comprises the following specific steps:
in order to obtain the distance between the sample characteristic and the middle domain, the middle domain weighting module is implemented as a two-classifier for judging whether the input sample characteristic comes from the source domain or the target domain, and the input is the video-level characteristic extracted by the networkGenerating domain labels simultaneously from source domain or target domain according to characteristics Carrying out supervised training on the module according to the domain label to improve the classification accuracy of the second classifier; the loss function of the middle domain weighting module is two-class cross entropy loss BCEloss, and the domain classification loss of the sample is defined as:
where N represents the number of samples in the current network training batch,a domain label representing the ith sample,a predicted value of the ith sample characteristic related to the domain d epsilon { S, T } for the middle domain weighting module; log () is a log-taking operation;
along with the training, the output classification probability of the two classifiers can reflect the distance between the sample characteristics and the middle domain, namely the closer the output classification probability is to 0.5, the more difficult the two classifiers judge whether the sample is from the source domain or the target domain, which shows that the characteristics have more domain-invariant characteristics and the smaller distance with the middle domain; calculating the intermediate domain weighting characteristic of each sample by using the following formula according to the classification probability output by the two classifiers;
the feature weight calculation formula of the classification module branch is as follows:
F C =F*e -3|λ-0.5|
the calculation formula of the feature weight of the domain comparison module branch is as follows:
F D =F*e 3|λ-0.5|
wherein F represents the video-level sample characteristics fused by the characteristic fusion module, and λ represents the classification probability output by the intermediate domain weighting module.
5. A video motion detection apparatus, the apparatus comprising:
the source domain sample resampling module balances the number of samples of the source domain samples in each category, prevents the model from over-fitting some categories, and is beneficial to improving the migration performance of the model to the target domain;
the characteristic extraction module is used for extracting the fragment-level characteristics of the sample by utilizing the pre-training I3D model;
the motion excitation module is used for modeling and enhancing the motion information of the segment-level features, and is beneficial to the classification and domain adaptation of subsequent videos;
the characteristic fusion module is used for fusing the 6-segment fragment-level characteristics of each sample into video-level characteristics;
the intermediate domain weighting module is used for finding out an intermediate domain sample which is extracted by the feature extraction module and has domain invariance by judging the features extracted by the model, and prompting the network to have greater learning strength on the domain invariant features and improve the migration capability of the network on one hand, and improving the efficiency of domain comparison learning on the other hand and reducing domain deviation better on the other hand by giving different weights to the sample;
the classification module is used for processing the characteristics weighted by the intermediate domain weighting module to obtain the classification loss of the source domain sample and the target domain pseudo label, and is responsible for outputting the final classification result of the target domain test sample in the test stage;
and the domain comparison module is used for processing the characteristics weighted by the intermediate domain weighting module, performing comparison learning on the source domain and the target domain, pushing the distribution of the sample characteristics of the same domain and different classes far away, and pulling the distribution of the sample characteristics of the same domain and different domains close to each other, so that the domain offset of the characteristics extracted by the network is reduced.
6. A video motion recognition device, the device comprising:
a memory for storing an executable computer program;
a processor for implementing the method of any one of claims 1 to 4 when executing an executable computer program stored in the memory.
7. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211171602.4A CN115439791A (en) | 2022-09-26 | 2022-09-26 | Cross-domain video action recognition method, device, equipment and computer-readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211171602.4A CN115439791A (en) | 2022-09-26 | 2022-09-26 | Cross-domain video action recognition method, device, equipment and computer-readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115439791A true CN115439791A (en) | 2022-12-06 |
Family
ID=84249846
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211171602.4A Pending CN115439791A (en) | 2022-09-26 | 2022-09-26 | Cross-domain video action recognition method, device, equipment and computer-readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115439791A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115861901A (en) * | 2022-12-30 | 2023-03-28 | 深圳大学 | Video classification method, device, equipment and storage medium |
-
2022
- 2022-09-26 CN CN202211171602.4A patent/CN115439791A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115861901A (en) * | 2022-12-30 | 2023-03-28 | 深圳大学 | Video classification method, device, equipment and storage medium |
CN115861901B (en) * | 2022-12-30 | 2023-06-30 | 深圳大学 | Video classification method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021042828A1 (en) | Neural network model compression method and apparatus, and storage medium and chip | |
CN112446423B (en) | Fast hybrid high-order attention domain confrontation network method based on transfer learning | |
CN108846413B (en) | Zero sample learning method based on global semantic consensus network | |
CN111079594B (en) | Video action classification and identification method based on double-flow cooperative network | |
CN112651940B (en) | Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network | |
Fang et al. | Confident learning-based domain adaptation for hyperspectral image classification | |
CN112527993B (en) | Cross-media hierarchical deep video question-answer reasoning framework | |
CN113158815A (en) | Unsupervised pedestrian re-identification method, system and computer readable medium | |
CN115761408A (en) | Knowledge distillation-based federal domain adaptation method and system | |
CN115270752A (en) | Template sentence evaluation method based on multilevel comparison learning | |
CN112883931A (en) | Real-time true and false motion judgment method based on long and short term memory network | |
Chen et al. | Multi-level attentive adversarial learning with temporal dilation for unsupervised video domain adaptation | |
CN116206327A (en) | Image classification method based on online knowledge distillation | |
CN117152459A (en) | Image detection method, device, computer readable medium and electronic equipment | |
CN115439791A (en) | Cross-domain video action recognition method, device, equipment and computer-readable storage medium | |
CN112200260B (en) | Figure attribute identification method based on discarding loss function | |
CN113221683A (en) | Expression recognition method based on CNN model in teaching scene | |
CN112052795B (en) | Video behavior identification method based on multi-scale space-time feature aggregation | |
Bi et al. | Entropy-weighted reconstruction adversary and curriculum pseudo labeling for domain adaptation in semantic segmentation | |
CN112883930A (en) | Real-time true and false motion judgment method based on full-connection network | |
CN116994320A (en) | Train driver in-transit fatigue driving detection method, system and equipment | |
CN117150069A (en) | Cross-modal retrieval method and system based on global and local semantic comparison learning | |
CN114973107B (en) | Unsupervised cross-domain video action identification method based on multi-discriminator cooperation and strong and weak sharing mechanism | |
CN113313185B (en) | Hyperspectral image classification method based on self-adaptive spatial spectrum feature extraction | |
CN115527275A (en) | Behavior identification method based on P2CS _3DNet |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |