CN116721458A

CN116721458A - Cross-modal time sequence contrast learning-based self-supervision action recognition method

Info

Publication number: CN116721458A
Application number: CN202310490527.6A
Authority: CN
Inventors: 徐增敏; 王露露; 蒙儒省
Original assignee: Guilin Anview Technology Co ltd; Guilin University of Electronic Technology
Current assignee: Guilin Anview Technology Co ltd; Guilin University of Electronic Technology
Priority date: 2023-05-04
Filing date: 2023-05-04
Publication date: 2023-09-08

Abstract

The invention relates to the technical field of video processing, in particular to a self-supervision action recognition method based on cross-modal time sequence contrast learning, which comprises the steps of generating RGB frame and optical flow data from a non-labeled video sample, obtaining different amplified views of input data by adopting different data enhancement methods, obtaining coding characteristics of the different views by an encoder, and establishing a dual-path self-supervision action recognition network; capturing global semantic dependency of the video sequence through comparative learning of instance discrimination, modeling time sequence motion characteristics among local fragments through local time sequence comparative learning tasks, and obtaining a preliminary model of independent training of two branches after initializing training; collaborative training is performed based on a cross-modal self-supervision action recognition and local time sequence comparison learning method to obtain a final model; and finally, carrying out fine adjustment on parameters of the final model by using the labeled data, and carrying out effect evaluation by using the test data to obtain the identification performance of the model.

Description

Cross-modal time sequence contrast learning-based self-supervision action recognition method

Technical Field

The invention relates to the technical field of video processing, in particular to a self-supervision action recognition method based on cross-mode time sequence contrast learning.

Background

With the rapid development of internet technology, multimedia information data are continuously generated, and huge video data volume brings great challenges to data analysis and understanding. The computer vision technology is utilized to rapidly capture effective information in data in real time, so that tasks such as detection, tracking and behavior recognition of targets in videos are realized. Human motion recognition is a research hotspot in the field of computer vision, and is mainly to determine human behaviors by analyzing the correlation and visual appearance characteristics of image frames in a video sequence, so that the human motion recognition method has wide application prospects in multiple fields of intelligent monitoring, intelligent security, video retrieval and the like.

Currently, video characterization methods based on supervised learning achieve superior performance in many visual recognition tasks, which rely on a large amount of manually labeled training sample data, requiring significant labor and time costs to acquire these data sets. Self-supervised learning attracts extensive attention of researchers due to its utilization efficiency and generalization ability of data information. The self-supervision learning mainly utilizes a front-end task to mine supervision information of unlabeled data, learns a generalization token through network training, and transfers the token to a downstream task so as to improve the performance of the model. In the existing self-supervision learning algorithm, a comparison learning method based on example discrimination is researched by vast students. The core idea of contrast learning is to construct a representation by encoding the similarity or dissimilarity of two things. Although only example contrast learning can achieve a remarkable effect in many video understanding tasks, the method encourages the model to learn similar features in the same example, ignores the time sequence variation characteristics of the video, and has rough features based on fragment extraction, and has great limitation in tasks of some fine-grained scenes. In the early contrast learning method, cross-modal semantic information interaction is largely ignored, and single mode is unfavorable for the clustering effect of the model due to semantic limitation.

Existing patents related to motion recognition include the following two aspects:

supervised action recognition domain: the twenty eighth research institute of the national electronics and technology group company in 2023 discloses an invention patent of a video target behavior recognition method based on a space grid, wherein the behavior actions of a target in a video frame are analyzed through target detection and motion detection by combining the space grid positioning and the grid periphery condition; the invention discloses a human body behavior recognition method of a double-channel mixed graph convolution network in 2023 university, which is based on a graph convolution network, and is characterized in that a feature graph and an individual feature graph are constructed for action data, and behavior recognition performance is improved through the double-channel mixed graph convolution network integrating feature similarity and individual characteristics; the invention patent disclosed by the university of Western An traffic in 2023 (real-time behavior recognition method based on time attention mechanism and double-flow network) improves the accuracy of network recognition by extracting video features of different time frame rates and performing time attention weighting on images with larger contribution to the network; the invention discloses a video behavior recognition method based on an attention mechanism, which is characterized in that the importance degree of different channels of frame-level characteristics is distinguished in fine granularity, so that key information in video characteristic expression is more fully reserved.

Self-supervision action recognition domain: the invention discloses a method and a system for identifying contrasting self-supervision human behaviors based on temporal and spatial information aggregation, which are disclosed in a patent of Shenzhen research institute of university of Beijing in 2022, and effectively aggregate video temporal and spatial information to obtain more reliable characterization by carrying out intra-data fusion and inter-data voting on skeleton action sequences, motion information and skeleton information; the invention discloses a human body action recognition method and system based on a graphic neural network in 2022 university of combined fertilizer industry, wherein the invention utilizes downsampling with short connection and a corresponding upsampling layer to realize 2D feature extraction and joint point recognition of data, and the obtained 2D joint information is input into the graphic neural network to improve the effect of 3D action recognition; the invention patent 'an action recognition method and an electronic device based on multitask self-supervision learning' published in the university of Beijing in 2022, wherein the accuracy of model action recognition is improved by designing action prediction, jigsaw and contrast learning of a plurality of self-supervision tasks; the invention discloses an unsupervised cross-domain video motion recognition method based on a multi-discriminator collaboration and strong and weak sharing mechanism, which is disclosed by an artificial intelligence research institute in Shandong province in 2023.

Disclosure of Invention

The invention aims to provide a self-supervision action recognition method based on cross-mode time sequence comparison learning, and aims to solve the problem that the existing human action recognition method has less human action information obtained on unlabeled data.

In order to achieve the above purpose, the present invention provides a self-supervision action recognition method based on cross-modal time sequence contrast learning, comprising the following steps:

step 1: generating RGB frames and optical flow data from unlabeled video samples;

step 2: different amplification views of the input data are obtained by adopting different data enhancement methods;

step 3: the coding characteristics of different views are obtained through the coder, and a dual-path self-supervision action recognition network is established;

step 4: capturing global semantic dependence of a video sequence through comparative learning of instance discrimination;

step 5: modeling time sequence motion characteristics among the local fragments through local time sequence comparison learning tasks;

step 6: initializing and training the dual-path self-supervision action recognition network by combining the step 4 and the step 5 to obtain a primary model with two branches independently trained;

step 7: information interaction among multiple modes is carried out through a cross-mode consistency mining method;

step 8: based on a cross-mode self-supervision action recognition and local time sequence contrast learning method, the preliminary models of the two modes are cooperatively trained to obtain a final model;

step 9: and performing fine tuning training on parameters of the final model by using the labeled data, and evaluating the model effect by using the test data to obtain the recognition performance of the model.

Preferably, the process of generating RGB frames and optical flow data from unlabeled video samples is specifically to extract a frame-level video sequence from the unlabeled video samples and extract the corresponding optical flow graph from the frame sequence using an unsupervised TV-LI algorithm.

Preferably, different data enhancement methods are adopted to obtain different enhancement views of input data, specifically, random time clipping and sampling are carried out on the input data, and the RGB frames and optical flow sampling fragments are amplified through random clipping, horizontal overturning, gaussian blur and color dithering data enhancement strategies, so that different enhancement views of the same instance are obtained as positive sample pairs.

Preferably, in the process of obtaining coding features of different views through an encoder and establishing a dual-path self-supervision action recognition network, the encoder is a deep convolutional neural network S3D, the dual-path self-supervision action recognition network is two branches of RGB and optical flow, for each branch, sample data after data enhancement is respectively input into the encoder to obtain visual representation of the features, and then the features are projected into a low-dimensional embedding space through an MLP layer.

Preferably, in the process of capturing global semantic dependency of a video sequence through comparison learning of instance discrimination, the instance discrimination is performed by maximizing semantic consistency of different views of the same video for the output characteristics of the dual-path self-supervision action recognition network, and the global semantic dependency of the video is captured.

Preferably, the implementation process of modeling the time sequence motion characteristics among the local fragments through the local time sequence contrast learning task is as follows:

the feature based on global contrast learning is difficult to model local motion and time sequence change of video actions, and the model feature expression capability is poor in a fine-grained scene. In order to fully capture the difference between the same video actions and enable the model to learn the time sequence information between frames, a local time sequence comparison learning module is designed and mainly comprises two comparison learning tasks: a local comparison task learns the similarity between different local fragments of the same video, distinguishes the characteristics from different examples, and increases the fine granularity of characterization; and (3) a local time sequence comparison task, namely learning the distinction between non-overlapping local fragments of the same video, and increasing the time sequence of characterization.

Preferably, the obtaining process of the preliminary model of the two-branch independent training specifically includes respectively independently training the RGB and optical flow networks, and performing iterative optimization on the model through example comparison learning and local time sequence comparison learning on the same example to obtain the preliminary training model of the two modes.

Preferably, the implementation process of information interaction among multiple modes through a cross-mode consistency mining method comprises the following steps:

the cross-mode consistency mining method is to collect positive samples for one network to perform information interaction among modes, and specifically comprises the following steps: inputting the optical flow sample into an encoder to extract features; and (3) comparing the similarity between the characteristic and other characteristics in a storage library, and selecting the first k similar examples in the optical flow embedding space as positive samples of an RGB network, wherein similarly, the RGB network can also be used for selecting the positive samples for the optical flow network.

Preferably, in the process of cooperatively training the preliminary models of two modes based on a cross-mode self-supervision action recognition and local time sequence comparison learning method to obtain a final model, the preliminary training models of the two modes are trained alternately, and the inter-mode data association and inter-mode semantic collaborative interaction are realized by combining a cross-mode global consistency mining network and a local time sequence comparison learning method.

Preferably, the parameters of the final model are subjected to fine tuning training by using the tagged data, the model effect is evaluated by using the test data, and the recognition performance of the model is obtained.

The invention provides a self-supervision action recognition method based on cross-modal time sequence contrast learning, which comprises the steps of generating RGB frames and optical flow data from unlabeled video samples, obtaining different amplified views of input data by adopting different data enhancement methods, obtaining coding characteristics of the different views by an encoder, and establishing a dual-path self-supervision action recognition network; further capturing global semantic dependency of the video sequence through comparative learning of instance discrimination, modeling time sequence motion characteristics among local fragments through local time sequence comparative learning tasks, and performing initialization training on the network to obtain a preliminary model of independent training of two branches; information interaction among multiple modes is carried out through a cross-mode consistency mining method, and a preliminary model of two modes is cooperatively trained based on a cross-mode self-supervision action recognition and local time sequence comparison learning method to obtain a final model; and finally, carrying out fine adjustment training on parameters of the final model by using the labeled data, and evaluating the model effect by using the test data to obtain the recognition performance of the model. According to the invention, through intra-mode data association and inter-mode semantic collaborative interaction, the related information deeper than the classification task is obtained, so that the accuracy of human action recognition is improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a self-monitoring action recognition method based on cross-modal timing comparison learning.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

Referring to fig. 1, the present invention provides a self-supervision action recognition method based on cross-modal timing comparison learning, which is further described below with reference to specific steps:

s1: generating RGB frames and optical flow data from unlabeled video samples;

specifically, given a training data set containing N video instances, set to v= { V ₁ ,v ₂ ,…,v _N Extracting a frame-level video sequence, and extracting a corresponding light flow map m= { M from the frame sequence using an unsupervised TV-LI algorithm ₁ ,m ₂ ,…,m _N }。

S2: different amplification views of the input data are obtained by adopting different data enhancement methods;

specifically, from a video sequence v _i Randomly sampling a segment, amplifying the input segment by data enhancement strategies such as random cutting, horizontal overturning, gaussian blur, color dithering and the like, and obtaining different enhancement views of the same instance as positive sample pairsThe rest video clips are negative samples N ^- V, i.e _j ∈N ^- J+.i, and stored in a queue in a memory bank, an expanded view of the optical flow data can be obtained similarly.

S3: the coding characteristics of different views are obtained through the coder, and a dual-path self-supervision action recognition network is established;

specifically, the feature encoder is a deep convolutional neural network S3D, the dual-path self-supervision action recognition network is an RGB and optical flow two-branch, and for each branch, a sample after data enhancement is input into the feature encoder f _q And f _k A visual representation of the features is obtained, through the MLP layer g(.) project features into a low-dimensional embedding space. For example, for RGB branching, viewsIs expressed as +.>The embedding characteristic of the positive sample is +.>The embedding characteristic of the negative sample is->

S4: capturing global semantic dependence of a video sequence through comparative learning of instance discrimination;

specifically, for the output characteristics of the dual-path self-supervision action recognition network, the instance discrimination is performed by maximizing the semantic consistency of different views of the same video, and the following loss functions need to be minimized to constrain the model:

wherein z.z ⁺ Is the dot product between the two vectors, τ is the temperature coefficient, and the example contrast loss for the optical flow branches can be similarly defined. Through instance contrast learning, the model learning draws similar instances closer in the embedding space, pushing different instances farther.

S5: the generalization of the characterization is enhanced through the local time sequence contrast learning task, and the two branches are independently trained to obtain a preliminary model;

specifically, the difference between the same video actions is captured by the local time sequence comparison learning module, so that the model learns the time sequence information between frames, and 4 non-overlapping local fragments { s ] are randomly sampled from a video sequence ₁ ,s ₂ ,s ₃ ,s ₄ S in local contrast learning ₁ Is a sample, { s ₁ ^a ,s ₂ ,s ₃ ,s ₄ The positive sample of the sample is considered, and the local contrast loss between fragments is:

wherein z is _i For sample s _i Is characterized by (i=1, …, 4), a is data enhancement.

The fine granularity of characterization is increased through local contrast learning, and in order to further learn time sequence change information in the video, the local time sequence contrast learning is designed to make s ₁ S as input samples ₁ ^a Considered as a positive sample of the sample, { s ₂ ,s ₃ ,s ₄ Regarded as negative samples, the local time-sequence contrast loss between fragments is

The loss of the local timing comparison module is:

L _LTCL ＝L _local +L _LT

s6: performing constraint training on the network by combining the step 4 and the step 5 to obtain a preliminary model with two branches trained independently;

specifically, iterative optimization is performed on the model through example contrast learning and local time sequence contrast learning on the same example, so as to obtain a preliminary training model of two modes, and the following loss functions are required to be minimized in the initial training stage:

L ₁ ＝αL _InfoNCE +βL _LTMC

where α and β are coefficients of balance loss, setting α=β=1 makes the algorithm more general.

S7: information interaction among multiple modes is carried out through a cross-mode consistency mining method;

specifically, the cross-modal consistency mining method uses one network to collect positive samples for the other network to perform information interaction between modes, for example, uses the following stepsThe optical flow network collects positive samples for the RGB network, and the optical flow samples { m } in batches are needed to be collected first ₁ ,m ₂ ,…,m _B Input to encoder to extract featuresB is the batch size, similarity comparison is carried out on the characteristics and other characteristics in a storage library, k neighbor samples in an optical flow embedded space are selected to be used as supplements of RGB network positive samples, and at the moment, the positive sample set of the RGB network is a sample video sequence v _i Data enhancement of (c) plus v _i The first k nearest neighbors in the optical flow feature space, positive sample set P _1i The expression is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,for RGB sample v _i Data enhancement view of->The similarity between the ith and jth videos in the optical flow view, topK (·) refers to selecting k most similar sample features from N samples, and outputting a sample index value, and similarly, an RGB network may also be used to select positive samples for the optical flow network.

In the cross-modal consistency mining module, the loss function of the auxiliary RGB network using the optical flow network is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,for video sample v _i Is characterized by (1)>For the positive sample feature of the RGB view, < > for>Is a negative example.

S8: based on a cross-mode self-supervision action recognition and local time sequence contrast learning method, the preliminary models of the two modes are cooperatively trained to obtain a final model;

specifically, the two branch preliminary networks are trained alternately, and the inter-mode data association and inter-mode semantic collaborative interaction are realized by combining a cross-mode global consistency mining network and a local time sequence contrast learning method, so that a final model is obtained, and the following loss functions are required to be minimized in the model training process:

L ₂ ＝αL _RF +βL _LTMC

S9: and performing fine tuning training on parameters of the final model by using the labeled data, and evaluating the model effect by using the test data to obtain the recognition performance of the model.

Specifically, the performance of the model is verified through fine tuning training of the motion recognition task, a linear classifier (namely a full-connection layer and a softmax layer) is added after the self-supervision pre-training encoder, then the motion recognition task of the whole model is supervised and trained, and the trained model is evaluated by using test data, so that the recognition performance of the model is obtained.

The deep neural network based on supervised learning is successfully applied to various computer vision tasks, and the method needs to train on a labeled data set, so that the performance of a model depends on the quantity and quality of labeling data to a certain extent, the data need to consume a large amount of resources to be labeled manually, the cost is high, and compared with a supervised deep learning method, the self-supervised learning method has the characteristic of obtaining generalization by performing self supervision by using a large amount of unlabeled data, and the method is transferred to a downstream task, so that the performance of a human behavior recognition model can be effectively improved, and the problem of using a large amount of labeled data is avoided.

Although the self-supervised deep learning approach has achieved significant results in the image field, its development in the video field is relatively slow. On the one hand, because video is more difficult to process than images, in general, only information in the spatial dimension needs to be focused when processing images, while in order to be able to analyze content in video more accurately, it is necessary to focus not only on information in the spatial dimension but also on information in the temporal dimension when processing video. On the other hand, since application of the self-supervised deep learning method to the video field requires a lot of computational resources.

In recent years, research interest has been raised in contrast to learning methods, which rely on labeled data, primarily using the data itself as supervisory information to learn more valuable representations of features.

The invention relates to a self-supervision action recognition method based on cross-mode time sequence comparison learning, which is used for generating RGB frames and optical flow data from unlabeled video sample data; based on a dual-path time sequence comparison learning frame, independently training two modal data of an RGB frame and an optical flow to obtain a primary model; alternately training the initialized models, and acquiring positive samples for the other network by using one network to perform cross-modal information interaction to obtain a self-supervision video characterization model; evaluating the model effect by using test data on the action recognition task to obtain the recognition performance of the model; according to the invention, more task-related and valuable information in the video multi-mode data is mined by utilizing the contrast learning to construct the self-supervision model, so that the problem that the human behavior information obtained on the unlabeled data is less in the existing human behavior recognition method is solved.

The above disclosure is only a preferred embodiment of the present invention, and it should be understood that the scope of the invention is not limited thereto, and those skilled in the art will appreciate that all or part of the procedures described above can be performed according to the equivalent changes of the claims, and still fall within the scope of the present invention.

Claims

1. A self-supervision action recognition method based on cross-modal time sequence contrast learning is characterized by comprising the following steps:

step 6: initializing the dual-path self-supervision action recognition network by combining the step 4 and the step 5 to obtain a primary model with two branches independently trained;

2. The method for self-monitoring motion recognition based on cross-modal timing comparison learning of claim 1,

the process of generating RGB frames and optical flow data from unlabeled video samples is specifically to extract a frame-level video sequence from the unlabeled video samples and extract the corresponding optical flow graph from the frame sequence using an unsupervised TV-LI algorithm.

3. The method for self-monitoring motion recognition based on cross-modal timing comparison learning of claim 2,

the process of obtaining different amplified views of input data by adopting different data enhancement methods comprises the steps of carrying out random time clipping sampling on the input data, amplifying RGB frames and optical flow sampling fragments by random clipping, horizontal overturning, gaussian blur and color dithering data enhancement strategies, and obtaining different amplified views of the same instance as positive sample pairs.

4. The method for self-monitoring motion recognition based on cross-modal timing comparison learning of claim 3,

in the process of obtaining coding features of different views through an encoder and establishing a dual-path self-supervision action recognition network, the encoder is a deep convolutional neural network S3D, the dual-path self-supervision action recognition network is an RGB branch and an optical flow branch, for each branch, sample data after data enhancement is respectively input into the encoder to obtain visual representation of the features, and then the features are projected into a low-dimensional embedding space through an MLP layer.

5. The method for self-monitoring motion recognition based on cross-modal timing comparison learning of claim 4, wherein,

in the process of capturing global semantic dependency of a video sequence through comparison learning of instance discrimination, the instance discrimination is carried out by maximizing semantic consistency of different views of the same video for the output characteristics of the dual-path self-supervision action recognition network, and the global semantic dependency of the video is captured.

6. The method for self-monitoring motion recognition based on cross-modal timing comparison learning of claim 5,

in modeling temporal motion characteristics between local segments by local timing contrast learning tasks, the local timing contrast learning tasks include two contrast learning tasks: a local comparison task learns the similarity between different local fragments of the same video, distinguishes the characteristics from different examples, and increases the fine granularity of characterization; and (3) a local time sequence comparison task, namely learning the distinction between non-overlapping local fragments of the same video, and increasing the time sequence of characterization.

7. The method for self-monitoring motion recognition based on cross-modal timing comparison learning of claim 6, wherein,

the obtaining process of the preliminary model of the two-branch independent training is specifically to independently train RGB and optical flow networks respectively, and the preliminary training models of two modes are obtained through iterative optimization of example comparison learning and local time sequence comparison learning on the same example.

8. The method for self-monitoring motion recognition based on cross-modal timing comparison learning of claim 7,

in the process of information interaction among multiple modes through a cross-mode consistency mining method, inputting an optical flow sample into an encoder to extract characteristics; performing similarity comparison on the extracted features and other features in a storage library, and selecting the first k similar examples in the optical flow embedding space as positive samples of the RGB network; a similar method of operation as described above is employed when using an RGB network to select positive samples for an optical flow network.

9. The method for self-monitoring motion recognition based on cross-modal timing comparison learning of claim 8,

in the process of cooperatively training the preliminary models of two modes based on a cross-mode self-supervision action recognition and local time sequence contrast learning method to obtain a final model, the preliminary training models of the two modes are alternately trained, and the inter-mode data association and inter-mode semantic cooperative interaction are realized by combining a cross-mode global consistency mining network and a local time sequence contrast learning method.

10. The method for self-monitoring motion recognition based on cross-modal timing comparison learning of claim 9,

the method comprises the steps of performing fine tuning training on parameters of a final model by using tagged data, evaluating the model effect by using test data, and obtaining the recognition performance of the model.