CN117612071A

CN117612071A - Video action recognition method based on transfer learning

Info

Publication number: CN117612071A
Application number: CN202410090020.6A
Authority: CN
Inventors: 张信明; 刘语西; 陈思宏
Original assignee: University of Science and Technology of China USTC; Shenzhen Tencent Computer Systems Co Ltd
Current assignee: University of Science and Technology of China USTC; Shenzhen Tencent Computer Systems Co Ltd
Priority date: 2024-01-23
Filing date: 2024-01-23
Publication date: 2024-02-27
Anticipated expiration: 2044-01-23
Also published as: CN117612071B

Abstract

The invention discloses a video action recognition method based on transfer learning, which comprises the following training processes of a network model: s1: constructing a training set; s2: processing the video text labels to obtain text features, and splicing the text features to serve as an initialized classification matrix; s3: extracting a frame image in the video, and processing the frame image to obtain a video frame characteristic image; s4: inputting the video frame feature map into an implicit timing modeling module to output an implicit video characterization; s5: randomly scrambling implicit video frame characterization maps by rowsNext, willInputting the video frame feature graphs into an explicit time sequence modeling module, outputting correct explicit video characterization, and connecting the explicit video characterization and the implicit video characterization as residual errors to obtain the characterization of the whole video; s6: performing inner product operation on the characterization and classification matrix of the whole video, and calculating a prediction score to obtain a prediction result; the video motion recognition method improves the recognition prediction accuracy of the video.

Description

Video action recognition method based on transfer learning

Technical Field

The invention relates to the technical field of transfer learning, in particular to a video action recognition method based on transfer learning.

Background

Video understanding, and in particular motion recognition, is one of the fundamental tasks of computer vision. It relates to identifying and understanding the actions and gestures of a human or object captured from image, video or sensor data. This task has wide application in various applications including video surveillance, human-machine interaction, virtual reality, somatosensory gaming, autopilot, and the like.

Currently, the mainstream video motion recognition algorithm mainly adopts a structure based on CNN or a structure based on a transducer. CNN, as it is relatively mature in the picture field, has a framework of early motion recognition that directly uses 2D convolution on video frames and introduces traditional algorithms such as optical flow, motion trajectories, etc. to supplement timing information. After that, a 3D convolution occurs, and spatial features and temporal features between adjacent frames can be simultaneously learned from the video segments, and classified by using the obtained video features.

In recent years, video models based on a transducer structure have shown better results. The open source model CLIP proposed by OpenAI is widely used in transfer learning from a picture domain to a video domain due to its excellent generalization performance and picture characterization capability. A part of work based on the CLIP is to extract the characteristics of a single video frame by using the CLIP, and then design a time sequence modeling module to extract the characteristics of the whole video. The method based on the CLIP can be used for carrying out end-to-end fine adjustment, or the method of parameter-efficiency fine tune can be adopted to freeze the parameters of the CLIP, and only an adapter module with a smaller parameter quantity is trained, so that the training efficiency is improved. However, on one hand, video contains more abundant spatiotemporal information than pictures, and current methods use either one-hot tags, or single word tags, when classifying, which do not fully describe the complex content of the video, resulting in insufficient alignment of video space and text space; on the other hand, in the current migration learning method, in the time sequence modeling stage, a model is mostly enabled to learn the time sequence relation of video frames through a leachable position code and self-attribute, and thus, the implicit time sequence mining is difficult to establish an efficient time sequence model when processing multi-content and long-time video.

Disclosure of Invention

Based on the technical problems in the background technology, the invention provides a video action recognition method based on transfer learning, which considers two mode data of video text labels and video, and improves the recognition and prediction accuracy of the video.

The video motion recognition method based on transfer learning inputs video information into a network model to output a prediction result;

the training process of the network model is as follows:

s1: constructing a training set, wherein the training set comprises a video and a video text label;

s2: expanding the video text label through the large language model, encoding the expanded video text label through the CLIP model to obtain text features, splicing the text features to serve as an initialized classification matrix, and freezing the classification matrix in the network model training process;

s3: extracting a frame image in the video, and encoding the enhanced frame image through a CLIP model to obtain a video frame feature map;

s4: the method comprises the steps of constructing a position coding learning implicit time sequence modeling module based on a converter of an encodable-only architecture, inputting a video frame feature map into the implicit time sequence modeling module, and calculating the video frame feature map through a self-attention mechanism to obtain an implicit video frame feature mapRepresentation of implicit video frames>Obtaining implicit time after average pooling operationImplicit video characterization output by sequence modeling module>；

S5: building an explicit time sequence modeling module based on a cross attention mechanism, and representing an implicit video frame into a graphRandom scrambling according to rows>For times, get->Video frame feature, will->The video frame characteristics are input into an explicit time sequence modeling module to obtain +.>-individual explicit video characterization, said->The individual explicit video representations form 1 positive sample pair sum +.>Aligning the text-video space by using a contrast learning paradigm and outputting the correct explicit video representation +.>Explicit video characterization +.>And implicit video characterization->Residual connection is performed to obtain the representation of the whole video>；

S6: will characterizeAnd performing inner product operation on the classification matrix, and calculating a prediction score to obtain a prediction result.

Further, in step S2, specifically includes:

video text label collectionFor processing objects, large language models are used for video text label collectionText expansion is performed>，/>Representing a large language model->Representing the extended tag description;

description of tagsConversion into word vectors by means of a word segmentation device>，Wherein->Representation->Function (F)>Representing descriptive string length,/->Is the text vector length;

text encoder based on CLIP modelFor word vector->Coding to obtain text feature->，/>Wherein->Representing a feature dimension;

will beText feature of class tag description->Merging into a classification matrix, the>Behavior is expressed as +.>The classification matrix obtained->，/>Representing the tag class.

Further, in step S3, specifically includes:

uniform sampling of videoFrames, in video frame set->Data enhancement of a set of video frames as processing object,/->Representing the number of frames sampled, +.>Representing the height and width of the picture, +.>A channel representing a picture;

stacking the video frames with the enhanced data on the channel dimension to obtain an input image of the network model，/>Representing the size after clipping in data enhancement;

image encoder pair input image based on CLIP modelCoding to obtain video frame characteristic diagram，/>Representing the feature dimension.

Further, in step S4, an implicit timing modeling moduleA position coding and encoder-only architecture self-attention mechanism is included, in which:

wherein, the implicit video frame representation mapImplicit video characterization，/>Representing average pooling.

Further, in step S5, specifically includes:

characterizing a map of implicit video framesIs set to the order of the correct order video frame features；

Randomly generated numbersSequence of->Representing the length of an implicit video frame representation, scrambling the video frame representation with slices, repeating +.>For times, get->Error-sequenced video frame characterization->；

Taking the video frame characteristic diagram as a QueryThe video frame features are respectively fixed as key and value, the Query, key and value are calculated based on a mask cross attention mechanism,the calculated characteristics are subjected to full-connection layer dimension reduction to obtain the +.>A video representation;

extracting a vector containing text information from text featuresVector is calculatedAdded to->On the video representation, an explicit output value is obtained +.>；

Explicit output valueInner product of text characteristics of corresponding video label to obtain a fractional column vectorFractional column vector +.>Inputting the video sequence into a time sequence loss function for comparison learning to obtain a correct explicit video representation +.>；

Explicit video characterizationAnd implicit video characterization->Residual connection is performed to obtain the representation of the whole video>。

Further, the timing loss functionThe formula is as follows:

wherein,is a tag text feature of the current video frame, +.>Is the output of the right order video frame features after being input to the explicit timing modeling module,/for the video frame features>Is the output of all sequential video frame representations after input to the explicit timing modeling module,/for example>Is the number of negative samples, i.e. the number of upsets, < +.>Is a temperature coefficient, represents a dot product.

Further, in step S6, the prediction score builds a prediction loss function with cross entropy loss, the prediction loss functionThe formula of (2) is as follows:

wherein,is for->Predictive scores for the individual samples; />Is->Labels of the individual samples.

The video action recognition method based on transfer learning has the advantages that: the video action recognition method based on transfer learning provided by the structure of the invention considers two modal data of video text labels and video; on the text side, in order to match rich information contained in the video, a large language model is used for expanding the original simple action and behavior labels into detailed description, so that the alignment capability of text-video space in the network model learning process can be improved; meanwhile, the text description is encoded by using the text encoder trained in the CLIP as a classification matrix, so that training parameters can be reduced, and training time can be shortened; aiming at the implicit time sequence mining problem of time sequence modeling dependent position coding in transfer learning, an explicit time sequence modeling module is designed, and is interacted with tag text description to fully mine information of two mode data of text and video.

Drawings

FIG. 1 is a schematic flow chart of the present invention;

FIG. 2 is a flow chart of network model training;

FIG. 3 is a training flow diagram of an explicit timing modeling module.

Detailed Description

In the following detailed description of the present invention, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The invention may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit or scope of the invention, which is therefore not limited to the specific embodiments disclosed below.

As shown in fig. 1 to 3, the video motion recognition method based on the transfer learning provided by the invention inputs video information into a network model to output a prediction result.

The training process for the network model is as follows S1 to S6.

s2: expanding a video text label through a large language model, encoding the expanded video text label through a CLIP model to obtain text features, splicing the text features to serve as an initialized classification matrix, and freezing the classification matrix in the network model training process, wherein the method specifically comprises the steps S21 to S24;

s21: video text label collectionFor processing objects, use of large language models for video text tag sets->Text expansion is performed>，/>Representing a large language model->Representing the extended tag description;

s22: description of tagsConversion into word vectors by means of a word segmentation device>，Wherein->Representation->Function (F)>Representing descriptive string length,/->Is the text vector length;

s23: text encoder based on CLIP modelFor word vector->Coding to obtain text feature->，/>Wherein->Representing a feature dimension;

wherein the text encoder of the CLIP uses a trans former architecture of an encoder-decoder, and the CLIP uses by defaultOperating on text by means of a text encoder, < >>Representing an encoder;

s24: will beText feature of class tag description->Merging into a classification matrix, the>Behavior is expressed as +.>The classification matrix obtained->，/>Representing the tag class.

S21 to S24, designing a video tag expansion process by using a large language model to generate a more complete description of tag actions; in other words, in this embodiment, on the text side, in order to match the rich information contained in the video, the large language model is used to expand the original simple actions and behavior labels into detailed descriptions, so that the alignment capability of the text-video space in the network model learning process can be improved; meanwhile, the text description is encoded by the text encoder trained in the CLIP as a classification matrix, so that training parameters can be reduced, and training time can be shortened.

S3: extracting a frame image in a video, and encoding the enhanced frame image through a CLIP model to obtain a video frame feature map, wherein the method specifically comprises steps S31 to S33;

s31: uniform sampling of videoFrames, in video frame set->Data enhancement of a set of video frames as processing object,/->Representing the number of frames sampled, +.>Representing the height and width of the picture, +.>A channel representing a picture;

for the visual sideFirst, video frame sampling is performed. Different sampling strategies are adopted for the training set and the testing set; for training sets, video is uniformly sampledFrames, i.e. a video, are shown in training phase +.>A frame; for a test set, video is divided intoSegments, each uniformly sampled->Frames, randomly sampled per frame->The next time, i.e. one video is denoted +.>A frame;

data enhancement is used, including but not limited to the following:

a1 Cutting: including random cutting, center cutting, etc.; randomly clipping the picture according to a certain size at a certain position of the picture. a2 Random graying: the picture is converted into a gray scale with a certain probability. a3 Random cut-out and turn-over): cutting the picture with a certain probability, and then horizontally and vertically overturning;

s32: stacking the video frames with the enhanced data on the channel dimension to obtain an input image of the network model，/> 。/>Representing the size after clipping in data enhancement;

s33: figure based on CLIP modelSlice encoder pair input imageCoding to obtain video frame characteristic diagram +.>，/>Representing the number of frames sampled, +.>Representing the feature dimension.

Wherein,for picture encoder->ViT (Vision Transformer) or ResNet may be used.

S4: the method comprises the steps of constructing a position coding learning implicit time sequence modeling module based on a converter of an encodable-only architecture, inputting a video frame feature map into the implicit time sequence modeling module, and calculating the video frame feature map through a self-attention mechanism to obtain an implicit video frame feature mapRepresentation of implicit video frames>Obtaining implicit video representation output by an implicit time sequence modeling module after carrying out average pooling operation>；

Will beInput to the hiddenTime sequence modeling module->In (3), performing calculation:

wherein, implicit time sequence modeling moduleIncludes a position code and an encoder-only architecture self-attention mechanism +.>Is an implicit video frame representation calculated by self-attention mechanism,/for the video frame representation>Representing average pooling, namely fusing all video frame characteristics learned through position coding by using the average pooling to obtain implicit video characterization of an implicit time sequence modeling module +.>And the residual connection is used for carrying out residual connection with the explicit video representation obtained by the explicit time sequence modeling module.

As shown in fig. 3, in this embodiment, a proposed Shuffle Contrastive Learning (disorder contrast learning) module is used as a framework structure of an explicit timing modeling module, and step S5 specifically includes;

s51: characterizing a map of implicit video framesIs set to the order of the correct order video frame features；

Implicit video frame characterization map is of lengthAnd is provided with a correct sequence viewThe frequency frame is characterized by a picture->The method comprises the following steps: />

S52: randomly generated numbersSequence of->，/>Representing implicit video frame characterization graphsIs to shuffle the video frame feature map by slicing, repeat +.>For times, get->Error-sequenced video frame characterization->；

It should be noted that forObtaining a representation of the video frames in error order with each scramblingRepeat->For times, get->Error-sequenced video frame characterization->The above 1 right order video frame feature +.>And go to->Error-sequenced video frame characterization->The video features in (a) are identical, but ordered differently, 1 video frame feature in correct order +.>And go to->Error-sequenced video frame characterization->Common composition->Video frame features, video frame features in correct orderAs positive samples, the wrong sequential video frames are characterized +.>As a negative example.

S53: will beThe method comprises the steps of respectively fixing the features of video frames as keys and values, inputting the feature images of the video frames as Query into an explicit time sequence modeling module, calculating the keys, the values and the Query based on a cross attention mechanism, and obtaining the features of different orders of attention through full-connection layer dimension reductionS+1 video characterizations;

wherein the video frame feature map is used as a Query,the values of the key and the value are the same and fixed in training, the Query, the key and the value are input into a mask cross attention (mask cross attention) module to be calculated, and as the training learning process is carried out, the Query is correspondingly changed due to the parameter change of the network model based on the mask cross attention, the calculation process is as follows:

wherein,representing an activation function->Corresponding Query, < >>Corresponding key(s)>The value of the value is corresponding to the value,the causal masking operation is represented in order to focus a certain video frame feature only on video frame features that occur before it. />And the scaling factor is represented, so that the stability of the cross attention module training is ensured.

When implicit video frame characterization map is used asWith the right order video frame features (+)>,/>) The cross-attention calculation can be considered as focusing on the video frame characterization in the correct sequence; when implicit video frame representation is taken as +.>Characterization with error sequential video frame (++>,/>) The cross-attention calculation may be considered to be concerned with the characterization of video frames in the wrong order. The dimension of the output is identical to the dimension of the original video frame of the input +.>The whole connection layer is used again to reduce the dimension to +.>Resulting in S +1 video characterizations of different orders of interest.

S54: extracting a vector containing text information from text featuresVector +.>Added to->On the video representation, an explicit output value is obtained +.>；

In the step S53 of the process of the present invention,in the calculation process, the design interacts with the text feature, and the text feature is used for generating a vector +.>：

Wherein,representing a full connection layer for reducing dimension and extracting text information;

s55: explicit output valueWith the text feature inner product, a fractional column vector is obtained +.>Fractional column vector +.>Inputting the video representation data into a time sequence loss function for comparison learning to obtain correct explicit video representation。

Assume thatIs the right order video frame feature->The resulting output, the rest being error-sequenced video frame characterization +.>The resulting outputs are superimposed and added to the extracted text information:

indicating connection(s)>，/>Representing the feature dimension.

Explicit output valueText feature with corresponding video tag->Inner product operation:

wherein,indicating transpose,/->In this column vector, only the predictive fraction of the first row should approach 1, the fractions of the remaining rows should approach 0, will +.>Dividing the temperature coefficient to be used as prediction, taking the column vector of one first behavior 1 and the column vector of the other S behaviors 0 as targets, and inputting the column vector into a time sequence loss function to calculate the contrast learning loss. The training task is a proxy task, namely, the training purpose is indirectly achieved through tasks which are irrelevant or weakly relevant to downstream tasks. Wherein the time sequence loss functionThe formula is as follows:

wherein,is a tag text feature of the current video frame, +.>Is the output of the right order video frame features after being input to the explicit timing modeling module,/for the video frame features>Is the output of all sequential video frame characterizations after being input to the explicit timing modeling module, all sequential video frame characterizations including correct sequential video frame characterization and wrong sequential video frame characterization, +.>Is the number of negative samples, i.e. the number of upsets, < +.>Is a temperature coefficient, represents a dot product.

In the training process, the wrong sequence video frame representation is calculated and generated, and in the actual use process of the network model, the wrong sequence video frame representation is not required to be generated, and the right sequence video frame representation is directly calculated and used as the video representation.

S56: explicit video characterizationAnd implicit video characterization->Residual connection is performed to obtain the representation of the whole video>。

Through steps S51 to S56, an explicit timing modeling module is designed for the implicit timing mining problem of timing modeling dependent position coding in the transfer learning, and interacts with the tag text description, where the explicit timing modeling module can be used in the transfer learning task from any picture domain to video domain in a plug-and-play manner, specifically: considering the problem that a part of time information of video frames is lost when implicit time sequence modeling is performed through a learner-based position code and the problem that interaction with a text side is absent in a network model learning process, the embodiment designs an explicit time sequence modeling module, different frame characterization sequences obtained through Shuffle video frame characterization form positive and negative sample pairs with text description, and a contrast learning paradigm is utilized to align a text-video space. In this way, the information of the text and video two-mode data is fully mined, the information is used for the migration from the picture model to the video action recognition task, and the explicit time sequence modeling module SCL is inserted into the existing model through the test on the video data set, so that the recognition accuracy can be improved, and the effectiveness of the method of the embodiment is illustrated.

Will characterizeNormalized and then subjected to inner product with a classification matrix to obtain a prediction score +.>：

，/>And (5) representing the label category, and selecting the label with the highest probability as a prediction result of video action recognition.

In the prediction fraction stage, the prediction fraction constructs a prediction loss function with cross entropy loss, and the prediction loss functionThe formula of (2) is as follows:

The overall loss function of the network model is thus:

wherein,representing parameters->Control->And->Is a ratio of (c) to (d).

Through steps S1 to S6, the present embodiment designs a video motion recognition method based on transfer learning. The method considers two modal data of video text labels and video. Firstly, designing a pre-task on the text side, and using(Large Language Model, dayu)The language model) reasonably expands the video text labels, improves the integrity and accuracy of the video text labels to the video description, and can help the network model to better align two different modal feature spaces of the text and the video. And then, respectively encoding the tag description text and the video frame picture by using a text encoder and a picture encoder of the CLIP to obtain the characteristics of the text and the video frame. Second, given the implicit nature of timing modeling using self-attention, the present embodiment designs a ShuffleContrastive Learning (SCL, disorder contrast learning) model as an explicit timing modeling module to explicitly extract video timing characterizations. SCL realizes the identification model migration from the picture domain to the video domain by letting the model learn the sequence of video frames and simultaneously utilizing text feature interaction. And finally, carrying out inner product operation on the video representation extracted by the model and the classification matrix to obtain a prediction score. In the training stage, besides calculating the prediction score and the loss function of the final label, the embodiment also designs a time sequence loss function in SCL, forces the network model to learn the time sequence information of the video, and finally improves the prediction result of the network model on the video.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. A video motion recognition method based on transfer learning is characterized in that video information is input into a network model to output a prediction result;

the training process of the network model is as follows:

S5: building an explicit time sequence modeling module based on a cross attention mechanism, and representing an implicit video frame into a graphRandom scrambling according to rows>For times, get->Video frame feature, will->The video frame characteristics are input into an explicit time sequence modeling module to obtain +.>-individual explicit video characterization, said->The individual explicit video representations form 1 positive sample pair sum +.>Aligning the text-video space by using a contrast learning paradigm and outputting the correct explicit video representation +.>Explicit video characterization +.>Implicit video characterizationResidual connection is performed to obtain the representation of the whole video>；

2. The video motion recognition method based on transfer learning according to claim 1, wherein in step S2, specifically comprising:

video text label collectionFor processing objects, use of large language models for video text tag sets->Text expansion is performed>，/>Representing a large language model->Representing the extended tag description;

will beText feature of class tag description->Merging into a classification matrix, the>Behavior is expressed asThe classification matrix obtained->，/>Representing the tag class.

3. The video motion recognition method based on transfer learning according to claim 1, wherein in step S3, specifically comprising:

4. The video motion recognition method based on transfer learning of claim 3, wherein in step S4, the implicit timing modeling moduleA position coding and encoder-only architecture self-attention mechanism is included, in which:

5. The video motion recognition method based on transfer learning according to claim 4, wherein in step S5, specifically comprising:

Taking the video frame characteristic diagram as a QueryThe features of the video frames are respectively used as keys and values to be fixed, the Query, the keys and the values are calculated based on a mask cross attention mechanism, and the calculated features are subjected to full-connection layer dimension reduction to obtain the +_s of focusing on different orders>A video representation;

Explicit output valueInner product of text feature of corresponding video tag, a fractional column vector is obtained>Fractional column vector +.>Inputting the video representation data into a time sequence loss function for comparison learning to obtain correct explicit video representation；

6. The video motion recognition method based on transfer learning of claim 5, wherein the timing loss functionThe formula is as follows:

7. The vision based on transitional learning of claim 5In step S6, a predictive score constructs a predictive loss function with cross entropy loss, and the predictive loss function is calculated based on the cross entropy lossThe formula of (2) is as follows: