CN117612071A - Video action recognition method based on transfer learning - Google Patents

Video action recognition method based on transfer learning Download PDF

Info

Publication number
CN117612071A
CN117612071A CN202410090020.6A CN202410090020A CN117612071A CN 117612071 A CN117612071 A CN 117612071A CN 202410090020 A CN202410090020 A CN 202410090020A CN 117612071 A CN117612071 A CN 117612071A
Authority
CN
China
Prior art keywords
video
text
implicit
video frame
explicit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410090020.6A
Other languages
Chinese (zh)
Other versions
CN117612071B (en
Inventor
张信明
刘语西
陈思宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Shenzhen Tencent Computer Systems Co Ltd
Original Assignee
University of Science and Technology of China USTC
Shenzhen Tencent Computer Systems Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC, Shenzhen Tencent Computer Systems Co Ltd filed Critical University of Science and Technology of China USTC
Priority to CN202410090020.6A priority Critical patent/CN117612071B/en
Publication of CN117612071A publication Critical patent/CN117612071A/en
Application granted granted Critical
Publication of CN117612071B publication Critical patent/CN117612071B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video action recognition method based on transfer learning, which comprises the following training processes of a network model: s1: constructing a training set; s2: processing the video text labels to obtain text features, and splicing the text features to serve as an initialized classification matrix; s3: extracting a frame image in the video, and processing the frame image to obtain a video frame characteristic image; s4: inputting the video frame feature map into an implicit timing modeling module to output an implicit video characterization; s5: randomly scrambling implicit video frame characterization maps by rowsNext, willInputting the video frame feature graphs into an explicit time sequence modeling module, outputting correct explicit video characterization, and connecting the explicit video characterization and the implicit video characterization as residual errors to obtain the characterization of the whole video; s6: performing inner product operation on the characterization and classification matrix of the whole video, and calculating a prediction score to obtain a prediction result; the video motion recognition method improves the recognition prediction accuracy of the video.

Description

Video action recognition method based on transfer learning
Technical Field
The invention relates to the technical field of transfer learning, in particular to a video action recognition method based on transfer learning.
Background
Video understanding, and in particular motion recognition, is one of the fundamental tasks of computer vision. It relates to identifying and understanding the actions and gestures of a human or object captured from image, video or sensor data. This task has wide application in various applications including video surveillance, human-machine interaction, virtual reality, somatosensory gaming, autopilot, and the like.
Currently, the mainstream video motion recognition algorithm mainly adopts a structure based on CNN or a structure based on a transducer. CNN, as it is relatively mature in the picture field, has a framework of early motion recognition that directly uses 2D convolution on video frames and introduces traditional algorithms such as optical flow, motion trajectories, etc. to supplement timing information. After that, a 3D convolution occurs, and spatial features and temporal features between adjacent frames can be simultaneously learned from the video segments, and classified by using the obtained video features.
In recent years, video models based on a transducer structure have shown better results. The open source model CLIP proposed by OpenAI is widely used in transfer learning from a picture domain to a video domain due to its excellent generalization performance and picture characterization capability. A part of work based on the CLIP is to extract the characteristics of a single video frame by using the CLIP, and then design a time sequence modeling module to extract the characteristics of the whole video. The method based on the CLIP can be used for carrying out end-to-end fine adjustment, or the method of parameter-efficiency fine tune can be adopted to freeze the parameters of the CLIP, and only an adapter module with a smaller parameter quantity is trained, so that the training efficiency is improved. However, on one hand, video contains more abundant spatiotemporal information than pictures, and current methods use either one-hot tags, or single word tags, when classifying, which do not fully describe the complex content of the video, resulting in insufficient alignment of video space and text space; on the other hand, in the current migration learning method, in the time sequence modeling stage, a model is mostly enabled to learn the time sequence relation of video frames through a leachable position code and self-attribute, and thus, the implicit time sequence mining is difficult to establish an efficient time sequence model when processing multi-content and long-time video.
Disclosure of Invention
Based on the technical problems in the background technology, the invention provides a video action recognition method based on transfer learning, which considers two mode data of video text labels and video, and improves the recognition and prediction accuracy of the video.
The video motion recognition method based on transfer learning inputs video information into a network model to output a prediction result;
the training process of the network model is as follows:
s1: constructing a training set, wherein the training set comprises a video and a video text label;
s2: expanding the video text label through the large language model, encoding the expanded video text label through the CLIP model to obtain text features, splicing the text features to serve as an initialized classification matrix, and freezing the classification matrix in the network model training process;
s3: extracting a frame image in the video, and encoding the enhanced frame image through a CLIP model to obtain a video frame feature map;
s4: the method comprises the steps of constructing a position coding learning implicit time sequence modeling module based on a converter of an encodable-only architecture, inputting a video frame feature map into the implicit time sequence modeling module, and calculating the video frame feature map through a self-attention mechanism to obtain an implicit video frame feature mapRepresentation of implicit video frames>Obtaining implicit time after average pooling operationImplicit video characterization output by sequence modeling module>
S5: building an explicit time sequence modeling module based on a cross attention mechanism, and representing an implicit video frame into a graphRandom scrambling according to rows>For times, get->Video frame feature, will->The video frame characteristics are input into an explicit time sequence modeling module to obtain +.>-individual explicit video characterization, said->The individual explicit video representations form 1 positive sample pair sum +.>Aligning the text-video space by using a contrast learning paradigm and outputting the correct explicit video representation +.>Explicit video characterization +.>And implicit video characterization->Residual connection is performed to obtain the representation of the whole video>
S6: will characterizeAnd performing inner product operation on the classification matrix, and calculating a prediction score to obtain a prediction result.
Further, in step S2, specifically includes:
video text label collectionFor processing objects, large language models are used for video text label collectionText expansion is performed>,/>Representing a large language model->Representing the extended tag description;
description of tagsConversion into word vectors by means of a word segmentation device>Wherein->Representation->Function (F)>Representing descriptive string length,/->Is the text vector length;
text encoder based on CLIP modelFor word vector->Coding to obtain text feature->,/>Wherein->Representing a feature dimension;
will beText feature of class tag description->Merging into a classification matrix, the>Behavior is expressed as +.>The classification matrix obtained->,/>Representing the tag class.
Further, in step S3, specifically includes:
uniform sampling of videoFrames, in video frame set->Data enhancement of a set of video frames as processing object,/->Representing the number of frames sampled, +.>Representing the height and width of the picture, +.>A channel representing a picture;
stacking the video frames with the enhanced data on the channel dimension to obtain an input image of the network model,/>Representing the size after clipping in data enhancement;
image encoder pair input image based on CLIP modelCoding to obtain video frame characteristic diagram,/>Representing the feature dimension.
Further, in step S4, an implicit timing modeling moduleA position coding and encoder-only architecture self-attention mechanism is included, in which:
wherein, the implicit video frame representation mapImplicit video characterization,/>Representing average pooling.
Further, in step S5, specifically includes:
characterizing a map of implicit video framesIs set to the order of the correct order video frame features
Randomly generated numbersSequence of->Representing the length of an implicit video frame representation, scrambling the video frame representation with slices, repeating +.>For times, get->Error-sequenced video frame characterization->
Taking the video frame characteristic diagram as a QueryThe video frame features are respectively fixed as key and value, the Query, key and value are calculated based on a mask cross attention mechanism,the calculated characteristics are subjected to full-connection layer dimension reduction to obtain the +.>A video representation;
extracting a vector containing text information from text featuresVector is calculatedAdded to->On the video representation, an explicit output value is obtained +.>
Explicit output valueInner product of text characteristics of corresponding video label to obtain a fractional column vectorFractional column vector +.>Inputting the video sequence into a time sequence loss function for comparison learning to obtain a correct explicit video representation +.>
Explicit video characterizationAnd implicit video characterization->Residual connection is performed to obtain the representation of the whole video>
Further, the timing loss functionThe formula is as follows:
wherein,is a tag text feature of the current video frame, +.>Is the output of the right order video frame features after being input to the explicit timing modeling module,/for the video frame features>Is the output of all sequential video frame representations after input to the explicit timing modeling module,/for example>Is the number of negative samples, i.e. the number of upsets, < +.>Is a temperature coefficient, represents a dot product.
Further, in step S6, the prediction score builds a prediction loss function with cross entropy loss, the prediction loss functionThe formula of (2) is as follows:
wherein,is for->Predictive scores for the individual samples; />Is->Labels of the individual samples.
The video action recognition method based on transfer learning has the advantages that: the video action recognition method based on transfer learning provided by the structure of the invention considers two modal data of video text labels and video; on the text side, in order to match rich information contained in the video, a large language model is used for expanding the original simple action and behavior labels into detailed description, so that the alignment capability of text-video space in the network model learning process can be improved; meanwhile, the text description is encoded by using the text encoder trained in the CLIP as a classification matrix, so that training parameters can be reduced, and training time can be shortened; aiming at the implicit time sequence mining problem of time sequence modeling dependent position coding in transfer learning, an explicit time sequence modeling module is designed, and is interacted with tag text description to fully mine information of two mode data of text and video.
Drawings
FIG. 1 is a schematic flow chart of the present invention;
FIG. 2 is a flow chart of network model training;
FIG. 3 is a training flow diagram of an explicit timing modeling module.
Detailed Description
In the following detailed description of the present invention, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The invention may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit or scope of the invention, which is therefore not limited to the specific embodiments disclosed below.
As shown in fig. 1 to 3, the video motion recognition method based on the transfer learning provided by the invention inputs video information into a network model to output a prediction result.
The training process for the network model is as follows S1 to S6.
S1: constructing a training set, wherein the training set comprises a video and a video text label;
s2: expanding a video text label through a large language model, encoding the expanded video text label through a CLIP model to obtain text features, splicing the text features to serve as an initialized classification matrix, and freezing the classification matrix in the network model training process, wherein the method specifically comprises the steps S21 to S24;
s21: video text label collectionFor processing objects, use of large language models for video text tag sets->Text expansion is performed>,/>Representing a large language model->Representing the extended tag description;
s22: description of tagsConversion into word vectors by means of a word segmentation device>Wherein->Representation->Function (F)>Representing descriptive string length,/->Is the text vector length;
s23: text encoder based on CLIP modelFor word vector->Coding to obtain text feature->,/>Wherein->Representing a feature dimension;
wherein the text encoder of the CLIP uses a trans former architecture of an encoder-decoder, and the CLIP uses by defaultOperating on text by means of a text encoder, < >>Representing an encoder;
s24: will beText feature of class tag description->Merging into a classification matrix, the>Behavior is expressed as +.>The classification matrix obtained->,/>Representing the tag class.
S21 to S24, designing a video tag expansion process by using a large language model to generate a more complete description of tag actions; in other words, in this embodiment, on the text side, in order to match the rich information contained in the video, the large language model is used to expand the original simple actions and behavior labels into detailed descriptions, so that the alignment capability of the text-video space in the network model learning process can be improved; meanwhile, the text description is encoded by the text encoder trained in the CLIP as a classification matrix, so that training parameters can be reduced, and training time can be shortened.
S3: extracting a frame image in a video, and encoding the enhanced frame image through a CLIP model to obtain a video frame feature map, wherein the method specifically comprises steps S31 to S33;
s31: uniform sampling of videoFrames, in video frame set->Data enhancement of a set of video frames as processing object,/->Representing the number of frames sampled, +.>Representing the height and width of the picture, +.>A channel representing a picture;
for the visual sideFirst, video frame sampling is performed. Different sampling strategies are adopted for the training set and the testing set; for training sets, video is uniformly sampledFrames, i.e. a video, are shown in training phase +.>A frame; for a test set, video is divided intoSegments, each uniformly sampled->Frames, randomly sampled per frame->The next time, i.e. one video is denoted +.>A frame;
data enhancement is used, including but not limited to the following:
a1 Cutting: including random cutting, center cutting, etc.; randomly clipping the picture according to a certain size at a certain position of the picture. a2 Random graying: the picture is converted into a gray scale with a certain probability. a3 Random cut-out and turn-over): cutting the picture with a certain probability, and then horizontally and vertically overturning;
s32: stacking the video frames with the enhanced data on the channel dimension to obtain an input image of the network model,/> />Representing the size after clipping in data enhancement;
s33: figure based on CLIP modelSlice encoder pair input imageCoding to obtain video frame characteristic diagram +.>,/>Representing the number of frames sampled, +.>Representing the feature dimension.
Wherein,for picture encoder->ViT (Vision Transformer) or ResNet may be used.
S4: the method comprises the steps of constructing a position coding learning implicit time sequence modeling module based on a converter of an encodable-only architecture, inputting a video frame feature map into the implicit time sequence modeling module, and calculating the video frame feature map through a self-attention mechanism to obtain an implicit video frame feature mapRepresentation of implicit video frames>Obtaining implicit video representation output by an implicit time sequence modeling module after carrying out average pooling operation>
Will beInput to the hiddenTime sequence modeling module->In (3), performing calculation:
wherein, implicit time sequence modeling moduleIncludes a position code and an encoder-only architecture self-attention mechanism +.>Is an implicit video frame representation calculated by self-attention mechanism,/for the video frame representation>Representing average pooling, namely fusing all video frame characteristics learned through position coding by using the average pooling to obtain implicit video characterization of an implicit time sequence modeling module +.>And the residual connection is used for carrying out residual connection with the explicit video representation obtained by the explicit time sequence modeling module.
S5: building an explicit time sequence modeling module based on a cross attention mechanism, and representing an implicit video frame into a graphRandom scrambling according to rows>For times, get->Video frame feature, will->The video frame characteristics are input into an explicit time sequence modeling module to obtain +.>-individual explicit video characterization, said->The individual explicit video representations form 1 positive sample pair sum +.>Aligning the text-video space by using a contrast learning paradigm and outputting the correct explicit video representation +.>Explicit video characterization +.>And implicit video characterization->Residual connection is performed to obtain the representation of the whole video>
As shown in fig. 3, in this embodiment, a proposed Shuffle Contrastive Learning (disorder contrast learning) module is used as a framework structure of an explicit timing modeling module, and step S5 specifically includes;
s51: characterizing a map of implicit video framesIs set to the order of the correct order video frame features
Implicit video frame characterization map is of lengthAnd is provided with a correct sequence viewThe frequency frame is characterized by a picture->The method comprises the following steps: />
S52: randomly generated numbersSequence of->,/>Representing implicit video frame characterization graphsIs to shuffle the video frame feature map by slicing, repeat +.>For times, get->Error-sequenced video frame characterization->
It should be noted that forObtaining a representation of the video frames in error order with each scramblingRepeat->For times, get->Error-sequenced video frame characterization->The above 1 right order video frame feature +.>And go to->Error-sequenced video frame characterization->The video features in (a) are identical, but ordered differently, 1 video frame feature in correct order +.>And go to->Error-sequenced video frame characterization->Common composition->Video frame features, video frame features in correct orderAs positive samples, the wrong sequential video frames are characterized +.>As a negative example.
S53: will beThe method comprises the steps of respectively fixing the features of video frames as keys and values, inputting the feature images of the video frames as Query into an explicit time sequence modeling module, calculating the keys, the values and the Query based on a cross attention mechanism, and obtaining the features of different orders of attention through full-connection layer dimension reductionS+1 video characterizations;
wherein the video frame feature map is used as a Query,the values of the key and the value are the same and fixed in training, the Query, the key and the value are input into a mask cross attention (mask cross attention) module to be calculated, and as the training learning process is carried out, the Query is correspondingly changed due to the parameter change of the network model based on the mask cross attention, the calculation process is as follows:
wherein,representing an activation function->Corresponding Query, < >>Corresponding key(s)>The value of the value is corresponding to the value,the causal masking operation is represented in order to focus a certain video frame feature only on video frame features that occur before it. />And the scaling factor is represented, so that the stability of the cross attention module training is ensured.
When implicit video frame characterization map is used asWith the right order video frame features (+)>,/>) The cross-attention calculation can be considered as focusing on the video frame characterization in the correct sequence; when implicit video frame representation is taken as +.>Characterization with error sequential video frame (++>,/>) The cross-attention calculation may be considered to be concerned with the characterization of video frames in the wrong order. The dimension of the output is identical to the dimension of the original video frame of the input +.>The whole connection layer is used again to reduce the dimension to +.>Resulting in S +1 video characterizations of different orders of interest.
S54: extracting a vector containing text information from text featuresVector +.>Added to->On the video representation, an explicit output value is obtained +.>
In the step S53 of the process of the present invention,in the calculation process, the design interacts with the text feature, and the text feature is used for generating a vector +.>
Wherein,representing a full connection layer for reducing dimension and extracting text information;
s55: explicit output valueWith the text feature inner product, a fractional column vector is obtained +.>Fractional column vector +.>Inputting the video representation data into a time sequence loss function for comparison learning to obtain correct explicit video representation
Assume thatIs the right order video frame feature->The resulting output, the rest being error-sequenced video frame characterization +.>The resulting outputs are superimposed and added to the extracted text information:
indicating connection(s)>,/>Representing the feature dimension.
Explicit output valueText feature with corresponding video tag->Inner product operation:
wherein,indicating transpose,/->In this column vector, only the predictive fraction of the first row should approach 1, the fractions of the remaining rows should approach 0, will +.>Dividing the temperature coefficient to be used as prediction, taking the column vector of one first behavior 1 and the column vector of the other S behaviors 0 as targets, and inputting the column vector into a time sequence loss function to calculate the contrast learning loss. The training task is a proxy task, namely, the training purpose is indirectly achieved through tasks which are irrelevant or weakly relevant to downstream tasks. Wherein the time sequence loss functionThe formula is as follows:
wherein,is a tag text feature of the current video frame, +.>Is the output of the right order video frame features after being input to the explicit timing modeling module,/for the video frame features>Is the output of all sequential video frame characterizations after being input to the explicit timing modeling module, all sequential video frame characterizations including correct sequential video frame characterization and wrong sequential video frame characterization, +.>Is the number of negative samples, i.e. the number of upsets, < +.>Is a temperature coefficient, represents a dot product.
In the training process, the wrong sequence video frame representation is calculated and generated, and in the actual use process of the network model, the wrong sequence video frame representation is not required to be generated, and the right sequence video frame representation is directly calculated and used as the video representation.
S56: explicit video characterizationAnd implicit video characterization->Residual connection is performed to obtain the representation of the whole video>
Through steps S51 to S56, an explicit timing modeling module is designed for the implicit timing mining problem of timing modeling dependent position coding in the transfer learning, and interacts with the tag text description, where the explicit timing modeling module can be used in the transfer learning task from any picture domain to video domain in a plug-and-play manner, specifically: considering the problem that a part of time information of video frames is lost when implicit time sequence modeling is performed through a learner-based position code and the problem that interaction with a text side is absent in a network model learning process, the embodiment designs an explicit time sequence modeling module, different frame characterization sequences obtained through Shuffle video frame characterization form positive and negative sample pairs with text description, and a contrast learning paradigm is utilized to align a text-video space. In this way, the information of the text and video two-mode data is fully mined, the information is used for the migration from the picture model to the video action recognition task, and the explicit time sequence modeling module SCL is inserted into the existing model through the test on the video data set, so that the recognition accuracy can be improved, and the effectiveness of the method of the embodiment is illustrated.
S6: will characterizeAnd performing inner product operation on the classification matrix, and calculating a prediction score to obtain a prediction result.
Will characterizeNormalized and then subjected to inner product with a classification matrix to obtain a prediction score +.>
,/>And (5) representing the label category, and selecting the label with the highest probability as a prediction result of video action recognition.
In the prediction fraction stage, the prediction fraction constructs a prediction loss function with cross entropy loss, and the prediction loss functionThe formula of (2) is as follows:
wherein,is for->Predictive scores for the individual samples; />Is->Labels of the individual samples.
The overall loss function of the network model is thus:
wherein,representing parameters->Control->And->Is a ratio of (c) to (d).
Through steps S1 to S6, the present embodiment designs a video motion recognition method based on transfer learning. The method considers two modal data of video text labels and video. Firstly, designing a pre-task on the text side, and using(Large Language Model, dayu)The language model) reasonably expands the video text labels, improves the integrity and accuracy of the video text labels to the video description, and can help the network model to better align two different modal feature spaces of the text and the video. And then, respectively encoding the tag description text and the video frame picture by using a text encoder and a picture encoder of the CLIP to obtain the characteristics of the text and the video frame. Second, given the implicit nature of timing modeling using self-attention, the present embodiment designs a ShuffleContrastive Learning (SCL, disorder contrast learning) model as an explicit timing modeling module to explicitly extract video timing characterizations. SCL realizes the identification model migration from the picture domain to the video domain by letting the model learn the sequence of video frames and simultaneously utilizing text feature interaction. And finally, carrying out inner product operation on the video representation extracted by the model and the classification matrix to obtain a prediction score. In the training stage, besides calculating the prediction score and the loss function of the final label, the embodiment also designs a time sequence loss function in SCL, forces the network model to learn the time sequence information of the video, and finally improves the prediction result of the network model on the video.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims (7)

1. A video motion recognition method based on transfer learning is characterized in that video information is input into a network model to output a prediction result;
the training process of the network model is as follows:
s1: constructing a training set, wherein the training set comprises a video and a video text label;
s2: expanding the video text label through the large language model, encoding the expanded video text label through the CLIP model to obtain text features, splicing the text features to serve as an initialized classification matrix, and freezing the classification matrix in the network model training process;
s3: extracting a frame image in the video, and encoding the enhanced frame image through a CLIP model to obtain a video frame feature map;
s4: the method comprises the steps of constructing a position coding learning implicit time sequence modeling module based on a converter of an encodable-only architecture, inputting a video frame feature map into the implicit time sequence modeling module, and calculating the video frame feature map through a self-attention mechanism to obtain an implicit video frame feature mapRepresentation of implicit video frames>Obtaining implicit video representation output by an implicit time sequence modeling module after carrying out average pooling operation>
S5: building an explicit time sequence modeling module based on a cross attention mechanism, and representing an implicit video frame into a graphRandom scrambling according to rows>For times, get->Video frame feature, will->The video frame characteristics are input into an explicit time sequence modeling module to obtain +.>-individual explicit video characterization, said->The individual explicit video representations form 1 positive sample pair sum +.>Aligning the text-video space by using a contrast learning paradigm and outputting the correct explicit video representation +.>Explicit video characterization +.>Implicit video characterizationResidual connection is performed to obtain the representation of the whole video>
S6: will characterizeAnd performing inner product operation on the classification matrix, and calculating a prediction score to obtain a prediction result.
2. The video motion recognition method based on transfer learning according to claim 1, wherein in step S2, specifically comprising:
video text label collectionFor processing objects, use of large language models for video text tag sets->Text expansion is performed>,/>Representing a large language model->Representing the extended tag description;
description of tagsConversion into word vectors by means of a word segmentation device>Wherein->Representation->Function (F)>Representing descriptive string length,/->Is the text vector length;
text encoder based on CLIP modelFor word vector->Coding to obtain text feature->,/>Wherein->Representing a feature dimension;
will beText feature of class tag description->Merging into a classification matrix, the>Behavior is expressed asThe classification matrix obtained->,/>Representing the tag class.
3. The video motion recognition method based on transfer learning according to claim 1, wherein in step S3, specifically comprising:
uniform sampling of videoFrames, in video frame set->Data enhancement of a set of video frames as processing object,/->Representing the number of frames sampled, +.>Representing the height and width of the picture, +.>A channel representing a picture;
stacking the video frames with the enhanced data on the channel dimension to obtain an input image of the network model,/>Representing the size after clipping in data enhancement;
image encoder pair input image based on CLIP modelCoding to obtain video frame characteristic diagram,/>Representing the feature dimension.
4. The video motion recognition method based on transfer learning of claim 3, wherein in step S4, the implicit timing modeling moduleA position coding and encoder-only architecture self-attention mechanism is included, in which:
wherein, the implicit video frame representation mapImplicit video characterization,/>Representing average pooling.
5. The video motion recognition method based on transfer learning according to claim 4, wherein in step S5, specifically comprising:
characterizing a map of implicit video framesIs set to the order of the correct order video frame features
Randomly generated numbersSequence of->Representing the length of an implicit video frame representation, scrambling the video frame representation with slices, repeating +.>For times, get->Error-sequenced video frame characterization->
Taking the video frame characteristic diagram as a QueryThe features of the video frames are respectively used as keys and values to be fixed, the Query, the keys and the values are calculated based on a mask cross attention mechanism, and the calculated features are subjected to full-connection layer dimension reduction to obtain the +_s of focusing on different orders>A video representation;
extracting a vector containing text information from text featuresVector is calculatedAdded to->On the video representation, an explicit output value is obtained +.>
Explicit output valueInner product of text feature of corresponding video tag, a fractional column vector is obtained>Fractional column vector +.>Inputting the video representation data into a time sequence loss function for comparison learning to obtain correct explicit video representation
Explicit video characterizationAnd implicit video characterization->Residual connection is performed to obtain the representation of the whole video>
6. The video motion recognition method based on transfer learning of claim 5, wherein the timing loss functionThe formula is as follows:
wherein,is a tag text feature of the current video frame, +.>Is the output of the right order video frame features after being input to the explicit timing modeling module,/for the video frame features>Is the output of all sequential video frame representations after input to the explicit timing modeling module,/for example>Is the number of negative samples, i.e. the number of upsets, < +.>Is a temperature coefficient, represents a dot product.
7. The vision based on transitional learning of claim 5In step S6, a predictive score constructs a predictive loss function with cross entropy loss, and the predictive loss function is calculated based on the cross entropy lossThe formula of (2) is as follows:
wherein,is for->Predictive scores for the individual samples; />Is->Labels of the individual samples.
CN202410090020.6A 2024-01-23 2024-01-23 Video action recognition method based on transfer learning Active CN117612071B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410090020.6A CN117612071B (en) 2024-01-23 2024-01-23 Video action recognition method based on transfer learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410090020.6A CN117612071B (en) 2024-01-23 2024-01-23 Video action recognition method based on transfer learning

Publications (2)

Publication Number Publication Date
CN117612071A true CN117612071A (en) 2024-02-27
CN117612071B CN117612071B (en) 2024-04-19

Family

ID=89951979

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410090020.6A Active CN117612071B (en) 2024-01-23 2024-01-23 Video action recognition method based on transfer learning

Country Status (1)

Country Link
CN (1) CN117612071B (en)

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170357720A1 (en) * 2016-06-10 2017-12-14 Disney Enterprises, Inc. Joint heterogeneous language-vision embeddings for video tagging and search
CN111079532A (en) * 2019-11-13 2020-04-28 杭州电子科技大学 Video content description method based on text self-encoder
US20200241545A1 (en) * 2019-01-30 2020-07-30 Perceptive Automata, Inc. Automatic braking of autonomous vehicles using machine learning based prediction of behavior of a traffic entity
WO2022083335A1 (en) * 2020-10-20 2022-04-28 神思电子技术股份有限公司 Self-attention mechanism-based behavior recognition method
CN114724240A (en) * 2022-03-22 2022-07-08 南京甄视智能科技有限公司 Behavior recognition method, electronic device and storage medium
CN115393949A (en) * 2022-07-14 2022-11-25 河北大学 Continuous sign language recognition method and device
CN115796029A (en) * 2022-11-28 2023-03-14 东南大学 NL2SQL method based on explicit and implicit characteristic decoupling
CN116109980A (en) * 2023-02-14 2023-05-12 杭州电子科技大学 Action recognition method based on video text matching
US20230154169A1 (en) * 2021-11-15 2023-05-18 Qualcomm Incorporated Video processing using delta distillation
CN116383671A (en) * 2023-03-27 2023-07-04 武汉大学 Text image cross-mode pedestrian retrieval method and system with implicit relation reasoning alignment
CN116721458A (en) * 2023-05-04 2023-09-08 桂林电子科技大学 Cross-modal time sequence contrast learning-based self-supervision action recognition method
US20230316749A1 (en) * 2022-03-04 2023-10-05 Samsung Electronics Co., Ltd. Method and apparatus for video action classification
CN116958677A (en) * 2023-07-25 2023-10-27 重庆邮电大学 Internet short video classification method based on multi-mode big data
WO2023229094A1 (en) * 2022-05-27 2023-11-30 주식회사 엔씨소프트 Method and apparatus for predicting actions
CN117351392A (en) * 2023-09-28 2024-01-05 西北工业大学 Method for detecting abnormal behavior of video
US20240013558A1 (en) * 2022-07-07 2024-01-11 Beijing Baidu Netcom Science Technology Co., Ltd. Cross-modal feature extraction, retrieval, and model training method and apparatus, and medium

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170357720A1 (en) * 2016-06-10 2017-12-14 Disney Enterprises, Inc. Joint heterogeneous language-vision embeddings for video tagging and search
US20200241545A1 (en) * 2019-01-30 2020-07-30 Perceptive Automata, Inc. Automatic braking of autonomous vehicles using machine learning based prediction of behavior of a traffic entity
CN111079532A (en) * 2019-11-13 2020-04-28 杭州电子科技大学 Video content description method based on text self-encoder
WO2022083335A1 (en) * 2020-10-20 2022-04-28 神思电子技术股份有限公司 Self-attention mechanism-based behavior recognition method
US20230154169A1 (en) * 2021-11-15 2023-05-18 Qualcomm Incorporated Video processing using delta distillation
US20230316749A1 (en) * 2022-03-04 2023-10-05 Samsung Electronics Co., Ltd. Method and apparatus for video action classification
CN114724240A (en) * 2022-03-22 2022-07-08 南京甄视智能科技有限公司 Behavior recognition method, electronic device and storage medium
WO2023229094A1 (en) * 2022-05-27 2023-11-30 주식회사 엔씨소프트 Method and apparatus for predicting actions
US20240013558A1 (en) * 2022-07-07 2024-01-11 Beijing Baidu Netcom Science Technology Co., Ltd. Cross-modal feature extraction, retrieval, and model training method and apparatus, and medium
CN115393949A (en) * 2022-07-14 2022-11-25 河北大学 Continuous sign language recognition method and device
CN115796029A (en) * 2022-11-28 2023-03-14 东南大学 NL2SQL method based on explicit and implicit characteristic decoupling
CN116109980A (en) * 2023-02-14 2023-05-12 杭州电子科技大学 Action recognition method based on video text matching
CN116383671A (en) * 2023-03-27 2023-07-04 武汉大学 Text image cross-mode pedestrian retrieval method and system with implicit relation reasoning alignment
CN116721458A (en) * 2023-05-04 2023-09-08 桂林电子科技大学 Cross-modal time sequence contrast learning-based self-supervision action recognition method
CN116958677A (en) * 2023-07-25 2023-10-27 重庆邮电大学 Internet short video classification method based on multi-mode big data
CN117351392A (en) * 2023-09-28 2024-01-05 西北工业大学 Method for detecting abnormal behavior of video

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
DING JIANG, ET.AL: "Cross-model implicit relation reasoning and aligning for text-to-image person retrieval", 《PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》, 31 December 2023 (2023-12-31), pages 2787 - 2797 *
HUI ZHENG, ET.AL: "A cross view learning approach for skeleton-based action recognition", 《IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY》, vol. 32, no. 5, 26 July 2021 (2021-07-26), pages 3061 - 3072 *
HUI ZHENG, ET.AL: "A cross-modal learning approach for recognizing human actions", 《IEEE SYSTEMS JOURNAL》, vol. 15, no. 2, 30 June 2021 (2021-06-30), pages 2022 - 2330 *
YUE LIU, ET.AL: "Contrastive predictive coding with transformer for video representation learning", 《NEUROCOMPUTING》, vol. 482, 14 April 2022 (2022-04-14), pages 154 - 162, XP086978621, DOI: 10.1016/j.neucom.2021.11.031 *
张超等: "基于标签嵌入的多模态多标签情感识别算法", 《网络安全与数据治理》, vol. 41, no. 7, 31 January 2022 (2022-01-31), pages 101 - 107 *
张静然: "行为识别中视频时空建模及其鲁棒性研究", 《中国博士学位论文全文数据库(信息科技辑)》, no. 1, 15 January 2023 (2023-01-15), pages 138 - 93 *

Also Published As

Publication number Publication date
CN117612071B (en) 2024-04-19

Similar Documents

Publication Publication Date Title
CN107239801B (en) Video attribute representation learning method and video character description automatic generation method
CN112070114B (en) Scene character recognition method and system based on Gaussian constraint attention mechanism network
CN113283336A (en) Text recognition method and system
CN112488055A (en) Video question-answering method based on progressive graph attention network
CN113516152B (en) Image description method based on composite image semantics
CN116912642A (en) Multimode emotion analysis method, device and medium based on dual-mode and multi-granularity interaction
CN114495129A (en) Character detection model pre-training method and device
CN115130591A (en) Cross supervision-based multi-mode data classification method and device
CN111340006B (en) Sign language recognition method and system
CN113806747B (en) Trojan horse picture detection method and system and computer readable storage medium
Chen et al. Cross-lingual text image recognition via multi-task sequence to sequence learning
CN117934803A (en) Visual positioning method based on multi-modal feature alignment
CN115761764A (en) Chinese handwritten text line recognition method based on visual language joint reasoning
CN117809218B (en) Electronic shop descriptive video processing system and method
CN117456581A (en) Method for recognizing facial expression from image pre-training model to video
CN114821420B (en) Time sequence action positioning method based on multi-time resolution temporal semantic aggregation network
CN117612071B (en) Video action recognition method based on transfer learning
CN110929013A (en) Image question-answer implementation method based on bottom-up entry and positioning information fusion
Pan et al. Quality-aware clip for blind image quality assessment
Han et al. NSNP-DFER: a nonlinear spiking neural P network for dynamic facial expression recognition
CN114254080A (en) Text matching method, device and equipment
CN114529908A (en) Offline handwritten chemical reaction type image recognition technology
CN111339782A (en) Sign language translation system and method based on multilevel semantic analysis
Zhang Design and Implementation of the Chinese Character Font Recognition System Based on Binary Convolutional Encoding and Decoding Network
CN118536049B (en) Content main body discovery method based on multi-mode abnormal content understanding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant