CN117876940B - Video language task execution and model training method, device, equipment and medium thereof - Google Patents

Video language task execution and model training method, device, equipment and medium thereof Download PDF

Info

Publication number
CN117876940B
CN117876940B CN202410270242.6A CN202410270242A CN117876940B CN 117876940 B CN117876940 B CN 117876940B CN 202410270242 A CN202410270242 A CN 202410270242A CN 117876940 B CN117876940 B CN 117876940B
Authority
CN
China
Prior art keywords
video
frame
text
features
language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410270242.6A
Other languages
Chinese (zh)
Other versions
CN117876940A (en
Inventor
金良
赵雅倩
闫瑞栋
范宝余
郭振华
尹云峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN202410270242.6A priority Critical patent/CN117876940B/en
Publication of CN117876940A publication Critical patent/CN117876940A/en
Application granted granted Critical
Publication of CN117876940B publication Critical patent/CN117876940B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video language task execution and model training method, device, equipment and medium thereof, which are applied to the technical field of video understanding. The method comprises the steps of inputting a video sample with a text label, a video parameter to be learned and a frame parameter to be learned into a video language model, extracting visual characteristics and parameter characteristics by a visual language pre-training model, converting the visual characteristics into frame visual information meeting the requirements of the visual language pre-training model by a video frame adapter based on the frame parameter to be learned, and extracting the video visual information by the video adapter based on the video parameter to be learned; and carrying out iterative updating on the video language model according to the frame visual information, the loss information between the video visual information and the text semantic features until the preset model training ending condition is met. The method and the device can solve the problems of slow convergence and time and resource consumption in training of the video language model in the related technology, can effectively improve the training efficiency of the video language model, and save the calculation resources required by model training.

Description

Video language task execution and model training method, device, equipment and medium thereof
Technical Field
The present invention relates to the field of video understanding technologies, and in particular, to a method and apparatus for executing a video language task and training a model thereof, an electronic device, and a readable storage medium.
Background
The video language model is capable of understanding the inherent relationship of visual modalities to language modalities and may be used to perform video language related tasks including, but not limited to, video content understanding and classification tasks, video subtitle translation and generation tasks.
The video language model in the related technology has the problems of weak correlation between a visual mode and a text mode and different focusing ranges of the text to the video, so that the video language model is slow to converge, and the training is time-consuming and resource-consuming.
In view of this, improving training efficiency of video language models is a technical problem that needs to be solved by those skilled in the art.
Disclosure of Invention
The invention provides a video language task execution and model training method, device, electronic equipment and readable storage medium thereof, which can effectively improve the training efficiency of a video language model and save the calculation resources required by model training.
In order to solve the technical problems, the invention provides the following technical scheme:
in one aspect, the present invention provides a video language model training method, including:
Acquiring a video sample data set carrying a text description tag, preset video parameters to be learned and frame parameters to be learned;
Inputting a video sample in the video sample data set, the video parameter to be learned and the frame parameter to be learned into the video language model; the video language model comprises a target visual language pre-training model, a video frame adapter and a video adapter; the target visual language pre-training model is used for extracting visual characteristics and parameter characteristics and inputting the visual characteristics and the parameter characteristics into the video frame adapter and the video adapter correspondingly, the video frame adapter is used for converting the visual characteristics into frame visual information meeting the requirements of the target visual language pre-training model, and the video adapter is used for extracting video visual information;
According to the frame visual information, the video visual information and the loss information of text semantic features, iteratively updating the video language model until a preset model training ending condition is met;
The method comprises the steps that parameter characteristics corresponding to video parameters to be learned are input to a video adapter, and parameter characteristics corresponding to frame parameters to be learned are input to the video frame adapter so as to obtain text-related visual information by utilizing the frame parameters to be learned.
In a first exemplary embodiment, the parameter features corresponding to the frame parameters to be learned are frame parameter features, the text description tag includes a video frame description text tag, the text semantic features corresponding to the video frame description text tag are video frame text features, and the video frame adapter includes a frame input layer, a text encoding layer, a cross-modal fusion layer, a feature enhancement layer and a frame output layer;
the frame input layer is used for receiving the splicing result of the frame parameter characteristics and the video frame text characteristics; the text coding layer is used for coding the splicing result based on the current attention mask to obtain frame parameter coding characteristics; the cross-modal fusion layer is used for carrying out cross-modal fusion processing on the frame parameter coding features and the visual features; the feature enhancement layer is used for carrying out feature enhancement processing on the fusion result and inputting enhanced features into the text coding layer; the frame output layer is used for outputting frame visual information.
In a second exemplary embodiment, the cross-modal fusion layer is a cross-modal attention mechanism layer, and the cross-modal fusion processing of the frame parameter coding feature and the visual feature includes:
And taking the frame parameter coding feature as a query vector, taking the visual feature as a group of value vectors and key vectors, and coding the frame parameter coding feature and the visual feature based on a cross-modal attention mechanism to be taken as a fusion result.
In a third exemplary embodiment, the feature enhancement layer includes a first feature enhancement layer, an interactive feature extraction layer, and a second feature enhancement layer;
The first characteristic enhancement layer is used for carrying out layer normalization processing on the fusion result and obtaining a first interaction enhancement characteristic through residual error connection;
the interaction feature extraction layer is used for extracting features of the first interaction enhancement features to obtain second interaction enhancement features;
and the second characteristic enhancement layer is used for carrying out layer normalization processing on the second interaction enhancement characteristic and is connected through residual errors.
In a fourth exemplary embodiment, the training process of the video frame adapter includes:
Extracting the characteristics of the frame visual information corresponding to the current frame to obtain the image frame characteristics corresponding to the current frame image;
extracting text features of the video frames corresponding to the current frames to obtain text features of the image frames corresponding to the current frame images;
And carrying out iterative updating on the video frame adapter according to the loss information between each image frame characteristic and the corresponding image frame text characteristic.
In a fifth exemplary embodiment, the iteratively updating the video frame adapter according to the loss information between each image frame feature and the corresponding image frame text feature includes:
Determining a frame-text matching penalty by predicting whether an image frame feature and an image frame text feature are positively matched or negatively mismatched using the video frame adapter;
Determining a frame-text contrast penalty by comparing the similarity between the image frame features and the image frame text features;
Masking off part of the text features of the video frames, predicting the text features of the video frames which are masked off through a video frame adapter trained based on the text features of the image frames corresponding to the rest of the text features of the video frames and the image frame features, and determining text generation loss;
determining a penalty function for the video frame adapter based on the frame-to-text matching penalty, the frame-to-text contrast penalty, and the text generation penalty.
In a sixth exemplary embodiment, the determining the frame-text contrast loss by comparing the similarity between the image frame feature and the image frame text feature includes:
Taking the image frame characteristics and the image frame text characteristics which are positively matched as a group of positive samples, and taking the image frame characteristics and the image frame text characteristics which are negatively unmatched as a group of negative samples;
Calculating positive similarity between the image frame features and the image frame text features in each group of positive samples, and calculating negative similarity between the image frame features and the image frame text features in each group of negative samples;
a frame-text contrast loss is determined by comparing the positive similarity to the negative similarity.
In a seventh exemplary embodiment, the determining the frame-text contrast loss by comparing the similarity between the image frame feature and the image frame text feature includes:
Invoking a contrast loss function relation, and calculating frame-text contrast loss; the contrast loss function relationship is:
Where Loss ITG is a frame-text contrast Loss, exp is an exponential function, Z i is an ith image frame feature, T i is an image frame text feature matching the ith image frame feature, T j is a jth image frame feature not matching the image frame text feature, N ITG is a total number of image frame text features matching the image frame features, θ is a similarity between the image frame features and the image frame text features, and τ is a parameter to be optimized.
In an eighth exemplary embodiment, the determining the loss function of the video frame adapter based on the frame-text matching loss, the frame-text contrast loss, and the text generation loss comprises:
determining an image frame-image frame text penalty from the frame-text matching penalty, the frame-text contrast penalty, and the text generation penalty;
Masking target image frames of the video samples, predicting the target image frames through a video frame adapter trained based on image frame text features and image frame features corresponding to the masked video samples, and determining video frame mask loss;
and determining a loss function of the video frame adapter according to the image frame-image frame text loss and the video frame mask loss.
In a ninth exemplary embodiment, the determining the video frame mask loss includes:
Invoking a video frame mask loss function relation, and calculating video frame mask loss; the video frame mask loss function relationship is:
where Loss MTF is the video frame mask penalty, For the desire of random distribution inside small batches of video samples, D represents random distribution, V represents image frame features, V m is target image frame, O (V m) is target image frame feature,/>For image frame features where video samples are not masked, T represents/>Corresponding image frame text feature,/>K is the K-th image frame feature masked inside the small lot of video samples, K is the number of images masked inside the small lot of video samples, and model represents the prediction result.
In a tenth exemplary embodiment, the parameter features corresponding to the video parameters to be learned are video parameter features, and the video adapter includes a video input layer, a parameter encoder layer, a feature fusion layer, a feature extraction layer, and a video output layer;
Wherein the video input layer is configured to receive the joint feature of the visual feature and the frame visual information; the parameter encoder layer is used for encoding the video parameter characteristics to obtain video parameter encoding characteristics; the feature fusion layer is used for carrying out fusion processing on the video parameter coding features and the joint features; the feature extraction layer is used for extracting features of the fusion processing result and transmitting the extracted features to the parameter encoder layer; the video output layer is used for outputting video visual information.
In an eleventh exemplary embodiment, the feature fusion layer includes a first video feature enhancement layer, a cross-modality learning layer, and a second video feature enhancement layer;
The first video feature enhancement layer is used for carrying out residual connection on the video parameter coding features and the video parameter features, and carrying out layer normalization processing to obtain parameter enhancement features;
The cross-modal learning layer is used for carrying out fusion processing on the video parameter coding feature and the joint feature based on a cross-modal attention mechanism by taking the parameter enhancement feature as a query vector and taking the joint feature as a group of value vectors and key vectors to obtain a multi-modal fusion feature;
and the second video feature enhancement layer is used for carrying out residual connection on the multi-mode fusion features and carrying out layer normalization processing to obtain a fusion processing result.
In a twelfth exemplary embodiment, the video language model further comprises a docking network layer; the docking network layer comprises a first converter model, a video feature extraction layer and a joint layer;
The first converter model is used for fusing the visual features based on a self-attention mechanism to obtain visual fusion features; the video feature extraction layer is used for extracting features of the visual fusion features and converting the dimensions of the extracted features into dimensions identical to the input dimensions of the video adapter; the joint layer is used for combining the frame visual information and the output of the video feature extraction layer and inputting joint features to the video adapter.
In a thirteenth exemplary embodiment, the text description tag includes a video description text tag, the text semantic feature corresponding to the video description text tag is a video text feature, and the training process of the video adapter includes:
extracting video features of the video visual information;
Extracting coding text features corresponding to the video text features;
And carrying out iterative updating on the video adapter according to the loss information between the video characteristics and the coded text characteristics.
In a fourteenth exemplary embodiment, the step of determining the loss information between the video feature and the encoded text feature includes:
invoking a video-text loss calculation relation, and calculating video-text loss of the video adapter, wherein the video-text loss calculation relation is as follows:
where Loss G is video-text Loss, N G is the total number of matches of the video feature and the encoded text feature in the current lot, For the i' th video feature in the current lot,/>Coded text feature matching for the ith video feature,/>For the j 'th encoded text feature that does not match the i' th video feature, θ represents the similarity between the video feature and the encoded text feature, τ is the parameter to be optimized.
In a fifteenth exemplary embodiment, the inputting the video sample, the preset video parameter to be learned, and the frame parameter to be learned into the video language model includes:
Performing image sampling processing on the video sample to obtain a multi-frame sample image;
Extracting image features of each frame of sample image by using an image encoder of the target visual language pre-training model to obtain visual features;
Extracting text semantic features of text description tags of the video samples by using a text encoder of the target visual language pre-training model;
And respectively extracting the video parameters to be learned and the parameter characteristics corresponding to the frame parameters to be learned by using a text encoder of the target visual language pre-training model.
In a sixteenth exemplary embodiment, the extracting, by the text encoder using the target visual language pre-training model, the parameter features corresponding to the video parameter to be learned and the frame parameter to be learned includes:
carrying out random initialization processing on the frame parameters to be learned by using a text encoder of the target visual language pre-training model, and taking a random initialization result of the frame parameters to be learned as frame parameter characteristics;
And utilizing a text encoder of the target visual language pre-training model to encode the video parameters to be learned based on the current attention mask, so as to obtain video parameter characteristics.
In a seventeenth exemplary embodiment, the text description tags include a video description text tag and a video frame description text tag, the text encoder using the target visual language pre-training model extracting text semantic features of the text description tag of the video sample, comprising:
extracting video text features of the video description text labels by using a text encoder of the target visual language pre-training model;
Performing tokenization processing on the video frame description text labels by using a text encoder of the target visual language pre-training model, and performing word embedding processing on the tokenization processing result to obtain video frame text characteristics;
And utilizing a text encoder of the target visual language pre-training model to encode the video description text label based on the current attention mask so as to obtain video text characteristics.
In an eighteenth exemplary embodiment, the extracting, by the image encoder using the target visual language pre-training model, image features of each frame of sample image to obtain visual features includes:
dividing the current frame image into a plurality of image blocks with non-overlapping contents;
converting each graphic block into one-dimensional representation through linear mapping, and adding position coding information for the corresponding image block;
And inputting the image block subjected to linear mapping and position coding to an encoder of a second converter model, and extracting the characteristics of the output of the encoder of the second converter model to obtain the visual characteristics of the video sample.
In a nineteenth exemplary embodiment, the parameter features corresponding to the frame parameters to be learned are frame parameter features, the text description tag includes a video frame description text tag and a video description text tag, the text semantic features corresponding to the video frame description text tag are video frame text features, the text semantic features corresponding to the video description text tag are video text features, and the training process of the video language model includes:
taking the video frame description text label, the frame parameters to be learned and the video sample data set as inputs, and training the video frame adapter by freezing an image encoder of the target visual language pre-training model and utilizing the frame parameters to be learned to acquire visual information corresponding to the video frame description text label;
When the training of the video frame adapter is completed, taking the video frame description text label, the frame parameters to be learned, the video parameters to be learned and the video sample data set as inputs to train the video adapter.
In a twentieth exemplary embodiment, before said training said video adapter, further comprising:
When a learning rate adjustment instruction is received, updating the current learning rate according to the new learning rate of the learning rate adjustment instruction; the new learning rate is less than the current learning rate.
In a twenty-first exemplary embodiment, the training the video frame adapter comprises:
invoking a video frame adapter loss function, and training the video frame adapter; the video frame adapter loss function is:
Where Loss frame represents the video frame adapter penalty function, loss ITM is the frame-text matching penalty, loss ITC is the text generation penalty, loss ITG is the frame-text contrast penalty, loss MEF is the video frame mask penalty, α 0 is the frame-text matching penalty coefficient, α 1 is the text generation penalty coefficient, α 2 is the frame-text contrast penalty coefficient, and β is the video frame mask penalty coefficient.
In a twenty-second exemplary embodiment, the training the video frame adapter comprises:
invoking a video language loss function to train the video adapter; the video language loss function is:
where Loss represents the video language penalty function, α is the video frame adapter penalty function coefficient, loss G is the video-text penalty, and γ is the video-text penalty function coefficient.
The second aspect of the present invention provides a video language task execution method, including:
training to obtain a video language model by using the video language model training method according to any one of the previous claims;
Acquiring a video language task to be executed and a corresponding video language task training sample set;
Based on the video language task, utilizing the video language task training sample set to finely tune the video language model;
And executing the video language task by utilizing the trimmed video language model.
In a first exemplary embodiment, the video language task to be performed is a video content understanding task, and the video language task training sample set is a video sample set of a plurality of video samples carrying video content tags; the fine tuning of the video language model based on the video language task using the video language task training sample set includes:
And based on the video content understanding task, utilizing the video sample set to finely tune the video language model so as to execute the video content understanding task by utilizing the finely tuned video language model.
A third aspect of the present invention provides a video language model training apparatus, comprising:
the data acquisition module is used for acquiring a video sample data set carrying a text description tag, preset video parameters to be learned and preset frame parameters to be learned;
The input data processing module is used for inputting the video sample in the video sample data set, the video parameter to be learned and the frame parameter to be learned into the video language model; the video language model comprises a target visual language pre-training model, a video frame adapter and a video adapter; the target visual language pre-training model is used for extracting visual characteristics and parameter characteristics and inputting the visual characteristics and the parameter characteristics into the video frame adapter and the video adapter correspondingly, the video frame adapter is used for converting the visual characteristics into frame visual information meeting the requirements of the target visual language pre-training model, and the video adapter is used for extracting video visual information; the method comprises the steps that parameter characteristics corresponding to video parameters to be learned are input to a video adapter, and parameter characteristics corresponding to frame parameters to be learned are input to the video frame adapter so as to obtain text-related visual information by utilizing the frame parameters to be learned;
And the model parameter updating module is used for carrying out iterative updating on the video language model according to the frame visual information, the video visual information and the loss information of the text semantic features until the preset model training ending condition is met.
A fourth aspect of the present invention provides a video language task execution device, including:
the model training module is used for training to obtain a video language model by utilizing the video language model training method according to any one of the previous claims;
the data acquisition module is used for acquiring a video language task to be executed and a corresponding video language task sample set;
The model fine tuning module is used for carrying out fine tuning on the video language model by utilizing the video language task sample set based on the video language task;
and the task execution module is used for executing the video language task by utilizing the trimmed video language model.
The fifth aspect of the present invention also provides an electronic device comprising a processor for implementing the steps of the video language model training method according to any one of the preceding claims when executing a computer program stored in a memory.
The invention finally provides a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the video language model training method of any one of the preceding claims.
The technical scheme provided by the invention has the advantages that the video frame adapter is utilized to convert the visual characteristics of the video sample into the frame visual information meeting the requirements of the visual language pre-training model, the frame parameter to be learned is utilized to learn the visual information related to the text, the model is assisted to build association between different frames, the model is guided to pay attention to different visual information, the problem of weak correlation between the video mode and the language mode can be solved, and the adaptation of the visual language pre-training model to the video frame is realized. The method has the advantages that the video adapter can integrate frame text visual information, the problem of a language mode focusing range is solved, the to-be-learned video parameter auxiliary model is utilized to establish a semantic corresponding relation on the whole video sequence and understand global information of video, the global video information and the frame text visual information are integrated, and the information loss caused by the attention deviation of the frame text visual information is solved, so that visual information of the video at different layers such as local details and global semantics is fully utilized, the visual understanding capability of the model to the video is improved, the representation capability of the video language model is improved, the video language model is enabled to be converged rapidly in the training process, the training efficiency of the video language model is improved, and calculation resources required by training are saved. Furthermore, on the basis of the existing visual language pre-training model, the video language model is built by adding the video frame adapter and the video adapter, the original visual language pre-training model structure is not required to be changed, all network structures are not required to be redesigned, a large amount of video text data is not required to retrain the video language model, the rich visual representation of the visual language pre-training model and the strong cross-modal interaction capability can be migrated to a video language task, the original model performance is reserved, and the expansibility and flexibility of the visual language pre-training model are enhanced.
In addition, the invention also provides a corresponding video language task execution method, an implementation device, electronic equipment and a readable storage medium aiming at the video language model training method, so that the method has more practicability, and the video language task execution method, the video language task execution device, the electronic equipment and the readable storage medium have corresponding advantages.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
Drawings
For a clearer description of the present invention or of the technical solutions related thereto, the following brief description will be given of the drawings used in the description of the embodiments or of the related art, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained from these drawings without the inventive effort of a person skilled in the art.
FIG. 1 is a schematic flow chart of a video language model training method provided by the invention;
FIG. 2 is a schematic diagram of a video frame adapter in an exemplary application scenario according to the present invention;
FIG. 3 is a schematic diagram of a training process for a video frame adapter in an exemplary application scenario provided by the present invention;
FIG. 4 is a schematic diagram of a video adapter in an exemplary application scenario according to the present invention;
FIG. 5 is a schematic diagram of a video adapter and docking network in an exemplary application scenario according to the present invention;
FIG. 6 is a schematic diagram of a training process for a video adapter in an exemplary application scenario provided by the present invention;
FIG. 7 is a schematic flow chart of a video language task execution method provided by the invention;
FIG. 8 is a schematic diagram of a hardware framework of an exemplary application scenario of the video language task execution method provided by the present invention;
FIG. 9 is a schematic diagram of a video language model in an exemplary application scenario according to the present invention;
FIG. 10 is a block diagram of an embodiment of a video language model training apparatus provided by the present invention;
FIG. 11 is a block diagram of an embodiment of a video language task execution device provided by the present invention;
fig. 12 is a block diagram of an embodiment of an electronic device according to the present invention.
Detailed Description
In order to make the technical scheme of the present invention better understood by those skilled in the art, the present invention will be further described in detail with reference to the accompanying drawings and the detailed description. Wherein the terms "first," "second," "third," "fourth," and the like in the description and in the above figures are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations of the two, are intended to cover a non-exclusive inclusion. The term "exemplary" means "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
The video language model is a cross-modal model capable of deeply understanding the internal relation between a visual mode and a language mode, and is widely applied to various application scenes related to video language, for example, the video language model can help a user to quickly locate and understand video contents in the application scenes of searching and annotating video. The ability to accurately analyze user interests and preferences via a video language model can also drive the development of video summary generation and personalized recommendation systems. In addition, the video language model also understands the video contents such as actual scenes, objects, actions, emotion and the like, and is beneficial to the development of the related fields of natural language description, such as the generation of titles, subtitles and storylines. For application scenes of video questions and answers and conversations, the video language model enables a computer to understand and accurately respond to video related questions.
However, the current video language model has the problem of language mode focusing range, that is, the range of the video focused by the video language model is different, for example, the text description may focus on some frames, some sections of video and the whole video, which results in slow convergence of training of the video language model, and the execution result of the final video language task does not meet the requirements of users. In addition, the video comment text description of the video language model is short, the sparsity of the video text description causes the association between the modeling video features and the text description to be more difficult, the training convergence of the video language model is slow, and the performance of the video language model is poor. In the current video language model training process, inconsistency exists between visual information in a sample video and semantic information in description. A scene in a video may not establish a clear correspondence with a specific semantic concept in a description, resulting in reduced correlation between the description and the video, slow convergence of training of a video language model, and failure of the video language model performance to meet the needs of a user. Further, video language tasks often require training complex deep learning models, which have a large number of parameters and hierarchical structures, resulting in an increase in the demands on computing and memory resources required for the model training process, and within a certain range, the structure of the model and the scale of training samples are also positively correlated to the model performance, so that in order to obtain a video language model with good performance, multiple iterations and experiments are required for training and tuning the model, further increasing the training time and the consumption of computing resources.
Therefore, the video language model in the related technology has the problems that the video is weakly related to the language and the focusing range of the text to the video is different, so that the video language model is slow to converge, and the training is time-consuming and resource-consuming. Furthermore, the video language pre-training model in the related art requires redesigning all network structures, then re-training the model through a large amount of video text data, extracting the video features and text features first, and then mapping the visual information and text information to a unified semantic space through contrast learning or a transmission (converter model) based on attention (attention) mechanisms, which is time-consuming and labor-consuming, generally requires a large amount of computing resources, and has poor flexibility and extensibility.
In view of this, in order to solve the problem that the video language pre-training in the related technology consumes resources, the video is weakly related to the language, and the language mode focuses on the video mode range, the invention can complete the video language pre-training model adaptation by adding a video frame adapter, a video adapter, all video frames sharing a frame parameter to be learned and a small amount of parameters of the video parameter to be learned on the basis of the video language pre-training model in the related technology without changing the original model parameters, thereby realizing the adaptation of the abundant visual representation of the video language pre-training model and the migration of strong cross-mode interaction capability to the video language task, namely retaining the original model performance, enhancing the expansibility and flexibility of the video language model, effectively improving the training efficiency of the video language model, and saving the calculation resources required by the model training. Having described aspects of the invention, various non-limiting embodiments of the invention are described in detail below. Numerous specific details are set forth in the following description in order to provide a better understanding of the invention. It will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well known methods and means have not been described in detail in order to not obscure the present invention.
Referring to fig. 1, fig. 1 is a flow chart of a video language model training method provided in this embodiment, where the embodiment may include the following:
S101: and acquiring a video sample data set carrying a text description tag, a preset video parameter to be learned and a preset frame parameter to be learned.
In this embodiment, the video sample data set is a training sample data set used for training a video language model, which may include a large number of video samples covering a plurality of rich scenes, and the number of video samples may be flexibly determined according to actual requirements, which does not affect the implementation of the present invention. Each video sample is provided with a text description tag, the text description tag is used for describing the video sample, the text description tag is text data, the video samples in the video sample data set are all labeled with text description in advance, and labeled information is the text description tag.
The frame parameters to be learned are one variable or a group of variables of visual information which enable the video frames to learn text related in the training process of the video language model, the variable or the group of variables can be flexibly determined according to specific practical application scenes, the frame parameters to be learned are learnable parameters of the video frames, all the video frames share one frame parameter to be learned, and the shared frame parameters to be learned are helpful for the video language model to be disassociated and understand the correlation among different video frames. In video, different frames may contain the same person, the same scene, or objects with relevance. By sharing the frame parameters to be learned, the ability of the model to establish associations between different frames can be assisted. For example, other frames are assisted in finding related persons or objects, thereby improving the accuracy and consistency of the video language model in multi-frame video understanding and analysis. In addition, as the corresponding visual information of the text can be obtained through the frame parameters to be learned, different visual information can be obtained by different frames, which is equivalent to providing different visual angles and information, the video language model can be guided to pay attention to the information through sharing the frame parameters to be learned among different frames, and meanwhile, the complementary information can be comprehensively utilized, so that the understanding and representing capability of the video language model to video language tasks can be improved. The video parameters to be learned are also one variable or a group of variables which enable the video to learn and acquire visual information related to texts in the video language model training process, the variable or the group of variables can be flexibly determined according to specific practical application scenes, the video parameters to be learned are learnable parameters of the video, and the parameters can assist the video language model to establish semantic corresponding relations on the whole video sequence and understand global information of the video. The video parameters to be learned can enable the video language model to establish semantic corresponding relation on the whole video sequence and understand global information of the video, and the frame parameters to be learned can enable the video language model to be locally aligned and understood on the video frame level. By combining the video parameters to be learned and the frame parameters to be learned, fine granularity alignment across time scales is realized, visual information of videos at different levels such as local details and global semantics is fully utilized, understanding and expression capacity of global and local features of the video language model to be processed or video samples is improved, visual understanding capacity of the video language model to be processed or video samples is improved, and representation capacity of the video language model is improved.
S102: and inputting the video sample, the video parameter to be learned and the frame parameter to be learned in the video sample data set into the video language model.
In this step, the video language model is a pre-training model frame pre-built based on the target visual language pre-training model, the video frame adapter and the video adapter, that is, the model structure of the video language model includes the target visual language pre-training model, the video frame adapter and the video adapter. The target visual language pre-training model is used for extracting visual characteristics and parameter characteristics and inputting the visual characteristics and the parameter characteristics into the video frame adapter and the video adapter correspondingly, the target visual language pre-training model can simultaneously understand and fuse visual and language information, cross-modal visual and language pre-training models are realized, the target visual language pre-training model can learn the association between the visual information and the language information by pre-training data on a large-scale image-text pair, and therefore the model can simultaneously process and understand the image and text information and has good performance and generalization capability in a plurality of visual and language tasks.
The target visual Language pre-training model of the present embodiment may employ any of the VLP (Vision-Language Pretraining, visual Language pre-training) models of the related art, including, but not limited to, viLBERT (Vision and Language bert (Bidirectional Encoder Representations from Transformers, bi-directional encoder characterization from transducer), visual Language bert) model, LXMERT (Language-Visual Multimodal bert, cross-modality BERT framework for Visual and language understanding), video BERT Model, visual BERT Model, uniVL (Unifified Video and Language Pre-Training Model, video language pre-Training) Model, UNITER (Universal Image-Text Representation, universal Image text representation), CLIP (Contrastive Language-IMAGE PRETRAINING, a comparative language-image pre-training) model, OSCAR (Object-SEMANTICS ALIGNED PRETRAINING, object semantic alignment pre-training) model, MOCA (Montreal Cognitive Assessment ) model. The VideoBERT model is a joint model and can be used for video and language representation learning. Through the pre-training process, information can be extracted and understood from the video and text associated therewith. This type of model allows the model to understand visual concepts in the video and how those concepts relate to natural language descriptions. The UniVL model is a unified video and language pre-training model for multimodal understanding and generation that enables unified representation of video and language through joint optimization understanding and generation tasks. This approach enables the model to understand and generate language descriptions related to the video. The VisualBERT model is a simple and efficient benchmark model for video and language learning that combines visual and language information and learns multimodal representations through pre-training and fine-tuning tasks. The MOCA model carries out video and text representation learning through a memory-enhanced contrast learning method, builds positive and negative pairs between the video and the text, carries out self-supervision training, and enhances the contrast learning effect by utilizing a memory mechanism. The ViLBERT model realizes cross-modal understanding and reasoning of the images and the texts by jointly training the image encoder and the text encoder to learn the interrelationship between the images and the texts. The UNITER model pre-trains on large-scale image-text pair data, learns visual and linguistic representations, and has multi-modal understanding and generating capabilities. The CLIP model is a visual language pre-training model which is pre-trained through contrast learning, and the model is trained through learning the similarity of images and texts, so that the CLIP model has strong image and text matching capability, and can judge the relevance of the images and the texts. OSCAR models are VLP models targeting object and semantic alignment that learn visual and linguistic representations by joint training of visual and linguistic encoders and enable alignment and correlation between images and text.
In this step, after a corresponding number of video samples are selected from the video sample dataset according to preset training parameters and input to the video language model, the image encoder of the target visual language pre-training model is used to extract visual features of the video samples, and the text encoder of the target visual language pre-training model is used to extract text semantic features of text description tags of the video samples and extract parameters to be learned and parameter features corresponding to the parameters to be learned. The image encoder of the target visual language pre-training model transmits the extracted visual features to the video frame adapter and the video adapter respectively, the parameter features corresponding to the video parameters to be learned are input to the video adapter, and the text encoder of the target visual language pre-training model inputs the parameter features corresponding to the frame parameters to be learned to the video frame adapter.
In this embodiment, since there is often weak correlation between the image and the text of the target visual language pre-training model, and the weak correlation is more serious in the video language data, although the image features of rich semantic information can be extracted by using the target visual language pre-training model, the image features have strong semantic consistency and generalization performance, but still cannot meet the requirement of adapting video frames, so the image features need to be converted into frame features meeting the requirement of the video language pre-training model through a video frame adapter. In other words, the video frame adapter converts the visual features into frame visual information meeting the requirements of the target visual language pre-training model by forcing the frame parameters to be learned to acquire text-related visual information according to the received visual features. It can be understood that the frame text visual information extracted by the video frame adapter is fine-grained information at the frame level, and lacks of understanding of the whole video, on the basis of which the video adapter combines the visual features extracted by the image encoder of the target visual language pre-training model, the visual information about the video text related to the video parameters to be learned is learned based on the video adapter. In other words, the video adapter receives visual features, learns text information of the learning video through the video parameters to be learned, assists the model to build the whole video sequence, builds semantic corresponding relation and understands global information of the video, and extracts video visual information.
S103: and carrying out iterative updating on the video language model according to the frame visual information, the video visual information and the loss information of the text semantic features until the preset model training ending condition is met.
After the frame visual information and the video visual information corresponding to the video sample are obtained in the last step, comparing the frame visual information and the video visual information with text semantic features of text description labels corresponding to the video sample, and updating model parameters of the video language model by continuously reducing the difference between the frame visual information and the text semantic features, wherein the video language model can be trained by adopting a batch random gradient descent method until a preset model training ending condition is reached, for example, the preset model training ending condition can be that the iteration number reaches a preset value, the video language model can be converged, and the precision of the video language model can reach a preset precision threshold, so that the realization of the method is not influenced. Before the gradient update iteration, the model needs to initialize the gradient descent algorithm, set epoch (training period), batch_size (batch size), and weight update period t, iteration number. For example, the video sample data set may include 6 ten thousand video samples, the video language model is trained for at least 100 training periods, and one training period refers to that model parameters of the neural network are not repeatedly updated by using all training samples in the training set, and one batch (batch) of data is taken at a time for updating model parameters of the video language model, so as to complete a training process. In the gradient update iteration process, 500 video samples are used per iteration update, and these 500 video samples are referred to as one batch (batch) data, i.e., batch_size number of samples. The iteration number iteration refers to the number of training using batch_size samples, and the iteration number iteration=60000/500=120 for one epoch is completed. The weight updating period refers to updating the weight once every t times of iteration when the video language model is trained. When the preset model training ending condition is reached, the video language model is a trained video language model, visual information of the video on different layers can be fully utilized, and understanding and deducing capabilities of video contents are effectively improved through joint learning of video and language representation, so that the capability of the visual language pre-training model is transferred to a video language pre-training task, the performance of the original model is reserved, and expansibility and flexibility of the original model are enhanced, so that the video language model is suitable for the wider multi-mode application field.
In the technical scheme provided by the embodiment, the video frame adapter is utilized to convert the visual characteristics of the video sample into frame visual information meeting the requirements of the visual language pre-training model, the frame parameter to be learned is utilized to learn the visual information related to the text, the model is assisted to build association between different frames, the model is guided to pay attention to different visual information, the problem that the video mode is related to the language mode weak can be solved, and the adaptation of the visual language pre-training model to the video frame is realized. The method has the advantages that the video adapter can integrate frame text visual information, the problem of a language mode focusing range is solved, the to-be-learned video parameter auxiliary model is utilized to establish a semantic corresponding relation on the whole video sequence and understand global information of video, the global video information and the frame text visual information are integrated, and the information loss caused by the attention deviation of the frame text visual information is solved, so that visual information of the video at different layers such as local details and global semantics is fully utilized, the visual understanding capability of the model to the video is improved, the representation capability of the video language model is improved, the video language model is enabled to be converged rapidly in the training process, the training efficiency of the video language model is improved, and calculation resources required by training are saved. Furthermore, on the basis of the existing visual language pre-training model, the video language model is built by adding the video frame adapter and the video adapter, the original visual language pre-training model structure is not required to be changed, all network structures are not required to be redesigned, a large amount of video text data is not required to retrain the video language model, the rich visual representation of the visual language pre-training model and the strong cross-modal interaction capability can be migrated to a video language task, the original model performance is reserved, and the expansibility and flexibility of the visual language pre-training model are enhanced.
In the above embodiment, the structure of the video frame adapter is not limited, and based on the above embodiment, as shown in fig. 2, the present invention also provides an exemplary structure of the video frame adapter, which may include the following:
in this embodiment, the video frame adapter may include a frame input layer, a text encoding layer, a cross-modal fusion layer, a feature enhancement layer, and a frame output layer. The frame input layer is used for receiving the splicing result of the frame parameter characteristics and the video frame text characteristics; the text coding layer is internally provided with a text coder which is used for coding the splicing result based on the current attention mask to obtain frame parameter coding characteristics and frame text coding characteristics; the cross-modal fusion layer is used for carrying out cross-modal fusion processing on the frame parameter coding features and the visual features; the feature enhancement layer is used for carrying out feature enhancement processing on the fusion result, inputting the enhancement feature into the text coding layer, repeating the process for a plurality of times until the first preset repetition times are reached, for example, M1 times, M1 can be 200, and obtaining frame visual information; the frame output layer is used for outputting the frame visual information.
For convenience in representation, the parameter features corresponding to the frame parameters to be learned may be defined as frame parameter features, the text description tag includes a video frame description text tag and a video description text tag, the video description text tag is text data corresponding to the whole video, the video frame description text tag is text data corresponding to the current frame of image, for example, the video description text tag is that an indicator light on a server panel is flashing, the video frame description text tag is that an indicator light on the server panel is lighted, and green light is emitted. For ease of description, the text semantic features corresponding to the video frame description text labels may be defined as video frame text features. The frame parameter characteristics are the characteristics output after the text encoder of the target visual language pre-training model encodes the frame parameters to be learned, the video frame text characteristics are the characteristics after the text encoder of the target visual language pre-training model encodes the video frame description text labels, and the frame parameters to be learned and the video frame description text labels can be correspondingly encoded according to the model type adopted by the target visual language pre-training model. The frame input layer may preset an attention mask, and the value of the attention mask may correspondingly indicate which feature of the input is encoded. For example, attention maskCan be expressed, for example, asWherein/>Mask for i-th frame parameter to be learned,/>Masking of the i-th word, attention masking gives different types of mask values, representing encoding of different types of data. For example, when only the frame parameters to be learned are encoded, masking off the video frame description text label may set/>1, Let/>, if video frame description text label is inputFor 0, if no video frame description text tag is input, no setting may be made. When only the text label coding is described for the video frame, the frame parameters to be learned are masked, and the/>1, If frame parameters to be learned are input, the/>If the frame parameter to be learned is not input, the frame parameter is 0, and if the frame parameter to be learned is not input, the setting is not performed. If the frame parameters to be learned and the video frame description text labels are coded at the same time, all values in the attention mask can be set to be 1, so that simultaneous coding of two modes is indicated. The frame text coding feature is a feature obtained after a text coder of the video frame adapter is used for coding the video frame text feature, and the frame parameter coding feature is a feature obtained after the text coder of the video frame adapter is used for coding the frame parameter feature. The text encoder built in the text encoding layer may be the same type as or different from the text encoder of the target visual language pre-training model, which does not affect the implementation of the present invention. The cross-modal fusion layer can adopt any mechanism capable of realizing cross-modal data fusion to perform data fusion, forces frame parameters to be learned to extract visual characteristics related to texts, simultaneously sequentially promotes information interaction and integration of two modes for a bridge, and can internally adopt any method capable of extracting the inherent relation of different mode data and fusing the different mode data, such as any attention mechanism or cross-modal attention mechanism. Taking a cross-modal attention mechanism as an example, the cross-modal fusion layer can be a cross-modal attention mechanism layer, frame parameter coding features can be used as query vectors, visual features can be used as a group of value vectors and key vectors, the frame parameter coding features and the visual features are coded based on the cross-modal attention mechanism, and the coding results are used as fusion results. In order to further improve the precision of extracting the fusion characteristics and accelerate the convergence speed of the video frame adapter, the convergence of the model can be accelerated through layer normalization, meanwhile, the generalization capability is further improved through a residual structure, and the characteristic enhancement layers can comprise a first characteristic enhancement layer, an interaction characteristic extraction layer and a second characteristic enhancement layer correspondingly; the first characteristic enhancement layer is used for carrying out layer normalization processing on the fusion result and obtaining a first interaction enhancement characteristic through residual error connection; the interaction feature extraction layer is used for extracting features of the first interaction enhancement features to obtain second interaction enhancement features; and the second characteristic enhancement layer is used for carrying out layer normalization processing on the second interaction enhancement characteristic and is connected through residual errors. The interactive feature extraction layer can be any model structure capable of extracting deep features, for example, a feedforward neural network or a fully-connected neural network or a video feature extraction layer can be adopted, and the feedforward neural network firstly maps data to a high-latitude space and then to a low-latitude space through linear transformation, so that deeper features are extracted. For example, the visual feature corresponding to the current image frame is/>Encoding frame parameters based on cross-modal attention mechanisms feature/>And the result of encoding the visual features may be expressed as/>Layer normalization of the first feature enhancement layer and residual connection are carried out to obtain a first interaction enhancement feature/>Wherein/>As residual coefficients, the interaction feature extraction layer is a feedforward neural network (namely Feed forward), and Feed forward operation is carried out on the first interaction enhancement feature to obtain a second interaction enhancement feature/>. Again by layer normalization and by residual connection, we get/>Wherein/>Is the residual coefficient. Inputting the feat 2 obtained into a text coding layer, sequentially executing the steps by adopting a first attention mask representation mode, and repeating the steps for a first preset repetition time to obtain the final frame characteristic/>(I.e., frame visual information), its dimensions and input/>The same applies.
To make the encoding process more clear to those skilled in the art, the present embodiment takes the target visual language pre-training model as the CLIP as an example, and the corresponding text encoder is CLIP text Encoder (i.e., CLIP text encoder). For example, a text encoder of a target visual language pre-training model can be utilized to perform random initialization processing on frame parameters to be learned, and a random initialization result of the frame parameters to be learned is used as a frame parameter characteristic; and performing tokenization processing on the video frame description text labels by using a text encoder of the target visual language pre-training model, and performing word embedding processing on the tokenization processing result to obtain video frame text characteristics. For example, randomly initializing frame parameters to be learned (as may be defined as frame learnable context), the result may be recorded asWherein/>For the real set, C is embedding (embedding) length, D is embedding dimension, and D is equal to the dimension after video frame text feature token. When the frame parameters to be learned are initialized, the average value of C and D can be 0, and the variance of C and D can be 1. The result of video frame description text tag tokenization can be noted asText is a text label describing video frames,/>For any BERT model, word-embedding (word embedding) processing can be performed on the tokenized result to obtain video frame text characteristics. word embedding results can be expressed as/>. Word_ embedding is embedding corresponding to a video frame adapter, taking a target visual language pre-training model as a CLIP as an example, and a corresponding text encoder is CLIP text Encoder (i.e., a CLIP text encoder), and outputting a/>, corresponding to the result、/>. Random initialization result flc of splicing frame parameters to be learned and video frame text characteristics/>And taking the splicing result as input of a frame input layer of the video frame adapterWherein, the method comprises the steps of, wherein,For the frame parameter characteristics corresponding to the ith frame parameter to be learned,/>Embedding, i.e., video frame text features, for the i-th word. The random initialization result flc and the text characteristics of the video frame are encoded by using a text encoder built in a text encoding layer of the video frame adapter, and the encoding result can be recorded as/>Wherein the first c data are frame parameter coding features corresponding to the frame parameter features, and are availableThe latter m data are represented as video frame text feature codes, i.e. frame text code features, can/>And (3) representing. The present embodiment needs to encode both the frame parameters to be learned and the video frame description text labels, so the attention mask here adopts the third way described above.
As can be seen from the above, in this embodiment, through the video frame adapter, the visual information learned by the target visual language pre-training model is migrated and adapted, so as to complete the frame feature extraction of the video language pre-training model, and solve the problem that the video is related to the language weak on the frame level. Through setting up layer normalization processing, can accelerate the convergence rate of video frame adapter, further promote generalization ability through residual structure simultaneously, be favorable to promoting the training efficiency and the performance of whole video language model.
Based on the model structure of the video frame adapter determined in any of the above embodiments, the present embodiment further needs to train the video frame adapter, and the training process of the video frame adapter may include the following:
Extracting the characteristics of the frame visual information corresponding to the current frame to obtain the image frame characteristics corresponding to the current frame image; extracting the characteristics of the text characteristics of the video frame corresponding to the current frame to obtain the text characteristics of the image frame corresponding to the current frame image; and carrying out iterative updating on the video frame adapter according to the loss information between each image frame characteristic and the corresponding image frame text characteristic.
In this embodiment, as shown in fig. 3, the network structure interfacing with the video frame adapter may be used to extract video frame text features and features of frame visual information. The full connection layer can be used to obtain the frame visual information corresponding to the current frame, which can be defined as the image frame feature,/>The input dimension of the full connection layer isThe output dimension is the same as the text feature of the image frame. The characteristics of the text characteristics of the video frame corresponding to the current frame can be extracted by utilizing a feedforward neural network and can be defined as the text characteristics/>, of the image frame. Because the input layer of the video frame adapter is the splicing result of the frame parameter feature and the video frame text feature, the image frame text feature corresponding to the current frame image can also directly extract the feature corresponding to the frame text coding feature corresponding to the current frame, and correspondingly,The full connection layer input dimension here is/>The dimension, the output dimension is the same as the image frame feature.
The above embodiments do not limit how to iteratively update the video frame adapter according to the loss information between each image frame feature and the corresponding image frame text feature, i.e. do not limit any loss function adopted by the video frame adapter. Of course, as a simple implementation, the loss function of the video frame adapter may be based directly on a mean square error or cross entropy error or other related art loss function. In order to improve the performance of the video frame adapter, this embodiment also provides a determination manner of the loss function of the video frame adapter, which may include the following:
Determining a frame-text matching loss by predicting whether the image frame features and the image frame text features are positively matched or negatively unmatched using the video frame adapter; determining a frame-text contrast penalty by comparing the similarity between the image frame features and the image frame text features; masking off part of the text features of the video frames, predicting the text features of the video frames which are masked off through a video frame adapter trained based on the text features of the image frames corresponding to the rest of the text features of the video frames and the image frame features, and determining text generation loss; a penalty function for the video frame adapter is determined based on the frame-to-text matching penalty, the frame-to-text contrast penalty, and the text generation penalty.
Where frame-text matching loss is the loss of whether a video frame matches text, with the aim of learning a fine-grained alignment of the two classification tasks between video frames and text tokens, the video adapter is required to predict whether an image-text pair is positively or negatively matching, i.e. not matchingWhere V and T are video frame and text tokens, respectively, matched represents matching video frame and text tokens, i.e., positive matching, unmatched represents non-matching video frame and text tokens, i.e., negative non-matching, and frame-text contrast loss is the contrast of video frames to text, with the aim of achieving their alignment by maximizing mutual information between video frame and text tokens. For example, the process of calculating the frame-text contrast loss may include: taking the image frame characteristics and the image frame text characteristics which are positively matched as a group of positive samples, and taking the image frame characteristics and the image frame text characteristics which are negatively unmatched as a group of negative samples; calculating positive similarity between the image frame features and the image frame text features in each group of positive samples, and calculating negative similarity between the image frame features and the image frame text features in each group of negative samples; the frame-text contrast loss is determined by comparing the positive and negative similarities. In other words, the penalty is achieved by comparing the similarity of video frame-text for positive sample pairs with the similarity of video frame-text for negative samples. The embodiment can acquire visual information related to the text through the frame parameters to be learned, and then calculate the similarity with the text characterization. Text generation penalty is a loss of text generation based on video frames, masking off part of the text, predicting masking part of the text by image frames and the rest of the text, and then obtaining the final penalty by cross entropy. To further enhance the joint understanding of image frames and text, video frame mask loss may also be added to the process of calculating the loss function of the video adapter, based on the above embodiments. For example, a certain frame feature in the video sample can be masked first, the masked video frame feature can be predicted through other frames and frame texts, and the difference between reality and prediction can be compared to obtain a corresponding video frame mask loss. Based on this, the calculation process of the loss function of the video frame adapter may include: determining an image frame-image frame text penalty from the frame-text matching penalty, the frame-text contrast penalty, and the text generation penalty; masking target image frames of the video samples, predicting the target image frames through a video frame adapter trained based on image frame text features and image frame features corresponding to the masked video samples, and determining video frame mask loss; a loss function of the video frame adapter is determined based on the image frame-to-image frame text loss and the video frame mask loss.
In order to improve the model training efficiency, a contrast loss function relation and a video frame mask loss function relation can be stored in advance, frame-text contrast loss is calculated by calling the contrast loss function relation, and video frame mask loss function relation is called to calculate video frame mask loss. The contrast loss function relationship can be expressed as:
The video frame mask penalty function relationship may be expressed as:
Where Loss ITG is a frame-text contrast Loss, exp is an exponential function, Z i is an ith image frame feature, T i is an image frame text feature matching the ith image frame feature, T j is a jth image frame feature not matching the image frame text feature, N ITG is a total number of image frame text features matching the image frame features, θ is a similarity between the image frame features and the image frame text features, and τ is a parameter to be optimized. Loss MTF is a video frame mask penalty,
For the desire of random distribution inside small batches of video samples, D represents random distribution, V represents image frame features, V m is target image frame, O (V m) is target image frame feature,/>For image frame features where video samples are not masked, T represents/>Corresponding image frame text feature,/>K is the K-th image frame feature masked inside the small lot of video samples, K is the number of images masked inside the small lot of video samples, and model represents the prediction result.
From the above, the image frame-image frame text loss in the embodiment is to complete alignment and learning among modes from three different layers, which is helpful for the video adapter to learn the fine-grained correspondence between the image frames and the text, improves the performance of semantic understanding and cross-mode tasks, and enhances the representation capability of the video language model on multi-mode data. The masked image frames are further predicted through other frames, corresponding texts of other frames and corresponding texts of mask frames, so that the video adapter can learn richer modal representations, and the combined understanding capability of the image frames and the texts is improved.
The structure of the video adapter according to the above embodiment is not limited, and based on the above embodiment, as shown in fig. 4, the present invention also provides an exemplary structure of the video adapter, which may include the following:
the video adapter comprises a video input layer, a parameter encoder layer, a feature fusion layer, a feature extraction layer and a video output layer; the video input layer is used for receiving the joint characteristics of the visual characteristics and the frame visual information; a parameter encoder layer for encoding the video parameter characteristics to obtain video parameter encoding characteristics; the feature fusion layer is used for carrying out fusion treatment on the video parameter coding features and the joint features; the feature extraction layer is used for extracting features of the fusion processing result and transmitting the extracted features to the parameter encoder layer; repeating for multiple times until reaching a second preset repetition time, such as M2 times, wherein M2 can be 300, obtaining video visual information, and outputting the video visual information by the video output layer.
In this embodiment, for convenience of representation, parameter features corresponding to video parameters to be learned may be defined as video parameter features, and text semantic features corresponding to text labels of video descriptions may be defined as video text features. The video parameter characteristics are the characteristics output after the text encoder of the target visual language pre-training model encodes the video parameter to be learned, the video text characteristics are the characteristics after the text encoder of the target visual language pre-training model encodes the video description text label, and the video parameter to be learned and the video description text label can be correspondingly encoded according to the model type adopted by the target visual language pre-training model. Similarly, the text encoder of the target visual language pre-training model may also preset an attention mask, and the value of the attention mask may correspondingly indicate which feature of the input is encoded. Correspondingly, a text encoder of the target visual language pre-training model can be utilized to encode the video parameters to be learned based on the current attention mask, so that the video parameter characteristics are obtained. Extracting video text characteristics of a video description text label by using a text encoder of a target visual language pre-training model, taking the target visual language pre-training model as a CLIP as an example, encoding video parameters to be learned and the video text description label through CLIP Text Encoder, and recording an encoding result as followsInput of corresponding CLIP Text Encoder,/>For the video parameters to be learned,/>Describing text labels for the video; CLIP Text Encoder output/>Wherein the encoding of the first c video parameters to be learned, i.e. the video parameter characteristics, is denoted/>The latter m are video text features, noted/>The attention mask here sets the corresponding value in the third way described above.
Wherein the parameter encoder layer can employ any self-attention mechanism for video parameter characterizationThe encoding, and correspondingly, the parameter encoder layer may be denoted as a self-care mechanism layer, as shown in fig. 5, and the video parameter encoding feature is used to represent the output of the parameter encoder layer, and may be defined as/>,/>Self attention represents any self-attention mechanism, for example MultiHeadAttention, the input and output dimensions are the same and the header can be set to 6. The feature fusion layer is used for learning relevant visual information of the input video parameter coding features, and can adopt any attention mechanism for learning, such as a self-attention mechanism and a cross-attention mechanism, which do not affect the implementation of the invention. In order to further improve the precision of extracting the fusion characteristics and accelerate the convergence speed of the video adapter, the model convergence can be accelerated through layer normalization, meanwhile, the generalization capability is further improved through a residual structure, correspondingly, the characteristic fusion layer can learn relevant visual information of video parameter coding characteristics by adopting a cross attention mechanism, and correspondingly, the characteristic fusion layer can adopt a cross attention mechanism layer, and can comprise a first video characteristic enhancement layer, a cross-mode learning layer and a second video characteristic enhancement layer; the first video feature enhancement layer is used for carrying out residual connection on video parameter coding features and video parameter features and carrying out layer normalization processing to obtain parameter enhancement features; the cross-modal learning layer is used for carrying out fusion processing on the video parameter coding feature and the joint feature based on a cross-modal attention mechanism by taking the parameter enhancement feature as a query vector and taking the joint feature as a group of value vectors and key vectors to obtain a multi-modal fusion feature; and the second video feature enhancement layer is used for carrying out residual connection on the multi-mode fusion features and carrying out layer normalization processing to obtain a fusion processing result. The feature extraction layer may be any model structure capable of extracting deep features, for example, a feedforward neural network or a fully-connected neural network layer or a video feature extraction layer may be adopted, wherein the feedforward neural network firstly maps data to a high-latitude space and then to a low-latitude space through linear transformation, so as to extract deeper features. For example based on/>Performing residual connection and layer normalization, wherein beta 0 is a residual coefficient, and learning visual information related to related video learning parameters through a cross-modal cross-attention mechanism, namely/>. By/>Residual connection is carried out again and layer normalization is carried out, and the method comprises the following steps ofIs the residual coefficient. Feed forward operation of features, i.e.
. Will/>As input of corresponding steps of the parameter encoder layer, repeating the process M2 times to obtain video visual information/>, related to the video learning parameter acquisition text, of the video
Based on the above embodiments, it can be appreciated that the video frame adapter can obtain visual information about text through the frame parameters to be learned, but these information are beneficial to the visual information extraction of the whole video, but at the same time, attention deviation occurs, resulting in lack of visual information of the relevant video. To further enhance the performance of the final video language model, the present embodiment may further enhance the performance of the final video language model by outputting visual features through the image encoder of the original frame visual information, i.e., the target visual language pre-training modelAnd (3) coding again to obtain the visual information of the whole video, and further fusing, extracting and obtaining the complete visual information of the video to fill the information loss of the video frame adapter caused by the attention deviation, wherein the information loss can comprise the following contents:
As shown in fig. 5, the video language model further includes a docking network layer; the docking network layer comprises a first converter model, a video feature extraction layer and a joint layer; the first converter model is used for fusing visual features based on a self-attention mechanism to obtain visual fusion features; the video feature extraction layer is used for extracting features of the visual fusion features and converting the dimensions of the extracted features into dimensions identical to the input dimensions of the video adapter; and the joint layer is used for combining the frame visual information and the output of the video feature extraction layer and inputting the joint features to the video adapter. In this embodiment, the first converter model may be a transducer, for The visual information can be fused by using a transducer self-attention mechanism to obtain
TF is any transducer and any encoder model, for example, the simplest MultiHeadAttention (Multi-head attention mechanism) can be used, input dimension/>And output dimension/>Likewise, the header may be set to 3. Then the video information is obtained by a video feature extraction layer, which is a combination of a linear rectification function and a fully connected layer, such as a multi-layer perceptron MLP,/>Visual information of richer videos can be obtained through further abstraction and combination of visual semantic information by a video feature extraction layer, and meanwhile/>To the same dimension as the video frame adapter. Obtaining results of video frame adapter processing/>I.e. frame visual information related to video frame text, joint layer joint frame visual information/>And video visual information/>Can be defined asWherein/>
As can be seen from the above, in this embodiment, the video adapter integrates the frame text visual information and the global video information, so that the information loss caused by the attention deviation of the frame text visual information can be solved, the convergence rate of the video adapter can be accelerated through the normalization processing of the setting layer, and meanwhile, the generalization capability is further improved through the residual structure, which is beneficial to improving the training efficiency and performance of the whole video language model.
Based on the model structure of the video adapter determined in any of the foregoing embodiments, this embodiment further needs to train the video adapter, and as shown in fig. 6, the training process of the video adapter may include the following:
Extracting video characteristics of video visual information; extracting coding text features corresponding to the video text features; and carrying out iterative updating on the video adapter according to the loss information between the video characteristics and the coded text characteristics.
In this embodiment, the network architecture interfacing with the video adapter may be used to extract video text features and features of video visual information. Video visual information acquisition using fully connected layersThe corresponding video feature may be defined as video feature/>,/>The input dimension of the full connection layer isOutput dimension and/>The same applies. Features that can be used to extract video text features using a feed-forward neural network can be defined as encoded text features/>,/>The feedforward neural network input dimension here is/>Dimension, output dimension and/>The same applies.
The above embodiments do not limit how to iteratively update the video adapter according to the loss information between the video feature and the encoded text feature, i.e. the loss function employed by the video adapter. Of course, as a simple implementation, the loss function of the video adapter may be based directly on a mean square error or cross entropy error or other related art loss function. In order to improve the performance of the video adapter, this embodiment also provides a determination manner of the loss function of the video adapter, which may include the following:
The video-text loss calculation relation is stored locally in advance, the video-text loss calculation relation is called, the video-text loss of the video adapter is calculated, and the video-text loss calculation relation can be expressed as:
where Loss G is the video-text penalty, N G is the total number of video feature and coded text feature matches in the current lot, For the i' th video feature in the current lot,/>Coded text feature matching for the ith video feature,/>For the j 'th encoded text feature that does not match the i' th video feature, θ represents the similarity between the video feature and the encoded text feature, and τ is the parameter to be optimized. Wherein, for the calculation relation related to the invention, since the log base is a fixed number, the model training process is not affected, the base can be omitted and not written, and the person skilled in the art can select the required base according to the actual situation, which does not affect the implementation of the invention.
From the above, the video adapter with better performance can be trained by determining the loss function based on the visual characteristics and the text characterization similarity, and the performance of the video language model is improved.
The above embodiment does not limit how to process the input video sample, the preset video parameter to be learned and the frame parameter to be learned by using the target visual language pre-training model, and based on this, the present embodiment further provides an exemplary implementation, which may include the following:
Performing image sampling processing on the video samples to obtain multi-frame sample images; extracting image features of each frame of sample image by using an image encoder of the target visual language pre-training model to obtain visual features; extracting text semantic features of text description tags of the video samples by using a text encoder of the target visual language pre-training model; and respectively extracting parameter characteristics corresponding to the video parameters to be learned and the frame parameters to be learned by using a text encoder of the target visual language pre-training model.
In order to enable the image encoder of the target visual language pre-training model to rapidly output visual characteristics, video frame disassembly operation can be performed, namely, a continuous video stream is converted into a single image frame process, a fixed time interval can be set in a time dimension, and N frames are selected in a uniform sampling mode. Recording the results of a single video sample asWhere N is the number of samples and V is the result after the current video frame is decimated.
In order to extract visual features with rich semantics, to achieve the capability of efficient representation and cross-modal learning of image visual information, the process of extracting image features of each frame of sample image by the image encoder of the target visual language pre-training model may include: dividing the current frame image into a plurality of image blocks with non-overlapping contents; converting each graphic block into one-dimensional representation through linear mapping, and adding position coding information for the corresponding image block; and inputting the image block subjected to linear mapping and position coding to an encoder of the second converter model, and extracting the characteristics of the output of the encoder of the second converter model to obtain the visual characteristics of the video sample. The final result can be represented by video frame features, i.e. visual features, extracted by an image encoder based on a pre-training model of a markup visual languageWherein/>Is a frame video feature. Taking the target visual language pre-training model as a CLIP for example, the image visual features extracted by Vision Transformer (ViT, visual encoder) divide the input image into non-overlapping tiles of size 16x 16. These tiles represent local areas of the image. Each tile is linearly mapped and converted from a two-dimensional representation to a one-dimensional representation. At the same time, a position code is added to each tile to capture its spatial position information in the image. The linearly mapped and position encoded tile sequence passes through the encoder of the 12-layer transducer. The encoder of the transducer utilizes a self-attention mechanism to integrate context information in the tile sequence, model global information of the image, and facilitate interactions and fusion between different tiles. At the output of the transform encoder, feature extraction may be performed using a multi-layer perceptron in order to extract image features and pass feature information between different locations to capture details and semantic associations in the image. Through the steps, visual features with rich semantics can be extracted from the image. These features can capture content and semantic information of the image for cross-modal matching and learning with text input. The extracted image feature dimension is 197x768, where 197 represents the length of the tile sequence and 768 is the feature dimension of the tile.
From the above, through the video frame splitting and the above-mentioned visual feature extraction mode, the visual features with rich semantics can be extracted, the capability of effectively representing the visual information of the image and cross-modal learning is realized, and the performance of the video language model is effectively improved.
Furthermore, considering video language data or video language tasks, weak correlation between video and text can cause more difficult semantic understanding and modal correlation of a video language model, and if end-to-end direct training is adopted, model convergence can be slow, so the invention also provides a training mode of the video language model, which can comprise the following contents:
Taking a video frame description text label, frame parameters to be learned and a video sample data set as inputs, and training a video frame adapter by freezing an image encoder of a target visual language pre-training model and utilizing the frame parameters to be learned to acquire visual information corresponding to the video frame description text label; when training of the video frame adapter is completed, taking the video frame description text label, the frame parameters to be learned, the video parameters to be learned and the video sample data set as inputs to train the video adapter. Of course, prior to model training, relevant parameters of model training need to be preset, including but not limited to setting dimensions of frame parameters to be learned, such as C being 256, 768; setting the number of training rounds, learning rate, optimizer, frame disassembly interval, image size input to the image encoder and image enhancement method. On the basis of completing video frame adaptation, when training the video adapter, the learning rate of the video frame adapter can be adjusted to be low, namely when a learning rate adjustment instruction is received, the current learning rate is updated according to the new learning rate of the learning rate adjustment instruction; the new learning rate is less than the current learning rate, such as by reducing the previous 3e-3 to 5e-4. In training the video adapter, besides the video part, a video frame text description tag, a video text description tag, a frame parameter to be learned and a video parameter to be learned are input, namely the current input is Wherein/>Describing labels for video texts, video parameters to be learned,/>、/>And respectively describing a label and frame parameters to be learned for the video frame text. When the video adapter is trained, fine granularity alignment across time scales can be achieved by combining frame parameters to be learned and video parameters to be learned, and understanding and expressing capacity of the model on global and local features of the video are improved. By combining the video frame adapter and the video adapter, the feature extraction and adaptation are completed at the video frame level and the video level respectively, and visual information of different layers in the video, including local details and global semantics, can be fully utilized, so that the visual understanding capability of a model on the video is improved.
Furthermore, in order to improve the model training efficiency, the video frame adapter loss function and the video language loss function can be stored in advance, and can be directly called during training. When training the video frame adapter, the loss function of the video frame adapter can be called to calculate the loss function of the video frame adapter; when training the video adapter, the video language loss function can be called to calculate the loss function of the video adapter. Wherein the video frame adapter loss function is:
the video language loss function is:
Where Loss represents a video language penalty function, loss frame represents a video frame adapter penalty function, loss ITM represents a frame-text matching penalty, loss ITC represents a text generation penalty, loss ITG represents a frame-text contrast penalty, loss MEF represents a video frame mask penalty, α 0 represents a frame-text matching penalty coefficient, α 1 represents a text generation penalty coefficient, α 2 represents a frame-text contrast penalty coefficient, and β represents a video frame mask penalty coefficient. Loss represents the video language penalty function, α is the video frame adapter penalty function coefficient, loss G is the video-text penalty, and γ is the video-text penalty function coefficient.
As can be seen from the foregoing, in this embodiment, the visual language model is first trained on the video frame adapter, and then the video frame adapter learning is completed on the basis of reducing the video frame adapter learning strength, so that the training of the whole video language pre-training model is completed, the semantic understanding capability of the video language model can be improved, the convergence speed of the video language model is improved, and the training speed and performance of the video language model are further improved.
It will be appreciated that in the video language dependent task processing process, the pre-training model includes a pre-training process and a fine tuning process, the fine tuning process being a process of applying the pre-trained model to the data set of the current downstream video language application task and adapting the model parameters to the own data set. The embodiment trains and obtains the video language model suitable for the video language task, the video language model is a pre-trained large model with very strong generalization capability, and the embodiment carries out fine adjustment on the pre-trained large model to obtain the video language model for executing a certain appointed video language task. Based on this, the present invention also provides a method for executing video language tasks, please refer to fig. 7, which may include the following contents:
S701: training a video language model;
S702: acquiring a video language task to be executed and a corresponding video language task sample set;
S703: based on the video language task, utilizing the video language task sample set to finely adjust the video language model;
s704: and executing the video language task by utilizing the trimmed video language model.
In the pre-training stage, the training task is generally performed on the basis of a large-scale corpus, and the large-scale neural network algorithm structure is trained for learning and realizing aiming at a specific language model, so that the finally obtained large-scale neural network algorithm structure and parameters are the pre-training model, namely the video language model recorded in the embodiment. In this embodiment, the video language model is obtained by training using the video language model training method described in any one of the previous embodiments. In the fine tuning stage, small-scale training is performed on specific task targets (downstream tasks) and task data (downstream data), so that the micro-adjustment of parameters of the video language model is realized, and finally the video language model adapting to the specific tasks and data is obtained. In this embodiment, a video language task to be executed and a corresponding video language task training sample set are obtained; based on the video language task, utilizing the video language task training sample set to finely adjust the video language model; the downstream task is the video language task to be executed, and the task data is the video language task training sample set. And finally, executing video language tasks by utilizing the trimmed video language model.
Among them, the video language tasks to be performed include, but are not limited to, video content understanding and classifying tasks, video subtitle generating and translating tasks, video question and answer tasks, video summary and highlight generating tasks, and video retrieving and recommending tasks. The task of understanding and classifying video content refers to that video content is understood by using a video language model and classified into different categories, such as films, sporting events, news stories and the like, such as video classification and video library content management. The video subtitle generation and translation task is to understand video content and dialogue by using a video language model, automatically generate subtitles, and even perform multi-language translation. Such as automatically generating subtitles for a movie or television program, accessing cross-language video content. Video question-answering tasks refer to the use of a video language model to understand video content and answer questions about the video, such as interactive learning on an educational platform, and automatic question answering in customer services. The video abstraction and highlight generation task is to automatically identify key moments in video by using a video language model and generate abstracts or highlight fragments, which are suitable for quick browsing of long video contents. Such as highlight playback of a sporting event, and a summary of the critical content of the meeting video. The video retrieval and recommendation task refers to improving the accuracy and relevance of video search, such as search and recommendation of an online video platform and video retrieval of a digital library, by understanding video content and user query. The video language task training sample set is a training sample data set corresponding to the video language task to be executed, namely a training sample set used by the pre-training model in the fine tuning process applicable to the video language task to be executed, taking the video language task to be executed as a video question-answering task as an example, and the video language task training sample set comprises a plurality of video samples of different types, wherein each video sample is marked with corresponding questions and corresponding answers through manual or automatic labels in advance. Taking a video language task to be executed as a video content understanding task as an example, the video language task training sample set is a video sample set of a plurality of video samples carrying video content labels, namely the video language task training sample set is a video content understanding task training sample set, the video content understanding task training sample set comprises a plurality of video samples of different types, each video sample carries a label corresponding to video content, so that a video language model learns the video content, and automatically understands and measures the video content of an input video.
As can be seen from the foregoing, in this embodiment, the pre-training model is obtained by training the model training method described in the foregoing embodiment, and then the parameters of the video language model are fine-tuned by the downstream video language task to be applied, so as to obtain a video language model capable of executing the downstream video language task, which is beneficial to improving the execution efficiency and execution precision of the video language task, and meets the execution requirement of the user on the video language related task.
It should be noted that, in the present invention, the steps are not strictly executed sequentially, so long as they conform to the logic sequence, and the steps may be executed simultaneously or according to a certain preset sequence, and fig. 1 and fig. 7 are only schematic, and do not represent only such an execution sequence.
Finally, based on the above technical solution of the present invention, the following description will exemplify some possible application scenarios related to the technical solution of the present invention with reference to fig. 8, and fig. 8 is a schematic diagram of a hardware composition framework to which the video language task execution method provided by the present invention is applicable, where the following may be included:
The hardware component framework may include a first electronic device 81 and a second electronic device 82, where the first electronic device 81 and the second electronic device 82 are connected through a network 83. The first electronic device 81 deploys a processor for executing the video language model training method described in any of the above embodiments, and transmits the trained video language model to the second electronic device 82. The second electronic device 82 deploys an interface for providing human-computer interaction, stores the video language model after the pre-training stage, and when receiving the video question-answering task, acquires a training sample set corresponding to the training video question-answering task; based on the video question-answering task, utilizing a video question-answering task sample set to finely adjust a video language model; and executing a video question-answering task by using the trimmed video language model.
The first electronic device 81 completes all or part of the steps in the training of the video language model according to the above embodiment, and the built video language model is shown in fig. 9, where the video language model includes an image encoder and a text encoder of the target visual language pre-training model, a video frame adapter, a video adapter, and a video frame removal module. The inputs of the method comprise a video frame description text label, frame parameters to be learned, video parameters to be learned and a video sample data set. The target visual language pre-training model is used for extracting visual characteristics, frame parameter characteristics and video parameter characteristics, the visual characteristics, the frame parameter characteristics and the video parameter characteristics are respectively and correspondingly input into the video frame adapter and the video adapter, the video frame adapter is used for converting the visual characteristics into frame visual information meeting the requirements of the target visual language pre-training model, and the video adapter is used for extracting the video visual information.
It should be noted that the above application scenario is only shown for the convenience of understanding the idea and principle of the present invention, and the embodiment of the present invention is not limited in any way. Rather, embodiments of the invention may be applied to any scenario where applicable.
As can be seen from the above, the embodiment of the present invention can improve the execution efficiency and execution accuracy of the video question task, and satisfy the execution requirement of the user on the video question task.
The invention also provides a corresponding device for the video language model training method and the video language task execution method, so that the method has more practicability. Wherein the device may be described separately from the functional module and the hardware. In this embodiment, the video language model training device and the video language task execution device may include or be divided into one or more program modules, where the one or more program modules are stored in a storage medium and executed by one or more processors, to complete the video language model training method and the video language task execution method according to the first embodiment of the present disclosure. Program modules in the present embodiment refer to a series of computer program instruction segments capable of performing a specific function, and are more suitable than programs themselves for describing the execution of the video language model training apparatus and the video language task execution apparatus in a storage medium. The following description will specifically describe the functions of each program module of the present embodiment, and the video language model training device and the video language task execution device described below may be referred to correspondingly to the corresponding video language model training method and video language task execution method described above.
Based on the angles of the functional modules, referring to fig. 10, fig. 10 is a block diagram of a video language model training device provided in this embodiment under a specific implementation manner, where the device may include:
The data acquisition module 101 is configured to acquire a video sample data set carrying a text description tag, a preset video parameter to be learned, and a preset frame parameter to be learned;
An input data processing module 102, configured to input a video sample in the video sample data set, a video parameter to be learned, and a frame parameter to be learned into the video language model; the video language model comprises a target visual language pre-training model, a video frame adapter and a video adapter; the target visual language pre-training model is used for extracting visual characteristics and parameter characteristics and inputting the visual characteristics and the parameter characteristics into the video frame adapter and the video adapter correspondingly, the video frame adapter is used for converting the visual characteristics into frame visual information meeting the requirements of the target visual language pre-training model, and the video adapter is used for extracting video visual information; the method comprises the steps that parameter characteristics corresponding to video parameters to be learned are input to a video adapter, and parameter characteristics corresponding to frame parameters to be learned are input to the video frame adapter so as to obtain text-related visual information by utilizing the frame parameters to be learned;
the model parameter updating module 103 is configured to iteratively update the video language model according to the frame visual information, the video visual information, and the loss information of the text semantic features until a preset model training end condition is satisfied.
Illustratively, in some implementations of the present embodiment, the video frame adapter may include a frame input layer, a text encoding layer, a cross-modal fusion layer, a feature enhancement layer, and a frame output layer;
The frame input layer is used for receiving the splicing result of the frame parameter characteristics and the video frame text characteristics; the text coding layer is used for coding the splicing result based on the current attention mask to obtain frame parameter coding features and frame text coding features; the cross-modal fusion layer is used for carrying out cross-modal fusion processing on the frame parameter coding features and the visual features; the feature enhancement layer is used for carrying out feature enhancement processing on the fusion result and inputting the enhanced features into the text coding layer; repeating for a plurality of times until the first preset repetition times are reached, and obtaining frame visual information; and the frame output layer is used for outputting the frame visual information.
In some exemplary implementations of the present embodiments, the cross-modal fusion layer is a cross-modal attention mechanism layer, and the cross-modal fusion layer is configured to encode the frame parameter encoding features and the visual features based on the cross-modal attention mechanism as a fusion result with the frame parameter encoding features as query vectors, the visual features as a set of value vectors and key vectors.
In other exemplary implementations of the present embodiment, the feature enhancement layer includes a first feature enhancement layer, an interactive feature extraction layer, and a second feature enhancement layer; the first characteristic enhancement layer is used for carrying out layer normalization processing on the fusion result and obtaining a first interaction enhancement characteristic through residual error connection; the interaction feature extraction layer is used for extracting features of the first interaction enhancement features to obtain second interaction enhancement features; and the second characteristic enhancement layer is used for carrying out layer normalization processing on the second interaction enhancement characteristic and is connected through residual errors.
In other exemplary implementations of the present embodiment, the model parameter updating module 103 includes a video frame adapter training module, where the video frame adapter training module is configured to:
Extracting the characteristics of the frame visual information corresponding to the current frame to obtain the image frame characteristics corresponding to the current frame image; extracting frame text coding features corresponding to the current frame to obtain image frame text features corresponding to the current frame image; and carrying out iterative updating on the video frame adapter according to the loss information between each image frame characteristic and the corresponding image frame text characteristic.
As an exemplary implementation of the present embodiment, the video frame adapter training module may further be configured to:
Determining a frame-text matching loss by predicting whether the image frame features and the image frame text features are positively matched or negatively unmatched using the video frame adapter; determining a frame-text contrast penalty by comparing the similarity between the image frame features and the image frame text features; masking off part of the text features of the video frames, predicting the text features of the video frames which are masked off through a video frame adapter trained based on the text features of the image frames corresponding to the rest of the text features of the video frames and the image frame features, and determining text generation loss; a penalty function for the video frame adapter is determined based on the frame-to-text matching penalty, the frame-to-text contrast penalty, and the text generation penalty.
As another exemplary implementation of the present embodiment, the video frame adapter training module may further be configured to:
Taking the image frame characteristics and the image frame text characteristics which are positively matched as a group of positive samples, and taking the image frame characteristics and the image frame text characteristics which are negatively unmatched as a group of negative samples;
Calculating positive similarity between the image frame features and the image frame text features in each group of positive samples, and calculating negative similarity between the image frame features and the image frame text features in each group of negative samples;
The frame-text contrast loss is determined by comparing the positive and negative similarities.
As an exemplary implementation of the present embodiment, the video frame adapter training module may further be configured to:
invoking a contrast loss function relation, and calculating frame-text contrast loss; the contrast loss function relationship is:
where Loss ITG is a Loss of frame-to-text contrast, Z i is an ith image frame feature, T i is an image frame text feature that matches the ith image frame feature, T j is a jth image frame feature that does not match the image frame text feature, N ITG is a total number of image frame text features that match the image frame feature, θ represents a similarity between the image frame feature and the image frame text feature, and τ is a parameter to be optimized.
As another exemplary implementation of the present embodiment, the video frame adapter training module may further be configured to:
determining an image frame-image frame text penalty from the frame-text matching penalty, the frame-text contrast penalty, and the text generation penalty;
Masking target image frames of the video samples, predicting the target image frames through a video frame adapter trained based on image frame text features and image frame features corresponding to the masked video samples, and determining video frame mask loss;
A loss function of the video frame adapter is determined based on the image frame-to-image frame text loss and the video frame mask loss.
As an exemplary implementation of the present embodiment, the video frame adapter training module may further be configured to:
invoking a video frame mask loss function relation, and calculating video frame mask loss; the video frame mask loss function relationship is:
where Loss MTF is the video frame mask penalty, For the desire of random distribution inside small batches of video samples, D represents random distribution, V represents image frame features, V m is target image frame, O (V m) is target image frame feature,/>For image frame features where video samples are not masked, T represents/>Corresponding image frame text feature,/>K is the K-th image frame feature masked inside the small lot of video samples, K is the number of images masked inside the small lot of video samples, and model represents the prediction result.
The video adapter comprises a video input layer, a parameter encoder layer, a feature fusion layer, a feature extraction layer and a video output layer;
The video input layer is used for receiving the joint characteristics of the visual characteristics and the frame visual information; a parameter encoder layer for encoding the video parameter characteristics to obtain video parameter encoding characteristics; the feature fusion layer is used for carrying out fusion treatment on the video parameter coding features and the joint features; the feature extraction layer is used for extracting features of the fusion processing result and transmitting the extracted features to the parameter encoder layer; repeating for multiple times until reaching the second preset repetition times, obtaining video visual information, and outputting the video visual information by a video output layer.
Illustratively, in other implementations of the present embodiment, the feature fusion layer includes a first video feature enhancement layer, a cross-modal learning layer, and a second video feature enhancement layer;
The first video feature enhancement layer is used for carrying out residual connection on video parameter coding features and video parameter features and carrying out layer normalization processing to obtain parameter enhancement features;
The cross-modal learning layer is used for carrying out fusion processing on the video parameter coding feature and the joint feature based on a cross-modal attention mechanism by taking the parameter enhancement feature as a query vector and taking the joint feature as a group of value vectors and key vectors to obtain a multi-modal fusion feature;
And the second video feature enhancement layer is used for carrying out residual connection on the multi-mode fusion features and carrying out layer normalization processing to obtain a fusion processing result.
In some exemplary implementations of the present embodiments, the video language model further includes a docking network layer; the docking network layer comprises a first converter model, a video feature extraction layer and a joint layer;
The first converter model is used for fusing visual features based on a self-attention mechanism to obtain visual fusion features; the video feature extraction layer is used for extracting features of the visual fusion features and converting the dimensions of the extracted features into dimensions identical to the input dimensions of the video adapter; and the joint layer is used for combining the frame visual information and the output of the video feature extraction layer and inputting the joint features to the video adapter.
In other exemplary implementations of the present embodiment, the model parameter updating module 103 includes a video adapter training module, where the video adapter training module is configured to:
Extracting video characteristics of video visual information;
extracting coding text features corresponding to the video text features;
And carrying out iterative updating on the video adapter according to the loss information between the video characteristics and the coded text characteristics.
As an exemplary implementation of the present embodiment, the video adapter training module may further be configured to:
Invoking a video-text loss calculation relation, and calculating video-text loss of the video adapter, wherein the video-text loss calculation relation is as follows:
where Loss G is the video-text penalty, N G is the total number of video feature and coded text feature matches in the current lot, For the i' th video feature in the current lot,/>Coded text feature matching for the ith video feature,/>For the j 'th encoded text feature that does not match the i' th video feature, θ represents the similarity between the video feature and the encoded text feature, and τ is the parameter to be optimized.
Illustratively, in other implementations of the present embodiment, the input data processing module 102 may be further configured to:
performing image sampling processing on the video samples to obtain multi-frame sample images;
extracting image features of each frame of sample image by using an image encoder of the target visual language pre-training model to obtain visual features;
Extracting text semantic features of text description tags of the video samples by using a text encoder of the target visual language pre-training model;
and respectively extracting parameter characteristics corresponding to the video parameters to be learned and the frame parameters to be learned by using a text encoder of the target visual language pre-training model.
In some exemplary implementations of the present embodiment, the input data processing module 102 may be further configured to:
Carrying out random initialization processing on frame parameters to be learned by using a text encoder of a target visual language pre-training model, and taking a random initialization result of the frame parameters to be learned as frame parameter characteristics;
And utilizing a text encoder of the target visual language pre-training model to encode the video parameters to be learned based on the current attention mask, so as to obtain video parameter characteristics.
In other exemplary implementations of the present embodiment, the input data processing module 102 may be further configured to:
extracting video text features of the video description text labels by using a text encoder of the target visual language pre-training model;
performing tokenization processing on the video frame description text labels by using a text encoder of the target visual language pre-training model, and performing word embedding processing on the tokenization processing result to obtain video frame text characteristics;
and a text encoder of the target visual language pre-training model is utilized to encode the video description text label based on the current attention mask, so that video text characteristics are obtained.
In other exemplary implementations of the present embodiment, the input data processing module 102 may be further configured to:
dividing the current frame image into a plurality of image blocks with non-overlapping contents;
converting each graphic block into one-dimensional representation through linear mapping, and adding position coding information for the corresponding image block;
And inputting the image block subjected to linear mapping and position coding to an encoder of the second converter model, and extracting the characteristics of the output of the encoder of the second converter model to obtain the visual characteristics of the video sample.
Illustratively, in other implementations of the present embodiment, the model parameter updating module 103 may further be configured to:
Taking a video frame description text label, frame parameters to be learned and a video sample data set as inputs, and training a video frame adapter by freezing an image encoder of a target visual language pre-training model and utilizing the frame parameters to be learned to acquire visual information corresponding to the video frame description text label;
When training of the video frame adapter is completed, taking the video frame description text label, the frame parameters to be learned, the video parameters to be learned and the video sample data set as inputs to train the video adapter.
In some exemplary implementations of the present embodiment, the model parameter updating module 103 may further be configured to:
When a learning rate adjustment instruction is received, updating the current learning rate according to the new learning rate of the learning rate adjustment instruction; the new learning rate is less than the current learning rate.
In other exemplary implementations of the present embodiment, the model parameter updating module 103 may be further configured to:
Invoking a video frame adapter loss function, and training the video frame adapter; the video frame adapter loss function is:
Where Loss frame represents the video frame adapter penalty function, loss ITM is the frame-text matching penalty, loss ITC is the text generation penalty, loss ITG is the frame-text contrast penalty, loss MEF is the video frame mask penalty, α 0 is the frame-text matching penalty coefficient, α 1 is the text generation penalty coefficient, α 2 is the frame-text contrast penalty coefficient, and β is the video frame mask penalty coefficient.
In other exemplary implementations of the present embodiment, the model parameter updating module 103 may be further configured to:
Calling a video language loss function, and training a video adapter; the video language loss function is:
where Loss represents the video language penalty function, α is the video frame adapter penalty function coefficient, loss G is the video-text penalty, and γ is the video-text penalty function coefficient.
Based on the angles of the functional modules, referring to fig. 11, fig. 11 is a block diagram of a video language task execution device provided in this embodiment under a specific implementation manner, where the device may include:
the model training module 111 is used for training to obtain a video language model;
the data acquisition module 112 is configured to acquire a video language task to be executed and a corresponding video language task sample set;
The model fine tuning module 113 is configured to fine tune the video language model based on the video language task by using the video language task sample set;
The task execution module 114 is configured to execute the video language task using the trimmed video language model.
The functions of each functional module of the video language model training device and the video language task execution device in this embodiment may be specifically implemented according to the method in the above method embodiment, and the specific implementation process may refer to the related description of the above method embodiment, which is not repeated herein.
From the above, the training efficiency of the video language model can be effectively improved, and the computing resources required by model training can be saved.
The video language model training device and the video language task execution device mentioned above are described from the perspective of functional modules, and further, the invention also provides an electronic device, which is described from the perspective of hardware. Fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 12, the electronic device comprises a memory 120 for storing a computer program; a processor 121 for implementing the steps of the video language model training method and/or the video language task execution method as mentioned in any of the above embodiments when executing a computer program.
Processor 121 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and processor 121 may also be a controller, microcontroller, microprocessor, or other data processing chip, among others. The processor 121 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable gate array) GATE ARRAY, PLA (Programmable Logic Array ). Processor 121 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 121 may be integrated with a GPU (Graphics Processing Unit, graphics processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 121 may also include an AI (ARTIFICIAL INTELLIGENCE ) processor for processing computing operations related to machine learning.
Memory 120 may include one or more computer-readable storage media, which may be non-transitory. Memory 120 may also include high-speed random access memory as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. Memory 120 may be an internal storage unit of an electronic device, such as a hard disk of a server, in some embodiments. The memory 120 may also be an external storage device of the electronic device, such as a plug-in hard disk provided on a server, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), etc. in other embodiments. Further, the memory 120 may also include both internal storage units and external storage devices of the electronic device. The memory 120 may be used to store not only application software installed in an electronic device, but also various types of data, such as: code or the like of a program during execution of the video language model training method and during execution of the video language task execution method can also be used to temporarily store data that has been output or is to be output. In this embodiment, the memory 120 is at least used for storing a computer program 1201, where the computer program can implement the relevant steps of the video language model training method and the video language task execution method disclosed in any of the foregoing embodiments after being loaded and executed by the processor 121. In addition, the resources stored by the memory 120 may further include an operating system 1202, data 1203, and the like, and the storage manner may be transient storage or permanent storage. Operating system 1202 may include Windows, unix, linux, among other things. The data 1203 may include, but is not limited to, video language model training results, data corresponding to video language task execution results, and the like.
In some embodiments, the electronic device may further include a display 122, an input/output interface 123, a communication interface 124, or referred to as a network interface, a power supply 125, and a communication bus 126. Among other things, a display 122, an input output interface 123 such as a Keyboard (Keyboard) pertain to a user interface, which may also include standard wired interfaces, wireless interfaces, and the like. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device and for displaying a visual user interface. Communication interface 124 may illustratively include a wired interface and/or a wireless interface, such as a WI-FI interface, a bluetooth interface, etc., typically used to establish a communication connection between an electronic device and other electronic devices. The communication bus 126 may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, or the like. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in fig. 12, but not only one bus or one type of bus.
Those skilled in the art will appreciate that the configuration shown in fig. 12 is not limiting of the electronic device and may include more or fewer components than shown, for example, may also include sensors 127 to perform various functions.
The functions of each functional module of the electronic device in this embodiment may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the relevant description of the foregoing method embodiment, which is not repeated herein.
From the above, the training efficiency of the video language model can be effectively improved, and the computing resources required by model training can be saved.
It will be appreciated that if the video language model training method and the video language task execution method in the above embodiments are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution contributing to the related art, or may be embodied in the form of a software product stored in a storage medium, which performs all or part of the steps of the methods of the various embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrically erasable programmable ROM, registers, a hard disk, a multimedia card, a card-type Memory (e.g., SD or DX Memory, etc.), a magnetic Memory, a removable disk, a CD-ROM, a magnetic disk, or an optical disk, etc., that can store program code.
Based on this, the present invention also provides a readable storage medium storing a computer program which, when executed by a processor, performs the steps of the video language model training method and/or the video language task execution method according to any one of the embodiments above.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the hardware including the device and the electronic equipment disclosed in the embodiments, the description is relatively simple because the hardware includes the device and the electronic equipment corresponding to the method disclosed in the embodiments, and relevant places refer to the description of the method.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The video language task execution and the model training method, the device, the electronic equipment and the readable storage medium thereof provided by the invention are described in detail. The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present invention and its core ideas. It should be noted that, based on the embodiments of the present invention, all other embodiments obtained by a person skilled in the art without making any inventive effort fall within the scope of protection of the present invention. The present invention is capable of numerous modifications and adaptations without departing from the principles of the present invention, and such modifications and adaptations are intended to be within the scope of the present invention.

Claims (26)

1. A method for training a video language model, comprising:
Acquiring a video sample data set carrying a text description tag, preset video parameters to be learned and frame parameters to be learned;
Inputting a video sample in the video sample data set, the video parameter to be learned and the frame parameter to be learned into the video language model; the video language model comprises a target visual language pre-training model, a video frame adapter and a video adapter; the target visual language pre-training model is used for extracting visual characteristics and parameter characteristics and inputting the visual characteristics and the parameter characteristics into the video frame adapter and the video adapter correspondingly, the video frame adapter is used for converting the visual characteristics into frame visual information meeting the requirements of the target visual language pre-training model, and the video adapter is used for extracting video visual information;
According to the frame visual information, the video visual information and the loss information of text semantic features, iteratively updating the video language model until a preset model training ending condition is met;
The method comprises the steps that parameter characteristics corresponding to video parameters to be learned are input to a video adapter, and parameter characteristics corresponding to frame parameters to be learned are input to the video frame adapter so as to obtain text-related visual information by utilizing the frame parameters to be learned;
The image encoder of the target visual language pre-training model is used for extracting visual characteristics of a video sample, the text encoder of the target visual language pre-training model is used for extracting text semantic characteristics of a text description tag of the video sample, and extracting video parameters to be learned and parameter characteristics corresponding to the frame parameters to be learned;
The frame parameter to be learned is characterized in that the parameter characteristic corresponding to the frame parameter to be learned is a frame parameter characteristic, the text description tag comprises a video frame description text tag, the text semantic characteristic corresponding to the video frame description text tag is a video frame text characteristic, and the video frame adapter comprises a frame input layer, a text coding layer, a cross-mode fusion layer, a characteristic enhancement layer and a frame output layer; the frame input layer is used for receiving the splicing result of the frame parameter characteristics and the video frame text characteristics; the text coding layer is used for coding the splicing result based on the current attention mask to obtain frame parameter coding characteristics; the cross-modal fusion layer is used for carrying out cross-modal fusion processing on the frame parameter coding features and the visual features; the feature enhancement layer is used for carrying out feature enhancement processing on the fusion result and inputting enhanced features into the text coding layer; the frame output layer is used for outputting frame visual information;
The video adapter comprises a video input layer, a parameter encoder layer, a feature fusion layer, a feature extraction layer and a video output layer, wherein the parameter features corresponding to the video parameters to be learned are video parameter features; the video input layer is used for receiving the combined characteristics of the visual characteristics and the frame visual information; the parameter encoder layer is used for encoding the video parameter characteristics to obtain video parameter encoding characteristics; the feature fusion layer is used for carrying out fusion processing on the video parameter coding features and the joint features; the feature extraction layer is used for extracting features of the fusion processing result and transmitting the extracted features to the parameter encoder layer; the video output layer is used for outputting video visual information;
Each video sample is provided with a text description tag, the video description text tag is text data corresponding to the description of the whole video, the video frame description text tag is text data corresponding to the description of the current frame of image, and when frame visual information and video visual information corresponding to the video sample are obtained, the frame visual information and the video visual information are compared with text semantic features of the text description tag corresponding to the video sample, and model parameters of the video language model are updated by continuously reducing the difference between the frame visual information and the video visual information.
2. The video language model training method of claim 1, wherein the cross-modal fusion layer is a cross-modal attention mechanism layer, and the cross-modal fusion processing of the frame parameter coding feature and the visual feature comprises:
And taking the frame parameter coding feature as a query vector, taking the visual feature as a group of value vectors and key vectors, and coding the frame parameter coding feature and the visual feature based on a cross-modal attention mechanism to be taken as a fusion result.
3. The video language model training method of claim 1, wherein the feature enhancement layer comprises a first feature enhancement layer, an interactive feature extraction layer, and a second feature enhancement layer;
The first characteristic enhancement layer is used for carrying out layer normalization processing on the fusion result and obtaining a first interaction enhancement characteristic through residual error connection;
the interaction feature extraction layer is used for extracting features of the first interaction enhancement features to obtain second interaction enhancement features;
and the second characteristic enhancement layer is used for carrying out layer normalization processing on the second interaction enhancement characteristic and is connected through residual errors.
4. The video language model training method of claim 1, wherein the training process of the video frame adapter comprises:
Extracting the characteristics of the frame visual information corresponding to the current frame to obtain the image frame characteristics corresponding to the current frame image;
extracting text features of the video frames corresponding to the current frames to obtain text features of the image frames corresponding to the current frame images;
And carrying out iterative updating on the video frame adapter according to the loss information between each image frame characteristic and the corresponding image frame text characteristic.
5. The method according to claim 4, wherein iteratively updating the video frame adapter according to loss information between each image frame feature and a corresponding image frame text feature comprises:
Determining a frame-text matching penalty by predicting whether an image frame feature and an image frame text feature are positively matched or negatively mismatched using the video frame adapter;
Determining a frame-text contrast penalty by comparing the similarity between the image frame features and the image frame text features;
Masking off part of the text features of the video frames, predicting the text features of the video frames which are masked off through a video frame adapter trained based on the text features of the image frames corresponding to the rest of the text features of the video frames and the image frame features, and determining text generation loss;
determining a penalty function for the video frame adapter based on the frame-to-text matching penalty, the frame-to-text contrast penalty, and the text generation penalty.
6. The video language model training method of claim 5, wherein said determining a frame-text contrast loss by comparing similarities between image frame features and image frame text features comprises:
Taking the image frame characteristics and the image frame text characteristics which are positively matched as a group of positive samples, and taking the image frame characteristics and the image frame text characteristics which are negatively unmatched as a group of negative samples;
Calculating positive similarity between the image frame features and the image frame text features in each group of positive samples, and calculating negative similarity between the image frame features and the image frame text features in each group of negative samples;
a frame-text contrast loss is determined by comparing the positive similarity to the negative similarity.
7. The video language model training method of claim 5, wherein said determining a frame-text contrast loss by comparing similarities between image frame features and image frame text features comprises:
Invoking a contrast loss function relation, and calculating frame-text contrast loss; the contrast loss function relationship is:
Where Loss ITG is a frame-text contrast Loss, exp is an exponential function, Z i is an ith image frame feature, T i is an image frame text feature matching the ith image frame feature, T j is a jth image frame feature not matching the image frame text feature, N ITG is a total number of image frame text features matching the image frame features, θ is a similarity between the image frame features and the image frame text features, and τ is a parameter to be optimized.
8. The video language model training method of claim 5, wherein said determining a penalty function for said video frame adapter based on said frame-to-text matching penalty, said frame-to-text contrast penalty, and said text generation penalty comprises:
determining an image frame-image frame text penalty from the frame-text matching penalty, the frame-text contrast penalty, and the text generation penalty;
Masking target image frames of the video samples, predicting the target image frames through a video frame adapter trained based on image frame text features and image frame features corresponding to the masked video samples, and determining video frame mask loss;
and determining a loss function of the video frame adapter according to the image frame-image frame text loss and the video frame mask loss.
9. The video language model training method of claim 8, wherein said determining video frame mask loss comprises:
Invoking a video frame mask loss function relation, and calculating video frame mask loss; the video frame mask loss function relationship is:
where Loss MTF is the video frame mask penalty, For the desire of random distribution inside small batches of video samples, D represents random distribution, V represents image frame features, V m is target image frame, O (V m) is target image frame feature,/>For image frame features where video samples are not masked, T represents/>Corresponding image frame text feature,/>K is the K-th image frame feature masked inside the small lot of video samples, K is the number of images masked inside the small lot of video samples, and model represents the prediction result.
10. The video language model training method of claim 1, wherein the feature fusion layer comprises a first video feature enhancement layer, a cross-modal learning layer, and a second video feature enhancement layer;
The first video feature enhancement layer is used for carrying out residual connection on the video parameter coding features and the video parameter features, and carrying out layer normalization processing to obtain parameter enhancement features;
The cross-modal learning layer is used for carrying out fusion processing on the video parameter coding feature and the joint feature based on a cross-modal attention mechanism by taking the parameter enhancement feature as a query vector and taking the joint feature as a group of value vectors and key vectors to obtain a multi-modal fusion feature;
and the second video feature enhancement layer is used for carrying out residual connection on the multi-mode fusion features and carrying out layer normalization processing to obtain a fusion processing result.
11. The video language model training method of claim 9, wherein the video language model further comprises a docking network layer; the docking network layer comprises a first converter model, a video feature extraction layer and a joint layer;
The first converter model is used for fusing the visual features based on a self-attention mechanism to obtain visual fusion features; the video feature extraction layer is used for extracting features of the visual fusion features and converting the dimensions of the extracted features into dimensions identical to the input dimensions of the video adapter; the joint layer is used for combining the frame visual information and the output of the video feature extraction layer and inputting joint features to the video adapter.
12. The method for training a video language model according to claim 9, wherein the text description tag comprises a video description text tag, the text semantic feature corresponding to the video description text tag is a video text feature, and the training process of the video adapter comprises:
extracting video features of the video visual information;
Extracting coding text features corresponding to the video text features;
And carrying out iterative updating on the video adapter according to the loss information between the video characteristics and the coded text characteristics.
13. The video language model training method of claim 12, wherein said determining loss information between said video feature and said encoded text feature comprises:
invoking a video-text loss calculation relation, and calculating video-text loss of the video adapter, wherein the video-text loss calculation relation is as follows:
where Loss G is video-text Loss, N G is the total number of matches of the video feature and the encoded text feature in the current lot, For the i' th video feature in the current lot,/>Coded text feature matching for the ith video feature,/>For the j 'th encoded text feature that does not match the i' th video feature, θ represents the similarity between the video feature and the encoded text feature, τ is the parameter to be optimized.
14. The video language model training method of claim 1, wherein said inputting the video samples in the video sample dataset, the video parameters to be learned, and the frame parameters to be learned to the video language model comprises:
Performing image sampling processing on the video sample to obtain a multi-frame sample image;
Extracting image features of each frame of sample image by using an image encoder of the target visual language pre-training model to obtain visual features;
Extracting text semantic features of text description tags of the video samples by using a text encoder of the target visual language pre-training model;
And respectively extracting the video parameters to be learned and the parameter characteristics corresponding to the frame parameters to be learned by using a text encoder of the target visual language pre-training model.
15. The method for training a video language model according to claim 14, wherein the text encoder using the target visual language pre-training model extracts the parameters corresponding to the video parameters to be learned and the frame parameters to be learned, respectively, comprising:
carrying out random initialization processing on the frame parameters to be learned by using a text encoder of the target visual language pre-training model, and taking a random initialization result of the frame parameters to be learned as frame parameter characteristics;
And utilizing a text encoder of the target visual language pre-training model to encode the video parameters to be learned based on the current attention mask, so as to obtain video parameter characteristics.
16. The video language model training method of claim 14, wherein the text description tags comprise video description text tags and video frame description text tags, the text encoder utilizing the target visual language pre-training model extracting text semantic features of the text description tags of the video samples comprising:
extracting video text features of the video description text labels by using a text encoder of the target visual language pre-training model;
Performing tokenization processing on the video frame description text labels by using a text encoder of the target visual language pre-training model, and performing word embedding processing on the tokenization processing result to obtain video frame text characteristics;
And utilizing a text encoder of the target visual language pre-training model to encode the video description text label based on the current attention mask so as to obtain video text characteristics.
17. The method for training a video language model according to claim 14, wherein the extracting the image features of each frame of sample image by the image encoder of the target visual language pre-training model to obtain the visual features comprises:
dividing the current frame image into a plurality of image blocks with non-overlapping contents;
converting each graphic block into one-dimensional representation through linear mapping, and adding position coding information for the corresponding image block;
And inputting the image block subjected to linear mapping and position coding to an encoder of a second converter model, and extracting the characteristics of the output of the encoder of the second converter model to obtain the visual characteristics of the video sample.
18. The training method of a video language model according to any one of claims 1 to 17, wherein the parameter feature corresponding to the frame parameter to be learned is a frame parameter feature, the text description tag includes a video frame description text tag and a video description text tag, the text semantic feature corresponding to the video frame description text tag is a video frame text feature, the text semantic feature corresponding to the video description text tag is a video text feature, and the training process of the video language model includes:
taking the video frame description text label, the frame parameters to be learned and the video sample data set as inputs, and training the video frame adapter by freezing an image encoder of the target visual language pre-training model and utilizing the frame parameters to be learned to acquire visual information corresponding to the video frame description text label;
When the training of the video frame adapter is completed, taking the video frame description text label, the frame parameters to be learned, the video parameters to be learned and the video sample data set as inputs to train the video adapter.
19. The video language model training method of claim 18, wherein said training said video frame adapter comprises:
invoking a video frame adapter loss function, and training the video frame adapter; the video frame adapter loss function is:
Where Loss frame represents the video frame adapter penalty function, loss ITM is the frame-text matching penalty, loss ITC is the text generation penalty, loss ITG is the frame-text contrast penalty, loss MEF is the video frame mask penalty, α 0 is the frame-text matching penalty coefficient, α 1 is the text generation penalty coefficient, α 2 is the frame-text contrast penalty coefficient, and β is the video frame mask penalty coefficient.
20. The video language model training method of claim 19, wherein said training said video frame adapter comprises:
invoking a video language loss function to train the video adapter; the video language loss function is:
Where Loss represents the video language penalty function, α is the video frame adapter penalty function coefficient, loss G is the video-text penalty, and γ is the video-text penalty function.
21. A method for executing a video language task, comprising:
Training to obtain a video language model by using the video language model training method as claimed in any one of claims 1 to 20;
Acquiring a video language task to be executed and a corresponding video language task training sample set;
Based on the video language task, utilizing the video language task training sample set to finely tune the video language model;
And executing the video language task by utilizing the trimmed video language model.
22. The video language task execution method according to claim 21, wherein the video language task to be executed is a video content understanding task, and the video language task training sample set is a video sample set of a plurality of video samples carrying video content tags; the fine tuning of the video language model based on the video language task using the video language task training sample set includes:
And based on the video content understanding task, utilizing the video sample set to finely tune the video language model so as to execute the video content understanding task by utilizing the finely tuned video language model.
23. A video language model training apparatus, comprising:
the data acquisition module is used for acquiring a video sample data set carrying a text description tag, preset video parameters to be learned and preset frame parameters to be learned;
The input data processing module is used for inputting the video sample in the video sample data set, the video parameter to be learned and the frame parameter to be learned into the video language model; the video language model comprises a target visual language pre-training model, a video frame adapter and a video adapter; the target visual language pre-training model is used for extracting visual characteristics and parameter characteristics and inputting the visual characteristics and the parameter characteristics into the video frame adapter and the video adapter correspondingly, the video frame adapter is used for converting the visual characteristics into frame visual information meeting the requirements of the target visual language pre-training model, and the video adapter is used for extracting video visual information; the method comprises the steps that parameter characteristics corresponding to video parameters to be learned are input to a video adapter, and parameter characteristics corresponding to frame parameters to be learned are input to the video frame adapter so as to obtain text-related visual information by utilizing the frame parameters to be learned; the image encoder of the target visual language pre-training model is used for extracting visual characteristics of a video sample, the text encoder of the target visual language pre-training model is used for extracting text semantic characteristics of a text description tag of the video sample, and extracting video parameters to be learned and parameter characteristics corresponding to the frame parameters to be learned; the model parameter updating module is used for carrying out iterative updating on the video language model according to the frame visual information, the video visual information and the loss information of text semantic features until a preset model training ending condition is met;
The frame parameter to be learned is characterized in that the parameter characteristic corresponding to the frame parameter to be learned is a frame parameter characteristic, the text description tag comprises a video frame description text tag, the text semantic characteristic corresponding to the video frame description text tag is a video frame text characteristic, and the video frame adapter comprises a frame input layer, a text coding layer, a cross-mode fusion layer, a characteristic enhancement layer and a frame output layer; the frame input layer is used for receiving the splicing result of the frame parameter characteristics and the video frame text characteristics; the text coding layer is used for coding the splicing result based on the current attention mask to obtain frame parameter coding characteristics; the cross-modal fusion layer is used for carrying out cross-modal fusion processing on the frame parameter coding features and the visual features; the feature enhancement layer is used for carrying out feature enhancement processing on the fusion result and inputting enhanced features into the text coding layer; the frame output layer is used for outputting frame visual information;
The video adapter comprises a video input layer, a parameter encoder layer, a feature fusion layer, a feature extraction layer and a video output layer, wherein the parameter features corresponding to the video parameters to be learned are video parameter features; the video input layer is used for receiving the combined characteristics of the visual characteristics and the frame visual information; the parameter encoder layer is used for encoding the video parameter characteristics to obtain video parameter encoding characteristics; the feature fusion layer is used for carrying out fusion processing on the video parameter coding features and the joint features; the feature extraction layer is used for extracting features of the fusion processing result and transmitting the extracted features to the parameter encoder layer; the video output layer is used for outputting video visual information;
Wherein the model parameter updating module is further configured to: each video sample is provided with a text description tag, the video description text tag is text data corresponding to the description of the whole video, the video frame description text tag is text data corresponding to the description of the current frame of image, when frame visual information and video visual information corresponding to the video sample are obtained, the frame visual information and the video visual information are compared with text semantic features of the text description tag corresponding to the video sample, and model parameters of the video language model are updated by continuously reducing the difference between the frame visual information and the video visual information.
24. A video language task execution device, comprising:
a model training module, configured to train to obtain a video language model by using the video language model training method according to any one of claims 1 to 20;
the data acquisition module is used for acquiring a video language task to be executed and a corresponding video language task sample set;
The model fine tuning module is used for carrying out fine tuning on the video language model by utilizing the video language task sample set based on the video language task;
and the task execution module is used for executing the video language task by utilizing the trimmed video language model.
25. An electronic device comprising a processor and a memory, the processor being configured to implement the video language model training method of any one of claims 1 to 20 and/or the steps of the video language task execution method of claim 22 or 21 when executing a computer program stored in the memory.
26. A readable storage medium, characterized in that the readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the video language model training method according to any one of claims 1 to 20 and/or the video language task execution method according to claim 21 or 22.
CN202410270242.6A 2024-03-11 2024-03-11 Video language task execution and model training method, device, equipment and medium thereof Active CN117876940B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410270242.6A CN117876940B (en) 2024-03-11 2024-03-11 Video language task execution and model training method, device, equipment and medium thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410270242.6A CN117876940B (en) 2024-03-11 2024-03-11 Video language task execution and model training method, device, equipment and medium thereof

Publications (2)

Publication Number Publication Date
CN117876940A CN117876940A (en) 2024-04-12
CN117876940B true CN117876940B (en) 2024-05-31

Family

ID=90595230

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410270242.6A Active CN117876940B (en) 2024-03-11 2024-03-11 Video language task execution and model training method, device, equipment and medium thereof

Country Status (1)

Country Link
CN (1) CN117876940B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118379749B (en) * 2024-06-20 2024-08-27 清华大学 Visual language model parameter alignment method and device, storage medium and electronic equipment
CN118520933B (en) * 2024-07-25 2024-09-17 山东海量信息技术研究院 Visual language model training method, device, medium and computer program product

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114996513A (en) * 2022-05-11 2022-09-02 湖南大学 Video question-answering method and system based on cross-modal prompt learning
CN115563342A (en) * 2022-10-19 2023-01-03 国家计算机网络与信息安全管理中心广东分中心 Method, system, equipment and storage medium for video theme retrieval
CN116363560A (en) * 2023-03-23 2023-06-30 上海人工智能创新中心 Video mask self-coding method and system
CN116861995A (en) * 2023-07-10 2023-10-10 京东科技信息技术有限公司 Training of multi-mode pre-training model and multi-mode data processing method and device
CN117037176A (en) * 2023-08-03 2023-11-10 厦门大学 Pre-training language model adaptation method for vision-language task
CN117253112A (en) * 2023-08-29 2023-12-19 哈尔滨工业大学 Large-model visual language cross-modal learning method for structural health diagnosis
CN117541956A (en) * 2023-10-26 2024-02-09 浙江大学 Video semantic feature extraction method based on self-supervised learning
CN117609550A (en) * 2024-01-17 2024-02-27 腾讯科技(深圳)有限公司 Video title generation method and training method of video title generation model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112668671B (en) * 2021-03-15 2021-12-24 北京百度网讯科技有限公司 Method and device for acquiring pre-training model

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114996513A (en) * 2022-05-11 2022-09-02 湖南大学 Video question-answering method and system based on cross-modal prompt learning
CN115563342A (en) * 2022-10-19 2023-01-03 国家计算机网络与信息安全管理中心广东分中心 Method, system, equipment and storage medium for video theme retrieval
CN116363560A (en) * 2023-03-23 2023-06-30 上海人工智能创新中心 Video mask self-coding method and system
CN116861995A (en) * 2023-07-10 2023-10-10 京东科技信息技术有限公司 Training of multi-mode pre-training model and multi-mode data processing method and device
CN117037176A (en) * 2023-08-03 2023-11-10 厦门大学 Pre-training language model adaptation method for vision-language task
CN117253112A (en) * 2023-08-29 2023-12-19 哈尔滨工业大学 Large-model visual language cross-modal learning method for structural health diagnosis
CN117541956A (en) * 2023-10-26 2024-02-09 浙江大学 Video semantic feature extraction method based on self-supervised learning
CN117609550A (en) * 2024-01-17 2024-02-27 腾讯科技(深圳)有限公司 Video title generation method and training method of video title generation model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
关系挖掘驱动的视频描述自动生成;黄毅;鲍秉坤;徐常胜;;南京信息工程大学学报(自然科学版);20171128(第06期);全文 *
文本词向量与预训练语言模型研究;徐菲菲;冯东升;;上海电力大学学报;20200815(第04期);全文 *

Also Published As

Publication number Publication date
CN117876940A (en) 2024-04-12

Similar Documents

Publication Publication Date Title
CN112487182B (en) Training method of text processing model, text processing method and device
WO2022057776A1 (en) Model compression method and apparatus
CN117876940B (en) Video language task execution and model training method, device, equipment and medium thereof
CN111741330B (en) Video content evaluation method and device, storage medium and computer equipment
CN113704460B (en) Text classification method and device, electronic equipment and storage medium
CN114676234A (en) Model training method and related equipment
CN110795944A (en) Recommended content processing method and device, and emotion attribute determining method and device
CN113627447A (en) Label identification method, label identification device, computer equipment, storage medium and program product
CN115221846A (en) Data processing method and related equipment
CN112749556B (en) Multi-language model training method and device, storage medium and electronic equipment
CN117540221B (en) Image processing method and device, storage medium and electronic equipment
CN116432019A (en) Data processing method and related equipment
CN116541492A (en) Data processing method and related equipment
Anitha Kumari et al. Automated image captioning for flickr8k dataset
CN117313728A (en) Entity recognition method, model training method, device, equipment and storage medium
CN117765450B (en) Video language understanding method, device, equipment and readable storage medium
CN115269781A (en) Modal association degree prediction method, device, equipment, storage medium and program product
CN117711001B (en) Image processing method, device, equipment and medium
CN112100389A (en) Long text classification method and device
CN117216255A (en) Classification model training method and related equipment
CN117034133A (en) Data processing method, device, equipment and medium
CN118228035B (en) Content tag determination method and related equipment
CN118135466B (en) Data processing method, device, computer, storage medium and program product
CN118227910B (en) Media resource aggregation method, device, equipment and storage medium
CN116882398B (en) Implicit chapter relation recognition method and system based on phrase interaction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant