CN117351387A

CN117351387A - Video dialogue and model training method, device, equipment and storage medium

Info

Publication number: CN117351387A
Application number: CN202311176255.9A
Authority: CN
Inventors: 宋雨鑫; 戎康; 刘芳龙; 张琦; 李鑫
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-09-12
Filing date: 2023-09-12
Publication date: 2024-01-05

Abstract

The disclosure provides a video dialogue and model training method, a device, equipment and a storage medium, relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, deep learning, large models and the like, and can be applied to scenes such as AIGC. The video dialogue method comprises the following steps: performing characterization extraction processing on a target video to obtain an initial video characterization of the target video; performing space-time processing on the initial video representation to obtain a target video representation of the target video; performing conversion processing on the target video representation to obtain video embedding of the target video; the dimension of the video embedding is the same as the dimension of the text embedding of the problem text; and carrying out dialogue processing on the video embedding and the text embedding to obtain answer text.

Description

Video dialogue and model training method, device, equipment and storage medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, deep learning, large models and the like, and can be applied to scenes such as AIGC (automatic guided vehicle), and particularly relates to a video dialogue and model training method, device, equipment and storage medium.

Background

A vision-centric multimodal dialog system is an important area of research. Such dialog systems typically use pre-trained large language models (Large Language Model, LLM), in conjunction with image encoders and other learnable modules, to perform image-related tasks in a dialog with a user.

The above-mentioned scheme is mainly aimed at making conversation for image. How to apply the above scheme to implementing video dialogue under video scene is a problem to be solved.

Disclosure of Invention

The present disclosure provides a video conversation and model training method, apparatus, device, and storage medium.

According to an aspect of the present disclosure, there is provided a video conversation method including: performing characterization extraction processing on a target video to obtain an initial video characterization of the target video; performing space-time processing on the initial video representation to obtain a target video representation of the target video; performing conversion processing on the target video representation to obtain video embedding of the target video; the dimension of the video embedding is the same as the dimension of the text embedding of the problem text; and carrying out dialogue processing on the video embedding and the text embedding to obtain answer text.

According to another aspect of the present disclosure, there is provided a training method of a video session model, the video session model including: an operation timing module, the method comprising: performing characterization extraction processing on a video sample to obtain an initial video characterization of the video sample; performing space-time processing on the initial video representation by adopting the operation time sequence module so as to obtain a target video representation of the video sample; performing conversion processing on the target video representation to obtain video embedding of the video sample; the dimension of the video embedding is the same as the dimension of the text embedding of the problem sample; performing dialogue processing on the video embedding and the text embedding to obtain a prediction answer; constructing a loss function based on the predicted answer; and adjusting parameters of the operation time sequence module by adopting the loss function.

According to another aspect of the present disclosure, there is provided a video dialog device including: the extraction module is used for carrying out characterization extraction processing on the target video so as to obtain an initial video characterization of the target video; the processing module is used for carrying out space-time processing on the initial video representation so as to obtain a target video representation of the target video; the conversion module is used for carrying out conversion processing on the target video representation so as to obtain video embedding of the target video; the dimension of the video embedding is the same as the dimension of the text embedding of the problem text; and the generating module is used for carrying out dialogue processing on the video embedding and the text embedding so as to obtain answer text.

According to another aspect of the present disclosure, there is provided a training apparatus of a video conversation model, the video conversation model including: an operation timing module, the apparatus comprising: the extraction module is used for carrying out characterization extraction processing on the video sample so as to obtain an initial video characterization of the video sample; the processing module is used for performing space-time processing on the initial video representation by adopting the operation time sequence module so as to obtain a target video representation of the video sample; the conversion module is used for carrying out conversion processing on the target video representation so as to obtain video embedding of the video sample; the dimension of the video embedding is the same as the dimension of the text embedding of the problem sample; the generation module is used for carrying out dialogue processing on the video embedding and the text embedding so as to obtain a prediction answer; the construction module is used for constructing a loss function based on the prediction answers; and the adjusting module is used for adjusting the parameters of the operation time sequence module by adopting the loss function.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above aspects.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method according to any one of the above aspects.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to any of the above aspects.

According to the technical scheme, the video dialogue effect can be improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

fig. 2 is a schematic diagram of an application scenario provided according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of the overall architecture of a dialog system provided in accordance with an embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 5 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 7 is a schematic diagram according to a fifth embodiment of the present disclosure;

fig. 8 is a schematic diagram of an electronic device for implementing a video conversation method or training method for a video conversation model in accordance with an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the related art, for a video conversation, video content is typically converted into text, the text is input as a prompt word (prompt) to LLM, and the conversation is performed with a user through LLM.

However, converting video content to text can undoubtedly lead to visual information loss and excessive simplification of spatiotemporal complexity, affecting the reasoning effect of the dialog system.

In order to enhance the video dialog effect, the present disclosure provides the following embodiments.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. The embodiment provides a video dialogue method, which comprises the following steps:

101. and performing characterization extraction processing on the target video to obtain an initial video characterization of the target video.

102. And carrying out space-time processing on the initial video representation to obtain a target video representation of the target video.

103. Performing conversion processing on the target video representation to obtain video embedding of the target video; wherein the dimensions of the video embedding are the same as the dimensions of the text embedding of the question text.

104. And carrying out dialogue processing on the video embedding and the text embedding to obtain answer text.

The target video is a video to be processed, and can be uploaded to the dialogue system by a user.

The question text is the question content of the target video, which is input by the user. For example, a user uploads a target video to a dialog system and enters question text for the target video. Specifically, after the user uploads the target video, the question text is "what program the video is".

Answer text refers to answer content obtained based on the target video and question text, for example, based on the above example, the answer text is "the video is XX program", where XX is a specific program name obtained according to actual conditions.

For distinction, the video representation of the target video is divided into an initial video representation and a target video representation, wherein the video representation before the spatio-temporal processing is called the initial video representation and the video representation after the spatio-temporal processing is called the target video representation.

The initial video representation can be obtained by adopting a visual encoder to perform representation extraction processing on the target video. The visual encoder is, for example, a visual transducer (Vision Transformer, viT) model. The ViT model is a representation extraction model based on visual task correlation of a transducer encoder (Transformer Encoder). The target video may be specifically input into the ViT model, with the output being an initial video representation.

The space-time processing refers to processing of the initial video representation in a time dimension and a space dimension. Specifically, the method can include performing four-rule operation on the initial video representation, wherein the four-rule operation includes addition, subtraction, multiplication and division operation. Further, the four-rule operation may be implemented by a convolution module that is configured to perform a convolution operation, which may include a one-dimensional (1D) convolution operation and/or a two-dimensional (2D) convolution operation. The specific structure of the convolution module can be set according to actual needs.

The target video representation is a video representation obtained by performing space-time processing on the initial video representation.

Through space-time processing, the target video representation can be made to contain abundant space-time information. In particular, the target video characterization may characterize average properties (e.g., intensity of images or cumulative state of images), varying properties (e.g., passage of time, trend of motion), relevance, consistency, etc. in the target video. Wherein, the average property can be represented by an addition operation, the variation property can be represented by a subtraction operation, the correlation can be represented by a multiplication operation, and the consistency can be represented by a division operation.

The text embedding can be obtained by embedding the problem text by a text encoder.

In order to ensure the accuracy of dialogue processing, mapping processing can be performed on the target video representation to obtain video embedding, wherein the video embedding and the text embedding are in the same vector space, i.e. the dimensions of the video embedding and the text embedding are the same.

After the video embedding and the text embedding are obtained, dialogue processing can be carried out on the video embedding and the text embedding to obtain an answer text.

In the embodiment, the initial video representation is obtained by carrying out representation extraction processing on the target video, and the visual information of the target video can be fully utilized due to the processing based on the video representation; in addition, the original video representation is subjected to space-time processing to obtain the target video representation, and rich space-time information of the target video can be utilized. Thus, the video dialog effect can be improved.

In order to better understand the embodiments of the present disclosure, application scenarios to which the embodiments of the present disclosure are applicable are described.

Fig. 2 is a schematic diagram of an application scenario provided in an embodiment of the present disclosure. The scene comprises: user terminal 201 and server 202, user terminal 201 may include: personal computers (Personal Computer, PCs), cell phones, tablet computers, notebook computers, smart wearable devices, and the like. The server 202 may be a cloud server or a local server, and the user terminal 201 and the server 202 may communicate using a communication network, for example, a wired network and/or a wireless network.

The user can send the target video and the question text to the server 202 through the user terminal 201, and the server 202 performs dialogue processing based on the target video and the question text to obtain an answer text; the server 202 transmits the answer text to the user terminal 201 for display.

Specifically, a dialogue system can be deployed on the server, and man-machine dialogue is performed through the dialogue system. I.e., using a dialog system, answer text is generated based on the target video and the question text.

And the dialogue system can conduct video dialogue based on the graphic multi-mode dialogue large model.

The graphic multi-modal dialog large model may be a large language and visual assistant (Large Language and Vision Assistant, LLaVA) model, in particular.

The LLaVA model is a large multi-modal model that connects a visual encoder and LLM for general visual and linguistic understanding. Wherein the visual encoder may be a ViT model.

LLM is a hot problem in the field of artificial intelligence in recent years, LLM is a pre-training language model, and rich language knowledge and world knowledge are learned by pre-training on massive text data, so that a remarkable effect can be achieved on various natural language processing (Natural Language Processing, NLP) tasks. The relics, chatGPT and the like are all applications based on LLM development, and can generate smooth, logical and creative text contents and even perform natural dialogue with human beings.

Taking the LLaVA model as an example, the LLM corresponding to the LLaVA model can be specifically selected as StableVicuna. StableVicuna is a LLM.

As shown in fig. 3, the ViT model includes a plurality of ViT blocks (ViT blocks), in this embodiment, two adjacent ViT blocks may be selected, and an operation timing module (Arithmetic Temporal Module, ATM) is additionally introduced between the two ViT blocks, so that space-time processing is performed through the ATM.

It is assumed that ATM is introduced between the i ViT th block and the (i+1) th ViT th block, where the specific value of i can be set according to actual needs.

The input of the ViT model (block 1, viT) is the target video and the output of the i ViT block is the initial video representation X; the ATM performs space-time processing on the initial video representation X to obtain a target video representation X'.

For ATM, it mainly includes four arithmetic modules (represented by addition, subtraction, multiplication and division), and may further include: a context generation (context mapping) module, a feature extraction (feature) module, and a region transformation (domain transformation) module. The feature extraction module may include a two-dimensional convolution (Conv 2D) module, and the region transformation module may include a rebinning (Regroup) module and a two-dimensional convolution (Conv 2D) module.

Specifically, the dimension of the initial video representation X is t×c×h×w, where T is the number of frames of the target video, C is the number of channels, H and W are the height and width of each image in the target video.

The context generating module is used for carrying out context representation extraction processing on the initial video representation X, namely, acquiring a preset number of other adjacent image representations of each image. The output of the context generation module is a video representation containing context information, which may be referred to as a context representation, the dimension is T x Z x C x H x W, where Z is the number of images of the context, e.g. as shown in fig. 3, a representation of 5 adjacent images may be obtained for each image, then z=5. The context generating module can be composed of a convolution module, and the specific structure can be set according to the requirement.

And the four-rule operation module is used for carrying out four-rule operation on the initial video representation and the context representation. During operation, the initial video representation can be filled to be consistent with the dimension of the context representation, and then four operations are performed.

The four-rule operation module can be composed of a convolution module (1D convolution module and/or 2D convolution module) to realize the four-rule operation function, and the specific structure can be set according to actual requirements. Model parameters of a convolution module in the four-rule operation module can be learned in the training stage, so that in the reasoning stage, the model structure and the model parameters of the four-rule operation module are determined, and further determined four-rule operation can be performed.

The feature extraction module may specifically consist of a two-dimensional convolution module (Conv 2D), and performs feature extraction on the four-rule operated feature to obtain a feature extracted feature. The dimensions of the four operations and the characterization after feature extraction are t×z×c×h×w.

And the region transformation module is used for carrying out region transformation processing on the characterization after the feature extraction to obtain a target video characterization.

The region transformation module comprises a reorganization module and a 2D convolution module.

The feature extraction module is used for extracting features of the feature extraction module, and the feature extraction module is used for extracting features of the feature extraction module. Specifically, the dimension of the input representation of the reorganization module is t×z×c×h×w, the dimension of the output representation is t×zc×h×w, and ZC represents the multiplication of Z and C.

And the 2D convolution module in the region transformation module is used for carrying out 2D convolution processing on the characterization subjected to dimension recombination to obtain a target video characterization X'. The dimension obtained by the video representation X' is t×c×h×w.

After the target video representation is obtained, the initial video representation X and the target video representation X 'can be subjected to addition operation, the added video representation (X+X') is used as the input of the (i+1) th ViT block, the (i+1) th ViT block further carries out representation extraction processing on the input representation, the output representation of the (i+1) th ViT block is obtained, and after conversion processing, the keyword embedding (K) and the value embedding (V) required by the attention network can be respectively obtained.

In addition, the input to the attention network also includes query embedding (Q). In this embodiment, the Q input by qfurmer may be referred to as a target query embedding, which includes two parts, one is the Q (original query embedding) of qfurmer, and the other is an additional leachable Q (newly added query embedding), where the original query embedding is kept unchanged during the training phase and is represented by freezing; the initial value of the new query embedding may be a randomized vector, and the final value of the new query embedding is obtained through constant learning (adjustment) in the training phase. The original query embedding and the newly added query embedding which are unchanged form a target query embedding which is used as the input of the QSORM model.

The QFormer model is based on an attention network, and after K, V, Q, K, V, Q can be processed by the attention network to obtain an output representation of the QFormer model, which is processed by a linear layer to obtain video embedding (video embedding).

Through the linear layer, the dimension adjustment can be performed on the representation, for example, the target video with the dimension of t×c×h×w is represented, and the target video is converted into the video with the dimension of c×n, where N is obtained by compressing the value obtained by multiplying the three parameters of t×h×w.

During the training phase, the model parameters of the qfurmer model remain unchanged (frozen) while the model parameters of the linear layer are adjusted.

Model parameters of LLM remain unchanged during the training phase. LLM can perform dialogue processing based on text embedding and video embedding, generating answer text. The text embedding and the video embedding are representations with the same dimension, for example, the dimensions are all C.times.N.

The LLaVA model is a graphic multi-modal model, i.e. it can be understood for images and text. In this embodiment, the ViT model, ATM, qfermer model, linear layer and LLM may use a processing manner similar to that of images and texts to generate answer texts based on video embedding and text embedding, so as to migrate understanding of images and texts to understanding of videos and texts, and realize knowledge migration. By using the LLaVA model as a hot start, the reasoning speed and the reasoning effect can be improved.

In combination with the above application scenario, the present disclosure further provides the following embodiments.

Fig. 4 is a schematic diagram of a second embodiment of the present disclosure, which provides a video dialogue method, as shown in fig. 4, including:

401. and adopting a visual encoder to perform characterization extraction processing on the target video so as to obtain an initial video characterization of the target video.

Wherein, referring to fig. 3, the visual encoder may be a ViT model, and the output representation of the ith ViT block of the ViT model is taken as an initial video representation, and the specific value of i may be set according to actual needs.

402. And performing space-time processing on the initial video representation by adopting an operation time sequence module (ATM) in a video dialogue model to obtain a target video representation of the target video.

Wherein referring to fig. 3, ATM may be introduced between the i ViT th block and the (i+1) th ViT th block. The ATM performs a spatiotemporal process on the input initial video representation X, the output being the target video representation X'.

In this embodiment, because the ATM is trainable, the ATM with better performance can be obtained through the training process, so that during reasoning, the ATM with better performance can be adopted to perform space-time processing on the initial video representation, so as to obtain the target video representation with better performance, and further improve the video dialogue effect.

As shown in fig. 3, the ATM may include: the system comprises a context generation module, a four-rule operation module, a feature extraction module and a region transformation module.

Correspondingly, the operation time sequence module adopting the pre-training carries out space-time processing on the initial video representation to obtain the target video representation, and comprises the following steps:

performing context representation extraction processing on the initial video representation by adopting the context generation module to obtain a context Wen Biaozheng;

adopting the four-rule operation module to perform four-rule operation processing on the initial video representation and the context representation to obtain a representation after four-rule operation;

adopting the feature extraction module to perform feature extraction processing on the four-rule operated representation to obtain a feature extracted representation;

and carrying out region transformation processing on the characterization extracted by the features by adopting the region transformation module so as to obtain the target video characterization.

The method comprises the steps of extracting rich space-time information of a target video through four arithmetic processing, such as representing average properties of the target video (such as intensity of an image or accumulated state of the image) through addition operation, representing changing properties of the target video (such as time lapse and trend of movement) through subtraction operation, representing relevance of the target video through multiplication operation and representing consistency of the target video through division operation.

Therefore, based on four arithmetic processing, the target video representation containing abundant space-time information can be obtained; in addition, the representation capability of the target video representation can be further enhanced through the processing of context representation extraction, feature extraction, region transformation and the like, and further the video dialogue effect is improved.

403. And carrying out query processing on the target video representation to obtain a video representation after query processing.

404. Mapping the video representation after the query processing to obtain the video embedding; wherein the dimensions of the video embedding are the same as the dimensions of the text embedding of the question text.

In the embodiment, the target video representation is subjected to query processing to obtain the video representation after query processing, and subsequent processing is performed based on the video representation after query processing, so that the representation capacity of the video representation can be further improved, and the video dialogue effect is improved; in addition, by mapping the video representation after query processing, video embedding in the same vector space as text embedding can be obtained, and further dialogue processing is performed based on the video embedding and the text embedding, so that the accuracy of dialogue processing can be improved.

A querier in the video dialogue model can be adopted to query the target video representation so as to obtain the video representation after query processing; wherein parameters of the querier remain unchanged during training of the video session model.

As shown in fig. 3, the querier may be embodied as a query transducer (Querying Transformer, qfurmer) model. The parameters of the qfurmer model remain unchanged (frozen) during the training of the video session model.

Specifically, the initial video representation and the target video representation may be added to obtain an added video representation, the added video representation is input into the (i+1) th ViT block, and the keyword embedding (K) and the value embedding (V) are obtained based on the output representation of the (i+1) th ViT block.

Taking ATM as an example between the i ViT module and the (i+1) ViT module, it will be appreciated that if ATM is connected after the last ViT block, the target video representation can be directly converted to K and V; or converting the video representation after the addition of the initial video representation and the target video representation into K and V.

The QSORM model processes the input K, V, as well as the query embedding (Q), and the output is a query processed video representation.

In this embodiment, the query device is used to perform query processing on the target video representation, so as to obtain the video representation after query processing, which can further improve the representation capability of the video representation and improve the accuracy of the video dialogue. In addition, parameters of the inquirer are kept unchanged in the training process of the video dialogue model, the parameter quantity required to be adjusted can be reduced, and the model training efficiency is improved.

In order to enable the video embedding and the text embedding to be in the same vector space, mapping processing can be carried out on the video representation after query processing to obtain the video embedding in the same vector space as the text embedding, namely the dimensions of the video embedding and the text embedding are the same.

As shown in fig. 3, the mapping process may be performed using a linear layer in the video session model. Wherein the parameters of the linear layer are adjustable during the training of the video session model.

In this embodiment, the video embedding and the text embedding can be in the same vector space through the linear layer with the parameter capable of being learned, so as to further improve the video dialogue effect.

For query embedding, as shown in fig. 3, the query embedding (Q) entered by the qfurmer model may be referred to as a target query embedding, which includes an original query embedding (shown on the left) that remains unchanged (frozen) during training and a new query embedding (shown on the right) that is adjustable (learned) during training.

In this embodiment, by additionally introducing a new query embedding (newly added query embedding), and the newly added query embedding is learnable, the context modeling can be better performed on the target video, so as to obtain a compact LLM compatible video embedding, and improve the video dialogue effect.

405. Performing dialogue processing on the video embedding and the text embedding by adopting a Large Language Model (LLM) in a video dialogue model to obtain answer text; wherein the parameters of LLM remain unchanged (frozen) during the training of the video session model.

In this embodiment, since LLM is an existing large language model and has excellent text generation capability, the answer text is obtained by using LLM, so that the accuracy of the answer text can be improved by using the excellent performance of LLM. In addition, parameters of the LLM remain unchanged in the training process of the video dialogue model, so that the parameter quantity required to be adjusted can be reduced, and the model training efficiency is improved.

The above embodiments describe a video session process, and in particular, a video session model may be used to conduct a video session. For a training process of the video dialog model, see the following examples.

Fig. 5 is a schematic diagram of a third embodiment of the present disclosure, where the present embodiment provides a training method of a video session model, the video session model includes: an operational timing module (ATM), the method comprising:

501. and performing characterization extraction processing on the video sample to obtain an initial video characterization of the video sample.

502. And carrying out space-time processing on the initial video representation by adopting the operation time sequence module so as to obtain a target video representation of the video sample.

503. Performing conversion processing on the target video representation to obtain video embedding of the video sample; wherein the dimensions of the video embedding are the same as the dimensions of the text embedding of the problem sample.

504. And carrying out dialogue processing on the video embedding and the text embedding to obtain a prediction answer.

505. And constructing a loss function based on the predicted answer.

506. And adjusting parameters of the operation time sequence module by adopting the loss function.

The training phase may take samples, including video samples and problem samples, using existing data sources or by manual collection.

The process of obtaining a predicted answer based on the video sample and the question sample may be similar to the process of obtaining answer text during a video conversation described above.

After obtaining the predicted answer, a loss function may be constructed based on the predicted answer and the question sample, or a loss function may be constructed based on the predicted answer and the real answer. The loss function is, for example, an autoregressive loss function.

After constructing the loss function, a Back Propagation (BP) algorithm may be used to perform parameter adjustment.

In the embodiment, the initial video representation is obtained by carrying out representation extraction processing on the video sample, and the visual information of the video sample can be fully utilized due to the processing based on the video representation; in addition, the target video representation is obtained by performing space-time processing on the initial video representation, and rich space-time information of the video sample can be utilized. Thus, the effect of the video dialog model can be improved.

In some embodiments, the operation timing module includes: the system comprises a context generation module, a four-rule operation module, a feature extraction module and a region transformation module;

the performing space-time processing on the initial video representation by adopting the operation time sequence module to obtain a target video representation of the video sample, including:

In some embodiments, the converting the target video representation to obtain video embedding of the target video includes:

the target video representation is subjected to query processing to obtain a video representation after query processing;

and mapping the video representation after the query processing to obtain the video embedding.

In some embodiments, the video dialog model further comprises: a querier;

the query processing is performed on the target video representation to obtain a video representation after query processing, including:

adopting the inquirer to inquire the target video representation so as to obtain an inquired video representation; wherein parameters of the querier remain unchanged during training of the video session model.

In some embodiments, the query processing, with the querier, the target video representation to obtain a query processed video representation includes:

adopting the querier to query the target video representation based on target query embedding so as to obtain the video representation after query processing; wherein the target query embedding comprises: original query embedding and newly added query embedding, wherein parameters of the original query embedding are kept unchanged in the training process;

The method further comprises the steps of:

and adjusting the parameters embedded in the newly added query by adopting the loss function.

In some embodiments, the video dialog model further comprises: a linear layer;

the mapping processing is performed on the video representation after the query processing to obtain the video embedding, which comprises the following steps:

mapping the video representation after the query processing by adopting the linear layer to obtain the video embedding;

the method further comprises the steps of:

and adjusting parameters of the linear layer by adopting the loss function.

In some embodiments, the video dialog model further comprises: a large language model;

said dialogue processing said video embedding and said text embedding to obtain a predicted answer, comprising:

Performing dialogue processing on the video embedding and the text embedding by adopting the large language model to obtain a prediction answer; wherein parameters of the large language model remain unchanged during training of the video dialog model.

In addition, in the whole, because only parameters of the ATM, parameters embedded in a newly added query and parameters of the linear layer are adjusted during model training, parameters of other models (such as QSORM models and LLM) are kept unchanged, efficient adjustment of parameters can be realized, the quantity of parameters to be adjusted can be reduced, the data quantity of instruction data (problem samples) can be reduced, and the model training efficiency is improved. In addition, the image-text multi-mode large model is used as a hot start, so that image and text understanding knowledge can be migrated to video and text understanding, the training efficiency of the video dialogue model is improved, and the dialogue generation effect of the video dialogue model is improved.

Fig. 6 is a schematic diagram according to a fourth embodiment of the present disclosure. The present embodiment provides a video dialogue apparatus, as shown in fig. 6, the apparatus 600 includes: an extraction module 601, a processing module 602, a conversion module 603 and a generation module 604.

The extraction module 601 is configured to perform a feature extraction process on a target video to obtain an initial video feature of the target video; the processing module 602 is configured to perform space-time processing on the initial video representation to obtain a target video representation of the target video; the conversion module 603 is configured to perform conversion processing on the target video representation to obtain video embedding of the target video; the dimension of the video embedding is the same as the dimension of the text embedding of the problem text; the generating module 604 is configured to perform a dialogue process on the video embedding and the text embedding to obtain answer text.

In some embodiments, the processing module 602 is further configured to: performing space-time processing on the initial video representation by adopting an operation time sequence module in a video dialogue model so as to obtain the target video representation; wherein parameters of the operational timing module are adjustable during training of the video session model.

In some embodiments, the operation timing module includes: the system comprises a context generation module, a four-rule operation module, a feature extraction module and a region transformation module; the processing module 602 is further configured to:

In some embodiments, the conversion module 603 is further configured to: inquiring the target video representation to obtain an inquired video representation; and mapping the video representation after the query processing to obtain the video embedding.

In some embodiments, the conversion module 603 is further configured to: adopting a querier in the video dialogue model to query the target video representation so as to obtain the video representation after query processing; wherein parameters of the querier remain unchanged during training of the video session model.

In some embodiments, the conversion module 603 is further configured to: adopting the querier to query the target video representation based on target query embedding so as to obtain the video representation after query processing; wherein the target query embedding comprises: original query embedding and newly added query embedding, the parameters of the original query embedding remaining unchanged during the training process, the parameters of the newly added query embedding being adjustable during the training process.

In some embodiments, the conversion module 603 is further configured to: mapping the video characterization after query processing by adopting a linear layer in a video dialogue model to obtain the video embedding; wherein parameters of the linear layer are adjustable during training of the video session model.

In some embodiments, the generating module 604 is further configured to: performing dialogue processing on the video embedding and the text characterization by adopting a large language model in a video dialogue model to obtain the answer text; wherein parameters of the large language model remain unchanged during training of the video dialog model.

Fig. 7 is a schematic diagram according to a fifth embodiment of the present disclosure. The present embodiment provides a training device for a video session model, where the video session model includes: the operation timing module, as shown in fig. 7, the apparatus 700 includes: an extraction module 701, a processing module 702, a conversion module 703, a generation module 704, a construction module 705 and an adjustment module 706.

The extraction module 701 is configured to perform a feature extraction process on a video sample to obtain an initial video feature of the video sample; the processing module 702 is configured to perform space-time processing on the initial video representation by using the operation timing module, so as to obtain a target video representation of the video sample; the conversion module 703 is configured to perform conversion processing on the target video representation to obtain video embedding of the video sample; the dimension of the video embedding is the same as the dimension of the text embedding of the problem sample; the generating module 704 is configured to perform dialogue processing on the video embedding and the text embedding to obtain a prediction answer; the construction module 705 is configured to construct a loss function based on the prediction answer; the adjusting module 706 is configured to adjust parameters of the operation timing module by using the loss function.

In some embodiments, the operation timing module includes: the system comprises a context generation module, a four-rule operation module, a feature extraction module and a region transformation module; the processing module 702 is further configured to:

In some embodiments, the conversion module 703 is further configured to: the target video representation is subjected to query processing to obtain a video representation after query processing; and mapping the video representation after the query processing to obtain the video embedding.

In some embodiments, the video dialog model further comprises: a querier; the conversion module 703 is further configured to: adopting the inquirer to inquire the target video representation so as to obtain an inquired video representation; wherein parameters of the querier remain unchanged during training of the video session model.

In some embodiments, the conversion module 703 is further configured to: adopting the querier to query the target video representation based on target query embedding so as to obtain the video representation after query processing; wherein the target query embedding comprises: original query embedding and newly added query embedding, wherein parameters of the original query embedding are kept unchanged in the training process; the adjustment module 706 is further configured to: and adjusting the parameters embedded in the newly added query by adopting the loss function.

In some embodiments, the video dialog model further comprises: a linear layer; the conversion module 703 is further configured to: mapping the video representation after the query processing by adopting the linear layer to obtain the video embedding; the adjustment module 706 is further configured to: and adjusting parameters of the linear layer by adopting the loss function.

In some embodiments, the video dialog model further comprises: a large language model; the generating module 704 is further configured to: performing dialogue processing on the video embedding and the text embedding by adopting the large language model to obtain a prediction answer; wherein parameters of the large language model remain unchanged during training of the video dialog model.

It is to be understood that in the embodiments of the disclosure, the same or similar content in different embodiments may be referred to each other.

It can be understood that "first", "second", etc. in the embodiments of the present disclosure are only used for distinguishing, and do not indicate the importance level, the time sequence, etc.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic device 800 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in electronic device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, for example, a training method of a video conversation model or a video conversation method. For example, in some embodiments, the training method of the video dialog model or the video dialog method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. When the computer program is loaded into RAM 803 and executed by computing unit 801, one or more steps of the training method of video conversation model or video conversation method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform a training method of a video dialog model or a video dialog method in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems-on-chips (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A video conversation method, comprising:

performing characterization extraction processing on a target video to obtain an initial video characterization of the target video;

performing space-time processing on the initial video representation to obtain a target video representation of the target video;

performing conversion processing on the target video representation to obtain video embedding of the target video; the dimension of the video embedding is the same as the dimension of the text embedding of the problem text;

And carrying out dialogue processing on the video embedding and the text embedding to obtain answer text.

2. The method of claim 1, wherein the spatiotemporal processing of the initial video representation to obtain a target video representation of the target video comprises:

performing space-time processing on the initial video representation by adopting an operation time sequence module in a video dialogue model so as to obtain the target video representation; wherein parameters of the operational timing module are adjustable during training of the video session model.

3. The method of claim 2, wherein,

the operation timing module includes: the system comprises a context generation module, a four-rule operation module, a feature extraction module and a region transformation module;

the performing space-time processing on the initial video representation by adopting a pre-trained operation time sequence module to obtain the target video representation comprises the following steps:

4. The method of claim 1, wherein the converting the target video representation to obtain a video embedding of the target video comprises:

inquiring the target video representation to obtain an inquired video representation;

5. The method of claim 4, wherein query processing the target video representation to obtain a query processed video representation comprises:

adopting a querier in the video dialogue model to query the target video representation so as to obtain the video representation after query processing; wherein parameters of the querier remain unchanged during training of the video session model.

6. The method of claim 5, wherein said employing a querier in the video dialog model to query the target video representation to obtain a query processed video representation comprises:

Adopting the querier to query the target video representation based on target query embedding so as to obtain the video representation after query processing; wherein the target query embedding comprises: original query embedding and newly added query embedding, the parameters of the original query embedding remaining unchanged during the training process, the parameters of the newly added query embedding being adjustable during the training process.

7. The method of claim 4, wherein the mapping the query-processed video representation to obtain the video embedding comprises:

mapping the video characterization after query processing by adopting a linear layer in a video dialogue model to obtain the video embedding; wherein parameters of the linear layer are adjustable during training of the video session model.

8. The method of any of claims 1-7, wherein said dialogue processing the video and text embeddings to obtain answer text comprises:

performing dialogue processing on the video embedding and the text embedding by adopting a large language model in a video dialogue model to obtain the answer text; wherein parameters of the large language model remain unchanged during training of the video dialog model.

9. A method of training a video conversation model, the video conversation model comprising: an operation timing module, the method comprising:

performing characterization extraction processing on a video sample to obtain an initial video characterization of the video sample;

performing space-time processing on the initial video representation by adopting the operation time sequence module so as to obtain a target video representation of the video sample;

performing conversion processing on the target video representation to obtain video embedding of the video sample; the dimension of the video embedding is the same as the dimension of the text embedding of the problem sample;

performing dialogue processing on the video embedding and the text embedding to obtain a prediction answer;

constructing a loss function based on the predicted answer;

and adjusting parameters of the operation time sequence module by adopting the loss function.

10. The method of claim 9, wherein,

11. The method of claim 9, wherein the converting the target video representation to obtain a video embedding of the target video comprises:

12. The method of claim 11, wherein,

the video dialog model further includes: a querier;

13. The method of claim 12, wherein said employing the querier to query the target video representation to obtain a query processed video representation comprises:

the method further comprises the steps of:

14. The method of claim 11, wherein,

the video dialog model further includes: a linear layer;

The method further comprises the steps of:

and adjusting parameters of the linear layer by adopting the loss function.

15. The method according to any one of claims 9-14, wherein,

the video dialog model further includes: a large language model;

16. A video dialog device comprising:

the extraction module is used for carrying out characterization extraction processing on the target video so as to obtain an initial video characterization of the target video;

the processing module is used for carrying out space-time processing on the initial video representation so as to obtain a target video representation of the target video;

the conversion module is used for carrying out conversion processing on the target video representation so as to obtain video embedding of the target video; the dimension of the video embedding is the same as the dimension of the text embedding of the problem text;

and the generating module is used for carrying out dialogue processing on the video embedding and the text embedding so as to obtain answer text.

17. The apparatus of claim 16, wherein the processing module is further to:

18. The apparatus of claim 17, wherein,

the processing module is further to:

19. The apparatus of claim 16, wherein the conversion module is further to:

20. The apparatus of claim 19, wherein the conversion module is further to:

21. The apparatus of claim 20, wherein the conversion module is further to:

22. The apparatus of claim 19, wherein the conversion module is further to:

23. The apparatus of any of claims 16-22, wherein the generation module is further to:

24. A training device for a video conversation model, the video conversation model comprising: an operation timing module, the apparatus comprising:

the extraction module is used for carrying out characterization extraction processing on the video sample so as to obtain an initial video characterization of the video sample;

the processing module is used for performing space-time processing on the initial video representation by adopting the operation time sequence module so as to obtain a target video representation of the video sample;

The conversion module is used for carrying out conversion processing on the target video representation so as to obtain video embedding of the video sample; the dimension of the video embedding is the same as the dimension of the text embedding of the problem sample;

the generation module is used for carrying out dialogue processing on the video embedding and the text embedding so as to obtain a prediction answer;

the construction module is used for constructing a loss function based on the prediction answers;

and the adjusting module is used for adjusting the parameters of the operation time sequence module by adopting the loss function.

25. The apparatus of claim 24, wherein,

the processing module is further to:

26. The apparatus of claim 24, wherein the conversion module is further to:

27. The apparatus of claim 26, wherein,

the video dialog model further includes: a querier;

the conversion module is further configured to:

28. The apparatus of claim 26, wherein the conversion module is further to:

The adjustment module is also used for:

29. The apparatus of claim 26, wherein,

the video dialog model further includes: a linear layer;

the conversion module is further configured to:

the adjustment module is also used for:

and adjusting parameters of the linear layer by adopting the loss function.

30. The device of any one of claims 24-29, wherein,

the video dialog model further includes: a large language model;

the generation module is further configured to:

31. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-15.

32. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-15.

33. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-15.