CN116246213A

CN116246213A - Data processing method, device, equipment and medium

Info

Publication number: CN116246213A
Application number: CN202310506746.9A
Authority: CN
Inventors: 项进喜; 余剑扬; 罗凤; 关永航; 赵创钿; 张军; 邵纪春
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-05-08
Filing date: 2023-05-08
Publication date: 2023-06-09
Anticipated expiration: 2043-05-08
Also published as: CN116246213B

Abstract

The embodiment of the application provides a data processing method, a device, equipment and a medium, and the method can be applied to the fields of video comment generation, subtitle generation, video content understanding and the like. The method comprises the following steps: acquiring video data and video text distribution data corresponding to the video data; acquiring video representation information corresponding to video data and acquiring text representation information corresponding to video text distribution data; performing time sequence sampling processing on the video representation information to obtain video time sequence sampling information corresponding to video data, and combining the video time sequence sampling information and the text representation information into a multi-mode combination feature; and carrying out encoding processing on the multi-mode combination features to obtain multi-mode fusion encoding features, and carrying out text decoding processing on the multi-mode fusion encoding features to obtain video content description text associated with the video data. By the embodiment of the application, the description accuracy of the video content can be improved.

Description

Data processing method, device, equipment and medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a data processing method, apparatus, device, and medium.

Background

Video content understanding can be understood as converting video content into services described in natural language, i.e., video content description. The video content understanding can be applied to the fields of video comments, video subtitles, video content summaries and the like, such as comments, subtitles and the like which are in accordance with the context can be automatically generated through the video content understanding.

Current video processing flows are generally divided into two stages, encoding and decoding. The coding processing refers to extracting image characteristics of each frame of image in the original video by an encoder; the decoding process refers to that a text (which may be a comment, a subtitle, etc. of the original video) for describing the video content is predicted by the decoder from the image features of each frame image extracted by the encoder. Although the understanding of the video content is realized through the prior technical scheme, the text for describing the video content is automatically generated for the original video, however, the image characteristics in each frame of image extracted by the encoder in the prior technical scheme are independent, and the association relation of each frame of image in the original video is not considered, so that the text finally generated by a decoder is possibly not matched with the original video content, and the accuracy of describing the video content by adopting the text is too low.

Disclosure of Invention

The embodiment of the application provides a data processing method, a device, equipment and a medium, which can improve the description accuracy of video content.

In one aspect, an embodiment of the present application provides a data processing method, including:

acquiring video data and video text distribution data corresponding to the video data;

acquiring video representation information corresponding to video data and acquiring text representation information corresponding to video text distribution data;

performing time sequence sampling processing on the video representation information to obtain video time sequence sampling information corresponding to video data, and combining the video time sequence sampling information and the text representation information into a multi-mode combination feature;

and carrying out encoding processing on the multi-mode combination features to obtain multi-mode fusion encoding features, and carrying out text decoding processing on the multi-mode fusion encoding features to obtain video content description text associated with the video data.

acquiring sample text matching data and sample content description text corresponding to the sample video;

outputting sample video representation information corresponding to the sample video through a visual encoder in the initial generation model, and outputting sample text representation information corresponding to the sample text data through a text encoder in the initial generation model; the visual encoder and the text encoder are trained based on a plurality of image-text data sets, wherein one image-text data set comprises a sample image and image text data;

Acquiring an attention mask matrix in an initial time sequence sampler contained in an initial generation model, performing time sequence sampling processing on sample video representation information according to the attention mask matrix and a mutual attention component and a self attention component in the initial time sequence sampler to obtain sample time sequence sampling information, and combining the sample time sequence sampling information and the sample ligand representation information into a sample multi-mode characteristic;

the method comprises the steps of performing coding processing on sample multi-mode features through an initial multi-mode coder in an initial generation model to obtain sample fusion coding features, and outputting a text prediction probability matrix corresponding to the sample fusion coding features through an initial text decoder in the initial generation model; the text prediction probability matrix is used for representing the prediction probability corresponding to each unit character in the sample content description text;

correcting network parameters of an initial time sequence sampler, an initial multi-mode encoder and an initial text decoder according to the attention mask matrix, the text prediction probability matrix and the sample content description text, and determining an initial generation model containing the corrected network parameters as a target generation model; the object generation model is used for generating video content description text for the input video data and data distribution data.

An aspect of an embodiment of the present application provides a data processing apparatus, including:

the first data acquisition module is used for acquiring video data and video text distribution data corresponding to the video data;

the first feature extraction module is used for acquiring video representation information corresponding to the video data and acquiring text representation information corresponding to the video text distribution data;

the first time sequence sampling module is used for performing time sequence sampling processing on the video representation information to obtain video time sequence sampling information corresponding to the video data, and combining the video time sequence sampling information and the text representation information into a multi-mode combination characteristic;

the first text generation module is used for carrying out coding processing on the multi-mode combination characteristics to obtain multi-mode fusion coding characteristics, and carrying out text decoding processing on the multi-mode fusion coding characteristics to obtain video content description text associated with video data.

The first feature extraction module obtains video representation information corresponding to video data, and the first feature extraction module comprises:

carrying out frame division processing on video data to obtain a video frame sequence, dividing each video frame in the video frame sequence into a plurality of image blocks with fixed sizes, and obtaining an image block set corresponding to each video frame in the video frame sequence;

Acquiring image input features corresponding to a video frame i according to an image block set corresponding to the video frame i contained in the video frame sequence, and inputting the image input features to a visual encoder in a target generation model; i is a positive integer less than or equal to the number of video frames corresponding to the video frame sequence;

coding the image input characteristics according to the visual encoder to obtain image representation information corresponding to the video frame i;

and combining the image representation information corresponding to each video frame in the video frame sequence into video representation information corresponding to the video data.

The first feature extraction module performs encoding processing on the image input features according to the visual encoder to obtain image representation information corresponding to the video frame i, and the first feature extraction module comprises:

according to the attention coding component in the visual encoder, outputting attention coding features corresponding to the image input features, and combining the image input features and the attention coding features into image joint features;

acquiring an implicit weight matrix and an offset vector corresponding to a multi-layer perceptron in a visual encoder, determining an image transformation feature corresponding to a video frame i based on the offset vector and dot multiplication between the implicit weight matrix and the image transformation feature, and combining the image transformation feature and the image transformation feature into image representation information corresponding to the video frame i.

The first feature extraction module outputs attention coding features corresponding to the image input features according to an attention coding component in the visual encoder, and the first feature extraction module comprises:

acquiring a transformation weight matrix corresponding to an attention coding component in a visual encoder, and transforming the image input characteristics into a first query matrix, a first key matrix and a first value matrix based on the transformation weight matrix of the attention coding component;

performing dot multiplication operation on the first query matrix and the transposed matrix of the first key matrix to obtain candidate weight matrices, and obtaining the number of columns corresponding to the first query matrix;

and carrying out normalization processing on the ratio between the candidate weight matrix and the square root of the column number to obtain a first attention weight matrix, and determining attention coding features corresponding to the image input features according to dot multiplication between the first attention weight matrix and the first value matrix.

The first feature extraction module obtains text representation information corresponding to video text data, and the method comprises the following steps:

dividing video text matching data into D unit characters, and obtaining unit word vectors corresponding to the D unit characters respectively; d is a positive integer;

according to semantic information of the D unit characters in the video text distribution data, text vectors corresponding to the D unit characters are obtained;

According to the text positions of the D unit characters in the video text distribution data, obtaining position vectors corresponding to the D unit characters respectively;

superposing the unit word vector, the text vector and the position vector to obtain text input characteristics corresponding to the video text matching data;

and inputting the text input characteristics to a text encoder in the target generation model, and encoding the text input characteristics through the text encoder to obtain text representation information corresponding to the video text distribution data.

The first time sequence sampling module performs time sequence sampling processing on the video representation information to obtain video time sequence sampling information corresponding to video data, and the first time sequence sampling module comprises:

performing position coding processing on the video representation information to obtain position coding information corresponding to the video representation information, and combining the video representation information and the position coding information into video description information;

acquiring initial time sequence characteristics with the same dimension as the text representation information, and inputting the initial time sequence characteristics and the video description information into a time sequence sampler in a target generation model;

and carrying out iterative updating on the initial time sequence characteristics through the time sequence sampler and the video description information to obtain video time sequence sampling information corresponding to the video data.

The video representation information comprises L pieces of image representation information, wherein L is a positive integer;

the first time sequence sampling module performs position coding processing on the video representation information to obtain position coding information corresponding to the video representation information, and the method comprises the following steps:

acquiring index positions of L image representation information in video data, and dividing the index positions of the L image representation information into even index positions and odd index positions;

sinusoidal position coding is carried out on even index positions in the video representation information, and sinusoidal coding information corresponding to the even index positions is obtained;

cosine position coding is carried out on odd index positions in the video representation information, and cosine coding information corresponding to the odd index positions is obtained;

and determining the sine coding information and the cosine coding information as position coding information corresponding to the video representation information.

The time sequence sampler in the target generation model comprises N mutual attention components and N self attention components, wherein the N mutual attention components and the N self attention components are alternately connected, and N is a positive integer;

the first time sequence sampling module carries out iterative updating on initial time sequence characteristics through a time sequence sampler and video description information to obtain video time sequence sampling information corresponding to video data, and the method comprises the following steps:

Acquiring input characteristics of a j-th mutual attention component in a time sequence sampler; when j is 1, the input features of the j-th mutual attention component comprise video description information and initial time sequence features; when j is not 1, the input characteristics of the j-th mutual attention component comprise video description information and the output characteristics of the j-1-th self attention component; j is a positive integer less than or equal to N;

acquiring a first weight matrix, a second weight matrix and a third weight matrix corresponding to the jth mutual attention component, and performing dot product operation on the output characteristics of the first weight matrix and the jth-1 self attention component to obtain a second query matrix;

performing dot multiplication operation on the second weight matrix and the video description information to obtain a second key matrix, and performing dot multiplication operation on the third weight matrix and the video description information to obtain a second value matrix;

determining output characteristics of a j-th mutual attention component according to the second query matrix, the second key matrix and the second value matrix;

inputting the output characteristics of the jth mutual attention component into the jth self attention component in the time sequence sampler, and performing self attention coding processing on the output characteristics of the jth mutual attention component through the jth self attention component to obtain the output characteristics of the jth self attention component;

And determining the output characteristics of the Nth self-attention component in the time sequence sampler as video time sequence sampling information corresponding to the video data.

The method comprises the steps of performing encoding processing on the multi-mode combination characteristics by a first text generation module to obtain multi-mode fusion encoding characteristics, performing text decoding processing on the multi-mode fusion encoding characteristics to obtain video content description text associated with video data, and comprising the following steps:

inputting the multi-mode combination features into a multi-mode encoder in a target generation model, and performing bidirectional feature coding processing on the multi-mode combination features through the multi-mode encoder to obtain multi-mode fusion coding features;

inputting the multi-mode fusion coding features to a text decoder in a target generation model, and performing attention aggregation operation on the multi-mode fusion coding features through the text decoder to obtain attention aggregation features;

and performing autoregressive processing on the attention-aggregated features to obtain a text probability output matrix, and determining a video content description text associated with the video data according to the text probability output matrix.

the second data acquisition module is used for acquiring the sample video, and sample text matching data and sample content description text corresponding to the sample video;

The second feature extraction module is used for outputting sample video representing information corresponding to the sample video through a visual encoder in the initial generation model and outputting sample text representing information corresponding to the sample text data through a text encoder in the initial generation model; the visual encoder and the text encoder are trained based on a plurality of image-text data sets, wherein one image-text data set comprises a sample image and image text data;

the second time sequence sampling module is used for acquiring an attention mask matrix in an initial time sequence sampler contained in the initial generation model, performing time sequence sampling processing on sample video representing information according to the attention mask matrix and a mutual attention component and a self attention component in the initial time sequence sampler to obtain sample time sequence sampling information, and combining the sample time sequence sampling information and sample text representing information into a sample multi-mode characteristic;

the second text generation module is used for carrying out coding processing on the sample multi-mode characteristics through an initial multi-mode coder in the initial generation model to obtain sample fusion coding characteristics, and outputting a text prediction probability matrix corresponding to the sample fusion coding characteristics through an initial text decoder in the initial generation model; the text prediction probability matrix is used for representing the prediction probability corresponding to each unit character in the sample content description text;

The network parameter correction module is used for correcting network parameters of the initial time sequence sampler, the initial multi-mode encoder and the initial text decoder according to the attention mask matrix, the text prediction probability matrix and the sample content description text, and determining an initial generation model containing the corrected network parameters as a target generation model; the object generation model is used for generating video content description text for the input video data and data distribution data.

The second time sequence sampling module performs time sequence sampling processing on sample video representing information according to the attention mask matrix, a mutual attention component and a self attention component in the initial time sequence sampler to obtain sample time sequence sampling information, and the second time sequence sampling module comprises:

acquiring initial sample time sequence characteristics with the same dimension as the sample text expression information, and performing dot multiplication operation on the initial sample time sequence characteristics and a first transformation matrix corresponding to a mutual attention component in an initial time sequence sampler to obtain a third query matrix;

performing dot multiplication operation on the sample video representing information and a second transformation matrix corresponding to the mutual attention component in the initial time sequence sampler to obtain a third transformation matrix, and performing dot multiplication operation on the sample video representing information and a third transformation matrix corresponding to the mutual attention component in the initial time sequence sampler to obtain a third value matrix;

Determining the mutual attention characteristic of the sample video representation information association according to the third query matrix, the third key matrix and the third value matrix;

performing activation processing on the attention mask matrix to obtain attention activation characteristics, and determining the product of the attention activation characteristics and the mutual attention characteristics as the output characteristics of the mutual attention components in the initial time sequence sampling component;

and inputting the output characteristics of the mutual attention component in the initial time sequence sampling component into the self-attention component in the initial time sequence sampling component, and performing self-attention coding processing on the output characteristics of the mutual attention component in the initial time sequence sampling component through the self-attention component in the initial time sequence sampling component to obtain sample time sequence sampling information corresponding to the sample video representation information.

The network parameter correction module corrects network parameters of the initial time sequence sampler, the initial multi-mode encoder and the initial text decoder according to the attention mask matrix, the text prediction probability matrix and the sample content description text, and determines an initial generation model containing the corrected network parameters as a target generation model, wherein the method comprises the following steps of:

determining the sum of absolute values of all elements in the attention mask matrix as an attention mask activation value, and determining the product between the attention mask activation value and the sparse constraint parameter as a sparse attention constraint corresponding to the initial time sequence sampler;

Determining the autoregressive loss associated with the initial text decoder according to the number of sample videos and the prediction probability of each unit character in the sample content description text in the text prediction probability matrix;

determining the sum of sparse attention constraint and autoregressive loss as the model total loss corresponding to the initial generation model;

and carrying out iterative training on network parameters of the initial time sequence sampler, the initial multi-mode encoder and the initial text decoder based on the model total loss, stopping training until the model total loss meets the training ending condition, and determining the visual encoder, the text encoder, the initial time sequence sampler, the initial multi-mode encoder and the initial text decoder at the end of training as target generation models.

The network parameter correction module determines an autoregressive loss associated with an initial text decoder according to the number of sample videos and the prediction probability of each unit character in a text prediction probability matrix in a sample content description text, and comprises the following steps:

obtaining the prediction probability corresponding to each unit character in the sample content description text from the text prediction probability matrix, and carrying out logarithmic operation on the prediction probability corresponding to each unit character in the sample content description text to obtain a logarithmic probability value corresponding to each unit character in the sample content description text;

And accumulating the logarithmic probability values corresponding to each unit character in the sample content description text to obtain a logarithmic probability total value corresponding to the sample video, and determining the autoregressive loss associated with the initial text decoder according to the ratio between the logarithmic probability total value and the number of the sample videos.

An aspect of the embodiments of the present application provides a computer device, including a memory and a processor, where the memory is connected to the processor, and the memory is used to store a computer program, and the processor is used to call the computer program, so that the computer device performs the method provided in the foregoing aspect of the embodiments of the present application.

An aspect of the present application provides a computer readable storage medium, in which a computer program is stored, the computer program being adapted to be loaded and executed by a processor, to cause a computer device having a processor to perform the method provided in the above aspect of the embodiments of the present application.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the method provided in the above aspect.

In the embodiment of the application, by acquiring the video representation information corresponding to the video data and the text representation information of the video distribution data corresponding to the video data, the time sequence sampling processing can be further performed on the video representation information to acquire the time sequence information in the video data, so that the representation capability of video features (such as the video time sequence sampling information) can be improved. The text representing information and the video time sequence sampling information are combined into the multi-mode combination characteristic, the multi-mode combination characteristic is subjected to coding processing to obtain the multi-mode fusion coding characteristic, further, the multi-mode fusion coding characteristic is subjected to text decoding processing, the video content description text corresponding to the video data is obtained through prediction, and the description accuracy of the video content can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application;

fig. 2 is a schematic diagram of a video comment generation scene provided in an embodiment of the present application;

FIG. 3 is a flowchart illustrating a data processing method according to an embodiment of the present application;

fig. 4 is a schematic diagram of feature extraction of video data and video text data according to an embodiment of the present application;

FIG. 5 is a second flow chart of a data processing method according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a timing sampler in a target generation model according to an embodiment of the present application;

fig. 7 is a flowchart illustrating a data processing method according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a network framework of an initial generative model provided in an embodiment of the present application;

FIG. 9 is a schematic diagram of an attention mask matrix provided by an embodiment of the present application;

FIG. 10 is a schematic diagram of the structure of an initial multi-mode encoder and an initial text decoder provided in an embodiment of the present application;

FIG. 11 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a second embodiment of a data processing apparatus;

Fig. 13 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

For ease of understanding, the following describes the basic technical concepts related to the embodiments of the present application:

computer Vision technology (CV): computer vision is a science of how to make a machine "look at", more specifically, it means that a camera and a computer are used to replace human eyes to identify, locate and measure targets, and further perform graphic processing, so that the computer is processed into images more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR (Optical Character Recognition ), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, map construction, and other techniques, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and the like.

Natural language processing (Nature Language processing, NLP): natural language processing is an important direction in the fields of computer science and artificial intelligence, and is a method for researching various theories for realizing effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

The embodiment of the application particularly relates to the technologies of video semantic understanding, text processing and the like under the computer vision technology and the natural language processing technology. According to the method and the device for processing the video content description text, the video data and the video text matching data corresponding to the video data can be obtained, the feature extraction can be carried out on the video data and the video text matching data through the pre-trained graphic model, so that multi-mode combination features containing video information and text information can be obtained, the video content description text corresponding to the video data can be automatically generated through the multi-mode combination features, and the effect of the video content description text can be improved.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a network architecture provided in an embodiment of the present application, where the network architecture may include a server 10d and a terminal cluster, and the terminal cluster may include one or more terminal devices, where the number of terminal devices included in the terminal cluster is not limited. As shown in fig. 1, the terminal cluster may specifically include a terminal device 10a, a terminal device 10b, a terminal device 10c, and the like; all terminal devices in the terminal cluster (which may include, for example, terminal device 10a, terminal device 10b, and terminal device 10c, etc.) may be in network connection with the server 10d, so that each terminal device may interact with the server 10d through the network connection.

The terminal devices of the terminal cluster may include, but are not limited to: electronic devices such as smart phones, tablet computers, notebook computers, palm computers, mobile internet devices (mobile internet device, MID), wearable devices (such as smart watches, smart bracelets and the like), intelligent voice interaction devices, intelligent household appliances (such as smart televisions and the like), vehicle-mounted devices, aircrafts and the like, and the type of terminal devices is not limited. It will be appreciated that each terminal device in the terminal cluster shown in fig. 1 may be provided with an application client, and when the application client runs in each terminal device, data interaction may be performed between the application client and the server 10d shown in fig. 1. The application client running in each terminal device may be an independent client, or may be an embedded sub-client integrated in a certain client, which is not limited in this application.

The application client may specifically include, but is not limited to: a client having video-text processing functions such as a browser, a vehicle-mounted client, a smart home client, an entertainment client (e.g., a game client), a multimedia client (e.g., a video client, a short video client), a conference client, and a social client. If the terminal device included in the terminal cluster is a vehicle-mounted device, the vehicle-mounted device may be an intelligent terminal in an intelligent traffic scene, and an application client running in the vehicle-mounted device may be referred to as a vehicle-mounted client.

The server 10d may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligence platform, and the type of the server is not limited in this application.

It may be appreciated that the application client installed in each terminal device shown in fig. 1 may invoke a pre-trained object generation model, e.g., the trained object generation model may be online in the application client. If the object a wants to upload the video data authored or edited by itself and the corresponding video distribution data to the application client installed in the terminal device (for example, the terminal device 10 a), the terminal device 10a may acquire the video data and the video distribution data uploaded to the application client by the object a, and further may process the video data and the video distribution data by using the trained target generation model, and automatically generate the video content description text conforming to the context for the video data. The video content description text related to the embodiment of the application may refer to a natural sentence for describing video content, and the video content description text may include, but is not limited to, information such as comments of video data, subtitles, video content summaries, and the like.

For ease of understanding, the foregoing object generation model may refer to a text generation model for processing video data and its corresponding video textual data, input data of the object generation model including the video data and its corresponding video textual data, and output data of the object generation model refers to video content description text automatically generated for the video data. The text generation model of the training phase (i.e., the text generation model that has not completed training) may be referred to as an initial generation model. The initial generation model and the target generation model are text generation models in different stages, that is, the initial generation model and the target generation model have the same network structure, and network parameters between the initial generation model and the target generation model may be different.

It should be noted that, the process of training the initial generation model by using the video data and the corresponding video text data, and the process of applying the target recognition model in the video content understanding service may be executed by the computer device; that is, the data processing method provided in the embodiment of the present application may be performed by a computer device, where the computer device may be a server 10d (the server 10d may be a background server corresponding to an application client) in the network architecture shown in fig. 1, or any one of terminal devices in a terminal cluster, or may be a computer program (including program code, for example, an application client integrated by the terminal device), which is not limited in this embodiment of the present application.

The application process of the training-completed target generation model is described below by taking a video comment generation scene as an example. Referring to fig. 2, fig. 2 is a schematic view of a video comment generation scene provided in an embodiment of the present application; when the object a creates video data 20a and adds video distribution data 20b thereto, the object a may upload the video data 20a and the video distribution data 20b created thereby (for example, the video distribution data 20b may be text content "capture today's lovely, share today's happy") to any one of the terminal devices in the terminal cluster shown in fig. 1 (or may be understood as the terminal device used by the object, such as the terminal device 10a shown in fig. 1). The terminal device 10a at this time may acquire the video data 20a and the video context data 20b authored by the object a in the application client, and acquire a target generation model that has been trained (the target generation model may be a formally online model in the application client); the object generation model may include a visual encoder 20c for extracting image features in video data, a text encoder 20e for extracting text features, a timing sampler 20g for mining video timing dimension information, a multi-modal encoder 20i for fusing video information and text information, a text decoder 20j for generating video content description text of the video data, and the like.

The video data 20a and the video context data 20b may be used as input data of a target generation model, specifically, each video frame in the video data 20a may be sequentially input to a visual encoder 20c in the target generation model, image feature extraction may be performed on each video frame in the video data 20a by the visual encoder 20c, image representation information corresponding to each video frame in the video data 20a may be output, and further, image representation information corresponding to each video frame in the video data 20a may be combined to obtain video representation information 20d corresponding to the video data 20 a. In other words, the visual encoder 20c in the object generation model may be understood as an image feature extractor.

The video distribution data 20b may be subjected to word segmentation processing to obtain a plurality of unit characters contained in the video distribution data 20b, where the unit characters may be single characters or single words in the video distribution data 20b, so that each unit character in the video distribution data 20b may be converted into a vector representation, and the vector representations of each unit character in the video distribution data 20b are combined to obtain text input features corresponding to the video distribution data 20 b. The text input features are input to a text encoder 20e in the target generation model, and text feature extraction can be performed on the video text data 20b by the text encoder 20e, so as to output text representation information 20f corresponding to the video text data 20 b. In other words, the text encoder 20e in the object generation model may be understood as a text feature extractor.

Wherein the image representation information extracted by the visual encoder 20c is obtained from still video frames, and the video data 20a itself is time-series-associated data composed of a plurality of consecutive video frames, there is no time-series association between the image representation information corresponding to each video frame extracted by the visual encoder 20 c. In order to hold the timing information of the video data 20a, the video presentation information 20d extracted by the visual encoder 20c may be input to a timing sampler 20g in the target generation model, and the video presentation information 20d may be subjected to timing sampling processing by the timing sampler 20g to obtain video timing sampling information 20h corresponding to the video data 20 a. In other words, the video data 20a can be treated as a plurality of independent video frames by the visual encoder 20c, but there is substantially a timing correlation between the plurality of video frames in the video data 20a, so that the information inside the video data 20a can be mined by the timing sampler 20g, the input video representation information 20d is subjected to the dimension reduction processing, redundant information between the respective video frames is reduced, and finally the video timing sampling information 20h is output.

It will be appreciated that the video timing sample information 20h output by the timing sampler 20g and the text representation information 20f output by the text encoder 20e may have the same dimensions, so that the video timing sample information 20h and the text representation information 20f may be added (e.g., feature stitching) to obtain multi-modal combined features; the multimodal combination feature herein may refer to multimodal information including video and text. The multi-modal combination feature may be input to the multi-modal encoder 20i in the target generation model, and the multi-modal encoder 20i performs fusion encoding processing on the input multi-modal combination feature (or may be understood as performing fusion encoding processing on the input video information and text information) to obtain the multi-modal encoding feature. Further, the multi-modal encoding features output by the multi-modal encoder 20i may be input to a text decoder 20j in the target generation model, and the multi-modal encoding features are text-decoded by the text decoder 20j to generate comment data 20k for the video data 20a (for example, the comment data 20k may be "sister-of-text" good-of-shadow "). Here, the embodiments of the present application simply describe the roles of the visual encoder 20c, the text encoder 20e, the timing sampler 20g, the multi-mode encoder 20i, and the text decoder 20j in the object generation model, and the specific network structure thereof will be described in detail later.

Further, when the video data 20a and the video distribution data 20b are successfully distributed in the application client, comment data 20k generated by content understanding of the input video data 20a and video distribution data 20b using the object generation model may also be distributed in the comment area of the video data 20a. A registration object in the application client (including the aforementioned object a) can view and play the video data 20a successfully published in the application client.

For example, the registration object B in the application client may play the video data 20a published in the application client through the terminal device it uses, such as the video data 20a may be played in the page 20m of the application client. The page 20m may display information such as a playing progress bar, a played time length, an overall time length (for example, the overall time length of the video data 20a is 20 seconds) of the video data 20a, and may also display a sound control 20r, a full screen control 20s, an expansion control 20n, and the like; wherein the sound control 20r may be used to adjust the play volume of the video data 20a, the full screen control 20s may be used to control the play display size of the video data 20a in the page 20m, etc. When the registered object B performs a triggering operation on the expansion control 20n in the page 20m, a menu bar 20p may be displayed in the page 20m, and the menu bar 20p may include favorite, comment, share, and the like controls. When the registration object B performs a triggering operation on the comment control in the menu bar 20p, a page 20q may be displayed, and the page 20q may be used to present all comment data of the video data 20a, such as comment data 20k automatically generated for the video data 20a using the target generation model as described above. Optionally, the registration object B may also comment on the video data 20a through a comment posting control 20t in the page 20 q. The page 20q may be a partial area in the page 20m, or may be a sub-page independently displayed on the page 20m, or may be a page jumped from the page 20m, or the like, which is not limited in this application.

According to the embodiment of the application, the characteristics of the input video data and the video text data corresponding to the video data can be extracted through the visual encoder and the text encoder, the time sequence information among all video frames in the video data is mined through the time sequence sampler, the representation capability of multi-mode combination information can be improved, the multi-mode combination characteristics are subjected to fusion encoding, comment data corresponding to the video data is further generated through the text decoder, and the effect of the comment data can be improved.

Referring to fig. 3, fig. 3 is a schematic flow chart of a data processing method according to an embodiment of the present application; it will be appreciated that the data processing method may be performed by a computer device, which may be a server, or may be a terminal device, which is not limited in this application. As shown in fig. 3, the data processing method may include the following steps S101 to S104:

step S101, obtaining video data and video text data corresponding to the video data.

In particular, the computer device may obtain video data and video distribution data corresponding to the video data from different channels. Herein, a channel may refer to a manner of acquiring video data and video distribution data, and may include, but is not limited to: the method comprises the steps of adopting the camera equipment to directly shoot, directly downloading from application clients such as a video platform, a short video platform and the like, adopting a video making tool to make creation and generation of a plurality of static images and the like, and not carrying out one-to-one example here. For example, the video data may be short videos created for each object, or may be video clips intercepted in videos such as dramas, movies, shows, cartoons, news, music clips, live broadcast, etc. in the application client; the short video is a short video, is an internet content transmission mode, and is usually a video with a transmission duration of less than 5 minutes on new internet media. The video distribution data may refer to text matched to the video data, and the video distribution data may be natural sentences of different language types.

Step S102, obtaining video representation information corresponding to the video data and obtaining text representation information corresponding to the video distribution data.

Specifically, after the video data and the corresponding video text matching data are obtained, a trained target generation model can be obtained, the target generation model can perform content understanding on the video data and the video text matching data, and video content description texts meeting the context can be automatically generated, wherein the video content description texts can include but are not limited to comment data, subtitle data, video content summary and the like of the video data. The object generation model may include a teletext feature extraction sub-model, a time sequence sampler, and an encoder-decoder generation sub-model. Wherein, the image-text characteristic extraction sub-model can comprise a visual encoder for extracting image characteristics and a text encoder for extracting text characteristics; the network structure of the visual encoder may include, but is not limited to, res net (a residual network model), densnet (a dense connection network model), convertors (a network model using an attention mechanism), convertors (a network model composed of a convolutional neural network and convertors), and a variant of any one of the above feature extraction networks, or a combination of any two or more of the above networks, etc., and the application does not limit the network structure of the visual encoder; the network structure of the text encoder may include, but is not limited to, any one of a BERT (Bidirectional Encoder Representation from Transformers, a natural language processing model) structure, a BART structure (a natural language output model modified based on BERT), an RNN (Recurrent Neural Network, a cyclic neural network), an LSTM (Long Short-Term Memory), a Long-Term Memory network, etc., or may be a modification of any one of the above natural language processing networks, or a combination of any two or more of the above networks, etc., the network structure of the text encoder of the present application is not limited, the graphic feature extraction sub-model formed by the visual encoder and the text encoder may be a BLIP (Bootstrapping Language-Image Pre-training, a language Image Pre-training frame for unified visual language understanding and generation), an it3 (a universal multi-mode base model), a CoCa (multi-mode Image-text base model), or the multi-mode decoder of the present application may be a multi-mode decoder of the present application, and may include a decoder structure of the present application, or may be a decoder structure of the present application may be limited, and the encoder may be a decoder structure of the present application may be a decoder.

In the embodiment of the application, the image feature extraction can be performed on each video frame in the video data through the visual encoder in the image-text feature extraction sub-model, so as to obtain video representation information corresponding to the video data, wherein the video representation information can be used for representing the video features in the video data. In a possible embodiment, in the process of extracting image features of video data by using a visual encoder, firstly, the video data may be subjected to framing processing to obtain a video frame sequence, and then each video frame in the video frame sequence may be divided into a plurality of image blocks with fixed sizes, so as to obtain an image block set corresponding to each video frame in the video frame sequence, where the fixed size may be determined according to actual requirements of an application scene, for example, the fixed size may be 16×16, or may be 32×32 or other sizes. Acquiring image input features corresponding to a video frame i according to an image block set corresponding to the video frame i contained in the video frame sequence, and inputting the image input features to a visual encoder in a target generation model; the image input feature may be formed by stitching feature vectors of respective image blocks in an image block set corresponding to a video frame i, where i is a positive integer less than or equal to the number of video frames corresponding to the video frame sequence. The image representing information corresponding to the video frame i can be output by carrying out encoding processing on the image input characteristics corresponding to the video frame i according to the visual encoder; and further, the image representation information corresponding to each video frame in the video frame sequence can be combined into the video representation information corresponding to the video data.

For example, assuming that the spatial resolution of video data is 224×224 and the video data is divided into 10 video frames, the size of each video frame is 224×224. When the fixed size of the image block is defined as 16×16, each video frame in the video data may be divided into (224/16) × (224/16) image blocks, that is, 196 image blocks having a size of 16×16 may be included in the image block set corresponding to the aforementioned video frame i (where i may be a positive integer less than or equal to 10). The image representation information output after the video frame i passes through the visual encoder may be regarded as a feature representation of 196 width, and the video representation information output after the video data consisting of 10 video frames passes through the visual encoder may be regarded as a feature representation of 196 x 10 width.

Similarly, text feature extraction is carried out on the video text data through a text encoder in the text feature extraction sub-model, so that text representation information corresponding to the video text data is obtained, and the text representation information can be used for representing text features in the video text data. In one or more embodiments, in the process of extracting text features from video text data by using a text encoder, word segmentation processing may be performed on the video text data to obtain a plurality of unit characters (where the unit characters may be single characters or single words in the video text data), and then vector conversion may be performed on each unit character to obtain vector representations corresponding to each unit character respectively; for example, a one-hot (one-hot) code may be used to vector each unit character, or a word2vec (a word vector model) may be used to vector each unit character, which is not limited in this application. Further, vector representations corresponding to each unit character contained in the video text matching data can be combined, and text input characteristics corresponding to the video text matching data can be obtained; the text input features are input to a text encoder, and the text encoder performs feature extraction (such as encoding processing) on the text input features, so that text representation information corresponding to the video text distribution data can be output.

Referring to fig. 4, fig. 4 is a schematic diagram illustrating feature extraction of video data and video distribution data according to an embodiment of the present application; as shown in fig. 4, after the video data and the video context data 30b corresponding thereto are acquired, the acquired video data may be subjected to framing processing to obtain a video frame sequence, where the feature extraction process of each video frame in the video frame sequence in the visual encoder is the same, and any video frame (for example, the video frame 30 a) in the video frame sequence will be described below as an example. The video frame 30a may be cut into a plurality of image blocks with a fixed size, so as to obtain an image block set 30c corresponding to the video frame 30a, and then each image block in the image block set 30c may be embedded with an image block, for example, each image block in the image block set 30c may be flattened into a vector sequence, and vector sequences corresponding to all image blocks in the image block set 30c may be spliced into an entire vector sequence, where the vector sequence may be used as an image input feature corresponding to the video frame 30 a. The image input features corresponding to the video frame 30a are input to a visual encoder, and the image input features corresponding to the video frame 30a are encoded by the visual encoder to obtain image representation information 30d corresponding to the video frame 30 a. Similarly, each video frame in the video data may extract the image representation information corresponding to each video frame after passing through the visual encoder, and then the image representation information corresponding to each video frame is combined into the video representation information corresponding to the video data.

It should be noted that, the dimensions of the image input features corresponding to the video frame 30a need to satisfy the dimensions of the input data of the visual encoder, and the network structures of the visual encoders are different, so that the processing procedure of the video frame 30a before the video frame is input to the visual encoder may also be different (for example, when the visual encoder has a res net structure, the image block may be input to the visual encoder without flattening the image block into a vector sequence), which is not limited in this application.

As shown in fig. 4, the video text data 30b corresponding to the video data is "capturing lovely today, sharing happy today", and the unit character set 30e may be obtained by performing word segmentation processing on the video text data 30b, where the unit character set 30e may include unit characters such as "capturing", "lovely", and the like. Word embedding processing is carried out on each unit character in the unit character set 30e, namely each unit character in the unit character set 30e is converted into vector representation, so that text input characteristics corresponding to the video text matching data 30b are obtained; the text input feature is input to a text encoder, and the text input feature is encoded by the text encoder, so that text representation information 30f corresponding to the video text distribution data 30b can be obtained.

Step S103, performing time sequence sampling processing on the video representation information to obtain video time sequence sampling information corresponding to the video data, and combining the video time sequence sampling information and the text representation information into a multi-mode combination feature.

Specifically, since the video data is composed of a plurality of continuous video frames, the time sequence sampling processing can be performed on the video representation information through the time sequence sampler in the target generation model, so as to obtain the video time sequence sampling information corresponding to the video data. The time sequence sampler can perform dimension reduction processing on the input video representation information and mine time sequence information among video frames contained in the video data; that is, the dimension of the video timing sampling information is smaller than the dimension of the video representation information, and the dimension of the video timing sampling information is the same as the dimension of the text representation information. By introducing a time sequence sampler into the target generation model, time sequence information in video data can be acquired, and therefore the representation capability of video features is improved; the computational complexity in the subsequent processing can be reduced by the dimension reduction processing.

In this embodiment of the present application, the timing sampler may be formed by alternately connecting a plurality of mutual Attention components (Cross Attention) and a plurality of self Attention components (self Attention), where the number of mutual Attention group prices is the same as the number of self Attention components. The mutual attention component in the time sequence sampler can introduce an initial time sequence characteristic, the initial time sequence characteristic and the video representation information can be input into the time sequence sampler, the initial time sequence characteristic can be iteratively updated based on the video representation information in the time sequence sampler, and finally the video time sequence sampling information is output. The video representing information corresponding to the video data can be input into all the mutual attention components in the time sequence sampler, and the input data of the mutual attention components in the time sequence sampler can comprise the output characteristics of the previous self attention component besides the video representing information; of course, the input data of the first mutual attention component in the time sequence sampler includes video representation information and initial time sequence characteristics, and the dimension of the initial time sequence characteristics is the dimension of the video time sequence sampling information finally output by the time sequence sampling information. The input data of each self-attention component in the time series sampler is the output characteristic of the previous mutual-attention component. The calculation process of the mutual attention component and the self attention component in the time sequence sampler will be described in detail in the following.

The time sequence sampler in the target generation model can align and fuse the video representation information output by the visual encoder and the text representation information output by the text encoder, and the dimension of the video time sequence sampling information output by the time sequence sampler is the same as that of the text representation information, so that the video time sequence sampling information and the text representation information can be spliced into multi-mode combination features directly.

Step S104, encoding the multi-mode combination features to obtain multi-mode fusion encoding features, and performing text decoding processing on the multi-mode fusion encoding features to obtain video content description text associated with the video data.

Specifically, the computer device may input the multi-mode combination feature to an encoder-decoder generation sub-model in the target generation model, and encode the multi-mode combination feature by using a multi-mode encoder in the encoder-decoder generation sub-model to obtain a multi-mode fusion encoding feature, where the multi-mode fusion encoding feature may refer to a hidden state feature output by a last network layer in the multi-mode encoder. The multi-mode fusion coding feature is input to a text decoder in an encoder-decoder generation sub-model, and text decoding processing is carried out on the multi-mode fusion coding feature through the text decoder, so that video content description text meeting the context of the video data is generated, and comments, subtitles, video content summaries and the like meeting the context can be automatically generated for the video data.

The encoder-decoder generating sub-model in the target generating model may adopt a BERT structure or a BART structure commonly used in NPL, and compared with the current BERT structure or BART structure, the encoder-decoder generating sub-model related in the embodiment of the present application changes the input data of the encoder-decoder generating sub-model from plain text information into multi-mode combination features including video and text information. For example, assuming that an encoder-decoder generation sub-model in the target generation model is a BART structure, the multi-mode combination feature can be input to a multi-mode encoder in the target generation model, and the multi-mode combination feature is subjected to bidirectional feature encoding processing by the multi-mode encoder to obtain a multi-mode fusion encoding feature; inputting the multi-mode fusion coding features to a text decoder in a target generation model, and performing attention aggregation operation on the multi-mode fusion coding features through the text decoder to obtain attention aggregation features; performing autoregressive processing on the attention aggregation characteristics to obtain a text probability output matrix, and determining a video content description text associated with video data according to the text probability output matrix; the text probability output matrix herein may include the prediction probabilities of the respective predicted characters (single text or single word), and the optimal predicted character string is determined based on the prediction probabilities of the respective predicted characters included in the text probability output matrix, and the optimal predicted character string is determined as the video content description text associated with the video data.

In other words, the multi-mode encoder in the encoder-decoder generation sub-model can perform bidirectional feature representation encoding, calculation and feature extraction on the input multi-mode combined features; the text decoder in the encoder-decoder generating sub-model can perform Attention aggregation calculation by using the mutual Attention aggregation operation (Cross Attention) and the hidden state result output by the last network layer of the multi-mode encoder, so as to generate the video content description text in an autoregressive mode represented by the unidirectional characteristic, and the specific calculation process is the same as the calculation process in the currently commonly used BART structure, which is not repeated in the embodiment of the present application.

In the embodiment of the application, video representing information can be obtained after video data passes through a visual encoder in a target generation model, and text representing information can be obtained after video text matching data corresponding to the video data passes through a text encoder; further, the video representation information can be subjected to time sequence sampling processing through a time sequence sampler introduced into the target generation model so as to obtain video time sequence sampling information, and the text representation information and the video time sequence sampling information are combined into a multi-mode combination characteristic; the multi-modal combination feature is passed through a multi-modal encoder and then predicted using a text decoder to produce the final video content description text. The time sequence information among all video frames in the video data is acquired through the time sequence sampler, so that the representation capability of video features can be improved, and further, the description accuracy of video content can be improved.

Referring to fig. 5, fig. 5 is a second flowchart of a data processing method according to an embodiment of the present application; it will be appreciated that the data processing method may be performed by a computer device, which may be a server, or may be a terminal device, which is not limited in this application. As shown in fig. 5, the data processing method may include the following steps S201 to S209:

step S201, obtaining video data and video text data corresponding to the video data.

The specific implementation process of step S201 may refer to the description of step S101 in the embodiment corresponding to fig. 3, and will not be described herein.

Step S202, obtaining an image input feature corresponding to a video frame i in video data, outputting an attention coding feature corresponding to the image input feature according to an attention coding component in a visual encoder, and combining the image input feature and the attention coding feature into an image joint feature.

Specifically, the computer device may perform frame division processing on the video data to obtain a video frame sequence, and divide each video frame in the video frame sequence into a plurality of image blocks with a fixed size to obtain an image block set corresponding to each video frame in the video frame sequence; and further, according to the image block set corresponding to the video frame i in the video frame sequence, the image input characteristic corresponding to the video frame i is obtained, wherein i is a positive integer less than or equal to the number of video frames contained in the video frame sequence. The process of acquiring the image input features may refer to the related description in step S102 in the embodiment corresponding to fig. 3, which is not described herein.

In the embodiment of the present application, a network structure adopted by a visual encoder in a target generation model is described by taking a video (Vision) transformation model frame as an example, and a processing process of the visual encoder on an input video frame i at this time may include processes of image block division, image block embedding, self-attention component iteration and the like, so as to finally obtain image representation information corresponding to the video frame i; the implementation process of image block division and image block embedding can be understood as the process of acquiring the image input feature corresponding to the video frame i, and will not be described herein; the self-attention component iterative process may refer to the processing of image input features by structures such as normalization layers, attention encoding components, and multi-layer perceptrons in a visual encoder. It will be appreciated that the specific network structure of the visual encoder may be created according to actual requirements, which is not limited in this application; an attention encoding component, as in a visual encoder, may include one or more self-attention components.

For ease of understanding, the following description will be given by taking the example that the attention encoding component contains only one self-attention component. After the image input feature corresponding to the video frame i is input to the visual encoder, a transformation weight matrix corresponding to the attention encoding component in the visual encoder can be obtained, and the image input feature is transformed into a first query matrix, a first key matrix and a first value matrix based on the transformation weight matrix of the attention encoding component. Wherein the transformation weight matrix corresponding to the attention encoding component can comprise three parameter matrices, such as a parameter matrix W _q Parameter matrix W _k Parameter matrix W _v The transformation weight matrix corresponding to the attention coding component is a parameter learned in the training process of the graphic character extraction submodel. The image is input into a parameter matrix W in the characteristic and transformation weight matrix _q The first query matrix (for example, may be denoted as Q1) may be obtained by dot multiplication, and the image input feature may be combined with the parameter matrix W in the transform weight matrix _k The first key matrix (for example, K1) can be obtained by dot multiplication, and the image input characteristics and the parameter matrix W in the transformation weight matrix are obtained _v Dot multiplication may be performed to obtain a first matrix of values (e.g., may be denoted as V1). Each query vector in the first query matrix may be used to encode a similarity relationship between each feature and other features, which may determine the feature and the preamble featureAnd dependent information therebetween.

Further, the first query matrix and the transpose matrix of the first key matrix are subjected to a dot multiplication operation to obtain candidate weight matrices (which may be expressed as

) The candidate weight matrix may be considered as an inner product (may also be referred to as dot product or dot product) of each row of vectors in the first query matrix Q1 and the first key matrix K1, and in order to prevent the inner product from being excessively large, the number of columns corresponding to the first query matrix Q1 may be obtained (the first query matrix Q1 and the first key matrix K1 have the same number of columns and may also be referred to as vector dimension); the square root of the candidate weight matrix and the number of columns (which can be written as +. >

) And carrying out normalization processing on the ratio of the first attention weight matrix to obtain a first attention weight matrix, and determining attention coding features corresponding to the image input features according to dot multiplication between the first attention weight matrix and the first value matrix.

Wherein the first attention weight matrix may be expressed as

The softmax function is a function for normalization processing, which can be used to calculate the self-attention coefficient of a single feature to other features, by which +.>

Softmax is performed for each row in (a). The point multiplication between the first attention weight matrix and the first value matrix V1 is determined as the output characteristic of the attention encoding component (which may be expressed as +.>

) The output features of the attention encoding component may then be referred to as attention encoding features corresponding to the image input features (for ease of understanding, the attention encoding features herein may be referred to as image attention encoding features).

Alternatively, if the attention encoding component in the visual encoder may be a plurality of self-attention components, each of the self-attention components in the attention encoding component may correspond to an output feature, such as output feature O1, output feature O2, output features O3, … …; and then the output features corresponding to the self-attention sub-components can be spliced into the image attention coding features corresponding to the image input features, and the splicing can be a concat operation. Further, the image input features and the image attention encoding features output by the attention encoding component may be combined into image joint features.

Optionally, in one or more embodiments, before the input attention encoding component inputs the image input feature, the image input feature may be subjected to position encoding to obtain a position encoding feature corresponding to the image input feature (for convenience of understanding, the position encoding feature may be referred to as a first position encoding feature herein), and the image input feature and the first position encoding feature are added to obtain an image initial characterization matrix; wherein the relevant description of the position encoding operation will be described in the subsequent steps. Optionally, the image initial characterization matrix may be normalized to obtain a normalized image initial characterization matrix, and then the normalized image initial characterization matrix may be calculated according to one or more self-attention components in the attention coding component to obtain an attention coding feature corresponding to the image input feature. Further, the image joint feature at this time may be a feature obtained by adding the attention code feature and the image initial characterization matrix.

Step S203, an implicit weight matrix and an offset vector corresponding to the multi-layer perceptron in the visual encoder are obtained, an image transformation feature corresponding to the video frame i is determined based on the offset vector and the dot product between the implicit weight matrix and the image transformation feature, the image transformation feature and the image combination feature are combined into image representation information corresponding to the video frame i, and the image representation information corresponding to each video frame in the video frame sequence is combined into video representation information corresponding to the video data.

Specifically, the image joint features can be normalized to obtain normalized image joint features, and then the normalized image joint features are subjected to linear transformation according to an implicit weight matrix and a bias vector corresponding to a multi-layer perceptron in the visual encoder, and finally image transformation features are output; or the image joint characteristics can be directly subjected to linear transformation according to an implicit weight matrix and an offset vector corresponding to the multi-layer perceptron in the visual encoder, and finally the image transformation characteristics are output, which is not limited in the application. Further, the image joint feature and the image transformation feature may be combined into image representation information corresponding to the video frame i.

It can be understood that, through the foregoing operation, the image representation information corresponding to each video frame in the video data can be obtained, and the image representation information corresponding to each video frame in the video frame sequence is combined into the video representation information corresponding to the video data.

Step S204, unit word vectors, text vectors and position vectors corresponding to the video text data are obtained, and the unit word vectors, the text vectors and the position vectors are overlapped to obtain text input features corresponding to the video text data.

In the embodiment of the present application, a network structure adopted by a text encoder in a target generation model is exemplified by a BERT structure, and processing procedures of the text encoder on input video text data at this time may include, but are not limited to, word segmentation, word embedding, iteration of a self-attention component, and other procedures, and finally text representation information corresponding to the video text data is output.

The processing of video palindromic data by a text encoder may include, but is not limited to: the video text data can be divided into D unit characters, and unit word vectors corresponding to the D unit characters respectively are obtained; d is a positive integer, for example, D can take the values of 1,2 and … …; according to semantic information of the D unit characters in the video text distribution data, text vectors corresponding to the D unit characters are obtained; according to the text positions of the D unit characters in the video text distribution data, obtaining position vectors corresponding to the D unit characters respectively; superposing the unit word vector, the text vector and the position vector to obtain text input characteristics corresponding to the video text matching data; and inputting the text input characteristics to a text encoder in the target generation model, and encoding the text input characteristics through the text encoder to obtain text representation information corresponding to the video text distribution data.

The computer device may convert each unit character in the video text data into a word vector (i.e., a unit word vector) by querying a word vector table, where the word vector table may contain word vectors corresponding to all common characters, and the word vector table may be understood as a "dictionary" containing all common character vectors, and the unit characters may refer to a single word or a single word in the video text data. The text vector can be used for describing global semantic information of video text data and can be fused with the unit word vector; because the semantic information carried by the unit characters appearing in different positions of the video text data has a difference (for example, the semantic information carried by ' I ' and you ' have a difference), different position vectors can be respectively added for D unit characters in the video text data to distinguish; and then the addition of the unit word vector, the text vector and the position vector can be used as the text input characteristic corresponding to the video text distribution data.

Step S205, inputting the text input features to a text encoder in the target generation model, and encoding the text input features through the text encoder to obtain text representation information corresponding to the video text distribution data.

In particular, when the text encoder is in the BERT structure, the main structure of the text encoder is a transformer, that is, the text encoder may also include a focus encoding component (may be a multi-head self-focus component), a first normalization network layer, a multi-layer perceptron (or may be referred to as a feed-forward network layer), a second normalization layer, and so on. Thus, after inputting the text input feature into the text encoder, the text input feature may be processed using an attention encoding component in the text encoder; or the text input feature may be position-coded to obtain a position-coded feature corresponding to the text input feature (for convenience of understanding, the position-coded feature may be referred to as a second position-coded feature), the text input feature and the second position-coded feature are added to obtain text initial characterization information, and then the attention-encoding component in the text encoder is used to process the text initial characterization information. The processing procedure of the attention encoding component in the text encoder for the text input feature (or the text initial characterization information) may refer to the related description in the foregoing step S202, and will not be described herein.

The feature obtained after the text input feature corresponding to the video text matching data passes through the attention encoding component in the text encoder can be called a text attention encoding feature, further, the text attention encoding feature can be normalized according to a first normalization network layer to obtain a first normalized text feature, and the first normalized text feature and the text input feature (or text initial characterization information) are combined into a text joint feature. According to a feedforward network layer (or a multi-layer perceptron) in the text encoder, carrying out feature transformation on the text joint features to obtain text transformation features, further carrying out normalization processing on the text transformation features according to a second normalization network layer to obtain second normalization text features, and combining the second normalization text features and the text joint features into text representation information corresponding to video text data.

Step S206, performing position coding processing on the video representation information to obtain position coding information corresponding to the video representation information, and combining the video representation information and the position coding information into video description information.

Specifically, after obtaining video representation information corresponding to video data through a visual encoder in the target generation model, a timing sampler in the target generation model may be used to obtain timing information between each video frame included in the video data. In order to maintain the timing information of the video representation information, the video representation information may be subjected to a position encoding process to obtain position encoded information corresponding to the video representation information (the position encoded information may also be referred to as a third position encoded feature for ease of understanding), and the video representation information and the position encoded information may be added to obtain video description information. The location coding manner according to the application embodiment may include, but is not limited to: the position encoding process of the video representation information will be described below with reference to the sine and cosine position encoding (2D sine position embedding), the leachable position encoding (learnable position embedding), and the like, for ease of understanding.

Wherein, the video representation information is assumed to include L pieces of image representation information, L represents the number of pieces of image representation information included in the video representation information, and L can be 1,2, … …; the sine and cosine position encoding process of the video representation information may include: acquiring index positions of L image representation information in video data, and dividing the index positions of the L image representation information into even index positions and odd index positions; sinusoidal position coding is carried out on even index positions in the video representation information, and sinusoidal coding information corresponding to the even index positions is obtained; cosine position coding is carried out on odd index positions in the video representation information, and cosine coding information corresponding to the odd index positions is obtained; and determining the sine coding information and the cosine coding information as position coding information corresponding to the video representation information. The sine and cosine position encoding process can be shown in the following formulas (1) to (2):

（1）

（2）

in the above formula (1) and formula (2), pos represents the position of the current image feature in the video representing information, 2t represents the even index position, and 2t+1 represents the odd index position; that is, each index position of the position code can be seen as a sine curve with a wavelength in the geometric progression from 2 pi to 1000 x 2 pi; d2 represents the dimension of a single feature in the video representation information, as may be considered the height in the dimension of the video representation information, or as may be understood as the number of columns of the video representation information. Equation (1) may represent sine-coded information of image features at even index positions and equation (2) may represent cosine-coded information of image features at odd index positions.

The relationship among the video representation information, the position coding information, and the video description information can be represented as the following formula (3):

z ₀ =x _in + E _pos （3）

wherein z in formula (3) ₀ Video description information corresponding to the video data is represented, namely, the video information is input to the time sequence sampler; x is x _in Video representation information corresponding to the video data; e (E) _pos And representing the position coding information corresponding to the video representation information.

Step S207, acquiring an initial time sequence feature with the same dimension as the text representation information, and inputting the initial time sequence feature and the video description information to a time sequence sampler in the target generation model.

Specifically, an initial timing characteristic having the same dimension as the text representation information may be obtained in the timing sampler, where the initial timing characteristic may be denoted as F ⁽¹⁾ The initial timing characteristic F ⁽¹⁾ The method is a learnable feature sequence, and the initial time sequence features which are finally and iteratively updated in the time sequence sampler can be used as video time sequence sampling information corresponding to video data. The initial timing characteristics and video descriptive information may be input to a timing sampler in the target generation model.

Step S208, the initial time sequence characteristics are iteratively updated through the time sequence sampler and the video description information, and video time sequence sampling information corresponding to the video data is obtained.

Specifically, the time sequence sampler in the target generation model may include N mutual attention components and N self attention components, where N is a positive integer, and if N may take a value of 6 or 12, the application is not limited to this; furthermore, the initial time sequence characteristics can be iteratively updated based on the video description information containing the video representation information and the position coding information and N mutual attention components and N self attention components in the time sequence sampler, and finally the video time sequence sampling information corresponding to the video data is output.

The processing procedure of the video description information and the initial timing characteristics in the timing sampler may include: the computer device may obtain input characteristics of a j-th mutual attention component in the time sequence sampler; when j is 1, the input features of the jth mutual attention component (i.e., the first mutual attention component in the instant sampler) include video descriptive information and initial timing features; when j is not 1, the input features of the jth mutual attention component (any one of the second through nth mutual attention components in the timing sampler) include video description information and the output features of the jth-1 self attention component; j is a positive integer less than or equal to N. Further, a first weight matrix, a second weight matrix and a third weight matrix corresponding to the jth mutual attention component can be obtained, and the output characteristics of the first weight matrix and the jth-1 self attention component are subjected to dot multiplication operation to obtain a second query matrix; and performing dot multiplication operation on the second weight matrix and the video description information to obtain a second key matrix, and performing dot multiplication operation on the third weight matrix and the video description information to obtain a second value matrix. From the second query matrix, the second key matrix, and the second value matrix, the output characteristics of the jth mutual attention component may be determined, and at this time, the representation of the output characteristics of the jth mutual attention component may be referred to the representation of the output characteristics of the attention encoding component in step S202. Inputting the output characteristics of the jth mutual attention component into the jth self attention component in the time sequence sampler, and performing self attention coding processing on the output characteristics of the jth mutual attention component through the jth self attention component to obtain the output characteristics of the jth self attention component; and determining the output characteristics of the Nth self-attention component in the time sequence sampler as video time sequence sampling information corresponding to the video data.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a timing sampler in a target generation model according to an embodiment of the present application. As shown in fig. 6, the time sequence sampler of the target generation model may be alternately connected by N mutual attention components and N self attention components, for example, when N is 6, the 6 mutual attention components and the 6 self attention components may be alternately connected. The N mutual attention components may be denoted as mutual attention component 1, mutual attention component 2, … …, mutual attention component N; the N self-attention components may be denoted as self-attention component 1, self-attention component 2, … …, self-attention component N. In this embodiment, the calculation process of the time sequence sampler is described by taking the spatial resolution of the video data as 224×224 and the fixed size of the image block as 16×16 as an example.

Assuming that video data acquired by a computer device is composed of 10 video frames, the sequence length of video representation information obtained after the video data passes through a visual encoder in a target generation model is (224/16) × (224/16) ×10=1960, which can also be understood as video representation information x _in Is a feature representation of width 1960. In order to maintain the time sequence information of the video representation information, the video representation information can be subjected to position coding processing to obtain position coding information E of the video representation information _pos Further, the video representation information x can be expressed based on the aforementioned formula (3) _in And position-coding information E _pos The video description information 40a (which may be denoted as z ₀ ) The video descriptive information 40a may be used as input data for N mutual attention components in the timing sampler. It will be appreciated that the video representation information x _in Position-coding information E _pos And the video description information 40a has the same dimension; for example, this dimension may be expressed as 1960 x 1024, with a width of 1960 and a height of 1024.

The mutual attention component in the time sequence sampler may iteratively update a learnable sequence F, where the sequence length of the learnable sequence F may be set to 35, and the dimension of the learnable sequence F may be represented as 35×1024, and the height of the learnable sequence F is the same as the height of the video description information 40 a. It should be noted that, before the learnable sequence F is input to the mutual attention component of the time sequence sampler, the learnable sequence F may be initialized, and in this embodiment of the present application, the initialized learnable sequence F may be referred to as an initial time sequence feature 40b; the dimension of the video time sequence sampling information finally output by the time sequence sampler is consistent with the dimension of the learnable sequence F, and the video time sequence sampling information finally output by the time sequence sampler can be regarded as the result of the learnable sequence F (namely the initial time sequence characteristic) after initialization after passing through N mutual attention components and N self attention components. The initialization method herein may include, but is not limited to, constant initialization, gaussian distribution initialization (which may refer to random initialization of parameters in the learnable sequence F if they conform to gaussian distribution), uniform distribution initialization, bilinear initialization, and the like.

As shown in fig. 6, after the video description information 40a is input to the timing sampler, the calculation process of N mutual attention components in the timing sampler may be as follows in the following formula (4):

（4）

wherein in formula (4)

、/>

、/>

The first weight matrix, the second weight matrix and the third weight matrix which correspond to the j-th mutual attention component in the time sequence sampler are sequentially expressed, and the three weight matrices are obtained by training in the training process of the target generation model. />

Representing a second query matrix in the jth mutual attention component, e.g., the second query matrix in the first mutual attention component may be denoted as Q _c1 The second query matrix in the second mutual attention component may be denoted as Q _c2 Etc. />

Representing a second key matrix in the jth mutual attention component, e.g., the second key matrix in the first mutual attention component may be denoted as K _c1 The second key matrix in the second mutual attention component may be denoted as K _c2 Etc.

Representing a second matrix of values in the jth mutual attention component, e.g., the second matrix of values in the first mutual attention component may be denoted as V _c1 The second matrix of values in the second mutual attention component may be denoted as V _c2 Etc. />

Representing the number of columns of the second query matrix, as here +. >

The value may be 1024.F (F) ^(j) A learnable matrix representing the jth mutual attention component input into the time sequence sampler, e.g. F ⁽¹⁾ The learnable matrix representing the first mutual attention component input into the timing sampler, the initial timing characteristic 40b of dimension 35 x 1024. />

Representing the output characteristics of the j-th mutual-attention component in the time-series sampler, or can be considered as the result of the learner matrix F after iteration through j mutual-attention components and j-1 self-attention components, e.g.)>

Representing the output characteristics of the learnable matrix F after passing through the first mutual attention component in the time series sampler, namely the 35 x 1024 dimension characteristic 40c shown in fig. 6.

Since alternate connections are made between the mutual-attention component and the self-attention component in the time sequence sampler, the output characteristics of the jth mutual-attention component

Can be input to the jth selfThe attention component performs the calculation, e.g., the output characteristics of the first mutual attention component (characteristic 40 c) may be input to the first self attention component for the calculation. The calculation process of the N self-attention components in the time sequence sampler can be shown in the following formula (5):

（5）

wherein in formula (5)

、/>

、/>

And the three weight matrixes are sequentially expressed as three weight matrixes corresponding to the jth self-attention component in the time sequence sampler, and the three weight matrixes are also trained in the training process of the target generation model. />

Representing the query matrix in the jth self-attention component (which may be referred to as a fourth query matrix for ease of understanding), e.g., the query matrix in the first self-attention component may be denoted as Q _s1 The query matrix in the second self-attention component may be denoted as Q _s2 Etc. />

Representing the key matrix in the jth self-attention component (which may be referred to as a fourth key matrix for ease of understanding), e.g., the key matrix in the first self-attention component may be denoted as K _s1 The key matrix in the second self-attention component can be denoted as K _s2 Etc. />

Representing a matrix of values in the j-th self-attention component (which may be referred to as a fourth matrix of values for ease of understanding), e.g., the matrix of values in the first self-attention component mayIs denoted as V _s1 The matrix of values in the second self-attention component may be denoted as V _s2 Etc. />

Representing the learner matrix F input to the j-th self-attention component, i.e., the output characteristics of the j-th mutual-attention component. F (F) ^(j+1) Representing the output characteristics of the j-th self-attention component in the time series sampler, or may be considered as the result of the learner matrix F after iteration through j mutual attention components and j self-attention components, the dimension of which is still 35 x 1024.

Further, by repeating the above-described formula (4) and formula (5) for N times, the output after time-series sampling, that is, the video time-series sampling information 40d, whose dimension is 35×1024 is finally obtained by the time-series sampler. It can be seen that the dimension (35×1024) of the video timing sampler obtained by the timing sampler is much smaller than the dimension (1960×1024) of the video representation information, so that the subsequent computational complexity can be reduced. It can be understood that the dimensions of the video description information, the video representation information, and the video timing sampling information shown in fig. 6 are merely illustrative, and may be specifically determined according to the actual requirements of the application scenario, which is not limited in the embodiment of the present application.

Step S209, encoding the multi-mode combination features to obtain multi-mode fusion encoding features, and performing text decoding processing on the multi-mode fusion encoding features to obtain video content description text associated with the video data.

The specific implementation process of step S209 may refer to the description of step S104 in the embodiment corresponding to fig. 3, and will not be described herein.

In the embodiment of the application, video representing information can be obtained after video data passes through a visual encoder in a target generation model, and text representing information can be obtained after video text matching data corresponding to the video data passes through a text encoder; further, the video representation information can be subjected to time sequence sampling processing through a time sequence sampler introduced into the target generation model so as to obtain video time sequence sampling information, and the text representation information and the video time sequence sampling information are combined into a multi-mode combination characteristic; the multi-modal combination feature is passed through a multi-modal encoder and then predicted using a text decoder to produce the final video content description text. The time sequence information among all video frames in the video data is acquired through the time sequence sampler, so that the representation capability of video features can be improved, and further, the description accuracy of video content can be improved; the video content description text based on the video data can promote the transmissibility of the video data.

It can be understood that the target generation model is a text generation model after training, that is, the text generation model after training the model can be formally applied to the video content understanding scene; for ease of understanding, the text generation model in the training phase may be referred to as an initial generation model. The training process of the initial generation model will be described with reference to fig. 7 to 10.

Referring to fig. 7, fig. 7 is a flowchart illustrating a data processing method according to an embodiment of the present application; it will be appreciated that the data processing method may be performed by a computer device, which may be a server, or may be a terminal device, which is not limited in this application. As shown in fig. 7, the data processing method may include the following steps S301 to S305:

step S301, a sample video, sample text data and sample content description text corresponding to the sample video are obtained.

In particular, during the training process of the initial generation model, the computer device may obtain a sample data set for training the initial generation model, where the sample data set may include a large amount of sample data (e.g., 100 tens of thousands of sample data may be collected), and one sample data may be composed of one sample video, one sample text data, and one sample content description text (e.g., comment data, caption data, etc.); the sample video and the sample text data in the sample data can be used as input data of an initial generation model, and the sample content description text can be used as label information of the sample data to adjust network parameters of the initial generation model.

Step S302, outputting sample video representing information corresponding to the sample video through a visual encoder in the initial generation model, and outputting sample text representing information corresponding to the sample text data through a text encoder in the initial generation model; the visual encoder and the text encoder are trained based on a plurality of sets of teletext data, one set of teletext data comprising one sample image and one image-text data.

Specifically, the computer device may train the initial generation model using the obtained sample data set, and before describing the training process of the initial generation model, first, a simple introduction is made to the network framework of the initial generation model. Referring to fig. 8, fig. 8 is a schematic diagram of a network framework of an initial generation model according to an embodiment of the present application. As shown in fig. 8, the network framework of the initial generative model may include a visual encoder, a text encoder, an initial timing sampler, an initial multi-modal encoder, and an initial text decoder; wherein the visual encoder and the text encoder may constitute a teletext feature extraction sub-model, and the initial multi-modal encoder and the initial text decoder may be referred to as an encoder-decoder generation sub-model. The sample video and sample text data input into the initial generation model are respectively extracted by adopting a visual encoder and a text encoder, wherein the visual encoder and the text encoder are obtained by initializing the picture and text pre-training, namely, the visual encoder and the text encoder are migrated from the picture and text model after training. For example, the visual encoder and the text encoder are obtained by training based on a plurality of image-text data sets in advance, one image-text data set comprises one sample image and one image text data, for example, a large amount of image text data can be utilized, and comparison learning and image-text matching are performed by referring to a training mode of BLIP, so that three self-supervision loss function optimizations are generated, and therefore, the visual encoder and the text encoder obtained by training can have good image and text characterization capability. In other words, since the visual encoder and the text encoder are pre-trained models, no corrective adjustments are required to the network parameters in the visual encoder and the text encoder during the training of the initial generation model.

Since sample video typically contains multiple video frames, the sequence of video features (e.g., sample video representation information) extracted by the visual encoder is very dimensional. And the initial timing sampler may be used to reduce the dimensionality of the video feature sequence and to mine timing information in the video. After the video feature sequence (e.g., sample timing sampling information) and the palindromic feature sequence (e.g., sample palindromic representation information) are obtained, they may be input to an encoder-decoder generation sub-model (including an initial multi-modal encoder and an initial text decoder) for text prediction. The encoder-decoder generation sub-model is a common generation structure and is widely used in language tasks such as text generation, abstract generation and the like, and the encoder-decoder generation sub-model is applied to multi-mode application of video-text; the multi-modality referred to in the embodiments of the present application refers to different multimedia data types, such as video and text, image and text, etc.

After the computer equipment acquires the sample video, the sample text matching data and the sample content description text corresponding to the sample video, the sample video can be input to a pre-trained visual encoder in an initial generation model, and image feature extraction is carried out on each video frame in the sample video through the visual encoder to obtain image representation information corresponding to each video frame in the sample video, so that the image representation information corresponding to each video frame in the sample video can be combined into sample video representation information; the process of acquiring the sample video representation information may refer to the descriptions of the process of acquiring the video representation information in step S202 and step S203 shown in fig. 5, and will not be described herein. Sample text matching data corresponding to the sample video can be input to a text encoder pre-trained in an initial generation model, and text feature extraction can be carried out on the sample text matching data through the text encoder to obtain sample text matching representation information corresponding to the sample text matching data; the process of obtaining the sample text representation information may refer to the description of the text representation information obtaining process in step S204 and step S205 shown in fig. 5, and will not be described herein.

Step S303, an attention mask matrix in an initial time sequence sampler contained in the initial generation model is obtained, time sequence sampling processing is carried out on sample video representing information according to the attention mask matrix, a mutual attention component and a self attention component in the initial time sequence sampler, sample time sequence sampling information is obtained, and the sample time sequence sampling information and the sample ligand representing information are combined into a sample multi-mode feature.

Specifically, the initial timing sampler in the initial generation model may include N mutual-attention components and N self-attention components, where N is a positive integer, and the N mutual-attention components and the N self-attention components in the timing sampler are alternately connected. After the sample video representation information is input to the initial timing sampler, the feature processing procedure in the initial timing sampler may include: the method comprises the steps that an attention mask matrix in an initial time sequence sampler contained in an initial generation model can be obtained, initial sample time sequence characteristics with the same dimension as sample text representation information are obtained, and dot multiplication operation is carried out on the initial sample time sequence characteristics and a first transformation matrix corresponding to a mutual attention component in the initial time sequence sampler to obtain a third query matrix; performing dot multiplication operation on the sample video representing information and a second transformation matrix corresponding to the mutual attention component in the initial time sequence sampler to obtain a third transformation matrix, and performing dot multiplication operation on the sample video representing information and a third transformation matrix corresponding to the mutual attention component in the initial time sequence sampler to obtain a third value matrix; determining the mutual attention characteristic of the sample video representation information association according to the third query matrix, the third key matrix and the third value matrix; performing activation processing on the attention mask matrix to obtain attention activation characteristics, and determining the product of the attention activation characteristics and the mutual attention characteristics as the output characteristics of the mutual attention components in the initial time sequence sampling component; and inputting the output characteristics of the mutual attention component in the initial time sequence sampling component into the self-attention component in the initial time sequence sampling component, and performing self-attention coding processing on the output characteristics of the mutual attention component in the initial time sequence sampling component through the self-attention component in the initial time sequence sampling component to obtain sample time sequence sampling information corresponding to the sample video representation information.

It will be appreciated that in the training phase of the initial generation model, to assist in training of the initial timing sampler, attention sparsity constraints are additionally added, which may be added in the mutual attention component of the initial timing sampler, so that a learnable attention mask (attention mask) matrix may be introduced in the mutual attention component of the initial timing sampler. Referring to fig. 9, fig. 9 is a schematic diagram of an attention mask matrix according to an embodiment of the present application. As shown in fig. 9, assuming that the sample video is composed of 10 video frames, the sequence length (width) of sample video representation information obtained after the 10 video frames pass through the visual encoder may be denoted as M1, the sequence length of sample timing sampling information finally output by the initial timing sampler may be denoted as M2, and the sequence length (width) of the foregoing initial sample timing feature may be considered, and the dimension of the attention mask feature introduced in the mutual attention component of the initial timing sampler may be denoted as m1×m2. It will be appreciated that the attention mask matrix is a sparse matrix, i.e. the attention mask matrix contains only a small number of non-zero elements, most of which are zero.

After the attention mask matrix is introduced into the mutual attention components of the initial time sequence sampler, the calculation process of the N mutual attention components in the initial time sequence sampler in the training stage can be shown in the following formula (6):

（6）

wherein M in formula (6) _A Representing the attention mask matrix introduced in the mutual attention component of the initial timing sampler, as in the video resolution and fixed size of image tiles in the embodiment corresponding to fig. 6 described above, M1 may be 35 and M2 may be 1960.ReLU () represents an activation function, and other activation functions such as tanh may be used in an actual application scenario, which is not limited in the embodiment of the present application. A represents the attention score of the jth mutual attention component in the initial timing sampler. In formula (6)

Sample video description information representing the added sample video representation information and position coding information input to the initial timing sampler; />

Represented sequentially as a first transformation matrix, a second transformation matrix, and a third transformation matrix (e.g., three weight matrices in the embodiment corresponding to fig. 6) corresponding to the jth mutual attention component in the initial timing sampler, where the three transformation matrices may be continuously learned and updated in the training phase of the initial generation model. / >

Representing a third query matrix in the jth mutual attention component of the initial timing sampler. />

Representing the third key matrix in the jth mutual attention component of the initial timing sampler. />

Representing a third matrix of values in the jth mutual attention component of the initial timing sampler. />

The number of columns representing the third query matrix is the same as the number of columns of the second query matrix. />

Representing a learnable matrix (initial sample timing characteristics) of the jth mutual attention component input into the initial timing sampler. />

Representing the output characteristics of the jth mutual attention component in the initial timing sampler, or may be considered as the result of the initial sample timing characteristics after iteration through j mutual attention components and j-1 self attention components, e.g.)>

Representing the output characteristics of the initial sample timing characteristics after passing through the first mutual-attention component in the initial timing sampler.

It should be noted that, compared to the calculation of the mutual attention group price in the application phase (as in the foregoing formula (4)), the calculation of the mutual attention component in the training phase (as in the formula (6)) is identical except that the attention mask matrix is introduced (the symbols in the formula (4) and the formula (6) are only used to distinguish the weight parameters of the time-series sampler in the training phase and the application phase); by adding attention sparsity constraints to the mutual attention component in the temporal sampler during the training phase, redundant information between individual video frames contained in the sample video can be reduced, thereby better capturing more efficient video content. In addition, the calculation of the self-attention components in the initial timing sampler is the same as the calculation process shown in the foregoing formula (5), and the calculation processes of the self-attention components in the timing sampler in the training phase and the application phase are the same, and the calculation processes of the N self-attention components in the initial timing sampler may refer to the calculation process shown in the formula (5) in the embodiment corresponding to fig. 6, which is not described herein.

Through the calculation of the mutual attention component shown in the formula (6) and the calculation of the self attention component shown in the formula (5), the initial time sequence sampler can finally output sample time sequence sampling information corresponding to the sample video, and further can combine the sample time sequence sampling information and the sample text representation information into the sample multi-modal characteristics.

Step S304, the sample multi-mode characteristics are coded through an initial multi-mode coder in the initial generation model to obtain sample fusion coding characteristics, and a text prediction probability matrix corresponding to the sample fusion coding characteristics is output through an initial text decoder in the initial generation model; the text prediction probability matrix is used for representing the prediction probability corresponding to each unit character in the sample content description text.

Specifically, after the sample multi-modal characteristics corresponding to the sample video and the sample text data are obtained, the sample multi-modal characteristics can be input into an encoder-decoder generation sub-model in an initial generation model, and the sample multi-modal characteristics are subjected to characteristic fusion coding processing through an initial multi-modal encoder in the encoder-decoder generation sub-model to obtain sample fusion coding characteristics. The sample fusion coding feature can then be input to an initial text decoder in an encoder-decoder generation sub-model, through which a text prediction probability matrix corresponding to the sample fusion coding feature is output, the text prediction probability matrix being used to characterize the prediction probabilities corresponding to each unit character in the sample content description text. That is, the initial text decoder may perform text decoding processing on the sample fusion coding feature, and the last network layer (e.g., softmax layer) of the initial text decoder outputs a text prediction probability matrix, and by using the prediction probabilities corresponding to the respective character positions included in the text prediction probability matrix, the characters corresponding to the maximum prediction probability at the respective character positions are combined into the sample prediction text corresponding to the sample video, so that a forward calculation process in the initial generation model may be completed. The process of obtaining the sample prediction text may refer to the description of the process of obtaining the video content description text in step S104 in the embodiment corresponding to fig. 3, which is not described herein.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an initial multi-mode encoder and an initial text decoder according to an embodiment of the present application. As shown in fig. 10, the initial multi-mode encoder in the initial generation model may be a self-coding structure of a bi-directional feature representation, may include a plurality of transform coding blocks, for example, may include 6 or 12 transform coding blocks, and may have a structure similar to the Bert model structure, and the specific network structure of the initial multi-mode encoder is not limited in the embodiments of the present application. The initial text decoder in the initial generation model may be an autoregressive structure represented by a unidirectional feature, and may also include a plurality of transform coding blocks, for example, may include 6 or 12 transform coding blocks, and the structure may be similar to a GPT2 (generating Pre-Training 2.0) structure, and the specific network structure of the initial decoder is not limited in the embodiment of the present application.

It will be appreciated that one of the main factors in determining whether a model structure is a bi-directional feature representation or a uni-directional feature representation is the attention matrix in the model structure, which may determine how much of its forward or forward-backward position encoding information is available for text sequence positions in the input model, and the mask matrix in the model structure may be used to determine whether or not its forward or forward-backward position encoding information is available for text sequence positions in the input model structure. As shown in fig. 10, when the sample multi-modal feature in the initial multi-modal encoder includes text information at 5 text sequence positions (A, B, C, D, E, respectively), a mask matrix in the initial multi-modal encoder may be shown as mask matrix 1, and each element in the mask matrix 1 may be 1, which means that in the initial multi-modal encoder, each text sequence position in the sample multi-modal feature may obtain encoded information of its front-to-rear position. The mask matrix in the initial text decoder may be shown as mask matrix 2, where the mask matrix 2 includes zero elements and 1 elements, indicating that in the initial text decoder, each text sequence position in the sample fusion coding feature may obtain coding information of its front and rear positions.

It should be noted that, during the training process of the initial multi-modal encoder, a part of information in the sample multi-modal feature input to the initial multi-modal encoder may be masked, and C and D in the sample multi-modal feature input to the initial multi-modal encoder are masked as shown in fig. 10. To provide more useful information to the initial decoder, sample content descriptive text corresponding to the sample video may be entered into the initial text decoder and represented in characters'

"as a start symbol of sample content description text, as shown in fig. 8, sample content description text corresponding to sample video is +_>

"input as initial symbol to initial text solutionAnd the coder is used for obtaining a sample prediction text based on the input sample content description text and the sample multi-modal characteristic prediction.

In step S305, network parameters of the initial timing sampler, the initial multi-mode encoder and the initial text decoder are modified according to the attention mask matrix, the text prediction probability matrix and the sample content descriptive text, and an initial generation model including the modified network parameters is determined as a target generation model.

In particular, the computer device may determine a sum of absolute values of the individual elements in the attention mask matrix as an attention mask activation value, and a product between the attention mask activation value and a sparse constraint parameter as a sparse attention constraint corresponding to the initial timing sampler. Wherein the sparse attention constraint may be as shown in the following equation (7):

（7）

Wherein in formula (7)

Representing sparse attention constraints added in the mutual attention component of the initial timing sampler; lambda represents a sparseness constraint parameter that can be used to control the attention mask matrix M _A Sparsity of (2); m is M _p,q Representing an attention mask matrix M _A The p-th row, q-th element value in (a), may also be referred to as an attention mask matrix M _A Is used for the activation value of (a). In particular, attention mask matrix M is introduced in the mutual attention component of the initial timing sampler _A L1 constraint of->

。

The computer device may determine an autoregressive loss associated with the initial text decoder based on the number of sample videos and the predictive probability of each unit character in the sample content description text in the text predictive probability matrix. More specifically, the prediction probability corresponding to each unit character in the sample content description text can be obtained in the text prediction probability matrix, and logarithmic operation is performed on the prediction probability corresponding to each unit character in the sample content description text to obtain a logarithmic probability value corresponding to each unit character in the sample content description text; and accumulating the logarithmic probability values corresponding to each unit character in the sample content description text to obtain a logarithmic probability total value corresponding to the sample video, and determining the autoregressive loss associated with the initial text decoder according to the ratio between the logarithmic probability total value and the number of the sample videos. Wherein the autoregressive loss can be represented by the following formula (8):

（8）

Wherein in formula (8)

Representing autoregressive loss corresponding to an encoder-decoder generation sub-model in the initial generation model; r represents a sample dataset for training an initially generated model, < >>

Representing the total number of samples of the sample data set used to train the initial generation model, θ represents the network parameters in the encoder-decoder generation sub-model, x represents the sample content text (label information corresponding to the sample video), and here refers to the real label information (e.g., real comment data of the sample video) that needs to be fitted during the training phase; x is x _u The method comprises the steps of representing a U-th unit character (text or word) in a sample content description text, wherein U is a positive integer which is less than or equal to the total number of the unit characters (which can be recorded as U) contained in the sample content description text; />

Representing a unit character preceding a u-th unit character in the sample content description text; />

Representing the initial text decoder based on the first u unit characters that have been outputThe prediction probability of the u-th unit character, log () represents the log process. />

Further, the sum of the sparse attention constraint and the autoregressive loss is determined as the model total loss corresponding to the initial generation model

Total loss of the model->

Can be represented by the following formula (9):

（9）

performing iterative training on network parameters of an initial time sequence sampler, an initial multi-mode encoder and an initial text decoder based on the model total loss, stopping training until the model total loss meets the training ending condition, and determining the visual encoder, the text encoder, the initial time sequence sampler, the initial multi-mode encoder and the initial text decoder at the end of training as target generation models; the training ending condition may include, but is not limited to, a preset maximum iteration number, or a value of total model loss is less than or equal to a preset error value, which is not limited in the present application.

In one or more embodiments, in the training phase of the initial generation model, the learning rate during training may be set to be 1.5X10 ^-5 The method comprises the steps of carrying out a first treatment on the surface of the AdamW (an optimization method) was used as an optimizer and was optimized on 32 Nvidia V100 GPUs (model of graphics processor used in the initial generation model) with cosine learning rate decay, and the training batch size could be set to 1024 (feature height in the forward computation of the initial generation model). In the training phase of the initial generation model, the initial generation model may train 10 epochs, one epoch representing all sample data in a complete training sample data set, and epoch=10 representing all sample data in a complete training 10 sample data set.

In the embodiment of the application, in the training stage of the initial generation model, the pre-trained visual encoder and the pre-trained text encoder are adopted to directly extract the characteristics of the sample video and the sample text matching data, so that the image-text characterization network structure migration can be applied to the characteristic extraction of the video-text, and the training efficiency of the initial generation model can be improved. The initial time sequence sampler is used for mining time sequence information among all video frames in the sample video, so that the representation capability of video representation information can be improved, meanwhile, attention sparse constraint can be introduced into a mutual attention component of the initial time sequence sampler, redundant information among all video frames contained in the sample video can be reduced through the attention sparse constraint, and effective information in the sample video can be captured better. The initial multi-mode encoder in the initial generation model carries out multi-mode fusion encoding processing on the sample time sequence sampling information output by the initial time sequence sampler and the text representation information output by the text encoder, and then the initial decoder predicts the final sample prediction text, so that the description accuracy of the sample prediction text can be improved.

It will be appreciated that in particular embodiments of the present application, video data (e.g., portrait video, etc.) of a user may be involved, and that when the above embodiments of the present application are applied to particular products or technologies, permissions or consents of the user or the like may need to be obtained, and the collection, use and processing of relevant data may need to comply with relevant laws and regulations and standards of the relevant country and region.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. As shown in fig. 11, the data processing apparatus 1 includes: a first data acquisition module 11, a first feature extraction module 12, a first text generation module 14;

the first data acquisition module 11 is configured to acquire video data and video text data corresponding to the video data;

the first feature extraction module 12 is configured to obtain video representation information corresponding to video data, and obtain text representation information corresponding to video text distribution data;

the first timing sampling module 13 is configured to perform timing sampling processing on the video representation information to obtain video timing sampling information corresponding to the video data, and combine the video timing sampling information and the text representation information into a multi-mode combination feature;

the first text generation module 14 is configured to perform encoding processing on the multi-mode combination feature to obtain a multi-mode fusion encoding feature, and perform text decoding processing on the multi-mode fusion encoding feature to obtain a video content description text associated with the video data.

In one or more embodiments, the first feature extraction module 12 obtains video representation information corresponding to video data, including:

In one or more embodiments, the first feature extraction module 12 encodes the image input feature according to a visual encoder to obtain image representation information corresponding to the video frame i, including:

In one or more embodiments, the first feature extraction module 12 outputs attention encoding features corresponding to image input features according to an attention encoding component in a visual encoder, including:

In one or more embodiments, the first feature extraction module 12 obtains text representation information corresponding to video textual data, including:

In one or more embodiments, the first timing sampling module 13 performs timing sampling processing on the video representation information to obtain video timing sampling information corresponding to the video data, where the processing includes:

In one or more embodiments, the video representation information includes L image representation information, L being a positive integer;

the first timing sampling module 13 performs a position encoding process on the video representation information to obtain position encoding information corresponding to the video representation information, including:

In one or more embodiments, a timing sampler in a target generation model includes N mutual attention components and N self attention components, alternating connections are made between the N mutual attention components and the N self attention components, N is a positive integer;

the first timing sampling module 13 performs iterative update on the initial timing characteristics through the timing sampler and the video description information to obtain video timing sampling information corresponding to the video data, including:

In one or more embodiments, the first text generation module 14 encodes the multi-modal combined feature to obtain a multi-modal fusion encoded feature, and decodes the multi-modal fusion encoded feature to obtain a video content description text associated with the video data, including:

According to one embodiment of the present application, the steps involved in the data processing method shown in fig. 3 and 5 described above may be performed by the respective modules in the data processing apparatus 1 shown in fig. 11. For example, step S101 shown in fig. 3 may be performed by the first data acquisition module 11 shown in fig. 11, step S102 shown in fig. 3 may be performed by the first feature extraction module 12 shown in fig. 11, step S103 shown in fig. 3 may be performed by the first timing sampling module 13 shown in fig. 11, step S104 shown in fig. 3 may be performed by the first text generation module 14 shown in fig. 11, and so on.

According to an embodiment of the present application, each module in the data processing apparatus 1 shown in fig. 11 may be separately or completely combined into one or several units to form a structure, or some (some) of the units may be further split into at least two sub-units with smaller functions, so that the same operation may be implemented without affecting the implementation of the technical effects of the embodiments of the present application. The above modules are divided based on logic functions, and in practical application, the functions of one module may be implemented by at least two units, or the functions of at least two modules may be implemented by one unit. In other embodiments of the present application, the data processing device 1 may also comprise other units, and in practical applications, these functions may also be implemented with the assistance of other units, and may be implemented by cooperation of at least two units.

Referring to fig. 12, fig. 12 is a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure. As shown in fig. 12, the data processing apparatus 2 includes: a second data acquisition module 21, a second feature extraction module 22, a second timing sampling module 23, a second text generation module 24, and a network parameter correction module 25;

A second data obtaining module 21, configured to obtain a sample video, and sample text matching data and sample content description text corresponding to the sample video;

the second feature extraction module 22 is configured to output, by using a visual encoder in the initial generation model, sample video representation information corresponding to the sample video, and output, by using a text encoder in the initial generation model, sample text representation information corresponding to the sample text data; the visual encoder and the text encoder are trained based on a plurality of image-text data sets, wherein one image-text data set comprises a sample image and image text data;

the second time sequence sampling module 23 is configured to obtain an attention mask matrix in an initial time sequence sampler included in the initial generation model, perform time sequence sampling processing on sample video representation information according to the attention mask matrix and a mutual attention component and a self attention component in the initial time sequence sampler to obtain sample time sequence sampling information, and combine the sample time sequence sampling information and the sample ligand representation information into a sample multi-mode feature;

the second text generation module 24 is configured to perform encoding processing on the sample multi-mode feature by using an initial multi-mode encoder in the initial generation model to obtain a sample fusion encoding feature, and output a text prediction probability matrix corresponding to the sample fusion encoding feature by using an initial text decoder in the initial generation model; the text prediction probability matrix is used for representing the prediction probability corresponding to each unit character in the sample content description text;

A network parameter correction module 25, configured to correct network parameters of the initial timing sampler, the initial multi-mode encoder, and the initial text decoder according to the attention mask matrix, the text prediction probability matrix, and the sample content descriptive text, and determine an initial generation model including the corrected network parameters as a target generation model; the object generation model is used for generating video content description text for the input video data and data distribution data.

In one or more embodiments, the second timing sampling module 23 performs timing sampling processing on the sample video representation information according to the attention mask matrix and the mutual attention component and the self attention component in the initial timing sampler, to obtain sample timing sampling information, including:

In one or more embodiments, the network parameter modification module 25 modifies network parameters of the initial timing sampler, the initial multi-mode encoder, and the initial text decoder according to the attention mask matrix, the text prediction probability matrix, and the sample content descriptive text, and determines an initial generation model including the modified network parameters as a target generation model, including:

In one or more embodiments, the network parameter modification module 25 determines an initial text decoder-associated autoregressive loss based on the number of sample videos and the predicted probability of each unit character in the sample content description text in the text prediction probability matrix, including:

According to an embodiment of the present application, the steps involved in the data processing method shown in fig. 7 above may be performed by the respective modules in the data processing apparatus 2 shown in fig. 12. For example, step S301 shown in fig. 7 may be performed by the second data acquisition module 21 shown in fig. 12, step S302 shown in fig. 7 may be performed by the second feature extraction module 22 shown in fig. 12, step S303 shown in fig. 7 may be performed by the second timing sampling module 23 shown in fig. 12, step S304 shown in fig. 7 may be performed by the second text generation module 24 shown in fig. 12, step S305 shown in fig. 7 may be performed by the network parameter correction module 25 shown in fig. 12, and so on.

According to an embodiment of the present application, each module in the data processing apparatus 2 shown in fig. 12 may be separately or completely combined into one or several units to form a structure, or some (some) of the units may be further split into at least two sub-units with smaller functions, so that the same operation may be implemented without affecting the implementation of the technical effects of the embodiments of the present application. The above modules are divided based on logic functions, and in practical application, the functions of one module may be implemented by at least two units, or the functions of at least two modules may be implemented by one unit. In other embodiments of the present application, the data processing device 1 may also comprise other units, and in practical applications, these functions may also be implemented with the assistance of other units, and may be implemented by cooperation of at least two units.

Further, referring to fig. 13, fig. 13 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 13, the computer device 1000 may be a terminal device, for example, the terminal device 10a in the embodiment corresponding to fig. 1, or a server, for example, the server 10d in the embodiment corresponding to fig. 1, which is not limited herein. For ease of understanding, the present application takes a computer device as an example of a terminal device, and the computer device 1000 may include: processor 1001, network interface 1004, and memory 1005, in addition, the computer device 1000 may further comprise: a user interface 1003, and at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may also include a standard wired interface, a wireless interface, among others. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may also optionally be at least one storage device located remotely from the processor 1001. As shown in fig. 13, an operating system, a network communication module, a user interface module, and a device control application program may be included in the memory 1005, which is one type of computer-readable storage medium.

The network interface 1004 in the computer device 1000 may also provide network communication functions, and the optional user interface 1003 may also include a Display screen (Display) and a Keyboard (Keyboard). In the computer device 1000 shown in FIG. 13, the network interface 1004 may provide network communication functions; while user interface 1003 is primarily used as an interface for providing input to a user; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:

Alternatively, the processor 1001 may be configured to invoke a device control application program stored in the memory 1005 to implement:

It should be understood that the computer device 1000 described in the embodiments of the present application may perform the description of the data processing method in any of the embodiments of fig. 3, 5 and 7, or may perform the description of the data processing apparatus 1 in the embodiment corresponding to fig. 11, or perform the description of the data processing apparatus 2 in the embodiment corresponding to fig. 12, which will not be repeated herein. In addition, the description of the beneficial effects of the same method is omitted.

Furthermore, it should be noted here that: the embodiments of the present application further provide a computer readable storage medium, in which the aforementioned computer program executed by the data processing apparatus 1 or the data processing apparatus 2 is stored, and the computer program includes program instructions, when executed by a processor, can execute the description of the data processing method in any of the foregoing embodiments of fig. 3, 5 and 7, and therefore, a detailed description will not be given here. In addition, the description of the beneficial effects of the same method is omitted. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like. For technical details not disclosed in the embodiments of the computer-readable storage medium according to the present application, please refer to the description of the method embodiments of the present application. As an example, program instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or, alternatively, across multiple computing devices distributed across multiple sites and interconnected by a communication network, where the multiple computing devices distributed across multiple sites and interconnected by the communication network may constitute a blockchain system.

In addition, it should be noted that: embodiments of the present application also provide a computer program product or computer program that may include computer instructions that may be stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor can execute the computer instructions, so that the computer device performs the description of the data processing method in any one of the embodiments of fig. 3, fig. 5, and fig. 7, which will not be described herein. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the computer program product or the computer program embodiments related to the present application, please refer to the description of the method embodiments of the present application.

The terms first, second and the like in the description and in the claims and drawings of the embodiments of the present application are used for distinguishing between different media content and not for describing a particular sequential order. Furthermore, the term "include" and any variations thereof is intended to cover a non-exclusive inclusion. For example, a process, method, apparatus, article, or device that comprises a list of steps or elements is not limited to the list of steps or modules but may, in the alternative, include other steps or modules not listed or inherent to such process, method, apparatus, article, or device.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The methods and related devices provided in the embodiments of the present application are described with reference to the method flowcharts and/or structure diagrams provided in the embodiments of the present application, and each flowchart and/or block of the method flowcharts and/or structure diagrams may be implemented by computer program instructions, and combinations of flowcharts and/or blocks in the flowchart and/or block diagrams. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or structural diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or structures.

The foregoing disclosure is only illustrative of the preferred embodiments of the present application and is not intended to limit the scope of the claims herein, as the equivalent of the claims herein shall be construed to fall within the scope of the claims herein.

Claims

1. A method of data processing, comprising:

acquiring video representation information corresponding to the video data and acquiring text representation information corresponding to the video text distribution data;

performing time sequence sampling processing on the video representation information to obtain video time sequence sampling information corresponding to the video data, and combining the video time sequence sampling information and the text representation information into a multi-mode combination feature;

and carrying out coding processing on the multi-mode combination characteristic to obtain a multi-mode fusion coding characteristic, and carrying out text decoding processing on the multi-mode fusion coding characteristic to obtain a video content description text associated with the video data.

2. The method according to claim 1, wherein the acquiring video representation information corresponding to the video data includes:

carrying out frame division processing on the video data to obtain a video frame sequence, dividing each video frame in the video frame sequence into a plurality of image blocks with fixed sizes, and obtaining an image block set corresponding to each video frame in the video frame sequence;

Acquiring an image input characteristic corresponding to a video frame i according to an image block set corresponding to the video frame i contained in the video frame sequence, and inputting the image input characteristic to a visual encoder in a target generation model; i is a positive integer less than or equal to the number of video frames corresponding to the video frame sequence;

and combining the image representation information corresponding to each video frame in the video frame sequence into the video representation information corresponding to the video data.

3. The method according to claim 2, wherein the encoding the image input feature according to the visual encoder to obtain the image representation information corresponding to the video frame i includes:

outputting attention coding features corresponding to the image input features according to an attention coding component in the visual encoder, and combining the image input features and the attention coding features into image joint features;

acquiring an implicit weight matrix and an offset vector corresponding to a multi-layer perceptron in the visual encoder, determining an image transformation feature corresponding to the video frame i based on the offset vector and point multiplication between the implicit weight matrix and the image transformation feature, and combining the image transformation feature and the image combination feature into image representation information corresponding to the video frame i.

4. A method according to claim 3, wherein said outputting attention encoding features corresponding to said image input features according to an attention encoding component in said visual encoder comprises:

acquiring a transformation weight matrix corresponding to an attention coding component in the visual encoder, and transforming the image input features into a first query matrix, a first key matrix and a first value matrix based on the transformation weight matrix of the attention coding component;

and normalizing the ratio between the candidate weight matrix and the square root of the column number to obtain a first attention weight matrix, and determining attention coding features corresponding to the image input features according to dot multiplication between the first attention weight matrix and the first value matrix.

5. The method according to claim 1, wherein the obtaining text representation information corresponding to the video distribution data includes:

dividing the video text matching data into D unit characters, and obtaining unit word vectors corresponding to the D unit characters respectively; d is a positive integer;

According to semantic information of the D unit characters in the video text distribution data, text vectors respectively corresponding to the D unit characters are obtained;

acquiring position vectors corresponding to the D unit characters respectively according to the text positions of the D unit characters in the video text distribution data;

and inputting the text input characteristics to a text encoder in a target generation model, and encoding the text input characteristics through the text encoder to obtain text representation information corresponding to the video text distribution data.

6. The method according to claim 1, wherein the performing time-series sampling processing on the video representation information to obtain video time-series sampling information corresponding to the video data includes:

7. The method of claim 6, wherein the video representation information comprises L image representation information, L being a positive integer;

the step of performing position coding processing on the video representation information to obtain position coding information corresponding to the video representation information includes:

acquiring index positions of the L image representation information in the video data, and dividing the index positions of the L image representation information into even index positions and odd index positions;

8. The method of claim 6, wherein the timing sampler in the target generation model comprises N mutual attention components and N self attention components, the N mutual attention components and the N self attention components are alternately connected, and N is a positive integer;

The step of iteratively updating the initial time sequence feature through the time sequence sampler and the video description information to obtain video time sequence sampling information corresponding to the video data comprises the following steps:

acquiring input characteristics of a jth mutual attention component in the time sequence sampler; when j is 1, the input features of the jth mutual attention component comprise the video description information and the initial timing feature; when j is not 1, the input characteristics of the j-th mutual attention component comprise the video description information and the output characteristics of the j-1-th self attention component; j is a positive integer less than or equal to N;

acquiring a first weight matrix, a second weight matrix and a third weight matrix corresponding to the jth mutual attention component, and performing dot multiplication operation on the output characteristics of the first weight matrix and the jth-1 self attention component to obtain a second query matrix;

determining output features of the jth mutual attention component according to the second query matrix, the second key matrix and the second value matrix;

Inputting the output characteristics of the jth mutual attention component into a jth self-attention component in the time sequence sampler, and performing self-attention coding processing on the output characteristics of the jth mutual attention component through the jth self-attention component to obtain the output characteristics of the jth self-attention component;

9. The method of claim 1, wherein the encoding the multi-modal combination feature to obtain a multi-modal fusion encoded feature, and the text decoding the multi-modal fusion encoded feature to obtain a video content descriptive text associated with the video data, comprises:

inputting the multi-mode combination features to a multi-mode encoder in a target generation model, and performing bidirectional feature encoding processing on the multi-mode combination features through the multi-mode encoder to obtain multi-mode fusion encoding features;

inputting the multi-modal fusion coding features to a text decoder in the target generation model, and performing attention aggregation operation on the multi-modal fusion coding features through the text decoder to obtain attention aggregation features;

And performing autoregressive processing on the attention aggregation characteristics to obtain a text probability output matrix, and determining video content description text associated with the video data according to the text probability output matrix.

10. A method of data processing, comprising:

acquiring sample video, sample text matching data corresponding to the sample video and sample content description text;

outputting sample video representation information corresponding to the sample video through a visual encoder in an initial generation model, and outputting sample text representation information corresponding to the sample text data through a text encoder in the initial generation model; the visual encoder and the text encoder are trained based on a plurality of image-text data sets, wherein one image-text data set comprises one sample image and one image text data;

acquiring an attention mask matrix in an initial time sequence sampler contained in the initial generation model, performing time sequence sampling processing on the sample video representation information according to the attention mask matrix and a mutual attention component and a self attention component in the initial time sequence sampler to obtain sample time sequence sampling information, and combining the sample time sequence sampling information and the sample ligand representation information into a sample multi-mode feature;

The sample multi-mode characteristics are encoded through an initial multi-mode encoder in the initial generation model, so that sample fusion encoding characteristics are obtained, and a text prediction probability matrix corresponding to the sample fusion encoding characteristics is output through an initial text decoder in the initial generation model; the text prediction probability matrix is used for representing the prediction probability corresponding to each unit character in the sample content description text;

correcting network parameters of the initial time sequence sampler, the initial multi-mode encoder and the initial text decoder according to the attention mask matrix, the text prediction probability matrix and the sample content description text, and determining an initial generation model containing the corrected network parameters as a target generation model; the object generation model is used for generating video content description text for the input video data and data text matching data.

11. The method according to claim 10, wherein the performing the time-series sampling process on the sample video representation information according to the attention mask matrix and the mutual attention component and the self attention component in the initial time-series sampler to obtain sample time-series sampling information includes:

Acquiring initial sample time sequence characteristics with the same dimension as the sample text expression information, and performing dot multiplication operation on the initial sample time sequence characteristics and a first transformation matrix corresponding to a mutual attention component in the initial time sequence sampler to obtain a third query matrix;

performing dot multiplication operation on the sample video representation information and a second transformation matrix corresponding to the mutual attention component in the initial time sequence sampler to obtain a third key matrix, and performing dot multiplication operation on the sample video representation information and a third transformation matrix corresponding to the mutual attention component in the initial time sequence sampler to obtain a third value matrix;

determining mutual attention characteristics associated with the sample video representation information according to the third query matrix, the third key matrix and the third value matrix;

and inputting the output characteristics of the mutual attention component in the initial time sequence sampling component to the self-attention component in the initial time sequence sampling component, and carrying out self-attention coding processing on the output characteristics of the mutual attention component in the initial time sequence sampling component through the self-attention component in the initial time sequence sampling component to obtain sample time sequence sampling information corresponding to the sample video representation information.

12. The method of claim 10, wherein modifying network parameters of the initial timing sampler, the initial multi-modal encoder, and the initial text decoder based on the attention mask matrix, the text prediction probability matrix, and the sample content descriptive text, determining an initial generation model including the modified network parameters as a target generation model, comprises:

determining the sum of absolute values of all elements in the attention mask matrix as an attention mask activation value, and determining the product between the attention mask activation value and a sparse constraint parameter as a sparse attention constraint corresponding to the initial time sequence sampler;

determining an autoregressive loss associated with the initial text decoder according to the number of the sample videos and the prediction probability of each unit character in the sample content description text in the text prediction probability matrix;

determining the sum of the sparse attention constraint and the autoregressive loss as a model total loss corresponding to the initial generation model;

and carrying out iterative training on network parameters of the initial time sequence sampler, the initial multi-mode encoder and the initial text decoder based on the model total loss until the model total loss meets the training ending condition, stopping training, and determining the visual encoder, the text encoder, the initial time sequence sampler, the initial multi-mode encoder and the initial text decoder at the end of training as target generation models.

13. The method of claim 12, wherein said determining the initial text decoder-associated autoregressive loss based on the number of sample videos and the predictive probability of each unit character in the sample content description text in the text predictive probability matrix comprises:

14. A data processing apparatus, comprising:

the first time sequence sampling module is used for performing time sequence sampling processing on the video representation information to obtain video time sequence sampling information corresponding to the video data, and combining the video time sequence sampling information and the text representation information into a multi-mode combination feature;

the first text generation module is used for carrying out coding processing on the multi-mode combination characteristics to obtain multi-mode fusion coding characteristics, and carrying out text decoding processing on the multi-mode fusion coding characteristics to obtain video content description text associated with the video data.

15. A data processing apparatus, comprising:

the second data acquisition module is used for acquiring a sample video, sample text matching data corresponding to the sample video and sample content description text;

the second feature extraction module is used for outputting sample video representation information corresponding to the sample video through a visual encoder in an initial generation model, and outputting sample text representation information corresponding to the sample text data through a text encoder in the initial generation model; the visual encoder and the text encoder are trained based on a plurality of image-text data sets, wherein one image-text data set comprises one sample image and one image text data;

The second time sequence sampling module is used for acquiring an attention mask matrix in an initial time sequence sampler contained in the initial generation model, carrying out time sequence sampling processing on the sample video representing information according to the attention mask matrix, a mutual attention component and a self attention component in the initial time sequence sampler to obtain sample time sequence sampling information, and combining the sample time sequence sampling information and the sample ligand representing information into a sample multi-mode characteristic;

the network parameter correction module is used for correcting network parameters of the initial time sequence sampler, the initial multi-mode encoder and the initial text decoder according to the attention mask matrix, the text prediction probability matrix and the sample content description text, and determining an initial generation model containing the corrected network parameters as a target generation model; the object generation model is used for generating video content description text for the input video data and data text matching data.

16. A computer device comprising a memory and a processor;

the memory is connected to the processor for storing a computer program for invoking the computer program to cause the computer device to perform the method of any of claims 1 to 9 or to perform the method of any of claims 10 to 13.

17. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program adapted to be loaded and executed by a processor to cause a computer device having the processor to perform the method of any of claims 1 to 9 or to perform the method of any of claims 10 to 13.