CN113810730B

CN113810730B - Video-based real-time text generation method and device and computing equipment

Info

Publication number: CN113810730B
Application number: CN202111091882.3A
Authority: CN
Inventors: 吴志勇; 史佳慧; 裴兴; 郭宇; 斯凌
Original assignee: China Mobile Communications Group Co Ltd; MIGU Digital Media Co Ltd; MIGU Culture Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Digital Media Co Ltd; MIGU Culture Technology Co Ltd
Priority date: 2021-09-17
Filing date: 2021-09-17
Publication date: 2023-08-01
Anticipated expiration: 2041-09-17
Also published as: CN113810730A

Abstract

The invention discloses a real-time text generation method, a device and a computing device based on video, wherein the method comprises the following steps: performing feature processing on a currently played video image and a played video image of the video to obtain a first video frame feature vector and a second video frame feature vector; determining the association relation between the current playing video image and the played video image according to the first video frame feature vector and the second video frame feature vector, extracting a key feature vector corresponding to the current playing video image based on the association relation, and correcting the first video frame feature vector according to the key feature vector to obtain a coding feature vector corresponding to the current playing video image; acquiring derivative data associated with a video, and carrying out vectorization processing on the derivative data to obtain derivative data feature vectors; and decoding the coded feature vector and the derived data feature vector to obtain a real-time video text corresponding to the current playing video image, thereby realizing the generation of vivid, flexible and accurate video text.

Description

Video-based real-time text generation method and device and computing equipment

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and apparatus for generating real-time text based on video, and a computing device.

Background

In the prior art, the text is automatically generated mainly according to an antagonism network, a deep learning model and the like, but most of the texts generated by the technology are general texts and cannot be applied to special scenes such as sports commentary on an arena. Especially for the transformation problems of many background knowledge and real-time scenes in games such as ice hockey, the text generated by the general text generation technology obviously cannot meet the requirements. Thus, there is a need for a solution that enables vivid, flexible and accurate real-time generation of video text.

Disclosure of Invention

In view of the foregoing, embodiments of the present invention are directed to providing a method, apparatus, and computing device for video-based real-time text generation that overcomes or at least partially solves the foregoing problems.

According to an aspect of an embodiment of the present invention, there is provided a real-time text generation method based on video, including:

performing feature processing on a current playing video image and a played video image of the video to obtain a first video frame feature vector and a second video frame feature vector, wherein the current playing video image is the current playing video image;

determining the association relation between the current playing video image and the played video image according to the first video frame feature vector and the second video frame feature vector, extracting a key feature vector corresponding to the current playing video image based on the association relation, and correcting the first video frame feature vector according to the key feature vector to obtain a coding feature vector corresponding to the current playing video image;

acquiring derivative data associated with a video, and carrying out vectorization processing on the derivative data to obtain derivative data feature vectors;

and decoding the coded feature vector and the derived data feature vector to obtain the real-time video text corresponding to the current playing video image.

According to another aspect of an embodiment of the present invention, there is provided a real-time text generation apparatus based on video, including:

the feature processing module is suitable for carrying out feature processing on a currently played video image and a played video image of the video to obtain a first video frame feature vector and a second video frame feature vector, wherein the currently played video image is a currently played video image;

the coding module is suitable for determining the association relation between the current playing video image and the played video image according to the first video frame feature vector and the second video frame feature vector, extracting a key feature vector corresponding to the current playing video image based on the association relation, and correcting the first video frame feature vector according to the key feature vector to obtain a coding feature vector corresponding to the current playing video image;

the derived data processing module is suitable for acquiring derived data associated with the video, and carrying out vectorization processing on the derived data to obtain a derived data feature vector;

and the decoding module is suitable for decoding the coded feature vector and the derived data feature vector to obtain a real-time video text corresponding to the current playing video image.

According to yet another aspect of an embodiment of the present invention, there is provided a computing device including: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface are communicated with each other through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the video-based real-time text generation method.

According to still another aspect of the embodiments of the present invention, there is provided a computer storage medium having at least one executable instruction stored therein, the executable instruction causing a processor to perform operations corresponding to the video-based real-time text generation method described above.

According to the scheme provided by the invention, the association relation between the video frame image at the current moment and the played video frame image can be learned, key features of the current video frame image are mined based on the association relation, and data such as game item development history information, game team information, player information and the like are relied on, so that real-time video texts can be generated vividly, flexibly and efficiently, the problems of information loss, inflexible generation result and templatization of a real-time video text intelligent generation method of a current sports event are solved, reliable support is provided for event playing and sport popularization of a competition, and running cost of event rebroadcasting is reduced.

The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and may be implemented according to the content of the specification, so that the technical means of the embodiments of the present invention can be more clearly understood, and the following specific implementation of the embodiments of the present invention will be more apparent.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1A shows a flow chart of a video-based real-time text generation method provided by an embodiment of the invention;

FIG. 1B is a schematic diagram of text generation model training;

fig. 2 is a schematic structural diagram of a real-time video-based text generating apparatus according to an embodiment of the present invention;

FIG. 3 illustrates a schematic diagram of a computing device provided by an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Fig. 1A shows a flowchart of a video-based real-time text generation method according to an embodiment of the present invention. As shown in fig. 1A, the method comprises the steps of:

step S101, performing feature processing on a currently played video image and a played video image of a video to obtain a first video frame feature vector and a second video frame feature vector, wherein the currently played video image is the currently played video image.

The video in this embodiment is a video of a game on an arena, is a video in a game currently being live, and may be, for example, an ice hockey game video, a basketball game video, a table tennis game video, etc., where the currently played video image of the video in this step is a video image currently being played, and the already played video image is a video image already played. The purpose of this embodiment is to generate real-time video text for the video image currently being played, that is, the video image is updated continuously as the game proceeds and time elapses, so as to generate the video text corresponding to the latest video image in real time.

Specifically, the video frame features of the current playing video image and the played video image may be obtained based on the 3D convolutional neural network, for example, the current playing video image and the played video image are input into the 3D convolutional neural network, and the first video frame feature vector corresponding to the current playing video image and the second video frame feature vector corresponding to the played video image are output.

Step S102, determining the association relation between the current playing video image and the played video image according to the first video frame feature vector and the second video frame feature vector, extracting the key feature vector corresponding to the current playing video image based on the association relation, and correcting the first video frame feature vector according to the key feature vector to obtain the coding feature vector corresponding to the current playing video image.

In order to generate vivid, flexible and accurate video text, after a first video frame feature vector corresponding to a current playing video image and a second video frame feature vector corresponding to a played video image are obtained, the association relationship between the current playing video image and the played video image can be determined according to the first video frame feature vector and the second video frame feature vector, the association relationship reflects the correlation between features, for example, two shots exist in the played video image, one shot still exists in the current playing video image, then the third shot can be determined in the current playing video image, after the association relationship between the current playing video image and the played video image is determined, a key feature vector corresponding to the current playing video image is extracted based on the association relationship, the key feature is an important feature of the current playing video image, for example, whether the key feature belongs to a penalty, a goal, a shot, a rescue, a celebration, a playback, a feature lens and the like, after the key feature vector corresponding to the current playing video image is extracted, the first video frame feature vector can be corrected according to the key feature vector, the key feature vector is corrected to obtain a main feature corresponding to the first video frame correction feature, and the main feature is corrected.

Preferentially, the key feature vector and the first video frame feature vector can be subjected to vector cross multiplication to obtain the coding feature vector corresponding to the current playing video image. Wherein the role of the cross-multiplication is to distinguish between the importance of the different positions of the currently played video image and the relevance of the currently played video image with respect to the played video image.

Optionally, in this embodiment, a multi-head multi-layer self-attention text generating model is trained in advance, where the multi-head multi-layer self-attention text generating model is composed of two parts of an encoding network and a decoding network, where a first video frame feature vector and a second video frame feature vector can be input into the encoding network of the multi-head multi-layer self-attention text generating model trained in advance, an association relationship between a currently played video image and a played video image is determined by using the encoding network, a key feature vector corresponding to the currently played video image is extracted based on the association relationship, and the first video frame feature vector is corrected according to the key feature vector to obtain an encoded feature vector corresponding to the currently played video image.

Step S103, obtaining derivative data associated with the video, and carrying out vectorization processing on the derivative data to obtain derivative data feature vectors.

Derived data is data related to video, for example, derived data including, but not limited to, the following information: athlete information, coach information, referee information, game item history information, playing field information, pre-game news information, and team history information. The derivative data relates to some background knowledge of the game, real-time scene change and the like, and the derivative data serves as priori knowledge of the video to supplement the explanation.

In this step, derived data associated with a video is obtained, and after the derived data is obtained, the derived data is vectorized to obtain derived data feature vectors, where the number of data of the derived data may be multiple, so that vectorization processing may be performed on each derived data separately to obtain multiple derived data feature vectors, and then fusion processing is performed on the multiple derived data feature vectors.

The present embodiment is not limited to the execution sequence of steps S101 to S102 and S103.

And step S104, decoding the coded feature vector and the derived data feature vector to obtain a real-time video text corresponding to the current playing video image.

After the coding feature vector and the derived data feature vector are determined, decoding the coding feature vector and the derived data feature vector to obtain the real-time video text corresponding to the current playing video image. The real-time video text may be a video caption corresponding to the currently played video image.

Preferentially, vector addition can be performed on the coded feature vector and the derived data feature vector to obtain a fused coded feature vector, and then the fused coded feature vector is decoded by using a decoding network of a multi-head multi-layer self-attention text generation model trained in advance to obtain a real-time video text corresponding to the current playing video image.

In an alternative embodiment of the present invention, the text generation model training process includes:

obtaining sample video images of each frame in a sample video and labeling sample video texts corresponding to the sample video images of each frame;

obtaining derivative data associated with the sample video, and carrying out vectorization processing on the derivative data to obtain derivative data feature vectors;

performing feature processing on the T-th frame sample video image and the previous T-1 frame sample video image to obtain a third video frame feature vector and a fourth video frame feature vector;

training a coding network of the multi-head multi-layer self-attention text generation model according to the position coding of the T-frame sample video image, the third video frame feature vector and the fourth video frame feature vector, determining the association relation between the T-frame sample video image and the previous T-1 frame sample video image, extracting the key feature vector corresponding to the T-frame sample video image according to the association relation, and correcting the third video frame feature vector according to the key feature vector to obtain the coding feature vector corresponding to the T-frame sample video image;

training a decoding network of the multi-head multi-layer self-attention text generation model based on coding network weight parameters, coding feature vectors and derived data feature vectors of coding network training to obtain a video text corresponding to a T frame sample video image;

obtaining a text generation model loss function according to text loss between the video text corresponding to the T-th frame sample video image and the labeling sample video text, and updating the coding network weight parameter and the decoding network weight parameter according to the text generation model loss function;

and iteratively executing the steps until the preset convergence condition is met.

It should be noted that the output and input of the previous layer multi-head attention are combined as the input of the next layer multi-head attention.

For easy understanding, taking the video as the puck game video, the caption of the currently playing video image of the puck game video is generated as an example, to describe the text generation model training process, and fig. 1B is a schematic diagram of the text generation model training:

step 1: the method comprises the steps of acquiring a video set of the ice hockey game through data collection, acquiring derivative data of each ice hockey game as sample video, specifically including but not limited to related information, personal data and the like of athletes, coaches and referees, various news information about team games, such as social attention, and the like, team history information about both players, local information about game places, such as fan reflection, and the like, carrying out vectorization processing on the derivative data to obtain derivative data feature vectors, and recording the derivative data feature vectors as p.

Step 2: and acquiring complete caption text corresponding to the ice hockey game video, taking the complete caption text as a training learning target, namely marking the sample video text, carrying out vectorization processing on the caption text, and obtaining a caption text feature vector which is marked as n.

Step 3: and encoding the video of the ice hockey game. The feature vector of the video frame is obtained by using a 3DCNN algorithm, and is marked as T, wherein T is a video frame feature vector of a time sequence, and the specific expression is shown in a formula (1), and it is to be noted that T is the video frame feature vector of the current moment, namely, the obtained video frame feature vector contains the video frame feature vector of the T-th frame sample video image and the video frame feature vector of the previous T-1 frame sample video image.

t＝[t ₁ ,t ₂ ,…,t _T ] ^T (1)

The video frame feature vector is input into a coding network of a multi-head multi-layer self-attention text generation model, and the coding network is trained. Based on the multi-head Attention method, parameters Q, K and V have:

Q _i ＝t×W _i ^Q (2)

K _i ＝t×W _i ^K (3)

V _i ＝t×W _i ^V (4)

wherein W is _i ^Q ∈R ^d×d ,W _i ^K ∈R ^d×d ,W _i ^V ∈R ^d×d Is the weight parameters of the attention layer query, key and value, i is the number of self-attention heads. Taking the output of the multi-head attention as z, z=concat (z) ₁ ,z ₂ ,…,z _i W), there are:

where w is the point multiplication between different head self-attentions (w= [ w ] ₁ ,w ₂ ,…,w _i ] ^T The method is used for mining the correlation information of the T-frame sample video image and the previous T-1 frame sample video image in the video, and comprises the following steps:

wherein, the liquid crystal display device comprises a liquid crystal display device,the scale of the ith head is used for preventing the result from being oversized, and the specific value is based on the specific situation. In the coding part, the output and input of the previous layer of multi-head attention (multi-head attention) are added and combined to output a normalization layer, the normalized data is learned by a feedforward neural network and used as the input of the next layer of multi-head attention (multi-head attention), and the like, so as to realize the association relation between the T-th frame sample video image and the previous T-1 frame sample video imageAnd extracting key feature vectors of the T-frame sample video image based on the association relation, wherein key features such as whether the key features belong to a crime, a ball, a shooting, a rescue, a celebration, a playback, a close-up lens and the like are recorded and output as phi. Phi and T-th frame sample video image T _i The output after cross multiplication is used as the coding feature vector output corresponding to the T frame sample video image, the cross multiplication is used for distinguishing the importance of different positions of the T frame sample video image and the relevance of the T frame sample video image relative to the previous T-1 frame sample video image, and the relevance is marked as e and has the following functions of

e＝Φ×t _i (7)

Step 4: the encoded feature vector is added to the derived data. Merging the derivative data feature vector p of the derivative data such as team information, player information, pre-competition news and the like subjected to vectorization into the coding feature vector corresponding to the T-th frame sample video image, and recording the coding feature vector added with the derivative data as beta

β＝e+p (8)

Step 5: and decoding video frame content. After the vectorization processing of the caption text of the ice hockey game video, the feature vector of the caption text is n= [ n ] ₁ ,n ₂ ,…,n _l ] ^T Where l represents the amount of information of the caption text, video frame decoding is based on the encoded feature vector added to the derived data as input, the decoding network is trained, e.g., frame-by-frame recognition and training to predict the caption of a sequence of video, and the untrained predicted caption text is mask processed to mask the effect of posterior data on the overall decoding result. Taking the text of the caption with trained prediction asWherein τ < l. The caption text to be predicted is +.>During the training of the decoding network, the trained caption text is also used as the input of the decoding part +.>Where τ < l, in order to facilitate more accurate decoding of the network being trained. />The parameters Q, K, and V are:

wherein, the liquid crystal display device comprises a liquid crystal display device,the weight parameters of the attribute layer query, key and value are that i is the head number of self-attribute, K and Q are the values obtained by adding K and Q after the training of the coding network and the derived data, and V is the result after the last training of the decoding network. Taking the output of the multi-layer multi-head attention as ψ, ψ=concat (ψ) ₁ ,ψ ₂ ,…,ψ _τ θ), there are:

where θ is the point multiplication between different head self-attentions, θ= [ θ ] ₁ ,θ ₂ ,…,θ _i ] ^T The method is used for mining the correlation information of the T-frame sample video image and the previous T-1 frame sample video image in the video, and comprises the following steps:

wherein, the liquid crystal display device comprises a liquid crystal display device,the scale of the ith head is used for preventing the result from being oversized, and the specific value is based on the specific situation. The output of the previous multi-head attention input feedforward neural network is taken as the input of the next multi-head attention, and the output of the decoding part is +.>

Step 6: video frame caption training predictions at some point. Decoding information based on step 5After a linear function and a normalized softmax function, the next caption text is predicted:

where f is a linear function. And circularly executing the steps until the preset convergence condition is met, and finally obtaining the multi-head multi-layer attention text generation model.

Fig. 2 shows a schematic structural diagram of a real-time video-based text generating apparatus according to an embodiment of the present invention. As shown in fig. 2, the apparatus includes: the device comprises a feature processing module 201, an encoding module 202, a derivative data processing module 203 and a decoding module 204.

The feature processing module 201 is adapted to perform feature processing on a currently played video image and a played video image of a video to obtain a first video frame feature vector and a second video frame feature vector, wherein the currently played video image is a currently played video image;

the encoding module 202 is adapted to determine an association relationship between the currently played video image and the played video image according to the first video frame feature vector and the second video frame feature vector, extract a key feature vector corresponding to the currently played video image based on the association relationship, and correct the first video frame feature vector according to the key feature vector to obtain an encoded feature vector corresponding to the currently played video image;

the derived data processing module 203 is adapted to obtain derived data associated with the video, and perform vectorization processing on the derived data to obtain a derived data feature vector;

the decoding module 204 is adapted to decode the encoded feature vector and the derived data feature vector to obtain a real-time video text corresponding to the currently played video image.

Optionally, the encoding module is further adapted to: and carrying out vector cross multiplication on the key feature vector and the first video frame feature vector to obtain the coding feature vector corresponding to the current playing video image.

Optionally, the decoding module is further adapted to: vector addition is carried out on the coding feature vector and the derivative data feature vector, and the fused coding feature vector is obtained;

and decoding the fused coding feature vectors by utilizing a decoding network of the pre-trained multi-head multi-layer self-attention text generation model to obtain the real-time video text corresponding to the current playing video image.

Optionally, the encoding module is further adapted to: inputting the first video frame feature vector and the second video frame feature vector into a coding network of a pre-trained multi-head multi-layer self-attention text generation model, determining the association relation between a current playing video image and a played video image by using the coding network, extracting a key feature vector corresponding to the current playing video image based on the association relation, and correcting the first video frame feature vector according to the key feature vector to obtain a coding feature vector corresponding to the current playing video image.

Optionally, the apparatus further comprises: the text generation model training module is suitable for acquiring each frame of sample video image in the sample video and the labeling sample video text corresponding to the frame of sample video image;

Optionally, the output and input of the previous layer multi-head attention are combined as the input of the next layer multi-head attention.

Optionally, the derived data comprises: athlete information, coach information, referee information, game item history information, playing field information, pre-game news information, and team history information.

Embodiments of the present invention provide a non-volatile computer storage medium storing at least one executable instruction that may perform the video-based real-time text generation method of any of the above method embodiments.

FIG. 3 illustrates a schematic diagram of a computing device according to an embodiment of the present invention, and the embodiment of the present invention is not limited to a specific implementation of the computing device.

As shown in fig. 3, the computing device may include: a processor (processor), a communication interface (Communications Interface), a memory (memory), and a communication bus.

Wherein: the processor, communication interface, and memory communicate with each other via a communication bus. A communication interface for communicating with network elements of other devices, such as clients or other servers, etc. A processor for executing a program, and in particular, may perform the relevant steps in the video-based real-time text generation method embodiment for a computing device.

In particular, the program may include program code including computer-operating instructions.

The processor may be a central processing unit, CPU, or specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included by the computing device may be the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs.

And the memory is used for storing programs. The memory may comprise high-speed RAM memory or may further comprise non-volatile memory, such as at least one disk memory.

The program may be specifically adapted to cause a processor to perform the video-based real-time text generation method in any of the method embodiments described above. Specific implementation of each step in the program may refer to corresponding descriptions in the corresponding steps and units in the video-based real-time text generation embodiment, which are not repeated herein. It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus and modules described above may refer to corresponding procedure descriptions in the foregoing method embodiments, which are not repeated herein.

The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It will be appreciated that the teachings of embodiments of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the embodiments of the present invention.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the above description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., an embodiment of the invention that is claimed, requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.

Various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functionality of some or all of the components according to embodiments of the present invention may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). Embodiments of the present invention may also be implemented as a device or apparatus program (e.g., a computer program and a computer program product) for performing a portion or all of the methods described herein. Such a program embodying the embodiments of the present invention may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. Embodiments of the invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specifically stated.

Claims

1. A video-based real-time text generation method, comprising:

performing feature processing on a currently played video image and a played video image of the video to obtain a first video frame feature vector and a second video frame feature vector;

determining the association relation between a current playing video image and a played video image according to the first video frame feature vector and the second video frame feature vector, extracting a key feature vector corresponding to the current playing video image based on the association relation, and correcting the first video frame feature vector according to the key feature vector to obtain a coding feature vector corresponding to the current playing video image;

acquiring derivative data associated with the video, and carrying out vectorization processing on the derivative data to obtain derivative data feature vectors;

and decoding the coding feature vector and the derivative data feature vector to obtain the real-time video text corresponding to the current playing video image.

2. The method of claim 1, wherein the correcting the first video frame feature vector according to the key feature vector to obtain the encoded feature vector corresponding to the currently playing video image further comprises:

and carrying out vector cross multiplication on the key feature vector and the first video frame feature vector to obtain a coding feature vector corresponding to the current playing video image.

3. The method according to claim 1 or 2, wherein the decoding the encoded feature vector and the derived data feature vector to obtain the real-time video text corresponding to the currently playing video image further comprises:

vector addition is carried out on the coding feature vector and the derivative data feature vector, and a fused coding feature vector is obtained;

4. The method of claim 3, wherein the determining the association between the currently played video image and the played video image according to the first video frame feature vector and the second video frame feature vector, extracting the key feature vector corresponding to the currently played video image based on the association, and correcting the first video frame feature vector according to the key feature vector, so as to obtain the encoded feature vector corresponding to the currently played video image further comprises:

inputting the first video frame feature vector and the second video frame feature vector into a coding network of a pre-trained multi-head multi-layer self-attention text generation model, determining the association relation between a current playing video image and a played video image by utilizing the coding network, extracting a key feature vector corresponding to the current playing video image based on the association relation, and correcting the first video frame feature vector according to the key feature vector to obtain a coding feature vector corresponding to the current playing video image.

5. The method of claim 3, wherein the text generation model training process comprises:

training a coding network of a multi-head multi-layer self-attention text generation model according to the position coding of the T-frame sample video image, the third video frame feature vector and the fourth video frame feature vector, determining the association relation between the T-frame sample video image and the previous T-1 frame sample video image, extracting a key feature vector corresponding to the T-frame sample video image according to the association relation, and correcting the third video frame feature vector according to the key feature vector to obtain a coding feature vector corresponding to the T-frame sample video image;

obtaining a text generation model loss function according to text loss between the video text corresponding to the T-frame sample video image and the labeling sample video text, and updating the coding network weight parameter and the decoding network weight parameter according to the text generation model loss function;

6. The method of claim 5, wherein the output and input of a previous layer of multi-headed attention are combined as the input of a next layer of multi-headed attention.

7. The method of claim 1 or 2, wherein the derivative data comprises: athlete information, coach information, referee information, game item history information, playing field information, pre-game news information, and team history information.

8. A video-based real-time text generation apparatus, comprising:

and the decoding module is suitable for decoding the coding feature vector and the derivative data feature vector to obtain a real-time video text corresponding to the current playing video image.

9. A computing device, comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;

the memory is configured to store at least one executable instruction that causes the processor to perform operations corresponding to the video-based real-time text generation method according to any one of claims 1-7.

10. A computer storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the video-based real-time text generation method of any one of claims 1-7.