CN114996515A

CN114996515A - Training method of video feature extraction model, text generation method and device

Info

Publication number: CN114996515A
Application number: CN202210615076.XA
Authority: CN
Inventors: 林和政; 吴翔宇
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-09-02

Abstract

The disclosure relates to a training method of a video feature extraction model, a text generation method and a text generation device, and belongs to the technical field of computers. In the embodiment of the disclosure, the image information and the text information of the sample video, and the text label and the image label of the sample video are utilized to perform model training on the video feature extraction model, and a model training method based on a double training task is provided.

Description

Training method of video feature extraction model, text generation method and device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a training method for a video feature extraction model, and a text generation method and apparatus.

Background

With the rapid development of computer technology and internet technology, video processing technology is gradually becoming an emerging research hotspot. In the video processing technology, it is generally required to extract video features capable of characterizing video content, and then perform processing procedures such as video recommendation, video classification, or video search by using the video features.

At present, before feature extraction is performed on a video, a video classification model is generally trained according to image information of a plurality of sample videos and category labels of the plurality of sample videos, and then images in the video are processed by using the trained video classification model to obtain category features of the video. However, the video classification model has weak feature extraction capability, and is not beneficial to the subsequent processing procedures of video recommendation, video classification or video search.

Disclosure of Invention

The invention provides a training method of a video feature extraction model, a text generation method and a device, which can train the video feature extraction model with better text generation capability and improve the training effect of the video feature extraction model. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a training method for a video feature extraction model, the method including:

acquiring image information, text information, an image tag and a text tag of a sample video, wherein the image tag represents image reconstruction characteristics, and the text tag represents a content description text of the sample video;

inputting the image information and the text information into a video feature extraction model, performing feature extraction on the image information through an image feature extraction sub-model of the video feature extraction model to obtain image features of the sample video, processing the text information through an embedding layer of a feature fusion sub-model of the video feature extraction model to obtain text features of the sample video, and performing feature fusion on the image features and the text features through a feature fusion layer of the feature fusion sub-model to obtain fusion features of the sample video;

carrying out image restoration on the image characteristics in the fusion characteristics through an image reconstruction submodel of the video characteristic extraction model to obtain an image training result of the size of an original image, and processing the fusion characteristics through a text generation submodel of the video characteristic extraction model to obtain a text training result;

and adjusting model parameters of the image feature extraction submodel, the feature fusion submodel, the image reconstruction submodel and the text generation submodel based on the image training result, the text training result, the image label and the text label of the sample video so as to train the video feature extraction model.

In the embodiment of the disclosure, a video feature extraction model is model trained by using image information and text information of a sample video and a text label and an image label of the sample video, wherein an image feature extraction submodel is constructed in the video feature extraction model, so that image features of the sample video can be accurately extracted, a feature fusion submodel is constructed in the video feature extraction model, so that not only text features of the sample video can be acquired, but also image features and text features of the sample video can be feature fused, so that on the basis of the fusion features, on one hand, image reconstruction can be performed on the sample video to obtain high-quality image features, on the other hand, a content description text of the sample video can be generated, and thus, a model training method based on a double training task is provided, in the case that a text generation task is used as a main task and an image reconstruction task is used as an auxiliary task, because the image label of the sample video represents the image reconstruction feature, the extraction capability of the video feature extraction model for the image feature can be improved in the model training process, so that the high-quality image feature can be obtained, the video feature extraction model with better text generation capability can be trained on the basis of obtaining the high-quality image feature, and the training effect of the video feature extraction model is improved.

In some embodiments, the obtaining of the image information of the sample video includes at least one of:

acquiring a cover image of the sample video; or, at least one frame of image in the sample video is acquired.

In the embodiment of the disclosure, the image information of the sample video can be quickly acquired by acquiring the cover image of the sample video or the image frame included in the sample video, the image information acquiring efficiency is ensured, the types of the image information are enriched, and the flexibility of acquiring the image information is improved.

In some embodiments, the obtaining of the text information of the sample video includes at least one of:

obtaining the description information of the sample video; acquiring title information of the sample video; acquiring subtitle information of the sample video; acquiring a character recognition result of the sample video, wherein the character recognition result is obtained by performing character recognition on at least one frame of image in the sample video; and acquiring an audio identification result of the sample video, wherein the audio identification result is obtained by performing audio identification on the background audio of the sample video.

In the embodiment of the disclosure, the text information of the sample video can be quickly acquired by acquiring the description information, the title information, the caption information, the character recognition result or the audio recognition result of the sample video, so that the efficiency of acquiring the text information is ensured, the types of the text information are enriched, and the flexibility of acquiring the text information is improved.

In some embodiments, the content description text is at least one of a content category description text, a content form description text, a content subject description text, and a content detail description text.

In the embodiment of the disclosure, by setting multiple types of content description texts, on one hand, a content description text with more expressive ability can be generated, and on the other hand, the multiple types of content description texts can describe the content of the video from different dimensions, so that the types of the generated content description texts are enriched, and the video can be represented more fully and completely.

In some embodiments, feature fusion of the image feature and the text feature by the feature fusion layer of the feature fusion sub-model to obtain a fusion feature of the sample video includes any one of:

processing the image feature and the text feature through a self-attention layer included in the feature fusion sub-model to obtain a fusion feature of the sample video;

and processing the image characteristic and the text characteristic through a depth confidence network included in the characteristic fusion submodel to obtain the fusion characteristic of the sample video.

In the embodiment of the disclosure, the self-attention layer or the depth confidence network is set in the feature fusion sub-model, and then the self-attention mechanism or the depth confidence network is used for feature fusion, so that the features with better video representation capability can be obtained, and the accuracy of feature fusion is improved.

In some embodiments, processing the fusion feature through a text generation submodel of the video feature extraction model to obtain a text training result includes:

and processing the fusion characteristics through a self-attention layer included in the text generation sub-model to obtain the text training result.

In the embodiment of the disclosure, the self-attention layer is arranged in the text generation submodel, and then the content description text is generated by using the self-attention mechanism, so that the accuracy of text generation is improved.

In some embodiments, the text training results include a plurality of types of content description text;

before the text generation submodel of the video feature extraction model processes the fusion feature to obtain the text training result, the method further comprises:

adding type identifications of various types on the fusion characteristics;

processing the fusion features through a text generation submodel of the video feature extraction model, and obtaining a text training result comprises the following steps:

inputting the fusion characteristics added with the type identifications into the text generation submodel, and processing the fusion characteristics through the text generation submodel based on the processing mechanisms corresponding to the type identifications respectively to obtain the content description texts of the multiple types.

In the embodiment of the disclosure, the type identifiers of each type are added to the fusion features, so that the text generation submodel in the video feature extraction model can trigger the generation of the content description texts of multiple types of the sample video based on the type identifiers of each type, and the smooth text generation is ensured.

In some embodiments, the number of the text information is plural;

before the text information is processed by the embedding layer of the feature fusion submodel of the video feature extraction model to obtain the text features of the sample video, the method further comprises the following steps:

splicing the plurality of text messages to obtain the spliced text messages;

and based on the spliced text information, executing the step of processing the text information through the embedded layer of the feature fusion sub-model of the video feature extraction model to obtain the text features of the sample video.

In the embodiment of the disclosure, under the condition that the number of the text information is multiple, the text information is spliced to obtain the spliced text information, and then the process of extracting the text features is performed by using the spliced text information, so that multiple types of text information are referred, and the accuracy of extracting the text features is improved.

In some embodiments, after a plurality of text messages are spliced to obtain the spliced text messages, the method further includes:

extracting characters of the number of the front targets from the spliced text information;

and executing the step of processing the text information through the embedded layer of the feature fusion sub-model of the video feature extraction model based on the extracted characters to obtain the text features of the sample video.

In the embodiment of the disclosure, in the spliced text information, the characters of the front target number are extracted, so that the subsequent text feature extraction process is performed based on the extracted characters of a certain number, the operation amount of the video feature extraction model is reduced on the basis of ensuring sufficient text information input, and the efficiency of extracting the text features is improved.

In some embodiments, adjusting the model parameters of the image feature extraction sub-model, the feature fusion sub-model, the image reconstruction sub-model, and the text generation sub-model based on the image training result, the text training result, and the image tag and the text tag of the sample video to train the video feature extraction model comprises:

in the ith iteration process of model training, determining a model loss value of the ith iteration process based on an image training result and a text training result of the ith iteration process and an image label and a text label of the sample video, wherein i is a positive integer greater than 1;

and adjusting the model parameters of the image feature extraction sub-model, the feature fusion sub-model, the image reconstruction sub-model and the text generation sub-model determined in the ith-1 th iteration process based on the model loss value of the ith iteration process, and repeating the training iteration process until the training meets the target condition.

In the embodiment of the disclosure, in any iteration process of model training, the model loss value of the iteration process is used to adjust the model parameters of each sub-model in the video feature extraction model, so as to improve the text generation capability of the video feature extraction model, thereby training the video feature extraction model with higher text generation capability.

In some embodiments, determining the model loss value for the ith iterative process based on the image training result and the text training result for the ith iterative process and the image label and the text label for the sample video comprises:

determining an image reconstruction loss value of the ith iteration process based on the image training result of the ith iteration process and the image label of the sample video, wherein the image reconstruction loss value represents the difference between the image training result and the image label;

determining a text generation loss value of the ith iteration process based on the text training result of the ith iteration process and the text label of the sample video, wherein the text generation loss value represents the difference between the text training result and the text label;

and determining the model loss value of the ith iteration process based on the image reconstruction loss value and the text generation loss value.

In some embodiments, determining the model loss value for the i-th iterative process based on the image reconstruction loss value and the text generation loss value comprises:

and carrying out weighted summation based on the image reconstruction loss value, the weight coefficient corresponding to the image reconstruction loss value, the text generation loss value and the weight coefficient corresponding to the text generation loss value to obtain the model loss value of the ith iteration process.

In the embodiment of the disclosure, the weight coefficients corresponding to the tasks are respectively set for the image reconstruction task and the text generation task of the video feature extraction model, and then the loss value of each task and the weight coefficient corresponding to each task are used for determining the model loss value, so that the accuracy of determining the model loss value is improved.

determining a text generation loss value of the ith iteration process based on the text training result of the ith iteration process and the text label of the sample video comprises:

for any type, determining a loss value of the ith iteration process on the type based on a text training result of the ith iteration process on the type and a text label of the sample video on the type;

and carrying out weighted summation based on the loss values of the ith iteration process on the multiple types and the weight coefficients of the video feature extraction network on the multiple types to obtain a text generation loss value of the ith iteration process.

In the embodiment of the disclosure, the weight coefficients corresponding to the types are respectively set for the types related to the text generation, and then the loss values corresponding to the types and the weight coefficients corresponding to the types are used for determining the text generation loss values, so that the accuracy of determining the text generation loss values is improved.

In some embodiments, before obtaining the text generation loss value of the ith iteration process, the method further includes:

for any type, determining the correct proportion of the ith iteration process on the type based on the correct text quantity of the ith iteration process on the type and the total text quantity, wherein the correct proportion represents the proportion of the correct text quantity in the ith iteration process to the total text quantity;

and determining the weight coefficient of the video feature extraction network on the type based on the correct proportion of the ith iteration process on the type, wherein the correct proportion is in negative correlation with the weight coefficient.

In the embodiment of the disclosure, for each type related to text generation, the weight coefficient corresponding to each type is determined according to the correct proportion corresponding to each type, because the correct proportion represents the proportion of the correct text quantity to the total text quantity, and because the correct proportion and the weight coefficient are in negative correlation, under the condition of calculating the text generation loss value, a smaller weight coefficient is set for the type with a large correct proportion, and a larger weight coefficient is set for the type with a small correct proportion, so that the accuracy of determining the weight coefficient is improved, and the accuracy of determining the text generation loss value is also improved.

According to a second aspect of the embodiments of the present disclosure, there is provided a text generation method based on a video feature extraction model, where the video feature extraction model is trained based on the first aspect or a training method shown in any embodiment of the first aspect, and the method includes:

acquiring image information and text information of a target video;

inputting the image information and the text information into the video feature extraction model, performing feature extraction on the image information through an image feature extraction sub-model of the video feature extraction model to obtain image features of the target video, processing the text information through an embedding layer of a feature fusion sub-model of the video feature extraction model to obtain text features of the target video, and performing feature fusion on the image features and the text features through a feature fusion layer of the feature fusion sub-model to obtain fusion features of the target video;

and processing the fusion features through a text generation sub-model of the video feature extraction model, outputting a plurality of characters meeting text generation conditions, and generating a content description text of the target video based on the characters.

In the embodiment of the disclosure, the image feature of the target video can be accurately extracted by constructing the image feature extraction submodel in the video feature extraction model, the text feature of the target video can be acquired, and the image feature and the text feature of the target video can be subjected to feature fusion by constructing the feature fusion submodel in the video feature extraction model, so that subsequent processing based on the fusion feature can be performed, a plurality of characters meeting text generation conditions can be output, and further, the content description text of the target video can be automatically generated based on the plurality of output characters.

In some embodiments, the method further comprises:

and performing image restoration on the image characteristics in the fusion characteristics through an image reconstruction sub-model of the video characteristic extraction model to obtain image reconstruction characteristics of the original image size of the target video.

According to a third aspect of the embodiments of the present disclosure, there is provided an apparatus for training a video feature extraction model, the apparatus including:

an acquisition unit configured to perform acquisition of image information, text information, an image tag, and a text tag of a sample video, the image tag representing an image reconstruction feature, the text tag representing a content description text of the sample video;

the input unit is configured to input the image information and the text information into a video feature extraction model, perform feature extraction on the image information through an image feature extraction sub-model of the video feature extraction model to obtain image features of the sample video, process the text information through an embedding layer of a feature fusion sub-model of the video feature extraction model to obtain text features of the sample video, and perform feature fusion on the image features and the text features through a feature fusion layer of the feature fusion sub-model to obtain fusion features of the sample video;

the processing unit is configured to execute image restoration on the image features in the fusion features through an image reconstruction sub-model of the video feature extraction model to obtain an image training result of the size of an original image, and process the fusion features through a text generation sub-model of the video feature extraction model to obtain a text training result;

and the training unit is configured to execute adjusting model parameters of the image feature extraction submodel, the feature fusion submodel, the image reconstruction submodel and the text generation submodel based on the image training result, the text training result and the image label and the text label of the sample video so as to train the video feature extraction model.

In some embodiments, the obtaining unit is configured to perform at least one of:

obtaining the description information of the sample video; acquiring title information of the sample video; acquiring caption information of the sample video; acquiring a character recognition result of the sample video, wherein the character recognition result is obtained by performing character recognition on at least one frame of image in the sample video; and acquiring an audio identification result of the sample video, wherein the audio identification result is obtained by performing audio identification on the background audio of the sample video.

In some embodiments, the input unit comprises a processing subunit configured to perform any one of:

In some embodiments, the processing unit comprises a text generation subunit configured to perform:

the device also comprises an adding unit which is configured to add type identifications of various types on the fusion characteristics;

the processing unit comprises a text generation subunit and is also configured to input the fusion features added with the type identifiers into the text generation submodel, and the fusion features are processed through the text generation submodel based on the processing mechanisms corresponding to the type identifiers respectively to obtain the content description texts of the multiple types.

In some embodiments, the number of the text information is plural;

the device also comprises a splicing unit which is configured to splice a plurality of text messages to obtain the spliced text messages;

the input unit is further configured to execute the step of processing the text information through the embedding layer of the feature fusion sub-model of the video feature extraction model based on the spliced text information to obtain the text features of the sample video.

In some embodiments, the input unit is further configured to perform:

In some embodiments, the training unit comprises:

a determining subunit, configured to perform, in an ith iteration process of model training, determining a model loss value of the ith iteration process based on an image training result and a text training result of the ith iteration process and an image label and a text label of the sample video, where i is a positive integer greater than 1;

and the adjusting subunit is configured to perform adjustment on the model parameters of the video feature extraction model determined in the ith-1 th iteration process based on the model loss value of the ith iteration process, and repeat the above training iteration process until the training meets the target condition.

In some embodiments, the determining subunit includes:

an image reconstruction loss value determination subunit configured to perform determining an image reconstruction loss value of the ith iteration process based on the image training result of the ith iteration process and the image label of the sample video, the image reconstruction loss value representing a difference between the image training result and the image label;

a text generation loss value determination subunit configured to perform determining a text generation loss value of the ith iteration process based on the text training result of the ith iteration process and the text label of the sample video, the text generation loss value representing a difference between the text training result and the text label;

a model loss value determination subunit configured to perform determining a model loss value for the i-th iteration process based on the image reconstruction loss value and the text generation loss value.

In some embodiments, the model loss value determination subunit is configured to perform:

the text generation loss value determination subunit configured to perform:

and carrying out weighted summation based on the loss values of the ith iteration process on the multiple types and the weighting coefficients of the video feature extraction network on the multiple types to obtain a text generation loss value of the ith iteration process.

In some embodiments, the apparatus further comprises a determining unit configured to perform:

According to a fourth aspect of the embodiments of the present disclosure, there is provided a text generation apparatus based on a video feature extraction model, where the video feature extraction model is obtained by training based on the first aspect or the training method shown in any embodiment of the first aspect, the apparatus includes:

an acquisition unit configured to perform acquisition of image information and text information of a target video;

the input unit is configured to input the image information and the text information into the video feature extraction model, perform feature extraction on the image information through an image feature extraction sub-model of the video feature extraction model to obtain image features of the target video, process the text information through an embedding layer of a feature fusion sub-model of the video feature extraction model to obtain text features of the target video, and perform feature fusion on the image features and the text features through a feature fusion layer of the feature fusion sub-model to obtain fusion features of the target video;

and the processing unit is configured to execute processing on the fusion feature through a text generation sub-model of the video feature extraction model, output a plurality of characters meeting text generation conditions, and generate a content description text of the target video based on the plurality of characters.

In some embodiments, the processing unit is further configured to perform:

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer apparatus including:

one or more processors;

a memory for storing the processor executable program code;

wherein the processor is configured to execute the program code to implement the training method of the video feature extraction model shown in the first aspect or any embodiment of the first aspect, or the text generation method based on the video feature extraction model shown in the second aspect or any embodiment of the second aspect.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium including: the program code in the computer readable storage medium, when executed by a processor of a computer device, enables the computer device to perform a training method for a video feature extraction model as shown in the first aspect or any embodiment of the first aspect, or a text generation method based on a video feature extraction model as shown in the second aspect or any embodiment of the second aspect.

According to a seventh aspect of the embodiments of the present disclosure, there is provided a computer program product, which includes a computer program, and when executed by a processor, the computer program implements a training method of a video feature extraction model shown in the first aspect or any embodiment of the first aspect, or a text generation method based on a video feature extraction model shown in the second aspect or any embodiment of the second aspect.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a schematic diagram illustrating an environment for implementing a method for training a video feature extraction model according to an exemplary embodiment;

FIG. 2 is a flow diagram illustrating a method of training a video feature extraction model in accordance with an exemplary embodiment;

FIG. 3 is a flow diagram illustrating a method for text generation based on a video feature extraction model in accordance with an exemplary embodiment;

FIG. 4 is a flowchart illustrating a method of training a video feature extraction model in accordance with an exemplary embodiment;

FIG. 5 is a framework of a video feature extraction model according to an exemplary embodiment;

FIG. 6 is a flow diagram illustrating a method for text generation based on a video feature extraction model in accordance with an exemplary embodiment;

FIG. 7 is a block diagram illustrating a training apparatus for a video feature extraction model in accordance with an exemplary embodiment;

FIG. 8 is a block diagram illustrating a video feature extraction model based text generation apparatus in accordance with an illustrative embodiment;

FIG. 9 is a block diagram illustrating a terminal in accordance with an exemplary embodiment;

FIG. 10 is a block diagram illustrating a server in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

It should be noted that information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, displayed data, etc.), and signals involved in the embodiments of the present disclosure are authorized by the user or sufficiently authorized by various parties, and the collection, use, and processing of the relevant data requires compliance with relevant laws and regulations and standards in relevant countries and regions. For example, the image information and the text information related to the embodiments of the present disclosure are acquired with sufficient authorization.

In some embodiments, a terminal is provided with an authority inquiry page, the authority inquiry page is used for inquiring whether a user grants an acquisition authority of image information and text information of a video, an authorization approval control and an authorization denial control are displayed in the authority inquiry page, under the condition that the triggering operation of the authorization approval control by the user is detected, the image information and the text information of a sample video are acquired by using the training method of the video feature extraction model provided by the embodiment of the disclosure, and then model training is performed on the video feature extraction model based on the image information and the text information of the sample video.

Fig. 1 is a schematic diagram of an implementation environment of a training method for a video feature extraction model according to an exemplary embodiment, and referring to fig. 1, the implementation environment includes: a terminal 101 and a server 102.

The terminal 101 may be at least one of a smartphone, a smart watch, a desktop computer, a laptop computer, a virtual reality terminal, an augmented reality terminal, a wireless terminal, a laptop portable computer, and the like. The terminal 101 has a communication function and can access a wired network or a wireless network. The terminal 101 may be generally referred to as one of a plurality of terminals, and the embodiment is only illustrated by the terminal 101. Those skilled in the art will appreciate that the number of terminals may be greater or less.

The server 102 may be an independent physical server, a server cluster or a distributed file system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, CDN (Content Delivery Network), big data and artificial intelligence platform, and the like. In some embodiments, the server 102 and the terminal 101 are connected directly or indirectly through wired or wireless communication, which is not limited in the embodiments of the present disclosure. Alternatively, the number of the servers 102 may be more or less, and the embodiment of the disclosure does not limit this. Of course, the server 102 may also include other functional servers to provide more comprehensive and diverse services.

In some embodiments, the training method for the video feature extraction model provided by the embodiments of the present disclosure is performed by the terminal 101, for example, the terminal 101 performs model training on the video feature extraction model by using the training method for the video feature extraction model provided by the embodiments of the present disclosure in response to a training operation on the video feature extraction model; in still other embodiments, the training method for the video feature extraction model provided by the embodiments of the present disclosure is performed by the terminal 101 and the server 102 together, for example, the terminal 101 sends the training data of the video feature extraction model to the server 102 in response to an uploading operation of the training data of the video feature extraction model, and then the server 102 receives the training data uploaded by the terminal 101 and performs model training on the video feature extraction model by using the training method for the video feature extraction model provided by the embodiments of the present disclosure.

It should be noted that the video feature extraction model provided by the embodiment of the present disclosure may be applied in scenes of video recommendation, video classification, or video search. In some embodiments, the video feature extraction model trained by the embodiments of the present disclosure is used to obtain a content description text of a video, and then the obtained content description text is used to implement functions of video recommendation, video classification or video search.

Fig. 2 is a flowchart illustrating a method for training a video feature extraction model according to an exemplary embodiment, where the method is performed by a computer device, such as the terminal or the server illustrated in fig. 1, and the method illustratively includes the following steps:

in step 201, a computer device obtains image information, text information, an image tag and a text tag of a sample video, wherein the image tag represents an image reconstruction feature, and the text tag represents a content description text of the sample video.

In step 202, the computer device inputs the image information and the text information into a video feature extraction model, performs feature extraction on the image information through an image feature extraction sub-model of the video feature extraction model to obtain image features of the sample video, processes the text information through an embedding layer of a feature fusion sub-model of the video feature extraction model to obtain text features of the sample video, and performs feature fusion on the image features and the text features through a feature fusion layer of the feature fusion sub-model to obtain fusion features of the sample video.

In step 203, the computer device performs image restoration on the image features in the fusion features through the image reconstruction submodel of the video feature extraction model to obtain an image training result of the original image size, and processes the fusion features through the text generation submodel of the video feature extraction model to obtain a text training result.

In step 204, the computer device adjusts model parameters of the image feature extraction sub-model, the feature fusion sub-model, the image reconstruction sub-model and the text generation sub-model based on the image training result, the text training result, and the image tag and the text tag of the sample video, so as to train the video feature extraction model.

The technical scheme provided by the embodiment of the disclosure is that a video feature extraction model is subjected to model training by utilizing image information and text information of a sample video and a text label and an image label of the sample video, wherein the image feature of the sample video can be accurately extracted by constructing an image feature extraction submodel in the video feature extraction model, and the feature fusion submodel is constructed in the video feature extraction model, so that not only can the text feature of the sample video be obtained, but also the image feature and the text feature of the sample video can be subjected to feature fusion, and the subsequent process is based on the fusion feature, on one hand, the image reconstruction can be performed on the sample video to obtain the high-quality image feature, and on the other hand, the content description text of the sample video can be generated, thus, the model training method based on the double training tasks is provided, under the condition that the text generation task is used as a main task and the image reconstruction task is used as an auxiliary task, the image label of the sample video represents the image reconstruction characteristics, so that the extraction capability of the video characteristic extraction model for the image characteristics can be improved in the model training process, the high-quality image characteristics can be obtained, the video characteristic extraction model with better text generation capability can be trained on the basis of obtaining the high-quality image characteristics, and the training effect of the video characteristic extraction model is improved.

and processing the fusion features through a self-attention layer included by the text generation sub-model to obtain the text training result.

before the text generation submodel of the video feature extraction model processes the fusion feature and obtains the text training result, the method further comprises the following steps:

adding type identifications of the various types to the fusion characteristics;

inputting the fusion features added with the type identifications into the text generation sub-model, and processing the fusion features through the text generation sub-model based on the processing mechanisms corresponding to the type identifications respectively to obtain the content description texts of multiple types.

In some embodiments, the number of the text information is plural;

before the text information is processed through the embedding layer of the feature fusion submodel of the video feature extraction model to obtain the text features of the sample video, the method further comprises the following steps:

splicing the plurality of text messages to obtain the spliced text messages;

In some embodiments, after the splicing is performed on a plurality of pieces of text information and the spliced text information is obtained, the method further includes:

and carrying out weighted summation based on the image reconstruction loss value, the weight coefficient corresponding to the image reconstruction loss value, the text generation loss value and the weight coefficient corresponding to the text generation loss value to obtain a model loss value of the ith iteration process.

Fig. 3 is a flowchart illustrating a text generation method based on a video feature extraction model according to an exemplary embodiment, and as shown in fig. 3, the method is executed by a computer device, which may be provided as the terminal or the server shown in fig. 1, and the method illustratively includes the following steps:

in step 301, a computer device obtains image information and text information of a target video.

In step 302, the computer device inputs the image information and the text information into the video feature extraction model, performs feature extraction on the image information through an image feature extraction sub-model of the video feature extraction model to obtain image features of the target video, processes the text information through an embedding layer of a feature fusion sub-model of the video feature extraction model to obtain text features of the target video, and performs feature fusion on the image features and the text features through a feature fusion layer of the feature fusion sub-model to obtain fusion features of the target video.

In step 303, the computer device processes the fusion feature through a text generation submodel of the video feature extraction model, outputs a plurality of characters satisfying a text generation condition, and generates a content description text of the target video based on the plurality of characters.

According to the technical scheme provided by the embodiment of the disclosure, the image feature extraction submodel is built in the video feature extraction model, the image feature of the target video can be accurately extracted, the feature fusion submodel is built in the video feature extraction model, the text feature of the target video can be obtained, the image feature and the text feature of the target video can be subjected to feature fusion, the subsequent processing based on the fusion feature can be performed, a plurality of characters meeting the text generation condition can be output, the content description text of the target video can be automatically generated based on the plurality of output characters, the video feature extraction model based on text generation is provided, the generated content description text contains abundant information, the target video can be better represented, and the accuracy of video representation is improved.

In some embodiments, the method further comprises:

Fig. 2 to fig. 3 are only basic processes of the present disclosure, and the following further explains a scheme provided by the present disclosure based on a specific implementation, and fig. 4 is a flowchart of a training method of a video feature extraction model according to an exemplary embodiment, and referring to fig. 4, the method includes:

in step 401, the computer device obtains image information, text information, an image tag representing an image reconstruction feature, and a text tag representing a content description text of a sample video of the sample video.

In the embodiment of the present disclosure, the sample video is used to refer to a training video for training the video feature extraction model, and in some embodiments, the number of the sample videos is multiple. It should be understood that the image information, text information, image labels, and text labels of the sample video are obtained as a training data set.

In some embodiments, the process of the computer device obtaining image information of the sample video includes at least one of: acquiring a cover image of the sample video; or, at least one frame of image in the sample video is acquired.

The number of the at least one frame of image may be one frame, two frames or more than two frames, such as three frames. In some embodiments, the process of acquiring the at least one frame of image by the computer device is: randomly extracting a preset number of images from a plurality of frames of images included in the sample video, for example, randomly extracting three frames of images; or, uniformly extracting a preset number of images from the multiple frames of images included in the sample video, for example, extracting three frames of images at equal intervals; and extracting key frames from the multi-frame images included in the sample video. Of course, the computer device can also obtain at least one image in the sample video based on other ways, which is not limited by the embodiment of the present disclosure.

In the embodiment, the image information of the sample video can be quickly acquired by acquiring the cover image of the sample video or the image frame included in the sample video, the efficiency of acquiring the image information is ensured, the types of the image information are enriched, and the flexibility of acquiring the image information is improved.

In some embodiments, the process of the computer device obtaining textual information of the sample video includes at least one of: obtaining the description information of the sample video; acquiring title information of the sample video; acquiring subtitle information of the sample video; acquiring a character recognition result of the sample video; and acquiring an audio identification result of the sample video.

Wherein the description information is information for describing a video subject of the sample video, such as subject description information or topic description information (topic); alternatively, the description information is information for describing the video content of the sample video, such as content description information (hashtag). The title information refers to a video title (title) of the sample video. In some embodiments, the description information and the title information are information set by a publisher of the sample video, for example, when the publisher of the sample video publishes the sample video, a terminal corresponding to the publisher is provided with a description information entry frame and a title information entry frame, through which the publisher can set the description information and the title information of the sample video, and further, the computer device can acquire the description information and the title information of the sample video while acquiring the sample video.

The caption information refers to a caption (text) included in an image within a sample video, and in some embodiments, the computer device extracts the caption information of the sample video using a caption extraction tool. In some embodiments, the computer device performs text Recognition on a plurality of frames of images included in the sample video by using an OCR (Optical Character Recognition) technique to obtain a text Recognition result of the sample video. In some embodiments, the computer device performs audio Recognition on the background audio of the sample video by using an ASR (Automatic Speech Recognition) technique to obtain an audio Recognition result of the sample video.

In the embodiment, the text information of the sample video can be quickly acquired by acquiring the description information, the title information, the caption information, the character recognition result or the audio recognition result of the sample video, so that the efficiency of acquiring the text information is ensured, the types of the text information are enriched, and the flexibility of acquiring the text information is improved.

The image label represents the image reconstruction characteristics of the sample video after image reconstruction, wherein the image reconstruction refers to image restoration or image restoration of the image in the video to obtain the complete image characteristics. The text label represents content description text of the sample video, and the content description text refers to a sentence for describing the content of the sample video.

In some embodiments, the content category description text includes multiple levels of category description text, for example, a primary category description text and a secondary category description text, wherein the primary category description text refers to a sentence describing a primary category of the video and the secondary category description text refers to a sentence describing a secondary category of the video. It should be understood that the primary category refers to the general classification of the video, and the secondary category refers to the sub-classification of the video based on the primary category, wherein the secondary category is tree-structured with respect to the primary category, that is, the primary category may include a plurality of secondary categories. For example, the primary category may be a life category and the secondary category may be a life record, a good share, or a health care, or the primary category may be a beauty makeup category and the secondary category may be a beauty makeup teaching, a beauty makeup evaluation, a skin care, or the like.

The content form description text refers to a sentence for describing a content form of a video, which represents a presentation form (or referred to as a shooting form) of the video. Illustratively, the content form may be a documentary form, a scene short form, or a street interview form, or the like. The content subject description text refers to a sentence for describing a content subject of a video, which represents the content subject of the video, for example, the video subject may be a video subject. The content detail description text refers to a sentence for describing content detail of a video, for example, a sentence for describing screen content of a video.

In the embodiment, by setting multiple types of content description texts, on one hand, the content description texts with more expressive ability can be generated, and on the other hand, the multiple types of content description texts can describe the content of the video from different dimensions, so that the types of the generated content description texts are enriched, and the video can be represented more fully and completely.

In step 402, the computer device inputs the image information and the text information into a video feature extraction model.

In the embodiment of the disclosure, the video feature extraction model is provided with functions of image reconstruction and text generation. In some embodiments, the video feature extraction model is an Encoder (Encoder) -Decoder (Decoder) architecture based on the Self Attention Mechanism (Self Attention Mechanism). Among them, the self-attention mechanism is a mechanism for learning the meaning of features based on the dependency relationship between the features, and in the self-attention mechanism, for each input feature, calculating the similarity or correlation between the feature and its neighboring features, such as calculating the vector dot product of the two, calculating the vector similarity of the two, or evaluating by introducing additional neural network, etc., to obtain the calculated scores of the feature and its neighboring features, and then numerically converting the calculated scores by using a calculation method such as softmax function (an activation function), thus, on one hand, the calculated score can be converted into a probability distribution with the sum of the element weights being 1, thereby realizing normalization, and on the other hand, through the intrinsic mechanism of the softmax function, the weights of the important elements can be highlighted, and then the weights of the elements are used for carrying out weighted summation to output the self-attention score.

Illustratively, fig. 5 is a block diagram illustrating a video feature extraction model according to an exemplary embodiment, referring to fig. 5, the video feature extraction model includes an image feature extraction sub-model, a feature fusion sub-model, an image reconstruction sub-model, and a text generation sub-model, wherein an encoder of the video feature extraction model includes the feature fusion sub-model, and a decoder of the video feature extraction model includes the text generation sub-model. The following describes a training method of a video feature extraction model provided in the embodiment of the present disclosure based on the video feature extraction model shown in fig. 5.

In step 403, feature extraction is performed on the image information through the image feature extraction submodel of the video feature extraction model, so as to obtain the image features of the sample video.

In the embodiment of the present disclosure, the image feature extraction submodel is provided with a function of extracting image features of a video. In some embodiments, the image feature extraction submodel is the Resnet (residual network) or ViT (Vision Transformer), Swin Tiny model, or the like.

In some embodiments, after the computer device inputs the image information and the text information into the video feature extraction model, the image information is input into an image feature extraction submodel of the video feature extraction model through the video feature extraction model, and feature extraction is performed on the image information through the image feature extraction submodel, so that image features with predetermined dimensions, such as 512-dimensional (or other number-dimensional) image features, can be obtained.

In step 404, the text information is processed by the embedding layer of the feature fusion submodel of the video feature extraction model, so as to obtain the text features of the sample video.

Wherein the embedding layer is configured to convert the values into vectors having a fixed size. In some embodiments, after the computer device inputs the image information and the text information into the video feature extraction model, the video feature extraction model inputs the text information into the feature fusion submodel of the video feature extraction model, and the text information is processed by the embedding layer of the feature fusion submodel, so that text features with predetermined dimensions, such as text features with 512 dimensions (or other number of dimensions), can be obtained. It should be noted that the dimension of the text feature is the same as the dimension of the image feature.

In some embodiments, the number of the text information is multiple (e.g., two or more), and accordingly, before performing the feature extraction on the text information, the method further includes: and splicing the plurality of text messages to obtain spliced text messages, and executing the step 404 based on the spliced text messages. Illustratively, taking five pieces of text information shown in step 401 as an example, the description information, the caption information, the character recognition result, and the audio recognition result of the sample video are spliced to obtain spliced text information, and the spliced text information is input into the text feature extraction submodel to execute step 404. In the embodiment, under the condition that the number of the text information is multiple, the text information is spliced to obtain the spliced text information, and the process of extracting the text features is further performed by utilizing the spliced text information, so that the accuracy of extracting the text features is improved by referring to multiple types of text information.

Further, in some embodiments, after obtaining the spliced text information, the method further includes: from the spliced text information, the characters of the pre-target number are extracted, and based on the extracted characters, the above-described step 404 is performed. Wherein the target number is a predetermined fixed number, such as 200. In the embodiment, the characters of the number of the targets before the splicing are extracted from the spliced text information, so that the subsequent text feature extraction process is performed based on the extracted characters of a certain number, the operation amount of the video feature extraction model is reduced on the basis of ensuring that sufficient text information is input, and the efficiency of extracting the text features is improved.

In step 405, feature fusion is performed on the image feature and the text feature through a feature fusion layer of the feature fusion sub-model, so as to obtain a fusion feature of the sample video.

The feature fusion sub-model has the function of performing feature fusion on the image features and the text features so as to output features with video representation capability. In some embodiments, the feature fusion layer is provided as a self-attention layer, such as a transform layer (transform layer) based on a self-attention mechanism, and accordingly, the image feature and the text feature are processed through the self-attention layer included in the feature fusion sub-model to obtain a fusion feature of the sample video; in other embodiments, the feature fusion layer is provided as a depth confidence network, and accordingly, the image feature and the text feature are processed through the depth confidence network included in the feature fusion submodel to obtain a fusion feature of the sample video. In the embodiment, the feature fusion is performed by setting the self-attention layer depth confidence network in the feature fusion submodel and further using the self-attention mechanism depth confidence network, so that the features with better video representation capability can be obtained, and the accuracy of the feature fusion is improved. Of course, in other embodiments, other network layers with a feature fusion function may be further disposed in the feature fusion submodel to implement the feature fusion function, which is not limited in this disclosure.

In the embodiment of the present disclosure, the feature fusion sub-model includes an Embedding layer (Embedding layer) and a feature fusion layer (such as a multilayer self-attention layer or a deep confidence network), so as to extract text features of a sample video by using the Embedding layer of the feature fusion sub-model, further combine image features output by the image feature extraction sub-model, and perform feature fusion by using the feature fusion layer of the feature fusion sub-model.

In step 406, image restoration is performed on the image features in the fusion features through the image reconstruction submodel of the video feature extraction model, so as to obtain an image training result of the original image size.

The image reconstruction submodel is provided with a function of performing image reconstruction processing on the video to output image reconstruction characteristics of the video, and in some embodiments, the image reconstruction submodel is provided with a function of performing image restoration on an image in the video to output image reconstruction characteristics of an original image size. Image restoration is a process of reconstructing a degraded image (or referred to as a degraded image) to restore the degraded image to an original image. The original image size is the size of the image in the video, and in some embodiments, the original image size is determined based on horizontal pixels in the image, vertical pixels, and color information in the image. Therefore, the image characteristics of the sample video are restored to the image reconstruction characteristics of the original image size, so that high-quality image reconstruction characteristics are obtained, and then the model training process is carried out by utilizing the high-quality image reconstruction characteristics. And the image training result is an image reconstruction characteristic obtained in the model training process. In some embodiments, the image reconstruction submodel includes a plurality of MLP networks (multi-layer neural networks) or other network layers with image reconstruction function, which is not limited by the embodiments of the present disclosure.

In some embodiments, a preset number of image features are extracted from the fusion features output by the feature fusion submodel, the extracted image features are input into the image reconstruction submodel of the video feature extraction model, and then the extracted image features are processed by the image reconstruction submodel of the video feature extraction model, and the image reconstruction features of the sample video are output, that is, the image training result is obtained. The preset number refers to the number of image features, it should be noted that the feature fusion layer of the feature fusion submodel includes multiple self-attention layers, and the self-attention layers can output the same number of features on the premise of inputting a certain number of features, so that the image features are extracted according to the preset number, sufficient image features can be extracted, and the image features and the text features pass through the feature fusion submodel, so that the output image features are the features fused with the text features, that is, the accuracy of image reconstruction is improved.

In step 407, the fusion feature is processed through the text generation submodel of the video feature extraction model to obtain a text training result.

The text generation submodel is provided with a function of performing text generation processing on the video so as to output a content description text of the video. The text training result is a content description text generated in the model training process.

In some embodiments, the text generation submodel includes multiple self-attention layers, and accordingly, the fusion feature is input into the text generation submodel of the video feature extraction model through the video feature extraction model, and the fusion feature is processed through the self-attention layer included in the text generation submodel to obtain the text training result. In some embodiments, the fusion feature is processed through a plurality of self-attention layers included in the text generation sub-model, a plurality of characters with self-attention scores reaching a text generation condition are output, and a content description text of the sample video is generated based on the output characters, that is, the text training result is obtained. For example, the text generation condition may be that the self-attention score reaches a score threshold.

Based on the multiple types of content description texts shown in step 401, in an alternative embodiment, the text generation sub-model is provided with a function of executing multiple types of text generation tasks, and for any type, the fusion features are processed according to the processing mechanisms of the text generation tasks corresponding to the multiple types respectively through the multiple layers of attention layers included in the text generation sub-model, so as to generate multiple types of content description texts of the sample video. Further, aiming at any type, processing a first section of feature sequence in the fusion features through a plurality of self-attention layers included in the text generation sub-model, and outputting characters of which self-attention scores reach text generation conditions in the first section of feature sequence; based on the features output in the first section of feature sequence, continuing to process the second section of feature sequence in the fusion features, and outputting characters of which the self-attention scores reach text generation conditions in the second section of feature sequence; and continuously processing the third section of feature sequence in the fusion features based on the features output in the first section of feature sequence and the features output in the second section of feature sequence, outputting characters with self-attention scores reaching text generation conditions in the third section of feature sequence, further, repeatedly executing the processing process and the output process based on the output characters, outputting characters with self-attention scores reaching the text generation conditions in the next section of feature sequence, obtaining a plurality of characters output by the text generation submodel, and splicing according to the output time sequence of the characters to obtain the content description text of the sample video, namely obtaining the text training result.

In the embodiment, the self-attention layer is arranged in the text generation sub-model, and the content description text is generated by using the self-attention mechanism, so that the accuracy of text generation is improved.

In some embodiments, the text generation submodel further provides a mask mechanism, where the mask mechanism is to shield a segment of feature sequence from which attention is not needed, so as to avoid the influence on the video feature extraction model and improve the accuracy of the self-attention mechanism.

In some embodiments, where the text generation sub-model is for performing multiple types of text generation tasks, prior to entering the fused features into the text generation sub-model, the method further comprises: type identifiers of the respective types are added to the fusion feature, and the above step 407 is executed based on the fusion feature to which the type identifiers are added. The type identification is used for indicating the text generation task of the corresponding type. In some embodiments, the video feature extraction model is used for inputting the fusion features added with the type identifiers into a text generation sub-model, and the text generation sub-model is used for processing the fusion features based on the processing mechanisms corresponding to the type identifiers respectively to obtain the content description texts of multiple types. In this embodiment, by adding the type identifiers of the respective types to the fusion features, the text generation submodel in the video feature extraction model can trigger generation of the content description text of the sample video on the respective types based on the type identifiers of the respective types, thereby ensuring smooth text generation.

It should be noted that, for example, in the above steps 406 to 407, the image training result is output first, and then the text training result is output, in other embodiments, the computer device can also output the text training result first and then the image training result, or the computer device can also output the image training result and the text training result at the same time, and the execution order of

steps

406 and 407 is not limited in the embodiment of the present disclosure.

In step 408, the computer device adjusts model parameters of the image feature extraction sub-model, the feature fusion sub-model, the image reconstruction sub-model and the text generation sub-model based on the image training result, the text training result and the image tag and the text tag of the sample video, so as to train the video feature extraction model.

For the above steps 402 to 408, in some embodiments, in a first iteration process of model training, the computer device inputs image information and text information of the sample video into an initial video feature extraction model, triggers the video feature extraction model to execute the processing processes of the above steps 402 to 407, obtains an image training result and a text training result of the first iteration process, and adjusts model parameters of the image feature extraction submodel, the feature fusion submodel, the image reconstruction submodel, and the text generation submodel in the initial video feature extraction model based on the image training result and the text training result of the first iteration process and the image tag and the text tag of the sample video; under the condition that the adjusted video feature extraction model does not meet the target condition, performing the next iteration process based on the adjusted model parameters, further, in the ith iteration process of model training, inputting the image information and the text information of the sample video into the video feature extraction model determined by the ith-1 iteration process, triggering the video feature extraction model to execute the processing processes from the step 402 to the step 407, further obtaining the image training result and the text training result of the ith iteration process, and adjusting the model parameters of the image feature extraction submodel, the feature fusion submodel, the image reconstruction submodel and the text generation submodel determined by the ith-1 iteration process based on the image training result and the text training result of the ith iteration process and the image label and the text label of the sample video, and under the condition that the adjusted video feature extraction model does not meet the target condition, carrying out the (i + 1) th iteration process based on the adjusted model parameter, and repeating the above training iteration process until the training meets the target condition, wherein i is a positive integer greater than 1.

In some embodiments, the target condition met by training is that the number of training iterations of the video feature extraction model reaches a target number, which is a preset number of training iterations, such as 1000; alternatively, the target condition met by the training is that the model loss value meets a target threshold condition, such as a loss value less than 0.00001. The embodiments of the present disclosure do not limit the setting of the target conditions.

In some embodiments, in an ith iteration process of model training, based on an image training result and a text training result of the ith iteration process and an image tag and a text tag of the sample video, a model loss value of the ith iteration process is determined; and adjusting the model parameters of the video feature extraction model determined in the ith-1 iteration process based on the model loss value of the ith iteration process. The following describes a process of determining the model loss value of the i-th iteration process by the computer device based on steps 408A to 408C:

in step 408A, the computer device determines an image reconstruction loss value for the ith iterative process based on the image training result for the ith iterative process and the image label for the sample video, the image reconstruction loss value representing a difference between the image training result and the image label.

In some embodiments, the computer device determines a Mean Square error Loss (mseoss) value of the ith iterative process based on the image training result of the ith iterative process and the image label of the sample video, and uses the determined mseoss value as an image reconstruction Loss value.

In step 408B, the computer device determines a text generation loss value for the ith iterative process based on the text training result for the ith iterative process and the text label for the sample video, the text generation loss value representing a difference between the text training result and the text label.

In some embodiments, in a case that the text training result includes a plurality of types of content description text, for any type, determining a loss value of the i-th iterative process on the type based on the text training result of the i-th iterative process on the type and a text label of the sample video on the type; and carrying out weighted summation based on the loss values of the ith iteration process on the multiple types and the weight coefficients of the video feature extraction network on the multiple types to obtain a text generation loss value of the ith iteration process.

In some embodiments, for any type, the computer device determines a Loss (Cross Entropy Loss) value of the ith iterative process on the type based on the text training result of the ith iterative process on the type and the text label of the sample video on the type, performs weighted summation based on the Loss values of the ith iterative process on the multiple types and the weight coefficients of the video feature extraction network on the multiple types, and obtains a text generation Loss value of the ith iterative process.

With respect to the above-mentioned value of the CEloss, in some embodiments, the process of the computer device determining the value of the CEloss comprises: for either type, the computer device determines a cross entropy loss value for the ith iterative process over the type based on the text training results for the ith iterative process over the type, the text labels for the sample videos over the type, the number of sample videos, and the pass equation (1).

Wherein, CEloss represents a cross entropy loss value; m represents the number of sample videos in the training dataset; y is _k Representing a text training result of the video feature extraction model aiming at the kth sample video; p represents the correct probability of the text training result of the video feature extraction model for the kth sample video, for example, the correct probability may be the similarity between the text training result and the text label.

For the weighting coefficients corresponding to the plurality of types, in some embodiments, the process of determining the weighting coefficients by the computer device includes: for any type, determining a correct proportion of the ith iteration process on the type based on the correct text quantity of the ith iteration process on the type and the total text quantity, wherein the correct proportion represents the proportion of the correct text quantity to the total text quantity in the ith iteration process, and determining a weight coefficient of the video feature extraction network on the type based on the correct proportion of the ith iteration process on the type, wherein the correct proportion is in negative correlation with the weight coefficient.

The correct text number refers to the number of correct text training results generated in the model training process, for example, the number of text training results with a correct probability reaching a probability threshold. The total text quantity refers to the total quantity of text training results generated in the model training process.

In some embodiments, the computer device determines the weight coefficient of the video feature extraction network on the type based on the correct amount of text on the type for the ith iteration, the total amount of text on the type for the ith iteration, and the weight coefficient formula (2) below.

W＝1-(correct/total) (2)

In the formula, W represents a weight coefficient of the video feature extraction network on the type; correct number of texts on the type; total represents the total number of texts on the type.

In the above embodiment, for each type involved in text generation, the weighting coefficients corresponding to each type are determined according to the correct proportion corresponding to each type, and since the correct proportion represents the proportion of the correct text number to the total text number, and since the correct proportion and the weighting coefficients are in negative correlation, in the case of calculating the text generation loss value, a smaller weighting coefficient is set for a type with a large correct proportion, and a larger weighting coefficient is set for a type with a small correct proportion, so that the accuracy of determining the weighting coefficients is improved, and the accuracy of determining the text generation loss value is also improved.

For the ith iteration process, in some embodiments, cross entropy loss values corresponding to the multiple types are calculated based on the above formula (1) for loss, and after the weight coefficients corresponding to the multiple types are calculated based on the above formula (2) for weight coefficients, weighted summation is performed based on the cross entropy loss values corresponding to the multiple types, the weight coefficients corresponding to the multiple types, and the following formula (3) for loss values, so as to obtain a text generation loss value of the ith iteration process.

In the formula, loss _{Text generation} Representing a text generation loss value; n represents the number of the plurality of types; w _s Representing a weight coefficient corresponding to the type s; CEloss _s And representing the cross entropy loss value corresponding to the type s.

In the above embodiment, the weighting coefficients corresponding to the types are respectively set for the types involved in the text generation, and further, the cross entropy loss values on the types and the weighting coefficients corresponding to the types are used to determine the text generation loss values, so that the accuracy of determining the text generation loss values is improved.

In step 408C, the computer device determines a model loss value for the i-th iteration based on the image reconstruction loss value and the text generation loss value.

In some embodiments, a weighted sum is performed based on the image reconstruction loss value, the weight coefficient corresponding to the image reconstruction loss value, the text generation loss value, and the weight coefficient corresponding to the text generation loss value, so as to obtain a model loss value of the i-th iteration.

In some embodiments, a weighted summation is performed based on the image reconstruction loss value, the weight coefficient corresponding to the image reconstruction loss value, the text generation loss value, the weight coefficient corresponding to the text generation loss value, and the following loss value formula (4), so as to obtain a model loss value of the i-th iteration process.

Totalloss＝W _{Image reconstruction} *loss _{Image reconstruction} +W _{Text generation} *loss _{Text generation} (4)

Wherein Totalloss represents a model loss value; w is a group of _{Image reconstruction} Representing a weight coefficient corresponding to the image reconstruction loss value; loss _{Image reconstruction} Representing an image reconstruction loss value; w _{Text generation} Representing a weight coefficient corresponding to the text generation loss value; loss _{Text generation} The representation text generates a loss value.

In the above embodiment, in any iteration process of model training, the model loss value of the iteration process is used to adjust the model parameters of each sub-model in the video feature extraction model, so as to improve the text generation capability of the video feature extraction model, thereby training the video feature extraction model with higher text generation capability.

In the embodiment, the image feature extraction submodel is built in the video feature extraction model, so that the image features of the sample video can be accurately extracted, the feature fusion submodel is built in the video feature extraction model, not only can the text features of the sample video be obtained, but also the image features and the text features of the sample video can be subjected to feature fusion, so that on the basis of the fusion features, on one hand, the image reconstruction can be performed on the sample video to obtain the high-quality image features, on the other hand, the content description text of the sample video can be generated, the training method of the video feature extraction model combining the image reconstruction task and the text generation task is provided, and the training effect of the video feature extraction model is improved.

The technical scheme provided by the embodiment of the disclosure is that a video feature extraction model is subjected to model training by utilizing image information and text information of a sample video and a text label and an image label of the sample video, wherein an image feature extraction submodel is constructed in the video feature extraction model, so that the image feature of the sample video can be accurately extracted, and a feature fusion submodel is constructed in the video feature extraction model, so that not only the text feature of the sample video can be obtained, but also the image feature and the text feature of the sample video can be subjected to feature fusion, so that the subsequent fusion features are based on one hand, the image reconstruction can be performed on the sample video to obtain the high-quality image feature, and on the other hand, the content description text of the sample video can be generated, thus, the model training method based on the double training tasks is provided, under the condition that the text generation task is used as a main task and the image reconstruction task is used as an auxiliary task, the image label of the sample video represents the image reconstruction characteristics, so that the extraction capability of the video characteristic extraction model for the image characteristics can be improved in the model training process, the high-quality image characteristics can be obtained, the video characteristic extraction model with better text generation capability can be trained on the basis of obtaining the high-quality image characteristics, and the training effect of the video characteristic extraction model is improved.

In the scheme shown in fig. 4, a training method for a video feature extraction model is provided, in some embodiments, a text generation method based on the video feature extraction model trained by the training method can be implemented, fig. 6 is a flowchart of a text generation method based on the video feature extraction model according to an exemplary embodiment, and referring to fig. 6, the method includes:

in step 601, the computer device obtains image information and text information of the target video.

In the embodiment of the present disclosure, a target video is used to refer to a video to be subjected to text generation. In some embodiments, taking a computer device as a terminal as an example, the target video is a video locally stored in the terminal, or a video downloaded by the terminal, etc.; in other embodiments, for example, the computer device is provided as a server, and the target video is a video in a video database associated with the server, or a video uploaded by the terminal, and the like. The disclosed embodiments do not limit the source of the target video.

In some embodiments, the process of the computer device obtaining image information of the target video includes at least one of: acquiring a cover image of the target video; or acquiring at least one frame of image in the target video. It should be noted that, regarding the process of acquiring the image information of the target video, reference is made to the process of acquiring the image information of the sample video in step 401, and details are not repeated.

In some embodiments, the process of the computer device obtaining the text information of the target video comprises at least one of: acquiring description information of the target video; acquiring the title information of the target video; acquiring subtitle information of the target video; acquiring a character recognition result of the target video; and acquiring an audio recognition result of the target video. It should be noted that, regarding the process of acquiring the text information of the target video, reference is made to the process of acquiring the text information of the sample video in step 401, and details are not repeated.

In step 602, the computer device inputs the image information and the text information into the video feature extraction model, and performs feature extraction on the image information through an image feature extraction sub-model of the video feature extraction model to obtain image features of the target video.

It should be noted that, for the process of acquiring the image feature of the target video, refer to the process of acquiring the image feature of the sample video in step 403, and details are not repeated.

In step 603, the text information is processed through the embedding layer of the feature fusion submodel of the video feature extraction model, so as to obtain the text feature of the target video.

It should be noted that, regarding the process of obtaining the text feature of the target video, reference is made to the process of obtaining the text feature of the sample video in step 404, and details are not repeated.

In step 604, feature fusion is performed on the image feature and the text feature through the feature fusion layer of the feature fusion submodel to obtain a fusion feature of the target video.

It should be noted that, for the process of obtaining the fusion feature of the target video, refer to the process of obtaining the fusion feature of the sample video in step 405, and details are not repeated.

In step 605, the fusion feature is processed by the text generation submodel of the video feature extraction model, a plurality of characters satisfying the text generation condition are output, and the content description text of the target video is generated based on the plurality of characters.

It should be noted that, for the process of obtaining the content description text of the target video, reference is made to the process of obtaining the text training result of the sample video in step 407, and details are not repeated.

In the above embodiment, the computer device inputs the image information and the text information into the video feature extraction model, processes the image information and the text information through the video feature extraction model, outputs a plurality of characters meeting a text generation condition, and generates the content description text of the target video based on the plurality of characters, thereby providing a model for text generation based on the image information and the text information, and referring to multi-mode information of an image modality and a text model, and increasing the amount of information referred by the video feature extraction model. Where modality refers to a representation or form of presentation of information, it should be understood that each medium or form of information may be referred to as a modality, for example, a medium of information such as an image, text, audio, and so on. In other embodiments, the computer device can also utilize information of other modalities to perform the text generation process, such as publishing information of the target video, and the like.

In some embodiments, the method further comprises: and performing image restoration on the image characteristics in the fusion characteristics through an image reconstruction sub-model of the video characteristic extraction model to obtain image reconstruction characteristics of the original image size of the target video. It should be noted that, for the process of obtaining the image reconstruction feature of the target video, refer to the process of obtaining the image training result of the sample video in step 406, and details are not repeated.

According to the technical scheme provided by the embodiment of the disclosure, the image feature of the target video can be accurately extracted by constructing the image feature extraction submodel in the video feature extraction model, the text feature of the target video can be acquired, and the image feature and the text feature of the target video can be subjected to feature fusion by constructing the feature fusion submodel in the video feature extraction model, so that on the basis of the fusion feature, on one hand, the image reconstruction of the target video can be performed to obtain the high-quality image feature, on the other hand, the content description text of the target video can be generated, the video feature extraction model combining the image reconstruction task and the text generation task is provided, the target video can be better characterized, and the accuracy of video characterization is improved.

FIG. 7 is a block diagram illustrating a training apparatus for a video feature extraction model according to an example embodiment. Referring to fig. 7, the apparatus includes an acquisition unit 701, an input unit 702, a processing unit 703 and a training unit 704.

An obtaining unit 701 configured to perform obtaining image information, text information, an image tag, and a text tag of a sample video, the image tag representing an image reconstruction feature, the text tag representing a content description text of the sample video;

an input unit 702 configured to perform inputting the image information and the text information into a video feature extraction model, performing feature extraction on the image information through an image feature extraction sub-model of the video feature extraction model to obtain image features of the sample video, processing the text information through an embedding layer of a feature fusion sub-model of the video feature extraction model to obtain text features of the sample video, and performing feature fusion on the image features and the text features through a feature fusion layer of the feature fusion sub-model to obtain fusion features of the sample video;

the processing unit 703 is configured to perform image restoration on the image features in the fusion features through an image reconstruction submodel of the video feature extraction model to obtain an image training result of an original image size, and process the fusion features through a text generation submodel of the video feature extraction model to obtain a text training result;

a training unit 704 configured to perform adjusting model parameters of the image feature extraction sub-model, the feature fusion sub-model, the image reconstruction sub-model, and the text generation sub-model based on the image training result, the text training result, and the image tag and the text tag of the sample video, so as to train the video feature extraction model.

In some embodiments, the obtaining unit 701 is configured to perform at least one of the following:

In some embodiments, the input unit 702 comprises a processing subunit configured to perform any one of:

In some embodiments, the processing unit 703 comprises a text generation sub-unit configured to perform:

the processing unit 703 includes a text generation sub-unit, and is further configured to input the fusion feature added with the type identifier into the text generation sub-model, and process the fusion feature through the text generation sub-model based on the processing mechanisms corresponding to the respective type identifiers, respectively, to obtain the content description texts of the multiple types.

In some embodiments, the number of the text information is plural;

In some embodiments, the input unit is further configured to perform:

In some embodiments, the training unit 704 includes:

In some embodiments, the determining subunit includes:

an image reconstruction loss value determination subunit configured to perform determining an image reconstruction loss value of the ith iterative process based on an image training result of the ith iterative process and an image label of the sample video, the image reconstruction loss value representing a difference between the image training result and the image label;

the text generation loss value determination subunit configured to perform:

for any type, determining a loss value of the ith iteration process on the type based on a text training result of the ith iteration process on the type and a description text label of the sample video on the type;

and carrying out weighted summation based on the cross entropy loss values of the ith iteration process on the multiple types and the weight coefficients of the video feature extraction network on the multiple types to obtain a text generation loss value of the ith iteration process.

for any type, determining the correct proportion of the ith iteration process on the type based on the correct text quantity of the ith iteration process on the type and the total text quantity, wherein the correct proportion represents the proportion of the correct text quantity to the total text quantity in the ith iteration process;

Fig. 8 is a block diagram illustrating a text generation apparatus based on a video feature extraction model according to an example embodiment. Referring to fig. 8, the apparatus includes an acquisition unit 801, an input unit 802, and a processing unit 803.

An acquisition unit 801 configured to perform acquisition of image information and text information of a target video;

an input unit 802 configured to perform inputting the image information and the text information into the video feature extraction model, performing feature extraction on the image information through an image feature extraction sub-model of the video feature extraction model to obtain image features of the target video, processing the text information through an embedding layer of a feature fusion sub-model of the video feature extraction model to obtain text features of the target video, and performing feature fusion on the image features and the text features through a feature fusion layer of the feature fusion sub-model to obtain fusion features of the target video;

a processing unit 803 configured to perform processing on the fusion feature through a text generation sub-model of the video feature extraction model, output a plurality of characters satisfying a text generation condition, and generate a content description text of the target video based on the plurality of characters.

According to the technical scheme, the image feature extraction submodel is built in the video feature extraction model, the image feature of the target video can be accurately extracted, the feature fusion submodel is built in the video feature extraction model, not only can the text feature of the target video be obtained, but also the image feature and the text feature of the target video can be subjected to feature fusion, so that the subsequent processing based on the fusion feature can be performed, a plurality of characters meeting text generation conditions can be output, further, the content description text of the target video can be automatically generated based on the plurality of output characters, the video feature extraction model based on text generation is provided, the generated content description text contains abundant information, the target video can be better represented, and the accuracy of video representation is improved.

In some embodiments, the processing unit 803 is further configured to perform:

It should be noted that: in the training apparatus for a video feature extraction model provided in the foregoing embodiment, only the division of the above functional modules is used for illustration when extracting features, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the above described functions. In addition, the training device of the video feature extraction model provided in the above embodiment and the training method embodiment of the video feature extraction model belong to the same concept, and specific implementation processes thereof are described in detail in the method embodiment and are not described herein again.

The computer device mentioned in the embodiments of the present disclosure may be provided as a terminal. Fig. 9 shows a block diagram of a terminal 900 according to an exemplary embodiment of the disclosure. The terminal 900 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 900 may also be referred to by other names such as user equipment, portable terminals, laptop terminals, desktop terminals, and the like.

In general, terminal 900 includes: a processor 901 and a memory 902.

Processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 901 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). Processor 901 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in a wake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 901 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 901 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 902 may include one or more computer-readable storage media, which may be non-transitory. The memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 902 is used for storing at least one program code for execution by the processor 901 to implement a process executed by a terminal in a training method of a video feature extraction model or a text generation method based on a video feature extraction model provided by method embodiments in the present disclosure.

In some embodiments, terminal 900 can also optionally include: a peripheral interface 903 and at least one peripheral. The processor 901, memory 902, and peripheral interface 903 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 903 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 904, a display screen 905, a camera assembly 906, an audio circuit 907, a positioning assembly 908, and a power supply 909.

The peripheral interface 903 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 901 and the memory 902. In some embodiments, the processor 901, memory 902, and peripheral interface 903 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 901, the memory 902, and the peripheral device interface 903 may be implemented on a separate chip or circuit board, which is not limited by the embodiment.

The Radio Frequency circuit 904 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 904 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 904 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 904 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 904 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 904 may also include NFC (Near Field Communication) related circuits, which are not limited by this disclosure.

The display screen 905 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 905 is a touch display screen, the display screen 905 also has the ability to capture touch signals on or over the surface of the display screen 905. The touch signal may be input to the processor 901 as a control signal for processing. At this point, the display 905 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 905 may be one, disposed on the front panel of the terminal 900; in other embodiments, the number of the display panels 905 may be at least two, and each of the display panels is disposed on a different surface of the terminal 900 or is in a foldable design; in other embodiments, the display 905 may be a flexible display disposed on a curved surface or a folded surface of the terminal 900. Even more, the display screen 905 may be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The Display panel 905 can be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 906 is used to capture images or video. Optionally, camera assembly 906 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 906 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

Audio circuitry 907 may include a microphone and speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 901 for processing, or inputting the electric signals to the radio frequency circuit 904 for realizing voice communication. For stereo sound acquisition or noise reduction purposes, the microphones may be multiple and disposed at different locations of the terminal 900. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 901 or the radio frequency circuit 904 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuit 907 may also include a headphone jack.

The positioning component 908 is used to locate the current geographic Location of the terminal 900 for navigation or LBS (Location Based Service).

Power supply 909 is used to provide power to the various components in terminal 900. The power source 909 may be alternating current, direct current, disposable or rechargeable. When the power source 909 includes a rechargeable battery, the rechargeable battery may support wired charging or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 900 can also include one or more sensors 910. The one or more sensors 910 include, but are not limited to: an acceleration sensor 911, a gyro sensor 912, a pressure sensor 913, a fingerprint sensor 914, an optical sensor 915, and a proximity sensor 916.

The acceleration sensor 911 can detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 900. For example, the acceleration sensor 911 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 901 can control the display screen 905 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 911. The acceleration sensor 911 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 912 may detect a body direction and a rotation angle of the terminal 900, and the gyro sensor 912 may cooperate with the acceleration sensor 911 to acquire a 3D motion of the user on the terminal 900. The processor 901 can implement the following functions according to the data collected by the gyro sensor 912: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 913 may be disposed on a side bezel of the terminal 900 and/or underneath the display 905. When the pressure sensor 913 is disposed on the side frame of the terminal 900, the user's holding signal of the terminal 900 may be detected, and the processor 901 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 913. When the pressure sensor 913 is disposed at a lower layer of the display screen 905, the processor 901 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 905. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The optical sensor 914 is used to collect the ambient light intensity. In one embodiment, the processor 901 may control the display brightness of the display 905 according to the ambient light intensity collected by the optical sensor 914. Specifically, when the ambient light intensity is high, the display brightness of the display screen 905 is increased; when the ambient light intensity is low, the display brightness of the display screen 905 is reduced. In another embodiment, the processor 901 may also dynamically adjust the shooting parameters of the camera assembly 906 according to the ambient light intensity collected by the optical sensor 914.

A proximity sensor 915, also known as a distance sensor, is typically provided on the front panel of the terminal 900. The proximity sensor 915 is used to collect the distance between the user and the front surface of the terminal 900. In one embodiment, when the proximity sensor 915 detects that the distance between the user and the front face of the terminal 900 gradually decreases, the processor 901 controls the display 905 to switch from the bright screen state to the dark screen state; when the proximity sensor 915 detects that the distance between the user and the front surface of the terminal 900 becomes gradually larger, the display 905 is controlled by the processor 901 to switch from the message screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 9 is not limiting to terminal 900 and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components may be used.

The computer device mentioned in the embodiment of the present disclosure may be provided as a server. Fig. 10 is a block diagram of a server, which may generate large differences according to different configurations or performances, according to an exemplary embodiment, and which may include one or more processors (CPUs) 1001 and one or more memories 1002, where the one or more memories 1002 store at least one program code, and the at least one program code is loaded and executed by the one or more processors 1001 to implement the processes executed by the server in the training method of the video feature extraction model provided by the above-mentioned method embodiments. Of course, the server 1000 may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the server 1000 may also include other components for implementing the functions of the device, which are not described herein again.

In an exemplary embodiment, a computer readable storage medium comprising program code, such as the memory 1002 comprising program code, executable by the processor 1001 of the server 1000 to perform the above-described training method of the video feature extraction model is also provided. Alternatively, the computer-readable storage medium may be a ROM (Read-Only Memory), a RAM (Random Access Memory), a CD-ROM (Compact-Disc Read-Only Memory), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, which comprises a computer program that, when being executed by a processor, implements the method of training a video feature extraction model as described above.

In some embodiments, a computer program according to embodiments of the present disclosure may be deployed to be executed on one computer device or on multiple computer devices located at one site, or on multiple computer devices distributed at multiple sites and interconnected by a communication network, and the multiple computer devices distributed at the multiple sites and interconnected by the communication network may constitute a block chain system.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A training method of a video feature extraction model is characterized by comprising the following steps:

inputting the image information and the text information into a video feature extraction model, performing feature extraction on the image information through an image feature extraction submodel of the video feature extraction model to obtain image features of the sample video, processing the text information through an embedding layer of a feature fusion submodel of the video feature extraction model to obtain text features of the sample video, and performing feature fusion on the image features and the text features through a feature fusion layer of the feature fusion submodel to obtain fusion features of the sample video;

carrying out image restoration on the image features in the fusion features through an image reconstruction sub-model of the video feature extraction model to obtain an image training result of the size of an original image, and processing the fusion features through a text generation sub-model of the video feature extraction model to obtain a text training result;

2. The method for training the video feature extraction model according to claim 1, wherein the content description text is at least one of a content category description text, a content form description text, a content subject description text, and a content detail description text.

3. The method for training the video feature extraction model according to claim 1, wherein the feature fusion of the image feature and the text feature by the feature fusion layer of the feature fusion submodel to obtain the fusion feature of the sample video includes any one of the following:

processing the image features and the text features through a self-attention layer included in the feature fusion sub-model to obtain fusion features of the sample video;

and processing the image features and the text features through a depth confidence network included in the feature fusion sub-model to obtain fusion features of the sample video.

4. The method of claim 1, wherein the processing the fusion features through a text generation submodel of the video feature extraction model to obtain a text training result comprises:

and processing the fusion features through a self-attention layer included in the text generation sub-model to obtain the text training result.

5. A text generation method based on a video feature extraction model, wherein the video feature extraction model is obtained by training based on the training method of any one of the claims 1 to 4, and the method comprises the following steps:

acquiring image information and text information of a target video;

inputting the image information and the text information into the video feature extraction model, performing feature extraction on the image information through an image feature extraction submodel of the video feature extraction model to obtain image features of the target video, processing the text information through an embedding layer of a feature fusion submodel of the video feature extraction model to obtain text features of the target video, and performing feature fusion on the image features and the text features through a feature fusion layer of the feature fusion submodel to obtain fusion features of the target video;

6. An apparatus for training a video feature extraction model, the apparatus comprising:

the processing unit is configured to execute image restoration on the image features in the fusion features through an image reconstruction submodel of the video feature extraction model to obtain an image training result of the size of an original image, and process the fusion features through a text generation submodel of the video feature extraction model to obtain a text training result;

a training unit configured to perform training on the video feature extraction model by adjusting model parameters of the image feature extraction submodel, the feature fusion submodel, the image reconstruction submodel, and the text generation submodel based on the image training result, the text training result, and the image tag and the text tag of the sample video.

7. A text generation device based on a video feature extraction model, wherein the video feature extraction model is obtained by training based on the training method of any one of the above claims 1 to 4, the device comprises:

8. A computer device, characterized in that the computer device comprises:

one or more processors;

a memory for storing the processor executable program code;

wherein the processor is configured to execute the program code to implement the training method of the video feature extraction model according to any one of claims 1 to 4 or the text generation method based on the video feature extraction model according to claim 5.

9. A computer-readable storage medium, wherein program code in the computer-readable storage medium, when executed by a processor of a computer device, enables the computer device to perform the method of training a video feature extraction model according to any one of claims 1 to 4, or the method of generating text based on a video feature extraction model according to claim 5.

10. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method for training a video feature extraction model according to any one of claims 1 to 4, or the method for generating text based on a video feature extraction model according to claim 5.