CN113747168A

CN113747168A - Training method of multimedia data description model and generation method of description information

Info

Publication number: CN113747168A
Application number: CN202010478653.6A
Authority: CN
Inventors: 林科; 甘卓欣
Original assignee: Samsung Electronics Co Ltd
Current assignee: Beijing Samsung Telecom R&D Center; Samsung Electronics Co Ltd
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2021-12-03

Abstract

The application provides a training method and a generation method of description information of a multimedia data description model, wherein the model comprises an encoding module and a decoding module which are sequentially cascaded, the decoding module comprises at least one decoder, and the training method comprises the following steps: acquiring a training data set; training the model based on the training data set until the total loss function of the model is converged; the total loss function comprises a first loss function, and during training, for each sample multimedia data in a training data set, the sample multimedia data is input into a coding module to obtain coding characteristics, and the coding characteristics are respectively input into each decoder to obtain a first decoding result of each decoder; the value of the first loss function is calculated based on the respective descriptive labels of the respective sample multimedia data and the first decoding results corresponding to the respective decoders. Based on the method provided by the application, the accuracy of the description information of the generated multimedia data can be effectively improved.

Description

Training method of multimedia data description model and generation method of description information

Technical Field

The application relates to the field of computer vision and the field of artificial intelligence, in particular to a training method of a multimedia data description model and a generation method of description information.

Background

In computer vision technology, a Video description (Video capturing) or an Image description (Image capturing) refers to a textual description of a given Video or Image for a given Video or Image output. As shown in fig. 1, when a video including a plurality of frames of images shown in fig. 1 is given, a textual description "a child is cleaning the floor" of the video may be automatically output.

Taking a video description mode as an example, an existing video description mode generally selects a plurality of frames from a video, extracts full-image features from the selected frames, then uses the features for decoding, and generates text description about the video according to a maximum likelihood probability, and the image description principle is similar. As can be seen from the above, the existing video description model basically adopts a structure of an encoder and a decoder, the encoder is responsible for extracting the features of the video frames, and the decoder is responsible for decoding the features of the video frames and generating the text description. The image description is similar.

An example of an existing video description algorithm is shown in fig. 2, a plurality of frames shown in fig. 2 are selected from a video, each frame is respectively subjected to a plurality of CNN (Convolutional Neural Networks) encoders to respectively extract and select features of each video frame, and the extracted features are decoded by an LSTM (Long Short-Term Memory) decoder to generate a corresponding textual description that "a man is putting a pizza into an oven". Although there are many ways of generating description information of multimedia data (video or image) in the prior art, the accuracy of the generated description information still needs to be optimized.

Disclosure of Invention

The purpose of the present application is to provide a training method of a multimedia data description model and a generation method of description information, so as to improve the accuracy of the generated multimedia data description information. The scheme provided by the embodiment of the application is as follows:

in one aspect, the present application provides a training method for a multimedia data description model, where the multimedia data description model includes an encoding module and a decoding module that are sequentially cascaded, and the decoding module includes at least one decoder, the training method includes:

acquiring a training data set, wherein the training data set comprises sample multimedia data and at least one description label of each sample multimedia data;

training the multimedia data description model based on the training data set until the total loss function of the multimedia data description model is converged;

the total loss function comprises a first loss function, and during training, for each sample multimedia data, the sample multimedia data is input into an encoding module to obtain encoding characteristics of the sample multimedia data, and the encoding characteristics are respectively input into each decoder to obtain a first decoding result corresponding to each decoder; the value of the first loss function is calculated based on the respective descriptive labels of the respective sample multimedia data and the first decoding results corresponding to the respective decoders.

In another aspect, the present application provides a method for generating description information of multimedia data, including:

inputting multimedia data into a coding module of a multimedia data description model to obtain coding characteristics of the multimedia data, wherein the multimedia data description model comprises a coding module and a decoding module which are sequentially cascaded, and the decoding module comprises at least one decoder;

respectively inputting the coding characteristics into each decoder, and obtaining the description information of the media data based on the decoding result of each decoder;

the multimedia data description model is obtained by training through the training method of the multimedia data description model provided by the application.

In another aspect, the present application provides a training apparatus for a multimedia data description model, where the multimedia data description model includes an encoding module and a decoding module, which are sequentially cascaded, and the decoding module includes at least one decoder, the training apparatus includes:

the training data acquisition module is used for acquiring a training data set, wherein the training data set comprises multimedia data of each sample and at least one description label of the multimedia data of each sample;

the training module is used for training the multimedia data description model based on the training data set until the total loss function of the multimedia data description model is converged;

wherein the total loss function comprises a first loss function, and the training module, when training the multimedia data description model based on the training data set, is configured to:

for each sample multimedia data, inputting the sample multimedia data into an encoding module to obtain encoding characteristics of the sample multimedia data, and respectively inputting the encoding characteristics into each decoder to obtain a first decoding result corresponding to each decoder;

the value of the first loss function is calculated based on the respective descriptive labels of the respective sample multimedia data and the first decoding results corresponding to the respective decoders.

In another aspect, the present application provides an apparatus for generating description information of multimedia data, the apparatus including:

the multimedia data description model comprises a coding module and a decoding module which are sequentially cascaded, wherein the decoding module comprises at least one decoder;

the character description generation module is used for respectively inputting the coding characteristics into each decoder and obtaining the description information of the media data based on the decoding result of each decoder;

In another aspect, the present application provides an electronic device comprising a memory and a processor;

the memory has a computer program stored therein;

and the processor is used for executing the training method of the multimedia data description model provided by the application or executing the generation method of the description information of the multimedia data provided by the embodiment of the application when the computer program is run.

In another aspect, the present application provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the computer program performs a training method of a multimedia data description model provided in the present application, or performs a generation method of description information of multimedia data provided in an embodiment of the present application.

The advantageous effects brought by the technical solutions provided by the embodiments of the present application will be described in detail in the following description of the specific embodiments with reference to various alternative embodiments, and will not be further described herein.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic diagram of an application scenario of the present application;

fig. 2 is a schematic flowchart of a conventional method for generating video description information;

FIG. 3 is a schematic flow chart illustrating a method for training a multimedia data description model according to the present application;

FIG. 4 is a flow chart illustrating a method for obtaining video description information through a multimedia data description model;

FIG. 5 is a schematic diagram illustrating a method for training a multimedia data description model according to an example of the present application;

FIG. 6 is a schematic diagram illustrating a structure of a video description model provided in the present application;

FIG. 7 is a schematic diagram illustrating a principle of frame masking a video according to the present application;

FIG. 8 is a schematic structural diagram of a training apparatus for describing models by multimedia data according to the present application;

fig. 9 is a schematic structural diagram of an electronic device provided in the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

For better understanding and description of the solutions provided by the embodiments of the present application, the related art to which the present application relates will be described first.

In computer vision technology, video/image description refers to outputting a textual description of a given video/image for that video/image, which is the cross direction of computer vision and natural language processing. Video/image description is a more challenging task compared to other computer vision tasks such as object detection, image segmentation, etc. It not only requires a more comprehensive understanding of the video or image, but also expresses the content of the video or image in a natural language.

As can be seen from the foregoing description, the existing video/image description models basically adopt the structure of encoder-decoder. The encoder, which is usually designed based on CNN and therefore may also be referred to as CNN encoder, is responsible for extracting features of the image, and the decoder, which is usually designed based on RNN (e.g., LSTM) and therefore may also be referred to as RNN decoder, is responsible for decoding features of the image to generate the textual description.

To improve the encoding or decoding capability of the model, some methods use multiple encoders to improve the encoding capability and some methods use multiple decoders to improve the decoding capability. Multi-decoder approaches, which generally use the similarity of the outputs of the decoders as a loss function (e.g., K-L Divergence), are also referred to as co-learning training, which makes the outputs of the decoders respectively closer and can use the outputs of other decoders to guide the current decoder. Co-learning can improve the performance of multiple decoders compared to the performance of a single decoder and a decoder, but the method is still not perfect and needs to be improved.

In addition, there is a problem of "one-to-many" in the existing video/image description training, which means that one training video/image in the video/image description training data usually corresponds to multiple description labels, and when performing model training (e.g. training with cross entropy loss), such uncertainty affects the performance of the video/image description network because one input corresponds to multiple outputs (labeled outputs). Therefore, improving the decoding capability of a video/image description multi-decoder and "one-to-many" in training is a problem to be solved.

In order to solve at least one of the above problems, the present application provides a training method for a multimedia data description model, which can effectively improve the accuracy of generated description information when the description information of multimedia data is generated by using the multimedia data description model obtained by training based on the training method.

In order to make the objects, technical solutions and advantages of the present application clearer, various alternative embodiments of the present application and how the technical solutions of the embodiments of the present application solve the above technical problems will be described in detail below with reference to specific embodiments and drawings. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings. Fig. 3 illustrates a training method for a multimedia data description model provided by an embodiment of the present application, where the multimedia data description model includes an encoding module and a decoding module that are sequentially cascaded, where the encoding module includes at least one encoder, and the decoding module includes at least one decoder, as shown in fig. 3, the training method may include:

step S110: acquiring a training data set;

step S120: and training the multimedia data description model based on the training data set until the total loss function of the multimedia data description model converges.

The training data set comprises sample multimedia data and at least one description label of each sample multimedia data; the total loss function of the model comprises a first loss function, and during training, for each sample multimedia data, the sample multimedia data is input into an encoding module to obtain the encoding characteristics of the sample multimedia data, and the encoding characteristics are respectively input into each decoder to obtain a first decoding result corresponding to each decoder; the value of the first loss function is calculated based on the respective descriptive labels of the respective sample multimedia data and the first decoding results corresponding to the respective decoders.

It can be seen that the value of the first loss function characterizes the difference of the descriptive label of the sample multimedia data and the first decoding result. Specifically, for each description label of each sample multimedia data, the difference between the description label and the first decoding result of each decoder may be calculated, and finally, the differences corresponding to the description labels corresponding to the sample multimedia data are summed to obtain the value of the first loss function.

In practical application, the first loss function is specifically selected from which function is not limited in the embodiment of the present application, and optionally, the first loss function may cross the entropy loss function.

In the embodiment of the application, the multimedia data description model can be an image description model and can also be a video description model, the description information of the image can be obtained through the image description model, and the description information of the video can be obtained through the video description model.

Accordingly, for the image description model, the sample multimedia data is a sample image (an image used for model training, which may also be referred to as a training image), and the encoding features of the sample multimedia data may include encoding features of each target region in the sample image. For the video description model, the sample multimedia data is sample video (also referred to as training video), and the encoding features of the sample multimedia data include encoding features of frames of the sample video.

In practical applications, for a video, when obtaining description information of the video through a video description model, several frames are generally selected from the video, and the several frames are input into the video description model to obtain the description information of the video. Similarly, when training the model, a plurality of frames of the sample video may be selected, and correspondingly, the coding feature of each frame of the sample video may be the coding feature of the selected plurality of frames. Optionally, in this embodiment of the present application, when processing a video, processing may be performed on several frames in the video.

For describing a model in multimedia data, in practical applications, in order to improve the codec capability of the model, there are usually a plurality of (including two) encoders included in the encoding module of the model, and a plurality of decoders included in the decoding module of the model. The encoder and the decoder specifically adopt which kind, and can be selected or configured according to actual requirements. For example, for the image description model, the encoding module may also include, but is not limited to, one or more of a local encoder, a global encoder, and a semantic encoder, and for the video description module, the encoding module may include, but is not limited to, one or more of a local encoder, a global encoder, a semantic encoder, and a spatiotemporal visual feature encoder (for extracting spatiotemporal visual features of video).

As an example, fig. 4 is a flowchart illustrating a method for obtaining video description information through a video description model, as shown in fig. 4, an input video passes through multiple encoders to obtain features, where the multiple encoders in this example include a global encoder, a local encoder, and a semantic encoder, and the global features, the local features, and the semantic features of the input video can be obtained after passing through the multiple encoders. The resulting features are then decoded by a multi-decoder, which in this example includes two decoders, each decoding based on the features extracted by the multi-decoder, and the outputs of the decoders are fused to obtain the final output.

As can be seen from the foregoing description, there is a "one-to-many" problem in the current training method of multimedia data description model, and for the multimedia data description model including multiple decoders, the decoding capability of the multiple decoders still needs to be improved.

For the "one-to-many problem", the embodiment of the present application proposes to use a frame mask method (for an image, an image mask method), which has a principle that, for each description label of sample multimedia data, it can be encoded as a description label feature using a language coding model (e.g., a coding model based on a BERT (coder of a Bidirectional Transformer)), and the coding feature of the sample multimedia data extracted by an Encoder of a multimedia data description model (e.g., a multi-frame video feature of a sample video) is also mapped to a description label feature space, that is, both the description label and the coding feature are mapped to the same feature space. For sample video, the relevance (e.g., similarity) of the video features and the description label features of each frame may be calculated, each frame may be sorted according to the relevance, and the frames with lower ranks may be masked, that is, a smaller weight may be applied to the coding features of the frames with lower ranks, or the weight thereof may be set to zero (that is, the coding features of the frames do not participate in the candidate decoding process). This brings different frame masks for different description labels, which can turn a "one-to-many" mapping into a "one-to-one" mapping. For the sample image, each target region (which may also be referred to as an object region) in the image may be masked. Alternative embodiments for solving the "one-to-many" problem using the frame mask approach will be described in detail later.

In order to improve the decoding performance of the decoder, the embodiment of the present application proposes to use an enhanced co-learning manner to solve the problem, where the enhanced co-learning means to perform data enhancement on the input of the model, the data after data enhancement is also output by using multiple decoders, and the enhanced data and the original data are learned by using similarity loss through the output of the multiple decoders, and the enhanced co-learning can be further improved compared with the co-learning.

The multimedia description model training task is to input multimedia data into the model, so that the output of the model is as close as possible to the description label of the multimedia data. Taking the video description model as an example, a given training video may be represented as X ═ X₁,x₂,…,x_N]Where N denotes the number of frames of the selected video, x_NFor the selected nth frame image, the H description labels of the video may be expressed as Y ═ Y₁,y₂,…,y_H}，y_tE.y, t is 1,2, … H, Y is a lexicon, and the model training task is to take X as input to generate video description information close to Y.

Various alternative embodiments provided by the present application are described in detail below.

In an alternative embodiment of the present application, the decoding module includes a plurality of decoders respectively connected to the encoding modules; for sample multimedia data, the method may further comprise:

performing data enhancement processing on the sample multimedia data to obtain enhanced multimedia data;

inputting the enhanced multimedia data into a coding module to obtain the coding characteristics of the enhanced multimedia data;

respectively inputting the coding characteristics of the enhanced multimedia data into each decoder to obtain second decoding results corresponding to each decoder;

calculating a value of a second loss function based on each first decoding result corresponding to each sample multimedia data and each second decoding result corresponding to the enhanced multimedia data corresponding to each sample multimedia data; wherein the total loss function further comprises a second loss function.

Specifically, calculating a value of the second loss function based on each first decoding result corresponding to each sample multimedia data and each second decoding result corresponding to the enhanced multimedia data corresponding to each sample multimedia data includes:

calculating the consistency loss value between every two decoding results in each first decoding result and each second decoding result;

the values of the consistency losses are added to obtain the value of the second loss function.

The scheme is the enhanced co-learning (training) method provided by the embodiment of the application.

In order to improve the decoding capability of the multimedia data description, the decoding module of the multimedia data description model may adopt multiple decoders. Among them, the performance of the multi-decoder becomes better with the increase of the number of decoding, but the computational resource consumption and the model size are increased, and in practical application, the performance and consumption are considered comprehensively, and the number of the multi-decoder is usually 2 or 3. In practical applications, which decoder or decoders are specifically used is not limited in the embodiments of the present application. Alternatively, each decoder of the multi-decoder may be a cyclic neural network-based decoder, or a self-attention structure-based decoder (e.g., a transform network). As an alternative, the multi-decoder may include a GRU (Gated recursive Unit) based decoder and a transform (Transformer) based decoder, wherein the GRU units with temporal attention may be used in the GRU based decoder in place of the LSTM units.

When a multi-decoder structure is adopted and multi-decoders are trained simultaneously, if the output of each decoder is not constrained, the performance difference of each decoder is possibly large, and the performance is reduced after fusion. To address this problem, the conventional approach is to use a common learning algorithm to train multiple decoders simultaneously. Assuming that there are two decoders, the outputs of which are two probability distributions p1 and p2, respectively, the difference of the outputs of the two decoders can be constrained by a loss of consistency, which can be defined as follows:

D_KL(p₁||p₂)

wherein D is_KLThe K-L divergence is expressed, i.e., the training can be constrained by the KL divergence of the decoding results of the two decoders, and the way to train the multiple decoders using this loss is the co-training.

Although co-training can improve the decoding performance of multiple decoders, the decoding performance remains to be improved. The method for enhancing the co-training can further improve the decoding effect of the model. The enhanced co-training method is described in detail below with a video as an example.

Specifically, the original video data v (i.e., the sample video) may be subjected to data enhancement first, and the data enhancement mode is not limited in this application. For example, the data enhancement method may be one or more of translation, rotation, flipping, frame removal, auto-augmentation (auto data enhancement), random augmentation (random data enhancement), and the like. The data enhanced video is denoted as v'. When a video description model is trained, original video data v and a data-enhanced video v' are respectively input into the model, and if two decoders exist, the output of the two decoders corresponding to v is two probability distributions p₁And p₂V 'the outputs of the corresponding two decoders are respectively two probability distributions p'₁And p'₂Then the consistency loss, i.e., the second loss function, is as follows:

D_KL(p₁||p₂)+D_KL(p₁||p′₁)+D_KL(p₂||p′₂)+D_KL(p′₁||p′₂)

training with this loss is enhanced co-training. The performance of the enhanced co-training model is further improved compared with the performance of the co-training model. And calculating the sum of the consistency loss values corresponding to the sample videos to obtain a second loss function value.

In practical applications, when data enhancement is performed, enhancement processing may be performed once, or enhancement processing may be performed multiple times to obtain multiple enhanced multimedia data, and when a corresponding loss is calculated, a loss between each two of a decoding result of sample multimedia data and a decoding result corresponding to each enhanced multimedia data may be calculated.

In the enhanced co-learning scheme provided by the embodiment of the present application, in order to improve the performance of the decoders, on one hand, the decoders can learn each other, and for each decoder, external guidance from other decoders can be used for learning (i.e., learning of the decoder using loss constraint between decoder results of different decoders, such as D)_KL(p′₁||p′₂)). In the enhanced co-learning using multimedia data samples and enhanced samples multimedia data, another aspect of the present application may provide that, for each decoder, the difference between the decoding result of the decoder on the original multimedia data and the decoding result on the enhanced multimedia data is calculated as the internal loss of the decoder for guiding the learning of the decoder (i.e., the learning of the decoder is constrained by the loss between the decoding result of the original data and the decoding result of the enhanced data corresponding to the same decoder, such as D_KL(p₁||p′₁) Based on the scheme), the posterior entropy of the decoder can be further reduced, and the performance of the decoder is improved.

In an alternative embodiment of the present application, for sample multimedia data, the method further comprises:

respectively performing mask processing on the coding features of the sample multimedia data based on each description label of the sample multimedia data to obtain masked coding features;

correspondingly, the above inputting the encoding characteristics into each decoder respectively to obtain a first decoding result corresponding to each decoder, and calculating the value of the first loss function based on each description label of each sample multimedia data and the first decoding result corresponding to each decoder, includes:

respectively inputting the coded features of the mask corresponding to each description label of the sample multimedia data into each decoder to obtain a first decoding result of each description label corresponding to each decoder;

the value of the first loss function is calculated based on each descriptive label of the sample multimedia data and the first decoding result of the decoder corresponding to the descriptive label.

The scheme provided by the application can convert the one-to-many problem existing in the prior art into the one-to-one problem when the sample multimedia data comprises at least two description labels. In practical application, taking a video as an example, for each description label of the video, the association degree, i.e. the contribution degree, of the same frame of the video to different description labels is different, and the contribution degree, i.e. the contribution degree, of each frame of the video to the same description label is also different, therefore, for each description label, the video may be masked based on the correlation degree between the description label and each frame, for example, different weights are determined for each frame based on the correlation degree, or frames with lower correlation degree are screened from being involved in the generation of video description information, that is, some frames with lower correlation degree may not be involved in the generation of video description information, so that one description label corresponds to one masked video, and a decoding result corresponding to the description label is obtained.

In an optional embodiment of the present application, for each description label based on the sample multimedia data, respectively performing mask processing on the coding features of the sample multimedia data to obtain masked coding features, where the mask processing includes:

obtaining description labeling characteristics of each description label of the sample multimedia data;

for each description label of the sample multimedia data, determining the correlation degree of the coding characteristics of the sample multimedia data and the description label characteristics of the description label;

and for each description label of the sample multimedia data, weighting the coding characteristics of the sample multimedia data based on the corresponding correlation degree of the description label to obtain the weighted coding characteristics.

That is to say, for each description label, the masking processing on the sample multimedia data can be realized through the correlation degree between the description label feature of the description label and the coding feature of the sample multimedia data, that is, the corresponding weighting weight is determined based on the correlation degree, and the coding feature is weighted based on the weighting, where the weighted coding feature is the masked coding feature of the sample multimedia data. Respectively inputting the weighted coding features into each decoder to obtain a first decoding result of the description label corresponding to each decoder; the value of the first loss function may be calculated based on each descriptive label of the sample multimedia data and the first decoding result of the decoder corresponding to the descriptive label

For any description label of the sample multimedia data, the description label characteristic of the description label can be obtained through a language coding model. The specific model architecture of the language coding model is not limited in the embodiments of the present application, and may be a BERT coding model, a transform coding model, or the like.

In this alternative, when the description label of the sample multimedia data is multiple (including two), for any description label of the sample multimedia data, based on this scheme, the decoding result of each decoder corresponding to the description label can be obtained through the multimedia data description model, and when the first loss function is calculated, the loss value corresponding to the description label may be based on the decoding result of each decoder corresponding to the description label and not on the same decoding result corresponding to multiple description labels, that is, the loss is not calculated based on multiple description labels and the same decoding result, respectively. With this scheme, the mapping of "one-to-many (one sample multimedia data (same decoding result) corresponds to a plurality of description labels, and a plurality of description labels correspond to)" of the sample multimedia data is changed to the mapping of "one-to-one (one description label corresponds to respective decoding result)".

The specific function form of the first loss function is not limited in this embodiment, the value of the loss function represents the difference between each description label and its corresponding decoding result, and the function may be selected according to actual needs, for example, a cross entropy loss function may be used, for a description label, cross entropy losses between the description label and the decoding results of each decoder corresponding to the description label may be calculated respectively, and the cross entropy losses corresponding to the description labels are added to obtain the value of the first loss function.

Specifically, for the video description model, the sample multimedia data is a sample video, and the encoding characteristics of the sample multimedia data include encoding characteristics of each frame of the sample video. For a video, when the description label of the sample video is multiple, the correlation degree of each frame of the video with different description labels is likely to be different, that is, the roles of each frame for different description labels are different. For example, for a descriptive annotation A and a descriptive annotation B, the correlation of a frame in the video with descriptive annotation A and the correlation of the frame with descriptive annotation B are likely to be different. Therefore, based on the scheme, for each description label of the sample video, the degree of correlation between the coding feature of each frame of the video and the description label is calculated, and weighting is performed according to the degree of correlation to obtain the weighted coding feature corresponding to the description label, so as to obtain the decoding result corresponding to the description label based on the weighted feature.

That is, for the video description model, the above-mentioned determining, for each description label of the sample multimedia data, a correlation between the encoding characteristic of the sample multimedia data and the description label characteristic of the description label may include:

for each description label of the sample video, respectively determining the correlation degree of the coding characteristics of each frame of the sample video and the description label characteristics of the description label;

correspondingly, for each description label of the sample video, weighting the coding features of the sample multimedia data based on the correlation corresponding to the description label to obtain weighted coding features, including:

and weighting the coding features of the frames based on the corresponding correlation of the frames of the sample video to obtain the weighted coding features of the frames.

For the image description model, the sample multimedia data is a sample image, and the encoding features of the sample multimedia data include the encoding features of each target region in the sample image. Similarly, when the description label of the sample image is multiple, the roles of the target regions in the image for different description labels are likely to be different, so for one description label, the target regions may be weighted based on the correlation between the coding features of the target regions and the description labels, so as to obtain the decoding result corresponding to the description label based on the weighted coding features.

That is, for the image description model, the above-mentioned determining, for each description label of the sample multimedia data, a degree of correlation between the encoding characteristic of the sample multimedia data and the description label characteristic of the description label includes:

for each description label of the sample image, respectively determining the correlation degree of the coding feature of each target area of the sample image and the description label feature of the description label;

correspondingly, for each description label of the sample image, weighting the coding feature of the sample multimedia data based on the relevance of the description label to obtain a weighted coding feature, including:

and weighting the coding features of the target areas based on the corresponding correlation degrees of the target areas of the sample image to obtain the weighted coding features of the target areas.

The coding features of each target region of the sample image can be extracted by a local coder.

For the above-mentioned calculation method of the correlation, the embodiment of the present application is not limited, and if the correlation can be calculated by calculating the L2 loss, the correlation can also be calculated by calculating the distance between the description label feature and the coding feature. Taking the way of calculating the loss of L2 as an example, for a description label, the way of calculating the corresponding relevance of the description label may be as follows:

and calculating the L2 loss of the description label characteristic and the coding characteristic of the description label, wherein the loss represents the difference between the description label characteristic and the coding characteristic, and 1 minus the L2 loss can be used as the correlation degree of the two.

Optionally, the above-mentioned weighting, based on the correlation, the encoding characteristic of the sample multimedia data for one description label of the sample multimedia data to obtain a weighted encoding characteristic may include:

determining a weight of the coding feature based on the correlation;

the encoding features are weighted based on the determined weights.

Specifically, when weighting the sample multimedia data, the coding feature may be weighted by using the correlation as a weight, or a weight (i.e., coefficient) may be determined based on the correlation, and the weighted feature may be obtained by multiplying the coding feature by the weight that is relatively greater as the correlation is higher.

For the video description model, for each frame of the video, the weight of the frame may be obtained based on the correlation corresponding to the frame, and the weighted feature corresponding to the frame may be obtained by multiplying the weight by the coding feature of the frame.

Optionally, for each frame of the video, if the correlation corresponding to one frame is greater than the set value, the weight of the frame may be determined to be 1, and if the correlation corresponding to one frame is not greater than the set value, the weight of the frame may be determined to be 0, or the frames are sorted according to the level of the correlation corresponding to each frame, the weights of the frames sorted later (i.e., the frames with lower correlation) are determined to be 0, and the weights of the other frames sorted later are determined to be 1.

In an alternative embodiment of the present application, the total loss function further includes a third loss function, and the third loss function characterizes a difference between the encoding characteristic of the sample multimedia data and the description label characteristic of each description label.

As can be seen from the foregoing, for a video, for different video description labels, the contribution of each frame of the video is different, for an image, the contribution of different target regions to an image description label is also different, the "one-to-many" problem can be converted into a "one-to-one" mapping relationship by the frame mask method (i.e., the processing method of weighting the coding features based on the correlation), and for a description label, during model training, the difference between the description label and the description label can be further constrained by the third loss function, so that the coding features of the sample multimedia data and the description label features of the description label are as close as possible, so as to improve the performance of the model.

The specific form of the third loss function is not limited in this application. Alternatively, the third loss function may be an L2 loss (i.e., MSE (Mean Squared Error)). For example, for a description label of a sample video, its corresponding L2 loses L_fCan be expressed as:

where N is the frame number of the sample video (generally, the selected N frames), j represents the jth frame of the sample video,

a descriptive label feature representing the ith descriptive label,

represents the coding characteristics of the j-th frame, L_fThe L2 loss corresponding to the ith descriptor of the sample video, i.e. the sum of the descriptor label characteristic of the descriptor label and the L2 loss of the coding characteristic of each frame, is labeled.

It can be understood that, in practical applications, for the video description model, the L2 loss corresponding to the model is the sum of the L2 losses corresponding to the description labels of the sample videos.

In the same principle, for the sample images, the L2 loss corresponding to the model is the sum of the L2 losses corresponding to the description labels of the sample images, and the L2 loss corresponding to one description label of one sample image may be the sum of the L2 losses of the description label feature of the description label and the coding feature of each target region.

For the "one-to-many" problem, the above-mentioned optional embodiments provided in this application may enable, for each description label, the encoding feature to play a corresponding role in the subsequent decoding process based on the correlation between the encoding feature and the description label, so as to obtain the decoding result corresponding to the description label. Based on the mode, the performance of the multimedia data description model can be effectively improved.

for each description label of the sample multimedia data, determining a second degree of correlation between the coding characteristics of the enhanced multimedia data corresponding to the sample multimedia data and the description label characteristics of the description label;

weighting the coding features of the enhanced multimedia data based on the second degree of correlation corresponding to the description label to obtain second weighted coding features;

inputting the second weighted coding features into each decoder respectively to obtain a third decoding result of the description label corresponding to each decoder;

calculating a value of a fourth loss function based on each description label of each sample multimedia data corresponding to the first decoding result and the third decoding result of each decoder;

wherein the total loss function further comprises a fourth loss function.

That is, the frame mask processing manner and the reinforcement co-learning manner may be used in combination. At this time, the frame mask processing manner may be applied to the sample multimedia data and the corresponding enhanced multimedia data at the same time, that is, for one description label, the encoding characteristic of the sample multimedia data may be weighted based on the correlation between the description label characteristic of the description label and the encoding characteristic of the sample multimedia data, and the enhanced multimedia data may be weighted based on the correlation between the description label characteristic and the characteristic of the enhanced multimedia data, at this time, when the loss between each two of the first decoding results and each of the third decoding results is calculated, the first decoding results and the third decoding results are decoded based on the corresponding weighted characteristics.

It can be understood that the principle of the fourth loss function in this alternative is the same as that of the second loss function in the foregoing, the scheme corresponding to the second loss function is a scheme that does not consider the frame mask processing manner, and the scheme corresponding to the fourth loss function is a scheme that combines the frame mask processing and the reinforcement co-learning at the same time.

In an alternative embodiment of the present application, the method further comprises:

obtaining the weight of each loss function contained in the total loss function;

the value of the total loss function is obtained by weighting and summing the loss functions contained in the total loss function based on the weights of the loss functions contained in the total loss function.

Because different loss functions correspond to the performances of different layers of the model, the influence of each loss function on the performance of the model is different, and the proportion of different losses can be controlled through the weight of each loss function. The weight of each loss function may be configured according to an application scenario or a need, and is not limited in this application.

It can be understood that, in practical applications, the alternative solution to the "one-to-many" problem provided by the present application and the solution for enhancing the co-learning training can be implemented individually or in combination, and can be selected according to practical application scenarios.

In order to better explain the scheme provided by the present application, the following describes a training method of a multimedia data description model provided by the present application with a specific example. In this example, the multimedia data description model takes a video description model as an example, fig. 5 shows a schematic diagram of a training method provided by this example, fig. 6 shows a schematic diagram of a structure and an operation principle of an optional video description model provided by this application, fig. 7 is an example of a frame mask processing manner provided by this application, and the following describes a scheme of this application with reference to fig. 5 to 7.

1. Extraction of coding features of video

As shown in fig. 5, the encoding module using the video description model in this example is a multi-encoder structure, and uses a multi-encoder to extract video features of a training video (i.e., a sample video).

Wherein, the multi-encoder may include a global encoder, a local encoder and a semantic encoder. The global encoder may be a 3D Convolutional neural network, such as various forms of 3D Convolutional neural networks, such as C3D (Convolutional 3D, three-dimensional Convolutional network), I3D (convolution 3D, dual stream air-filled three-dimensional Convolutional network), P3D (Pseudo-three-dimensional network), ECO (Efficient Convolutional network for Online video understanding), and the like, which is not limited in this application.

Optionally, as shown in fig. 6, an ECO may be used to extract global features, for a sample video (e.g., a man's paper-cut video shown in the figure), a plurality of frames of the video may be selected to be respectively input into an encoder, a two-dimensional feature, i.e., a spatial feature, of each frame may be obtained directly by using an output of a 2D network of the ECO, a three-dimensional feature, i.e., a space feature, of the video may be obtained based on an output of a 3D network, further, for an output of the 3D network, an output of a maximum pooling layer may be used as a space-time feature of the video through processing of a maximum pooling layer, and for each frame, the two-dimensional feature of the frame and the space-time feature of the video may be spliced (concat) to obtain the space-time feature of the frame.

The local encoder encodes each object (i.e., target/object) in each frame of the video, and optionally, may also comprehensively consider information such as the attribute of the object and the relationship between the objects. The specific method comprises the following steps: firstly, object detection is carried out on each frame to obtain a plurality of objects and characteristics thereof. The object detection network may be a fast R-CNN (fast Region-CNN), a YOLO (young Only Look one), or the like, which is not limited in the present application, and is shown in fig. 6 as the fast R-CNN. And then, the attribute of the predicted object and the relation between the objects are respectively obtained by using the attribute prediction network and the relation detection network. Thus, a scene graph can be obtained, which includes a plurality of nodes and a plurality of connected edges, wherein the features and attributes of the objects are nodes (the features and attributes of the objects can also be combined), and the relationships among the objects are edges. And then, the scene Graph is taken as input and sent to a Graph Convolution Network (GCN), the Graph convolution Network updates the characteristics of the current node according to the information of the edge and the information of the adjacent nodes, and the updated characteristics are the local characteristics of the object. As an alternative, the formula of the graph convolution network is as follows:

wherein v is_iIs a feature of a certain node of the input, i.e. an object feature and/or an attribute feature of the object, N (v)_i) Is represented by the formula_iSet of neighboring nodes, W and b are weights and bias parameters that the graph convolution network needs to learn, dir (v)_i,v_j) The direction of the finger edge has two possible values, respectively from v_iTo v_jOr v or_jTo v_i，

Then the corresponding may correspond to two results. label (v)_i,v_j) Identification v_jAnd v_iThe relationship between them, for different relationships, may have different biasesThe set value, σ, is a non-linear activation function,

the features after the graph convolution, namely the local features.

The features learned by the above alternatives are based on a graph structure, and the features include objects (i.e., objects), relationships among the objects, and attribute information of the objects, which is helpful for understanding and describing videos.

The semantic encoder is a multi-classification network, and can predict attribute or topic information appearing in a video by extracting semantic features of each frame of the video, so that more diversified features are integrated into the generation of video description information, namely, video description can be assisted, the expression capability of the generated video description is enhanced, and the precision of the description information is improved. Alternatively, as shown in fig. 6, the semantic encoder may include a plurality of cascaded fully-connected layers.

Local features, global features and semantic features of each frame of the video can be extracted through the multi-encoder. The local features, the global features and the semantic features of each frame can be spliced to serve as the coding features of each frame.

The spliced encoding features may be input to each decoder, respectively, to obtain a decoded output of each decoder, and for each sample video, a value of the first loss function may be calculated based on each video description of the video and the decoded output of each corresponding decoder, and optionally, the first loss function may use cross entropy loss.

Assuming that the number of decoders is m, each decoder may be a GRU-based decoder or a transform-based decoder. The m decoders can be represented as Θ₁，Θ₂…, Θ m, for the ith decoder of m decoders, for one sample video V, the cross-entropy loss L corresponding to this decoder_C(Θ_i) Can be expressed as follows:

wherein H represents the number of description labels of the sample video, T represents the number of words contained in the description labels, y represents_h,tThe t-th word, y, representing the h-th descriptive label_h,1:t-1The 1 st to t-1 st words, P theta, representing the h-th descriptive labels_i(y_h,t|y_h,1:t-1V) denotes the probability that the tth word output by the ith decoder is the tth word describing the label given the 1 st to t-1 st words of the video.

By calculating the cross-entropy loss for each sample video corresponding to each decoder, the value of the first loss function of the model is obtained by adding.

2. Multi-decoder and enhanced co-training algorithm

As shown in fig. 5, the decoding template of the video description model in this example includes two decoders, namely decoder 1 and decoder 2. Alternatively, as shown in fig. 6, the decoder 1 is a GRU-based decoder and the decoder 2 is a transform-based decoder.

For each training video (the input video shown in the figure), data enhancement shown in the figure can be performed on the training video to obtain an enhanced video, and the flow part shown by the dotted line in fig. 5 and fig. 6 is the flow corresponding to the enhanced video, as shown in fig. 5, the input video and the enhanced video are both used as the input of a multi-encoder during training, the encoding characteristics of the video are extracted through the multi-encoder, and the encoding characteristics are decoded through two decoders respectively to obtain two decoding results. That is, the two decoder outputs (two probability distributions) for the input video are p1 and p2, respectively, and the two decoder outputs for the enhanced video are p'₁And p'₂The trained coherence loss (corresponding to the second loss function) can be obtained by calculating the K-L divergence between two of the four decoded outputs, and for two encoders, the coherence loss can be expressed as:

D_KL(p₁||p₂)+D_KL(p₁||p′₁)+D_KL(p₂||p′₂)+D_KL(p′₁||p′₂)

wherein D is_KL(p₁||p₂)、D_KL(p₁||p′₁)、D_KL(p₂||p′₂)、D_KL(p′₁||p′₂) I.e., the K-L divergence shown in fig. 5.

As can be seen from the foregoing description, the consistency loss includes a consistency loss (denoted as L) corresponding to the mutual learning between decoders_e(Θ_i) And the loss (which may be denoted as L) between the decoding results of the original sample video and the decoding results of each enhanced video of the same decoder_a(Θ_i))。

More generally, taking the m decoders described in the foregoing as an example, for the ith decoder of the m decoders, for one sample video V, the corresponding consistency loss L of the decoder is_e(Θ_i) Can be expressed as follows:

where m is the number of decoders, V identifies the sample video,

represents the enhanced ith video, wherein K times of enhancement can be carried out to obtain K enhanced videos, P theta_j(y_h,t|y_h,1:t-1V) represents the probability that the t-th word output by the jth decoder is the t-th word describing the annotation given the 1 st to t-1 st words of the video, P theta_i(y_h,t|y_h,1:t-1V) indicates the probability that the tth word output from the decoder is the tth word describing the label given the 1 st to tth-1 st words of the sample video, so that for each decoder, the corresponding output is a probability distribution, and for the ith decoder, the corresponding output from the decoder and the output from each of the other decoders (including the output corresponding to the sample video and the enhanced videos) can be calculatedThe corresponding consistency loss part of the ith decoder is obtained according to the KL divergence of the ith decoder. As can be seen from the foregoing description, for each decoder, the intrinsic loss of the decoder can also be calculated according to the decoding result of the decoder corresponding to the original sample video and the decoding result corresponding to the enhanced sample video, so as to better improve the performance of the decoder. For sample video V, assuming that it is enhanced K times, the enhanced K times video can be represented as

For the ith decoder, the intrinsic loss L of the decoder corresponding to one sample video_a(Θ_i) This can be obtained by the following expression:

indicating the probability that the tth word output by the ith decoder is the tth word describing the annotation for the jth enhanced sample video given the 1 st to t-1 st words of the video.

3. Frame masking

In order to solve the problem of one-to-many mapping in video description training, the present application proposes a scheme of frame masking. As shown in fig. 5, the frame mask processing flow is as follows:

for each video description label (the input description label shown in fig. 5 to fig. 7) of a training video, a text mapper (the language encoder shown in fig. 6 and fig. 7) may map the video description label to a high-dimensional space to obtain description label features, where the text mapper may be a BERT network, a transform network, or a recurrent neural network, which is not limited in this application, the text mapper shown in fig. 6 and fig. 7 is based on BERT, and may extract features of the description label through a BERT structure, and the extracted features may be processed through mean pooling and then obtain the description label features through a full connection layer. In order to make the description label characteristic and the coding characteristic of each frame of the video in the same mapping space, the video characteristic extracted by the multi-coder can be mapped to the same space by a visual mapper (such as a full connection network).

It will be appreciated that in practical applications, the specific network architecture employed for the text mapper, which is a BERT-based text mapper and the visual mapper, which is a network containing two fully-connected layers (FC 1 and FC2 shown in the figure), may be selected and configured according to practical requirements, as shown in fig. 6.

For the coding features and the description label features mapped to the same space, the correlation degree of the coding features and the description label features of each frame can be calculated respectively, in this example, as shown in fig. 5 to 7, the L2 loss can be used to calculate the correlation degree of the text mapping features (i.e. the description label features) and the visual mapping features of each frame (the visual features shown in the figure, i.e. the coding features of each frame), and the frames can be sorted. And carrying out weighting processing on the coding characteristics of the array based on the sorting result.

For the video shown in fig. 7, the descriptions are labeled five, respectively as shown in the figure, "a oriental woman is cutting carrots into slices", "deliciously laying on a table", …, "a girl is cutting some green vegetables", for each descriptive label, its corresponding descriptive label feature, i.e. the language representation, can be obtained by the language encoder. For the video, the coding characteristics of each frame can be obtained by selecting a plurality of frames of the video and inputting the frames into a coding module (visual coding shown in fig. 7) of a video description model, and the coding characteristics can be mapped to the same feature space as the description labeling characteristics through a visual mapper (visual coding shown in fig. 6, visual embedding shown in fig. 7) to obtain the coding characteristics (visual representation shown in fig. 7) of each frame. Then, for each description label feature, the correlation between the description label feature and the coding feature of each frame after conversion can be calculated, and the coding feature of each frame is subjected to weighting processing based on the correlation, that is, the frame mask processing shown in fig. 6 and 7.

Specifically, as shown in fig. 7, the frame number of the selected video is 8 frames, and for each description label, the correlation between the description label and the coding feature of each frame may be calculated, and the calculation result is calculated to perform weighting processing on each frame. For example, for the description label of "a eastern woman is cutting carrot into slices" shown in the figure, the correlation between the 1 st frame to the 3 rd frame and the 6 th frame shown in the figure and the description label is high, the coding features of the frame can be given high weight, and the other frames can be subjected to mask processing. For another example, for a description label of "a girl is cutting green vegetables", since the correlation between the 4 th frame to the 6 th frame and the description label is high, a high weight can be given to the description label, and the other frames are subjected to masking processing.

For a sample video, it is assumed that there are H description labels corresponding to it, and can be expressed as { y }₁,y₂,…,y_HThe description label characteristics obtained after each description label passes through the text mapper can be expressed as

Assuming that the number of the selected frames of the video is N, the characteristics of the coded features of each frame after passing through the visual mapper can be expressed as

Wherein the content of the first and second substances,

and

belonging to the same mapping space.

Labeling features for each description

By calculation of

And

the MSE between the descriptions and the i-th description label, i.e. the L2 loss, can obtain the correlation between each frame and the i-th description label, and the loss between the label feature and the coding feature, i.e. the value of the fourth loss function corresponding to one description label, and specifically, the fourth loss function L corresponding to the description label_fCan be expressed as follows:

wherein r is_jAnd the loss of the L2 corresponding to the j frame is represented, the larger the value of the loss of the L2 is, the smaller the correlation degree between the frame and the description label is, and conversely, the smaller the loss of the L2 is, the greater the correlation between the frame and the description label is.

Therefore, for a description label, based on the obtained r value, i.e. L2 loss, the frame with lower r value can be reserved and the frame with higher r value can be masked. Based on this scheme, a different "frame mask" can be obtained for each descriptive label, so that the respective different descriptive labels of the video map to separate "frame masks", i.e. a "one-to-one" mapping is obtained. As can be seen from the foregoing description, the frame masking scheme is also applicable to the enhanced sample video, that is, the enhanced sample video may be weighted by calculating the correlation between each description label and the enhanced sample video, and the weighted features obtain the corresponding decoding result.

Optionally, when performing Masking processing on the frames with lower rank correlation with the text mapping features, the present application provides two optional processing manners, one may be called Hard Masking (Hard Masking), and the other may be called Soft Masking (Soft Masking), which are specifically as follows:

soft Masking: that is, a coefficient can be obtained according to the degree of correlation, and the coefficient is multiplied by the video feature, that is, the coding feature, that is, the visual feature, of each frame is multiplied by a weight, and optionally, for each frame, as can be seen from the foregoing description, the greater the L2 loss corresponding to the frame, the smaller the correlation between the frame and the description label, and therefore, the weight corresponding to each frame is inversely proportional to the L2 loss corresponding to each frame, and optionally, for each frame, the corresponding weight can be the reciprocal of the L2 loss corresponding to the frame, that is, the expression of the weight can be expressed as:

the coding characteristics of the masked video can be expressed as:

where V represents a video feature extracted by an encoder,

representing the masked video features, diag (w)_s) The matrix is a diagonal matrix, and the element values on the diagonal are the weights corresponding to the frames.

Hard Masking: namely, a plurality of frames with higher relevance to description labels, namely a plurality of frames with lower r value are selected, and other frames with higher r value are masked. That is, the weight of the frame with lower rank of correlation (i.e., greater loss of L2) can be determined to be 0, i.e., the frames are not calculated, and the weight of the other frames is 1. The expression of the weights is as follows:

wherein, W_hRepresenting a weight matrix, w, corresponding to each frame of the video_jIndicating the weight corresponding to the j-th frame.

Accordingly, the video bit after the mask at this timeSign for

It can be expressed as:

a coefficient can also be derived from the degree of correlation and multiplied by the video feature. The frame masks corresponding to different video description labels of the same video are basically different, and the video description labels after the masks can be regarded as a one-to-one mapping relation, so that the problem of one-to-many mapping is solved.

Then, based on the correlation degree corresponding to each frame, the coding features of each frame are weighted, and the two decoders decode based on the weighted features respectively to obtain two decoding results corresponding to the above description label, and by calculating the loss between the description label and the two decoding results respectively, the loss corresponding to the description label (i.e. the first loss function, such as cross entropy loss) is obtained. And adding the loss between each description label of each training sample video and the decoding result corresponding to each description to obtain the value of the first loss function of the model.

4. Training of video description models (which may also be referred to as video description networks)

Alternatively, the loss function (i.e., the total loss function) of the video description model can be expressed as:

L(Θ_i)＝L_c(Θ_i)+λ₁L_e(Θ_i)+λ₂L_a(Θ_i)+λ₃L_f(Θ_i) (11)

wherein λ is₁、λ₂And λ₃Are three weighting coefficients to control the loss ratio of the corresponding loss function. The specific value and the value mode of the weight coefficient are not limited in this application.

When the video description model is trained, the model may be trained based on each training video and the total loss function until the total loss function converges.

In practical applications, in the multi-encoder structure, when processing the encoding features of each frame or image of the video extracted by the multi-encoder, if weighting processing is performed according to the correlation, the features extracted by the encoders may be fused and then weighted, or some or all of the features extracted by the encoders may be weighted, and for the image, only the local features of each object may be weighted, and then decoding may be performed based on the weighted local features, global features, and the like. The specific processing method may be configured according to actual requirements and application scenarios (such as video processing or image processing), which is not limited in the present application.

In addition, in practical application, when training the model, the frame mask processing method and the enhanced co-training method can be adopted at the same time, and at this time, when calculating the second loss function corresponding to the enhanced co-training method, i.e., calculating the difference, e.g., K-L divergence, between each decoding result corresponding to sample multimedia data (before enhancement) and each decoding result corresponding to sample multimedia data after enhancement, it is also possible to perform, for sample multimedia data (before enhancement), the corresponding decoding results may be obtained based on the encoding characteristics of the sample multimedia data, or based on the encoding characteristics after the weighting processing, and similarly, for the enhanced sample multimedia data, the corresponding decoding results may be obtained based on the coding characteristics of the enhanced sample multimedia data, or may be obtained based on the coding characteristics after the weighting processing.

After the trained model is obtained, the description information of the video can be generated through the model. In practical applications (including testing and application of models), since there is no description label, frame masking cannot be performed, and at this time, the mask weight corresponding to each frame may be considered to be 1, and may be other set values. As shown in fig. 6, for a video to be processed or a test video, the weights corresponding to the frames may all be 1, that is, decoding processing is directly performed based on the coding features of the video to be processed or the coding features after spatial conversion, and the decoding results of the decoders are fused to obtain model output, that is, corresponding video description information.

Based on the training method provided by the embodiment, the performance of the model can be effectively improved, and more accurate video description information or image description information can be generated based on the model.

The embodiment of the application also provides a method for generating the description information of the multimedia data, which comprises the following steps:

the multimedia data description model is obtained by training by using the training method provided by any optional embodiment of the application.

The multimedia data description model can be a video description model, the multimedia data can be a video, and the description information which is described more accurately can be obtained by inputting the video into the model. Similarly, the multimedia data description model may also be an image description model, and more accurate description information of an image can be obtained through the model.

Corresponding to the training method provided by the present application, an embodiment of the present application further provides a training apparatus for a multimedia data description model, where the multimedia data description model includes an encoding module and a decoding module, which are sequentially cascaded, and the decoding module includes at least one decoder, as shown in fig. 8, and the training apparatus 100 for a multimedia data description model includes a training data obtaining module 110 and a training module 120.

A training data obtaining module 110, configured to obtain a training data set, where the training data set includes each sample multimedia data and at least one description label of each sample multimedia data;

a training module 120, configured to train the multimedia data description model based on a training data set until a total loss function of the multimedia data description model converges;

Optionally, the decoding module includes a plurality of decoders respectively connected to the encoding module; for sample multimedia data, the training module is further to:

calculating a value of a second loss function based on each first decoding result corresponding to each sample multimedia data and each second decoding result corresponding to the enhanced multimedia data corresponding to each sample multimedia data;

the total loss function also includes a second loss function.

Optionally, the training module calculates a value of the second loss function based on each first decoding result corresponding to each sample multimedia data and each second decoding result corresponding to the enhanced multimedia data corresponding to each sample multimedia data, and may be configured to:

Optionally, for the sample multimedia data, the training module may be further configured to:

correspondingly, when the training module inputs the encoding features into each decoder respectively to obtain a first decoding result corresponding to each decoder, and calculates a value of the first loss function based on each description label of each sample multimedia data and the first decoding result corresponding to each decoder, the training module may be configured to:

Optionally, the training module, when performing mask processing on the coding features of the sample multimedia data based on each description label of the sample multimedia data to obtain masked coding features, may be configured to:

for each description label of the sample multimedia data, the coding features of the sample multimedia data are weighted based on the relevance of the description, and weighted coding features (namely masked features) are obtained.

Optionally, the training module is further configured to calculate a third loss function, where the third loss function represents a difference between the encoding characteristic of the sample multimedia data and the description label characteristic of each description label, and the total loss function includes the third loss function.

Optionally, the decoding module includes a plurality of decoders respectively connected to the encoding module; for sample multimedia data, the training module may be further operable to:

the total loss function further includes a fourth loss function.

Optionally, the multimedia data description model is a video description model or an image description model;

for the video description model, the sample multimedia data is a sample video, the encoding features of the sample multimedia data include encoding features of frames of the sample video, and for each description label of the sample multimedia data, the training module, when determining the correlation between the encoding features of the sample multimedia data and the description label features of the description label, may be configured to:

each description label of the sample multimedia data, and the training module, when performing weighting processing on the coding feature of the sample multimedia data based on the correlation degree corresponding to the description label to obtain a weighted coding feature, may be configured to:

For the image description model, the sample multimedia data is a sample image, and the coding features of the sample multimedia data comprise the coding features of each target area in the sample image; for each annotation of the sample multimedia data, the training module, when determining the correlation between the encoding characteristic of the sample multimedia data and the annotation describing characteristic of the annotation, may be configured to:

for each description label of the sample image, the training module may be configured to, when performing weighting processing on the encoding feature of the sample multimedia data based on the relevance corresponding to the description label to obtain a weighted encoding feature:

Optionally, the training module, when calculating the value of the total loss function, may be configured to:

and carrying out weighted summation on each loss function contained in the total loss function based on the weight of each loss function contained in the total loss function to obtain the value of the total loss function.

The present application further provides a device for generating description information of multimedia data, where the device may specifically obtain the description information of the multimedia data through a multimedia data description model, and the device may include:

the multimedia data description model is obtained by training through the training method provided by any optional embodiment of the application.

The present application further provides an electronic device comprising a memory and a processor; wherein the memory has stored therein a computer program; the processor is adapted to perform the method provided in any of the alternative embodiments of the present application when running the computer program.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method provided in any of the alternative embodiments of the present application.

As an alternative, fig. 9 shows a schematic structural diagram of an electronic device to which the embodiment of the present application is applicable, and as shown in fig. 9, the electronic device 4000 may include a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 9, but this does not indicate only one bus or one type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. The processor 4001 is configured to execute application program code (computer program) stored in the memory 4003 to implement the contents shown in any one of the foregoing method embodiments.

In the embodiments provided in the present application, the above-mentioned description information generation method performed by the electronic device may be performed using an artificial intelligence model.

According to an embodiment of the application, the method performed in the electronic device may obtain output data identifying an image or image content features in the image by using the image data or video data as input data for an artificial intelligence model. The artificial intelligence model may be obtained through training. Here, "obtained by training" means that a basic artificial intelligence model is trained with a plurality of pieces of training data by a training algorithm to obtain a predefined operation rule or artificial intelligence model configured to perform a desired feature (or purpose). The artificial intelligence model can include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values, and the neural network calculation is performed by a calculation between a calculation result of a previous layer and the plurality of weight values.

Visual understanding is a technique for recognizing and processing things like human vision, and includes, for example, object recognition, object tracking, image retrieval, human recognition, scene recognition, 3D reconstruction/localization, or image enhancement.

In embodiments provided herein, at least one of the plurality of modules may be implemented by an AI model. The functions associated with the AI may be performed by the non-volatile memory, the volatile memory, and the processor.

The processor may include one or more processors. At this time, the one or more processors may be general-purpose processors (e.g., a Central Processing Unit (CPU), an Application Processor (AP), etc.), or pure graphics processing units (e.g., a Graphics Processing Unit (GPU), a Vision Processing Unit (VPU), and/or AI-specific processors (e.g., a Neural Processing Unit (NPU)).

The one or more processors control the processing of the input data according to predefined operating rules or Artificial Intelligence (AI) models stored in the non-volatile memory and the volatile memory. Predefined operating rules or artificial intelligence models are provided through training or learning.

Here, the provision by learning means that a predefined operation rule or an AI model having a desired characteristic is obtained by applying a learning algorithm to a plurality of learning data. This learning may be performed in the device itself in which the AI according to the embodiment is performed, and/or may be implemented by a separate server/system.

The AI model may be comprised of layers including multiple neural networks. Each layer has a plurality of weight values, and the calculation of one layer is performed by the calculation result of the previous layer and the plurality of weights of the current layer. Examples of neural networks include, but are not limited to, Convolutional Neural Networks (CNNs), Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), Restricted Boltzmann Machines (RBMs), Deep Belief Networks (DBNs), Bidirectional Recurrent Deep Neural Networks (BRDNNs), generative confrontation networks (GANs), and deep Q networks.

A learning algorithm is a method of training a predetermined target device (e.g., a robot) using a plurality of learning data to make, allow, or control the target device to make a determination or prediction. Examples of the learning algorithm include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for training a multimedia data description model, wherein the multimedia data description model comprises an encoding module and a decoding module which are sequentially cascaded, and the decoding module comprises at least one decoder, the method comprising:

training the multimedia data description model based on the training data set until a total loss function of the multimedia data description model converges;

the total loss function comprises a first loss function, during training, for each sample multimedia data, the sample multimedia data is input into an encoding module to obtain encoding characteristics of the sample multimedia data, the encoding characteristics are respectively input into each decoder to obtain a first decoding result corresponding to each decoder, and a value of the first loss function is calculated based on each description label of each sample multimedia data and the first decoding result corresponding to each decoder.

2. The method of claim 1, wherein the decoding module comprises a plurality of decoders respectively connected to the encoding module; for the sample multimedia data, the method further comprises:

the total loss function also includes a second loss function.

3. The method of claim 2, wherein the calculating a value of the second loss function based on the corresponding first decoding result of each sample multimedia data and the corresponding second decoding result of the enhanced multimedia data corresponding to each sample multimedia data comprises:

4. The method of claim 1, wherein for sample multimedia data, the method further comprises:

the step of inputting the encoding characteristics into each decoder respectively to obtain first decoding results corresponding to each decoder, and calculating a value of a first loss function based on each description label of each sample multimedia data and the first decoding results corresponding to each decoder includes:

5. The method of claim 4, wherein for the sample multimedia data, the masking the coding features of the sample multimedia data based on each description label of the sample multimedia data to obtain masked coding features comprises:

and for each description label of the sample multimedia data, weighting the coding features of the sample multimedia data based on the correlation corresponding to the description label to obtain the weighted coding features.

6. The method of claim 5, wherein the overall loss function further comprises a third loss function characterizing a difference between the encoding characteristics of the sample multimedia data and the annotation characteristics of each annotation.

7. The method according to claim 5 or 6, wherein the decoding module comprises a plurality of decoders respectively connected to the encoding modules; for sample multimedia data, the method further comprises:

the total loss function further includes the fourth loss function.

8. The method of claim 7, wherein the multimedia data description model is a video description model or an image description model;

for the video description model, the sample multimedia data is a sample video, and the coding features of the sample multimedia data comprise the coding features of each frame of the sample video; for each description label of the sample multimedia data, determining the correlation between the coding feature of the sample multimedia data and the description label feature of the description label includes:

the weighting processing is performed on the coding features of the sample multimedia data based on the correlation corresponding to the description label to obtain weighted coding features, and the weighting processing comprises the following steps:

weighting the coding features of each frame based on the corresponding correlation of each frame of the sample video to obtain the weighted coding features of each frame;

for the image description model, sample multimedia data are sample images, and the coding features of the sample multimedia data comprise the coding features of each target area in the sample images; for each description label of the sample multimedia data, determining the correlation between the coding feature of the sample multimedia data and the description label feature of the description label includes:

for each description label of a sample image, weighting the coding features of the sample multimedia data based on the relevance corresponding to the description label to obtain weighted coding features, including:

9. The method of any one of claims 1 to 8, further comprising:

wherein the value of the total loss function is obtained by weighting and summing the loss functions included in the total loss function based on the weights of the loss functions included in the total loss function.

10. A method for generating description information of multimedia data, comprising:

inputting the coding features into each decoder respectively, and obtaining the description information of the media data based on the decoding result of each decoder;

wherein the multimedia data description model is trained using the method of any one of claims 1 to 9.

11. An apparatus for training a multimedia data description model, wherein the multimedia data description model comprises an encoding module and a decoding module which are sequentially cascaded, and the decoding module comprises at least one decoder, the apparatus comprising:

wherein the total loss function comprises a first loss function, the training module when training the multimedia data description model based on the training data set is to:

12. An apparatus for generating description information of multimedia data, comprising:

13. An electronic device comprising a memory and a processor;

the memory has stored therein a computer program;

the processor, when running the computer program, is configured to perform the method of any one of claims 1 to 9, or to perform the method of claim 10.

14. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of one of the claims 1 to 9 or carries out the method of claim 10.