CN111723937A

CN111723937A - Method, device, equipment and medium for generating description information of multimedia data

Info

Publication number: CN111723937A
Application number: CN202010152713.5A
Authority: CN
Inventors: 林科; 甘卓欣; 姜映映
Original assignee: Beijing Samsung Telecommunications Technology Research Co Ltd; Samsung Electronics Co Ltd
Current assignee: Beijing Samsung Telecom R&D Center; Beijing Samsung Telecommunications Technology Research Co Ltd; Samsung Electronics Co Ltd
Priority date: 2019-03-21
Filing date: 2020-03-06
Publication date: 2020-09-29
Also published as: KR20210114074A; KR102593440B1

Abstract

The embodiment of the application provides a method, a device, equipment and a medium for generating description information of multimedia data, wherein the method comprises the following steps: extracting characteristic information of multimedia data to be processed, wherein the multimedia data comprises videos or images; based on the extracted feature information, a textual description of the multimedia data is generated. Based on the method provided by the embodiment of the application, the accuracy of the character description of the generated multimedia data can be effectively improved.

Description

Method, device, equipment and medium for generating description information of multimedia data

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for generating description information of multimedia data, an electronic device, and a storage medium.

Background

In computer vision technology, Video description (Video capturing) or image description (imagecapturing) refers to a text description of a given Video or image for which a given Video or image is output, for example, for a Video in which a child is cleaning the ground, the text description of the Video "a child is cleaning the ground" is automatically output by the Video description, which is the cross direction of computer vision and natural language processing.

The existing video description mode generally selects a plurality of frames from a video, extracts full-image features from the selected frames, then uses the features for decoding, generates text description about the video according to the maximum likelihood probability, and has similar image description principles. As can be seen from the above, the existing video description model basically adopts a structure of an encoder and a decoder, the encoder is responsible for extracting the features of the video frames, and the decoder is responsible for decoding the features of the video frames and generating the text description. Although there are a variety of ways to generate video description information, the accuracy of the generated video description information still needs to be optimized.

Disclosure of Invention

An object of the embodiment of the present application is to provide a method and an apparatus for generating video description information, an electronic device, and a storage medium, so as to improve accuracy of the generated video description information. The scheme provided by the embodiment of the application is as follows:

in a first aspect, an embodiment of the present application provides a method for generating description information of multimedia data, where the method includes:

extracting characteristic information of multimedia data to be processed, wherein the multimedia data comprises videos or images;

based on the extracted feature information, a textual description of the multimedia data is generated.

In a second aspect, an embodiment of the present application provides an apparatus for generating description information of multimedia data, where the apparatus includes:

the system comprises a characteristic information extraction module, a processing module and a processing module, wherein the characteristic information extraction module is used for extracting the characteristic information of multimedia data to be processed, and the multimedia data comprises videos or images; (ii) a

And the description information generation module is used for generating the text description of the multimedia data based on the extracted characteristic information.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory and a processor; wherein the memory has stored therein a computer program; the processor is used for executing the method provided by the embodiment of the application when the computer program runs.

In a fourth aspect, the present application provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the computer program performs the method provided by the present application.

The advantageous effects brought by the technical solutions provided in the embodiments of the present application will be described in detail in the following description of specific implementation modes with reference to various alternative embodiments, which will not be described herein.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

FIG. 1 is a schematic diagram of an example of an image description;

FIG. 2 is a schematic diagram of an example of a video description;

FIG. 3 is a schematic diagram of a conventional video description algorithm;

FIG. 4 is a schematic diagram of a training process of a conventional supervised learning-based video description algorithm;

fig. 5 is a schematic flowchart of a method for generating description information of multimedia data according to an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating semantic features obtained through a semantic prediction network according to an example of the present application;

FIG. 7a is a schematic diagram of a spatial scene graph in an example of the present application;

FIG. 7b is a schematic view of a spatial scene graph in another example of the present application;

FIG. 8 is a schematic diagram illustrating an example of a relationship feature obtained through a relationship prediction network;

FIG. 9 is a schematic diagram illustrating an example of a principle of obtaining attribute characteristics through an attribute prediction network;

FIG. 10 is a schematic diagram of a spatio-temporal scene graph in an example of the present application;

FIG. 11 is a schematic diagram of a spatio-temporal scene graph in another example of the present application

FIG. 12 is a schematic diagram of a feature selection network provided in an example of the present application;

FIG. 13a is a block diagram of a self-attention-based codec according to an example of the present application;

FIG. 13b is a block diagram of a self-attention-based codec according to an example of the present application;

fig. 14, 15 and 16 are schematic diagrams of methods for generating video description information provided in three examples of the present application, respectively;

fig. 17a and 17b are schematic diagrams of video description information obtained in two alternative examples of the present application;

fig. 18 and 19 are schematic flow charts of methods for generating video description information in two other examples of the present application;

fig. 20 is a schematic diagram of a principle of obtaining video description information provided in an example of the present application;

fig. 21 is a flowchart illustrating a method for generating image description information according to an embodiment of the present application;

fig. 22 and 23 are schematic structural diagrams of codecs provided in two alternative examples of the present application;

fig. 24 is a flowchart illustrating a method for training a multimedia data description model according to an embodiment of the present application;

FIG. 25 is a diagram of a sample video with a video description label (i.e., an original description label) in an example of the present application;

FIG. 26 is a schematic diagram illustrating a method for training a video description model according to an example of the present application;

fig. 27 is a flowchart illustrating a method for obtaining enhanced multimedia data description information according to an embodiment of the present application;

FIGS. 28a and 28b are schematic structural diagrams of two codecs provided in two alternative examples of the present application;

FIG. 29 is a schematic flowchart of a training method for an image description model according to an embodiment of the present disclosure;

fig. 30 and fig. 31 are schematic diagrams of the principle of the generation method of video description information provided in two examples of the present application;

fig. 32 is a schematic structural diagram of an apparatus for generating description information of multimedia data according to an embodiment of the present application;

fig. 33 is a schematic structural diagram of an electronic device to which the embodiment of the present application is applied.

Detailed Description

The following detailed description of the embodiments of the present application, which is to be taken in an illustrative manner and is described below with reference to the accompanying drawings, is provided for purposes of explanation only and is intended to provide a thorough understanding of the embodiments of the present application, which are defined by the claims and their equivalents, wherein various specific details are included to facilitate understanding, but the examples and details are to be regarded as illustrative only and not as limiting the application. Accordingly, those of ordinary skill in the art will recognize that changes and modifications of the described embodiments can be made without departing from the scope and spirit of the application. In addition, some descriptions of well-known functions and constructions may be omitted in the following description for clarity and conciseness.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all, or any and all combinations of one or more of the associated listed items.

First, it should be noted that the method for generating description information of multimedia data provided in the embodiment of the present application may be used to generate description information of a video including multiple frames of images, and may also be used to generate description information of an image, where a source of the image is not limited in the embodiment of the present application, and the image may be an image acquired, downloaded, or received, or an image in the video, such as a key frame image or a specific frame image. That is, the generation method of the embodiment of the present application may be a generation method of video description information or a generation method of image description information.

For better illustration and understanding of the solutions provided by the embodiments of the present application, the following first describes the technologies related to the embodiments of the present application.

In computer vision technology, video/image description refers to outputting a textual description of a given video/image for that video/image, which is the cross direction of computer vision and natural language processing. Video/image description is a more challenging task compared to other computer vision tasks such as object detection, image segmentation, etc. It not only requires a more comprehensive understanding of the video or image, but also expresses the content of the video or image in a natural language. As shown in fig. 1, when an image as shown in fig. 1 is given, a textual description "a boy is playing tennis" for the image may be automatically output. As shown in fig. 2, when a video including a plurality of frames of images shown in fig. 2 is given, a textual description "a child is cleaning the floor" of the video may be automatically output.

At present, the existing image description model basically adopts the structure of an encoder-decoder. The encoder is usually designed based on a Convolutional Neural Network (CNN) and is responsible for extracting features of an image, and the decoder is usually designed based on a Recurrent Neural Network (RNN) and is responsible for decoding features of an image and generating a textual description.

Similarly, the existing video description model generally selects a plurality of frames from a video, uses CNN to extract full-image features for the selected frames, then uses RNN to decode the features of all the frames, and generates a text description about the video according to the maximum likelihood probability. As can be seen, the existing video description algorithms basically adopt a structure of an encoder-decoder, the CNN is used for encoding a video frame and is responsible for extracting features of the video frame, and therefore, the CNN may also be referred to as an encoder or a CNN encoder, and the RNN is used for decoding the video frame and is responsible for decoding the features of the video frame and generating a text description, and therefore, the RNN may also be referred to as a decoder or a RNN decoder. The RNN may use a Long Short-Term Memory network (LSTM), which may be referred to as an LSTM decoder.

As an example, fig. 3 shows a schematic diagram of a conventional video description model, and as shown in fig. 3, a plurality of frames shown in fig. 3 are selected from a video (the ellipses in the figure indicate omitted frames and frames not shown), each frame is respectively processed by a CNN encoder to extract features of each selected video frame, and the extracted features are decoded by an LSTM decoder to generate a corresponding text description "a man is pizza in an oven".

Although the prior art has been able to generate text descriptions of videos or images, the inventors of the present application have analyzed and studied that at least the following technical problems still exist in the prior art:

(1) the existing decoder, such as RNN, is a loop structure, and requires a step-by-step training during training, so that the existing decoder has the problems of low training speed and low training efficiency, and has the problems of difficulty in learning long-distance correlation, insufficient expression capability and the like.

(2) In a data set commonly used in the field of video/image description at present, description information of a training sample (i.e., a sample video or a sample image) is less, for example, description labels of sample images are usually only 5, and it is usually difficult to completely express information in an image only by using 5 description labels, and due to the diversity of natural languages, the same semantic meaning can be expressed in various ways. Therefore, poor diversity of the description information of the training samples is also a problem that hinders further development in this field.

(3) For videos containing multiple frames of images, the prior art does not consider intra-frame information, but the information has significance for generating more accurate video description, and therefore, the problem of how to fully utilize the intra-frame information needs to be solved.

(4) The prior art does not consider semantic information of video or images, which is significant for generating more accurate video description.

(5) For example, for a video description algorithm, that is, each training video corresponds to 1 or more labeled video descriptions, then the video description model is trained by using a reinforcement learning method, as shown in fig. 4, for data labeled with video descriptions, video data P in the data is input into a video description model K, so that the video description model K analyzes and processes the video data P to generate a corresponding video description, and then a video description Q in the labeled data and the generated video description are based on the video description Q in the labeled data and the generated video description KDescribe to calculate the loss function T_mark(α) value by the loss function T_mark(α) to guide the learning of the video description model K, however, the video description of the annotation video requires a lot of labor cost and time cost, which not only makes the number of samples of the existing video description data set relatively limited, but also, because the number of samples of the video description data set is relatively limited, the accuracy and precision of the video description model trained based on the video description data set are poor.

(6) In an existing video description information or image description information generation mode, the length of generated description information is not controllable, and application requirements of users for different lengths of description information required in different application scenes cannot be met.

In order to solve at least one of the above technical problems in the prior art, embodiments of the present application provide a method, an apparatus, an electronic device, and a storage medium for generating description information of multimedia data, where the multimedia data may be a video or an image.

In order to make the objects, technical solutions and advantages of the present application clearer, various alternative embodiments of the present application and how the technical solutions of the embodiments of the present application solve the above technical problems will be described in detail below with reference to specific embodiments and drawings. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 5 is a flowchart illustrating a method for generating description information of multimedia data according to an embodiment of the present application, where as shown in the diagram, the method mainly includes the following steps:

step S101: extracting characteristic information of multimedia data to be processed;

step S102: based on the extracted feature information, a textual description of the multimedia data is generated.

In the embodiment of the present application, the multimedia data includes a video or an image, and based on the method provided in the embodiment of the present application, generation of video description information or image description information can be realized.

In an optional embodiment of the present application, the extracting feature information of the multimedia data to be processed includes at least one of:

extracting local visual features of targets contained in each target region of each image in the multimedia data;

extracting semantic features of the multimedia data;

if the multimedia data is a video, extracting the space-time visual features of the multimedia data;

extracting global visual features of the multimedia data;

extracting attribute features of targets contained in each target area of each image in the multimedia data;

global attribute features of each image in the multimedia data are extracted.

That is, the feature information of the multimedia data may include one or more (including two items) of a local visual feature, a semantic feature, a spatiotemporal visual feature, a global visual feature, a local attribute feature (i.e., an attribute feature of the object), and a global attribute feature. The local visual feature is relative to the image to which the target region belongs, and the local visual feature is the visual feature of one target region in the image.

For example, for a given video, several frames may be selected from the given video at equal intervals, or several image frames (that is, key frames) may be selected from the given video by using a key frame algorithm, or frame selection may be performed through a neural network. In the following description, for a video, each image of multimedia data will be described by taking each frame selected from the video as an example.

The visual features are features capable of reflecting pixel information in the images, and the attribute features are attribute information capable of reflecting each target in the images, so that the visual features and the attribute features can be used for generating description information of videos or images, and the local visual features can reflect information of each target area in each image more accurately and more finely, so that the local visual features more fully utilize intra-frame information of each image, and more accurate text description of the videos or images can be generated.

In practical application, besides information such as visual features and attribute features of images, some other features also contribute to the description of multimedia data, for example, the spatio-temporal visual features of a video can effectively reflect the dynamic change of a video in space and time, and the semantic features of the video or images reflect semantic information of contents contained in a video or images, so that when generating text description of the video or images, the spatio-temporal visual features of the video, the semantic features of the video or images, and the like can be extracted through a neural network, so as to blend more diverse features into the generation of video description information, thereby enhancing the expression capability of the generated video description and improving the precision of the description information.

Optionally, for each image, a feature extraction network may be specifically used to obtain a plurality of target regions and region features (i.e., the above-mentioned local visual features, which may also be referred to as target features) of each target region, and for the attribute features, the attribute features may be specifically obtained by an attribute prediction network. For example, a fast Region-convolutional neural network (fast R-CNN, fast Region-CNN) may be applied to each image to extract several target regions and local visual features (i.e., local features/regional features) of the respective target regions. Wherein the fast R-CNN can be pre-trained with a sample data set, such as ImageNet or Visual Genome data set.

It should be noted that the manner of extracting each target Region and the local feature of each target Region through the Faster Region-CNN is merely exemplary, and the present application is not limited thereto, and the feature extraction network may be implemented through any other available neural network.

For the spatio-temporal visual features of the video, the spatio-temporal visual features may be obtained by using a spatio-temporal feature extraction network, for example, a three-dimensional visual feature extraction model based on an ECO (Efficient Convolutional network for Online video understanding) model or 3D-CNN (three-dimensional Convolutional network for Online video understanding) model may be used to extract the spatio-temporal visual features of the video, and of course, other spatio-temporal visual feature extraction networks may also be used to implement the spatio-temporal visual features. The semantic features can be extracted through a semantic prediction network.

Specifically, a trained semantic prediction network may be applied to obtain semantic features of the entire video or image. As an example, fig. 6 shows a schematic structural diagram of a semantic prediction network, as shown in the figure, the semantic prediction network may include a CNN and a multi-classification structure (multi-classification shown in the figure), and taking the video shown in fig. 6 as an example, when the network is used, a plurality of frames of a selected video are input into the semantic prediction network, video features of each frame are extracted through the CNN, and the extracted video features are subjected to multi-classification operation through the multi-classification structure, so that probabilities corresponding to a plurality of predefined semantic features of the video can be obtained, and finally, one or more predefined semantic features of the predefined semantic features are output according to the probabilities. As shown in fig. 6, based on a plurality of input frames (each image including a person and a dog shown in the figure), the probability of a plurality of semantic features corresponding to a person, a dog, a road, and the like can be obtained through a semantic prediction network, and based on the predicted probability and a preset semantic feature screening rule, semantic features having a probability greater than a set threshold value or a set number of semantic features having a high probability can be output, where the probability toward a person and the probability toward a dog in the figure are 0.95 and 0.8, respectively.

In an optional embodiment of the present application, the feature information of the multimedia data includes local visual features of objects included in each object region in each image of the multimedia data, and the generating of the text description of the multimedia data based on the extracted feature information includes:

for each image, obtaining the relation characteristics among the targets according to the local visual characteristics of the targets in the image, and constructing a scene graph of the image based on the local visual characteristics and the relation characteristics among the targets;

for each image, obtaining the image convolution characteristics of the image according to the scene image of the image;

and obtaining the text description of the multimedia data based on the image convolution characteristics of each image of the multimedia data.

The Scene Graph (Scene Graph) is a Graph structure that represents a local visual feature, an attribute feature (to be described later in detail), a relationship feature, and the like of each target region in an image in a Graph format. The scene graph may include a plurality of nodes, each of the plurality of nodes specifically represents a target feature (i.e., the above-mentioned local visual feature) of a target (i.e., an object) contained in the target region or an attribute feature of the target, and a plurality of continuous edges, each of the plurality of continuous edges represents a relationship feature between the nodes.

As an example, fig. 7a shows a scene graph corresponding to a frame of image, as shown in fig. 7a, some nodes in the scene graph represent local visual features of each object in the frame of image, such as the nodes "person", the nodes "dog", and the nodes "skateboard" shown in fig. 7a, and represent feature vectors of "person", "dog", and "skateboard"; some nodes represent the attribute characteristics of the target, such as the node 'wears blue clothes', the attribute characteristics are the attribute characteristics of the target 'person', and 'black' represents the attribute characteristics of 'dog'; the connecting edges between the nodes in the scene graph represent the relationship between two connected nodes, namely the relationship between two targets, for example, the relationship between the node "person" and the node "dog" is "pulling", the relationship between the node "person" and the node "skateboard" is "sliding", and the like. As can be seen from fig. 7a, the scene graph reflects the spatial relationship between objects in an image, and therefore may also be referred to as a spatial scene graph (spatial scene graph).

In practical applications, for each image, a feature extraction network may be specifically used to obtain several target regions and region features (i.e., the local visual features) of the respective target regions, and a relationship prediction network may be used to obtain relationship features between the respective target regions. Specifically, after obtaining the local features of each target region of the image through the feature extraction network, the relationship features between the target regions can be obtained from the extracted target region features by using a pre-trained relationship prediction network, and the region features of each target region and the relationship features between the region features can be represented in a graph mode to obtain a scene graph of each frame.

The relation prediction network is a classification network and is used for predicting the relation among target areas, and the specific network structure of the relation prediction network can be selected according to actual requirements. As an optional scheme, the relationship prediction network may include a fully-connected layer, a feature splicing layer, and a softmax layer, where the fully-connected layer may be used to extract local visual features of the target region, and may also be used to perform dimension reduction processing on the features, so as to adapt to subsequent processing of the feature splicing layer and the softmax layer. The relationship prediction network may be trained based on a data set used for the sample, for example, the Visual Genome data set may be trained based on a common relationship and attribute learning data set having a large number of object attributes and relationship labels, and thus, the Visual Genome data set may be used for training the relationship prediction network.

Specifically, when the relationship characteristics between the target regions are obtained through the relationship prediction network, for the region characteristics of at least two target regions in each target region, the full connection layer may be used to perform the feature extraction, the features corresponding to each target region extracted by the full connection layer are input to the feature splicing layer to perform the feature splicing, and then input to the softmax layer, and the relationship characteristics between the at least two target regions are obtained according to the probabilities respectively corresponding to each relationship output by the softmax layer.

As an example, fig. 8 is a schematic diagram illustrating a relational feature prediction using a relational prediction network, and as shown in fig. 8, regional features of a target person and a dog in an image frame are respectively input to a fully-connected layer of the relational prediction network, feature splicing is performed on features output from the fully-connected layer, and then the features are input to a softmax layer, so as to obtain probabilities that a plurality of predefined relational features between the person and the dog respectively correspond to each other, and finally one or more predefined relational features are output according to the probabilities. For example, if the probability of the relationship feature "on top" is 0.12, the probability of the relationship feature "on the left" is 0.20, and the probability of the relationship feature "on the back" is 0.40, the relationship feature output last is "on" based on the probabilities of the relationship features. Of course, in practical applications, for at least two target regions, the corresponding relationship feature may be one, and the relationship feature corresponding to the maximum probability may be selected as in the above example, or may be multiple, such as outputting a relationship feature with a probability higher than a set threshold or outputting a set number (e.g., two) of relationship features with the highest probability.

The method provided by the embodiment of the application is realized based on the local visual features (also referred to as regional features and local features) of each target region (also referred to as target candidate region) of each image when acquiring the text description of the video or the image, and because the local visual features of each target region can reflect the information of each local region in each image more accurately and more finely, the method of the embodiment of the application makes full use of the intra-frame information of each image, thereby generating more accurate text description of the video or the image.

Further, the method of the embodiment of the application can determine the relationship characteristic between the target areas based on the local characteristics, construct and obtain the spatial scene graph of each image (for an image, the image is one image) according to the relationship characteristic, obtain the image convolution characteristic of each image based on the spatial scene graph, and thus obtain the text description of the video or the image based on the image convolution characteristic of each image. Because the spatial scene graph can well reflect the objects in the image and the relationship among the objects, and the relationship among the objects is very helpful for understanding and describing the image content, the graph convolution characteristic based on the spatial scene graph can further improve the accuracy of the text description of the video or the image.

For convenience of description, in the following description, for a target region, a node corresponding to a local visual feature of the target region may be simply referred to as a target node, a local visual feature represented by the target node may also be referred to as a target feature, and a node of an attribute feature of the target region may be simply referred to as an attribute node.

In an optional embodiment of the present application, the feature information of the multimedia data may include attribute features of objects included in respective object regions of each image in the multimedia data;

the constructing the scene graph of the image based on the local visual features of the targets and the relationship features between the targets may include:

and constructing a scene graph of the image based on the local visual features of the targets, the relation features between the targets and the attribute features of the targets, wherein each node in the scene graph represents the local visual features or the attribute features of the targets corresponding to the target area.

In this alternative, the scene graph is constructed in a graph mode of the local visual features, the attribute features, and the relationship features corresponding to the respective target regions, that is, the scene graph may include target nodes and may also include attribute nodes. At this time, the nodes in the scene graph represent local visual features or attribute features of each target region, that is, target features reflecting the targets included in each target region, and attributes of the targets, and the corresponding connected edges may include connected edges representing the relationship between the targets (e.g., the connected edge between "person" and "dog" in the spatial scene graph shown in fig. 7 a), and connected edges representing the relationship between the targets and the attributes (e.g., the connected edge between "person" and "blue clothes" in the spatial scene graph shown in fig. 7 a).

In practical applications, in order to reduce the connection redundancy, the attribute node and the target node of the same target region may be merged, that is, for a target region, the nodes representing the local visual feature and the attribute feature of the target region may be the same node. For example, for the scene graph shown in fig. 7a, by combining the object node and the attribute node of the same object, the scene graph shown in fig. 7b can be obtained, wherein "person" and "blue clothes-wearing" in fig. 7a respectively represent the object feature and the attribute feature of the person in the image, and therefore can be combined into one node of "person clothes-wearing" shown in fig. 7b, which simultaneously reflects the category feature and the attribute feature of the person in the corresponding object region, and similarly, "dog" and "black" in fig. 7a respectively represent the object feature and the attribute feature of the dog in the image, and therefore can also be combined into a node of "dog black" shown in fig. 7 b.

Wherein the attribute feature may be derived based on a local visual feature of the target region. In the optional scheme, when the scene graph is constructed, the attribute characteristics of each target area in the image are also considered, and the attributes of each target in the image are also very helpful for describing the image content, so that the video can be more accurately described based on the scene graph with the attribute characteristics blended.

Optionally, the attribute features corresponding to the target regions may be obtained from the extracted local visual features using an attribute prediction network. The attribute prediction network is a multi-classification network, and can be trained based on a sample data set (such as a visual genome data set), and in practical application, a specific network structure of the attribute prediction network can be selected according to actual requirements.

In an alternative embodiment of the present application, the attribute prediction network may comprise a plurality of attribute classifiers, wherein each classifier corresponds to a type of attribute prediction.

The specific division manner of the type of the attribute may be configured according to actual requirements, and as an optional manner, the type of the attribute may specifically refer to a part-of-speech corresponding to the attribute, for example, the part-of-speech corresponding to the attribute may include a noun, a verb, an adjective, and some other relatively rare attribute types. The prediction of attribute features is performed by employing a classifier that includes classes corresponding to a plurality of attribute types.

When the spatial scene graph is constructed based on the predicted attribute characteristics of the scheme, compared with the traditional spatial scene graph construction mode, the attributes of objects, namely targets, are not distinguished in the traditional mode, various attributes are classified by a classifier, and the accuracy of the obtained attributes is low. And for different types of attributes, different classifiers can be used for obtaining the attributes, so that the obtained attributes are more accurate, and the attributes are more diversified, so that more accurate description information can be generated based on the predicted attribute characteristics. In addition, in order to reduce redundancy, the attribute node and the target node can be combined to improve the data processing efficiency.

As an alternative, the structure of the attribute prediction network may include a full connection layer and a multi-classification layer, where the full connection layer may be used to extract attribute features of the target region, and may also be used to perform dimension reduction processing on the features to adapt to subsequent multi-classification layer processing, and optionally, the multi-classification layer may be implemented by multiple sigmoids. When the attribute characteristics among all target areas are obtained by using the attribute prediction network, the input of the network is the local visual characteristics of one target area, and the output is one or a plurality of predefined attribute characteristics.

As an example, fig. 9 shows a schematic diagram of a principle of predicting an attribute characteristic of a target area using an attribute prediction network. As shown in fig. 9, local visual features (local features shown in the figure) of a target "person" in one image frame are input to a full-link layer of the attribute prediction network, features output from the full-link layer are input to a multi-classification layer to perform a multi-classification operation, so that probabilities respectively corresponding to a plurality of predefined attribute features about the person are obtained, and finally some of the predefined attribute features are output according to the probabilities. For example, the attribute feature "blue" with an output probability of 0.92 and the attribute feature "high" with a probability of 0.54 may be one or more, specifically, in practical applications, for one target region, for example, the attribute feature with the highest probability may be output as the attribute feature of the target region, or the attribute feature with a probability higher than a set threshold value or the attribute features with a predetermined number of highest probabilities may be output as the attribute features of the target region.

It should be noted that, as an alternative of the present application, the above step of obtaining the attribute features corresponding to each target region may be used or omitted, that is, when constructing the scene graph, the local visual features, the relationship features and the attribute features may be used, or the scene graph may be constructed based on the local visual features and the relationship features, without using the attribute features. When the step of obtaining the attribute features is omitted, the nodes in the constructed scene graph represent the target features corresponding to the target area, namely the local attribute features. When the scene graph is constructed based on the relationship features and the attribute features, each node in the scene graph represents the target features and/or attributes corresponding to the target area, and each connecting edge represents the relationship between the targets corresponding to the nodes.

Still taking the example shown in fig. 7a as an example, as shown in fig. 7a, after extracting the target features, the attribute features and the relationship features between the target features and the attribute features of the person and the dog from the image, a scene graph for the image can be constructed in a graph manner. In the scene graph shown in fig. 7a, each square block represents a target feature or an attribute feature of a target area, a connecting line between squares representing the target feature represents a relationship between targets, i.e. a corresponding relationship feature, an oval box in the connecting line represents a relationship feature between each target area, an arrow of the connecting line (i.e. a direction of the connecting edge) represents a relationship between a subject and an object between the two, in the relationship between "person" and "dog" in the scene graph shown in fig. 7, the "person" is the subject, the "dog" is the object, the connecting line direction between the person "to the" dog "represents a relationship of ownership between a node of the target feature and a node representing the attribute feature, as shown between the" person "and the" blue clothing "in fig. 7a, the blue clothing is an attribute of the person, and the direction of the arrow represents the relationship of ownership, i.e. the property of wearing blue clothes, belongs to humans. The scene diagram shown in fig. 7a clearly shows the objects included in the diagram, the relative positions, attributes, and behavioral relationships of the objects.

In an optional embodiment of the present application, if the multimedia data is a video, each image of the multimedia data is a plurality of frames selected from the video, and if the objects included in the object regions in two adjacent frames are the same, time edges exist between nodes (object nodes) corresponding to the same object in the scene graphs of the two adjacent frames, that is, the plurality of edges further include time edges.

In order to better utilize the time information, in this alternative, for the video, the time information between two adjacent frames in the selected frames can be considered, and the time information is added to the scene graph corresponding to each frame to obtain a time-space scene graph. Specifically, if the targets corresponding to the target regions between two adjacent frames are the same, a time continuous edge is added between target nodes of the target regions containing the same target in the scene graphs of the two adjacent frames, and the scene graph after the time continuous edge is added can reflect both the spatial relationship and the temporal relationship between the targets, so that the scene graph may be referred to as a temporal-spatial scene graph (spatial-temporal scene graph).

As an example, a schematic diagram of a temporal-spatial scene graph is shown in fig. 10. Time edges are added to the scene graph corresponding to each frame. In the scene graphs corresponding to two adjacent frames, if the target types of the target areas in the scene graphs belonging to the two adjacent frames are the same, a time connecting edge is added between the two target areas. For example, in the scene graphs corresponding to the first frame and the second frame shown in fig. 10, the object types of the people, the oven, and the pizza in the two frame images are the same, and temporal links may be added between the corresponding object areas in the scene graphs of the two frames, such as between the people, between the oven, and between the pizza and the pizza in the scene graphs of the two frames shown in the figure.

Compared with the spatial scene graph, the temporal-spatial scene graph increases the relationship between objects (i.e., objects) in the temporal dimension, so that the temporal-spatial information of the video can be better described, and in addition, the temporal-spatial scene graph may further include motion information (described in detail later) of objects corresponding to temporal edges, so as to improve the accuracy of motion description.

In the alternative of the embodiment of the application, inter-frame time information is also considered, that is, a space-time scene graph is obtained by integrating time edges during the scene graph, in the scheme, the relevance between inter-frame images is fully considered by integrating the time information, and because the time edges are established between the same objects in adjacent frames, when the image convolution feature is extracted based on the scene graph, the continuity information of the objects in different images can be better learned, and better video description is obtained on the basis of fully utilizing the intra-frame information and the inter-frame information of each frame.

In an optional embodiment of the present application, obtaining a map convolution feature of the image according to the scene graph of the image includes:

and coding the nodes and the connecting edges in the scene graph to obtain the feature vectors with the same target dimensionality, and obtaining the graph convolution characteristics by using a graph convolution network according to the obtained feature vectors.

It should be noted that, in practical applications, if the dimensions of each of the acquired local visual features, attribute features, and relationship features are the same, this step may be executed or may not be executed.

Specifically, when a node in the constructed scene graph represents a target feature or a attribute feature corresponding to a target region, the obtained relationship feature and attribute feature of the target region may be encoded into a feature vector having the same target dimension, and then a graph convolution network is applied to the encoded feature vector to learn a relationship between adjacent nodes and connected edges in the scene graph, so as to obtain a graph convolved feature (i.e., a graph convolution feature) of each node included in the scene graph. Features learned based on graph convolution networks are based on graph structures (scene graphs), so graph convolved features may include target features, attributes, and relationship information.

When the nodes in the constructed scene graph only represent the target features, the features after graph convolution may include the target features and the relationship information, and at this time, the attribute features of each target region may be obtained from the extracted target region features without using an attribute prediction network.

As an example, encoding all or part of the nodes (e.g., target nodes) and edges in the scene graph into the same eigenvector as the target dimension (one fixed dimension, related to the eigenvector of the input vector of the subsequent decoder) may be implemented using a fully connected matrix. For example, when the dimension of the relationship feature in the scene graph is 512 and the target dimension is 1024, a matrix of 512 × 1024 may be applied to the relationship feature of the scene graph, so that the same dimension as the target dimension 1024 is obtained.

After obtaining the feature vector with the same dimension as the target, for each node in the obtained feature vector, the graph convolution formula can be used to obtain the graph convolution feature of each node, and one of the simplest graph convolution formulas without weight is shown in the following equation (1):

wherein v is_iThe feature vector representing node i in the scene graph, i.e. the target feature or attribute feature vector, N (v)_i) Representing a set of nodes, v, adjacent to node i in the scene graph (i.e. in the same frame of image)_jIs the feature vector of the node j adjacent to the node i in the same frame image (i.e. the same scene graph), the adjacent node of the node i does not generally include the node i itself, W is the network weight parameter of the graph convolution network to be learned,

represents the feature after the graph convolution (graph convolution feature) of the node i pair, and σ represents the nonlinear activation function.

In practical applications, since the importance degree of the relationship between different nodes in the scene graph in the image is different, that is, the importance of the relationship between different objects to the description information of the multimedia data to be finally obtained is different, the edges in the scene graph may be weighted edges, and thus as another alternative, the graph convolution feature of each node may be obtained by using the following equation (2):

wherein v is_i、N(v_i) Has the same meaning as in the formula (1), N (v)_i) May include v_iItself, W and b represent the weights and bias parameters of the graph convolution network to be learned, dir (v)_i,v_j) Indicates the direction of the connecting edge, wherein dir (v)_i,v_j) There are two values, respectively denoted from v_iTo v_j(i.e., the direction of the connecting edge is from node i to node j), or from v_jTo v_i，

Then there can be two corresponding outcomes, each dir (v) for_i,v_j) Corresponding to one result, label (v)_i,v_j) Denotes v_iAnd v_jThe relationship between the two, for different relationships, can have different bias values, sigma represents the nonlinear activation function,

and representing the graph convolution characteristics corresponding to the node i.

As another alternative, equation (2) can also be extended to the following form, as shown in equation (3) below, in which one weight can be used for the node itself, that is, the importance of different nodes is also different, and the other two weights are used for the neighboring nodes according to the dependency of the relationship, that is, two weights are used for the relationship feature:

wherein the content of the first and second substances,

σ、v_i、v_j、N(v_i) The meaning of (a) is the same as that of each corresponding parameter in the foregoing, and is not repeated herein; w_s、W_(sub,obj)、W_(in,out)、W_aAre all parameter weight matrices to be learned, in particular, W_sIs v is_iA parametric weight matrix of, W_(sub,obj)For neighbouring node characteristics, i.e. v_jThe (sub, obj) represents the dependency of the relationship, and there are two values, which represent v respectively_iWhether it is a subject or an object, as in the relationship between "person" and "dog" in the scene diagram shown in FIG. 7a, if "person" is a subject and "dog" is an object, if v_jIs a main body, v_iIs object, then uses W_obj(for example, for the "dog" in FIG. 7a, "person" is the subject), otherwise W is used_sub；W_(in,out)The parametric weight matrix representing the relationship, (in, out) represents the direction of the connecting edge, and there are two values, which represent whether the connecting edge is output from node i or input to node i, such as for the connecting edge between "person" and "dog" shown in FIG. 7a, for the node of "person", the connecting edge is output from the node, for the node of "dog", the connecting edge is input from the node, if v is_iIs a main body, v_jIs object, then uses W_(in)Otherwise, W is used_(out)；e^rThe relation feature vector between two corresponding nodes can be represented by different values according to the direction of the connecting edge, and can be the same value,

then represent node i and nodeThe relationship feature vector corresponding to j may be the same or different according to the different directions of the connected edges, i.e. the different relationships between the subject and the object, and in the above example, the relationship between "person" and "dog" is "drawn" for people, and "drawn" for "dog".

Representing the attention layer, in particular the attention parameter matrix, W_aRepresenting the weight of the attention layer.

In another alternative, for a video, if the constructed scene graph is a space-time scene graph, for each frame of image, the processing may further consider inter-frame information in addition to the intra-frame information, that is, the image convolution feature of a frame of image may be obtained based on the scene graph of the frame of image and the scene graphs of the adjacent frames of images of the frame of image. Specifically, all or part of nodes and connected edges in the time-space scene graph may be encoded to obtain feature vectors with the same target dimension, and then the obtained feature vectors are trained using a graph convolution network to obtain graph convolution features.

Optionally, the graph convolution feature may be obtained by the following expression (4):

wherein the same parameters in the expression as in the expression (3) have the same meanings. N is a radical of_b(v_i) Represents a set of nodes having the same category as the node i in the adjacent frame image of the current frame image, that is, a set of the same objects in the adjacent frame, as in the example shown in fig. 10, for the node "person" in the 2 nd frame image shown in the figure, if the node is v_iThe set of "person" nodes in the 1 st frame and/or the 3 rd frame shown in the figure is N_b(v_i)。W_(pre,aft)Indicating the current frame to which node i belongs and N_b(v_i) The order relationship of the frames to which the node j belongs in the video, that is, the sameParameter weight matrix of target in adjacent frame, W_(pre,aft)There are two values, which respectively represent whether the current frame is the previous frame of the adjacent frame or the next frame of the adjacent frame, that is, the frame where the node i is located is the previous frame or the next frame in time sequence relative to the adjacent frame, if v is_jIs the previous frame, then W is used_(pre)Otherwise with W_(aft)。

In an optional embodiment of the present application, when constructing the temporal-spatial scene graph, the method may further include:

determining the action characteristics of the target corresponding to the time connection edge;

at this time, for each frame image, a temporal-spatial scene graph may be constructed based on the local visual features, the attribute features (optional), the relationship features, and the motion features of the objects corresponding to the temporal edges, which correspond to the target regions of the frame image.

That is, the action characteristics of each target node corresponding to the time continuous edge may be added to the scene graph. Alternatively, a common target between adjacent frames may be identified by using an object tracking method, for the common target, a motion class (also referred to as a motion relation) except for the target in the image may be identified by using a pre-trained motion classifier (motion detector), and a feature vector of the motion class is used as a motion feature of the target.

As shown in the space-time scene graph of fig. 11, the common objects included in each frame include "person", "pizza", and "oven", where the action corresponding to "person" is "open", that is, the value of the time boundary of the common object in the adjacent frames in the space-time scene graph is "open", the action corresponding to "pizza" is "held", and the action corresponding to "oven" is "open", it is apparent that the scene graph may further include action information corresponding to the common object included in the adjacent frames, as compared with the scene graph in fig. 10, and the description information may be generated based on the scene graph, so that more image detail information may be used, and the accuracy of the generated description information may be further improved.

Corresponding to the continuous scheme, the graph convolution feature of each node in the scene graph can be calculated by adopting the following formula:

wherein, the same parameters as those in the above formula (4) can be explained as above, W_TA parametric weight matrix representing the motion relationship (i.e. motion class) of the same object in adjacent frames,

a motion class (specifically, a feature vector of the motion class) is represented, that is, motion features of the same target in adjacent frames, and in the example shown in fig. 11, the motion relationship is "on" for a target "person" common in adjacent frames; for the scene graph of the first frame in the graph, the action category corresponding to the node of "oven" in the graph (node i at this time) and the node of "oven" in the scene graph of the second frame (node j in the adjacent frame at this time) is the feature vector of the action of "being opened"; w_TThe weight matrix is also a weight matrix to be learned, and different weight values can be provided for different action categories.

It is the attention parameter matrix, different weights can be assigned to different targets, as in the example shown in fig. 7b, when the feature of the "dog" node is updated, the "person" node is more closely related to the "dog" node than the "skateboard" node, so a higher weight is assigned to the "person" node.

The above equation (5) is understood more intuitively, that is, the feature v is for different objects (i.e., target nodes)_iAfter the node "person" shown in FIG. 11 passes through the graph convolution network, the updated feature, i.e., the graph convolution feature, includes the feature v of the object itself_iAnd characteristics of some connected objects of the object, such as oven, pizza, and also characteristics of the object nodes of adjacent frames.

Compared with the existing graph convolution feature extraction scheme, the scheme for obtaining the graph convolution feature provided by the embodiment of the application has one difference that the attention (namely the attention weight) is increased when the node feature is updated

) I.e. when updating the signature, its neighbors may be given different weights. In the above example, when the characteristics of the "dog" node are updated, the "person" node has a closer relationship with the "dog" node than the "skateboard" node, so that the "person" node is given a higher weight; another difference is that for two adjacent objects, different weight parameter matrixes can be used to update the characteristics according to the difference between the subject and the object and the difference between the subject and the object before and after the time. For example, in the relationship between the "person" node and the "skateboard" node, the "person" is the subject, and the "skateboard" is the object. When updating the "people" node, the weight parameter matrix is W_subWhen updating the "microphone" node, the weight parameter matrix is W_obj。

In practical applications, the attribute feature of the same object in the scene graph and the node of the object feature may be merged, that is, the node in the scene graph may be a node representing the object feature and the attribute feature. Through the various selection modes of the application, the graph convolution characteristics of each node in the scene graph of each image can be obtained through the graph convolution network, and the convolution characteristics of one image are the convolution characteristics of each node contained in the scene graph of the image. In addition, the attribute feature of the same object in the scene graph and the node of the object feature may also be merged, that is, the node in the scene graph may have both the object node and the attribute node, and in this case, the graph convolution feature of each object node and each attribute node may be obtained. When obtaining the graph convolution characteristics of each characteristic node through the above optional modes, if some parameter or parameters in the above expression do not exist for a certain node, a pre-configured value may be adopted, such as a zero vector or other pre-configured characteristic vectors.

In an optional embodiment of the present application, if the feature information of the multimedia data includes at least two of a local visual feature, a semantic feature, a spatiotemporal visual feature, and a global feature, generating a text description of the multimedia data based on the extracted feature information includes:

determining the weight of each kind of characteristic information;

performing weighting processing on each characteristic information based on the weight of each characteristic information;

and generating the character description of the multimedia data based on the weighted characteristic information.

In practical applications, since the importance of the feature information of each category is likely to be different for different multimedia data (e.g., different videos, different images), different features of different categories can have different weights, so that different features can perform different functions, so that the scheme of the embodiment of the present application has adaptivity to generation of description information of different videos, that is, multiple kinds of feature information can each perform different functions for different videos.

Optionally, when determining the weight of each type of feature information, the feature selection network may be specifically used to implement, and the feature selection network is trained to enable the network to select, for different multimedia data, specific feature information used for generating description information of the multimedia data, that is, for a given multimedia data, the feature selection network may determine respective weights for different feature information.

As an example, fig. 12 shows a schematic diagram of a feature selection network, as shown in fig. 12, in which each feature information is a graph volume feature V_GCNSpatio-temporal visual characteristics V_3dChord meaning feature V_SFIn this example, the specific implementation of the feature selection network may be expressed by the following formula:

wherein, a_tFeature selection net for representing time tA set of weight values of the output of the network, i.e. the weight of each characteristic information at time t, E_1:t-1Embedding representing words from 1 time to t-1 time, that is, when t word of video description information is obtained by decoding, feature vector W of first t-1 words obtained by decoding_3d，W_GCN，W_SF，W_eIs a parameter weight matrix of the network, each parameter weight matrix converts various characteristics into the same dimensionality and adds the dimensionalities, and the dimensionalities are added after passing through a nonlinear layer tanh and then are used

The parameter weight matrix becomes a 3 x1 vector, and finally, the vector is normalized by softmax, each dimension represents the weight of different features, and the sum is equal to 1. The intuitive meaning of the formula is to perform an attribute operation on each feature to obtain the attention weight of each feature.

In this example as shown in FIG. 12, the weights of the spatiotemporal visual, atlas and semantic features are 0.3, 0.2 and 0.5, respectively.

It can be understood that the time 1, the time t-1 and the time t are all relative time concepts, and when the decoder decodes the output video description information, the decoder decodes the output video description information to obtain the relative decoding time of the 1 st word, the t-1 st word and the t-th word in the video description information. The weight of each feature information at each time other than time 1 may be obtained based on each feature information and each word decoded before the current time.

The scheme provided by the embodiment of the application uses various different characteristics to express the video information. Besides the time-space scene graph characteristics, namely graph convolution characteristics, the time-space visual characteristics and the semantic characteristics are used, namely three characteristics can be adopted, wherein the graph convolution characteristics pay more attention to the relation and the attribute between the objects, the time-space visual characteristics pay more attention to the time information, and the semantic characteristics pay more attention to the whole semantic information contained in the video. Different characteristics can be selected according to different videos by adopting a characteristic selection network (characteristic selection gate), and the output result of the characteristic selection gate is a group of weight values a_tIs divided intoRepresenting the weights of the different features. For example, some videos are longer and time information is more important, so the weight of the spatio-temporal visual features is higher, some videos are shorter and more objects appear, so the correlation among the objects and the attributes of the objects are more important, and therefore the weight of the graph convolution features is higher.

After the weights of the feature information are obtained, when the subsequent processing is performed based on the feature information, as an optional manner, the respective weights may be used to weight the feature information, and the weighted features are used for the subsequent processing, or the weighted and fused features may be obtained by performing weighted fusion on the feature information based on the weights, and the subsequent processing is performed based on the fused features. As in the example shown in fig. 12, the fused features (i.e., 0.3 × spatiotemporal visual feature +0.2 × map convolution feature +0.5 × semantic feature shown in the figure) may be used for subsequent processing, or the weighted features may be processed separately, i.e., to obtain weighted features, i.e., 0.3 × spatiotemporal visual feature, 0.2 × map convolution feature, and 0.5 × semantic feature. Different weights are adaptively given to different characteristics, so that different types of characteristics can play different importance when generating text description according to the characteristics of multimedia data.

In an alternative embodiment of the present application, generating a text description of multimedia data based on the extracted feature information may include:

performing encoding processing on each obtained feature information by using an encoder based on self attention;

inputting the feature information after the coding processing into a decoder to generate a text description of the multimedia data;

wherein, if the multimedia data is an image, the self-attention-based encoder is a self-attention-based intra-frame encoder; if the multimedia data is video, the self-attention based encoder includes a self-attention based intra-frame encoder and/or a self-attention based inter-frame encoder.

That is, the obtained feature information (weighted features if weighted, weighted fused features if weighted fusion) may be encoded separately using a self-attention-based intra-frame encoder to obtain deeper and higher level feature information, and the encoded features may be input to a decoder to generate corresponding textual descriptions; for video, a self-attention-based inter-frame encoder may also be used to encode the obtained video features and input the encoded features to a decoder to generate a textual description of the video. The decoder may also adopt a self-attention-based decoder, such as an attention-based intra-frame decoder for an image to better learn intra-frame information during decoding, and a self-attention-based intra-frame decoder and/or a self-attention-based inter-frame decoder for a video to learn better intra-frame information and/or inter-frame information during decoding, so as to obtain more accurate video description.

Taking the text description of the video obtained based on the feature of the image convolution (it is understood that the feature may also be a spatio-temporal visual feature and/or a semantic feature, and the feature of the image convolution) as an example, a vector related to the feature of the image convolution may be obtained for each selected frame, and the vectors of the feature may be input to a decoder based on self-attention to learn to obtain the inter-frame information between frames.

When generating the text description of the video based on the graph volume characteristics, the decoder outputs the words which can be output at each moment and the output probability thereof according to the decoder input and the graph volume characteristics. As an example, the self-attention based decoder may be implemented by a transform decoder.

In the process of generating the textual description of the video, assuming that at a first time instant, the decoder inputs are global features and a start symbol, the decoder output based on self-attention is a set of probability values, each value representing a word, and the word with the highest probability is selected, i.e., the output at the first time instant. The input at the second moment is the global feature + the start character + the output at the first moment, and the word with the highest probability is still selected as the output. And the input at the third moment is the global feature + the initial character + the output at the first moment and the output at the second moment, and the circulation is carried out until the word with the highest probability at a certain moment is the terminator, and the circulation is ended, so that a sentence sequence is obtained, namely the output of the decoder based on self-attention is the final sentence sequence related to the video description.

For example, the transform decoder may output words a and b and an output probability (e.g., 60%) of a word a and an output probability (e.g., 40%) of a word b that may be output at a first time, words c, d, and e and an output probability (e.g., 60%) of a word c that may be output at a second time, an output probability (e.g., 20%) of a word d and an output probability (e.g., 20%) of a word e, and so on at a later time. In this case, according to an exemplary embodiment of the present application, the video description sentence may be generated by a greedy decoding method, that is, by combining words, which are most likely to be output at each time, in chronological order. However, the present disclosure is not limited thereto, and other decoding methods may be used to generate the video description sentence.

According to an exemplary embodiment of the present application, the video description sentence may be obtained by combining words, which are possible to be output at each time, with the highest output probability in chronological order until the end when the probability of the terminator of the output is the highest. The role of the self-attention-based decoder is to learn information between frames, which has a self-attention mechanism-based structure including a multi-headed attention layer, a layer normalization layer, and a forward network layer. The self-attention-based decoder has the advantages of fast training speed, few parameters and easy learning of long-distance correlation compared with a decoder having an RNN structure.

As an example, fig. 13a shows a schematic structural diagram of a self-attention-based codec model provided in an embodiment of the present application, and as shown in fig. 13a, the self-attention-based codec model is divided into two parts, namely a self-attention-based encoder and a self-attention-based decoder. Alternatively, the self-attention-based codec may be implemented by a transform codec. The feature information of the video in this example is still illustrated by taking the graph volume feature as an example. As shown in fig. 13a, the self-attention based encoder in this example has been composed of a multi-headed attention layer, a forward network, a layer normalization layer, and the self-attention based decoder may be composed of a masked multi-headed attention layer, a layer normalization layer, and a forward network layer.

The encoder with the structure based on the attention mechanism shown in fig. 13a may be a multi-block structure, the structure of each block may be the same or different, and the multi-block structure may be cascaded in sequence, that is, the output of the current block is the input of the next block. As an alternative, for example, the encoder may include 6 blocks (one block is shown in fig. 13 a) with the same structure, each block may mainly include two parts, namely, a multi-head attention layer and a location-by-location fully-connected forward network, the forward network may be implemented by two linear prediction layers, a ReLU activation operation is included between the two linear prediction layers, the multi-head attention layer and the forward network part in each block may correspond to one layer normalization layer, specifically, as shown in fig. 13a, each block may be sequentially composed of a multi-head attention layer, a layer normalization layer, a forward network layer, and a layer normalization layer, and the respective blocks may be stacked to obtain the encoder. The encoder input is the graph convolution feature.

When the graph convolution characteristic is coded by using a self-attention-based coder, coder Embedding (Embedding) processing can be performed on the graph convolution characteristic, the dimensionality of the characteristic information is changed so as to be suitable for the subsequent coder processing, and the characteristic information output after the Embedding processing is input into the coder for coding processing.

The following description will take the processing of the first block in the encoder as an example: the graph convolution characteristic is firstly self-attention processed by a multi-head attention layer, the output result of the graph convolution characteristic and the output of an encoder after Embedding are fused (such as addition processing) and then subjected to layer normalization processing, the normalized result is processed by a forward network, then the normalized result and the output of a previous layer normalization layer are fused (such as addition processing), and then layer normalization processing is carried out to obtain the output result of a first block. The output of the first block is used as the input of the second block, and the encoding process is sequentially performed, so that the output of the encoder (i.e., the encoder output in fig. 13) is obtained.

The decoder structure of the self-attention mechanism shown in fig. 13a may also be a multi-block structure, each block may have the same or different structure, and the multi-block structure may be cascaded in sequence, for example, may include 6 blocks having the same structure, and each block may mainly include three parts, namely, multi-headed attention with masks, multi-headed self-attention corresponding to features, and a forward network. The multi-head attention and forward network part in each block may correspond to a layer normalization layer, specifically, the structure of each block may be sequentially composed of a multi-head attention layer with a mask, a layer normalization layer, a multi-head attention layer, a layer normalization layer, a forward network layer, and a layer normalization layer, and each block may be stacked to obtain the decoder. The decoder inputs in the figure are global feature information and word vectors, and the feature vectors of the target regions extracted from each frame by the feature extraction network can be called as local features or regional features, so that regional features of a plurality of target regions can be obtained for the frame, and after the regional features are subjected to averaging processing, the global features corresponding to the frame can be obtained, and the global features can also be obtained by other modes (such as weighting processing and the like). In addition, an initiator and a word vector predicted in the iterative prediction process can be obtained (if the first prediction of the iterative prediction is performed, only the initiator is obtained, and all the word vectors can be input during the training of the model). For the above decoder input (i.e. the global feature and the initial symbol and the word vector predicted in the iterative prediction process, the decoder Embedding process may be performed to change the dimension of the feature information so as to be suitable for the subsequent decoder to perform the processing, and the global feature information, the initial symbol and the word vector output after the Embedding process may be input to the decoder to perform the decoding process.

The following description will be given taking the processing of the first block as an example: the global feature information, the initial symbol and the word vector output after the Embedding processing are firstly processed by a multi-head attention layer with a mask, the processing result and the output after the Embedding processing of a decoder are fused (such as addition processing), then layer normalization processing is carried out, the output after the normalization processing and the result output by an encoder (if the encoder comprises an interframe encoder, the result output by the encoder is the result output by the interframe encoder, if the encoder only comprises an intraframe encoder, the result output by the encoder can be the result obtained after the fusion processing of the result output by each intraframe encoder, such as the result obtained after splicing the results output by each intraframe encoder) are processed by the multi-head attention layer together, then the result is fused (such as addition processing) with the output of the last layer normalization layer, and then the layer normalization processing is carried out, the output after normalization processing is processed through a forward network, then is fused with the output of the last layer normalization layer, and then is processed through the layer normalization layer, and the processing result is the output result of the first block. And taking the output result of the first block as the input of the second block, and sequentially performing decoding processing to obtain the output result of the decoder.

The output result of the decoder is processed by the softmax layer after being subjected to linear transformation processing by the linear layer, and possible word vectors output at the current moment (namely, the iterative prediction) and corresponding output probabilities, such as the words a and b, the output probability of the word a and the output probability of the word b, are output. And repeating the iterative prediction process by the decoder, the linear layer and the softmax layer until the probability of the output terminator is maximum, and obtaining the description information corresponding to the video according to the word vector obtained in each iteration.

It can be understood that, in the above examples, the example is exemplified by a graph convolution feature, and in practical applications, besides the graph convolution feature, the example may further include a spatio-temporal visual feature and/or a semantic feature of a video, in this case, the encoding process may be to encode each feature information separately, in this case, the decoder may perform a decoding process on a feature obtained by fusing each encoded feature, in this case, the weights of each encoded feature may be determined through the feature selection network, and each encoded feature may be fused based on each weight and then used as an output of the encoder, and in this case, the decoder may obtain a text description of the video based on the fused feature. Or the features after being coded by the weight pairs can be respectively input into different cross attention layers of a decoder for processing, and the decoder obtains the text description of the video.

In an alternative embodiment of the present application, generating a textual description of multimedia data based on the extracted feature information includes:

inputting the extracted feature information into a plurality of decoders, respectively;

and obtaining the text description of the multimedia data based on the decoding results of the decoders.

In order to provide decoding capability and improve the representation capability of the description information, in the solution of the embodiment of the present application, when processing the encoding result, a decoder group (decoder-bank) including a plurality of decoders may be used to perform decoding processing on the encoding result respectively so as to enhance the decoding capability of the decoders, and the final textual description information is obtained based on the decoding result of each decoder, for example, the final output may be obtained by averaging the decoding results of each decoder. The decoder bank may include more than 2 encoders, and the type of each decoder included in the decoder bank is not limited in this embodiment, for example, the decoder bank may include an LSTM-based decoder, a Gated recursive Unit (Gated current Unit) based decoder, a self-attention-based decoder, a transform decoder, and the like, and the output of each decoder is averaged to obtain the final output result.

Tests show that when the number of decoders in a decoder group is increased from 2, the effect is better and better, but after the number of decoders exceeds 4, the decoding performance is improved and the region is stable, meanwhile, the complexity of the system is improved due to the increase of the decoders, so in practical application, the number of decoders needs to be selected according to the performance and the complexity in a compromise mode, and optionally, 2 or 3 decoders can be selected generally. For example, 2 decoders may be selected if used for an on-device system, or 3 or more decoders if used for a clouded end, i.e., a cloud, for example.

As for the selection of each decoder in the decoder group, a plurality of decoder groups may be trained in advance, the decoders included in different decoder groups may be different, the number of decoders in the decoder groups may be different, the decoders may be summarized in practical application, the decoder group with the best effect on the verification set or the test set may be selected from the plurality of decoder groups for use, and the selection may be considered from two aspects of decoding efficiency and decoding performance of the decoder group.

When multiple decoders are used for decoding respectively, in order to make the output result of each decoder approach to the true value when the decoder group is trained, a coherent loss consistency is added to the output result of each decoder) to perform constraint, so as to avoid that the performance of different decoders in the decoder group may have a large difference, which results in that the performance of the decoder group is not as good as that of a single decoder. Assuming a decoder bank with two decoders, the outputs are two probability distributions p1 and p2, respectively. The consistency loss definition may be as follows:

loss＝D_KL(p₁||p₂)

wherein D is_KLRepresents the K-L divergence.

In various alternatives of the embodiments of the present application, an attention-based neural network may be used when performing encoding or decoding processing, because the attention-based neural network may simultaneously map global dependencies between different input and target locations, and therefore, may better learn long-term dependencies, and this type of neural network allows more efficient parallel computation when performing data processing. In addition, in the encoder or decoder based on self-attention, especially for the decoder based on self-attention, a plurality of cross attention layers can be stacked, because the neural network based on self-attention can well learn the associated information between elements of the same feature vector, and the cross attention (the core is multi-head attention) can well learn the associated information between different feature vectors, therefore, in the decoder based on self-attention, by adding the cross attention layers, the decoder can well learn the associated features between the elements of the feature vectors and the associated features between different feature vectors, thereby better processing various different types of features and obtaining better description information.

As an example, fig. 13b shows a schematic structural diagram of a self-attention-based decoder provided by an embodiment of the present application, in which an encoder portion may include a spatio-temporal visual encoder (spatio-temporal feature extraction network) and a semantic encoder (semantic prediction network), and encoder outputs of the spatio-temporal visual feature and the semantic feature may be included through the two encoders. When decoding is performed by using the decoder in fig. 13b, the output of the semantic encoder is input to the semantic cross attention layer, and the output of the spatiotemporal visual feature encoder is input to the spatiotemporal visual cross attention layer. The masked self-attention layer can ensure that the previous time will not receive the information of the later time when decoding, and mask the input of the later time, and the input of the masked self-attention layer can correspond to the input of the decoder shown in fig. 13a, including the start symbol, for example, the feature vector can be processed by the embedded layer of the decoder.

In an alternative embodiment of the present application, generating a script of multimedia data based on the extracted feature information includes:

acquiring length information of the character description to be generated;

and generating the text description of the video based on the length information and the extracted characteristic information.

In order to solve the problem that video descriptions or image descriptions with different lengths cannot be generated for users in the prior art, in the scheme provided by the application, text descriptions with corresponding lengths can be generated by acquiring length information of the text descriptions to be generated, so as to meet the requirements of different application scenes. Wherein, the length information can be a relative length information, such as "long" (e.g. more than 20 words of generated description information), "moderate" (e.g. 10-20 words of generated description information), "short" (e.g. less than 10 words of generated description information), etc. The length information can be obtained from a user, for example, the user is prompted to want to generate a long description or a short description, and the user can give a corresponding instruction according to the prompt; the length information can also be obtained by analyzing the video, if the video is the video collected in real time, the current application scene can be determined by analyzing the video, and different length information can be determined for different application scenes.

For this scheme, in training the decoder, unlike the prior art, the start of the decoder may be a start containing length information, for each training sample, the start identifier may comprise a start identifier indicating that a longer description needs to be generated or a start identifier indicating that a shorter description needs to be generated, and may correspond to a different start identifier, the description label information corresponding to different samples, when training the decoder based on the training sample, the decoder can learn the mapping relationship between the start symbols corresponding to different length information and the description information corresponding to the length, therefore, when decoding is performed based on the trained decoder, the start character corresponding to the length information can be used as the decoded start character based on the acquired length information, so that video description or image description meeting the length requirement can be generated.

That is, in the scheme of the embodiment of the present application, during training, "BOS" (Begin of start) in the existing manner is replaced with length information, such as "short", "medium", or "long", to control the length of the output description information, and during actual training, different length identifiers may be used for different length information. Specifically, when a short description is output in training, the start symbol is input as a short, a moderate description corresponds to a moderate start symbol, and a long description corresponds to a long start symbol, so that the sentence length corresponds to the short, moderate and long characters in training. When the system is used online, according to different requirements of users, short, moderate or long description information is input to obtain description information with different lengths.

Based on the method in the optional embodiments of the present application, the intraframe information (such as the objects, attributes, and relationships of the images, and the semantic information, the spatiotemporal visual characteristics, and the like of the video or the images) of each frame or each image in the video can be analyzed in detail, and the information of the images can be fully utilized to generate more accurate text description. As can be seen from the above description, based on the generation manner of the video description information provided by the embodiment of the present application, in practical applications, a variety of different specific implementations may be selected according to practical application requirements.

In addition, according to the scheme provided by the embodiment of the present application, when extracting feature information of multimedia data, in addition to extracting features of regions of an image by using a feature extraction network, an encoder (i.e., a relationship prediction network) for learning a relationship between features of the regions is further added, and the encoder may be implemented by a self-attention-based encoder (e.g., a transform encoder), so as to enhance the performance of obtaining video or image description information by improving the performance of feature encoding. Furthermore, embodiments of the present application may not employ a conventional RNN structure decoder but employ a self-attention-based decoder (e.g., a transform decoder) when acquiring the description information, which has advantages of fast training speed, few parameters, and easy learning of long-distance correlation compared to the conventional RNN.

The following describes a method for generating description information of multimedia data provided in the embodiments of the present application again with reference to several alternative embodiments, taking a video as an example.

Example 1

Fig. 14 is a schematic flow chart illustrating a video description information generating method according to an alternative embodiment of the present application, and as shown in fig. 14, the generating method may include the following steps:

step S301: selecting a plurality of frames from a video;

step S302: respectively constructing a scene graph for each frame in the plurality of frames;

step S303: for each frame, obtaining the image convolution characteristics of each frame by using an image convolution network according to the constructed scene graph;

step S304: generating a textual description about the video based on the obtained graph convolution features.

Optionally, after obtaining the image convolution feature of each frame, the text description of the video may be obtained based on the image convolution feature. For example, the obtained feature of the image convolution may be input to a decoder, and a textual description for a given video may be obtained by decoding the obtained feature of the image convolution. As an alternative, a self-attention-based decoder may be used to generate a textual description of the video from the image convolution characteristics of each of several frames, however, the application is not limited thereto.

Example two

A flow diagram of an alternative method for generating video description information given in this example is shown in fig. 15, and as shown in fig. 15, this alternative embodiment may include the following steps:

step S1201: selecting a number of frames from a given video, such as 501 in fig. 17a and 1001 in fig. 17 b;

step S1202: for each of the selected frames, a feature extraction network is used to obtain a plurality of target regions and features (i.e., regional features or local features) of the respective target regions, such as 502 in fig. 17a and 1002 in fig. 17b, and for each image frame, a fast R-CNN algorithm may be used to extract the respective target regions and features of the target regions in each frame

Step S1203: a relationship prediction network is applied to the extracted region features of the respective target regions to obtain relationship features between the respective target regions, as an example shown in fig. 8.

Step S1204: a scene map for each image frame is constructed based on the obtained relational features between the respective target regions, as indicated at 503 in fig. 17 a.

Step S1205: obtaining a graph convolution feature of each frame using a graph convolution network based on nodes and edges in the scene graph of each frame, as in 504 of fig. 17 a;

step S1206: and generating the video text description according to the graph convolution characteristics. Alternatively, a self-attention-based decoder may be used to learn the inter-frame information of selected frames based on the obtained image convolution features to generate a textual description of a given video. For example, a vector of the feature of the image convolution may be obtained for each selected frame, and the feature vectors are input to a self-attention-based decoder for learning to obtain inter-frame information from frame to frame, as shown in 505 in fig. 17a, the vector of the feature of the image convolution of each frame may be input to an inter-frame transformer decoder, and a textual description of the video is obtained based on the decoder output: i.e. "a person is putting a pizza in the oven".

Example three

Fig. 16 shows a schematic flow diagram of an optional method for generating video description information in this example, as can be seen from comparing fig. 15 and fig. 16, this example is the same as the first three steps in the second example, and is not repeated here, and the difference from the first example is that step S1304 is added in this example, that is, an attribute prediction network is applied to the extracted features of each target region to obtain the attribute features of each target region, for example, a relationship prediction network may be trained based on a VisualGenome data set, and then the trained attribute prediction network is applied to obtain the attribute features of each target region, as shown in fig. 9.

Accordingly, in step S1305, when constructing the scene graph of each frame, specifically, the scene graph may be constructed based on the obtained attribute features of each target area and the relationship features between each target area, and then in step S1306, the graph convolution feature of each frame is obtained according to the scene graph established based on the attribute features and the relationship features.

It should be noted that the execution sequence between step S1303 and step S1304 may be changed or executed simultaneously.

After the graph convolution feature of each frame is obtained, the text description of the video can be generated according to the obtained graph convolution feature in step S1307. Wherein a self-attention-based decoder may be used to learn inter-frame information for selected frames to generate a textual description of a given video.

For example, as shown at 505 in FIG. 17a, an attention-based decoder (such as the interframe transformer decoder shown in the figure) can be used to learn interframe information to generate a video textual description according to graph convolution features.

For another example, as shown in 1005, 1006, and 1007 in fig. 17b, the obtained feature of the image convolution may be encoded respectively, and the encoded feature may be processed to obtain a feature vector with the same target dimension, where the cross-connection line between the scene graph and the feature of the image convolution is constructed in fig. 17b, which indicates that the adopted scene graph may be a space-time scene graph, that is, inter-frame information may be considered when constructing the scene graph, and of course, a space scene graph may also be adopted. Specifically, the encoding operation of the image convolution feature for each frame may be separately performed using a self-attention-based intra-frame encoder (such as the intra-frame transform encoder shown in fig. 17 b), which is used to learn the intra-frame information, i.e., to further learn the correlation information between the intra-frame objects using the self-attention mechanism. Next, the output from the self-attention based intra encoder is processed, obtaining one feature vector for each frame that is the same as the target dimension.

For example, assuming that the dimension of the output sequence from the self-attention-based intra encoder is T × C, where T represents the number of nodes in the scene graph and C is the feature dimension of the feature vector corresponding to each node, the self-attention-based intra encoder learns information such as the relationship of the output sequence by using the self-attention mechanism and outputs the learned sequence. Here, the output sequence length is the same as the input sequence length, i.e., T × C. Averaging the output can result in a feature vector with a dimension of 1 × C. Thus each frame gets a feature vector of 1 × C.

One encoded feature vector may be obtained for each selected frame, and these encoded feature vectors are input to a self-attention-based inter-frame encoder (such as the inter-frame transformer encoder shown in fig. 17 b) for encoding again to obtain the feature vectors with the same target dimension.

The inter-frame information is then learned using a self-attention-based decoder (such as the inter-frame transform decoder shown in fig. 17 b) based on the encoded features to generate a textual description of the given video. The encoded features are input to a self-attention-based interframe decoder for learning to obtain interframe information between frames, and text descriptions about a given video are generated through the learning of the input features.

As another example, the feature of the image convolution of each of the obtained frames may be encoded separately using only a self-attention-based intra-frame encoder, and then the encoded feature may be input to a decoder to generate a textual description of the video. Alternatively, the obtained feature of the image volume of each of the several frames may be encoded using only a self-attention-based inter-frame encoder, and the encoded feature may be input to a decoder to generate a textual description of the video. That is, only the 1005 operation of fig. 17b, or only the 1006 operation of fig. 17b may be performed, or both the 1005 operation of fig. 17b and the 1006 operation of fig. 17b may be performed together.

After some sequence of processing of a given video, a textual description of the given video may be implemented, as shown in FIG. 17b, from a selected number of frames a textual description of "a man is pizza in the oven" may be generated.

Example four

A flowchart of an alternative video description information generation method given in this example is shown in fig. 18, and as shown in fig. 18, as can be seen from comparing fig. 18 and fig. 16, this example differs from the third example in that in step S1505, in constructing the scene graph of each frame, temporal information is also considered in this example, specifically, a spatial scene graph for each image frame is constructed based on the obtained attribute features of each target region and the relationship features between each target region, and temporal information is added between the scene graphs for each frame to obtain a spatial-temporal scene graph, as shown in fig. 10.

It should be noted that this example may be implemented on the basis of the first example, that is, step S1504 may be omitted, and the attribute characteristics of the target area may not be considered.

After the spatio-temporal scene graph of each frame is obtained, in step S1506, a graph convolution network is used to obtain the graph convolution feature of each frame based on the nodes and edges in the spatio-temporal scene graph of each frame, and then, in step S1507: the textual description of a given video is generated from the obtained convolution features, which may be encoded using, for example, a self-attention-based encoder, from which inter-frame information is learned to generate a textual description of the video by a general self-attention-based decoder.

Example five

Fig. 19 is a schematic flow chart showing an alternative video description information generation method in this example, and as can be seen from comparing fig. 19 with fig. 16, this example differs from the third example as follows:

step S1602: and obtaining a plurality of target areas and the characteristics (namely, regional characteristics or local characteristics) of the target areas and the space-time visual characteristics of the video by using the characteristic extraction network for each selected frame in the plurality of frames.

Compared with the step of extracting the region features in the above examples, the features extracted by the step in the example can also include spatio-temporal visual features of the video. Alternatively, as shown in fig. 20, spatiotemporal visual features may be obtained through a spatiotemporal feature extraction network.

Step S1603: and extracting the semantic features of the video through a semantic feature extraction network based on the selected frames. As shown in fig. 20, semantic features may be derived by a semantic prediction network based on several frames.

Steps S1604 to S1607 correspond to the steps of obtaining the relationship features, the attribute features, constructing the scene graph (the spatial scene graph or the spatial-temporal scene graph), and extracting the graph convolution features in the previous examples, and are not repeated herein.

Step S1608: and generating the text description of the video according to the image convolution characteristics, the space-time visual characteristics and the semantic characteristics of each frame. In particular, inter-frame information may be learned using multiple decoders (the set of decoders shown in fig. 20) based on spatio-temporal visual features, semantic features, and graph convolution features to generate a video textual description.

Each decoder may be self-attention based or RNN based. Specifically, the spatio-temporal visual features, the semantic features and the graph convolution features can be input into a decoder group to learn to obtain inter-frame information between frames, and then the results of each decoder are averaged to obtain the final decoding result. Textual descriptions are generated for a given video through learning of input features.

The method for generating the video description information solves the problem that the precision is insufficient due to the fact that the intra-frame information is ignored in the existing video description algorithm, and provides an improved video description scheme. When the method provided by the application is implemented, feature acquisition can be carried out based on the graph convolution network, and the character description of the video can be obtained based on the decoding output of the self-attention structure. Specifically, the graph convolution feature may be directly input to a self-attention-based decoder for decoding to output the text description about the given video after the graph convolution feature is obtained, or the obtained graph convolution feature may be subjected to an encoding operation and then the encoded feature may be input to a self-attention-based codec for inter-frame encoding and decoding to output the text description about the given video.

It should be noted that the precision of the description information generated by the present application can be further improved by referring to the self-attention-based intra-frame encoder and the self-attention-based inter-frame encoder. Optionally, the obtained graph convolution characteristics of each frame may be respectively encoded by using a self-attention-based intra-frame encoder and input to a decoder after the encoded characteristics are fused to generate the text description of the video, or the obtained graph convolution characteristics of each frame may be encoded by using a self-attention-based inter-frame encoder and input to the decoder to generate the text description of the video. That is, for the intra encoder and the inter encoder, it can be selectively used. The scheme provided by the embodiment of the application can fully utilize the intraframe information and/or interframe information, so that more accurate text description is generated for the given video.

In the following, some optional embodiments of the method for generating the image description information will be described by taking another image as an example.

Example five

Fig. 21 is a flow chart illustrating a method for generating image description information according to an embodiment of the present application, where as shown in the drawing, the method includes:

step S10: extracting characteristic information corresponding to the image;

step S20: and acquiring description information corresponding to the image according to the extracted feature information.

Wherein the images may be retrieved from a local memory or local database as desired or received from an external data source (e.g., internet, server, database, etc.) via an input device or transmission medium.

Specifically, the feature information corresponding to the image may be extracted through a feature extraction network. As an alternative, local features of each target region may be extracted through the trained fast R-CNN, for example, Pool-averaged feature vectors of the feature map of Pool-forming Pool5 (RoI) layer may be selected as the features.

After the feature information of the image is acquired, the description information corresponding to the image can be acquired through a decoder according to the extracted feature information. The specific structure of the decoder is not limited in the embodiments of the present application. Such as a decoder may be implemented by a self-attention-based decoder (e.g., a transform decoder). Specifically, the decoder may output a word that may be output at each time and an output probability thereof (which may be a normalized probability) based on the input word vector (which may include an initial symbol and a word vector predicted in an iterative prediction process, etc.) according to the extracted feature information. As an alternative, the self-attention-based decoder may be composed of a multi-head attention layer with a mask, a multi-head attention layer, a layer normalization layer, and a forward network layer

For example, the decoder may output words a and b and an output probability (e.g., 60%) of word a and an output probability (e.g., 40%) of word b that may be output at a first time, words c, d, and e and an output probability (e.g., 60%) of word c that may be output at a second time, an output probability (e.g., 20%) of word d and an output probability (e.g., 20%) of word e, and so on at later times. In this case, according to an exemplary embodiment of the present application, the image description sentence may be generated by a greedy decoding method, that is, by combining words, which are possible to be output at each time instant and have the highest output probability, in chronological order. Alternatively, according to another exemplary embodiment of the present application, the image description sentence may be generated by a monte carlo sampling method, that is, the image description sentence is generated by performing monte carlo sampling based on an output probability of a word that may be output at each time.

Accordingly, when generating description information of an image, an image description sentence can be obtained by combining words, which are possible to be output at each time and have the highest output probability, in time order until the end when the probability of the terminator to be output is the highest.

Optionally, the step S10 may further include: obtaining global features corresponding to the images;

accordingly, in step S20, the text description of the image may be obtained based on the obtained local feature and the global feature.

In order to obtain more accurate image description, after the local features of each target region of the image are acquired, the global features of the image can be further obtained based on each local feature, so that more accurate image description information can be obtained based on the local and global feature information.

Optionally, the obtaining of the global feature corresponding to the image may include: and obtaining the global features of the image based on the local features of the image, or extracting the global features through a feature extraction network based on the image.

Specifically, local feature information corresponding to each target candidate region of the image can be extracted through a feature extraction network, and global feature information corresponding to the image is obtained based on the local feature information. Correspondingly, the description information corresponding to the image can be obtained through the decoder according to the local feature information and the global feature information. As an alternative, the global feature information may be obtained by averaging the local feature information corresponding to each target candidate region of the image. Alternatively, the global feature information may be applied to the image by using a feature extraction network (e.g., CNN), such as by using ResNet to extract feature maps of each layer (i.e., each channel) of the image and performing average pooling.

In an optional embodiment of the present application, the local features may include local image features and/or local attribute features, and the global features include global image features and/or global attribute features; correspondingly, the global image characteristics corresponding to the image can be obtained based on the local image characteristics; and/or obtaining the global attribute feature corresponding to the image based on the local attribute feature.

That is, the obtained local feature information may include local text attribute information in addition to the local image feature information. Thus, in extracting local features through the feature extraction network, the feature extraction network may further include an attribute prediction network. The attribute prediction network can be a multi-label classification network, and optionally, the attribute prediction network can be obtained by a weak supervision training method such as noise-OR. In practical application, the attributes can be finely divided according to nouns, verbs, adjectives, relatively rare words and topics, each attribute is obtained based on a specific attribute prediction network (such as a Multiple Instance Learning network (MIL)), and finally various attribute features can be spliced to obtain a final text attribute feature.

Optionally, when the obtained local feature information includes local image feature information and local text attribute information, the obtained global feature information may also include global image feature information and global text attribute information. Similarly, the global image feature information and the global text attribute information may be obtained based on corresponding local feature information, that is, the global image feature information corresponding to the image may be obtained based on the local image feature information, the global text attribute information corresponding to the image may be obtained based on the local text attribute information, or the global image feature information and the global text attribute information may be extracted through a neural network based on the image. Of course, the local feature information and the global feature information may also include image feature information and text attribute information, and one includes image feature information or text attribute information, and may be configured according to application requirements.

In an alternative embodiment of the present application, obtaining a textual description of an image based on a local feature and the global feature includes:

and respectively coding each local feature according to all the extracted local features to obtain each coded local feature, and obtaining the character description of the image according to each coded local feature and the global feature.

That is, after obtaining each local feature of the image, the encoder may further perform encoding processing on each piece of local feature information based on all pieces of extracted local feature information, to obtain each piece of encoded local feature information. Wherein the encoder is operable to learn the relationship between the local features of the respective target candidate regions based on all the extracted local feature information.

Alternatively, the encoder may be implemented by a self-attention-based encoder, and the encoder performs encoding processing on each piece of local feature information based on all pieces of extracted local feature information to obtain each piece of encoded local feature information. Accordingly, when obtaining description information corresponding to an image according to each piece of local feature information and global feature information through a decoder, the decoder may output a word that may be output at each time and an output probability thereof (which may be a normalized probability) based on an input word vector (which may be a start symbol and a word vector predicted in an iterative prediction process) according to each piece of local feature information and global feature information after encoding, and may obtain an image description sentence by combining words having a maximum output probability that may be output at each time in time order until the probability of an output end symbol is maximum, and then end up.

As an alternative, the self-attention based encoder described above may comprise multiple attention layers, a layer normalization layer, and a forward network layer in sequential cascade.

As an example, fig. 22 shows a schematic diagram of a codec structure provided in an embodiment of the present application, as shown in the diagram, in order to obtain input information of an encoder, for an image to be processed (an image shown in a lower right corner of the diagram), a local feature of each target region of the image may be extracted through a feature extraction network (e.g., fast R-CNN shown in the diagram), specifically, the feature extraction network may divide the input image into a plurality of target candidate regions (i.e., target regions), and obtain a feature vector (i.e., local feature) from each target candidate region, so as to obtain a plurality of feature vectors, such as { V shown in fig. 22 [, ]_j}，{V_jEach vector in the vector representation represents the local feature information of a target candidate region (the region corresponding to the rectangular frame marked by the lower left figure).

Regional feature Embedding (Embedding) processing can be performed on each piece of local feature information extracted by the feature extraction network, the purpose is to change the dimension of the feature information so as to be suitable for a subsequent encoder to perform processing, and each piece of local feature information output after the Embedding processing is input into the encoder to perform encoding processing.

Alternatively, the encoder shown in fig. 22 may have a one-block or multi-block structure, and in the case of the multi-block structure, the structure of each block may be the same or different. As an example, it is assumed that the encoder may include 6 blocks with the same structure, and the 6 blocks are sequentially cascaded, wherein each block may include two parts, i.e., a multi-head attention layer and a forward network that is fully connected position by position, and optionally, the forward network may be implemented by two linear prediction layers, and a ReLU activation operation may be included between the two linear prediction layers. The multi-headed attention layer and the forward network portion in each block may correspond to a layer normalization layer, as shown in fig. 22, and each block in this example may be composed of the multi-headed attention layer, the layer normalization layer, the forward network layer, and the layer normalization layer in turn, and the respective blocks may be stacked to obtain the encoder.

The following description will be given taking the processing of the first block as an example: each local feature information is subjected to self-attention processing by a multi-head attention layer, the output result of the local feature information and the output of the local feature information after the region feature Embedding are subjected to fusion processing (such as addition processing), then layer normalization processing is carried out, the normalized result is processed by a forward network, then fusion processing (such as addition processing) is carried out on the normalized result and the output of the previous layer normalization layer, and then layer normalization processing is carried out to obtain the output result of the first block. The output of the first block is used as the input of the second block, and the encoding process is sequentially performed, thereby obtaining the output of the encoder (i.e., the encoder output in fig. 25).

Optionally, as shown in fig. 22, the global feature information may be further obtained based on each piece of local feature information extracted by the feature extraction network, for example, each piece of local feature information may be averaged to obtain the global feature information, for example, a feature vector in fig. 22

In addition, an initiator and a predicted word vector in the iterative prediction process (if the first prediction of the iterative prediction is performed, only the initiator is obtained) may be obtained, where the initiator and the predicted word vector are shown as w in fig. 22, and all word vectors corresponding to the samples may be input when the model is trained.

For global feature information

And the initial character and the word vector w predicted in the iterative prediction process can be subjected to decoder Embedding processing, so that the dimension of the characteristic information is changed to be suitable for the subsequent decoder processing, and the global characteristic information, the initial character and the word vector output after the Embedding can be input into the decoder for decoding processing.

Alternatively, the decoder in fig. 22 may also be of one or more blocks, and the structure of each block may be the same, for example, each block may include 6 blocks with the same structure, and as an alternative structure, each block may mainly include three parts, namely a multi-head attention layer with a mask, a multi-head self-attention layer corresponding to a feature, and a forward network. The multi-head attention layer and the forward network portion in each block may correspond to a layer normalization layer, and the structure of each block may specifically be formed by a multi-head attention layer with a mask, a layer normalization layer, a multi-head attention layer, a layer normalization layer, a forward network layer, and a layer normalization layer in sequence, and each block may be stacked to obtain the decoder.

The following description will be given taking the processing of the first block as an example: global feature information

The initial character and the word vector w are firstly processed by a multi-head attention layer with a mask, the processing result and the output after the decoder is embedded are fused (such as addition processing), then layer normalization processing is carried out, the output after the normalization processing and the result output by the encoder are processed by the multi-head attention layer together, then the multi-head attention layer and the output of the last layer normalization layer are fused (such as addition processing), then the layer normalization processing is carried out, the output after the normalization processing is processed by a forward network, then the output after the normalization processing and the output of the last layer normalization layer are fused, and then the layer normalization layer processing is carried out, and the processing result is the output result of the first block. And taking the output result of the first block as the input of the second block, and sequentially performing decoding processing to obtain the output result of the decoder.

The output result of the decoder is processed by the linear layer and then processed by the softmax layer, and possible word vectors and corresponding output probabilities, such as the output probabilities of the words a and b, the word a and the word b, output at the current moment (i.e. the iterative prediction) are output. The decoder, the linear layer and the softmax layer repeat the iterative prediction process until the probability of the output terminator is maximum, and the description information corresponding to the input image can be obtained according to the word vector obtained in each iteration.

In the case where the local features include local image features and local text attribute information, the encoder used may include an image feature encoder section and an attribute feature encoder section, the two-section encoders being respectively used to encode the local image feature information and the local text attribute information. Specifically, the image feature encoder portion may perform encoding processing on each local image feature information according to all extracted local image feature information to obtain each encoded local image feature information, and the attribute feature encoder portion may perform encoding processing on each local text attribute information according to all extracted local text attribute information to obtain each encoded local text attribute information. Correspondingly, at this time, the decoder may obtain the description information corresponding to the image according to the encoded local image feature information, the encoded local text attribute information, the global image feature information, and the global text attribute information.

As another example, fig. 23 shows a schematic structural diagram of a codec provided in an embodiment of the present application, and as can be seen from fig. 22 and fig. 23, the structure of the decoder of fig. 23 may adopt a similar structure to that of the decoder of fig. 22. The encoder in this example may include an image feature encoder portion and an attribute feature encoder portion, which may or may not be identical in structure.

As shown in FIG. 23, for the image to be processed (the image shown in the lower left corner of the figure), local image feature vectors of a plurality of target candidate regions can be obtained through a feature extraction network (e.g., Faster R-CNN), such as { v } shown in FIG. 23_j}，{v_jEach vector in the vector represents a local image feature vector of a target candidate region (a region corresponding to a rectangular frame marked in an image processed by fast R-CNN shown in fig. 26), and multiple local text attribute vectors can be obtained, such as { a-shown in fig. 23_j}， {a_jEach vector in the sequence represents a local text attribute vector of a target candidate region (a region corresponding to a rectangular box marked by a lower left figure).

The region image feature Embedding processing can be carried out on each extracted local image feature information, each local image feature information output after Embedding is input into an image feature encoder to be encoded, the region attribute feature Embedding processing can be carried out on each extracted local text attribute information, and each local text attribute information output after Embedding is input into an attribute feature encoder to be encoded.

In this example, the structures of the image feature encoder and the attribute feature encoder shown in fig. 23 are described by taking the structure of the encoder shown in fig. 22 as an example, and each of the two structures may include 6 blocks having the same structure, each block has the structure shown in fig. 23, and the respective blocks may be stacked to obtain the encoder. The processing flow of each block can refer to the description of the structure of the block of the encoder in fig. 22, and when feature encoding is performed by the image feature encoder and the attribute feature encoder, the difference is only that the input of the image feature encoder is the feature after performing the region image feature Embedding process on the local image feature information, the input of the attribute feature encoder is the feature after performing the region attribute feature Embedding process on the local text attribute, and after the result encoding process, the output result of the encoder (i.e., the output of the image feature encoder and the output of the attribute feature encoder in fig. 23) is obtained.

Further, the feature information of each local image may be averaged to obtain global image feature information, such as the feature vector in fig. 23

The global text attribute information, such as the feature vector in fig. 23, can be obtained by averaging the local text attribute information

It is also possible to obtain an initiator and a predicted word vector during iterative prediction (only the initiator is obtained if the first prediction is iterative prediction), which are shown as w in fig. 23, wherein all word vectors of a sample may be input when training the model.

For global image feature information

Global text attribute information

The initial character and the word vector w predicted in the iterative prediction process can be subjected to decoder Embedding processing, the purpose is to change the dimension of the characteristic information so as to be suitable for the subsequent decoder processing, and the full-office image characteristic information, the global text attribute information, the initial character and the word vector output after Embedding can be input into the decoder for decoding processing.

As an alternative structure, the decoder structure shown in fig. 23 may be a block or multi-block structure, and in the case of the multi-block structure, the structure of each block may be the same or different. For example, 6 blocks with the same structure can be contained, and the structure of each block can be composed of four parts, namely a multi-head attention layer with a mask, a multi-head self-attention layer corresponding to image features, a multi-head self-attention layer corresponding to attribute features and a forward network. The multi-head attention and forward network portion in each block may correspond to a layer normalization layer, specifically, as shown in fig. 23, the structure of each block may be sequentially composed of a multi-head attention layer with a mask, a layer normalization layer, a multi-head attention layer, a layer normalization layer, a forward network layer, and a layer normalization layer, and each block may be stacked to obtain a decoder.

The following description will be given taking the processing of the first block as an example: for global image feature information

Global text attribute information

The initial symbol and the predicted word vector w in the iterative prediction process are processed by a multi-head attention layer with a mask, the processed result and the output after the decoder Embedding are fused (such as addition processing), then layer normalization processing is carried out, the output after the normalization processing and the output result I output by the image characteristic encoder are processedProcessing by a multi-head attention layer, then performing fusion processing (such as addition processing) with the output of a previous layer normalization layer, then performing layer normalization processing, processing the output after the normalization processing and the result output by an attribute feature encoder by the multi-head attention layer, then performing fusion processing (such as addition processing) with the output of a previous layer normalization layer, then performing layer normalization processing, processing the output after the normalization processing by a forward network, and then performing layer normalization layer processing after being fused with the output of the previous layer normalization layer, wherein the processing result is the output result of the first block. And taking the output result of the first block as the input of the second block, and sequentially performing decoding processing to obtain the output result of the decoder.

The output result of the decoder is obtained by processing the linear layer, and then processed by the softmax layer, and the word vectors which are possibly output at the current moment (i.e. the iterative prediction) and the corresponding output probabilities, such as the words a and b and the output probability of the word a and the output probability of the word b, are output. And the decoder, the linear layer and the softmax layer repeatedly execute the iterative prediction process until the iterative prediction process is ended when the probability of the output terminator is maximum, and the description information corresponding to the input image can be obtained according to the word vector obtained in each iteration.

As can be seen from the description of the method for generating image description information in the above optional embodiments, the method for generating image description information may be specifically implemented by using an image description model, that is, an image may be input into the image description model, and a text description of the image may be obtained based on model output. The specific neural network structure of the image description model is not limited in the embodiments of the present application, and a codec network structure including, but not limited to, a self-attention-based encoder and a self-attention-based decoder shown in fig. 22 or fig. 23 may be adopted.

It should be understood that the schemes in the above examples are only examples of some alternatives of the present application, and are not intended to limit the schemes of the present application. The above examples are applied to generation of description information of an image and generation of video description information, and are different from the generation of description information of a video only in that when the description information of an image is generated, since there is only one image, information between frames, that is, between adjacent images, such as the above-described temporal edge and inter-frame encoder, is not considered.

As can be seen from the above description of the method for generating description information of multimedia data provided in various alternative embodiments of the present application, the generation of the description information of multimedia data can be specifically realized by a multimedia data description model. For a video, the multimedia data description model is a video description model, for an image, the multimedia data description model is an image description model, and the video description model and the image description model may be different models or the same model, that is, a model suitable for generating an image description or a model suitable for generating a video description. Optionally, the video description model may be a model based on RNN, or a model based on other network structures, such as a transform-based model, and in practical application, a specific structure of the model may be set according to actual requirements, which is not limited in this embodiment of the present application.

For a video needing to obtain video description information, the video or a plurality of frames selected from the video can be input into a video description model, and the text description of the video is obtained based on the input of the video description model. The embodiment of the present application is not limited, and the original video description model may be trained through a video sample, and the trained video description model is used for generating video description information.

Specifically, in an optional embodiment of the present application, the text description of the multimedia data is obtained through a multimedia data description model, where the multimedia data description model is obtained through training in the following manner:

obtaining a training sample, wherein the training sample comprises first sample multimedia data with description labels;

and training the initial description model based on the first sample multimedia data until the model loss function is converged, and taking the trained description model as a multimedia data description model.

It can be understood that, for the video description model, the sample multimedia data is a sample video, and the description label is a description label of the video, and for the image description model, the sample multimedia data is a sample image, and the description label is a description label of the sample image. The specific form of the model loss function can be configured according to actual requirements, for example, a loss function commonly used in training of a video description model or an image description model can be selected, during training, the value of the model loss function represents the difference between description information and description labeling information of multimedia data predicted by the model, or whether the predicted description information meets other preset ending conditions, and through continuous training, the multimedia information predicted by the model approaches and describes the labeling information, or meets other preset conditions.

In order to improve the accuracy of the description information of the generated multimedia data, fig. 24 illustrates a method for training a multimedia data description model provided in an alternative embodiment of the present application, where, as shown in the figure, the training samples further include second sample multimedia data without description labels, and the model loss function includes a first loss function and a second loss function, and the method may include the following steps S201 to S203 when training an initial description model based on the first sample multimedia data.

Step S201: training a preset description model based on first sample multimedia data to obtain a value of a first loss function, and training the description model based on second sample multimedia data to obtain a value of a second loss function;

specifically, the embodiment of the present application may train the preset video description model by using the first sample multimedia data with the description label and the second sample multimedia data without the description label at the same time.

The sources of the first sample multimedia data and the second sample multimedia data are not limited in the embodiments of the present application. Taking a video as an example, the original video description corresponding to the first sample video data may be manually annotated by a technician, such as the video shown in fig. 25, and the technician may annotate the video with a video description of "a child is cleaning the floor". The second sample video data may be any acquired video without video description, such as a video acquired from a video website, or a video shot by the user himself. The specific function forms of the first loss function and the second loss function are not limited in the embodiments of the present application, and may be configured according to the actual application requirements.

Step S202: obtaining a final loss function value based on the first loss function value and the second loss function value;

optionally, for different loss functions, each function may also correspond to a respective weight, so that the importance of each different loss function in the training process is different, for example, since the first multimedia data has the original description label and the second sample multimedia data does not have the original description label, the label information of the first sample multimedia data, i.e., the original description label, is very accurate, and the weight of the first loss function may be greater than the weight of the second loss function. When the different loss functions have respective weights, a final loss function of the multimedia data description model may be determined based on the respective weights of the loss functions, for example, the final loss function may be a weighted sum of the loss functions.

That is, the step of obtaining a value of a model loss function (which may also be referred to as a final loss function) based on the value of the first loss function and the value of the second loss function may include:

obtaining a corresponding target first loss function value based on the preset weight of the first loss function, and obtaining a corresponding target second loss function value based on the preset weight of the second loss function;

and taking the sum of the value of the target first loss function and the value of the target second loss function as the value of the final loss function.

Specifically, the value of the final loss function can be calculated by the formula:

min_θJ＝J_label(θ)+∈J_unlabel(θ)

where ∈ is a hyperparameter in this example, J_label(θ) is a first loss function, J_unlabel(θ) is a second loss function, the weight of the first loss function may be set to 1, and the weight of the second loss function is ∈. thus, the product of the first loss function and the corresponding weight is a target first loss function, the product of the second loss function and the corresponding weight is a target second loss function, and the sum of the target first loss function and the target second loss function is a final loss function.

Step S203: and training the description model based on the value of the final loss function until the final loss function is converged to obtain the trained multimedia data description model.

Specifically, after a final loss function of the video description model is obtained, model parameters of the video description model are updated based on the final loss function until the final loss function converges based on a minimum value, and the trained video description model is obtained. The final loss function of the video description model is determined by the first loss function and the second loss function, and the convergence of the final loss function based on the minimum value may be that the function converges based on the minimum value, or that both the first loss function and the second loss function converge based on the minimum value.

In the embodiment of the application, when first sample multimedia data with description labels are received, a preset multimedia data description model is trained based on the first sample multimedia data and the description labels to obtain a value of a first loss function, when second sample multimedia data without the description labels are received, the description model is trained based on second sample multimedia data to obtain a value of a second loss function, then a final loss function value of the multimedia data description model is obtained based on the first loss function and the second loss function, the multimedia data description model is trained based on the final loss function until the final loss function converges based on a minimum value, and the trained multimedia data description model is obtained.

Through the mode, the optional embodiment of the application can train the multimedia data description model by using the sample video data with the description label and can train the video description model by using the sample multimedia data without the description label, so that the labor cost and the time cost required for labeling the description information on the sample multimedia data are greatly reduced, especially when the number of the sample multimedia data is large, and the accuracy and the precision of the multimedia data description model are improved because the number of the sample multimedia data is increased. In addition, the algorithm of the embodiment of the present application is applicable to different models, such as the RNN-based model or the transform-based model, which is a general training method.

In an optional embodiment of the application, in the step S201, obtaining a value of the first loss function based on the first sample multimedia data may include:

inputting the first sample multimedia data into a video description model to obtain predicted target description information;

and obtaining the value of the first loss function based on the target description information and the corresponding description label.

Wherein the value of the first loss function characterizes a difference between the target description information obtained based on the model output and the corresponding annotated description information.

As an example, fig. 26 is a schematic diagram illustrating a principle of a training method for a multimedia data description model according to an embodiment of the present application, where the example is illustrated by taking a video as an example, labeled data shown in the diagram corresponds to video data in first sample video data, and unlabeled data shown in the diagram corresponds to video data in second sample video data. This training method will be described below with reference to fig. 26.

As shown in fig. 26, specifically, the video data V with the label may be input into a video description model M, the video data is analyzed and processed through the video description model to generate a corresponding target video description, and then a value of the first loss function may be calculated based on the original video description (corresponding to label y in fig. 26) in the first sample video data and the target video description. In this example, the first loss function may be a cross-entropy loss function, which is shown in formula (1):

wherein, J_label(theta) represents cross entropy loss, theta represents model parameters of the video description model, T represents the current time, T represents the maximum time, y_tShowing the true value, y, corresponding to the current time_1:t-1Representing the corresponding true value from 1 to t-1, V representing the video, p_θProbability indicating that the word output is true, in particular, (p)_θ(y_t|y_1:t-1And V) represents the probability that the word predicted by the model at the current moment is the corresponding labeled word. The implication of this penalty function is that the probability that the output at the current time is also the correct word is maximized when the input at the current time is the correct word at times prior to the current time.

For example, the video shown in fig. 25 is analyzed by a video description model, and assuming that the current time t is 2, the word y output at the initial time t is 0₀To "one", the word y output at the moment t ═ 1₁For "child", the word y output at the instant when t is 2 and t is 1 is present₁When the child is the correct word, y₂The probability of outputting "being" is maximized.

Assuming that the video obtained by analyzing the video shown in fig. 25 through the video description model is described as "a child is sweeping the floor", the video description model is trained based on "a child is sweeping the floor" and the original video description "a child is cleaning the floor".

In practical application, the word stock can be preset in the embodiment of the application, and the words output at each moment are all from the word stockAnd (4) determining. And the word y output with initial time t equal to 0₀It can be determined based on the start of the video. For example, for the video shown in fig. 25, the word y output at the time when t ═ 0₀"one" is determined based on the video start. Of course, in practical applications, other ways may be used to determine the first word of the video description, and the embodiment of the present application is not limited thereto.

In an optional embodiment of the present application, training the description model based on the second sample multimedia data to obtain a value of the second loss function includes:

performing data enhancement on the second sample multimedia data at least once to obtain third sample multimedia data;

inputting the second sample multimedia data into a description model to obtain at least one multimedia description;

determining a score of each multimedia description based on the second sample multimedia data and the third sample multimedia data;

the value of the second loss function is derived based on the scores of the multimedia descriptions.

That is, when the model is trained based on the second sample multimedia data without the description label, the sample multimedia data may be enhanced, scores of the description information obtained through the model based on the second sample multimedia data are determined based on the third sample multimedia data obtained after enhancement and the second sample video data, and a value of the second loss function is obtained based on each score.

Optionally, for example, the second sample multimedia data may be input into the description model to obtain the first description information and the second description information, the first score value of the first description information and the second score value of the second description information may be determined based on the second sample multimedia data and the third sample multimedia data, and the value of the second loss function may be obtained based on the first score value and the second score value.

According to the scheme of the embodiment of the application, when the multimedia data description model is trained by using sample multimedia data without description labels, data enhancement is performed on second sample multimedia data for K (K is larger than or equal to 1) times to obtain third sample multimedia data, the description model is trained on the basis of the second sample multimedia data and the third sample multimedia data, and the second sample multimedia data and the third sample multimedia data are the same or similar, so that the description information of the second sample multimedia data and the description information of the third sample multimedia data are the same or similar, the value of a second loss function can be calculated on the basis of the scheme, the description model is trained on the basis of the function, and therefore the accuracy and precision of the description model can be further improved.

In an optional embodiment of the present application, inputting the second sample multimedia data into the multimedia data description model to obtain corresponding first description information and second description information, which may specifically include:

inputting the second sample multimedia data into a description model, and determining first description information according to an output result of the description model based on a greedy algorithm;

the second sample multimedia data is input to a description model and a video description is determined based on the probabilistic sampling for an output of the description model.

Greedy algorithm (also called greedy algorithm or greedy search) means that when solving a problem, the selection which is the best in the current view is always made. That is, rather than being considered from an overall optimum, a locally optimum solution in some sense is made.

Optionally, taking a video as an example, the first description information is a first video description, the second description information is a second video description, and the first video description c is obtained based on a greedy algorithm_gCan be as shown in equation (2):

c_g＝{c_g(1),c_g(2),……,c_g(T) } formula (2);

wherein, c_g(T) (T ═ 1,2, … …, T) denotes the word output at the current time T, and optionally c_g(t)＝(argmax_y∈Y(p_θ(c_g,1:t-1V))) wherein V represents the secondTwo sample video data, c_g,1:t-1Representing the sequence of words output from the initial instant until the instant t-1, i.e. the words output before the current instant, at which time c_gAnd (t) selecting the word with the maximum probability at the current time t as the word output at the current time, wherein the output probability of each candidate word corresponding to the current time is determined based on the word output at each time before the current time and the video V, and the final output word at the time is the word with the maximum probability in each candidate word.

After the words output at each moment are obtained, the words output at each moment are sequenced according to the output sequence to obtain a first video description c_g。

For the second video description information, probability sampling refers to the chance that every unit in the survey population sample has equal possibility to be extracted, and is also called random sampling or probability sampling, and probability sampling takes probability theory and random principle as the basis to extract the sample, so that every unit in the population has a non-zero probability which is known in advance to be extracted. The probability that the population unit is drawn can be specified by the sample design, achieved by some randomization, although the random sample will not generally be exactly the same as the population.

Optionally, the second video description c is derived based on probabilistic sampling_sCan be as shown in equation (3):

c_s＝{c_s(1),c_s(2),……,c_s(T) } formula (3);

wherein, c_s(T) (T ═ 1,2, … …, T) denotes the word output at the current time T, and optionally c_s(t)＝(multinomial_y∈Y(p_θ(c_s,1:t-1V))) wherein c_s,1:t-1Representing the sequence of words output from the initial time to a time before t-1, i.e. the words output before the current time, V representing the second sample video data, c_s(t) sampling according to the output probability of each word at the current time, and outputting the sampling result, wherein the output probability of each candidate word corresponding to the current time is based on the current timeThe words and video V output at each time before the previous time are determined, and the final output words at the time are obtained by randomly sampling each output probability.

Still taking the example in fig. 26 as an example for explanation, for the second sample video data, i.e. the unlabeled data shown in fig. 26, K times of data enhancement may be performed to obtain the third sample video data (e.g. enhanced video data V' in fig. 26). The data enhancement mode may be to randomly remove a plurality of frames in the video, or to perform conversion processing such as rotation and clipping on each frame in the video, or may be other data enhancement modes. Inputting the second sample video data into a video description model M to obtain a corresponding first video description c_gAnd a second video description c_sThe first video description c can be obtained as in the above equation (2)_gThe first video description c can be obtained by the above formula (3)_s

Specifically, for example, by analyzing and processing a video with the same content but without original description information as shown in fig. 25 through a video description model, the words (output results) corresponding to 5 times respectively can be obtained as follows: "one (corresponding to c)_g(1) "," child (corresponding to c)_g(2) Is being (corresponding to c) "," is being (corresponding to c)_g(3) "," clean (corresponding to c)_g(4) "," (corresponding to c)_g(4) -, "arrangement (corresponding to c)_g(4) "," ground (corresponding to c) ")_g(5) "a plurality of words, wherein, c_g(4) The output probabilities of the three candidate words at the time are: 0.5 (clean), 0.2 (clean), 0.3 (clean), then the clean with the highest output probability is regarded as c based on the greedy algorithm without considering the output probability of other words_g(4) Output word at time, and therefore, final c generated based on greedy algorithm_gIs "a child is cleaning the floor".

Another example, see example, c_s(4) Determining three candidate words as 'cleaning', 'sorting' at any momentThe output probabilities of "" clean "" and "" clean "" are 0.2, the output probability of "clean" is 0.3, and the output probability of "clean" is 0.5, then without considering the output probabilities of other words, three video descriptions can be generated, respectively "one child is cleaning the floor", the output probability of which is 0.2, "one child is finishing the floor", the output probability of which is 0.3, "one child is cleaning the floor", and the output probability of which is 0.5, and thus, the final c generated based on probability sampling_sWhich may be any one of the three video descriptions.

That is, for the above case, the second sample video data is output to the video description model, and the output result is obtained, and assuming that 10 video descriptions are generated for the output result based on the greedy algorithm, the 10 obtained video descriptions may all be "one child is cleaning the ground"; assuming that 10 video descriptions are generated for the output result based on probabilistic sampling, 2 possible video descriptions are "one child is cleaning the floor", 3 possible video descriptions are "one child is cleaning the floor", and 5 possible video descriptions are "one child is cleaning the floor".

In an optional embodiment of the present application, obtaining the first score value of the first video information and the second score value of the second description information based on the second sample multimedia data and the third sample multimedia data may specifically include:

inputting the first description information into a description model together with the second sample multimedia data and the third sample multimedia data respectively to obtain a first output probability distribution of the second sample multimedia data and a second output probability distribution of the third sample multimedia data respectively, and obtaining a first score value of the first description information based on the first output probability distribution and the second output probability distribution;

and inputting the second description information into the description model together with the second sample multimedia data and the third sample multimedia data respectively to obtain a third output probability distribution of the second sample multimedia data and a fourth output probability distribution of the third sample multimedia data respectively, and obtaining a second score value of the second description information based on the third output probability distribution and the fourth output probability distribution.

Specifically, taking a video as an example, for a first video description, a first video description is input into a video description model as a true value and second sample video data to obtain a first output probability distribution of the first video description at each time, meanwhile, the first video description is input into the video description model as a true value and third sample video data to obtain a second output probability distribution of the first video description at each time, and then a KL divergence of the first output probability distribution and the second output probability distribution can be calculated to obtain a first score value r based on the KL divergence_g。

As an alternative, the KL divergence may be multiplied by the time-domain weight and negated to obtain the first score r of the first video description_gSpecifically, as shown in formula (4):

wherein, W_tT/T is the temporal weight that gives higher weight to the preceding words in the first video description and lower weight to the following words to balance the problem of error accumulation. Since the enhanced video V' may include K data, K r may be obtained_gAt this time, it can be based on the k r_gTo give a final first fraction value r'_gFor example, it can be done for K r_gAveraging to obtain a first score value r'_gIt is of course also possible to obtain the video V 'in other ways, such as weighted averaging, with different weights for the video V' obtained by different enhancement processing.

Similarly, for the second video description, the second video description is used as the true value and the second sample video data is input into the video description model to obtain the third output probability distribution of each moment of the second video description, and meanwhile, the second video description is used as the true value and the third sample video data is input into the video description model to obtain the fourth output probability distribution of each moment of the second video description, and then the third output probability distribution is calculatedAnd the KL divergence of the output probability distribution and the fourth output probability distribution is multiplied by the time domain weight to obtain a second score value r of the second video description after negation_sSpecifically, as shown in formula (5):

since V' includes K data, it is possible to adopt, for example, K r_sAveraging to obtain a first score value r'_s。

To obtain r'_gAnd r'_sThen, a second loss function of the video description model can be obtained through the calculation of the first loss function and the second loss function.

In the example shown in fig. 26, the first video description c is being obtained_gThen, c is put_gAnd the second sample video data V is input into the video description model M, a third output probability distribution of each moment of the second video description is obtained based on the model output, and c can be set for each enhanced video V '(one V' is shown in FIG. 21)_gAnd the video V 'to the video description model M to obtain a fourth output probability distribution of each moment of the second video description corresponding to each V', and calculating r corresponding to the third output probability distribution and each fourth output probability distribution by the formula (3)_gAnd r 'can be obtained by averaging or other means'_g(c in the figure)_gKL divergence for the one). R 'can be obtained based on formula (4) by adopting the same calculation principle'_s(KL divergence of the one subtended by s in the figure).

In an alternative embodiment of the present application, the step of obtaining the value of the second loss function based on the first score value and the second score value may include:

taking the difference value of the first score value and the second score value as a reward value;

a second loss function describing the model is derived based on the reward and the second description information.

Specifically, the second loss function may be a strategic gradient loss function, which is shown in equation (6):

wherein, (r'_s-r′_g) The prize value is the difference between the first score value and the second score value;

to calculate the gradient of theta. After the second loss function is obtained, the description model is trained by using the strategy gradient. From the above description, it can be known that, if the word obtained by sampling is more correct, the KL divergence between the third output probability distribution and the fourth output probability distribution is smaller, and the reward is larger, so that the probability of outputting the word after the model is updated is larger. On the contrary, if the words obtained by sampling are poor, the KL divergence of the third output probability distribution and the fourth output probability distribution is large, the reward is small, and the probability of outputting the words after the model is updated is small.

Still taking video as an example, as shown in FIG. 26, may be based on r'_g(c in the figure)_gCorresponding KL divergence) and r'_s(c in the figure)_sThe corresponding KL divergence) yields a reward value (the reward shown in the figure) and may be calculated by equation (6) above based on this reward to yield a value of the policy gradient penalty, such that a value of the final penalty function is derived based on the value of the first penalty function (i.e., the value of the cross-entropy penalty shown in figure 26) and the value of the second penalty function (i.e., the value of the policy gradient penalty shown in figure 26).

As is apparent from the above description, only one type of description information may be generated based on the second sample multimedia data, and in this case, the value of the second loss function may be obtained based on only the description information, and r 'in equation (6) is exemplified by equation (6) above'_gIt is also possible to eliminate, so that equation (6) is rewritten as follows:

that is, the corresponding score (the second score in the above example) may be obtained based only on the second description information, and the value of the second loss function may be obtained based on the second description information and the score.

At present, in a data set commonly used in the field of video description or image description, description labels of videos or images are generally few, for example, description labels of one training sample image are generally only 5, and it is generally difficult to fully express information in the image by using only 5 description labels. In order to improve the diversity of the description of the training sample, the embodiment of the application also provides a method for obtaining the description of the multimedia data, and data enhancement can be performed on the description label of the sample multimedia data based on the method to obtain enhanced description information so as to increase the description number of sample data, so that a multimedia data description model with better effect can be obtained by training based on the sample data added with the enhanced description information.

Accordingly, in an alternative embodiment of the present application, the description label of the first sample multimedia data may include at least one original description label of the first sample multimedia data, and an enhanced description label corresponding to each original description label.

Fig. 27 is a flowchart illustrating a method for obtaining a multimedia data description according to an embodiment of the present application, where the method may include:

step S2501: acquiring at least one original description label corresponding to the multimedia data;

for the first sample multimedia data, the original description labels of the first sample multimedia data are obtained.

The multimedia data may be sample data in a training image database or a training video database acquired from a local storage or a local database as needed, or may be training samples in a training image database or a training video database received from an external data source through an input device or a transmission medium. Taking an image as an example, the training image may include a predetermined number N of image description labels, where N may be a positive integer not less than 1. For example, the image in this scheme may be a training image in a training image database (e.g., data set MS-COCO) commonly used in the field of image description, where the image in the commonly used training image database generally has 5 image description labels, and the 5 image description labels for the same training image are different from each other but have similar semantics.

Step S2502, according to each original description label corresponding to the multimedia data, enhanced description information corresponding to each original description label, namely enhanced description information, is respectively generated.

Specifically, the enhanced description information corresponding to each original description label can be generated by the generator according to each original description label corresponding to the multimedia data. Wherein the generator is operable to generate a description sentence with similar semantics than the original description label. That is, when an original descriptively labeled sentence is input to the generator, the generator may generate a descriptive sentence having similar semantics, which is different from the original descriptively labeled sentence, based on the original descriptively labeled sentence.

The process of generating the sentence by the generator is a time sequence process, and as an optional mode, the sentence can be generated by a greedy decoding mode generally, that is, at a first time, the input word vector is an initial character, the output is a first word with the largest predicted output probability, at a second time, the input is the initial character and the output is the first time, the output is a second word with the largest predicted output probability, and the like is performed in the following steps until the output word is a termination character.

The specific network structure of the generator is not limited in the embodiments of the present application, and as an alternative, the generator may be implemented by a self-attention-based encoder and a self-attention-based decoder.

As two examples, fig. 28a and 28b show schematic diagrams of a network structure of a generator provided by an embodiment of the present application, and the generation may include that the generator may be composed of a self-attention-based encoder (e.g., a transform encoder) and a self-attention-based decoder (e.g., a transform decoder), and as shown in fig. 28a and 28b, the encoder in the generator may be composed of a multi-headed attention layer, a layer normalization layer, and a forward network layer for encoding an original description label of an input. The decoder in the generator may be composed of a multi-head attention layer with a mask, a multi-head attention layer, a layer normalization layer and a forward network layer, and is configured to decode the encoded image description label or video description label to obtain enhanced image description information or enhanced video description information. For a detailed description of the structures of the parts of the encoder and decoder shown in fig. 28a and 28b, reference may be made to the corresponding descriptions of the encoder and decoder shown in fig. 13, 22 or 23 in the foregoing.

It should be noted that the network structure of the generator in the embodiment of the present application may include, but is not limited to, the structure shown in the above example, and the generator may be implemented by using any other available encoder and decoder.

Wherein the generator also needs to be trained in order to guarantee the accuracy of the enhanced image description information generated by the generator. As an alternative, the generator may be trained by:

acquiring a training database, wherein the training database comprises a plurality of training sample data, each training sample data comprises N original description labels, and N is a positive integer not less than 1;

training a generator based on original description labels of a plurality of training sample data in a training database, wherein the generator is used for generating description information with similar semantics and different from the original description labels.

In addition, in order to improve the effect of the generator, as an optional scheme, when the generator is trained, a discriminator may be introduced, and the generator is trained by adopting a counter-training mode, specifically, the step of training the generator may include:

and alternately training the generator and the discriminator until the similarity value of the description information generated by the generator aiming at each original description label of each training sample data meets a preset condition, wherein the discriminator is particularly used for discriminating the probability that the description information generated by the generator is a real original description label.

The specific network structure of the discriminator can be configured according to actual requirements. It will be appreciated that the arbiter also needs to be trained. When the trained discriminators discriminate that the probability that the description sentence generated by the generator is a true original description label is high (e.g., exceeds a predetermined threshold), it is illustrated that the description sentence generated by the generator is close to the description of the true sample (i.e., the true original description label), which can "trick" the trained discriminators. In this case, such a description sentence can be applied as enhanced description information in the training process to achieve an improvement in sample diversity.

Specifically, during training, the generator and the discriminator may be alternately trained until the similarity value of the description information generated by the generator for each original description label of each training sample data satisfies a preset condition. The specific calculation manner of the similarity value is not limited in the embodiments of the present application.

As an alternative, the similarity value may be a CIDEr value. The CIDER is a commonly used evaluation index for evaluating description performance, and the higher the value of the CIDER is, the more similar the generated description sentence is to the real original description label. The CIDER index may consider each sentence as a "document," representing it as a tf-idf vector, and calculate the cosine similarity of the reference (i.e., true) description sentence to the generated description sentence as a score, resulting in a CIDER value. Therefore, according to an exemplary embodiment of the present invention, the CIDEr value of the generated description sentence (image description sentence or video description sentence) may be calculated based on the similarity of the description sentence and the N original description labels of the training sample data to which the original description label used to generate the description sentence belongs. For example, taking an image as an example, when the generator satisfies a preset condition for the CIDEr value of the generated image description sentence for each original image description label of each training image, the description generator can already generate an image description sentence with a semantic similar to the real image description label, i.e. the training of the generator and the discriminator can be completed.

The preset condition may include that a similarity value of the description information generated for each original description label of each training sample data reaches a predetermined threshold, or an average similarity value of the image description information generated for each original description label of each training sample data reaches a predetermined threshold. The preset condition may be a default of the system or may be set by a user as needed or experienced. In addition, whether the training of the generator and the discriminator is completed can be determined according to the needs or experience of the user. For example, when the generator and the arbiter are trained to a certain extent, the user can use a batch of training sample data to test the generator, see whether the output of the generator is satisfactory, and when the output of the generator is satisfactory, the training of the generator and the arbiter can be completed.

In an alternative embodiment of the present application, the step of training the generator and the arbiter alternately comprises:

training the discriminator under the condition of fixing the parameters of the generator;

the generator is trained with the parameters of the trained discriminators fixed.

That is, in training the generator and the discriminator alternately, the discriminator may be trained first with the parameters of the generator fixed, and then with the parameters of the trained discriminator fixed. The training process may be repeated for different sets of training sample data. For example, taking an image as an example, for the first training image set, training of the discriminators and the generator is performed once on the basis of the original parameters (i.e., network structure parameters) of the generators and discriminators. Subsequently, for the second training image set, training of the discriminator and generator is performed once more, based on the parameters of the discriminator and generator trained for the first training image set. Subsequently, for the third training image set, training of the discriminator and generator is performed once more, based on the parameters of the discriminator and generator trained for the second training image set. And so on until the similarity value of the image description information generated by the generator aiming at each image description label of each training image meets the preset condition or the output result of the user test generator is satisfactory.

In an alternative embodiment of the present application, the arbiter may be trained by:

the following operations are performed for each original description label of each training sample data (the number of the original description labels of the sample data in the operation mode is greater than 1, that is, N is greater than 1):

pairing the original description label with other N-1 original description labels of the training sample data respectively to generate N-1 first pairings; inputting the original description label into a generator, generating description information by using the generator, and pairing the generated description information and the original description label to generate a second pairing; based on the N-1 first and second pairings, the discriminators can be trained with a cross entropy loss function, where the output of the discriminator labels for each pairing the probability values of the two real original descriptions.

That is, based on an original description label (referred to as a benchmark label for short), N-1 reference pairs, i.e., sample pairs, are obtained by respectively configuring with N-1 other original description labels, based on the benchmark label, N-1 description information can be generated by using a generator, the label is respectively paired with N-1 generated description information to obtain a prediction pair, a value of a loss function is calculated based on each corresponding sample pair and prediction pair, a network parameter of a discriminator is adjusted based on the value of the loss function until a preset condition is satisfied, and for an image, a probability value that the discriminator outputs two real image description labels (i.e., reference pairs) for each prediction pair is greater than a set threshold.

In an alternative embodiment of the present application (which may be referred to as scheme one for short), the step of training the generator with the parameters of the trained discriminator fixed may include:

performing the following operations for each original description label of each training sample data:

inputting the original description label into a generator, and generating description information by using the generator according to a greedy decoding mode; and for the generated description information, performing the following operations:

calculating similarity values corresponding to the generated description information based on the generated description information and N original description labels of corresponding training images, pairing the generated description information and the original description labels to generate a second pairing, obtaining probability values of the second pairing which are two original image description labels by using a trained discriminator, weighting and summing the calculated similarity values and the obtained probability values to obtain rewards, and adjusting parameters of the generator according to the obtained rewards.

In another alternative embodiment of the present application (which may be referred to as scheme two for short), the step of training the generator with the parameters of the trained discriminator fixed may include:

inputting the original description label into a generator, and generating first description information by using the generator according to a greedy decoding mode;

inputting the original description label into a generator, and generating second description information by using the generator according to a Monte Carlo sampling mode;

for the generated first description information, the following operations are performed:

calculating a first similarity value corresponding to the generated first description information based on the generated first description information and N original description labels of corresponding training images, pairing the generated first description information and the original description labels to generate a second pairing, obtaining a first probability value of two real original description labels of the second pairing by using a trained discriminator, and carrying out weighted summation on the calculated first similarity value and the obtained first probability value to obtain a first reward;

for the generated second description information, the following operations are performed:

calculating a second similarity value corresponding to the generated second description information based on the generated second description information and the N original description labels of the corresponding training images, pairing the generated second description information and the original description labels to generate a second pairing, obtaining second probability values of the two real original description labels of the second pairing by using a trained discriminator, and performing weighted summation on the calculated second similarity value and the obtained second probability values to obtain a second reward;

and adjusting the parameters of the generator according to the difference value of the first reward and the second reward as a final reward.

In practical applications, the gradient of the discriminator is difficult to reflect back to the generator due to the discreteness of the text data. To solve this problem, as an alternative, a policy gradient method may be adopted, and based on the description sentences generated by the generator, the reward is calculated, the higher the reward is, the better the description sentences generated currently is, and the more the parameters of the generator are adjusted according to this direction. The traditional method reward only has the output of the discriminator, while the reward of the above-mentioned optional embodiment provided by the present application may include two parts, which are the output of the discriminator and the similarity value (e.g. the CIDEr value), respectively, and the final reward is obtained after weighted summation of the two, and the reward for parameter adjustment of the generator is determined by adopting more diverse data, so that the generator can effectively learn more information, and can generate enhanced description information which is more similar to but different from the original description label, thereby obtaining better enhanced image description information based on the trained generator, and providing more and better data base for training the multimedia data description model based on the sample data containing the enhanced description information.

For better understanding and description of the training scheme of the generator provided in the embodiments of the present application, the training scheme is further described in detail below with reference to fig. 28a and 28b, respectively. The example is given by way of example in which the multimedia data is an image, it being understood that the principles of the example apply equally to video.

As an alternative example, corresponding to the first solution, as shown in fig. 28a, when the generator is trained, the following operations may be performed for each image description label of each training image: inputting the image description label (X1: T shown in the figure) to a generator, and generating image description information according to a greedy decoding mode by using the generator; for the generated image description information (e.g., Y1: T), the following operations are performed: calculating similarity values corresponding to the generated image description information based on the generated image description information and the N image description labels of the corresponding training images, pairing the generated image description information and the image description labels to generate a second pair (for example, X1: T, Y1: T), obtaining probability values of the second pair for two real image description labels by using a trained discriminator, weighting and summing the calculated similarity values and the obtained probability values to obtain rewards, and adjusting parameters of a generator according to the obtained rewards.

Specifically, as shown in fig. 28a, a sentence y is described by describing an image generated by the generator according to a greedy decoding method^bThe calculated CIDER value and the image description sentence y generated by the generator and obtained according to the greedy decoding mode by the discriminator^bThe obtained probability values are weighted and summed to obtain the reward r (y)^b) The formula for the weighted sum is as follows:

where r (corresponding to r (y) in FIG. 28 a)^b) Is a bonus, tau is a weight coefficient,

c is the CIDEr value (corresponding to the CIDEr score in fig. 28 a) which is the probability value output by the discriminator.

In the example shown in fig. 28a, the structure of the discriminator may be a CNN-based structure, such as may include the convolutional layer, max-pooling layer, etc. shown in the figure. Specifically, for each pair of image description labels and corresponding generated image description information, that is, each second pair, the pair may be processed by embedding to obtain a corresponding feature vector, convolution processing may be performed on convolution layers using various convolution processing parameters based on the feature vector, pooling processing is performed on each convolution result through a maximum pooling layer, the pooling results are concatenated, and a probability value corresponding to each second pair is obtained based on vector prediction after concatenation.

In addition, according to another alternative, namely the second alternative, in order to further improve the effect, a self-evaluation mechanism may be used to train the generator, that is, a difference between rewards of an image description sentence obtained by monte carlo sampling and an image description sentence obtained by greedy decoding is used as a final reward, and parameters of the generator are adjusted according to the obtained final reward.

In the example shown in FIG. 28b, sentence y is described by describing the image generated by the generator according to the greedy decoding method^bThe calculated CIDER value and the image description sentence y generated by the generator and obtained according to the greedy decoding mode by the discriminator^bThe obtained probability values are weighted and summed to obtain the reward r (y)^b). By describing the sentence y for the image generated with the generator in accordance with the Monte Carlo sampling method^sThe computed CIDER value and the discriminator describe the sentence y for the image generated by the generator and obtained according to the Monte Carlo sampling mode^sThe obtained probability values are weighted and summed to obtain the reward r (y)^s). R (y)^s)-r(y^b) The parameters of the generator are adjusted as the final reward. Wherein, the image obtained according to the Monte Carlo sampling mode describes the sentence y^sThe calculated CIDER value and the image description sentence y obtained according to the Monte Carlo sampling mode^sThe obtained probability values and performing a weighted summation to obtain a reward r (y)^s) See FIG. 28a above for a specific scheme of r (y)^b) The description of the above may be the same in principle, and will not be repeated herein.

In an optional embodiment of the present application, in order to avoid repeated information in each generated enhanced description information, the method may further include:

when the repeated enhanced description information exists, the original description label corresponding to the repeated enhanced description information is input into the generator again, and the enhanced description information is generated by the generator again by adjusting the size of the bundle value based on the bundle searching method.

Taking the image as an example, after the generator and the discriminator are trained, the trained generator can be used to generate enhanced image description information corresponding to each image description label according to each image description label corresponding to the image. These enhanced image descriptions are different from the actual image description annotations, but they may be repeated with respect to each other.

In order to solve the problem that the enhanced description information may be duplicated, as an alternative, a bundle search method may be adopted to regenerate the enhanced description information. The generator generates enhanced description information based on the maximum probability, that is, the word with the maximum prediction probability (corresponding to the bundle value being 1) is output at each time, and the bundle searching method can adjust the generation result of the generator by changing the bundle value (e.g., 2, 3, etc.). For example, when there are two identical enhanced descriptions, the real description (i.e., the original description label) corresponding to one of the enhanced descriptions may be input to the generator, the bundle value is set to 2, and a different enhanced description is generated by the generator. For example, the generator may output two words of the first two of the probabilities at a first time, say { a }, { b }; outputting two words with the highest probability based on the words { a } and { b } at the two first time points respectively at the next time point, wherein the two words are assumed to be { a, c }, { a, d }, { b, e }, and { b, f }; then selecting two sequences with the highest probability from the four sequences, wherein the two sequences are assumed to be { a, c }, { b, e }; the latter time and so on. For another example, when there are three identical enhanced descriptors, the original description label corresponding to one of the enhanced descriptors may be input to the generator, the bundle value may be set to 2, and a different enhanced descriptor may be generated by the generator, and the real description corresponding to one of the enhanced descriptors may be input to the generator, the bundle value may be set to 3, and a different enhanced descriptor may be regenerated by the generator. And so on. In this way, the enhanced description sentences can be generated with different bundle sizes, and the generated result can be changed to solve the duplication problem.

After the enhanced description information is obtained by adopting the method for obtaining the enhanced description information provided by the application embodiment, each original description label of the multimedia data and each corresponding enhanced description information can be used as the label information of the multimedia data, and an initial multimedia data description model is trained on the basis of a multimedia data sample containing more label information, so that a description model with a better effect is obtained. Specifically, taking an image as an example, the image description model may be obtained by training in the following manner:

acquiring training samples, wherein each sample image in the training samples corresponds to labeling information, and the labeling information comprises at least one image description label of the sample image and enhanced image description information corresponding to each image description label;

training the initial image description model based on each sample image until a preset training end condition is met to obtain a trained image description model;

for each sample image, the enhanced image description information corresponding to the sample image is obtained by the method for obtaining the image description provided in any optional embodiment of the present application. The specific network structure of the image description model is not limited in the embodiments of the present application, such as an image description model based on an encoder and a decoder, and such as a network image description model based on a codec shown in fig. 22 or fig. 23.

As an example, fig. 29 is a flowchart illustrating a method for training an image description model provided in an embodiment of the present application, where the image description model is an image description model based on an encoder and a decoder, and as shown in the diagram, the method may include the following steps:

step S2701: performing a first training on the encoder and decoder with a cross entropy loss function for each training image (i.e., sample image) in a training image database;

optionally, the training image in this step may be a training image in a training image database, and the training image may include a predetermined number N of image description labels, where N may be a positive integer greater than or equal to 1. For example, the training image may be a training image typically having 5 image description labels in a training image database (e.g., data set MS-COCO) commonly used in the field of image description.

Specifically, description information corresponding to the training image may be obtained by using the method shown in fig. 24 or the method shown in fig. 24, and enhanced image description information corresponding to the training image may be obtained according to the method shown in fig. 27, and the encoder and the decoder may be trained by using the cross entropy loss function based on the obtained description information, the image description label of the training image, and the enhanced image description information. For example, the training may be performed by the following formula of the cross entropy loss function.

Wherein, J_xe(θ) represents loss, θ represents parameters of the transform encoder 302 and the transform decoder 303, T represents current time, T represents maximum time, y represents maximum time_tIndicating the word output at the current moment, y_1:t-1Word representing the true value (ground-route) of the previous time instant, I representing the current image, p_θIndicating the probability that the output word is a true value. Here, the first image description sentence is composed of the word combination having the maximum output probability at each time, and therefore, y at each time can be obtained from the first image description sentence_t. Furthermore, the words of group-route at various times may be obtained from each of the image description labels and the enhanced image descriptions of the training images.

Step S2702: when the encoder and decoder are trained based on the first training, a second training may be performed on the encoder and decoder trained by the first training using a policy gradient and/or a self-evaluation mechanism for each training image in the training image database.

Specifically, the policy gradient is used because the optimization target of the cross entropy loss and the index (e.g., the CIDEr value) for evaluating the description are different, and in order to solve this problem, the policy gradient is used to directly optimize the CIDEr value.

The formula is as follows:

where J (θ) is the loss, θ represents the encoder and decoder parameters, E represents the expectation, y^sFor the sampled image description sentence, r (y)^s) Is the CIDER value, i.e. reward, y^s～p_θRepresenting a set of image description sentences sampled with existing network parameters.

The self-evaluation mechanism is that the reward is set to be the difference between the CIDER value of the image description sentence obtained in a Monte Carlo sampling mode and the CIDER value of the image description sentence obtained in a greedy decoding mode, namely, the reward is restrained by greedy decoding, and the effect is better. The formula is as follows:

wherein the content of the first and second substances,

for an image description sentence obtained by greedy decoding, y^sFor the image description sentence obtained by monte carlo sampling, r is the calculated CIDEr value,

l (θ) is the gradient of the loss, p_θ(y^s) To sample y^sThe probability of the time correspondence.

Alternatively, in performing the second training, the sentence may be described by using a first image obtained by a greedy decoding method with reference to the method in fig. 24 or based on the method shown in fig. 24, so as to perform the second training on the encoder and decoder subjected to the first training by using a policy gradient and self-evaluation mechanism, with describing the sentence by using a second image obtained by monte carlo sampling according to the method in fig. 24 or based on the method shown in fig. 24. Specifically, the CIDER value of the first image description sentence may be calculated based on the similarity of the first image description sentence to the N image description labels of the corresponding training images, the CIDER value of the second image description sentence may be calculated based on the similarity of the second image description sentence to the N image description labels of the corresponding training images, the difference between the CIDER value of the first image description sentence and the CIDER value of the second image description sentence may be calculated to obtain the reward, and the parameters of the encoder and the decoder of the first training may be adjusted according to the obtained reward.

It can be understood that the training method of the image description model provided in the embodiment of the present application is only an optional training method, and as long as the enhanced image description obtained based on the scheme of obtaining the enhanced image description provided in the embodiment of the present application is also used as the annotation information of the sample image, the number and diversity of the annotation data of the sample image are increased, and the image description model is trained based on the training data including the sample image, so that the effect of the model can be effectively provided. It is also clear to a person skilled in the art that the above-described scheme applied to images is equally applicable to the processing of video, and the principle is the same.

Furthermore, the expressive power of the description labels of the training samples is improved. In the optional embodiment of the application, the generation countermeasure network can be utilized to enhance the description data, and the sample with enhanced description data is applied to the training description model, so that the sample diversity is improved, and the effect of the video description model or the image description model is improved.

The following describes the generating method of the description information of the multimedia data provided in the present application in a general manner with reference to two schematic diagrams.

Taking a video as an example, fig. 30 shows a flow chart of a method for generating video description information according to the present application, as shown in the figure, for a given video, a plurality of frames of the video may be selected through a frame selection step, a region encoder shown in the figure is a region encoder for extracting local visual features of each target region in each frame image in the selected plurality of frames, and optionally, the region encoder may include a region feature extraction network, a relationship detector (i.e., a relationship prediction network), an attribute detector (i.e., an attribute prediction network), and a motion detector (i.e., a motion classifier).

In order to obtain a trained video description model, before performing video description by using the model of the codec structure shown in this example, it may be trained by using semi-supervised learning, i.e. semi-supervised training, as shown in fig. 30, it may be trained by using videos with labels (i.e. videos with description labels) and videos without labels (i.e. videos without description labels), wherein, during training, several frames of the videos may be used, videos without labels may be enhanced by using data enhancement processing, at least one video description may be obtained from videos without labels, scores of the descriptions of the videos are obtained based on the enhanced videos, so as to obtain values of a second loss function, and for videos with labels, values of a first loss function may be obtained based on target description information and corresponding label information of the videos output by the model, and obtaining the value of the total loss function of the model according to the value of the first loss function and the value of the second loss function, and guiding the training of the model based on the value until the total loss function of the model converges towards the minimum value.

In addition, for the part shown in the figure for acquiring the enhanced description information, in order to obtain more diversified annotation information in quantity when the model is trained, by using the part, the enhanced description information corresponding to the annotation can be generated through the generator based on the original description label (namely, the real description information) of the video with the annotation, and both the original description label and the enhanced description label are used as the labeling information of the sample video data in the training, so that the quantity and diversity of the annotation information are improved, and the stability of the model and the accuracy of the generated description information can be further improved by guiding the model training through more quantity of the annotation information and the description information predicted by the decoder of the model.

As shown in fig. 30, in this example, when performing video processing based on a trained video description model, the region encoder may extract local visual features, relationship features, attribute features, etc. of each target region in each of several frames of a video, and may construct a scene graph of each frame according to the extracted features, where the scene graph in this example may be a space-time scene graph into which time information is merged, and may obtain corresponding updated features through a graph convolution network, that is, graph convolution features, and accordingly, for a decoder portion of the model, an intra-frame decoder based on self-attention and an inter-frame decoder based on self-attention may be used in this example, and in addition, when generating description information through decoding, a text description of the video that better meets the user's requirements may be generated by obtaining information of the description information that the user desires to generate, for example, when a user needs to be prompted during driving, for example, when there is a potential danger in front of the user, the user can be provided with a corresponding prompt by analyzing the description information of the generated video, or the description information is played to the user by collecting a real-time video in front of the sight line of the user and analyzing the video.

As another example, fig. 31 shows a flowchart of a method for generating video description information according to the present application, where a 3D visual feature encoder (i.e. an instantaneous spatial feature extraction network), a region encoder, and a semantic encoder (i.e. a semantic prediction network) shown in the figure are respectively an encoder for extracting a local visual feature (i.e. a local feature shown in the figure), a temporal visual feature, and a semantic feature of each target region in each frame image of a video. A spatio-temporal scene graph of each frame of image can be constructed based on the local visual features, and then the graph convolution features (the updated local features shown in the graph) can be obtained through a graph convolution network. In this example, the spatio-temporal visual features of the video may be extracted by a 3D visual feature encoder, the semantic features of the video may be extracted by a semantic encoder, the obtained spatio-temporal visual features, semantic features and graph convolution features may be further feature-selected by a feature selection network, that is, weights of the features may be determined, the features may be further weighted and fused based on the weights of the features to obtain fused features, a Decoder group (a Decoder composed of several decoders, i.e., decors, shown in the figure) may decode the fused features and length information of desired description information respectively, and finally final description information (output description information shown in the figure) may be obtained according to the results of the decoders, for example, averaging the decoding results of the decoders, and obtaining final description information based on the result of the averaging processing. Wherein, optionally, a self-attention-based intra decoder may be included in the decoder group, and the length information of the desired description information may be input to the decoder to control the length of the finally generated description information by the decoder.

Similarly, in order to obtain a trained video description model, before performing video description by using the model of the codec structure shown in this example, the model may be trained in a semi-supervised learning manner, that is, in a semi-supervised training manner, wherein the model may be trained in an antagonistic training manner, and the training process may specifically refer to the model training and the above description of corresponding parts in fig. 30, which is not described herein again.

In addition, as can be seen from the foregoing description, in practical applications, after each feature information of a video is obtained by the encoding portion shown in fig. 30 and fig. 31, each extracted feature information may be encoded by an intra-frame or inter-frame encoder based on attention before being decoded by a decoder, and the features after the encoding process may be input to the decoder again.

Based on the same principle as the method for generating the description information of the multimedia data provided by the embodiment of the present application, the present application also provides a device for generating the description information of the multimedia data, and as shown in fig. 32, the device 100 for generating the description information may include a feature information extraction module 110 and a description information generation module 120. Wherein:

a feature information extraction module 110, configured to extract feature information of multimedia data to be processed, where the multimedia data includes a video or an image;

and a description information generating module 120, configured to generate a text description of the multimedia data based on the extracted feature information.

Optionally, the description information generating module 120 is specifically configured to execute at least one of the following:

extracting semantic features of the multimedia data;

extracting global visual features of the multimedia data;

global attribute features of each image in the multimedia data are extracted.

Optionally, the feature information includes local visual features of objects included in each object region in each image of the multimedia data, and the description information generating module 120 is specifically configured to:

for each image, obtaining the relation characteristics among the targets according to the local visual characteristics of the targets in the image, and constructing a scene graph of the image based on the local visual characteristics and the relation characteristics among the targets; for each image, obtaining the image convolution characteristics of the image according to the scene image of the image; and obtaining the text description of the multimedia data based on the image convolution characteristics of each image of the multimedia data.

Optionally, the scene graph includes a plurality of nodes and a plurality of connected edges, where each node represents a local visual feature, and the plurality of connected edges includes each connected edge representing a relationship feature between two connected nodes.

Optionally, the feature information includes attribute features of objects included in each object region of each image in the multimedia data; the description information generating module 120 is specifically configured to, when constructing the scene graph of the image: and constructing a scene graph of the image based on the local visual features of the targets, the relation features between the targets and the attribute features of the targets, wherein each node in the scene graph represents the local visual features or the attribute features of the targets corresponding to the target area.

Optionally, if the multimedia data is a video, each image of the multimedia data is a plurality of frames selected from the video, and if the objects included in the object regions in two adjacent frames are the same, time edges exist between nodes corresponding to the object regions including the same object in the scene graphs of the two adjacent frames.

Optionally, when obtaining the map convolution feature of the image according to the scene map of the image, the description information generating module 120 is configured to: and coding the nodes and the connecting edges in the scene graph to obtain the feature vectors with the same target dimensionality, and obtaining the graph convolution characteristics by using a graph convolution network according to the obtained feature vectors.

Optionally, if the feature information of the multimedia data includes at least two of a local visual feature, a semantic feature, a spatiotemporal visual feature, and a global feature, the description information generating module 120 may be configured to:

determining the weight of each kind of characteristic information; performing weighting processing on each characteristic information based on the weight of each characteristic information; and generating the character description of the multimedia data based on the weighted characteristic information.

Optionally, the description information generating module 120 may be configured to:

performing encoding processing on each obtained feature information by using an encoder based on self attention; inputting the feature information after the coding processing into a decoder to generate a text description of the multimedia data; wherein, if the multimedia data is an image, the self-attention-based encoder is a self-attention-based intra-frame encoder; if the multimedia data is video, the self-attention based encoder includes a self-attention based intra-frame encoder and/or a self-attention based inter-frame encoder.

Optionally, the description information generating module 120 may be configured to: and respectively inputting the extracted characteristic information into a plurality of decoders, and obtaining the character description of the multimedia data based on the decoding result of each decoder.

Optionally, the description information generating module 120 may be configured to: and acquiring length information of the character description to be generated, and generating the character description of the video based on the length information and the extracted characteristic information.

Optionally, the description information generating module 120 may specifically use a multimedia data description model to obtain a textual description of the multimedia data, where the multimedia data description model is obtained by training a model training device, where the model training device may include:

the system comprises a sample acquisition module, a data processing module and a data processing module, wherein the sample acquisition module is used for acquiring a training sample, and the training sample comprises first sample multimedia data with description labels;

and the model training module is used for training the initial description model based on the first sample multimedia data until the model loss function is converged, and taking the trained description model as the multimedia data description model.

Optionally, the training sample further includes second sample multimedia data without description labels, and the model loss function includes a first loss function and a second loss function; the model training module may be to:

training a preset description model based on first sample multimedia data to obtain a value of a first loss function, and training the description model based on second sample multimedia data to obtain a value of a second loss function;

and obtaining a final loss function value based on the first loss function value and the second loss function value, and training the description model based on the final loss function value until the final loss function is converged.

Optionally, the model training module may specifically be configured to, when training the description model based on the second sample multimedia data to obtain a value of the second loss function:

performing data enhancement on the second sample multimedia data at least once to obtain third sample multimedia data; inputting the second sample multimedia data into a description model to obtain at least one multimedia description; determining a score for each multimedia description based on the second sample multimedia data and the third sample multimedia data; the value of the second loss function is derived based on the scores of the multimedia descriptions.

Optionally, the description label of the first sample multimedia data includes at least one original description label of the first sample multimedia data and an enhanced description label corresponding to each original description label; wherein the enhanced description label is obtained by the following method:

and generating enhanced image description labels corresponding to the original description labels respectively according to the original description labels of the first sample multimedia data.

Based on the same principle as the method and the apparatus provided by the embodiment of the present application, an embodiment of the present application further provides an electronic device, where the electronic device includes a memory and a processor, where the memory stores a computer program, and the processor is configured to, when running the computer program, be able to perform the method shown in any optional embodiment of the present application.

Embodiments of the present application further provide a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, is capable of executing the method shown in any optional embodiment of the present application.

As an example, fig. 33 shows a schematic structural diagram of an electronic device to which an embodiment of the present application is applied, and as shown in fig. 33, an electronic device 4000 shown in fig. 33 includes: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004. In addition, the transceiver 4004 is not limited to one in practical use, and the configuration of the electronic apparatus 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application specific integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure herein. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (extended industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 33, but this does not mean only one bus or one type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically erasable programmable Read Only Memory), a CD-ROM (Compact disc Read Only Memory) or other optical disc storage, optical disc storage (including Compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), a magnetic disc storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto.

The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in any of the foregoing method embodiments.

It is understood that the methods and models (such as video description model, image description model, etc.) provided in the alternative embodiments of the present application may be run in any terminal (which may be a user terminal, or a server, etc.) that needs to perform video description information generation and image description information generation. Optionally, the terminal may have the following features:

(1) on a hardware architecture, a device has a central processing unit, a memory, an input unit and an output unit, that is, the device is often a microcomputer device having a communication function. In addition, various input modes can be provided, such as a keyboard, a mouse, a touch screen, a microphone, a camera and the like, and the input can be adjusted according to needs. Meanwhile, the equipment often has a plurality of output modes, such as a receiver, a display screen and the like, and can be adjusted according to the requirements;

(2) on a software system, the device must have an operating system, such as Windows Mobile, Symbian, Palm, Android, iOS, and the like. Meanwhile, the operating systems are more and more open, and various personalized application programs developed based on the open operating system platforms, such as a communication book, a schedule, a notebook, a calculator, various games and the like, are developed, so that the requirements of personalized users are met to a great extent;

(3) the device has flexible access mode and high bandwidth communication performance in terms of communication capability, and can automatically adjust the selected communication mode according to the selected service and the environment, thereby facilitating the use of users. The device can support GSM (Global System for Mobile Communication), WCDMA (wideband Code Division Multiple Access), CDMA2000(Code Division Multiple Access), TDSCDMA (Time Division-Synchronous Code Division Multiple Access), Wi-Fi (Wireless-Fidelity), WiMAX (Worldwide Interoperability for Microwave Access) and the like, thereby adapting to various types of networks, not only supporting voice service, but also supporting various Wireless data services;

(4) in the aspect of function use, the equipment focuses more on humanization, individuation and multi-functionalization. With the development of computer technology, the equipment enters a mode of 'centering around the equipment', integrates embedded computing, control technology, artificial intelligence technology, biometric authentication technology and the like, and fully embodies the people-oriented purpose. Due to the development of software technology, the equipment can be adjusted and set according to personal requirements, and is more personalized. Meanwhile, the device integrates a plurality of software and hardware, and the function is more and more powerful.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figures may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, in different orders, and may be performed alternately or in turns with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A method for generating description information of multimedia data, comprising:

and generating the text description of the multimedia data based on the extracted characteristic information.

2. The method of claim 1, wherein the extracting feature information of the multimedia data to be processed comprises at least one of:

extracting semantic features of the multimedia data;

extracting global visual features of the multimedia data;

global attribute features of each image in the multimedia data are extracted.

3. The method of claim 2, wherein the feature information comprises local visual features of objects included in object regions in each image of the multimedia data, and wherein generating the textual description of the multimedia data based on the extracted feature information comprises:

and obtaining the text description of the multimedia data based on the graph convolution characteristics of the images of the multimedia data.

4. The method of claim 3, wherein the scene graph comprises a plurality of nodes and a plurality of continuous edges, wherein a node represents a local visual feature of an object, and the plurality of continuous edges comprises each continuous edge representing a relationship feature between two connected nodes.

5. The method according to claim 3, wherein the feature information includes attribute features of objects contained in respective object regions of each image in the multimedia data;

the method for constructing the scene graph of the image based on the local visual features of the targets and the relationship features between the targets comprises the following steps:

and constructing a scene graph of the image based on the local visual features of the targets, the relation features between the targets and the attribute features of the targets, wherein one node in the scene graph represents the local visual features or the attribute features of one target.

6. The method according to claim 3 or 4, wherein if the multimedia data is a video, each image of the multimedia data is a plurality of frames selected from the video, and if the objects contained in the object regions in two adjacent frames are the same, the nodes corresponding to the same objects in the scene graphs of the two adjacent frames have time edges therebetween.

7. The method according to any one of claims 3 to 6, wherein obtaining the map convolution feature of the image according to the scene map of the image comprises:

coding nodes and continuous edges in the scene graph to obtain a feature vector the same as a target dimension;

obtaining the graph convolution features using a graph convolution network based on the obtained feature vectors.

8. The method according to any one of claims 2 to 7, wherein if the feature information of the multimedia data includes at least two of a local visual feature, a semantic feature, a spatiotemporal visual feature, and a global feature, the generating the textual description of the multimedia data based on the extracted feature information comprises:

determining the weight of each kind of characteristic information;

and generating the text description of the multimedia data based on the weighted characteristic information.

9. The method of any of claims 2 to 8, wherein generating the textual description of the multimedia data based on the extracted feature information comprises:

wherein, if the multimedia data is an image, the self-attention-based encoder is a self-attention-based intra-frame encoder; if the multimedia data is video, the self-attention-based encoder includes a self-attention-based intra-frame encoder and/or a self-attention-based inter-frame encoder.

10. The method according to any one of claims 1 to 9, wherein the generating a textual description of the multimedia data based on the extracted feature information comprises:

and obtaining the text description of the multimedia data based on the decoding result of each decoder.

11. The method according to any one of claims 1 to 10, wherein the generating a textual description of the multimedia data based on the extracted feature information comprises:

acquiring length information of the character description to be generated;

generating a textual description of the video based on the length information and the extracted feature information.

12. The method according to any one of claims 1 to 11, wherein the textual description of the multimedia data is obtained by a multimedia data description model, wherein the multimedia data description model is trained by:

and training an initial description model based on the first sample multimedia data until a model loss function is converged, and taking the trained description model as the multimedia data description model.

13. The method of claim 12, wherein the training samples further comprise second sample multimedia data without description labels, and wherein the model loss function comprises a first loss function and a second loss function;

the training an initial description model based on the first sample multimedia data until a model loss function converges includes:

obtaining a final loss function value based on the first loss function value and the second loss function value,

and training the description model based on the value of the final loss function until the final loss function converges.

14. The method of claim 13, wherein training the description model based on the second sample multimedia data to obtain a value for the second loss function comprises:

15. The method according to any one of claims 1 to 14, wherein the description label of the first sample multimedia data comprises at least one original description label of the first sample multimedia data and an enhanced description label corresponding to each original description label;

wherein the enhanced description label is obtained by the following method:

16. An apparatus for generating description information of multimedia data, comprising:

the system comprises a characteristic information extraction module, a processing module and a processing module, wherein the characteristic information extraction module is used for extracting the characteristic information of multimedia data to be processed, and the multimedia data comprises videos or images;

17. An electronic device comprising a memory and a processor;

the memory has stored therein a computer program;

the processor, when executing the computer program, is configured to perform the method of any of claims 1 to 15.

18. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 15.