CN108200483B

CN108200483B - Dynamic multi-modal video description generation method

Info

Publication number: CN108200483B
Application number: CN201711433810.6A
Authority: CN
Inventors: 张兆翔; 郝王丽; 关赫
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2017-12-26
Filing date: 2017-12-26
Publication date: 2020-02-28
Anticipated expiration: 2037-12-26
Also published as: CN108200483A

Abstract

The invention belongs to the field of video description, and particularly relates to a dynamic multi-modal video description generation method. It is aimed at capturing resonance information of an audio-visual modality to produce an ideal video description and, in addition, addressing situations where the audio modality in the video is impaired or absent. The multi-modal video description generation system provided by the invention shares the weight of an LSTM internal memory unit or shares an external memory unit through the characteristic coding stage of the visual-audio mode, models the time domain dependence between the visual and audio modes, and captures the resonance information of the visual-audio mode; in addition, the invention infers corresponding auditory modality information according to the known visual modality information based on an auditory reasoning system. The video description can be generated quickly and effectively through the method and the device.

Description

Dynamic multi-modal video description generation method

Technical Field

The invention belongs to the field of video description, and particularly relates to a dynamic multi-modal video description generation method.

Background

A video description generation system refers to automatically generating a corresponding natural language description for a given video. Video description generation is inspired by image description generation, which iteratively generates an image or video description based on given image or video frame characteristics.

Conventional video description generation can be roughly classified into three types of methods:

the first category is template-based methods. They first determine the semantic concepts contained in the video, then infer sentence structure based on a predefined sentence template, and finally use a probabilistic graph model to collect the most relevant content that can generate a sentence description. Although grammatically correct sentences can be generated in this way, the sentences lack richness and flexibility.

The second category is to consider the video description as a retrieval problem. The video is first tagged with metadata and then the sentence description and video are classified according to the corresponding tags. Sentences generated in this way are more natural than in the first method, but they are largely limited by metadata.

A third type of video description generation method directly maps a video representation onto a particular sentence. Venugopalan et al first performs feature extraction on all image frames in the video using a Convolutional Neural Network (CNN) and performs an average pooling operation on them to obtain a fixed-length video representation. A corresponding video description is then generated based on the video representation using an LSTM decoder. Although the above method can conveniently obtain video description, it ignores the temporal characteristics implicit in video. To explore the role of temporal features in video description generation, Venugopalan et al used LSTM to encode features in video frames to obtain a fixed-length video representation. On the other hand, Yao et al explore the effect of the implicit local and global temporal properties in video on the generation of video descriptions. Specifically, the local temporal property in the video can be encoded by a space-time convolutional neural network, and the global temporal property in the video can be encoded by a temporal attention mechanism. To further improve the performance of video description generation, Ballas et al studied the role of the middle layer video representation. Pan et al propose the use of a hierarchical recurrent neural network decoder to explore the performance of different granularities of temporal properties in video description generation. To produce a more detailed description of the video, Haonan Yu et al propose the use of hierarchical recurrent neural networks to generate a paragraph description of the video. Their hierarchical model contains a sentence generator and a paragraph generator. The sentence generator generates a simple sentence description for a particular short video interval. The paragraph generator generates a paragraph description for the video by capturing inter-sentence dependencies based on the plurality of sentence codes generated by the sentence generator.

The above-described video description generative models have a common property that their video representation is based on visual information only. In order to fully utilize the information contained in the video, many researchers have proposed to fuse multi-modal information in the video to obtain a better video description. Vasili Ramanishka et al directly concatenates corresponding visual and auditory features to obtain a multimodal video representation. Qin Jin et al use a multi-layer, forward neural network to perform fusion of visual and auditory information. Although these methods may improve the performance of video descriptions to some extent, they are still limited by the following effects. For example, directly connecting visual auditory features can lead to aliasing of information and degradation of performance. In addition, the model proposed by Qin Jin et al is to obtain video features by pooling all video frame features or audio segment features, ignoring the temporal dependencies between (visual-auditory) features.

The multi-modal video description generation system provided by the invention models the time-domain dependence between the visual sense and the auditory sense by sharing the weight of an LSTM internal memory unit or sharing an external memory unit at the characteristic coding stage of the visual sense and the auditory sense, captures the resonance information of the visual sense and the auditory sense and generates ideal video description. In addition, the auditory modalities in the video may be impaired or absent in some cases due to environmental influences, sensor interference, and the like. In this case, in order to prevent the performance of the video description generation system from being greatly affected, the invention proposes an auditory inference system to infer corresponding auditory modality information based on the known visual modality information, and then to input the complementary complete visual and auditory modality information to the multi-modal video description generation system to generate the video description.

Disclosure of Invention

In order to solve the above problem, that is, to solve the problem that the video description cannot be accurately generated when the auditory modality is damaged or lost, the dynamic multi-modal video description generation method provided by the invention comprises the following steps:

step S1: extracting corresponding visual CNN characteristics and auditory MFCC characteristics in the video, and judging whether the auditory MFCC characteristics are damaged or disappear; if the loss or disappearance, performing step S2, otherwise performing step S3;

step S2: the visual CNN characteristics are reasoned through an auditory inference model based on a coding-decoding mode to obtain complete auditory MFCC characteristics;

step S3: and coding and interactive fusion of the two audio-visual modes through a multi-mode coder based on the time domain dependence between the visual sense and the audio sense by utilizing the visual CNN characteristic and the audio MFCC characteristic to obtain a fusion characteristic, and iteratively decoding the fusion characteristic through a decoder to generate a video description.

Preferably, the auditory inference model, the method of generating auditory MFCC features is:

coding the visual CNN characteristics by using a coder to obtain high-level semantics;

the corresponding auditory MFCC features are decoded by a decoder.

Preferably, the multi-modal encoder is a multi-modal LSTM encoder based on a shared weight, or a multi-modal memory unit encoder based on a shared memory unit.

Preferably, the multimodal LSTM encoder based on the shared weight includes two LSTM neural networks, which are respectively used for encoding the visual features CNN and the auditory features MFCC; the internal memory units of the two LSTM neural networks share the weight.

Preferably, the modeling formula of the multi-modal LSTM encoder based on the shared weight is as follows:

wherein the content of the first and second substances,

i_t,f_t,o_tandan input gate, a forgetting gate, an output gate and a memory unit respectively;

the superscript s is an index value of the mode;

s-0, representing an LSTM-based auditory information encoder;

s-1, representing an LSTM-based visual information encoder;

wherein x⁰Is an auditory MFCC feature;

x¹is a visual CNN feature;

w, U and b are weight matrixes of corresponding items, wherein U represents that the audio-visual encoder based on the LSTM shares weight in a hidden layer unit;

sigma is sigmoid function;

i is an input gate of the LSTM;

h is the hidden state of the LSTM;

h_t、h_t-1the hidden state of the LSTM at t and t-1;

W_i、W_f、W_o、W_cthe input gate, the forgetting gate, the output gate and the weight of each item of the memory unit related to the input x are respectively;

x_t-1is the input at the time t-1;

U_i、U_f、U_o、U_crespectively an input gate, a forgetting gate, an output gate and a weight of each item of the memory unit related to a hidden state h;

b_i、b_f、b_o、b_cthe offset items are respectively an input gate, a forgetting gate, an output gate and a memory unit;

c_t、c_t-1the values of the memory cell at times t and t-1.

Preferably, the multi-modal memory unit encoder based on the shared memory unit comprises two LSTM neural networks for encoding visual CNN features and auditory MFCC features, respectively; and the internal memory units of the two LSTM neural networks perform information updating through the external memory units.

Preferably, the "internal memory units of the two LSTM neural networks perform information updating through the external memory units" includes:

reading information from an external memory unit;

and respectively fusing the information read by the external memory unit and the internal memory units of the two LSTM neural networks, and updating the memory units of the two LSTM neural networks.

Preferably, the method for extracting the visual CNN feature and the auditory MFCC feature corresponding to the video includes:

extracting visual CNN characteristics of video frames in the video through a convolutional neural network;

and extracting audio MFCC characteristics from the corresponding audio segment of the video frame through a convolutional neural network.

Preferably, the multimode encoder based on the shared external memory unit realizes interactive fusion of visual CNN features and auditory MFCC features through read-write operations, and the specific steps are as follows:

step S11: reading information from a multimode encoder memory unit;

step S12: respectively fusing the information read by the memory unit and the visual CNN characteristic and the auditory MFCC characteristic obtained based on the multi-modal encoder coding to obtain fused information;

step S13: and storing the fusion information to a multi-mode encoder memory unit.

Preferably, the visual CNN features and the auditory MFCC features do not share weights before being input to the multimodal encoder.

According to the technical scheme, the invention provides a dynamic multi-modal video description generation model, and provides a quick and effective method for video description. Compared with the prior art, the invention has the following advantages:

(1) the input information of the invention comes from two modes of vision and hearing, and compared with a traditional vision description generation system based on the vision information only, the invention can obtain higher performance.

(2) The invention fuses audiovisual information by modeling the time-domain dependencies between audiovisual modalities. The temporal dependencies between the audio-visual modalities may represent, to some extent, resonance information between the audio-visual modalities, i.e., events that actually occur in the video. The invention can effectively model the resonance information in the video, thereby generating ideal video description.

(3) The invention can establish a unified video description system for videos with complete modal information and missing auditory modalities. If the input video has two complete modes, the two complete modes are directly input into the multi-mode video description generation system, if the input video auditory mode is damaged or disappears, an auditory reasoning system is used for generating auditory characteristics based on visual characteristics, and then the two complete modes are input into the multi-mode video description generation system.

Drawings

FIG. 1 is a flow chart of a method for generating a dynamic multi-modal video description according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a dynamic multi-modal video description generation model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a multi-modal visual-auditory fusion coding model based on a shared weight according to an embodiment of the present invention;

FIG. 4 is a schematic model of a multi-modal audio-visual fusion coding model based on a shared external memory unit according to an embodiment of the present invention;

FIG. 5 is a video description generated based on various base models and a multi-modal fusion model according to an embodiment of the present invention;

FIG. 6 is a diagram of an auditory inference model in accordance with an embodiment of the present invention;

fig. 7 is a video description generated based on various basic models and after supplementing an auditory modality according to an embodiment of the present invention.

Detailed Description

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.

As shown in fig. 1, the dynamic multi-modal video description generation method provided by the present invention specifically includes the following steps:

The dynamic multi-modal video description generation method provided by the invention realizes the video description function based on the audio-visual multi-modal video description generation system and the auditory reasoning model, wherein the preferred scheme of the encoder of the audio-visual multi-modal video description generation system is a multi-modal LSTM encoder.

Fig. 2 shows a schematic diagram of an audiovisual multimodal video description generation system according to the present invention. The specific steps of the embodiment are as follows:

step S11: if the input video modalities are complete, directly inputting the input video modalities into a multi-modal video description generation system; if the input audio CNN features of the video are damaged or disappear, generating the audio CNN features by using an audio inference system according to the known visual MFCC features, and inputting two complementary complete modes into a multi-mode video description generation system;

step S12: visual CNN characteristics of video frames in the video are extracted through a convolutional neural network, and auditory MFCC characteristics of corresponding audio segments of the video frames are extracted. Visual CNN and auditory MFCC features in the video are then separately input to a multimodal encoder for encoding, and finally a text decoder iteratively generates a video description based on the feature representation provided by the encoder.

In this embodiment, step S11 specifically includes: if the auditory features extracted by the audio-visual multi-mode video description generation system are influenced by the external environment or are interfered by the electromagnetic environment to cause auditory feature loss or loss, the auditory inference model is encoded based on the visual features to obtain high-level semantics, then a decoder of the auditory inference model is used for decoding corresponding auditory MFCC features, and then the corresponding auditory MFCC features are input into the multi-mode video description generation system.

The multi-mode encoder of the audio-visual multi-mode video description generation system is a multi-mode LSTM encoder based on a shared weight or a multi-mode memory unit encoder based on a shared memory unit. The multi-modal encoder of the present invention is exemplified by a multi-modal LSTM encoder, as shown in fig. 3, based on a multi-modal LSTM encoder sharing weights, LSTM can model temporal dependencies in sequence data. Both the audiovisual sequences in video contain temporal properties and there is a resonance relationship between the two. In order to model the time domain dependency relationship, the invention uses two LSTM neural networks to respectively encode the characteristics of the audio-visual mode, weight sharing is carried out between the internal memory units of the two LSTM neural networks, and the multi-mode encoder designed by the mode can capture the time domain resonance information in the audio-visual symbiotic mode.

Modeling of a multimodal LSTM encoder based on shared weights is shown in equations (1) - (6):

wherein the content of the first and second substances,

i_t，f_t，o_tand

an input gate, a forgetting gate, an output gate and a memory unit respectively;

the superscript s is an index value of the mode;

s-0, representing an LSTM-based auditory information encoder;

s-1, representing an LSTM-based visual information encoder;

wherein x⁰Is an auditory MFCC feature;

x¹is a visual CNN feature;

sigma is sigmoid function;

i is an input gate of the LSTM;

h is the hidden state of the LSTM;

h_t、h_t-1the hidden state of the LSTM at t and t-1;

x_t-1is the input at the time t-1;

c_t、c_t-1as a memory cellValues at time t and t-1.

Wherein the visual CNN features and the auditory MFCC features do not share weights before being input to the multimodal encoder. The input visual and auditory features do not share the weight value for two reasons: first, a specific function is learned that maps from the visual-auditory feature space to the hidden layer space. Secondly, the visual and auditory input dimensions are different, and different weight matrixes can more conveniently handle the problem of different modal dimensions.

As shown in fig. 4, the multi-modal memory unit encoder based on shared memory units has a short period, although LSTM can model temporal dependencies. In order to further explore the effect of long-term dependence on video description generation, the invention provides a multi-modal encoder based on a shared external memory unit, wherein the multi-modal memory unit encoder of the shared memory unit comprises two LSTM neural networks which are respectively used for encoding visual CNN characteristics and auditory MFCC characteristics; the information updating is carried out by the internal memory units of the two LSTM neural networks through the external memory units, and the specific steps are as follows:

step S21: reading information from an external memory unit;

step S21: and respectively fusing the information read by the external memory unit and the internal memory units of the two LSTM neural networks, and updating the memory units of the two LSTM neural networks.

The multimode memory unit encoder of the shared memory unit in this embodiment implements interactive fusion of visual CNN features and auditory MFCC features through read-write operations, and includes the following specific steps:

step S31: reading information from a multimode encoder memory unit;

step S32: respectively fusing the information read by the memory unit and the visual CNN characteristic and the auditory MFCC characteristic obtained based on the multi-modal encoder coding to obtain fused information;

step S33: and storing the fusion information to a multi-mode encoder memory unit.

To verify the performance of the multi-modal video description generation system of the present invention, we compared the video descriptions generated by the following models, as shown in fig. 5:

audio, using auditory MFCC features alone to generate a video description;

visual, using Visual CNN features alone to generate a video description;

V-Cat-A, directly connecting visual characteristics CNN and auditory MFCC characteristics as video characteristics to generate video description;

V-ShaMem-A, an external memory unit is shared by visual CNN characteristics and auditory MFCC characteristics to obtain final video characteristics to generate video description, and the model aims to verify the effect of long-term temporal dependence between visual and auditory senses on the generation of the video description;

V-ShaWei-A, the final video features are obtained by sharing the weight of an LSTM internal memory unit through visual CNN features and auditory MFCC features to generate video description, and the model aims to verify the effect of the temporal domain dependency between visual sense and auditory sense on the generation of the video description.

For the first video in FIG. 5, the sentence generated by the Visual model focuses more on the Visual information and ignores the auditory information, so it produces the wrong content ("to a man" vs. "news"); the V-Cat-a model generates accurate objects ("aman") and behaviors ("talking") but loses content ("news") because connecting directly to audiovisual features results in aliasing of information and thus a portion of the information is lost; V-ShaMem-A and V-ShaWei-A may generate sentences that are very close to the reference sentence with the help of auditory information, and the sentences generated by V-ShaWei-A are more accurate ("news" vs. "bathing") because V-ShaMem-A focuses more on long-term information, so the words generated are more abstract, "bathing", and V-ShaWei-A focuses more on truly functioning fine-grained events.

For the second video in FIG. 5, all models can generate correlated behavior ("swimming") and goals ("inter water"). However, only the V-ShaWei-A model generated more accurate objects ("fish" vs. "man" and "person") because the V-ShaWei-A model was more concerned with the events that are sensitive to the audiovisual modality, i.e., the events caused by the audiovisual resonances.

For the third video in FIG. 5, only the V-ShaWei-A model generated more relevant behavior ("shoving" vs. "playing"), illustrating the nature of the behavior that the V-ShaWei-A model can capture.

For the fourth video in FIG. 5, the V-Cat-A and V-ShaWei-A models may generate more relevant behaviors with the help of sound cues ("knocking on a wall", "using a phone"); the V-ShaMem-A model focuses more on global events, thus providing a sentence ("lying on bed"); meanwhile, Visual models focus more on Visual information and also generate descriptions ("lying on bed").

For the fifth video in fig. 5, the events occurring in the video are more relevant to the visual information. Therefore, the Visual, V-ShaMem-A, V-ShaWei-A models all generated accurate behavior ("planning" vs. "playing" or "singing"); moreover, the V-ShaMem-A and V-ShaWei-A models generate more accurate objects ("a group of" vs. "a girl", "acarton filter" and "someone"), indicating that time-domain dependencies are helpful in object positioning; meanwhile, the V-ShaWei-A model provides the most relevant objects ("cars characters" vs. "peoples"), indicating that short time-domain dependencies are more effective. The english sentences in fig. 5 are output contents after video description is performed on different models, and are for showing technical effects and comparison.

An auditory inference model based on encoding-decoding approach, as shown in fig. 6. Specifically, the symbiotic audio-visual modalities have the same high-level semantic representation, and based on the high-level semantic representation, damaged or missing auditory MFCC features can be generated according to known visual CNN features. Specifically, the method comprises the steps of firstly encoding video frame characteristics by using an encoder to obtain high-level semantics, and then decoding corresponding auditory MFCC characteristics by using a decoder. Wherein 1024, 512, 256 represent the number of neurons included in each network layer of the encoder or decoder.

To verify the performance of the multi-modal video description generation system of the present invention, as shown in fig. 7, we compared the video descriptions generated by several models, specifically GA, and used the generated auditory MFCC features alone to generate the video descriptions; visual, using Visual CNN features alone to generate video descriptions; V-ShaWei-GA, which generates video descriptions using visual CNN features and auditory MFCC features that share weights.

For the first video in FIG. 7, the Visual model focuses primarily on Visual cues, thus generating the wrong content "Piano" because objects behind the child look very much like "Piano" and take up more space in the picture; V-ShaWei-GA may capture a more accurate sounding object, "Violin," because it may model resonance information between audio-visual modalities; the GA model may also generate a more relevant object description "violin".

For the second video in FIG. 7, V-ShaWei-GA may generate a more accurate behavioral description ("eating the sounding") indicating that V-ShaWei-GA may capture the visual and auditory behaviors while being sensitive, i.e., resonance information; the GA model may also generate an accurate behavioral description "pointing", which explains the significance of the generated auditory modality.

For the third video in FIG. 7, both the V-ShaWei-GA and GA models may generate a related object description ("girl" vs. "man"). The english sentences in fig. 7 are output contents after video description is performed on different models, and are for showing technical effects and comparison.

Those of skill in the art will appreciate that the various illustrative models, elements, and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of electronic hardware and software. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A dynamic multi-modal video description generation method is characterized by comprising the following steps:

step S3: by utilizing the visual CNN characteristic and the auditory MFCC characteristic, based on the time domain dependency between visual and auditory senses, encoding and interactive fusion of two audio-visual modes are carried out through a multi-mode encoder to obtain a fusion characteristic, and the fusion characteristic is iteratively decoded through a decoder to generate a video description;

the multi-modal encoder is a multi-modal LSTM encoder based on a shared weight, the multi-modal LSTM encoder based on the shared weight comprises two LSTM neural networks which are respectively used for encoding a visual characteristic CNN and an auditory characteristic MFCC, and weight sharing is carried out between internal memory units of the two LSTM neural networks; or

The multi-modal encoder is a multi-modal memory unit encoder based on a shared memory unit, the multi-modal memory unit encoder based on the shared memory unit comprises two LSTM neural networks which are respectively used for encoding visual CNN characteristics and auditory MFCC characteristics, and the internal memory units of the two LSTM neural networks carry out information updating through an external memory unit.

2. The dynamic multi-modal video description generation method of claim 1, wherein the auditory inference model, which generates auditory MFCC features, is:

coding the CNN characteristics of the video by using a coder to obtain high-level semantics;

decoding the corresponding auditory MFCC features with a decoder;

wherein the decoder is a decoder of an auditory inference model.

3. The method of generating dynamic multi-modal video description according to claim 1, wherein the modeling formula of the weight-sharing based multi-modal LSTM encoder is as follows:

wherein the content of the first and second substances,

i_t,f_t,o_tand

the superscript s is an index value of the mode;

s-0, representing an LSTM-based auditory information encoder;

s-1, representing an LSTM-based visual information encoder;

wherein x⁰Is an auditory MFCC feature;

x¹is a visual CNN feature;

sigma is sigmoid function;

i is an input gate of the LSTM;

h is the hidden state of the LSTM;

h_t、h_t-1the hidden state of the LSTM at t and t-1;

x_t-1is the input at the time t-1;

bi、b_f、b_o、b_cthe offset items are respectively an input gate, a forgetting gate, an output gate and a memory unit;

c_t、c_t-1the values of the memory cell at times t and t-1.

4. The dynamic multi-modal video description generation method of claim 1, wherein the internal memory units of the two LSTM neural networks perform information update through the external memory unit, and the method is as follows:

reading information from an external memory unit;

5. The dynamic multi-modal video description generation method according to any one of claims 1 to 4, wherein the method for extracting the corresponding visual CNN feature and auditory MFCC feature in the video comprises:

6. The video description generation method of claim 5, wherein the interactive fusion of visual CNN feature and auditory MFCC feature is realized by read-write operation based on a multi-modal encoder sharing an external memory unit, and the specific steps are as follows:

step S11: reading information from a multimode encoder memory unit;

7. The video description generation method of claim 5, wherein the visual CNN features and the auditory MFCC features do not share weights before being input to the multi-modal encoder.