CN108200483B - Dynamic multi-modal video description generation method - Google Patents

Dynamic multi-modal video description generation method Download PDF

Info

Publication number
CN108200483B
CN108200483B CN201711433810.6A CN201711433810A CN108200483B CN 108200483 B CN108200483 B CN 108200483B CN 201711433810 A CN201711433810 A CN 201711433810A CN 108200483 B CN108200483 B CN 108200483B
Authority
CN
China
Prior art keywords
visual
auditory
video
modal
memory unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711433810.6A
Other languages
Chinese (zh)
Other versions
CN108200483A (en
Inventor
张兆翔
郝王丽
关赫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201711433810.6A priority Critical patent/CN108200483B/en
Publication of CN108200483A publication Critical patent/CN108200483A/en
Application granted granted Critical
Publication of CN108200483B publication Critical patent/CN108200483B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/84Generation or processing of descriptive data, e.g. content descriptors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4662Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms
    • H04N21/4666Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms using neural networks, e.g. processing the feedback provided by the user

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention belongs to the field of video description, and particularly relates to a dynamic multi-modal video description generation method. It is aimed at capturing resonance information of an audio-visual modality to produce an ideal video description and, in addition, addressing situations where the audio modality in the video is impaired or absent. The multi-modal video description generation system provided by the invention shares the weight of an LSTM internal memory unit or shares an external memory unit through the characteristic coding stage of the visual-audio mode, models the time domain dependence between the visual and audio modes, and captures the resonance information of the visual-audio mode; in addition, the invention infers corresponding auditory modality information according to the known visual modality information based on an auditory reasoning system. The video description can be generated quickly and effectively through the method and the device.

Description

Dynamic multi-modal video description generation method
Technical Field
The invention belongs to the field of video description, and particularly relates to a dynamic multi-modal video description generation method.
Background
A video description generation system refers to automatically generating a corresponding natural language description for a given video. Video description generation is inspired by image description generation, which iteratively generates an image or video description based on given image or video frame characteristics.
Conventional video description generation can be roughly classified into three types of methods:
the first category is template-based methods. They first determine the semantic concepts contained in the video, then infer sentence structure based on a predefined sentence template, and finally use a probabilistic graph model to collect the most relevant content that can generate a sentence description. Although grammatically correct sentences can be generated in this way, the sentences lack richness and flexibility.
The second category is to consider the video description as a retrieval problem. The video is first tagged with metadata and then the sentence description and video are classified according to the corresponding tags. Sentences generated in this way are more natural than in the first method, but they are largely limited by metadata.
A third type of video description generation method directly maps a video representation onto a particular sentence. Venugopalan et al first performs feature extraction on all image frames in the video using a Convolutional Neural Network (CNN) and performs an average pooling operation on them to obtain a fixed-length video representation. A corresponding video description is then generated based on the video representation using an LSTM decoder. Although the above method can conveniently obtain video description, it ignores the temporal characteristics implicit in video. To explore the role of temporal features in video description generation, Venugopalan et al used LSTM to encode features in video frames to obtain a fixed-length video representation. On the other hand, Yao et al explore the effect of the implicit local and global temporal properties in video on the generation of video descriptions. Specifically, the local temporal property in the video can be encoded by a space-time convolutional neural network, and the global temporal property in the video can be encoded by a temporal attention mechanism. To further improve the performance of video description generation, Ballas et al studied the role of the middle layer video representation. Pan et al propose the use of a hierarchical recurrent neural network decoder to explore the performance of different granularities of temporal properties in video description generation. To produce a more detailed description of the video, Haonan Yu et al propose the use of hierarchical recurrent neural networks to generate a paragraph description of the video. Their hierarchical model contains a sentence generator and a paragraph generator. The sentence generator generates a simple sentence description for a particular short video interval. The paragraph generator generates a paragraph description for the video by capturing inter-sentence dependencies based on the plurality of sentence codes generated by the sentence generator.
The above-described video description generative models have a common property that their video representation is based on visual information only. In order to fully utilize the information contained in the video, many researchers have proposed to fuse multi-modal information in the video to obtain a better video description. Vasili Ramanishka et al directly concatenates corresponding visual and auditory features to obtain a multimodal video representation. Qin Jin et al use a multi-layer, forward neural network to perform fusion of visual and auditory information. Although these methods may improve the performance of video descriptions to some extent, they are still limited by the following effects. For example, directly connecting visual auditory features can lead to aliasing of information and degradation of performance. In addition, the model proposed by Qin Jin et al is to obtain video features by pooling all video frame features or audio segment features, ignoring the temporal dependencies between (visual-auditory) features.
The multi-modal video description generation system provided by the invention models the time-domain dependence between the visual sense and the auditory sense by sharing the weight of an LSTM internal memory unit or sharing an external memory unit at the characteristic coding stage of the visual sense and the auditory sense, captures the resonance information of the visual sense and the auditory sense and generates ideal video description. In addition, the auditory modalities in the video may be impaired or absent in some cases due to environmental influences, sensor interference, and the like. In this case, in order to prevent the performance of the video description generation system from being greatly affected, the invention proposes an auditory inference system to infer corresponding auditory modality information based on the known visual modality information, and then to input the complementary complete visual and auditory modality information to the multi-modal video description generation system to generate the video description.
Disclosure of Invention
In order to solve the above problem, that is, to solve the problem that the video description cannot be accurately generated when the auditory modality is damaged or lost, the dynamic multi-modal video description generation method provided by the invention comprises the following steps:
step S1: extracting corresponding visual CNN characteristics and auditory MFCC characteristics in the video, and judging whether the auditory MFCC characteristics are damaged or disappear; if the loss or disappearance, performing step S2, otherwise performing step S3;
step S2: the visual CNN characteristics are reasoned through an auditory inference model based on a coding-decoding mode to obtain complete auditory MFCC characteristics;
step S3: and coding and interactive fusion of the two audio-visual modes through a multi-mode coder based on the time domain dependence between the visual sense and the audio sense by utilizing the visual CNN characteristic and the audio MFCC characteristic to obtain a fusion characteristic, and iteratively decoding the fusion characteristic through a decoder to generate a video description.
Preferably, the auditory inference model, the method of generating auditory MFCC features is:
coding the visual CNN characteristics by using a coder to obtain high-level semantics;
the corresponding auditory MFCC features are decoded by a decoder.
Preferably, the multi-modal encoder is a multi-modal LSTM encoder based on a shared weight, or a multi-modal memory unit encoder based on a shared memory unit.
Preferably, the multimodal LSTM encoder based on the shared weight includes two LSTM neural networks, which are respectively used for encoding the visual features CNN and the auditory features MFCC; the internal memory units of the two LSTM neural networks share the weight.
Preferably, the modeling formula of the multi-modal LSTM encoder based on the shared weight is as follows:
Figure GDA0001624224990000032
Figure GDA0001624224990000034
Figure GDA0001624224990000035
Figure GDA0001624224990000036
wherein the content of the first and second substances,
it,ft,otandan input gate, a forgetting gate, an output gate and a memory unit respectively;
the superscript s is an index value of the mode;
s-0, representing an LSTM-based auditory information encoder;
s-1, representing an LSTM-based visual information encoder;
wherein x0Is an auditory MFCC feature;
x1is a visual CNN feature;
w, U and b are weight matrixes of corresponding items, wherein U represents that the audio-visual encoder based on the LSTM shares weight in a hidden layer unit;
sigma is sigmoid function;
i is an input gate of the LSTM;
h is the hidden state of the LSTM;
ht、ht-1the hidden state of the LSTM at t and t-1;
Wi、Wf、Wo、Wcthe input gate, the forgetting gate, the output gate and the weight of each item of the memory unit related to the input x are respectively;
xt-1is the input at the time t-1;
Ui、Uf、Uo、Ucrespectively an input gate, a forgetting gate, an output gate and a weight of each item of the memory unit related to a hidden state h;
bi、bf、bo、bcthe offset items are respectively an input gate, a forgetting gate, an output gate and a memory unit;
ct、ct-1the values of the memory cell at times t and t-1.
Preferably, the multi-modal memory unit encoder based on the shared memory unit comprises two LSTM neural networks for encoding visual CNN features and auditory MFCC features, respectively; and the internal memory units of the two LSTM neural networks perform information updating through the external memory units.
Preferably, the "internal memory units of the two LSTM neural networks perform information updating through the external memory units" includes:
reading information from an external memory unit;
and respectively fusing the information read by the external memory unit and the internal memory units of the two LSTM neural networks, and updating the memory units of the two LSTM neural networks.
Preferably, the method for extracting the visual CNN feature and the auditory MFCC feature corresponding to the video includes:
extracting visual CNN characteristics of video frames in the video through a convolutional neural network;
and extracting audio MFCC characteristics from the corresponding audio segment of the video frame through a convolutional neural network.
Preferably, the multimode encoder based on the shared external memory unit realizes interactive fusion of visual CNN features and auditory MFCC features through read-write operations, and the specific steps are as follows:
step S11: reading information from a multimode encoder memory unit;
step S12: respectively fusing the information read by the memory unit and the visual CNN characteristic and the auditory MFCC characteristic obtained based on the multi-modal encoder coding to obtain fused information;
step S13: and storing the fusion information to a multi-mode encoder memory unit.
Preferably, the visual CNN features and the auditory MFCC features do not share weights before being input to the multimodal encoder.
According to the technical scheme, the invention provides a dynamic multi-modal video description generation model, and provides a quick and effective method for video description. Compared with the prior art, the invention has the following advantages:
(1) the input information of the invention comes from two modes of vision and hearing, and compared with a traditional vision description generation system based on the vision information only, the invention can obtain higher performance.
(2) The invention fuses audiovisual information by modeling the time-domain dependencies between audiovisual modalities. The temporal dependencies between the audio-visual modalities may represent, to some extent, resonance information between the audio-visual modalities, i.e., events that actually occur in the video. The invention can effectively model the resonance information in the video, thereby generating ideal video description.
(3) The invention can establish a unified video description system for videos with complete modal information and missing auditory modalities. If the input video has two complete modes, the two complete modes are directly input into the multi-mode video description generation system, if the input video auditory mode is damaged or disappears, an auditory reasoning system is used for generating auditory characteristics based on visual characteristics, and then the two complete modes are input into the multi-mode video description generation system.
Drawings
FIG. 1 is a flow chart of a method for generating a dynamic multi-modal video description according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a dynamic multi-modal video description generation model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a multi-modal visual-auditory fusion coding model based on a shared weight according to an embodiment of the present invention;
FIG. 4 is a schematic model of a multi-modal audio-visual fusion coding model based on a shared external memory unit according to an embodiment of the present invention;
FIG. 5 is a video description generated based on various base models and a multi-modal fusion model according to an embodiment of the present invention;
FIG. 6 is a diagram of an auditory inference model in accordance with an embodiment of the present invention;
fig. 7 is a video description generated based on various basic models and after supplementing an auditory modality according to an embodiment of the present invention.
Detailed Description
Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.
As shown in fig. 1, the dynamic multi-modal video description generation method provided by the present invention specifically includes the following steps:
step S1: extracting corresponding visual CNN characteristics and auditory MFCC characteristics in the video, and judging whether the auditory MFCC characteristics are damaged or disappear; if the loss or disappearance, performing step S2, otherwise performing step S3;
step S2: the visual CNN characteristics are reasoned through an auditory inference model based on a coding-decoding mode to obtain complete auditory MFCC characteristics;
step S3: and coding and interactive fusion of the two audio-visual modes through a multi-mode coder based on the time domain dependence between the visual sense and the audio sense by utilizing the visual CNN characteristic and the audio MFCC characteristic to obtain a fusion characteristic, and iteratively decoding the fusion characteristic through a decoder to generate a video description.
The dynamic multi-modal video description generation method provided by the invention realizes the video description function based on the audio-visual multi-modal video description generation system and the auditory reasoning model, wherein the preferred scheme of the encoder of the audio-visual multi-modal video description generation system is a multi-modal LSTM encoder.
Fig. 2 shows a schematic diagram of an audiovisual multimodal video description generation system according to the present invention. The specific steps of the embodiment are as follows:
step S11: if the input video modalities are complete, directly inputting the input video modalities into a multi-modal video description generation system; if the input audio CNN features of the video are damaged or disappear, generating the audio CNN features by using an audio inference system according to the known visual MFCC features, and inputting two complementary complete modes into a multi-mode video description generation system;
step S12: visual CNN characteristics of video frames in the video are extracted through a convolutional neural network, and auditory MFCC characteristics of corresponding audio segments of the video frames are extracted. Visual CNN and auditory MFCC features in the video are then separately input to a multimodal encoder for encoding, and finally a text decoder iteratively generates a video description based on the feature representation provided by the encoder.
In this embodiment, step S11 specifically includes: if the auditory features extracted by the audio-visual multi-mode video description generation system are influenced by the external environment or are interfered by the electromagnetic environment to cause auditory feature loss or loss, the auditory inference model is encoded based on the visual features to obtain high-level semantics, then a decoder of the auditory inference model is used for decoding corresponding auditory MFCC features, and then the corresponding auditory MFCC features are input into the multi-mode video description generation system.
The multi-mode encoder of the audio-visual multi-mode video description generation system is a multi-mode LSTM encoder based on a shared weight or a multi-mode memory unit encoder based on a shared memory unit. The multi-modal encoder of the present invention is exemplified by a multi-modal LSTM encoder, as shown in fig. 3, based on a multi-modal LSTM encoder sharing weights, LSTM can model temporal dependencies in sequence data. Both the audiovisual sequences in video contain temporal properties and there is a resonance relationship between the two. In order to model the time domain dependency relationship, the invention uses two LSTM neural networks to respectively encode the characteristics of the audio-visual mode, weight sharing is carried out between the internal memory units of the two LSTM neural networks, and the multi-mode encoder designed by the mode can capture the time domain resonance information in the audio-visual symbiotic mode.
Modeling of a multimodal LSTM encoder based on shared weights is shown in equations (1) - (6):
Figure GDA0001624224990000071
Figure GDA0001624224990000073
Figure GDA0001624224990000074
Figure GDA0001624224990000075
Figure GDA0001624224990000076
wherein the content of the first and second substances,
it,ft,otand
Figure GDA0001624224990000077
an input gate, a forgetting gate, an output gate and a memory unit respectively;
the superscript s is an index value of the mode;
s-0, representing an LSTM-based auditory information encoder;
s-1, representing an LSTM-based visual information encoder;
wherein x0Is an auditory MFCC feature;
x1is a visual CNN feature;
w, U and b are weight matrixes of corresponding items, wherein U represents that the audio-visual encoder based on the LSTM shares weight in a hidden layer unit;
sigma is sigmoid function;
i is an input gate of the LSTM;
h is the hidden state of the LSTM;
ht、ht-1the hidden state of the LSTM at t and t-1;
Wi、Wf、Wo、Wcthe input gate, the forgetting gate, the output gate and the weight of each item of the memory unit related to the input x are respectively;
xt-1is the input at the time t-1;
Ui、Uf、Uo、Ucrespectively an input gate, a forgetting gate, an output gate and a weight of each item of the memory unit related to a hidden state h;
bi、bf、bo、bcthe offset items are respectively an input gate, a forgetting gate, an output gate and a memory unit;
ct、ct-1as a memory cellValues at time t and t-1.
Wherein the visual CNN features and the auditory MFCC features do not share weights before being input to the multimodal encoder. The input visual and auditory features do not share the weight value for two reasons: first, a specific function is learned that maps from the visual-auditory feature space to the hidden layer space. Secondly, the visual and auditory input dimensions are different, and different weight matrixes can more conveniently handle the problem of different modal dimensions.
As shown in fig. 4, the multi-modal memory unit encoder based on shared memory units has a short period, although LSTM can model temporal dependencies. In order to further explore the effect of long-term dependence on video description generation, the invention provides a multi-modal encoder based on a shared external memory unit, wherein the multi-modal memory unit encoder of the shared memory unit comprises two LSTM neural networks which are respectively used for encoding visual CNN characteristics and auditory MFCC characteristics; the information updating is carried out by the internal memory units of the two LSTM neural networks through the external memory units, and the specific steps are as follows:
step S21: reading information from an external memory unit;
step S21: and respectively fusing the information read by the external memory unit and the internal memory units of the two LSTM neural networks, and updating the memory units of the two LSTM neural networks.
The multimode memory unit encoder of the shared memory unit in this embodiment implements interactive fusion of visual CNN features and auditory MFCC features through read-write operations, and includes the following specific steps:
step S31: reading information from a multimode encoder memory unit;
step S32: respectively fusing the information read by the memory unit and the visual CNN characteristic and the auditory MFCC characteristic obtained based on the multi-modal encoder coding to obtain fused information;
step S33: and storing the fusion information to a multi-mode encoder memory unit.
To verify the performance of the multi-modal video description generation system of the present invention, we compared the video descriptions generated by the following models, as shown in fig. 5:
audio, using auditory MFCC features alone to generate a video description;
visual, using Visual CNN features alone to generate a video description;
V-Cat-A, directly connecting visual characteristics CNN and auditory MFCC characteristics as video characteristics to generate video description;
V-ShaMem-A, an external memory unit is shared by visual CNN characteristics and auditory MFCC characteristics to obtain final video characteristics to generate video description, and the model aims to verify the effect of long-term temporal dependence between visual and auditory senses on the generation of the video description;
V-ShaWei-A, the final video features are obtained by sharing the weight of an LSTM internal memory unit through visual CNN features and auditory MFCC features to generate video description, and the model aims to verify the effect of the temporal domain dependency between visual sense and auditory sense on the generation of the video description.
For the first video in FIG. 5, the sentence generated by the Visual model focuses more on the Visual information and ignores the auditory information, so it produces the wrong content ("to a man" vs. "news"); the V-Cat-a model generates accurate objects ("aman") and behaviors ("talking") but loses content ("news") because connecting directly to audiovisual features results in aliasing of information and thus a portion of the information is lost; V-ShaMem-A and V-ShaWei-A may generate sentences that are very close to the reference sentence with the help of auditory information, and the sentences generated by V-ShaWei-A are more accurate ("news" vs. "bathing") because V-ShaMem-A focuses more on long-term information, so the words generated are more abstract, "bathing", and V-ShaWei-A focuses more on truly functioning fine-grained events.
For the second video in FIG. 5, all models can generate correlated behavior ("swimming") and goals ("inter water"). However, only the V-ShaWei-A model generated more accurate objects ("fish" vs. "man" and "person") because the V-ShaWei-A model was more concerned with the events that are sensitive to the audiovisual modality, i.e., the events caused by the audiovisual resonances.
For the third video in FIG. 5, only the V-ShaWei-A model generated more relevant behavior ("shoving" vs. "playing"), illustrating the nature of the behavior that the V-ShaWei-A model can capture.
For the fourth video in FIG. 5, the V-Cat-A and V-ShaWei-A models may generate more relevant behaviors with the help of sound cues ("knocking on a wall", "using a phone"); the V-ShaMem-A model focuses more on global events, thus providing a sentence ("lying on bed"); meanwhile, Visual models focus more on Visual information and also generate descriptions ("lying on bed").
For the fifth video in fig. 5, the events occurring in the video are more relevant to the visual information. Therefore, the Visual, V-ShaMem-A, V-ShaWei-A models all generated accurate behavior ("planning" vs. "playing" or "singing"); moreover, the V-ShaMem-A and V-ShaWei-A models generate more accurate objects ("a group of" vs. "a girl", "acarton filter" and "someone"), indicating that time-domain dependencies are helpful in object positioning; meanwhile, the V-ShaWei-A model provides the most relevant objects ("cars characters" vs. "peoples"), indicating that short time-domain dependencies are more effective. The english sentences in fig. 5 are output contents after video description is performed on different models, and are for showing technical effects and comparison.
An auditory inference model based on encoding-decoding approach, as shown in fig. 6. Specifically, the symbiotic audio-visual modalities have the same high-level semantic representation, and based on the high-level semantic representation, damaged or missing auditory MFCC features can be generated according to known visual CNN features. Specifically, the method comprises the steps of firstly encoding video frame characteristics by using an encoder to obtain high-level semantics, and then decoding corresponding auditory MFCC characteristics by using a decoder. Wherein 1024, 512, 256 represent the number of neurons included in each network layer of the encoder or decoder.
To verify the performance of the multi-modal video description generation system of the present invention, as shown in fig. 7, we compared the video descriptions generated by several models, specifically GA, and used the generated auditory MFCC features alone to generate the video descriptions; visual, using Visual CNN features alone to generate video descriptions; V-ShaWei-GA, which generates video descriptions using visual CNN features and auditory MFCC features that share weights.
For the first video in FIG. 7, the Visual model focuses primarily on Visual cues, thus generating the wrong content "Piano" because objects behind the child look very much like "Piano" and take up more space in the picture; V-ShaWei-GA may capture a more accurate sounding object, "Violin," because it may model resonance information between audio-visual modalities; the GA model may also generate a more relevant object description "violin".
For the second video in FIG. 7, V-ShaWei-GA may generate a more accurate behavioral description ("eating the sounding") indicating that V-ShaWei-GA may capture the visual and auditory behaviors while being sensitive, i.e., resonance information; the GA model may also generate an accurate behavioral description "pointing", which explains the significance of the generated auditory modality.
For the third video in FIG. 7, both the V-ShaWei-GA and GA models may generate a related object description ("girl" vs. "man"). The english sentences in fig. 7 are output contents after video description is performed on different models, and are for showing technical effects and comparison.
Those of skill in the art will appreciate that the various illustrative models, elements, and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of electronic hardware and software. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims (7)

1. A dynamic multi-modal video description generation method is characterized by comprising the following steps:
step S1: extracting corresponding visual CNN characteristics and auditory MFCC characteristics in the video, and judging whether the auditory MFCC characteristics are damaged or disappear; if the loss or disappearance, performing step S2, otherwise performing step S3;
step S2: the visual CNN characteristics are reasoned through an auditory inference model based on a coding-decoding mode to obtain complete auditory MFCC characteristics;
step S3: by utilizing the visual CNN characteristic and the auditory MFCC characteristic, based on the time domain dependency between visual and auditory senses, encoding and interactive fusion of two audio-visual modes are carried out through a multi-mode encoder to obtain a fusion characteristic, and the fusion characteristic is iteratively decoded through a decoder to generate a video description;
the multi-modal encoder is a multi-modal LSTM encoder based on a shared weight, the multi-modal LSTM encoder based on the shared weight comprises two LSTM neural networks which are respectively used for encoding a visual characteristic CNN and an auditory characteristic MFCC, and weight sharing is carried out between internal memory units of the two LSTM neural networks; or
The multi-modal encoder is a multi-modal memory unit encoder based on a shared memory unit, the multi-modal memory unit encoder based on the shared memory unit comprises two LSTM neural networks which are respectively used for encoding visual CNN characteristics and auditory MFCC characteristics, and the internal memory units of the two LSTM neural networks carry out information updating through an external memory unit.
2. The dynamic multi-modal video description generation method of claim 1, wherein the auditory inference model, which generates auditory MFCC features, is:
coding the CNN characteristics of the video by using a coder to obtain high-level semantics;
decoding the corresponding auditory MFCC features with a decoder;
wherein the decoder is a decoder of an auditory inference model.
3. The method of generating dynamic multi-modal video description according to claim 1, wherein the modeling formula of the weight-sharing based multi-modal LSTM encoder is as follows:
Figure FDA0002239032690000011
Figure FDA0002239032690000012
Figure FDA0002239032690000013
Figure FDA0002239032690000014
Figure FDA0002239032690000022
wherein the content of the first and second substances,
it,ft,otand
Figure FDA0002239032690000023
an input gate, a forgetting gate, an output gate and a memory unit respectively;
the superscript s is an index value of the mode;
s-0, representing an LSTM-based auditory information encoder;
s-1, representing an LSTM-based visual information encoder;
wherein x0Is an auditory MFCC feature;
x1is a visual CNN feature;
w, U and b are weight matrixes of corresponding items, wherein U represents that the audio-visual encoder based on the LSTM shares weight in a hidden layer unit;
sigma is sigmoid function;
i is an input gate of the LSTM;
h is the hidden state of the LSTM;
ht、ht-1the hidden state of the LSTM at t and t-1;
Wi、Wf、Wo、Wcthe input gate, the forgetting gate, the output gate and the weight of each item of the memory unit related to the input x are respectively;
xt-1is the input at the time t-1;
Ui、Uf、Uo、Ucrespectively an input gate, a forgetting gate, an output gate and a weight of each item of the memory unit related to a hidden state h;
bi、bf、bo、bcthe offset items are respectively an input gate, a forgetting gate, an output gate and a memory unit;
ct、ct-1the values of the memory cell at times t and t-1.
4. The dynamic multi-modal video description generation method of claim 1, wherein the internal memory units of the two LSTM neural networks perform information update through the external memory unit, and the method is as follows:
reading information from an external memory unit;
and respectively fusing the information read by the external memory unit and the internal memory units of the two LSTM neural networks, and updating the memory units of the two LSTM neural networks.
5. The dynamic multi-modal video description generation method according to any one of claims 1 to 4, wherein the method for extracting the corresponding visual CNN feature and auditory MFCC feature in the video comprises:
extracting visual CNN characteristics of video frames in the video through a convolutional neural network;
and extracting audio MFCC characteristics from the corresponding audio segment of the video frame through a convolutional neural network.
6. The video description generation method of claim 5, wherein the interactive fusion of visual CNN feature and auditory MFCC feature is realized by read-write operation based on a multi-modal encoder sharing an external memory unit, and the specific steps are as follows:
step S11: reading information from a multimode encoder memory unit;
step S12: respectively fusing the information read by the memory unit and the visual CNN characteristic and the auditory MFCC characteristic obtained based on the multi-modal encoder coding to obtain fused information;
step S13: and storing the fusion information to a multi-mode encoder memory unit.
7. The video description generation method of claim 5, wherein the visual CNN features and the auditory MFCC features do not share weights before being input to the multi-modal encoder.
CN201711433810.6A 2017-12-26 2017-12-26 Dynamic multi-modal video description generation method Active CN108200483B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711433810.6A CN108200483B (en) 2017-12-26 2017-12-26 Dynamic multi-modal video description generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711433810.6A CN108200483B (en) 2017-12-26 2017-12-26 Dynamic multi-modal video description generation method

Publications (2)

Publication Number Publication Date
CN108200483A CN108200483A (en) 2018-06-22
CN108200483B true CN108200483B (en) 2020-02-28

Family

ID=62584286

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711433810.6A Active CN108200483B (en) 2017-12-26 2017-12-26 Dynamic multi-modal video description generation method

Country Status (1)

Country Link
CN (1) CN108200483B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190683A (en) * 2018-08-14 2019-01-11 电子科技大学 A kind of classification method based on attention mechanism and bimodal image
US10846522B2 (en) * 2018-10-16 2020-11-24 Google Llc Speaking classification using audio-visual data
CN111464881B (en) * 2019-01-18 2021-08-13 复旦大学 Full-convolution video description generation method based on self-optimization mechanism
CN109885723B (en) * 2019-02-20 2023-10-13 腾讯科技(深圳)有限公司 Method for generating video dynamic thumbnail, method and device for model training
CN110110636B (en) * 2019-04-28 2021-03-02 清华大学 Video logic mining device and method based on multi-input single-output coding and decoding model
CN110222227B (en) * 2019-05-13 2021-03-23 西安交通大学 Chinese folk song geographical classification method integrating auditory perception features and visual features
CN110826397B (en) * 2019-09-20 2022-07-26 浙江大学 Video description method based on high-order low-rank multi-modal attention mechanism
CN111309971B (en) * 2020-01-19 2022-03-25 浙江工商大学 Multi-level coding-based text-to-video cross-modal retrieval method
CN111859005B (en) * 2020-07-01 2022-03-29 江西理工大学 Cross-layer multi-model feature fusion and image description method based on convolutional decoding
CN112069361A (en) * 2020-08-27 2020-12-11 新华智云科技有限公司 Video description text generation method based on multi-mode fusion
CN112287893B (en) * 2020-11-25 2023-07-18 广东技术师范大学 Sow lactation behavior identification method based on audio and video information fusion
CN112331337B (en) 2021-01-04 2021-04-16 中国科学院自动化研究所 Automatic depression detection method, device and equipment
CN114581749B (en) * 2022-05-09 2022-07-26 城云科技(中国)有限公司 Audio-visual feature fusion target behavior identification method and device and application

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105005558A (en) * 2015-08-14 2015-10-28 武汉大学 Multi-modal data fusion method based on crowd sensing
CN105279495A (en) * 2015-10-23 2016-01-27 天津大学 Video description method based on deep learning and text summarization
CN106708890A (en) * 2015-11-17 2017-05-24 创意引晴股份有限公司 Intelligent high fault-tolerant video identification system based on multimoding fusion and identification method thereof
CN107220591A (en) * 2017-04-28 2017-09-29 哈尔滨工业大学深圳研究生院 Multi-modal intelligent mood sensing system
CN107256221A (en) * 2017-04-26 2017-10-17 苏州大学 Video presentation method based on multi-feature fusion
CN107391646A (en) * 2017-07-13 2017-11-24 清华大学 A kind of Semantic features extraction method and device of video image
CN107451552A (en) * 2017-07-25 2017-12-08 北京联合大学 A kind of gesture identification method based on 3D CNN and convolution LSTM

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105005558A (en) * 2015-08-14 2015-10-28 武汉大学 Multi-modal data fusion method based on crowd sensing
CN105279495A (en) * 2015-10-23 2016-01-27 天津大学 Video description method based on deep learning and text summarization
CN106708890A (en) * 2015-11-17 2017-05-24 创意引晴股份有限公司 Intelligent high fault-tolerant video identification system based on multimoding fusion and identification method thereof
CN107256221A (en) * 2017-04-26 2017-10-17 苏州大学 Video presentation method based on multi-feature fusion
CN107220591A (en) * 2017-04-28 2017-09-29 哈尔滨工业大学深圳研究生院 Multi-modal intelligent mood sensing system
CN107391646A (en) * 2017-07-13 2017-11-24 清华大学 A kind of Semantic features extraction method and device of video image
CN107451552A (en) * 2017-07-25 2017-12-08 北京联合大学 A kind of gesture identification method based on 3D CNN and convolution LSTM

Also Published As

Publication number Publication date
CN108200483A (en) 2018-06-22

Similar Documents

Publication Publication Date Title
CN108200483B (en) Dynamic multi-modal video description generation method
Cannizzaro Internet memes as internet signs: A semiotic view of digital culture
Xu et al. Dual-stream recurrent neural network for video captioning
US11409791B2 (en) Joint heterogeneous language-vision embeddings for video tagging and search
US20210390700A1 (en) Referring image segmentation
CN111209440B (en) Video playing method, device and storage medium
Chen et al. Learning a recurrent visual representation for image caption generation
Prabhakaran Multimedia database management systems
Xie et al. Attention-based dense LSTM for speech emotion recognition
CN112104919A (en) Content title generation method, device, equipment and computer readable storage medium based on neural network
Chang et al. The prompt artists
CN114339450B (en) Video comment generation method, system, device and storage medium
CN116681810B (en) Virtual object action generation method, device, computer equipment and storage medium
KR102165160B1 (en) Apparatus for predicting sequence of intention using recurrent neural network model based on sequential information and method thereof
CN113704460A (en) Text classification method and device, electronic equipment and storage medium
Niu et al. Improvement on speech emotion recognition based on deep convolutional neural networks
CN113157941B (en) Service characteristic data processing method, service characteristic data processing device, text generating method, text generating device and electronic equipment
CN113010780B (en) Model training and click rate estimation method and device
CN112417118B (en) Dialog generation method based on marked text and neural network
Rastgoo et al. All You Need In Sign Language Production
Rodriguez et al. How important is motion in sign language translation?
CN115169472A (en) Music matching method and device for multimedia data and computer equipment
Chu et al. The forgettable-watcher model for video question answering
CN116824461B (en) Question understanding guiding video question answering method and system
CN116737756B (en) Data query method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant