CN108200483A - Dynamically multi-modal video presentation generation method - Google Patents

Dynamically multi-modal video presentation generation method Download PDF

Info

Publication number
CN108200483A
CN108200483A CN201711433810.6A CN201711433810A CN108200483A CN 108200483 A CN108200483 A CN 108200483A CN 201711433810 A CN201711433810 A CN 201711433810A CN 108200483 A CN108200483 A CN 108200483A
Authority
CN
China
Prior art keywords
hearing
sense
video presentation
modal
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711433810.6A
Other languages
Chinese (zh)
Other versions
CN108200483B (en
Inventor
张兆翔
郝王丽
关赫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201711433810.6A priority Critical patent/CN108200483B/en
Publication of CN108200483A publication Critical patent/CN108200483A/en
Application granted granted Critical
Publication of CN108200483B publication Critical patent/CN108200483B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/84Generation or processing of descriptive data, e.g. content descriptors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4662Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms
    • H04N21/4666Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms using neural networks, e.g. processing the feedback provided by the user

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention belongs to video presentation fields, and in particular to a kind of multi-modal video presentation generation method of dynamic.It is intended to capture the resonance information of audiovisual mode to generate preferable video presentation, in addition, solving the situation that the audio modality in video is damaged or lacks.Multi-modal video presentation generation system proposed by the present invention shares the weights of LSTM memory internal units or shared memory external unit by the feature coding stage of audio visual mode, time domain dependence between audio visual models, and captures the resonance information of audiovisual mode;In addition, corresponding audio modality information is gone out according to known visual modalities information inference the present invention is based on sense of hearing inference system.Video presentation can quickly and effectively be generated by the present invention.

Description

Dynamically multi-modal video presentation generation method
Technical field
The invention belongs to video presentation fields, and in particular to a kind of multi-modal video presentation generation method of dynamic.
Background technology
Video presentation generation system refers to generate corresponding natural language description for given video automatically.Video presentation generates It is to be inspired by iamge description generation, is all based on given image or video frame feature, is iteratively generating image or video Description.
Traditional video presentation generates the method that can substantially be divided into three classes:
The first kind is the method based on template.They determine the semantic concept included in video first, are then based in advance The sentence template defined infers sentence structure, finally goes to collect the most phase that can be generated sentence and describe using probability graph model Hold inside the Pass.Although in this way can the correct sentence of generative grammar, sentence lacks rich and flexibility.
Second class is to regard video presentation as a search problem.Video is labeled using metadata first, then Classified according to corresponding mark to sentence description and video.Compared with first kind method, the sentence that this kind of mode generates is more Add naturally, however they are largely but limited to metadata.
Representation of video shot is directly mapped on a specific sentence by third class video presentation generation method.Venugopalan Et al. feature extractions are carried out, and be averaged to them to all picture frames in video using convolutional neural networks (CNN) first Pondization operation obtains the representation of video shot of a regular length.Then it is based on representation of video shot using LSTM decoders and generates corresponding regard Frequency describes.Although the above method can easily obtain video presentation, it has ignored the time domain specification implied in video.For Effect of the time domain in video presentation generation in video, Venugopalan et al. are explored using LSTM in video frame Feature carries out the representation of video shot that coding obtains regular length.On the other hand, Yao et al. explores the part implied in video and complete The effect that office's time domain generates video presentation.Specifically, the local temporal in video can pass through space-time convolutional neural networks It is encoded, the global time domain in video can be encoded by time domain attention mechanism.Further to promote video presentation The performance of generation, Ballas et al. have studied the effect of middle layer representation of video shot.Pan et al. is proposed using level recurrent neural Network decoder goes to explore performance of the varigrained time domain in video presentation generation.It is retouched to generate more detailed video It states, Haonan Yu et al. are proposed goes the paragraph of generation video to describe using level recurrent neural network.Their hierarchical model Include a sentence generator and a paragraph generator.Sentence generator generates a letter for a specific short intervals of video Single sentence description.Paragraph generator is encoded based on multiple sentences that sentence generator generates by capturing the dependence between sentence Paragraph description is generated for video.
For above-mentioned video presentation generation model there are one predicable, i.e. their representation of video shot is based only on visual information.For Information included in video is made full use of, Many researchers propose by information multi-modal in video merge and go to obtain Better video presentation.Corresponding visual signature and aural signature are directly carried out series connection by Vasili Ramanishka et al. Obtain multi-modal representation of video shot.Qin Jin et al. go to carry out audio visual information using the feedforward neural network of a multilayer Fusion.Although these methods, in the performance that can improve video presentation to a certain degree, they are still limited by following influence.For example, The aliasing of information and the decline of performance can be led to by being directly connected to audiovisual feature.In addition, the model that Qin Jin et al. are proposed is Video features are obtained by carrying out pond to all video frame features or clip features, are had ignored (vision-sense of hearing) Time domain dependence between feature.
Multi-modal video presentation generation system proposed by the present invention was total to by the feature coding stage in audio visual mode The weights of LSTM memory internal units or shared memory external unit are enjoyed, is built come the time domain dependence between audio visual Mould captures the resonance information of audiovisual mode and generates preferable video presentation.In addition, the factors such as affected by environment, sensor disturbance, Audio modality in video can be damaged or lack in some cases.In this case, it is to generate video presentation The performance of system is not affected substantially, and the present invention proposes that a sense of hearing inference system goes out correspondence based on known visual modalities information inference Audio modality information, then will supplement complete audio visual modal information and be input to multi-modal video presentation generation system and go to give birth to Into video presentation.
Invention content
To solve the above-mentioned problems, in order to solving that when audio modality is damaged or lacks video presentation can not be accurately generated The problem of, a kind of multi-modal video presentation generation method of dynamic proposed by the present invention includes the following steps:
Step S1:Vision CNN features corresponding in video and sense of hearing MFCC features are extracted, and judges sense of hearing MFCC features Whether it is damaged or disappears;Step S2 is performed as lost or disappearing, otherwise performs step S3;
Step S2:The vision CNN features are made inferences by the sense of hearing inference pattern based on coding-decoding process To complete sense of hearing MFCC features;
Step S3:Using the vision CNN features and sense of hearing MFCC features, based on the time domain dependence between audio visual, It is encoded by multimodal code device and the interaction of two mode of audiovisual is merged, obtained fusion feature, fusion feature is led to Video presentation is generated after decoding with crossing decoder iteration.
Preferably, the sense of hearing inference pattern, the method for generation sense of hearing MFCC features are:
Vision CNN features are encoded using encoder, obtain high-level semantic;
Corresponding sense of hearing MFCC features are decoded using decoder.
Preferably, the multimodal code device is for the multi-modal LSTM encoders based on shared weights or based on shared The multi-modal mnemon encoder of mnemon.
Preferably, two LSTM neural networks, difference are included in the multi-modal LSTM encoders based on shared weights For being encoded to visual signature CNN and aural signature MFCC;It is weighed between the memory internal unit of two LSTM neural networks Value is shared.
Preferably, the multi-modal LSTM encoders based on shared weights, modeling formula are as follows:
Wherein,
it,ft,otWithIt is input gate respectively, forgets door, out gate and mnemon;
Subscript s is the index value of mode;
S=0 represents the auditory information encoder based on LSTM;
S=1 represents the visual information encoder based on LSTM;
Wherein x0It is sense of hearing MFCC features;
x1It is vision CNN features;
W, U, b are the weight matrixs of respective items, and wherein U represents the audiovisual encoder based on LSTM and shared in hiding layer unit Weights;
σ is sigmoid functions;
I is the input gate of LSTM;
H is the hidden state of LSTM;
ht、ht-1For LSTM t the and t-1 moment hidden state;
Wi、Wf、Wo、WcRespectively input gate forgets door, out gate, weights of the mnemon items about input x;
xt-1Fort-1The input at moment;
Ui、Uf、Uo、UcRespectively input gate forgets door, and out gate, mnemon items are about hidden state h weights;
bi、bf、bo、bcRespectively input gate forgets door, out gate, the bias term of mnemon items;
ct、ct-1For mnemon t the and t-1 moment value.
The multi-modal mnemon encoder for being preferably based on shared mnemon includes two LSTM neural networks, point Vision CNN features and sense of hearing MFCC features Yong Yu not encoded;The memory internal unit of two LSTM neural networks passes through Memory external unit is updated into row information.
Preferably, it is described " the memory internal unit of two LSTM neural networks by memory external unit into row information more Newly ", method is:
Information is read from memory external unit;
The memory internal unit of the information that memory external unit is read and two LSTM neural networks merges respectively, And update the mnemon of two LSTM neural networks.
Preferably, described " corresponding vision CNN features and sense of hearing MFCC features in extraction video ", method is:
The extraction of vision CNN features is carried out to the video frame in video by convolutional neural networks;
The extraction of audio MFCC features is carried out to the correspondence audio fragment of video frame by convolutional neural networks.
The multimodal code device for being preferably based on shared memory external unit realizes vision CNN features by read-write operation Interaction with sense of hearing MFCC features is merged, the specific steps are:
Step S11:Information is read from multimodal code device mnemon;
Step S12:Information that mnemon is read and the vision CNN features obtained based on multimodal code device coding and Sense of hearing MFCC features are merged to obtain fuse information respectively;
Step S13:By fuse information storage to multimodal code device mnemon.
Preferably, the vision CNN features and sense of hearing MFCC features do not share weights before multimodal code device is inputted.
There is above-mentioned technical proposal it is found that the present invention proposes a kind of multi-modal video presentation generation model of dynamic, for video Description provides a kind of quickly and effectively method.Compared with prior art, the present invention has following advantage:
(1) input information of the invention is from two vision, sense of hearing mode, traditional vision with being based only on visual information Description generation system is compared, it can obtain higher performance.
(2) present invention merges audio visual information by modeling the time domain dependence between audiovisual mode.Between audiovisual mode Time domain dependence can represent resonance information between audiovisual mode, i.e., the event really occurred in video to a certain extent. The present invention can effectively model the resonance information in video, therefore can generate ideal video presentation.
(3) video complete to modal information and missing audio modality, the present invention can establish unified video presentation system. If input video has complete two mode, directly they are input in multi-modal video presentation generation system, if defeated The video audio modality entered is damaged or disappears, then generates aural signature using sense of hearing inference system view-based access control model feature, then will It supplements complete two mode and is input to multi-modal video presentation generation system.
Description of the drawings
Fig. 1 is the multi-modal video presentation generation method flow diagram of dynamic of an embodiment of the present invention;
Fig. 2 is that the multi-modal video presentation of dynamic of an embodiment of the present invention generates model schematic;
Fig. 3 is that the multi-modal audio visual based on shared weights of an embodiment of the present invention merges encoding model schematic diagram;
Fig. 4 is that the fusion coding mould of the multi-modal audio visual based on shared memory external unit of an embodiment of the present invention shows Meaning type;
Fig. 5 is that the video generated based on each basic model and multi-modal fusion model of an embodiment of the present invention is retouched It states;
Fig. 6 is the sense of hearing inference pattern schematic diagram of an embodiment of the present invention;
Fig. 7 is being retouched based on the video generated after each basic model and supplement audio modality for an embodiment of the present invention It states.
Specific embodiment
The preferred embodiment of the present invention described with reference to the accompanying drawings.It will be apparent to a skilled person that this A little embodiments are used only for explaining the technical principle of the present invention, it is not intended that limit the scope of the invention.
The multi-modal video presentation generation method of dynamic provided by the invention, as shown in Figure 1, specific steps include:
Step S1:Vision CNN features corresponding in video and sense of hearing MFCC features are extracted, and judges sense of hearing MFCC features Whether it is damaged or disappears;Step S2 is performed as lost or disappearing, otherwise performs step S3;
Step S2:The vision CNN features are made inferences by the sense of hearing inference pattern based on coding-decoding process To complete sense of hearing MFCC features;
Step S3:Using the vision CNN features and sense of hearing MFCC features, based on the time domain dependence between audio visual, It is encoded by multimodal code device and the interaction of two mode of audiovisual is merged, obtained fusion feature, fusion feature is led to Video presentation is generated after decoding with crossing decoder iteration.
The multi-modal video presentation generation method of dynamic proposed by the present invention is based on the multi-modal video presentation generation system of audiovisual System and sense of hearing inference pattern realize video presentation function jointly, wherein, the encoder of the multi-modal video presentation generation system of audiovisual Preferred embodiment is multi-modal LSTM encoders.
Fig. 2 shows the multi-modal video presentation generation system schematics of audiovisual in the present invention.The present embodiment the specific steps are:
Step S11:If input video mode is complete, they are directly input to multi-modal video presentation and generates system In;If the video sense of hearing CNN features of input are damaged or disappear, using sense of hearing inference system according to known vision MFCC spies Sign generation sense of hearing CNN features, then complete two mode will be supplemented and be input to multi-modal video presentation generation system;
Step S12:Video frame in video is extracted into vision CNN features, while to video frame by convolutional neural networks Correspondence audio snippet extraction sense of hearing MFCC features.Then vision CNN in video and sense of hearing MFCC features are inputted respectively more Mode Coding device is encoded, and last text decoder is iteratively generating video presentation based on the character representation that encoder provides.
Step S11 is specially in the present embodiment:If the aural signature of audiovisual multi-modal video presentation generation system extraction due to It is influenced by external environment or aural signature is caused to lose or lack by electromagnetic environment interference, then sense of hearing inference pattern is based on regarding Feature is felt through being encoded, and after obtaining high-level semantic, the decoder of sense of hearing inference pattern is recycled to decode the corresponding sense of hearing MFCC features, then it is input to multi-modal video presentation generation system.
The multimodal code device of the multi-modal video presentation generation system of audiovisual is compiled for the multi-modal LSTM based on shared weights Code device or the multi-modal mnemon encoder based on shared mnemon.The multimodal code device of the present invention is with multimode For state LSTM encoders, as shown in figure 3, the multi-modal LSTM encoders based on shared weights, LSTM can model sequence data In time domain rely on.Audio visual sequence in video has comprising time domain and between the two resonance relationship.During to model this Domain dependence, the present invention respectively encode the feature of audiovisual mode, and using two LSTM neural networks at two It carries out weights between the memory internal unit of LSTM neural networks to share, the multimodal code device designed by this way can capture To the time domain resonance information in audiovisual symbiosis mode.
Multi-modal LSTM encoders modeling based on shared weights is as shown in formula (1)-(6):
Wherein,
it, ft, otWithIt is input gate respectively, forgets door, out gate and mnemon;
Subscript s is the index value of mode;
S=0 represents the auditory information encoder based on LSTM;
S=1 represents the visual information encoder based on LSTM;
Wherein x0It is sense of hearing MFCC features;
x1It is vision CNN features;
W, U, b are the weight matrixs of respective items, and wherein U represents the audiovisual encoder based on LSTM and shared in hiding layer unit Weights;
σ is sigmoid functions;
I is the input gate of LSTM;
H is the hidden state of LSTM;
ht、ht-1For LSTM t the and t-1 moment hidden state;
Wi、Wf、Wo、WcRespectively input gate forgets door, out gate, weights of the mnemon items about input x;
xt-1Input for the t-1 moment;
Ui、Uf、Uo、UcRespectively input gate forgets door, and out gate, mnemon items are about hidden state h weights;
bi、bf、bo、bcRespectively input gate forgets door, out gate, the bias term of mnemon items;
ct、ct-1For mnemon t the and t-1 moment value.
Wherein, vision CNN features and sense of hearing MFCC features do not share weights before multimodal code device is inputted.Input Audio visual feature was not shared there are two the reason of weights:First, study is specifically mapped to hidden layer from audio visual feature space The function in space.Second, vision is different with the input dimension of the sense of hearing, and different weight matrixs can more convenient processing different modalities dimension Different problems.
As shown in figure 4, the multi-modal mnemon encoder based on shared mnemon, although LSTM can model time domain according to Lai Xing, but its period is shorter.The effect that dependence generates video presentation during further to explore long, the present invention are proposed based on altogether The multimodal code device of memory external unit is enjoyed, the multi-modal mnemon encoder for sharing mnemon includes two LSTM god Through network, it is respectively used to encode vision CNN features and sense of hearing MFCC features;The memory internal of two LSTM neural networks Unit is updated by memory external unit into row information, the specific steps are:
Step S21:Information is read from memory external unit;
Step S21:The memory internal unit difference of the information that memory external unit is read and two LSTM neural networks It is merged, and updates the mnemon of two LSTM neural networks.
The multi-modal mnemon encoder of the shared mnemon of the present embodiment realizes that vision CNN is special by read-write operation The interaction fusion for sense of hearing MFCC features of seeking peace, the specific steps are:
Step S31:Information is read from multimodal code device mnemon;
Step S32:Information that mnemon is read and the vision CNN features obtained based on multimodal code device coding and Sense of hearing MFCC features are merged to obtain fuse information respectively;
Step S33:By fuse information storage to multimodal code device mnemon.
The performance of system is generated for multi-modal video presentation in the verification present invention, as shown in figure 5, we compared it is following several The video presentation of a model generation:
Sense of hearing MFCC features generation video presentation is used alone in Audio;
Vision CNN features generation video presentation is used alone in Visual;
V-Cat-A, using vision spy CNN seek peace sense of hearing MFCC features be directly attached as video features go generation video Description;
V-ShaMem-A shares a memory external unit to obtain most by vision CNN features and sense of hearing MFCC features Whole video features go generation video presentation, the simulated target be verify audio visual between it is long when time domain dependence to video presentation The effect of generation;
V-ShaWei-A, by vision CNN features and sense of hearing MFCC features share the weights of LSTM memory internal units come Obtain final video feature go generation video presentation, the simulated target be verify audio visual between the dependence of time domain in short-term to regarding The effect of frequency description generation.
For first video in Fig. 5, the sentence of Visual models generation increasingly focuses on visual information and has ignored and listen Feel information, therefore it produces the content (" to aman " vs. " news ") of mistake;V-Cat-A models generate accurate object (" a man ") and behavior (" talking "), but be lost content (" news "), reason for directly to audio visual feature into Row connection can lead to the aliasing of information, so as to lose a part of information;V-ShaMem-A and V-ShaWei-A believes in the sense of hearing It can be generated with the help of breath with the sentence with reference to sentence closely, and the sentence of V-ShaWei-A generations is more accurate (" news " Vs. " something "), information when reason more pays close attention to long for V-ShaMem-A, therefore the word generated is more abstract " something ", And V-ShaWei-A more pays close attention to the fine granularity event really to work.
For second video in Fig. 5, all models can generate relevant behavior (" swimming ") and target (" in the water”).However, only V-ShaWei-A models generate more accurate object (" fish " vs. " man " and " person "), reason more pays close attention to the event of audio visual mode sensitivity for V-ShaWei-A models, i.e., as caused by resonating audio visual Event.
For third video in Fig. 5, only V-ShaWei-A models generate more relevant behavior (" showing " vs. " playing "), illustrate that V-ShaWei-A models can capture the essence of behavior.
4th video, V-Cat-A and V-ShaWei-A models in Fig. 5 can be generated more with the help of sound clue Add relevant behavior (" knocking on a wall ", " using a phone ");V-ShaMem-A models more pay close attention to global thing Part thus provides a sentence (" lying on bed ");Meanwhile Visual models are more paid close attention to visual information and are also created and retouch It states (" lying on bed ").
For the 5th video in Fig. 5, the event and visual information occurred in the video is more relevant.Therefore Visual, V- ShaMem-A, V-ShaWei-A model generate accurate behavior (" dancing " vs. " playing ", " singing ");And And V-ShaMem-A and V-ShaWei-A models generate more accurate object (" a group of " vs. " a girl ", " a Cartoon character " and " someone "), it is helpful to illustrate that time domain dependence positions object;Meanwhile V-ShaWei- A models provide maximally related object (" cartoon characters " vs. " people "), illustrate short time domain dependence more Added with effect.The content that english sentence in Fig. 5 for different models export after video presentation, be for display technique effect and Comparison.
Sense of hearing inference pattern based on coding-decoding process, as shown in Figure 6.Specifically, the audiovisual mode of symbiosis has phase Same high-level semantic represents, is represented based on high-level semantic, and the sense of hearing that is impaired or lacking can be generated according to known vision CNN features MFCC features.In the present invention specifically, being encoded first using encoder to video frame feature, after obtaining high-level semantic, Decoder is recycled to decode corresponding sense of hearing MFCC features.Wherein, 1024,512,256 expressions is that encoder or decoder exist Each network layer includes the number of neuron.
The performance of system is generated for multi-modal video presentation in the verification present invention, as shown in fig. 7, we compared it is following several The video presentation of a model generation, specially GA, the sense of hearing MFCC features that generation is used alone remove generation video presentation; Visual is used alone vision CNN features and removes generation video presentation;V-ShaWei-GA, it is special using the vision CNN of shared weights Sense of hearing MFCC features of seeking peace generate video presentation.
For first video in Fig. 7, Visual models are primarily upon visual cues, therefore generation error content " piano " the reason is that the object behind child looks very much like " Piano ", and is taken up too much space in picture;V- ShaWei-GA can capture more accurately sound producing body " Violin ", and reason can model the resonance information between audiovisual mode for it; GA models can also generate more relevant Object representation " violin ".
For second video in Fig. 7, V-ShaWei-GA can generate more accurately behavior description (" pouring sauce Into a pot " vs. " cooking the something "), it is sensitive simultaneously to illustrate that V-ShaWei-GA can capture audio visual Behavior, i.e. resonance information;GA models can also generate accurate behavior description " pouring ", illustrate that the audio modality of generation is intentional Justice.
For third video in Fig. 7, V-ShaWei-GA can generate relevant Object representation (" girl " with GA models vs.“man”).The content that english sentence in Fig. 7 for different models export after video presentation is in order to which display technique is imitated Fruit and comparison.
Those skilled in the art should be able to recognize that, each exemplary mould described with reference to the embodiments described herein Type, unit and method and step can be realized with the combination of electronic hardware, computer software or the two, in order to clearly say The interchangeability of bright electronic hardware and software generally describes each exemplary composition according to function in the above description And step.These functions are performed actually with electronic hardware or software mode, depending on technical solution specific application and set Count constraints.Those skilled in the art can realize described work(using distinct methods to each specific application Can, but this realization is it is not considered that beyond the scope of this invention.
So far, it has been combined preferred embodiment shown in the drawings and describes technical scheme of the present invention, still, this field Technical staff is it is easily understood that protection scope of the present invention is expressly not limited to these specific embodiments.Without departing from this Under the premise of the principle of invention, those skilled in the art can make the relevant technologies feature equivalent change or replacement, these Technical solution after changing or replacing it is fallen within protection scope of the present invention.

Claims (10)

1. a kind of multi-modal video presentation generation method of dynamic, which is characterized in that include the following steps:
Step S1:Vision CNN features corresponding in video and sense of hearing MFCC features are extracted, and whether judges sense of hearing MFCC features Impaired or disappearance;Step S2 is performed as lost or disappearing, otherwise performs step S3;
Step S2:The vision CNN features are made inferences to have obtained by the sense of hearing inference pattern based on coding-decoding process Whole sense of hearing MFCC features;
Step S3:Using the vision CNN features and sense of hearing MFCC features, based on the time domain dependence between audio visual, pass through Multimodal code device is encoded and the interaction fusion of two mode of audiovisual, obtains fusion feature, fusion feature is passed through solution Code device generates video presentation after iteratively decoding.
2. the multi-modal video presentation generation method of dynamic according to claim 1, which is characterized in that the sense of hearing reasoning mould Type, the method for generation sense of hearing MFCC features are:
Video CNN features are encoded using encoder, obtain high-level semantic;
Corresponding sense of hearing MFCC features are decoded using decoder.
3. the multi-modal video presentation generation method of dynamic according to claim 1, which is characterized in that the multimodal code Device is the multi-modal LSTM encoders based on shared weights or the coding of the multi-modal mnemon based on shared mnemon Device.
4. the multi-modal video presentation generation method of dynamic according to claim 1, which is characterized in that described based on shared power Comprising two LSTM neural networks in the multi-modal LSTM encoders of value, it is respectively used to visual signature CNN and aural signature MFCC is encoded;Weights are shared between the memory internal unit of two LSTM neural networks.
5. the multi-modal video presentation generation method of dynamic according to claim 4, which is characterized in that described based on shared weights Multi-modal LSTM encoders, modeling formula it is as follows:
Wherein,
it,ft,otWithIt is input gate respectively, forgets door, out gate and mnemon;
Subscript s is the index value of mode;
S=0 represents the auditory information encoder based on LSTM;
S=1 represents the visual information encoder based on LSTM;
Wherein x0It is sense of hearing MFCC features;
x1It is vision CNN features;
W, U, b are the weight matrixs of respective items, and wherein U represents the audiovisual encoder based on LSTM and shares power in hiding layer unit Value;
σ is sigmoid functions;
I is the input gate of LSTM;
H is the hidden state of LSTM;
ht、ht-1For LSTM t the and t-1 moment hidden state;
Wi、Wf、Wo、WcRespectively input gate forgets door, out gate, weights of the mnemon items about input x;
xt-1Input for the t-1 moment;
Ui、Uf、Uo、UcRespectively input gate forgets door, and out gate, mnemon items are about hidden state h weights;
bi、bf、bo、bcRespectively input gate forgets door, out gate, the bias term of mnemon items;
ct、ct-1For mnemon t the and t-1 moment value.
6. the multi-modal video presentation generation method of dynamic according to claim 4, which is characterized in that based on shared mnemon Multi-modal mnemon encoder include two LSTM neural networks, be respectively used to special to vision CNN features and sense of hearing MFCC Sign is encoded;The memory internal unit of two LSTM neural networks is updated by memory external unit into row information.
7. the multi-modal video presentation generation method of dynamic according to claim 4, which is characterized in that " two LSTM god Memory internal unit through network is updated by memory external unit into row information ", method is:
Information is read from memory external unit;
The memory internal unit of the information that memory external unit is read and two LSTM neural networks merges, and more respectively The mnemon of new two LSTM neural networks.
8. according to the multi-modal video presentation generation method of any one of the claim 1-7 dynamics, which is characterized in that described " to carry Take vision CNN features corresponding in video and sense of hearing MFCC features ", method is:
The extraction of vision CNN features is carried out to the video frame in video by convolutional neural networks;
The extraction of audio MFCC features is carried out to the correspondence audio fragment of video frame by convolutional neural networks.
9. video presentation generation method according to claim 8, which is characterized in that based on the more of shared memory external unit Mode Coding device realizes that the interaction of vision CNN features and sense of hearing MFCC features is merged by read-write operation, the specific steps are:
Step S11:Information is read from multimodal code device mnemon;
Step S12:Information and the vision CNN features based on the coding acquisition of multimodal code device that mnemon is read and the sense of hearing MFCC features are merged to obtain fuse information respectively;
Step S13:By fuse information storage to multimodal code device mnemon.
10. video presentation generation method according to claim 8, which is characterized in that the vision CNN features and the sense of hearing MFCC features do not share weights before multimodal code device is inputted.
CN201711433810.6A 2017-12-26 2017-12-26 Dynamic multi-modal video description generation method Active CN108200483B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711433810.6A CN108200483B (en) 2017-12-26 2017-12-26 Dynamic multi-modal video description generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711433810.6A CN108200483B (en) 2017-12-26 2017-12-26 Dynamic multi-modal video description generation method

Publications (2)

Publication Number Publication Date
CN108200483A true CN108200483A (en) 2018-06-22
CN108200483B CN108200483B (en) 2020-02-28

Family

ID=62584286

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711433810.6A Active CN108200483B (en) 2017-12-26 2017-12-26 Dynamic multi-modal video description generation method

Country Status (1)

Country Link
CN (1) CN108200483B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190683A (en) * 2018-08-14 2019-01-11 电子科技大学 A kind of classification method based on attention mechanism and bimodal image
CN109885723A (en) * 2019-02-20 2019-06-14 腾讯科技(深圳)有限公司 A kind of generation method of video dynamic thumbnail, the method and device of model training
CN110110636A (en) * 2019-04-28 2019-08-09 清华大学 Video logic mining model and method based on multiple input single output coding/decoding model
CN110222227A (en) * 2019-05-13 2019-09-10 西安交通大学 A kind of Chinese folk song classification of countries method merging auditory perceptual feature and visual signature
CN110826397A (en) * 2019-09-20 2020-02-21 浙江大学 Video description method based on high-order low-rank multi-modal attention mechanism
CN111309971A (en) * 2020-01-19 2020-06-19 浙江工商大学 Multi-level coding-based text-to-video cross-modal retrieval method
CN111464881A (en) * 2019-01-18 2020-07-28 复旦大学 Full-convolution video description generation method based on self-optimization mechanism
CN111859005A (en) * 2020-07-01 2020-10-30 江西理工大学 Cross-layer multi-model feature fusion and image description method based on convolutional decoding
CN112069361A (en) * 2020-08-27 2020-12-11 新华智云科技有限公司 Video description text generation method based on multi-mode fusion
CN112287893A (en) * 2020-11-25 2021-01-29 广东技术师范大学 Sow lactation behavior identification method based on audio and video information fusion
CN112331337A (en) * 2021-01-04 2021-02-05 中国科学院自动化研究所 Automatic depression detection method, device and equipment
CN112889108A (en) * 2018-10-16 2021-06-01 谷歌有限责任公司 Speech classification using audiovisual data
CN114581749A (en) * 2022-05-09 2022-06-03 城云科技(中国)有限公司 Audio-visual feature fusion target behavior identification method and device and application

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105005558A (en) * 2015-08-14 2015-10-28 武汉大学 Multi-modal data fusion method based on crowd sensing
CN105279495A (en) * 2015-10-23 2016-01-27 天津大学 Video description method based on deep learning and text summarization
CN106708890A (en) * 2015-11-17 2017-05-24 创意引晴股份有限公司 Intelligent high fault-tolerant video identification system based on multimoding fusion and identification method thereof
CN107220591A (en) * 2017-04-28 2017-09-29 哈尔滨工业大学深圳研究生院 Multi-modal intelligent mood sensing system
CN107256221A (en) * 2017-04-26 2017-10-17 苏州大学 Video presentation method based on multi-feature fusion
CN107391646A (en) * 2017-07-13 2017-11-24 清华大学 A kind of Semantic features extraction method and device of video image
CN107451552A (en) * 2017-07-25 2017-12-08 北京联合大学 A kind of gesture identification method based on 3D CNN and convolution LSTM

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105005558A (en) * 2015-08-14 2015-10-28 武汉大学 Multi-modal data fusion method based on crowd sensing
CN105279495A (en) * 2015-10-23 2016-01-27 天津大学 Video description method based on deep learning and text summarization
CN106708890A (en) * 2015-11-17 2017-05-24 创意引晴股份有限公司 Intelligent high fault-tolerant video identification system based on multimoding fusion and identification method thereof
CN107256221A (en) * 2017-04-26 2017-10-17 苏州大学 Video presentation method based on multi-feature fusion
CN107220591A (en) * 2017-04-28 2017-09-29 哈尔滨工业大学深圳研究生院 Multi-modal intelligent mood sensing system
CN107391646A (en) * 2017-07-13 2017-11-24 清华大学 A kind of Semantic features extraction method and device of video image
CN107451552A (en) * 2017-07-25 2017-12-08 北京联合大学 A kind of gesture identification method based on 3D CNN and convolution LSTM

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190683A (en) * 2018-08-14 2019-01-11 电子科技大学 A kind of classification method based on attention mechanism and bimodal image
CN112889108B (en) * 2018-10-16 2022-08-16 谷歌有限责任公司 Speech classification using audiovisual data
CN112889108A (en) * 2018-10-16 2021-06-01 谷歌有限责任公司 Speech classification using audiovisual data
CN111464881A (en) * 2019-01-18 2020-07-28 复旦大学 Full-convolution video description generation method based on self-optimization mechanism
CN109885723A (en) * 2019-02-20 2019-06-14 腾讯科技(深圳)有限公司 A kind of generation method of video dynamic thumbnail, the method and device of model training
CN109885723B (en) * 2019-02-20 2023-10-13 腾讯科技(深圳)有限公司 Method for generating video dynamic thumbnail, method and device for model training
CN110110636A (en) * 2019-04-28 2019-08-09 清华大学 Video logic mining model and method based on multiple input single output coding/decoding model
CN110110636B (en) * 2019-04-28 2021-03-02 清华大学 Video logic mining device and method based on multi-input single-output coding and decoding model
CN110222227A (en) * 2019-05-13 2019-09-10 西安交通大学 A kind of Chinese folk song classification of countries method merging auditory perceptual feature and visual signature
CN110826397A (en) * 2019-09-20 2020-02-21 浙江大学 Video description method based on high-order low-rank multi-modal attention mechanism
CN110826397B (en) * 2019-09-20 2022-07-26 浙江大学 Video description method based on high-order low-rank multi-modal attention mechanism
CN111309971A (en) * 2020-01-19 2020-06-19 浙江工商大学 Multi-level coding-based text-to-video cross-modal retrieval method
CN111859005A (en) * 2020-07-01 2020-10-30 江西理工大学 Cross-layer multi-model feature fusion and image description method based on convolutional decoding
CN111859005B (en) * 2020-07-01 2022-03-29 江西理工大学 Cross-layer multi-model feature fusion and image description method based on convolutional decoding
CN112069361A (en) * 2020-08-27 2020-12-11 新华智云科技有限公司 Video description text generation method based on multi-mode fusion
CN112287893A (en) * 2020-11-25 2021-01-29 广东技术师范大学 Sow lactation behavior identification method based on audio and video information fusion
CN112287893B (en) * 2020-11-25 2023-07-18 广东技术师范大学 Sow lactation behavior identification method based on audio and video information fusion
CN112331337A (en) * 2021-01-04 2021-02-05 中国科学院自动化研究所 Automatic depression detection method, device and equipment
US11266338B1 (en) 2021-01-04 2022-03-08 Institute Of Automation, Chinese Academy Of Sciences Automatic depression detection method and device, and equipment
CN114581749B (en) * 2022-05-09 2022-07-26 城云科技(中国)有限公司 Audio-visual feature fusion target behavior identification method and device and application
CN114581749A (en) * 2022-05-09 2022-06-03 城云科技(中国)有限公司 Audio-visual feature fusion target behavior identification method and device and application

Also Published As

Publication number Publication date
CN108200483B (en) 2020-02-28

Similar Documents

Publication Publication Date Title
CN108200483A (en) Dynamically multi-modal video presentation generation method
Park et al. A metaverse: Taxonomy, components, applications, and open challenges
Wang et al. T3: Tree-autoencoder constrained adversarial text generation for targeted attack
CN109219812A (en) Spatial term in spoken dialogue system
Hossain et al. Text to image synthesis for improved image captioning
CN111104512B (en) Game comment processing method and related equipment
CN110234018A (en) Multimedia content description generation method, training method, device, equipment and medium
Zhang et al. Image captioning via semantic element embedding
Dong The sociolinguistics of voice in globalising China
Ayers The limits of transactional identity: Whiteness and embodiment in digital facial replacement
Studt Virtual reality documentaries and the illusion of presence
CN107122393A (en) Electron album generation method and device
Rastgoo et al. A survey on recent advances in Sign Language Production
Rastgoo et al. All You Need In Sign Language Production
CN109635303A (en) The recognition methods of specific area metasemy word
Rose Our Posthuman Past: Transhumanism, Posthumanism and Ethical Futures
Park et al. OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset
Zajko ‘What Difference Was Made?’: Feminist Models of Reception
Nam et al. A survey on multimodal bidirectional machine learning translation of image and natural language processing
Zhao et al. Research on video captioning based on multifeature fusion
Liu Research on virtual interactive animation design system based on deep learning
Starc et al. Constructing a Natural Language Inference dataset using generative neural networks
Wang China in the Age of Global Capitalism: Jia Zhangke's Filmic World
Haugen The construction of Beijing as an Olympic City
Lash et al. Gilles Deleuze and Film Criticism: Philosophy, Theory, and the Individual Film

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant