CN108200483A - Dynamically multi-modal video presentation generation method - Google Patents
Dynamically multi-modal video presentation generation method Download PDFInfo
- Publication number
- CN108200483A CN108200483A CN201711433810.6A CN201711433810A CN108200483A CN 108200483 A CN108200483 A CN 108200483A CN 201711433810 A CN201711433810 A CN 201711433810A CN 108200483 A CN108200483 A CN 108200483A
- Authority
- CN
- China
- Prior art keywords
- hearing
- sense
- video presentation
- modal
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 230000000007 visual effect Effects 0.000 claims abstract description 40
- 238000013527 convolutional neural network Methods 0.000 claims description 45
- 238000013528 artificial neural network Methods 0.000 claims description 21
- 230000004927 fusion Effects 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 7
- 230000003993 interaction Effects 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 4
- 239000012634 fragment Substances 0.000 claims description 2
- 230000001771 impaired effect Effects 0.000 claims description 2
- 230000008034 disappearance Effects 0.000 claims 1
- 230000006399 behavior Effects 0.000 description 8
- 230000000694 effects Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 3
- 239000013589 supplement Substances 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 230000031068 symbiosis, encompassing mutualism through parasitism Effects 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 241000251468 Actinopterygii Species 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010411 cooking Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 235000015067 sauces Nutrition 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 230000009182 swimming Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000003945 visual behavior Effects 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/83—Generation or processing of protective or descriptive data associated with content; Content structuring
- H04N21/84—Generation or processing of descriptive data, e.g. content descriptors
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/45—Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
- H04N21/466—Learning process for intelligent management, e.g. learning user preferences for recommending movies
- H04N21/4662—Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms
- H04N21/4666—Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms using neural networks, e.g. processing the feedback provided by the user
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
The invention belongs to video presentation fields, and in particular to a kind of multi-modal video presentation generation method of dynamic.It is intended to capture the resonance information of audiovisual mode to generate preferable video presentation, in addition, solving the situation that the audio modality in video is damaged or lacks.Multi-modal video presentation generation system proposed by the present invention shares the weights of LSTM memory internal units or shared memory external unit by the feature coding stage of audio visual mode, time domain dependence between audio visual models, and captures the resonance information of audiovisual mode;In addition, corresponding audio modality information is gone out according to known visual modalities information inference the present invention is based on sense of hearing inference system.Video presentation can quickly and effectively be generated by the present invention.
Description
Technical field
The invention belongs to video presentation fields, and in particular to a kind of multi-modal video presentation generation method of dynamic.
Background technology
Video presentation generation system refers to generate corresponding natural language description for given video automatically.Video presentation generates
It is to be inspired by iamge description generation, is all based on given image or video frame feature, is iteratively generating image or video
Description.
Traditional video presentation generates the method that can substantially be divided into three classes:
The first kind is the method based on template.They determine the semantic concept included in video first, are then based in advance
The sentence template defined infers sentence structure, finally goes to collect the most phase that can be generated sentence and describe using probability graph model
Hold inside the Pass.Although in this way can the correct sentence of generative grammar, sentence lacks rich and flexibility.
Second class is to regard video presentation as a search problem.Video is labeled using metadata first, then
Classified according to corresponding mark to sentence description and video.Compared with first kind method, the sentence that this kind of mode generates is more
Add naturally, however they are largely but limited to metadata.
Representation of video shot is directly mapped on a specific sentence by third class video presentation generation method.Venugopalan
Et al. feature extractions are carried out, and be averaged to them to all picture frames in video using convolutional neural networks (CNN) first
Pondization operation obtains the representation of video shot of a regular length.Then it is based on representation of video shot using LSTM decoders and generates corresponding regard
Frequency describes.Although the above method can easily obtain video presentation, it has ignored the time domain specification implied in video.For
Effect of the time domain in video presentation generation in video, Venugopalan et al. are explored using LSTM in video frame
Feature carries out the representation of video shot that coding obtains regular length.On the other hand, Yao et al. explores the part implied in video and complete
The effect that office's time domain generates video presentation.Specifically, the local temporal in video can pass through space-time convolutional neural networks
It is encoded, the global time domain in video can be encoded by time domain attention mechanism.Further to promote video presentation
The performance of generation, Ballas et al. have studied the effect of middle layer representation of video shot.Pan et al. is proposed using level recurrent neural
Network decoder goes to explore performance of the varigrained time domain in video presentation generation.It is retouched to generate more detailed video
It states, Haonan Yu et al. are proposed goes the paragraph of generation video to describe using level recurrent neural network.Their hierarchical model
Include a sentence generator and a paragraph generator.Sentence generator generates a letter for a specific short intervals of video
Single sentence description.Paragraph generator is encoded based on multiple sentences that sentence generator generates by capturing the dependence between sentence
Paragraph description is generated for video.
For above-mentioned video presentation generation model there are one predicable, i.e. their representation of video shot is based only on visual information.For
Information included in video is made full use of, Many researchers propose by information multi-modal in video merge and go to obtain
Better video presentation.Corresponding visual signature and aural signature are directly carried out series connection by Vasili Ramanishka et al.
Obtain multi-modal representation of video shot.Qin Jin et al. go to carry out audio visual information using the feedforward neural network of a multilayer
Fusion.Although these methods, in the performance that can improve video presentation to a certain degree, they are still limited by following influence.For example,
The aliasing of information and the decline of performance can be led to by being directly connected to audiovisual feature.In addition, the model that Qin Jin et al. are proposed is
Video features are obtained by carrying out pond to all video frame features or clip features, are had ignored (vision-sense of hearing)
Time domain dependence between feature.
Multi-modal video presentation generation system proposed by the present invention was total to by the feature coding stage in audio visual mode
The weights of LSTM memory internal units or shared memory external unit are enjoyed, is built come the time domain dependence between audio visual
Mould captures the resonance information of audiovisual mode and generates preferable video presentation.In addition, the factors such as affected by environment, sensor disturbance,
Audio modality in video can be damaged or lack in some cases.In this case, it is to generate video presentation
The performance of system is not affected substantially, and the present invention proposes that a sense of hearing inference system goes out correspondence based on known visual modalities information inference
Audio modality information, then will supplement complete audio visual modal information and be input to multi-modal video presentation generation system and go to give birth to
Into video presentation.
Invention content
To solve the above-mentioned problems, in order to solving that when audio modality is damaged or lacks video presentation can not be accurately generated
The problem of, a kind of multi-modal video presentation generation method of dynamic proposed by the present invention includes the following steps:
Step S1:Vision CNN features corresponding in video and sense of hearing MFCC features are extracted, and judges sense of hearing MFCC features
Whether it is damaged or disappears;Step S2 is performed as lost or disappearing, otherwise performs step S3;
Step S2:The vision CNN features are made inferences by the sense of hearing inference pattern based on coding-decoding process
To complete sense of hearing MFCC features;
Step S3:Using the vision CNN features and sense of hearing MFCC features, based on the time domain dependence between audio visual,
It is encoded by multimodal code device and the interaction of two mode of audiovisual is merged, obtained fusion feature, fusion feature is led to
Video presentation is generated after decoding with crossing decoder iteration.
Preferably, the sense of hearing inference pattern, the method for generation sense of hearing MFCC features are:
Vision CNN features are encoded using encoder, obtain high-level semantic;
Corresponding sense of hearing MFCC features are decoded using decoder.
Preferably, the multimodal code device is for the multi-modal LSTM encoders based on shared weights or based on shared
The multi-modal mnemon encoder of mnemon.
Preferably, two LSTM neural networks, difference are included in the multi-modal LSTM encoders based on shared weights
For being encoded to visual signature CNN and aural signature MFCC;It is weighed between the memory internal unit of two LSTM neural networks
Value is shared.
Preferably, the multi-modal LSTM encoders based on shared weights, modeling formula are as follows:
Wherein,
it,ft,otWithIt is input gate respectively, forgets door, out gate and mnemon;
Subscript s is the index value of mode;
S=0 represents the auditory information encoder based on LSTM;
S=1 represents the visual information encoder based on LSTM;
Wherein x0It is sense of hearing MFCC features;
x1It is vision CNN features;
W, U, b are the weight matrixs of respective items, and wherein U represents the audiovisual encoder based on LSTM and shared in hiding layer unit
Weights;
σ is sigmoid functions;
I is the input gate of LSTM;
H is the hidden state of LSTM;
ht、ht-1For LSTM t the and t-1 moment hidden state;
Wi、Wf、Wo、WcRespectively input gate forgets door, out gate, weights of the mnemon items about input x;
xt-1Fort-1The input at moment;
Ui、Uf、Uo、UcRespectively input gate forgets door, and out gate, mnemon items are about hidden state h weights;
bi、bf、bo、bcRespectively input gate forgets door, out gate, the bias term of mnemon items;
ct、ct-1For mnemon t the and t-1 moment value.
The multi-modal mnemon encoder for being preferably based on shared mnemon includes two LSTM neural networks, point
Vision CNN features and sense of hearing MFCC features Yong Yu not encoded;The memory internal unit of two LSTM neural networks passes through
Memory external unit is updated into row information.
Preferably, it is described " the memory internal unit of two LSTM neural networks by memory external unit into row information more
Newly ", method is:
Information is read from memory external unit;
The memory internal unit of the information that memory external unit is read and two LSTM neural networks merges respectively,
And update the mnemon of two LSTM neural networks.
Preferably, described " corresponding vision CNN features and sense of hearing MFCC features in extraction video ", method is:
The extraction of vision CNN features is carried out to the video frame in video by convolutional neural networks;
The extraction of audio MFCC features is carried out to the correspondence audio fragment of video frame by convolutional neural networks.
The multimodal code device for being preferably based on shared memory external unit realizes vision CNN features by read-write operation
Interaction with sense of hearing MFCC features is merged, the specific steps are:
Step S11:Information is read from multimodal code device mnemon;
Step S12:Information that mnemon is read and the vision CNN features obtained based on multimodal code device coding and
Sense of hearing MFCC features are merged to obtain fuse information respectively;
Step S13:By fuse information storage to multimodal code device mnemon.
Preferably, the vision CNN features and sense of hearing MFCC features do not share weights before multimodal code device is inputted.
There is above-mentioned technical proposal it is found that the present invention proposes a kind of multi-modal video presentation generation model of dynamic, for video
Description provides a kind of quickly and effectively method.Compared with prior art, the present invention has following advantage:
(1) input information of the invention is from two vision, sense of hearing mode, traditional vision with being based only on visual information
Description generation system is compared, it can obtain higher performance.
(2) present invention merges audio visual information by modeling the time domain dependence between audiovisual mode.Between audiovisual mode
Time domain dependence can represent resonance information between audiovisual mode, i.e., the event really occurred in video to a certain extent.
The present invention can effectively model the resonance information in video, therefore can generate ideal video presentation.
(3) video complete to modal information and missing audio modality, the present invention can establish unified video presentation system.
If input video has complete two mode, directly they are input in multi-modal video presentation generation system, if defeated
The video audio modality entered is damaged or disappears, then generates aural signature using sense of hearing inference system view-based access control model feature, then will
It supplements complete two mode and is input to multi-modal video presentation generation system.
Description of the drawings
Fig. 1 is the multi-modal video presentation generation method flow diagram of dynamic of an embodiment of the present invention;
Fig. 2 is that the multi-modal video presentation of dynamic of an embodiment of the present invention generates model schematic;
Fig. 3 is that the multi-modal audio visual based on shared weights of an embodiment of the present invention merges encoding model schematic diagram;
Fig. 4 is that the fusion coding mould of the multi-modal audio visual based on shared memory external unit of an embodiment of the present invention shows
Meaning type;
Fig. 5 is that the video generated based on each basic model and multi-modal fusion model of an embodiment of the present invention is retouched
It states;
Fig. 6 is the sense of hearing inference pattern schematic diagram of an embodiment of the present invention;
Fig. 7 is being retouched based on the video generated after each basic model and supplement audio modality for an embodiment of the present invention
It states.
Specific embodiment
The preferred embodiment of the present invention described with reference to the accompanying drawings.It will be apparent to a skilled person that this
A little embodiments are used only for explaining the technical principle of the present invention, it is not intended that limit the scope of the invention.
The multi-modal video presentation generation method of dynamic provided by the invention, as shown in Figure 1, specific steps include:
Step S1:Vision CNN features corresponding in video and sense of hearing MFCC features are extracted, and judges sense of hearing MFCC features
Whether it is damaged or disappears;Step S2 is performed as lost or disappearing, otherwise performs step S3;
Step S2:The vision CNN features are made inferences by the sense of hearing inference pattern based on coding-decoding process
To complete sense of hearing MFCC features;
Step S3:Using the vision CNN features and sense of hearing MFCC features, based on the time domain dependence between audio visual,
It is encoded by multimodal code device and the interaction of two mode of audiovisual is merged, obtained fusion feature, fusion feature is led to
Video presentation is generated after decoding with crossing decoder iteration.
The multi-modal video presentation generation method of dynamic proposed by the present invention is based on the multi-modal video presentation generation system of audiovisual
System and sense of hearing inference pattern realize video presentation function jointly, wherein, the encoder of the multi-modal video presentation generation system of audiovisual
Preferred embodiment is multi-modal LSTM encoders.
Fig. 2 shows the multi-modal video presentation generation system schematics of audiovisual in the present invention.The present embodiment the specific steps are:
Step S11:If input video mode is complete, they are directly input to multi-modal video presentation and generates system
In;If the video sense of hearing CNN features of input are damaged or disappear, using sense of hearing inference system according to known vision MFCC spies
Sign generation sense of hearing CNN features, then complete two mode will be supplemented and be input to multi-modal video presentation generation system;
Step S12:Video frame in video is extracted into vision CNN features, while to video frame by convolutional neural networks
Correspondence audio snippet extraction sense of hearing MFCC features.Then vision CNN in video and sense of hearing MFCC features are inputted respectively more
Mode Coding device is encoded, and last text decoder is iteratively generating video presentation based on the character representation that encoder provides.
Step S11 is specially in the present embodiment:If the aural signature of audiovisual multi-modal video presentation generation system extraction due to
It is influenced by external environment or aural signature is caused to lose or lack by electromagnetic environment interference, then sense of hearing inference pattern is based on regarding
Feature is felt through being encoded, and after obtaining high-level semantic, the decoder of sense of hearing inference pattern is recycled to decode the corresponding sense of hearing
MFCC features, then it is input to multi-modal video presentation generation system.
The multimodal code device of the multi-modal video presentation generation system of audiovisual is compiled for the multi-modal LSTM based on shared weights
Code device or the multi-modal mnemon encoder based on shared mnemon.The multimodal code device of the present invention is with multimode
For state LSTM encoders, as shown in figure 3, the multi-modal LSTM encoders based on shared weights, LSTM can model sequence data
In time domain rely on.Audio visual sequence in video has comprising time domain and between the two resonance relationship.During to model this
Domain dependence, the present invention respectively encode the feature of audiovisual mode, and using two LSTM neural networks at two
It carries out weights between the memory internal unit of LSTM neural networks to share, the multimodal code device designed by this way can capture
To the time domain resonance information in audiovisual symbiosis mode.
Multi-modal LSTM encoders modeling based on shared weights is as shown in formula (1)-(6):
Wherein,
it, ft, otWithIt is input gate respectively, forgets door, out gate and mnemon;
Subscript s is the index value of mode;
S=0 represents the auditory information encoder based on LSTM;
S=1 represents the visual information encoder based on LSTM;
Wherein x0It is sense of hearing MFCC features;
x1It is vision CNN features;
W, U, b are the weight matrixs of respective items, and wherein U represents the audiovisual encoder based on LSTM and shared in hiding layer unit
Weights;
σ is sigmoid functions;
I is the input gate of LSTM;
H is the hidden state of LSTM;
ht、ht-1For LSTM t the and t-1 moment hidden state;
Wi、Wf、Wo、WcRespectively input gate forgets door, out gate, weights of the mnemon items about input x;
xt-1Input for the t-1 moment;
Ui、Uf、Uo、UcRespectively input gate forgets door, and out gate, mnemon items are about hidden state h weights;
bi、bf、bo、bcRespectively input gate forgets door, out gate, the bias term of mnemon items;
ct、ct-1For mnemon t the and t-1 moment value.
Wherein, vision CNN features and sense of hearing MFCC features do not share weights before multimodal code device is inputted.Input
Audio visual feature was not shared there are two the reason of weights:First, study is specifically mapped to hidden layer from audio visual feature space
The function in space.Second, vision is different with the input dimension of the sense of hearing, and different weight matrixs can more convenient processing different modalities dimension
Different problems.
As shown in figure 4, the multi-modal mnemon encoder based on shared mnemon, although LSTM can model time domain according to
Lai Xing, but its period is shorter.The effect that dependence generates video presentation during further to explore long, the present invention are proposed based on altogether
The multimodal code device of memory external unit is enjoyed, the multi-modal mnemon encoder for sharing mnemon includes two LSTM god
Through network, it is respectively used to encode vision CNN features and sense of hearing MFCC features;The memory internal of two LSTM neural networks
Unit is updated by memory external unit into row information, the specific steps are:
Step S21:Information is read from memory external unit;
Step S21:The memory internal unit difference of the information that memory external unit is read and two LSTM neural networks
It is merged, and updates the mnemon of two LSTM neural networks.
The multi-modal mnemon encoder of the shared mnemon of the present embodiment realizes that vision CNN is special by read-write operation
The interaction fusion for sense of hearing MFCC features of seeking peace, the specific steps are:
Step S31:Information is read from multimodal code device mnemon;
Step S32:Information that mnemon is read and the vision CNN features obtained based on multimodal code device coding and
Sense of hearing MFCC features are merged to obtain fuse information respectively;
Step S33:By fuse information storage to multimodal code device mnemon.
The performance of system is generated for multi-modal video presentation in the verification present invention, as shown in figure 5, we compared it is following several
The video presentation of a model generation:
Sense of hearing MFCC features generation video presentation is used alone in Audio;
Vision CNN features generation video presentation is used alone in Visual;
V-Cat-A, using vision spy CNN seek peace sense of hearing MFCC features be directly attached as video features go generation video
Description;
V-ShaMem-A shares a memory external unit to obtain most by vision CNN features and sense of hearing MFCC features
Whole video features go generation video presentation, the simulated target be verify audio visual between it is long when time domain dependence to video presentation
The effect of generation;
V-ShaWei-A, by vision CNN features and sense of hearing MFCC features share the weights of LSTM memory internal units come
Obtain final video feature go generation video presentation, the simulated target be verify audio visual between the dependence of time domain in short-term to regarding
The effect of frequency description generation.
For first video in Fig. 5, the sentence of Visual models generation increasingly focuses on visual information and has ignored and listen
Feel information, therefore it produces the content (" to aman " vs. " news ") of mistake;V-Cat-A models generate accurate object
(" a man ") and behavior (" talking "), but be lost content (" news "), reason for directly to audio visual feature into
Row connection can lead to the aliasing of information, so as to lose a part of information;V-ShaMem-A and V-ShaWei-A believes in the sense of hearing
It can be generated with the help of breath with the sentence with reference to sentence closely, and the sentence of V-ShaWei-A generations is more accurate (" news "
Vs. " something "), information when reason more pays close attention to long for V-ShaMem-A, therefore the word generated is more abstract " something ",
And V-ShaWei-A more pays close attention to the fine granularity event really to work.
For second video in Fig. 5, all models can generate relevant behavior (" swimming ") and target (" in
the water”).However, only V-ShaWei-A models generate more accurate object (" fish " vs. " man " and
" person "), reason more pays close attention to the event of audio visual mode sensitivity for V-ShaWei-A models, i.e., as caused by resonating audio visual
Event.
For third video in Fig. 5, only V-ShaWei-A models generate more relevant behavior (" showing " vs.
" playing "), illustrate that V-ShaWei-A models can capture the essence of behavior.
4th video, V-Cat-A and V-ShaWei-A models in Fig. 5 can be generated more with the help of sound clue
Add relevant behavior (" knocking on a wall ", " using a phone ");V-ShaMem-A models more pay close attention to global thing
Part thus provides a sentence (" lying on bed ");Meanwhile Visual models are more paid close attention to visual information and are also created and retouch
It states (" lying on bed ").
For the 5th video in Fig. 5, the event and visual information occurred in the video is more relevant.Therefore Visual, V-
ShaMem-A, V-ShaWei-A model generate accurate behavior (" dancing " vs. " playing ", " singing ");And
And V-ShaMem-A and V-ShaWei-A models generate more accurate object (" a group of " vs. " a girl ", " a
Cartoon character " and " someone "), it is helpful to illustrate that time domain dependence positions object;Meanwhile V-ShaWei-
A models provide maximally related object (" cartoon characters " vs. " people "), illustrate short time domain dependence more
Added with effect.The content that english sentence in Fig. 5 for different models export after video presentation, be for display technique effect and
Comparison.
Sense of hearing inference pattern based on coding-decoding process, as shown in Figure 6.Specifically, the audiovisual mode of symbiosis has phase
Same high-level semantic represents, is represented based on high-level semantic, and the sense of hearing that is impaired or lacking can be generated according to known vision CNN features
MFCC features.In the present invention specifically, being encoded first using encoder to video frame feature, after obtaining high-level semantic,
Decoder is recycled to decode corresponding sense of hearing MFCC features.Wherein, 1024,512,256 expressions is that encoder or decoder exist
Each network layer includes the number of neuron.
The performance of system is generated for multi-modal video presentation in the verification present invention, as shown in fig. 7, we compared it is following several
The video presentation of a model generation, specially GA, the sense of hearing MFCC features that generation is used alone remove generation video presentation;
Visual is used alone vision CNN features and removes generation video presentation;V-ShaWei-GA, it is special using the vision CNN of shared weights
Sense of hearing MFCC features of seeking peace generate video presentation.
For first video in Fig. 7, Visual models are primarily upon visual cues, therefore generation error content
" piano " the reason is that the object behind child looks very much like " Piano ", and is taken up too much space in picture;V-
ShaWei-GA can capture more accurately sound producing body " Violin ", and reason can model the resonance information between audiovisual mode for it;
GA models can also generate more relevant Object representation " violin ".
For second video in Fig. 7, V-ShaWei-GA can generate more accurately behavior description (" pouring sauce
Into a pot " vs. " cooking the something "), it is sensitive simultaneously to illustrate that V-ShaWei-GA can capture audio visual
Behavior, i.e. resonance information;GA models can also generate accurate behavior description " pouring ", illustrate that the audio modality of generation is intentional
Justice.
For third video in Fig. 7, V-ShaWei-GA can generate relevant Object representation (" girl " with GA models
vs.“man”).The content that english sentence in Fig. 7 for different models export after video presentation is in order to which display technique is imitated
Fruit and comparison.
Those skilled in the art should be able to recognize that, each exemplary mould described with reference to the embodiments described herein
Type, unit and method and step can be realized with the combination of electronic hardware, computer software or the two, in order to clearly say
The interchangeability of bright electronic hardware and software generally describes each exemplary composition according to function in the above description
And step.These functions are performed actually with electronic hardware or software mode, depending on technical solution specific application and set
Count constraints.Those skilled in the art can realize described work(using distinct methods to each specific application
Can, but this realization is it is not considered that beyond the scope of this invention.
So far, it has been combined preferred embodiment shown in the drawings and describes technical scheme of the present invention, still, this field
Technical staff is it is easily understood that protection scope of the present invention is expressly not limited to these specific embodiments.Without departing from this
Under the premise of the principle of invention, those skilled in the art can make the relevant technologies feature equivalent change or replacement, these
Technical solution after changing or replacing it is fallen within protection scope of the present invention.
Claims (10)
1. a kind of multi-modal video presentation generation method of dynamic, which is characterized in that include the following steps:
Step S1:Vision CNN features corresponding in video and sense of hearing MFCC features are extracted, and whether judges sense of hearing MFCC features
Impaired or disappearance;Step S2 is performed as lost or disappearing, otherwise performs step S3;
Step S2:The vision CNN features are made inferences to have obtained by the sense of hearing inference pattern based on coding-decoding process
Whole sense of hearing MFCC features;
Step S3:Using the vision CNN features and sense of hearing MFCC features, based on the time domain dependence between audio visual, pass through
Multimodal code device is encoded and the interaction fusion of two mode of audiovisual, obtains fusion feature, fusion feature is passed through solution
Code device generates video presentation after iteratively decoding.
2. the multi-modal video presentation generation method of dynamic according to claim 1, which is characterized in that the sense of hearing reasoning mould
Type, the method for generation sense of hearing MFCC features are:
Video CNN features are encoded using encoder, obtain high-level semantic;
Corresponding sense of hearing MFCC features are decoded using decoder.
3. the multi-modal video presentation generation method of dynamic according to claim 1, which is characterized in that the multimodal code
Device is the multi-modal LSTM encoders based on shared weights or the coding of the multi-modal mnemon based on shared mnemon
Device.
4. the multi-modal video presentation generation method of dynamic according to claim 1, which is characterized in that described based on shared power
Comprising two LSTM neural networks in the multi-modal LSTM encoders of value, it is respectively used to visual signature CNN and aural signature
MFCC is encoded;Weights are shared between the memory internal unit of two LSTM neural networks.
5. the multi-modal video presentation generation method of dynamic according to claim 4, which is characterized in that described based on shared weights
Multi-modal LSTM encoders, modeling formula it is as follows:
Wherein,
it,ft,otWithIt is input gate respectively, forgets door, out gate and mnemon;
Subscript s is the index value of mode;
S=0 represents the auditory information encoder based on LSTM;
S=1 represents the visual information encoder based on LSTM;
Wherein x0It is sense of hearing MFCC features;
x1It is vision CNN features;
W, U, b are the weight matrixs of respective items, and wherein U represents the audiovisual encoder based on LSTM and shares power in hiding layer unit
Value;
σ is sigmoid functions;
I is the input gate of LSTM;
H is the hidden state of LSTM;
ht、ht-1For LSTM t the and t-1 moment hidden state;
Wi、Wf、Wo、WcRespectively input gate forgets door, out gate, weights of the mnemon items about input x;
xt-1Input for the t-1 moment;
Ui、Uf、Uo、UcRespectively input gate forgets door, and out gate, mnemon items are about hidden state h weights;
bi、bf、bo、bcRespectively input gate forgets door, out gate, the bias term of mnemon items;
ct、ct-1For mnemon t the and t-1 moment value.
6. the multi-modal video presentation generation method of dynamic according to claim 4, which is characterized in that based on shared mnemon
Multi-modal mnemon encoder include two LSTM neural networks, be respectively used to special to vision CNN features and sense of hearing MFCC
Sign is encoded;The memory internal unit of two LSTM neural networks is updated by memory external unit into row information.
7. the multi-modal video presentation generation method of dynamic according to claim 4, which is characterized in that " two LSTM god
Memory internal unit through network is updated by memory external unit into row information ", method is:
Information is read from memory external unit;
The memory internal unit of the information that memory external unit is read and two LSTM neural networks merges, and more respectively
The mnemon of new two LSTM neural networks.
8. according to the multi-modal video presentation generation method of any one of the claim 1-7 dynamics, which is characterized in that described " to carry
Take vision CNN features corresponding in video and sense of hearing MFCC features ", method is:
The extraction of vision CNN features is carried out to the video frame in video by convolutional neural networks;
The extraction of audio MFCC features is carried out to the correspondence audio fragment of video frame by convolutional neural networks.
9. video presentation generation method according to claim 8, which is characterized in that based on the more of shared memory external unit
Mode Coding device realizes that the interaction of vision CNN features and sense of hearing MFCC features is merged by read-write operation, the specific steps are:
Step S11:Information is read from multimodal code device mnemon;
Step S12:Information and the vision CNN features based on the coding acquisition of multimodal code device that mnemon is read and the sense of hearing
MFCC features are merged to obtain fuse information respectively;
Step S13:By fuse information storage to multimodal code device mnemon.
10. video presentation generation method according to claim 8, which is characterized in that the vision CNN features and the sense of hearing
MFCC features do not share weights before multimodal code device is inputted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711433810.6A CN108200483B (en) | 2017-12-26 | 2017-12-26 | Dynamic multi-modal video description generation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711433810.6A CN108200483B (en) | 2017-12-26 | 2017-12-26 | Dynamic multi-modal video description generation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108200483A true CN108200483A (en) | 2018-06-22 |
CN108200483B CN108200483B (en) | 2020-02-28 |
Family
ID=62584286
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711433810.6A Active CN108200483B (en) | 2017-12-26 | 2017-12-26 | Dynamic multi-modal video description generation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108200483B (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109190683A (en) * | 2018-08-14 | 2019-01-11 | 电子科技大学 | A kind of classification method based on attention mechanism and bimodal image |
CN109885723A (en) * | 2019-02-20 | 2019-06-14 | 腾讯科技(深圳)有限公司 | A kind of generation method of video dynamic thumbnail, the method and device of model training |
CN110110636A (en) * | 2019-04-28 | 2019-08-09 | 清华大学 | Video logic mining model and method based on multiple input single output coding/decoding model |
CN110222227A (en) * | 2019-05-13 | 2019-09-10 | 西安交通大学 | A kind of Chinese folk song classification of countries method merging auditory perceptual feature and visual signature |
CN110826397A (en) * | 2019-09-20 | 2020-02-21 | 浙江大学 | Video description method based on high-order low-rank multi-modal attention mechanism |
CN111309971A (en) * | 2020-01-19 | 2020-06-19 | 浙江工商大学 | Multi-level coding-based text-to-video cross-modal retrieval method |
CN111464881A (en) * | 2019-01-18 | 2020-07-28 | 复旦大学 | Full-convolution video description generation method based on self-optimization mechanism |
CN111859005A (en) * | 2020-07-01 | 2020-10-30 | 江西理工大学 | Cross-layer multi-model feature fusion and image description method based on convolutional decoding |
CN112069361A (en) * | 2020-08-27 | 2020-12-11 | 新华智云科技有限公司 | Video description text generation method based on multi-mode fusion |
CN112287893A (en) * | 2020-11-25 | 2021-01-29 | 广东技术师范大学 | Sow lactation behavior identification method based on audio and video information fusion |
CN112331337A (en) * | 2021-01-04 | 2021-02-05 | 中国科学院自动化研究所 | Automatic depression detection method, device and equipment |
CN112889108A (en) * | 2018-10-16 | 2021-06-01 | 谷歌有限责任公司 | Speech classification using audiovisual data |
CN114581749A (en) * | 2022-05-09 | 2022-06-03 | 城云科技(中国)有限公司 | Audio-visual feature fusion target behavior identification method and device and application |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105005558A (en) * | 2015-08-14 | 2015-10-28 | 武汉大学 | Multi-modal data fusion method based on crowd sensing |
CN105279495A (en) * | 2015-10-23 | 2016-01-27 | 天津大学 | Video description method based on deep learning and text summarization |
CN106708890A (en) * | 2015-11-17 | 2017-05-24 | 创意引晴股份有限公司 | Intelligent high fault-tolerant video identification system based on multimoding fusion and identification method thereof |
CN107220591A (en) * | 2017-04-28 | 2017-09-29 | 哈尔滨工业大学深圳研究生院 | Multi-modal intelligent mood sensing system |
CN107256221A (en) * | 2017-04-26 | 2017-10-17 | 苏州大学 | Video presentation method based on multi-feature fusion |
CN107391646A (en) * | 2017-07-13 | 2017-11-24 | 清华大学 | A kind of Semantic features extraction method and device of video image |
CN107451552A (en) * | 2017-07-25 | 2017-12-08 | 北京联合大学 | A kind of gesture identification method based on 3D CNN and convolution LSTM |
-
2017
- 2017-12-26 CN CN201711433810.6A patent/CN108200483B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105005558A (en) * | 2015-08-14 | 2015-10-28 | 武汉大学 | Multi-modal data fusion method based on crowd sensing |
CN105279495A (en) * | 2015-10-23 | 2016-01-27 | 天津大学 | Video description method based on deep learning and text summarization |
CN106708890A (en) * | 2015-11-17 | 2017-05-24 | 创意引晴股份有限公司 | Intelligent high fault-tolerant video identification system based on multimoding fusion and identification method thereof |
CN107256221A (en) * | 2017-04-26 | 2017-10-17 | 苏州大学 | Video presentation method based on multi-feature fusion |
CN107220591A (en) * | 2017-04-28 | 2017-09-29 | 哈尔滨工业大学深圳研究生院 | Multi-modal intelligent mood sensing system |
CN107391646A (en) * | 2017-07-13 | 2017-11-24 | 清华大学 | A kind of Semantic features extraction method and device of video image |
CN107451552A (en) * | 2017-07-25 | 2017-12-08 | 北京联合大学 | A kind of gesture identification method based on 3D CNN and convolution LSTM |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109190683A (en) * | 2018-08-14 | 2019-01-11 | 电子科技大学 | A kind of classification method based on attention mechanism and bimodal image |
CN112889108B (en) * | 2018-10-16 | 2022-08-16 | 谷歌有限责任公司 | Speech classification using audiovisual data |
CN112889108A (en) * | 2018-10-16 | 2021-06-01 | 谷歌有限责任公司 | Speech classification using audiovisual data |
CN111464881A (en) * | 2019-01-18 | 2020-07-28 | 复旦大学 | Full-convolution video description generation method based on self-optimization mechanism |
CN109885723A (en) * | 2019-02-20 | 2019-06-14 | 腾讯科技(深圳)有限公司 | A kind of generation method of video dynamic thumbnail, the method and device of model training |
CN109885723B (en) * | 2019-02-20 | 2023-10-13 | 腾讯科技(深圳)有限公司 | Method for generating video dynamic thumbnail, method and device for model training |
CN110110636A (en) * | 2019-04-28 | 2019-08-09 | 清华大学 | Video logic mining model and method based on multiple input single output coding/decoding model |
CN110110636B (en) * | 2019-04-28 | 2021-03-02 | 清华大学 | Video logic mining device and method based on multi-input single-output coding and decoding model |
CN110222227A (en) * | 2019-05-13 | 2019-09-10 | 西安交通大学 | A kind of Chinese folk song classification of countries method merging auditory perceptual feature and visual signature |
CN110826397A (en) * | 2019-09-20 | 2020-02-21 | 浙江大学 | Video description method based on high-order low-rank multi-modal attention mechanism |
CN110826397B (en) * | 2019-09-20 | 2022-07-26 | 浙江大学 | Video description method based on high-order low-rank multi-modal attention mechanism |
CN111309971A (en) * | 2020-01-19 | 2020-06-19 | 浙江工商大学 | Multi-level coding-based text-to-video cross-modal retrieval method |
CN111859005A (en) * | 2020-07-01 | 2020-10-30 | 江西理工大学 | Cross-layer multi-model feature fusion and image description method based on convolutional decoding |
CN111859005B (en) * | 2020-07-01 | 2022-03-29 | 江西理工大学 | Cross-layer multi-model feature fusion and image description method based on convolutional decoding |
CN112069361A (en) * | 2020-08-27 | 2020-12-11 | 新华智云科技有限公司 | Video description text generation method based on multi-mode fusion |
CN112287893A (en) * | 2020-11-25 | 2021-01-29 | 广东技术师范大学 | Sow lactation behavior identification method based on audio and video information fusion |
CN112287893B (en) * | 2020-11-25 | 2023-07-18 | 广东技术师范大学 | Sow lactation behavior identification method based on audio and video information fusion |
CN112331337A (en) * | 2021-01-04 | 2021-02-05 | 中国科学院自动化研究所 | Automatic depression detection method, device and equipment |
US11266338B1 (en) | 2021-01-04 | 2022-03-08 | Institute Of Automation, Chinese Academy Of Sciences | Automatic depression detection method and device, and equipment |
CN114581749B (en) * | 2022-05-09 | 2022-07-26 | 城云科技(中国)有限公司 | Audio-visual feature fusion target behavior identification method and device and application |
CN114581749A (en) * | 2022-05-09 | 2022-06-03 | 城云科技(中国)有限公司 | Audio-visual feature fusion target behavior identification method and device and application |
Also Published As
Publication number | Publication date |
---|---|
CN108200483B (en) | 2020-02-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108200483A (en) | Dynamically multi-modal video presentation generation method | |
Park et al. | A metaverse: Taxonomy, components, applications, and open challenges | |
Wang et al. | T3: Tree-autoencoder constrained adversarial text generation for targeted attack | |
CN109219812A (en) | Spatial term in spoken dialogue system | |
Hossain et al. | Text to image synthesis for improved image captioning | |
CN111104512B (en) | Game comment processing method and related equipment | |
CN110234018A (en) | Multimedia content description generation method, training method, device, equipment and medium | |
Zhang et al. | Image captioning via semantic element embedding | |
Dong | The sociolinguistics of voice in globalising China | |
Ayers | The limits of transactional identity: Whiteness and embodiment in digital facial replacement | |
Studt | Virtual reality documentaries and the illusion of presence | |
CN107122393A (en) | Electron album generation method and device | |
Rastgoo et al. | A survey on recent advances in Sign Language Production | |
Rastgoo et al. | All You Need In Sign Language Production | |
CN109635303A (en) | The recognition methods of specific area metasemy word | |
Rose | Our Posthuman Past: Transhumanism, Posthumanism and Ethical Futures | |
Park et al. | OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset | |
Zajko | ‘What Difference Was Made?’: Feminist Models of Reception | |
Nam et al. | A survey on multimodal bidirectional machine learning translation of image and natural language processing | |
Zhao et al. | Research on video captioning based on multifeature fusion | |
Liu | Research on virtual interactive animation design system based on deep learning | |
Starc et al. | Constructing a Natural Language Inference dataset using generative neural networks | |
Wang | China in the Age of Global Capitalism: Jia Zhangke's Filmic World | |
Haugen | The construction of Beijing as an Olympic City | |
Lash et al. | Gilles Deleuze and Film Criticism: Philosophy, Theory, and the Individual Film |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |