CN108200483A

CN108200483A - Dynamically multi-modal video presentation generation method

Info

Publication number: CN108200483A
Application number: CN201711433810.6A
Authority: CN
Inventors: 张兆翔; 郝王丽; 关赫
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2017-12-26
Filing date: 2017-12-26
Publication date: 2018-06-22
Anticipated expiration: 2037-12-26
Also published as: CN108200483B

Abstract

The invention belongs to video presentation fields, and in particular to a kind of multi-modal video presentation generation method of dynamic.It is intended to capture the resonance information of audiovisual mode to generate preferable video presentation, in addition, solving the situation that the audio modality in video is damaged or lacks.Multi-modal video presentation generation system proposed by the present invention shares the weights of LSTM memory internal units or shared memory external unit by the feature coding stage of audio visual mode, time domain dependence between audio visual models, and captures the resonance information of audiovisual mode；In addition, corresponding audio modality information is gone out according to known visual modalities information inference the present invention is based on sense of hearing inference system.Video presentation can quickly and effectively be generated by the present invention.

Description

Dynamically multi-modal video presentation generation method

Technical field

The invention belongs to video presentation fields, and in particular to a kind of multi-modal video presentation generation method of dynamic.

Background technology

Video presentation generation system refers to generate corresponding natural language description for given video automatically.Video presentation generates It is to be inspired by iamge description generation, is all based on given image or video frame feature, is iteratively generating image or video Description.

Traditional video presentation generates the method that can substantially be divided into three classes：

The first kind is the method based on template.They determine the semantic concept included in video first, are then based in advance The sentence template defined infers sentence structure, finally goes to collect the most phase that can be generated sentence and describe using probability graph model Hold inside the Pass.Although in this way can the correct sentence of generative grammar, sentence lacks rich and flexibility.

Second class is to regard video presentation as a search problem.Video is labeled using metadata first, then Classified according to corresponding mark to sentence description and video.Compared with first kind method, the sentence that this kind of mode generates is more Add naturally, however they are largely but limited to metadata.

Representation of video shot is directly mapped on a specific sentence by third class video presentation generation method.Venugopalan Et al. feature extractions are carried out, and be averaged to them to all picture frames in video using convolutional neural networks (CNN) first Pondization operation obtains the representation of video shot of a regular length.Then it is based on representation of video shot using LSTM decoders and generates corresponding regard Frequency describes.Although the above method can easily obtain video presentation, it has ignored the time domain specification implied in video.For Effect of the time domain in video presentation generation in video, Venugopalan et al. are explored using LSTM in video frame Feature carries out the representation of video shot that coding obtains regular length.On the other hand, Yao et al. explores the part implied in video and complete The effect that office's time domain generates video presentation.Specifically, the local temporal in video can pass through space-time convolutional neural networks It is encoded, the global time domain in video can be encoded by time domain attention mechanism.Further to promote video presentation The performance of generation, Ballas et al. have studied the effect of middle layer representation of video shot.Pan et al. is proposed using level recurrent neural Network decoder goes to explore performance of the varigrained time domain in video presentation generation.It is retouched to generate more detailed video It states, Haonan Yu et al. are proposed goes the paragraph of generation video to describe using level recurrent neural network.Their hierarchical model Include a sentence generator and a paragraph generator.Sentence generator generates a letter for a specific short intervals of video Single sentence description.Paragraph generator is encoded based on multiple sentences that sentence generator generates by capturing the dependence between sentence Paragraph description is generated for video.

For above-mentioned video presentation generation model there are one predicable, i.e. their representation of video shot is based only on visual information.For Information included in video is made full use of, Many researchers propose by information multi-modal in video merge and go to obtain Better video presentation.Corresponding visual signature and aural signature are directly carried out series connection by Vasili Ramanishka et al. Obtain multi-modal representation of video shot.Qin Jin et al. go to carry out audio visual information using the feedforward neural network of a multilayer Fusion.Although these methods, in the performance that can improve video presentation to a certain degree, they are still limited by following influence.For example, The aliasing of information and the decline of performance can be led to by being directly connected to audiovisual feature.In addition, the model that Qin Jin et al. are proposed is Video features are obtained by carrying out pond to all video frame features or clip features, are had ignored (vision-sense of hearing) Time domain dependence between feature.

Multi-modal video presentation generation system proposed by the present invention was total to by the feature coding stage in audio visual mode The weights of LSTM memory internal units or shared memory external unit are enjoyed, is built come the time domain dependence between audio visual Mould captures the resonance information of audiovisual mode and generates preferable video presentation.In addition, the factors such as affected by environment, sensor disturbance, Audio modality in video can be damaged or lack in some cases.In this case, it is to generate video presentation The performance of system is not affected substantially, and the present invention proposes that a sense of hearing inference system goes out correspondence based on known visual modalities information inference Audio modality information, then will supplement complete audio visual modal information and be input to multi-modal video presentation generation system and go to give birth to Into video presentation.

Invention content

To solve the above-mentioned problems, in order to solving that when audio modality is damaged or lacks video presentation can not be accurately generated The problem of, a kind of multi-modal video presentation generation method of dynamic proposed by the present invention includes the following steps：

Step S1：Vision CNN features corresponding in video and sense of hearing MFCC features are extracted, and judges sense of hearing MFCC features Whether it is damaged or disappears；Step S2 is performed as lost or disappearing, otherwise performs step S3；

Step S2：The vision CNN features are made inferences by the sense of hearing inference pattern based on coding-decoding process To complete sense of hearing MFCC features；

Step S3：Using the vision CNN features and sense of hearing MFCC features, based on the time domain dependence between audio visual, It is encoded by multimodal code device and the interaction of two mode of audiovisual is merged, obtained fusion feature, fusion feature is led to Video presentation is generated after decoding with crossing decoder iteration.

Preferably, the sense of hearing inference pattern, the method for generation sense of hearing MFCC features are：

Vision CNN features are encoded using encoder, obtain high-level semantic；

Corresponding sense of hearing MFCC features are decoded using decoder.

Preferably, the multimodal code device is for the multi-modal LSTM encoders based on shared weights or based on shared The multi-modal mnemon encoder of mnemon.

Preferably, two LSTM neural networks, difference are included in the multi-modal LSTM encoders based on shared weights For being encoded to visual signature CNN and aural signature MFCC；It is weighed between the memory internal unit of two LSTM neural networks Value is shared.

Preferably, the multi-modal LSTM encoders based on shared weights, modeling formula are as follows：

Wherein,

i_t,f_t,o_tWithIt is input gate respectively, forgets door, out gate and mnemon；

Subscript s is the index value of mode；

S=0 represents the auditory information encoder based on LSTM；

S=1 represents the visual information encoder based on LSTM；

Wherein x⁰It is sense of hearing MFCC features；

x¹It is vision CNN features；

W, U, b are the weight matrixs of respective items, and wherein U represents the audiovisual encoder based on LSTM and shared in hiding layer unit Weights；

σ is sigmoid functions；

I is the input gate of LSTM；

H is the hidden state of LSTM；

h_t、h_t-1For LSTM t the and t-1 moment hidden state；

W_i、W_f、W_o、W_cRespectively input gate forgets door, out gate, weights of the mnemon items about input x；

x_t-1For_t-1The input at moment；

U_i、U_f、U_o、U_cRespectively input gate forgets door, and out gate, mnemon items are about hidden state h weights；

b_i、b_f、b_o、b_cRespectively input gate forgets door, out gate, the bias term of mnemon items；

c_t、c_t-1For mnemon t the and t-1 moment value.

The multi-modal mnemon encoder for being preferably based on shared mnemon includes two LSTM neural networks, point Vision CNN features and sense of hearing MFCC features Yong Yu not encoded；The memory internal unit of two LSTM neural networks passes through Memory external unit is updated into row information.

Preferably, it is described " the memory internal unit of two LSTM neural networks by memory external unit into row information more Newly ", method is：

Information is read from memory external unit；

The memory internal unit of the information that memory external unit is read and two LSTM neural networks merges respectively, And update the mnemon of two LSTM neural networks.

Preferably, described " corresponding vision CNN features and sense of hearing MFCC features in extraction video ", method is：

The extraction of vision CNN features is carried out to the video frame in video by convolutional neural networks；

The extraction of audio MFCC features is carried out to the correspondence audio fragment of video frame by convolutional neural networks.

The multimodal code device for being preferably based on shared memory external unit realizes vision CNN features by read-write operation Interaction with sense of hearing MFCC features is merged, the specific steps are：

Step S11：Information is read from multimodal code device mnemon；

Step S12：Information that mnemon is read and the vision CNN features obtained based on multimodal code device coding and Sense of hearing MFCC features are merged to obtain fuse information respectively；

Step S13：By fuse information storage to multimodal code device mnemon.

Preferably, the vision CNN features and sense of hearing MFCC features do not share weights before multimodal code device is inputted.

There is above-mentioned technical proposal it is found that the present invention proposes a kind of multi-modal video presentation generation model of dynamic, for video Description provides a kind of quickly and effectively method.Compared with prior art, the present invention has following advantage：

(1) input information of the invention is from two vision, sense of hearing mode, traditional vision with being based only on visual information Description generation system is compared, it can obtain higher performance.

(2) present invention merges audio visual information by modeling the time domain dependence between audiovisual mode.Between audiovisual mode Time domain dependence can represent resonance information between audiovisual mode, i.e., the event really occurred in video to a certain extent. The present invention can effectively model the resonance information in video, therefore can generate ideal video presentation.

(3) video complete to modal information and missing audio modality, the present invention can establish unified video presentation system. If input video has complete two mode, directly they are input in multi-modal video presentation generation system, if defeated The video audio modality entered is damaged or disappears, then generates aural signature using sense of hearing inference system view-based access control model feature, then will It supplements complete two mode and is input to multi-modal video presentation generation system.

Description of the drawings

Fig. 1 is the multi-modal video presentation generation method flow diagram of dynamic of an embodiment of the present invention；

Fig. 2 is that the multi-modal video presentation of dynamic of an embodiment of the present invention generates model schematic；

Fig. 3 is that the multi-modal audio visual based on shared weights of an embodiment of the present invention merges encoding model schematic diagram；

Fig. 4 is that the fusion coding mould of the multi-modal audio visual based on shared memory external unit of an embodiment of the present invention shows Meaning type；

Fig. 5 is that the video generated based on each basic model and multi-modal fusion model of an embodiment of the present invention is retouched It states；

Fig. 6 is the sense of hearing inference pattern schematic diagram of an embodiment of the present invention；

Fig. 7 is being retouched based on the video generated after each basic model and supplement audio modality for an embodiment of the present invention It states.

Specific embodiment

The preferred embodiment of the present invention described with reference to the accompanying drawings.It will be apparent to a skilled person that this A little embodiments are used only for explaining the technical principle of the present invention, it is not intended that limit the scope of the invention.

The multi-modal video presentation generation method of dynamic provided by the invention, as shown in Figure 1, specific steps include：

The multi-modal video presentation generation method of dynamic proposed by the present invention is based on the multi-modal video presentation generation system of audiovisual System and sense of hearing inference pattern realize video presentation function jointly, wherein, the encoder of the multi-modal video presentation generation system of audiovisual Preferred embodiment is multi-modal LSTM encoders.

Fig. 2 shows the multi-modal video presentation generation system schematics of audiovisual in the present invention.The present embodiment the specific steps are：

Step S11：If input video mode is complete, they are directly input to multi-modal video presentation and generates system In；If the video sense of hearing CNN features of input are damaged or disappear, using sense of hearing inference system according to known vision MFCC spies Sign generation sense of hearing CNN features, then complete two mode will be supplemented and be input to multi-modal video presentation generation system；

Step S12：Video frame in video is extracted into vision CNN features, while to video frame by convolutional neural networks Correspondence audio snippet extraction sense of hearing MFCC features.Then vision CNN in video and sense of hearing MFCC features are inputted respectively more Mode Coding device is encoded, and last text decoder is iteratively generating video presentation based on the character representation that encoder provides.

Step S11 is specially in the present embodiment：If the aural signature of audiovisual multi-modal video presentation generation system extraction due to It is influenced by external environment or aural signature is caused to lose or lack by electromagnetic environment interference, then sense of hearing inference pattern is based on regarding Feature is felt through being encoded, and after obtaining high-level semantic, the decoder of sense of hearing inference pattern is recycled to decode the corresponding sense of hearing MFCC features, then it is input to multi-modal video presentation generation system.

The multimodal code device of the multi-modal video presentation generation system of audiovisual is compiled for the multi-modal LSTM based on shared weights Code device or the multi-modal mnemon encoder based on shared mnemon.The multimodal code device of the present invention is with multimode For state LSTM encoders, as shown in figure 3, the multi-modal LSTM encoders based on shared weights, LSTM can model sequence data In time domain rely on.Audio visual sequence in video has comprising time domain and between the two resonance relationship.During to model this Domain dependence, the present invention respectively encode the feature of audiovisual mode, and using two LSTM neural networks at two It carries out weights between the memory internal unit of LSTM neural networks to share, the multimodal code device designed by this way can capture To the time domain resonance information in audiovisual symbiosis mode.

Multi-modal LSTM encoders modeling based on shared weights is as shown in formula (1)-(6)：

Wherein,

i_t, f_t, o_tWithIt is input gate respectively, forgets door, out gate and mnemon；

Subscript s is the index value of mode；

S=0 represents the auditory information encoder based on LSTM；

S=1 represents the visual information encoder based on LSTM；

Wherein x⁰It is sense of hearing MFCC features；

x¹It is vision CNN features；

σ is sigmoid functions；

I is the input gate of LSTM；

H is the hidden state of LSTM；

h_t、h_t-1For LSTM t the and t-1 moment hidden state；

x_t-1Input for the t-1 moment；

c_t、c_t-1For mnemon t the and t-1 moment value.

Wherein, vision CNN features and sense of hearing MFCC features do not share weights before multimodal code device is inputted.Input Audio visual feature was not shared there are two the reason of weights：First, study is specifically mapped to hidden layer from audio visual feature space The function in space.Second, vision is different with the input dimension of the sense of hearing, and different weight matrixs can more convenient processing different modalities dimension Different problems.

As shown in figure 4, the multi-modal mnemon encoder based on shared mnemon, although LSTM can model time domain according to Lai Xing, but its period is shorter.The effect that dependence generates video presentation during further to explore long, the present invention are proposed based on altogether The multimodal code device of memory external unit is enjoyed, the multi-modal mnemon encoder for sharing mnemon includes two LSTM god Through network, it is respectively used to encode vision CNN features and sense of hearing MFCC features；The memory internal of two LSTM neural networks Unit is updated by memory external unit into row information, the specific steps are：

Step S21：Information is read from memory external unit；

Step S21：The memory internal unit difference of the information that memory external unit is read and two LSTM neural networks It is merged, and updates the mnemon of two LSTM neural networks.

The multi-modal mnemon encoder of the shared mnemon of the present embodiment realizes that vision CNN is special by read-write operation The interaction fusion for sense of hearing MFCC features of seeking peace, the specific steps are：

Step S31：Information is read from multimodal code device mnemon；

Step S32：Information that mnemon is read and the vision CNN features obtained based on multimodal code device coding and Sense of hearing MFCC features are merged to obtain fuse information respectively；

Step S33：By fuse information storage to multimodal code device mnemon.

The performance of system is generated for multi-modal video presentation in the verification present invention, as shown in figure 5, we compared it is following several The video presentation of a model generation：

Sense of hearing MFCC features generation video presentation is used alone in Audio；

Vision CNN features generation video presentation is used alone in Visual；

V-Cat-A, using vision spy CNN seek peace sense of hearing MFCC features be directly attached as video features go generation video Description；

V-ShaMem-A shares a memory external unit to obtain most by vision CNN features and sense of hearing MFCC features Whole video features go generation video presentation, the simulated target be verify audio visual between it is long when time domain dependence to video presentation The effect of generation；

V-ShaWei-A, by vision CNN features and sense of hearing MFCC features share the weights of LSTM memory internal units come Obtain final video feature go generation video presentation, the simulated target be verify audio visual between the dependence of time domain in short-term to regarding The effect of frequency description generation.

For first video in Fig. 5, the sentence of Visual models generation increasingly focuses on visual information and has ignored and listen Feel information, therefore it produces the content (" to aman " vs. " news ") of mistake；V-Cat-A models generate accurate object (" a man ") and behavior (" talking "), but be lost content (" news "), reason for directly to audio visual feature into Row connection can lead to the aliasing of information, so as to lose a part of information；V-ShaMem-A and V-ShaWei-A believes in the sense of hearing It can be generated with the help of breath with the sentence with reference to sentence closely, and the sentence of V-ShaWei-A generations is more accurate (" news " Vs. " something "), information when reason more pays close attention to long for V-ShaMem-A, therefore the word generated is more abstract " something ", And V-ShaWei-A more pays close attention to the fine granularity event really to work.

For second video in Fig. 5, all models can generate relevant behavior (" swimming ") and target (" in the water”).However, only V-ShaWei-A models generate more accurate object (" fish " vs. " man " and " person "), reason more pays close attention to the event of audio visual mode sensitivity for V-ShaWei-A models, i.e., as caused by resonating audio visual Event.

For third video in Fig. 5, only V-ShaWei-A models generate more relevant behavior (" showing " vs. " playing "), illustrate that V-ShaWei-A models can capture the essence of behavior.

4th video, V-Cat-A and V-ShaWei-A models in Fig. 5 can be generated more with the help of sound clue Add relevant behavior (" knocking on a wall ", " using a phone ")；V-ShaMem-A models more pay close attention to global thing Part thus provides a sentence (" lying on bed ")；Meanwhile Visual models are more paid close attention to visual information and are also created and retouch It states (" lying on bed ").

For the 5th video in Fig. 5, the event and visual information occurred in the video is more relevant.Therefore Visual, V- ShaMem-A, V-ShaWei-A model generate accurate behavior (" dancing " vs. " playing ", " singing ")；And And V-ShaMem-A and V-ShaWei-A models generate more accurate object (" a group of " vs. " a girl ", " a Cartoon character " and " someone "), it is helpful to illustrate that time domain dependence positions object；Meanwhile V-ShaWei- A models provide maximally related object (" cartoon characters " vs. " people "), illustrate short time domain dependence more Added with effect.The content that english sentence in Fig. 5 for different models export after video presentation, be for display technique effect and Comparison.

Sense of hearing inference pattern based on coding-decoding process, as shown in Figure 6.Specifically, the audiovisual mode of symbiosis has phase Same high-level semantic represents, is represented based on high-level semantic, and the sense of hearing that is impaired or lacking can be generated according to known vision CNN features MFCC features.In the present invention specifically, being encoded first using encoder to video frame feature, after obtaining high-level semantic, Decoder is recycled to decode corresponding sense of hearing MFCC features.Wherein, 1024,512,256 expressions is that encoder or decoder exist Each network layer includes the number of neuron.

The performance of system is generated for multi-modal video presentation in the verification present invention, as shown in fig. 7, we compared it is following several The video presentation of a model generation, specially GA, the sense of hearing MFCC features that generation is used alone remove generation video presentation； Visual is used alone vision CNN features and removes generation video presentation；V-ShaWei-GA, it is special using the vision CNN of shared weights Sense of hearing MFCC features of seeking peace generate video presentation.

For first video in Fig. 7, Visual models are primarily upon visual cues, therefore generation error content " piano " the reason is that the object behind child looks very much like " Piano ", and is taken up too much space in picture；V- ShaWei-GA can capture more accurately sound producing body " Violin ", and reason can model the resonance information between audiovisual mode for it； GA models can also generate more relevant Object representation " violin ".

For second video in Fig. 7, V-ShaWei-GA can generate more accurately behavior description (" pouring sauce Into a pot " vs. " cooking the something "), it is sensitive simultaneously to illustrate that V-ShaWei-GA can capture audio visual Behavior, i.e. resonance information；GA models can also generate accurate behavior description " pouring ", illustrate that the audio modality of generation is intentional Justice.

For third video in Fig. 7, V-ShaWei-GA can generate relevant Object representation (" girl " with GA models vs.“man”).The content that english sentence in Fig. 7 for different models export after video presentation is in order to which display technique is imitated Fruit and comparison.

Those skilled in the art should be able to recognize that, each exemplary mould described with reference to the embodiments described herein Type, unit and method and step can be realized with the combination of electronic hardware, computer software or the two, in order to clearly say The interchangeability of bright electronic hardware and software generally describes each exemplary composition according to function in the above description And step.These functions are performed actually with electronic hardware or software mode, depending on technical solution specific application and set Count constraints.Those skilled in the art can realize described work(using distinct methods to each specific application Can, but this realization is it is not considered that beyond the scope of this invention.

So far, it has been combined preferred embodiment shown in the drawings and describes technical scheme of the present invention, still, this field Technical staff is it is easily understood that protection scope of the present invention is expressly not limited to these specific embodiments.Without departing from this Under the premise of the principle of invention, those skilled in the art can make the relevant technologies feature equivalent change or replacement, these Technical solution after changing or replacing it is fallen within protection scope of the present invention.

Claims

1. a kind of multi-modal video presentation generation method of dynamic, which is characterized in that include the following steps：

Step S1：Vision CNN features corresponding in video and sense of hearing MFCC features are extracted, and whether judges sense of hearing MFCC features Impaired or disappearance；Step S2 is performed as lost or disappearing, otherwise performs step S3；

Step S2：The vision CNN features are made inferences to have obtained by the sense of hearing inference pattern based on coding-decoding process Whole sense of hearing MFCC features；

Step S3：Using the vision CNN features and sense of hearing MFCC features, based on the time domain dependence between audio visual, pass through Multimodal code device is encoded and the interaction fusion of two mode of audiovisual, obtains fusion feature, fusion feature is passed through solution Code device generates video presentation after iteratively decoding.

2. the multi-modal video presentation generation method of dynamic according to claim 1, which is characterized in that the sense of hearing reasoning mould Type, the method for generation sense of hearing MFCC features are：

Video CNN features are encoded using encoder, obtain high-level semantic；

Corresponding sense of hearing MFCC features are decoded using decoder.

3. the multi-modal video presentation generation method of dynamic according to claim 1, which is characterized in that the multimodal code Device is the multi-modal LSTM encoders based on shared weights or the coding of the multi-modal mnemon based on shared mnemon Device.

4. the multi-modal video presentation generation method of dynamic according to claim 1, which is characterized in that described based on shared power Comprising two LSTM neural networks in the multi-modal LSTM encoders of value, it is respectively used to visual signature CNN and aural signature MFCC is encoded；Weights are shared between the memory internal unit of two LSTM neural networks.

5. the multi-modal video presentation generation method of dynamic according to claim 4, which is characterized in that described based on shared weights Multi-modal LSTM encoders, modeling formula it is as follows：

Wherein,

Subscript s is the index value of mode；

S=0 represents the auditory information encoder based on LSTM；

S=1 represents the visual information encoder based on LSTM；

Wherein x⁰It is sense of hearing MFCC features；

x¹It is vision CNN features；

W, U, b are the weight matrixs of respective items, and wherein U represents the audiovisual encoder based on LSTM and shares power in hiding layer unit Value；

σ is sigmoid functions；

I is the input gate of LSTM；

H is the hidden state of LSTM；

h_t、h_t-1For LSTM t the and t-1 moment hidden state；

x_t-1Input for the t-1 moment；

bi、b_f、b_o、b_cRespectively input gate forgets door, out gate, the bias term of mnemon items；

c_t、c_t-1For mnemon t the and t-1 moment value.

6. the multi-modal video presentation generation method of dynamic according to claim 4, which is characterized in that based on shared mnemon Multi-modal mnemon encoder include two LSTM neural networks, be respectively used to special to vision CNN features and sense of hearing MFCC Sign is encoded；The memory internal unit of two LSTM neural networks is updated by memory external unit into row information.

7. the multi-modal video presentation generation method of dynamic according to claim 4, which is characterized in that " two LSTM god Memory internal unit through network is updated by memory external unit into row information ", method is：

Information is read from memory external unit；

The memory internal unit of the information that memory external unit is read and two LSTM neural networks merges, and more respectively The mnemon of new two LSTM neural networks.

8. according to the multi-modal video presentation generation method of any one of the claim 1-7 dynamics, which is characterized in that described " to carry Take vision CNN features corresponding in video and sense of hearing MFCC features ", method is：

9. video presentation generation method according to claim 8, which is characterized in that based on the more of shared memory external unit Mode Coding device realizes that the interaction of vision CNN features and sense of hearing MFCC features is merged by read-write operation, the specific steps are：

Step S11：Information is read from multimodal code device mnemon；

Step S12：Information and the vision CNN features based on the coding acquisition of multimodal code device that mnemon is read and the sense of hearing MFCC features are merged to obtain fuse information respectively；

Step S13：By fuse information storage to multimodal code device mnemon.

10. video presentation generation method according to claim 8, which is characterized in that the vision CNN features and the sense of hearing MFCC features do not share weights before multimodal code device is inputted.