CN109472232A

CN109472232A - Video semantic representation method, system and medium based on multimodal fusion mechanism

Info

Publication number: CN109472232A
Application number: CN201811289502.5A
Authority: CN
Inventors: 侯素娟; 车统统; 王海帅; 郑元杰; 王静; 贾伟宽; 史云峰
Original assignee: Shandong Normal University
Current assignee: Suzhou Wuyun Pen And Ink Education Technology Co ltd
Priority date: 2018-10-31
Filing date: 2018-10-31
Publication date: 2019-03-15
Anticipated expiration: 2038-10-31
Also published as: CN109472232B

Abstract

The present disclosure discloses a video semantic representation method, system and medium based on a multi-modal fusion mechanism. Feature extraction: extracts the visual features, voice features, motion features, text features and domain features of the video itself; Speech, motion, text features and domain features are fused through the constructed multi-level implicit Dirichlet distribution topic model; feature mapping: the fused features are mapped to a high-level semantic space, and the fused feature representation sequence is obtained . The model takes advantage of the unique advantages of topic models in the field of semantic analysis, and the video representations obtained by the training of the proposed model have ideal distinction in the semantic space.

Description

Video semantic representation method, system and medium based on multimodal fusion mechanism

技术领域technical field

本公开涉及基于多模态融合机制的视频语义表征方法、系统及介质。The present disclosure relates to a video semantic representation method, system and medium based on a multimodal fusion mechanism.

背景技术Background technique

随着网络时代数据量呈爆炸式增长，加速了媒体大数据时代的到来。其中，视频作为多媒体信息的重要载体，与人们的生活息息相关。海量数据的演化不仅要求对数据的处理方式产生极大的变革，同时也对视频的存储、处理和应用带来很大的挑战。一个亟需解决的问题是如何对数据进行有效的组织和管理。随着数据源源不断地产生，由于硬件条件的限制，使得数据只能被分段或分时进行存储，这不可避免地会造成不同程度的信息缺失。因此为视频提供一种简洁高效的数据表征方法，对视频分析和提高数据管理效率是有意义的。With the explosive growth of the amount of data in the Internet age, the arrival of the era of media big data has been accelerated. Among them, video, as an important carrier of multimedia information, is closely related to people's lives. The evolution of massive data not only requires great changes in data processing methods, but also brings great challenges to video storage, processing and application. An urgent problem to be solved is how to organize and manage data effectively. With the continuous generation of data, due to the limitation of hardware conditions, data can only be stored in segments or time-sharing, which will inevitably result in different degrees of information loss. Therefore, providing a concise and efficient data representation method for video is meaningful for video analysis and improving data management efficiency.

视频数据具有如下特点：1)在数据形式上，视频数据具有多模复杂的结构，它是一种未完全结构化的数据流。每个视频都是由一系列的图像帧沿时间轴分布的流式结构，在时空多维空间上表现出视觉和运动等多种特性，同时在时间跨度上又融入了音频特性。其表现力强、信息量大，所蕴含的内容具有丰富性、海量性、非结构化等特征。视频中蕴含的这种多模特性给视频表征带来了巨大挑战；2)在内容构成上，视频又具有很强的逻辑性。它是由一系列的逻辑单元组成，蕴含丰富的语义信息，通过连续的若干帧就可以刻画出发生在特定时空环境下的事件，来表达特定的语义内容。视频内容的多样性以及对视频内容理解的差异性和模糊性，使得对表征视频数据的特征提取变得困难，进而使得基于语义信息的视频理解更具挑战性。Video data has the following characteristics: 1) In terms of data form, video data has a multi-modal and complex structure, which is an incompletely structured data stream. Each video is a streaming structure composed of a series of image frames distributed along the time axis, showing various characteristics such as vision and motion in the multi-dimensional space of space and time, and at the same time incorporating audio characteristics in the time span. It has strong expressiveness and large amount of information, and the content it contains is rich, massive, and unstructured. The multimodality contained in videos brings great challenges to video representation; 2) In terms of content composition, videos have strong logic. It is composed of a series of logical units and contains rich semantic information. Through several consecutive frames, events that occur in a specific time and space environment can be depicted to express specific semantic content. The diversity of video content and the difference and ambiguity in understanding video content make it difficult to extract features that characterize video data, which in turn makes video understanding based on semantic information more challenging.

传统的数据表征方法，如基于视觉的视频特征学习方法，可以得到视频的简洁表征，然而要想合理构建良好的特征，需要一定的经验和专业领域特征。深度学习方法的运用使视觉任务取得显著进展，但仍存在“语义鸿沟”和“多模异构鸿沟”等问题。目前，通过采用多模态融合技术建立对视频的有效表征，是跨越“多模异构”鸿沟的有效途径。理解视频，最自然的方式就是基于视频中的多模态信息，利用人思维中的高层概念将视频的内容表达出来，这也是跨越“语义鸿沟”的最佳途径。然而，对于特定领域的视频分析，需要综合运用相应的领域特征和现有的多模融合技术挖掘有效的表征模式来完成特定的任务。尽管计算机技术不断发展，如何让计算机准确地理解视频中的语义概念仍是个难题。Traditional data representation methods, such as vision-based video feature learning methods, can obtain concise representations of videos. However, in order to reasonably construct good features, certain experience and professional domain characteristics are required. The application of deep learning methods has made significant progress in vision tasks, but there are still problems such as "semantic gap" and "multimodal heterogeneity gap". At present, establishing an effective representation of video by adopting multimodal fusion technology is an effective way to cross the "multimodal heterogeneity" gap. The most natural way to understand video is to express the content of the video based on the multimodal information in the video and use the high-level concepts in human thinking, which is also the best way to cross the "semantic gap". However, for video analysis in a specific domain, it is necessary to comprehensively use the corresponding domain features and existing multimodal fusion techniques to mine effective representation patterns to complete specific tasks. Despite the continuous development of computer technology, how to make computers accurately understand the semantic concepts in videos is still a difficult problem.

发明内容SUMMARY OF THE INVENTION

为了解决现有技术的不足，本公开提供了基于多模态融合机制的视频语义表征方法、系统及介质，它是一个可扩展的通用表征模型，在模型训练和整体优化过程中，不仅对于单模态信息的个数是可扩展的，而且可将任意类型的视频所蕴含的领域特征融入到模型中。模型充分考虑了各模态之间的关系，并将多模态交互过程融入到对整个模型的联合训练和整体优化过程中。模型利用主题模型在语义分析领域的独特优势，在其基础上提出的模型所训练获得的视频表征方式在语义空间具有较理想的区分性。In order to solve the deficiencies of the prior art, the present disclosure provides a video semantic representation method, system and medium based on a multimodal fusion mechanism, which is an extensible general representation model. In the process of model training and overall optimization, not only for single The number of modal information is scalable, and domain features contained in any type of video can be incorporated into the model. The model fully considers the relationship between the modalities, and integrates the multi-modal interaction process into the joint training and overall optimization process of the entire model. The model takes advantage of the unique advantages of topic models in the field of semantic analysis, and the video representations obtained by the training of the proposed model have ideal distinction in the semantic space.

为了解决上述技术问题，本公开采用如下技术方案：In order to solve the above-mentioned technical problems, the present disclosure adopts the following technical solutions:

作为本公开的第一方面，提供了基于多模态融合机制的视频语义表征方法；As a first aspect of the present disclosure, a video semantic representation method based on a multimodal fusion mechanism is provided;

基于多模态融合机制的视频语义表征方法，包括：Video semantic representation methods based on multimodal fusion mechanism, including:

特征提取：提取视频自身的视觉特征、语音特征、运动特征、文本特征和领域特征；Feature extraction: extract the visual features, voice features, motion features, text features and domain features of the video itself;

特征融合：将提取的视觉、语音、运动、文本特征和领域特征，通过构建的多层次的隐含狄利克雷分布主题模型进行特征融合；Feature fusion: The extracted visual, speech, motion, text features and domain features are used for feature fusion through the constructed multi-level implicit Dirichlet distribution topic model;

特征映射：将融合后的特征映射到一个高层语义空间，得到融合后的特征表示序列。Feature mapping: Map the fused features to a high-level semantic space to obtain the fused feature representation sequence.

作为一些可能的实现方式，提取视频的视觉特征的具体步骤为：As some possible implementations, the specific steps for extracting visual features of a video are:

预处理：视频分割，即将视频分割为若干个镜头；各镜头内的图像帧按照时间顺序组成图像帧序列；Preprocessing: video segmentation, that is, dividing the video into several shots; the image frames in each shot form a sequence of image frames in chronological order;

步骤(a1)：建立深度学习神经网络模型；Step (a1): establish a deep learning neural network model;

所述深度学习神经网络模型，包括：依次连接的输入层、第一卷积层C1、第一池化层S2、第二卷积层C3、第二池化层S4、第三卷积层C5、全连接层F6和输出层；The deep learning neural network model includes: an input layer, a first convolution layer C1, a first pooling layer S2, a second convolution layer C3, a second pooling layer S4, and a third convolution layer C5 connected in sequence , the fully connected layer F6 and the output layer;

步骤(a2)：将视频各镜头的图像帧序列输入到深度学习神经网络模型的输入层，输入层将图像帧传递给第一卷积层C1；Step (a2): input the image frame sequence of each shot of the video into the input layer of the deep learning neural network model, and the input layer transfers the image frame to the first convolution layer C1;

第一卷积层C1，用一组可训练的卷积核对视频的图像帧序列中的每一帧图像进行卷积，将卷积得到的各层特征图求均值得到平均特征图，再将得到的平均特征图连同偏置输入激活函数，输出一组特征映射图；The first convolution layer C1 uses a set of trainable convolution kernels to convolve each frame of the image frame sequence of the video, and averages the feature maps of each layer obtained by the convolution to obtain the average feature map, and then obtains the average feature map. The average feature map of , together with the biased input activation function, outputs a set of feature maps;

第一池化层S2，对第一卷积层C1得到的特征映射图的每个像素点的像素值进行重叠池化操作，减小第一卷积层输出特征映射图矩阵的长和宽；然后将运算结果传递给第二卷积层C3；The first pooling layer S2 performs an overlapping pooling operation on the pixel value of each pixel of the feature map obtained by the first convolution layer C1 to reduce the length and width of the output feature map matrix of the first convolution layer; Then pass the operation result to the second convolutional layer C3;

第二卷积层C3，用于对第一池化层S2的运算结果进行卷积操作；第二卷积层C3的卷积核的个数是第一卷积层C1的卷积核个数的二倍；The second convolution layer C3 is used to perform a convolution operation on the operation result of the first pooling layer S2; the number of convolution kernels of the second convolution layer C3 is the number of convolution kernels of the first convolution layer C1 twice;

第二池化层S4，用于对第二卷积层C3输出的特征映射图进行重叠池化操作来降低特征图矩阵的大小；The second pooling layer S4 is used to perform an overlapping pooling operation on the feature map output by the second convolution layer C3 to reduce the size of the feature map matrix;

第三卷积层C5，采用与第二池化层S4相同尺寸大小的卷积核对第二池化层S4的结果进行卷积操作，最终得到若干个1×1像素的特征图；The third convolution layer C5 uses a convolution kernel of the same size as the second pooling layer S4 to perform convolution operations on the results of the second pooling layer S4, and finally obtains several 1×1 pixel feature maps;

全连接层F6，将本层每个神经元与第三卷积层C5中的各个神经元进行全连接，将第三卷积层C5得到的结果表示为特征向量；The fully connected layer F6 fully connects each neuron in this layer with each neuron in the third convolution layer C5, and expresses the result obtained by the third convolution layer C5 as a feature vector;

输出层，用于将全连接层F6输出的特征向量输入到分类器中进行分类，计算分类准确率，当分类准确率低于设定阈值时，通过反向传播调整参数，重复执行步骤(a2)，直至分类准确率高于设定阈值；当分类准确率高于设定阈值时，高于设定阈值的分类准确率所对应的特征向量作为最终视频视觉特征的学习结果。The output layer is used to input the feature vector output by the fully connected layer F6 into the classifier for classification, and calculate the classification accuracy. When the classification accuracy is lower than the set threshold, adjust the parameters through back propagation, and repeat step (a2 ) until the classification accuracy is higher than the set threshold; when the classification accuracy is higher than the set threshold, the feature vector corresponding to the classification accuracy higher than the set threshold is used as the learning result of the final video visual feature.

作为一些可能的实现方式，提取视频的语音特征的具体步骤为：As some possible implementations, the specific steps for extracting the voice features of the video are:

提取视频中的语音信号，将音频数据转换成声谱图，以声谱图作为深度学习神经网络模型输入，然后通过深度学习神经网络模型对音频信息进行无监督学习，并通过全连接层，得到视频语音特征的向量表示。Extract the speech signal in the video, convert the audio data into spectrogram, use the spectrogram as the input of the deep learning neural network model, and then perform unsupervised learning on the audio information through the deep learning neural network model, and through the fully connected layer, get Vector representation of video speech features.

作为一些可能的实现方式，提取视频的运动特征的具体步骤为：As some possible implementations, the specific steps for extracting the motion features of the video are:

提取视频中的光流场，并对光流方向进行加权统计，得到光流方向信息直方图特征(HOF Histogram of Oriented Optical Flow)，作为运动特征的向量表示。The optical flow field in the video is extracted, and the optical flow direction is weighted and counted to obtain the histogram feature of the optical flow direction information (HOF Histogram of Oriented Optical Flow), which is used as the vector representation of the motion feature.

作为一些可能的实现方式，提取视频的文本特征的具体步骤为：As some possible implementations, the specific steps for extracting the text features of the video are as follows:

采集视频帧中的文字和视频的周边文本信息(如视频标题、标注等)，采用词袋模型从文本信息中提取文本特征。Collect the text in the video frame and the surrounding text information of the video (such as video title, annotation, etc.), and use the bag-of-words model to extract text features from the text information.

领域特征是指视频所属领域所设定的规则特征。例如，足球视频会根据足球比赛的规则和转播规范做一些场景规范(如前场、后场和禁区等)和事件定义(如射门、角球、任意球等)。新闻视频具有基本一致的时序结构和场景语义，即新闻镜头在播音员和新闻报道之间按时序切换。广告视频中通常包含所推广商品或服务相关联的logo信息。The domain feature refers to the rule feature set by the domain to which the video belongs. For example, a football video will do some scene specifications (such as frontcourt, backcourt and penalty area, etc.) and event definitions (such as shooting, corner kick, free kick, etc.) according to the rules and broadcast specifications of football games. News videos have basically the same temporal structure and scene semantics, that is, news footage switches chronologically between announcers and news reports. Advertising videos usually contain logo information associated with the advertised goods or services.

作为一些可能的实现方式，所述多模态特征融合的具体步骤为：As some possible implementations, the specific steps of the multimodal feature fusion are:

步骤(a1)：运用LDA隐含狄利克雷分布主题模型(LDA Latent DirichletAllocation)，将视频的视觉特征向量从视觉特征空间映射到语义特征空间Γ下；LDA隐含狄利克雷分布主题模型的输入为视频的视觉特征向量，LDA隐含狄利克雷分布主题模型输出为在特征空间Γ上的语义表征；Step (a1): Use LDA Latent Dirichlet Allocation to map the visual feature vector of the video from the visual feature space to the semantic feature space Γ; the input of the LDA latent Dirichlet distribution topic model is the visual feature vector of the video, and the output of the LDA implicit Dirichlet distribution topic model is the semantic representation on the feature space Γ;

步骤(a2)：运用LDA隐含狄利克雷分布主题模型，将视频的语音特征向量从语音特征空间映射到语义特征空间Γ下；LDA隐含狄利克雷分布主题模型的输入为视频的语音特征向量，输出为在特征空间Γ上的语义表征；Step (a2): Use the LDA implicit Dirichlet distribution topic model to map the speech feature vector of the video from the speech feature space to the semantic feature space Γ; the input of the LDA implicit Dirichlet distribution topic model is the speech feature of the video vector, the output is the semantic representation on the feature space Γ;

步骤(a3)：运用LDA隐含狄利克雷分布主题模型，将视频的光流方向信息直方图特征从运动特征空间映射到语义特征空间Γ下；LDA的输入为视频的光流方向信息直方图特征，输出为在特征空间Γ上的语义表征；Step (a3): Use the LDA implicit Dirichlet distribution topic model to map the optical flow direction information histogram feature of the video from the motion feature space to the semantic feature space Γ; the input of LDA is the video optical flow direction information histogram feature, the output is the semantic representation on the feature space Γ;

步骤(a4)：LDA隐含狄利克雷分布主题模型，将视频的文本特征从文本特征空间映射到语义特征空间Γ下；LDA的输入为视频的文本，输出为在特征空间Γ上的语义表征；Step (a4): LDA implicit Dirichlet distribution topic model, which maps the text features of the video from the text feature space to the semantic feature space Γ; the input of LDA is the text of the video, and the output is the semantic representation on the feature space Γ ;

步骤(a5)：将视频领域特征转换为先验知识Ω；Step (a5): Convert the video domain features into prior knowledge Ω;

步骤(a6)：运用LDA隐含狄利克雷分布主题模型，将步骤(a1)-步骤(a4)得到的各模态特征在语义特征空间Γ上的语义表征，设定各模态特征的权重，通过加权融合获得在模态融合后的视频表征。Step (a6): Using the LDA implicit Dirichlet distribution topic model, the semantic representation of each modal feature obtained from steps (a1) to (a4) on the semantic feature space Γ is used to set the weight of each modal feature. , the video representation after modal fusion is obtained by weighted fusion.

所述各模态特征的权重的求取过程如下：The calculation process of the weight of each modal feature is as follows:

步骤(a61)：选择一个话题分布θ|α～Dir(α)，其中α为Dirichlet先验分布的先验参数；Step (a61): Select a topic distribution θ|α～Dir(α), where α is the prior parameter of Dirichlet prior distribution;

步骤(a62)：对训练样本视频中每个词，选择一个顶层话题分配话题服从多项式分布；Step (a62): For each word in the training sample video, select a top-level topic assignment The topic obeys a multinomial distribution;

步骤(a63)：对于各模态特征权重ρ∈NV＝{NV个模态特征词典}，选择一个底层话题分配话题服从多项式分布；Step (a63): For each modal feature weight ρ∈NV={NV modal feature dictionaries}, select an underlying topic assignment The topic obeys a multinomial distribution;

步骤(a64)：在各模态特征权重ρ下，基于已选择的话题，结合领域知识Ω，从分布生成一个词。Step (a64): Under each modal feature weight ρ, based on the selected topic, combined with domain knowledge Ω, from the distribution generate a word.

对于单个视频d，给定α、β和ρ，话题θ和顶层话题在从NV个单模态空间联合映射到高层语义空间，话题θ和顶层话题联合分布概率p(θ,z^top,d|α,Ω,ρ,β)p(β)为：For a single video d, given α, β and ρ, topic θ and top topic In joint mapping from NV unimodal space to high-level semantic space, topic θ and top-level topic The joint distribution probability p(θ,z ^top ,d|α,Ω,ρ,β)p(β) is:

其中，参数θ和z^top为隐变量；通过求边缘分布将参数θ和z^top消除。Among them, the parameters θ and z ^top are hidden variables; the parameters θ and z ^top are eliminated by finding the marginal distribution.

其中，p(β^η)表示在η模态空间下词典元素之间的先验关系，where p(β ^η ) represents the prior relationship between dictionary elements in η modal space,

采用高斯-马尔科夫随机场先验模型，即：The Gauss-Markov random field prior model is used, namely:

其中，Π^j表示在η模态空间下具有先验关系的词的集合，σ_i为模型的平滑系数，用于调整先验模型；exp表示以自然常数e为底的指数函数；Among them, Π ^j represents the set of words with a priori relation in n modal space, σ _i is the smoothing coefficient of the model, used to adjust the prior model; exp represents the exponential function with the natural constant e as the base;

对于含有M个视频的视频语料库D来讲，其生成概率通过对M个视频的边缘概率连乘得到：For a video corpus D containing M videos, its generation probability is obtained by multiplying the edge probabilities of the M videos:

目标函数设定为D的似然函数，即 The objective function is set to the likelihood function of D, namely

D的似然函数最大化时，所对应的参数ρ就是各单模态特征对应的权重，log表示以a为底的对数，表示似然函数。When the likelihood function of D is maximized, the corresponding parameter ρ is the weight corresponding to each unimodal feature, log represents the logarithm with the base a, represents the likelihood function.

作为本公开的第二方面，提供了基于多模态融合机制的视频语义表征系统；As a second aspect of the present disclosure, a video semantic representation system based on a multimodal fusion mechanism is provided;

基于多模态融合机制的视频语义表征系统，包括：存储器、处理器以及存储在存储器上并在处理器上运行的计算机指令，所述计算机指令被处理器运行时，完成上述任一方法所述的步骤。A video semantic representation system based on a multi-modal fusion mechanism includes: a memory, a processor, and computer instructions stored in the memory and running on the processor, and when the computer instructions are run by the processor, complete any of the methods described above. A step of.

作为本公开的第三方面，提供了一种计算机可读存储介质；As a third aspect of the present disclosure, a computer-readable storage medium is provided;

一种计算机可读存储介质，其上存储有计算机指令，所述计算机指令被处理器运行时，完成上述任一方法所述的步骤。A computer-readable storage medium having computer instructions stored thereon, when the computer instructions are executed by a processor, the steps described in any of the above methods are completed.

与现有技术相比，本公开的有益效果是：Compared with the prior art, the beneficial effects of the present disclosure are:

(1)本公开重点研究一种基于多模态融合机制的视频语义表征方法，综合利用图像处理、模式识别和机器学习等相关领域中的算法来处理视频中的序列信息。将为不同领域的视频表征分析提供新的研究视角和理论借鉴。(1) This disclosure focuses on a video semantic representation method based on a multimodal fusion mechanism, which comprehensively utilizes algorithms in related fields such as image processing, pattern recognition, and machine learning to process sequence information in videos. It will provide new research perspectives and theoretical references for video representation analysis in different fields.

(2)将传统方法和深度学习方法结合，在语义层次研究对视频的有效表征，有效缩短视频理解中普遍存在的“多模鸿沟”和“语义鸿沟”。(2) Combining traditional methods and deep learning methods to study the effective representation of videos at the semantic level, effectively shortening the "multimodal gap" and "semantic gap" that are common in video understanding.

(3)提出基于自适应学习机制的深度视觉特征学习模型，自动学习机制的自适应性主要表现在两个方面：一是采用镜头检测技术使得深度模型的输入是一组可变长度的帧序列，帧的数目可以根据镜头的长度自适应调整；二是在C2池化层，根据特征图的尺度动态计算池化窗口的大小和步长，从而保证所有的镜头视频的数据表征维度一致。(3) Propose a deep visual feature learning model based on an adaptive learning mechanism. The adaptability of the automatic learning mechanism is mainly manifested in two aspects: First, the use of shot detection technology makes the input of the deep model a set of variable-length frame sequences , the number of frames can be adaptively adjusted according to the length of the shot; the second is in the C2 pooling layer, the size and step size of the pooling window are dynamically calculated according to the scale of the feature map, so as to ensure that the data representation dimensions of all shot videos are consistent.

(4)设计视频镜头自适应3D深度学习神经网络，对视频特征的视觉特征自动学习算法进行研究，提高了分类器的性能和优化了整个系统的参数，以一种最有效的方式来表征视频的视觉信息。(4) Design video lens adaptive 3D deep learning neural network, study the visual feature automatic learning algorithm of video features, improve the performance of the classifier and optimize the parameters of the whole system, and characterize the video in the most effective way. visual information.

(5)提出一种多模态多层次的主题表征模型，该模型主要的特色有以下三个方面：一是它是一个可扩展的通用表征模型，单模信息的个数是可扩展的，也可将任意类型的视频所蕴含的领域特征融入到模型中，提高了视频表征能力的针对性；二是模型充分考虑了各模态之间的关系，将多模态交互过程融入到对整个模型的联合训练和整体优化中；三是利用主题模型在语义分析领域的独特的优势，所训练获得的视频表征方式在语义空间中有较理想的区分性，能有效获得视频的简洁表征。(5) A multi-modal and multi-level topic representation model is proposed. The main features of this model are as follows: First, it is an extensible general representation model, and the number of single-modal information is extensible. The domain features contained in any type of video can also be incorporated into the model, which improves the pertinence of video representation capabilities; second, the model fully considers the relationship between various modalities, and integrates the multi-modal interaction process into the whole process. In the joint training and overall optimization of the model; the third is to use the unique advantages of the topic model in the field of semantic analysis, the video representation obtained by training has an ideal distinction in the semantic space, and can effectively obtain the concise representation of the video.

附图说明Description of drawings

构成本申请的一部分的说明书附图用来提供对本申请的进一步理解，本申请的示意性实施例及其说明用于解释本申请，并不构成对本申请的不当限定。The accompanying drawings that form a part of the present application are used to provide further understanding of the present application, and the schematic embodiments and descriptions of the present application are used to explain the present application and do not constitute improper limitations on the present application.

图1视频镜头自适应3D深度学习架构；Figure 1. Video lens adaptive 3D deep learning architecture;

图2自适应3D卷积过程；Figure 2 Adaptive 3D convolution process;

图3使用卷积核进行卷积计算的过程；Figure 3 The process of using convolution kernel for convolution calculation;

图4视频多模态融合机制的总体框架；Figure 4. The overall framework of the video multimodal fusion mechanism;

图5多模态多层次的主题生成模型。Figure 5. Multimodal and multi-level topic generation model.

具体实施方式Detailed ways

应该指出，以下详细说明都是例示性的，旨在对本申请提供进一步的说明。除非另有指明，本文使用的所有技术和科学术语具有与本申请所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed description is exemplary and intended to provide further explanation of the application. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

需要注意的是，这里所使用的术语仅是为了描述具体实施方式，而非意图限制根据本申请的示例性实施方式。如在这里所使用的，除非上下文另外明确指出，否则单数形式也意图包括复数形式，此外，还应当理解的是，当在本说明书中使用术语“包含”和/或“包括”时，其指明存在特征、步骤、操作、器件、组件和/或它们的组合。It should be noted that the terminology used herein is for the purpose of describing specific embodiments only, and is not intended to limit the exemplary embodiments according to the present application. As used herein, unless the context clearly dictates otherwise, the singular is intended to include the plural as well, furthermore, it is to be understood that when the terms "comprising" and/or "including" are used in this specification, it indicates that There are features, steps, operations, devices, components and/or combinations thereof.

本公开首先提出一种自适应帧选取机制的时空特征学习模型，以获得视频的视觉特征。然后在此基础上，进一步提出能够结合领域特征，将视觉特征和其他模态特征有效融合的模型，实现对视频的语义表征。The present disclosure first proposes a spatiotemporal feature learning model of an adaptive frame selection mechanism to obtain visual features of a video. Then, on this basis, we further propose a model that can effectively integrate visual features and other modal features in combination with domain features to achieve semantic representation of videos.

为实现上述目的，本公开所述的视频表征模型结合传统方法和深度学习方法，综合运用传统的特征选取技术、深度学习机制和主题模型理论的优势对视频的多模态融合机制展开研究，并进一步在语义层次研究对视频的有效表征。In order to achieve the above purpose, the video representation model described in this disclosure combines traditional methods and deep learning methods, and comprehensively uses the advantages of traditional feature selection technology, deep learning mechanism and topic model theory to conduct research on the multi-modal fusion mechanism of video. Efficient representation of videos is further investigated at the semantic level.

具体的研究技术方案如下：The specific research technical plans are as follows:

首先对视频的时空域信息表征学习机制进行深入分析，在保证时空信息连续性和完整性的基础上获得对视频视觉信息的有效表征；接着，对多模态信息的融合机制进行研究，同时把视频的领域特征融入到视频的多模态信息融合过程中，最终建立一套领域视频的语义表征模型。Firstly, an in-depth analysis of the learning mechanism of video spatiotemporal information representation is carried out, and an effective representation of video visual information is obtained on the basis of ensuring the continuity and integrity of spatiotemporal information; then, the fusion mechanism of multimodal information is studied, and the The domain features of the video are integrated into the multi-modal information fusion process of the video, and finally a set of semantic representation models of the domain video are established.

(1)视频特征的时空域深度特征自动学习(1) Automatic learning of spatiotemporal deep features of video features

设计一种具有较强的数据拟合能力和学习能力的视频镜头特征学习模型，能够充分发挥层层特征抽取的优势。运用镜头检测技术，以镜头长度作为自适应的学习单元，采用层层抽取的方式挖掘视频镜头中包含的时空序列信息。因此设计了视频镜头自适应的3D深度学习网络模型(见图1)。Designing a video shot feature learning model with strong data fitting ability and learning ability can give full play to the advantages of layer-by-layer feature extraction. Using shot detection technology, the length of shot is used as an adaptive learning unit, and the spatiotemporal sequence information contained in the video shot is mined by layer-by-layer extraction. Therefore, a 3D deep learning network model for video lens adaptation is designed (see Figure 1).

过程如下：The process is as follows:

步骤1：运用视频镜头检测技术对视频进行镜头分割。Step 1: Use video shot detection technology to segment the video.

步骤2：以镜头的一组视频帧作为模型的输入，信息再依次传递到不同层，每层通过一组滤波器来获得观测数据在不同类别上的最显著特征信息。Step 2: Take a set of video frames of the shot as the input of the model, and then pass the information to different layers in turn, and each layer passes a set of filters to obtain the most salient feature information of the observation data in different categories.

步骤3:最终所有的镜头帧的像素值被光栅化，并连接成一个向量。Step 3: Finally all the pixel values of the shot frames are rasterized and concatenated into a vector.

上述自适应3D卷积过程体现在C1卷积层，图2给出对某个包含长度为L的镜头帧进行卷积的过程，即以L个帧序列作为输入，通过一组可学习的滤波器分别对不同帧的相应位置进行卷积，再将得到的各个神经元进行融合平均，最后通过一个激活函数输出一组特征映射图。对于视频帧来讲，我们认为帧内部的空间联系是局部的，因此将每个神经元设置成仅感知局部区域。The above adaptive 3D convolution process is embodied in the C1 convolution layer. Figure 2 shows the process of convolving a shot frame containing a length of L, that is, taking L frame sequences as input, through a set of learnable filters The device convolves the corresponding positions of different frames, then fuses and averages the obtained neurons, and finally outputs a set of feature maps through an activation function. For video frames, we consider the spatial connections within the frame to be local, so each neuron is set to perceive only local regions.

卷积过程中，同一平面层的神经元权值采用权值共享的方式，图3给出利用卷积核进行非线性转化的过程。During the convolution process, the weights of neurons in the same plane layer are shared by weights. Figure 3 shows the process of nonlinear transformation using convolution kernels.

图3中W＝(w₁,w₂,…,w_K)表示卷积核在该卷积层的权重，它是一组可学习的参数；a＝(a₁,a₂,…,a_L)为L个帧中对应位置的局部感受野，其中a_i＝(a_i1,a_i2,…,a_iK)。在用某个卷积核对图像执行卷积运算时，参与卷积运算的是输入图像的某个区域，这个区域的大小就是感受野。In Figure 3, W=(w ₁ ,w ₂ ,...,w _K ) represents the weight of the convolution kernel in the convolution layer, which is a set of learnable parameters; a=(a ₁ ,a ₂ ,...,a _L ) is the local receptive field of the corresponding position in L frames, where a _i =(a _i1 ,a _i2 ,...,a _iK ). When performing a convolution operation on an image with a certain convolution kernel, a certain area of the input image is involved in the convolution operation, and the size of this area is the receptive field.

在S2池化层，首先通过对C1卷积层得到的特征映射图的像素单元进行加权计算，然后运行非线性函数将运算结果向下一层继续传递。对于池化操作，实施过程中采用重叠池化，即设置的步长小于池化窗口大小。In the S2 pooling layer, the pixel unit of the feature map obtained by the C1 convolution layer is firstly weighted and calculated, and then the non-linear function is run to pass the operation result to the next layer. For the pooling operation, overlapping pooling is used in the implementation process, that is, the set step size is smaller than the pooling window size.

鉴于不同数据来源的视频帧的大小有差异，在C1卷积操作后，得到的特征图大小可能不一致，这会造成全连接层每个视频镜头的特征维度差异，进而导致所产生的数据表征不一致的问题。本公开中采取的策略是根据特征图的尺度动态计算池化窗口的大小和步长，从而保证对所有视频镜头表征向量在维度上的一致性。In view of the differences in the size of video frames from different data sources, after the C1 convolution operation, the size of the obtained feature maps may be inconsistent, which will cause the feature dimension of each video shot in the fully connected layer to be different, which will lead to inconsistent data representation. The problem. The strategy adopted in this disclosure is to dynamically calculate the size and step size of the pooling window according to the scale of the feature map, thereby ensuring the dimensional consistency of the representation vectors for all video shots.

在C3卷积层，设计个数2倍于C1卷积层的卷积核作用于S2层，目的是随着空间分辨率在各层之间递减，能够检测到更多的特征信息。S4池化层采用与S2类似的操作，目的是通过下采样技术使运算结果继续向下一层进行传递。在C5层，采用与S4相同尺寸大小的卷积核对S4层得到的特征图进行卷积运算，最终得到一系列大小为1×1的特征图。F6为全连接图，通过将每个神经元与C5中的全部神经元相连，实现将输入最终表示为一定长度的特征向量的表征目的。接着将训练得到的特征送入分类器，进一步提高分类器的性能来优化整个系统的参数，以一种最有效的方式来表征视频的视觉信息。In the C3 convolution layer, the convolution kernels with twice the number of the C1 convolution layer are designed to act on the S2 layer, in order to detect more feature information as the spatial resolution decreases between layers. The S4 pooling layer adopts a similar operation to that of S2, and the purpose is to make the operation result continue to be passed to the next layer through the downsampling technique. In the C5 layer, the convolution kernel of the same size as the S4 layer is used to perform the convolution operation on the feature map obtained by the S4 layer, and finally a series of feature maps with a size of 1 × 1 are obtained. F6 is a fully connected graph. By connecting each neuron with all neurons in C5, the representation purpose of finally expressing the input as a feature vector of a certain length is achieved. Then, the features obtained by training are sent to the classifier to further improve the performance of the classifier to optimize the parameters of the whole system, and to represent the visual information of the video in the most effective way.

(2)视频表征的多模态信息融合(2) Multimodal Information Fusion for Video Representation

在对视频进行多模态特征融合的预处理阶段，需要对视频各模态特征进行分别提取和表征。一般来讲，视频的特征有两大类：一类是通用特征，包括：In the preprocessing stage of multi-modal feature fusion of the video, it is necessary to extract and characterize the modal features of the video separately. Generally speaking, there are two categories of video features: one is general features, including:

1)含时间序列的视觉特征，包含时间维和空间维信息；1) Visual features with time series, including time dimension and space dimension information;

2)文本特征，包括视频帧中的文字和视频周边文本，采用词袋模型将文本信息转化为可建模的数字描述；2) Text features, including the text in the video frame and the surrounding text of the video, using the bag-of-words model to convert the text information into a modelable digital description;

3)运动特征，首先提取视频中的光流信息，然后采用光流直方图HOF(Histogramof Oriented Optical Flow)进行描述；3) Motion features, first extract the optical flow information in the video, and then use the optical flow histogram HOF (Histogram of Oriented Optical Flow) to describe;

4)音频特征，首先将视频中的音频信息转换成声谱图，再以声谱图作为输入，通过微调现有的网络模型进行无监督学习，得到对语音信息的向量表示。4) Audio features, first convert the audio information in the video into a spectrogram, and then use the spectrogram as an input to perform unsupervised learning by fine-tuning the existing network model to obtain a vector representation of the speech information.

另一类是领域特征，这与视频种类和具体的应用领域相关。The other category is domain features, which are related to video types and specific application domains.

对视频中各模态特征处理的过程不是各类特征的简单组合，而是几种不同模态特征的交互和融合。本公开以主题模型中流行的潜在狄利克雷分布为切入点，综合机器学习、图像处理和语音识别等学科理论，对上述视频各模态信息进行融合。通过构建多层次的主题模型将视频数据从各模态空间和领域特征有机映射到一个高层空间，得到在该高层空间上的视频层次的表征序列。图4给出多模态融合机制的总体框架。The process of processing the modal features in the video is not a simple combination of various features, but the interaction and fusion of several different modal features. The present disclosure takes the popular latent Dirichlet distribution in the subject model as an entry point, and integrates subject theories such as machine learning, image processing, and speech recognition to fuse the above-mentioned video modal information. By constructing a multi-level topic model, the video data is organically mapped from each modal space and domain features to a high-level space, and the video-level representation sequence on the high-level space is obtained. Figure 4 presents the overall framework of the multimodal fusion mechanism.

对于以上所述(2)，视频表征的多模态信息融合需要对视频各模态特征分别提取。然后构建多层次的主题模型进行多特征融合。过程如下：For the above (2), the multi-modal information fusion of video representation needs to extract each modal feature of the video separately. Then a multi-level topic model is constructed for multi-feature fusion. The process is as follows:

(1)对视频各模态特征分别提取，即分别提取视频的视觉信息、语音信息、运动信息、文本信息。(1) Extract each modal feature of the video separately, that is, extract the visual information, voice information, motion information, and text information of the video respectively.

对于视觉信息，以镜头(一组视频帧)为单位作为模型的输入，通过微调现有的网络模型如AlexNet，GoogleNet，对视觉信息进行无监督学习，最终，所有的镜头帧的像素值被光栅化，连接成一个向量；对于语音信息，将音频数据转换成声谱图，以声谱图信息作为模型的输入，然后通过微调现有的网络模型对音频信息进行无监督学习，并通过全连接层，得到一个向量；对于运动信息，拟首先提取视频中的光流信息，然后采用光流直方图HOF(Histogram of Oriented Optical Flow)进行表征；对于文本信息，包括视频帧中的文字和视频周边文本，拟采用词袋模型将文本信息转化为可建模的数字描述。For visual information, the unit of shot (a set of video frames) is used as the input of the model. By fine-tuning existing network models such as AlexNet, GoogleNet, unsupervised learning of visual information is performed. Finally, the pixel values of all shot frames are rasterized. For speech information, convert the audio data into a spectrogram, use the spectrogram information as the input of the model, and then perform unsupervised learning on the audio information by fine-tuning the existing network model, and through the full connection layer to obtain a vector; for motion information, it is proposed to first extract the optical flow information in the video, and then use the optical flow histogram HOF (Histogram of Oriented Optical Flow) to characterize; for text information, including the text in the video frame and the video surrounding Text, a bag-of-words model is proposed to convert textual information into a modelable digital description.

(2)构建多层次的主题模型实现多特征融合(2) Build a multi-level topic model to achieve multi-feature fusion

通过构建多层次的主题模型(图5)将视频数据从各模态特征空间和领域特征有机映射到一个高层语义空间，实现多模态融合。具体实施过程如下：By constructing a multi-level topic model (Fig. 5), the video data is organically mapped from each modal feature space and domain features to a high-level semantic space to achieve multi-modal fusion. The specific implementation process is as follows:

模型假设语料库中含有M个视频，记为D＝{d₁,d₂,…,d_M}，每个视频d_i(1≤i≤M)包含了一组潜在的主题信息，我们认为这些主题信息是由各模态空间下的词典元素在一定的先验条件下，按照一定的分布有机映射到一个高层次语义空间中产生的。模型以视频作为处理单元，共涉及了两种层次的主题模型，实现以领域特征作为先验知识的多模态信息融合，最终得到向量形式的话题表征。模型共包含两个层次的主题，分别以Z^top和Z^low表示，Z^top表示视频融合后的主题，Z^low表示融合前的主题，前者是由后者按照以ρ为参数的多项式分布组成，参数Ω对应视频的领域特征。模型认为不同模态空间下词之间呈独立同分布。图特征词典的构造采用K-means聚类技术，以词袋模型的方式进行构建。The model assumes that the corpus contains M videos, denoted as D={d ₁ ,d ₂ ,...,d _M }, and each video d _i (1≤i≤M) contains a set of potential topic information, we consider these Topic information is generated by the organic mapping of dictionary elements in each modal space to a high-level semantic space according to a certain distribution under certain prior conditions. The model uses video as the processing unit, involving a total of two levels of topic models, realizing multi-modal information fusion with domain features as prior knowledge, and finally obtaining topic representations in the form of vectors. The model contains two levels of topics, represented by Z ^top and Z ^low respectively, Z ^top represents the topic after video fusion, Z ^low represents the topic before fusion, the former is composed of the latter according to the polynomial distribution with ρ as the parameter, The parameter Ω corresponds to the domain feature of the video. The model considers that words in different modal spaces are independent and identically distributed. The construction of the graph feature dictionary adopts the K-means clustering technology and constructs it in the way of a bag of words model.

模型中参数θ服从以α为先验参数的Dirichlet分布，表示当前处理的视频所蕴含的分布。参数NV为多模态的数量，β表示不同模态空间下的词典。模型通过解决参数ρ实现了在多模态空间向语义空间转换时不同模态的权重设置问题。The parameter θ in the model obeys the Dirichlet distribution with α as the prior parameter, which represents the distribution contained in the currently processed video. The parameter NV is the number of multimodalities, and β represents dictionaries in different modal spaces. The model realizes the problem of weight setting of different modalities when converting from multimodal space to semantic space by solving the parameter ρ.

每个视频语料的生成过程如下表所示：The generation process of each video corpus is shown in the following table:

上述生成过程：The above generation process:

第一步：选择一个话题分布θ|α～Dir(α)，其中α为Dirichlet先验分布的先验参数；Step 1: Select a topic distribution θ|α～Dir(α), where α is the prior parameter of Dirichlet prior distribution;

第二步：对视频中每个词，选择一个顶层话题分配话题服从多项式分布；Step 2: For each word in the video, select a top-level topic assignment The topic obeys a multinomial distribution;

第三步：对于NV个模态空间，在模态空间p下，选择一个底层话题分配话题服从多项式分布；Step 3: For NV modal spaces, under modal space p, select an underlying topic assignment The topic obeys a multinomial distribution;

第四步：根据已选择的话题，从分布生成一个词。Step 4: According to the selected topic, from the distribution generate a word.

对于单个视频d，在给定参数α,β,ρ时，结合领域知识Ω，模型中话题θ，顶层话题在从NV个模态空间联合映射到高层空间时，它们的联合分布概率为：For a single video d, given parameters α, β, ρ, combine domain knowledge Ω, topic θ in the model, top topic When jointly mapping from the NV modal space to the high-level space, their joint distribution probability is:

其中参数θ和z^top为隐变量。可以通过求边缘分布将其消除。The parameters θ and z ^top are hidden variables. It can be eliminated by finding the marginal distribution.

上述p(β^η)表示在η模态空间下词典元素之间的先验关系，The above p(β ^η ) represents the prior relationship between dictionary elements in η modal space,

采用典型的高斯-马尔科夫随机场先验模型，即：A typical Gauss-Markov random field prior model is used, namely:

上式中Π^j表示在η模态空间下具有先验关系的词的集合，σ_i为模型的平滑系数，用于调整先验模型。In the above formula, ^Πj represents the set of words with a priori relationship in the _n -modal space, and σi is the smoothing coefficient of the model, which is used to adjust the priori model.

对于含有M个视频的视频语料库D来讲，其似然值可以通过对M个视频的边缘概率连乘得到:For a video corpus D containing M videos, its likelihood value can be obtained by multiplying the edge probabilities of the M videos:

期望找到合适的参数α,ρ,β能够最大化语料集的下述似然函数，即目标函数表示成：It is expected that finding suitable parameters α, ρ, β can maximize the following likelihood function of the corpus, that is, the objective function is expressed as:

通过解决上述模型，可以实现多模态特征和视频领域特征的有机融合，最终得到对视频的语义表征。By solving the above model, the organic fusion of multimodal features and video domain features can be achieved, and finally the semantic representation of the video can be obtained.

以上过程的总体思路如图4所示。总结以上过程，本公开提出的多模态多层次主题模型相对于现有的方法具有以下特色：The general idea of the above process is shown in Figure 4. Summarizing the above process, the multi-modal and multi-level topic model proposed by the present disclosure has the following characteristics compared with the existing methods:

1)它是一个可扩展的通用表征模型，在模型训练和整体优化过程中，不仅对于单模态信息的个数是可扩展的，而且可将任意类型的视频所蕴含的领域特征融入到模型中，提高了视频表征能力的针对性；1) It is an extensible general representation model. In the process of model training and overall optimization, it is not only extensible for the number of single-modal information, but also can incorporate the domain features contained in any type of video into the model. , which improves the pertinence of video representation capabilities;

2)模型充分考虑了各模态之间的关系，并将多模态交互过程融入到对整个模型的联合训练和整体优化过程中；2) The model fully considers the relationship between the modalities, and integrates the multi-modal interaction process into the joint training and overall optimization process of the entire model;

3)主题模型本身在语义分析领域有其独特的优势，在此基础上提出的模型所训练获得的视频表征方式在语义空间中具有较理想的区分性，这也是获得视频简洁表征的有效方式之一。3) The topic model itself has its unique advantages in the field of semantic analysis. On this basis, the video representation obtained by the model training has an ideal distinction in the semantic space, which is also one of the effective ways to obtain concise representation of video. one.

以上所述仅为本申请的优选实施例而已，并不用于限制本申请，对于本领域的技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。The above descriptions are only preferred embodiments of the present application, and are not intended to limit the present application. For those skilled in the art, the present application may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included within the protection scope of this application.

Claims

1. The video semantic representation method based on the multi-mode fusion mechanism is characterized by comprising the following steps:

feature extraction: extracting visual features, voice features, motion features, text features and field features of the video;

feature fusion: performing feature fusion on the extracted visual, voice, motion and text features and the field features through a constructed multilayer hidden Dirichlet distributed topic model;

characteristic mapping: and mapping the fused features to a high-level semantic space to obtain a fused feature representation sequence.

2. The method for video semantic representation based on the multi-modal fusion mechanism as claimed in claim 1, wherein the specific steps for extracting the visual features of the video are as follows:

pretreatment: video segmentation, namely segmenting a video into a plurality of shots; the image frames in each lens form an image frame sequence according to the time sequence;

step (a 1): establishing a deep learning neural network model;

the deep learning neural network model comprises: the input layer, the first scrolling layer C1, the first pooling layer S2, the second scrolling layer C3, the second pooling layer S4, the third scrolling layer C5, the full-connection layer F6 and the output layer are connected in sequence;

step (a 2): inputting the image frame sequence of each lens of the video into an input layer of the deep learning neural network model, and transmitting the image frames to a first convolution layer C1 by the input layer;

the first convolution layer C1 is used for performing convolution on each frame image in the image frame sequence of the video by using a group of trainable convolution cores, averaging all layers of feature maps obtained by the convolution to obtain an average feature map, and outputting a group of feature mapping maps by using the obtained average feature map and a bias input activation function;

the first pooling layer S2 is used for performing overlapping pooling operation on the pixel value of each pixel point of the feature map obtained by the first convolution layer C1, so that the length and the width of the first convolution layer output feature map matrix are reduced; then the operation result is transmitted to a second convolution layer C3;

a second convolution layer C3 for performing convolution operation on the operation result of the first pooling layer S2; the number of convolution kernels of the second convolution layer C3 is twice the number of convolution kernels of the first convolution layer C1;

a second pooling layer S4 for performing an overlap pooling operation on the feature map output by the second convolutional layer C3 to reduce the size of the feature map matrix;

a third convolution layer C5, which performs convolution operation on the result of the second pooling layer S4 by using a convolution kernel with the same size as the second pooling layer S4 to finally obtain a plurality of characteristic graphs of 1 × 1 pixels;

a full-connection layer F6 for fully connecting each neuron of the layer with each neuron in the third convolutional layer C5, and expressing the result obtained by the third convolutional layer C5 as a feature vector;

the output layer is used for inputting the feature vectors output by the full connection layer F6 into a classifier for classification, calculating the classification accuracy, and when the classification accuracy is lower than a set threshold, adjusting parameters through back propagation to repeatedly execute the step (a2) until the classification accuracy is higher than the set threshold; and when the classification accuracy is higher than the set threshold, the feature vector corresponding to the classification accuracy higher than the set threshold is used as the final learning result of the video visual features.

3. The method for video semantic representation based on the multi-modal fusion mechanism as claimed in claim 1, wherein the specific steps for extracting the voice features of the video are as follows:

extracting voice signals in the video, converting audio data into a spectrogram, inputting the spectrogram serving as a deep learning neural network model, performing unsupervised learning on audio information through the deep learning neural network model, and obtaining vector representation of video voice characteristics through a full connection layer.

4. The method for semantic representation of video based on multi-modal fusion mechanism as claimed in claim 1, wherein the specific steps for extracting the motion features of the video are as follows:

and extracting an optical flow field in the video, and carrying out weighted statistics on the optical flow direction to obtain optical flow direction information histogram features as vector representation of the motion features.

5. The method for video semantic representation based on the multi-modal fusion mechanism as claimed in claim 1, wherein the specific steps for extracting the text features of the video are as follows:

the method comprises the steps of collecting characters in a video frame and peripheral text information of a video, and extracting text features from the text information by adopting a word bag model.

6. The method as claimed in claim 1, wherein the domain feature is a rule feature set by a domain to which the video belongs.

7. The method for video semantic representation based on multi-modal fusion mechanism as claimed in claim 1, wherein,

the specific steps of the multi-modal feature fusion are as follows:

step (a 1): mapping visual feature vectors of the video from a visual feature space to a semantic feature space gamma by using an LDA hidden Dirichlet distributed topic model; the input of the LDA hidden Dirichlet distribution topic model is a visual feature vector of the video, and the output of the LDA hidden Dirichlet distribution topic model is a semantic representation on a feature space gamma;

step (a 2): mapping the voice feature vector of the video from a voice feature space to a semantic feature space gamma by using an LDA hidden Dirichlet distributed topic model; the input of an LDA hidden Dirichlet distributed topic model is a voice feature vector of a video, and the output is a semantic representation on a feature space gamma;

step (a 3): mapping the optical flow direction information histogram feature of the video from a motion feature space to a semantic feature space gamma by using an LDA hidden Dirichlet distribution topic model; the input of the LDA is the optical flow direction information histogram feature of the video, and the output is the semantic representation on the feature space gamma;

step (a 4): the LDA implies a Dirichlet distribution topic model, and the text features of the video are mapped to a semantic feature space gamma from a text feature space; the input of the LDA is a text of a video, and the output is a semantic representation on a feature space gamma;

step (a 5): converting the video domain features into a priori knowledge omega;

step (a 6): and (3) by using an LDA hidden Dirichlet distributed topic model, setting the weight of each modal feature on the semantic feature space gamma of each modal feature obtained in the steps (a1) and (a4), and obtaining the video representation after modal fusion through weighted fusion.

8. The video semantic representation system based on the multi-mode fusion mechanism is characterized by comprising the following steps: a memory, a processor, and computer instructions stored on the memory and executed on the processor, the computer instructions, when executed by the processor, performing the steps of the method of any of claims 1-7.

9. A computer-readable storage medium having stored thereon computer instructions which, when executed by a processor, perform the steps of the method of any one of claims 1 to 7.