CN109472232A - Video semanteme characterizing method, system and medium based on multi-modal fusion mechanism - Google Patents

Video semanteme characterizing method, system and medium based on multi-modal fusion mechanism Download PDF

Info

Publication number
CN109472232A
CN109472232A CN201811289502.5A CN201811289502A CN109472232A CN 109472232 A CN109472232 A CN 109472232A CN 201811289502 A CN201811289502 A CN 201811289502A CN 109472232 A CN109472232 A CN 109472232A
Authority
CN
China
Prior art keywords
video
feature
layer
features
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811289502.5A
Other languages
Chinese (zh)
Other versions
CN109472232B (en
Inventor
侯素娟
车统统
王海帅
郑元杰
王静
贾伟宽
史云峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Wuyun Pen And Ink Education Technology Co ltd
Original Assignee
Shandong Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Normal University filed Critical Shandong Normal University
Priority to CN201811289502.5A priority Critical patent/CN109472232B/en
Publication of CN109472232A publication Critical patent/CN109472232A/en
Application granted granted Critical
Publication of CN109472232B publication Critical patent/CN109472232B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The present disclosure discloses video semanteme characterizing method, system and medium based on multi-modal fusion mechanism, feature extraction: visual signature, phonetic feature, motion feature, text feature and the domain features of video itself are extracted;Fusion Features: by the vision of extraction, voice, movement, text feature and domain features, topic model is distributed by the multi-level implicit Di Li Cray of building and carries out Fusion Features;Feature Mapping: by fused Feature Mapping to a high-level semantic space, fused character representation sequence is obtained.Unique advantage of the model using topic model in semantic analysis field, the video characteristic manner that the model training proposed on its basis obtains have comparatively ideal distinction in semantic space.

Description

Video semantic representation method, system and medium based on multi-mode fusion mechanism
Technical Field
The disclosure relates to a video semantic representation method, system and medium based on a multi-modal fusion mechanism.
Background
With the explosive increase of data volume in the network era, the arrival of the media big data era is accelerated. The video is an important carrier of multimedia information and is closely related to the life of people. The evolution of mass data not only requires great change to the data processing mode, but also brings great challenges to the storage, processing and application of videos. One problem that needs to be addressed is how to efficiently organize and manage data. As data is continuously generated, due to the limitation of hardware conditions, the data can only be stored in a segmented or time-sharing manner, which inevitably causes different degrees of information loss. Therefore, a simple and efficient data characterization method is provided for the video, and the method is significant for video analysis and improvement of data management efficiency.
The video data has the following characteristics: 1) in data form, video data has a multi-modal complex structure, which is an incompletely structured data stream. Each video is a streaming structure formed by distributing a series of image frames along a time axis, shows various characteristics such as vision, motion and the like on a space-time multidimensional space, and simultaneously integrates audio characteristics on a time span. The expression is strong, the information content is large, and the contained content has the characteristics of richness, mass property, non-structuring and the like. The multimode characteristic contained in the video brings great challenges to video representation; 2) in content composition, the video has strong logic. The system is composed of a series of logic units, contains rich semantic information, and can depict events occurring in a specific space-time environment through a plurality of continuous frames to express specific semantic content. The diversity of video content and the diversity and ambiguity of video content understanding make it difficult to extract features characterizing video data, thereby making semantic information-based video understanding more challenging.
Traditional data characterization methods, such as a vision-based video feature learning method, can obtain concise characterization of videos, but to reasonably construct good features, certain experience and professional field features are required. The application of the deep learning method enables the visual task to be remarkably developed, but the problems of semantic gap, multi-modal heterogeneous gap and the like still exist. At present, effective representation of videos is established by adopting a multi-mode fusion technology, and the method is an effective way for spanning a multi-mode heterogeneous gap. The most natural way to understand video is to express the content of video by using high-level concepts in human thinking based on multi-modal information in video, which is also the best way to cross the "semantic gap". However, for the video analysis in a specific field, the corresponding field features and the existing multi-mode fusion technology need to be comprehensively applied to mine effective characterization modes to complete a specific task. Despite the continuous development of computer technology, how to make a computer accurately understand semantic concepts in videos is still a difficult problem.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a video semantic representation method, a system and a medium based on a multi-mode fusion mechanism, which is an expandable general representation model, and not only can the number of single-mode information be expanded, but also the domain characteristics contained in any type of videos can be fused into the model in the model training and overall optimization processes. The model fully considers the relation among all the modes, and the multi-mode interaction process is integrated into the combined training and overall optimization process of the whole model. The model utilizes the unique advantages of the topic model in the semantic analysis field, and the video representation mode obtained by training the model on the basis of the unique advantages has ideal distinguishability in the semantic space.
In order to solve the technical problem, the following technical scheme is adopted in the disclosure:
as a first aspect of the present disclosure, a video semantic representation method based on a multi-modal fusion mechanism is provided;
the video semantic representation method based on the multi-modal fusion mechanism comprises the following steps:
feature extraction: extracting visual features, voice features, motion features, text features and field features of the video;
feature fusion: performing feature fusion on the extracted visual, voice, motion and text features and the field features through a constructed multilayer hidden Dirichlet distributed topic model;
characteristic mapping: and mapping the fused features to a high-level semantic space to obtain a fused feature representation sequence.
As some possible implementations, the specific steps of extracting the visual features of the video are:
pretreatment: video segmentation, namely segmenting a video into a plurality of shots; the image frames in each lens form an image frame sequence according to the time sequence;
step (a 1): establishing a deep learning neural network model;
the deep learning neural network model comprises: the input layer, the first scrolling layer C1, the first pooling layer S2, the second scrolling layer C3, the second pooling layer S4, the third scrolling layer C5, the full-connection layer F6 and the output layer are connected in sequence;
step (a 2): inputting the image frame sequence of each lens of the video into an input layer of the deep learning neural network model, and transmitting the image frames to a first convolution layer C1 by the input layer;
the first convolution layer C1 is used for performing convolution on each frame image in the image frame sequence of the video by using a group of trainable convolution cores, averaging all layers of feature maps obtained by the convolution to obtain an average feature map, and outputting a group of feature mapping maps by using the obtained average feature map and a bias input activation function;
the first pooling layer S2 is used for performing overlapping pooling operation on the pixel value of each pixel point of the feature map obtained by the first convolution layer C1, so that the length and the width of the first convolution layer output feature map matrix are reduced; then the operation result is transmitted to a second convolution layer C3;
a second convolution layer C3 for performing convolution operation on the operation result of the first pooling layer S2; the number of convolution kernels of the second convolution layer C3 is twice the number of convolution kernels of the first convolution layer C1;
a second pooling layer S4 for performing an overlap pooling operation on the feature map output by the second convolutional layer C3 to reduce the size of the feature map matrix;
a third convolution layer C5, which performs convolution operation on the result of the second pooling layer S4 by using a convolution kernel with the same size as the second pooling layer S4 to finally obtain a plurality of characteristic graphs of 1 × 1 pixels;
a full-connection layer F6 for fully connecting each neuron of the layer with each neuron in the third convolutional layer C5, and expressing the result obtained by the third convolutional layer C5 as a feature vector;
the output layer is used for inputting the feature vectors output by the full connection layer F6 into a classifier for classification, calculating the classification accuracy, and when the classification accuracy is lower than a set threshold, adjusting parameters through back propagation to repeatedly execute the step (a2) until the classification accuracy is higher than the set threshold; and when the classification accuracy is higher than the set threshold, the feature vector corresponding to the classification accuracy higher than the set threshold is used as the final learning result of the video visual features.
As some possible implementation manners, the specific steps of extracting the voice features of the video are as follows:
extracting voice signals in the video, converting audio data into a spectrogram, inputting the spectrogram serving as a deep learning neural network model, performing unsupervised learning on audio information through the deep learning neural network model, and obtaining vector representation of video voice characteristics through a full connection layer.
As some possible implementation manners, the specific steps of extracting the motion features of the video are as follows:
and extracting an Optical Flow field in the video, and performing weighted statistics on the Optical Flow direction to obtain Optical Flow direction information Histogram features (HOF Histogram of ordered Optical Flow) as vector representation of motion features.
As some possible implementation manners, the specific steps of extracting the text features of the video are as follows:
characters in a video frame and peripheral text information (such as video titles, labels and the like) of the video are collected, and text features are extracted from the text information by adopting a word bag model.
The domain feature refers to a rule feature set by the domain to which the video belongs. For example, a football video may have some scene specifications (such as top, bottom, and exclusion areas) and event definitions (such as shots, corner balls, free-form balls, etc.) according to the rules and rebroadcasting specifications of a football game. News videos have a substantially consistent temporal structure and scene semantics, i.e., news footage is switched chronologically between the announcer and the news story. Logo information associated with the promoted goods or services is typically included in the advertisement video.
As some possible implementations, the specific steps of the multi-modal feature fusion are:
step (a 1): mapping visual feature vectors of the video from a visual feature space to a semantic feature space gamma by using an LDA implicit Dirichletailocation (LDA) topic model; the input of the LDA hidden Dirichlet distribution topic model is a visual feature vector of the video, and the output of the LDA hidden Dirichlet distribution topic model is a semantic representation on a feature space gamma;
step (a 2): mapping the voice feature vector of the video from a voice feature space to a semantic feature space gamma by using an LDA hidden Dirichlet distributed topic model; the input of an LDA hidden Dirichlet distributed topic model is a voice feature vector of a video, and the output is a semantic representation on a feature space gamma;
step (a 3): mapping the optical flow direction information histogram feature of the video from a motion feature space to a semantic feature space gamma by using an LDA hidden Dirichlet distribution topic model; the input of the LDA is the optical flow direction information histogram feature of the video, and the output is the semantic representation on the feature space gamma;
step (a 4): the LDA implies a Dirichlet distribution topic model, and the text features of the video are mapped to a semantic feature space gamma from a text feature space; the input of the LDA is a text of a video, and the output is a semantic representation on a feature space gamma;
step (a 5): converting the video domain features into a priori knowledge omega;
step (a 6): and (3) by using an LDA hidden Dirichlet distributed topic model, setting the weight of each modal feature on the semantic feature space gamma of each modal feature obtained in the steps (a1) and (a4), and obtaining the video representation after modal fusion through weighted fusion.
The process of obtaining the weight of each modal characteristic is as follows:
step (a61) selecting a topic distribution theta | α -Dir (α), wherein α is a prior parameter of a Dirichlet prior distribution;
step (a 62): selecting a top topic assignment for each word in a training sample videoTopic obeys a polynomial distribution;
step (a 63): for each modal feature weight rho epsilon NV { NV modal feature dictionaries }, selecting a bottom-layer topic assignmentTopic obeys a polynomial distribution;
step (a 64): under each modal characteristic weight rho, based on the selected topic, combining with the domain knowledge omega, distributingA word is generated.
For a single video d, given α, β, and ρ, topic θ and the top topicJointly mapping from NV single-mode spaces to high-level semantic space, topic theta and top-level topicJoint distribution probability p (theta, z)topD | α, Ω, ρ, β) p (β) is:
wherein,parameters theta and ztopIs a hidden variable; distributing the parameters theta and z by calculating the edgetopAnd (4) eliminating.
Wherein, p (β)η) Representing the prior relationship between the dictionary elements in the η modal space,
adopting a Gaussian-Markov random field prior model, namely:
therein, IIjRepresenting a collection of words having a priori relationship in the η modal space, σiThe smoothing coefficient of the model is used for adjusting the prior model; exp represents an exponential function with a natural constant e as the base;
for a video corpus D containing M videos, the generation probability is obtained by multiplying the edge probabilities of the M videos:
the target function being set to the likelihood function of D, i.e.
When the likelihood function of D is maximized, soThe corresponding parameter p is the weight corresponding to each single-mode feature, log represents the logarithm with a as the base,representing a likelihood function.
As a second aspect of the present disclosure, a video semantic representation system based on a multi-modal fusion mechanism is provided;
the video semantic representation system based on the multi-modal fusion mechanism comprises: the computer program product comprises a memory, a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of any of the above methods.
As a third aspect of the present disclosure, there is provided a computer-readable storage medium;
a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, perform the steps of any of the above methods.
Compared with the prior art, the beneficial effect of this disclosure is:
(1) the disclosure focuses on researching a video semantic representation method based on a multi-mode fusion mechanism, and algorithms in related fields of image processing, mode recognition, machine learning and the like are comprehensively utilized to process sequence information in a video. A new research visual angle and theoretical reference are provided for video characterization analysis in different fields.
(2) The traditional method and the deep learning method are combined, effective representation of the video is researched at a semantic level, and the ubiquitous 'multimode gap' and 'semantic gap' in video understanding are effectively shortened.
(3) A deep visual feature learning model based on an adaptive learning mechanism is provided, and the adaptivity of an automatic learning mechanism is mainly expressed in two aspects: firstly, a shot detection technology is adopted to enable the input of a depth model to be a group of frame sequences with variable lengths, and the number of the frames can be adaptively adjusted according to the length of a shot; and secondly, in a C2 pooling layer, dynamically calculating the size and the step length of a pooling window according to the scale of the characteristic diagram, thereby ensuring that the data representation dimensions of all the shot videos are consistent.
(4) A video lens self-adaptive 3D deep learning neural network is designed, an automatic visual feature learning algorithm of video features is researched, the performance of a classifier is improved, parameters of the whole system are optimized, and visual information of a video is represented in a most effective mode.
(5) A multi-modal and multi-level topic representation model is provided, and the model is mainly characterized by the following three aspects: the video representation method is an extensible general representation model, the number of single-mode information is extensible, and domain features contained in any type of videos can be integrated into the model, so that the pertinence of video representation capacity is improved; secondly, the model fully considers the relation among all the modes, and a multi-mode interaction process is integrated into the joint training and the overall optimization of the whole model; thirdly, by utilizing the unique advantages of the topic model in the semantic analysis field, the trained video representation mode has ideal distinguishability in the semantic space, and the concise representation of the video can be effectively obtained.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
FIG. 1 is a video shot adaptive 3D depth learning architecture;
FIG. 2 an adaptive 3D convolution process;
FIG. 3 is a process for performing convolution calculations using convolution kernels;
FIG. 4 is an overall framework of the video multimodal fusion mechanism;
FIG. 5 is a model of multi-modal multi-level topic generation.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The present disclosure first proposes a spatio-temporal feature learning model of an adaptive frame selection mechanism to obtain visual features of a video. And then, on the basis, a model capable of effectively fusing the visual features and other modal features by combining the domain features is further provided, so that the semantic representation of the video is realized.
In order to achieve the purpose, the video representation model disclosed by the disclosure is combined with a traditional method and a deep learning method, the advantages of the traditional feature selection technology, the deep learning mechanism and the topic model theory are comprehensively utilized to develop research on the multi-mode fusion mechanism of the video, and further the effective representation of the video is researched at a semantic level.
The specific research technical scheme is as follows:
the method comprises the steps that firstly, a time-space domain information representation learning mechanism of a video is deeply analyzed, and effective representation of video visual information is obtained on the basis of guaranteeing continuity and integrity of time-space information; and then, researching a fusion mechanism of the multi-mode information, simultaneously fusing the domain characteristics of the video into the multi-mode information fusion process of the video, and finally establishing a set of semantic representation models of the domain video.
(1) Automatic learning of time-space domain depth features of video features
A video shot feature learning model with strong data fitting capability and learning capability is designed, and the advantage of feature extraction layer by layer can be fully exerted. And (3) mining the time-space sequence information contained in the video shot in a layer-by-layer extraction mode by using a shot detection technology and taking the shot length as a self-adaptive learning unit. Therefore, a video shot adaptive 3D deep learning network model is designed (see figure 1).
The process is as follows:
step 1: and performing shot segmentation on the video by using a video shot detection technology.
Step 2: a group of video frames of a lens is used as the input of a model, information is sequentially transmitted to different layers, and the most significant characteristic information of observation data in different categories is obtained by each layer through a group of filters.
And 3, finally rasterizing pixel values of all lens frames and connecting the pixel values into a vector.
The adaptive 3D convolution process is embodied in the C1 convolution layer, and fig. 2 shows a process of performing convolution on a shot frame with a length of L, that is, taking L frame sequences as input, performing convolution on corresponding positions of different frames through a set of learnable filters, performing fusion averaging on each obtained neuron, and finally outputting a set of feature maps through an activation function. For video frames we consider the spatial connection inside the frame to be local, so each neuron is set to perceive only a local region.
In the convolution process, the weights of the neurons in the same plane layer are shared, and fig. 3 shows a process of performing nonlinear transformation by using a convolution kernel.
In fig. 3W ═ W1,w2,…,wK) Represents the weight of the convolution kernel on the convolution layer, which is a set of learnable parameters; a ═ a1,a2,…,aL) Local receptive fields for corresponding positions in L frames, where ai=(ai1,ai2,…,aiK). When a convolution kernel is used to perform convolution on an image, a region of the input image is involved in the convolution operation, and the size of the region is the receptive field.
In the S2 pooling layer, the calculation result is continuously transmitted to the next layer by performing weighted calculation on the pixel units of the feature map obtained by the C1 convolutional layer and then operating the nonlinear function. For pooling operations, overlapping pooling is used in the implementation, i.e. the step size is set smaller than the pooling window size.
In view of the difference in the sizes of the video frames of different data sources, after the C1 convolution operation, the sizes of the obtained feature maps may be inconsistent, which may cause the difference in the feature dimensions of each video shot of the full link layer, thereby causing the problem of inconsistent data representation. The strategy adopted in the disclosure is to dynamically calculate the size and step length of the pooling window according to the scale of the feature map, thereby ensuring the consistency of all video shot characterization vectors in dimension.
In the C3 convolutional layer, 2 times as many convolution kernels as the C1 convolutional layer were designed to act on the S2 layer in order to be able to detect more feature information as the spatial resolution decreases between layers. The S4 pooling layer takes a similar operation as S2 in order to pass the operation result on to the next layer by the down-sampling technique. At the C5 level, the feature maps obtained at the S4 level were convolved using convolution kernels of the same size as S4, and a series of feature maps of 1 × 1 size were finally obtained. F6 is a fully connected graph, with the goal of characterizing the input as a final feature vector of a certain length by connecting each neuron to all neurons in C5. And then, the features obtained by training are sent to a classifier, so that the performance of the classifier is further improved to optimize the parameters of the whole system, and the visual information of the video is represented in a most effective mode.
(2) Multimodal information fusion for video representation
In the preprocessing stage of performing multi-modal feature fusion on the video, the features of each modality of the video need to be respectively extracted and characterized. In general, there are two broad categories of video features: one class is a generic feature, including:
1) visual features comprising a time series, including both time and space dimensional information;
2) text characteristics including characters in video frames and texts around the videos are converted into modeling digital descriptions by adopting a bag-of-words model;
3) motion characteristics, namely extracting Optical flow information in a video, and describing by adopting an Optical flow histogram HOF (histogram of ordered Optical flow);
4) and audio features, namely converting audio information in the video into a spectrogram, taking the spectrogram as input, and carrying out unsupervised learning by finely adjusting the conventional network model to obtain vector representation of the voice information.
Another class is domain features, which relate to the video category and the specific application domain.
The process of processing each modal feature in the video is not simple combination of various features, but interaction and fusion of several different modal features. The method disclosed by the invention integrates the information of each mode of the video by taking the popular potential Dirichlet distribution in the topic model as an entry point and integrating disciplinary theories such as machine learning, image processing, voice recognition and the like. And organically mapping the video data from each modal space and domain feature to a high-level space by constructing a multi-level topic model to obtain a representation sequence of the video level on the high-level space. Fig. 4 gives the overall framework of the multimodal fusion mechanism.
For the above (2), the multimodal information fusion of the video representation needs to extract the features of each modality of the video respectively. And then constructing a multi-level topic model for multi-feature fusion. The process is as follows:
(1) and respectively extracting the modal characteristics of the video, namely respectively extracting visual information, voice information, motion information and text information of the video.
For visual information, taking a lens (a group of video frames) as an input of a model, carrying out unsupervised learning on the visual information by finely adjusting the existing network model such as AlexNet and GoogleNet, and finally rasterizing pixel values of all the lens frames to connect the pixel values into a vector; for voice information, converting audio data into a spectrogram, taking the spectrogram information as the input of a model, then carrying out unsupervised learning on the audio information by finely adjusting the existing network model, and obtaining a vector through a full connection layer; for motion information, Optical flow information in a video is extracted firstly, and then an Optical flow histogram HOF (histogram of ordered Optical flow) is adopted for characterization; for text information, including text in video frames and text around video, a bag-of-words model is to be used to convert the text information into a modelable numerical description.
(2) Multi-feature fusion by constructing multi-level topic model
The video data is mapped to a high-level semantic space from each modal feature space and domain feature by constructing a multi-level topic model (figure 5), so that multi-modal fusion is realized. The specific implementation process is as follows:
the model assumes that the corpus contains M videos, denoted as D ═ D1,d2,…,dMD, each videoiThe (1 ≦ i ≦ M) contains a group of potential subject information, which is generated by mapping dictionary elements in each modal space to a high-level semantic space according to a certain distribution in a certain prior condition. The model takes video as a processing unit, relates to two levels of topic models, realizes multi-mode information fusion by taking domain features as prior knowledge, and finally obtains a vector-form topic representation. The model comprises two levels of topics, respectively represented by ZtopAnd ZlowIs represented by ZtopRepresenting a video-fused topic, ZlowThe theme before fusion is shown, the former is composed of the latter according to polynomial distribution with rho as a parameter, and the parameter omega corresponds to the domain feature of the video. The model considers that the words are independently and uniformly distributed under different modal spaces. The construction of the graph feature dictionary adopts a K-means clustering technology and is constructed in a word bag model mode.
The parameter theta in the model is subjected to Dirichlet distribution with α as a prior parameter and represents the distribution contained in the currently processed video, the parameter NV is the number of multiple modes, β represents dictionaries under different modal spaces, and the model realizes the weight setting of different modes when the multi-mode spaces are converted into semantic spaces by solving the parameter rho.
The generation process of each video corpus is shown in the following table:
the generation process comprises the following steps:
selecting a topic distribution theta | α -Dir (α), wherein α is a prior parameter of a Dirichlet prior distribution;
the second step is that: selecting a top topic assignment for each word in a videoTopic obeys a polynomial distribution;
the third step: for the Ny modal spaces, under the modal space p, one underlying topic assignment is selected Topic obeys a polynomial distribution;
the fourth step: from the distribution according to the selected topicA word is generated.
For a single video d, given the parameters α, ρ, in combination with the domain knowledge Ω, the topic θ in the model, the top topicWhen jointly mapping from the NV modality spaces to the higher layer space, their joint distribution probability is:
wherein the parameters theta and ztopAre hidden variables. It can be eliminated by finding the edge distribution.
P (β) aboveη) Representing the prior relationship between the dictionary elements in the η modal space,
a typical gaussian-markov random field prior model is used, namely:
middle II of upper typejRepresenting a collection of words having a priori relationship in the η modal space, σiIs the smoothing coefficient of the model and is used for adjusting the prior model.
For a video corpus D containing M videos, the likelihood value can be obtained by multiplying the edge probabilities of the M videos:
it is desirable to find suitable parameters α, ρ, β that maximize the following likelihood functions for a corpus, i.e., the objective function expressed as:
by solving the model, the organic fusion of the multi-modal characteristics and the video field characteristics can be realized, and finally the semantic representation of the video is obtained.
The general idea of the above process is shown in fig. 4. Summarizing the above process, the multi-modal multi-level topic model proposed by the present disclosure has the following features relative to existing approaches:
1) the method is an extensible general representation model, and in the process of model training and overall optimization, the number of single-mode information is extensible, and domain features contained in any type of videos can be integrated into the model, so that the pertinence of video representation capability is improved;
2) the model fully considers the relation among all the modes, and a multi-mode interaction process is integrated into the combined training and overall optimization process of the whole model;
3) the topic model has unique advantages in the field of semantic analysis, and a video representation mode obtained by training the model on the basis has ideal distinctiveness in a semantic space, which is one of effective modes for obtaining concise video representation.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (9)

1. The video semantic representation method based on the multi-mode fusion mechanism is characterized by comprising the following steps:
feature extraction: extracting visual features, voice features, motion features, text features and field features of the video;
feature fusion: performing feature fusion on the extracted visual, voice, motion and text features and the field features through a constructed multilayer hidden Dirichlet distributed topic model;
characteristic mapping: and mapping the fused features to a high-level semantic space to obtain a fused feature representation sequence.
2. The method for video semantic representation based on the multi-modal fusion mechanism as claimed in claim 1, wherein the specific steps for extracting the visual features of the video are as follows:
pretreatment: video segmentation, namely segmenting a video into a plurality of shots; the image frames in each lens form an image frame sequence according to the time sequence;
step (a 1): establishing a deep learning neural network model;
the deep learning neural network model comprises: the input layer, the first scrolling layer C1, the first pooling layer S2, the second scrolling layer C3, the second pooling layer S4, the third scrolling layer C5, the full-connection layer F6 and the output layer are connected in sequence;
step (a 2): inputting the image frame sequence of each lens of the video into an input layer of the deep learning neural network model, and transmitting the image frames to a first convolution layer C1 by the input layer;
the first convolution layer C1 is used for performing convolution on each frame image in the image frame sequence of the video by using a group of trainable convolution cores, averaging all layers of feature maps obtained by the convolution to obtain an average feature map, and outputting a group of feature mapping maps by using the obtained average feature map and a bias input activation function;
the first pooling layer S2 is used for performing overlapping pooling operation on the pixel value of each pixel point of the feature map obtained by the first convolution layer C1, so that the length and the width of the first convolution layer output feature map matrix are reduced; then the operation result is transmitted to a second convolution layer C3;
a second convolution layer C3 for performing convolution operation on the operation result of the first pooling layer S2; the number of convolution kernels of the second convolution layer C3 is twice the number of convolution kernels of the first convolution layer C1;
a second pooling layer S4 for performing an overlap pooling operation on the feature map output by the second convolutional layer C3 to reduce the size of the feature map matrix;
a third convolution layer C5, which performs convolution operation on the result of the second pooling layer S4 by using a convolution kernel with the same size as the second pooling layer S4 to finally obtain a plurality of characteristic graphs of 1 × 1 pixels;
a full-connection layer F6 for fully connecting each neuron of the layer with each neuron in the third convolutional layer C5, and expressing the result obtained by the third convolutional layer C5 as a feature vector;
the output layer is used for inputting the feature vectors output by the full connection layer F6 into a classifier for classification, calculating the classification accuracy, and when the classification accuracy is lower than a set threshold, adjusting parameters through back propagation to repeatedly execute the step (a2) until the classification accuracy is higher than the set threshold; and when the classification accuracy is higher than the set threshold, the feature vector corresponding to the classification accuracy higher than the set threshold is used as the final learning result of the video visual features.
3. The method for video semantic representation based on the multi-modal fusion mechanism as claimed in claim 1, wherein the specific steps for extracting the voice features of the video are as follows:
extracting voice signals in the video, converting audio data into a spectrogram, inputting the spectrogram serving as a deep learning neural network model, performing unsupervised learning on audio information through the deep learning neural network model, and obtaining vector representation of video voice characteristics through a full connection layer.
4. The method for semantic representation of video based on multi-modal fusion mechanism as claimed in claim 1, wherein the specific steps for extracting the motion features of the video are as follows:
and extracting an optical flow field in the video, and carrying out weighted statistics on the optical flow direction to obtain optical flow direction information histogram features as vector representation of the motion features.
5. The method for video semantic representation based on the multi-modal fusion mechanism as claimed in claim 1, wherein the specific steps for extracting the text features of the video are as follows:
the method comprises the steps of collecting characters in a video frame and peripheral text information of a video, and extracting text features from the text information by adopting a word bag model.
6. The method as claimed in claim 1, wherein the domain feature is a rule feature set by a domain to which the video belongs.
7. The method for video semantic representation based on multi-modal fusion mechanism as claimed in claim 1, wherein,
the specific steps of the multi-modal feature fusion are as follows:
step (a 1): mapping visual feature vectors of the video from a visual feature space to a semantic feature space gamma by using an LDA hidden Dirichlet distributed topic model; the input of the LDA hidden Dirichlet distribution topic model is a visual feature vector of the video, and the output of the LDA hidden Dirichlet distribution topic model is a semantic representation on a feature space gamma;
step (a 2): mapping the voice feature vector of the video from a voice feature space to a semantic feature space gamma by using an LDA hidden Dirichlet distributed topic model; the input of an LDA hidden Dirichlet distributed topic model is a voice feature vector of a video, and the output is a semantic representation on a feature space gamma;
step (a 3): mapping the optical flow direction information histogram feature of the video from a motion feature space to a semantic feature space gamma by using an LDA hidden Dirichlet distribution topic model; the input of the LDA is the optical flow direction information histogram feature of the video, and the output is the semantic representation on the feature space gamma;
step (a 4): the LDA implies a Dirichlet distribution topic model, and the text features of the video are mapped to a semantic feature space gamma from a text feature space; the input of the LDA is a text of a video, and the output is a semantic representation on a feature space gamma;
step (a 5): converting the video domain features into a priori knowledge omega;
step (a 6): and (3) by using an LDA hidden Dirichlet distributed topic model, setting the weight of each modal feature on the semantic feature space gamma of each modal feature obtained in the steps (a1) and (a4), and obtaining the video representation after modal fusion through weighted fusion.
8. The video semantic representation system based on the multi-mode fusion mechanism is characterized by comprising the following steps: a memory, a processor, and computer instructions stored on the memory and executed on the processor, the computer instructions, when executed by the processor, performing the steps of the method of any of claims 1-7.
9. A computer-readable storage medium having stored thereon computer instructions which, when executed by a processor, perform the steps of the method of any one of claims 1 to 7.
CN201811289502.5A 2018-10-31 2018-10-31 Video semantic representation method, system and medium based on multi-mode fusion mechanism Active CN109472232B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811289502.5A CN109472232B (en) 2018-10-31 2018-10-31 Video semantic representation method, system and medium based on multi-mode fusion mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811289502.5A CN109472232B (en) 2018-10-31 2018-10-31 Video semantic representation method, system and medium based on multi-mode fusion mechanism

Publications (2)

Publication Number Publication Date
CN109472232A true CN109472232A (en) 2019-03-15
CN109472232B CN109472232B (en) 2020-09-29

Family

ID=65666408

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811289502.5A Active CN109472232B (en) 2018-10-31 2018-10-31 Video semantic representation method, system and medium based on multi-mode fusion mechanism

Country Status (1)

Country Link
CN (1) CN109472232B (en)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046279A (en) * 2019-04-18 2019-07-23 网易传媒科技(北京)有限公司 Prediction technique, medium, device and the calculating equipment of video file feature
CN110162669A (en) * 2019-04-04 2019-08-23 腾讯科技(深圳)有限公司 Visual classification processing method, device, computer equipment and storage medium
CN110234018A (en) * 2019-07-09 2019-09-13 腾讯科技(深圳)有限公司 Multimedia content description generation method, training method, device, equipment and medium
CN110390311A (en) * 2019-07-27 2019-10-29 苏州过来人科技有限公司 A kind of video analysis algorithm based on attention and subtask pre-training
CN110399934A (en) * 2019-07-31 2019-11-01 北京达佳互联信息技术有限公司 A kind of video classification methods, device and electronic equipment
CN110489593A (en) * 2019-08-20 2019-11-22 腾讯科技(深圳)有限公司 Topic processing method, device, electronic equipment and the storage medium of video
CN110580509A (en) * 2019-09-12 2019-12-17 杭州海睿博研科技有限公司 multimodal data processing system and method for generating countermeasure model based on hidden representation and depth
CN110647804A (en) * 2019-08-09 2020-01-03 中国传媒大学 Violent video identification method, computer system and storage medium
CN110674348A (en) * 2019-09-27 2020-01-10 北京字节跳动网络技术有限公司 Video classification method and device and electronic equipment
CN111235709A (en) * 2020-03-18 2020-06-05 东华大学 Online detection system for spun yarn evenness of ring spinning based on machine vision
CN111401259A (en) * 2020-03-18 2020-07-10 南京星火技术有限公司 Model training method, system, computer readable medium and electronic device
CN111414959A (en) * 2020-03-18 2020-07-14 南京星火技术有限公司 Image recognition method and device, computer readable medium and electronic equipment
CN111723239A (en) * 2020-05-11 2020-09-29 华中科技大学 Multi-mode-based video annotation method
WO2021134277A1 (en) * 2019-12-30 2021-07-08 深圳市优必选科技股份有限公司 Emotion recognition method, intelligent device, and computer-readable storage medium
CN113094550A (en) * 2020-01-08 2021-07-09 百度在线网络技术(北京)有限公司 Video retrieval method, device, equipment and medium
CN113177914A (en) * 2021-04-15 2021-07-27 青岛理工大学 Robot welding method and system based on semantic feature clustering
CN113239184A (en) * 2021-07-09 2021-08-10 腾讯科技(深圳)有限公司 Knowledge base acquisition method and device, computer equipment and storage medium
CN113806609A (en) * 2021-09-26 2021-12-17 郑州轻工业大学 Multi-modal emotion analysis method based on MIT and FSM
CN113903358A (en) * 2021-10-15 2022-01-07 北京房江湖科技有限公司 Voice quality inspection method, readable storage medium and computer program product
CN114519809A (en) * 2022-02-14 2022-05-20 复旦大学 Audio-visual video analysis device and method based on multi-scale semantic network
CN114863202A (en) * 2022-03-23 2022-08-05 腾讯科技(深圳)有限公司 Video representation method and device
JP2022135930A (en) * 2021-03-05 2022-09-15 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Video classification method, apparatus, device, and storage medium
WO2022198854A1 (en) * 2021-03-24 2022-09-29 北京百度网讯科技有限公司 Method and apparatus for extracting multi-modal poi feature
CN117933269A (en) * 2024-03-22 2024-04-26 合肥工业大学 Multi-mode depth model construction method and system based on emotion distribution

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7920761B2 (en) * 2006-08-21 2011-04-05 International Business Machines Corporation Multimodal identification and tracking of speakers in video
CN103778443A (en) * 2014-02-20 2014-05-07 公安部第三研究所 Method for achieving scene analysis description based on theme model method and field rule library
CN104090971A (en) * 2014-07-17 2014-10-08 中国科学院自动化研究所 Cross-network behavior association method for individual application
CN105760507A (en) * 2016-02-23 2016-07-13 复旦大学 Cross-modal subject correlation modeling method based on deep learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7920761B2 (en) * 2006-08-21 2011-04-05 International Business Machines Corporation Multimodal identification and tracking of speakers in video
CN103778443A (en) * 2014-02-20 2014-05-07 公安部第三研究所 Method for achieving scene analysis description based on theme model method and field rule library
CN104090971A (en) * 2014-07-17 2014-10-08 中国科学院自动化研究所 Cross-network behavior association method for individual application
CN105760507A (en) * 2016-02-23 2016-07-13 复旦大学 Cross-modal subject correlation modeling method based on deep learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LIU ZHENG ET.AL: "MMDF-LDA: An improved Multi-Modal Latent Dirichlet Allocation model for social image annotation", 《EXPERT SYSTEMS WITH APPLICATIONS》 *
QIN JIN ET.AL: "Describing Videos using Multi-modal Fusion", 《PROCEEDINGS OF THE 24TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA》 *
张德 等: "基于语义空间统一表征的视频多模态内容分析技术", 《电视技术》 *

Cited By (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162669A (en) * 2019-04-04 2019-08-23 腾讯科技(深圳)有限公司 Visual classification processing method, device, computer equipment and storage medium
CN110162669B (en) * 2019-04-04 2021-07-02 腾讯科技(深圳)有限公司 Video classification processing method and device, computer equipment and storage medium
CN110046279A (en) * 2019-04-18 2019-07-23 网易传媒科技(北京)有限公司 Prediction technique, medium, device and the calculating equipment of video file feature
CN110046279B (en) * 2019-04-18 2022-02-25 网易传媒科技(北京)有限公司 Video file feature prediction method, medium, device and computing equipment
CN110234018B (en) * 2019-07-09 2022-05-31 腾讯科技(深圳)有限公司 Multimedia content description generation method, training method, device, equipment and medium
CN110234018A (en) * 2019-07-09 2019-09-13 腾讯科技(深圳)有限公司 Multimedia content description generation method, training method, device, equipment and medium
CN110390311A (en) * 2019-07-27 2019-10-29 苏州过来人科技有限公司 A kind of video analysis algorithm based on attention and subtask pre-training
CN110399934A (en) * 2019-07-31 2019-11-01 北京达佳互联信息技术有限公司 A kind of video classification methods, device and electronic equipment
CN110647804A (en) * 2019-08-09 2020-01-03 中国传媒大学 Violent video identification method, computer system and storage medium
CN110489593A (en) * 2019-08-20 2019-11-22 腾讯科技(深圳)有限公司 Topic processing method, device, electronic equipment and the storage medium of video
CN110580509A (en) * 2019-09-12 2019-12-17 杭州海睿博研科技有限公司 multimodal data processing system and method for generating countermeasure model based on hidden representation and depth
CN110674348B (en) * 2019-09-27 2023-02-03 北京字节跳动网络技术有限公司 Video classification method and device and electronic equipment
CN110674348A (en) * 2019-09-27 2020-01-10 北京字节跳动网络技术有限公司 Video classification method and device and electronic equipment
WO2021134277A1 (en) * 2019-12-30 2021-07-08 深圳市优必选科技股份有限公司 Emotion recognition method, intelligent device, and computer-readable storage medium
CN113094550A (en) * 2020-01-08 2021-07-09 百度在线网络技术(北京)有限公司 Video retrieval method, device, equipment and medium
CN113094550B (en) * 2020-01-08 2023-10-24 百度在线网络技术(北京)有限公司 Video retrieval method, device, equipment and medium
CN111401259B (en) * 2020-03-18 2024-02-02 南京星火技术有限公司 Model training method, system, computer readable medium and electronic device
CN111235709A (en) * 2020-03-18 2020-06-05 东华大学 Online detection system for spun yarn evenness of ring spinning based on machine vision
CN111414959B (en) * 2020-03-18 2024-02-02 南京星火技术有限公司 Image recognition method, device, computer readable medium and electronic equipment
CN111401259A (en) * 2020-03-18 2020-07-10 南京星火技术有限公司 Model training method, system, computer readable medium and electronic device
CN111414959A (en) * 2020-03-18 2020-07-14 南京星火技术有限公司 Image recognition method and device, computer readable medium and electronic equipment
CN111723239A (en) * 2020-05-11 2020-09-29 华中科技大学 Multi-mode-based video annotation method
CN111723239B (en) * 2020-05-11 2023-06-16 华中科技大学 Video annotation method based on multiple modes
US12094208B2 (en) 2021-03-05 2024-09-17 Beijing Baidu Netcom Science Technology Co., Ltd. Video classification method, electronic device and storage medium
JP2022135930A (en) * 2021-03-05 2022-09-15 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Video classification method, apparatus, device, and storage medium
JP7334395B2 (en) 2021-03-05 2023-08-29 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Video classification methods, devices, equipment and storage media
JP2023529939A (en) * 2021-03-24 2023-07-12 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Multimodal POI feature extraction method and apparatus
WO2022198854A1 (en) * 2021-03-24 2022-09-29 北京百度网讯科技有限公司 Method and apparatus for extracting multi-modal poi feature
CN113177914A (en) * 2021-04-15 2021-07-27 青岛理工大学 Robot welding method and system based on semantic feature clustering
CN113177914B (en) * 2021-04-15 2023-02-17 青岛理工大学 Robot welding method and system based on semantic feature clustering
CN113239184A (en) * 2021-07-09 2021-08-10 腾讯科技(深圳)有限公司 Knowledge base acquisition method and device, computer equipment and storage medium
CN113806609A (en) * 2021-09-26 2021-12-17 郑州轻工业大学 Multi-modal emotion analysis method based on MIT and FSM
CN113806609B (en) * 2021-09-26 2022-07-12 郑州轻工业大学 Multi-modal emotion analysis method based on MIT and FSM
CN113903358B (en) * 2021-10-15 2022-11-04 贝壳找房(北京)科技有限公司 Voice quality inspection method, readable storage medium and computer program product
CN113903358A (en) * 2021-10-15 2022-01-07 北京房江湖科技有限公司 Voice quality inspection method, readable storage medium and computer program product
CN114519809A (en) * 2022-02-14 2022-05-20 复旦大学 Audio-visual video analysis device and method based on multi-scale semantic network
CN114863202A (en) * 2022-03-23 2022-08-05 腾讯科技(深圳)有限公司 Video representation method and device
CN117933269A (en) * 2024-03-22 2024-04-26 合肥工业大学 Multi-mode depth model construction method and system based on emotion distribution

Also Published As

Publication number Publication date
CN109472232B (en) 2020-09-29

Similar Documents

Publication Publication Date Title
CN109472232B (en) Video semantic representation method, system and medium based on multi-mode fusion mechanism
CN110322446B (en) Domain self-adaptive semantic segmentation method based on similarity space alignment
Köpüklü et al. You only watch once: A unified cnn architecture for real-time spatiotemporal action localization
WO2023280065A1 (en) Image reconstruction method and apparatus for cross-modal communication system
CN112232425B (en) Image processing method, device, storage medium and electronic equipment
CN109145712B (en) Text information fused GIF short video emotion recognition method and system
US20180114071A1 (en) Method for analysing media content
CN112446476A (en) Neural network model compression method, device, storage medium and chip
CN107683469A (en) A kind of product classification method and device based on deep learning
CN109783666A (en) A kind of image scene map generation method based on iteration fining
CN115223020B (en) Image processing method, apparatus, device, storage medium, and computer program product
Mukushev et al. Evaluation of manual and non-manual components for sign language recognition
WO2023173552A1 (en) Establishment method for target detection model, application method for target detection model, and device, apparatus and medium
CN112364168A (en) Public opinion classification method based on multi-attribute information fusion
CN111598183A (en) Multi-feature fusion image description method
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
Wang et al. Basketball shooting angle calculation and analysis by deeply-learned vision model
Zhenhua et al. FTCF: Full temporal cross fusion network for violence detection in videos
Zhao et al. Multifeature fusion action recognition based on key frames
Sun et al. Video understanding: from video classification to captioning
Wei et al. Lightweight multimodal feature graph convolutional network for dangerous driving behavior detection
US11948090B2 (en) Method and apparatus for video coding
Li A deep learning-based text detection and recognition approach for natural scenes
Qiao et al. Two-Stream Convolutional Neural Network for Video Action Recognition
CN116932788A (en) Cover image extraction method, device, equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20210506

Address after: Room 1605, Kangzhen building, 18 Louyang Road, Suzhou Industrial Park, 215000, Jiangsu Province

Patentee after: Suzhou Wuyun pen and ink Education Technology Co.,Ltd.

Address before: No.1 Daxue Road, University Science Park, Changqing District, Jinan City, Shandong Province

Patentee before: SHANDONG NORMAL University