CN101299241B - Multimodal Video Semantic Concept Detection Method Based on Tensor Representation - Google Patents

Multimodal Video Semantic Concept Detection Method Based on Tensor Representation Download PDF

Info

Publication number
CN101299241B
CN101299241B CN2008100591256A CN200810059125A CN101299241B CN 101299241 B CN101299241 B CN 101299241B CN 2008100591256 A CN2008100591256 A CN 2008100591256A CN 200810059125 A CN200810059125 A CN 200810059125A CN 101299241 B CN101299241 B CN 101299241B
Authority
CN
China
Prior art keywords
tensor
video
audio
lens
semantic concept
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2008100591256A
Other languages
Chinese (zh)
Other versions
CN101299241A (en
Inventor
吴飞
庄越挺
刘亚楠
郭同强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN2008100591256A priority Critical patent/CN101299241B/en
Publication of CN101299241A publication Critical patent/CN101299241A/en
Application granted granted Critical
Publication of CN101299241B publication Critical patent/CN101299241B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种基于张量表示的多模态视频语义概念检测方法。包括如下步骤:1)对训练集合及测试集合中的视频镜头均提取图像、音频、文本三种模态的底层特征,每个视频张量镜头由这三种底层特征形成3阶张量来表达;2)根据视频张量镜头集合的流形空间本征结构,通过寻找转换矩阵实现对原始高维张量的维度降低及子空间嵌入;3)采用支持张量机对降维后的视频张量镜头集合建立分类器模型;4)对于测试镜头,由训练集合计算得到的转换矩阵进行投影后,再通过分类器模型进行语义概念检测。本发明充分利用视频中的多模态数据,将视频镜头表示为3阶张量,并基于此种表达提出了一种子空间嵌入的降维方法,实现了视频镜头的语义概念检测,对视频语义进行了较好的分析与理解。The invention discloses a multimodal video semantic concept detection method based on tensor representation. Including the following steps: 1) Extract the bottom-level features of image, audio and text from the video shots in the training set and test set, and each video tensor shot is represented by a third-order tensor formed by these three bottom-level features ; 2) According to the eigenstructure of the manifold space of the video tensor lens set, the dimension reduction and subspace embedding of the original high-dimensional tensor are realized by finding the transformation matrix; 4) For the test shots, the transformation matrix calculated from the training set is projected, and then the semantic concept detection is performed through the classifier model. The present invention makes full use of the multimodal data in the video, expresses the video shot as a third-order tensor, and proposes a dimensionality reduction method of subspace embedding based on this expression, realizes the semantic concept detection of the video shot, and improves the video semantic better analysis and understanding.

Description

Method for detecting multi-mode video semantic conception based on tensor representation
Technical field
The present invention relates to a kind of method for detecting multi-mode video semantic conception based on tensor representation.This method is expressed as 3 rank tensors with video lens, and seeks effective dimension reduction method it is projected to the low-dimensional semantic space, thereby realizes the semantic concept of video tensor camera lens is detected by the training classifier model, belongs to video semanteme analysis and understanding field.
Background technology
Along with the development of various digital image apparatus with popularize, and the develop rapidly of film and television industry, computer technology, the communication technology, multimedia treatment technology, compression coding technology and internet etc., produced a large amount of video datas in fields such as news, film, historical document and monitoring.Video data has contained abundant semanteme such as task, scene, object and incident, and video is again a time series data simultaneously, has image, audio frequency and three kinds of medium data of text in the video, and presents the sequential correlation symbiotic characteristic.Simultaneously, the fusion of multiple modalities is also played important effect with " the semantic wide gap " of cooperating for reducing between low-level image feature and the high-level semantic.How effectively to utilize the multi-modal of video and sequential characteristic to excavate its semantic information, thereby support effective retrieval of video, the resource sharing advantage of performance video data is challenging studying a question.
For how expressing multi-modal medium in the video, traditional method is that image, audio frequency and text feature are represented with the splicing vector.But this high dimension vector tends to cause the problem of " dimension disaster ", and the relation of the sequential correlation symbiosis between multiple modalities in the video also can be left in the basket.In recent years, how much of polytenies-be that high order tensor has been widely applied to fields such as computer vision, information retrieval and signal Processing.Tensor is a kind of natural expansion and the extension to the vector sum matrix, and tensor geometry has defined a series of polyteny computings based on the vector space set.Simultaneously, finding the solution optimum solution with tensor as the supervising tensor learning framework employing alternating projection optimization step that has of input, is the combination of protruding optimization and polyteny geometric operation.Based on the supervising tensor learning framework is arranged, traditional support vector machine can be expanded to the tensor machine of supporting, realize the training and the application of sorter model.
Summary of the invention
The purpose of this invention is to provide a kind of method for detecting multi-mode video semantic conception based on tensor representation.Comprise the steps:
1) video lens in training set and the test set is all extracted the low-level image feature of image, audio frequency, three kinds of mode of text, each video tensor camera lens forms 3 rank tensors by these three kinds of low-level image features and expresses;
2) according to the stream shape space intrinsic structure of video tensor camera lens set, realize the dimension of original higher-dimension tensor is reduced and the subspace embedding by seeking transition matrix;
3) adopt the video tensor camera lens set after supporting the tensor machine to dimensionality reduction to set up sorter model;
4) for testing lens, add up to the transition matrix obtain to carry out projection by training set after, carry out the semantic concept detection by sorter model again.
The described low-level image feature that camera lens in training set and the test set is all extracted image, audio frequency, three kinds of mode of text: choose a key frame in each camera lens as representative image, extract color histogram, texture and Canny border then as characteristics of image; One section audio of camera lens correspondence is extracted as an audio example, and audio example is divided into the audio frame in short-term that contains superposition, extract each audio frame feature in short-term, comprise MFCC, barycenter, decay cutoff frequency, frequency spectrum flow and zero-crossing rate, form the frame proper vector, then the audio frequency characteristics of the statistical value of audio frame proper vector in short-term as camera lens; From video, extract the TF*IDF value as text feature through the transcribed text of identification.
The expression of described video tensor camera lens: based on the image that extracts in the video, audio frequency, text low-level image feature, with each video lens with one 3 rank tensor S ∈ R I 1 × I 2 × I 3 Represent.Wherein, I 1, I 2And I 3It is respectively the dimension of characteristics of image, audio frequency characteristics and text feature.Each element s so I1i2i3Value defined be: s I1,1,1(1≤i 1≤ I 1) be the value of characteristics of image, s 2, i2,2(1≤i 2≤ I 2) be the value of audio frequency characteristics, s 3,3, i3(1≤i 3≤ I 3) be the value of text feature, the value of other element all initially is made as zero.
Described stream shape space intrinsic structure according to video tensor camera lens by seeking the transition matrix realization to the dimension reduction of original higher-dimension tensor and the method for subspace embedding is: given space R I1 * I2 * I3On lens data set X={X 1, X 2, LX N, according to the stream shape space intrinsic structure and the spectrogram theory of tensor camera lens, be each the tensor camera lens X on the X i| I=1 NSeek three transition matrix: J 1* I 1The T of dimension 1 i, J 2* I 2The T of dimension 2 iAnd J 3* I 3The T of dimension 3 i, make it to shine upon this N data point to space R J1 * J2 * J3(J 1<I 1, J 2<I 2, J 3<I 3) on Y={Y 1, Y 2, LY N, satisfy Y i | i = 1 N = X i ⊗ 1 T 1 iT ⊗ 2 T 2 iT ⊗ 3 T 3 iT , Realize the dimension of original higher-dimension tensor is reduced and the subspace embedding with this; When asking for T 1 i| I=1 NThe time, by finding the solution generalized eigenvector problem (D U-W U) V 1=λ D UV 1Calculate optimized intermediate conversion matrix V 1, wherein, D U = Σ i D ii U 1 i U 1 iT , W U = Σ ij W ij U 1 i U 1 jT , And W is that D is that the diagonal matrix of W is D according to the weight matrix of the constructed arest neighbors figure of training set X Ii=∑ jW Ij, U 1 iBe to X i| I=1 NA mould to launch matrix mode-1 unfolding matrix be X (1) iCarry out SVD and decompose the left matrix that obtains, so finally can calculate transition matrix T 1 i | i = 1 N = V 1 T U 1 i ∈ R J 1 × I 1 ; With asking for T with quadrat method 2 iWith T 3 i
The method that sorter model is set up in video tensor camera lens set after described employing supports the tensor machine to dimensionality reduction is: the input of sorter model is the low-dimensional tensor through the subspace embeds and dimensionality reduction obtains Y i | i = 1 N = X i ⊗ 1 T 1 iT ⊗ 2 T 2 iT ⊗ 3 T 3 iT ∈ R J 1 × J 2 × J 3 And corresponding classification logotype y i∈+1 ,-1), output is the tensor lineoid parameter of sorter model w k | k = 1 3 ∈ R J k With b ∈ R; By the iterative optimization problem min w j , b , ξ J C - STM ( w j , b , ξ ) = η 2 | | w j | | Fro 2 + c Σ i = 1 N ξ i s . t . y i [ w T j ( Y i Π j = 1 3 × j w j + b ) ] ≥ ξ i , 1 ≤ i ≤ N ξ ≥ 0 Obtain w k| K=1 3And b, wherein parameter j is recycled to 3 from 1, and c is constant, and ξ is a relaxation factor, η = Π 1 ≤ k ≤ 3 k ≠ j | | w k | | Fro 2 .
Described for testing lens, add up to the transition matrix obtain to carry out projection by training set after, carry out semantic concept by sorter model again and detect: the new data outside the training set X t ∈ R I 1 × I 2 × I 3 By transition matrix T 1 t = V 1 T U 1 t ∈ R J 1 × I 1 , T 2 t = V 2 T U 2 t ∈ R J 2 × I 2 And T 3 t = V 3 T U 3 t ∈ R J 3 × I 3 Be mapped as in the low n-dimensional subspace n Y t = X t ⊗ 1 T 1 tT ⊗ 2 T 2 tT ⊗ 3 T 3 tT ∈ R J 1 × J 2 × J 3 , Carry out classification by sorter model then and detect, promptly calculate z t=sign (Y t
Figure 2008100591256_3
1w 1 2w 2
Figure 2008100591256_5
3w 3)+b obtains the classification logotype z of test data t∈+1 ,-1}.
Beneficial effect of the present invention:
1) the present invention has replaced the vectorial expression way of traditional video with tensor, can effectively reduce the problem that " dimension disaster " brings;
2) the present invention has considered the multiple modalities in the video: image, audio frequency and text, and the sequential correlation symbiotic characteristic of video data, play important effect based on the fusion of the multiple modalities of this characteristic with " the semantic wide gap " of cooperating for reducing between low-level image feature and the high-level semantic;
3) the present invention is according to the stream shape space intrinsic structure and the spectrogram theory that keep the set of tensor camera lens, the tensor camera lens subspace that is proposed embeds and dimension reduction method, not only solved the high-dimensional difficulty of bringing effectively, and owing to be linear method, the new data outer for training set can directly carry out projection mapping;
4) the present invention adopts and supports the tensor machine to come the training classifier model, has good classification detectability.
Description of drawings
Fig. 1 is based on the method for detecting multi-mode video semantic conception process flow diagram of tensor representation;
Fig. 2 is the testing result of the present invention to semantic concept " Explosion (blast) ", compares with ISOMAP and two kinds of methods of PCA respectively, is expressed as the ROC curve map;
Fig. 3 is the testing result of the present invention to semantic concept " Sports (sports) ", compares with ISOMAP and two kinds of methods of PCA respectively, is expressed as the ROC curve map.
Embodiment
Method for detecting multi-mode video semantic conception based on tensor representation.Comprise the steps:
1) video lens in training set and the test set is all extracted the low-level image feature of image, audio frequency, three kinds of mode of text, each video tensor camera lens forms 3 rank tensors by these three kinds of low-level image features and expresses;
2) according to the stream shape space intrinsic structure of video tensor camera lens set, realize the dimension of original higher-dimension tensor is reduced and the subspace embedding by seeking transition matrix;
3) adopt the video tensor camera lens set after supporting the tensor machine to dimensionality reduction to set up sorter model;
4) for testing lens, add up to the transition matrix obtain to carry out projection by training set after, carry out the semantic concept detection by sorter model again.
The described low-level image feature that camera lens in training set and the test set is all extracted image, audio frequency, three kinds of mode of text: low-level image feature is meant the feature of directly extracting from video source data, be different from the high-level characteristic of semantic concept representative.We extract low-level image feature respectively from each video lens, comprise image, audio frequency and text feature.
Characteristics of image: camera lens is a basic processing unit, chooses a key frame in each camera lens as representative image, and color histogram, texture and the Canny border of extracting key frame then are as characteristics of image;
Audio frequency characteristics a: section audio of camera lens correspondence is extracted as an audio example (audio clip), and audio example is divided into the audio frame in short-term that contains superposition, extract each audio frame feature in short-term, comprise MFCC, barycenter, decay cutoff frequency, frequency spectrum flow and zero-crossing rate, form the frame proper vector, then the statistical value of audio frame proper vector (average or variance) in short-term as the audio frequency characteristics of camera lens;
Text feature: we extract feature by (transcript) text of transcribing through identification from video.Because the dimension of text feature is much larger than other mode features, and comprised abundant semantic information in the text, (Latent Semantic Analysis LSA) does dimension-reduction treatment to text can to adopt implicit semantic analysis earlier.
The expression of described video tensor camera lens: based on the image that extracts in the video, audio frequency, text low-level image feature, with each video lens with one 3 rank tensor S ∈ R I 1 × I 2 × I 3 Represent.Wherein, I 1, I 2And I 3It is respectively the dimension of characteristics of image, audio frequency characteristics and text feature.Each element s so I1i2i3Value defined be: s I1,1,1(1≤i 1≤ I 1) be the value of characteristics of image, s 2, i2,2(1≤i 2≤ I 2) be the value of audio frequency characteristics, s 3,3, i3(1≤i 3≤ I 3) be the value of text feature, the value of other element all initially is made as zero.
Described stream shape space intrinsic structure according to video tensor camera lens by seeking the transition matrix realization to the dimension reduction of original higher-dimension tensor and the method for subspace embedding is: given space R I1 * I2 * I3On lens data set X={X 1, X 2, LX N, according to the stream shape space intrinsic structure and the spectrogram theory of tensor camera lens, be each the tensor camera lens X on the X i| I=1 NSeek three transition matrix: J 1* I 1The T of dimension 1 i, J 2* I 2The T of dimension 2 iAnd J 3* I 3The T of dimension 3 i, make it to shine upon this N data point to space R J1 * J2 * J3(J 1<I 1, J 2<I 2, J 3<I 3) on set Y={Y 1, Y 2, L Y N, and satisfy Y i | i = 1 N = X i ⊗ 1 T 1 iT ⊗ 2 T 2 iT ⊗ 3 T 3 iT , Low-dimensional data acquisition Y has just reflected how much topological structures of intrinsic in the stream shape space of set X so; Simultaneously, this mapping has also kept linear feature, that is to say for the outer data point X of training set t, the transition matrix that can directly be obtained by training in advance calculates its mapping in low n-dimensional subspace n.
Order X ∈ i I 1 × I 2 × I 3 Represent one 3 rank tensor camera lens, given N tensor camera lens stream shape space that is distributed in M ∈ R I 1 × I 2 × I 3 On data acquisition X={X 1, X 2, Λ X N, we can make up the local geometry that an arest neighbors figure G simulates M.The weight matrix W of definition G is as follows:
Figure S2008100591256D00054
Wherein c is a constant.
For each tensor camera lens X i, (1≤i≤N), (Higher-OrderSingular Value Decomposition, HOSVD), we can be respectively to X according to the high-order svd iK| K=1 3Mould launches matrix (mode-kunfolding matrix) X (1) i, X (2) i, X (3) iCarry out singular value SVD and decompose, calculate left matrix U 1 i, U 2 i, U 3 iFor instance, U 1 iBe to X iA mould launch matrix (mode-1 unfolding matrix) X (1) iCarry out SVD and decompose the left matrix that obtains.
Existing now U 1 i ∈ R I 1 × I 1 , ( 1 ≤ i ≤ N ) , We want to find I 1* J 1The matrix V of dimension 1With U 1 iBe mapped to T 1 iT ∈ R I 1 × J 1 , Just make T 1 i = V 1 T U 1 i ∈ R J 1 × I 1 . We will consider this problem from two angles.On the one hand, the intrinsic structure of shape be kept flowing, the optimum solution of following this objective function need be asked for:
min v 1 Σ ij | | V 1 T U 1 i - V 1 T U 1 j | | 2 W ij
That is to say, minimize Σ ij | | V 1 T U 1 i - V 1 T U 1 j | | 2 W ij Can guarantee to work as U 1 iAnd U 1 jBe " close ", V so 1 TU 1 iAnd V 1 TU 1 jAlso be " close ".
D is the diagonal matrix of W, i.e. D Ii=∑ jW IjAnd for a matrix A, its " mark (trace) " ‖ A ‖ 2=tr (AA T), have so:
1 2 Σ ij | | V 1 T U 1 i - V 1 T U 1 i | | 2 W ij
= 1 2 Σ ij tr ( ( T 1 i - T 1 j ) ( T 1 i - T 1 j ) T ) W ij
= 1 2 Σ ij tr ( T 1 i T 1 i + T 1 j T 1 j - T 1 i T 1 j T - T 1 j T 1 i T ) W ij
= tr ( Σ i D ii T 1 i T 1 i T - Σ ij W ij T 1 i T 1 j T )
= tr ( Σ i D ii V 1 T U 1 i U 1 i T V 1 - Σ ij W ij V 1 T U 1 i U 1 j T V 1 )
= tr ( V 1 T ( Σ i D ii U 1 i U 1 i T - Σ ij W ij U 1 i U 1 j T ) V 1 )
Figure S2008100591256D00067
Wherein D U = Σ i D ii U 1 i U 1 i T , W U = Σ ij W ij U 1 i U 1 j T . From top derivation as can be seen, if want to find the solution min V 1 Σ ij | | V 1 T U 1 i - V 1 T U 1 j | | 2 W ij , Need minimize tr (V 1 T(D U-W U) V 1).
On the other hand, except keeping flowing the graph structure of shape, also need to maximize the overall variance on the stream shape space.Usually, the variance of a stochastic variable x is:
var(x)=∫ M(x-μ) 2dP(x),μ=∫ MxdP(x)
Wherein M is the stream shape of data, and μ is an expectation value, and dP is a probability density function.According to spectrogram theory (spectralgraph theory), dP can be by the diagonal matrix D (D of sample point Ii=∑ jW Ij) estimation of discretize obtains.We have following derivation so:
var ( T 1 )
= Σ i | | T 1 i | | 2 D ii
= Σ i tr ( T 1 i T 1 i T ) D ii
= Σ i tr ( V 1 T U 1 i U 1 i T V 1 ) D ii
= tr ( V 1 T ( Σ i D ii U 1 i U 1 i T ) V 1 )
Figure S2008100591256D000616
By the constraint condition of above two aspects, we have obtained following optimization problem:
min v 1 tr ( V 1 T ( D U - W U ) V 1 ) tr ( V 1 T D U V 1 )
Obviously, V 1Optimum solution be (D U-W U, D U) generalized eigenvector.Therefore we can calculate following generalized eigenvector problem and obtain optimized V 1:
(D U-W U)V 1=λD UV 1
When calculating v 1After, by U 1 i ∈ i I 1 × I 1 , ( 1 ≤ i ≤ N ) , Can ask for T 1 i = V 1 T U 1 i ∈ i J 1 × I 1 . In like manner, for the intermediate conversion matrix V of audio frequency and these two kinds of mode of text 2And V 3Also can use the same method and calculate, so by U 2 i ∈ i I 2 × I 2 , ( 1 ≤ i ≤ N ) And V 2Just can ask for T 2 i = V 2 T U 2 i ∈ i J 2 × I 2 , And by U 3 i ∈ i I 3 × I 3 , ( 1 ≤ i ≤ N ) And V 3Can ask for T 3 i = V 3 T U 3 i ∈ i J 3 × I 3 . The video tensor camera lens of lower dimensional space is gathered the data among the Y like this Y i | i = 1 N = X i ⊗ 1 T 1 iT ⊗ 2 T 2 iT ⊗ 3 T 3 iT ∈ R J 1 × J 2 × J 3 .
Be the algorithm that the subspace embeds and dimension reduces of tensor camera lens below.
Input: original training tensor camera lens set X = { X 1 , X 2 , L X N } R I 1 × I 2 × I 3 ;
Output: the low-dimensional tensor camera lens set after the mapping Y = { Y 1 , Y 2 , L Y N } ∈ R J 1 × J 2 × J 3 , The intermediate conversion matrix V 1 ∈ R I 1 × J 1 , V 2 ∈ R I 2 × J 2 With V 3 ∈ R I 3 × J 3 , And transition matrix T 1 i = V 1 T U 1 i ∈ i J 1 × I 1 , T 2 i = V 2 T U 2 i ∈ i J 2 × I 2 With T 3 i = V 3 T U 3 i ∈ i J 3 × I 3 , And satisfy Y i | i = 1 N = X i ⊗ 1 T 1 iT ⊗ 2 T 2 iT ⊗ 3 T 3 iT ; Arthmetic statement:
Step 1: make up an arest neighbors figure G;
Step 2: calculate weight matrix W;
Step 3:For k=1to3
Step 4:For i=1toN
Step 5: calculate X iThe k mould launch matrix X (k) iThe left matrix U that decomposes of SVD (k) i
Step 6:End;
Step 7: D U = Σ i D ii U ( k ) i U ( k ) i T ;
Step 8: W U = Σ ij W ij U ( k ) i U ( k ) j T ;
Step 9: find the solution following generalized eigenvector problem to obtain optimized V k:
(D U-W U)V k=λD UV k
Step 10:For i=1toN
Step 11: T ( k ) i = V ( k ) T U ( k ) i ∈ R J k × I k ;
Step 12:End;
Step 13:End;
Step 14:For i=1toN
Step 15: Y i = X i ⊗ 1 T 1 iT ⊗ 2 T 2 iT ⊗ 3 T 3 iT ;
Step 16:End.
The method that sorter model is set up in video tensor camera lens set after described employing supports the tensor machine to dimensionality reduction is: in this step, we adopt and support the tensor machine to train the sorter of tensor camera lens.The input of training pattern is exactly a previous step through the low-dimensional tensor Y that the subspace embeds and dimensionality reduction obtains i, rather than original X iSuch processing can not only improve degree of accuracy, and can improve the efficient of training and classification.
Support that the algorithm of tensor machine training classifier is as follows.
Input: the low-dimensional tensor camera lens set after the mapping Y i | i = 1 N = X i ⊗ 1 T 1 iT ⊗ 2 T 2 iT ⊗ 3 T 3 iT ∈ R J 1 × J 2 × J 3 , And
Corresponding classification logotype y i∈+1 ,-1};
Output: the tensor lineoid parameter of sorter model w k | k = 1 3 ∈ R J k With b ∈ R;
Arthmetic statement:
Step 1: w is set k| K=1 3Be R JkIn the random units vector;
Step 2: repeating step 3-5 is until convergence;
Step 3:For j=1to3
Step 4: by finding the solution optimization problem min w j , b , ξ J C - STM ( w j , b , ξ ) = η 2 | | w j | | Fro 2 + c Σ i = 1 N ξ i s . t . y i [ w T j ( Y i Π k = 1 3 × k w k + b ) ] ≥ ξ i , 1 ≤ i ≤ N ξ ≥ 0 Obtain w i ∈ R I j And b, wherein c is a constant, ξ is a relaxation factor, η = Π 1 ≤ k ≤ 3 k ≠ j | | w k | | Fro 2 ;
Step 5:End;
Step 6: check whether restrain: if Σ k = 1 3 [ | w k , t T w k , t - 1 | ( | | w k , t | | Fro - 2 ) - 1 ] ≤ ϵ Calculate so
w k| K=1 3Restrain.Here w K, tBe current projection vector, w K, t-1It is previous projection vector;
Step 7:End.
Described for testing lens, after adding up to the transition matrix obtain to carry out projection by training set, carrying out semantic concept by sorter model again detects: in this step, we will come according to the sorter model that the front training obtains the outer new data of training set is detected.Because our dimension reduction method is linear,, carries out classification by sorter then and detect so can map directly to low n-dimensional subspace n for new data.
Make X tAs an outer detection example of training set, following algorithm provides testing process.
Input: camera lens to be detected X t ∈ R I 1 × I 2 × I 3 , The intermediate conversion matrix V 1, V 2, V 3, classifier parameters w k | k = 1 3 ∈ R I k With b ∈ R;
Output: X tClassification logotype z t∈+1 ,-1};
Arthmetic statement:
Step 1:For k=1to3;
Step 2: calculate X tThe k mould launch matrix X (k) tThe left matrix U that decomposes of SVD (k) t
Step 3: calculate T ( k ) t = V ( k ) T U ( k ) t ;
Step 4:End;
Step 5: calculate Y t = X t ⊗ 1 T 1 t T ⊗ 2 T 2 t T ⊗ 3 T 3 t T ;
Step 6: calculate z t=sign (Y t 1w 1
Figure 2008100591256_7
2w 2
Figure 2008100591256_8
3w 3)+b;
Step 7:End.

Claims (5)

1.一种基于张量表示的多模态视频语义概念检测方法,其特征在于包括如下步骤:1. a multimodal video semantic concept detection method based on tensor representation, is characterized in that comprising the steps: 1)对训练集合及测试集合中的视频镜头均提取图像、音频、文本三种模态的底层特征,每个视频张量镜头由这三种底层特征形成3阶张量来表达;1) For the video shots in the training set and the test set, the bottom-level features of the image, audio, and text modes are extracted, and each video tensor shot is represented by a third-order tensor formed by these three bottom-level features; 2)根据视频张量镜头集合的流形空间本征结构,通过寻找转换矩阵实现对原始高维张量的维度降低及子空间嵌入;2) According to the eigenstructure of the manifold space of the video tensor lens set, the dimension reduction and subspace embedding of the original high-dimensional tensor are realized by finding the transformation matrix; 3)采用支持张量机对降维后的视频张量镜头集合建立分类器模型;3) Use the support tensor machine to establish a classifier model for the video tensor lens set after dimensionality reduction; 4)对于测试镜头,由训练集合计算得到的转换矩阵进行投影后,再通过分类器模型进行语义概念检测;4) For the test shot, the transformation matrix calculated from the training set is projected, and then the semantic concept detection is performed through the classifier model; 所述的视频张量镜头的表达:基于视频中提取的图像、音频、文本底层特征,将每个视频镜头用一个3阶张量
Figure FA20192082200810059125601C00011
来表示,其中,I1,I2和I3分别是图像特征、音频特征及文本特征的维数,那么每个元素
Figure FA20192082200810059125601C00012
的值定义为:
Figure FA20192082200810059125601C00013
为图像特征的值,其中1≤i1≤I1
Figure FA20192082200810059125601C00014
为音频特征的值,其中1≤i2≤I2
Figure FA20192082200810059125601C00015
为文本特征的值,其中1≤i3≤I3;其它元素的值均初始设为零。
The expression of the video tensor lens: based on the underlying features of images, audio, and text extracted from the video, use a 3rd-order tensor for each video lens
Figure FA20192082200810059125601C00011
to represent, where, I 1 , I 2 and I 3 are the dimensions of image features, audio features and text features respectively, then each element
Figure FA20192082200810059125601C00012
The value of is defined as:
Figure FA20192082200810059125601C00013
is the value of the image feature, where 1≤i 1 ≤I 1 ;
Figure FA20192082200810059125601C00014
is the value of the audio feature, where 1≤i 2 ≤I 2 ;
Figure FA20192082200810059125601C00015
is the value of the text feature, where 1≤i 3 ≤I 3 ; the values of other elements are initially set to zero.
2.根据权利要求1所述的一种基于张量表示的多模态视频语义概念检测方法,其特征在于所述的对训练集合及测试集合中的视频镜头均提取图像、音频、文本三种模态的底层特征:每个视频镜头中选取一个关键帧作为代表图像,然后提取颜色直方图、纹理和Canny边界作为图像特征;将视频镜头对应的一段音频提取出作为一个音频例子,并将音频例子分成迭加短时音频帧,提取每个短时音频帧特征,包括MFCC、质心、衰减截止频率、频谱流量及过零率,来形成帧特征向量,然后把短时音频帧特征向量的统计值作为视频镜头的音频特征;从视频中经过识别的转录文本提取TF*IDF值作为文本特征。2. A kind of multimodal video semantic concept detection method based on tensor representation according to claim 1, characterized in that the described video shots in the training set and the test set all extract three kinds of images, audio and text. The underlying features of the modality: select a key frame in each video shot as a representative image, and then extract the color histogram, texture, and Canny boundary as image features; extract a piece of audio corresponding to the video shot as an audio example, and use the audio The example is divided into superimposed short-term audio frames, and the features of each short-term audio frame are extracted, including MFCC, centroid, attenuation cut-off frequency, spectral flow and zero-crossing rate, to form a frame feature vector, and then the statistics of the short-term audio frame feature vector Values are used as audio features of video shots; TF*IDF values are extracted from recognized transcripts in videos as text features. 3.根据权利要求1所述的一种基于张量表示的多模态视频语义概念检测方法,其特征在于所述的根据视频张量镜头的流形空间本征结构,通过寻找转换矩阵实现对原始高维张量的维度降低及子空间嵌入的方法为:给定空间
Figure FA20192082200810059125601C00016
上的镜头数据集合X={X1,X2,…XN},根据张量镜头的流形空间本征结构以及谱图理论,为X上的每个张量镜头Xi|i=1 N寻找三个转换矩阵:J1×I1维的T1 i、J2×I2维的T2 i及J3×I3维的T3 i,使之映射这N个数据点到空间
Figure FA20192082200810059125601C00018
上的Y={Y1,Y2,…YN},满足 Y i | i = 1 N = X i ⊗ 1 T 1 iT ⊗ 2 T 2 iT ⊗ 3 T 3 iT , 其中,J1<I1,J2<I2,J3<I3,以此实现对原始高维张量的维度降低及子空间嵌入;当求取T1 i|i=1 N时,通过求解广义特征向量问题(DU-WU)V1=λDUV1计算得到最优化的中间转换矩阵V1,其中, D U = Σ i D ii U 1 i U 1 iT , W U = Σ ij W ij U 1 i U 1 jT ,且W是根据训练集合X所构建的最近邻图的权重矩阵,D是W的对角矩阵即 D ii = Σ j W ij , U1 i是对Xi|i=1 N的一模展开矩阵mode-1 unfolding matrix即X(1) i进行SVD分解得到的左矩阵,那么最终可以计算转换矩阵
Figure FA20192082200810059125601C00026
用同样方法求取T2 i与T3 i
3. a kind of multimodal video semantic concept detection method based on tensor representation according to claim 1, is characterized in that described according to the manifold space eigenstructure of video tensor lens, realizes pairing by finding transformation matrix The method of dimension reduction and subspace embedding of the original high-dimensional tensor is: given space
Figure FA20192082200810059125601C00016
The lens data set X on X={X 1 , X 2 ,...X N }, according to the manifold space eigenstructure of the tensor lens and the spectrogram theory, for each tensor lens X i | i=1 on X N finds three transformation matrices: J 1 × I 1 -dimensional T 1 i , J 2 × I 2- dimensional T 2 i and J 3 × I 3 -dimensional T 3 i , to map these N data points to the space
Figure FA20192082200810059125601C00018
Y={Y 1 , Y 2 ,...Y N } on Y={Y 1 , Y 2 ,...Y N }, satisfy Y i | i = 1 N = x i &CircleTimes; 1 T 1 i &CircleTimes; 2 T 2 i &CircleTimes; 3 T 3 i , Among them, J 1 <I 1 , J 2 <I 2 , J 3 <I 3 , so as to realize dimension reduction and subspace embedding of the original high-dimensional tensor; when T 1 i | i=1 N is obtained, The optimized intermediate transformation matrix V 1 is obtained by solving the generalized eigenvector problem (D U -W U ) V 1 =λD U V 1 , where, D. u = &Sigma; i D. i u 1 i u 1 i , W u = &Sigma; ij W ij u 1 i u 1 J , and W is the weight matrix of the nearest neighbor graph constructed according to the training set X, and D is the diagonal matrix of W, namely D. i = &Sigma; j W ij , U 1 i is the left matrix obtained by SVD decomposition of the mode-1 unfolding matrix of X i | i = 1 N , that is, X (1) i , and finally the conversion matrix can be calculated
Figure FA20192082200810059125601C00026
Calculate T 2 i and T 3 i in the same way.
4.根据权利要求1所述的一种基于张量表示的多模态视频语义概念检测方法,其特征在于所述的采用支持张量机对降维后的视频张量镜头集合建立分类器模型的方法为:分类器模型的输入是经过子空间嵌入及降维得到的低维张量
Figure FA20192082200810059125601C00027
以及相应的类别标识yi∈{+1,-1},输出是分类器模型的张量超平面参数
Figure FA20192082200810059125601C00028
通过迭代求解最优化问题 min w j , b , &xi; J C - STM ( w j , b , &xi; ) = &eta; 2 | | w j | | Fro 2 + c &Sigma; i = 1 N &xi; i s . t . y i [ w j T ( Y i &Pi; j = 1 3 &times; j w j + b ) ] &GreaterEqual; &xi; i , 1 &le; i &le; N &xi; &GreaterEqual; 0 得到wk|k=1 3和b,其中参数j从1循环到3,并且c是常量,ξ是松弛因子, &eta; = &Pi; 1 &le; k &le; 3 k &NotEqual; j | | w k | | Fro 2 .
4. A kind of multi-modal video semantic concept detection method based on tensor representation according to claim 1, characterized in that said support tensor machine is used to establish a classifier model for the video tensor lens set after dimensionality reduction The method is: the input of the classifier model is a low-dimensional tensor obtained through subspace embedding and dimensionality reduction
Figure FA20192082200810059125601C00027
and the corresponding class identity y i ∈ {+1, -1}, the output is the tensor hyperplane parameter of the classifier model
Figure FA20192082200810059125601C00028
Solve optimization problems iteratively min w j , b , &xi; J C - STM ( w j , b , &xi; ) = &eta; 2 | | w j | | Fro 2 + c &Sigma; i = 1 N &xi; i the s . t . the y i [ w j T ( Y i &Pi; j = 1 3 &times; j w j + b ) ] &Greater Equal; &xi; i , 1 &le; i &le; N &xi; &Greater Equal; 0 Get w k | k = 1 3 and b, where the parameter j cycles from 1 to 3, and c is a constant, ξ is the relaxation factor, &eta; = &Pi; 1 &le; k &le; 3 k &NotEqual; j | | w k | | Fro 2 .
5.根据权利要求1所述的一种基于张量表示的多模态视频语义概念检测方法,其特征在于所述的对于测试镜头,由训练集合计算得到的转换矩阵进行投影后,再通过分类器模型进行语义概念检测:训练集合外的新数据
Figure FA20192082200810059125601C000212
由转换矩阵
Figure FA20192082200810059125601C000214
Figure FA20192082200810059125601C000215
映射为低维子空间中的
Figure FA20192082200810059125601C000216
然后通过分类器模型进行类别检测,即计算 Z t = sign ( Y t &CircleTimes; 1 w 1 &CircleTimes; 2 w 2 &CircleTimes; 3 w 3 ) + b , 来得到测试数据的类别标识zt∈{+1,-1}。
5. A kind of multimodal video semantic concept detection method based on tensor representation according to claim 1, characterized in that for the test shot, after the transformation matrix calculated by the training set is projected, and then classified Semantic Concept Detection with Detector Models: New Data Outside the Training Set
Figure FA20192082200810059125601C000212
by the transformation matrix
Figure FA20192082200810059125601C000214
and
Figure FA20192082200810059125601C000215
is mapped to the low-dimensional subspace
Figure FA20192082200810059125601C000216
Category detection is then performed through the classifier model, that is, computing Z t = sign ( Y t &CircleTimes; 1 w 1 &CircleTimes; 2 w 2 &CircleTimes; 3 w 3 ) + b , To get the category identification z t ∈ {+1, -1} of the test data.
CN2008100591256A 2008-01-14 2008-01-14 Multimodal Video Semantic Concept Detection Method Based on Tensor Representation Expired - Fee Related CN101299241B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008100591256A CN101299241B (en) 2008-01-14 2008-01-14 Multimodal Video Semantic Concept Detection Method Based on Tensor Representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008100591256A CN101299241B (en) 2008-01-14 2008-01-14 Multimodal Video Semantic Concept Detection Method Based on Tensor Representation

Publications (2)

Publication Number Publication Date
CN101299241A CN101299241A (en) 2008-11-05
CN101299241B true CN101299241B (en) 2010-06-02

Family

ID=40079065

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008100591256A Expired - Fee Related CN101299241B (en) 2008-01-14 2008-01-14 Multimodal Video Semantic Concept Detection Method Based on Tensor Representation

Country Status (1)

Country Link
CN (1) CN101299241B (en)

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110078224A1 (en) * 2009-09-30 2011-03-31 Wilson Kevin W Nonlinear Dimensionality Reduction of Spectrograms
CN101996327B (en) * 2010-09-02 2012-08-08 西安电子科技大学 Video anomaly detection method based on weighted tensor subspace background modeling
US8819019B2 (en) * 2010-11-18 2014-08-26 Qualcomm Incorporated Systems and methods for robust pattern classification
CN103312938B (en) * 2012-03-16 2016-07-06 富士通株式会社 Video process apparatus, method for processing video frequency and equipment
CN102750349B (en) * 2012-06-08 2014-10-08 华南理工大学 Video browsing method based on video semantic modeling
CN103455705A (en) * 2013-05-24 2013-12-18 中国科学院自动化研究所 Analysis and prediction system for cooperative correlative tracking and global situation of network social events
CN103336832A (en) * 2013-07-10 2013-10-02 中国科学院自动化研究所 Video classifier construction method based on quality metadata
CN103473308B (en) * 2013-09-10 2017-02-01 浙江大学 High-dimensional multimedia data classifying method based on maximum margin tensor study
CN103473307B (en) * 2013-09-10 2016-07-13 浙江大学 Cross-media sparse hash index method
US9818032B2 (en) * 2015-10-28 2017-11-14 Intel Corporation Automatic video summarization
CN105701504B (en) * 2016-01-08 2019-09-13 天津大学 Multimodal Manifold Embedding Method for Zero-Shot Learning
CN105701514B (en) * 2016-01-15 2019-05-21 天津大学 A method of the multi-modal canonical correlation analysis for zero sample classification
CN105718940B (en) * 2016-01-15 2019-03-29 天津大学 The zero sample image classification method based on factorial analysis between multiple groups
CN106529435B (en) * 2016-10-24 2019-10-15 天津大学 Action recognition method based on tensor quantization
CN107341522A (en) * 2017-07-11 2017-11-10 重庆大学 A kind of text based on density semanteme subspace and method of the image without tag recognition
CN108595555B (en) * 2018-04-11 2020-12-08 西安电子科技大学 Image retrieval method based on semi-supervised tensor quantum space regression
CN109214302A (en) * 2018-08-13 2019-01-15 湖南志东科技有限公司 One kind being based on multispectral substance identification
CN109936766B (en) * 2019-01-30 2021-04-13 天津大学 An end-to-end audio generation method for water scenes
CN110209758B (en) * 2019-04-18 2021-09-03 同济大学 Text increment dimension reduction method based on tensor decomposition
CN110097010A (en) * 2019-05-06 2019-08-06 北京达佳互联信息技术有限公司 Picture and text detection method, device, server and storage medium
CN112257857B (en) * 2019-07-22 2024-06-04 中科寒武纪科技股份有限公司 Tensor processing method and related product
CN111400601B (en) * 2019-09-16 2023-03-10 腾讯科技(深圳)有限公司 Video recommendation method and related equipment
CN111222011B (en) * 2020-01-06 2023-11-14 腾讯科技(深圳)有限公司 Video vector determining method and device
CN111460971B (en) * 2020-03-27 2023-09-12 北京百度网讯科技有限公司 Video concept detection method and device and electronic equipment
CN112015955B (en) * 2020-09-01 2021-07-30 清华大学 A multimodal data association method and device
CN112804558B (en) * 2021-04-14 2021-06-25 腾讯科技(深圳)有限公司 Video splitting method, device and equipment

Also Published As

Publication number Publication date
CN101299241A (en) 2008-11-05

Similar Documents

Publication Publication Date Title
CN101299241B (en) Multimodal Video Semantic Concept Detection Method Based on Tensor Representation
CN101894276B (en) Training method of human action recognition and recognition method
Changpinyo et al. Synthesized classifiers for zero-shot learning
Shah et al. Multi-view action recognition using contrastive learning
Liu et al. $ p $-Laplacian regularized sparse coding for human activity recognition
CN107392147A (en) A kind of image sentence conversion method based on improved production confrontation network
CN102663015A (en) Video semantic labeling method based on characteristics bag models and supervised learning
Zhang et al. Recognition of emotions in user-generated videos with kernelized features
CN111506773A (en) Video duplicate removal method based on unsupervised depth twin network
Liang et al. 3D human action recognition using a single depth feature and locality-constrained affine subspace coding
CN117150076B (en) A self-supervised video summarization method
CN107967441B (en) Video behavior identification method based on two-channel 3D-2D RBM model
Zhang et al. Video action recognition with Key-detail Motion Capturing based on motion spectrum analysis and multiscale feature fusion
Zhang [Retracted] Sports Action Recognition Based on Particle Swarm Optimization Neural Networks
Roy et al. Sparsity-inducing dictionaries for effective action classification
Li et al. Action recognition with spatio-temporal augmented descriptor and fusion method
CN116385946B (en) Video-oriented target segment location method, system, storage medium and device
Tong et al. Unconstrained Facial expression recognition based on feature enhanced CNN and cross-layer LSTM
Zhang et al. A multi-view camera-based anti-fraud system and its applications
Ma et al. Motion feature retrieval in basketball match video based on multisource motion feature fusion
Arif et al. Trajectory-Based 3D Convolutional Descriptors for Human Action Recognition.
CN109857906B (en) Multi-video abstraction method based on query unsupervised deep learning
Li et al. Action recognition based on depth motion map and hybrid classifier
Xiong Motion Recognition Model of Sports Video Based on Feature Extraction Algorithm
Wang et al. Self-trained video anomaly detection based on teacher-student model

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100602

Termination date: 20150114

EXPY Termination of patent right or utility model