CN101299241B - Method for detecting multi-mode video semantic conception based on tensor representation - Google Patents

Method for detecting multi-mode video semantic conception based on tensor representation Download PDF

Info

Publication number
CN101299241B
CN101299241B CN2008100591256A CN200810059125A CN101299241B CN 101299241 B CN101299241 B CN 101299241B CN 2008100591256 A CN2008100591256 A CN 2008100591256A CN 200810059125 A CN200810059125 A CN 200810059125A CN 101299241 B CN101299241 B CN 101299241B
Authority
CN
China
Prior art keywords
tensor
video
lens
dimension
camera lens
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2008100591256A
Other languages
Chinese (zh)
Other versions
CN101299241A (en
Inventor
吴飞
庄越挺
刘亚楠
郭同强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN2008100591256A priority Critical patent/CN101299241B/en
Publication of CN101299241A publication Critical patent/CN101299241A/en
Application granted granted Critical
Publication of CN101299241B publication Critical patent/CN101299241B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multiple modal video semantic concept testing method based on tensor representation, including the following steps: 1. extracting image, audio and text modal bottom layer characteristic from the video lens in the training set and test set, wherein, each video tensor lens is expressed by the three bottom layer characteristics to form a third order tensor; 2. according to the manifold space eigen structure of the video tensor lens, implementing the dimension reducing and subspace embedding to the original high dimension tensor through conversion matrix searching; 3.using a supporting tensor machine to establish a categorizer model to the video tensor lens set after dimensionality reduction; 4. using the conversion matrix obtained by calculating the training set to project the test lens, and then executing semantic concept testing through a categorizer. The invention fully utilizes the multiple modal data in the video, represents the video lens as a third order tensor, and proposes a subspace embedded dimensionality reduction method based on the expression, thereby implementing the semantic concept testing to the video lens and executing better analysis and comprehension to the video semantic meaning.

Description

Method for detecting multi-mode video semantic conception based on tensor representation
Technical field
The present invention relates to a kind of method for detecting multi-mode video semantic conception based on tensor representation.This method is expressed as 3 rank tensors with video lens, and seeks effective dimension reduction method it is projected to the low-dimensional semantic space, thereby realizes the semantic concept of video tensor camera lens is detected by the training classifier model, belongs to video semanteme analysis and understanding field.
Background technology
Along with the development of various digital image apparatus with popularize, and the develop rapidly of film and television industry, computer technology, the communication technology, multimedia treatment technology, compression coding technology and internet etc., produced a large amount of video datas in fields such as news, film, historical document and monitoring.Video data has contained abundant semanteme such as task, scene, object and incident, and video is again a time series data simultaneously, has image, audio frequency and three kinds of medium data of text in the video, and presents the sequential correlation symbiotic characteristic.Simultaneously, the fusion of multiple modalities is also played important effect with " the semantic wide gap " of cooperating for reducing between low-level image feature and the high-level semantic.How effectively to utilize the multi-modal of video and sequential characteristic to excavate its semantic information, thereby support effective retrieval of video, the resource sharing advantage of performance video data is challenging studying a question.
For how expressing multi-modal medium in the video, traditional method is that image, audio frequency and text feature are represented with the splicing vector.But this high dimension vector tends to cause the problem of " dimension disaster ", and the relation of the sequential correlation symbiosis between multiple modalities in the video also can be left in the basket.In recent years, how much of polytenies-be that high order tensor has been widely applied to fields such as computer vision, information retrieval and signal Processing.Tensor is a kind of natural expansion and the extension to the vector sum matrix, and tensor geometry has defined a series of polyteny computings based on the vector space set.Simultaneously, finding the solution optimum solution with tensor as the supervising tensor learning framework employing alternating projection optimization step that has of input, is the combination of protruding optimization and polyteny geometric operation.Based on the supervising tensor learning framework is arranged, traditional support vector machine can be expanded to the tensor machine of supporting, realize the training and the application of sorter model.
Summary of the invention
The purpose of this invention is to provide a kind of method for detecting multi-mode video semantic conception based on tensor representation.Comprise the steps:
1) video lens in training set and the test set is all extracted the low-level image feature of image, audio frequency, three kinds of mode of text, each video tensor camera lens forms 3 rank tensors by these three kinds of low-level image features and expresses;
2) according to the stream shape space intrinsic structure of video tensor camera lens set, realize the dimension of original higher-dimension tensor is reduced and the subspace embedding by seeking transition matrix;
3) adopt the video tensor camera lens set after supporting the tensor machine to dimensionality reduction to set up sorter model;
4) for testing lens, add up to the transition matrix obtain to carry out projection by training set after, carry out the semantic concept detection by sorter model again.
The described low-level image feature that camera lens in training set and the test set is all extracted image, audio frequency, three kinds of mode of text: choose a key frame in each camera lens as representative image, extract color histogram, texture and Canny border then as characteristics of image; One section audio of camera lens correspondence is extracted as an audio example, and audio example is divided into the audio frame in short-term that contains superposition, extract each audio frame feature in short-term, comprise MFCC, barycenter, decay cutoff frequency, frequency spectrum flow and zero-crossing rate, form the frame proper vector, then the audio frequency characteristics of the statistical value of audio frame proper vector in short-term as camera lens; From video, extract the TF*IDF value as text feature through the transcribed text of identification.
The expression of described video tensor camera lens: based on the image that extracts in the video, audio frequency, text low-level image feature, with each video lens with one 3 rank tensor S ∈ R I 1 × I 2 × I 3 Represent.Wherein, I 1, I 2And I 3It is respectively the dimension of characteristics of image, audio frequency characteristics and text feature.Each element s so I1i2i3Value defined be: s I1,1,1(1≤i 1≤ I 1) be the value of characteristics of image, s 2, i2,2(1≤i 2≤ I 2) be the value of audio frequency characteristics, s 3,3, i3(1≤i 3≤ I 3) be the value of text feature, the value of other element all initially is made as zero.
Described stream shape space intrinsic structure according to video tensor camera lens by seeking the transition matrix realization to the dimension reduction of original higher-dimension tensor and the method for subspace embedding is: given space R I1 * I2 * I3On lens data set X={X 1, X 2, LX N, according to the stream shape space intrinsic structure and the spectrogram theory of tensor camera lens, be each the tensor camera lens X on the X i| I=1 NSeek three transition matrix: J 1* I 1The T of dimension 1 i, J 2* I 2The T of dimension 2 iAnd J 3* I 3The T of dimension 3 i, make it to shine upon this N data point to space R J1 * J2 * J3(J 1<I 1, J 2<I 2, J 3<I 3) on Y={Y 1, Y 2, LY N, satisfy Y i | i = 1 N = X i ⊗ 1 T 1 iT ⊗ 2 T 2 iT ⊗ 3 T 3 iT , Realize the dimension of original higher-dimension tensor is reduced and the subspace embedding with this; When asking for T 1 i| I=1 NThe time, by finding the solution generalized eigenvector problem (D U-W U) V 1=λ D UV 1Calculate optimized intermediate conversion matrix V 1, wherein, D U = Σ i D ii U 1 i U 1 iT , W U = Σ ij W ij U 1 i U 1 jT , And W is that D is that the diagonal matrix of W is D according to the weight matrix of the constructed arest neighbors figure of training set X Ii=∑ jW Ij, U 1 iBe to X i| I=1 NA mould to launch matrix mode-1 unfolding matrix be X (1) iCarry out SVD and decompose the left matrix that obtains, so finally can calculate transition matrix T 1 i | i = 1 N = V 1 T U 1 i ∈ R J 1 × I 1 ; With asking for T with quadrat method 2 iWith T 3 i
The method that sorter model is set up in video tensor camera lens set after described employing supports the tensor machine to dimensionality reduction is: the input of sorter model is the low-dimensional tensor through the subspace embeds and dimensionality reduction obtains Y i | i = 1 N = X i ⊗ 1 T 1 iT ⊗ 2 T 2 iT ⊗ 3 T 3 iT ∈ R J 1 × J 2 × J 3 And corresponding classification logotype y i∈+1 ,-1), output is the tensor lineoid parameter of sorter model w k | k = 1 3 ∈ R J k With b ∈ R; By the iterative optimization problem min w j , b , ξ J C - STM ( w j , b , ξ ) = η 2 | | w j | | Fro 2 + c Σ i = 1 N ξ i s . t . y i [ w T j ( Y i Π j = 1 3 × j w j + b ) ] ≥ ξ i , 1 ≤ i ≤ N ξ ≥ 0 Obtain w k| K=1 3And b, wherein parameter j is recycled to 3 from 1, and c is constant, and ξ is a relaxation factor, η = Π 1 ≤ k ≤ 3 k ≠ j | | w k | | Fro 2 .
Described for testing lens, add up to the transition matrix obtain to carry out projection by training set after, carry out semantic concept by sorter model again and detect: the new data outside the training set X t ∈ R I 1 × I 2 × I 3 By transition matrix T 1 t = V 1 T U 1 t ∈ R J 1 × I 1 , T 2 t = V 2 T U 2 t ∈ R J 2 × I 2 And T 3 t = V 3 T U 3 t ∈ R J 3 × I 3 Be mapped as in the low n-dimensional subspace n Y t = X t ⊗ 1 T 1 tT ⊗ 2 T 2 tT ⊗ 3 T 3 tT ∈ R J 1 × J 2 × J 3 , Carry out classification by sorter model then and detect, promptly calculate z t=sign (Y t
Figure 2008100591256_3
1w 1 2w 2
Figure 2008100591256_5
3w 3)+b obtains the classification logotype z of test data t∈+1 ,-1}.
Beneficial effect of the present invention:
1) the present invention has replaced the vectorial expression way of traditional video with tensor, can effectively reduce the problem that " dimension disaster " brings;
2) the present invention has considered the multiple modalities in the video: image, audio frequency and text, and the sequential correlation symbiotic characteristic of video data, play important effect based on the fusion of the multiple modalities of this characteristic with " the semantic wide gap " of cooperating for reducing between low-level image feature and the high-level semantic;
3) the present invention is according to the stream shape space intrinsic structure and the spectrogram theory that keep the set of tensor camera lens, the tensor camera lens subspace that is proposed embeds and dimension reduction method, not only solved the high-dimensional difficulty of bringing effectively, and owing to be linear method, the new data outer for training set can directly carry out projection mapping;
4) the present invention adopts and supports the tensor machine to come the training classifier model, has good classification detectability.
Description of drawings
Fig. 1 is based on the method for detecting multi-mode video semantic conception process flow diagram of tensor representation;
Fig. 2 is the testing result of the present invention to semantic concept " Explosion (blast) ", compares with ISOMAP and two kinds of methods of PCA respectively, is expressed as the ROC curve map;
Fig. 3 is the testing result of the present invention to semantic concept " Sports (sports) ", compares with ISOMAP and two kinds of methods of PCA respectively, is expressed as the ROC curve map.
Embodiment
Method for detecting multi-mode video semantic conception based on tensor representation.Comprise the steps:
1) video lens in training set and the test set is all extracted the low-level image feature of image, audio frequency, three kinds of mode of text, each video tensor camera lens forms 3 rank tensors by these three kinds of low-level image features and expresses;
2) according to the stream shape space intrinsic structure of video tensor camera lens set, realize the dimension of original higher-dimension tensor is reduced and the subspace embedding by seeking transition matrix;
3) adopt the video tensor camera lens set after supporting the tensor machine to dimensionality reduction to set up sorter model;
4) for testing lens, add up to the transition matrix obtain to carry out projection by training set after, carry out the semantic concept detection by sorter model again.
The described low-level image feature that camera lens in training set and the test set is all extracted image, audio frequency, three kinds of mode of text: low-level image feature is meant the feature of directly extracting from video source data, be different from the high-level characteristic of semantic concept representative.We extract low-level image feature respectively from each video lens, comprise image, audio frequency and text feature.
Characteristics of image: camera lens is a basic processing unit, chooses a key frame in each camera lens as representative image, and color histogram, texture and the Canny border of extracting key frame then are as characteristics of image;
Audio frequency characteristics a: section audio of camera lens correspondence is extracted as an audio example (audio clip), and audio example is divided into the audio frame in short-term that contains superposition, extract each audio frame feature in short-term, comprise MFCC, barycenter, decay cutoff frequency, frequency spectrum flow and zero-crossing rate, form the frame proper vector, then the statistical value of audio frame proper vector (average or variance) in short-term as the audio frequency characteristics of camera lens;
Text feature: we extract feature by (transcript) text of transcribing through identification from video.Because the dimension of text feature is much larger than other mode features, and comprised abundant semantic information in the text, (Latent Semantic Analysis LSA) does dimension-reduction treatment to text can to adopt implicit semantic analysis earlier.
The expression of described video tensor camera lens: based on the image that extracts in the video, audio frequency, text low-level image feature, with each video lens with one 3 rank tensor S ∈ R I 1 × I 2 × I 3 Represent.Wherein, I 1, I 2And I 3It is respectively the dimension of characteristics of image, audio frequency characteristics and text feature.Each element s so I1i2i3Value defined be: s I1,1,1(1≤i 1≤ I 1) be the value of characteristics of image, s 2, i2,2(1≤i 2≤ I 2) be the value of audio frequency characteristics, s 3,3, i3(1≤i 3≤ I 3) be the value of text feature, the value of other element all initially is made as zero.
Described stream shape space intrinsic structure according to video tensor camera lens by seeking the transition matrix realization to the dimension reduction of original higher-dimension tensor and the method for subspace embedding is: given space R I1 * I2 * I3On lens data set X={X 1, X 2, LX N, according to the stream shape space intrinsic structure and the spectrogram theory of tensor camera lens, be each the tensor camera lens X on the X i| I=1 NSeek three transition matrix: J 1* I 1The T of dimension 1 i, J 2* I 2The T of dimension 2 iAnd J 3* I 3The T of dimension 3 i, make it to shine upon this N data point to space R J1 * J2 * J3(J 1<I 1, J 2<I 2, J 3<I 3) on set Y={Y 1, Y 2, L Y N, and satisfy Y i | i = 1 N = X i ⊗ 1 T 1 iT ⊗ 2 T 2 iT ⊗ 3 T 3 iT , Low-dimensional data acquisition Y has just reflected how much topological structures of intrinsic in the stream shape space of set X so; Simultaneously, this mapping has also kept linear feature, that is to say for the outer data point X of training set t, the transition matrix that can directly be obtained by training in advance calculates its mapping in low n-dimensional subspace n.
Order X ∈ i I 1 × I 2 × I 3 Represent one 3 rank tensor camera lens, given N tensor camera lens stream shape space that is distributed in M ∈ R I 1 × I 2 × I 3 On data acquisition X={X 1, X 2, Λ X N, we can make up the local geometry that an arest neighbors figure G simulates M.The weight matrix W of definition G is as follows:
Figure S2008100591256D00054
Wherein c is a constant.
For each tensor camera lens X i, (1≤i≤N), (Higher-OrderSingular Value Decomposition, HOSVD), we can be respectively to X according to the high-order svd iK| K=1 3Mould launches matrix (mode-kunfolding matrix) X (1) i, X (2) i, X (3) iCarry out singular value SVD and decompose, calculate left matrix U 1 i, U 2 i, U 3 iFor instance, U 1 iBe to X iA mould launch matrix (mode-1 unfolding matrix) X (1) iCarry out SVD and decompose the left matrix that obtains.
Existing now U 1 i ∈ R I 1 × I 1 , ( 1 ≤ i ≤ N ) , We want to find I 1* J 1The matrix V of dimension 1With U 1 iBe mapped to T 1 iT ∈ R I 1 × J 1 , Just make T 1 i = V 1 T U 1 i ∈ R J 1 × I 1 . We will consider this problem from two angles.On the one hand, the intrinsic structure of shape be kept flowing, the optimum solution of following this objective function need be asked for:
min v 1 Σ ij | | V 1 T U 1 i - V 1 T U 1 j | | 2 W ij
That is to say, minimize Σ ij | | V 1 T U 1 i - V 1 T U 1 j | | 2 W ij Can guarantee to work as U 1 iAnd U 1 jBe " close ", V so 1 TU 1 iAnd V 1 TU 1 jAlso be " close ".
D is the diagonal matrix of W, i.e. D Ii=∑ jW IjAnd for a matrix A, its " mark (trace) " ‖ A ‖ 2=tr (AA T), have so:
1 2 Σ ij | | V 1 T U 1 i - V 1 T U 1 i | | 2 W ij
= 1 2 Σ ij tr ( ( T 1 i - T 1 j ) ( T 1 i - T 1 j ) T ) W ij
= 1 2 Σ ij tr ( T 1 i T 1 i + T 1 j T 1 j - T 1 i T 1 j T - T 1 j T 1 i T ) W ij
= tr ( Σ i D ii T 1 i T 1 i T - Σ ij W ij T 1 i T 1 j T )
= tr ( Σ i D ii V 1 T U 1 i U 1 i T V 1 - Σ ij W ij V 1 T U 1 i U 1 j T V 1 )
= tr ( V 1 T ( Σ i D ii U 1 i U 1 i T - Σ ij W ij U 1 i U 1 j T ) V 1 )
Figure S2008100591256D00067
Wherein D U = Σ i D ii U 1 i U 1 i T , W U = Σ ij W ij U 1 i U 1 j T . From top derivation as can be seen, if want to find the solution min V 1 Σ ij | | V 1 T U 1 i - V 1 T U 1 j | | 2 W ij , Need minimize tr (V 1 T(D U-W U) V 1).
On the other hand, except keeping flowing the graph structure of shape, also need to maximize the overall variance on the stream shape space.Usually, the variance of a stochastic variable x is:
var(x)=∫ M(x-μ) 2dP(x),μ=∫ MxdP(x)
Wherein M is the stream shape of data, and μ is an expectation value, and dP is a probability density function.According to spectrogram theory (spectralgraph theory), dP can be by the diagonal matrix D (D of sample point Ii=∑ jW Ij) estimation of discretize obtains.We have following derivation so:
var ( T 1 )
= Σ i | | T 1 i | | 2 D ii
= Σ i tr ( T 1 i T 1 i T ) D ii
= Σ i tr ( V 1 T U 1 i U 1 i T V 1 ) D ii
= tr ( V 1 T ( Σ i D ii U 1 i U 1 i T ) V 1 )
Figure S2008100591256D000616
By the constraint condition of above two aspects, we have obtained following optimization problem:
min v 1 tr ( V 1 T ( D U - W U ) V 1 ) tr ( V 1 T D U V 1 )
Obviously, V 1Optimum solution be (D U-W U, D U) generalized eigenvector.Therefore we can calculate following generalized eigenvector problem and obtain optimized V 1:
(D U-W U)V 1=λD UV 1
When calculating v 1After, by U 1 i ∈ i I 1 × I 1 , ( 1 ≤ i ≤ N ) , Can ask for T 1 i = V 1 T U 1 i ∈ i J 1 × I 1 . In like manner, for the intermediate conversion matrix V of audio frequency and these two kinds of mode of text 2And V 3Also can use the same method and calculate, so by U 2 i ∈ i I 2 × I 2 , ( 1 ≤ i ≤ N ) And V 2Just can ask for T 2 i = V 2 T U 2 i ∈ i J 2 × I 2 , And by U 3 i ∈ i I 3 × I 3 , ( 1 ≤ i ≤ N ) And V 3Can ask for T 3 i = V 3 T U 3 i ∈ i J 3 × I 3 . The video tensor camera lens of lower dimensional space is gathered the data among the Y like this Y i | i = 1 N = X i ⊗ 1 T 1 iT ⊗ 2 T 2 iT ⊗ 3 T 3 iT ∈ R J 1 × J 2 × J 3 .
Be the algorithm that the subspace embeds and dimension reduces of tensor camera lens below.
Input: original training tensor camera lens set X = { X 1 , X 2 , L X N } R I 1 × I 2 × I 3 ;
Output: the low-dimensional tensor camera lens set after the mapping Y = { Y 1 , Y 2 , L Y N } ∈ R J 1 × J 2 × J 3 , The intermediate conversion matrix V 1 ∈ R I 1 × J 1 , V 2 ∈ R I 2 × J 2 With V 3 ∈ R I 3 × J 3 , And transition matrix T 1 i = V 1 T U 1 i ∈ i J 1 × I 1 , T 2 i = V 2 T U 2 i ∈ i J 2 × I 2 With T 3 i = V 3 T U 3 i ∈ i J 3 × I 3 , And satisfy Y i | i = 1 N = X i ⊗ 1 T 1 iT ⊗ 2 T 2 iT ⊗ 3 T 3 iT ; Arthmetic statement:
Step 1: make up an arest neighbors figure G;
Step 2: calculate weight matrix W;
Step 3:For k=1to3
Step 4:For i=1toN
Step 5: calculate X iThe k mould launch matrix X (k) iThe left matrix U that decomposes of SVD (k) i
Step 6:End;
Step 7: D U = Σ i D ii U ( k ) i U ( k ) i T ;
Step 8: W U = Σ ij W ij U ( k ) i U ( k ) j T ;
Step 9: find the solution following generalized eigenvector problem to obtain optimized V k:
(D U-W U)V k=λD UV k
Step 10:For i=1toN
Step 11: T ( k ) i = V ( k ) T U ( k ) i ∈ R J k × I k ;
Step 12:End;
Step 13:End;
Step 14:For i=1toN
Step 15: Y i = X i ⊗ 1 T 1 iT ⊗ 2 T 2 iT ⊗ 3 T 3 iT ;
Step 16:End.
The method that sorter model is set up in video tensor camera lens set after described employing supports the tensor machine to dimensionality reduction is: in this step, we adopt and support the tensor machine to train the sorter of tensor camera lens.The input of training pattern is exactly a previous step through the low-dimensional tensor Y that the subspace embeds and dimensionality reduction obtains i, rather than original X iSuch processing can not only improve degree of accuracy, and can improve the efficient of training and classification.
Support that the algorithm of tensor machine training classifier is as follows.
Input: the low-dimensional tensor camera lens set after the mapping Y i | i = 1 N = X i ⊗ 1 T 1 iT ⊗ 2 T 2 iT ⊗ 3 T 3 iT ∈ R J 1 × J 2 × J 3 , And
Corresponding classification logotype y i∈+1 ,-1};
Output: the tensor lineoid parameter of sorter model w k | k = 1 3 ∈ R J k With b ∈ R;
Arthmetic statement:
Step 1: w is set k| K=1 3Be R JkIn the random units vector;
Step 2: repeating step 3-5 is until convergence;
Step 3:For j=1to3
Step 4: by finding the solution optimization problem min w j , b , ξ J C - STM ( w j , b , ξ ) = η 2 | | w j | | Fro 2 + c Σ i = 1 N ξ i s . t . y i [ w T j ( Y i Π k = 1 3 × k w k + b ) ] ≥ ξ i , 1 ≤ i ≤ N ξ ≥ 0 Obtain w i ∈ R I j And b, wherein c is a constant, ξ is a relaxation factor, η = Π 1 ≤ k ≤ 3 k ≠ j | | w k | | Fro 2 ;
Step 5:End;
Step 6: check whether restrain: if Σ k = 1 3 [ | w k , t T w k , t - 1 | ( | | w k , t | | Fro - 2 ) - 1 ] ≤ ϵ Calculate so
w k| K=1 3Restrain.Here w K, tBe current projection vector, w K, t-1It is previous projection vector;
Step 7:End.
Described for testing lens, after adding up to the transition matrix obtain to carry out projection by training set, carrying out semantic concept by sorter model again detects: in this step, we will come according to the sorter model that the front training obtains the outer new data of training set is detected.Because our dimension reduction method is linear,, carries out classification by sorter then and detect so can map directly to low n-dimensional subspace n for new data.
Make X tAs an outer detection example of training set, following algorithm provides testing process.
Input: camera lens to be detected X t ∈ R I 1 × I 2 × I 3 , The intermediate conversion matrix V 1, V 2, V 3, classifier parameters w k | k = 1 3 ∈ R I k With b ∈ R;
Output: X tClassification logotype z t∈+1 ,-1};
Arthmetic statement:
Step 1:For k=1to3;
Step 2: calculate X tThe k mould launch matrix X (k) tThe left matrix U that decomposes of SVD (k) t
Step 3: calculate T ( k ) t = V ( k ) T U ( k ) t ;
Step 4:End;
Step 5: calculate Y t = X t ⊗ 1 T 1 t T ⊗ 2 T 2 t T ⊗ 3 T 3 t T ;
Step 6: calculate z t=sign (Y t 1w 1
Figure 2008100591256_7
2w 2
Figure 2008100591256_8
3w 3)+b;
Step 7:End.

Claims (5)

1. the method for detecting multi-mode video semantic conception based on tensor representation is characterized in that comprising the steps:
1) video lens in training set and the test set is all extracted the low-level image feature of image, audio frequency, three kinds of mode of text, each video tensor camera lens forms 3 rank tensors by these three kinds of low-level image features and expresses;
2) according to the stream shape space intrinsic structure of video tensor camera lens set, realize the dimension of original higher-dimension tensor is reduced and the subspace embedding by seeking transition matrix;
3) adopt the video tensor camera lens set after supporting the tensor machine to dimensionality reduction to set up sorter model;
4) for testing lens, add up to the transition matrix obtain to carry out projection by training set after, carry out the semantic concept detection by sorter model again;
The expression of described video tensor camera lens: based on the image that extracts in the video, audio frequency, text low-level image feature, with each video lens with one 3 rank tensor
Figure FA20192082200810059125601C00011
Represent, wherein, I 1, I 2And I 3Be respectively the dimension of characteristics of image, audio frequency characteristics and text feature, each element so
Figure FA20192082200810059125601C00012
Value defined be:
Figure FA20192082200810059125601C00013
Be the value of characteristics of image, wherein 1≤i 1≤ I 1
Figure FA20192082200810059125601C00014
Be the value of audio frequency characteristics, wherein 1≤i 2≤ I 2
Figure FA20192082200810059125601C00015
Be the value of text feature, wherein 1≤i 3≤ I 3The value of other element all initially is made as zero.
2. a kind of method for detecting multi-mode video semantic conception according to claim 1 based on tensor representation, it is characterized in that the described low-level image feature that video lens in training set and the test set is all extracted image, audio frequency, three kinds of mode of text: choose a key frame in each video lens as representative image, extract color histogram, texture and Canny border then as characteristics of image; One section audio of video lens correspondence is extracted as an audio example, and audio example is divided into superposition audio frame in short-term, extract each audio frame feature in short-term, comprise MFCC, barycenter, decay cutoff frequency, frequency spectrum flow and zero-crossing rate, form the frame proper vector, then the audio frequency characteristics of the statistical value of audio frame proper vector in short-term as video lens; From video, extract the TF*IDF value as text feature through the transcribed text of identification.
3. a kind of method for detecting multi-mode video semantic conception according to claim 1 based on tensor representation, it is characterized in that described stream shape space intrinsic structure, by seeking the transition matrix realization be: given space the dimension reduction of original higher-dimension tensor and the method for subspace embedding according to video tensor camera lens
Figure FA20192082200810059125601C00016
On lens data set X={X 1, X 2... X N, according to the stream shape space intrinsic structure and the spectrogram theory of tensor camera lens, be each the tensor camera lens X on the X i| I=1 NSeek three transition matrix: J 1* I 1The T of dimension 1 i, J 2* I 2The T of dimension 2 iAnd J 3* I 3The T of dimension 3 i, make it to shine upon this N data point to the space
Figure FA20192082200810059125601C00018
On Y={Y 1, Y 2... Y N, satisfy Y i | i = 1 N = X i ⊗ 1 T 1 iT ⊗ 2 T 2 iT ⊗ 3 T 3 iT , Wherein, J 1<I 1, J 2<I 2, J 3<I 3, realize the dimension of original higher-dimension tensor is reduced and the subspace embedding with this; When asking for T 1 i| I=1 NThe time, by finding the solution generalized eigenvector problem (D U-W U) V 1=λ D UV 1Calculate optimized intermediate conversion matrix V 1, wherein, D U = Σ i D ii U 1 i U 1 iT , W U = Σ ij W ij U 1 i U 1 jT , and W is the weight matrix according to the constructed arest neighbors figure of training set X, D be W diagonal matrix promptly D ii = Σ j W ij , U 1 iBe to X i| I=1 NA mould to launch matrix mode-1 unfolding matrix be X (1) iCarry out SVD and decompose the left matrix that obtains, so finally can calculate transition matrix
Figure FA20192082200810059125601C00026
With asking for T with quadrat method 2 iWith T 3 i
4. a kind of method for detecting multi-mode video semantic conception based on tensor representation according to claim 1, it is characterized in that video tensor camera lens set after described employing supports the tensor machine to dimensionality reduction sets up the method for sorter model and be: the input of sorter model is the low-dimensional tensor through the subspace embeds and dimensionality reduction obtains
Figure FA20192082200810059125601C00027
And corresponding classification logotype y i∈+1, and-1}, output is the tensor lineoid parameter of sorter model
Figure FA20192082200810059125601C00028
By the iterative optimization problem min w j , b , ξ J C - STM ( w j , b , ξ ) = η 2 | | w j | | Fro 2 + c Σ i = 1 N ξ i s . t . y i [ w j T ( Y i Π j = 1 3 × j w j + b ) ] ≥ ξ i , 1 ≤ i ≤ N ξ ≥ 0 Obtain w k| K=1 3And b, wherein parameter j is recycled to 3 from 1, and c is constant, and ξ is a relaxation factor, η = Π 1 ≤ k ≤ 3 k ≠ j | | w k | | Fro 2 .
5. a kind of method for detecting multi-mode video semantic conception according to claim 1 based on tensor representation, it is characterized in that described for testing lens, after adding up to the transition matrix obtain to carry out projection by training set, carry out semantic concept by sorter model again and detect: the new data outside the training set
Figure FA20192082200810059125601C000212
By transition matrix
Figure FA20192082200810059125601C000214
And
Figure FA20192082200810059125601C000215
Be mapped as in the low n-dimensional subspace n
Figure FA20192082200810059125601C000216
Carry out classification by sorter model then and detect, promptly calculate Z t = sign ( Y t ⊗ 1 w 1 ⊗ 2 w 2 ⊗ 3 w 3 ) + b , Obtain the classification logotype z of test data t∈+1 ,-1}.
CN2008100591256A 2008-01-14 2008-01-14 Method for detecting multi-mode video semantic conception based on tensor representation Expired - Fee Related CN101299241B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008100591256A CN101299241B (en) 2008-01-14 2008-01-14 Method for detecting multi-mode video semantic conception based on tensor representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008100591256A CN101299241B (en) 2008-01-14 2008-01-14 Method for detecting multi-mode video semantic conception based on tensor representation

Publications (2)

Publication Number Publication Date
CN101299241A CN101299241A (en) 2008-11-05
CN101299241B true CN101299241B (en) 2010-06-02

Family

ID=40079065

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008100591256A Expired - Fee Related CN101299241B (en) 2008-01-14 2008-01-14 Method for detecting multi-mode video semantic conception based on tensor representation

Country Status (1)

Country Link
CN (1) CN101299241B (en)

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110078224A1 (en) * 2009-09-30 2011-03-31 Wilson Kevin W Nonlinear Dimensionality Reduction of Spectrograms
CN101996327B (en) * 2010-09-02 2012-08-08 西安电子科技大学 Video anomaly detection method based on weighted tensor subspace background modeling
US8819019B2 (en) * 2010-11-18 2014-08-26 Qualcomm Incorporated Systems and methods for robust pattern classification
CN103312938B (en) * 2012-03-16 2016-07-06 富士通株式会社 Video process apparatus, method for processing video frequency and equipment
CN102750349B (en) * 2012-06-08 2014-10-08 华南理工大学 Video browsing method based on video semantic modeling
CN103455705A (en) * 2013-05-24 2013-12-18 中国科学院自动化研究所 Analysis and prediction system for cooperative correlative tracking and global situation of network social events
CN103336832A (en) * 2013-07-10 2013-10-02 中国科学院自动化研究所 Video classifier construction method based on quality metadata
CN103473308B (en) * 2013-09-10 2017-02-01 浙江大学 High-dimensional multimedia data classifying method based on maximum margin tensor study
CN103473307B (en) * 2013-09-10 2016-07-13 浙江大学 Across media sparse hash indexing means
US9818032B2 (en) * 2015-10-28 2017-11-14 Intel Corporation Automatic video summarization
CN105701504B (en) * 2016-01-08 2019-09-13 天津大学 Multi-modal manifold embedding grammar for zero sample learning
CN105718940B (en) * 2016-01-15 2019-03-29 天津大学 The zero sample image classification method based on factorial analysis between multiple groups
CN105701514B (en) * 2016-01-15 2019-05-21 天津大学 A method of the multi-modal canonical correlation analysis for zero sample classification
CN106529435B (en) * 2016-10-24 2019-10-15 天津大学 Action identification method based on tensor quantization
CN107341522A (en) * 2017-07-11 2017-11-10 重庆大学 A kind of text based on density semanteme subspace and method of the image without tag recognition
CN108595555B (en) * 2018-04-11 2020-12-08 西安电子科技大学 Image retrieval method based on semi-supervised tensor quantum space regression
CN109214302A (en) * 2018-08-13 2019-01-15 湖南志东科技有限公司 One kind being based on multispectral substance identification
CN109936766B (en) * 2019-01-30 2021-04-13 天津大学 End-to-end-based method for generating audio of water scene
CN110209758B (en) * 2019-04-18 2021-09-03 同济大学 Text increment dimension reduction method based on tensor decomposition
CN110097010A (en) * 2019-05-06 2019-08-06 北京达佳互联信息技术有限公司 Picture and text detection method, device, server and storage medium
CN112257857B (en) * 2019-07-22 2024-06-04 中科寒武纪科技股份有限公司 Tensor processing method and related product
CN110609955B (en) * 2019-09-16 2022-04-05 腾讯科技(深圳)有限公司 Video recommendation method and related equipment
CN111222011B (en) * 2020-01-06 2023-11-14 腾讯科技(深圳)有限公司 Video vector determining method and device
CN111460971B (en) * 2020-03-27 2023-09-12 北京百度网讯科技有限公司 Video concept detection method and device and electronic equipment
CN112015955B (en) * 2020-09-01 2021-07-30 清华大学 Multi-mode data association method and device
CN112804558B (en) * 2021-04-14 2021-06-25 腾讯科技(深圳)有限公司 Video splitting method, device and equipment

Also Published As

Publication number Publication date
CN101299241A (en) 2008-11-05

Similar Documents

Publication Publication Date Title
CN101299241B (en) Method for detecting multi-mode video semantic conception based on tensor representation
CN104899253B (en) Towards the society image across modality images-label degree of correlation learning method
Hu et al. Learning spatial-temporal features for video copy detection by the combination of CNN and RNN
CN101894276B (en) Training method of human action recognition and recognition method
CN103440668B (en) Method and device for tracing online video target
Amiri et al. Hierarchical keyframe-based video summarization using QR-decomposition and modified-means clustering
CN112364168A (en) Public opinion classification method based on multi-attribute information fusion
CN108154156A (en) Image Ensemble classifier method and device based on neural topic model
Olaode et al. Unsupervised image classification by probabilistic latent semantic analysis for the annotation of images
Lu et al. Latent semantic learning by efficient sparse coding with hypergraph regularization
CN116385946B (en) Video-oriented target fragment positioning method, system, storage medium and equipment
Wang et al. Action recognition using linear dynamic systems
Li et al. Action recognition with spatio-temporal augmented descriptor and fusion method
Zhang et al. Extractive Document Summarization based on hierarchical GRU
Rebecca et al. Predictive analysis of online television videos using machine learning algorithms
Li et al. Frame aggregation and multi-modal fusion framework for video-based person recognition
Wang Video description with GAN
CN109857906B (en) Multi-video abstraction method based on query unsupervised deep learning
Lu et al. MTCA: a multimodal summarization model based on two-stream cross attention
Hammad et al. Characterizing the impact of using features extracted from pre-trained models on the quality of video captioning sequence-to-sequence models
Luisier et al. BBN VISER TRECVID 2014 Multimedia Event Detection and Multimedia Event Recounting Systems.
Arif et al. Trajectory-Based 3D Convolutional Descriptors for Human Action Recognition.
Hui-bin et al. Recognition of individual object in focus people group based on deep learning
Sun et al. Multimodal micro-video classification based on 3D convolutional neural network
He et al. Human abnormal action identification method in different scenarios

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100602

Termination date: 20150114

EXPY Termination of patent right or utility model