CN101299241A - Method for detecting multi-mode video semantic conception based on tensor representation - Google Patents

Method for detecting multi-mode video semantic conception based on tensor representation Download PDF

Info

Publication number
CN101299241A
CN101299241A CNA2008100591256A CN200810059125A CN101299241A CN 101299241 A CN101299241 A CN 101299241A CN A2008100591256 A CNA2008100591256 A CN A2008100591256A CN 200810059125 A CN200810059125 A CN 200810059125A CN 101299241 A CN101299241 A CN 101299241A
Authority
CN
China
Prior art keywords
tensor
video
circletimes
lens
camera lens
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2008100591256A
Other languages
Chinese (zh)
Other versions
CN101299241B (en
Inventor
吴飞
庄越挺
刘亚楠
郭同强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN2008100591256A priority Critical patent/CN101299241B/en
Publication of CN101299241A publication Critical patent/CN101299241A/en
Application granted granted Critical
Publication of CN101299241B publication Critical patent/CN101299241B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a multiple modal video semantic concept testing method based on tensor representation, including the following steps: 1. extracting image, audio and text modal bottom layer characteristic from the video lens in the training set and test set, wherein, each video tensor lens is expressed by the three bottom layer characteristics to form a third order tensor; 2. according to the manifold space eigen structure of the video tensor lens, implementing the dimension reducing and subspace embedding to the original high dimension tensor through conversion matrix searching; 3.using a supporting tensor machine to establish a categorizer model to the video tensor lens set after dimensionality reduction; 4. using the conversion matrix obtained by calculating the training set to project the test lens, and then executing semantic concept testing through a categorizer. The invention fully utilizes the multiple modal data in the video, represents the video lens as a third order tensor, and proposes a subspace embedded dimensionality reduction method based on the expression, thereby implementing the semantic concept testing to the video lens and executing better analysis and comprehension to the video semantic meaning.

Description

Method for detecting multi-mode video semantic conception based on tensor representation
Technical field
The present invention relates to a kind of method for detecting multi-mode video semantic conception based on tensor representation.This method is expressed as 3 rank tensors with video lens, and seeks effective dimension reduction method it is projected to the low-dimensional semantic space, thereby realizes the semantic concept of video tensor camera lens is detected by the training classifier model, belongs to video semanteme analysis and understanding field.
Background technology
Along with the development of various digital image apparatus with popularize, and the develop rapidly of film and television industry, computer technology, the communication technology, multimedia treatment technology, compression coding technology and internet etc., produced a large amount of video datas in fields such as news, film, historical document and monitoring.Video data has contained abundant semanteme such as task, scene, object and incident, and video is again a time series data simultaneously, has image, audio frequency and three kinds of medium data of text in the video, and presents the sequential correlation symbiotic characteristic.Simultaneously, the fusion of multiple modalities is also played important effect with " the semantic wide gap " of cooperating for reducing between low-level image feature and the high-level semantic.How effectively to utilize the multi-modal of video and sequential characteristic to excavate its semantic information, thereby support effective retrieval of video, the resource sharing advantage of performance video data is challenging studying a question.
For how expressing multi-modal medium in the video, traditional method is that image, audio frequency and text feature are represented with the splicing vector.But this high dimension vector tends to cause the problem of " dimension disaster ", and the relation of the sequential correlation symbiosis between multiple modalities in the video also can be left in the basket.In recent years, how much of polytenies-be that high order tensor has been widely applied to fields such as computer vision, information retrieval and signal Processing.Tensor is a kind of natural expansion and the extension to the vector sum matrix, and tensor geometry has defined a series of polyteny computings based on the vector space set.Simultaneously, finding the solution optimum solution with tensor as the supervising tensor learning framework employing alternating projection optimization step that has of input, is the combination of protruding optimization and polyteny geometric operation.Based on the supervising tensor learning framework is arranged, traditional support vector machine can be expanded to the tensor machine of supporting, realize the training and the application of sorter model.
Summary of the invention
The purpose of this invention is to provide a kind of method for detecting multi-mode video semantic conception based on tensor representation.
Comprise the steps:
1) video lens in training set and the test set is all extracted the low-level image feature of image, audio frequency, three kinds of mode of text, each video tensor camera lens forms 3 rank tensors by these three kinds of low-level image features and expresses;
2) according to the stream shape space intrinsic structure of video tensor camera lens set, realize the dimension of original higher-dimension tensor is reduced and the subspace embedding by seeking transition matrix;
3) adopt the video tensor camera lens set after supporting the tensor machine to dimensionality reduction to set up sorter model;
4) for testing lens, add up to the transition matrix obtain to carry out projection by training set after, carry out the semantic concept detection by sorter model again.
The described low-level image feature that camera lens in training set and the test set is all extracted image, audio frequency, three kinds of mode of text: choose a key frame in each camera lens as representative image, extract color histogram, texture and Canny border then as characteristics of image; One section audio of camera lens correspondence is extracted as an audio example, and audio example is divided into the audio frame in short-term that contains superposition, extract each audio frame feature in short-term, comprise MFCC, barycenter, decay cutoff frequency, frequency spectrum flow and zero-crossing rate, form the frame proper vector, then the audio frequency characteristics of the statistical value of audio frame proper vector in short-term as camera lens; From video, extract the TF*IDF value as text feature through the transcribed text of identification.
The expression of described video tensor camera lens: based on the image that extracts in the video, audio frequency, text low-level image feature, with each video lens with one 3 rank tensor S ∈ R I 1 × I 2 × I 3 Represent.Wherein, I 1, I 2And I 3It is respectively the dimension of characteristics of image, audio frequency characteristics and text feature.Each element so Value defined be: s i 1 , 1,1 ( 1 ≤ i 1 ≤ I 1 ) Be the value of characteristics of image, s 2 , i 2 , 2 ( 1 ≤ i 2 ≤ I 2 ) Be the value of audio frequency characteristics, s 3,3 , i 3 ( 1 ≤ i 3 ≤ I 3 ) Be the value of text feature, the value of other element all initially is made as zero.
Described stream shape space intrinsic structure according to video tensor camera lens by seeking the transition matrix realization to the dimension reduction of original higher-dimension tensor and the method for subspace embedding is: given space
Figure A20081005912500056
On lens data set X={X 1, X 2, L X N, according to the stream shape space intrinsic structure and the spectrogram theory of tensor camera lens, be each the tensor camera lens X on the X i| I=1 NSeek three transition matrix: J 1* I 1The T of dimension 1 i, J 2* I 2The T of dimension 2 iAnd J 3* I 3The T of dimension 3 i, make it to shine upon this N data point to the space R J 1 &times; J 2 &times; J 3 ( J 1 < I 1 , J 2 < I 2 , J 3 < I 3 ) On Y={Y 1, Y 2, LY N, satisfy Y i | i = 1 N = X i &CircleTimes; 1 T 1 iT &CircleTimes; 2 T 2 iT &CircleTimes; 3 T 3 iT , Realize the dimension of original higher-dimension tensor is reduced and the subspace embedding with this; When asking for T 1 i| I=1 NThe time, by finding the solution generalized eigenvector problem (D U-W U) V 1=λ D UV 1Calculate optimized intermediate conversion matrix V 1, wherein, D U = &Sigma; i D ii U 1 i U 1 iT , W U = &Sigma; ij W ij U 1 i U 1 jT , And W is that D is that the diagonal matrix of W is D according to the weight matrix of the constructed arest neighbors figure of training set X Ii=∑ jW Ij, U 1 iBe to X i| I=1 NA mould to launch matrix mode-1 unfolding matrix be X (1) iCarry out SVD and decompose the left matrix that obtains, so finally can calculate transition matrix T 1 i | i = 1 N = V 1 T U 1 i &Element; R J 1 &times; I 1 ; With asking for T with quadrat method 2 iWith T 3 i
The method that sorter model is set up in video tensor camera lens set after described employing supports the tensor machine to dimensionality reduction is: the input of sorter model is the low-dimensional tensor through the subspace embeds and dimensionality reduction obtains Y i | i = 1 N = X i &CircleTimes; 1 T 1 iT &CircleTimes; 2 T 2 iT &CircleTimes; 3 T 3 iT &Element; R J 1 &times; J 2 &times; J 3 And corresponding classification logotype y i∈+1, and-1}, output is the tensor lineoid parameter of sorter model w k | k = 1 3 &Element; R J k With b ∈ R; By the iterative optimization problem min w j , b , &xi; J C - STM ( w j , b , &xi; ) = &eta; 2 | | w j | | Fro 2 + c &Sigma; i = 1 N &xi; i s . t . y i [ w T j ( Y i &Pi; j = 1 3 &times; j w j + b ) ] &GreaterEqual; &xi; i , 1 &le; i &le; N &xi; &GreaterEqual; 0 Obtain w k| K=1 3And b, wherein parameter j is recycled to 3 from 1, and c is constant, and ξ is a relaxation factor, &eta; = &Pi; 1 &le; k &le; 3 k &NotEqual; j | | w k | | Fro 2 .
Described for testing lens, add up to the transition matrix obtain to carry out projection by training set after, carry out semantic concept by sorter model again and detect: the new data outside the training set X t &Element; R I 1 &times; I 2 &times; I 3 By transition matrix T 1 t = V 1 T U 1 t &Element; R J 1 &times; I 1 , T 2 t = V 2 T U 2 t &Element; R J 2 &times; I 2 And T 3 t = V 3 T U 3 t &Element; R J 3 &times; I 3 Be mapped as in the low n-dimensional subspace n Y t = X t &CircleTimes; 1 T 1 tT &CircleTimes; 2 T 2 tT &CircleTimes; 3 T 3 tT &Element; R J 1 &times; J 2 &times; J 3 , Carry out classification by sorter model then and detect, promptly calculate z t = sign ( Y t &CircleTimes; 1 w 1 &CircleTimes; 2 w 2 &CircleTimes; 3 w 3 ) + b , Obtain the classification logotype z of test data t∈+1 ,-1}.
Beneficial effect of the present invention:
1) the present invention has replaced the vectorial expression way of traditional video with tensor, can effectively reduce the problem that " dimension disaster " brings;
2) the present invention has considered the multiple modalities in the video: image, audio frequency and text, and the sequential correlation symbiotic characteristic of video data, play important effect based on the fusion of the multiple modalities of this characteristic with " the semantic wide gap " of cooperating for reducing between low-level image feature and the high-level semantic;
3) the present invention is according to the stream shape space intrinsic structure and the spectrogram theory that keep the set of tensor camera lens, the tensor camera lens subspace that is proposed embeds and dimension reduction method, not only solved the high-dimensional difficulty of bringing effectively, and owing to be linear method, the new data outer for training set can directly carry out projection mapping;
4) the present invention adopts and supports the tensor machine to come the training classifier model, has good classification detectability.
Description of drawings
Fig. 1 is based on the method for detecting multi-mode video semantic conception process flow diagram of tensor representation;
Fig. 2 is the testing result of the present invention to semantic concept " Explosion (blast) ", compares with ISOMAP and two kinds of methods of PCA respectively, is expressed as the ROC curve map;
Fig. 3 is the testing result of the present invention to semantic concept " Sports (sports) ", compares with ISOMAP and two kinds of methods of PCA respectively, is expressed as the ROC curve map.
Embodiment
Method for detecting multi-mode video semantic conception based on tensor representation.Comprise the steps:
1) video lens in training set and the test set is all extracted the low-level image feature of image, audio frequency, three kinds of mode of text, each video tensor camera lens forms 3 rank tensors by these three kinds of low-level image features and expresses;
2) according to the stream shape space intrinsic structure of video tensor camera lens set, realize the dimension of original higher-dimension tensor is reduced and the subspace embedding by seeking transition matrix;
3) adopt the video tensor camera lens set after supporting the tensor machine to dimensionality reduction to set up sorter model;
4) for testing lens, add up to the transition matrix obtain to carry out projection by training set after, carry out the semantic concept detection by sorter model again.
The described low-level image feature that camera lens in training set and the test set is all extracted image, audio frequency, three kinds of mode of text: low-level image feature is meant the feature of directly extracting from video source data, be different from the high-level characteristic of semantic concept representative.We extract low-level image feature respectively from each video lens, comprise image, audio frequency and text feature.
Characteristics of image: camera lens is a basic processing unit, chooses a key frame in each camera lens as representative image, and color histogram, texture and the Canny border of extracting key frame then are as characteristics of image;
Audio frequency characteristics a: section audio of camera lens correspondence is extracted as an audio example (audio clip), and audio example is divided into the audio frame in short-term that contains superposition, extract each audio frame feature in short-term, comprise MFCC, barycenter, decay cutoff frequency, frequency spectrum flow and zero-crossing rate, form the frame proper vector, then the statistical value of audio frame proper vector (average or variance) in short-term as the audio frequency characteristics of camera lens;
Text feature: we extract feature by (transcript) text of transcribing through identification from video.Because the dimension of text feature is much larger than other mode features, and comprised abundant semantic information in the text, (Latent Semantic Analysis LSA) does dimension-reduction treatment to text can to adopt implicit semantic analysis earlier.
The expression of described video tensor camera lens: based on the image that extracts in the video, audio frequency, text low-level image feature, with each video lens with one 3 rank tensor S &Element; R I 1 &times; I 2 &times; I 3 Represent.Wherein, I 1, I 2And I 3It is respectively the dimension of characteristics of image, audio frequency characteristics and text feature.Each element so
Figure A20081005912500072
Value defined be: s i 1 , 1,1 ( 1 &le; i 1 &le; I 1 ) Be the value of characteristics of image, s 2 , i 2 , 2 ( 1 &le; i 2 &le; I 2 ) Be the value of audio frequency characteristics, s 3,3 , i 3 ( 1 &le; i 3 &le; I 3 ) Be the value of text feature, the value of other element all initially is made as zero.
Described stream shape space intrinsic structure according to video tensor camera lens by seeking the transition matrix realization to the dimension reduction of original higher-dimension tensor and the method for subspace embedding is: given space On lens data set X={X 1, X 2, LX N, according to the stream shape space intrinsic structure and the spectrogram theory of tensor camera lens, be each the tensor camera lens X on the X i| I=1 NSeek three transition matrix: J 1* I 1The T of dimension 1 i, J 2* I 2The T of dimension 2 iAnd J 3* I 3The T of dimension 3 i, make it to shine upon this N data point to the space R J 1 &times; J 2 &times; J 3 ( J 1 < J 1 , J 2 < I 2 , J 3 < I 3 ) On set Y={Y 1, Y 2, LY N, and satisfy Y i | i = 1 N = X i &CircleTimes; 1 T 1 iT &CircleTimes; 2 T 2 iT &CircleTimes; 3 T 3 iT , Low-dimensional data acquisition Y has just reflected how much topological structures of intrinsic in the stream shape space of set X so; Simultaneously, this mapping has also kept linear feature, that is to say for the outer data point X of training set t, the transition matrix that can directly be obtained by training in advance calculates its mapping in low n-dimensional subspace n.
Order Represent one 3 rank tensor camera lens, given N tensor camera lens stream shape space that is distributed in M &Element; R I 1 &times; I 2 &times; I 3 On data acquisition X={X 1, X 2, Λ X N, we can make up the local geometry that an arest neighbors figure G simulates M.The weight matrix W of definition G is as follows:
Figure A20081005912500084
Wherein c is a constant.
For each tensor camera lens X i(1≤i≤N), (Higher-OrderSingular Value Decomposition, HOSVD), we can be respectively to X according to the high-order svd iK| K=1 3Mould launches matrix (mode-k unfolding matrix) X (1) i, X (2) i, X (3) iCarry out singular value SVD and decompose, calculate left matrix U 1 i, U 2 i, U 3 iFor instance, U 1 iBe to X iA mould launch matrix (mode-1 unfolding matrix) X (1) iCarry out SVD and decompose the left matrix that obtains.
Existing now U 1 i &Element; R I 1 &times; I 1 ( 1 &le; i &le; N ) , We want to find I 1* J 1The matrix V of dimension 1With U 1 iBe mapped to T 1 iT &Element; R I 1 &times; J 1 , Just make T 1 i = V 1 T U 1 i &Element; R J 1 &times; I 1 . We will consider this problem from two angles.On the one hand, the intrinsic structure of shape be kept flowing, the optimum solution of following this objective function need be asked for:
min V 1 &Sigma; ij | | V 1 T U 1 i - V 1 T U 1 j | | 2 W ij
That is to say, minimize &Sigma; ij | | V 1 T U 1 i - V 1 T U 1 j | | 2 W ij Can guarantee to work as U 1 iAnd U 1 jBe " close ", V so 1 TU 1 iAnd V 1 TU 1 jAlso be " close ".
D is the diagonal matrix of W, i.e. D Ii=∑ jW IjAnd for a matrix A, its " mark (trace) " ‖ A ‖ 2=tr (AA T), have so:
1 2 &Sigma; ij | | V 1 T U 1 i - V 1 T U 1 i | | 2 W ij
= 1 2 &Sigma; ij tr ( ( T 1 i - T 1 j ) ( T 1 i - T 1 j ) T ) W ij
= 1 2 &Sigma; ij tr ( T 1 i T 1 i + T 1 j T 1 j - T 1 i T 1 j T - T 1 j T 1 i T ) W ij
= tr ( &Sigma; i D ii T 1 i T 1 i T - &Sigma; ij W ij T 1 i T 1 j T )
= tr ( &Sigma; i D ii V 1 T U 1 i U 1 i T V 1 - &Sigma; ij W ij V 1 T U 1 i U 1 j T V 1 )
= tr ( V 1 T ( &Sigma; i D ii U 1 i U 1 i T - &Sigma; ij W ij U 1 i U 1 j T ) V 1 )
Figure A20081005912500097
Wherein D U = &Sigma; i D ii U 1 i U 1 i T , W U = &Sigma; ij W ij U 1 i U 1 j T . From top derivation as can be seen, if want to find the solution min V 1 &Sigma; ij | | V 1 T U 1 i - V 1 T U 1 j | | 2 W ij , Need minimize tr (V 1 T(D U-W U) V 1).
On the other hand, except keeping flowing the graph structure of shape, also need to maximize the overall variance on the stream shape space.Usually, the variance of a stochastic variable x is:
var(x)=∫ M(x-μ) 2dP(x),μ=∫ MxdP(x)
Wherein M is the stream shape of data, and μ is an expectation value, and dP is a probability density function.According to spectrogram theory (spectralgraph theory), dP can be by the diagonal matrix D (D of sample point Ii=∑ jW Ij) estimation of discretize obtains.
We have following derivation so:
var ( T 1 )
= &Sigma; i | | T 1 i | | 2 D ii
= &Sigma; i tr ( T 1 i T 1 i T ) D ii
= &Sigma; i tr ( V 1 T U 1 i V 1 i T V 1 ) D ii
= tr ( V 1 T ( &Sigma; i D ii U 1 i U 1 i T ) V 1 )
Figure A200810059125000916
By the constraint condition of above two aspects, we have obtained following optimization problem:
min V 1 tr ( V 1 T ( D U - W U ) V 1 ) tr ( V 1 T D U V 1 )
Obviously, V 1Optimum solution be (D U-W U, D U) generalized eigenvector.Therefore we can calculate following generalized eigenvector problem and obtain optimized V 1:
(D U-W U)V 1=λD UV 1
When calculating V 1After, by
Figure A20081005912500102
Can ask for
Figure A20081005912500103
In like manner, for the intermediate conversion matrix V of audio frequency and these two kinds of mode of text 2And V 3Also can use the same method and calculate, so by
Figure A20081005912500104
And V 2Just can ask for And by
Figure A20081005912500106
And V 3Can ask for
Figure A20081005912500107
The video tensor camera lens of lower dimensional space is gathered the data among the Y like this Y i | i = 1 N = X i &CircleTimes; 1 T 1 iT &CircleTimes; 2 T 2 iT &CircleTimes; 3 T 3 iT &Element; R J 1 &times; J 2 &times; J 3 .
Be the algorithm that the subspace embeds and dimension reduces of tensor camera lens below.
Input: original training tensor camera lens set X = { X 1 , X 2 , L X N } &Element; R I 1 &times; I 2 &times; I 3 ;
Output: the low-dimensional tensor camera lens set after the mapping Y = { Y 1 , Y 2 , L Y N } &Element; R J 1 &times; J 2 &times; J 3 , The intermediate conversion matrix V 1 &Element; R I 1 &times; J 1 , V 2 &Element; R I 2 &times; J 2 With V 3 &Element; R I 3 &times; J 3 , And transition matrix
Figure A200810059125001014
Figure A200810059125001015
With
Figure A200810059125001016
And satisfy Y i | i = 1 N = X i &CircleTimes; 1 T 1 iT &CircleTimes; 2 T 2 iT &CircleTimes; 3 T 3 iT ; Arthmetic statement:
Step 1: make up an arest neighbors figure G;
Step 2: calculate weight matrix W;
Step 3:For k=1 to 3
Step 4:For i=1 to N
Step 5: calculate X iThe k mould launch matrix X (k) iThe left matrix U that decomposes of SVD (k) i
Step 6:End;
Step 7: D U = &Sigma; i = D ii U ( k ) i U ( k ) i T ;
Step 8: w U = &Sigma; ij w ij U ( k ) i U ( k ) j T ;
Step 9: find the solution following generalized eigenvector problem to obtain optimized V k:
(D U-W U)V k=λD UV k
Step 10:For i=1 to N
Step 11: T ( k ) i = V ( k ) T U ( k ) i &Element; R J k &times; I k ;
Step 12:End;
Step 13:End;
Step 14:For i=1 to N
Step 15: Y i = X i &CircleTimes; 1 T 1 iT &CircleTimes; 2 T 2 iT &CircleTimes; 3 T 3 iT ;
Step 16:End.
The method that sorter model is set up in video tensor camera lens set after described employing supports the tensor machine to dimensionality reduction is: in this step, we adopt and support the tensor machine to train the sorter of tensor camera lens.The input of training pattern is exactly a previous step through the low-dimensional tensor Y that the subspace embeds and dimensionality reduction obtains i, rather than original X iSuch processing can not only improve degree of accuracy, and can improve the efficient of training and classification.
Support that the algorithm of tensor machine training classifier is as follows.
Input: the low-dimensional tensor camera lens set after the mapping Y i | i = 1 N = X i &CircleTimes; 1 T 1 iT &CircleTimes; 2 T 2 iT &CircleTimes; 3 T 3 iT &Element; R J 1 &times; J 2 &times; J 3 , And corresponding classification logotype y i∈+1 ,-1};
Output: the tensor lineoid parameter of sorter model w k | k = 1 3 &Element; R J k With b ∈ R;
Arthmetic statement:
Step 1: w is set k| K=1 3For
Figure A20081005912500116
In the random units vector;
Step 2: repeating step 3-5 is until convergence;
Step 3:For j=1 to 3
Step 4: by finding the solution optimization problem min w j , b , &xi; J C - STM ( w j , b , &xi; ) = &eta; 2 | | w j | | Fro 2 + c &Sigma; i = 1 N &xi; i s . t . y i [ w T j ( Y i &Pi; k = 1 3 &times; k w k + b ) ] &GreaterEqual; &xi; i , 1 &le; i &le; N &xi; &GreaterEqual; 0
Obtain w j &Element; R I j And b, wherein c is a constant, ξ is a relaxation factor, &eta; = &Pi; 1 &le; k &le; 3 k &NotEqual; j | | w k | | Fro 2 ;
Step 5:End;
Step 6: check whether restrain: if &Sigma; k = 1 3 [ | w k , t T w k , t - 1 | ( | | w k , t | | Fro - 2 ) - 1 ] &le; &epsiv; Calculate so
w k| K=1 3Restrain.Here w K, tBe current projection vector, w K, t-1It is previous projection vector;
Step 7:End.
Described for testing lens, after adding up to the transition matrix obtain to carry out projection by training set, carrying out semantic concept by sorter model again detects: in this step, we will come according to the sorter model that the front training obtains the outer new data of training set is detected.Because our dimension reduction method is linear,, carries out classification by sorter then and detect so can map directly to low n-dimensional subspace n for new data.
Make X tAs an outer detection example of training set, following algorithm provides testing process.
Input: camera lens to be detected X t &Element; R I 1 &times; I 2 &times; I 3 , The intermediate conversion matrix V 1, V 2, V 3, classifier parameters w k | k = 1 3 &Element; R I k With b ∈ R;
Output: X tClassification logotype z t∈+1 ,-1};
Arthmetic statement:
Step 1:For k=1 to 3;
Step 2: calculate X tThe k mould launch matrix X (k) tThe left matrix U that decomposes of SVD (k) t
Step 3: calculate T ( k ) t = V ( k ) T U ( k ) t ;
Step 4:End;
Step 5: calculate Y t = X t &CircleTimes; 1 T 1 t T &CircleTimes; 2 T 2 t T &CircleTimes; 3 T 3 t T ;
Step 6: calculate z t = sign ( Y t &CircleTimes; 1 w 1 &CircleTimes; 2 w 2 &CircleTimes; 3 w 3 ) + b ;
Step 7:End.

Claims (6)

1. the method for detecting multi-mode video semantic conception based on tensor representation is characterized in that comprising the steps:
1) video lens in training set and the test set is all extracted the low-level image feature of image, audio frequency, three kinds of mode of text, each video tensor camera lens forms 3 rank tensors by these three kinds of low-level image features and expresses;
2) according to the stream shape space intrinsic structure of video tensor camera lens set, realize the dimension of original higher-dimension tensor is reduced and the subspace embedding by seeking transition matrix;
3) adopt the video tensor camera lens set after supporting the tensor machine to dimensionality reduction to set up sorter model;
4) for testing lens, add up to the transition matrix obtain to carry out projection by training set after, carry out the semantic concept detection by sorter model again.
2. a kind of method for detecting multi-mode video semantic conception according to claim 1 based on tensor representation, it is characterized in that the described low-level image feature that camera lens in training set and the test set is all extracted image, audio frequency, three kinds of mode of text: choose a key frame in each camera lens as representative image, extract color histogram, texture and Canny border then as characteristics of image; One section audio of camera lens correspondence is extracted as an audio example, and audio example is divided into the audio frame in short-term that contains superposition, extract each audio frame feature in short-term, comprise MFCC, barycenter, decay cutoff frequency, frequency spectrum flow and zero-crossing rate, form the frame proper vector, then the audio frequency characteristics of the statistical value of audio frame proper vector in short-term as camera lens; From video, extract the TF*IDF value as text feature through the transcribed text of identification.
3. a kind of method for detecting multi-mode video semantic conception according to claim 1 based on tensor representation, it is characterized in that the expression of described video tensor camera lens: based on the image that extracts in the video, audio frequency, text low-level image feature, with each video lens with one 3 rank tensor S &Element; R I 1 &times; I 2 &times; I 3 Represent.Wherein, I 1, I 2And I 3It is respectively the dimension of characteristics of image, audio frequency characteristics and text feature.Each element so
Figure A2008100591250002C2
Value defined be:
Figure A2008100591250002C3
Be the value of characteristics of image, Be the value of audio frequency characteristics,
Figure A2008100591250002C5
Be the value of text feature, the value of other element all initially is made as zero.
4. a kind of method for detecting multi-mode video semantic conception according to claim 1 based on tensor representation, it is characterized in that described stream shape space intrinsic structure, by seeking the transition matrix realization be: given space the dimension reduction of original higher-dimension tensor and the method for subspace embedding according to video tensor camera lens
Figure A2008100591250002C6
On lens data set X={X 1, X 2, L X N, according to the stream shape space intrinsic structure and the spectrogram theory of tensor camera lens, be each the tensor camera lens X on the X i| I=1 NSeek three transition matrix: J 1* I 1The T of dimension 1 i, J 2* I 2The T of dimension 2 iAnd J 3* I 3The T of dimension 3 i, make it to shine upon this N data point to the space R J 1 &times; J 2 &times; J 3 ( J 1 < I 1 , J 2 < I 2 , J 3 < I 3 ) On Y={Y 1, Y 2, L Y N, satisfy Y i | i = 1 N = X i &CircleTimes; 1 T 1 iT &CircleTimes; 2 T 2 iT &CircleTimes; 3 T 3 iT , Realize the dimension of original higher-dimension tensor is reduced and the subspace embedding with this; When asking for T 1 i| I=1 NThe time, by finding the solution generalized eigenvector problem (D U-W U) V 1=λ D UV 1Calculate optimized intermediate conversion matrix V 1, wherein, D U = &Sigma; i D ii U 1 i U 1 iT , W U = &Sigma; ij W ij U 1 i U 1 jT , And W is that D is that the diagonal matrix of W is D according to the weight matrix of the constructed arest neighbors figure of training set X Ii=∑ jW Ij, U 1 iBe to X i| I=1 NA mould to launch matrix mode-1 unfolding matrix be X (1) iCarry out SVD and decompose the left matrix that obtains, so finally can calculate transition matrix T 1 i | i = 1 N = V 1 T U 1 i &Element; R J 1 &times; I 1 ; With asking for T with quadrat method 2 iWith T 3 i
5. a kind of method for detecting multi-mode video semantic conception based on tensor representation according to claim 1, it is characterized in that video tensor camera lens set after described employing supports the tensor machine to dimensionality reduction sets up the method for sorter model and be: the input of sorter model is the low-dimensional tensor through the subspace embeds and dimensionality reduction obtains Y i | i = 1 N = X i &CircleTimes; 1 T 1 iT &CircleTimes; 2 T 2 iT &CircleTimes; 3 T 3 iT &Element; R J 1 &times; J 2 &times; J 3 And corresponding classification logotype y i∈+1, and-1}, output is the tensor lineoid parameter of sorter model w k | k = 1 3 &Element; R J k With b ∈ R; By the iterative optimization problem min w j , b , &xi; J C - STM ( w j , b , &xi; ) = &eta; 2 | | w j | | Fro 2 + c &Sigma; i = 1 N &xi; i s . t . y i [ w j T ( Y i &Pi; j = 1 3 &times; j w j + b ) ] &GreaterEqual; &xi; i , 1 &le; i &le; N &xi; &GreaterEqual; 0 Obtain w k| K=1 3And b, wherein parameter j is recycled to 3 from 1, and c is constant, and ξ is a relaxation factor, &eta; = &Pi; 1 &le; k &le; 3 k &NotEqual; j | | w k | | Fro 2 .
6. a kind of method for detecting multi-mode video semantic conception according to claim 1 based on tensor representation, it is characterized in that described for testing lens, after adding up to the transition matrix obtain to carry out projection by training set, carry out semantic concept by sorter model again and detect: the new data outside the training set X t &Element; R I 1 &times; I 2 &times; I 3 By transition matrix T 1 t = V 1 T U 1 t &Element; R J 1 &times; I 1 , T 2 t = V 2 T U 2 t &Element; R J 2 &times; I 2 And T 3 t = V 3 T U 3 t &Element; R J 3 &times; I 3 Be mapped as in the low n-dimensional subspace n Y t = X t &CircleTimes; 1 T 1 tT &CircleTimes; 2 T 2 tT &CircleTimes; 3 T 3 tT &Element; R J 1 &times; J 2 &times; J 3 , Carry out classification by sorter model then and detect, promptly calculate z t = sign ( Y t &CircleTimes; 1 w 1 &CircleTimes; 2 w 2 &CircleTimes; 3 w 3 ) + b , Obtain the classification logotype z of test data t∈+1 ,-1}.
CN2008100591256A 2008-01-14 2008-01-14 Method for detecting multi-mode video semantic conception based on tensor representation Expired - Fee Related CN101299241B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008100591256A CN101299241B (en) 2008-01-14 2008-01-14 Method for detecting multi-mode video semantic conception based on tensor representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008100591256A CN101299241B (en) 2008-01-14 2008-01-14 Method for detecting multi-mode video semantic conception based on tensor representation

Publications (2)

Publication Number Publication Date
CN101299241A true CN101299241A (en) 2008-11-05
CN101299241B CN101299241B (en) 2010-06-02

Family

ID=40079065

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008100591256A Expired - Fee Related CN101299241B (en) 2008-01-14 2008-01-14 Method for detecting multi-mode video semantic conception based on tensor representation

Country Status (1)

Country Link
CN (1) CN101299241B (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996327A (en) * 2010-09-02 2011-03-30 西安电子科技大学 Video anomaly detection method based on weighted tensor subspace background modeling
CN102033853A (en) * 2009-09-30 2011-04-27 三菱电机株式会社 Method and system for reducing dimensionality of the spectrogram of a signal produced by a number of independent processes
CN102750349A (en) * 2012-06-08 2012-10-24 华南理工大学 Video browsing method based on video semantic modeling
CN103221965A (en) * 2010-11-18 2013-07-24 高通股份有限公司 Systems and methods for robust pattern classification
CN103312938A (en) * 2012-03-16 2013-09-18 富士通株式会社 Video processing device, video processing method and equipment
CN103336832A (en) * 2013-07-10 2013-10-02 中国科学院自动化研究所 Video classifier construction method based on quality metadata
CN103455705A (en) * 2013-05-24 2013-12-18 中国科学院自动化研究所 Analysis and prediction system for cooperative correlative tracking and global situation of network social events
CN103473307A (en) * 2013-09-10 2013-12-25 浙江大学 Cross-media sparse Hash indexing method
CN103473308A (en) * 2013-09-10 2013-12-25 浙江大学 High-dimensional multimedia data classifying method based on maximum margin tensor study
CN105701504A (en) * 2016-01-08 2016-06-22 天津大学 Multimode manifold embedding method used for zero sample learning
CN105701514A (en) * 2016-01-15 2016-06-22 天津大学 Multi-modal canonical correlation analysis method for zero sample classification
CN105718940A (en) * 2016-01-15 2016-06-29 天津大学 Zero-sample image classification method based on multi-group factor analysis
CN106529435A (en) * 2016-10-24 2017-03-22 天津大学 Action recognition method based on sensor quantization
CN107341522A (en) * 2017-07-11 2017-11-10 重庆大学 A kind of text based on density semanteme subspace and method of the image without tag recognition
CN108140032A (en) * 2015-10-28 2018-06-08 英特尔公司 Automatic video frequency is summarized
CN108595555A (en) * 2018-04-11 2018-09-28 西安电子科技大学 The image search method returned based on semi-supervised tensor subspace
CN109214302A (en) * 2018-08-13 2019-01-15 湖南志东科技有限公司 One kind being based on multispectral substance identification
CN109936766A (en) * 2019-01-30 2019-06-25 天津大学 A kind of generation method based on water scene audio end to end
CN110097010A (en) * 2019-05-06 2019-08-06 北京达佳互联信息技术有限公司 Picture and text detection method, device, server and storage medium
CN110209758A (en) * 2019-04-18 2019-09-06 同济大学 A kind of text increment dimension reduction method based on tensor resolution
CN111222011A (en) * 2020-01-06 2020-06-02 腾讯科技(深圳)有限公司 Video vector determination method and device
CN111400601A (en) * 2019-09-16 2020-07-10 腾讯科技(深圳)有限公司 Video recommendation method and related equipment
CN111460971A (en) * 2020-03-27 2020-07-28 北京百度网讯科技有限公司 Video concept detection method and device and electronic equipment
CN112015955A (en) * 2020-09-01 2020-12-01 清华大学 Multi-mode data association method and device
CN112257857A (en) * 2019-07-22 2021-01-22 中科寒武纪科技股份有限公司 Tensor processing method and related product
CN112804558A (en) * 2021-04-14 2021-05-14 腾讯科技(深圳)有限公司 Video splitting method, device and equipment

Cited By (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033853A (en) * 2009-09-30 2011-04-27 三菱电机株式会社 Method and system for reducing dimensionality of the spectrogram of a signal produced by a number of independent processes
CN101996327A (en) * 2010-09-02 2011-03-30 西安电子科技大学 Video anomaly detection method based on weighted tensor subspace background modeling
CN103221965A (en) * 2010-11-18 2013-07-24 高通股份有限公司 Systems and methods for robust pattern classification
CN103312938B (en) * 2012-03-16 2016-07-06 富士通株式会社 Video process apparatus, method for processing video frequency and equipment
CN103312938A (en) * 2012-03-16 2013-09-18 富士通株式会社 Video processing device, video processing method and equipment
CN102750349B (en) * 2012-06-08 2014-10-08 华南理工大学 Video browsing method based on video semantic modeling
CN102750349A (en) * 2012-06-08 2012-10-24 华南理工大学 Video browsing method based on video semantic modeling
CN103455705A (en) * 2013-05-24 2013-12-18 中国科学院自动化研究所 Analysis and prediction system for cooperative correlative tracking and global situation of network social events
CN103336832A (en) * 2013-07-10 2013-10-02 中国科学院自动化研究所 Video classifier construction method based on quality metadata
CN103473307A (en) * 2013-09-10 2013-12-25 浙江大学 Cross-media sparse Hash indexing method
CN103473308A (en) * 2013-09-10 2013-12-25 浙江大学 High-dimensional multimedia data classifying method based on maximum margin tensor study
CN103473308B (en) * 2013-09-10 2017-02-01 浙江大学 High-dimensional multimedia data classifying method based on maximum margin tensor study
CN103473307B (en) * 2013-09-10 2016-07-13 浙江大学 Across media sparse hash indexing means
CN108140032A (en) * 2015-10-28 2018-06-08 英特尔公司 Automatic video frequency is summarized
CN108140032B (en) * 2015-10-28 2022-03-11 英特尔公司 Apparatus and method for automatic video summarization
CN105701504B (en) * 2016-01-08 2019-09-13 天津大学 Multi-modal manifold embedding grammar for zero sample learning
CN105701504A (en) * 2016-01-08 2016-06-22 天津大学 Multimode manifold embedding method used for zero sample learning
CN105718940A (en) * 2016-01-15 2016-06-29 天津大学 Zero-sample image classification method based on multi-group factor analysis
CN105701514A (en) * 2016-01-15 2016-06-22 天津大学 Multi-modal canonical correlation analysis method for zero sample classification
CN105718940B (en) * 2016-01-15 2019-03-29 天津大学 The zero sample image classification method based on factorial analysis between multiple groups
CN105701514B (en) * 2016-01-15 2019-05-21 天津大学 A method of the multi-modal canonical correlation analysis for zero sample classification
CN106529435A (en) * 2016-10-24 2017-03-22 天津大学 Action recognition method based on sensor quantization
CN106529435B (en) * 2016-10-24 2019-10-15 天津大学 Action identification method based on tensor quantization
CN107341522A (en) * 2017-07-11 2017-11-10 重庆大学 A kind of text based on density semanteme subspace and method of the image without tag recognition
CN108595555A (en) * 2018-04-11 2018-09-28 西安电子科技大学 The image search method returned based on semi-supervised tensor subspace
CN108595555B (en) * 2018-04-11 2020-12-08 西安电子科技大学 Image retrieval method based on semi-supervised tensor quantum space regression
CN109214302A (en) * 2018-08-13 2019-01-15 湖南志东科技有限公司 One kind being based on multispectral substance identification
CN109936766B (en) * 2019-01-30 2021-04-13 天津大学 End-to-end-based method for generating audio of water scene
CN109936766A (en) * 2019-01-30 2019-06-25 天津大学 A kind of generation method based on water scene audio end to end
CN110209758B (en) * 2019-04-18 2021-09-03 同济大学 Text increment dimension reduction method based on tensor decomposition
CN110209758A (en) * 2019-04-18 2019-09-06 同济大学 A kind of text increment dimension reduction method based on tensor resolution
CN110097010A (en) * 2019-05-06 2019-08-06 北京达佳互联信息技术有限公司 Picture and text detection method, device, server and storage medium
CN112257857B (en) * 2019-07-22 2024-06-04 中科寒武纪科技股份有限公司 Tensor processing method and related product
CN112257857A (en) * 2019-07-22 2021-01-22 中科寒武纪科技股份有限公司 Tensor processing method and related product
CN111400601A (en) * 2019-09-16 2020-07-10 腾讯科技(深圳)有限公司 Video recommendation method and related equipment
CN111222011A (en) * 2020-01-06 2020-06-02 腾讯科技(深圳)有限公司 Video vector determination method and device
CN111222011B (en) * 2020-01-06 2023-11-14 腾讯科技(深圳)有限公司 Video vector determining method and device
CN111460971A (en) * 2020-03-27 2020-07-28 北京百度网讯科技有限公司 Video concept detection method and device and electronic equipment
CN111460971B (en) * 2020-03-27 2023-09-12 北京百度网讯科技有限公司 Video concept detection method and device and electronic equipment
CN112015955A (en) * 2020-09-01 2020-12-01 清华大学 Multi-mode data association method and device
CN112804558B (en) * 2021-04-14 2021-06-25 腾讯科技(深圳)有限公司 Video splitting method, device and equipment
CN112804558A (en) * 2021-04-14 2021-05-14 腾讯科技(深圳)有限公司 Video splitting method, device and equipment

Also Published As

Publication number Publication date
CN101299241B (en) 2010-06-02

Similar Documents

Publication Publication Date Title
CN101299241B (en) Method for detecting multi-mode video semantic conception based on tensor representation
CN104899253B (en) Towards the society image across modality images-label degree of correlation learning method
CN101894276B (en) Training method of human action recognition and recognition method
Bouchard et al. Semantic segmentation of motion capture using laban movement analysis
Chen et al. Efficient spatial temporal convolutional features for audiovisual continuous affect recognition
CN103210651A (en) Method and system for video summarization
Amiri et al. Hierarchical keyframe-based video summarization using QR-decomposition and modified-means clustering
CN108154156B (en) Image set classification method and device based on neural topic model
CN112364168A (en) Public opinion classification method based on multi-attribute information fusion
Dong et al. A procedural texture generation framework based on semantic descriptions
Olaode et al. Unsupervised image classification by probabilistic latent semantic analysis for the annotation of images
CN116385946B (en) Video-oriented target fragment positioning method, system, storage medium and equipment
Wang et al. Action recognition using linear dynamic systems
Wu et al. Double constrained bag of words for human action recognition
Naphade A probablistic framework for mapping audio-visual features to high-level semantics in terms of concepts and context
Richard et al. A BoW-equivalent Recurrent Neural Network for Action Recognition.
Gao et al. Discriminative optical flow tensor for video semantic analysis
Gayathri et al. An efficient video indexing and retrieval algorithm using ensemble classifier
CN109857906B (en) Multi-video abstraction method based on query unsupervised deep learning
Wang Video description with GAN
Arif et al. Trajectory-Based 3D Convolutional Descriptors for Human Action Recognition.
Hammad et al. Characterizing the impact of using features extracted from pre-trained models on the quality of video captioning sequence-to-sequence models
Sun et al. Multimodal micro-video classification based on 3D convolutional neural network
Lu et al. MTCA: a multimodal summarization model based on two-stream cross attention
Wang et al. Self-trained video anomaly detection based on teacher-student model

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100602

Termination date: 20150114

EXPY Termination of patent right or utility model