Summary of the invention
The purpose of this invention is to provide a kind of method for detecting multi-mode video semantic conception based on tensor representation.Comprise the steps:
1) video lens in training set and the test set is all extracted the low-level image feature of image, audio frequency, three kinds of mode of text, each video tensor camera lens forms 3 rank tensors by these three kinds of low-level image features and expresses;
2) according to the stream shape space intrinsic structure of video tensor camera lens set, realize the dimension of original higher-dimension tensor is reduced and the subspace embedding by seeking transition matrix;
3) adopt the video tensor camera lens set after supporting the tensor machine to dimensionality reduction to set up sorter model;
4) for testing lens, add up to the transition matrix obtain to carry out projection by training set after, carry out the semantic concept detection by sorter model again.
The described low-level image feature that camera lens in training set and the test set is all extracted image, audio frequency, three kinds of mode of text: choose a key frame in each camera lens as representative image, extract color histogram, texture and Canny border then as characteristics of image; One section audio of camera lens correspondence is extracted as an audio example, and audio example is divided into the audio frame in short-term that contains superposition, extract each audio frame feature in short-term, comprise MFCC, barycenter, decay cutoff frequency, frequency spectrum flow and zero-crossing rate, form the frame proper vector, then the audio frequency characteristics of the statistical value of audio frame proper vector in short-term as camera lens; From video, extract the TF*IDF value as text feature through the transcribed text of identification.
The expression of described video tensor camera lens: based on the image that extracts in the video, audio frequency, text low-level image feature, with each video lens with one 3 rank tensor
Represent.Wherein, I
1, I
2And I
3It is respectively the dimension of characteristics of image, audio frequency characteristics and text feature.Each element s so
I1i2i3Value defined be: s
I1,1,1(1≤i
1≤ I
1) be the value of characteristics of image, s
2, i2,2(1≤i
2≤ I
2) be the value of audio frequency characteristics, s
3,3, i3(1≤i
3≤ I
3) be the value of text feature, the value of other element all initially is made as zero.
Described stream shape space intrinsic structure according to video tensor camera lens by seeking the transition matrix realization to the dimension reduction of original higher-dimension tensor and the method for subspace embedding is: given space R
I1 * I2 * I3On lens data set X={X
1, X
2, LX
N, according to the stream shape space intrinsic structure and the spectrogram theory of tensor camera lens, be each the tensor camera lens X on the X
i|
I=1 NSeek three transition matrix: J
1* I
1The T of dimension
1 i, J
2* I
2The T of dimension
2 iAnd J
3* I
3The T of dimension
3 i, make it to shine upon this N data point to space R
J1 * J2 * J3(J
1<I
1, J
2<I
2, J
3<I
3) on Y={Y
1, Y
2, LY
N, satisfy
Realize the dimension of original higher-dimension tensor is reduced and the subspace embedding with this; When asking for T
1 i|
I=1 NThe time, by finding the solution generalized eigenvector problem (D
U-W
U) V
1=λ D
UV
1Calculate optimized intermediate conversion matrix V
1, wherein,
And W is that D is that the diagonal matrix of W is D according to the weight matrix of the constructed arest neighbors figure of training set X
Ii=∑
jW
Ij, U
1 iBe to X
i|
I=1 NA mould to launch matrix mode-1 unfolding matrix be X
(1) iCarry out SVD and decompose the left matrix that obtains, so finally can calculate transition matrix
With asking for T with quadrat method
2 iWith T
3 i
The method that sorter model is set up in video tensor camera lens set after described employing supports the tensor machine to dimensionality reduction is: the input of sorter model is the low-dimensional tensor through the subspace embeds and dimensionality reduction obtains
And corresponding classification logotype y
i∈+1 ,-1), output is the tensor lineoid parameter of sorter model
With b ∈ R; By the iterative optimization problem
Obtain w
k|
K=1 3And b, wherein parameter j is recycled to 3 from 1, and c is constant, and ξ is a relaxation factor,
Described for testing lens, add up to the transition matrix obtain to carry out projection by training set after, carry out semantic concept by sorter model again and detect: the new data outside the training set
By transition matrix
And
Be mapped as in the low n-dimensional subspace n
Carry out classification by sorter model then and detect, promptly calculate z
t=sign (Y
t 1w
1 2w
2 3w
3)+b obtains the classification logotype z of test data
t∈+1 ,-1}.
Beneficial effect of the present invention:
1) the present invention has replaced the vectorial expression way of traditional video with tensor, can effectively reduce the problem that " dimension disaster " brings;
2) the present invention has considered the multiple modalities in the video: image, audio frequency and text, and the sequential correlation symbiotic characteristic of video data, play important effect based on the fusion of the multiple modalities of this characteristic with " the semantic wide gap " of cooperating for reducing between low-level image feature and the high-level semantic;
3) the present invention is according to the stream shape space intrinsic structure and the spectrogram theory that keep the set of tensor camera lens, the tensor camera lens subspace that is proposed embeds and dimension reduction method, not only solved the high-dimensional difficulty of bringing effectively, and owing to be linear method, the new data outer for training set can directly carry out projection mapping;
4) the present invention adopts and supports the tensor machine to come the training classifier model, has good classification detectability.
Embodiment
Method for detecting multi-mode video semantic conception based on tensor representation.Comprise the steps:
1) video lens in training set and the test set is all extracted the low-level image feature of image, audio frequency, three kinds of mode of text, each video tensor camera lens forms 3 rank tensors by these three kinds of low-level image features and expresses;
2) according to the stream shape space intrinsic structure of video tensor camera lens set, realize the dimension of original higher-dimension tensor is reduced and the subspace embedding by seeking transition matrix;
3) adopt the video tensor camera lens set after supporting the tensor machine to dimensionality reduction to set up sorter model;
4) for testing lens, add up to the transition matrix obtain to carry out projection by training set after, carry out the semantic concept detection by sorter model again.
The described low-level image feature that camera lens in training set and the test set is all extracted image, audio frequency, three kinds of mode of text: low-level image feature is meant the feature of directly extracting from video source data, be different from the high-level characteristic of semantic concept representative.We extract low-level image feature respectively from each video lens, comprise image, audio frequency and text feature.
Characteristics of image: camera lens is a basic processing unit, chooses a key frame in each camera lens as representative image, and color histogram, texture and the Canny border of extracting key frame then are as characteristics of image;
Audio frequency characteristics a: section audio of camera lens correspondence is extracted as an audio example (audio clip), and audio example is divided into the audio frame in short-term that contains superposition, extract each audio frame feature in short-term, comprise MFCC, barycenter, decay cutoff frequency, frequency spectrum flow and zero-crossing rate, form the frame proper vector, then the statistical value of audio frame proper vector (average or variance) in short-term as the audio frequency characteristics of camera lens;
Text feature: we extract feature by (transcript) text of transcribing through identification from video.Because the dimension of text feature is much larger than other mode features, and comprised abundant semantic information in the text, (Latent Semantic Analysis LSA) does dimension-reduction treatment to text can to adopt implicit semantic analysis earlier.
The expression of described video tensor camera lens: based on the image that extracts in the video, audio frequency, text low-level image feature, with each video lens with one 3 rank tensor
Represent.Wherein, I
1, I
2And I
3It is respectively the dimension of characteristics of image, audio frequency characteristics and text feature.Each element s so
I1i2i3Value defined be: s
I1,1,1(1≤i
1≤ I
1) be the value of characteristics of image, s
2, i2,2(1≤i
2≤ I
2) be the value of audio frequency characteristics, s
3,3, i3(1≤i
3≤ I
3) be the value of text feature, the value of other element all initially is made as zero.
Described stream shape space intrinsic structure according to video tensor camera lens by seeking the transition matrix realization to the dimension reduction of original higher-dimension tensor and the method for subspace embedding is: given space R
I1 * I2 * I3On lens data set X={X
1, X
2, LX
N, according to the stream shape space intrinsic structure and the spectrogram theory of tensor camera lens, be each the tensor camera lens X on the X
i|
I=1 NSeek three transition matrix: J
1* I
1The T of dimension
1 i, J
2* I
2The T of dimension
2 iAnd J
3* I
3The T of dimension
3 i, make it to shine upon this N data point to space R
J1 * J2 * J3(J
1<I
1, J
2<I
2, J
3<I
3) on set Y={Y
1, Y
2, L Y
N, and satisfy
Low-dimensional data acquisition Y has just reflected how much topological structures of intrinsic in the stream shape space of set X so; Simultaneously, this mapping has also kept linear feature, that is to say for the outer data point X of training set
t, the transition matrix that can directly be obtained by training in advance calculates its mapping in low n-dimensional subspace n.
Order
Represent one 3 rank tensor camera lens, given N tensor camera lens stream shape space that is distributed in
On data acquisition X={X
1, X
2, Λ X
N, we can make up the local geometry that an arest neighbors figure G simulates M.The weight matrix W of definition G is as follows:
Wherein c is a constant.
For each tensor camera lens X
i, (1≤i≤N), (Higher-OrderSingular Value Decomposition, HOSVD), we can be respectively to X according to the high-order svd
iK|
K=1 3Mould launches matrix (mode-kunfolding matrix) X
(1) i, X
(2) i, X
(3) iCarry out singular value SVD and decompose, calculate left matrix U
1 i, U
2 i, U
3 iFor instance, U
1 iBe to X
iA mould launch matrix (mode-1 unfolding matrix) X
(1) iCarry out SVD and decompose the left matrix that obtains.
Existing now
We want to find I
1* J
1The matrix V of dimension
1With U
1 iBe mapped to
Just make
We will consider this problem from two angles.On the one hand, the intrinsic structure of shape be kept flowing, the optimum solution of following this objective function need be asked for:
That is to say, minimize
Can guarantee to work as U
1 iAnd U
1 jBe " close ", V so
1 TU
1 iAnd V
1 TU
1 jAlso be " close ".
D is the diagonal matrix of W, i.e. D
Ii=∑
jW
IjAnd for a matrix A, its " mark (trace) " ‖ A ‖
2=tr (AA
T), have so:
Wherein
From top derivation as can be seen, if want to find the solution
Need minimize tr (V
1 T(D
U-W
U) V
1).
On the other hand, except keeping flowing the graph structure of shape, also need to maximize the overall variance on the stream shape space.Usually, the variance of a stochastic variable x is:
var(x)=∫
M(x-μ)
2dP(x),μ=∫
MxdP(x)
Wherein M is the stream shape of data, and μ is an expectation value, and dP is a probability density function.According to spectrogram theory (spectralgraph theory), dP can be by the diagonal matrix D (D of sample point
Ii=∑
jW
Ij) estimation of discretize obtains.We have following derivation so:
By the constraint condition of above two aspects, we have obtained following optimization problem:
Obviously, V
1Optimum solution be (D
U-W
U, D
U) generalized eigenvector.Therefore we can calculate following generalized eigenvector problem and obtain optimized V
1:
(D
U-W
U)V
1=λD
UV
1
When calculating v
1After, by
Can ask for
In like manner, for the intermediate conversion matrix V of audio frequency and these two kinds of mode of text
2And V
3Also can use the same method and calculate, so by
And V
2Just can ask for
And by
And V
3Can ask for
The video tensor camera lens of lower dimensional space is gathered the data among the Y like this
Be the algorithm that the subspace embeds and dimension reduces of tensor camera lens below.
Input: original training tensor camera lens set
Output: the low-dimensional tensor camera lens set after the mapping
The intermediate conversion matrix
With
And transition matrix
With
And satisfy
Arthmetic statement:
Step 1: make up an arest neighbors figure G;
Step 2: calculate weight matrix W;
Step 3:For k=1to3
Step 4:For i=1toN
Step 5: calculate X
iThe k mould launch matrix X
(k) iThe left matrix U that decomposes of SVD
(k) i
Step 6:End;
Step 7:
Step 8:
Step 9: find the solution following generalized eigenvector problem to obtain optimized V
k:
(D
U-W
U)V
k=λD
UV
k;
Step 10:For i=1toN
Step 11:
Step 12:End;
Step 13:End;
Step 14:For i=1toN
Step 15:
Step 16:End.
The method that sorter model is set up in video tensor camera lens set after described employing supports the tensor machine to dimensionality reduction is: in this step, we adopt and support the tensor machine to train the sorter of tensor camera lens.The input of training pattern is exactly a previous step through the low-dimensional tensor Y that the subspace embeds and dimensionality reduction obtains
i, rather than original X
iSuch processing can not only improve degree of accuracy, and can improve the efficient of training and classification.
Support that the algorithm of tensor machine training classifier is as follows.
Input: the low-dimensional tensor camera lens set after the mapping
And
Corresponding classification logotype y
i∈+1 ,-1};
Output: the tensor lineoid parameter of sorter model
With b ∈ R;
Arthmetic statement:
Step 1: w is set
k|
K=1 3Be R
JkIn the random units vector;
Step 2: repeating step 3-5 is until convergence;
Step 3:For j=1to3
Step 4: by finding the solution optimization problem
Obtain
And b, wherein c is a constant, ξ is a relaxation factor,
Step 5:End;
Step 6: check whether restrain: if
Calculate so
w
k|
K=1 3Restrain.Here w
K, tBe current projection vector, w
K, t-1It is previous projection vector;
Step 7:End.
Described for testing lens, after adding up to the transition matrix obtain to carry out projection by training set, carrying out semantic concept by sorter model again detects: in this step, we will come according to the sorter model that the front training obtains the outer new data of training set is detected.Because our dimension reduction method is linear,, carries out classification by sorter then and detect so can map directly to low n-dimensional subspace n for new data.
Make X
tAs an outer detection example of training set, following algorithm provides testing process.
Input: camera lens to be detected
The intermediate conversion matrix V
1, V
2, V
3, classifier parameters
With b ∈ R;
Output: X
tClassification logotype z
t∈+1 ,-1};
Arthmetic statement:
Step 1:For k=1to3;
Step 2: calculate X
tThe k mould launch matrix X
(k) tThe left matrix U that decomposes of SVD
(k) t
Step 3: calculate
Step 4:End;
Step 5: calculate
Step 6: calculate z
t=sign (Y
t 1w
1 2w
2 3w
3)+b;
Step 7:End.