CN101299241A

CN101299241A - Method for detecting multi-mode video semantic conception based on tensor representation

Info

Publication number: CN101299241A
Application number: CNA2008100591256A
Authority: CN
Inventors: 吴飞; 庄越挺; 刘亚楠; 郭同强
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2008-01-14
Filing date: 2008-01-14
Publication date: 2008-11-05
Anticipated expiration: 2028-01-14
Also published as: CN101299241B

Abstract

The invention discloses a multiple modal video semantic concept testing method based on tensor representation, including the following steps: 1. extracting image, audio and text modal bottom layer characteristic from the video lens in the training set and test set, wherein, each video tensor lens is expressed by the three bottom layer characteristics to form a third order tensor; 2. according to the manifold space eigen structure of the video tensor lens, implementing the dimension reducing and subspace embedding to the original high dimension tensor through conversion matrix searching; 3.using a supporting tensor machine to establish a categorizer model to the video tensor lens set after dimensionality reduction; 4. using the conversion matrix obtained by calculating the training set to project the test lens, and then executing semantic concept testing through a categorizer. The invention fully utilizes the multiple modal data in the video, represents the video lens as a third order tensor, and proposes a subspace embedded dimensionality reduction method based on the expression, thereby implementing the semantic concept testing to the video lens and executing better analysis and comprehension to the video semantic meaning.

Description

Method for detecting multi-mode video semantic conception based on tensor representation

Technical field

The present invention relates to a kind of method for detecting multi-mode video semantic conception based on tensor representation.This method is expressed as 3 rank tensors with video lens, and seeks effective dimension reduction method it is projected to the low-dimensional semantic space, thereby realizes the semantic concept of video tensor camera lens is detected by the training classifier model, belongs to video semanteme analysis and understanding field.

Background technology

Along with the development of various digital image apparatus with popularize, and the develop rapidly of film and television industry, computer technology, the communication technology, multimedia treatment technology, compression coding technology and internet etc., produced a large amount of video datas in fields such as news, film, historical document and monitoring.Video data has contained abundant semanteme such as task, scene, object and incident, and video is again a time series data simultaneously, has image, audio frequency and three kinds of medium data of text in the video, and presents the sequential correlation symbiotic characteristic.Simultaneously, the fusion of multiple modalities is also played important effect with " the semantic wide gap " of cooperating for reducing between low-level image feature and the high-level semantic.How effectively to utilize the multi-modal of video and sequential characteristic to excavate its semantic information, thereby support effective retrieval of video, the resource sharing advantage of performance video data is challenging studying a question.

For how expressing multi-modal medium in the video, traditional method is that image, audio frequency and text feature are represented with the splicing vector.But this high dimension vector tends to cause the problem of " dimension disaster ", and the relation of the sequential correlation symbiosis between multiple modalities in the video also can be left in the basket.In recent years, how much of polytenies-be that high order tensor has been widely applied to fields such as computer vision, information retrieval and signal Processing.Tensor is a kind of natural expansion and the extension to the vector sum matrix, and tensor geometry has defined a series of polyteny computings based on the vector space set.Simultaneously, finding the solution optimum solution with tensor as the supervising tensor learning framework employing alternating projection optimization step that has of input, is the combination of protruding optimization and polyteny geometric operation.Based on the supervising tensor learning framework is arranged, traditional support vector machine can be expanded to the tensor machine of supporting, realize the training and the application of sorter model.

Summary of the invention

The purpose of this invention is to provide a kind of method for detecting multi-mode video semantic conception based on tensor representation.

Comprise the steps:

1) video lens in training set and the test set is all extracted the low-level image feature of image, audio frequency, three kinds of mode of text, each video tensor camera lens forms 3 rank tensors by these three kinds of low-level image features and expresses;

2) according to the stream shape space intrinsic structure of video tensor camera lens set, realize the dimension of original higher-dimension tensor is reduced and the subspace embedding by seeking transition matrix;

3) adopt the video tensor camera lens set after supporting the tensor machine to dimensionality reduction to set up sorter model;

4) for testing lens, add up to the transition matrix obtain to carry out projection by training set after, carry out the semantic concept detection by sorter model again.

The described low-level image feature that camera lens in training set and the test set is all extracted image, audio frequency, three kinds of mode of text: choose a key frame in each camera lens as representative image, extract color histogram, texture and Canny border then as characteristics of image; One section audio of camera lens correspondence is extracted as an audio example, and audio example is divided into the audio frame in short-term that contains superposition, extract each audio frame feature in short-term, comprise MFCC, barycenter, decay cutoff frequency, frequency spectrum flow and zero-crossing rate, form the frame proper vector, then the audio frequency characteristics of the statistical value of audio frame proper vector in short-term as camera lens; From video, extract the TF*IDF value as text feature through the transcribed text of identification.

The expression of described video tensor camera lens: based on the image that extracts in the video, audio frequency, text low-level image feature, with each video lens with one 3 rank tensor

S &Element; R^{I_{1} \times I_{2} \times I_{3}}

Represent.Wherein, I ₁, I ₂And I ₃It is respectively the dimension of characteristics of image, audio frequency characteristics and text feature.Each element so Value defined be:

s_{i_{1}, 1,1} (1 \leq i_{1} \leq I_{1})

Be the value of characteristics of image,

s_{2, i_{2}, 2} (1 \leq i_{2} \leq I_{2})

Be the value of audio frequency characteristics,

s_{3,3, i_{3}} (1 \leq i_{3} \leq I_{3})

Be the value of text feature, the value of other element all initially is made as zero.

Described stream shape space intrinsic structure according to video tensor camera lens by seeking the transition matrix realization to the dimension reduction of original higher-dimension tensor and the method for subspace embedding is: given space

On lens data set X={X ₁, X ₂, L X _N, according to the stream shape space intrinsic structure and the spectrogram theory of tensor camera lens, be each the tensor camera lens X on the X _i| _I=1 ^NSeek three transition matrix: J ₁* I ₁The T of dimension ₁ ⁱ, J ₂* I ₂The T of dimension ₂ ⁱAnd J ₃* I ₃The T of dimension ₃ ⁱ, make it to shine upon this N data point to the space

R^{J}

_{1} \times J_{2} \times J_{3} (J_{1} < I_{1}, J_{2} < I_{2}, J_{3} < I_{3})

On Y={Y ₁, Y ₂, LY _N, satisfy

Y_{i} |_{i = 1}^{N} = X_{i} {&CircleTimes;}_{1} T_{1}^{iT} {&CircleTimes;}_{2} T_{2}^{iT} {&CircleTimes;}_{3} T_{3}^{iT},

Realize the dimension of original higher-dimension tensor is reduced and the subspace embedding with this; When asking for T ₁ ⁱ| _I=1 ^NThe time, by finding the solution generalized eigenvector problem (D _U-W _U) V ₁=λ D _UV ₁Calculate optimized intermediate conversion matrix V ₁, wherein,

D_{U} = \underset{i}{Σ} D_{ii} U_{1}^{i} U_{1}^{iT},

W_{U} = \underset{ij}{Σ} W_{ij} U_{1}^{i} U_{1}^{jT},

And W is that D is that the diagonal matrix of W is D according to the weight matrix of the constructed arest neighbors figure of training set X _Ii=∑ _jW _Ij, U ₁ ⁱBe to X _i| _I=1 ^NA mould to launch matrix mode-1 unfolding matrix be X ₍₁₎ ⁱCarry out SVD and decompose the left matrix that obtains, so finally can calculate transition matrix

T_{1}^{i} |_{i = 1}^{N} = V_{1}^{T} U_{1}^{i} &Element; R^{J_{1} \times I_{1}};

With asking for T with quadrat method ₂ ⁱWith T ₃ ⁱ

The method that sorter model is set up in video tensor camera lens set after described employing supports the tensor machine to dimensionality reduction is: the input of sorter model is the low-dimensional tensor through the subspace embeds and dimensionality reduction obtains

Y_{i} |_{i = 1}^{N} = X_{i} {&CircleTimes;}_{1} T_{1}^{iT} {&CircleTimes;}_{2} T_{2}^{iT} {&CircleTimes;}_{3} T_{3}^{iT} &Element; R^{J_{1} \times J_{2} \times J_{3}}

And corresponding classification logotype y _i∈+1, and-1}, output is the tensor lineoid parameter of sorter model

w_{k} |_{k = 1}^{3} &Element; R^{J_{k}}

With b ∈ R; By the iterative optimization problem

[\begin{matrix} \min_{w_{j}, b, ξ} J_{C - STM} (w_{j}, b, ξ) = \frac{η}{2} {| | w_{j} | |}_{Fro}^{2} + c Σ_{i = 1}^{N} ξ_{i} \\ \begin{matrix} s . t . & \begin{matrix} y_{i} [{w^{T}}_{j} (Y_{i} Π_{j = 1}^{3} \times_{j} w_{j} + b)] &GreaterEqual; ξ_{i}, 1 \leq i \leq N \\ ξ &GreaterEqual; 0 \end{matrix} \end{matrix} \end{matrix}]

Obtain w _k| _K=1 ³And b, wherein parameter j is recycled to 3 from 1, and c is constant, and ξ is a relaxation factor,

η = Π_{1 \leq k \leq 3}^{k &NotEqual; j} {| | w_{k} | |}_{Fro}^{2} .

Described for testing lens, add up to the transition matrix obtain to carry out projection by training set after, carry out semantic concept by sorter model again and detect: the new data outside the training set

X_{t} &Element; R^{I_{1} \times I_{2} \times I_{3}}

By transition matrix

T_{1}^{t} = V_{1}^{T} U_{1}^{t} &Element; R^{J_{1} \times I_{1}},

T_{2}^{t} = V_{2}^{T} U_{2}^{t} &Element; R^{J_{2} \times I_{2}}

And

T_{3}^{t} = V_{3}^{T} U_{3}^{t} &Element; R^{J_{3} \times I_{3}}

Be mapped as in the low n-dimensional subspace n

Y_{t} = X_{t} {&CircleTimes;}_{1} T_{1}^{tT} {&CircleTimes;}_{2} T_{2}^{tT} {&CircleTimes;}_{3} T_{3}^{tT} &Element; R^{J_{1} \times J_{2} \times J_{3}},

Carry out classification by sorter model then and detect, promptly calculate

z_{t} = sign (Y_{t} {&CircleTimes;}_{1} w_{1} {&CircleTimes;}_{2} w_{2} {&CircleTimes;}_{3} w_{3}) + b,

Obtain the classification logotype z of test data _t∈+1 ,-1}.

Beneficial effect of the present invention:

1) the present invention has replaced the vectorial expression way of traditional video with tensor, can effectively reduce the problem that " dimension disaster " brings;

2) the present invention has considered the multiple modalities in the video: image, audio frequency and text, and the sequential correlation symbiotic characteristic of video data, play important effect based on the fusion of the multiple modalities of this characteristic with " the semantic wide gap " of cooperating for reducing between low-level image feature and the high-level semantic;

3) the present invention is according to the stream shape space intrinsic structure and the spectrogram theory that keep the set of tensor camera lens, the tensor camera lens subspace that is proposed embeds and dimension reduction method, not only solved the high-dimensional difficulty of bringing effectively, and owing to be linear method, the new data outer for training set can directly carry out projection mapping;

4) the present invention adopts and supports the tensor machine to come the training classifier model, has good classification detectability.

Description of drawings

Fig. 1 is based on the method for detecting multi-mode video semantic conception process flow diagram of tensor representation;

Fig. 2 is the testing result of the present invention to semantic concept " Explosion (blast) ", compares with ISOMAP and two kinds of methods of PCA respectively, is expressed as the ROC curve map;

Fig. 3 is the testing result of the present invention to semantic concept " Sports (sports) ", compares with ISOMAP and two kinds of methods of PCA respectively, is expressed as the ROC curve map.

Embodiment

Method for detecting multi-mode video semantic conception based on tensor representation.Comprise the steps:

The described low-level image feature that camera lens in training set and the test set is all extracted image, audio frequency, three kinds of mode of text: low-level image feature is meant the feature of directly extracting from video source data, be different from the high-level characteristic of semantic concept representative.We extract low-level image feature respectively from each video lens, comprise image, audio frequency and text feature.

Characteristics of image: camera lens is a basic processing unit, chooses a key frame in each camera lens as representative image, and color histogram, texture and the Canny border of extracting key frame then are as characteristics of image;

Audio frequency characteristics a: section audio of camera lens correspondence is extracted as an audio example (audio clip), and audio example is divided into the audio frame in short-term that contains superposition, extract each audio frame feature in short-term, comprise MFCC, barycenter, decay cutoff frequency, frequency spectrum flow and zero-crossing rate, form the frame proper vector, then the statistical value of audio frame proper vector (average or variance) in short-term as the audio frequency characteristics of camera lens;

Text feature: we extract feature by (transcript) text of transcribing through identification from video.Because the dimension of text feature is much larger than other mode features, and comprised abundant semantic information in the text, (Latent Semantic Analysis LSA) does dimension-reduction treatment to text can to adopt implicit semantic analysis earlier.

S &Element; R^{I_{1} \times I_{2} \times I_{3}}

Represent.Wherein, I ₁, I ₂And I ₃It is respectively the dimension of characteristics of image, audio frequency characteristics and text feature.Each element so

Value defined be:

s_{i_{1}, 1,1} (1 \leq i_{1} \leq I_{1})

Be the value of characteristics of image,

s_{2, i_{2}, 2} (1 \leq i_{2} \leq I_{2})

Be the value of audio frequency characteristics,

s_{3,3, i_{3}} (1 \leq i_{3} \leq I_{3})

Described stream shape space intrinsic structure according to video tensor camera lens by seeking the transition matrix realization to the dimension reduction of original higher-dimension tensor and the method for subspace embedding is: given space On lens data set X={X ₁, X ₂, LX _N, according to the stream shape space intrinsic structure and the spectrogram theory of tensor camera lens, be each the tensor camera lens X on the X _i| _I=1 ^NSeek three transition matrix: J ₁* I ₁The T of dimension ₁ ⁱ, J ₂* I ₂The T of dimension ₂ ⁱAnd J ₃* I ₃The T of dimension ₃ ⁱ, make it to shine upon this N data point to the space

R^{J}

_{1} \times J_{2} {\times J}_{3} (J_{1} < J_{1}, J_{2} < I_{2}, J_{3} < I_{3})

On set Y={Y ₁, Y ₂, LY _N, and satisfy

Y_{i} |_{i = 1}^{N} = X_{i} {&CircleTimes;}_{1} T_{1}^{iT} {&CircleTimes;}_{2} T_{2}^{iT} {&CircleTimes;}_{3} T_{3}^{iT},

Low-dimensional data acquisition Y has just reflected how much topological structures of intrinsic in the stream shape space of set X so; Simultaneously, this mapping has also kept linear feature, that is to say for the outer data point X of training set _t, the transition matrix that can directly be obtained by training in advance calculates its mapping in low n-dimensional subspace n.

Order Represent one 3 rank tensor camera lens, given N tensor camera lens stream shape space that is distributed in

M &Element; R^{I_{1} \times I_{2} \times I_{3}}

On data acquisition X={X ₁, X ₂, Λ X _N, we can make up the local geometry that an arest neighbors figure G simulates M.The weight matrix W of definition G is as follows:

Wherein c is a constant.

For each tensor camera lens X _i(1≤i≤N), (Higher-OrderSingular Value Decomposition, HOSVD), we can be respectively to X according to the high-order svd _iK| _K=1 ³Mould launches matrix (mode-k unfolding matrix) X ₍₁₎ ⁱ, X ₍₂₎ ⁱ, X ₍₃₎ ⁱCarry out singular value SVD and decompose, calculate left matrix U ₁ ⁱ, U ₂ ⁱ, U ₃ ⁱFor instance, U ₁ ⁱBe to X _iA mould launch matrix (mode-1 unfolding matrix) X ₍₁₎ ⁱCarry out SVD and decompose the left matrix that obtains.

Existing now

U_{1}^{i} &Element; R^{I_{1} \times I_{1}} (1 \leq i \leq N),

We want to find I ₁* J ₁The matrix V of dimension ₁With U ₁ ⁱBe mapped to

T_{1}^{iT} &Element; R^{I_{1} \times J_{1}},

Just make

T_{1}^{i} = V_{1}^{T} U_{1}^{i} &Element; R^{J_{1} \times I_{1}} .

We will consider this problem from two angles.On the one hand, the intrinsic structure of shape be kept flowing, the optimum solution of following this objective function need be asked for:

\min_{V} \underset{ij}{Σ} {| | V_{1}^{T} U_{1}^{i} - V_{1}^{T} U_{1}^{j} | |}^{2} W_{ij}

That is to say, minimize

\underset{ij}{Σ} {| | V_{1}^{T} U_{1}^{i} - V_{1}^{T} U_{1}^{j} | |}^{2} W_{ij}

Can guarantee to work as U ₁ ⁱAnd U ₁ ^jBe " close ", V so ₁ ^TU ₁ ⁱAnd V ₁ ^TU ₁ ^jAlso be " close ".

D is the diagonal matrix of W, i.e. D _Ii=∑ _jW _IjAnd for a matrix A, its " mark (trace) " ‖ A ‖ ²=tr (AA ^T), have so:

\frac{1}{2} \underset{ij}{Σ} {| | V_{1}^{T} U_{1}^{i} - V_{1}^{T} U_{1}^{i} | |}^{2} W_{ij}

= \frac{1}{2} \underset{ij}{Σ} tr ((T_{1}^{i} - T_{1}^{j}) {(T_{1}^{i} - T_{1}^{j})}^{T}) W_{ij}

= \frac{1}{2} \underset{ij}{Σ} tr (T_{1}^{i} T_{1}^{i} + T_{1}^{j} T_{1}^{j} - {T_{1}^{i} T_{1}^{j}}^{T} - {T_{1}^{j} T_{1}^{i}}^{T}) W_{ij}

= tr ({\underset{i}{Σ} D_{ii} T_{1}^{i} T_{1}^{i}}^{T} - {\underset{ij}{Σ} W_{ij} T_{1}^{i} T_{1}^{j}}^{T})

= tr (\underset{i}{Σ} D_{ii} V_{1}^{T} U_{1}^{i} {U_{1}^{i}}^{T} V_{1} - \underset{ij}{Σ} W_{ij} V_{1}^{T} U_{1}^{i} {U_{1}^{j}}^{T} V_{1})

= tr (V_{1}^{T} (\underset{i}{Σ} D_{ii} U_{1}^{i} {U_{1}^{i}}^{T} - \underset{ij}{Σ} W_{ij} U_{1}^{i} {U_{1}^{j}}^{T}) V_{1})

Wherein

D_{U} = \underset{i}{Σ} D_{ii} U_{1}^{i} {U_{1}^{i}}^{T},

W_{U} = \underset{ij}{Σ} W_{ij} U_{1}^{i} {U_{1}^{j}}^{T} .

From top derivation as can be seen, if want to find the solution

\min_{V}

_{1} \underset{ij}{Σ} {| | V_{1}^{T} U_{1}^{i} - V_{1}^{T} U_{1}^{j} | |}^{2} W_{ij},

Need minimize tr (V ₁ ^T(D _U-W _U) V ₁).

On the other hand, except keeping flowing the graph structure of shape, also need to maximize the overall variance on the stream shape space.Usually, the variance of a stochastic variable x is:

var(x)＝∫ _M(x-μ) ²dP(x)，μ＝∫ _MxdP(x)

Wherein M is the stream shape of data, and μ is an expectation value, and dP is a probability density function.According to spectrogram theory (spectralgraph theory), dP can be by the diagonal matrix D (D of sample point _Ii=∑ _jW _Ij) estimation of discretize obtains.

We have following derivation so:

var (T_{1})

= \underset{i}{Σ} {| | T_{1}^{i} | |}^{2} D_{ii}

= \underset{i}{Σ} tr (T_{1}^{i} {T_{1}^{i}}^{T}) D_{ii}

= \underset{i}{Σ} tr (V_{1}^{T} U_{1}^{i} {V_{1}^{i}}^{T} V_{1}) D_{ii}

= tr (V_{1}^{T} (\underset{i}{Σ} D_{ii} U_{1}^{i} {U_{1}^{i}}^{T}) V_{1})

By the constraint condition of above two aspects, we have obtained following optimization problem:

\min_{V} \frac{tr (V_{1}^{T} (D_{U} - W_{U}) V_{1})}{tr (V_{1}^{T} D_{U} V_{1})}

Obviously, V ₁Optimum solution be (D _U-W _U, D _U) generalized eigenvector.Therefore we can calculate following generalized eigenvector problem and obtain optimized V ₁:

(D _U-W _U)V ₁＝λD _UV ₁

When calculating V ₁After, by

Can ask for

In like manner, for the intermediate conversion matrix V of audio frequency and these two kinds of mode of text ₂And V ₃Also can use the same method and calculate, so by

And V ₂Just can ask for And by

And V ₃Can ask for

The video tensor camera lens of lower dimensional space is gathered the data among the Y like this

Y_{i} |_{i = 1}^{N} = X_{i} {&CircleTimes;}_{1} T_{1}^{iT} {&CircleTimes;}_{2} T_{2}^{iT} {&CircleTimes;}_{3} T_{3}^{iT} &Element; R^{J_{1} \times J_{2} {\times J}_{3}} .

Be the algorithm that the subspace embeds and dimension reduces of tensor camera lens below.

Input: original training tensor camera lens set

X = {X_{1}, X_{2}, L X_{N}} &Element; R^{I_{1} \times I_{2} \times I_{3}};

Output: the low-dimensional tensor camera lens set after the mapping

Y = {Y_{1}, Y_{2}, L Y_{N}} &Element; R^{J_{1} \times J_{2} \times J_{3}},

The intermediate conversion matrix

V

_{1} &Element; R^{I_{1} \times J_{1}},

V_{2} &Element; R^{I_{2} \times J_{2}}

With

V_{3} &Element; R^{I_{3} \times J_{3}},

And transition matrix

With

And satisfy

Y_{i} |_{i = 1}^{N} = X_{i} {&CircleTimes;}_{1} T_{1}^{iT} {&CircleTimes;}_{2} T_{2}^{iT} {&CircleTimes;}_{3} T_{3}^{iT};

Arthmetic statement:

Step 1: make up an arest neighbors figure G;

Step 2: calculate weight matrix W;

Step 3:For k=1 to 3

Step 4:For i=1 to N

Step 5: calculate X _iThe k mould launch matrix X _(k) ⁱThe left matrix U that decomposes of SVD _(k) ⁱ

Step 6:End;

Step 7:

D_{U} = \underset{i}{Σ} = D_{ii} U_{(k)}^{i} {U_{(k)}^{i}}^{T};

Step 8:

w_{U} = \underset{ij}{Σ} w_{ij} U_{(k)}^{i} {U_{(k)}^{j}}^{T};

Step 9: find the solution following generalized eigenvector problem to obtain optimized V _k:

(D _U-W _U)V _k＝λD _UV _k；

Step 10:For i=1 to N

Step 11:

T_{(k)}^{i} = V_{(k)}^{T} U_{(k)}^{i} &Element; R^{J_{k} \times I_{k}};

Step 12:End;

Step 13:End;

Step 14:For i=1 to N

Step 15:

Y_{i} = X_{i} {&CircleTimes;}_{1} T_{1}^{iT} {&CircleTimes;}_{2} T_{2}^{iT} {&CircleTimes;}_{3} T_{3}^{iT};

Step 16:End.

The method that sorter model is set up in video tensor camera lens set after described employing supports the tensor machine to dimensionality reduction is: in this step, we adopt and support the tensor machine to train the sorter of tensor camera lens.The input of training pattern is exactly a previous step through the low-dimensional tensor Y that the subspace embeds and dimensionality reduction obtains _i, rather than original X _iSuch processing can not only improve degree of accuracy, and can improve the efficient of training and classification.

Support that the algorithm of tensor machine training classifier is as follows.

Input: the low-dimensional tensor camera lens set after the mapping

Y_{i} |_{i = 1}^{N} = X_{i} {&CircleTimes;}_{1} T_{1}^{iT} {&CircleTimes;}_{2} T_{2}^{iT} {&CircleTimes;}_{3} T_{3}^{iT} &Element; R^{J_{1} \times J_{2} \times J_{3}},

And corresponding classification logotype y _i∈+1 ,-1};

Output: the tensor lineoid parameter of sorter model

w_{k} |_{k = 1}^{3} &Element; R^{J_{k}}

With b ∈ R;

Arthmetic statement:

Step 1: w is set _k| _K=1 ³For

In the random units vector;

Step 2: repeating step 3-5 is until convergence;

Step 3:For j=1 to 3

Step 4: by finding the solution optimization problem

[\begin{matrix} \min_{w_{j}, b, ξ} J_{C - STM} (w_{j}, b, ξ) = \frac{η}{2} {| | w_{j} | |}_{Fro}^{2} + c Σ_{i = 1}^{N} ξ_{i} \\ \begin{matrix} s . t . & \begin{matrix} y_{i} [{w^{T}}_{j} (Y_{i} Π_{k = 1}^{3} \times_{k} w_{k} + b)] &GreaterEqual; ξ_{i}, 1 \leq i \leq N \\ ξ &GreaterEqual; 0 \end{matrix} \end{matrix} \end{matrix}]

Obtain

w_{j} &Element; R^{I_{j}}

And b, wherein c is a constant, ξ is a relaxation factor,

η = Π_{1 \leq k \leq 3}^{k &NotEqual; j} {| | w_{k} | |}_{Fro}^{2};

Step 5:End;

Step 6: check whether restrain: if

Σ_{k = 1}^{3} [| w_{k, t}^{T} w_{k, t - 1} | ({| | w_{k, t} | |}_{Fro}^{- 2}) - 1] \leq ϵ

Calculate so

w _k| _K=1 ³Restrain.Here w _{K, t}Be current projection vector, w _{K, t-1}It is previous projection vector;

Step 7:End.

Described for testing lens, after adding up to the transition matrix obtain to carry out projection by training set, carrying out semantic concept by sorter model again detects: in this step, we will come according to the sorter model that the front training obtains the outer new data of training set is detected.Because our dimension reduction method is linear,, carries out classification by sorter then and detect so can map directly to low n-dimensional subspace n for new data.

Make X _tAs an outer detection example of training set, following algorithm provides testing process.

Input: camera lens to be detected

X_{t} &Element; R^{I_{1} \times I_{2} \times I_{3}},

The intermediate conversion matrix V ₁, V ₂, V ₃, classifier parameters

w_{k} |_{k = 1}^{3} &Element; R^{I_{k}}

With b ∈ R;

Output: X _tClassification logotype z _t∈+1 ,-1};

Arthmetic statement:

Step 1:For k=1 to 3;

Step 2: calculate X _tThe k mould launch matrix X _(k) ^tThe left matrix U that decomposes of SVD _(k) ^t

Step 3: calculate

T_{(k)}^{t} = V_{(k)}^{T} U_{(k)}^{t};

Step 4:End;

Step 5: calculate

Y_{t} = X_{t} {&CircleTimes;}_{1} {T_{1}^{t}}^{T} {&CircleTimes;}_{2} {T_{2}^{t}}^{T} {&CircleTimes;}_{3} {T_{3}^{t}}^{T};

Step 6: calculate

z_{t} = sign (Y_{t} {&CircleTimes;}_{1} w_{1} {&CircleTimes;}_{2} w_{2} {&CircleTimes;}_{3} w_{3}) + b;

Step 7:End.

Claims

1. the method for detecting multi-mode video semantic conception based on tensor representation is characterized in that comprising the steps:

2. a kind of method for detecting multi-mode video semantic conception according to claim 1 based on tensor representation, it is characterized in that the described low-level image feature that camera lens in training set and the test set is all extracted image, audio frequency, three kinds of mode of text: choose a key frame in each camera lens as representative image, extract color histogram, texture and Canny border then as characteristics of image; One section audio of camera lens correspondence is extracted as an audio example, and audio example is divided into the audio frame in short-term that contains superposition, extract each audio frame feature in short-term, comprise MFCC, barycenter, decay cutoff frequency, frequency spectrum flow and zero-crossing rate, form the frame proper vector, then the audio frequency characteristics of the statistical value of audio frame proper vector in short-term as camera lens; From video, extract the TF*IDF value as text feature through the transcribed text of identification.

3. a kind of method for detecting multi-mode video semantic conception according to claim 1 based on tensor representation, it is characterized in that the expression of described video tensor camera lens: based on the image that extracts in the video, audio frequency, text low-level image feature, with each video lens with one 3 rank tensor

S &Element; R^{I_{1} \times I_{2} \times I_{3}}

Value defined be:

Be the value of characteristics of image, Be the value of audio frequency characteristics,

4. a kind of method for detecting multi-mode video semantic conception according to claim 1 based on tensor representation, it is characterized in that described stream shape space intrinsic structure, by seeking the transition matrix realization be: given space the dimension reduction of original higher-dimension tensor and the method for subspace embedding according to video tensor camera lens

R^{J_{1} \times J_{2} \times J_{3}} (J_{1} < I_{1}, J_{2} < I_{2}, J_{3} < I_{3})

On Y={Y ₁, Y ₂, L Y _N, satisfy

Y_{i} |_{i = 1}^{N} = X_{i} {&CircleTimes;}_{1} T_{1}^{iT} {&CircleTimes;}_{2} T_{2}^{iT} {&CircleTimes;}_{3} T_{3}^{iT},

D_{U} = \underset{i}{Σ} D_{ii} U_{1}^{i} U_{1}^{iT},

W_{U} = \underset{ij}{Σ} W_{ij} U_{1}^{i} U_{1}^{jT},

T_{1}^{i} |_{i = 1}^{N} = V_{1}^{T} U_{1}^{i} &Element; R^{J_{1} \times I_{1}};

With asking for T with quadrat method ₂ ⁱWith T ₃ ⁱ

5. a kind of method for detecting multi-mode video semantic conception based on tensor representation according to claim 1, it is characterized in that video tensor camera lens set after described employing supports the tensor machine to dimensionality reduction sets up the method for sorter model and be: the input of sorter model is the low-dimensional tensor through the subspace embeds and dimensionality reduction obtains

Y_{i} |_{i = 1}^{N} = X_{i} {&CircleTimes;}_{1} T_{1}^{iT} {&CircleTimes;}_{2} T_{2}^{iT} {&CircleTimes;}_{3} T_{3}^{iT} &Element; R^{J_{1} \times J_{2} \times J_{3}}

w_{k} |_{k = 1}^{3} &Element; R^{J_{k}}

With b ∈ R; By the iterative optimization problem

[\begin{matrix} \min_{w_{j}, b, ξ} J_{C - STM} (w_{j}, b, ξ) = \frac{η}{2} {| | w_{j} | |}_{Fro}^{2} + c Σ_{i = 1}^{N} ξ_{i} \\ s . t . y_{i} [w_{j}^{T} (Y_{i} Π_{j = 1}^{3} \times_{j} w_{j} + b)] &GreaterEqual; ξ_{i}, 1 \leq i \leq N \\ ξ &GreaterEqual; 0 \end{matrix}]

η = Π_{1 \leq k \leq 3}^{k &NotEqual; j} {| | w_{k} | |}_{Fro}^{2} .

6. a kind of method for detecting multi-mode video semantic conception according to claim 1 based on tensor representation, it is characterized in that described for testing lens, after adding up to the transition matrix obtain to carry out projection by training set, carry out semantic concept by sorter model again and detect: the new data outside the training set

X_{t} &Element; R^{I_{1} \times I_{2} \times I_{3}}

By transition matrix

T_{1}^{t} = V_{1}^{T} U_{1}^{t} &Element; R^{J_{1} \times I_{1}},

T_{2}^{t} = V_{2}^{T} U_{2}^{t} &Element; R^{J_{2} \times I_{2}}

And

T_{3}^{t} = V_{3}^{T} U_{3}^{t} &Element; R^{J_{3} \times I_{3}}

Be mapped as in the low n-dimensional subspace n

Y_{t} = X_{t} {&CircleTimes;}_{1} T_{1}^{tT} {&CircleTimes;}_{2} T_{2}^{tT} {&CircleTimes;}_{3} T_{3}^{tT} &Element; R^{J_{1} {\times J}_{2} \times J_{3}},

Carry out classification by sorter model then and detect, promptly calculate

z_{t} = sign (Y_{t} {&CircleTimes;}_{1} w_{1} {&CircleTimes;}_{2} w_{2} {&CircleTimes;}_{3} w_{3}) + b,

Obtain the classification logotype z of test data _t∈+1 ,-1}.