CN101299241B

CN101299241B - Multimodal Video Semantic Concept Detection Method Based on Tensor Representation

Info

Publication number: CN101299241B
Application number: CN2008100591256A
Authority: CN
Inventors: 吴飞; 庄越挺; 刘亚楠; 郭同强
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2008-01-14
Filing date: 2008-01-14
Publication date: 2010-06-02
Anticipated expiration: 2028-01-14
Also published as: CN101299241A

Abstract

本发明公开了一种基于张量表示的多模态视频语义概念检测方法。包括如下步骤：1)对训练集合及测试集合中的视频镜头均提取图像、音频、文本三种模态的底层特征，每个视频张量镜头由这三种底层特征形成3阶张量来表达；2)根据视频张量镜头集合的流形空间本征结构，通过寻找转换矩阵实现对原始高维张量的维度降低及子空间嵌入；3)采用支持张量机对降维后的视频张量镜头集合建立分类器模型；4)对于测试镜头，由训练集合计算得到的转换矩阵进行投影后，再通过分类器模型进行语义概念检测。本发明充分利用视频中的多模态数据，将视频镜头表示为3阶张量，并基于此种表达提出了一种子空间嵌入的降维方法，实现了视频镜头的语义概念检测，对视频语义进行了较好的分析与理解。The invention discloses a multimodal video semantic concept detection method based on tensor representation. Including the following steps: 1) Extract the bottom-level features of image, audio and text from the video shots in the training set and test set, and each video tensor shot is represented by a third-order tensor formed by these three bottom-level features ; 2) According to the eigenstructure of the manifold space of the video tensor lens set, the dimension reduction and subspace embedding of the original high-dimensional tensor are realized by finding the transformation matrix; 4) For the test shots, the transformation matrix calculated from the training set is projected, and then the semantic concept detection is performed through the classifier model. The present invention makes full use of the multimodal data in the video, expresses the video shot as a third-order tensor, and proposes a dimensionality reduction method of subspace embedding based on this expression, realizes the semantic concept detection of the video shot, and improves the video semantic better analysis and understanding.

Description

Method for detecting multi-mode video semantic conception based on tensor representation

Technical field

The present invention relates to a kind of method for detecting multi-mode video semantic conception based on tensor representation.This method is expressed as 3 rank tensors with video lens, and seeks effective dimension reduction method it is projected to the low-dimensional semantic space, thereby realizes the semantic concept of video tensor camera lens is detected by the training classifier model, belongs to video semanteme analysis and understanding field.

Background technology

Along with the development of various digital image apparatus with popularize, and the develop rapidly of film and television industry, computer technology, the communication technology, multimedia treatment technology, compression coding technology and internet etc., produced a large amount of video datas in fields such as news, film, historical document and monitoring.Video data has contained abundant semanteme such as task, scene, object and incident, and video is again a time series data simultaneously, has image, audio frequency and three kinds of medium data of text in the video, and presents the sequential correlation symbiotic characteristic.Simultaneously, the fusion of multiple modalities is also played important effect with " the semantic wide gap " of cooperating for reducing between low-level image feature and the high-level semantic.How effectively to utilize the multi-modal of video and sequential characteristic to excavate its semantic information, thereby support effective retrieval of video, the resource sharing advantage of performance video data is challenging studying a question.

For how expressing multi-modal medium in the video, traditional method is that image, audio frequency and text feature are represented with the splicing vector.But this high dimension vector tends to cause the problem of " dimension disaster ", and the relation of the sequential correlation symbiosis between multiple modalities in the video also can be left in the basket.In recent years, how much of polytenies-be that high order tensor has been widely applied to fields such as computer vision, information retrieval and signal Processing.Tensor is a kind of natural expansion and the extension to the vector sum matrix, and tensor geometry has defined a series of polyteny computings based on the vector space set.Simultaneously, finding the solution optimum solution with tensor as the supervising tensor learning framework employing alternating projection optimization step that has of input, is the combination of protruding optimization and polyteny geometric operation.Based on the supervising tensor learning framework is arranged, traditional support vector machine can be expanded to the tensor machine of supporting, realize the training and the application of sorter model.

Summary of the invention

The purpose of this invention is to provide a kind of method for detecting multi-mode video semantic conception based on tensor representation.Comprise the steps:

1) video lens in training set and the test set is all extracted the low-level image feature of image, audio frequency, three kinds of mode of text, each video tensor camera lens forms 3 rank tensors by these three kinds of low-level image features and expresses;

2) according to the stream shape space intrinsic structure of video tensor camera lens set, realize the dimension of original higher-dimension tensor is reduced and the subspace embedding by seeking transition matrix;

3) adopt the video tensor camera lens set after supporting the tensor machine to dimensionality reduction to set up sorter model;

4) for testing lens, add up to the transition matrix obtain to carry out projection by training set after, carry out the semantic concept detection by sorter model again.

The described low-level image feature that camera lens in training set and the test set is all extracted image, audio frequency, three kinds of mode of text: choose a key frame in each camera lens as representative image, extract color histogram, texture and Canny border then as characteristics of image; One section audio of camera lens correspondence is extracted as an audio example, and audio example is divided into the audio frame in short-term that contains superposition, extract each audio frame feature in short-term, comprise MFCC, barycenter, decay cutoff frequency, frequency spectrum flow and zero-crossing rate, form the frame proper vector, then the audio frequency characteristics of the statistical value of audio frame proper vector in short-term as camera lens; From video, extract the TF*IDF value as text feature through the transcribed text of identification.

The expression of described video tensor camera lens: based on the image that extracts in the video, audio frequency, text low-level image feature, with each video lens with one 3 rank tensor

S &Element; R^{I_{1} \times I_{2} \times I_{3}}

Represent.Wherein, I ₁, I ₂And I ₃It is respectively the dimension of characteristics of image, audio frequency characteristics and text feature.Each element s so _I1i2i3Value defined be: s _I1,1,1(1≤i ₁≤ I ₁) be the value of characteristics of image, s _{2, i2,2}(1≤i ₂≤ I ₂) be the value of audio frequency characteristics, s _{3,3, i3}(1≤i ₃≤ I ₃) be the value of text feature, the value of other element all initially is made as zero.

Described stream shape space intrinsic structure according to video tensor camera lens by seeking the transition matrix realization to the dimension reduction of original higher-dimension tensor and the method for subspace embedding is: given space R ^{I1 * I2 * I3}On lens data set X={X ₁, X ₂, LX _N, according to the stream shape space intrinsic structure and the spectrogram theory of tensor camera lens, be each the tensor camera lens X on the X _i| _I=1 ^NSeek three transition matrix: J ₁* I ₁The T of dimension ₁ ⁱ, J ₂* I ₂The T of dimension ₂ ⁱAnd J ₃* I ₃The T of dimension ₃ ⁱ, make it to shine upon this N data point to space R ^{J1 * J2 * J3}(J ₁＜I ₁, J ₂＜I ₂, J ₃＜I ₃) on Y={Y ₁, Y ₂, LY _N, satisfy

Y_{i} |_{i = 1}^{N} = X_{i} {&CircleTimes;}_{1} T_{1}^{iT} {&CircleTimes;}_{2} T_{2}^{iT} {&CircleTimes;}_{3} T_{3}^{iT},

Realize the dimension of original higher-dimension tensor is reduced and the subspace embedding with this; When asking for T ₁ ⁱ| _I=1 ^NThe time, by finding the solution generalized eigenvector problem (D _U-W _U) V ₁=λ D _UV ₁Calculate optimized intermediate conversion matrix V ₁, wherein,

D_{U} = \underset{i}{Σ} D_{ii} U_{1}^{i} U_{1}^{iT},

W_{U} = \underset{ij}{Σ} W_{ij} U_{1}^{i} U_{1}^{jT},

And W is that D is that the diagonal matrix of W is D according to the weight matrix of the constructed arest neighbors figure of training set X _Ii=∑ _jW _Ij, U ₁ ⁱBe to X _i| _I=1 ^NA mould to launch matrix mode-1 unfolding matrix be X ₍₁₎ ⁱCarry out SVD and decompose the left matrix that obtains, so finally can calculate transition matrix

T_{1}^{i} |_{i = 1}^{N} = V_{1}^{T} U_{1}^{i} &Element; R^{J_{1} \times I_{1}};

With asking for T with quadrat method ₂ ⁱWith T ₃ ⁱ

The method that sorter model is set up in video tensor camera lens set after described employing supports the tensor machine to dimensionality reduction is: the input of sorter model is the low-dimensional tensor through the subspace embeds and dimensionality reduction obtains

Y_{i} |_{i = 1}^{N} = X_{i} {&CircleTimes;}_{1} T_{1}^{iT} {&CircleTimes;}_{2} T_{2}^{iT} {&CircleTimes;}_{3} T_{3}^{iT} &Element; R^{J_{1} \times J_{2} \times J_{3}}

And corresponding classification logotype y _i∈+1 ,-1), output is the tensor lineoid parameter of sorter model

w_{k} |_{k = 1}^{3} &Element; R^{J_{k}}

With b ∈ R; By the iterative optimization problem

[\begin{matrix} \min_{w_{j}, b, ξ} J_{C - STM} (w_{j}, b, ξ) = \frac{η}{2} {| | w_{j} | |}_{Fro}^{2} + c Σ_{i = 1}^{N} ξ_{i} \\ \begin{matrix} s . t . & \begin{matrix} y_{i} [{w^{T}}_{j} (Y_{i} Π_{j = 1}^{3} \times_{j} w_{j} + b)] {&GreaterEqual; ξ}_{i}, 1 \leq i \leq N \\ ξ &GreaterEqual; 0 \end{matrix} \end{matrix} \end{matrix}]

Obtain w _k| _K=1 ³And b, wherein parameter j is recycled to 3 from 1, and c is constant, and ξ is a relaxation factor,

η = Π_{1 \leq k \leq 3}^{k &NotEqual; j} {| | w_{k} | |}_{Fro}^{2} .

Described for testing lens, add up to the transition matrix obtain to carry out projection by training set after, carry out semantic concept by sorter model again and detect: the new data outside the training set

X_{t} &Element; R^{I_{1} \times I_{2} \times I_{3}}

By transition matrix

T_{1}^{t} = V_{1}^{T} U_{1}^{t} &Element; R^{J_{1} \times I_{1}},

T_{2}^{t} = V_{2}^{T} U_{2}^{t} &Element; R^{J_{2} \times I_{2}}

And

T_{3}^{t} = V_{3}^{T} U_{3}^{t} &Element; R^{J_{3} \times I_{3}}

Be mapped as in the low n-dimensional subspace n

Y_{t} = X_{t} {&CircleTimes;}_{1} T_{1}^{tT} {&CircleTimes;}_{2} T_{2}^{tT} {&CircleTimes;}_{3} T_{3}^{tT} &Element; R^{J_{1} \times J_{2} {\times J}_{3}},

Carry out classification by sorter model then and detect, promptly calculate z _t=sign (Y _t

₁w ₁ ₂w ₂

₃w ₃)+b obtains the classification logotype z of test data _t∈+1 ,-1}.

Beneficial effect of the present invention:

1) the present invention has replaced the vectorial expression way of traditional video with tensor, can effectively reduce the problem that " dimension disaster " brings;

2) the present invention has considered the multiple modalities in the video: image, audio frequency and text, and the sequential correlation symbiotic characteristic of video data, play important effect based on the fusion of the multiple modalities of this characteristic with " the semantic wide gap " of cooperating for reducing between low-level image feature and the high-level semantic;

3) the present invention is according to the stream shape space intrinsic structure and the spectrogram theory that keep the set of tensor camera lens, the tensor camera lens subspace that is proposed embeds and dimension reduction method, not only solved the high-dimensional difficulty of bringing effectively, and owing to be linear method, the new data outer for training set can directly carry out projection mapping;

4) the present invention adopts and supports the tensor machine to come the training classifier model, has good classification detectability.

Description of drawings

Fig. 1 is based on the method for detecting multi-mode video semantic conception process flow diagram of tensor representation;

Fig. 2 is the testing result of the present invention to semantic concept " Explosion (blast) ", compares with ISOMAP and two kinds of methods of PCA respectively, is expressed as the ROC curve map;

Fig. 3 is the testing result of the present invention to semantic concept " Sports (sports) ", compares with ISOMAP and two kinds of methods of PCA respectively, is expressed as the ROC curve map.

Embodiment

Method for detecting multi-mode video semantic conception based on tensor representation.Comprise the steps:

The described low-level image feature that camera lens in training set and the test set is all extracted image, audio frequency, three kinds of mode of text: low-level image feature is meant the feature of directly extracting from video source data, be different from the high-level characteristic of semantic concept representative.We extract low-level image feature respectively from each video lens, comprise image, audio frequency and text feature.

Characteristics of image: camera lens is a basic processing unit, chooses a key frame in each camera lens as representative image, and color histogram, texture and the Canny border of extracting key frame then are as characteristics of image;

Audio frequency characteristics a: section audio of camera lens correspondence is extracted as an audio example (audio clip), and audio example is divided into the audio frame in short-term that contains superposition, extract each audio frame feature in short-term, comprise MFCC, barycenter, decay cutoff frequency, frequency spectrum flow and zero-crossing rate, form the frame proper vector, then the statistical value of audio frame proper vector (average or variance) in short-term as the audio frequency characteristics of camera lens;

Text feature: we extract feature by (transcript) text of transcribing through identification from video.Because the dimension of text feature is much larger than other mode features, and comprised abundant semantic information in the text, (Latent Semantic Analysis LSA) does dimension-reduction treatment to text can to adopt implicit semantic analysis earlier.

S &Element; R^{I_{1} \times I_{2} \times I_{3}}

Described stream shape space intrinsic structure according to video tensor camera lens by seeking the transition matrix realization to the dimension reduction of original higher-dimension tensor and the method for subspace embedding is: given space R ^{I1 * I2 * I3}On lens data set X={X ₁, X ₂, LX _N, according to the stream shape space intrinsic structure and the spectrogram theory of tensor camera lens, be each the tensor camera lens X on the X _i| _I=1 ^NSeek three transition matrix: J ₁* I ₁The T of dimension ₁ ⁱ, J ₂* I ₂The T of dimension ₂ ⁱAnd J ₃* I ₃The T of dimension ₃ ⁱ, make it to shine upon this N data point to space R ^{J1 * J2 * J3}(J ₁＜I ₁, J ₂＜I ₂, J ₃＜I ₃) on set Y={Y ₁, Y ₂, L Y _N, and satisfy

Y_{i} |_{i = 1}^{N} = X_{i} {&CircleTimes;}_{1} T_{1}^{iT} {&CircleTimes;}_{2} T_{2}^{iT} {&CircleTimes;}_{3} T_{3}^{iT},

Low-dimensional data acquisition Y has just reflected how much topological structures of intrinsic in the stream shape space of set X so; Simultaneously, this mapping has also kept linear feature, that is to say for the outer data point X of training set _t, the transition matrix that can directly be obtained by training in advance calculates its mapping in low n-dimensional subspace n.

Order

X &Element; i^{I_{1} \times I_{2} \times I_{3}}

Represent one 3 rank tensor camera lens, given N tensor camera lens stream shape space that is distributed in

{M &Element; R}^{I_{1} \times I_{2} \times I_{3}}

On data acquisition X={X ₁, X ₂, Λ X _N, we can make up the local geometry that an arest neighbors figure G simulates M.The weight matrix W of definition G is as follows:

Wherein c is a constant.

For each tensor camera lens X _i, (1≤i≤N), (Higher-OrderSingular Value Decomposition, HOSVD), we can be respectively to X according to the high-order svd _iK| _K=1 ³Mould launches matrix (mode-kunfolding matrix) X ₍₁₎ ⁱ, X ₍₂₎ ⁱ, X ₍₃₎ ⁱCarry out singular value SVD and decompose, calculate left matrix U ₁ ⁱ, U ₂ ⁱ, U ₃ ⁱFor instance, U ₁ ⁱBe to X _iA mould launch matrix (mode-1 unfolding matrix) X ₍₁₎ ⁱCarry out SVD and decompose the left matrix that obtains.

Existing now

U_{1}^{i} &Element; R^{I_{1} \times I_{1}}, (1 \leq i \leq N),

We want to find I ₁* J ₁The matrix V of dimension ₁With U ₁ ⁱBe mapped to

T_{1}^{iT} &Element; R^{I_{1} \times J_{1}},

Just make

T_{1}^{i} = V_{1}^{T} U_{1}^{i} &Element; R^{J_{1} \times I_{1}} .

We will consider this problem from two angles.On the one hand, the intrinsic structure of shape be kept flowing, the optimum solution of following this objective function need be asked for:

\min_{v_{1}} \underset{ij}{Σ} {| | V_{1}^{T} U_{1}^{i} - V_{1}^{T} U_{1}^{j} | |}^{2} W_{ij}

That is to say, minimize

\underset{ij}{Σ} {| | {V_{1}}^{T} U_{1}^{i} - {V_{1}}^{T} U_{1}^{j} | |}^{2} W_{ij}

Can guarantee to work as U ₁ ⁱAnd U ₁ ^jBe " close ", V so ₁ ^TU ₁ ⁱAnd V ₁ ^TU ₁ ^jAlso be " close ".

D is the diagonal matrix of W, i.e. D _Ii=∑ _jW _IjAnd for a matrix A, its " mark (trace) " ‖ A ‖ ²=tr (AA ^T), have so:

\frac{1}{2} \underset{ij}{Σ} {{| | {V_{1}^{T} U}_{1}^{i} - V_{1}^{T} U_{1}^{i} | |}^{2} W}_{ij}

= \frac{1}{2} \underset{ij}{Σ} tr ((T_{1}^{i} - T_{1}^{j}) (T_{1}^{i} - T_{1}^{j})^{T}) W_{ij}

= \frac{1}{2} \underset{ij}{Σ} tr {(T_{1}^{i} T_{1}^{i} + T_{1}^{j} T_{1}^{j} - T_{1}^{i} T_{1}^{j^{T}} - T_{1}^{j} T_{1}^{i^{T}}) W}_{ij}

= tr (\underset{i}{Σ} D_{ii} T_{1}^{i} T_{1}^{i^{T}} - \underset{ij}{Σ} W_{ij} T_{1}^{i} T_{1}^{j^{T}})

= tr (\underset{i}{Σ} D_{ii} V_{1}^{T} U_{1}^{i} U_{1}^{i^{T}} V_{1} - \underset{ij}{Σ} W_{ij} V_{1}^{T} U_{1}^{i} U_{1}^{j^{T}} V_{1})

= tr (V_{1}^{T} (\underset{i}{Σ} D_{ii} U_{1}^{i} U_{1}^{i^{T}} - \underset{ij}{Σ} W_{ij} U_{1}^{i} U_{1}^{j^{T}}) V_{1})

Wherein

D_{U} = \underset{i}{Σ} {D_{ii} U_{1}^{i} U_{1}^{i}}^{T},

W_{U} = \underset{ij}{Σ} W_{ij} U_{1}^{i} {U_{1}^{j}}^{T} .

From top derivation as can be seen, if want to find the solution

\min_{V}

_{1} \underset{ij}{Σ} {| | V_{1}^{T} U_{1}^{i} - V_{1}^{T} U_{1}^{j} | |}^{2} W_{ij},

Need minimize tr (V ₁ ^T(D _U-W _U) V ₁).

On the other hand, except keeping flowing the graph structure of shape, also need to maximize the overall variance on the stream shape space.Usually, the variance of a stochastic variable x is:

var(x)＝∫ _M(x-μ) ²dP(x)，μ＝∫ _MxdP(x)

Wherein M is the stream shape of data, and μ is an expectation value, and dP is a probability density function.According to spectrogram theory (spectralgraph theory), dP can be by the diagonal matrix D (D of sample point _Ii=∑ _jW _Ij) estimation of discretize obtains.We have following derivation so:

var (T_{1})

= \underset{i}{Σ} {| | T_{1}^{i} | |}^{2} D_{ii}

= \underset{i}{Σ} tr (T_{1}^{i} T_{1}^{i^{T}}) D_{ii}

= \underset{i}{Σ} tr ({{V_{1}^{T} U_{1}^{i} U_{1}^{i}}^{T} V}_{1}) D_{ii}

= tr (V_{1}^{T} (\underset{i}{Σ} {D_{ii} U_{1}^{i} U_{1}^{i}}^{T}) V_{1})

By the constraint condition of above two aspects, we have obtained following optimization problem:

\min_{v_{1}} \frac{tr (V_{1}^{T} (D_{U} - W_{U}) V_{1})}{tr (V_{1}^{T} D_{U} V_{1})}

Obviously, V ₁Optimum solution be (D _U-W _U, D _U) generalized eigenvector.Therefore we can calculate following generalized eigenvector problem and obtain optimized V ₁:

(D _U-W _U)V ₁＝λD _UV ₁

When calculating v ₁After, by

U_{1}^{i} &Element; i^{I_{1} \times I_{1}}, (1 \leq i \leq N),

Can ask for

T_{1}^{i} = V_{1}^{T} U_{1}^{i} &Element; i^{J_{1} \times I_{1}} .

In like manner, for the intermediate conversion matrix V of audio frequency and these two kinds of mode of text ₂And V ₃Also can use the same method and calculate, so by

U_{2}^{i} &Element; i^{I_{2} \times I_{2}}, (1 \leq i \leq N)

And V ₂Just can ask for

T_{2}^{i} = V_{2}^{T} U_{2}^{i} &Element; i^{J_{2} \times I_{2}},

And by

U_{3}^{i} &Element; i^{I_{3} \times I_{3}}, (1 \leq i \leq N)

And V ₃Can ask for

T_{3}^{i} = V_{3}^{T} U_{3}^{i} &Element; i^{J_{3} \times I_{3}} .

The video tensor camera lens of lower dimensional space is gathered the data among the Y like this

Y_{i} |_{i = 1}^{N} = X_{i} {&CircleTimes;}_{1} T_{1}^{iT} {&CircleTimes;}_{2} T_{2}^{iT} {&CircleTimes;}_{3} T_{3}^{iT} &Element; R^{J_{1} \times J_{2} \times J_{3}} .

Be the algorithm that the subspace embeds and dimension reduces of tensor camera lens below.

Input: original training tensor camera lens set

X = {X_{1}, X_{2}, L X_{N}} R^{I_{1} \times I_{2} \times I_{3}};

Output: the low-dimensional tensor camera lens set after the mapping

Y = {Y_{1}, Y_{2}, L Y_{N}} &Element; R^{J_{1} \times J_{2} \times J_{3}},

The intermediate conversion matrix

V

_{1} &Element; R^{I_{1} \times J_{1}},

V_{2} &Element; R^{I_{2} \times J_{2}}

With

V_{3} &Element; R^{I_{3} \times J_{3}},

And transition matrix

T_{1}^{i} = V_{1}^{T} U_{1}^{i} &Element; i^{J_{1} \times I_{1}},

T_{2}^{i} = V_{2}^{T} U_{2}^{i} &Element; i^{J_{2} \times I_{2}}

With

T_{3}^{i} = V_{3}^{T} U_{3}^{i} &Element; i^{J_{3} \times I_{3}},

And satisfy

Y_{i} |_{i = 1}^{N} = X_{i} {&CircleTimes;}_{1} T_{1}^{iT} {&CircleTimes;}_{2} T_{2}^{iT} {&CircleTimes;}_{3} T_{3}^{iT};

Arthmetic statement:

Step 1: make up an arest neighbors figure G;

Step 2: calculate weight matrix W;

Step 3:For k=1to3

Step 4:For i=1toN

Step 5: calculate X _iThe k mould launch matrix X _(k) ⁱThe left matrix U that decomposes of SVD _(k) ⁱ

Step 6:End;

Step 7:

D_{U} = \underset{i}{Σ} {D_{ii} U_{(k)}^{i} U_{(k)}^{i}}^{T};

Step 8:

W_{U} = \underset{ij}{Σ} {W_{ij} U_{(k)}^{i} U_{(k)}^{j}}^{T};

Step 9: find the solution following generalized eigenvector problem to obtain optimized V _k:

(D _U-W _U)V _k＝λD _UV _k；

Step 10:For i=1toN

Step 11:

T_{(k)}^{i} = V_{(k)}^{T} U_{(k)}^{i} &Element; R^{J_{k} \times I_{k}};

Step 12:End;

Step 13:End;

Step 14:For i=1toN

Step 15:

Y_{i} = X_{i} {&CircleTimes;}_{1} T_{1}^{iT} {&CircleTimes;}_{2} T_{2}^{iT} {&CircleTimes;}_{3} T_{3}^{iT};

Step 16:End.

The method that sorter model is set up in video tensor camera lens set after described employing supports the tensor machine to dimensionality reduction is: in this step, we adopt and support the tensor machine to train the sorter of tensor camera lens.The input of training pattern is exactly a previous step through the low-dimensional tensor Y that the subspace embeds and dimensionality reduction obtains _i, rather than original X _iSuch processing can not only improve degree of accuracy, and can improve the efficient of training and classification.

Support that the algorithm of tensor machine training classifier is as follows.

Input: the low-dimensional tensor camera lens set after the mapping

Y_{i} |_{i = 1}^{N} = X_{i} {&CircleTimes;}_{1} T_{1}^{iT} {&CircleTimes;}_{2} T_{2}^{iT} {&CircleTimes;}_{3} T_{3}^{iT} &Element; R^{J_{1} \times J_{2} \times J_{3}},

And

Corresponding classification logotype y _i∈+1 ,-1};

Output: the tensor lineoid parameter of sorter model

w_{k} |_{k = 1}^{3} &Element; R^{J_{k}}

With b ∈ R;

Arthmetic statement:

Step 1: w is set _k| _K=1 ³Be R ^JkIn the random units vector;

Step 2: repeating step 3-5 is until convergence;

Step 3:For j=1to3

Step 4: by finding the solution optimization problem

[\begin{matrix} \min_{w_{j}, b, ξ} J_{C - STM} (w_{j}, b, ξ) = \frac{η}{2} {| | w_{j} | |}_{Fro}^{2} + c Σ_{i = 1}^{N} ξ_{i} \\ \begin{matrix} s . t . & \begin{matrix} y_{i} [{w^{T}}_{j} (Y_{i} Π_{k = 1}^{3} \times_{k} w_{k} + b)] {&GreaterEqual; ξ}_{i}, 1 \leq i \leq N \\ ξ &GreaterEqual; 0 \end{matrix} \end{matrix} \end{matrix}]

Obtain

w_{i} &Element; R^{I_{j}}

And b, wherein c is a constant, ξ is a relaxation factor,

η = Π_{1 \leq k \leq 3}^{k &NotEqual; j} {| | w_{k} | |}_{Fro}^{2};

Step 5:End;

Step 6: check whether restrain: if

Σ_{k = 1}^{3} [| w_{k, t}^{T} w_{k, t - 1} | ({| | w_{k, t} | |}_{Fro}^{- 2}) - 1] \leq ϵ

Calculate so

w _k| _K=1 ³Restrain.Here w _{K, t}Be current projection vector, w _{K, t-1}It is previous projection vector;

Step 7:End.

Described for testing lens, after adding up to the transition matrix obtain to carry out projection by training set, carrying out semantic concept by sorter model again detects: in this step, we will come according to the sorter model that the front training obtains the outer new data of training set is detected.Because our dimension reduction method is linear,, carries out classification by sorter then and detect so can map directly to low n-dimensional subspace n for new data.

Make X _tAs an outer detection example of training set, following algorithm provides testing process.

Input: camera lens to be detected

X_{t} &Element; R^{I_{1} \times I_{2} \times I_{3}},

The intermediate conversion matrix V ₁, V ₂, V ₃, classifier parameters

w_{k} |_{k = 1}^{3} &Element; R^{I_{k}}

With b ∈ R;

Output: X _tClassification logotype z _t∈+1 ,-1};

Arthmetic statement:

Step 1:For k=1to3;

Step 2: calculate X _tThe k mould launch matrix X _(k) ^tThe left matrix U that decomposes of SVD _(k) ^t

Step 3: calculate

T_{(k)}^{t} = V_{(k)}^{T} U_{(k)}^{t};

Step 4:End;

Step 5: calculate

Y_{t} = X_{t} {&CircleTimes;}_{1} T_{1}^{t^{T}} {&CircleTimes;}_{2} T_{2}^{t^{T}} {&CircleTimes;}_{3} T_{3}^{t^{T}};

Step 6: calculate z _t=sign (Y _t ₁w ₁

₂w ₂

₃w ₃)+b;

Step 7:End.

Claims

1. a multimodal video semantic concept detection method based on tensor representation, is characterized in that comprising the steps:

1) For the video shots in the training set and the test set, the bottom-level features of the image, audio, and text modes are extracted, and each video tensor shot is represented by a third-order tensor formed by these three bottom-level features;

2) According to the eigenstructure of the manifold space of the video tensor lens set, the dimension reduction and subspace embedding of the original high-dimensional tensor are realized by finding the transformation matrix;

3) Use the support tensor machine to establish a classifier model for the video tensor lens set after dimensionality reduction;

4) For the test shot, the transformation matrix calculated from the training set is projected, and then the semantic concept detection is performed through the classifier model;

The expression of the video tensor lens: based on the underlying features of images, audio, and text extracted from the video, use a 3rd-order tensor for each video lens

to represent, where, I ₁ , I ₂ and I ₃ are the dimensions of image features, audio features and text features respectively, then each element

The value of is defined as:

is the value of the image feature, where 1≤i ₁ ≤I ₁ ;

is the value of the audio feature, where 1≤i ₂ ≤I ₂ ;

is the value of the text feature, where 1≤i ₃ ≤I ₃ ; the values of other elements are initially set to zero.

2. A kind of multimodal video semantic concept detection method based on tensor representation according to claim 1, characterized in that the described video shots in the training set and the test set all extract three kinds of images, audio and text. The underlying features of the modality: select a key frame in each video shot as a representative image, and then extract the color histogram, texture, and Canny boundary as image features; extract a piece of audio corresponding to the video shot as an audio example, and use the audio The example is divided into superimposed short-term audio frames, and the features of each short-term audio frame are extracted, including MFCC, centroid, attenuation cut-off frequency, spectral flow and zero-crossing rate, to form a frame feature vector, and then the statistics of the short-term audio frame feature vector Values are used as audio features of video shots; TF*IDF values are extracted from recognized transcripts in videos as text features.

3. a kind of multimodal video semantic concept detection method based on tensor representation according to claim 1, is characterized in that described according to the manifold space eigenstructure of video tensor lens, realizes pairing by finding transformation matrix The method of dimension reduction and subspace embedding of the original high-dimensional tensor is: given space

The lens data set X on X={X ₁ , X ₂ ,...X _N }, according to the manifold space eigenstructure of the tensor lens and the spectrogram theory, for each tensor lens X _i | _i=1 on X ^N finds three transformation matrices: J ₁ × I ₁ -dimensional T ₁ ⁱ , J ₂ × I _2- dimensional T ₂ ⁱ and J ₃ × I ₃ -dimensional T ₃ ⁱ , to map these N data points to the space

Y={Y ₁ , Y 2 ,...Y N } on Y={Y 1 , Y ₂ ,...Y _N }, satisfy

Y_{i} |_{i = 1}^{N} = x_{i} {&CircleTimes;}_{1} T_{1}^{i} {&CircleTimes;}_{2} T_{2}^{i} {&CircleTimes;}_{3} T_{3}^{i},

Among them, J ₁ <I ₁ , J ₂ <I ₂ , J ₃ <I ₃ , so as to realize dimension reduction and subspace embedding of the original high-dimensional tensor; when T ₁ ⁱ | _i=1 ^N is obtained, The optimized intermediate transformation matrix V ₁ is obtained by solving the generalized eigenvector problem (D _U -W _U ) V ₁ =λD _U V ₁ , where,

{D.}_{u} = \underset{i}{Σ} {D.}_{i} u_{1}^{i} u_{1}^{i},

W_{u} = \underset{ij}{Σ} W_{ij} u_{1}^{i} u_{1}^{J}

, and W is the weight matrix of the nearest neighbor graph constructed according to the training set X, and D is the diagonal matrix of W, namely

{D.}_{i} = Σ_{j} W_{ij},

U ₁ ⁱ is the left matrix obtained by SVD decomposition of the mode-1 unfolding matrix of X _i | _{i = 1} ^N , that is, X ₍₁₎ ⁱ , and finally the conversion matrix can be calculated

Calculate T ₂ ⁱ and T ₃ ⁱ in the same way.

4. A kind of multi-modal video semantic concept detection method based on tensor representation according to claim 1, characterized in that said support tensor machine is used to establish a classifier model for the video tensor lens set after dimensionality reduction The method is: the input of the classifier model is a low-dimensional tensor obtained through subspace embedding and dimensionality reduction

and the corresponding class identity y _i ∈ {+1, -1}, the output is the tensor hyperplane parameter of the classifier model

Solve optimization problems iteratively

[\begin{matrix} \min_{w_{j}, b, ξ} J_{C - STM} (w_{j}, b, ξ) = \frac{η}{2} {| | w_{j} | |}_{Fro}^{2} + c Σ_{i = 1}^{N} ξ_{i} \\ the s . t . \begin{matrix} {the y}_{i} [w_{j}^{T} (Y_{i} Π_{j = 1}^{3} \times_{j} w_{j} + b)] {&Greater Equal; ξ}_{i}, 1 \leq i \leq N \\ ξ &Greater Equal; 0 \end{matrix} \end{matrix}]

Get w _k | _{k = 1} ³ and b, where the parameter j cycles from 1 to 3, and c is a constant, ξ is the relaxation factor,

η = Π_{1 \leq k \leq 3}^{k &NotEqual; j} {| | w_{k} | |}_{Fro}^{2} .

5. A kind of multimodal video semantic concept detection method based on tensor representation according to claim 1, characterized in that for the test shot, after the transformation matrix calculated by the training set is projected, and then classified Semantic Concept Detection with Detector Models: New Data Outside the Training Set

by the transformation matrix

and

is mapped to the low-dimensional subspace

Category detection is then performed through the classifier model, that is, computing

Z_{t} = sign (Y_{t} {&CircleTimes;}_{1} w_{1} {&CircleTimes;}_{2} w_{2} {&CircleTimes;}_{3} w_{3}) + b,

To get the category identification z _t ∈ {+1, -1} of the test data.