CN104036023A

CN104036023A - Method for creating context fusion tree video semantic indexes

Info

Publication number: CN104036023A
Application number: CN201410297974.0A
Authority: CN
Inventors: 余春艳; 苏晨涵; 翁子林; 陈昭炯
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2014-06-26
Filing date: 2014-06-26
Publication date: 2014-09-10
Anticipated expiration: 2034-06-26
Also published as: CN104036023B

Abstract

The invention belongs to the field of technologies for retrieving video, and discloses a method for creating tree video semantic indexes. The video semantic indexes built by the aid of the method contain video semantics of various particle sizes, contexts among the video semantics are fused by the semantic indexes, and the video semantics of the different particle sizes are connected with one another according to the contexts, so that tree structures can be formed. The method is characterized by comprising steps of extracting lens semantic sets of various lenses one by one; acquiring contexts of video lens semantics under the monitoring condition and representing the contexts by context tag trees; combining the lens semantic sets and context information with one another to infer scene semantics; embedding the lens semantic sets and the scene semantics into the context tag trees to obtain the video indexes. The method has the advantages that after the semantic indexes are created for video by the aid of the method, users can input keywords of the different particle sizes to retrieve the video, search spaces can be diminished by the aid of context information in the indexes, and accordingly the efficiency of retrieval systems can be improved.

Description

A kind of tree-like video semanteme index establishing method of integrating context

Technical field

The invention belongs to video search technique area, a kind of method that be that camera lens that can utilize video is semantic, the context between Scene Semantics and semanteme builds video semanteme index.

Background technology

Nowadays video data becomes one of most important data on internet already.Yet along with the volatile growth of video data, how efficiently management, retrieve video become a very difficult problem.Conventionally user is a key word of input when retrieve video, then by video search engine, according to key word, need find relevant video data.This just requires video to set up efficiency and the hit rate that suitable semantic indexing could improve user search video.It is automatically to analyze by computing machine the semantic information that the visual signature of video contains to obtain video that video index based on video semanteme builds, then the index using semantic information as video, user can be by input key search video when retrieve video.

Yet user improves constantly the requirement of video search engine, user often difference according to demand inputs varigrained key word, may input the varigrained keywords such as " football ", " wonderful ", " shooting ", " judge's feature " and retrieve during such as the video of user search football related content.So Search Requirement traditional Monosized powder, can not meet user without the video semanteme index of level.In addition, the semantic content of video is abundant, except semantic information, also exist a large amount of contextual informations, utilize the contextual information engine that can assist search to understand the interaction between different grain size semanteme, for opening relationships between varigrained semanteme in video, thereby can search for relevant video according to these relation informations when retrieve video.Contextual information can guarantee under the prerequisite of search hit rate, dwindle search volume, improves search efficiency.Based on this, the present invention has realized a kind of video semanteme index that can integrating context, to improve the validity of video index.

Summary of the invention

The object of the invention is to realize a kind of method of can integrating context information setting up tree-like video semanteme index.The method can incorporate contextual information in video semanteme index, improves video frequency searching hit rate and efficiency.

The present invention adopts following scheme to realize: a kind of tree-like video semanteme index establishing method of integrating context, is characterized in that the method comprises the following steps:

Step 1: input n training video fragment video _j, j ∈ 1 ..., n}, to video _jcarry out pre-service, then take camera lens as unit labor cost mark video _jthe semantic collection of camera lens of each camera lens, and be that every class camera lens semanteme is constructed the semantic training set of camera lens with training classifier, camera lens semantic analyzer obtained.The video segment video of m tree index to be set up of input _k, k ∈ 1 ..., m}, to video _kcarry out pre-service, utilize camera lens semantic analyzer to extract video _kthe semantic collection of camera lens of each camera lens;

Step 2: take video segment as unit, manually mark video _jcontext between middle camera lens semanteme, uses the contextual tab tree LT with contextual tab _jrepresent, and build context training set.Training structure support vector SVM-Struct, obtains contextual tab tree analyzer.Utilize contextual analysis device to extract video _kin contextual tab tree LT _k;

Step 3: with video _jscene be unit labor cost mark Scene Semantics, build Scene Semantics training set.Training C4.5 sorter, obtains Scene Semantics analyzer.Utilize Scene Semantics analyzer to extract video _kin the Scene Semantics of each scene;

Step 4: by the video obtaining in step 2 _kthe video that the semantic collection of camera lens and the step 4 of each camera lens obtains _kthe Scene Semantics of each scene is embedded into the LT obtaining in step 3 _kin corresponding node, by the LT of and Scene Semantics semantic with camera lens _kas video _kvideo index.

Further, in described step 1, carry out as follows:

Step 2.1: to n training video fragment video _jcarry out camera lens and cut apart, obtain r training video camera lens; Extract and quantize the visual signature of camera lens, be configured to visual feature vector v;

Step 2.2: the semantic collection of mark Semantic={Sem is set _t| t=1 ..., e}, manually marks the semantic Sem occurring in r camera lens _t, the camera lens that joins each camera lens is semantic concentrated, is then the semantic Sem of each class camera lens _tthe semantic training set of structure camera lens, obtains the semantic training set Tra of e camera lens _t={ (v _i, s _i) | i=1 ..., r}, if semantic Sem _tappear in this camera lens, s _i=1, otherwise be 0;

Step 2.3: using svm classifier device as disaggregated model, is each semantic Sem _ttrain a sorter SVM _t; SVM _tdiscriminant function form be: f _t(v)=sgn[g (v)], g (v)=<w wherein, v>+b; So by training set Tra _ttraining SVM _toptimization aim be:

\begin{matrix} \min \frac{1}{2} {| | w | |}^{2} \\ s . t . s_{i} (< w, v_{i} > + b) - 1 &GreaterEqual; 0 \end{matrix} - - - (1)

Utilizing Lagrangian function to merge optimization problem and retrain is converted into (1) formula:

\begin{matrix} \max_{α} Σ_{i = 1}^{r} α_{i} - \frac{1}{2} Σ_{i, h = 1}^{r} α_{i} α_{h} s_{i} s_{h} v_{i} * v_{h} \\ s . t . α_{i} &GreaterEqual; 0, Σ_{i = 1}^{r} α_{i} s_{i} = 0 \end{matrix} - - - (2)

Introduce kernel function K (v _j, v _h), formula (2) is converted to:

\begin{matrix} \max_{α} Σ_{i = 1}^{r} α_{j} - \frac{1}{2} Σ_{i, h = 1}^{r} α_{i} α_{h} s_{i} s_{h} K (v_{i}, v_{h}) \\ s . t . α_{i} &GreaterEqual; 0, Σ_{i = 1}^{r} α_{i} s_{i} = 0 \end{matrix} - - - (3)

Kernel function is chosen to be radial basis function, is defined as:

K (v_{i}, v_{h}) = \exp (- \frac{{(v_{i} - v_{h})}^{2}}{2 σ^{2}}) - - - (4)

Wherein exp () is exponential function, and σ is parameter.

After having trained, just determined one group of α _i, also just determined the semantic Sem of camera lens _tdiscriminant function:

f_{t} (v) = sgn [Σ_{i = 1}^{r} α_{i} s_{i} K (v_{i}, v) + b_{0}] - - - (5)

B wherein ₀for parameter.

Step 2.4: complete all Sem according to step 2.3 _tsorter SVM _tafter training, obtain the discriminant function of e camera lens semanteme, the camera lens semantic analyzer group that the discriminant function of e camera lens semanteme is formed.

Step 2.5: the video segment video to m tree index to be set up _kcarry out camera lens and cut apart, then extract the visual signature composition characteristic vector v of each camera lens; V is inputted to camera lens semantic analyzer group to judge the semanteme occurring in this camera lens, and the semanteme of appearance is joined to semantic the concentrating of camera lens of this camera lens.

Further, described step 2 is carried out as follows:

Step 3.1: from video _jthe semantic camera lens semanteme of extraction of concentrating of camera lens of each camera lens represents this camera lens, then according to sequential relationship, forms camera lens semantic sequence wu _j;

Step 3.2: manually mark wu _jcontext, and with context tag tree LT _jrepresent contextual information; Contextual tab tree is a five-tuple LT=<L, Video, Scene, NL, P>; Wherein L is camera lens semantic label collection, its element representation be wu _jthe camera lens of middle representative shot is semantic; Video is " video context " label, and represented context is the content of this section of video of its child node co expression; Scene is " scene context " label, the content of this scene that has been its child node co expression shown in table; NL is the contextual tab collection except Video and Scene, and wherein each element represents a kind of context relation; P is contextual rules, its each element representation be a context rule;

Step 3.3: by n wu _jbe configured to context training set with corresponding contextual tab tree:

Context={ (x _j, y _j) | j=1 ..., n}, wherein x _jcamera lens semantic sequence, y _jit is corresponding contextual tab tree;

Step 3.4: utilize context training set training structure support vector machines-Struct, concrete operations are:

Step 3.4.1: the mapping function of structure camera lens semantic sequence and contextual tab tree is:

h (x; W) = \underset{y &Element; Y}{\arg \max} f (x, y; W),

F (x, y wherein; W)=<W, ψ (x, y) > is discriminant function, and W is weight vector, and ψ (x, y) is the union feature vector of the contextual tab tree that the camera lens semantic sequence in training data is corresponding with it; The mode of structure ψ (x, y) is as follows:

ψ (x, y) = \{\begin{matrix} p_{1} & a_{1} \\ . & . \\ . & . \\ . & . \\ p_{N} & a_{N} \end{matrix}

P wherein _iwith a _i(i ∈ [1, N]) is respectively rule and the corresponding number of times occurring of this rule in the contextual rules P of this contextual tab tree, and N is the context rule classification sum occurring in context training set;

Step 3.4.2: training SVM-Struct is converted into optimization problem:

\begin{matrix} \min \frac{1}{2} {| | W | |}^{2} + \frac{C}{n} Σ_{j = 1}^{n} ϵ_{j}, \\ s . t . &ForAll; y &Element; γ : < W, ψ (x_{j}, y_{j}) - ψ (x_{j}, y) > &GreaterEqual; Δ (y_{j}, y) - ϵ_{j} \end{matrix} - - - (6)

ε wherein _jfor slack variable, C>0 is the penalty value of wrong minute sample, Δ (y _j, y) be loss function; Make loss function Δ (y _j, y)=(1-F ₁(y _j, y)); Y wherein _jbe the true contextual tab tree of camera lens semantic sequence in context training set, y is the contextual tab tree of predicting in training process, and F1 account form is as follows:

Precision = \frac{| E (y_{j}) \cap E (y) |}{| E (y) |}

Recall = \frac{| E (y_{j}) \cap E (y) |}{| E (y_{i}) |}

F 1 = \frac{2 * Precision * Recall}{Precision + Recall}

Wherein, Precision is each node predictablity rate in contextual tab, and Precision is the recall rate of each node prediction in contextual tab tree, E (y _j) be y _jlimit collection, the limit collection that E (y) is y;

Step 3.4.3: the form that formula (6) is changed into its antithesis:

\begin{matrix} \max_{α} \underset{j, y &NotEqual; y_{j}}{Σ} α_{jy} - \frac{1}{2} \underset{z, \overset{&OverBar;}{y} &NotEqual; y_{z}}{\underset{j, y &NotEqual; y_{j}}{Σ}} α_{jy} α_{z \overset{&OverBar;}{y}} < δ ψ_{j} (y), δ ψ_{z} (\overset{&OverBar;}{y}) > \\ s . t &ForAll; j, &ForAll; y &NotEqual; Y \ y_{j} : α_{jy} &GreaterEqual; 0 . \end{matrix} - - - (7)

α wherein _iylagrange multiplier. for soft interval, also there is in addition group constraint condition:

&ForAll; j, n \underset{y &NotEqual; y_{j}}{Σ} \frac{α_{jy}}{Δ (y_{j}, y)} \leq C

Step 3.4.4: computing formula (7) on context training set context, finds one group of optimum α _jyafter also just determine weight vector W, obtain contextual tab tree analyzer;

Step 3.5: use the mode identical with step 3.1 to extract video _kcamera lens semantic sequence wu _k, and by wu _kinput video contextual tab tree analyzer, obtains wu _klT _k.

Further, described step 3 is carried out as follows:

Step 4.1: according to LT _jin " scene context " label Scene, using the corresponding camera lens of leaf node under each Scene label as a complete video scene, the scene that realizes video is cut apart; Then take scene as unit labor cost is to video _jscene carry out Scene Semantics mark;

Step 4.2: the semantic collection of camera lens and the corresponding LT that utilize each camera lens in each scene _jin contextual information structure Scene Semantics training set; Wherein the feature of Scene Semantics is divided into two kinds:

A. camera lens semantic feature: if certain camera lens semanteme appears in this scene, making this camera lens semantic feature value is 1, otherwise is 0;

B. contextual feature: contextual feature is two context relations between camera lens semanteme, and camera lens semanteme is at LT _ja middle corresponding leaf node, so the contextual feature value of these two camera lens semantemes is the contextual tab in the nearest public ancestor node of these two leaf nodes;

Step 4.3: take C4.5 algorithm as disaggregated model, according to the information gain rate of each characteristic attribute in Scene Semantics training set, select attribute as node, the final decision tree that generates analysis video scene semanteme, and using this decision tree as Scene Semantics analyzer;

Step 4.4: according to wu _klT _k, with method identical in step 4.1 by video _kbe divided into some scenes, and take scene and extract the proper vector of this scene as unit; By video _kthe proper vector input scene semantic analyzer of each scene, obtains video _kthe Scene Semantics of each scene.

Further, described step 4 is carried out as follows:

Step 5.1: by LT _kin each leaf node in camera lens semantic label replace with the semantic collection of the corresponding camera lens of the camera lens of representative;

Step 5.2: by LT _kin each Scene replace with corresponding Scene Semantics;

Step 5.3: will comprise the LT of camera lens semanteme with Scene Semantics _kas video _kvideo index.

The invention has the beneficial effects as follows: utilize method of the present invention to set up after semantic indexing for video, user can input varigrained keyword retrieval video, and the contextual information in index can dwindle search volume, the efficiency of raising searching system.

Accompanying drawing explanation

Fig. 1 is tree-like video semanteme index Establishing process.

Fig. 2 is the model of a video contextual tab tree.

Fig. 3 is a tree-like video index model.

Embodiment

Please refer to Fig. 1, a kind of method that tree-like video index of integrating context is set up, first take camera lens as the semantic information that unit extracts camera lens, then has supervision and obtains the context of video lens semanteme, and represent context with tree construction.In conjunction with camera lens is semantic, carry out the reasoning of Scene Semantics with their context again.Finally camera lens is semantic, Scene Semantics is embedded in tree construction and as the index of video.Specific as follows:

1. couple n training video fragment video _jcarry out camera lens and cut apart, obtain r training video camera lens.Extract and quantize the visual signature of camera lens, be configured to visual feature vector v.

The semantic collection of mark Semantic={Sem is set _t| t=1 ..., e}, manually marks the semantic Sem occurring in r camera lens _t, the camera lens that joins each camera lens is semantic concentrated, is then the semantic Sem of each class camera lens _tthe semantic training set of structure camera lens, obtains the semantic training set Tra of e camera lens _t={ (v _i, s _i) | i=1 ..., r}, if semantic Sem _tappear in this camera lens, s _i=1, otherwise be 0;

Using svm classifier device as disaggregated model, is each semantic Sem _ttrain a sorter SVM _t.SVM _tdiscriminant function form be: f _t(v)=sgn[g (v)], g (v)=<w wherein, v>+b.So by training set Tra _ttraining SVM _toptimization aim be:

\begin{matrix} \min \frac{1}{2} {| | w | |}^{2} \\ subjectto s_{i} (< w, v_{i} > + b) - 1 &GreaterEqual; 0 \end{matrix} - - - (1)

\begin{matrix} \max_{α} Σ_{i = 1}^{r} α_{i} - \frac{1}{2} Σ_{i, h = 1}^{r} α_{i} α_{h} s_{i} s_{h} v_{i} * v_{h} \\ s . t . α_{i} &GreaterEqual; 0, Σ_{i = 1}^{r} α_{i} s_{i} = 0 \end{matrix} - - - (2)

Introduce kernel function K (v _j, v _h), formula (2) is converted to:

\begin{matrix} \max_{α} Σ_{i = 1}^{r} α_{j} - \frac{1}{2} Σ_{i, h = 1}^{r} α_{i} α_{h} s_{i} s_{h} K (v_{i}, v_{h}) \\ s . t . α_{i} &GreaterEqual; 0, Σ_{i = 1}^{r} α_{i} s_{i} = 0 \end{matrix} - - - (3)

Kernel function is chosen to be radial basis function, is defined as:

K (v_{i}, v_{h}) = \exp (- \frac{{(v_{i} - v_{h})}^{2}}{2 σ^{2}}) - - - (4)

Wherein exp () is exponential function, and σ is parameter.

f_{t} (v) = sgn [Σ_{i = 1}^{r} α_{i} s_{i} K (v_{i}, v) + b_{0}] - - - (5)

B wherein ₀for parameter.

Complete Sem _tcorresponding sorter SVM _tafter training, obtain the camera lens semantic analyzer group that comprises e camera lens semantic analyzer.

Video segment video to m tree index to be set up _kcarry out camera lens and cut apart, then extract the visual signature composition characteristic vector v of each camera lens.V is inputted to camera lens semantic analyzer group to judge the semanteme occurring in this camera lens, and the semanteme of appearance is joined to semantic the concentrating of camera lens of this camera lens.

2. from video _jthe semantic camera lens semanteme of extraction of concentrating of camera lens of each camera lens represents this camera lens, then according to sequential relationship, forms camera lens semantic sequence wu _j.

Take video segment as unit, manually mark the semantic sequence wu of training video fragment _jcontext, and with corresponding contextual tab tree LT _jrepresent contextual information.Contextual tab tree is defined as five-tuple LT=<L, Video, Scene, NL, P> formally.Wherein L is camera lens semantic label collection, its element representation be wu _jthe camera lens of middle representative shot is semantic.Video is " video context " label, and represented context is the content of this section of video of its child node co expression.Scene is " scene context " label, the content of this scene that has been its child node co expression shown in table.NL is the contextual tab collection except Video and Scene, and wherein each element represents a kind of context relation.P is contextual rules, its each element representation be a kind of context rule.Leaf node l in Fig. 2 for example ₁and l ₂the father node nl that forms them ₁rule, this rule can be expressed as formally: nl ₁→ l ₁l ₂.

By n wu _jbe configured to context training set with corresponding contextual tab tree: context={ (x _j, y _j) | j=1 ..., n}, wherein x _jcamera lens semantic sequence, y _jit is corresponding contextual tab tree.

Utilize context training set training structure support vector machines-Struct, the mapping function of structure camera lens semantic sequence and contextual tab tree is:

h (x; W) = \underset{y &Element; Y}{\arg \max} f (x, y; W),

F (x, y wherein; W)=<W, ψ (x, y) > is discriminant function, and W is weight vector, and ψ (x, y) is the union feature vector of the contextual tab tree that the camera lens semantic sequence in training data is corresponding with it.The mode of structure ψ (x, y) is as follows:

ψ (x, y) = \{\begin{matrix} p_{1} & a_{1} \\ . & . \\ . & . \\ . & . \\ p_{N} & a_{N} \end{matrix}

P wherein _iwith a _i(i ∈ [1, N]) is respectively context rule and the corresponding number of times occurring of this rule in the contextual rules P of this contextual tab tree, and N is the context rule classification sum occurring in context training set.

Training SVM-Struct is converted into optimization problem:

ε wherein _jfor slack variable, C>0 is the penalty value of wrong minute sample, Δ (y _j, y) be loss function.Make loss function Δ (y _j, y)=(1-F ₁(y _j, y)).Y wherein _jbe the true contextual tab tree of camera lens semantic sequence in context training set, y is the contextual tab tree of predicting in training process, and F1 account form is as follows:

Precision = \frac{| E (y_{j}) \cap E (y) |}{| E (y) |}

Recall = \frac{| E (y_{j}) \cap E (y) |}{| E (y_{i}) |}

F 1 = \frac{2 * Precision * Recall}{Precision + Recall}

Wherein, Precision is each node predictablity rate in contextual tab, and Precision is the recall rate of each node prediction in contextual tab tree, E (y _j) be y _jlimit collection, the limit collection that E (y) is y.

Formula (6) is changed into the form of its antithesis:

\begin{matrix} \max_{α} \underset{j, y &NotEqual; y_{j}}{Σ} α_{jy} - \frac{1}{2} \underset{z, \overset{&OverBar;}{y} &NotEqual; y_{z}}{\underset{j, y &NotEqual; y_{j}}{Σ}} α_{jy} α_{z \overset{&OverBar;}{y}} < δ ψ_{j} (y), δ ψ_{z} (\overset{&OverBar;}{y}) > \\ s . t &ForAll; j, &ForAll; y &NotEqual; Y \ y_{j} : α_{jy} &GreaterEqual; 0 . \end{matrix} - - - (7)

&ForAll; j, n \underset{y &NotEqual; y_{j}}{Σ} \frac{α_{jy}}{Δ (y_{j}, y)} \leq C

Set after penalty value C, computing formula (7) on context training set context, finds one group of optimum α _jyafter also just determine weight vector W, obtain contextual tab tree analyzer.

Extract video _kcamera lens semantic sequence wu _k, and by wu _kinput video contextual tab tree analyzer, obtains wu _klT _k.

3. according to LT _jin " scene context " label Scene, using the corresponding camera lens of leaf node under each Scene label as a complete video scene, the scene that realizes video is cut apart.Then take scene as unit labor cost is to video _jscene carry out Scene Semantics mark.

Utilize the semantic collection of camera lens and the corresponding LT of each camera lens in each scene _jin contextual information structure Scene Semantics training set.Wherein the feature of Scene Semantics is divided into two kinds:

B. contextual feature: contextual feature is two context relations between camera lens semanteme, and camera lens semanteme is at LT _ja middle corresponding leaf node, so the contextual feature value of these two camera lens semantemes is the contextual tab in the nearest public ancestor node of these two leaf nodes.For example, l in Fig. 2 ₁and l ₂contextual feature be " nl ₁", l ₁and l ₃contextual feature be " Scene ".

Take C4.5 algorithm as disaggregated model, according to the information gain rate of each characteristic attribute in Scene Semantics training set, select attribute as node, the final decision tree that generates analysis video scene semanteme.Using this decision tree as Scene Semantics analyzer.

According to wu _klT _kin " scene context " label Scene, by video _kbe divided into some scenes, and take scene and extract the camera lens semantic feature of this scene and contextual feature composition characteristic vector as unit.By video _kthe proper vector input scene semantic analyzer of each scene, obtains video _kthe Scene Semantics of each scene.

4. by LT _kin each leaf node in camera lens semantic label replace with the semantic collection of the corresponding camera lens of the camera lens of representative, then by LT _kin each Scene replace with corresponding Scene Semantics, finally will comprise the semantic LT with Scene Semantics of camera lens _kas video _kvideo index;

The foregoing is only preferred embodiment of the present invention, all equalizations of doing according to the present patent application the scope of the claims change and modify, and all should belong to covering scope of the present invention.

Claims

1. a tree-like video semanteme index establishing method for integrating context, is characterized in that the method comprises the following steps:

Step 1: input n training video fragment video _j, j ∈ 1 ..., n}, to video _jcarry out pre-service, then take camera lens as unit labor cost mark video _jthe semantic collection of camera lens of each camera lens, and be that every class camera lens semanteme is constructed the semantic training set of camera lens with training classifier, camera lens semantic analyzer obtained; The video segment video of m tree index to be set up of input _k, k ∈ 1 ..., m}, to video _kcarry out pre-service, utilize camera lens semantic analyzer to extract video _kthe semantic collection of camera lens of each camera lens;

Step 2: take video segment as unit, manually mark video _jcontext between middle camera lens semanteme, uses the contextual tab tree LT with contextual tab _jrepresent, and build context training set; Training structure support vector SVM-Struct, obtains contextual tab tree analyzer; Utilize contextual analysis device to extract video _kin contextual tab tree LT _k;

Step 3: with video _jscene be unit labor cost mark Scene Semantics, build Scene Semantics training set; Training C4.5 sorter, obtains Scene Semantics analyzer; Utilize Scene Semantics analyzer to extract video _kin the Scene Semantics of each scene;

2. the tree-like video semanteme index establishing method of a kind of integrating context according to claim 1, is characterized in that: in described step 1, carry out as follows:

Step 2.3: using svm classifier device as disaggregated model, is each semantic Sem _ttrain a sorter SVM _t; SVM _tdiscriminant function form be: f _t(v)=sgn[g (v)], wherein _x(v)=<w, v>+b; So by training set Tra _ttraining SVM _toptimization aim be:

\begin{matrix} \min \frac{1}{2} {| | w | |}^{2} \\ s . t . s_{i} (< w, v_{i} > + b) - 1 &GreaterEqual; 0 \end{matrix} - - - (1)

\begin{matrix} \max_{α} Σ_{i = 1}^{r} α_{i} - \frac{1}{2} Σ_{i, h = 1}^{r} α_{i} α_{h} s_{i} s_{h} v_{i} * v_{h} \\ s . t . α_{i} &GreaterEqual; 0, Σ_{i = 1}^{r} α_{i} s_{i} = 0 \end{matrix} - - - (2)

Introduce kernel function K (v _j, v _h), formula (2) is converted to:

\begin{matrix} \max_{α} Σ_{i = 1}^{r} α_{j} - \frac{1}{2} Σ_{i, h = 1}^{r} α_{i} α_{h} s_{i} s_{h} K (v_{i}, v_{h}) \\ s . t . α_{i} &GreaterEqual; 0, Σ_{i = 1}^{r} α_{i} s_{i} = 0 \end{matrix} - - - (3)

Kernel function is chosen to be radial basis function, is defined as:

K (v_{i}, v_{h}) = \exp (- \frac{{(v_{i} - v_{h})}^{2}}{2 σ^{2}}) - - - (4)

Wherein exp () is exponential function, and σ is parameter.

f_{t} (v) = sgn [Σ_{i = 1}^{r} α_{i} s_{i} K (v_{i}, v) + b_{0}] - - - (5)

B wherein ₀for parameter.

3. the tree-like video semanteme index establishing method of a kind of integrating context according to claim 1, is characterized in that: described step 2 is carried out as follows:

Step 3.3: by n wu _jbe configured to context training set with corresponding contextual tab tree: context={ (x _j, y _j) | j=1 ..., n}, wherein x _jcamera lens semantic sequence, y _jit is corresponding contextual tab tree;

h (x; W) = \underset{y &Element; Y}{\arg \max} f (x, y; W),

ψ (x, y) = \{\begin{matrix} p_{1} & a_{1} \\ . & . \\ . & . \\ . & . \\ p_{N} & a_{N} \end{matrix}

Step 3.4.2: training SVM-Struct is converted into optimization problem:

\begin{matrix} \min \frac{1}{2} {| | W | |}^{2} + \frac{C}{n} Σ_{j = 1}^{n} ϵ_{j}, \\ s . t . &ForAll; y &Element; γ : < W, ψ (x_{j}, y_{j}) - ψ (x_{j}, y) > &GreaterEqual; Δ (y_{j}, y) - ϵ_{j} \end{matrix} - - - (6)

Be wherein slack variable, C>0 is the penalty value of wrong minute sample, Δ (y _j, y) be loss function; Make loss function Δ (y _j, y)=(1-F ₁(y _j, y)); Y wherein _jbe the true contextual tab tree of camera lens semantic sequence in context training set, y is the contextual tab tree of predicting in training process, and F1 account form is as follows:

Precision = \frac{| E (y_{j}) \cap E (y) |}{| E (y) |}

Recall = \frac{| E (y_{j}) \cap E (y) |}{| E (y_{i}) |}

F 1 = \frac{2 * Precision * Recall}{Precision + Recall}

Wherein, Precision is the accuracy rate of each node prediction in contextual tab, and Precision is the recall rate of each node prediction in contextual tab tree, E (y _j) be y _jlimit collection, the limit collection that E (y) is y;

Step 3.4.3: the form that formula (6) is changed into its antithesis:

\begin{matrix} \max_{α} \underset{j, y &NotEqual; y_{j}}{Σ} α_{jy} - \frac{1}{2} \underset{z, \overset{&OverBar;}{y} &NotEqual; y_{z}}{\underset{j, y &NotEqual; y_{j}}{Σ}} α_{jy} α_{z \overset{&OverBar;}{y}} < δ ψ_{j} (y), δ ψ_{z} (\overset{&OverBar;}{y}) > \\ s . t &ForAll; j, &ForAll; y &NotEqual; Y \ y_{j} : α_{jy} &GreaterEqual; 0 . \end{matrix} - - - (7)

&ForAll; j, n \underset{y &NotEqual; y_{j}}{Σ} \frac{α_{jy}}{Δ (y_{j}, y)} \leq C

Step 3.5: use the mode identical with step 3.1 to extract video _kcamera lens semantic sequence wu _k, and by wu _kinput video contextual tab tree analyzer, obtains wu _kcorresponding LT _k.

4. the tree-like video semanteme index establishing method of a kind of integrating context according to claim 1, is characterized in that: described step 3 is carried out as follows:

5. the tree-like video semanteme index establishing method of a kind of integrating context according to claim 1, is characterized in that: described step 4 is carried out as follows:

Step 5.2: by LT _kin each Scene replace with corresponding Scene Semantics;