CN104036023B

CN104036023B - Method for creating context fusion tree video semantic indexes

Info

Publication number: CN104036023B
Application number: CN201410297974.0A
Authority: CN
Inventors: 余春艳; 苏晨涵; 翁子林; 陈昭炯
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2014-06-26
Filing date: 2014-06-26
Publication date: 2017-05-10
Anticipated expiration: 2034-06-26
Also published as: CN104036023A

Abstract

The invention belongs to the field of technologies for retrieving video, and discloses a method for creating tree video semantic indexes. The video semantic indexes built by the aid of the method contain video semantics of various particle sizes, contexts among the video semantics are fused by the semantic indexes, and the video semantics of the different particle sizes are connected with one another according to the contexts, so that tree structures can be formed. The method is characterized by comprising steps of extracting lens semantic sets of various lenses one by one; acquiring contexts of video lens semantics under the monitoring condition and representing the contexts by context tag trees; combining the lens semantic sets and context information with one another to infer scene semantics; embedding the lens semantic sets and the scene semantics into the context tag trees to obtain the video indexes. The method has the advantages that after the semantic indexes are created for video by the aid of the method, users can input keywords of the different particle sizes to retrieve the video, search spaces can be diminished by the aid of context information in the indexes, and accordingly the efficiency of retrieval systems can be improved.

Description

A kind of tree-like video semanteme index establishing method of integrating context

Technical field

The invention belongs to video search technique area, be a kind of semantic camera lens that can utilize video, Scene Semantics and The method that context between semanteme builds video semanteme index.

Background technology

Nowadays video data already becomes one of most important data on internet.It is volatile however as video data Increase, how efficiently to manage, retrieve video and become a highly difficult problem.Generally user is defeated when video is retrieved Enter a keyword, then need to be found the video data of correlation according to keyword by video search engine.This is required to video Setting up suitable semantic indexing could improve the efficiency and hit rate of user search video.Video index structure based on video semanteme It is automatically to analyze the visual signature of video to obtain the semantic information that video contains by computer to build, then by semantic information Used as the index of video, user can be by being input into key search video when video is retrieved.

But requirement of the user to video search engine is improved constantly, user's different grains of different inputs often according to demand The keyword of degree, such as user search for football related content video when may be input into " football ", " wonderful ", " penetrate The varigrained keyword such as door ", " judge's feature " enters line retrieval.So traditional Monosized powder, without level video language Justice index can not meet the Search Requirement of user.Additionally, the semantic content of video enriches, also deposit in addition to semantic information In substantial amounts of contextual information, understand mutual between different grain size semanteme using the contextual information engine that can assist search Effect, be in video it is varigrained semanteme between opening relationships, so as to retrieve video when can be according to these relation informations The related video of search.Contextual information can reduce search space, improve search effect on the premise of search hit rate is ensured Rate.Based on this, present invention achieves a kind of video semanteme index for being capable of integrating context, to improve the effective of video index Property.

The content of the invention

The purpose of the present invention be realize it is a kind of being capable of the integrating context information method of setting up tree-like video semanteme index.Should Method can incorporate contextual information in video semanteme index, improve video frequency searching hit rate and efficiency.

The present invention is realized using below scheme：A kind of tree-like video semanteme index establishing method of integrating context, it is special Levy is that the method is comprised the following steps：

Step 1：N training video fragment video of input_j, j ∈ { 1 ..., n }, to video_jPre-processed, then with Camera lens manually marks video for unit_jThe semantic collection of the camera lens of each camera lens, and be the semantic training of the semantic construction camera lens of every class camera lens Collection obtains camera lens semantic analyzer to train grader.The video segment video of m tree index to be set up of input_k, k ∈ { 1 ..., m }, to video_kPre-processed, using camera lens semantic analyzer video is extracted_kThe semantic collection of the camera lens of each camera lens；

Step 2：In units of video segment, video is manually marked_jContext between middle camera lens semanteme, with upper The hereafter contextual tab tree LT of label_jRepresent, and build context training set.Training structure supporting vector SVM- Struct, obtains contextual tab tree analyzer.Video is extracted using context analyser_kIn contextual tab tree LT_k；

Step 3：With video_jScene manually mark Scene Semantics for unit, build Scene Semantics training set.Training C4.5 graders, obtain Scene Semantics analyzer.Video is extracted using Scene Semantics analyzer_kIn each scene scene language Justice；

Step 4：By the video obtained in step 2_kThe video that the semantic collection of the camera lens of each camera lens is obtained with step 4_kEach scene Scene Semantics be embedded into the LT obtained in step 3_kIn corresponding node, by the LT of the semantic and Scene Semantics with camera lens_kWork For video_kVideo index.

Further, carry out as follows in the step 1：

Step 2.1：To n training video fragment video_jShot segmentation is carried out, r training video camera lens is obtained；Extract And quantify the visual signature of camera lens, it is configured to visual feature vector v；

Step 2.2：The semantic collection Semantic={ Sem of mark are set_t| t=1 ..., e }, manually go out in r camera lens of mark Existing semantic Sem_t, the semantic concentration of camera lens of each camera lens is added to, it is then each class camera lens semanteme Sem_tConstruction camera lens is semantic Training set, obtains e camera lens semanteme training set Tra_t={ (v_i,s_i) | i=1 ..., r }, if semanteme Sem_tOccur in the mirror In head, then s_i=1, it is otherwise 0；

Step 2.3：It is each semantic Sem using SVM classifier as disaggregated model_tOne grader SVM of training_t； SVM_tDiscriminant function form be：f_t(v)=sgn [g (v)], wherein g (v)=<w,v>+b；So passing through training set Tra_tTraining SVM_tOptimization aim be：

Merge optimization problem using Lagrangian and constraint is converted into (1) formula：

Introduce kernel function K (v_j,v_h), formula (2) is converted to：

Kernel function is chosen to be RBF, is defined as：

Wherein exp () is exponential function, and σ is parameter.

One group of α has been determined that after the completion of training_i, also determined that camera lens semanteme Sem_tDiscriminant function：

Wherein b₀For parameter.

Step 2.4：Complete to all Sem according to step 2.3_tGrader SVM_tAfter training, e camera lens is obtained semantic Discriminant function, by the camera lens semantic analyzer group of the semantic discriminant function composition of e camera lens.

Step 2.5：Video segment video to m tree index to be set up_kShot segmentation is carried out, each is then extracted The visual signature composition characteristic vector v of camera lens；The semanteme occurred during v is input into into camera lens semantic analyzer group to judge the camera lens, And the semanteme that will appear from is added to the semantic concentration of camera lens of this camera lens.

Further, the step 2 is carried out as follows：

Step 3.1：From video_jThe camera lens of each camera lens is semantic to concentrate one camera lens semanteme of extraction to represent the camera lens, then presses Camera lens semantic sequence wu is constituted according to sequential relationship_j；

Step 3.2：Artificial mark wu_jContext, and with context tag tree LT_jRepresent contextual information；Context mark It is a five-tuple LT=to sign tree<L,Video,Scene,NL,P>；Wherein L is camera lens semantic label collection, and its element representation is wu_jThe middle camera lens for representing camera lens is semantic；Video is " video context " label, and represented context is that its child node is common Express the content of this section of video；Scene is " scene context " label, shown in table be its child node co expression this The content of scape；NL is the contextual tab collection in addition to Video and Scene, and wherein each element represents a kind of context and closes System；P is contextual rules, and each of which element representation is a context rule；

Step 3.3：By n wu_jContext training set is configured to corresponding contextual tab tree：

Context={ (x_j,y_j) | j=1 ..., n }, wherein x_jIt is camera lens semantic sequence, y_jIt is corresponding context mark Sign tree；

Step 3.4：Using context training set training structure support vector machines-Struct, concrete operations are：

Step 3.4.1：Construct camera lens semantic sequence is with the mapping function of contextual tab tree：

Wherein f (x, y；W)=<W,ψ(x,y)>For discriminant function, W is weight vector, and ψ (x, y) is the camera lens in training data The union feature vector of the corresponding contextual tab tree of semantic sequence；The mode of construction ψ (x, y) is as follows：

Wherein p_iWith a_i(i ∈ [1, N]) is respectively the rule in the contextual rules P of the contextual tab tree and the rule The number of times for occurring then is corresponded to, N is the context rule classification sum occurred in context training set；

Step 3.4.2：Training SVM-Struct is converted into into optimization problem：

Wherein ε_jFor slack variable, C>0 is the penalty value of wrong point of sample, Δ (y_j, y) it is loss function；Make loss function Δ (y_j, y)=(1-F₁(y_j,y))；Wherein y_jIt is the true contextual tab tree of camera lens semantic sequence in context training set, y is The contextual tab tree predicted in training process, F1 calculations are as follows：

Wherein, Precision is each node predictablity rate in contextual tab, and Precision is contextual tab The recall rate of each node prediction, E (y in tree_j) it is y_jSide collection, E (y) for y side collection；

Step 3.4.3：Formula (6) is changed into into the form of its antithesis：

Wherein α_iyIt is Lagrange multiplier. for soft margin, also there is a group constraints in addition：

Step 3.4.4：The computing formula (7) on context training set context, finds one group of α of optimum_jyAfterwards also just really Determine weight vector W, obtain contextual tab tree analyzer；

Step 3.5：Video is extracted with step 3.1 identical mode_kCamera lens semantic sequence wu_k, and by wu_kInput is regarded Frequency contextual tab tree analyzer, obtains wu_kLT_k。

Further, the step 3 is carried out as follows：

Step 4.1：According to LT_jIn " scene context " label Scene, by the leaf node institute under each Scene label Corresponding camera lens realizes the scene cut of video as a complete video scene；Then it is artificial right in units of scene video_jScene carry out Scene Semantics mark；

Step 4.2：Using the semantic collection of camera lens and corresponding LT of each camera lens in each scene_jIn contextual information construction Scene Semantics training set；The feature of wherein Scene Semantics is divided into two kinds：

A. camera lens semantic feature：If certain camera lens is semantic to occur in this scenario, making the camera lens semantic feature value be 1, otherwise For 0；

B. contextual feature：Contextual feature is the context relation between two camera lens semantemes, and camera lens semanteme is in LT_jIn One leaf node of correspondence, so the semantic contextual feature value of the two camera lenses is the two leaf nodes public ancestor node recently On contextual tab；

Step 4.3：With C4.5 algorithms as disaggregated model, increased according to the information of each characteristic attribute in Scene Semantics training set Beneficial rate ultimately generates the semantic decision tree of analysis video scene selecting attribute as node, and using this decision tree as field Scape semantic analyzer；

Step 4.4：According to wu_kLT_k, with identical method in step 4.1 by video_kIt is divided into some scenes, and with Scene extracts the characteristic vector of the scene for unit；By video_kThe characteristic vector input scene semantic analyzer of each scene, obtains To video_kThe Scene Semantics of each scene.

Further, the step 4 is carried out as follows：

Step 5.1：By LT_kIn each leaf node in camera lens semantic label replace with corresponding to representative camera lens The semantic collection of camera lens；

Step 5.2：By LT_kIn each Scene replace with corresponding Scene Semantics；

Step 5.3：By comprising the semantic LT with Scene Semantics of camera lens_kAs video_kVideo index.

The invention has the beneficial effects as follows：Set up after semantic indexing for video using the method for the present invention, user can be input into Varigrained keyword retrieval video, and the contextual information in indexing can reduce search space, improve searching system Efficiency.

Description of the drawings

Fig. 1 is tree-like video semanteme index Establishing process.

Fig. 2 is the model of a video contextual tab tree.

Fig. 3 is a tree-like video index model.

Specific embodiment

Fig. 1 is refer to, a kind of method of the tree-like video index foundation of integrating context first extracts mirror in units of camera lens The semantic information of head, obtains the semantic context of video lens with then having supervision, and represents context with tree construction.Tie again Close the semantic reasoning that Scene Semantics are carried out with their context of camera lens.Finally camera lens semanteme, Scene Semantics are embedded into into tree knot In structure and as the index of video.It is specific as follows：

1. pair n training video fragment video_jShot segmentation is carried out, r training video camera lens is obtained.Extract and quantify The visual signature of camera lens, is configured to visual feature vector v.

The semantic collection Semantic={ Sem of mark are set_t| t=1 ..., e }, the semanteme for manually occurring in r camera lens of mark Sem_t, the semantic concentration of camera lens of each camera lens is added to, it is then each class camera lens semanteme Sem_tConstruction camera lens semanteme training set, Obtain e camera lens semanteme training set Tra_t={ (v_i,s_i) | i=1 ..., r }, if semanteme Sem_tIn occurring in the camera lens, then s_i =1, it is otherwise 0；

It is each semantic Sem using SVM classifier as disaggregated model_tOne grader SVM of training_t。SVM_tDifferentiation Functional form is：f_t(v)=sgn [g (v)], wherein g (v)=<w,v>+b.So passing through training set Tra_tTraining SVM_tOptimization Target is：

Introduce kernel function K (v_j,v_h), formula (2) is converted to：

Kernel function is chosen to be RBF, is defined as：

Wherein exp () is exponential function, and σ is parameter.

Wherein b₀For parameter.

Complete Sem_tCorresponding grader SVM_tAfter training, the camera lens semantic analysis comprising e camera lens semantic analyzer is obtained Device group.

Video segment video to m tree index to be set up_kShot segmentation is carried out, regarding for each camera lens is then extracted Feel feature composition characteristic vector v.The semanteme occurred during v is input into into camera lens semantic analyzer group to judge the camera lens, and will appear from Semanteme be added to the camera lens of this camera lens and semantic concentrate.

2. from video_jThe camera lens of each camera lens is semantic to concentrate one camera lens semanteme of extraction to represent the camera lens, then according to sequential Relation constitutes camera lens semantic sequence wu_j。

In units of video segment, the semantic sequence wu of artificial mark training video fragment_jContext, and with corresponding Contextual tab tree LT_jRepresent contextual information.Contextual tab tree is formally defined as five-tuple LT=<L,Video, Scene,NL,P>.Wherein L is camera lens semantic label collection, and its element representation is wu_jThe middle camera lens for representing camera lens is semantic.Video It is " video context " label, represented context is the content of its this section of video of child node co expression.Scene is " field Scape context " label, the content of this scene that has been its child node co expression shown in table.NL be except Video and Scene it Outer contextual tab collection, wherein each element represent a kind of context relation.P is contextual rules, each of which element What is represented is a kind of context rule.Such as Fig. 2 leaf nodes l₁And l₂Constitute their father node nl₁Rule, this rule Can formally be expressed as：nl₁→l₁l₂。

By n wu_jContext training set is configured to corresponding contextual tab tree：Context={ (x_j,y_j) | j= 1 ..., n }, wherein x_jIt is camera lens semantic sequence, y_jIt is corresponding contextual tab tree.

Using context training set training structure support vector machines-Struct, construction camera lens semantic sequence with it is upper and lower The mapping function of literary tag tree is：

Wherein f (x, y；W)=<W,ψ(x,y)>For discriminant function, W is weight vector, and ψ (x, y) is the camera lens in training data The union feature vector of the corresponding contextual tab tree of semantic sequence.The mode of construction ψ (x, y) is as follows：

Wherein p_iWith a_i(i ∈ [1, N]) is respectively the context rule in the contextual rules P of the contextual tab tree The number of times of appearance corresponding with the rule, N is the context rule classification sum occurred in context training set.

Training SVM-Struct is converted into into optimization problem：

Wherein ε_jFor slack variable, C>0 is the penalty value of wrong point of sample, Δ (y_j, y) it is loss function.Make loss function Δ (y_j, y)=(1-F₁(y_j,y)).Wherein y_jIt is the true contextual tab tree of camera lens semantic sequence in context training set, y is The contextual tab tree predicted in training process, F1 calculations are as follows：

Wherein, Precision is each node predictablity rate in contextual tab, and Precision is contextual tab The recall rate of each node prediction, E (y in tree_j) it is y_jSide collection, E (y) for y side collection.

Formula (6) is changed into into the form of its antithesis：

After setting penalty value C, the computing formula (7) on context training set context finds one group of α of optimum_jyAfterwards Also weight vector W is determined that, contextual tab tree analyzer is obtained.

Extract video_kCamera lens semantic sequence wu_k, and by wu_kInput video contextual tab tree analyzer, obtains wu_k LT_k。

3. according to LT_jIn " scene context " label Scene, by corresponding to the leaf node under each Scene label Camera lens realizes the scene cut of video as a complete video scene.Then it is artificial to video in units of scene_j's Scene carries out Scene Semantics mark.

Using the semantic collection of camera lens and corresponding LT of each camera lens in each scene_jIn contextual information construction Scene Semantics Training set.The feature of wherein Scene Semantics is divided into two kinds：

B. contextual feature：Contextual feature is the context relation between two camera lens semantemes, and camera lens semanteme is in LT_jIn One leaf node of correspondence, so the semantic contextual feature value of the two camera lenses is the two leaf nodes public ancestor node recently On contextual tab.For example, l in Fig. 2₁And l₂Contextual feature be " nl₁", l₁And l₃Contextual feature be " Scene ".

With C4.5 algorithms as disaggregated model, selected according to the information gain-ratio of each characteristic attribute in Scene Semantics training set Attribute is selected as node, the semantic decision tree of analysis video scene is ultimately generated.Analyze this decision tree as Scene Semantics Device.

According to wu_kLT_kIn " scene context " label Scene, by video_kIt is divided into some scenes, and is with scene Unit extracts the camera lens semantic feature of the scene and contextual feature composition characteristic vector.By video_kThe characteristic vector of each scene Input scene semantic analyzer, obtains video_kThe Scene Semantics of each scene.

4. by LT_kIn each leaf node in camera lens semantic label replace with camera lens language corresponding to representative camera lens Justice collection, then by LT_kIn each Scene replace with corresponding Scene Semantics, finally will be comprising camera lens is semantic and Scene Semantics LT_kAs video_kVideo index；

The foregoing is only presently preferred embodiments of the present invention, all impartial changes done according to scope of the present invention patent with Modification, should all belong to the covering scope of the present invention.

Claims

1. the tree-like video semanteme index establishing method of a kind of integrating context, it is characterised in that the method is comprised the following steps：

Step 1：N training video fragment video of input_j, j ∈ { 1 ..., n }, to video_jPre-processed, then with camera lens Video is manually marked for unit_jThe semantic collection of the camera lens of each camera lens, and be the semantic construction camera lens semanteme training set of every class camera lens with Training grader, obtains camera lens semantic analyzer；The video segment video of m tree index to be set up of input_k, k ∈ { 1 ..., m }, to video_kPre-processed, using camera lens semantic analyzer video is extracted_kThe semantic collection of the camera lens of each camera lens；

Step 2：In units of video segment, video is manually marked_jContext between middle camera lens semanteme, with context mark The contextual tab tree LT of label_jRepresent, and build context training set；Training structure supporting vector SVM-Struct, obtains Contextual tab tree analyzer；Video is extracted using contextual tab tree analyzer_kIn contextual tab tree LT_k；

Step 3：With video_jScene manually mark Scene Semantics for unit, build Scene Semantics training set；Training C4.5 classification Device, obtains Scene Semantics analyzer；Video is extracted using Scene Semantics analyzer_kIn each scene Scene Semantics；

Step 4：By the video obtained in step 1_kThe video that the semantic collection of the camera lens of each camera lens is obtained with step 3_kThe field of each scene The LT that scape semantic embedding is obtained in step 2_kIn corresponding node, by the LT of the semantic and Scene Semantics with camera lens_kConduct video_kVideo index；

Wherein, the step 2 is carried out as follows：

Step 3.1：From video_jThe camera lens of each camera lens is semantic to concentrate one camera lens semanteme of extraction to represent the camera lens, then according to when Order relation constitutes camera lens semantic sequence wu_j；

Step 3.2：Artificial mark wu_jContext, and with context tag tree LT_jRepresent contextual information；Contextual tab tree For a five-tuple LT_j=＜ L, Video, Scene, NL, P ＞；Wherein L is camera lens semantic label collection, and its element representation is wu_jThe middle camera lens for representing camera lens is semantic；Video is " video context " label, and represented context is that its child node is common Express the content of this section of video；Scene is " scene context " label, represented has been its child node co expression this The content of scape；NL is the contextual tab collection in addition to Video and Scene, and wherein each element represents a kind of context and closes System；P is contextual rules, and each of which element representation is a context rule；

Step 3.3：By n wu_jContext training set is configured to corresponding contextual tab tree：Context={ (x_j,y_j)|j =1 ..., n }, wherein x_jIt is the camera lens semantic sequence in context training set, y_jIt is in context training set and x_jIt is corresponding Contextual tab tree；

Wherein, f (x_j,y_j；W)=＜ W, ψ (x_j,y_j) ＞ be discriminant function, Y is x_jThe all contextual tab trees that can be constructed Set, W is weight vector, ψ (x_j,y_j) be the corresponding contextual tab tree of camera lens semantic sequence in training data connection Close characteristic vector；Construction ψ (x_j,y_j) mode it is as follows：

ψ (x_{j}, y_{j}) = \{\begin{matrix} p_{1} & a_{1} \\ . & . \\ . & . \\ . & . \\ p_{N} & a_{N} \end{matrix}

Wherein p_iWith a_i, i ∈ [1, N] are respectively that the rule in the contextual rules P of the contextual tab tree is corresponding with the rule The number of times of appearance, N is the context rule classification sum occurred in context training set；

Step 3.4.2：Training SVM-Struct is converted into into optimization problem：

m i n \frac{1}{2} | | W | |^{2} + \frac{C}{n} Σ_{j = 1}^{n} ϵ_{j},

\begin{matrix} s . t . & &ForAll; y &Element; Y \ y_{j} : < W, ψ (x_{j}, y_{j}) - ψ (x_{j}, y) > &GreaterEqual; Δ (y_{j}, y) - ϵ_{j} \end{matrix} - - - (6)

Wherein ε_jFor slack variable, C>0 is the penalty value of wrong point of sample, Δ (y_j, y) it is loss function；Make loss function Δ (y_j, Y)=(1-F₁(y_j,y))；Wherein y_jIt is the true contextual tab tree of camera lens semantic sequence in context training set, y is training During predict contextual tab tree, F₁Calculation is as follows：

\Pr e c i s i o n = \frac{| E (y_{j}) \cap E (y) |}{| E (y) |}

Re c a l l = \frac{| E (y_{j}) \cap E (y) |}{| E (y_{j}) |}

F_{1} = \frac{2 * \Pr e c i s i o n * Re c a l l}{\Pr e c i s i o n + Re c a l l}

Wherein, Precision is the accuracy rate of each node prediction in contextual tab, and Recall is every in contextual tab tree The recall rate of individual node prediction, E (y_j) it is y_jSide collection, E (y) for y side collection；

Step 3.4.3：Formula (6) is changed into into the form of its antithesis：

\underset{α}{m a x} \underset{j, y &NotEqual; y_{j}}{Σ} α_{j y} - \frac{1}{2} \underset{z, \overset{&OverBar;}{y} &NotEqual; y_{z}}{\underset{j, y &NotEqual; y_{j}}{Σ}} α_{j y} α_{z \overset{&OverBar;}{y}} < (ψ (x_{j}, y_{j}) - ψ (x_{j}, y)), (ψ (x_{z}, y_{z}) - ψ (x_{z}, y)) >

\begin{matrix} s . t & &ForAll; j, &ForAll; y &Element; Y \ y_{j} : α_{j y} &GreaterEqual; 0. \end{matrix} - - - (7)

Wherein α_jyWithIt is Lagrange multiplier, for soft margin, also there is a group constraints in addition：

&ForAll; j, n \underset{y &NotEqual; y_{j}}{Σ} \frac{α_{j y}}{Δ (y_{j}, y)} \leq C

Step 3.4.4：The computing formula (7) on context training set context, finds one group of α of optimum_jyPower is also determined that afterwards Vectorial W, obtains contextual tab tree analyzer；

Step 3.5：Video is extracted with step 3.1 identical mode_kCamera lens semantic sequence wu_k, and by wu_kOn input video Hereafter tag tree analyzer, obtains wu_kCorresponding LT_k；

Wherein, the step 3 is carried out as follows：

Step 4.1：According to LT_jIn " scene context " label Scene, by corresponding to the leaf node under each Scene label Camera lens realizes the scene cut of video as a complete video scene；Then it is artificial to video in units of scene_j's Scene carries out Scene Semantics mark；

Step 4.2：Using the semantic collection of camera lens and corresponding LT of each camera lens in each scene_jIn contextual information construction scene Semantic training set；The feature of wherein Scene Semantics is divided into two kinds：

A. camera lens semantic feature：It is otherwise 0 if certain camera lens is semantic to occur in this scenario, making the camera lens semantic feature value be 1；

B. contextual feature：Contextual feature is the context relation between two camera lens semantemes, and camera lens semanteme is in LT_jMiddle correspondence One leaf node, so the semantic contextual feature value of the two camera lenses is in the nearest public ancestor node of the two leaf nodes Contextual tab；

Step 4.3：With C4.5 algorithms as disaggregated model, according to the information gain-ratio of each characteristic attribute in Scene Semantics training set To select attribute as node, the semantic decision tree of analysis video scene is ultimately generated, and using this decision tree as scene language Adopted analyzer；

Step 4.4：According to wu_kLT_k, with identical method in step 4.1 by video_kIt is divided into some scenes, and with scene For the characteristic vector that unit extracts the scene；By video_kThe characteristic vector input scene semantic analyzer of each scene, obtains video_kThe Scene Semantics of each scene；

Wherein, the step 4 is carried out as follows：

Step 5.1：By LT_kIn each leaf node in camera lens semantic label replace with camera lens corresponding to representative camera lens Semanteme collection；

Step 5.2：By LT_kIn each Scene replace with corresponding Scene Semantics；

2. the tree-like video semanteme index establishing method of a kind of integrating context according to claim 1, it is characterised in that： Carry out as follows in the step 1：

Step 2.1：To n training video fragment video_jShot segmentation is carried out, r training video camera lens shot is obtained₁, shot₂..., shot_r；Extract and quantify camera lens shot_iVisual signature, construct its visual feature vector v_i；

Step 2.2：The semantic collection Semantic={ Sem of mark are set_t| t=1 ..., e }, manually occur in r camera lens of mark Semantic Sem_t, the semantic concentration of camera lens of each camera lens is added to, it is then each class camera lens semanteme Sem_tThe semantic training of construction camera lens Collection Tra_t,Tra_t={ (v_i,s_i) | i=1 ..., r }, if semanteme Sem_tOccur in camera lens shot_iIn, then s_i=1, otherwise for 0；Finally give the semantic training set Tra of e camera lens₁, Tra₂..., Tra_e；

Step 2.3：It is each semantic Sem using SVM classifier as disaggregated model_tOne grader SVM of training_t；SVM_t's Discriminant function form is：f_t(v_i)=sgn [g (v_i)], wherein g (v_i)=＜ w, v_i＞+b, w and b are desired optimized parameters, v_i For video lens shot_iVisual feature vector；

Training set Tra_tTraining SVM_tOptimization aim be：

\begin{matrix} m i n & \frac{1}{2} | | w | |^{2} \\ s . t . & s_{i} (< w, v_{i} > + b) - 1 &GreaterEqual; 0 \end{matrix} - - - (1)

\begin{matrix} \max_{α} Σ_{i = 1}^{r} α_{i} - \frac{1}{2} Σ_{i, h = 1}^{r} α_{i} α_{h} s_{i} s_{h} v_{i} * v_{h} \\ \begin{matrix} s . t . & α_{i} &GreaterEqual; 0, Σ_{i = 1}^{r} α_{i} s_{i} = 0 \end{matrix} \end{matrix} - - - (2)

Wherein α={ α₁,α₂,...,α_rIt is Lagrange multiplier, h and i is subscript, v_iAnd v_hIt is camera lens shot_iAnd shot_hIt is right The visual feature vector answered；

Introduce kernel function K (v_i,v_h), formula (2) is converted to：

\begin{matrix} \max_{α} Σ_{i = 1}^{r} α_{j} - \frac{1}{2} Σ_{i, h = 1}^{r} α_{i} α_{h} s_{i} s_{h} K (v_{i}, v_{h}) \\ \begin{matrix} s . t . & α_{i} &GreaterEqual; 0, Σ_{i = 1}^{r} α_{i} s_{i} = 0 \end{matrix} \end{matrix} - - - (3)

Kernel function is chosen to be RBF, is defined as：

K (v_{i}, v_{h}) = \exp (- \frac{{(v_{i} - v_{h})}^{2}}{2 σ^{2}}) - - - (4)

Wherein exp () is exponential function, and σ is parameter；

f_{t} (v) = sgn [Σ_{i = 1}^{r} α_{i} s_{i} K (v_{i}, v_{h}) + b_{0}] - - - (5)

Wherein b₀For parameter；

Step 2.4：Complete to all Sem according to step 2.3_tGrader SVM_tAfter training, the semantic differentiation of e camera lens is obtained Function, by the discriminant function composition camera lens semantic analyzer group that e camera lens is semantic；

Step 2.5：Video segment video to m tree index to be set up_kShot segmentation is carried out, each camera lens is then extracted Visual signature composition characteristic vector v；The semanteme occurred during v is input into into camera lens semantic analyzer group to judge the camera lens, and will go out Existing semanteme is added to the semantic concentration of camera lens of this camera lens.