CN104036023B - Method for creating context fusion tree video semantic indexes - Google Patents

Method for creating context fusion tree video semantic indexes Download PDF

Info

Publication number
CN104036023B
CN104036023B CN201410297974.0A CN201410297974A CN104036023B CN 104036023 B CN104036023 B CN 104036023B CN 201410297974 A CN201410297974 A CN 201410297974A CN 104036023 B CN104036023 B CN 104036023B
Authority
CN
China
Prior art keywords
camera lens
video
semantic
scene
context
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201410297974.0A
Other languages
Chinese (zh)
Other versions
CN104036023A (en
Inventor
余春艳
苏晨涵
翁子林
陈昭炯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN201410297974.0A priority Critical patent/CN104036023B/en
Publication of CN104036023A publication Critical patent/CN104036023A/en
Application granted granted Critical
Publication of CN104036023B publication Critical patent/CN104036023B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Studio Devices (AREA)

Abstract

The invention belongs to the field of technologies for retrieving video, and discloses a method for creating tree video semantic indexes. The video semantic indexes built by the aid of the method contain video semantics of various particle sizes, contexts among the video semantics are fused by the semantic indexes, and the video semantics of the different particle sizes are connected with one another according to the contexts, so that tree structures can be formed. The method is characterized by comprising steps of extracting lens semantic sets of various lenses one by one; acquiring contexts of video lens semantics under the monitoring condition and representing the contexts by context tag trees; combining the lens semantic sets and context information with one another to infer scene semantics; embedding the lens semantic sets and the scene semantics into the context tag trees to obtain the video indexes. The method has the advantages that after the semantic indexes are created for video by the aid of the method, users can input keywords of the different particle sizes to retrieve the video, search spaces can be diminished by the aid of context information in the indexes, and accordingly the efficiency of retrieval systems can be improved.

Description

A kind of tree-like video semanteme index establishing method of integrating context
Technical field
The invention belongs to video search technique area, be a kind of semantic camera lens that can utilize video, Scene Semantics and The method that context between semanteme builds video semanteme index.
Background technology
Nowadays video data already becomes one of most important data on internet.It is volatile however as video data Increase, how efficiently to manage, retrieve video and become a highly difficult problem.Generally user is defeated when video is retrieved Enter a keyword, then need to be found the video data of correlation according to keyword by video search engine.This is required to video Setting up suitable semantic indexing could improve the efficiency and hit rate of user search video.Video index structure based on video semanteme It is automatically to analyze the visual signature of video to obtain the semantic information that video contains by computer to build, then by semantic information Used as the index of video, user can be by being input into key search video when video is retrieved.
But requirement of the user to video search engine is improved constantly, user's different grains of different inputs often according to demand The keyword of degree, such as user search for football related content video when may be input into " football ", " wonderful ", " penetrate The varigrained keyword such as door ", " judge's feature " enters line retrieval.So traditional Monosized powder, without level video language Justice index can not meet the Search Requirement of user.Additionally, the semantic content of video enriches, also deposit in addition to semantic information In substantial amounts of contextual information, understand mutual between different grain size semanteme using the contextual information engine that can assist search Effect, be in video it is varigrained semanteme between opening relationships, so as to retrieve video when can be according to these relation informations The related video of search.Contextual information can reduce search space, improve search effect on the premise of search hit rate is ensured Rate.Based on this, present invention achieves a kind of video semanteme index for being capable of integrating context, to improve the effective of video index Property.
The content of the invention
The purpose of the present invention be realize it is a kind of being capable of the integrating context information method of setting up tree-like video semanteme index.Should Method can incorporate contextual information in video semanteme index, improve video frequency searching hit rate and efficiency.
The present invention is realized using below scheme:A kind of tree-like video semanteme index establishing method of integrating context, it is special Levy is that the method is comprised the following steps:
Step 1:N training video fragment video of inputj, j ∈ { 1 ..., n }, to videojPre-processed, then with Camera lens manually marks video for unitjThe semantic collection of the camera lens of each camera lens, and be the semantic training of the semantic construction camera lens of every class camera lens Collection obtains camera lens semantic analyzer to train grader.The video segment video of m tree index to be set up of inputk, k ∈ { 1 ..., m }, to videokPre-processed, using camera lens semantic analyzer video is extractedkThe semantic collection of the camera lens of each camera lens;
Step 2:In units of video segment, video is manually markedjContext between middle camera lens semanteme, with upper The hereafter contextual tab tree LT of labeljRepresent, and build context training set.Training structure supporting vector SVM- Struct, obtains contextual tab tree analyzer.Video is extracted using context analyserkIn contextual tab tree LTk
Step 3:With videojScene manually mark Scene Semantics for unit, build Scene Semantics training set.Training C4.5 graders, obtain Scene Semantics analyzer.Video is extracted using Scene Semantics analyzerkIn each scene scene language Justice;
Step 4:By the video obtained in step 2kThe video that the semantic collection of the camera lens of each camera lens is obtained with step 4kEach scene Scene Semantics be embedded into the LT obtained in step 3kIn corresponding node, by the LT of the semantic and Scene Semantics with camera lenskWork For videokVideo index.
Further, carry out as follows in the step 1:
Step 2.1:To n training video fragment videojShot segmentation is carried out, r training video camera lens is obtained;Extract And quantify the visual signature of camera lens, it is configured to visual feature vector v;
Step 2.2:The semantic collection Semantic={ Sem of mark are sett| t=1 ..., e }, manually go out in r camera lens of mark Existing semantic Semt, the semantic concentration of camera lens of each camera lens is added to, it is then each class camera lens semanteme SemtConstruction camera lens is semantic Training set, obtains e camera lens semanteme training set Trat={ (vi,si) | i=1 ..., r }, if semanteme SemtOccur in the mirror In head, then si=1, it is otherwise 0;
Step 2.3:It is each semantic Sem using SVM classifier as disaggregated modeltOne grader SVM of trainingt; SVMtDiscriminant function form be:ft(v)=sgn [g (v)], wherein g (v)=<w,v>+b;So passing through training set TratTraining SVMtOptimization aim be:
Merge optimization problem using Lagrangian and constraint is converted into (1) formula:
Introduce kernel function K (vj,vh), formula (2) is converted to:
Kernel function is chosen to be RBF, is defined as:
Wherein exp () is exponential function, and σ is parameter.
One group of α has been determined that after the completion of trainingi, also determined that camera lens semanteme SemtDiscriminant function:
Wherein b0For parameter.
Step 2.4:Complete to all Sem according to step 2.3tGrader SVMtAfter training, e camera lens is obtained semantic Discriminant function, by the camera lens semantic analyzer group of the semantic discriminant function composition of e camera lens.
Step 2.5:Video segment video to m tree index to be set upkShot segmentation is carried out, each is then extracted The visual signature composition characteristic vector v of camera lens;The semanteme occurred during v is input into into camera lens semantic analyzer group to judge the camera lens, And the semanteme that will appear from is added to the semantic concentration of camera lens of this camera lens.
Further, the step 2 is carried out as follows:
Step 3.1:From videojThe camera lens of each camera lens is semantic to concentrate one camera lens semanteme of extraction to represent the camera lens, then presses Camera lens semantic sequence wu is constituted according to sequential relationshipj
Step 3.2:Artificial mark wujContext, and with context tag tree LTjRepresent contextual information;Context mark It is a five-tuple LT=to sign tree<L,Video,Scene,NL,P>;Wherein L is camera lens semantic label collection, and its element representation is wujThe middle camera lens for representing camera lens is semantic;Video is " video context " label, and represented context is that its child node is common Express the content of this section of video;Scene is " scene context " label, shown in table be its child node co expression this The content of scape;NL is the contextual tab collection in addition to Video and Scene, and wherein each element represents a kind of context and closes System;P is contextual rules, and each of which element representation is a context rule;
Step 3.3:By n wujContext training set is configured to corresponding contextual tab tree:
Context={ (xj,yj) | j=1 ..., n }, wherein xjIt is camera lens semantic sequence, yjIt is corresponding context mark Sign tree;
Step 3.4:Using context training set training structure support vector machines-Struct, concrete operations are:
Step 3.4.1:Construct camera lens semantic sequence is with the mapping function of contextual tab tree:
Wherein f (x, y;W)=<W,ψ(x,y)>For discriminant function, W is weight vector, and ψ (x, y) is the camera lens in training data The union feature vector of the corresponding contextual tab tree of semantic sequence;The mode of construction ψ (x, y) is as follows:
Wherein piWith ai(i ∈ [1, N]) is respectively the rule in the contextual rules P of the contextual tab tree and the rule The number of times for occurring then is corresponded to, N is the context rule classification sum occurred in context training set;
Step 3.4.2:Training SVM-Struct is converted into into optimization problem:
Wherein εjFor slack variable, C>0 is the penalty value of wrong point of sample, Δ (yj, y) it is loss function;Make loss function Δ (yj, y)=(1-F1(yj,y));Wherein yjIt is the true contextual tab tree of camera lens semantic sequence in context training set, y is The contextual tab tree predicted in training process, F1 calculations are as follows:
Wherein, Precision is each node predictablity rate in contextual tab, and Precision is contextual tab The recall rate of each node prediction, E (y in treej) it is yjSide collection, E (y) for y side collection;
Step 3.4.3:Formula (6) is changed into into the form of its antithesis:
Wherein αiyIt is Lagrange multiplier. for soft margin, also there is a group constraints in addition:
Step 3.4.4:The computing formula (7) on context training set context, finds one group of α of optimumjyAfterwards also just really Determine weight vector W, obtain contextual tab tree analyzer;
Step 3.5:Video is extracted with step 3.1 identical modekCamera lens semantic sequence wuk, and by wukInput is regarded Frequency contextual tab tree analyzer, obtains wukLTk
Further, the step 3 is carried out as follows:
Step 4.1:According to LTjIn " scene context " label Scene, by the leaf node institute under each Scene label Corresponding camera lens realizes the scene cut of video as a complete video scene;Then it is artificial right in units of scene videojScene carry out Scene Semantics mark;
Step 4.2:Using the semantic collection of camera lens and corresponding LT of each camera lens in each scenejIn contextual information construction Scene Semantics training set;The feature of wherein Scene Semantics is divided into two kinds:
A. camera lens semantic feature:If certain camera lens is semantic to occur in this scenario, making the camera lens semantic feature value be 1, otherwise For 0;
B. contextual feature:Contextual feature is the context relation between two camera lens semantemes, and camera lens semanteme is in LTjIn One leaf node of correspondence, so the semantic contextual feature value of the two camera lenses is the two leaf nodes public ancestor node recently On contextual tab;
Step 4.3:With C4.5 algorithms as disaggregated model, increased according to the information of each characteristic attribute in Scene Semantics training set Beneficial rate ultimately generates the semantic decision tree of analysis video scene selecting attribute as node, and using this decision tree as field Scape semantic analyzer;
Step 4.4:According to wukLTk, with identical method in step 4.1 by videokIt is divided into some scenes, and with Scene extracts the characteristic vector of the scene for unit;By videokThe characteristic vector input scene semantic analyzer of each scene, obtains To videokThe Scene Semantics of each scene.
Further, the step 4 is carried out as follows:
Step 5.1:By LTkIn each leaf node in camera lens semantic label replace with corresponding to representative camera lens The semantic collection of camera lens;
Step 5.2:By LTkIn each Scene replace with corresponding Scene Semantics;
Step 5.3:By comprising the semantic LT with Scene Semantics of camera lenskAs videokVideo index.
The invention has the beneficial effects as follows:Set up after semantic indexing for video using the method for the present invention, user can be input into Varigrained keyword retrieval video, and the contextual information in indexing can reduce search space, improve searching system Efficiency.
Description of the drawings
Fig. 1 is tree-like video semanteme index Establishing process.
Fig. 2 is the model of a video contextual tab tree.
Fig. 3 is a tree-like video index model.
Specific embodiment
Fig. 1 is refer to, a kind of method of the tree-like video index foundation of integrating context first extracts mirror in units of camera lens The semantic information of head, obtains the semantic context of video lens with then having supervision, and represents context with tree construction.Tie again Close the semantic reasoning that Scene Semantics are carried out with their context of camera lens.Finally camera lens semanteme, Scene Semantics are embedded into into tree knot In structure and as the index of video.It is specific as follows:
1. pair n training video fragment videojShot segmentation is carried out, r training video camera lens is obtained.Extract and quantify The visual signature of camera lens, is configured to visual feature vector v.
The semantic collection Semantic={ Sem of mark are sett| t=1 ..., e }, the semanteme for manually occurring in r camera lens of mark Semt, the semantic concentration of camera lens of each camera lens is added to, it is then each class camera lens semanteme SemtConstruction camera lens semanteme training set, Obtain e camera lens semanteme training set Trat={ (vi,si) | i=1 ..., r }, if semanteme SemtIn occurring in the camera lens, then si =1, it is otherwise 0;
It is each semantic Sem using SVM classifier as disaggregated modeltOne grader SVM of trainingt。SVMtDifferentiation Functional form is:ft(v)=sgn [g (v)], wherein g (v)=<w,v>+b.So passing through training set TratTraining SVMtOptimization Target is:
Merge optimization problem using Lagrangian and constraint is converted into (1) formula:
Introduce kernel function K (vj,vh), formula (2) is converted to:
Kernel function is chosen to be RBF, is defined as:
Wherein exp () is exponential function, and σ is parameter.
One group of α has been determined that after the completion of trainingi, also determined that camera lens semanteme SemtDiscriminant function:
Wherein b0For parameter.
Complete SemtCorresponding grader SVMtAfter training, the camera lens semantic analysis comprising e camera lens semantic analyzer is obtained Device group.
Video segment video to m tree index to be set upkShot segmentation is carried out, regarding for each camera lens is then extracted Feel feature composition characteristic vector v.The semanteme occurred during v is input into into camera lens semantic analyzer group to judge the camera lens, and will appear from Semanteme be added to the camera lens of this camera lens and semantic concentrate.
2. from videojThe camera lens of each camera lens is semantic to concentrate one camera lens semanteme of extraction to represent the camera lens, then according to sequential Relation constitutes camera lens semantic sequence wuj
In units of video segment, the semantic sequence wu of artificial mark training video fragmentjContext, and with corresponding Contextual tab tree LTjRepresent contextual information.Contextual tab tree is formally defined as five-tuple LT=<L,Video, Scene,NL,P>.Wherein L is camera lens semantic label collection, and its element representation is wujThe middle camera lens for representing camera lens is semantic.Video It is " video context " label, represented context is the content of its this section of video of child node co expression.Scene is " field Scape context " label, the content of this scene that has been its child node co expression shown in table.NL be except Video and Scene it Outer contextual tab collection, wherein each element represent a kind of context relation.P is contextual rules, each of which element What is represented is a kind of context rule.Such as Fig. 2 leaf nodes l1And l2Constitute their father node nl1Rule, this rule Can formally be expressed as:nl1→l1l2
By n wujContext training set is configured to corresponding contextual tab tree:Context={ (xj,yj) | j= 1 ..., n }, wherein xjIt is camera lens semantic sequence, yjIt is corresponding contextual tab tree.
Using context training set training structure support vector machines-Struct, construction camera lens semantic sequence with it is upper and lower The mapping function of literary tag tree is:
Wherein f (x, y;W)=<W,ψ(x,y)>For discriminant function, W is weight vector, and ψ (x, y) is the camera lens in training data The union feature vector of the corresponding contextual tab tree of semantic sequence.The mode of construction ψ (x, y) is as follows:
Wherein piWith ai(i ∈ [1, N]) is respectively the context rule in the contextual rules P of the contextual tab tree The number of times of appearance corresponding with the rule, N is the context rule classification sum occurred in context training set.
Training SVM-Struct is converted into into optimization problem:
Wherein εjFor slack variable, C>0 is the penalty value of wrong point of sample, Δ (yj, y) it is loss function.Make loss function Δ (yj, y)=(1-F1(yj,y)).Wherein yjIt is the true contextual tab tree of camera lens semantic sequence in context training set, y is The contextual tab tree predicted in training process, F1 calculations are as follows:
Wherein, Precision is each node predictablity rate in contextual tab, and Precision is contextual tab The recall rate of each node prediction, E (y in treej) it is yjSide collection, E (y) for y side collection.
Formula (6) is changed into into the form of its antithesis:
Wherein αiyIt is Lagrange multiplier. for soft margin, also there is a group constraints in addition:
After setting penalty value C, the computing formula (7) on context training set context finds one group of α of optimumjyAfterwards Also weight vector W is determined that, contextual tab tree analyzer is obtained.
Extract videokCamera lens semantic sequence wuk, and by wukInput video contextual tab tree analyzer, obtains wuk LTk
3. according to LTjIn " scene context " label Scene, by corresponding to the leaf node under each Scene label Camera lens realizes the scene cut of video as a complete video scene.Then it is artificial to video in units of scenej's Scene carries out Scene Semantics mark.
Using the semantic collection of camera lens and corresponding LT of each camera lens in each scenejIn contextual information construction Scene Semantics Training set.The feature of wherein Scene Semantics is divided into two kinds:
A. camera lens semantic feature:If certain camera lens is semantic to occur in this scenario, making the camera lens semantic feature value be 1, otherwise For 0;
B. contextual feature:Contextual feature is the context relation between two camera lens semantemes, and camera lens semanteme is in LTjIn One leaf node of correspondence, so the semantic contextual feature value of the two camera lenses is the two leaf nodes public ancestor node recently On contextual tab.For example, l in Fig. 21And l2Contextual feature be " nl1", l1And l3Contextual feature be " Scene ".
With C4.5 algorithms as disaggregated model, selected according to the information gain-ratio of each characteristic attribute in Scene Semantics training set Attribute is selected as node, the semantic decision tree of analysis video scene is ultimately generated.Analyze this decision tree as Scene Semantics Device.
According to wukLTkIn " scene context " label Scene, by videokIt is divided into some scenes, and is with scene Unit extracts the camera lens semantic feature of the scene and contextual feature composition characteristic vector.By videokThe characteristic vector of each scene Input scene semantic analyzer, obtains videokThe Scene Semantics of each scene.
4. by LTkIn each leaf node in camera lens semantic label replace with camera lens language corresponding to representative camera lens Justice collection, then by LTkIn each Scene replace with corresponding Scene Semantics, finally will be comprising camera lens is semantic and Scene Semantics LTkAs videokVideo index;
The foregoing is only presently preferred embodiments of the present invention, all impartial changes done according to scope of the present invention patent with Modification, should all belong to the covering scope of the present invention.

Claims (2)

1. the tree-like video semanteme index establishing method of a kind of integrating context, it is characterised in that the method is comprised the following steps:
Step 1:N training video fragment video of inputj, j ∈ { 1 ..., n }, to videojPre-processed, then with camera lens Video is manually marked for unitjThe semantic collection of the camera lens of each camera lens, and be the semantic construction camera lens semanteme training set of every class camera lens with Training grader, obtains camera lens semantic analyzer;The video segment video of m tree index to be set up of inputk, k ∈ { 1 ..., m }, to videokPre-processed, using camera lens semantic analyzer video is extractedkThe semantic collection of the camera lens of each camera lens;
Step 2:In units of video segment, video is manually markedjContext between middle camera lens semanteme, with context mark The contextual tab tree LT of labeljRepresent, and build context training set;Training structure supporting vector SVM-Struct, obtains Contextual tab tree analyzer;Video is extracted using contextual tab tree analyzerkIn contextual tab tree LTk
Step 3:With videojScene manually mark Scene Semantics for unit, build Scene Semantics training set;Training C4.5 classification Device, obtains Scene Semantics analyzer;Video is extracted using Scene Semantics analyzerkIn each scene Scene Semantics;
Step 4:By the video obtained in step 1kThe video that the semantic collection of the camera lens of each camera lens is obtained with step 3kThe field of each scene The LT that scape semantic embedding is obtained in step 2kIn corresponding node, by the LT of the semantic and Scene Semantics with camera lenskConduct videokVideo index;
Wherein, the step 2 is carried out as follows:
Step 3.1:From videojThe camera lens of each camera lens is semantic to concentrate one camera lens semanteme of extraction to represent the camera lens, then according to when Order relation constitutes camera lens semantic sequence wuj
Step 3.2:Artificial mark wujContext, and with context tag tree LTjRepresent contextual information;Contextual tab tree For a five-tuple LTj=< L, Video, Scene, NL, P >;Wherein L is camera lens semantic label collection, and its element representation is wujThe middle camera lens for representing camera lens is semantic;Video is " video context " label, and represented context is that its child node is common Express the content of this section of video;Scene is " scene context " label, represented has been its child node co expression this The content of scape;NL is the contextual tab collection in addition to Video and Scene, and wherein each element represents a kind of context and closes System;P is contextual rules, and each of which element representation is a context rule;
Step 3.3:By n wujContext training set is configured to corresponding contextual tab tree:Context={ (xj,yj)|j =1 ..., n }, wherein xjIt is the camera lens semantic sequence in context training set, yjIt is in context training set and xjIt is corresponding Contextual tab tree;
Step 3.4:Using context training set training structure support vector machines-Struct, concrete operations are:
Step 3.4.1:Construct camera lens semantic sequence is with the mapping function of contextual tab tree:
Wherein, f (xj,yj;W)=< W, ψ (xj,yj) > be discriminant function, Y is xjThe all contextual tab trees that can be constructed Set, W is weight vector, ψ (xj,yj) be the corresponding contextual tab tree of camera lens semantic sequence in training data connection Close characteristic vector;Construction ψ (xj,yj) mode it is as follows:
&psi; ( x j , y j ) = p 1 a 1 . . . . . . p N a N
Wherein piWith ai, i ∈ [1, N] are respectively that the rule in the contextual rules P of the contextual tab tree is corresponding with the rule The number of times of appearance, N is the context rule classification sum occurred in context training set;
Step 3.4.2:Training SVM-Struct is converted into into optimization problem:
m i n 1 2 | | W | | 2 + C n &Sigma; j = 1 n &epsiv; j ,
s . t . &ForAll; y &Element; Y \ y j : < W , &psi; ( x j , y j ) - &psi; ( x j , y ) > &GreaterEqual; &Delta; ( y j , y ) - &epsiv; j - - - ( 6 )
Wherein εjFor slack variable, C>0 is the penalty value of wrong point of sample, Δ (yj, y) it is loss function;Make loss function Δ (yj, Y)=(1-F1(yj,y));Wherein yjIt is the true contextual tab tree of camera lens semantic sequence in context training set, y is training During predict contextual tab tree, F1Calculation is as follows:
Pr e c i s i o n = | E ( y j ) &cap; E ( y ) | | E ( y ) |
Re c a l l = | E ( y j ) &cap; E ( y ) | | E ( y j ) |
F 1 = 2 * Pr e c i s i o n * Re c a l l Pr e c i s i o n + Re c a l l
Wherein, Precision is the accuracy rate of each node prediction in contextual tab, and Recall is every in contextual tab tree The recall rate of individual node prediction, E (yj) it is yjSide collection, E (y) for y side collection;
Step 3.4.3:Formula (6) is changed into into the form of its antithesis:
m a x &alpha; &Sigma; j , y &NotEqual; y j &alpha; j y - 1 2 &Sigma; j , y &NotEqual; y j z , y &OverBar; &NotEqual; y z &alpha; j y &alpha; z y &OverBar; < ( &psi; ( x j , y j ) - &psi; ( x j , y ) ) , ( &psi; ( x z , y z ) - &psi; ( x z , y ) ) >
s . t &ForAll; j , &ForAll; y &Element; Y \ y j : &alpha; j y &GreaterEqual; 0. - - - ( 7 )
Wherein αjyWithIt is Lagrange multiplier, for soft margin, also there is a group constraints in addition:
&ForAll; j , n &Sigma; y &NotEqual; y j &alpha; j y &Delta; ( y j , y ) &le; C
Step 3.4.4:The computing formula (7) on context training set context, finds one group of α of optimumjyPower is also determined that afterwards Vectorial W, obtains contextual tab tree analyzer;
Step 3.5:Video is extracted with step 3.1 identical modekCamera lens semantic sequence wuk, and by wukOn input video Hereafter tag tree analyzer, obtains wukCorresponding LTk
Wherein, the step 3 is carried out as follows:
Step 4.1:According to LTjIn " scene context " label Scene, by corresponding to the leaf node under each Scene label Camera lens realizes the scene cut of video as a complete video scene;Then it is artificial to video in units of scenej's Scene carries out Scene Semantics mark;
Step 4.2:Using the semantic collection of camera lens and corresponding LT of each camera lens in each scenejIn contextual information construction scene Semantic training set;The feature of wherein Scene Semantics is divided into two kinds:
A. camera lens semantic feature:It is otherwise 0 if certain camera lens is semantic to occur in this scenario, making the camera lens semantic feature value be 1;
B. contextual feature:Contextual feature is the context relation between two camera lens semantemes, and camera lens semanteme is in LTjMiddle correspondence One leaf node, so the semantic contextual feature value of the two camera lenses is in the nearest public ancestor node of the two leaf nodes Contextual tab;
Step 4.3:With C4.5 algorithms as disaggregated model, according to the information gain-ratio of each characteristic attribute in Scene Semantics training set To select attribute as node, the semantic decision tree of analysis video scene is ultimately generated, and using this decision tree as scene language Adopted analyzer;
Step 4.4:According to wukLTk, with identical method in step 4.1 by videokIt is divided into some scenes, and with scene For the characteristic vector that unit extracts the scene;By videokThe characteristic vector input scene semantic analyzer of each scene, obtains videokThe Scene Semantics of each scene;
Wherein, the step 4 is carried out as follows:
Step 5.1:By LTkIn each leaf node in camera lens semantic label replace with camera lens corresponding to representative camera lens Semanteme collection;
Step 5.2:By LTkIn each Scene replace with corresponding Scene Semantics;
Step 5.3:By comprising the semantic LT with Scene Semantics of camera lenskAs videokVideo index.
2. the tree-like video semanteme index establishing method of a kind of integrating context according to claim 1, it is characterised in that: Carry out as follows in the step 1:
Step 2.1:To n training video fragment videojShot segmentation is carried out, r training video camera lens shot is obtained1, shot2..., shotr;Extract and quantify camera lens shotiVisual signature, construct its visual feature vector vi
Step 2.2:The semantic collection Semantic={ Sem of mark are sett| t=1 ..., e }, manually occur in r camera lens of mark Semantic Semt, the semantic concentration of camera lens of each camera lens is added to, it is then each class camera lens semanteme SemtThe semantic training of construction camera lens Collection Trat,Trat={ (vi,si) | i=1 ..., r }, if semanteme SemtOccur in camera lens shotiIn, then si=1, otherwise for 0;Finally give the semantic training set Tra of e camera lens1, Tra2..., Trae
Step 2.3:It is each semantic Sem using SVM classifier as disaggregated modeltOne grader SVM of trainingt;SVMt's Discriminant function form is:ft(vi)=sgn [g (vi)], wherein g (vi)=< w, vi>+b, w and b are desired optimized parameters, vi For video lens shotiVisual feature vector;
Training set TratTraining SVMtOptimization aim be:
m i n 1 2 | | w | | 2 s . t . s i ( < w , v i > + b ) - 1 &GreaterEqual; 0 - - - ( 1 )
Merge optimization problem using Lagrangian and constraint is converted into (1) formula:
max &alpha; &Sigma; i = 1 r &alpha; i - 1 2 &Sigma; i , h = 1 r &alpha; i &alpha; h s i s h v i * v h s . t . &alpha; i &GreaterEqual; 0 , &Sigma; i = 1 r &alpha; i s i = 0 - - - ( 2 )
Wherein α={ α12,...,αrIt is Lagrange multiplier, h and i is subscript, viAnd vhIt is camera lens shotiAnd shothIt is right The visual feature vector answered;
Introduce kernel function K (vi,vh), formula (2) is converted to:
max &alpha; &Sigma; i = 1 r &alpha; j - 1 2 &Sigma; i , h = 1 r &alpha; i &alpha; h s i s h K ( v i , v h ) s . t . &alpha; i &GreaterEqual; 0 , &Sigma; i = 1 r &alpha; i s i = 0 - - - ( 3 )
Kernel function is chosen to be RBF, is defined as:
K ( v i , v h ) = exp ( - ( v i - v h ) 2 2 &sigma; 2 ) - - - ( 4 )
Wherein exp () is exponential function, and σ is parameter;
One group of α has been determined that after the completion of trainingi, also determined that camera lens semanteme SemtDiscriminant function:
f t ( v ) = sgn &lsqb; &Sigma; i = 1 r &alpha; i s i K ( v i , v h ) + b 0 &rsqb; - - - ( 5 )
Wherein b0For parameter;
Step 2.4:Complete to all Sem according to step 2.3tGrader SVMtAfter training, the semantic differentiation of e camera lens is obtained Function, by the discriminant function composition camera lens semantic analyzer group that e camera lens is semantic;
Step 2.5:Video segment video to m tree index to be set upkShot segmentation is carried out, each camera lens is then extracted Visual signature composition characteristic vector v;The semanteme occurred during v is input into into camera lens semantic analyzer group to judge the camera lens, and will go out Existing semanteme is added to the semantic concentration of camera lens of this camera lens.
CN201410297974.0A 2014-06-26 2014-06-26 Method for creating context fusion tree video semantic indexes Expired - Fee Related CN104036023B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410297974.0A CN104036023B (en) 2014-06-26 2014-06-26 Method for creating context fusion tree video semantic indexes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410297974.0A CN104036023B (en) 2014-06-26 2014-06-26 Method for creating context fusion tree video semantic indexes

Publications (2)

Publication Number Publication Date
CN104036023A CN104036023A (en) 2014-09-10
CN104036023B true CN104036023B (en) 2017-05-10

Family

ID=51466793

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410297974.0A Expired - Fee Related CN104036023B (en) 2014-06-26 2014-06-26 Method for creating context fusion tree video semantic indexes

Country Status (1)

Country Link
CN (1) CN104036023B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104506947B (en) * 2014-12-24 2017-09-05 福州大学 A kind of video fast forward based on semantic content/rewind speeds self-adapting regulation method
US20170083623A1 (en) * 2015-09-21 2017-03-23 Qualcomm Incorporated Semantic multisensory embeddings for video search by text
CN106878632B (en) * 2017-02-28 2020-07-10 北京知慧教育科技有限公司 Video data processing method and device
CN107590442A (en) * 2017-08-22 2018-01-16 华中科技大学 A kind of video semanteme Scene Segmentation based on convolutional neural networks
KR102387767B1 (en) * 2017-11-10 2022-04-19 삼성전자주식회사 Apparatus and method for user interest information generation
US10860649B2 (en) * 2018-03-14 2020-12-08 TCL Research America Inc. Zoomable user interface for TV
CN110545299B (en) * 2018-05-29 2022-04-05 腾讯科技(深圳)有限公司 Content list information acquisition method, content list information providing method, content list information acquisition device, content list information providing device and content list information equipment
CN109344887B (en) * 2018-09-18 2020-07-07 山东大学 Short video classification method, system and medium based on multi-mode dictionary learning
CN109685144B (en) * 2018-12-26 2021-02-12 上海众源网络有限公司 Method and device for evaluating video model and electronic equipment
CN111435453B (en) * 2019-01-14 2022-07-22 中国科学技术大学 Fine-grained image zero sample identification method
CN110097094B (en) * 2019-04-15 2023-06-13 天津大学 Multiple semantic fusion few-sample classification method for character interaction
CN110765314A (en) * 2019-10-21 2020-02-07 长沙品先信息技术有限公司 Video semantic structural extraction and labeling method
CN114302224B (en) * 2021-12-23 2023-04-07 新华智云科技有限公司 Intelligent video editing method, device, equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593363A (en) * 2012-08-15 2014-02-19 中国科学院声学研究所 Video content indexing structure building method and video searching method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10078693B2 (en) * 2006-06-16 2018-09-18 International Business Machines Corporation People searches by multisensor event correlation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593363A (en) * 2012-08-15 2014-02-19 中国科学院声学研究所 Video content indexing structure building method and video searching method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Co-Concept-Boosting视频语义索引方法";陈丹雯等;《小型微型计算机系统》;20120731;第33卷(第7期);1603-1607 *
"一种新的用于视频检索的语义索引";韩智广等;《和谐人机环境2008》;20081231;454-459 *

Also Published As

Publication number Publication date
CN104036023A (en) 2014-09-10

Similar Documents

Publication Publication Date Title
CN104036023B (en) Method for creating context fusion tree video semantic indexes
Qu et al. Dynamic modality interaction modeling for image-text retrieval
Gorti et al. X-pool: Cross-modal language-video attention for text-video retrieval
Liang et al. Jointly learning aspect-focused and inter-aspect relations with graph convolutional networks for aspect sentiment analysis
Chen et al. Hierarchical visual-textual graph for temporal activity localization via language
Perez-Martin et al. Improving video captioning with temporal composition of a visual-syntactic embedding
CN106951438A (en) A kind of event extraction system and method towards open field
CN105760507A (en) Cross-modal subject correlation modeling method based on deep learning
CN103970733B (en) A kind of Chinese new word identification method based on graph structure
CN103778227A (en) Method for screening useful images from retrieved images
Zablocki et al. Context-aware zero-shot learning for object recognition
Hii et al. Multigap: Multi-pooled inception network with text augmentation for aesthetic prediction of photographs
CN109948668A (en) A kind of multi-model fusion method
CN105849720A (en) Visual semantic complex network and method for forming network
CN105824862A (en) Image classification method based on electronic equipment and electronic equipment
CN106203296B (en) The video actions recognition methods of one attribute auxiliary
CN104376108B (en) A kind of destructuring natural language information abstracting method based on the semantic marks of 6W
CN104537028B (en) A kind of Web information processing method and device
CN106649663A (en) Video copy detection method based on compact video representation
CN103761286B (en) A kind of Service Source search method based on user interest
Hinami et al. Discriminative learning of open-vocabulary object retrieval and localization by negative phrase augmentation
Jung et al. Devil's on the edges: Selective quad attention for scene graph generation
CN113076483A (en) Case element heteromorphic graph-based public opinion news extraction type summarization method
CN109376964A (en) A kind of criminal case charge prediction technique based on Memory Neural Networks
Wang et al. Topic scene graph generation by attention distillation from caption

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170510

Termination date: 20200626

CF01 Termination of patent right due to non-payment of annual fee