CN104036023A - Method for creating context fusion tree video semantic indexes - Google Patents

Method for creating context fusion tree video semantic indexes Download PDF

Info

Publication number
CN104036023A
CN104036023A CN201410297974.0A CN201410297974A CN104036023A CN 104036023 A CN104036023 A CN 104036023A CN 201410297974 A CN201410297974 A CN 201410297974A CN 104036023 A CN104036023 A CN 104036023A
Authority
CN
China
Prior art keywords
camera lens
video
semantic
scene
context
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410297974.0A
Other languages
Chinese (zh)
Other versions
CN104036023B (en
Inventor
余春艳
苏晨涵
翁子林
陈昭炯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN201410297974.0A priority Critical patent/CN104036023B/en
Publication of CN104036023A publication Critical patent/CN104036023A/en
Application granted granted Critical
Publication of CN104036023B publication Critical patent/CN104036023B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Studio Devices (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the field of technologies for retrieving video, and discloses a method for creating tree video semantic indexes. The video semantic indexes built by the aid of the method contain video semantics of various particle sizes, contexts among the video semantics are fused by the semantic indexes, and the video semantics of the different particle sizes are connected with one another according to the contexts, so that tree structures can be formed. The method is characterized by comprising steps of extracting lens semantic sets of various lenses one by one; acquiring contexts of video lens semantics under the monitoring condition and representing the contexts by context tag trees; combining the lens semantic sets and context information with one another to infer scene semantics; embedding the lens semantic sets and the scene semantics into the context tag trees to obtain the video indexes. The method has the advantages that after the semantic indexes are created for video by the aid of the method, users can input keywords of the different particle sizes to retrieve the video, search spaces can be diminished by the aid of context information in the indexes, and accordingly the efficiency of retrieval systems can be improved.

Description

A kind of tree-like video semanteme index establishing method of integrating context
Technical field
The invention belongs to video search technique area, a kind of method that be that camera lens that can utilize video is semantic, the context between Scene Semantics and semanteme builds video semanteme index.
Background technology
Nowadays video data becomes one of most important data on internet already.Yet along with the volatile growth of video data, how efficiently management, retrieve video become a very difficult problem.Conventionally user is a key word of input when retrieve video, then by video search engine, according to key word, need find relevant video data.This just requires video to set up efficiency and the hit rate that suitable semantic indexing could improve user search video.It is automatically to analyze by computing machine the semantic information that the visual signature of video contains to obtain video that video index based on video semanteme builds, then the index using semantic information as video, user can be by input key search video when retrieve video.
Yet user improves constantly the requirement of video search engine, user often difference according to demand inputs varigrained key word, may input the varigrained keywords such as " football ", " wonderful ", " shooting ", " judge's feature " and retrieve during such as the video of user search football related content.So Search Requirement traditional Monosized powder, can not meet user without the video semanteme index of level.In addition, the semantic content of video is abundant, except semantic information, also exist a large amount of contextual informations, utilize the contextual information engine that can assist search to understand the interaction between different grain size semanteme, for opening relationships between varigrained semanteme in video, thereby can search for relevant video according to these relation informations when retrieve video.Contextual information can guarantee under the prerequisite of search hit rate, dwindle search volume, improves search efficiency.Based on this, the present invention has realized a kind of video semanteme index that can integrating context, to improve the validity of video index.
Summary of the invention
The object of the invention is to realize a kind of method of can integrating context information setting up tree-like video semanteme index.The method can incorporate contextual information in video semanteme index, improves video frequency searching hit rate and efficiency.
The present invention adopts following scheme to realize: a kind of tree-like video semanteme index establishing method of integrating context, is characterized in that the method comprises the following steps:
Step 1: input n training video fragment video j, j ∈ 1 ..., n}, to video jcarry out pre-service, then take camera lens as unit labor cost mark video jthe semantic collection of camera lens of each camera lens, and be that every class camera lens semanteme is constructed the semantic training set of camera lens with training classifier, camera lens semantic analyzer obtained.The video segment video of m tree index to be set up of input k, k ∈ 1 ..., m}, to video kcarry out pre-service, utilize camera lens semantic analyzer to extract video kthe semantic collection of camera lens of each camera lens;
Step 2: take video segment as unit, manually mark video jcontext between middle camera lens semanteme, uses the contextual tab tree LT with contextual tab jrepresent, and build context training set.Training structure support vector SVM-Struct, obtains contextual tab tree analyzer.Utilize contextual analysis device to extract video kin contextual tab tree LT k;
Step 3: with video jscene be unit labor cost mark Scene Semantics, build Scene Semantics training set.Training C4.5 sorter, obtains Scene Semantics analyzer.Utilize Scene Semantics analyzer to extract video kin the Scene Semantics of each scene;
Step 4: by the video obtaining in step 2 kthe video that the semantic collection of camera lens and the step 4 of each camera lens obtains kthe Scene Semantics of each scene is embedded into the LT obtaining in step 3 kin corresponding node, by the LT of and Scene Semantics semantic with camera lens kas video kvideo index.
Further, in described step 1, carry out as follows:
Step 2.1: to n training video fragment video jcarry out camera lens and cut apart, obtain r training video camera lens; Extract and quantize the visual signature of camera lens, be configured to visual feature vector v;
Step 2.2: the semantic collection of mark Semantic={Sem is set t| t=1 ..., e}, manually marks the semantic Sem occurring in r camera lens t, the camera lens that joins each camera lens is semantic concentrated, is then the semantic Sem of each class camera lens tthe semantic training set of structure camera lens, obtains the semantic training set Tra of e camera lens t={ (v i, s i) | i=1 ..., r}, if semantic Sem tappear in this camera lens, s i=1, otherwise be 0;
Step 2.3: using svm classifier device as disaggregated model, is each semantic Sem ttrain a sorter SVM t; SVM tdiscriminant function form be: f t(v)=sgn[g (v)], g (v)=<w wherein, v>+b; So by training set Tra ttraining SVM toptimization aim be:
min 1 2 | | w | | 2 s . t . s i ( < w , v i > + b ) - 1 &GreaterEqual; 0 - - - ( 1 )
Utilizing Lagrangian function to merge optimization problem and retrain is converted into (1) formula:
max &alpha; &Sigma; i = 1 r &alpha; i - 1 2 &Sigma; i , h = 1 r &alpha; i &alpha; h s i s h v i * v h s . t . &alpha; i &GreaterEqual; 0 , &Sigma; i = 1 r &alpha; i s i = 0 - - - ( 2 )
Introduce kernel function K (v j, v h), formula (2) is converted to:
max &alpha; &Sigma; i = 1 r &alpha; j - 1 2 &Sigma; i , h = 1 r &alpha; i &alpha; h s i s h K ( v i , v h ) s . t . &alpha; i &GreaterEqual; 0 , &Sigma; i = 1 r &alpha; i s i = 0 - - - ( 3 )
Kernel function is chosen to be radial basis function, is defined as:
K ( v i , v h ) = exp ( - ( v i - v h ) 2 2 &sigma; 2 ) - - - ( 4 )
Wherein exp () is exponential function, and σ is parameter.
After having trained, just determined one group of α i, also just determined the semantic Sem of camera lens tdiscriminant function:
f t ( v ) = sgn [ &Sigma; i = 1 r &alpha; i s i K ( v i , v ) + b 0 ] - - - ( 5 )
B wherein 0for parameter.
Step 2.4: complete all Sem according to step 2.3 tsorter SVM tafter training, obtain the discriminant function of e camera lens semanteme, the camera lens semantic analyzer group that the discriminant function of e camera lens semanteme is formed.
Step 2.5: the video segment video to m tree index to be set up kcarry out camera lens and cut apart, then extract the visual signature composition characteristic vector v of each camera lens; V is inputted to camera lens semantic analyzer group to judge the semanteme occurring in this camera lens, and the semanteme of appearance is joined to semantic the concentrating of camera lens of this camera lens.
Further, described step 2 is carried out as follows:
Step 3.1: from video jthe semantic camera lens semanteme of extraction of concentrating of camera lens of each camera lens represents this camera lens, then according to sequential relationship, forms camera lens semantic sequence wu j;
Step 3.2: manually mark wu jcontext, and with context tag tree LT jrepresent contextual information; Contextual tab tree is a five-tuple LT=<L, Video, Scene, NL, P>; Wherein L is camera lens semantic label collection, its element representation be wu jthe camera lens of middle representative shot is semantic; Video is " video context " label, and represented context is the content of this section of video of its child node co expression; Scene is " scene context " label, the content of this scene that has been its child node co expression shown in table; NL is the contextual tab collection except Video and Scene, and wherein each element represents a kind of context relation; P is contextual rules, its each element representation be a context rule;
Step 3.3: by n wu jbe configured to context training set with corresponding contextual tab tree:
Context={ (x j, y j) | j=1 ..., n}, wherein x jcamera lens semantic sequence, y jit is corresponding contextual tab tree;
Step 3.4: utilize context training set training structure support vector machines-Struct, concrete operations are:
Step 3.4.1: the mapping function of structure camera lens semantic sequence and contextual tab tree is: h ( x ; W ) = arg max y &Element; Y f ( x , y ; W ) ,
F (x, y wherein; W)=<W, ψ (x, y) > is discriminant function, and W is weight vector, and ψ (x, y) is the union feature vector of the contextual tab tree that the camera lens semantic sequence in training data is corresponding with it; The mode of structure ψ (x, y) is as follows:
&psi; ( x , y ) = p 1 a 1 . . . . . . p N a N
P wherein iwith a i(i ∈ [1, N]) is respectively rule and the corresponding number of times occurring of this rule in the contextual rules P of this contextual tab tree, and N is the context rule classification sum occurring in context training set;
Step 3.4.2: training SVM-Struct is converted into optimization problem:
min 1 2 | | W | | 2 + C n &Sigma; j = 1 n &epsiv; j , s . t . &ForAll; y &Element; &gamma; : < W , &psi; ( x j , y j ) - &psi; ( x j , y ) > &GreaterEqual; &Delta; ( y j , y ) - &epsiv; j - - - ( 6 )
ε wherein jfor slack variable, C>0 is the penalty value of wrong minute sample, Δ (y j, y) be loss function; Make loss function Δ (y j, y)=(1-F 1(y j, y)); Y wherein jbe the true contextual tab tree of camera lens semantic sequence in context training set, y is the contextual tab tree of predicting in training process, and F1 account form is as follows:
Precision = | E ( y j ) &cap; E ( y ) | | E ( y ) |
Recall = | E ( y j ) &cap; E ( y ) | | E ( y i ) |
F 1 = 2 * Precision * Recall Precision + Recall
Wherein, Precision is each node predictablity rate in contextual tab, and Precision is the recall rate of each node prediction in contextual tab tree, E (y j) be y jlimit collection, the limit collection that E (y) is y;
Step 3.4.3: the form that formula (6) is changed into its antithesis:
max &alpha; &Sigma; j , y &NotEqual; y j &alpha; jy - 1 2 &Sigma; j , y &NotEqual; y j z , y &OverBar; &NotEqual; y z &alpha; jy &alpha; z y &OverBar; < &delta; &psi; j ( y ) , &delta; &psi; z ( y &OverBar; ) > s . t &ForAll; j , &ForAll; y &NotEqual; Y \ y j : &alpha; jy &GreaterEqual; 0 . - - - ( 7 )
α wherein iylagrange multiplier. for soft interval, also there is in addition group constraint condition:
&ForAll; j , n &Sigma; y &NotEqual; y j &alpha; jy &Delta; ( y j , y ) &le; C
Step 3.4.4: computing formula (7) on context training set context, finds one group of optimum α jyafter also just determine weight vector W, obtain contextual tab tree analyzer;
Step 3.5: use the mode identical with step 3.1 to extract video kcamera lens semantic sequence wu k, and by wu kinput video contextual tab tree analyzer, obtains wu klT k.
Further, described step 3 is carried out as follows:
Step 4.1: according to LT jin " scene context " label Scene, using the corresponding camera lens of leaf node under each Scene label as a complete video scene, the scene that realizes video is cut apart; Then take scene as unit labor cost is to video jscene carry out Scene Semantics mark;
Step 4.2: the semantic collection of camera lens and the corresponding LT that utilize each camera lens in each scene jin contextual information structure Scene Semantics training set; Wherein the feature of Scene Semantics is divided into two kinds:
A. camera lens semantic feature: if certain camera lens semanteme appears in this scene, making this camera lens semantic feature value is 1, otherwise is 0;
B. contextual feature: contextual feature is two context relations between camera lens semanteme, and camera lens semanteme is at LT ja middle corresponding leaf node, so the contextual feature value of these two camera lens semantemes is the contextual tab in the nearest public ancestor node of these two leaf nodes;
Step 4.3: take C4.5 algorithm as disaggregated model, according to the information gain rate of each characteristic attribute in Scene Semantics training set, select attribute as node, the final decision tree that generates analysis video scene semanteme, and using this decision tree as Scene Semantics analyzer;
Step 4.4: according to wu klT k, with method identical in step 4.1 by video kbe divided into some scenes, and take scene and extract the proper vector of this scene as unit; By video kthe proper vector input scene semantic analyzer of each scene, obtains video kthe Scene Semantics of each scene.
Further, described step 4 is carried out as follows:
Step 5.1: by LT kin each leaf node in camera lens semantic label replace with the semantic collection of the corresponding camera lens of the camera lens of representative;
Step 5.2: by LT kin each Scene replace with corresponding Scene Semantics;
Step 5.3: will comprise the LT of camera lens semanteme with Scene Semantics kas video kvideo index.
The invention has the beneficial effects as follows: utilize method of the present invention to set up after semantic indexing for video, user can input varigrained keyword retrieval video, and the contextual information in index can dwindle search volume, the efficiency of raising searching system.
Accompanying drawing explanation
Fig. 1 is tree-like video semanteme index Establishing process.
Fig. 2 is the model of a video contextual tab tree.
Fig. 3 is a tree-like video index model.
Embodiment
Please refer to Fig. 1, a kind of method that tree-like video index of integrating context is set up, first take camera lens as the semantic information that unit extracts camera lens, then has supervision and obtains the context of video lens semanteme, and represent context with tree construction.In conjunction with camera lens is semantic, carry out the reasoning of Scene Semantics with their context again.Finally camera lens is semantic, Scene Semantics is embedded in tree construction and as the index of video.Specific as follows:
1. couple n training video fragment video jcarry out camera lens and cut apart, obtain r training video camera lens.Extract and quantize the visual signature of camera lens, be configured to visual feature vector v.
The semantic collection of mark Semantic={Sem is set t| t=1 ..., e}, manually marks the semantic Sem occurring in r camera lens t, the camera lens that joins each camera lens is semantic concentrated, is then the semantic Sem of each class camera lens tthe semantic training set of structure camera lens, obtains the semantic training set Tra of e camera lens t={ (v i, s i) | i=1 ..., r}, if semantic Sem tappear in this camera lens, s i=1, otherwise be 0;
Using svm classifier device as disaggregated model, is each semantic Sem ttrain a sorter SVM t.SVM tdiscriminant function form be: f t(v)=sgn[g (v)], g (v)=<w wherein, v>+b.So by training set Tra ttraining SVM toptimization aim be:
min 1 2 | | w | | 2 subjectto s i ( < w , v i > + b ) - 1 &GreaterEqual; 0 - - - ( 1 )
Utilizing Lagrangian function to merge optimization problem and retrain is converted into (1) formula:
max &alpha; &Sigma; i = 1 r &alpha; i - 1 2 &Sigma; i , h = 1 r &alpha; i &alpha; h s i s h v i * v h s . t . &alpha; i &GreaterEqual; 0 , &Sigma; i = 1 r &alpha; i s i = 0 - - - ( 2 )
Introduce kernel function K (v j, v h), formula (2) is converted to:
max &alpha; &Sigma; i = 1 r &alpha; j - 1 2 &Sigma; i , h = 1 r &alpha; i &alpha; h s i s h K ( v i , v h ) s . t . &alpha; i &GreaterEqual; 0 , &Sigma; i = 1 r &alpha; i s i = 0 - - - ( 3 )
Kernel function is chosen to be radial basis function, is defined as:
K ( v i , v h ) = exp ( - ( v i - v h ) 2 2 &sigma; 2 ) - - - ( 4 )
Wherein exp () is exponential function, and σ is parameter.
After having trained, just determined one group of α i, also just determined the semantic Sem of camera lens tdiscriminant function:
f t ( v ) = sgn [ &Sigma; i = 1 r &alpha; i s i K ( v i , v ) + b 0 ] - - - ( 5 )
B wherein 0for parameter.
Complete Sem tcorresponding sorter SVM tafter training, obtain the camera lens semantic analyzer group that comprises e camera lens semantic analyzer.
Video segment video to m tree index to be set up kcarry out camera lens and cut apart, then extract the visual signature composition characteristic vector v of each camera lens.V is inputted to camera lens semantic analyzer group to judge the semanteme occurring in this camera lens, and the semanteme of appearance is joined to semantic the concentrating of camera lens of this camera lens.
2. from video jthe semantic camera lens semanteme of extraction of concentrating of camera lens of each camera lens represents this camera lens, then according to sequential relationship, forms camera lens semantic sequence wu j.
Take video segment as unit, manually mark the semantic sequence wu of training video fragment jcontext, and with corresponding contextual tab tree LT jrepresent contextual information.Contextual tab tree is defined as five-tuple LT=<L, Video, Scene, NL, P> formally.Wherein L is camera lens semantic label collection, its element representation be wu jthe camera lens of middle representative shot is semantic.Video is " video context " label, and represented context is the content of this section of video of its child node co expression.Scene is " scene context " label, the content of this scene that has been its child node co expression shown in table.NL is the contextual tab collection except Video and Scene, and wherein each element represents a kind of context relation.P is contextual rules, its each element representation be a kind of context rule.Leaf node l in Fig. 2 for example 1and l 2the father node nl that forms them 1rule, this rule can be expressed as formally: nl 1→ l 1l 2.
By n wu jbe configured to context training set with corresponding contextual tab tree: context={ (x j, y j) | j=1 ..., n}, wherein x jcamera lens semantic sequence, y jit is corresponding contextual tab tree.
Utilize context training set training structure support vector machines-Struct, the mapping function of structure camera lens semantic sequence and contextual tab tree is:
h ( x ; W ) = arg max y &Element; Y f ( x , y ; W ) ,
F (x, y wherein; W)=<W, ψ (x, y) > is discriminant function, and W is weight vector, and ψ (x, y) is the union feature vector of the contextual tab tree that the camera lens semantic sequence in training data is corresponding with it.The mode of structure ψ (x, y) is as follows:
&psi; ( x , y ) = p 1 a 1 . . . . . . p N a N
P wherein iwith a i(i ∈ [1, N]) is respectively context rule and the corresponding number of times occurring of this rule in the contextual rules P of this contextual tab tree, and N is the context rule classification sum occurring in context training set.
Training SVM-Struct is converted into optimization problem:
ε wherein jfor slack variable, C>0 is the penalty value of wrong minute sample, Δ (y j, y) be loss function.Make loss function Δ (y j, y)=(1-F 1(y j, y)).Y wherein jbe the true contextual tab tree of camera lens semantic sequence in context training set, y is the contextual tab tree of predicting in training process, and F1 account form is as follows:
Precision = | E ( y j ) &cap; E ( y ) | | E ( y ) |
Recall = | E ( y j ) &cap; E ( y ) | | E ( y i ) |
F 1 = 2 * Precision * Recall Precision + Recall
Wherein, Precision is each node predictablity rate in contextual tab, and Precision is the recall rate of each node prediction in contextual tab tree, E (y j) be y jlimit collection, the limit collection that E (y) is y.
Formula (6) is changed into the form of its antithesis:
max &alpha; &Sigma; j , y &NotEqual; y j &alpha; jy - 1 2 &Sigma; j , y &NotEqual; y j z , y &OverBar; &NotEqual; y z &alpha; jy &alpha; z y &OverBar; < &delta; &psi; j ( y ) , &delta; &psi; z ( y &OverBar; ) > s . t &ForAll; j , &ForAll; y &NotEqual; Y \ y j : &alpha; jy &GreaterEqual; 0 . - - - ( 7 )
α wherein iylagrange multiplier. for soft interval, also there is in addition group constraint condition:
&ForAll; j , n &Sigma; y &NotEqual; y j &alpha; jy &Delta; ( y j , y ) &le; C
Set after penalty value C, computing formula (7) on context training set context, finds one group of optimum α jyafter also just determine weight vector W, obtain contextual tab tree analyzer.
Extract video kcamera lens semantic sequence wu k, and by wu kinput video contextual tab tree analyzer, obtains wu klT k.
3. according to LT jin " scene context " label Scene, using the corresponding camera lens of leaf node under each Scene label as a complete video scene, the scene that realizes video is cut apart.Then take scene as unit labor cost is to video jscene carry out Scene Semantics mark.
Utilize the semantic collection of camera lens and the corresponding LT of each camera lens in each scene jin contextual information structure Scene Semantics training set.Wherein the feature of Scene Semantics is divided into two kinds:
A. camera lens semantic feature: if certain camera lens semanteme appears in this scene, making this camera lens semantic feature value is 1, otherwise is 0;
B. contextual feature: contextual feature is two context relations between camera lens semanteme, and camera lens semanteme is at LT ja middle corresponding leaf node, so the contextual feature value of these two camera lens semantemes is the contextual tab in the nearest public ancestor node of these two leaf nodes.For example, l in Fig. 2 1and l 2contextual feature be " nl 1", l 1and l 3contextual feature be " Scene ".
Take C4.5 algorithm as disaggregated model, according to the information gain rate of each characteristic attribute in Scene Semantics training set, select attribute as node, the final decision tree that generates analysis video scene semanteme.Using this decision tree as Scene Semantics analyzer.
According to wu klT kin " scene context " label Scene, by video kbe divided into some scenes, and take scene and extract the camera lens semantic feature of this scene and contextual feature composition characteristic vector as unit.By video kthe proper vector input scene semantic analyzer of each scene, obtains video kthe Scene Semantics of each scene.
4. by LT kin each leaf node in camera lens semantic label replace with the semantic collection of the corresponding camera lens of the camera lens of representative, then by LT kin each Scene replace with corresponding Scene Semantics, finally will comprise the semantic LT with Scene Semantics of camera lens kas video kvideo index;
The foregoing is only preferred embodiment of the present invention, all equalizations of doing according to the present patent application the scope of the claims change and modify, and all should belong to covering scope of the present invention.

Claims (5)

1. a tree-like video semanteme index establishing method for integrating context, is characterized in that the method comprises the following steps:
Step 1: input n training video fragment video j, j ∈ 1 ..., n}, to video jcarry out pre-service, then take camera lens as unit labor cost mark video jthe semantic collection of camera lens of each camera lens, and be that every class camera lens semanteme is constructed the semantic training set of camera lens with training classifier, camera lens semantic analyzer obtained; The video segment video of m tree index to be set up of input k, k ∈ 1 ..., m}, to video kcarry out pre-service, utilize camera lens semantic analyzer to extract video kthe semantic collection of camera lens of each camera lens;
Step 2: take video segment as unit, manually mark video jcontext between middle camera lens semanteme, uses the contextual tab tree LT with contextual tab jrepresent, and build context training set; Training structure support vector SVM-Struct, obtains contextual tab tree analyzer; Utilize contextual analysis device to extract video kin contextual tab tree LT k;
Step 3: with video jscene be unit labor cost mark Scene Semantics, build Scene Semantics training set; Training C4.5 sorter, obtains Scene Semantics analyzer; Utilize Scene Semantics analyzer to extract video kin the Scene Semantics of each scene;
Step 4: by the video obtaining in step 2 kthe video that the semantic collection of camera lens and the step 4 of each camera lens obtains kthe Scene Semantics of each scene is embedded into the LT obtaining in step 3 kin corresponding node, by the LT of and Scene Semantics semantic with camera lens kas video kvideo index.
2. the tree-like video semanteme index establishing method of a kind of integrating context according to claim 1, is characterized in that: in described step 1, carry out as follows:
Step 2.1: to n training video fragment video jcarry out camera lens and cut apart, obtain r training video camera lens; Extract and quantize the visual signature of camera lens, be configured to visual feature vector v;
Step 2.2: the semantic collection of mark Semantic={Sem is set t| t=1 ..., e}, manually marks the semantic Sem occurring in r camera lens t, the camera lens that joins each camera lens is semantic concentrated, is then the semantic Sem of each class camera lens tthe semantic training set of structure camera lens, obtains the semantic training set Tra of e camera lens t={ (v i, s i) | i=1 ..., r}, if semantic Sem tappear in this camera lens, s i=1, otherwise be 0;
Step 2.3: using svm classifier device as disaggregated model, is each semantic Sem ttrain a sorter SVM t; SVM tdiscriminant function form be: f t(v)=sgn[g (v)], wherein x(v)=<w, v>+b; So by training set Tra ttraining SVM toptimization aim be:
min 1 2 | | w | | 2 s . t . s i ( < w , v i > + b ) - 1 &GreaterEqual; 0 - - - ( 1 )
Utilizing Lagrangian function to merge optimization problem and retrain is converted into (1) formula:
max &alpha; &Sigma; i = 1 r &alpha; i - 1 2 &Sigma; i , h = 1 r &alpha; i &alpha; h s i s h v i * v h s . t . &alpha; i &GreaterEqual; 0 , &Sigma; i = 1 r &alpha; i s i = 0 - - - ( 2 )
Introduce kernel function K (v j, v h), formula (2) is converted to:
max &alpha; &Sigma; i = 1 r &alpha; j - 1 2 &Sigma; i , h = 1 r &alpha; i &alpha; h s i s h K ( v i , v h ) s . t . &alpha; i &GreaterEqual; 0 , &Sigma; i = 1 r &alpha; i s i = 0 - - - ( 3 )
Kernel function is chosen to be radial basis function, is defined as:
K ( v i , v h ) = exp ( - ( v i - v h ) 2 2 &sigma; 2 ) - - - ( 4 )
Wherein exp () is exponential function, and σ is parameter.
After having trained, just determined one group of α i, also just determined the semantic Sem of camera lens tdiscriminant function:
f t ( v ) = sgn [ &Sigma; i = 1 r &alpha; i s i K ( v i , v ) + b 0 ] - - - ( 5 )
B wherein 0for parameter.
Step 2.4: complete all Sem according to step 2.3 tsorter SVM tafter training, obtain the discriminant function of e camera lens semanteme, the camera lens semantic analyzer group that the discriminant function of e camera lens semanteme is formed.
Step 2.5: the video segment video to m tree index to be set up kcarry out camera lens and cut apart, then extract the visual signature composition characteristic vector v of each camera lens; V is inputted to camera lens semantic analyzer group to judge the semanteme occurring in this camera lens, and the semanteme of appearance is joined to semantic the concentrating of camera lens of this camera lens.
3. the tree-like video semanteme index establishing method of a kind of integrating context according to claim 1, is characterized in that: described step 2 is carried out as follows:
Step 3.1: from video jthe semantic camera lens semanteme of extraction of concentrating of camera lens of each camera lens represents this camera lens, then according to sequential relationship, forms camera lens semantic sequence wu j;
Step 3.2: manually mark wu jcontext, and with context tag tree LT jrepresent contextual information; Contextual tab tree is a five-tuple LT=<L, Video, Scene, NL, P>; Wherein L is camera lens semantic label collection, its element representation be wu jthe camera lens of middle representative shot is semantic; Video is " video context " label, and represented context is the content of this section of video of its child node co expression; Scene is " scene context " label, the content of this scene that has been its child node co expression shown in table; NL is the contextual tab collection except Video and Scene, and wherein each element represents a kind of context relation; P is contextual rules, its each element representation be a context rule;
Step 3.3: by n wu jbe configured to context training set with corresponding contextual tab tree: context={ (x j, y j) | j=1 ..., n}, wherein x jcamera lens semantic sequence, y jit is corresponding contextual tab tree;
Step 3.4: utilize context training set training structure support vector machines-Struct, concrete operations are:
Step 3.4.1: the mapping function of structure camera lens semantic sequence and contextual tab tree is:
h ( x ; W ) = arg max y &Element; Y f ( x , y ; W ) ,
F (x, y wherein; W)=<W, ψ (x, y) > is discriminant function, and W is weight vector, and ψ (x, y) is the union feature vector of the contextual tab tree that the camera lens semantic sequence in training data is corresponding with it; The mode of structure ψ (x, y) is as follows:
&psi; ( x , y ) = p 1 a 1 . . . . . . p N a N
P wherein iwith a i(i ∈ [1, N]) is respectively rule and the corresponding number of times occurring of this rule in the contextual rules P of this contextual tab tree, and N is the context rule classification sum occurring in context training set;
Step 3.4.2: training SVM-Struct is converted into optimization problem:
min 1 2 | | W | | 2 + C n &Sigma; j = 1 n &epsiv; j , s . t . &ForAll; y &Element; &gamma; : < W , &psi; ( x j , y j ) - &psi; ( x j , y ) > &GreaterEqual; &Delta; ( y j , y ) - &epsiv; j - - - ( 6 )
Be wherein slack variable, C>0 is the penalty value of wrong minute sample, Δ (y j, y) be loss function; Make loss function Δ (y j, y)=(1-F 1(y j, y)); Y wherein jbe the true contextual tab tree of camera lens semantic sequence in context training set, y is the contextual tab tree of predicting in training process, and F1 account form is as follows:
Precision = | E ( y j ) &cap; E ( y ) | | E ( y ) |
Recall = | E ( y j ) &cap; E ( y ) | | E ( y i ) |
F 1 = 2 * Precision * Recall Precision + Recall
Wherein, Precision is the accuracy rate of each node prediction in contextual tab, and Precision is the recall rate of each node prediction in contextual tab tree, E (y j) be y jlimit collection, the limit collection that E (y) is y;
Step 3.4.3: the form that formula (6) is changed into its antithesis:
max &alpha; &Sigma; j , y &NotEqual; y j &alpha; jy - 1 2 &Sigma; j , y &NotEqual; y j z , y &OverBar; &NotEqual; y z &alpha; jy &alpha; z y &OverBar; < &delta; &psi; j ( y ) , &delta; &psi; z ( y &OverBar; ) > s . t &ForAll; j , &ForAll; y &NotEqual; Y \ y j : &alpha; jy &GreaterEqual; 0 . - - - ( 7 )
α wherein iylagrange multiplier. for soft interval, also there is in addition group constraint condition:
&ForAll; j , n &Sigma; y &NotEqual; y j &alpha; jy &Delta; ( y j , y ) &le; C
Step 3.4.4: computing formula (7) on context training set context, finds one group of optimum α jyafter also just determine weight vector W, obtain contextual tab tree analyzer;
Step 3.5: use the mode identical with step 3.1 to extract video kcamera lens semantic sequence wu k, and by wu kinput video contextual tab tree analyzer, obtains wu kcorresponding LT k.
4. the tree-like video semanteme index establishing method of a kind of integrating context according to claim 1, is characterized in that: described step 3 is carried out as follows:
Step 4.1: according to LT jin " scene context " label Scene, using the corresponding camera lens of leaf node under each Scene label as a complete video scene, the scene that realizes video is cut apart; Then take scene as unit labor cost is to video jscene carry out Scene Semantics mark;
Step 4.2: the semantic collection of camera lens and the corresponding LT that utilize each camera lens in each scene jin contextual information structure Scene Semantics training set; Wherein the feature of Scene Semantics is divided into two kinds:
A. camera lens semantic feature: if certain camera lens semanteme appears in this scene, making this camera lens semantic feature value is 1, otherwise is 0;
B. contextual feature: contextual feature is two context relations between camera lens semanteme, and camera lens semanteme is at LT ja middle corresponding leaf node, so the contextual feature value of these two camera lens semantemes is the contextual tab in the nearest public ancestor node of these two leaf nodes;
Step 4.3: take C4.5 algorithm as disaggregated model, according to the information gain rate of each characteristic attribute in Scene Semantics training set, select attribute as node, the final decision tree that generates analysis video scene semanteme, and using this decision tree as Scene Semantics analyzer;
Step 4.4: according to wu klT k, with method identical in step 4.1 by video kbe divided into some scenes, and take scene and extract the proper vector of this scene as unit; By video kthe proper vector input scene semantic analyzer of each scene, obtains video kthe Scene Semantics of each scene.
5. the tree-like video semanteme index establishing method of a kind of integrating context according to claim 1, is characterized in that: described step 4 is carried out as follows:
Step 5.1: by LT kin each leaf node in camera lens semantic label replace with the semantic collection of the corresponding camera lens of the camera lens of representative;
Step 5.2: by LT kin each Scene replace with corresponding Scene Semantics;
Step 5.3: will comprise the LT of camera lens semanteme with Scene Semantics kas video kvideo index.
CN201410297974.0A 2014-06-26 2014-06-26 Method for creating context fusion tree video semantic indexes Expired - Fee Related CN104036023B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410297974.0A CN104036023B (en) 2014-06-26 2014-06-26 Method for creating context fusion tree video semantic indexes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410297974.0A CN104036023B (en) 2014-06-26 2014-06-26 Method for creating context fusion tree video semantic indexes

Publications (2)

Publication Number Publication Date
CN104036023A true CN104036023A (en) 2014-09-10
CN104036023B CN104036023B (en) 2017-05-10

Family

ID=51466793

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410297974.0A Expired - Fee Related CN104036023B (en) 2014-06-26 2014-06-26 Method for creating context fusion tree video semantic indexes

Country Status (1)

Country Link
CN (1) CN104036023B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104506947A (en) * 2014-12-24 2015-04-08 福州大学 Video fast forward/fast backward speed self-adaptive regulating method based on semantic content
CN106878632A (en) * 2017-02-28 2017-06-20 北京知慧教育科技有限公司 A kind for the treatment of method and apparatus of video data
CN107590442A (en) * 2017-08-22 2018-01-16 华中科技大学 A kind of video semanteme Scene Segmentation based on convolutional neural networks
CN108027834A (en) * 2015-09-21 2018-05-11 高通股份有限公司 Semantic more sense organ insertions for the video search by text
CN109344887A (en) * 2018-09-18 2019-02-15 山东大学 Short video classification methods, system and medium based on multi-modal dictionary learning
CN109685144A (en) * 2018-12-26 2019-04-26 上海众源网络有限公司 The method, apparatus and electronic equipment that a kind of pair of Video Model does to assess
CN110097094A (en) * 2019-04-15 2019-08-06 天津大学 It is a kind of towards personage interaction multiple semantic fusion lack sample classification method
CN110275744A (en) * 2018-03-14 2019-09-24 Tcl集团股份有限公司 It is a kind of for making the method and system of scalable user interface
CN110545299A (en) * 2018-05-29 2019-12-06 腾讯科技(深圳)有限公司 content list information acquisition method, content list information providing method, content list information acquisition device, content list information providing device and content list information equipment
CN110765314A (en) * 2019-10-21 2020-02-07 长沙品先信息技术有限公司 Video semantic structural extraction and labeling method
CN111435453A (en) * 2019-01-14 2020-07-21 中国科学技术大学 Fine-grained image zero sample identification method
US20210182558A1 (en) * 2017-11-10 2021-06-17 Samsung Electronics Co., Ltd. Apparatus for generating user interest information and method therefor
CN114302224A (en) * 2021-12-23 2022-04-08 新华智云科技有限公司 Intelligent video editing method, device, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080252727A1 (en) * 2006-06-16 2008-10-16 Lisa Marie Brown People searches by multisensor event correlation
CN103593363A (en) * 2012-08-15 2014-02-19 中国科学院声学研究所 Video content indexing structure building method and video searching method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080252727A1 (en) * 2006-06-16 2008-10-16 Lisa Marie Brown People searches by multisensor event correlation
CN103593363A (en) * 2012-08-15 2014-02-19 中国科学院声学研究所 Video content indexing structure building method and video searching method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
陈丹雯等: ""Co-Concept-Boosting视频语义索引方法"", 《小型微型计算机系统》 *
韩智广等: ""一种新的用于视频检索的语义索引"", 《和谐人机环境2008》 *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104506947B (en) * 2014-12-24 2017-09-05 福州大学 A kind of video fast forward based on semantic content/rewind speeds self-adapting regulation method
CN104506947A (en) * 2014-12-24 2015-04-08 福州大学 Video fast forward/fast backward speed self-adaptive regulating method based on semantic content
CN108027834A (en) * 2015-09-21 2018-05-11 高通股份有限公司 Semantic more sense organ insertions for the video search by text
CN106878632A (en) * 2017-02-28 2017-06-20 北京知慧教育科技有限公司 A kind for the treatment of method and apparatus of video data
CN106878632B (en) * 2017-02-28 2020-07-10 北京知慧教育科技有限公司 Video data processing method and device
CN107590442A (en) * 2017-08-22 2018-01-16 华中科技大学 A kind of video semanteme Scene Segmentation based on convolutional neural networks
US20210182558A1 (en) * 2017-11-10 2021-06-17 Samsung Electronics Co., Ltd. Apparatus for generating user interest information and method therefor
US11678012B2 (en) * 2017-11-10 2023-06-13 Samsung Electronics Co., Ltd. Apparatus and method for user interest information generation
CN110275744A (en) * 2018-03-14 2019-09-24 Tcl集团股份有限公司 It is a kind of for making the method and system of scalable user interface
CN110275744B (en) * 2018-03-14 2021-11-23 Tcl科技集团股份有限公司 Method and system for making scalable user interface
CN110545299B (en) * 2018-05-29 2022-04-05 腾讯科技(深圳)有限公司 Content list information acquisition method, content list information providing method, content list information acquisition device, content list information providing device and content list information equipment
CN110545299A (en) * 2018-05-29 2019-12-06 腾讯科技(深圳)有限公司 content list information acquisition method, content list information providing method, content list information acquisition device, content list information providing device and content list information equipment
CN109344887A (en) * 2018-09-18 2019-02-15 山东大学 Short video classification methods, system and medium based on multi-modal dictionary learning
CN109685144A (en) * 2018-12-26 2019-04-26 上海众源网络有限公司 The method, apparatus and electronic equipment that a kind of pair of Video Model does to assess
CN111435453A (en) * 2019-01-14 2020-07-21 中国科学技术大学 Fine-grained image zero sample identification method
CN111435453B (en) * 2019-01-14 2022-07-22 中国科学技术大学 Fine-grained image zero sample identification method
CN110097094A (en) * 2019-04-15 2019-08-06 天津大学 It is a kind of towards personage interaction multiple semantic fusion lack sample classification method
CN110097094B (en) * 2019-04-15 2023-06-13 天津大学 Multiple semantic fusion few-sample classification method for character interaction
CN110765314A (en) * 2019-10-21 2020-02-07 长沙品先信息技术有限公司 Video semantic structural extraction and labeling method
CN114302224A (en) * 2021-12-23 2022-04-08 新华智云科技有限公司 Intelligent video editing method, device, equipment and storage medium
CN114302224B (en) * 2021-12-23 2023-04-07 新华智云科技有限公司 Intelligent video editing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN104036023B (en) 2017-05-10

Similar Documents

Publication Publication Date Title
CN104036023A (en) Method for creating context fusion tree video semantic indexes
Chang et al. Semantic pooling for complex event analysis in untrimmed videos
Duan et al. Exploiting web images for event recognition in consumer videos: A multiple source domain adaptation approach
Habibian et al. Videostory: A new multimedia embedding for few-example recognition and translation of events
US20170357878A1 (en) Multi-dimensional realization of visual content of an image collection
Garcia et al. Context-aware embeddings for automatic art analysis
US20180293313A1 (en) Video content retrieval system
CN102799684B (en) The index of a kind of video and audio file cataloguing, metadata store index and searching method
Zhou et al. Conceptlearner: Discovering visual concepts from weakly labeled image collections
Dal Bianco et al. A practical and effective sampling selection strategy for large scale deduplication
CN102890700A (en) Method for retrieving similar video clips based on sports competition videos
CN103425757A (en) Cross-medial personage news searching method and system capable of fusing multi-mode information
Zhang et al. Enhancing video event recognition using automatically constructed semantic-visual knowledge base
CN104391924A (en) Mixed audio and video search method and system
CN105678244B (en) A kind of near video search method based on improved edit-distance
CN107515934A (en) A kind of film semanteme personalized labels optimization method based on big data
CN106649663A (en) Video copy detection method based on compact video representation
CN107291895A (en) A kind of quick stratification document searching method
CN103617263A (en) Television advertisement film automatic detection method based on multi-mode characteristics
CN103761286B (en) A kind of Service Source search method based on user interest
CN103778206A (en) Method for providing network service resources
CN108241713A (en) A kind of inverted index search method based on polynary cutting
CN107818183A (en) A kind of Party building video pushing method based on three stage combination recommended technologies
CN110968721A (en) Method and system for searching infringement of mass images and computer readable storage medium thereof
CN104657376A (en) Searching method and searching device for video programs based on program relationship

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170510

Termination date: 20200626