CN101430689A

CN101430689A - Detection method for figure action in video

Info

Publication number: CN101430689A
Application number: CNA2008101375080A
Authority: CN
Inventors: 姚鸿勋; 纪荣嵘; 孙晓帅; 许鹏飞
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2008-11-12
Filing date: 2008-11-12
Publication date: 2009-05-13

Abstract

The invention provides a method for detecting a figure action in video, and relates to a video detection method based on content so as to solve the problem that the existing method for searching multimedia information can not detect action information in video materials. The method comprises the following steps: a video shot is divided by a shot boundary detection method based on Graph Partition Model; a space-time saliency map is obtained by a method of establishing a dynamic saliency model based on a saliency map of each frame for continuous video frames; attention transfer of the time-space saliency map is calculated by setting a threshold value and separating an attention transfer value exceeding the threshold value; in the same action, the separated attention transfer value is subject to 3D sequence slicing overlapped by each frame so as to establish the action detection model. In the invention, the method can be used for searching mass video materials according to figure action semantic information contained in the materials, and is convenient for a user to quickly browse and search the video and watch the content which the user is interested in.

Description

Detection method for figure action in a kind of video

Technical field

The invention belongs to content-based video detecting method, be to extract and efficient index by figure action to video content, make its view transformation for general meaning, the duration conversion has stronger robustness, thereby realizes based on the video index of action and the method for retrieval.

Background technology

The emerging in large numbers on a large scale of multimedia messages on the internet makes that arrangement, index, the retrieval technique of multimedia messages are paid close attention to by everybody.But multimedia retrieval mainly adopts keyword matching (as Google, the video frequency searching engine of Baidu) to retrieve at present.Method based on keyword matching is not understood video content, and the shooting of be based on webpage making person or video, wright define the understanding of this video and classify.

In recent years, content-based multimedia information retrieval technology grows up gradually, by the content of multimedia material is analyzed, extracts its low-level image feature (as color characteristic, textural characteristics etc.), and retrieves as new matching criterior with this.Though utilize the method for low-level image feature coupling can reflect two groups of multimedia messagess similarity in terms of content to a certain extent, the semantic wide gap of outwardness still is the difficult problem that this technology is not captured as yet.To content of multimedia, particularly the middle level semanteme of image and video extracts a kind of important channel that is considered to fill up semantic wide gap, and this has obtained checking in the sports video problem analysis.Action message in video material, is a kind of very important semantic information, and particularly in some movie and television play videos, the expansion of story tends to be presented in specific action, also is the focus that the user browses and retrieves.If can carry out index to video material, will be very beneficial for the user and browse and retrieve its interested video clips according to action message.

Summary of the invention

The present invention provides the detection method for figure action in a kind of video for solving the problem that existing multimedia information retrieval method can not detect the action message in the video material.The present invention includes following steps:

Step 1, by the camera lens of video being cut apart based on the lens boundary detection method of Graph Partition Model;

Step 2, for continuous video frames, the method for model obtains space-time remarkable figure by setting up dynamically significantly on the basis of the remarkable figure of each frame;

Step 3, pass through formula

A_{Shift} = \{\begin{matrix} 1 & CenterDis (i, j) > T_{C} and DiameterVar < T_{D} \\ 0 & CenterDis (i, j) < T_{C} or DiameterVar > T_{D} \end{matrix}

Calculate the diversion variables A of space-time remarkable figure _Shift:

The distance between the center of adjacent each the frame ' s focus of attention of CenterDis () expression wherein, the change in radius of the circumscribed circle of adjacent each the frame ' s focus of attention of DiameterVar () expression;

Step 4, a threshold value and will be above the diversion value A of threshold value is set _ShiftSeparate;

Step 5, in same action, to isolated diversion value A _ShiftCarry out the 3D sequence section of each frame stack, set up the motion detection model.

Beneficial effect: a large amount of video materials can be carried out index according to the figure action semantic information that it comprises, make things convenient for the user that video is browsed fast and retrieved, watch own interested content.On the one hand, the invention provides a kind of cutting of carrying out video actions based on the model of conspicuousness redirect; On the other hand, the present invention proposes a kind of camera lens internal physical incidence relation that passes through to analyze, effectively extracted the place semantic information in the video material; Have again, the invention provides a kind of similarity computation model of novelty, make the action similarity calculate for visual angle change, dimensional variation, apparent gradual change and duration change insensitive; At last, the present invention proposes on a kind of layering local feature cluster index structure, the present invention adopts and carries out the index of 3D visual vocabulary, thereby reaches higher accuracy in real-time retrieval.

The objective of the invention is to extract and to utilize the stage business semantic information in the video material, set up the index in video material storehouse, and then realize that the user browses or retrieves video material by figure action.Meaning of the present invention is: propose a kind ofly to generate and scalable similarity Matching Algorithm based on the burst visual vocabulary, in conjunction with model analysis of people's words attention rate and layering local feature cluster, realize the effective search of figure action in the video and browse.

Embodiment

Embodiment one: present embodiment is made up of following steps:

Step 3, pass through formula

A_{Shift} = \{\begin{matrix} 1 & CenterDis (i, j) > T_{C} and DiameterVar < T_{D} \\ 0 & CenterDis (i, j) < T_{C} or DiameterVar > T_{D} \end{matrix}

Calculate the diversion variables A of space-time remarkable figure _Shift:

Step 4, a threshold value and will be above the diversion value A of threshold value is set _ShiftSeparate, in case A _ShiftThe diversion value exceed threshold range, just think the generation that has this moment the focus action in the camera lens to switch;

Step 5, in same action, to isolated diversion value A _ShiftCarry out the 3D sequence section of each frame stack, set up the motion detection model.The space-time that the section that this step will generate can be regarded the stack of multiframe as passes through ordered sets, and these set all constituted the structure primitive of action index model.

Present embodiment is at first used role's occupation rate low coverage close-up shot that removes in the video, utilize the background information in the visual attention computation model filtration scene then, adopt sequential burst local feature to generate and quantification subsequently, in conjunction with the dynamic time registration technology, effectively carry out the calculating of respective action similarity.On the action data index, this invention has proposed the Index Algorithm based on layering local feature cluster thought, has effectively satisfied the requirement of retrieval real-time, thereby realizes fast and accurately video tour and retrieval based on figure action.

Embodiment two: present embodiment further defines the motion detection model of setting up described in the step 5 and may further comprise the steps on the basis of embodiment one:

Steps A 1, for the time aerial each 3D sequence section, adopt feature of poly-step of 3D-SIFT space-time to carry out feature description;

Steps A 2, feature of poly-step of all 3D-SIFT space-times that extract is carried out quantification on the high bit space, its quantized result is constituted the hierarchical clustering model by level K mean cluster;

Steps A 3, at the end of this hierarchical clustering model, each feature space that gathers is described as a visual vocabulary, this visual vocabulary converges to all in 3D-SIFT characteristic quantification to a speech of this cluster centre, and the 3D-sequence section at extracting these 3D-SIFT features carries out inverted index.

Present embodiment is referred to as 3D vision statement through the action sequence of this inverted index, because they are made up of the 3D visual vocabulary, and the time sequencing of priority is arranged; Further, present embodiment adopts the Term Frequency-Inverted Document Frequency (TF-IDF) in the text retrieval to carry out the calculating of the importance of each speech in the 3D vision statement, and then gives different weights for each the 3D visual vocabulary in the vision statement; In this 3D visual vocabulary, comprised the temporal information of an action, spatial correspondence, the apparent attribute of movable information and moving object.

Embodiment three, present embodiment further define the hierarchical clustering model described in the steps A 3 on the basis of embodiment two method for building up may further comprise the steps:

Step B1, carry out searching of two 3D vocabulary to be matched by the hierarchical structure of model, judge the visual word symbiosis that whether has in two vocabulary greater than number of thresholds, judged result is for being then to enter step B2, judged result is that then repeated execution of steps B1 does not search again;

Step B2, carry out calculation of similarity degree by the dynamic time registration.

In order in the process of action coupling, to reach rotation, convergent-divergent and viewpoint unchangeability, present embodiment has proposed a kind of 3D vision statement matching algorithm based on the dynamic time registration at this problem.The dynamic time registration Algorithm is the not isometric feature string that is used to weigh on the two ends time preface.Dynamic time is registered in and all seeks current optimum matching unique point of being born in the feature the inside on each characteristic matching.It has adopted the thought of dynamic programming, therefore has the near-optimization matching effect of algorithm.

At first define two 3D vision statements and the following is C=＜C ₀, C ₁, C ₂..., C _mAnd C '=＜C ₀', C ₁', C ₂' ..., C _m', each action that embodiment one is extracted all represented in each 3D vision statement, its length m and m ' might not equate.In order to weigh the similarity of these two 3D vision statements, we define the truncation of vision statement be Tail (C)=＜C ₁, C ₂..., C _m, and then we utilize formula two to calculate the similarity of two 3D vision statements:

DTW(<>，<>)＝0

DTW(C，<>)＝DTW(<>，C′)＝∞ (2)

DTW (C, C') = \sqrt{| | c_{i} - c_{j} | | + \min \{\begin{matrix} DTW (C, Tail (C')) \\ DTW (Tail (C), C') \\ DTW (Tail (C), Tail (C')) \end{matrix}}

The computation process of this similarity adopts dynamic programming to carry out.Generally speaking, || c _i-c _j|| can be two L2 or cosine distances between the 3D visual vocabulary.In realization, owing to extracted all 3D vision statements in advance, so this computation process can be finished in the efficient time.

Claims

1, the detection method for figure action in a kind of video is characterized in that it may further comprise the steps:

Step 3, pass through formula

A_{shift} = \{\begin{matrix} 1 & CenterDis (i, j) > T_{C} and DiameterVar < T_{D} \\ 0 & CenterDis (i, j) < T_{C} or DiameterVar > T_{D} \end{matrix}

Calculate the diversion variables A of space-time remarkable figure _Shift:

The distance between the center of adjacent each the frame ' s focus of attention of CenterDis () expression wherein, the change in radius of the circumscribed circle of adjacent each the frame ' s focus of attention of Diameter Var () expression;

2, the detection method for figure action in a kind of video according to claim 1 is characterized in that the motion detection model of setting up described in the step 5 may further comprise the steps:

The end of steps A 3, the hierarchical clustering model that obtains in steps A 2, each feature space that gathers is described as a visual vocabulary, described visual vocabulary is a speech that all is converged to the 3D-SIFT characteristic quantification acquisition of each cluster centre, and carries out inverted index at the 3D-sequence section of each 3D-SIFT feature by the value of characteristic quantification.

3, the detection method for figure action in a kind of video according to claim 2 is characterized in that the method for building up of the hierarchical clustering model described in the steps A 3 may further comprise the steps: