CN101430689A - Detection method for figure action in video - Google Patents
Detection method for figure action in video Download PDFInfo
- Publication number
- CN101430689A CN101430689A CNA2008101375080A CN200810137508A CN101430689A CN 101430689 A CN101430689 A CN 101430689A CN A2008101375080 A CNA2008101375080 A CN A2008101375080A CN 200810137508 A CN200810137508 A CN 200810137508A CN 101430689 A CN101430689 A CN 101430689A
- Authority
- CN
- China
- Prior art keywords
- video
- action
- steps
- model
- space
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a method for detecting a figure action in video, and relates to a video detection method based on content so as to solve the problem that the existing method for searching multimedia information can not detect action information in video materials. The method comprises the following steps: a video shot is divided by a shot boundary detection method based on Graph Partition Model; a space-time saliency map is obtained by a method of establishing a dynamic saliency model based on a saliency map of each frame for continuous video frames; attention transfer of the time-space saliency map is calculated by setting a threshold value and separating an attention transfer value exceeding the threshold value; in the same action, the separated attention transfer value is subject to 3D sequence slicing overlapped by each frame so as to establish the action detection model. In the invention, the method can be used for searching mass video materials according to figure action semantic information contained in the materials, and is convenient for a user to quickly browse and search the video and watch the content which the user is interested in.
Description
Technical field
The invention belongs to content-based video detecting method, be to extract and efficient index by figure action to video content, make its view transformation for general meaning, the duration conversion has stronger robustness, thereby realizes based on the video index of action and the method for retrieval.
Background technology
The emerging in large numbers on a large scale of multimedia messages on the internet makes that arrangement, index, the retrieval technique of multimedia messages are paid close attention to by everybody.But multimedia retrieval mainly adopts keyword matching (as Google, the video frequency searching engine of Baidu) to retrieve at present.Method based on keyword matching is not understood video content, and the shooting of be based on webpage making person or video, wright define the understanding of this video and classify.
In recent years, content-based multimedia information retrieval technology grows up gradually, by the content of multimedia material is analyzed, extracts its low-level image feature (as color characteristic, textural characteristics etc.), and retrieves as new matching criterior with this.Though utilize the method for low-level image feature coupling can reflect two groups of multimedia messagess similarity in terms of content to a certain extent, the semantic wide gap of outwardness still is the difficult problem that this technology is not captured as yet.To content of multimedia, particularly the middle level semanteme of image and video extracts a kind of important channel that is considered to fill up semantic wide gap, and this has obtained checking in the sports video problem analysis.Action message in video material, is a kind of very important semantic information, and particularly in some movie and television play videos, the expansion of story tends to be presented in specific action, also is the focus that the user browses and retrieves.If can carry out index to video material, will be very beneficial for the user and browse and retrieve its interested video clips according to action message.
Summary of the invention
The present invention provides the detection method for figure action in a kind of video for solving the problem that existing multimedia information retrieval method can not detect the action message in the video material.The present invention includes following steps:
Step 1, by the camera lens of video being cut apart based on the lens boundary detection method of Graph Partition Model;
Step 2, for continuous video frames, the method for model obtains space-time remarkable figure by setting up dynamically significantly on the basis of the remarkable figure of each frame;
Step 3, pass through formula
Calculate the diversion variables A of space-time remarkable figure
Shift:
The distance between the center of adjacent each the frame ' s focus of attention of CenterDis () expression wherein, the change in radius of the circumscribed circle of adjacent each the frame ' s focus of attention of DiameterVar () expression;
Step 4, a threshold value and will be above the diversion value A of threshold value is set
ShiftSeparate;
Step 5, in same action, to isolated diversion value A
ShiftCarry out the 3D sequence section of each frame stack, set up the motion detection model.
Beneficial effect: a large amount of video materials can be carried out index according to the figure action semantic information that it comprises, make things convenient for the user that video is browsed fast and retrieved, watch own interested content.On the one hand, the invention provides a kind of cutting of carrying out video actions based on the model of conspicuousness redirect; On the other hand, the present invention proposes a kind of camera lens internal physical incidence relation that passes through to analyze, effectively extracted the place semantic information in the video material; Have again, the invention provides a kind of similarity computation model of novelty, make the action similarity calculate for visual angle change, dimensional variation, apparent gradual change and duration change insensitive; At last, the present invention proposes on a kind of layering local feature cluster index structure, the present invention adopts and carries out the index of 3D visual vocabulary, thereby reaches higher accuracy in real-time retrieval.
The objective of the invention is to extract and to utilize the stage business semantic information in the video material, set up the index in video material storehouse, and then realize that the user browses or retrieves video material by figure action.Meaning of the present invention is: propose a kind ofly to generate and scalable similarity Matching Algorithm based on the burst visual vocabulary, in conjunction with model analysis of people's words attention rate and layering local feature cluster, realize the effective search of figure action in the video and browse.
Embodiment
Embodiment one: present embodiment is made up of following steps:
Step 1, by the camera lens of video being cut apart based on the lens boundary detection method of Graph Partition Model;
Step 2, for continuous video frames, the method for model obtains space-time remarkable figure by setting up dynamically significantly on the basis of the remarkable figure of each frame;
Step 3, pass through formula
Calculate the diversion variables A of space-time remarkable figure
Shift:
The distance between the center of adjacent each the frame ' s focus of attention of CenterDis () expression wherein, the change in radius of the circumscribed circle of adjacent each the frame ' s focus of attention of DiameterVar () expression;
Step 4, a threshold value and will be above the diversion value A of threshold value is set
ShiftSeparate, in case A
ShiftThe diversion value exceed threshold range, just think the generation that has this moment the focus action in the camera lens to switch;
Step 5, in same action, to isolated diversion value A
ShiftCarry out the 3D sequence section of each frame stack, set up the motion detection model.The space-time that the section that this step will generate can be regarded the stack of multiframe as passes through ordered sets, and these set all constituted the structure primitive of action index model.
Present embodiment is at first used role's occupation rate low coverage close-up shot that removes in the video, utilize the background information in the visual attention computation model filtration scene then, adopt sequential burst local feature to generate and quantification subsequently, in conjunction with the dynamic time registration technology, effectively carry out the calculating of respective action similarity.On the action data index, this invention has proposed the Index Algorithm based on layering local feature cluster thought, has effectively satisfied the requirement of retrieval real-time, thereby realizes fast and accurately video tour and retrieval based on figure action.
Embodiment two: present embodiment further defines the motion detection model of setting up described in the step 5 and may further comprise the steps on the basis of embodiment one:
Steps A 1, for the time aerial each 3D sequence section, adopt feature of poly-step of 3D-SIFT space-time to carry out feature description;
Steps A 2, feature of poly-step of all 3D-SIFT space-times that extract is carried out quantification on the high bit space, its quantized result is constituted the hierarchical clustering model by level K mean cluster;
Steps A 3, at the end of this hierarchical clustering model, each feature space that gathers is described as a visual vocabulary, this visual vocabulary converges to all in 3D-SIFT characteristic quantification to a speech of this cluster centre, and the 3D-sequence section at extracting these 3D-SIFT features carries out inverted index.
Present embodiment is referred to as 3D vision statement through the action sequence of this inverted index, because they are made up of the 3D visual vocabulary, and the time sequencing of priority is arranged; Further, present embodiment adopts the Term Frequency-Inverted Document Frequency (TF-IDF) in the text retrieval to carry out the calculating of the importance of each speech in the 3D vision statement, and then gives different weights for each the 3D visual vocabulary in the vision statement; In this 3D visual vocabulary, comprised the temporal information of an action, spatial correspondence, the apparent attribute of movable information and moving object.
Embodiment three, present embodiment further define the hierarchical clustering model described in the steps A 3 on the basis of embodiment two method for building up may further comprise the steps:
Step B1, carry out searching of two 3D vocabulary to be matched by the hierarchical structure of model, judge the visual word symbiosis that whether has in two vocabulary greater than number of thresholds, judged result is for being then to enter step B2, judged result is that then repeated execution of steps B1 does not search again;
Step B2, carry out calculation of similarity degree by the dynamic time registration.
In order in the process of action coupling, to reach rotation, convergent-divergent and viewpoint unchangeability, present embodiment has proposed a kind of 3D vision statement matching algorithm based on the dynamic time registration at this problem.The dynamic time registration Algorithm is the not isometric feature string that is used to weigh on the two ends time preface.Dynamic time is registered in and all seeks current optimum matching unique point of being born in the feature the inside on each characteristic matching.It has adopted the thought of dynamic programming, therefore has the near-optimization matching effect of algorithm.
At first define two 3D vision statements and the following is C=<C
0, C
1, C
2..., C
mAnd C '=<C
0', C
1', C
2' ..., C
m', each action that embodiment one is extracted all represented in each 3D vision statement, its length m and m ' might not equate.In order to weigh the similarity of these two 3D vision statements, we define the truncation of vision statement be Tail (C)=<C
1, C
2..., C
m, and then we utilize formula two to calculate the similarity of two 3D vision statements:
DTW(<>,<>)=0
DTW(C,<>)=DTW(<>,C′)=∞ (2)
The computation process of this similarity adopts dynamic programming to carry out.Generally speaking, || c
i-c
j|| can be two L2 or cosine distances between the 3D visual vocabulary.In realization, owing to extracted all 3D vision statements in advance, so this computation process can be finished in the efficient time.
Claims (3)
1, the detection method for figure action in a kind of video is characterized in that it may further comprise the steps:
Step 1, by the camera lens of video being cut apart based on the lens boundary detection method of Graph Partition Model;
Step 2, for continuous video frames, the method for model obtains space-time remarkable figure by setting up dynamically significantly on the basis of the remarkable figure of each frame;
Step 3, pass through formula
Calculate the diversion variables A of space-time remarkable figure
Shift:
The distance between the center of adjacent each the frame ' s focus of attention of CenterDis () expression wherein, the change in radius of the circumscribed circle of adjacent each the frame ' s focus of attention of Diameter Var () expression;
Step 4, a threshold value and will be above the diversion value A of threshold value is set
ShiftSeparate;
Step 5, in same action, to isolated diversion value A
ShiftCarry out the 3D sequence section of each frame stack, set up the motion detection model.
2, the detection method for figure action in a kind of video according to claim 1 is characterized in that the motion detection model of setting up described in the step 5 may further comprise the steps:
Steps A 1, for the time aerial each 3D sequence section, adopt feature of poly-step of 3D-SIFT space-time to carry out feature description;
Steps A 2, feature of poly-step of all 3D-SIFT space-times that extract is carried out quantification on the high bit space, its quantized result is constituted the hierarchical clustering model by level K mean cluster;
The end of steps A 3, the hierarchical clustering model that obtains in steps A 2, each feature space that gathers is described as a visual vocabulary, described visual vocabulary is a speech that all is converged to the 3D-SIFT characteristic quantification acquisition of each cluster centre, and carries out inverted index at the 3D-sequence section of each 3D-SIFT feature by the value of characteristic quantification.
3, the detection method for figure action in a kind of video according to claim 2 is characterized in that the method for building up of the hierarchical clustering model described in the steps A 3 may further comprise the steps:
Step B1, carry out searching of two 3D vocabulary to be matched by the hierarchical structure of model, judge the visual word symbiosis that whether has in two vocabulary greater than number of thresholds, judged result is for being then to enter step B2, judged result is that then repeated execution of steps B1 does not search again;
Step B2, carry out calculation of similarity degree by the dynamic time registration.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2008101375080A CN101430689A (en) | 2008-11-12 | 2008-11-12 | Detection method for figure action in video |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2008101375080A CN101430689A (en) | 2008-11-12 | 2008-11-12 | Detection method for figure action in video |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101430689A true CN101430689A (en) | 2009-05-13 |
Family
ID=40646094
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNA2008101375080A Pending CN101430689A (en) | 2008-11-12 | 2008-11-12 | Detection method for figure action in video |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101430689A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102668548A (en) * | 2009-12-17 | 2012-09-12 | 佳能株式会社 | Video information processing method and video information processing apparatus |
CN105049674A (en) * | 2015-07-01 | 2015-11-11 | 中科创达软件股份有限公司 | Video image processing method and system |
CN106534951A (en) * | 2016-11-30 | 2017-03-22 | 北京小米移动软件有限公司 | Method and apparatus for video segmentation |
CN109982109A (en) * | 2019-04-03 | 2019-07-05 | 睿魔智能科技(深圳)有限公司 | The generation method and device of short-sighted frequency, server and storage medium |
WO2019144840A1 (en) * | 2018-01-25 | 2019-08-01 | 北京一览科技有限公司 | Method and apparatus for acquiring video semantic information |
CN110097115A (en) * | 2019-04-28 | 2019-08-06 | 南开大学 | A kind of saliency object detecting method based on attention metastasis |
-
2008
- 2008-11-12 CN CNA2008101375080A patent/CN101430689A/en active Pending
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102668548A (en) * | 2009-12-17 | 2012-09-12 | 佳能株式会社 | Video information processing method and video information processing apparatus |
CN102668548B (en) * | 2009-12-17 | 2015-04-15 | 佳能株式会社 | Video information processing method and video information processing apparatus |
CN105049674A (en) * | 2015-07-01 | 2015-11-11 | 中科创达软件股份有限公司 | Video image processing method and system |
CN106534951A (en) * | 2016-11-30 | 2017-03-22 | 北京小米移动软件有限公司 | Method and apparatus for video segmentation |
CN106534951B (en) * | 2016-11-30 | 2020-10-09 | 北京小米移动软件有限公司 | Video segmentation method and device |
WO2019144840A1 (en) * | 2018-01-25 | 2019-08-01 | 北京一览科技有限公司 | Method and apparatus for acquiring video semantic information |
CN109982109A (en) * | 2019-04-03 | 2019-07-05 | 睿魔智能科技(深圳)有限公司 | The generation method and device of short-sighted frequency, server and storage medium |
CN109982109B (en) * | 2019-04-03 | 2021-08-03 | 睿魔智能科技(深圳)有限公司 | Short video generation method and device, server and storage medium |
CN110097115A (en) * | 2019-04-28 | 2019-08-06 | 南开大学 | A kind of saliency object detecting method based on attention metastasis |
CN110097115B (en) * | 2019-04-28 | 2022-11-25 | 南开大学 | Video salient object detection method based on attention transfer mechanism |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103218608B (en) | Network violent video identification method | |
CN101650722B (en) | Method based on audio/video combination for detecting highlight events in football video | |
CN104199933B (en) | The football video event detection and semanteme marking method of a kind of multimodal information fusion | |
Mao et al. | Deep cross-modal retrieval for remote sensing image and audio | |
CN112163122B (en) | Method, device, computing equipment and storage medium for determining label of target video | |
CN101430689A (en) | Detection method for figure action in video | |
CN101477798A (en) | Method for analyzing and extracting audio data of set scene | |
CN103745000A (en) | Hot topic detection method of Chinese micro-blogs | |
US20110106531A1 (en) | Program endpoint time detection apparatus and method, and program information retrieval system | |
Su et al. | Environmental sound classification for scene recognition using local discriminant bases and HMM | |
CN103761261A (en) | Voice recognition based media search method and device | |
EP2395502A1 (en) | Systems and methods for manipulating electronic content based on speech recognition | |
CN111754302A (en) | Video live broadcast interface commodity display intelligent management system based on big data | |
CN101247470A (en) | Method for detecting scene boundaries in genre independent videos | |
CN113709384A (en) | Video editing method based on deep learning, related equipment and storage medium | |
CN104281608A (en) | Emergency analyzing method based on microblogs | |
CN113111663A (en) | Abstract generation method fusing key information | |
CN110705292A (en) | Entity name extraction method based on knowledge base and deep learning | |
CN114547373A (en) | Method for intelligently identifying and searching programs based on audio | |
Bartolini et al. | Shiatsu: semantic-based hierarchical automatic tagging of videos by segmentation using cuts | |
CN102253993B (en) | Vocabulary tree-based audio-clip retrieving algorithm | |
Tu et al. | Challenge Huawei challenge: Fusing multimodal features with deep neural networks for Mobile Video Annotation | |
Thuseethan et al. | Multimodal deep learning framework for sentiment analysis from text-image web Data | |
Yang et al. | Multimodal short video rumor detection system based on contrastive learning | |
Tirupattur et al. | Violence Detection in Videos |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |
Open date: 20090513 |