CN102222101A - Method for video semantic mining - Google Patents
Method for video semantic mining Download PDFInfo
- Publication number
- CN102222101A CN102222101A CN 201110168952 CN201110168952A CN102222101A CN 102222101 A CN102222101 A CN 102222101A CN 201110168952 CN201110168952 CN 201110168952 CN 201110168952 A CN201110168952 A CN 201110168952A CN 102222101 A CN102222101 A CN 102222101A
- Authority
- CN
- China
- Prior art keywords
- video
- recognition
- semantic
- word
- chinese
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Character Discrimination (AREA)
Abstract
The invention relates to a method for video semantic mining, comprising the steps of: firstly performing a Chinese continuous speech recognition, a video target recognition and a video character recognition on a to-be-processed video; then performing a Chinese word division and a part-of-speech tagging on the recognition result, reserving nouns and verbs as the peaks of a graph model, wherein a side weight between the peaks is set to be the Chinese semantic distance of words represented by the two peaks; and finally mining the semantic information of the video according to a dense subgraph finding algorithm. The semantic mining of the video is realized by the fusion of the three recognition results of the Chinese continuous speech recognition, the video target recognition and the video character recognition; the video is represented to be a graph model, wherein the peaks are the words in the video, and the side weight is set to be the semantic distance between the two peaks; the video semantic mining algorithm is further transformed to be the dense subgraph finding algorithm of the graph model. The method and the device in the invention solve the problems of high error rate of a single recognition result and incapability of efficiently fusing a plurality of recognition results in the process of the Chinese continuous speech recognition, the video target recognition and the video character recognition, as well as solve the problem of the video structured expression and the problem of the video semantic mining algorithm realization. The method and the device in the invention can be used for performing automatic marking, classification and semantic mining on batches of videos.
Description
Technical field
The present invention relates to Digital Media and machine learning field, it carries out semantic analysis for the video of user's input, carries out semantic tagger by fusion voice, writings and image information for video.
Background technology
Along with the development of Online Video sharing website and video processing technique, the content of multitude of video form emerges.Because therefore the data of video yes-no formatization and lack necessary descriptor can not be handled as text at an easy rate.Carry out artificial semantic tagger for video and take time and effort again, can not satisfy the requirement of Video processing in batches.Content-based video processing technique is present research focus, but prior art and is not taken all factors into consideration effective fusion of image, literal and the many-sided content of voice for the marking error rate height of video content.The image object recognition technology is grown up gradually at present, and in the challenge match of sensation target category classification, image object identification has arrived practical degree.The continuous speech recognition technology makes voice signal can be transcribed into text.Video text identification can identify the embedding literal in the video, handles as text.In conjunction with above three kinds of recognition technologies, the video semanteme analysis needs a kind of effective fusion method." knowing net " is a semantic dictionary of Chinese, utilizes the concept hierarchy relation in " knowing net ", can calculate two semantic distances between the word.According to semantic distance, can carry out semanteme tolerance to three kinds of recognition results.According to the high correlation of these three kinds of modal informations of video image, literal and voice, can effectively merge different modalities information, remove identification error information.Graph model is made of summit and limit, can express the relation of notion in the whole video.Dense subgraph finds that algorithm can be implemented in the semantic relation of assembling of discovery in the video graph model, reaches the purpose of video semanteme mark.
Summary of the invention
Existing content-based video processing technique is not utilized the information of image, voice and three high-level semantic aspects of literal fully, and can not carry out visual classification and excavation on high-level semantic.In order to solve the deficiency of prior art problem, the present invention proposes a kind of method of video being carried out semantic excavation.
In order to reach described purpose, the invention provides the method that a kind of video is expressed and excavated, its technical scheme comprises the steps:
Step S1:, carry out Chinese continuous speech recognition, video object identification and video text identification respectively for pending video;
Step S2: for the described three kinds of recognition results of step S1, be expressed as a literal vector separately, form a tensor jointly to express video;
Step S3: three literal vectors among the step S2, carry out Chinese word segmentation and part-of-speech tagging respectively, keep noun and verb;
Step S4: the structural map model is expressed video, and wherein the summit of figure is resulting noun and a verb among the S3, and the limit weight of figure is set to the semantic distance of the Chinese word of two summit representatives;
Step S5:, use dense subgraph to find that algorithm excavates the semanteme in the graph model for the graph model that step S4 is constructed.
Beneficial effect of the present invention: for video can realize automatic semantic tagger, automatically the classification and the video measuring similarity.For massive video data, can avoid the manual uninteresting loaded down with trivial details work that is brought that marks by means of present technique.The result of Chinese continuous speech recognition, Chinese text identification and image object identification has effectively been merged in the present invention, represented the semantic distance relation of each semantic concept in the video by video being expressed as a graph model, this distance relation is to realize by measuring based on the semantic distance of " knowing net "; Can realize the mark and the excavation of semantic concept in the video at last by dense subgraph discovery algorithm.
Description of drawings
Fig. 1 is Video processing overall flow figure of the present invention.
Fig. 2 is a Chinese continuous speech recognition process flow diagram of the present invention.
Fig. 3 is video text identification process figure of the present invention.
Fig. 4 is image object identification process figure of the present invention.
Fig. 5 is semantic distance tolerance hierarchical relationship figure of the present invention.
Fig. 6 is that video dense subgraph of the present invention excavates expression.
Fig. 7 is video labeling result of the present invention.
Embodiment
Describe each related detailed problem in the technical solution of the present invention in detail below in conjunction with accompanying drawing.Be to be noted that described embodiment only is intended to be convenient to the understanding of the present invention, and it is not played any qualification effect.
The present invention proposes the method that a kind of video semanteme excavates, as shown in Figure 1, this method is divided into four layers on treatment scheme.Be the video library layer bottom, deposited various forms of video resources; Video library top one deck is multi-modal fused layer, finishes for the structure analysis of video and the identification and the effectively fusion of image, literal and voice at this one deck; Be the video tap layer toward last layer again, represent for the graph model of video and, can realize that visual classification excavates according to supporting vector machine model in addition based on the video mining algorithm that dense subgraph is found in this layer realization; The superiors are transparent intelligent video service layers that the user is provided; The semanteme that the rightmost side is based on " knowing net " calculates supporting layer.According to above-mentioned flow process, concrete implementation step is as follows:
1, video preprocessor is handled
Carry out camera lens for pending video and cut apart, preserve for the successive image Target Recognition then for each camera lens extraction key frame, and these key frames and use; For the sound signal in the video, according to 16KHZ, the requirement of 16bit sampling, and preserve into the wav form and use for subsequent speech recognition.
2, image object identification
At first download sensation target classification challenge match picture library (PASCAL VOC Challenge 2010 Database, http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/index.h tml), extract the partial gradient feature (HOG) of each picture in the picture library respectively, take means clustering algorithm (k-means) that the feature in the picture library is carried out cluster, the classification number can be decided to be 1000,1000 vision words have so just been formed, use these 1000 vision words to describe each picture then, this moment, each picture just constituted a speech bag (bag of words) as intermediate features, use support vector machine (SVM) method to train the disaggregated model that obtains 20 visual category at last on picture speech bag feature, these 20 classifications are respectively: the people, bird, cat, ox, dog, horse, sheep, aircraft, ship, bicycle, motorcycle, train, car, motorbus, bottle, chair, dining table, flowerpot, sofa, televisor.Use these disaggregated models to carry out Target Recognition for the key frame of video image at last, recognition result is preserved into a text, is designated as Text
OBJECTThis step process flow process is referring to Fig. 2.
3, Chinese continuous speech recognition
At first download sphinx continuous speech recognition system increase income code and supporting Chinese language model, Chinese acoustic model and Chinese word list file (Sphinx, http://sphinx.sourceforge.net).Handle resulting audio signal for video preprocessor and carry out continuous speech recognition, sound signal is transcribed into text, be designated as Text
ASRThis step process flow process is referring to Fig. 3.
1, video text identification
Literal location in the video image, at first the dual edge model based on character stroke obtains candidate character region, candidate character region is decomposed to obtain pinpoint text block then.Literal extraction algorithm in the video is got a frame every some frame of video and is carried out obtaining the literal object based on the literal location of image, in sequence of frames of video, the literal object is carried out forward and tracking backward then, at last the literal object is discerned and obtained literal extraction result, and recognition result preserved into a text, be designated as Text
VOCRThis step process flow process is referring to Fig. 4.
5, for Text
OBJECT, Text
ASRAnd Text
VOCRCarry out Chinese word segmentation, " stop words " that remove no implication and Chinese part-of-speech tagging respectively, retain verb and noun and carry out next step analysis, be designated as Word respectively after handling based on latent Ma Erke large-sized model
OBJECT, Word
ASRAnd Word
VOCRSo whole video semantically can be expressed as a tensor, promptly
Wherein Ψ represents the semantic tensor property of video,
Expression is by three vectorial Word
OBJECT, Word
ASRAnd Word
VOCRThe tensor space that forms.
6,, calculate the semantic similarity between two identical part of speech words for noun that obtains in the previous step and verb.Computing method are taked the level distance metric method based on " knowing net ", and similarity is defined between 0 and 1, and such as the similarity between desk and the chair 0.8, and the similarity of landscape and steamer is 0.1.This step process flow process is referring to Fig. 5.
7, Word
OBJECT, Word
ASRAnd Word
VOCRIn word as the summit V of graph model, the similarity between the word that defines in the previous step is defined as the weight w on limit between the summit, structure is expressed the graph model of video.This is a cum rights non-directed graph, G=(V, E), V represents vertex set, E represents the limit set, | V|=n represents number of vertices, every limit
A non-negative weight is arranged
The definition of this weight is exactly the semantic similarity between the word of determining in the previous step.
8, since in the video image, literal and audio frequency co expression a theme, so be consistent semantically at them.Thereby the semantic similarity of three vectors of video tensor should distance minimization, and the word that does not meet minimization principle should be removed owing to identification error causes.This principle can be expressed as:
Wherein
Represent two words, m wherein, n ∈ { Word
OBJECT, Word
ASR, Word
VOCR, a vector in the video tensor of expression; 0≤i≤| m|, 0≤j≤| n|, the word numbering of expression, this numbering is the maximum dimension of a vector in the video tensor to the maximum.The similarity distance value of the word of f () expression is defined between 0 and 1.
9, for above-mentioned optimization problem, the dense subgraph that is converted into graph model is pinpointed the problems.Promptly G=(V, E) in, find subgraph H=(X, F), H is a subgraph, x is the subgraph vertex set, F is the limit set of subgraph.The discovery algorithm of dense subgraph can be expressed as
Be the average weight sum maximization on each limit in the subgraph, wherein | X| represents the subgraph number of vertices,
The set on expression subgraph limit, the weight on w (l) expression limit, the weighing computation method on limit is with the f () in the previous step.Video semanteme feature tensor space comprises three vectors, promptly
Each vector structure graph model community (community), thus whole video is expressed as a graph model of being made up of three communities.Dense subgraph finds that algorithm carries out on above-mentioned graph model, as shown in Figure 6.
10, for the dense subgraph in the graph model of finding in the previous step, the word of the summit representative in the record dense subgraph is as effective mark of video, and this mark has promptly embodied the semantic information of video.The video labeling result as shown in Figure 7.
Claims (5)
1. a video semanteme method for digging is characterized in that, the step of described method is as follows:
Step S1:, carry out Chinese continuous speech recognition, video object identification and video text identification respectively for pending video;
Step S2: for the described three kinds of recognition results of step S1, be expressed as a literal vector separately, form a tensor jointly to express video;
Step S3: three literal vectors among the step S2, carry out Chinese word segmentation and part-of-speech tagging respectively, keep noun and verb;
Step S4: the structural map model is expressed video, and wherein the summit of figure is resulting noun and a verb among the S3, and the limit weight of figure is set to the semantic distance of the Chinese word of two summit representatives;
Step S5:, use dense subgraph to find that algorithm excavates the semanteme in the graph model for the graph model that step S4 is constructed.
2. video semanteme method for digging according to claim 1, it is characterized in that, described video object identification is at first gone up gradient feature (HOG) and the yardstick invariant features (SIFT) that extracts picture at sensation target classification challenge match picture library (PASCAL VOC Challenge 2010), and for these features use mean cluster (K-means) algorithm clusters, claim that these classes are the vision word, use these vision word construction speech bags (bag of words) that the picture in the picture library is described then, with the speech bag is characteristics of image training support vector machine model (SVM), uses the support vector machine model that the video lens key frame images is carried out Target Recognition.
3. video semanteme method for digging according to claim 1, it is characterized in that, Chinese continuous speech recognition, video object identification and video text identification have been merged for the processing of video, and three kinds of recognition result unifications are handled as character features, and word processing comprises Chinese word segmentation and part-of-speech tagging.
4. video semanteme method for digging according to claim 1, it is characterized in that, structure for graph model, noun and the verb in three kinds of recognition results of video represented on the summit, the weight on limit has been represented the semantic distance between the summit, the semantic measure that is based on " knowing net " that the weight calculation on limit is taked calculates two semantic distances between the word by level and the membership between the word in inquiry " knowing net " semantic dictionary.
5. video semanteme method for digging according to claim 1, it is characterized in that, the discovery algorithm of dense subgraph is to realize by constantly removing in the graph model isolated summit, and the mining process of video semanteme is expressed as pinpointing the problems of dense subgraph in the graph model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110168952 CN102222101A (en) | 2011-06-22 | 2011-06-22 | Method for video semantic mining |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110168952 CN102222101A (en) | 2011-06-22 | 2011-06-22 | Method for video semantic mining |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102222101A true CN102222101A (en) | 2011-10-19 |
Family
ID=44778653
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 201110168952 Pending CN102222101A (en) | 2011-06-22 | 2011-06-22 | Method for video semantic mining |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102222101A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014146463A1 (en) * | 2013-03-19 | 2014-09-25 | 中国科学院自动化研究所 | Behaviour recognition method based on hidden structure reasoning |
CN105629747A (en) * | 2015-09-18 | 2016-06-01 | 宇龙计算机通信科技(深圳)有限公司 | Voice control method and device of smart home system |
CN106202421A (en) * | 2012-02-02 | 2016-12-07 | 联想(北京)有限公司 | A kind of obtain the method for video, device and play the method for video, device |
CN107203586A (en) * | 2017-04-19 | 2017-09-26 | 天津大学 | A kind of automation result generation method based on multi-modal information |
CN108320318A (en) * | 2018-01-15 | 2018-07-24 | 腾讯科技(深圳)有限公司 | Image processing method, device, computer equipment and storage medium |
CN108427713A (en) * | 2018-02-01 | 2018-08-21 | 宁波诺丁汉大学 | A kind of video summarization method and system for homemade video |
CN110879974A (en) * | 2019-11-01 | 2020-03-13 | 北京微播易科技股份有限公司 | Video classification method and device |
CN111107381A (en) * | 2018-10-25 | 2020-05-05 | 武汉斗鱼网络科技有限公司 | Live broadcast room bullet screen display method, storage medium, equipment and system |
CN111126373A (en) * | 2019-12-23 | 2020-05-08 | 北京中科神探科技有限公司 | Internet short video violation judgment device and method based on cross-modal identification technology |
CN111797850A (en) * | 2019-04-09 | 2020-10-20 | Oppo广东移动通信有限公司 | Video classification method and device, storage medium and electronic equipment |
CN113343974A (en) * | 2021-07-06 | 2021-09-03 | 国网天津市电力公司 | Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement |
CN115203408A (en) * | 2022-06-24 | 2022-10-18 | 中国人民解放军国防科技大学 | Intelligent labeling method for multi-modal test data |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101093500A (en) * | 2007-07-16 | 2007-12-26 | 武汉大学 | Method for recognizing semantics of events in video |
US20080059872A1 (en) * | 2006-09-05 | 2008-03-06 | National Cheng Kung University | Video annotation method by integrating visual features and frequent patterns |
-
2011
- 2011-06-22 CN CN 201110168952 patent/CN102222101A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080059872A1 (en) * | 2006-09-05 | 2008-03-06 | National Cheng Kung University | Video annotation method by integrating visual features and frequent patterns |
CN101093500A (en) * | 2007-07-16 | 2007-12-26 | 武汉大学 | Method for recognizing semantics of events in video |
Non-Patent Citations (3)
Title |
---|
《中国博士学位论文全文数据库(电子期刊)》 20101215 刘亚楠 多模态特征融合和变量选择的视频语义理解 第4章 1-5 , * |
《信息化纵横》 20091031 张焕生等 基于图的频繁子结构挖掘算法综述 第5-9页 1-5 , * |
《计算机研究与发展》 20090131 刘亚楠等 基于多模态子空间相关性传递的视频语义挖掘 第1-8页 1-5 , * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106202421B (en) * | 2012-02-02 | 2020-01-31 | 联想(北京)有限公司 | method and device for obtaining video and method and device for playing video |
CN106202421A (en) * | 2012-02-02 | 2016-12-07 | 联想(北京)有限公司 | A kind of obtain the method for video, device and play the method for video, device |
WO2014146463A1 (en) * | 2013-03-19 | 2014-09-25 | 中国科学院自动化研究所 | Behaviour recognition method based on hidden structure reasoning |
CN105629747A (en) * | 2015-09-18 | 2016-06-01 | 宇龙计算机通信科技(深圳)有限公司 | Voice control method and device of smart home system |
CN107203586A (en) * | 2017-04-19 | 2017-09-26 | 天津大学 | A kind of automation result generation method based on multi-modal information |
CN108320318A (en) * | 2018-01-15 | 2018-07-24 | 腾讯科技(深圳)有限公司 | Image processing method, device, computer equipment and storage medium |
CN108427713A (en) * | 2018-02-01 | 2018-08-21 | 宁波诺丁汉大学 | A kind of video summarization method and system for homemade video |
CN108427713B (en) * | 2018-02-01 | 2021-11-16 | 宁波诺丁汉大学 | Video abstraction method and system for self-made video |
CN111107381A (en) * | 2018-10-25 | 2020-05-05 | 武汉斗鱼网络科技有限公司 | Live broadcast room bullet screen display method, storage medium, equipment and system |
CN111797850A (en) * | 2019-04-09 | 2020-10-20 | Oppo广东移动通信有限公司 | Video classification method and device, storage medium and electronic equipment |
CN110879974A (en) * | 2019-11-01 | 2020-03-13 | 北京微播易科技股份有限公司 | Video classification method and device |
CN111126373A (en) * | 2019-12-23 | 2020-05-08 | 北京中科神探科技有限公司 | Internet short video violation judgment device and method based on cross-modal identification technology |
CN113343974A (en) * | 2021-07-06 | 2021-09-03 | 国网天津市电力公司 | Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement |
CN115203408A (en) * | 2022-06-24 | 2022-10-18 | 中国人民解放军国防科技大学 | Intelligent labeling method for multi-modal test data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102222101A (en) | Method for video semantic mining | |
CN102663015B (en) | Video semantic labeling method based on characteristics bag models and supervised learning | |
Zhang et al. | Fusion of multichannel local and global structural cues for photo aesthetics evaluation | |
CN106599029B (en) | Chinese short text clustering method | |
CN103838864B (en) | Visual saliency and visual phrase combined image retrieval method | |
CN110162591B (en) | Entity alignment method and system for digital education resources | |
CN112004111B (en) | News video information extraction method for global deep learning | |
US9251434B2 (en) | Techniques for spatial semantic attribute matching for location identification | |
CN102414680B (en) | Utilize the semantic event detection of cross-domain knowledge | |
CN102799684B (en) | The index of a kind of video and audio file cataloguing, metadata store index and searching method | |
CN111914107B (en) | Instance retrieval method based on multi-channel attention area expansion | |
CN105100894A (en) | Automatic face annotation method and system | |
Yasser et al. | Saving cultural heritage with digital make-believe: machine learning and digital techniques to the rescue | |
WO2013075316A1 (en) | Interactive multi-modal image search | |
CN104376105A (en) | Feature fusing system and method for low-level visual features and text description information of images in social media | |
CN112836487A (en) | Automatic comment method and device, computer equipment and storage medium | |
CN106919652A (en) | Short-sighted frequency automatic marking method and system based on multi-source various visual angles transductive learning | |
CN109492168A (en) | A kind of visualization tour interest recommendation information generation method based on tourism photo | |
Kelm et al. | A hierarchical, multi-modal approach for placing videos on the map using millions of flickr photographs | |
CN111309956B (en) | Image retrieval-oriented extraction method | |
JP5433396B2 (en) | Manga image analysis device, program, search device and method for extracting text from manga image | |
CN113345053B (en) | Intelligent color matching method and system | |
CN110287369A (en) | A kind of semantic-based video retrieval method and system | |
Huang et al. | Tag refinement of micro-videos by learning from multiple data sources | |
CN111242060B (en) | Method and system for extracting key information of document image |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20111019 |