CN102222101A - Method for video semantic mining - Google Patents

Method for video semantic mining Download PDF

Info

Publication number
CN102222101A
CN102222101A CN 201110168952 CN201110168952A CN102222101A CN 102222101 A CN102222101 A CN 102222101A CN 201110168952 CN201110168952 CN 201110168952 CN 201110168952 A CN201110168952 A CN 201110168952A CN 102222101 A CN102222101 A CN 102222101A
Authority
CN
China
Prior art keywords
video
recognition
semantic
word
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 201110168952
Other languages
Chinese (zh)
Inventor
张师林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
North China University of Technology
Original Assignee
North China University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North China University of Technology filed Critical North China University of Technology
Priority to CN 201110168952 priority Critical patent/CN102222101A/en
Publication of CN102222101A publication Critical patent/CN102222101A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Character Discrimination (AREA)

Abstract

The invention relates to a method for video semantic mining, comprising the steps of: firstly performing a Chinese continuous speech recognition, a video target recognition and a video character recognition on a to-be-processed video; then performing a Chinese word division and a part-of-speech tagging on the recognition result, reserving nouns and verbs as the peaks of a graph model, wherein a side weight between the peaks is set to be the Chinese semantic distance of words represented by the two peaks; and finally mining the semantic information of the video according to a dense subgraph finding algorithm. The semantic mining of the video is realized by the fusion of the three recognition results of the Chinese continuous speech recognition, the video target recognition and the video character recognition; the video is represented to be a graph model, wherein the peaks are the words in the video, and the side weight is set to be the semantic distance between the two peaks; the video semantic mining algorithm is further transformed to be the dense subgraph finding algorithm of the graph model. The method and the device in the invention solve the problems of high error rate of a single recognition result and incapability of efficiently fusing a plurality of recognition results in the process of the Chinese continuous speech recognition, the video target recognition and the video character recognition, as well as solve the problem of the video structured expression and the problem of the video semantic mining algorithm realization. The method and the device in the invention can be used for performing automatic marking, classification and semantic mining on batches of videos.

Description

A kind of video semanteme method for digging
Technical field
The present invention relates to Digital Media and machine learning field, it carries out semantic analysis for the video of user's input, carries out semantic tagger by fusion voice, writings and image information for video.
Background technology
Along with the development of Online Video sharing website and video processing technique, the content of multitude of video form emerges.Because therefore the data of video yes-no formatization and lack necessary descriptor can not be handled as text at an easy rate.Carry out artificial semantic tagger for video and take time and effort again, can not satisfy the requirement of Video processing in batches.Content-based video processing technique is present research focus, but prior art and is not taken all factors into consideration effective fusion of image, literal and the many-sided content of voice for the marking error rate height of video content.The image object recognition technology is grown up gradually at present, and in the challenge match of sensation target category classification, image object identification has arrived practical degree.The continuous speech recognition technology makes voice signal can be transcribed into text.Video text identification can identify the embedding literal in the video, handles as text.In conjunction with above three kinds of recognition technologies, the video semanteme analysis needs a kind of effective fusion method." knowing net " is a semantic dictionary of Chinese, utilizes the concept hierarchy relation in " knowing net ", can calculate two semantic distances between the word.According to semantic distance, can carry out semanteme tolerance to three kinds of recognition results.According to the high correlation of these three kinds of modal informations of video image, literal and voice, can effectively merge different modalities information, remove identification error information.Graph model is made of summit and limit, can express the relation of notion in the whole video.Dense subgraph finds that algorithm can be implemented in the semantic relation of assembling of discovery in the video graph model, reaches the purpose of video semanteme mark.
Summary of the invention
Existing content-based video processing technique is not utilized the information of image, voice and three high-level semantic aspects of literal fully, and can not carry out visual classification and excavation on high-level semantic.In order to solve the deficiency of prior art problem, the present invention proposes a kind of method of video being carried out semantic excavation.
In order to reach described purpose, the invention provides the method that a kind of video is expressed and excavated, its technical scheme comprises the steps:
Step S1:, carry out Chinese continuous speech recognition, video object identification and video text identification respectively for pending video;
Step S2: for the described three kinds of recognition results of step S1, be expressed as a literal vector separately, form a tensor jointly to express video;
Step S3: three literal vectors among the step S2, carry out Chinese word segmentation and part-of-speech tagging respectively, keep noun and verb;
Step S4: the structural map model is expressed video, and wherein the summit of figure is resulting noun and a verb among the S3, and the limit weight of figure is set to the semantic distance of the Chinese word of two summit representatives;
Step S5:, use dense subgraph to find that algorithm excavates the semanteme in the graph model for the graph model that step S4 is constructed.
Beneficial effect of the present invention: for video can realize automatic semantic tagger, automatically the classification and the video measuring similarity.For massive video data, can avoid the manual uninteresting loaded down with trivial details work that is brought that marks by means of present technique.The result of Chinese continuous speech recognition, Chinese text identification and image object identification has effectively been merged in the present invention, represented the semantic distance relation of each semantic concept in the video by video being expressed as a graph model, this distance relation is to realize by measuring based on the semantic distance of " knowing net "; Can realize the mark and the excavation of semantic concept in the video at last by dense subgraph discovery algorithm.
Description of drawings
Fig. 1 is Video processing overall flow figure of the present invention.
Fig. 2 is a Chinese continuous speech recognition process flow diagram of the present invention.
Fig. 3 is video text identification process figure of the present invention.
Fig. 4 is image object identification process figure of the present invention.
Fig. 5 is semantic distance tolerance hierarchical relationship figure of the present invention.
Fig. 6 is that video dense subgraph of the present invention excavates expression.
Fig. 7 is video labeling result of the present invention.
Embodiment
Describe each related detailed problem in the technical solution of the present invention in detail below in conjunction with accompanying drawing.Be to be noted that described embodiment only is intended to be convenient to the understanding of the present invention, and it is not played any qualification effect.
The present invention proposes the method that a kind of video semanteme excavates, as shown in Figure 1, this method is divided into four layers on treatment scheme.Be the video library layer bottom, deposited various forms of video resources; Video library top one deck is multi-modal fused layer, finishes for the structure analysis of video and the identification and the effectively fusion of image, literal and voice at this one deck; Be the video tap layer toward last layer again, represent for the graph model of video and, can realize that visual classification excavates according to supporting vector machine model in addition based on the video mining algorithm that dense subgraph is found in this layer realization; The superiors are transparent intelligent video service layers that the user is provided; The semanteme that the rightmost side is based on " knowing net " calculates supporting layer.According to above-mentioned flow process, concrete implementation step is as follows:
1, video preprocessor is handled
Carry out camera lens for pending video and cut apart, preserve for the successive image Target Recognition then for each camera lens extraction key frame, and these key frames and use; For the sound signal in the video, according to 16KHZ, the requirement of 16bit sampling, and preserve into the wav form and use for subsequent speech recognition.
2, image object identification
At first download sensation target classification challenge match picture library (PASCAL VOC Challenge 2010 Database, http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/index.h tml), extract the partial gradient feature (HOG) of each picture in the picture library respectively, take means clustering algorithm (k-means) that the feature in the picture library is carried out cluster, the classification number can be decided to be 1000,1000 vision words have so just been formed, use these 1000 vision words to describe each picture then, this moment, each picture just constituted a speech bag (bag of words) as intermediate features, use support vector machine (SVM) method to train the disaggregated model that obtains 20 visual category at last on picture speech bag feature, these 20 classifications are respectively: the people, bird, cat, ox, dog, horse, sheep, aircraft, ship, bicycle, motorcycle, train, car, motorbus, bottle, chair, dining table, flowerpot, sofa, televisor.Use these disaggregated models to carry out Target Recognition for the key frame of video image at last, recognition result is preserved into a text, is designated as Text OBJECTThis step process flow process is referring to Fig. 2.
3, Chinese continuous speech recognition
At first download sphinx continuous speech recognition system increase income code and supporting Chinese language model, Chinese acoustic model and Chinese word list file (Sphinx, http://sphinx.sourceforge.net).Handle resulting audio signal for video preprocessor and carry out continuous speech recognition, sound signal is transcribed into text, be designated as Text ASRThis step process flow process is referring to Fig. 3.
1, video text identification
Literal location in the video image, at first the dual edge model based on character stroke obtains candidate character region, candidate character region is decomposed to obtain pinpoint text block then.Literal extraction algorithm in the video is got a frame every some frame of video and is carried out obtaining the literal object based on the literal location of image, in sequence of frames of video, the literal object is carried out forward and tracking backward then, at last the literal object is discerned and obtained literal extraction result, and recognition result preserved into a text, be designated as Text VOCRThis step process flow process is referring to Fig. 4.
5, for Text OBJECT, Text ASRAnd Text VOCRCarry out Chinese word segmentation, " stop words " that remove no implication and Chinese part-of-speech tagging respectively, retain verb and noun and carry out next step analysis, be designated as Word respectively after handling based on latent Ma Erke large-sized model OBJECT, Word ASRAnd Word VOCRSo whole video semantically can be expressed as a tensor, promptly
Figure BSA00000522595100031
Wherein Ψ represents the semantic tensor property of video,
Figure BSA00000522595100032
Expression is by three vectorial Word OBJECT, Word ASRAnd Word VOCRThe tensor space that forms.
6,, calculate the semantic similarity between two identical part of speech words for noun that obtains in the previous step and verb.Computing method are taked the level distance metric method based on " knowing net ", and similarity is defined between 0 and 1, and such as the similarity between desk and the chair 0.8, and the similarity of landscape and steamer is 0.1.This step process flow process is referring to Fig. 5.
7, Word OBJECT, Word ASRAnd Word VOCRIn word as the summit V of graph model, the similarity between the word that defines in the previous step is defined as the weight w on limit between the summit, structure is expressed the graph model of video.This is a cum rights non-directed graph, G=(V, E), V represents vertex set, E represents the limit set, | V|=n represents number of vertices, every limit
Figure BSA00000522595100041
A non-negative weight is arranged The definition of this weight is exactly the semantic similarity between the word of determining in the previous step.
8, since in the video image, literal and audio frequency co expression a theme, so be consistent semantically at them.Thereby the semantic similarity of three vectors of video tensor should distance minimization, and the word that does not meet minimization principle should be removed owing to identification error causes.This principle can be expressed as:
Figure BSA00000522595100043
Wherein
Figure BSA00000522595100044
Represent two words, m wherein, n ∈ { Word OBJECT, Word ASR, Word VOCR, a vector in the video tensor of expression; 0≤i≤| m|, 0≤j≤| n|, the word numbering of expression, this numbering is the maximum dimension of a vector in the video tensor to the maximum.The similarity distance value of the word of f () expression is defined between 0 and 1.
9, for above-mentioned optimization problem, the dense subgraph that is converted into graph model is pinpointed the problems.Promptly G=(V, E) in, find subgraph H=(X, F), H is a subgraph, x is the subgraph vertex set, F is the limit set of subgraph.The discovery algorithm of dense subgraph can be expressed as
Figure BSA00000522595100045
Be the average weight sum maximization on each limit in the subgraph, wherein | X| represents the subgraph number of vertices,
Figure BSA00000522595100046
The set on expression subgraph limit, the weight on w (l) expression limit, the weighing computation method on limit is with the f () in the previous step.Video semanteme feature tensor space comprises three vectors, promptly
Figure BSA00000522595100047
Each vector structure graph model community (community), thus whole video is expressed as a graph model of being made up of three communities.Dense subgraph finds that algorithm carries out on above-mentioned graph model, as shown in Figure 6.
10, for the dense subgraph in the graph model of finding in the previous step, the word of the summit representative in the record dense subgraph is as effective mark of video, and this mark has promptly embodied the semantic information of video.The video labeling result as shown in Figure 7.

Claims (5)

1. a video semanteme method for digging is characterized in that, the step of described method is as follows:
Step S1:, carry out Chinese continuous speech recognition, video object identification and video text identification respectively for pending video;
Step S2: for the described three kinds of recognition results of step S1, be expressed as a literal vector separately, form a tensor jointly to express video;
Step S3: three literal vectors among the step S2, carry out Chinese word segmentation and part-of-speech tagging respectively, keep noun and verb;
Step S4: the structural map model is expressed video, and wherein the summit of figure is resulting noun and a verb among the S3, and the limit weight of figure is set to the semantic distance of the Chinese word of two summit representatives;
Step S5:, use dense subgraph to find that algorithm excavates the semanteme in the graph model for the graph model that step S4 is constructed.
2. video semanteme method for digging according to claim 1, it is characterized in that, described video object identification is at first gone up gradient feature (HOG) and the yardstick invariant features (SIFT) that extracts picture at sensation target classification challenge match picture library (PASCAL VOC Challenge 2010), and for these features use mean cluster (K-means) algorithm clusters, claim that these classes are the vision word, use these vision word construction speech bags (bag of words) that the picture in the picture library is described then, with the speech bag is characteristics of image training support vector machine model (SVM), uses the support vector machine model that the video lens key frame images is carried out Target Recognition.
3. video semanteme method for digging according to claim 1, it is characterized in that, Chinese continuous speech recognition, video object identification and video text identification have been merged for the processing of video, and three kinds of recognition result unifications are handled as character features, and word processing comprises Chinese word segmentation and part-of-speech tagging.
4. video semanteme method for digging according to claim 1, it is characterized in that, structure for graph model, noun and the verb in three kinds of recognition results of video represented on the summit, the weight on limit has been represented the semantic distance between the summit, the semantic measure that is based on " knowing net " that the weight calculation on limit is taked calculates two semantic distances between the word by level and the membership between the word in inquiry " knowing net " semantic dictionary.
5. video semanteme method for digging according to claim 1, it is characterized in that, the discovery algorithm of dense subgraph is to realize by constantly removing in the graph model isolated summit, and the mining process of video semanteme is expressed as pinpointing the problems of dense subgraph in the graph model.
CN 201110168952 2011-06-22 2011-06-22 Method for video semantic mining Pending CN102222101A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110168952 CN102222101A (en) 2011-06-22 2011-06-22 Method for video semantic mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110168952 CN102222101A (en) 2011-06-22 2011-06-22 Method for video semantic mining

Publications (1)

Publication Number Publication Date
CN102222101A true CN102222101A (en) 2011-10-19

Family

ID=44778653

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110168952 Pending CN102222101A (en) 2011-06-22 2011-06-22 Method for video semantic mining

Country Status (1)

Country Link
CN (1) CN102222101A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014146463A1 (en) * 2013-03-19 2014-09-25 中国科学院自动化研究所 Behaviour recognition method based on hidden structure reasoning
CN105629747A (en) * 2015-09-18 2016-06-01 宇龙计算机通信科技(深圳)有限公司 Voice control method and device of smart home system
CN106202421A (en) * 2012-02-02 2016-12-07 联想(北京)有限公司 A kind of obtain the method for video, device and play the method for video, device
CN107203586A (en) * 2017-04-19 2017-09-26 天津大学 A kind of automation result generation method based on multi-modal information
CN108320318A (en) * 2018-01-15 2018-07-24 腾讯科技(深圳)有限公司 Image processing method, device, computer equipment and storage medium
CN108427713A (en) * 2018-02-01 2018-08-21 宁波诺丁汉大学 A kind of video summarization method and system for homemade video
CN110879974A (en) * 2019-11-01 2020-03-13 北京微播易科技股份有限公司 Video classification method and device
CN111107381A (en) * 2018-10-25 2020-05-05 武汉斗鱼网络科技有限公司 Live broadcast room bullet screen display method, storage medium, equipment and system
CN111126373A (en) * 2019-12-23 2020-05-08 北京中科神探科技有限公司 Internet short video violation judgment device and method based on cross-modal identification technology
CN111797850A (en) * 2019-04-09 2020-10-20 Oppo广东移动通信有限公司 Video classification method and device, storage medium and electronic equipment
CN113343974A (en) * 2021-07-06 2021-09-03 国网天津市电力公司 Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement
CN115203408A (en) * 2022-06-24 2022-10-18 中国人民解放军国防科技大学 Intelligent labeling method for multi-modal test data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101093500A (en) * 2007-07-16 2007-12-26 武汉大学 Method for recognizing semantics of events in video
US20080059872A1 (en) * 2006-09-05 2008-03-06 National Cheng Kung University Video annotation method by integrating visual features and frequent patterns

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080059872A1 (en) * 2006-09-05 2008-03-06 National Cheng Kung University Video annotation method by integrating visual features and frequent patterns
CN101093500A (en) * 2007-07-16 2007-12-26 武汉大学 Method for recognizing semantics of events in video

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《中国博士学位论文全文数据库(电子期刊)》 20101215 刘亚楠 多模态特征融合和变量选择的视频语义理解 第4章 1-5 , *
《信息化纵横》 20091031 张焕生等 基于图的频繁子结构挖掘算法综述 第5-9页 1-5 , *
《计算机研究与发展》 20090131 刘亚楠等 基于多模态子空间相关性传递的视频语义挖掘 第1-8页 1-5 , *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202421B (en) * 2012-02-02 2020-01-31 联想(北京)有限公司 method and device for obtaining video and method and device for playing video
CN106202421A (en) * 2012-02-02 2016-12-07 联想(北京)有限公司 A kind of obtain the method for video, device and play the method for video, device
WO2014146463A1 (en) * 2013-03-19 2014-09-25 中国科学院自动化研究所 Behaviour recognition method based on hidden structure reasoning
CN105629747A (en) * 2015-09-18 2016-06-01 宇龙计算机通信科技(深圳)有限公司 Voice control method and device of smart home system
CN107203586A (en) * 2017-04-19 2017-09-26 天津大学 A kind of automation result generation method based on multi-modal information
CN108320318A (en) * 2018-01-15 2018-07-24 腾讯科技(深圳)有限公司 Image processing method, device, computer equipment and storage medium
CN108427713A (en) * 2018-02-01 2018-08-21 宁波诺丁汉大学 A kind of video summarization method and system for homemade video
CN108427713B (en) * 2018-02-01 2021-11-16 宁波诺丁汉大学 Video abstraction method and system for self-made video
CN111107381A (en) * 2018-10-25 2020-05-05 武汉斗鱼网络科技有限公司 Live broadcast room bullet screen display method, storage medium, equipment and system
CN111797850A (en) * 2019-04-09 2020-10-20 Oppo广东移动通信有限公司 Video classification method and device, storage medium and electronic equipment
CN110879974A (en) * 2019-11-01 2020-03-13 北京微播易科技股份有限公司 Video classification method and device
CN111126373A (en) * 2019-12-23 2020-05-08 北京中科神探科技有限公司 Internet short video violation judgment device and method based on cross-modal identification technology
CN113343974A (en) * 2021-07-06 2021-09-03 国网天津市电力公司 Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement
CN115203408A (en) * 2022-06-24 2022-10-18 中国人民解放军国防科技大学 Intelligent labeling method for multi-modal test data

Similar Documents

Publication Publication Date Title
CN102222101A (en) Method for video semantic mining
CN102663015B (en) Video semantic labeling method based on characteristics bag models and supervised learning
Zhang et al. Fusion of multichannel local and global structural cues for photo aesthetics evaluation
CN106599029B (en) Chinese short text clustering method
CN103946838B (en) Interactive multi-mode image search
US9251434B2 (en) Techniques for spatial semantic attribute matching for location identification
CN102414680B (en) Utilize the semantic event detection of cross-domain knowledge
CN112004111B (en) News video information extraction method for global deep learning
CN102799684B (en) The index of a kind of video and audio file cataloguing, metadata store index and searching method
CN110162591B (en) Entity alignment method and system for digital education resources
CN111914107B (en) Instance retrieval method based on multi-channel attention area expansion
CN110619051B (en) Question sentence classification method, device, electronic equipment and storage medium
CN105100894A (en) Automatic face annotation method and system
Yasser et al. Saving cultural heritage with digital make-believe: machine learning and digital techniques to the rescue
CN114465737B (en) Data processing method and device, computer equipment and storage medium
CN109492168B (en) Visual tourism interest recommendation information generation method based on tourism photos
Kelm et al. A hierarchical, multi-modal approach for placing videos on the map using millions of flickr photographs
CN111309956B (en) Image retrieval-oriented extraction method
JP5433396B2 (en) Manga image analysis device, program, search device and method for extracting text from manga image
CN110287369A (en) A kind of semantic-based video retrieval method and system
CN107491814B (en) Construction method of process case layered knowledge model for knowledge push
CN113345053B (en) Intelligent color matching method and system
CN111242060B (en) Method and system for extracting key information of document image
Liu et al. The design patent images classification based on image caption model
Umamaheswaran et al. Variants, meta-heuristic solution approaches and applications for image retrieval in business–comprehensive review and framework

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20111019