CN102222101A

CN102222101A - Method for video semantic mining

Info

Publication number: CN102222101A
Application number: CN 201110168952
Authority: CN
Inventors: 张师林
Original assignee: North China University of Technology
Current assignee: North China University of Technology
Priority date: 2011-06-22
Filing date: 2011-06-22
Publication date: 2011-10-19

Abstract

本发明涉及一种视频语义挖掘方法，该方法首先对待处理的视频进行中文连续语音识别、视频目标识别和视频文字识别，然后对于识别结果进行中文分词和词性标注，并保留名词和动词作为图模型的顶点，顶点之间的边权重设置为两个顶点所代表的词语的中文语义距离，最后根据稠密子图发现算法挖掘视频的语义信息。本发明的特点是，利用中文连续语音识别、视频目标识别和视频文字识别三种识别结果的融合实现视频的语义挖掘；把视频表达为一个图模型，顶点为视频中的词语，边的权重设置为两个顶点的语义距离；进一步把视频语义挖掘算法转化为图模型的稠密子图发现算法。本发明解决了中文连续语音识别、视频目标识别和视频文字识别过程中的单识别结果错误率高和多识别结果不能有效融合的问题；解决了视频的结构化表达问题和视频语义挖掘的算法实现问题。本发明可以用来对批量视频进行自动标注、分类和语义挖掘。The invention relates to a video semantic mining method. The method first performs Chinese continuous speech recognition, video target recognition and video text recognition on the video to be processed, and then performs Chinese word segmentation and part-of-speech tagging on the recognition results, and retains nouns and verbs as graph models. The vertices, the edge weights between the vertices are set to the Chinese semantic distance of the words represented by the two vertices, and finally the semantic information of the video is mined according to the dense subgraph discovery algorithm. The present invention is characterized in that it realizes the semantic mining of video through the fusion of three recognition results of Chinese continuous speech recognition, video target recognition and video text recognition; the video is expressed as a graph model, the vertices are the words in the video, and the weights of the edges are set is the semantic distance between two vertices; further transform the video semantic mining algorithm into a dense subgraph discovery algorithm of the graph model. The invention solves the problems of high error rate of single recognition results and ineffective fusion of multiple recognition results in the process of Chinese continuous speech recognition, video target recognition and video text recognition; it solves the problem of structured expression of video and the algorithm realization of video semantic mining question. The invention can be used for automatic labeling, classification and semantic mining of batch videos.

Description

A Video Semantic Mining Method

技术领域technical field

本发明涉及数字媒体和机器学习领域，它对于用户输入的视频进行语义分析，通过融合语音、文字和图像信息对于视频进行语义标注。The invention relates to the fields of digital media and machine learning. It performs semantic analysis on video input by users, and performs semantic annotation on the video by fusing voice, text and image information.

背景技术Background technique

随着在线视频分享网站和视频处理技术的发展，大量视频格式的内容涌现出来。由于视频是非格式化的数据并且缺少必要的描述信息，因此并不能像文本那样很容易地进行处理。对于视频进行人工语义标注又耗时耗力，不能满足批量视频处理的要求。基于内容的视频处理技术是目前的研究热点，但是现有技术对于视频内容的标注错误率高，并且没有综合考虑图像、文字和语音多方面内容的有效融合。目前图像目标识别技术逐渐成熟起来，在视觉目标类别分类挑战赛中，图像目标识别已经到达实用的程度。连续语音识别技术使得语音信号可以转录为文本。视频文字识别可以把视频中的嵌入文字识别出来，作为文本文字来处理。结合以上三种识别技术，视频语义分析需要一种有效的融合方法。“知网”是一个中文语义辞典，利用“知网”中的概念层次关系，可以计算两个词语之间的语义距离。根据语义距离，可以对三种识别结果进行语义度量。根据视频图像、文字和语音这三种模态信息的高度相关性，可以有效融合不同模态信息，去除识别错误信息。图模型由顶点和边构成，可以表达整个视频中概念的关系。稠密子图发现算法可以实现在视频图模型中发现语义聚集关系，达到视频语义标注的目的。With the development of online video sharing sites and video processing technology, a large number of content in video formats has emerged. Because video is unformatted data and lacks necessary descriptive information, it cannot be processed as easily as text. Manual semantic annotation of videos is time-consuming and labor-intensive, and cannot meet the requirements of batch video processing. Content-based video processing technology is a current research hotspot, but the existing technology has a high error rate for video content labeling, and does not comprehensively consider the effective integration of image, text and voice content. At present, the image target recognition technology is gradually mature. In the visual target category classification challenge, the image target recognition has reached a practical level. Continuous speech recognition technology enables speech signals to be transcribed into text. Video text recognition can recognize the embedded text in the video and process it as text. Combining the above three recognition techniques, video semantic analysis requires an effective fusion method. "HowNet" is a Chinese semantic dictionary, which can calculate the semantic distance between two words by using the conceptual hierarchical relationship in "HowNet". According to the semantic distance, the semantic measurement of the three kinds of recognition results can be carried out. According to the high correlation of the three modal information of video image, text and voice, different modal information can be effectively fused to remove recognition error information. The graph model is composed of vertices and edges, which can express the relationship of concepts in the whole video. The dense subgraph discovery algorithm can realize the discovery of semantic aggregation relationship in the video graph model, and achieve the purpose of video semantic annotation.

发明内容Contents of the invention

现有的基于内容的视频处理技术，并没有完全利用图像、语音和文字三个高层语义方面的信息，并且不能在高层语义上进行视频分类和挖掘。为了解决现有技术问题的不足，本发明提出一种对视频进行语义挖掘的方法。The existing content-based video processing technology does not fully utilize the three high-level semantic information of image, voice and text, and cannot perform video classification and mining on high-level semantics. In order to solve the deficiencies of the existing technical problems, the present invention proposes a method for semantic mining of videos.

为了达成所述目的，本发明提供一种视频表达和挖掘的方法，其技术方案包括如下步骤：In order to achieve the stated purpose, the present invention provides a method for video expression and mining, and its technical solution includes the following steps:

步骤S1：对于待处理的视频，分别进行中文连续语音识别、视频目标识别和视频文字识别；Step S1: For the video to be processed, respectively perform Chinese continuous speech recognition, video object recognition and video text recognition;

步骤S2：对于步骤S1所述的三种识别结果，各自表达为一个文字向量，共同组成一个张量以表达视频；Step S2: For the three kinds of recognition results described in step S1, each is expressed as a text vector, which together form a tensor to express the video;

步骤S3：对于步骤S2中的三个文字向量，分别进行中文分词和词性标注，保留名词和动词；Step S3: For the three text vectors in step S2, respectively perform Chinese word segmentation and part-of-speech tagging, and retain nouns and verbs;

步骤S4：构造图模型来表达视频，其中图的顶点为S3中所得到的名词和动词，图的边权重设置为两个顶点所代表的中文词语的语义距离；Step S4: Construct a graph model to express the video, wherein the vertices of the graph are the nouns and verbs obtained in S3, and the edge weights of the graph are set to the semantic distance of the Chinese words represented by the two vertices;

步骤S5：对于步骤S4所构造的图模型，使用稠密子图发现算法挖掘图模型中的语义。Step S5: For the graph model constructed in step S4, use the dense subgraph discovery algorithm to mine the semantics in the graph model.

本发明的有益效果：对于视频可以实现自动的语义标注、自动分类和视频相似度度量。对于海量视频数据，借助于本技术可以避免手工标注所带来的枯燥繁琐的劳动。本发明有效融合了中文连续语音识别、中文文字识别和图像目标识别的结果，通过把视频表达为一个图模型而展现了视频中各个语义概念的语义距离关系，这个距离关系是通过基于“知网”的语义距离度量来实现的；最后通过稠密子图发现算法可以实现视频中语义概念的标注和挖掘。The beneficial effect of the present invention is that automatic semantic annotation, automatic classification and video similarity measurement can be realized for videos. For massive video data, with the help of this technology, the boring and tedious labor caused by manual labeling can be avoided. The present invention effectively integrates the results of Chinese continuous speech recognition, Chinese text recognition and image target recognition, and shows the semantic distance relationship of each semantic concept in the video by expressing the video as a graphical model. "Semantic distance measurement to achieve; Finally, the dense subgraph discovery algorithm can realize the labeling and mining of semantic concepts in videos.

附图说明Description of drawings

图1是本发明的视频处理整体流程图。FIG. 1 is an overall flow chart of video processing in the present invention.

图2是本发明的中文连续语音识别流程图。Fig. 2 is the Chinese continuous speech recognition flowchart of the present invention.

图3是本发明的视频文字识别流程图。Fig. 3 is a flow chart of video text recognition in the present invention.

图4是本发明的图像目标识别流程图。Fig. 4 is a flow chart of image target recognition in the present invention.

图5是本发明的语义距离度量层级关系图。Fig. 5 is a hierarchical relationship diagram of semantic distance measurement in the present invention.

图6是本发明的视频稠密子图挖掘表示。Figure 6 is a video dense subgraph mining representation of the present invention.

图7是本发明的视频标注结果。Fig. 7 is the video labeling result of the present invention.

具体实施方式Detailed ways

下面结合附图详细说明本发明技术方案中所涉及的各个细节问题。应指出的是，所描述的实施例仅旨在便于对本发明的理解，而对其不起任何限定作用。Various details involved in the technical solution of the present invention will be described in detail below in conjunction with the accompanying drawings. It should be pointed out that the described embodiments are only intended to facilitate the understanding of the present invention, rather than limiting it in any way.

本发明提出了一种视频语义挖掘的方法，如图1所示，该方法在处理流程上分为四层。最下边是视频库层，存放了各种形式的视频资源；视频库上边一层是多模态融合层，在这一层完成对于视频的结构分析和图像、文字以及语音的识别和有效融合；再往上一层是视频挖掘层，在该层实现对于视频的图模型表示和基于稠密子图发现的视频挖掘算法，此外还可以根据支持向量机模型实现视频分类挖掘；最上层是对用户提供的透明的智能视频服务层；最右侧是基于“知网”的语义计算支持层。根据上述流程，具体的实施步骤如下所示：The present invention proposes a method for video semantic mining. As shown in FIG. 1 , the method is divided into four layers in the processing flow. The bottom layer is the video library layer, which stores various forms of video resources; the upper layer of the video library is the multi-modal fusion layer, where the structural analysis of the video and the recognition and effective fusion of images, text, and voice are completed; The next upper layer is the video mining layer, which realizes the graphical model representation of video and the video mining algorithm based on dense subgraph discovery. In addition, video classification and mining can be realized based on the support vector machine model; the top layer is provided to users. The transparent intelligent video service layer; the far right is the semantic computing support layer based on "HowNet". According to the above process, the specific implementation steps are as follows:

1、视频预处理1. Video preprocessing

对于待处理的视频进行镜头分割，然后对于每个镜头提取关键帧，并把这些关键帧保存下来供后续图像目标识别使用；对于视频中的音频信号，按照16KHZ，16bit的要求采样，并且保存成wav格式供后续语音识别使用。Carry out shot segmentation for the video to be processed, then extract key frames for each shot, and save these key frames for subsequent image target recognition; for the audio signal in the video, sample according to the requirements of 16KHZ, 16bit, and save as The wav format is used for subsequent speech recognition.

2、图像目标识别2. Image target recognition

首先下载视觉目标分类挑战赛图片库(PASCAL VOC Challenge 2010 Database，http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/index.html)，分别提取图片库中每个图片的局部梯度特征(HOG)，采取均值聚类算法(k-means)对图片库中的特征进行聚类，类别数可以定为1000个，这样就形成了1000个视觉单词，然后使用这1000个视觉单词描述每个图片，此时每个图片就构成一个词袋(bag of words)作为中间特征，最后使用支持向量机(SVM)方法在图片词袋特征上训练得到20个视觉类别的分类模型，这20个类别分别是：人、鸟、猫、牛、狗、马、羊、飞机、船、自行车、摩托车、火车、轿车、公共汽车、瓶子、椅子、饭桌、花盆、沙发、电视机。最后使用这些分类模型对于视频关键帧图像进行目标识别，识别结果保存成一个文本，记为Text_OBJECT。本步骤处理流程参见图2。First download the Visual Object Classification Challenge picture library (PASCAL VOC Challenge 2010 Database, http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/index.html), and extract the Local gradient features (HOG), using mean value clustering algorithm (k-means) to cluster the features in the picture library, the number of categories can be set to 1000, thus forming 1000 visual words, and then using these 1000 visual words Words describe each picture. At this time, each picture constitutes a bag of words (bag of words) as an intermediate feature. Finally, the support vector machine (SVM) method is used to train the classification model of 20 visual categories on the bag of words feature of the picture. The 20 categories are: people, birds, cats, cows, dogs, horses, sheep, airplanes, boats, bicycles, motorcycles, trains, cars, buses, bottles, chairs, dining tables, flower pots, sofas, TV sets . Finally, these classification models are used to perform target recognition on video key frame images, and the recognition results are saved as a text, which is recorded as Text _OBJECT . Refer to Figure 2 for the processing flow of this step.

3、中文连续语音识别3. Chinese continuous speech recognition

首先下载史芬克斯连续语音识别系统开源代码和配套的汉语语言模型、汉语声学模型以及汉语词表文件(Sphinx，http://sphinx.sourceforge.net)。对于视频预处理所得到的音频信号进行连续语音识别，把音频信号转录成文本，记为Text_ASR。本步骤处理流程参见图3。First download the open source code of the Sphinx continuous speech recognition system and the supporting Chinese language model, Chinese acoustic model and Chinese vocabulary files (Sphinx, http://sphinx.sourceforge.net). Continuous speech recognition is performed on the audio signal obtained by video preprocessing, and the audio signal is transcribed into text, which is recorded as Text _ASR . Refer to Figure 3 for the processing flow of this step.

1、视频文字识别1. Video text recognition

视频图像中的文字定位，首先基于字符笔画的双边缘模型得到候选文字区域，然后对候选文字区域进行分解得到精确定位的文本块。视频中的文字提取算法每隔若干视频帧取一帧进行基于图像的文字定位得到文字对象，然后在视频帧序列中对文字对象进行向前和向后的跟踪，最后对文字对象进行识别得到文字提取结果，并把识别结果保存成一个文本，记为Text_VOCR。本步骤处理流程参见图4。For text positioning in video images, firstly, candidate text regions are obtained based on the double-edge model of character strokes, and then the candidate text regions are decomposed to obtain precisely positioned text blocks. The text extraction algorithm in the video takes a frame every several video frames to perform image-based text positioning to obtain the text object, then tracks the text object forward and backward in the video frame sequence, and finally recognizes the text object to obtain the text Extract the result, and save the recognition result as a text, denoted as Text _VOCR . Refer to Figure 4 for the processing flow of this step.

5、对于Text_OBJECT、Text_ASR和Text_VOCR分别进行基于隐马尔科大模型的中文分词、去掉无含义的“停用词”和中文词性标注，保留下动词和名词进行下一步的分析，处理之后分别记为Word_OBJECT、Word_ASR和Word_VOCR。于是整个视频在语义上可以表示为一个张量，即

其中Ψ表示视频的语义张量特征，

表示由三个向量Word_OBJECT、Word_ASR和Word_VOCR形成的张量空间。5. For Text _OBJECT , Text _ASR , and Text _VOCR , perform Chinese word segmentation based on the Hidden Marko Large Model, remove meaningless "stop words" and Chinese part-of-speech tagging, and retain verbs and nouns for the next step of analysis. After processing, respectively Recorded as Word _OBJECT , Word _ASR and Word _VOCR . Then the entire video can be semantically expressed as a tensor, namely

where Ψ represents the semantic tensor feature of the video,

Represents the tensor space formed by the three vectors Word _OBJECT , Word _ASR , and Word _VOCR .

6、对于上一步中得到的名词和动词，计算两个相同词性词语之间的语义相似度。计算方法采取基于“知网”的层次距离度量方法，相似度定义在0和1之间，比如桌子和椅子之间的相似度0.8，而风景和轮船的相似度为0.1。本步骤处理流程参见图5。6. For the nouns and verbs obtained in the previous step, calculate the semantic similarity between two words with the same part of speech. The calculation method adopts the hierarchical distance measurement method based on "HowNet", and the similarity is defined between 0 and 1. For example, the similarity between a table and a chair is 0.8, and the similarity between a landscape and a ship is 0.1. Refer to Figure 5 for the processing flow of this step.

7、把Word_OBJECT、Word_ASR和Word_VOCR中的词语作为图模型的顶点V，把上一步中定义的词语之间的相似度定义为顶点之间边的权重w，构造表达视频的图模型。这是一个带权无向图，G＝(V，E)，V表示顶点集合，E表示边集合，|V|＝n表示顶点个数，每条边

有一个非负的权重该权重的定义就是上一步中确定的词语之间的语义相似度。7. Take the words in Word _OBJECT , Word _ASR and Word _VOCR as the vertex V of the graph model, define the similarity between the words defined in the previous step as the weight w of the edge between the vertices, and construct a graph model to express the video. This is a weighted undirected graph, G=(V, E), V represents the set of vertices, E represents the set of edges, |V|=n represents the number of vertices, each edge

has a non-negative weight The definition of this weight is the semantic similarity between words determined in the previous step.

8、由于视频中图像、文字和音频共同表达了一个主题，所以在它们在语义上是一致的。从而视频张量的三个向量的语义相似度应该距离最小化，不符合最小化原则的词语是由于识别错误导致的，应当去除。这个原则可以表示为：

其中

表示两个词语，其中m，n∈{Word_OBJECT，Word_ASR，Word_VOCR}，表示的视频张量中的一个向量；0≤i≤|m|，0≤j≤|n|，表示的词语编号，该编号最大为视频张量中一个向量的最大维数。f(·)表示的词语的相似度距离值，定义在0和1之间。8. Since the image, text and audio in the video jointly express a theme, they are consistent in semantics. Therefore, the semantic similarity of the three vectors of the video tensor should be minimized, and the words that do not meet the minimization principle are caused by recognition errors and should be removed. This principle can be expressed as:

in

Represents two words, where m, n∈{Word _OBJECT , Word _ASR , Word _VOCR }, represents a vector in the video tensor; 0≤i≤|m|, 0≤j≤|n|, represents the word number, This number is up to the largest dimension of a vector in the video tensor. The similarity distance value of words represented by f(·), defined between 0 and 1.

9、对于上述最优化问题，转化为图模型的稠密子图发现问题。即在G＝(V，E)中，找到子图H＝(X，F)，H为子图，x为子图顶点集合，F为子图的边集合。稠密子图的发现算法可以表示为

即子图中各个边的平均权重之和最大化，其中|X|表示子图顶点个数，

表示子图边的集合，w(l)表示边的权重，边的权重计算方法同上一步中的f(·)。视频语义特征张量空间包含三个向量，即

每个向量构造一个图模型社区(community)，从而整个视频表达为由三个社区组成的一个图模型。稠密子图发现算法在上述图模型上进行，如图6所示。9. For the above optimization problem, it is transformed into a dense subgraph discovery problem of the graphical model. That is, in G=(V, E), find the subgraph H=(X, F), H is the subgraph, x is the vertex set of the subgraph, and F is the edge set of the subgraph. The dense subgraph discovery algorithm can be expressed as

That is, the sum of the average weights of each edge in the subgraph is maximized, where |X| represents the number of subgraph vertices,

Represents the set of subgraph edges, w(l) represents the weight of the edge, and the calculation method of the weight of the edge is the same as f(·) in the previous step. The video semantic feature tensor space contains three vectors, namely

Each vector constructs a graph model community (community), so that the whole video is expressed as a graph model composed of three communities. The dense subgraph discovery algorithm is performed on the above graph model, as shown in Figure 6.

10、对于上一步中发现的图模型中的稠密子图，记录稠密子图中的顶点所代表的词语作为视频的有效标注，该标注即体现了视频的语义信息。视频标注结果如图7所示。10. For the dense subgraph in the graphical model found in the previous step, record the words represented by the vertices in the dense subgraph as effective annotations of the video, which embodies the semantic information of the video. The video annotation results are shown in Figure 7.

Claims

1. A video semantic mining method is characterized in that the steps of the method are as follows:

Step S1: For the video to be processed, respectively perform Chinese continuous speech recognition, video object recognition and video text recognition;

Step S2: For the three kinds of recognition results described in step S1, each is expressed as a text vector, which together form a tensor to express the video;

Step S3: For the three text vectors in step S2, respectively perform Chinese word segmentation and part-of-speech tagging, and retain nouns and verbs;

Step S4: Construct a graph model to express the video, wherein the vertices of the graph are the nouns and verbs obtained in S3, and the edge weights of the graph are set to the semantic distance of the Chinese words represented by the two vertices;

Step S5: For the graph model constructed in step S4, use the dense subgraph discovery algorithm to mine the semantics in the graph model.

2. video semantics mining method according to claim 1, is characterized in that, described video object recognition first extracts the gradient feature (HOG) of picture and scale difference on Visual Object Classification Challenge picture storehouse (PASCAL VOC Challenge 2010). Variable features (SIFT), and use the mean value clustering (K-means) algorithm clustering for these features, these classes are called visual words, and then use these visual words to construct a bag of words (bag of words) for the pictures in the picture library To describe, use the bag of words as the image feature to train the support vector machine model (SVM), and use the support vector machine model to perform target recognition on the key frame image of the video lens.

3. video semantic mining method according to claim 1, is characterized in that, for the processing of video, has merged Chinese continuous speech recognition, video object recognition and video character recognition, and three kinds of recognition results are unified as character feature and are processed, Word processing includes Chinese word segmentation and part-of-speech tagging.

4. video semantic mining method according to claim 1, is characterized in that, for the structure of graph model, apex represents noun and verb in three kinds of recognition results of video, and the weight of edge represents the semantic distance between apex, The calculation of edge weight is based on the semantic measurement method of "HowNet", and the semantic distance between two words is calculated by querying the hierarchy and affiliation relationship between words in the semantic dictionary of "HowNet".

5. The video semantic mining method according to claim 1, wherein the discovery algorithm of the dense subgraph is realized by constantly removing isolated vertices in the graph model, and the mining process of the video semantics is expressed as a dense subgraph in the graph model. Subgraph discovery problem. the