CN102222101A - Method for video semantic mining - Google Patents
Method for video semantic mining Download PDFInfo
- Publication number
- CN102222101A CN102222101A CN 201110168952 CN201110168952A CN102222101A CN 102222101 A CN102222101 A CN 102222101A CN 201110168952 CN201110168952 CN 201110168952 CN 201110168952 A CN201110168952 A CN 201110168952A CN 102222101 A CN102222101 A CN 102222101A
- Authority
- CN
- China
- Prior art keywords
- video
- recognition
- semantic
- words
- vertices
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000005065 mining Methods 0.000 title claims abstract description 21
- 238000000034 method Methods 0.000 title claims abstract description 20
- 230000011218 segmentation Effects 0.000 claims abstract description 6
- 238000012545 processing Methods 0.000 claims description 13
- 239000013598 vector Substances 0.000 claims description 10
- 230000000007 visual effect Effects 0.000 claims description 8
- 238000012706 support-vector machine Methods 0.000 claims description 5
- 239000000284 extract Substances 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000000691 measurement method Methods 0.000 claims description 2
- 238000002372 labelling Methods 0.000 abstract description 5
- 230000004927 fusion Effects 0.000 abstract description 4
- 238000005516 engineering process Methods 0.000 description 7
- 238000005259 measurement Methods 0.000 description 4
- 241000252794 Sphinx Species 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 238000013145 classification model Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 241000271566 Aves Species 0.000 description 1
- 241000283690 Bos taurus Species 0.000 description 1
- 241000282472 Canis lupus familiaris Species 0.000 description 1
- 241000283086 Equidae Species 0.000 description 1
- 241000282326 Felis catus Species 0.000 description 1
- 241001494479 Pecora Species 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012916 structural analysis Methods 0.000 description 1
Images
Landscapes
- Character Discrimination (AREA)
Abstract
本发明涉及一种视频语义挖掘方法,该方法首先对待处理的视频进行中文连续语音识别、视频目标识别和视频文字识别,然后对于识别结果进行中文分词和词性标注,并保留名词和动词作为图模型的顶点,顶点之间的边权重设置为两个顶点所代表的词语的中文语义距离,最后根据稠密子图发现算法挖掘视频的语义信息。本发明的特点是,利用中文连续语音识别、视频目标识别和视频文字识别三种识别结果的融合实现视频的语义挖掘;把视频表达为一个图模型,顶点为视频中的词语,边的权重设置为两个顶点的语义距离;进一步把视频语义挖掘算法转化为图模型的稠密子图发现算法。本发明解决了中文连续语音识别、视频目标识别和视频文字识别过程中的单识别结果错误率高和多识别结果不能有效融合的问题;解决了视频的结构化表达问题和视频语义挖掘的算法实现问题。本发明可以用来对批量视频进行自动标注、分类和语义挖掘。The invention relates to a video semantic mining method. The method first performs Chinese continuous speech recognition, video target recognition and video text recognition on the video to be processed, and then performs Chinese word segmentation and part-of-speech tagging on the recognition results, and retains nouns and verbs as graph models. The vertices, the edge weights between the vertices are set to the Chinese semantic distance of the words represented by the two vertices, and finally the semantic information of the video is mined according to the dense subgraph discovery algorithm. The present invention is characterized in that it realizes the semantic mining of video through the fusion of three recognition results of Chinese continuous speech recognition, video target recognition and video text recognition; the video is expressed as a graph model, the vertices are the words in the video, and the weights of the edges are set is the semantic distance between two vertices; further transform the video semantic mining algorithm into a dense subgraph discovery algorithm of the graph model. The invention solves the problems of high error rate of single recognition results and ineffective fusion of multiple recognition results in the process of Chinese continuous speech recognition, video target recognition and video text recognition; it solves the problem of structured expression of video and the algorithm realization of video semantic mining question. The invention can be used for automatic labeling, classification and semantic mining of batch videos.
Description
技术领域technical field
本发明涉及数字媒体和机器学习领域,它对于用户输入的视频进行语义分析,通过融合语音、文字和图像信息对于视频进行语义标注。The invention relates to the fields of digital media and machine learning. It performs semantic analysis on video input by users, and performs semantic annotation on the video by fusing voice, text and image information.
背景技术Background technique
随着在线视频分享网站和视频处理技术的发展,大量视频格式的内容涌现出来。由于视频是非格式化的数据并且缺少必要的描述信息,因此并不能像文本那样很容易地进行处理。对于视频进行人工语义标注又耗时耗力,不能满足批量视频处理的要求。基于内容的视频处理技术是目前的研究热点,但是现有技术对于视频内容的标注错误率高,并且没有综合考虑图像、文字和语音多方面内容的有效融合。目前图像目标识别技术逐渐成熟起来,在视觉目标类别分类挑战赛中,图像目标识别已经到达实用的程度。连续语音识别技术使得语音信号可以转录为文本。视频文字识别可以把视频中的嵌入文字识别出来,作为文本文字来处理。结合以上三种识别技术,视频语义分析需要一种有效的融合方法。“知网”是一个中文语义辞典,利用“知网”中的概念层次关系,可以计算两个词语之间的语义距离。根据语义距离,可以对三种识别结果进行语义度量。根据视频图像、文字和语音这三种模态信息的高度相关性,可以有效融合不同模态信息,去除识别错误信息。图模型由顶点和边构成,可以表达整个视频中概念的关系。稠密子图发现算法可以实现在视频图模型中发现语义聚集关系,达到视频语义标注的目的。With the development of online video sharing sites and video processing technology, a large number of content in video formats has emerged. Because video is unformatted data and lacks necessary descriptive information, it cannot be processed as easily as text. Manual semantic annotation of videos is time-consuming and labor-intensive, and cannot meet the requirements of batch video processing. Content-based video processing technology is a current research hotspot, but the existing technology has a high error rate for video content labeling, and does not comprehensively consider the effective integration of image, text and voice content. At present, the image target recognition technology is gradually mature. In the visual target category classification challenge, the image target recognition has reached a practical level. Continuous speech recognition technology enables speech signals to be transcribed into text. Video text recognition can recognize the embedded text in the video and process it as text. Combining the above three recognition techniques, video semantic analysis requires an effective fusion method. "HowNet" is a Chinese semantic dictionary, which can calculate the semantic distance between two words by using the conceptual hierarchical relationship in "HowNet". According to the semantic distance, the semantic measurement of the three kinds of recognition results can be carried out. According to the high correlation of the three modal information of video image, text and voice, different modal information can be effectively fused to remove recognition error information. The graph model is composed of vertices and edges, which can express the relationship of concepts in the whole video. The dense subgraph discovery algorithm can realize the discovery of semantic aggregation relationship in the video graph model, and achieve the purpose of video semantic annotation.
发明内容Contents of the invention
现有的基于内容的视频处理技术,并没有完全利用图像、语音和文字三个高层语义方面的信息,并且不能在高层语义上进行视频分类和挖掘。为了解决现有技术问题的不足,本发明提出一种对视频进行语义挖掘的方法。The existing content-based video processing technology does not fully utilize the three high-level semantic information of image, voice and text, and cannot perform video classification and mining on high-level semantics. In order to solve the deficiencies of the existing technical problems, the present invention proposes a method for semantic mining of videos.
为了达成所述目的,本发明提供一种视频表达和挖掘的方法,其技术方案包括如下步骤:In order to achieve the stated purpose, the present invention provides a method for video expression and mining, and its technical solution includes the following steps:
步骤S1:对于待处理的视频,分别进行中文连续语音识别、视频目标识别和视频文字识别;Step S1: For the video to be processed, respectively perform Chinese continuous speech recognition, video object recognition and video text recognition;
步骤S2:对于步骤S1所述的三种识别结果,各自表达为一个文字向量,共同组成一个张量以表达视频;Step S2: For the three kinds of recognition results described in step S1, each is expressed as a text vector, which together form a tensor to express the video;
步骤S3:对于步骤S2中的三个文字向量,分别进行中文分词和词性标注,保留名词和动词;Step S3: For the three text vectors in step S2, respectively perform Chinese word segmentation and part-of-speech tagging, and retain nouns and verbs;
步骤S4:构造图模型来表达视频,其中图的顶点为S3中所得到的名词和动词,图的边权重设置为两个顶点所代表的中文词语的语义距离;Step S4: Construct a graph model to express the video, wherein the vertices of the graph are the nouns and verbs obtained in S3, and the edge weights of the graph are set to the semantic distance of the Chinese words represented by the two vertices;
步骤S5:对于步骤S4所构造的图模型,使用稠密子图发现算法挖掘图模型中的语义。Step S5: For the graph model constructed in step S4, use the dense subgraph discovery algorithm to mine the semantics in the graph model.
本发明的有益效果:对于视频可以实现自动的语义标注、自动分类和视频相似度度量。对于海量视频数据,借助于本技术可以避免手工标注所带来的枯燥繁琐的劳动。本发明有效融合了中文连续语音识别、中文文字识别和图像目标识别的结果,通过把视频表达为一个图模型而展现了视频中各个语义概念的语义距离关系,这个距离关系是通过基于“知网”的语义距离度量来实现的;最后通过稠密子图发现算法可以实现视频中语义概念的标注和挖掘。The beneficial effect of the present invention is that automatic semantic annotation, automatic classification and video similarity measurement can be realized for videos. For massive video data, with the help of this technology, the boring and tedious labor caused by manual labeling can be avoided. The present invention effectively integrates the results of Chinese continuous speech recognition, Chinese text recognition and image target recognition, and shows the semantic distance relationship of each semantic concept in the video by expressing the video as a graphical model. "Semantic distance measurement to achieve; Finally, the dense subgraph discovery algorithm can realize the labeling and mining of semantic concepts in videos.
附图说明Description of drawings
图1是本发明的视频处理整体流程图。FIG. 1 is an overall flow chart of video processing in the present invention.
图2是本发明的中文连续语音识别流程图。Fig. 2 is the Chinese continuous speech recognition flowchart of the present invention.
图3是本发明的视频文字识别流程图。Fig. 3 is a flow chart of video text recognition in the present invention.
图4是本发明的图像目标识别流程图。Fig. 4 is a flow chart of image target recognition in the present invention.
图5是本发明的语义距离度量层级关系图。Fig. 5 is a hierarchical relationship diagram of semantic distance measurement in the present invention.
图6是本发明的视频稠密子图挖掘表示。Figure 6 is a video dense subgraph mining representation of the present invention.
图7是本发明的视频标注结果。Fig. 7 is the video labeling result of the present invention.
具体实施方式Detailed ways
下面结合附图详细说明本发明技术方案中所涉及的各个细节问题。应指出的是,所描述的实施例仅旨在便于对本发明的理解,而对其不起任何限定作用。Various details involved in the technical solution of the present invention will be described in detail below in conjunction with the accompanying drawings. It should be pointed out that the described embodiments are only intended to facilitate the understanding of the present invention, rather than limiting it in any way.
本发明提出了一种视频语义挖掘的方法,如图1所示,该方法在处理流程上分为四层。最下边是视频库层,存放了各种形式的视频资源;视频库上边一层是多模态融合层,在这一层完成对于视频的结构分析和图像、文字以及语音的识别和有效融合;再往上一层是视频挖掘层,在该层实现对于视频的图模型表示和基于稠密子图发现的视频挖掘算法,此外还可以根据支持向量机模型实现视频分类挖掘;最上层是对用户提供的透明的智能视频服务层;最右侧是基于“知网”的语义计算支持层。根据上述流程,具体的实施步骤如下所示:The present invention proposes a method for video semantic mining. As shown in FIG. 1 , the method is divided into four layers in the processing flow. The bottom layer is the video library layer, which stores various forms of video resources; the upper layer of the video library is the multi-modal fusion layer, where the structural analysis of the video and the recognition and effective fusion of images, text, and voice are completed; The next upper layer is the video mining layer, which realizes the graphical model representation of video and the video mining algorithm based on dense subgraph discovery. In addition, video classification and mining can be realized based on the support vector machine model; the top layer is provided to users. The transparent intelligent video service layer; the far right is the semantic computing support layer based on "HowNet". According to the above process, the specific implementation steps are as follows:
1、视频预处理1. Video preprocessing
对于待处理的视频进行镜头分割,然后对于每个镜头提取关键帧,并把这些关键帧保存下来供后续图像目标识别使用;对于视频中的音频信号,按照16KHZ,16bit的要求采样,并且保存成wav格式供后续语音识别使用。Carry out shot segmentation for the video to be processed, then extract key frames for each shot, and save these key frames for subsequent image target recognition; for the audio signal in the video, sample according to the requirements of 16KHZ, 16bit, and save as The wav format is used for subsequent speech recognition.
2、图像目标识别2. Image target recognition
首先下载视觉目标分类挑战赛图片库(PASCAL VOC Challenge 2010 Database,http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/index.html),分别提取图片库中每个图片的局部梯度特征(HOG),采取均值聚类算法(k-means)对图片库中的特征进行聚类,类别数可以定为1000个,这样就形成了1000个视觉单词,然后使用这1000个视觉单词描述每个图片,此时每个图片就构成一个词袋(bag of words)作为中间特征,最后使用支持向量机(SVM)方法在图片词袋特征上训练得到20个视觉类别的分类模型,这20个类别分别是:人、鸟、猫、牛、狗、马、羊、飞机、船、自行车、摩托车、火车、轿车、公共汽车、瓶子、椅子、饭桌、花盆、沙发、电视机。最后使用这些分类模型对于视频关键帧图像进行目标识别,识别结果保存成一个文本,记为TextOBJECT。本步骤处理流程参见图2。First download the Visual Object Classification Challenge picture library (PASCAL VOC Challenge 2010 Database, http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/index.html), and extract the Local gradient features (HOG), using mean value clustering algorithm (k-means) to cluster the features in the picture library, the number of categories can be set to 1000, thus forming 1000 visual words, and then using these 1000 visual words Words describe each picture. At this time, each picture constitutes a bag of words (bag of words) as an intermediate feature. Finally, the support vector machine (SVM) method is used to train the classification model of 20 visual categories on the bag of words feature of the picture. The 20 categories are: people, birds, cats, cows, dogs, horses, sheep, airplanes, boats, bicycles, motorcycles, trains, cars, buses, bottles, chairs, dining tables, flower pots, sofas, TV sets . Finally, these classification models are used to perform target recognition on video key frame images, and the recognition results are saved as a text, which is recorded as Text OBJECT . Refer to Figure 2 for the processing flow of this step.
3、中文连续语音识别3. Chinese continuous speech recognition
首先下载史芬克斯连续语音识别系统开源代码和配套的汉语语言模型、汉语声学模型以及汉语词表文件(Sphinx,http://sphinx.sourceforge.net)。对于视频预处理所得到的音频信号进行连续语音识别,把音频信号转录成文本,记为TextASR。本步骤处理流程参见图3。First download the open source code of the Sphinx continuous speech recognition system and the supporting Chinese language model, Chinese acoustic model and Chinese vocabulary files (Sphinx, http://sphinx.sourceforge.net). Continuous speech recognition is performed on the audio signal obtained by video preprocessing, and the audio signal is transcribed into text, which is recorded as Text ASR . Refer to Figure 3 for the processing flow of this step.
1、视频文字识别1. Video text recognition
视频图像中的文字定位,首先基于字符笔画的双边缘模型得到候选文字区域,然后对候选文字区域进行分解得到精确定位的文本块。视频中的文字提取算法每隔若干视频帧取一帧进行基于图像的文字定位得到文字对象,然后在视频帧序列中对文字对象进行向前和向后的跟踪,最后对文字对象进行识别得到文字提取结果,并把识别结果保存成一个文本,记为TextVOCR。本步骤处理流程参见图4。For text positioning in video images, firstly, candidate text regions are obtained based on the double-edge model of character strokes, and then the candidate text regions are decomposed to obtain precisely positioned text blocks. The text extraction algorithm in the video takes a frame every several video frames to perform image-based text positioning to obtain the text object, then tracks the text object forward and backward in the video frame sequence, and finally recognizes the text object to obtain the text Extract the result, and save the recognition result as a text, denoted as Text VOCR . Refer to Figure 4 for the processing flow of this step.
5、对于TextOBJECT、TextASR和TextVOCR分别进行基于隐马尔科大模型的中文分词、去掉无含义的“停用词”和中文词性标注,保留下动词和名词进行下一步的分析,处理之后分别记为WordOBJECT、WordASR和WordVOCR。于是整个视频在语义上可以表示为一个张量,即其中Ψ表示视频的语义张量特征,表示由三个向量WordOBJECT、WordASR和WordVOCR形成的张量空间。5. For Text OBJECT , Text ASR , and Text VOCR , perform Chinese word segmentation based on the Hidden Marko Large Model, remove meaningless "stop words" and Chinese part-of-speech tagging, and retain verbs and nouns for the next step of analysis. After processing, respectively Recorded as Word OBJECT , Word ASR and Word VOCR . Then the entire video can be semantically expressed as a tensor, namely where Ψ represents the semantic tensor feature of the video, Represents the tensor space formed by the three vectors Word OBJECT , Word ASR , and Word VOCR .
6、对于上一步中得到的名词和动词,计算两个相同词性词语之间的语义相似度。计算方法采取基于“知网”的层次距离度量方法,相似度定义在0和1之间,比如桌子和椅子之间的相似度0.8,而风景和轮船的相似度为0.1。本步骤处理流程参见图5。6. For the nouns and verbs obtained in the previous step, calculate the semantic similarity between two words with the same part of speech. The calculation method adopts the hierarchical distance measurement method based on "HowNet", and the similarity is defined between 0 and 1. For example, the similarity between a table and a chair is 0.8, and the similarity between a landscape and a ship is 0.1. Refer to Figure 5 for the processing flow of this step.
7、把WordOBJECT、WordASR和WordVOCR中的词语作为图模型的顶点V,把上一步中定义的词语之间的相似度定义为顶点之间边的权重w,构造表达视频的图模型。这是一个带权无向图,G=(V,E),V表示顶点集合,E表示边集合,|V|=n表示顶点个数,每条边有一个非负的权重该权重的定义就是上一步中确定的词语之间的语义相似度。7. Take the words in Word OBJECT , Word ASR and Word VOCR as the vertex V of the graph model, define the similarity between the words defined in the previous step as the weight w of the edge between the vertices, and construct a graph model to express the video. This is a weighted undirected graph, G=(V, E), V represents the set of vertices, E represents the set of edges, |V|=n represents the number of vertices, each edge has a non-negative weight The definition of this weight is the semantic similarity between words determined in the previous step.
8、由于视频中图像、文字和音频共同表达了一个主题,所以在它们在语义上是一致的。从而视频张量的三个向量的语义相似度应该距离最小化,不符合最小化原则的词语是由于识别错误导致的,应当去除。这个原则可以表示为:其中表示两个词语,其中m,n∈{WordOBJECT,WordASR,WordVOCR},表示的视频张量中的一个向量;0≤i≤|m|,0≤j≤|n|,表示的词语编号,该编号最大为视频张量中一个向量的最大维数。f(·)表示的词语的相似度距离值,定义在0和1之间。8. Since the image, text and audio in the video jointly express a theme, they are consistent in semantics. Therefore, the semantic similarity of the three vectors of the video tensor should be minimized, and the words that do not meet the minimization principle are caused by recognition errors and should be removed. This principle can be expressed as: in Represents two words, where m, n∈{Word OBJECT , Word ASR , Word VOCR }, represents a vector in the video tensor; 0≤i≤|m|, 0≤j≤|n|, represents the word number, This number is up to the largest dimension of a vector in the video tensor. The similarity distance value of words represented by f(·), defined between 0 and 1.
9、对于上述最优化问题,转化为图模型的稠密子图发现问题。即在G=(V,E)中,找到子图H=(X,F),H为子图,x为子图顶点集合,F为子图的边集合。稠密子图的发现算法可以表示为即子图中各个边的平均权重之和最大化,其中|X|表示子图顶点个数,表示子图边的集合,w(l)表示边的权重,边的权重计算方法同上一步中的f(·)。视频语义特征张量空间包含三个向量,即每个向量构造一个图模型社区(community),从而整个视频表达为由三个社区组成的一个图模型。稠密子图发现算法在上述图模型上进行,如图6所示。9. For the above optimization problem, it is transformed into a dense subgraph discovery problem of the graphical model. That is, in G=(V, E), find the subgraph H=(X, F), H is the subgraph, x is the vertex set of the subgraph, and F is the edge set of the subgraph. The dense subgraph discovery algorithm can be expressed as That is, the sum of the average weights of each edge in the subgraph is maximized, where |X| represents the number of subgraph vertices, Represents the set of subgraph edges, w(l) represents the weight of the edge, and the calculation method of the weight of the edge is the same as f(·) in the previous step. The video semantic feature tensor space contains three vectors, namely Each vector constructs a graph model community (community), so that the whole video is expressed as a graph model composed of three communities. The dense subgraph discovery algorithm is performed on the above graph model, as shown in Figure 6.
10、对于上一步中发现的图模型中的稠密子图,记录稠密子图中的顶点所代表的词语作为视频的有效标注,该标注即体现了视频的语义信息。视频标注结果如图7所示。10. For the dense subgraph in the graphical model found in the previous step, record the words represented by the vertices in the dense subgraph as effective annotations of the video, which embodies the semantic information of the video. The video annotation results are shown in Figure 7.
Claims (5)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110168952 CN102222101A (en) | 2011-06-22 | 2011-06-22 | Method for video semantic mining |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110168952 CN102222101A (en) | 2011-06-22 | 2011-06-22 | Method for video semantic mining |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102222101A true CN102222101A (en) | 2011-10-19 |
Family
ID=44778653
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 201110168952 Pending CN102222101A (en) | 2011-06-22 | 2011-06-22 | Method for video semantic mining |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102222101A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014146463A1 (en) * | 2013-03-19 | 2014-09-25 | 中国科学院自动化研究所 | Behaviour recognition method based on hidden structure reasoning |
CN105629747A (en) * | 2015-09-18 | 2016-06-01 | 宇龙计算机通信科技(深圳)有限公司 | Voice control method and device of smart home system |
CN106202421A (en) * | 2012-02-02 | 2016-12-07 | 联想(北京)有限公司 | A kind of obtain the method for video, device and play the method for video, device |
CN107203586A (en) * | 2017-04-19 | 2017-09-26 | 天津大学 | A kind of automation result generation method based on multi-modal information |
CN108320318A (en) * | 2018-01-15 | 2018-07-24 | 腾讯科技(深圳)有限公司 | Image processing method, device, computer equipment and storage medium |
CN108427713A (en) * | 2018-02-01 | 2018-08-21 | 宁波诺丁汉大学 | A kind of video summarization method and system for homemade video |
CN110879974A (en) * | 2019-11-01 | 2020-03-13 | 北京微播易科技股份有限公司 | Video classification method and device |
CN111107381A (en) * | 2018-10-25 | 2020-05-05 | 武汉斗鱼网络科技有限公司 | Live broadcast room bullet screen display method, storage medium, equipment and system |
CN111126373A (en) * | 2019-12-23 | 2020-05-08 | 北京中科神探科技有限公司 | Device and method for judging Internet short video violations based on cross-modal recognition technology |
CN111797850A (en) * | 2019-04-09 | 2020-10-20 | Oppo广东移动通信有限公司 | Video classification method and device, storage medium and electronic equipment |
CN113343974A (en) * | 2021-07-06 | 2021-09-03 | 国网天津市电力公司 | Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement |
CN115203408A (en) * | 2022-06-24 | 2022-10-18 | 中国人民解放军国防科技大学 | An intelligent labeling method for multimodal test data |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101093500A (en) * | 2007-07-16 | 2007-12-26 | 武汉大学 | Method for recognizing semantics of events in video |
US20080059872A1 (en) * | 2006-09-05 | 2008-03-06 | National Cheng Kung University | Video annotation method by integrating visual features and frequent patterns |
-
2011
- 2011-06-22 CN CN 201110168952 patent/CN102222101A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080059872A1 (en) * | 2006-09-05 | 2008-03-06 | National Cheng Kung University | Video annotation method by integrating visual features and frequent patterns |
CN101093500A (en) * | 2007-07-16 | 2007-12-26 | 武汉大学 | Method for recognizing semantics of events in video |
Non-Patent Citations (3)
Title |
---|
《中国博士学位论文全文数据库(电子期刊)》 20101215 刘亚楠 多模态特征融合和变量选择的视频语义理解 第4章 1-5 , * |
《信息化纵横》 20091031 张焕生等 基于图的频繁子结构挖掘算法综述 第5-9页 1-5 , * |
《计算机研究与发展》 20090131 刘亚楠等 基于多模态子空间相关性传递的视频语义挖掘 第1-8页 1-5 , * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106202421B (en) * | 2012-02-02 | 2020-01-31 | 联想(北京)有限公司 | method and device for obtaining video and method and device for playing video |
CN106202421A (en) * | 2012-02-02 | 2016-12-07 | 联想(北京)有限公司 | A kind of obtain the method for video, device and play the method for video, device |
WO2014146463A1 (en) * | 2013-03-19 | 2014-09-25 | 中国科学院自动化研究所 | Behaviour recognition method based on hidden structure reasoning |
CN105629747A (en) * | 2015-09-18 | 2016-06-01 | 宇龙计算机通信科技(深圳)有限公司 | Voice control method and device of smart home system |
CN107203586A (en) * | 2017-04-19 | 2017-09-26 | 天津大学 | A kind of automation result generation method based on multi-modal information |
CN108320318A (en) * | 2018-01-15 | 2018-07-24 | 腾讯科技(深圳)有限公司 | Image processing method, device, computer equipment and storage medium |
CN108427713A (en) * | 2018-02-01 | 2018-08-21 | 宁波诺丁汉大学 | A kind of video summarization method and system for homemade video |
CN108427713B (en) * | 2018-02-01 | 2021-11-16 | 宁波诺丁汉大学 | Video abstraction method and system for self-made video |
CN111107381A (en) * | 2018-10-25 | 2020-05-05 | 武汉斗鱼网络科技有限公司 | Live broadcast room bullet screen display method, storage medium, equipment and system |
CN111797850A (en) * | 2019-04-09 | 2020-10-20 | Oppo广东移动通信有限公司 | Video classification method and device, storage medium and electronic equipment |
CN110879974A (en) * | 2019-11-01 | 2020-03-13 | 北京微播易科技股份有限公司 | Video classification method and device |
CN111126373A (en) * | 2019-12-23 | 2020-05-08 | 北京中科神探科技有限公司 | Device and method for judging Internet short video violations based on cross-modal recognition technology |
CN113343974A (en) * | 2021-07-06 | 2021-09-03 | 国网天津市电力公司 | Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement |
CN115203408A (en) * | 2022-06-24 | 2022-10-18 | 中国人民解放军国防科技大学 | An intelligent labeling method for multimodal test data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102222101A (en) | Method for video semantic mining | |
CN110489395B (en) | Method for automatically acquiring knowledge of multi-source heterogeneous data | |
WO2023065617A1 (en) | Cross-modal retrieval system and method based on pre-training model and recall and ranking | |
CN110598005B (en) | Public safety event-oriented multi-source heterogeneous data knowledge graph construction method | |
CN102663015B (en) | Video semantic labeling method based on characteristics bag models and supervised learning | |
CN101382937B (en) | Speech recognition-based multimedia resource processing method and its online teaching system | |
CN110309331A (en) | A Self-Supervised Cross-Modal Deep Hash Retrieval Method | |
CN110619051B (en) | Question sentence classification method, device, electronic equipment and storage medium | |
WO2018010365A1 (en) | Cross-media search method | |
CN112836487B (en) | Automatic comment method and device, computer equipment and storage medium | |
CN108268600B (en) | AI-based unstructured data management method and device | |
CN108132968A (en) | Network text is associated with the Weakly supervised learning method of Semantic unit with image | |
CN106484674A (en) | A kind of Chinese electronic health record concept extraction method based on deep learning | |
CN101877064B (en) | Image classification method and image classification device | |
CN106446109A (en) | Acquiring method and device for audio file abstract | |
CN105912625A (en) | Linked data oriented entity classification method and system | |
CN109034248B (en) | A deep learning-based classification method for images with noisy labels | |
CN112069826A (en) | Vertical Domain Entity Disambiguation Method Fusing Topic Models and Convolutional Neural Networks | |
CN113221882B (en) | Image text aggregation method and system for curriculum field | |
CN114997288B (en) | A design resource association method | |
WO2024130751A1 (en) | Text-to-image generation method and system based on local detail editing | |
CN110489548A (en) | A kind of Chinese microblog topic detecting method and system based on semanteme, time and social networks | |
CN108009135A (en) | The method and apparatus for generating documentation summary | |
CN110188359B (en) | Text entity extraction method | |
CN112347761A (en) | BERT-based drug relationship extraction method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20111019 |