CN102622451A - System for automatically generating television program labels - Google Patents

System for automatically generating television program labels Download PDF

Info

Publication number
CN102622451A
CN102622451A CN 201210110031 CN201210110031A CN102622451A CN 102622451 A CN102622451 A CN 102622451A CN 201210110031 CN201210110031 CN 201210110031 CN 201210110031 A CN201210110031 A CN 201210110031A CN 102622451 A CN102622451 A CN 102622451A
Authority
CN
Grant status
Application
Patent type
Prior art keywords
program
module
keyword
entry
label
Prior art date
Application number
CN 201210110031
Other languages
Chinese (zh)
Inventor
朱其立
王拯
蔡智源
Original Assignee
上海交通大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Abstract

The invention provides a system for automatically generating television program labels, which comprises a program information obtaining module used for grasping pages relevant to each program, pruning and filtering the pages and obtaining subject content of the program information, an information keyword extracting module used for summarizing the subject content and extracting keywords from the subject content, a knowledge base module used for establishing network relationships among entries and expanding the obtained key words, a keyword expanding module used for expanding the keywords to obtain a larger entry set by utilizing a network provided by the knowledge base module, and a label generating module used for processing the obtained key entry set of the keywords, filtering noise of the key entry set, computing the score of the key entry set and generating program label sets. The system has the advantages that the blank in the systems for automatically generating television program labels is filled; due to the introduction of a knowledge base, the system is not restricted to the network pages, has high expansibility and can better find out the labels; and the knowledge base can be established in an off-line manner, and a label generating algorithm is concise, so that the efficiency of the system is high.

Description

电视节目标签自动生成系统 TV shows automatically generated label system

技术领域 FIELD

[0001] 本发明涉及的是一种计算机应用技术领域的系统,具体是一种电视节目标签自动生成系统。 [0001] The present invention relates to a computer system application technology, in particular a television program label automatic generation system.

背景技术 Background technique

[0002] 长久以来,如何帮助人们更好的做出选择,总是一个意义重大而又充满趣味的问题。 [0002] a long time, how to help people make better choices, always a great sense of fun and full of problems. 人们做出选择是以一定的信息为基础的,将搜集到的信息与个人观念、爱好相结合,即产生了选择行为。 People make choices based on certain information based on the collected information and personal ideas, hobbies combined, which produce a selection behavior. 然而,信息的获取并不简单。 However, it is not simple to obtain information. 在网络尚不发达,信息交流并不便利的过去,信息的匮乏、比较的困难成为人们做出选择的障碍。 The network is not developed, to facilitate the exchange of information is not the past, lack of information, more difficult obstacle people to make choices. 而步入信息时代,信息的获取只需轻点鼠标即可完成,但是这却带来了另外的一个问题,信息泛滥。 And entered the information age, access to a single mouse click to complete the information, but it has brought another question, information overload. 面对着海量的信息,单是对信息作辨别和筛选就将花费人们很多时间,这也造成了选择的障碍。 The face of a flood of information, just the information for the identification and screening of people will spend a lot of time, it also poses a barrier of choice. 为了解决这一问题,标签自动生成系统应运而生。 To solve this problem, the label automatically generates system came into being. 通过对信息进行主体提取,内容总结,关键词分析,生成与信息对应的标签集。 The information extracted by the body, the content summary, keyword analysis, to generate information corresponding to the set of labels. 利用标签集,人们可以快速掌握信息大意,同时为信息分类提供依据,这都能帮助人们做出选择。 The use of the label set, people can quickly grasp the information to the effect, at the same time provide the basis for classification of information, which can help people make choices.

[0003]目前,对标签自动生成系统的研究很多,但主要着重于文本的处理,即针对一篇文档,自动生成与之相关的标签。 [0003] Currently, the label automatically generates a lot of systematic research, but mainly focused on the treatment of the text, that is, for a document, automatically generate labels associated with them. Jialie Shen[l]研究了音乐标签的自动生成,采用的方法主要是提取音频的特征,再以手动标注的音乐作为训练素材,通过机器学习的方法生成分类器,用这个分类器为音乐添加标签。 Jialie Shen [l] automatic generation of music labels method used mainly feature extraction audio, and then manually marked as music training material, a method by a machine learning to generate a classifier, to tag the music with this classification . Stefan Siersdorfer [2]提出了视频标签的补充方案,利用已有的视频比较技术,将相似视频已有的标签进行合并,不过这不是真正意义上的标签自动生成。 Stefan Siersdorfer [2] proposed supplementary program video labels, comparing the use of existing video technology, will be similar to the existing label video merge, but this label is not true in the sense automatically generated. 因此,现阶段对视频添加标签还主要依赖于人工处理,对电视节目标签自动生成系统的研究还是一个空缺。 Therefore, at this stage of the video add tags also largely dependent on manual processing, automatic generation of television programs label study system is still a vacancy.

[0004] [I]Jialie Shen, Meng Wang, Shuicheng Yan, HweeHwa Pang, Xiansheng HuaEffective Music Tagging through Advanced Statistical Modeling SIGIR 2010 ; [0004] [I] Jialie Shen, Meng Wang, Shuicheng Yan, HweeHwa Pang, Xiansheng HuaEffective Music Tagging through Advanced Statistical Modeling SIGIR 2010;

[0005] [2]Stefan Siersdorfer, Jose San Pedro, Mark Sanderson Automatic VideoTagging using Content Redundancy SIGIR 2009。 [0005] [2] Stefan Siersdorfer, Jose San Pedro, Mark Sanderson Automatic VideoTagging using Content Redundancy SIGIR 2009.

发明内容 SUMMARY

[0006] 本发明针对现有技术中存在的上述不足,提供了一种电视节目标签自动生成系统,仅需要为系统提供电视节目的名称,系统就可以自动从网上获取与该节目相关的信息,通过进一步对获取的信息进行归纳和扩展,系统将返回与该节目相关的一个标签集。 [0006] The present invention addresses the above deficiencies existing in the prior art, there is provided a television program label automatic generation system, need only provide the name of a television program for the system, the system can automatically acquire information related to the program from the Internet, further information acquired by summarizing and extended, the system will return a tag set associated with the program.

[0007] 本发明是通过以下技术方案实现的。 [0007] The present invention is achieved by the following technical solutions.

[0008] 一种电视节目标签自动生成系统,包括依次连接的节目信息获取模块、信息关键词提取模块、关键词扩展模块及标签生成模块,还包括与关键词扩展模块相连接的知识库模块,其中: [0008] A television program label automatic generation system, comprising a program information acquisition module connected successively, keyword extraction module, extension modules and keyword tag generating module, and further comprising a knowledge base module keyword expansion module is connected, among them:

[0009]-节目信息获取模块,用于从网上抓取与节目相关的页面,通过对页面的修剪和过滤,得到描述节目信息的主体内容;[0010]-信息关键词提取模块,用于汇总节目信息获取模块得到的主体内容,并从主体内容中抽取出关键词; [0009] - a program information acquisition module, used to crawl from the Internet related to the program page by page trim and filtered to give a description of the main content of the program information; [0010] - keyword extraction module for summary program content information obtaining module body obtained, and extracts keywords from the main content;

[0011]-知识库模块,用于建立词条间的网络关系,以便用于对获取的关键词进行扩展; [0011] - the knowledge base module, used to establish the relationship between the network entry, so that the obtained keywords for extended;

[0012]-关键词扩展模块,用于利用知识库模块提供的网络,将信息关键词提取模块得到的关键词进行扩展,得到一个更大的词条集; [0012] - Image expansion module for use of the network module of the knowledge base, the keyword extraction module obtained expanded keyword to obtain a larger set of terms;

[0013]-标签生成模块,用于将得到的所有关键词的关联词条集进行处理,滤除噪声,计算分数,并最终生成节目的标签集。 [0013] - Label generating module, configured to obtain a set of all the keywords associated entry is processed, noise filtering, calculating a score, and generates a final label set program.

[0014] 所述节目信息获取模块包括HTML解析器,接收需要生成标签的目标电视节目集合,在搜索引擎的辅助下,为每个节目获取网络页面,所述页面通过HTML解析器的处理,得到主体内容,所述主体内容传递给信息关键词提取模块作进一步处理。 [0014] The program information acquisition module comprises an HTML parser, certain television program receiver needs to generate a set of labels, with the aid of search engines, Web page acquired for each program, the process by the HTML parser page, to give SUMMARY body, the body contents to keyword extraction module for further processing.

[0015] 所述信息关键词提取模块包括分词与词性标注器,得到描述每个节目信息的主体内容后,通过分词与词性标注器对内容进行划分,并仅保留名词词性的词语。 The [0015] After the keyword extraction module includes a part of speech tagger word, to obtain main content of each program information is described, is divided by the content of the word speech tagger, and retaining only the words in the noun part of speech.

[0016] 所述名词词性的词语通过统计方法识别关键词。 [0016] The term noun part of speech by a statistical method of identifying keywords.

[0017] 所述统计方法包括以下步骤: [0017] The statistical method comprising the steps of:

[0018] 第一步,对于特定的某个节目,将词语划分为两组,一组来源于与该节目相关的网络页面,一组来源于节目集合中的其他网络页面; [0018] The first step, for a particular program, the word is divided into two groups, one associated with the program from the Web page, a set of Web pages from other programs in the set;

[0019] 第二步,对这两组词语计算词频,并统计出均值和标准差,这样,每个词语都用4个统计量描述其特征,所述4个统计量分别为这个词语在与节目相关页面的词频均值、标准差以及这个词语在与节目不相关页面的词频均值和标准差; [0019] The second step, the two groups of word frequency calculation term, and the mean and standard deviation statistics, so that each word can be characterized by four statistics, the statistics are four in this word and Frequencies mean, standard deviation, and the words show the relevant page of the mean and standard word frequency is not related to the program page of the difference;

[0020] 第三步,根据4个统计量间的关系,将最能表现节目特征的关键词识别出来。 [0020] The third step, according to the relationship between the four statistics will show the best performance characteristics of keywords identified.

[0021] 所述知识库模块以百度百科作为数据源,以图的形式进行存储。 [0021] In the knowledge base module Baidu Encyclopedia as a data source, stored in the form of FIG.

[0022] 所述百度百科的组织方式包括以下步骤: [0022] The Encyclopedia Baidu organization comprising the steps of:

[0023] 第一步,对于每个词条,均有一个页面对该词条进行描述,页面中除了纯文本外,还会将百度百科中已有的其他词条作引用; [0023] The first step, for each entry, the entry has a page description, the page in addition to plain text, but also the other existing Baidu Encyclopedia entry for reference;

[0024] 第二步,在知识库的图中,每个这样的被描述的词条和引用的词条间都会有一条有向边,对这个图应用PageRank算法,得到每个词条的重要性; [0024] In a second step, the knowledge base in FIG between each such term is described and referenced entries will have to have a side, the application of this FIG PageRank algorithm, each term is important to give sex;

[0025] 第三步,词条的权重和词条间的相互引用关系,构成了整个知识库。 [0025] The third step, a reference relationship between each weight and the weight entry terms, constitute the entire repository.

[0026] 所述关键词扩展模块对每个信息关键词提取模块得到的关键词,在知识库模块的图中找到与之存在一条路径的其他词条,根据词条本身的重要性和词条与关键词的距离,计算出词条的权重。 [0026] The expansion module keyword, find a path with the presence of other terms in the knowledge base module of FIG keyword for each keyword extraction module obtained in accordance with the importance of their own entry and entry keyword distance, calculate the weight of the weight terms.

[0027] 所述标签生成模块将所有得到的关键词的关联词条合并在一起,当一个词条同时关联多个关键词时,将这个词条在各种关键词中的权重相加。 The [0027] The tag generation module obtained all keywords associated entry combined, when a plurality of keywords associated entry simultaneously, adding this entry heavy weights in a variety of keywords. 将所有词条根据权重的总和进行排序,并根据需要返回前面的若干个,从而得到了描述节目特征的标签集。 All entries are sorted according to the sum of the weights, and return to the previous required number, thereby obtaining a set of labels in the program attribute description.

[0028] 本发明工作时,先为系统提供需要生成标签的目标电视节目集合。 [0028] In operation of the present invention, the system provides for the first television program label needs to generate a set of goals. 节目信息获取模块在搜索引擎的辅助下,为每个节目都获取一定量的网络页面,这些页面通过模块中HTML解析器的处理,得到主体内容,这些主体内容将传递给信息关键词提取模块作进一步处理。 Program information acquisition module with the aid of search engines, for each program will get a certain amount of web pages that by processing module HTML parser, get the main content, the main content of the information passed to the keyword extraction module for further processing. 信息关键词提取模块得到描述每个节目信息的主体内容后,通过模块中的分词与词性标注器对内容进行划分,并仅保留名词词性的词语。 After the keyword extraction module body to give a description of each program content information, the content is divided by the module with the part of speech tagger word, and retaining only the words in the noun part of speech. 这些词语将以统计的方法识别出关键词。 These words will be statistical methods to identify key words. 统计方法如下:对于特定的某个节目,将词语划分为两组。 Statistical methods are as follows: for a particular program, the words are divided into two groups. 一组来源于与该节目相关的网络页面,一组来源于节目集合中的其他网络页面。 A group from the web page associated with the program, a set of programs from other web pages in the collection. 对这两组词语都计算词频,并统计出均值和标准差。 These two words are calculated word frequency, and the statistics mean and standard deviation. 这样,每个词语都用4个统计量描述其特征。 Thus, each word can be characterized by four statistics. 分别是这个词语在与节目相关页面的词频均值,标准差以及这个词语在与节目不相关页面的词频均值和标准差。 The words are in the word frequency means, standard deviations, and the word associated with the program page of the difference in the mean and standard word frequency is not related to the program page. 根据4个统计量间的关系,就可以将最能表现节目特征的关键词识别出来。 According to the relationship between the four statistics, we can show the best performance characteristics of the keywords identified. 通过网络页面提取出来的关键词已经能在一定程度上反映节目的特征,但缺陷在于得到的关键词的范围有限,即它们必须出现在网络页面上。 Extracted by keyword web page has been able to reflect the characteristics of the program to some extent, but the drawback is the limited range of the resulting keywords that they must appear on the Web page. 针对这一限制,本发明很重要的一点就是引入了知识库模块。 In response to this limitation, the present invention is a very important point is the introduction of a knowledge base module. 知识库模块以百度百科作为数据源,以图的形式进行存储。 Baidu Encyclopedia knowledge module as the data source, stored in the form of FIG. 百度百科的组织方式为,对于每个词条,都有一个页面对该词条进行描述,页面中除了纯文本外,还会对百度百科中已有的其他词条作引用。 Baidu Encyclopedia is organized for each entry, the entry has a page description, the page in addition to plain text, but also on other existing Baidu Encyclopedia entry for reference. 在知识库的图中,每个这样的被描述的词条和引用的词条间都会有一条有向边。 In FIG knowledge base in between each such entry are described and referenced entries will have to have an edge. 对这个图应用PageRank算法,我们就得到了每个词条的重要性。 PageRank algorithm applied to this figure, we get the importance of each term. 词条的权重和词条间的相互引用关系,构成了整个知识库。 Right of entry and re-entry reference relationship with each other, constitute the entire knowledge base. 这样,关键词扩展模块的任务很简单,对于每个信息关键词提取模块得到的关键词,都可以在知识库的图中找到与之存在一条路径的其他词条,根据词条本身的重要性和词条与关键词的距离,计算出词条的权重。 In this way, keyword expansion module task is simple, for each keyword information obtained by the keyword extraction module, you can find a path with the presence of other entries in the figure in the knowledge base, according to the importance of the entry itself and a keyword from the entry, the entry of the heavy weight is calculated. 标签生成模块是系统的最后一个环节,在信息关键词提取模块,我们得到了能够反映节目特征的关键词集,在关键词扩展模块,我们得到了每个关键词关联的词条集,而且每个词条都有权重。 Label generation module is the last link in the system, the keyword extraction module, we get a program that reflects feature set of keywords, keyword expansion module, we get a set of terms associated with each keyword, and each entries have the right weight. 标签生成模块负责将两部分结果整合起来,即将所有得到的关键词的关联词条合并在一起。 Label generation module is responsible for integrating the two partial results, about to get all of the keywords associated entries combined. 当一个词条同时关联多个关键词时,将这个词条在各种关键词中的权重相加。 When a plurality of keywords associated entry simultaneously, this keyword entry right in various weights are added. 将所有词条根据权重的总和进行排序,并根据需要返回前面的若干个,我们就得到了描述节目特征的标签集了。 All entries are sorted according to the sum of the weights, and return to the previous required number, we get a set of labels of the features described in the program.

[0029] 与现有技术相比,本发明填补了自动生成电视节目标签系统的空白,知识库的引入,也使得系统不会受制于网络页面,有更好的扩展性,对标签也有更好的发现力。 [0029] Compared with the prior art, the present invention fill the gaps in the television program label automatically generating system, the introduction of the knowledge base, so that the system can not be subject to a Web page, better scalability, have a better label the discovery forces. 知识库可以离线建立,标签生成算法简洁,故系统效率也很高。 Knowledge offline establish, label generation algorithm is simple, it is also very high system efficiency.

附图说明[0030] 图I示出本发明的系统模块框图; BRIEF DESCRIPTION [0030] Figure I shows a system block diagram according to the present invention;

[0031]图2示出本发明节目信息获取模块的实施细节; [0031] Figure 2 illustrates a program information acquisition embodiment of the present invention, the details of the module;

[0032] 图3示出本发明信息关键词提取模块中词条列表的生成细节; [0032] Figure 3 shows a keyword extraction module generates details of the present invention in terms of the list;

[0033] 图4示出本发明信息关键词提取模块中关键词的生成细节。 [0033] FIG. 4 shows a keyword extraction module according to the present invention in detail keyword generation.

具体实施方式 detailed description

[0034] 下面结合附图对本发明的实施例作详细说明,本实施例在以发明技术方案为前提下进行实施,给出了详细的实施方式和具体的操作过程,但本发明的保护范围不限于下述的实施例。 [0034] The following embodiments in conjunction with the accompanying drawings of embodiments of the present invention will be described in detail, embodiments of the present invention according to the embodiment of the premise, and given the specific operation of the embodiment in detail, but the scope of the present invention is not limited to the following embodiments.

[0035] 本实施例的任务是为一组电视节目自动生成标签,分别是节目I、节目2、节目3、节目4、节目5、节目6、节目7、节目8、节目9、节目10。 Task [0035] The present embodiment is the automatic generation of a set of television program label, respectively, is a program I, program 2, program 3, program 4, the program 5, the program 6, program 7, program 8, the program 9, the program 10.

[0036] 如图I所示,本实施例包括5个模块:节目信息获取模块、信息关键词提取模块、知识库模块、关键词扩展模块、标签生成模块,其中,节目信息获取模块、信息关键词提取模块、关键词扩展模块及标签生成模块依次连接,知识库模块与关键词扩展模块相连接。 [0036] As shown in FIG I, the present embodiment includes five modules: a program information acquisition module, a keyword extraction module, a knowledge base module, extension module keyword, the tag generating module, wherein the program information acquiring module, an information key word extracting modules, extension modules and a keyword tag generating module are connected successively, with the knowledge base module is connected to the expansion module keyword. 所述节目信息获取模块,负责从网上抓取与这10个节目相关的页面,通过对页面的修剪和过滤,得到描述节目信息的主体内容。 The program information acquisition module is responsible grab from the Internet related to this 10 page program, through the pages of pruning and filtered to give a description of the main content of program information. 所述信息关键词提取模块,负责汇总节目信息获取模块得到的主体内容,并从主体内容中抽取出关键词。 The keyword extraction module, is responsible for the main content summary information obtaining program module obtained, and extracts keywords from the content body. 所述知识库模块,负责建立词条间的网络关系,以便用于对获取的关键词进行扩展。 The knowledge base module, responsible for establishing a network of relationships between terms for use of keywords get to be extended. 所述关键词扩展模块,负责利用知识库模块提供的网络,将信息关键词提取模块得到的关键词进行扩展,得到一个更大的词条集。 The keyword extension module, is responsible for using the Internet to provide the knowledge base module, the keyword extraction module obtained expanded keyword to obtain a set of larger entry. 所述标签生成模块,负责词条集进行处理,滤除噪声,计算分数,并最终生成节目的标签集 The tag generation module responsible for processing entry sets, filter noise score calculation, and finally set the label generation programs

[0037] 如图2所示,节目信息获取模块包括HTML解析器,接收需要生成标签的目标电视节目集合,在搜索引擎的辅助下,为每个节目获取网络页面,所述页面通过HTML解析器的处理,得到主体内容,所述主体内容传递给信息关键词提取模块作进一步处理。 [0037] As shown, the program information acquiring module comprises an HTML parser shown in Figure 2, the received television program label is necessary to generate the target set, with the aid of search engines, Web page acquired for each program, a page with an HTML parser processing to obtain main content of the message body contents to keyword extraction module for further processing. 具体为,节目信息获取模块利用搜索引擎,得到与目标节目相关的10个页面,即HTML文件。 Specifically, the program information acquisition module using the search engine, get associated with the target program of 10 pages, that is, the HTML file. 通过去除得到的HTML文件中如广告、图片、标题、脚本等的无用标记,我们就得到了描述节目信息的10个文档。 By removing the resulting HTML file useless marked as advertisements, pictures, titles, scripts, etc., we get 10 document describes the program information.

[0038] 如图3所示,信息关键词提取模块包括分词与词性标注器,得到描述每个节目信息的主体内容后,通过分词与词性标注器对内容进行划分,并仅保留名词词性的词语。 [0038] 3, the keyword extraction module includes a rear part of speech tagger word, to obtain main content of each program information is described, is divided by the content of the word speech tagger, and retaining only the words in the noun part of speech . 具体为,节目信息获取模块返回的文档会先通过信息关键词提取模块进行分词和词性标注的处理,并仅保留名词词性的词语,这样每个文档都被转换成一个词集。 Specifically, the program information for documentation module returns will be processed first word segmentation and POS tagging by keyword extraction module, and retaining only the words noun part of speech, so that each document is converted into a set of words. 一个节目对应的10个文档会有重复的词语,所以10个文档的词语将进行哈希处理,统计出每个词语的在每个文档中的词频。 A program corresponding to the 10 documents have duplicate words, so the words 10 of the document will be hashed, word frequency statistics of each word in each document. 最后我们针对每个节目都会得到一个词条列表,列表中的每一项是一个数据结构,包含词条的文本内容和该词条在10个文档中的词频。 Finally, we get for each program will be a list of entries, each entry in the list is a data structure that contains the text entry and the entry word frequency in the 10 document.

[0039] 需要说明的是,名词词性的词语通过统计方法识别关键词。 [0039] It should be noted that the word is a noun part of speech by a statistical method to identify keywords.

[0040] 统计方法包括以下步骤: [0040] Statistical method comprising the steps of:

[0041] 第一步,对于特定的某个节目,将词语划分为两组,一组来源于与该节目相关的网络页面,一组来源于节目集合中的其他网络页面; [0041] The first step, for a particular program, the word is divided into two groups, one associated with the program from the Web page, a set of Web pages from other programs in the set;

[0042] 第二步,对这两组词语计算词频,并统计出均值和标准差,这样,每个词语都用4个统计量描述其特征,所述4个统计量分别为这个词语在与节目相关页面的词频均值、标准差以及这个词语在与节目不相关页面的词频均值和标准差; [0042] The second step, the two groups of word frequency calculation term, and the mean and standard deviation statistics, so that each word can be characterized by four statistics, the statistics are four in this word and Frequencies mean, standard deviation, and the words show the relevant page of the mean and standard word frequency is not related to the program page of the difference;

[0043] 第三步,根据4个统计量间的关系,将最能表现节目特征的关键词识别出来。 [0043] The third step, according to the relationship between the four statistics will show the best performance characteristics of keywords identified.

[0044] 如图4所示,得到的词条列表经过进一步处理得到最终的关键词列表。 [0044] As shown in FIG 4, to obtain a list of entries further processed to obtain the final keyword list. 这里,对于目标节目词条列表中的每一个词语,都计算出4个统计量,分别是:该词语在目标节目中的词频均值和标准差,该词语在其他节目中的词频均值和标准差。 Here, for each term in the target program list entries are calculated statistics 4, respectively: the mean and standard term frequency of the term in the difference between the target program, the words mean and standard word frequency of a difference in other programs . 得到4个统计量后,先以这样的规则对词语进行归类: After obtaining four statistics, first with words such rules to classify:

[0045] 第一类:在其他节目中词频均值和标准差都是O ; [0045] The first type: in other programs word frequency mean and standard deviation are O;

[0046] 第二类:在其他节目中词频均值和标准差都不为0,而且目标节目中的均值比其他节目的均值大以及标准差比其他节目的小; [0046] The second category: Frequencies in other programs mean and standard deviation are not 0, and the mean of the target program is small compared to other programs than other programs and large mean standard;

[0047] 第三类:第一类和第二类之外的情况。 [0047] The third category: the case other than the first and second categories.

[0048] 每一类再以这样的规则计算分数: [0048] for each class and then calculating a score such a rule:

[0049] 第一类:目标节目中的均值除以标准差; [0049] The first category: the mean divided by the standard difference in the target program;

[0050] 第二类:目标节目中的均值乘以其他节目的标准差除以目标节目中的标准差再除以其他节目的均值。 [0050] The second category: the target program is multiplied by the mean standard deviation standard deviation divided by the other programs of the target program and then divided by the mean of the other programs.

[0051] 第三类:直接设为O。 [0051] The third category: direct to O. [0052] 接下来对词语进行排序,第一类优于第二类,第二类优于第三类,同类别中按分数再排序,最后输出前20个词语构成关键词列表。 [0052] Next, sorting words, the first type than the second type, the second type is superior to the third category, categories with scores by reordering, 20 before the final output words constituting the keyword list.

[0053] 知识库模块以百度百科作为数据源,以图的形式进行存储。 [0053] In Baidu Encyclopedia knowledge base module as a data source, stored in the form of FIG.

[0054] 需要说明的是:百度百科的组织方式包括以下步骤: [0054] Incidentally: Baidu Encyclopedia organization comprising the steps of:

[0055] 第一步,对于每个词条,均有一个页面对该词条进行描述,页面中除了纯文本外,还会将百度百科中已有的其他词条作引用; [0055] The first step, for each entry, the entry has a page description, the page in addition to plain text, but also the other existing Baidu Encyclopedia entry for reference;

[0056] 第二步,在知识库的图中,每个这样的被描述的词条和引用的词条间都会有一条有向边,对这个图应用PageRank算法,得到每个词条的重要性; [0056] The second step, the knowledge base in FIG between each such term is described and referenced entries will have to have a side, the application of this FIG PageRank algorithm, each term is important to give sex;

[0057] 第三步,词条的权重和词条间的相互引用关系,构成了整个知识库。 [0057] The third step, a reference relationship between each weight and the weight entry terms, constitute the entire repository.

[0058] 关键词列表中的每个关键词通过关键词扩展模块会得到关联的词条集,而且每个词条都有权重。 [0058] each of the keywords in the keyword list by keyword expansion module will be associated with set of entries, and each entry has the right weight. 标签生成模块会将两部分结果整合起来,即将所有得到的关键词的关联词条合并在一起。 Tag generating module will integrate two partial results, i.e. all of the keywords associated entry resulting combined. 当一个词条同时关联多个关键词时,将这个词条在各种关键词中的权重相力口。 When a plurality of keywords associated entry simultaneously, this keyword entry rights in various heavy phase power port. 将所有词条根据权重的总和进行排序,并返回前20个词条,我们就得到了描述节目特征的标签集了。 All entries are sorted according to the sum of the weights, and returns the first 20 entries, we get a set of labels in the program attribute description.

[0059] 对实验例中的10节目重复以上过程,我们就完成了为这些节目自动生成标签的任务。 [0059] Repeat the procedure of Experimental Example 10 in the program, we have completed the task automatically generate labels for these programs.

[0060] 以上对本发明的具体实施例进行了描述。 [0060] The foregoing specific embodiments of the invention have been described. 需要理解的是,本发明并不局限于上述特定实施方式,本领域技术人员可以在权利要求的范围内做出各种变形或修改,这并不影响本发明的实质内容。 Is to be understood that the present invention is not limited to the particular embodiments, those skilled in the art can make various changes and modifications within the scope of the appended claims, this does not affect the substance of the present invention.

Claims (9)

  1. 1. 一种电视节目标签自动生成系统,其特征在于,包括依次连接的节目信息获取模块、信息关键词提取模块、关键词扩展模块及标签生成模块,还包括与关键词扩展模块相连接的知识库模块,其中: -节目信息获取模块,用于从网上抓取与节目相关的页面,通过对页面的修剪和过滤,得到描述节目信息的主体内容; -信息关键词提取模块,用于汇总节目信息获取模块得到的主体内容,并从主体内容中抽取出关键词; -知识库模块,用于建立词条间的网络关系,以便用于对获取的关键词进行扩展; -关键词扩展模块,用于利用知识库模块提供的网络,将信息关键词提取模块得到的关键词进行扩展,得到一个更大的词条集; -标签生成模块,用于将得到的所有关键词的关联词条集进行处理,滤除噪声,计算分数,并最终生成节目的标签集。 1. A television program label automatic generation system, characterized by comprising program information acquisition module connected successively, keyword extraction module, extension modules and keyword tag generating module, and further comprising a knowledge keyword expansion module connected library modules, including: - a program information acquisition module, used to crawl from the internet related to the program page by page trim and filtered to give a description of the main content of program information; - keyword extraction module, a program summary content information acquiring module body obtained, and extracts keywords from the content body; - knowledge base module, the network used to establish the relationship between the entries, in order for the acquired extended keywords; - keyword expansion module, for utilizing the knowledge base module provides network, the keyword extraction module obtained expanded keyword to obtain a larger set of translation; - tag generating module, configured to obtain the set of all the keywords associated entry processing, noise filtering, calculating a score, and generates a final label set program.
  2. 2.根据权利要求I所述的电视节目标签自动生成系统,其特征在于,所述节目信息获取模块包括HTML解析器,接收需要生成标签的目标电视节目集合,在搜索引擎的辅助下,为每个节目获取网络页面,所述页面通过HTML解析器的处理,得到主体内容,所述主体内容传递给信息关键词提取模块作进一步处理。 The automatic generation system according to a television program label according to claim I, wherein said program information acquisition module comprises an HTML parser, certain television programs required to receive a set of tags to generate, with the aid of the search engine for each a program acquisition web page, the page processing by the HTML parser to obtain main content of the message body contents to keyword extraction module for further processing.
  3. 3.根据权利要求I所述的电视节目标签自动生成系统,其特征在于,所述信息关键词提取模块包括分词与词性标注器,得到描述每个节目信息的主体内容后,通过分词与词性标注器对内容进行划分,并仅保留名词词性的词语。 The television program label I according to claim automatic generation system, wherein the keyword extraction module includes a rear part of speech tagger word, to obtain main content of each program description information, and by word speech tagging the content is divided, and retaining only the words noun part of speech.
  4. 4.根据权利要求I所述的电视节目标签自动生成系统,其特征在于,所述名词词性的词语通过统计方法识别关键词。 4. The system automatically generates the television program label as claimed in claim I, wherein the noun part of speech of words by statistical methods to identify keywords.
  5. 5.根据权利要求4电视节目标签自动生成系统,其特征在于,所述统计方法包括以下步骤: 第一步,对于特定的某个节目,将词语划分为两组,一组来源于与该节目相关的网络页面,一组来源于节目集合中的其他网络页面; 第二步,对这两组词语计算词频,并统计出均值和标准差,这样,每个词语都用4个统计量描述其特征,所述4个统计量分别为这个词语在与节目相关页面的词频均值、标准差以及这个词语在与节目不相关页面的词频均值和标准差; 第三步,根据4个统计量间的关系,将最能表现节目特征的关键词识别出来。 5. The system automatically generates the television program label according to claim 4, wherein said statistical method comprises the steps of: a first step for a particular program, the word is divided into two groups, a group derived from the program related web page, a set of pages from other network program set; the second step, the two groups of word frequency calculation term, and the mean and standard deviation statistics, so that each word which are described by four statistics wherein the statistics are 4 this word in the word frequency mean, standard deviation, and this word is related to the program pages difference in the mean and standard word frequency is not related to the program page; a third step, according to statistics among the four relations, will show the best performance characteristics of keywords identified.
  6. 6.根据权利要求I所述的电视节目标签自动生成系统,其特征在于,所述知识库模块以百度百科作为数据源,以图的形式进行存储。 6. The television program label I according to claim automatic generation system, characterized in that said knowledge base module Baidu Encyclopedia as a data source, stored in the form of FIG.
  7. 7.根据权利要求6所述的电视节目标签自动生成系统,其特征在于,所述百度百科的组织方式包括以下步骤: 第一步,对于每个词条,均有一个页面对该词条进行描述,页面中除了纯文本外,还会将百度百科中已有的其他词条作引用; 第二步,在知识库的图中,每个这样的被描述的词条和引用的词条间都会有一条有向边,对这个图应用PageRank算法,得到每个词条的重要性; 第三步,词条的权重和词条间的相互引用关系,构成了整个知识库。 The automatic generation system according to television program label according to claim 6, characterized in that the organization Baidu Encyclopedia comprising: a first step, for each entry, the entry for the page has a between the second step, the knowledge base in FIG., each such entry are described and referenced entries; description, in addition to plain text page, will also Baidu Encyclopedia other terms as already cited there will be a directed edge of the map application PageRank algorithm, the importance of each entry; the third step, the right of entry and re-entry reference relationship with each other, constitute the entire knowledge base.
  8. 8.根据权利要求I所述的电视节目标签自动生成系统,其特征在于,所述关键词扩展模块对每个信息关键词提取模块得到的关键词,在知识库模块的图中找到与之存在一条路径的其他词条,根据词条本身的重要性和词条与关键词的距离,计算出词条的权重。 8. The television program label I according to claim automatic generation system, characterized in that the extension module keyword keyword extraction module for each keyword obtained, found in the knowledge base module of FIG therewith present a path other terms, according to the distance and the entry itself and the importance of the keyword entry, entry calculate the weight of the weight.
  9. 9.根据权利要求I所述的电视节目标签自动生成系统,其特征在于,所述标签生成模块将得到的所有关键词的关联词条合并在一起,当一个词条同时关联多个关键词时,将这个词条在各种关键词中的权重相加,将所有词条根据权重的总和进行排序,并根据需要返回前面的若干个,从而得到了描述节目特征的标签集。 I 9. According to the television program label automatic generation system, wherein the tag generating module obtained all the keywords associated entry combined, when a plurality of keywords simultaneously headword claim , this keyword entry right in various weights are added, all the entries are sorted according to the sum of the weights, and return to the previous required number, thereby obtaining a set of labels in the program attribute description.
CN 201210110031 2012-04-16 2012-04-16 System for automatically generating television program labels CN102622451A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201210110031 CN102622451A (en) 2012-04-16 2012-04-16 System for automatically generating television program labels

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201210110031 CN102622451A (en) 2012-04-16 2012-04-16 System for automatically generating television program labels

Publications (1)

Publication Number Publication Date
CN102622451A true true CN102622451A (en) 2012-08-01

Family

ID=46562369

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201210110031 CN102622451A (en) 2012-04-16 2012-04-16 System for automatically generating television program labels

Country Status (1)

Country Link
CN (1) CN102622451A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103152633A (en) * 2013-03-25 2013-06-12 天脉聚源(北京)传媒科技有限公司 Method and device for identifying key word
CN103186662A (en) * 2012-12-28 2013-07-03 中联竞成(北京)科技有限公司 System and method for extracting dynamic public sentiment keywords
CN103686406A (en) * 2013-12-03 2014-03-26 青岛海信传媒网络技术有限公司 Method and device for digital television to control intelligent terminal to display information
CN104978403A (en) * 2015-06-04 2015-10-14 无锡天脉聚源传媒科技有限公司 Generating method and apparatus for name of video album
CN104978400A (en) * 2015-06-04 2015-10-14 无锡天脉聚源传媒科技有限公司 Method for generating video album name and apparatus
CN105704573A (en) * 2014-09-25 2016-06-22 财团法人资讯工业策进会 TV program-based shopping guide system and TV program-based shopping guide method
CN105847948A (en) * 2016-03-28 2016-08-10 乐视控股(北京)有限公司 Data processing method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1596406A (en) * 2001-11-28 2005-03-16 皇家飞利浦电子股份有限公司 System and method for retrieving information related to targeted subjects
CN1640131A (en) * 2002-02-25 2005-07-13 皇家飞利浦电子股份有限公司 Method and system for retrieving information about television programs
WO2010117213A2 (en) * 2009-04-10 2010-10-14 Samsung Electronics Co., Ltd. Apparatus and method for providing information related to broadcasting programs
CN102075695A (en) * 2010-12-30 2011-05-25 中国科学院自动化研究所 New generation intelligent cataloging system and method facing large amount of broadcast television programs
CN102207945A (en) * 2010-05-11 2011-10-05 天津海量信息技术有限公司 Knowledge network-based text indexing system and method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1596406A (en) * 2001-11-28 2005-03-16 皇家飞利浦电子股份有限公司 System and method for retrieving information related to targeted subjects
CN1640131A (en) * 2002-02-25 2005-07-13 皇家飞利浦电子股份有限公司 Method and system for retrieving information about television programs
WO2010117213A2 (en) * 2009-04-10 2010-10-14 Samsung Electronics Co., Ltd. Apparatus and method for providing information related to broadcasting programs
CN102207945A (en) * 2010-05-11 2011-10-05 天津海量信息技术有限公司 Knowledge network-based text indexing system and method
CN102075695A (en) * 2010-12-30 2011-05-25 中国科学院自动化研究所 New generation intelligent cataloging system and method facing large amount of broadcast television programs

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103186662A (en) * 2012-12-28 2013-07-03 中联竞成(北京)科技有限公司 System and method for extracting dynamic public sentiment keywords
CN103186662B (en) * 2012-12-28 2016-08-03 北京中油网资讯技术有限公司 A dynamic public opinion keyword extraction system and method
CN103152633A (en) * 2013-03-25 2013-06-12 天脉聚源(北京)传媒科技有限公司 Method and device for identifying key word
CN103152633B (en) * 2013-03-25 2015-12-23 天脉聚源(北京)传媒科技有限公司 One kind of keyword recognition method and apparatus
CN103686406A (en) * 2013-12-03 2014-03-26 青岛海信传媒网络技术有限公司 Method and device for digital television to control intelligent terminal to display information
CN105704573A (en) * 2014-09-25 2016-06-22 财团法人资讯工业策进会 TV program-based shopping guide system and TV program-based shopping guide method
CN104978403A (en) * 2015-06-04 2015-10-14 无锡天脉聚源传媒科技有限公司 Generating method and apparatus for name of video album
CN104978400A (en) * 2015-06-04 2015-10-14 无锡天脉聚源传媒科技有限公司 Method for generating video album name and apparatus
CN105847948A (en) * 2016-03-28 2016-08-10 乐视控股(北京)有限公司 Data processing method and device

Similar Documents

Publication Publication Date Title
Wang et al. Event driven web video summarization by tag localization and key-shot identification
Siersdorfer et al. Automatic video tagging using content redundancy
US20070174269A1 (en) Generating clusters of images for search results
US20090240674A1 (en) Search Engine Optimization
US6505191B1 (en) Distributed computer database system and method employing hypertext linkage analysis
Van Damme et al. Folksontology: An integrated approach for turning folksonomies into ontologies
US20050038814A1 (en) Method, apparatus, and program for cross-linking information sources using multiple modalities
Agichtein et al. Querying text databases for efficient information extraction
US20050080781A1 (en) Information resource taxonomy
Wang et al. Arista-image search to annotation on billions of web photos
US20110113047A1 (en) System and method for publishing aggregated content on mobile devices
US20110307485A1 (en) Extracting topically related keywords from related documents
CN101408883A (en) Method for collecting network public feelings viewpoint
CN101661513A (en) Detection method of network focus and public sentiment
CN101127042A (en) Sensibility classification method based on language model
US8909563B1 (en) Methods, systems, and programming for annotating an image including scoring using a plurality of trained classifiers corresponding to a plurality of clustered image groups associated with a set of weighted labels
CN101620608A (en) Information collection method and system
CN102194013A (en) Domain-knowledge-based short text classification method and text classification system
Jaffri et al. Uri disambiguation in the context of linked data
CN102073725A (en) Method for searching structured data and search engine system for implementing same
CN101488150A (en) Real-time multi-view network focus event analysis apparatus and analysis method
Zhang et al. Narrative text classification for automatic key phrase extraction in web document corpora
CN102609458A (en) Method and device for picture recommendation
US20120158724A1 (en) Automated web page classification
CN102262634A (en) An automatic method and system Q

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C12 Rejection of a patent application after its publication