WO2015043075A1 - Microblog-oriented emotional entity search system - Google Patents

Microblog-oriented emotional entity search system Download PDF

Info

Publication number
WO2015043075A1
WO2015043075A1 PCT/CN2013/088772 CN2013088772W WO2015043075A1 WO 2015043075 A1 WO2015043075 A1 WO 2015043075A1 CN 2013088772 W CN2013088772 W CN 2013088772W WO 2015043075 A1 WO2015043075 A1 WO 2015043075A1
Authority
WO
WIPO (PCT)
Prior art keywords
emotional
word
microblog
query
words
Prior art date
Application number
PCT/CN2013/088772
Other languages
French (fr)
Chinese (zh)
Inventor
郝志峰
温雯
蔡瑞初
杜慎芝
陆印章
程杰
Original Assignee
广东工业大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to CN201310461443.6A priority Critical patent/CN103544242B/en
Priority to CN201310461443.6 priority
Application filed by 广东工业大学 filed Critical 广东工业大学
Publication of WO2015043075A1 publication Critical patent/WO2015043075A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The present invention relates to a microblog-oriented emotional entity search system. The system comprises the following five modules: 1) a user interface used for the interaction between a system and a user, so that the user can submit a query request via the module and obtain a feedback result; 2) a query expansion module used for conducting word relationship mining on microblog corpus data and establishing a weighted word relationship diagram in combination with a WordNet ontology base; 3) a query processing module used for converting a query request of a user into a query key word and a query sentence which can be accepted by an index database and conducting query expansion based on the word relationship diagram constructed by module 2); 4) an emotional information mining module used for conducting emotional mining on a microblog corpus and generating a determination rule for an emotional entity and an emotional polarity; 5) an emotional information decision and index establishment module used for determining the emotional entity and emotional polarity of microblog data, establishing an emotional information index and storing same; and 6) an inverted index establishment module used for establishing an inverted index for microblog text information and storing same. The present invention solves the difficult problems of the extraction of a microblog emotional entity, the emotional polarity analysis and the search for an emotional entity, etc., thereby providing an intelligent search product for analyzing and monitoring public opinions on a social network.

Description

面向微博的情感实体搜索系统 技术领域 本发明涉及文本情感挖掘及信息检索领域, 具体涉及一种面向微博的 情感实体搜索系统, 属于面向微博的情感实体搜索系统的创新技术。  BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to the field of text emotion mining and information retrieval, and particularly to an emotional entity search system for microblogs, which belongs to an innovative technology of an emotional entity search system for microblogs.
背景技术 近年来, 随着互联网和社交网络的发展, 包括微博在内的社交网络数 据正以指数形式快速增加。 微博的不断增长使得人们可检索的信息越来越 丰富, 但是海量的微博数据也使得人们难以迅速、 准确地找到所需要的信 息。 同时, 由于微博行文上的自由性, 情感信息的抽取相对于传统文本更 为困难, 在对舆情监控和产品调研行业具有重要意义的微博情感信息检索 领域, 目前还未见成熟的技术和系统。 面向微博的情感实体搜索方法和系统主要涉及三类相关的关键背景技 术。 其一为查询扩展技术; 其二为情感实体抽取技术; 其三为情感极性判 别技术。 以下分别对上述三类背景技术分别加以阐述和分析。 Background Art In recent years, with the development of the Internet and social networks, social network data including Weibo is rapidly increasing in the form of an index. The growing popularity of Weibo has made people's searchable information more and more abundant, but the massive amount of Weibo data also makes it difficult to find the information needed quickly and accurately. At the same time, due to the freedom of microblogging, the extraction of emotional information is more difficult than the traditional text. In the field of microblog emotional information retrieval, which is of great significance to the public opinion monitoring and product research industry, there is no mature technology and system. The emotional entity search method and system for Weibo mainly involves three related key background technologies. One is query expansion technology; the second is emotional entity extraction technology; the third is emotional polarity discrimination technology. The following three types of background technologies are separately elaborated and analyzed.
1查询扩展技术 直接通过关键词进行查询的传统检索系统或搜索引擎可以获得一些相 关的检索结果, 但是这种利用简单匹配的方式查找的结果较为机械, 不能 真正理解用户的查询意图, 返回的结果也就无法让人满意。 因此寻找一种 方法可以很好的理解用户的查询意图, 提高检索的查准率和查全率成为解 决上述问题的热点。 查询扩展技术正是这样的一种方法。 通过查询扩展可 以更准确的理解用户查询需求, 帮助用户更快更准确地获得需要的信息。 经典的查询扩展方法主要包括基于全局分析、 基于局部分析、 基于用户查 询日志和基于关联规则四种。 近年, 有学者提出基于本体 (或领域本体)和 语义网的查询扩展方法。 1 Query expansion technology The traditional retrieval system or search engine that directly queries through keywords can obtain some related retrieval results, but the result of using simple matching method is more mechanical, can not truly understand the user's query intention, and the returned result It is not satisfactory. So looking for a kind The method can well understand the user's query intention, and improve the precision and recall rate of the search to become a hot spot to solve the above problems. Query extension technology is one such method. Query extensions allow for a more accurate understanding of user query requirements, helping users get the information they need faster and more accurately. The classic query expansion methods mainly include four types based on global analysis, local analysis, user-based query logs, and association-based rules. In recent years, some scholars have proposed query expansion methods based on ontology (or domain ontology) and semantic web.
基于全局分析的查询扩展方法是通过挖掘全部数据集或整个数据库的 文档中的词语相关度进行扩展的。 其优点在于可以对整个数据集进行充分 的分析, 能够了解文档的方方面面; 其缺点是, 由于通常的数据集都过大, 因此对分析的时间和设备的要求都很高, 更不可能在线完成。 现有的检索 系统都是在离线完成全局词语的分析, 对于需求实时的搜索引擎更是难以 采用这种方法。  The query expansion method based on global analysis is extended by mining the relevance of words in the documents of all data sets or the entire database. The advantage is that the entire data set can be fully analyzed to understand all aspects of the document; the disadvantage is that because the usual data sets are too large, the analysis time and equipment requirements are very high, and it is impossible to complete online. . The existing retrieval systems all perform the analysis of global words offline, and it is more difficult to use this method for real-time search engines.
基于局部分析的方法包括相关反馈与伪相关反馈两种。 相关反馈即是 先通过用户初始查询, 得到检索结果, 然后再由用户人工判断结果文档的 相关与不相关, 分放于两个不同的文档集。这样就获得了标记的相关文档, 作查询扩展前只需要对这些文档进行词语分析即可。 这样做的优点是只处 理相关部分的文档, 使得文档数量减少了, 而且相关度也会有说提升; 其 缺点是需要大量的人工反馈, 这需要大量的人力, 而且仍然需要大量的实 验进行调试处理。 这样现有的检索系统或搜索引擎少有采用这种方法的。  The method based on local analysis includes two kinds of correlation feedback and pseudo correlation feedback. The relevant feedback is to first obtain the search result through the user's initial query, and then the user manually judges the correlation and irrelevance of the result document, and distributes it in two different document sets. In this way, the relevant documents of the markup are obtained, and only the word analysis of these documents is required before the query expansion. The advantage of this is that only the relevant part of the document is processed, the number of documents is reduced, and the relevance is also improved; the disadvantage is that a large amount of manual feedback is required, which requires a lot of manpower, and still requires a lot of experiments for debugging. deal with. Such an existing retrieval system or search engine rarely uses this method.
伪相关反馈方法是利用用户初次查询获得的前 n篇结果进行分析, 其 理论假设是认为结果中与查询词相关的文档会出现在检索的最前面, 也就 是认为这些文档就是相关度最高的文档, 通过分析这些文档获得扩展词并 进行查询扩展。 专利申请号为 CN20091032193.5 , 发明名称为"查询扩展方 法及查询扩展系统"就是利用伪相关反馈的专利例子。其主要思想是通过将 用户初次查询所得结果靠前的部分文档通过聚类分析并生成簇, 对簇进行 排序后, 再从排名在前一定数目的簇中提取扩展词, 把所得的扩展词添加 到原查询中, 形成扩展词结合然后进行二次检索。 这种方法的缺点是在于 不能保证初次查询的靠前的文档就是相关的, 如果是不相关的话, 得出的 扩展词可能会使得二次检索的结果更不相关, 检索性能就会降低。 The pseudo-correlation feedback method is to analyze the first n results obtained by the user's initial query. The theoretical hypothesis is that the documents related to the query words in the results will appear at the top of the search, that is, these documents are considered to be the most relevant documents. , by analyzing these documents to get extensions and Make query extensions. The patent application number is CN20091032193.5, and the invention titled "Query Expansion Method and Query Expansion System" is an example of a patent using pseudo correlation feedback. The main idea is to collect clusters by clustering and generating clusters by the first part of the results obtained by the user's initial query. After sorting the clusters, extract the extended words from a certain number of clusters before the ranking, and add the resulting extended words. In the original query, a combination of extended words is formed and then a second search is performed. The disadvantage of this method is that it cannot guarantee that the first document of the initial query is relevant. If it is irrelevant, the resulting extended word may make the result of the secondary search less relevant and the retrieval performance will be reduced.
基于用户查询日志的方法是现在搜索弓 I擎通用的一种扩展方法, 该方 法是通过对用户的查询日志进行词语分析, 将共现的词语作为扩展词。 专 利申请号为 CN200710097501.6, 发明名称为"查询扩展方法和装置以及相 关检索词库"和专利申请号为 CN200810115470.7, 发明名称为"一种扩展查 询的方法、装置及搜索引擎系统 "就是对用户输入的查询词进行分析得到相 关的词语, 然后将这些词语作为扩展词。 这种扩展方法首先也需要获得大 量的查询日志, 这需要一个积累的过程。  The method based on the user query log is an extension method of the current search engine. The method is to perform word analysis on the user's query log, and the co-occurring words are used as extension words. The patent application number is CN200710097501.6, the invention name is "query expansion method and device and related search term library" and the patent application number is CN200810115470.7, and the invention name is "an extended query method, device and search engine system" The query words input by the user are analyzed to obtain related words, and then these words are used as extension words. This extension method also requires a large amount of query logs first, which requires an accumulation process.
基于关联规则的方法是一种数据挖掘的经典方法, 常常是用于挖掘事 务之间的相关性, 在查询扩展中可以用于各种形式的资源进行挖掘, 例如 挖掘数据文档集、 查询日志等资源的词语间的相关性。 专利申请号为 CN201010605956.6, 发明名称为"扩展用户搜索结果的方法及服务器 "就是 用关联规则技术进行查询扩展的例子。 该专利采用一个关联规则数据库储 存建立好的规则, 这要的规则可以使手工建立也可以使利用支持度-置信度 框架的关联规则对特定文档进行挖掘, 将生产的规则保存到关联规则数据 库中。 当用户输入查询词时, 首先在规则数据库中获取与该词相关的词, 然后将原查询词、 获得的相关词以及两者的组合词形成新的查询词, 并对 数据库进行二次检索。 这种方法的缺点是未能通过词语的意思层面去理解 一个词语, 只是浮在词语的频率层面上, 这样的扩展也就不能很好的了解 用户的查询意图。 The association rule-based method is a classic method of data mining. It is often used to mine the correlation between transactions. It can be used for mining various types of resources in query expansion, such as mining data document sets, query logs, etc. The relevance of the words of the resource. The patent application number is CN201010605956.6, and the invention titled "Method and Server for Extending User Search Results" is an example of query expansion using association rule technology. The patent uses an association rule database to store the established rules. The required rules can be manually established or the association rules of the support-confidence framework can be used to mine specific documents, and the production rules can be saved to the association rules database. . When the user enters a query term, the word associated with the word is first obtained in the rule database. Then, the original query words, the obtained related words and the combined words of the two are formed into new query words, and the database is searched twice. The disadvantage of this method is that it fails to understand a word through the meaning level of the word, but it floats on the frequency level of the word. Such an extension cannot well understand the user's query intention.
基于本体或语义网的查询扩展方法通过使用或构建词语网络来对词语 进行扩展的一种技术。这种语义网络可以是已经建立好的网络,如 WordNet 和 HowNet; 也可以是自行构建的, 如领域知识或领域本体。语义网或本体 库组织了词语的多层关系, 例如同位词、 上下文位词、 概念词、 整体 -部分 词等等关系, 使之组成一个关于词语的网络。 专利申请号为  An ontology or semantic web-based query extension method is a technique for extending words by using or constructing a word network. This semantic network can be an established network, such as WordNet and HowNet; it can also be built by itself, such as domain knowledge or domain ontology. The Semantic Web or Ontology Library organizes the multi-layered relationships of words, such as collocations, contextual words, concept words, whole-partial words, etc., to form a network of words. Patent application number is
CN200810116729.X, 发明名称为"一种基于领域知识的语义查询扩展方法" 是先利用领域知识和用户语句特征的分析来构建一个领域知识库, 然后利 用领域知识库内容, 对原查询词进行语义分析, 获得一个语义项列表, 再 通过语义计算得到可扩展项; 最后将扩展项返回查询集合中对数据库进行 二次检索。 专利申请号为 CN20101084725.2, 发明名称为"一种图像检索中 基于文本的查询扩展与排序方法"是利用 WordNet网和 HowNet网对词语进 行语义分析并获得语义扩展的词语, 用在对文本分析的图像检索系统中, 并发明一种对返回结果进行优化排序的算法。 通过语义扩展, 可以很好认 知用户的查询意图,不过这种方法的扩展词不对待查询的数据库进行分析, 检索性能的通常会很有限; 而且建立领域本体库即费力又费时。 CN200810116729.X, the invention name is "a semantic query extension method based on domain knowledge". It is to first construct a domain knowledge base by analyzing the domain knowledge and user sentence features, and then use the domain knowledge base content to semantically the original query words. Analysis, obtain a list of semantic items, and then obtain extensible items through semantic calculation; finally return the extended items to the query set for secondary retrieval of the database. The patent application number is CN20101084725.2, and the invention name is "A text-based query expansion and sorting method in image retrieval" is a word that uses WordNet and HowNet to semantically analyze words and obtain semantic expansion. In the image retrieval system, an algorithm for optimally sorting the returned results is invented. Through semantic extension, the user's query intent can be well recognized, but the extension of this method does not analyze the database of the query, and the retrieval performance is usually limited; and the establishment of the domain ontology library is laborious and time consuming.
2 情感实体抽取技术  2 emotional entity extraction technology
情感对象就是情感表达作用的对象, 通常为一个名词或名词性短语。 通常情况下如果不知道情感对象, 而仅仅进行情感倾向分析和研究是没有 价值的。 情感对象的抽取作为情感分析和观点挖掘中一个非常重要同时也 颇具具挑战性的任务得到相关研究者的关注。 尽管目前已经有许多情感表 达和情感对象方面的研究, 但是他们大多是针对产品评论信息或者新闻信 息进行分析。 An emotional object is an object of emotional expression, usually a noun or a noun phrase. Usually, if you don’t know the emotional object, but only the emotional orientation analysis and research is not Value. The extraction of emotional objects is a very important and challenging task in sentiment analysis and opinion mining, which has attracted the attention of relevant researchers. Although there are many studies on emotional expression and emotional objects, most of them are based on product review information or news information.
与传统的文本信息不同, 微博由于系统字数的限制和网络行文的自由 性, 微博数据由于字数限制和行文自由性等原因, 使之含有大量缩略的表 达、 错别字、 特殊符号 (如表情符号、 链接等) 等各类不同于传统规范的 文字表达, 这些无疑都提高了数据分析的难度。 由于国内的情感分析和观 点挖掘起步较晚以及中英文的差异性, 另外相关技术的不成熟的限制, 目 前针对微博进行情感对象识别方面的研究还比较少。  Different from the traditional text information, due to the limitation of the system word count and the freedom of the network text, the microblog data contains a lot of abbreviated expressions, typos, special symbols (such as expressions) due to word limit and freedom of writing. Symbols, links, etc., and other types of textual expressions that differ from traditional norms, which undoubtedly increase the difficulty of data analysis. Due to the late domestic sentiment analysis and viewpoint mining, and the differences between Chinese and English, and the immature limitations of related technologies, there are few studies on the identification of emotional objects for Weibo.
目前已有的情感对象识别技术有北京航空航天大学申请的专利号为 The existing emotional object recognition technology has the patent number applied by Beijing University of Aeronautics and Astronautics.
CN201210317183.0, 发明名称为 "基于词语依存关系的观点抽取方法" 的 专利。 该方法采用基于词语依存关系链的匹配算法抽取评价对象, 没有用 到其他更多可用的辅助信息提高方法的准确度, 其次该方法不一定适合于 微博这种特殊的文本信息。 CN201210317183.0, a patent entitled "Method of Extracting Views Based on Word Dependence". The method uses the matching algorithm based on the word dependency chain to extract the evaluation object, and does not use other more available auxiliary information to improve the accuracy of the method. Secondly, the method is not suitable for the special text information of Weibo.
现有参考文献中常见的情感对象抽取主要针对的产品评论进行, 由于 有指定产品信息和领域限定, 问题更加具体、 清晰, 因此主题相关文本的 抽取工作往往都能达到比较好的效果。 但是在其他主题无关的文本中效果 并不佳, 这主要在于这些文本中评论对象很杂, 另外情感词也多样化。 目 前针对主题无关的微博进行情感对象识别技术很少, 已有的方法大多是直 接通过对微博进行句法依赖关系分析结合情感词典得到成对的 <情感词,情 感对象 >关系, 从而抽取情感对象。这种方法的识别效果不是很理想, 存在 以下几个不足之处: (1 ) 该抽取过程过分依赖于情感词典和特定几种句法 依赖关系, 一方面, 由于基于词典的判断方法是有限的, 且受到领域知识 的影响很大, 因此会存在很多的误判; 另一方面, 微博文字表达的特殊性, 情感词和情感对象并不一定会局限于特定的几种依赖关系; (2)在微博中, 常常一些情感词和其情感对象没有直接成对的出现在文本中, 只有情感词 表达情感情感倾向, 而情感对象未显性的出现在句子中, 那么该抽取过程 不能够抽取出一些没有直接出现在句子文本中的情感对象。 The common emotional object extraction in the existing references is mainly for product reviews. Because of the specified product information and domain definition, the problem is more specific and clear, so the extraction of the topic-related text can often achieve better results. However, it does not work well in other topics that are not related to the topic. The main reason is that the comments in these texts are very mixed, and the emotional words are also diverse. At present, there are few emotional object recognition techniques for the topic-independent microblog. Most of the existing methods directly obtain the paired <emotional and emotional objects> relationships through the syntactic dependency analysis of the microblogs. Object. The recognition effect of this method is not ideal, exist The following deficiencies are as follows: (1) The extraction process relies too much on the sentiment dictionary and certain syntactic dependencies. On the one hand, because the dictionary-based judgment method is limited and is greatly influenced by domain knowledge, There are many misjudgments; on the other hand, the particularity of microblogging expressions, emotional words and emotional objects are not necessarily limited to a specific number of dependencies; (2) in Weibo, often some emotional words and Emotional objects do not appear directly in the text in pairs. Only emotional words express emotional and emotional tendencies, while emotional objects do not appear prominently in sentences. Then the extraction process cannot extract emotions that do not appear directly in the sentence text. Object.
3 情感极性判别技术 3 emotional polarity discrimination technology
目前已有情感分析系统及技术从分析的粒度上看主要集中于篇章级别 以及句子级别的情感分析, 而极少数的实体级别的情感分析技术将实体识 别和情感分析分为两个独立的任务来进行。 从分析的对象来看目前的系统 及技术要针对新闻、 微博等评论信息, 关注于社会舆情的分析。  At present, sentiment analysis systems and techniques mainly focus on chapter-level and sentence-level sentiment analysis from the granularity of analysis, while very few entity-level sentiment analysis techniques divide entity recognition and sentiment analysis into two separate tasks. get on. From the point of view of the analysis, the current system and technology should focus on the analysis of social public opinion for news, microblogging and other commentary information.
目前已有的篇章级别及句子级别情感分析技术主要有: 西北工业大学 的申请号为 CN200910219161.9、 发明名称为 "基于混合模型的 WEB文本 情感主题识别方法" 的专利; 中国科学院计算技术研究所的申请号 为 CN200910083522.1、 发明名称为 "文本情感倾向性分析方法"的专利申 请; 中国科学院自动化研究所的申请号为 CN201210088366.X、发明名称为 "一种面向微博短文本的情感分析方法" 的专利申请; 富士通株式会社的 申请号为 CN201010157784.0、 发明名称为 "情感倾向性分析方法和装置" 的专利申请。  At present, there are mainly chapter-level and sentence-level sentiment analysis techniques: Northwestern Polytechnical University's application number is CN200910219161.9, and the invention name is "mixed model-based WEB text emotional theme recognition method" patent; Institute of Computing Technology, Chinese Academy of Sciences The application number is CN200910083522.1, and the invention name is "text sentiment analysis method" patent application; the application number of the Institute of Automation, Chinese Academy of Sciences is CN201210088366.X, the invention name is "A sentiment analysis for microblog short text Patent application of the method; Patent Application No. CN201010157784.0 by Fujitsu Co., Ltd., entitled "Emotional Tendency Analysis Method and Apparatus".
上述情感分析技术主要包括训练和情感判断两个步骤, 下面以为西北 工业大学的 "基于混合模型的 WEB文本情感主题识别方法"为例介绍其 在训练和情感判断的主要步骤, 其余相关技术基本类似。 该方法主要包括 以下几个步骤: 1、对训练集中的文本进行手工标注,估计出两类情感模型: The above-mentioned sentiment analysis techniques mainly include two steps of training and emotional judgment. The following is an example of the "hybrid model-based WEB text sentiment theme recognition method" of Northwestern Polytechnical University. In the main steps of training and emotional judgment, the remaining related technologies are basically similar. The method mainly includes the following steps: 1. Manually labeling the text in the training set to estimate two types of emotional models:
"褒义"模型和 "贬义"模型; 同时根据不同主题文本的语言表达方式, 分别估计各类主题语言模型; 2、 采用最大似然估计(MLE)方法对于步骤 1建立的情感模型和主题模型分别进行参数估计; 3、 对于待处理的文本, 计算其语言模型与两类情感模型的距离, 从而对文本的情感倾向性以及主 题进行判断。 目前的情感倾向性技术主要集中于篇章级别以及句子级别, 基于机器 学习的方法很普及, 而基于情感落点的情感分析技术很少。 现有的基于情感词的情感分析技术主要存在以下三个方面的不足: A) 情感词组的提取没有考虑副词的修饰, 但是一般情况下副词都会对形容词 这类情感词产生程度限定作用。如果不加以考虑, 容易造成情感强度偏差; B ) 否定词的识别及处理问题, 一般的方法是采取一种搜索的策略去查找 否定词, 很难确定否定的对象; C) 一些基于自动生成的情感词强度词典 不可靠, 因为情感词强度是情感词的基本属性, 主要由其本意决定。 发明内容 本发明的目的在于克服现有情感实体搜索技术存在的上述不足, 提出 一种提升了情感极性判断的准确率的面向微博的情感实体搜索系统。 本发明通过以下技术方案实现: 本发明面向微博的情感实体搜索系 统 , 包括以下 5个模块: "Derogatory" model and "derogatory" model; at the same time, according to the language expression of different subject texts, respectively estimate the various topic language models; 2. Using the maximum likelihood estimation (MLE) method for the emotional model and the topic model established in step 1, respectively Parameter estimation; 3. For the text to be processed, calculate the distance between the language model and the two types of sentiment models, so as to judge the sentiment orientation and the theme of the text. Current sentiment orientation techniques are mainly focused on the chapter level and sentence level. Machine learning based methods are popular, and emotional analysis techniques based on emotional placement are rare. The existing sentiment analysis techniques based on sentiment words mainly have the following three shortcomings: A) The extraction of sentiment phrases does not consider the modification of adverbs, but in general, adverbs will limit the degree of affective words such as adjectives. If not considered, it is easy to cause emotional intensity deviation; B) Negative word recognition and processing problems, the general method is to adopt a search strategy to find negative words, it is difficult to determine negative objects; C) Some based on automatic generation The dictionary of emotional word strength is not reliable, because the intensity of emotional words is the basic attribute of emotional words, which is mainly determined by its intention. SUMMARY OF THE INVENTION The object of the present invention is to overcome the above-mentioned deficiencies of the existing emotional entity search technology, and to propose a microblog-based emotional entity search system that improves the accuracy of emotional polarity determination. The present invention is implemented by the following technical solutions: The invention relates to an emotional entity search system for Weibo, including the following five modules:
1 ) 用户接口, 用于系统与用户的交互, 用户可以通过该模块提交查 询请求并获得反馈结果; 2) 查询扩展模块, 用于对微博语料数据进行词语关系挖掘, 并结合1) a user interface, used for interaction between the system and the user, through which the user can submit a query request and obtain a feedback result; 2) Query expansion module, used for word relationship mining of microblog corpus data, combined
WordNet本体库建立加权词语关系图; The WordNet ontology library establishes a weighted word relationship diagram;
3) 查询处理模块, 用于将用户查询请求转换为索引库所能接受的查 询关键词及查询语句, 并基于模块 2) 构建的词语关系图进行查询扩展;  3) a query processing module, configured to convert the user query request into a query keyword and a query statement acceptable by the index library, and perform query expansion based on the word relationship diagram constructed by the module 2);
4) 情感信息挖掘模块, 用于对微博语料库进行情感挖掘, 并生成情 感实体及情感极性的判定规则;  4) An emotional information mining module, which is used for emotional mining of the microblog corpus, and generates a determination rule of the emotional entity and the emotional polarity;
5)情感信息判定及索引建立模块, 用于对微博数据进行情感实体和 情感极性的判定, 建立情感信息索引, 并进行存储;  5) an emotional information determination and indexing module for determining the emotional entity and emotional polarity of the microblog data, establishing an emotional information index, and storing;
6) 倒排索引建立模块, 用于对微博文本信息建立倒排索引, 并进行 存储。  6) An inverted index building module is configured to create an inverted index for the microblog text information and store it.
上述模块 1) 中采用以下步骤实现查询扩展:  The following steps are used in module 1) above to implement query expansion:
11)对微博语料库中的数据进行相关规则挖掘, 输出相关规则挖掘所 得到的相关词集;  11) mining relevant rules in the data in the microblog corpus, and outputting relevant word sets obtained by mining relevant rules;
12)结合 11)所获得的频繁项及和 WordNet本体库, 构建加权词语关 系图。  12) Combine the frequent items obtained by 11) with the WordNet ontology library to construct a weighted word relationship diagram.
上述步骤 11) 中采用 Eclat算法挖掘微博语料库的频繁项集并生成相 关词集, 并将相关词集和 WordNet本体图通过映射或插入等形式形成加权 词语关系图;  In the above step 11), the Eclat algorithm is used to mine the frequent itemsets of the microblog corpus and generate related word sets, and the related words set and the WordNet ontology map are formed into a weighted word relationship diagram by mapping or inserting;
上述构建加权词语关系图时, 节点权重的计算方法为:  When constructing the weighted word relationship diagram above, the calculation method of the node weight is:
f(d) = deg(d) = deg+(d) + deg~(d), f(d) = deg(d) = deg + (d) + deg~(d),
其中 deg(4deg + (4deg-W分别表示结点的度、 出度和入度; 边权重 的计算方法为: Where deg(4d e g + (4deg-W represents the degree, degree, and indegree of the node, respectively; the calculation method of the edge weight is:
1, 若 dt , dj为原、 W。rdNet图的节点1, if d t , dj is the original, W. rdNet graph node
Figure imgf000010_0001
f di → dj) 若 d 为仅是规则词形成的节点
Figure imgf000010_0001
f di → dj) if d is a node formed only by rule words
ft(di → d ^+1, 若 既是 WordNe劇节点也是规则词形成的节点. 上述模块 3) 中采用以下步骤实现查询处理:  Ft(di → d ^+1, if both the WordNe drama node and the node formed by the rule word. The following steps are used in the above module 3) to implement the query processing:
31) 接收用户输入的查询词或语句;  31) receiving a query word or statement entered by the user;
32)对用户的输入进行分词、 去停用词和确定中心词的处理, 得到一 个或多个中心词;  32) performing word segmentation, de-stopping words, and determining central word processing on the user's input to obtain one or more central words;
33)将中心词在由本体和规则词构造的加权词语关系图库中选取适当 的扩展词,并对扩展词进行权重计算; 33) Select the appropriate word in the weighted word relationship library constructed by ontology and rule words The extension word, and the weight calculation of the extension word;
34) 然后选取权重大的前 p个词语加入到查询词集合, 并将扩展词集 合输入至查询接口。  34) Then add the first p words with significant weights to the query word set, and input the extended word set to the query interface.
上述步骤 33) 采用以下方法对扩展词进行权重计算:  The above steps 33) use the following method to calculate the weight of the extended word:
假设原查询词为 ^ = (qi,q2,-,qm) , 其中项 有 个最邻近词 A =^, ,···,^),则原查询项 与最邻近词项 的相关度由计算方法为 Suppose the original query word is ^ = ( qi , q 2 , -, q m ) , where the item has a nearest neighbor A = ^, , ···, ^), then the correlation between the original query term and the nearest neighbor Calculated by
W(qi , ά{] ) =―W(q i , ά {] ) =―
, )xfog2[/( ) + l]} 其中 为词 与词 的相关度, 为两词的权值, f(dv) 为 词 . 的 度 数 , 所 有 最邻 近 词 的 权重 计 算 方法 为 W(dk)= ^ Wiq^/m 上述模块 4) 中采用以下步骤实现情感实体的识别和判定: , )xfog 2 [/( ) + l]} where is the relevance of the word to the word, the weight of the two words, f(d v ) is the degree of the word. The weight of all nearest words is calculated as W ( d k )= ^ Wiq^/m The following steps are used in module 4) above to implement the identification and determination of emotional entities:
41) 采集具有代表性的微博数据;  41) collecting representative microblog data;
42) 对采集到的微博数据进行预处理, 包括对数据进行清洗、 转化、 分句、 分词、 词性标注以及句法解析等;  42) pre-processing the collected microblog data, including cleaning, transforming, segmenting, word segmentation, part-of-speech tagging, and syntactic parsing;
43) 对微博数据进行特征抽取, 将其表达成特征向量;  43) performing feature extraction on the microblog data and expressing it as a feature vector;
44) 训练情感实体识别模型, 获得模型参数;  44) training the emotional entity recognition model to obtain model parameters;
45) 输出情感实体判定模型并存储。  45) Output the emotional entity decision model and store it.
上述步骤 43) 中采用以下方法实现特征抽取: 结合词语上下文, 设计 包含全局特征在内的自定义词典, 根据自定义词典对微博数据进行特征抽 取, 将微博数据转化为情感实体识别模型能够处理的输入数据格式。  In the above step 43), the following methods are used to implement feature extraction: combining the word context, designing a custom dictionary including global features, extracting features of the microblog data according to the custom dictionary, and converting the microblog data into an emotional entity recognition model. The input data format processed.
上述步骤 44) 中采用以下方法实现情感实体识别模型: 在条件随机场 (CRF)模型中引入全局特征节点, 建立结合全局特征的 GLCRF模型, 并 使用 L-BFGS算法训练获得模型参数。  In the above step 44), the following method is used to realize the emotional entity recognition model: the global feature node is introduced in the conditional random field (CRF) model, the GLCRF model combined with the global feature is established, and the model parameters are obtained by training using the L-BFGS algorithm.
上述模块 5) 中采用以下步骤实现微博情感极性的判定:  The following steps are used in the above module 5) to determine the emotional polarity of the microblog:
51) 微博数据噪声去除及语义形式转化; 52 ) 分词, 词性标记及中文语法解析; 51) microblog data noise removal and semantic form transformation; 52) participle, part of speech and Chinese grammar;
53 ) 结合情感词典提取情感词组;  53) extracting emotional phrases in combination with an emotional dictionary;
54 ) 情感词组过滤;  54) emotional phrase filtering;
55 ) 情感极性判定及结果输出。  55) Emotional polarity determination and result output.
上述步骤 53 )中采用 sentiPY方法提取情感词组, 情感词组的形式统一 表达为 phmse mod ifier * sentiment , 即一个词组包括一个中心情感词, 同 时可能附带多个修饰副词;  In the above step 53), the sentiPY method is used to extract the emotional phrase, and the form of the emotional phrase is uniformly expressed as phmse modifier * sentiment, that is, a phrase includes a central emotional word, and at the same time, a plurality of modified adverbs may be attached;
上述步骤 55 ) 中采用基于情感落点的混合决策算法对微博情感极性进 行判定, 判定过程包含以下步骤  In the above step 55), the emotion determination polarity of the microblog is determined by using a mixed decision algorithm based on emotional drop points, and the determination process includes the following steps.
551 ) 判断句子中是否有概括词, 如无, 转步骤 552 ) ; 如有, 则以概 括词之后的语句作为情感落点, 将情感落点极性作为微博情感极性输出; 551) Determine whether there is a generalized word in the sentence, if no, go to step 552); if yes, use the statement after the generalized word as the emotional drop point, and the emotional falling point polarity as the microblog emotional polarity output;
552 ) 将微博句首及句尾作为情感落点, 比较句首、 句尾情感极性, 若两者情感极性相互抵消, 则转 553 ); 否则, 将情感极性较强者作为微博 情感极性进行输出; 552) The first sentence and the last sentence of the microblog are used as emotional points, and the emotional polarity of the first sentence and the ending of the sentence are compared. If the emotional polarities of the two sentences cancel each other, then 553); otherwise, the person with stronger emotional polarity is regarded as micro Bo emotional polarity for output;
553 )计算整条微博的情感词强度, 求和并平均, 将平均强度作为微博 情感极性进行输出。 本发明针对微博情感实体搜索的查询扩展方案, 特征在于对微博语料 数据进行词语关系挖掘,结合 WordNet本体库建立加权词语关系图,并根据 所构建的词语关系图进行查询扩展, 以更好的理解用户的查询意图; 本发 明在查询扩展方面解决了语义本体与语料词语关系有效结合的问题, 能够 更好的理解用户的查询用途,进而将查询语句转化为更合适的查询扩展词; 在情感实体抽取及情感色彩分析方面, 解决了微博这类行文自由度较大的 文本情感对象的抽取和情感极性的判断问题, 解决了的情感对象隐藏情况 下的实体抽取问题, 优化了情感实体的抽取效果, 同时提升了情感极性判 断的准确率。为网络舆情监控和产品舆情分析提供了一种优良的解决方案。 本发明解决微博情感实体抽取、情感极性分析和情感实体搜索等困难问题, 为社交网络舆情分析和监控提供一种智能搜索产品。 553) Calculate the intensity of the emotional words of the entire Weibo, sum and average, and output the average intensity as the emotional polarity of the microblog. The invention is directed to a query expansion scheme for microblog emotional entity search, characterized in that word relationship mining is performed on microblog corpus data, and a weighted word relationship diagram is established by using a WordNet ontology library, and query expansion is performed according to the constructed word relationship diagram, so as to better The invention understands the query intent of the user; the invention solves the problem of effectively combining the semantic ontology and the corpus word relationship in the query expansion, can better understand the user's query purpose, and further converts the query statement into a more suitable query expansion word; In terms of emotional entity extraction and emotional color analysis, it solves the problem of extracting text emotion objects with greater freedom of writing such as Weibo and judging the emotional polarity. It solves the problem of entity extraction under the hidden situation of emotional objects and optimizes emotions. The extraction effect of the entity improves the accuracy of the emotional polarity judgment. It provides an excellent solution for network public opinion monitoring and product public opinion analysis. The invention solves difficult problems such as microblog emotional entity extraction, emotional polarity analysis and emotional entity search. Provide a smart search product for social network public opinion analysis and monitoring.
附图说明  DRAWINGS
图 1为本发明整体结构图; Figure 1 is an overall structural view of the present invention;
图 2为本发明的实施使用流程图; Figure 2 is a flow chart showing the implementation of the present invention;
图 3为本发明的系统搭建构架图; Figure 3 is a structural diagram of the system construction of the present invention;
图 4为本发明的情感极性分析方法的流程图; 4 is a flow chart of an emotional polarity analysis method of the present invention;
图 5 为情感强度优化中基于相邻关系的图结构实例; Figure 5 is an example of a graph structure based on an adjacency relationship in emotional intensity optimization;
图 6为情感落点算法流程图; Figure 6 is a flow chart of the emotional drop algorithm;
图 7为微博情感对象抽取工作流程图; Figure 7 is a flow chart of the microblog emotional object extraction work;
图 8为数据预处理流程图; Figure 8 is a flow chart of data preprocessing;
图 9为情感对象模型训练实现原理图; Figure 9 is a schematic diagram of the implementation of the emotional object model training;
图 10为 GLCRF模型的图结构; Figure 10 is a diagram structure of the GLCRF model;
图 11 为 GLCRF模型拓展多个全局节点后的模型图结构。 Figure 11 shows the model diagram structure after the GLCRF model extends multiple global nodes.
具体实施方式 以下结合附图对本发明的实施方式作进一步说明,但本发明的实施不 限于此。 BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be further described with reference to the drawings, but the implementation of the present invention is not limited thereto.
图 1所示为本发明整体结构图。 一种面向微博的情感实体搜索系统, 包括: 用户接口模块, 用户可以通过该模块提交查询请求并获得反馈结果; 查询扩展模块, 实现对微博语料数据进行词语关系挖掘, 并结合 WordNet 本体库建立加权词语关系图; 查询处理模块, 用于将用户查询请求转换为 索引库所能接受的查询关键词及查询语句, 并基于查询扩展模块构建的词 语关系图实现查询扩展; 情感信息挖掘模块, 用于对微博语料库进行情感 挖掘, 并生成情感实体及情感极性的判定规则; 情感信息判定及索引建立 模块, 用于对微博数据进行情感实体和情感极性的判定, 建立情感信息索 引, 并进行存储; 倒排索引建立模块, 用于对微博文本信息建立倒排索引, 并进行存储。 Figure 1 is a diagram showing the overall structure of the present invention. An emotional entity search system for Weibo, comprising: a user interface module, the user can submit a query request and obtain a feedback result through the module; query an extension module, implement word relationship mining on the microblog corpus data, and combine the WordNet ontology library Establishing a weighted word relationship diagram; a query processing module for converting a user query request into The query keywords and query statements acceptable to the index library, and the query expansion based on the word relationship diagram constructed by the query extension module; the emotion information mining module, used for emotional mining of the microblog corpus, and generating emotional entities and emotional polarities The determination rule and the index establishment module are used for determining the emotional entity and the emotional polarity of the microblog data, establishing an emotional information index, and storing the data; and an inverted index establishing module for using the microblog text information Create an inverted index and store it.
图 2示出了本发明查询处理模块的工作流程图。  Figure 2 is a flow chart showing the operation of the query processing module of the present invention.
参照图 2, 该流程包括以下步骤: 1、 查询界面接收用户输入的查询词 或语句; 2、 经过查询过程对用户的输入进行分词、 去停用词和确定中心词 的处理, 得到一个或多个中心词, 中心词可以是关键词也可以使修饰词等 类型; 3、将中心词在由本体和规则词构造的加权词语关系图库中选取适当 的扩展词来源, 选取的词语距离是 1, 即是中心词的最邻近词; 4、 由于第 三步所得的扩展词可能很多, 因此为了衡量各个词语的重要性, 对每个词 语进行权重计算, 然后选取权重大的前 p个词语加入到查询词集合中; 5、 在第四步已经得到了所需要的扩展词, 但是要引入一个机制可以让用户了 解这些扩展词, 并对词语进行操作, 也就是对修改扩展的查询词集合, 使 得扩展词都符合用户查询意图; 6、 将扩展词集合返回查询入口, 对富媒体 数据库进行扩展检索; 7、 将检索的结果返回并显示给用户。  Referring to FIG. 2, the process includes the following steps: 1. The query interface receives a query word or a sentence input by a user; 2. performs a word segmentation, a stop word, and a process of determining a center word through a query process to obtain one or more The central word, the central word can be a keyword or a modifier, and the like; 3. The central word is selected from the source of the weighted word relationship constructed by the ontology and the rule word, and the selected word distance is 1, That is, the nearest word of the central word; 4, because the expansion word obtained in the third step may be many, so in order to measure the importance of each word, calculate the weight of each word, and then add the first p words with significant weight to join In the query word set; 5, in the fourth step has obtained the required extension words, but to introduce a mechanism to let the user understand these extension words, and operate on the words, that is, modify the extended query word set, so that The extended words are consistent with the user's query intent; 6. The extended word set is returned to the query entry, for the rich media database. Perform an extended search; 7. Return the results of the search to the user.
图 3示出了本发明的查询处理和查询扩展模块的整合细节。  Figure 3 shows the integration details of the query processing and query extension modules of the present invention.
参照图 3, 本发明的查询处理和查询扩展包括后台信息处理过程和检 索过程两大部分, 其中还可以分为微博信息抽取模块、 建立索引模块、 构 建词语关系图模块、 用户检索模块以及管理员操作和用户操作模块五大子 微博信息抽取模块的过程包括组织好初始的微博数据、 对其进行适当的清 洗、 分句、 分词和语法分析。 建立索引模块主要是对微博数据集建立一水 索引供快速检索。我们采用 Lucene来建立倒排索引。 Lucene是一个开源的 全文检索引擎的架构, 提供了完整的查询引擎和索引引擎, 支持布尔操作、 模糊查询、 分组查询等等操作。 用其建立好倒排索引并保存。 Referring to FIG. 3, the query processing and query extension of the present invention includes two parts: a background information processing process and a retrieval process, which can also be divided into a microblog information extraction module, an indexing module, a word relationship diagram module, a user retrieval module, and management. Member operation and user operation module The process of the microblog information extraction module includes organizing the initial microblog data, performing appropriate cleaning, clause, word segmentation and grammar analysis. The indexing module is mainly to establish a water index for the microblog dataset for quick retrieval. We use Lucene to build an inverted index. Lucene is an open source full-text search engine architecture that provides a complete query engine and indexing engine that supports Boolean operations, fuzzy queries, group queries, and more. Use it to build an inverted index and save it.
构建词语关系图库模块是本文的核心部分, 也是创新的部分。 该部分 V 分为分词过程、 Eclat相关规则挖掘过程、 相关规则词生成过程以及结合 WordNet 生成加权词语关系图过程。 分词过程就是将文本的文字资源分成 一个个词语。 我们采用对中文分词精确率较高的 ICTCLAS软件进行分词, 这是中科院研发的专门针对中文分词的系统。 我们先逐个地对数据集的文 档进行分词, 然后再将各种类的文档合在一起形成一个文档集, 以供相关 规则挖掘使用。 在相关规则的挖掘过程中, 我们采用挖掘效率较高的 Eclat 挖掘算法, 这是一个深度优先的算法, 对大文档可以分块的挖掘相关词最 后再合并起来。 本发明使用支持度 -兴趣度的相关规则框架, 该框架采用两 条评判公式:  Building a word relationship library module is the core part of this article and a part of innovation. This part V is divided into a word segmentation process, an Eclat related rule mining process, a related rule word generation process, and a process of generating a weighted word relationship diagram in combination with WordNet. The word segmentation process divides the text resource of a text into words. We use the ICTCLAS software with high accuracy of Chinese word segmentation for word segmentation. This is a system specially developed for Chinese word segmentation developed by the Chinese Academy of Sciences. We first segment the documents of the dataset one by one, and then combine the documents of various classes to form a document set for mining related rules. In the process of mining related rules, we use the Eclat mining algorithm with high mining efficiency. This is a depth-first algorithm. The mining related words of large documents can be merged and finally merged. The present invention uses a correlation rule framework of support-interest, which uses two evaluation formulas:
(1)、 支持度公式: (1), support formula:
\ X Y \ \ X Y \
supp(X→ Y)  Supp(X→ Y)
\ D \  \ D \
(2)、 兴趣度公式: lift(X→ Y) (2), interest degree formula: lift(X→ Y)
supp(X) x supp(Y) 其中 Ι Χ υ ίΊ是同时包含 ^和^的事务数, I 是数据库的事务总数; suPp X ^ Y) 是数据库中事务同时包含 X和 y的百分比, x)、 分别表示事务只包 含 X和只包含 y的百分比。 Supp(X) x supp(Y) Where Ι Χ υ Ί is the number of transactions containing both ^ and ^, I is the total number of transactions in the database; su Pp X ^ Y) is the percentage of transactions in the database containing both X and y, x ), respectively, indicating that the transaction contains only X and Contains only the percentage of y.
在挖掘过程中根据不同的文档集设定了不同支持度阈值, 而挖掘出的 频繁项集只有在兴趣度大于 1时才生产相关规则项。 因为本发明认为只要 当两个词的兴趣度大于 1时他们才是正相关的。 在挖掘过程还添加了合成 词的概念: 当两个词的兴趣度值大于 4时, 将这个规则项的前后两个词合 并生成组合词, 这个词分别与规则词的前件和后件形成一个新规则, 新规 则的兴趣度值与原规则的相同, 这样合成词也可作为扩展词被选取。 在相 关词语挖掘出来后将生产相关规则词并保存,保存的格式是 " X 的形式。 此时即完成了相关规则词的挖掘及分析。  In the mining process, different support thresholds are set according to different document sets, and the frequent itemsets mined only produce related rule items when the interest degree is greater than 1. Because the present invention considers that they are positively correlated whenever the degree of interest of the two words is greater than one. In the mining process, the concept of compound words is also added: When the interest value of two words is greater than 4, the two words before and after the rule item are combined to form a combined word, which is formed with the front and back of the rule word respectively. A new rule, the interest value of the new rule is the same as the original rule, so that the synthesized word can also be selected as the extended word. After the relevant words are mined, the relevant rule words will be produced and saved. The format of the preservation is "X." At this point, the mining and analysis of related rule words is completed.
剩下的一步是将这些规则词和 WordNet本体库结合成一个加权词语 关系图。 WordNet是基于词汇的语义网络。 WordNet不但将词汇组织成概 念,还定义了概念、词汇间的多种语义关系 (如同位词、上 /下位词、反义词、 整体-部分词、 蕴涵等等), 词与词的关系形成一个有向图 (如图 3的示例)。 此过程我们考虑将规则词项按一定的顺序映射或添加到 WordNet本体库 中, 我们设定加权词语关系图的构造原则是: 在两个规则词的结点间添加 一条由前件指向后件的有向边。 其中规则词的添加完全自动化, 分为两种 情况: 第一, 若原 WordNet本体图中存在这个词, 则只需将词映射到图即 可, 然后更新结点数据;第二, 若原 WordNet本体图中不存在这个词, 则 先添加词语, 再添加边以及更新数据。所有结点数据在图完成后逐一统计。 最终形成的关系图可以用一个四重组表示: G =< V,E, f, g >, 其中 V是结点 集合, E是边的集合, 是从 V到非负实数集合的函数,设为结点的度数; S是 从£到非负实数集合的函数, 设为两个结点边的值。 设 d'd„, 表示 结点 的度 (即该结点的出度和入度之和), ^ ^表示结点词语 的兴 趣度值, 那么有: f (d) = deg (d) The remaining step is to combine these rule words with the WordNet ontology library into a weighted word relationship diagram. WordNet is a vocabulary-based semantic network. WordNet not only organizes vocabulary into concepts, but also defines various semantic relationships between concepts and vocabulary (like words, upper/lower words, antonyms, whole-partial words, implications, etc.). The relationship between words and words forms a To the figure (as in the example of Figure 3). In this process, we consider mapping or adding rule terms to the WordNet ontology library in a certain order. The construction principle of setting the weighted word relationship diagram is: Add a piece from the node of the two rule words to the back part The directed side. The addition of rule words is completely automated, and is divided into two cases: First, if the word exists in the original WordNet ontology diagram, then simply map the word to the graph and then update the node data; second, if the original WordNet ontology graph If the word does not exist, add words first, then add edges and update the data. All node data are counted one by one after the graph is completed. The resulting graph can be represented by a four recombination: G = < V, E, f, g >, where V is the node The set, E is a set of edges, is a function from V to a set of non-negative real numbers, is set to the degree of the node; S is a function from £ to a set of non-negative real numbers, and is set to the value of the two node edges. Let d ' d „ denote the degree of the node (ie the sum of the out and the degree of the node), ^ ^ denote the value of the interest of the node, then: f (d) = deg (d)
(1)、  (1),
1, 若 dt , 为原 WordNe 图的节点 1, if d t , is the node of the original WordNe diagram
(2)、 liftid, → 若 d 为仅是规则词形成的节点 (2), liftid, → if d is a node formed only by rule words
Figure imgf000017_0001
→ dj )+ 1, 若 , 既是 WordNe爛节点也是规则词形成的节点. 加权词语关系图 (如图 4 的示例)中, 词在整个图中的重要程度由该词 所在结点的度衡量, 即结点的出度和入度之和 (图 4中结点旁的整数值);边 的值是权值,其中原 WordNet图的本体词间的权值设为 1(图 4中蓝色的边), 由规则插入的词间的权值设为两词的兴趣度值 (图 4 中蓝色的边), 若两词 既是 WordNet关系词又是规则词, 则权值为兴趣度值加 1。 图 4中黑色边 所指的词为合成词 (如"知识产权"),其与两个规则词的权值是相同的。此时 即完成了加权词语关系图的构建。
Figure imgf000017_0001
→ dj )+ 1, if, is both the WordNe bad node and the node formed by the rule word. In the weighted word relationship diagram (as in the example of Figure 4), the importance of the word in the whole graph is measured by the degree of the node where the word is located. That is, the sum of the degree of the node and the degree of ingress (the integer value next to the node in Figure 4); the value of the edge is the weight, where the weight between the ontology words of the original WordNet diagram is set to 1 (blue in Figure 4) Edge), the weight between the words inserted by the rule is set to the interest value of the two words (the blue side in Figure 4). If the two words are both WordNet relational and regular words, the weight is the interest value. plus 1. The words indicated by the black side in Figure 4 are compound words (such as "intellectual property"), which are the same as the weights of the two rule words. At this point, the construction of the weighted word relationship graph is completed.
用户检索模块包括查询输入、 查询分析过程、 匹配扩展词语过程、 生成扩 展查询词集合过程、 检索索引过程以及结果处理并显示给用户的过程。 查 询输入就是在查询界面接收用户输入的查询词或语句; 查询分析即是用户 的输入进行分词、去停用词和确定中心词的处理, 得到一个或多个中心词; 匹配扩展词语过程是将上一步骤的中心词输入到加权词语关系图库中选取 适当的扩展词来源, 即从这个图中选取距原查询词最近的词 (即距离为 1的 词)作为候选扩展词。生成扩展查询词集合过程是根据各个词与原查询词的 相关度, 计算词的权重后选取前 p个作为最终扩展词。 本发明创建了计算 各词语权重的公式, 根据加权词语关系图的结构可知:如果两个结点的权值 越大,表示这两个结点的相关度也越大;而如果结点的度越大,表明该结点的 重要性也越大. The user retrieval module includes a query input, a query analysis process, a process of matching extended words, a process of generating an extended query word set, a process of retrieving an index, and a process of processing the result and displaying it to the user. The query input is to receive the query words or sentences input by the user in the query interface; the query analysis is the user input to perform word segmentation, stop the stop words and determine the central word processing, and obtain one or more central words; the process of matching the extended words is The center word of the previous step is input into the weighted word relationship library to select the appropriate source of the extended word, that is, the word closest to the original query word (ie, the word with distance 1) is selected from the figure as the candidate extended word. The process of generating an extended query word set is based on the correlation between each word and the original query word, and the weight of the word is calculated, and the first p words are selected as the final extended word. The invention creates a calculation The formula of the weight of each word, according to the structure of the weighted word relationship diagram, if the weight of the two nodes is larger, the correlation between the two nodes is greater; and if the degree of the node is larger, it indicates that The importance of the node is also greater.
假设原查询词为 q , 其中项 有 个最邻近词 di =(dil,di2,---,qini), Suppose the original query word is q, where the item has a nearest neighbor word d i =(d il ,d i2 ,---,q ini ),
则原查询项 qt与最邻近词项 d 的相关度由计算方法为 》 ( 2+ 1] Then the correlation between the original query term q t and the nearest term d is calculated by ( 2+ 1]
1 ( )X^2[/( ) + l]} 其中 wfe,^)为词 与词 的相关度, gfe,^)为两词的权值, 为词^ 的度数, 所有最邻近词的权重计算方法为 1 ( )X^ 2 [/( ) + l]} where wfe,^) is the relevance of the word to the word, gfe,^) is the weight of the two words, the degree of the word ^, the weight of all nearest words The calculation method is
W(dk)= ^ Wiq^/m 其中 为词 4的权重, 表示原查询词的个数。在算出各个候选扩展词的 权重后,将权重按降序排列,并选取前 p个词加入到原查询中,构成扩展词集 合, 其中原查询项的权重都为 1。 W(d k )= ^ Wiq^/m where is the weight of word 4, which represents the number of original query words. After calculating the weights of each candidate extended word, the weights are arranged in descending order, and the first p words are added to the original query to form an expanded word set, wherein the weight of the original query item is 1.
由上一步以及得到扩展词集合, 如以下的形式:  From the previous step and get the set of extended words, as in the following form:
Q = (ql,q2,...,qm ,dx,d2,...,dp) (4) 检索过程是指将扩展词集合返回查询入口返回查询入口, 对富媒体数据库 进行扩展检索。 结果处理并显示的过程是指将排好序的检索的结果返回并 显示给用户。 Q = (q l , q 2 ,...,q m ,d x ,d 2 ,...,d p ) (4 ) The retrieval process refers to returning the extended word set back to the query entry and returning to the query entry, for rich media The database performs an extended search. The result processing and display process refers to returning and displaying the results of the sorted search to the user.
图 4为本发明提出的情感极性分析方法的流程图。 (1) 评论语料的噪声去除及语义形式转化: 4 is a flow chart of an emotional polarity analysis method proposed by the present invention. (1) Noise removal and semantic form conversion of commentary corpus:
评论语料的噪声去除主要是除去干扰子句如虚拟语气。 这些干扰的句 子非真实客观的评价, 会干扰后阶段的分析。替换表情符号为相应的文字, 从而将语义形式转化为友好处理的形式。  The noise removal of the commentary corpus is mainly to remove the interference clause such as the virtual tone. Non-real and objective evaluations of these interfering sentences interfere with the analysis of the later stages. Replace the emoji with the corresponding text, thus transforming the semantic form into a form of friendly processing.
(2) 自然语言处理: 主要是利用 Stanford NLP软件对评论语料进行分 词, 词性标记及中文语法解析。  (2) Natural language processing: Mainly using Stanford NLP software to segment the commentary corpus, part of speech tag and Chinese grammar analysis.
(3) 结合情感词典提取情感词组:  (3) Combine emotional dictionary to extract emotional phrases:
因为情感词在评论语料中的 POS tagger label主要集中在少数几个 label 上面, 我们就结合这些词性标签和情感词典提取情感词组。 采用我们开发 的 sentiPY方法提取情感词组, 在本系统情感词组的形式统一为:  Because the POS tagger label of the emotional word in the commentary corpus is mainly concentrated on a few labels, we combine these part of speech tags and sentimental lexicon to extract emotional phrases. Using the sentiPY method we developed to extract emotional phrases, the form of emotional phrases in this system is unified:
phrase: mod ifier * sentiment  Phrase: mod ifier * sentiment
, 即一个词组包括一个中心情感词, 可能附带多个修饰副词。  , that is, a phrase includes a central emotional word, possibly with multiple modified adverbs.
(4) 情感词组过滤: 对第 3 步中提取的粗粒度情感词组进行过滤, 使 得情感词组的形式更纯, 从而可以提升最终的极性分类的准确度。  (4) Emotional phrase filtering: Filter the coarse-grained emotional phrases extracted in step 3 to make the form of the emotional phrase more pure, which can improve the accuracy of the final polarity classification.
(5) 情感分析并将结果输出  (5) sentiment analysis and output of results
我们设计了一个基于情感落点的混合决策算法, 该算法可以有效的对 不同领域的评论语料进行分析。  We have designed a hybrid decision algorithm based on emotional drop point, which can effectively analyze the corpus of different fields.
图 5为情感强度优化中基于相邻关系的图结构实例。 参照图 5, 把评 论语料中的情感词看作是图中的节点, 基于传播的算法可以计算上下文的 情感强度。 基于情感词典, 提取情感词相邻的关系并通过 NGD 计算两情 感词节点的权重, 从而形成一个有向图。 图三为一条评论的图结构。  Figure 5 is an example of a graph structure based on the adjacency relationship in emotional intensity optimization. Referring to Figure 5, the sentiment words in the commentary corpus are regarded as nodes in the graph, and the propagation-based algorithm can calculate the emotional strength of the context. Based on the sentiment dictionary, the relationship between the sentiment words is extracted and the weights of the sentimental nodes are calculated by NGD, thus forming a directed graph. Figure 3 shows the structure of a comment.
图 6为情感落点算法流程图。 参照图 4, 在该步骤中, 我们的目标是 找到一条评论的情感落点。 所谓的情感落点就是在一条评论中作者主要想 表达的情感部分。 我们主要依据概括性的词汇(如"总体")、 比较开头结尾 处的情感强度及句子中的最强情感词组, 从而找到一条评论的情感落点。  Figure 6 is a flow chart of the emotional drop algorithm. Referring to Figure 4, in this step, our goal is to find a emotional drop of a comment. The so-called emotional placement is the emotional part that the author mainly wants to express in a commentary. We mainly rely on generalized vocabulary (such as "overall"), compare the emotional intensity at the end of the beginning, and the strongest emotional phrase in the sentence to find a emotional drop of a comment.
图 7示出了本发明针对微博情感实体抽取的工作流程图。 FIG. 7 shows a workflow diagram of the present invention for microblog emotional entity extraction.
参照图 1, 本发明的情感实体抽取包括微博数据采集、 数据预处理、 特征提取、 词典加载、 标记与修正、 模型训练和情感对象抽取等步骤。 微 博数据采集从互联网爬取的微博数据将以文件的形式保存起来, 模型训练 得到的情感对象抽取模型也会被保存起来用于对象抽取, 情感对象抽取得 到的结果以将文件的形式保存下来, 以便用户查看和修正预测结果。 Referring to FIG. 1, the emotional entity extraction of the present invention includes steps of microblog data collection, data preprocessing, feature extraction, dictionary loading, markup and correction, model training, and emotion object extraction. Micro Bo data collection The microblog data crawled from the Internet will be saved in the form of files. The emotional object extraction model obtained by the model training will also be saved for object extraction. The results obtained by the emotion object extraction will be saved in the form of files. , so that users can view and correct the forecast results.
微博数据采集, 用于从互联网上的微博系统(如新浪微博、 twtter和腾 讯微博等) 爬取微博数据, 并将采集下来的微博原始数据按照一定的组织 方式以文件的形式保存下来, 为系统的后期处理提供数据支持。  Microblog data collection, used to crawl microblog data from the microblogging system on the Internet (such as Sina Weibo, twtter and Tencent Weibo, etc.), and collect the collected microblog raw data according to a certain organization The form is saved and provides data support for the post processing of the system.
数据预处理, 用于对原始的微博数据进行一些预先处理, 便于后期进 行特征提取。 该模块包括数据清洗、 数据转化、 分句、 分词、 词性标注和 句法解析。 详情如图 2所示。  Data preprocessing is used to perform some pre-processing on the original microblog data to facilitate feature extraction later. The module includes data cleansing, data conversion, clauses, word segmentation, part-of-speech tagging, and syntax parsing. The details are shown in Figure 2.
词典加载, 用于加载数据预处理和特征提取步骤所需要的相关词典, 这项词典包括情感词典、 停用词词典、 常见网络用语词典等词典数据。  Dictionary loading, used to load the relevant dictionary required for data preprocessing and feature extraction steps. This dictionary includes dictionary data such as sentiment dictionary, stop word dictionary, common network term dictionary.
特征抽取, 借助词典加载模块加载的词典数据对与处理后的数据进行 预先定义特征的抽取, 将文本向量化, 转化为对象抽取模块能够处理的格 式。  Feature extraction, with the dictionary data loaded by the dictionary loading module, the pre-defined feature is extracted from the processed data, and the text is vectorized and converted into a format that the object extraction module can process.
情感对象模型训练, 用于本系统核心的情感对象抽取模型进行训练。 从标记和修正模块获取转化为要求格式的训练数据, 使用 L-BFGS算法对 根据训练数据构建的 CRF模型进行训练。 本发明使用的 CRF模型是在 Linear CRF (线性条件随机场) 模型的基础上演变而来, 是 CRF (条件随 机场)模型第一次在情感对象识别领域进行应用。通过在传统的 CRF模型 中添加全局变量, 从而达到能够识别出情感对象不显性出现在标记序列中 的情况。  The emotional object model training is used to train the emotional object extraction model at the core of the system. The training data converted into the required format is obtained from the marking and correction module, and the CRF model constructed based on the training data is trained using the L-BFGS algorithm. The CRF model used in the present invention evolved from the Linear CRF (Linear Conditional Random Field) model, and the CRF (Conditional Random Field) model was first applied in the field of emotional object recognition. By adding global variables to the traditional CRF model, it is possible to recognize that the emotional object does not appear in the marker sequence.
情感对象抽取, 用于从微博数据中抽取出情感情感对象, 该步骤主要 利用模型训练模块训练出的模型来进行预测从而达到抽取对象的目的。  The emotion object extraction is used to extract the emotional emotion object from the microblog data. This step mainly uses the model trained by the model training module to perform prediction to achieve the purpose of extracting the object.
标记和修正, 本发明中用到的 CRF模型为一个有监督统计学习方法, 因此需要对数据进行标注。 同时引入反馈机制对错误分析信息进行学习。 现有方法对于误分结果一般不作处理, 但这些反馈信息包含了大量有用信 息, 如何能够充分利用这些信息成了系统实现自我学习的关键。 反馈机制 的引入使得模型能够对错误分析的结果进行再次学习,使得系统越用越准。  Marking and Correction, the CRF model used in the present invention is a supervised statistical learning method, so the data needs to be labeled. At the same time, a feedback mechanism is introduced to learn the error analysis information. Existing methods generally do not deal with misclassification results, but these feedbacks contain a lot of useful information. How to make full use of this information becomes the key to the system to achieve self-learning. The introduction of the feedback mechanism enables the model to re-learn the results of the error analysis, making the system more accurate.
图 8示出了本发明数据预处理步骤的实现原理图, 数据预处理步骤包 括以下步骤: ( 1 )数据清洗处理步骤, 从数据采集模块收集的原始微博数据中读取 数据, 进行数据预处理中的数据清洗过程, 过滤掉一些空的、 无效的脏微 博数据。 FIG. 8 is a schematic diagram showing the implementation of the data preprocessing step of the present invention. The data preprocessing step includes the following steps: (1) Data cleaning processing steps, reading data from the original microblog data collected by the data acquisition module, performing data cleaning process in data preprocessing, filtering out some empty and invalid dirty microblog data.
(2) 数据转化处理步骤, 该步骤处理从 (1 ) 步骤处理后传过来的数 据, 对微博数据中的一些内容进行转化处理, 便于 (3 ) (4) (5 ) (6) 步骤 相关处理, 常见有以下几种情况: (a) 微博中常常含有一些对工作无效的 信息, 则需要剔除掉; (b ) —些对我们工作来说无用的链接 (如图片链接 和网页链接等)和特殊字符串需要剔除掉; (c)在微博中常常包含有带" #" 符号的话题和带 " @ "符号的联系人也进行了处理, 我们把微博头和尾出 现的话题和联系人直接删除, 在微博句子中的则只删除 "#"和 " @ "符号; (2) Data conversion processing step, which processes the data transmitted from the processing of (1), and transforms some contents in the microblog data, which is convenient for (3) (4) (5) (6) Handling, there are several common situations: (a) Weibo often contains some information that is invalid for work, so it needs to be removed; (b) some links that are useless for our work (such as image links and web links) ) and special strings need to be culled; (c) topics with the " #" symbol and contacts with the " @ " symbol are often processed in Weibo, we also put the topic of the microblog head and tail And the contact is directly deleted, and only the "#" and "@" symbols are deleted in the microblog sentence;
( d )微博中常常包含有一些表情符号,这些符号是带有强烈的情感倾向的, 也是对我们的工作有帮助的信息, 但是这些符号会影响分词、 词性标注(d) Weibo often contains some emojis, which are strongly emotionally inclined, and are also helpful to our work, but these symbols affect the participle and part of speech.
(POS标注) 和句法解析的精度, 因此在此过程中需要提取出来; (e) 需 要对微博中一些网络用语进行转换, 例如, 把网络表达方式的 "V5 "转成 规范表达的 "威武"等, 这同样有助于提高分词、 词性标注 (POS标注) 和句法解析的精度。 (POS annotation) and the accuracy of syntactic parsing, so it needs to be extracted in this process; (e) It is necessary to convert some network terms in Weibo, for example, to convert the "V5" of network expression into a normative expression. "Wait, this also helps to improve the accuracy of word segmentation, part-of-speech tagging (POS tagging) and syntactic parsing.
(3 )微博文本分句处理步骤, 本发明的情感对象识别方法的条件随机 场模型是构建在句子级别的序列标记, 进行信息抽取, 然而一条微博肯能 包含有一个以上的句子, 因此需要对之进行分句处理。 在分句处理过程中 主要是根据标点符号进行分句。 但是由于微博的特殊性, 仅仅根据标点进 行分句是不够的。 在微博中很多人为了方便, 习惯用空格或者特殊的符号 (3) The microblog text clause processing step, the conditional random field model of the emotion object recognition method of the present invention is a sequence marker constructed at the sentence level for information extraction, but a microblog can contain more than one sentence, It needs to be processed by clauses. In the process of clause processing, the clause is mainly based on punctuation. However, due to the particularity of Weibo, it is not enough to simply use clauses based on punctuation. Many people in Weibo are accustomed to using spaces or special symbols for convenience.
(如 "~"等)进行分句, 因此在此过程中还针对这些情况进行了对应分句 处理。 (such as "~", etc.) to make clauses, so in this process, the corresponding clauses are also processed for these cases.
(4)句子分词处理步骤, 本发明的情感对象识别方法的条件随机场模 型是对句子级别的序列中每个词进行标记, 因此需要进行分词处理。 分词 过程用到的是一些常用网络用语词汇词典 (如 "抓狂"、 "围观"等) 用于 提高分词的准确度。  (4) Sentence word segmentation processing step, the conditional random field model of the emotion object recognition method of the present invention marks each word in the sentence-level sequence, and therefore needs to perform word segmentation processing. The word segmentation process uses some common network term vocabulary dictionaries (such as "crazy", "crowd", etc.) to improve the accuracy of word segmentation.
(5 )句子中词的词性标注步骤, 此步骤对分词后的每个词进行词性标 注, 为本发明的特征提取模型进行特征提取时提供词的词性相关特征。  (5) The part-of-speech tagging step of the word in the sentence. This step performs part-of-speech tagging on each word after the word segmentation, and provides the part-of-speech feature of the word when extracting features for the feature extraction model of the present invention.
(6)句法解析步骤, 此步骤利用句法解析工具解析出句子中词之间的 句法依赖关系, 目的为本发明的特征提取模型进行特征提取时提供词的依 赖相关特征。 (6) Syntactic parsing steps, which use syntactic parsing tools to parse out words between sentences Syntactic dependency, the purpose is to provide the dependency-dependent feature of the word when the feature extraction model of the invention performs feature extraction.
图 9为本发明情感对象识别模型训练步骤的实现原理图。 参照图 9, 在该步骤中, 已标注的训练数据集来源于数据采集模块从互联网中爬取的 微博数据, 并进行数据预处理模块处理。 由于本发明中采用的条件随机场 (CRF)模型进行情感对象抽取, 而 CRF模型为一种有监督学习方法, 因 此在训练过程中的训练数据集还需要进行人工标注数据集。 训练模型过程 中, 首先需要利用词典加载模块加载用户词典, 包括情感词词典和停用词 词典; 下一步就是利用特征提取模块结合上一部加载的词典对训练数据集 进行特征提取并规范化数据; 最后一步是利用模型训练模块对上步规范化 的数据进行模型参数训练, 使用 L-BFGS算法训练学习得到模型的参数。  FIG. 9 is a schematic diagram showing the implementation of the training step of the emotion object recognition model of the present invention. Referring to FIG. 9, in this step, the labeled training data set is derived from the microblog data crawled by the data acquisition module from the Internet, and processed by the data preprocessing module. Since the conditional random field (CRF) model used in the present invention performs emotional object extraction, and the CRF model is a supervised learning method, the training data set in the training process also needs to manually label the data set. In the process of training the model, the user dictionary is first loaded by using the dictionary loading module, including the emotional word dictionary and the stop word dictionary; the next step is to extract and normalize the data of the training data set by using the feature extraction module combined with the previous loaded dictionary; The final step is to use the model training module to train the model parameters of the normalized data in the previous step, and use the L-BFGS algorithm to train and learn the parameters of the model.
在本发明中用到的条件随机场模型如图 10所示的形式,把情感对象识 别过程看成是一个序列标记问题。 该模型的第一层的 X表示输入的微博句 子, Xl表示句子中第 i个位置的词, 第二层的 yi和第三层的 gl、 g2输出结 果状态,这些状态的标签的肯能取值为:
Figure imgf000022_0001
The conditional random field model used in the present invention is in the form shown in Fig. 10, and the emotional object recognition process is regarded as a sequence mark problem. The first layer of the model X represents the input microblog sentence, Xl represents the word in the i-th position of the sentence, the second layer of yi and the third layer of the gl , g 2 output result state, the labels of these states are Can be valued as:
Figure imgf000022_0001
五个标签, 它表示序列标记过程中序列每个位置标记标签取值空间, 其中 N-B 标签表示负向情感对象的开始位置标签, N-I 标签表示负向情感对 象的后继标签 (即其前一个标签必须为 或者 N-/), Ρ- 标签表示正 向情感对象的开始位置标签, Ρ-/标签表示正向情感对象的后继标签 (同 理前一个标签必须为 Ρ_β或者 P_/), O标签表示其他所有标签, 即有 y L。 例如序列为 { "手机", "屏幕", "非常", "清晰" }, "手机屏幕"为 一正向的情感对象,对之进行标记的结果为{" P-B,,,"P-I ", " 0 ", " 0 "}。 Five tags, which indicate the sequence tag space for each position in the sequence tag process, where the NB tag represents the start position tag of the negative emotion object, and the NI tag represents the successor tag of the negative emotion object (ie, the previous tag must be For either N-/), the Ρ-tag indicates the start position label of the positive emotion object, and the Ρ-/ label indicates the successor label of the positive emotion object (the same label must be Ρ _β or P _/), O label Represents all other tags, ie y L. For example, the sequence is {"mobile", "screen", "very", "clear" }, "phone screen" is a positive emotional object, and the result of marking it is {"PB,,,"PI"," 0 ", " 0 "}.
模型中用两个全局节点& 和 表示两个独立的单一情感对象, 因此 取值只能为 {'N-fiVP-BVO'}这三个标签,要么为正向情感对象即为 -^标 签, 要么为负向情感对象即为 标签, 要么不是情感对象即为 0标签, 而不可能为情感对象的后继标签 N-/和 ^-/。  The model uses two global nodes & and two independent single emotion objects, so the value can only be the three labels {'N-fiVP-BVO'}, or the positive emotion object is the -^ label. Either the negative emotion object is the label, or the emotion object is the 0 label, and it is impossible to be the successor label N-/ and ^-/ of the emotion object.
为了提高情感对象识别的灵活性和可拓展性, 本发明采用的条件随机 场模型不局限于图 9所示的图结果, 表示非显性也不局限于两个隐藏的节 点& 和 , 可以拓展到如图 11所示的&… (η>=1)。  In order to improve the flexibility and expandability of the emotion object recognition, the conditional random field model adopted by the present invention is not limited to the graph result shown in FIG. 9, and the non-dominantness is not limited to two hidden nodes & and can be expanded. To &... (η>=1) as shown in Fig. 11.
以上所述的具体实施例, 对本发明的目的、 技术方案和有益效果进行了进 一步详细说明, 所应理解的是, 以上所述仅为本发明的具体实施例而已, 并不用于限制本发明, 凡在本发明的精神和原则之内, 所做的任何修改、 等同替换、 改进等, 均应包含在本发明的保护范围之内。 The specific embodiments described above advance the objects, technical solutions, and beneficial effects of the present invention. It is to be understood that the foregoing description is only illustrative of specific embodiments of the present invention, and is not intended to limit the invention, any modifications, equivalents, Improvements and the like should be included in the scope of the present invention.

Claims

权 利 要 求 书 Claims
1、 一种面向微博的情感实体搜索系统 , 其特征在于包括以下 5 个模块: 1. An emotional entity search system for Weibo, which is characterized by comprising the following five modules:
1 ) 用户接口, 用于系统与用户的交互, 用户可以通过该模块提 交查询请求并获得反馈结果; 1) a user interface, used for interaction between the system and the user, through which the user can submit a query request and obtain a feedback result;
2) 查询扩展模块, 用于对微博语料数据进行词语关系挖掘, 并 结合 WordNet本体库建立加权词语关系图;  2) Query expansion module, used for word relationship mining of microblog corpus data, and establishing a weighted word relationship diagram in combination with WordNet ontology library;
3 ) 查询处理模块, 用于将用户查询请求转换为索引库所能接受 的查询关键词及查询语句, 并基于模块 2) 构建的词语关系图进行查 询扩展;  3) a query processing module, configured to convert the user query request into a query keyword and a query statement acceptable by the index library, and perform query expansion based on the word relationship diagram constructed by the module 2);
4) 情感信息挖掘模块, 用于对微博语料库进行情感挖掘, 并生 成情感实体及情感极性的判定规则;  4) An emotional information mining module, which is used for emotional mining of the microblog corpus, and generates rules for determining emotional entities and emotional polarity;
5 ) 情感信息判定及索引建立模块, 用于对微博数据进行情感 实体和情感极性的判定, 建立情感信息索引, 并进行存储;  5) an emotional information determination and indexing module for determining the emotional entity and sentiment polarity of the microblog data, establishing an emotional information index, and storing;
6) 倒排索引建立模块, 用于对微博文本信息建立倒排索引, 并 进行存储。  6) An inverted index building module is configured to create an inverted index for the microblog text information and store it.
2、 根据权利要求 1所述的面向微博的情感实体搜索系统, 其特 征在于上述模块 1 ) 中采用以下步骤实现查询扩展:  2. The Weibo-oriented sentiment entity search system according to claim 1, wherein the module 1) uses the following steps to implement query expansion:
11 ) 对微博语料库中的数据进行相关规则挖掘, 输出相关规则 挖掘所得到的相关词集;  11) mining relevant rules in the data in the microblog corpus, and outputting relevant rules to mine the relevant word sets obtained;
12 ) 结合 11 ) 所获得的频繁项及和 WordNet本体库, 构建加权 词语关系图。  12) Combine the frequent items obtained by 11) with the WordNet ontology library to construct a weighted word relationship diagram.
3、 根据权利要求 1所述的面向微博的情感实体搜索系统, 其特 征在于上述步骤 11 ) 中采用 Eclat算法挖掘微博语料库的频繁项集并 生成相关词集,并将相关词集和 WordNet本体图通过映射或插入等形 式形成加权词语关系图;  3. The microblog-oriented sentiment entity search system according to claim 1, wherein the Eclat algorithm is used in the above step 11) to mine frequent itemsets of the microblog corpus and generate related word sets, and the related word sets and WordNet are used. The ontology graph forms a weighted word relationship diagram by mapping or inserting;
上述构建加权词语关系图时, 节点权重的计算方法为:  When constructing the weighted word relationship diagram above, the calculation method of the node weight is:
f (d) = deg(d) = deg+ (d) + deg~ (d) 其中 deg(6)、deg+(i)、deg 分别表示结点的度、 出度和入 度; 边权重的计算方法为: f (d) = deg(d) = deg + (d) + deg~ (d) Where deg(6), deg + (i), and deg represent the degree, degree, and indegree of the node, respectively;
Figure imgf000025_0001
形成 其中 → iJ是根据 Eclat算法所得的 , dj的相关度。
Figure imgf000025_0001
Formed therein → iJ is the correlation of dj according to the Eclat algorithm.
4、 根据权利要求 1所述的面向微博的情感实体搜索系统, 其特 征在于上述模块 3) 中采用以下步骤实现查询处理:  4. The Weibo-oriented sentiment entity search system according to claim 1, wherein the module 3) adopts the following steps to implement query processing:
31) 接收用户输入的查询词或语句;  31) receiving a query word or statement entered by the user;
32) 对用户的输入进行分词、 去停用词和确定中心词的处理, 得到一个或多个中心词;  32) performing word segmentation, de-stopping words, and determining the processing of the central word for the user's input to obtain one or more central words;
33) 将中心词在由本体和规则词构造的加权词语关系图库中选 取适当的扩展词,并对扩展词进行权重计算;  33) Select the appropriate extended words in the weighted word relationship library constructed by the ontology and the rule words, and calculate the weights of the extended words;
34)然后选取权重大的前 p个词语加入到查询词集合, 并将扩展 词集合输入至查询接口。  34) Then add the first p words with significant weights to the query word set, and input the extended word set to the query interface.
5、 根据权利要求 4所述的面向微博的情感实体搜索系统, 其特 征在于上述步骤 33) 采用以下方法对扩展词进行权重计算:  5. The Weibo-oriented sentiment entity search system according to claim 4, characterized in that the above step 33) performs weight calculation on the extended words by the following method:
假设原查询词为 = 2, 其中项 ^有 个最 邻近 =^, 2,···,^^则原查询项 与最邻近词项 的相关 度由计算方法为 Suppose the original query word is = 2, where the item ^ has the nearest neighbor =^, 2 ,···, ^^, then the correlation between the original query term and the nearest neighbor is calculated by
Figure imgf000025_0002
其中 为词 qt与词 的相关度, (为两词的权 值, 为词 ^的度数, 所有最邻近词的权重计算方法为 W(dk)= ^ Wiq^/m
Figure imgf000025_0002
Where is the degree of relevance of the word q t to the word, (for the weight of the two words, the degree of the word ^, the weight of all nearest words is calculated as W(d k )= ^ Wiq^/m
。 6、根据权利要求 1所述的面向微博的情感实体搜索系统, 其特 征在于上述模块 4) 中采用以下步骤实现情感实体的识别和判定:. 6. The Weibo-oriented sentiment entity search system according to claim 1, wherein the module 4) adopts the following steps to implement the identification and determination of the emotional entity:
41 ) 采集具有代表性的微博数据; 41) collecting representative microblog data;
42 ) 对采集到的微博数据进行预处理, 包括对数据进行清洗、 转化、 分句、 分词、 词性标注以及句法解析等;  42) pre-processing the collected microblog data, including cleaning, transforming, segmenting, word segmentation, part-of-speech tagging, and syntactic parsing;
43 ) 对微博数据进行特征抽取, 将其表达成特征向量;  43) performing feature extraction on the microblog data and expressing it as a feature vector;
44 ) 训练情感实体识别模型, 获得模型参数;  44) training the emotional entity recognition model to obtain model parameters;
45 ) 输出情感实体判定模型并存储。  45) Output the emotional entity decision model and store it.
7、 根据权利要求 6所述的面向微博的情感实体搜索系统, 其特 征在于上述步骤 43 ) 中采用以下方法实现特征抽取: 结合词语上下 文, 设计包含全局特征在内的自定义词典, 根据自定义词典对微博数 据进行特征抽取,将微博数据转化为情感实体识别模型能够处理的输 入数据格式。  7. The microblog-oriented sentiment entity search system according to claim 6, wherein the step (43) adopts the following method to implement feature extraction: combining a word context, designing a custom dictionary including global features, according to The definition dictionary extracts the features of the microblog data, and converts the microblog data into an input data format that the emotion entity recognition model can process.
8、 根据权利要求 6所述的面向微博的情感实体搜索系统, 其特 征在于上述步骤 44)中采用以下方法实现情感实体识别模型: 在条件 随机场 (CRF) 模型中引入全局特征节点, 建立结合全局特征的 GLCRF模型(全局条件随机场模型), 并使用 L-BFGS算法训练获得 模型参数。  8. The microblog-oriented sentiment entity search system according to claim 6, wherein the following method is used to implement the sentiment entity recognition model: introducing a global feature node in a conditional random field (CRF) model, establishing The GLCRF model (global conditional random field model) combining global features is used, and the model parameters are obtained by training using the L-BFGS algorithm.
9、 根据权利要求 1所述的面向微博的情感实体搜索系统, 其特 征在于上述模块 5 ) 中采用以下步骤实现微博情感极性的判定: 9. The Weibo-oriented sentiment entity search system according to claim 1, wherein the module 5) uses the following steps to determine the emotional polarity of the microblog:
51 ) 微博数据噪声去除及语义形式转化; 51) microblog data noise removal and semantic form conversion;
52 ) 分词, 词性标记及中文语法解析;  52) participle, part of speech and Chinese grammar;
53 ) 结合情感词典提取情感词组;  53) extracting emotional phrases in combination with an emotional dictionary;
54 ) 情感词组过滤;  54) emotional phrase filtering;
55 ) 情感极性判定及结果输出。  55) Emotional polarity determination and result output.
10、 根据权利要求 9所述的面向微博的情感实体搜索系统, 其特 征在于上述步骤 53 )中采用 sentiPY方法提取情感词组, 情感词组的 开式统一表达为 phrase: modifier * sentiment, 即一个词组包括一个 中心情感词 (sentiment) , 同时可能附带多个修饰副词 (modifier) ; 上述步骤 55 ) 中采用基于情感落点的混合决策算法对微博情感极 性进行判定, 判定过程包含以下步骤 10. The microblog-oriented sentiment entity search system according to claim 9, wherein the sentiPY method is used to extract the sentiment phrase in the above step 53), and the open form of the sentiment phrase is uniformly expressed as a phrase: modifier * sentiment, ie, a phrase Including a central sentiment (sentiment), and may be accompanied by a number of modified adverbs; above step 55) using a mixed decision-making algorithm based on emotional drop points on the microblogging emotional pole Sex determination, the decision process includes the following steps
551 ) 判断句子中是否有概括词, 如无, 转步骤 552); 如有, 则以概括词之后的语句作为情感落点,将情感落点极性作为微博情感 极性输出;  551) Determine whether there is a generalized word in the sentence, if no, go to step 552); if yes, use the statement after the generalized word as the emotional drop point, and the emotional falling point polarity as the microblog emotional polarity output;
552)将微博句首及句尾作为情感落点, 比较句首、 句尾情感极 性, 若两者情感极性相互抵消, 则转 553 ) ; 否则, 将情感极性较强 者作为微博情感极性进行输出;  552) The first sentence and the end of the microblog are used as emotional points, and the emotional polarity of the first sentence and the ending of the sentence are compared. If the emotional polarities of the two are offset each other, then 553); otherwise, the person with stronger emotional polarity is regarded as micro Bo emotional polarity for output;
553 ) 计算整条微博的情感词强度, 求和并平均, 将平均强度作 为微博情感极性进行输出。  553) Calculate the intensity of the emotional words of the entire Weibo, sum and average, and output the average intensity as the emotional polarity of the microblog.
PCT/CN2013/088772 2013-09-29 2013-12-06 Microblog-oriented emotional entity search system WO2015043075A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201310461443.6A CN103544242B (en) 2013-09-29 2013-09-29 Microblog-oriented emotion entity searching system
CN201310461443.6 2013-09-29

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
DE112013004082.4T DE112013004082T5 (en) 2013-09-29 2013-12-06 Search system of the emotion entity for the microblog

Publications (1)

Publication Number Publication Date
WO2015043075A1 true WO2015043075A1 (en) 2015-04-02

Family

ID=49967694

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/088772 WO2015043075A1 (en) 2013-09-29 2013-12-06 Microblog-oriented emotional entity search system

Country Status (3)

Country Link
CN (1) CN103544242B (en)
DE (1) DE112013004082T5 (en)
WO (1) WO2015043075A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111400617A (en) * 2020-06-02 2020-07-10 四川大学 Social robot detection data set extension method and system based on active learning

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095271B (en) * 2014-05-12 2019-04-05 北京大学 Microblogging search method and microblogging retrieve device
CN105095270B (en) * 2014-05-12 2019-02-26 北京大学 Retrieve device and search method
CN104217026B (en) * 2014-09-28 2017-08-11 福州大学 A kind of Chinese micro-blog tendentiousness search method based on graph model
CN104346326A (en) * 2014-10-23 2015-02-11 苏州大学 Method and device for determining emotional characteristics of emotional texts
CN104516947B (en) * 2014-12-03 2017-08-22 浙江工业大学 A kind of Chinese microblog emotional analysis method for merging dominant and recessive character
CN104484437B (en) * 2014-12-24 2018-07-20 福建师范大学 A kind of network short commentary emotion method for digging
CN104598588B (en) * 2015-01-19 2017-08-11 河海大学 Microblog users label automatic generating calculation based on double focusing class
CN105989176A (en) * 2015-03-05 2016-10-05 北大方正集团有限公司 Data processing method and device
CN104794212B (en) * 2015-04-27 2018-04-10 清华大学 Context sensibility classification method and categorizing system based on user comment text
CN105183803A (en) * 2015-08-25 2015-12-23 天津大学 Personalized search method and search apparatus thereof in social network platform
CN105183807A (en) * 2015-08-26 2015-12-23 苏州大学张家港工业技术研究院 emotion reason event identifying method and system based on structure syntax
CN105045925A (en) * 2015-08-26 2015-11-11 苏州大学张家港工业技术研究院 Emotional cause event recognition method and system based on CRF model
CN106599737A (en) * 2015-10-19 2017-04-26 北京奇虎科技有限公司 Information display method, information display device, and terminal
CN106610990B (en) * 2015-10-22 2020-12-29 北京国双科技有限公司 Method and device for analyzing emotional tendency
CN106910512A (en) * 2015-12-18 2017-06-30 株式会社理光 The analysis method of voice document, apparatus and system
CN105589976B (en) * 2016-03-08 2019-03-12 重庆文理学院 Method and device is determined based on the target entity of semantic relevancy
CN107515877B (en) * 2016-06-16 2021-07-20 百度在线网络技术(北京)有限公司 Sensitive subject word set generation method and device
CN106339368A (en) * 2016-08-24 2017-01-18 乐视控股(北京)有限公司 Text emotional tendency acquiring method and device
CN106776566B (en) * 2016-12-22 2019-12-24 东软集团股份有限公司 Method and device for recognizing emotion vocabulary
CN107330041A (en) * 2017-06-27 2017-11-07 达而观信息科技(上海)有限公司 A kind of relevant search word method for digging decayed based on the time and system
CN108629005A (en) * 2018-05-04 2018-10-09 北京林业大学 A kind of detection method and device of the descriptor of earthquake emergency
CN109376239B (en) * 2018-09-29 2021-07-30 山西大学 Specific emotion dictionary generation method for Chinese microblog emotion classification
CN110110744A (en) * 2019-03-27 2019-08-09 平安国际智慧城市科技股份有限公司 Text matching method, device and computer equipment based on semantic understanding

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073692A (en) * 2010-12-16 2011-05-25 北京农业信息技术研究中心 Agricultural field ontology library based semantic retrieval system and method
WO2011112319A2 (en) * 2010-03-12 2011-09-15 Yahoo! Inc. Emotional targeting
CN102279890A (en) * 2011-09-02 2011-12-14 苏州大学 Sentiment word extracting and collecting method based on micro blog

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011112319A2 (en) * 2010-03-12 2011-09-15 Yahoo! Inc. Emotional targeting
CN102073692A (en) * 2010-12-16 2011-05-25 北京农业信息技术研究中心 Agricultural field ontology library based semantic retrieval system and method
CN102279890A (en) * 2011-09-02 2011-12-14 苏州大学 Sentiment word extracting and collecting method based on micro blog

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111400617A (en) * 2020-06-02 2020-07-10 四川大学 Social robot detection data set extension method and system based on active learning

Also Published As

Publication number Publication date
DE112013004082T5 (en) 2015-07-23
CN103544242A (en) 2014-01-29
CN103544242B (en) 2017-02-15

Similar Documents

Publication Publication Date Title
WO2015043075A1 (en) Microblog-oriented emotional entity search system
CN103049435B (en) Text fine granularity sentiment analysis method and device
WO2016054301A1 (en) Distant supervision relationship extractor
Zubrinic et al. The automatic creation of concept maps from documents written using morphologically rich languages
Niu et al. Sentiment classification for microblog by machine learning
Al-Zoghby et al. Arabic semantic web applications–a survey
CN104765779A (en) Patent document inquiry extension method based on YAGO2s
Xu et al. Chinese emotion lexicon developing via multi-lingual lexical resources integration
Mahata et al. Theme-weighted ranking of keywords from text documents using phrase embeddings
Wang et al. Neural related work summarization with a joint context-driven attention mechanism
Mohnot et al. Hybrid approach for Part of Speech Tagger for Hindi language
Albukhitan et al. Semantic annotation of arabic web documents using deep learning
Fudholi et al. Ontology-based information extraction for knowledge enrichment and validation
Yuan et al. Semantic based chinese sentence sentiment analysis
Asmi et al. A framework for automated corpus generation for semantic sentiment analysis
Miao et al. A dynamic financial knowledge graph based on reinforcement learning and transfer learning
Zhang et al. Representation Learning in Academic Network Based on Research Interest and Meta-path
Arbizu Extracting knowledge from documents to construct concept maps
Liu et al. Research of semantic annotation technology based on domain ontology
Wang et al. A Novel Online Encyclopedia-Oriented Approach for Large-Scale Knowledge Base Construction.
Juršič et al. Constructing information networks from text documents
Gao et al. Research on the Importance of Data Enhancement Technology in Power Document Understanding
BIAN et al. Conceptual extraction of domain knowledge graph in different data sources
Ojo et al. Knowledge discovery in academic electronic resources using text mining
Coenen et al. From Semi-Automated to Automated Methods of Ontology Learning from Twitter Data.

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 1120130040824

Country of ref document: DE

Ref document number: 112013004082

Country of ref document: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13894417

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 13894417

Country of ref document: EP

Kind code of ref document: A1