CN104331506A

CN104331506A - Multiclass emotion analyzing method and system facing bilingual microblog text

Info

Publication number: CN104331506A
Application number: CN201410670909.8A
Authority: CN
Inventors: 礼欣; 栗雨晴; 韩煦; 宋丹丹; 廖乐健
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2014-11-20
Filing date: 2014-11-20
Publication date: 2015-02-04

Abstract

The present invention relates to a multi-type sentiment analysis method and system for bilingual microblog texts, belonging to the technical field of microblog text sentiment analysis, comprising the following steps: (1) construction of a bilingual dictionary: firstly collect a certain scale of corpus with emotional tendencies, and Extract high-frequency vocabulary with emotional tendencies from the corpus; then use the existing knowledge base and vocabulary similarity calculation model to expand the emotional dictionary; finally add network language and emoticons to the emotional dictionary; (2) Text preprocessing: Segment the text to be recognized and remove stop words and English word form normalization; (3) text feature space representation: use the bilingual sentiment dictionary to vectorize the text; (4) use the multi-emotional classification model to realize the emotion of the corpus text Identify tasks. The accuracy rate and F1 value of the method of the invention are higher than those of traditional classification methods, especially the classification effect of the semi-supervised Gaussian mixture model classification algorithm in a small-scale training set is obviously better than other methods.

Description

A multi-category sentiment analysis method and system for bilingual microblog text

技术领域technical field

本发明涉及一种情感分析方法与系统，特别涉及一种面向双语微博文本的多类情感分析方法与系统，属于微博文本情感分析技术领域。The invention relates to a sentiment analysis method and system, in particular to a multi-type sentiment analysis method and system for bilingual microblog texts, belonging to the technical field of microblog text sentiment analysis.

背景技术Background technique

随着社交媒体平台的兴起与移动设备的广泛使用，人们已经习惯用140个字符来传情达意表达诉求。发布微博已成为个体表达情感的重要手段，因此针对微博文本进行情感倾向分析具有重要的现实意义。目前，新浪微博已成为国内网络舆论的主要载体，大量用户通过微博进行信息交互和情感表达。针对用户微博文本进行情感分类系统的开发进而完成情感辨识，在舆情监测、产品测评等领域都有着重要的参考意义。With the rise of social media platforms and the widespread use of mobile devices, people have become accustomed to using 140 characters to express their appeals. Publishing Weibo has become an important means for individuals to express their emotions, so it is of great practical significance to analyze the sentiment tendency of Weibo texts. At present, Sina Weibo has become the main carrier of domestic Internet public opinion, and a large number of users interact with information and express emotions through Weibo. The development of an emotion classification system for user microblog texts and then the completion of emotion recognition has important reference significance in the fields of public opinion monitoring and product evaluation.

现有情感分析系统多将微博文本分为正向情感和负向情感两类。但是人类的情感是复杂多样的，正向情感包括如信任、感激、庆幸等情绪，负向情感则包括如痛苦、鄙视、仇恨、嫉妒等。简单的将情感分为两类不能够保证情感辨别的准确度。目前尚缺少能扑捉群体关注的细粒度情感分类系统。目前的微博情感分析系统的主要针对单一语种文本即中文情感倾向进行统计分析，然而近年来由于中国大陆地区教育水平的提高，以及国际化趋势的影响，中英文搭配使用或纯英文书写已逐渐成为个体情感表达的重要形式。这种中英文混搭的微博文本也为微博情感分析带来新的挑战。基于单语情感分析方法的情感分类系统不再适合越来越复杂的微博语言环境。Most existing sentiment analysis systems classify microblog texts into positive and negative sentiments. But human emotions are complex and diverse. Positive emotions include emotions such as trust, gratitude, and happiness, while negative emotions include emotions such as pain, contempt, hatred, and jealousy. Simply dividing emotions into two categories cannot guarantee the accuracy of emotion discrimination. At present, there is still a lack of fine-grained emotional classification systems that can capture the attention of groups. The current Weibo sentiment analysis system mainly conducts statistical analysis on single-language texts, that is, Chinese sentiment tendency. However, in recent years, due to the improvement of the education level in mainland China and the influence of internationalization trends, Chinese and English collocations or pure English writing have gradually become popular. It has become an important form of individual emotional expression. This mix of Chinese and English microblog texts also brings new challenges to microblog sentiment analysis. Sentiment classification systems based on monolingual sentiment analysis methods are no longer suitable for the increasingly complex microblog language environment.

此外，目前情感词汇的辨别工作大部分采用机器翻译的方法获取情感词汇，但对于微博文本，由于其短文本、140字的限制，词汇构成比较复杂，英文俚语、网络流行短语数目与日俱增，机器翻译的质量无法得到保证。In addition, most of the current emotional vocabulary identification work uses machine translation to obtain emotional vocabulary, but for Weibo texts, due to its short text and 140-character limit, the vocabulary composition is relatively complicated, and the number of English slang and popular phrases on the Internet is increasing day by day. The quality of translations cannot be guaranteed.

发明内容Contents of the invention

本发明的目的是为解决现有微博情感分析方法分类粒度粗、对于中英文混搭的微博文本分析质量不高、情感词汇的辨别方法滞后的问题，在微博文本情感领域提供一种基于微博语料的中英双语情感词典构造方法以及一种基于双语词典的微博多类情感分析方法和双语微博文本多类情感分析系统，从而针对微博文本进行多类情感分析。The purpose of the present invention is to solve the problem that the classification granularity of the existing microblog sentiment analysis method is coarse, the quality of the microblog text analysis is not high for the Chinese and English mix and match, and the identification method of the emotional vocabulary is lagging behind. A Chinese-English bilingual sentiment dictionary construction method for microblog corpus, a bilingual dictionary-based microblog multi-category sentiment analysis method and a bilingual microblog text multi-category sentiment analysis system, thereby performing multi-category sentiment analysis on microblog text.

本发明技术方案的思想是通过收集大量具有情感倾向的微博文本语料，构建中英情感词典库，采用半监督与全监督的混合模型构建多种情感分类器，在对双语文本进行文本处理之后根据词汇情感类别对文本进行空间特征表示，从而利用构建的多种情感分类器实现微博文本的情感识别任务。The idea of the technical solution of the present invention is to build a Chinese-English emotional dictionary by collecting a large amount of microblog text corpus with emotional tendencies, and use a semi-supervised and fully supervised hybrid model to construct a variety of emotional classifiers. After text processing of bilingual texts The spatial feature representation of the text is carried out according to the lexical emotion category, so that the emotional recognition task of the microblog text can be realized by using a variety of emotional classifiers constructed.

本发明的具体实现步骤如下：Concrete implementation steps of the present invention are as follows:

一种中英双语情感词典构造方法，该方法包括以下步骤：A Chinese-English bilingual emotion dictionary construction method, the method comprises the following steps:

步骤一、抓取微博网页，从网页中收集具有情感倾向的中英文语料，并从语料集中提取出具有情感倾向的高频词汇加入情感词典库；Step 1. Grab the Weibo webpage, collect Chinese and English corpus with emotional tendency from the webpage, and extract high-frequency vocabulary with emotional tendency from the corpus and add it to the emotional dictionary database;

步骤二、应用已有知识库对所述情感词典进行扩展；Step 2, applying the existing knowledge base to expand the sentiment dictionary;

步骤三、分析抓取的微博语料，将网络新兴语言和表情符号加入所述情感词典。Step 3, analyzing the captured microblog corpus, and adding new network languages and emoticons to the emotional dictionary.

较优的，所述情感倾向包括社会关爱、高兴、悲伤、愤怒和恐惧五类。Preferably, the emotional tendency includes five categories of social care, happiness, sadness, anger and fear.

较优的，所述知识库包括WordNet、NTUSD和HowNet。Preferably, the knowledge base includes WordNet, NTUSD and HowNet.

较优的，所述步骤二的扩展是通过分别计算各知识库中情感词汇与情感词典中各情感倾向词汇的平均相似度，并将情感词扩充到相似度最大的情感倾向分类中。Preferably, the expansion of the second step is to calculate the average similarity between the emotional vocabulary in each knowledge base and the emotional orientation vocabulary in the emotional dictionary, and expand the emotional words to the category of emotional orientation with the largest similarity.

较优的，对所述网络新兴语言和表情符号采用多人举手投票的方式对其情感倾向进行分类。Preferably, the emotional tendencies of the emerging online languages and emoticons are classified by raising hands and voting.

一种基于双语词典的多类情感分析方法，该方法包括以下步骤：A method for multi-category sentiment analysis based on a bilingual dictionary, the method comprising the following steps:

步骤一、对语料文本进行预处理；Step 1, preprocessing the corpus text;

步骤二、依照所述中英双语情感词典对所述语料文本进行特征空间表示；Step 2, performing feature space representation on the corpus text according to the Chinese-English bilingual sentiment dictionary;

步骤三、根据已建立的文本情感分类器模型对语料文本进行情感分类。Step 3: Carry out sentiment classification on the corpus text according to the established text sentiment classifier model.

较优的，所述预处理包括分词和去停用词，对于英文文本还包括词形规范化。Preferably, the preprocessing includes word segmentation and stop word removal, and for English text, it also includes morphological normalization.

较优的，所述文本特征空间表示是将语料中每一条文本表示成五维向量，向量中每个元素分别代表包含的所述中英双语情感词典中对应类别的情感词的个数。Preferably, the text feature space representation is to represent each text in the corpus as a five-dimensional vector, and each element in the vector represents the number of sentiment words of the corresponding category contained in the Chinese-English bilingual sentiment dictionary.

较优的，所述情感分类器模型为半监督高斯混合模型分类算法(Semi-GMM)或基于对称相对熵的K近邻算法(KNN-KL)。Preferably, the emotion classifier model is a semi-supervised Gaussian mixture model classification algorithm (Semi-GMM) or a K-nearest neighbor algorithm based on symmetric relative entropy (KNN-KL).

较优的，所述半监督高斯混合模型分类算法是通过已标记的训练语料集学习高斯混合模型，然后以该模型参数和已标记样本的概率分布作为高斯混合模型的参数初值对已标记的测试语料集进行迭代学习，直至算法收敛或未标注集合为空。Preferably, the semi-supervised Gaussian mixture model classification algorithm is to learn the Gaussian mixture model through the marked training corpus, and then use the model parameters and the probability distribution of the marked sample as the parameter initial value of the Gaussian mixture model. The test corpus is iteratively learned until the algorithm converges or the unlabeled set is empty.

较优的，所述基于对称相对熵的K近邻算法是采用相对熵对文本情感相似性进行度量以表达文本的距离，依据邻近样本的类别来决定待分类样本所属类别。Preferably, the K-nearest neighbor algorithm based on symmetric relative entropy uses relative entropy to measure the emotional similarity of texts to express the distance between texts, and determines the category of the sample to be classified according to the category of the adjacent samples.

较优的，所述相对熵采用如下公式进行计算：Preferably, the relative entropy is calculated using the following formula:

$D D. (({T T}_{i i} | | | | {T T}_{j j})) = = \frac{{Σ Σ}_{k k = = 11}^{55} {ω ω}_{ik ik} {log log}_{22} \frac{{ω ω}_{ik ik}}{{ω ω}_{jk jk}} \times \times {Σ Σ}_{k k = = 11}^{55} {ω ω}_{jk jk} {log log}_{22} \frac{{ω ω}_{jk jk}}{{ω ω}_{ik ik}}}{{Σ Σ}_{k k = = 11}^{55} (({ω ω}_{ik ik} {log log}_{22} \frac{{ω ω}_{ik ik}}{{ω ω}_{jk jk}} + + {ω ω}_{jk jk} {log log}_{22} \frac{{ω ω}_{jk jk}}{{ω ω}_{ik ik}}))}$

其中，T_i为已标记文本的归一化向量表示，T_j为未标记文本的归一化向量表示，ω_ik、ω_jk分别表示T_i、T_j的第k项，k为1到5之间的整数。Among them, T _i is the normalized vector representation of the marked text, T _j is the normalized vector representation of the unmarked text, ω _ik and ω _jk represent the kth item of T _i and T _j respectively, and k is 1 to 5 Integer between.

一种面向双语微博文本的多类情感分析系统，包括中英双语情感词典，语料预处理模块，语料文本特征空间表示模块，情感分类器识别模块；中英双语情感词典采用所述中英双语情感词典构造方法构建；语料预处理模块用于对待分析的语料本文进行分词和去停用词处理，对于英文文本还要进行词形规范化处理；语料文本特征空间表示模块用于对经语料预处理模块处理后的文本进行向量化表示，将文本处理为五维向量，向量中的五个元素分别表示文本中包含在所述中英双语情感词典中社会关爱、高兴、悲伤、愤怒和恐惧五类情感词的个数；情感分类器识别模块用于采用所述情感分类器模型对语料文本向量进行情感识别，确定语料文本所属的情感类别。A multi-category sentiment analysis system for bilingual microblog texts, including a Chinese-English bilingual sentiment dictionary, a corpus preprocessing module, a corpus text feature space representation module, and a sentiment classifier recognition module; the Chinese-English bilingual sentiment dictionary adopts the Chinese-English bilingual Sentiment dictionary construction method construction; corpus preprocessing module is used to perform word segmentation and stop word processing on the corpus text to be analyzed, and morphological normalization processing for English text; corpus text feature space representation module is used to preprocess the corpus The text processed by the module is vectorized, and the text is processed into a five-dimensional vector. The five elements in the vector represent the five categories of social care, happiness, sadness, anger and fear contained in the Chinese-English bilingual emotional dictionary. The number of emotional words; the emotion classifier identification module is used to adopt the emotion classifier model to carry out emotion recognition to the corpus text vector, and determine the emotion category to which the corpus text belongs.

有益效果Beneficial effect

本发明针对微博文本情感分析领域，通过使用新浪微博消息文本和已有知识库构建了细粒度五分类中英双语情感词典，构建了基于半监督高斯模型和基于对称相对熵的K近邻算法的双语微博情感多分类器，对中英文的双语微博进行情感分析。实验结果表明，本发明提出基于双语情感词典的情感分类方法的准确率和F1值高于传统的分类方法。特别是半监督高斯混合模型分类算法在小规模训练集下的分类效果明显优于其他方法。Aiming at the field of microblog text sentiment analysis, the present invention constructs a fine-grained five-category Chinese-English bilingual sentiment dictionary by using Sina microblog message texts and existing knowledge bases, and constructs a K-nearest neighbor algorithm based on a semi-supervised Gaussian model and a symmetric relative entropy Bilingual Weibo Sentiment Multi-Classifier, which performs sentiment analysis on Chinese and English bilingual Weibo. Experimental results show that the accuracy and F1 value of the emotion classification method based on the bilingual emotion dictionary proposed by the present invention are higher than those of the traditional classification method. In particular, the classification effect of the semi-supervised Gaussian mixture model classification algorithm in a small-scale training set is significantly better than other methods.

附图说明Description of drawings

图1为本发明实施例中微博文本情感分析方法流程图；Fig. 1 is the flowchart of microblog text emotion analysis method in the embodiment of the present invention;

图2为本发明实施例中双语微博文本范例示意图；FIG. 2 is a schematic diagram of a bilingual microblog text example in an embodiment of the present invention;

图3为本发明实施例中半监督高斯混合模型的情感分类算法流程示意图；Fig. 3 is the schematic flow chart of the emotion classification algorithm of semi-supervised Gaussian mixture model in the embodiment of the present invention;

图4为本发明实施例中多种机器学习文本情感分类算法的准确率比较示意图；Fig. 4 is a schematic diagram of accuracy comparison of various machine learning text emotion classification algorithms in the embodiment of the present invention;

图5为本发明实施例中双语微博文本的多种情感分类算法准确率比较示意图。Fig. 5 is a schematic diagram of comparison of accuracy rates of various emotion classification algorithms for bilingual microblog texts in an embodiment of the present invention.

图6为本发明实施例中面向双语微博文本的多类情感分析系统结构图。FIG. 6 is a structural diagram of a multi-category sentiment analysis system for bilingual microblog texts in an embodiment of the present invention.

具体实施方式Detailed ways

图1是本发明实施例一种面向双语微博文本的多类情感分析方法的流程图。文本情感识别主要工作流程如下：FIG. 1 is a flowchart of a multi-category sentiment analysis method for bilingual microblog texts according to an embodiment of the present invention. The main workflow of text emotion recognition is as follows:

(1)双语情感词典构建：首先收集一定规模具有情感倾向的语料，并从语料集中提取出具有情感倾向的高频词汇；然后，用已有知识库(WordNet和NTUSD、HowNet)和词汇相似度计算模型对情感词典进行扩展；最后，在情感词典中加入网络新兴语言和表情符号；(1) Construction of a bilingual emotional dictionary: First, collect a certain scale of corpus with emotional tendencies, and extract high-frequency words with emotional tendencies from the corpus; then, use the existing knowledge base (WordNet and NTUSD, HowNet) and vocabulary similarity Computational models expand the emotional dictionary; finally, new online languages and emoticons are added to the emotional dictionary;

(2)文本预处理：对待识别文本进行分词并去除停用词。停用词是指人类语言包含的无实际含义的功能词，比如英语中的限定词(“the”、“a”、“an”、“that”)。英文文本在此基础上还要进行词形还原和提取词干操作；(2) Text preprocessing: segment the text to be recognized and remove stop words. Stop words are function words that human language contains without real meaning, such as determiners in English (“the”, “a”, “an”, “that”). On this basis, the English text also needs to perform lemmatization and stem extraction operations;

(3)文本特征空间表示：利用已构建的双语情感词典对文本词汇进行特征提取，根据词汇情感类别对文本进行五维向量表示；(3) Text feature space representation: use the constructed bilingual sentiment dictionary to extract the features of the text vocabulary, and represent the text in a five-dimensional vector according to the vocabulary emotion category;

(4)利用多情感分类模型实现语料文本的情感识别任务。(4) Use the multi-emotion classification model to realize the emotion recognition task of corpus text.

一、双语情感词典构建：1. Construction of bilingual emotional dictionary:

在介绍如何构建情感词典库前，首先介绍情感倾向的分类。Before introducing how to build a sentiment lexicon, we first introduce the classification of sentiment tendencies.

·文本情感倾向性分类：·Classification of text sentiment tendency:

人类的情感是复杂多样的，而在微博中宣泄情感已成为人们日常生活的一部分。简单的将情感分为正向情感、负向情感两类不能够保证情感辨别的准确度。在本实施例中，为了更好涵盖微博文本的各种情感，进一步的细粒度化情感分类系统，我们将情感类别分为社会关爱、高兴、悲伤、愤怒、恐惧5类。Human emotions are complex and diverse, and venting emotions on Weibo has become a part of people's daily life. Simply dividing emotions into positive and negative emotions cannot guarantee the accuracy of emotion discrimination. In this embodiment, in order to better cover various emotions of microblog texts and further refine the emotion classification system, we divide the emotion categories into five categories: social care, happiness, sadness, anger, and fear.

·情感词典库构建：·Emotional dictionary library construction:

情感词典库的构建依赖于微博文本集。在人工对文本进行情感倾向标注时，文本中少数带有情感倾向的词汇往往起着决定性的作用。新浪微博拥有规模庞大的微博文本集，并且用户可即时向外发布140字以内的文本，因此构建一个满足微博文本情感分类的情感词典是本发明研究的基础。目前微博研究多基于单一语种语料，但中英文搭配使用已成为个体表达的流行趋势。因此，仅构建中文情感词典不能满足中英文混合的微博文本的情感分类需求。为进一步说明加入英文情感词典的必要性，我们在图2中展示微博用户发布的两则博文，图中可以看出，具有双语表述习惯的用户在谈及某一话题时，惯用英语情感词汇进行情感表达。The construction of the emotional dictionary depends on the microblog text collection. When artificially labeling texts with emotional tendencies, a few words with emotional tendencies in the text often play a decisive role. Sina Weibo has a large-scale microblog text collection, and users can immediately publish texts within 140 characters, so constructing a sentiment dictionary that satisfies the sentiment classification of microblog texts is the basis of this invention. At present, microblog research is mostly based on a single language corpus, but the combination of Chinese and English has become a popular trend for individual expression. Therefore, only constructing a Chinese sentiment dictionary cannot meet the needs of sentiment classification of Chinese-English mixed microblog texts. To further illustrate the necessity of adding an English emotional dictionary, we show two blog posts published by Weibo users in Figure 2. It can be seen from the figure that users with bilingual expressions habitually use English emotional words when talking about a certain topic Be emotional.

为建立双语情感词典，首先需要收集一定规模具有情感倾向的语料，并从语料集中人工提取出少量具有情感倾向的高频双语词汇(如happy、sad、高兴、漂亮、难过)加入情感词典，我们将这些词汇作为种子情感词汇。本实施例中是从新浪微博收集了七千多条微博文本作为语料集。我们从语料集中选取部分词汇进行人工标注五类情感作为种子词集seedset＝{PA,PB,PC,PD,PE}，其中PA,PB,PC,PD,PE分别代表各类情感(社会关爱、高兴、悲伤、愤怒、恐惧)的子集。In order to establish a bilingual emotional dictionary, it is first necessary to collect a certain scale of corpus with emotional tendencies, and artificially extract a small number of high-frequency bilingual words with emotional tendencies (such as happy, sad, happy, beautiful, sad) from the corpus and add them to the emotional dictionary. Use these words as seed emotion words. In this embodiment, more than 7,000 microblog texts are collected from Sina Weibo as a corpus. We select some words from the corpus to manually label the five types of emotions as the seed word set seedset={PA, PB, PC, PD, PE}, where PA, PB, PC, PD, and PE represent various emotions (social care, happy, sad, angry, fearful).

然后，根据已经选取的种子词集，通过计算种子情感词汇与各知识库中情感词汇的相似度，将知识库中的与种子情感词汇相似度大的情感词汇扩充到已经选取的五类种子情感词汇中。在本实施例中，我们应用已有知识库WordNet、NTUSD和HowNet依次对情感词典进行扩展。根据具体需要从知识库中选取合适数目的词汇进行操作，例如本实例中主要选取了HowNet和WordNet中的大部分词汇。在已有知识库中每个词汇v_k可以通过多个概念进行描述，每个概念C_k又是以义原为基础通过知识库表述语言进行定义，且每个概念含有多个义原解释。Then, according to the selected seed word set, by calculating the similarity between the seed emotional vocabulary and the emotional vocabulary in each knowledge base, the emotional vocabulary in the knowledge base with a large similarity with the seed emotional vocabulary is expanded to the five types of seed emotions that have been selected. in vocabulary. In this embodiment, we use the existing knowledge bases WordNet, NTUSD and HowNet to expand the sentiment dictionary in turn. Select an appropriate number of words from the knowledge base for operation according to specific needs. For example, most of the words in HowNet and WordNet are mainly selected in this example. In the existing knowledge base, each vocabulary v _k can be described by multiple concepts, and each concept C _k is defined based on the sememe through the knowledge base expression language, and each concept contains multiple sememe interpretations.

对于两个中文词汇间的语义相似性，本发明采用HowNet词汇相似度计算方法，其定义如下公式(1)和公式(2)所示：For the semantic similarity between two Chinese vocabulary, the present invention adopts HowNet vocabulary similarity computing method, and its definition is as shown in following formula (1) and formula (2):

$similarity similarity (({v v}_{11},, {v v}_{22})) = = \underset{i i = = 11 . . . . . . n no,, j j = = 11 . . . . . . m m}{max max} similarity similarity (({C C}_{11 i i},, {C C}_{22 j j})) - - - - - - ((11))$

$similarity similarity (({C C}_{11 i i},, {C C}_{22 j j})) = = \frac{11}{p p 11 p p 22} {Σ Σ}_{i i = = 11}^{p p 11} {Σ Σ}_{j j = = 11}^{p p 22} ((\frac{11}{22^{p p 11 - - i i + + 11}} \times \times \frac{α α}{α α + + d d})) - - - - - - ((22))$

公式1中，similarity(v₁,v₂)表示两个词汇之间的相似度。词汇v₁有n个概念，每个概念含有个p1个义原。词汇v₂有m个概念，每个概念有p2个义原。C_1i表示词汇v₁的第i个概念，C_2j表示词汇v₂的第j个概念。In Formula 1, similarity(v ₁ ,v ₂ ) represents the similarity between two words. Vocabulary _v1 has n concepts, and each concept contains p1 sememes. Vocabulary _v2 has m concepts and each concept has p2 sememes. C _1i represents the i-th concept of vocabulary _v1 , and C _2j represents the j-th concept of vocabulary _v2 .

公式2中，p1，p2分别表示两个概念含有的义原数目，则表示第i个义原在概念C₁中的权重。并选取两个词之间的最大义原相似度作为两个词的相似度。α是正的可变参数，d表示义原在HowNet义原树中的距离。In formula 2, p1 and p2 respectively represent the number of sememes contained in the two concepts, Then represents the weight of the i-th sememe in concept _C1 . And select the maximum sememe similarity between two words as the similarity of two words. α is a positive variable parameter, and d represents the distance of the sememe in the HowNet sememe tree.

而对于英文词汇间的语义相似性，则利用Wordnet中的Lesk方法对词汇之间的关联度进行度量。在Wordnet中的每一个概念即word sense都是通过一个短注释进行定义的。Lesk方法通过寻找和计算两个概念的注释的交叉部分进而计算两词汇之间的相似度similarity(v₁,v₂)。由于每个英文单词拥有多种形态表示，例如：happily，happiness等词都与happy有关，因此需要对一个词的不同形态进行归并，去除词缀得到词根，即词形规范化，从而提高文本处理的效率。本发明采用NLTK中给出的Lancaster和WordNet Lemmatizer两种词形还原方式对英文词汇进行规范化。For the semantic similarity between English words, the Lesk method in Wordnet is used to measure the degree of association between words. Every concept in Wordnet, namely word sense, is defined by a short annotation. The Lesk method calculates the similarity (v ₁ , v ₂ ) between the two words by finding and calculating the intersection of annotations of two concepts. Since each English word has multiple morphological representations, for example, happy, happiness and other words are related to happy, so it is necessary to merge the different forms of a word, remove the affixes to get the root, that is, normalize the word form, so as to improve the efficiency of text processing . The present invention standardizes English vocabulary by adopting two morphological restoration methods of Lancaster and WordNet Lemmatizer provided in NLTK.

在确定了相似度之后，我们利用如下公式将知识库中的情感词汇与种子词集中的词汇进行相似度的计算并将知识库中的情感词汇扩展到情感词典的对应情感倾向分类中：After determining the similarity, we use the following formula to calculate the similarity between the emotional vocabulary in the knowledge base and the vocabulary in the seed word set and extend the emotional vocabulary in the knowledge base to the corresponding emotional orientation classification of the emotional dictionary:

$ϖ ϖ ((v v)) = = \underset{{{PA PA,, PB PB,, PC PC,, PD PD,, PF PF}}}{arg arg max max} ((\frac{11}{{N N}_{11}} {Σ Σ}_{n no 11 - - 11}^{{N N}_{11}} similarity similarity ((v v,, {PA PA}_{n no 11})),, \frac{11}{{N N}_{22}} {Σ Σ}_{n no 22 - - 11}^{{N N}_{22}} similarity similarity ((v v,, {PB PB}_{n no 22})),,$

$\frac{11}{{N N}_{33}} {Σ Σ}_{n no 33 - - 11}^{{N N}_{33}} similarity similarity ((v v,, {PC PC}_{n no 33})),, \frac{11}{{N N}_{44}} {Σ Σ}_{n no 44 - - 11}^{{N N}_{44}} similarity similarity ((v v,, {PD PD}_{n no 44})),, \frac{11}{{N N}_{55}} {Σ Σ}_{n no 55 - - 11}^{{N N}_{55}} similarity similarity ((v v,, {PE PE}_{n no 55}))))$

其中N₁，N₂，N₃，N₄，N₅为各类情感子集中种子词汇的数目。ω(v)表示非种子词汇所属情感类别，取决于其与各类情感子集平均相似度的最大值。Among them, N ₁ , N ₂ , N ₃ , N ₄ , and N ₅ are the number of seed words in each emotion subset. ω(v) represents the emotional category of the non-seed vocabulary, which depends on the maximum value of the average similarity between it and various emotional subsets.

除传统情感词外，越来越多的网络新兴语言和表情符号被用户大量用于微博文本的情感表达中。因此除根据已有知识库进行词汇扩展外，我们通过对大量的微博文本进行观察归纳，在情感词典中人工引入网络语言和表情符号，并采用多人举手投票的方式对其情感倾向进行分类。In addition to traditional emotional words, more and more emerging online languages and emoticons are widely used by users in the emotional expression of Weibo texts. Therefore, in addition to the vocabulary expansion based on the existing knowledge base, we have observed and summarized a large number of microblog texts, artificially introduced network language and emoticons into the emotional dictionary, and used the method of many people to raise their hands to vote for their emotional tendencies. Classification.

综上所述，本实施例所构建的中文情感词汇共计7590个，英文情感词汇共计421个，网络词汇613个，常用表情符号101个。其中涵盖“社会关爱”类词汇971个、“高兴”类词汇2731个、“悲伤”类词汇2289个、“愤怒”类词汇1458个、“恐惧”类词汇1276个。To sum up, there are 7,590 Chinese emotional vocabulary, 421 English emotional vocabulary, 613 online vocabulary and 101 commonly used emoticons constructed in this embodiment. It covers 971 words of "social care", 2731 words of "happiness", 2289 words of "sadness", 1458 words of "anger" and 1276 words of "fear".

二、文本预处理：2. Text preprocessing:

对微博消息文本进行分词。本发明采用ICTCLAS分词系统对中文文本进行词汇识别，而对于英文文本则根据空格进行词汇识别。对一条微博消息文本进行分词后，对其进行去停用词处理，如：“的”、“a”、“the”等。英文文本在此基础上还要进行词形还原和提取词干操作，具体操作与在情感词典构建时对英文文本的处理一致。Segment the microblog message text. The present invention adopts the ICTCLAS word segmentation system to carry out vocabulary recognition on Chinese texts, and carries out vocabulary recognition on the basis of blank spaces for English texts. After word segmentation of a Weibo message text, it is processed by removing stop words, such as: "的", "a", "the", etc. On this basis, the English text also needs to be morphologically restored and stemmed. The specific operations are consistent with the processing of the English text during the construction of the emotional dictionary.

三、文本的特征空间表示：3. The feature space representation of the text:

令D＝{d₁,d₂,...,d_n}代表所有微博消息文本的集合，因此每条微博消息d_i都可采用五维向量进行表示，每一项分别代表属于对应情感分类的词汇的数目。Let D={d ₁ ,d ₂ ,...,d _n } represent the collection of all microblog message texts, so each microblog message d _i can be represented by a five-dimensional vector, and each item represents the corresponding The number of vocabulary for sentiment classification.

四、利用情感分类模型实现微博文本的情感识别任务：4. Use the emotion classification model to realize the emotion recognition task of microblog text:

现有很多方法或模型都可实现对文本的情感分析工作，在本发明中我们仅介绍两种微博文本多类情感分析模型。其中半监督高斯混合模型情感分类算法需要对模型进行迭代训练，基于对称相对熵的K近邻情感分类算法只需要输入已标注的文本向量，无需进行学习。There are many existing methods or models that can implement sentiment analysis on texts. In this invention, we only introduce two types of microblog text multi-category sentiment analysis models. Among them, the semi-supervised Gaussian mixture model sentiment classification algorithm needs iterative training of the model, and the K-nearest neighbor sentiment classification algorithm based on symmetric relative entropy only needs to input the marked text vector without learning.

·半监督高斯混合情感分类器模型：·Semi-supervised Gaussian mixture sentiment classifier model:

半监督高斯混合模型的情感分类算法流程如图3所示。高斯混合模型多采用期望最大化算法(EM)进行参数估计。对混合高斯模型(GMM)训练的过程就是指对样本的概率密度分布进行估计，而估计的模型是多个高斯模型加权之和，其中每个高斯模型代表了一个类。在本实施例中针对五个情感分类，我们采用五个高斯模型进行训练学习。对高斯混合模型学习，即是对各个高斯模型加概率密度的估计和权重(π_i)进行最大似然估计的过程。π_i取决于训练模型中各种情感类别文本所占的比例，在迭代过程中π_i会随着测试集样本的加入变化，初始值由于训练集中五个情感类别文本所占比例相等，故均为1/5。The sentiment classification algorithm process of the semi-supervised Gaussian mixture model is shown in Figure 3. Gaussian mixture models mostly use the expectation-maximization algorithm (EM) for parameter estimation. The process of training the mixed Gaussian model (GMM) refers to estimating the probability density distribution of the sample, and the estimated model is the weighted sum of multiple Gaussian models, where each Gaussian model represents a class. In this embodiment, for five emotion classifications, we use five Gaussian models for training and learning. Gaussian mixture model learning is the process of maximum likelihood estimation for each Gaussian model plus probability density estimation and weight (π _i ). π _i depends on the proportion of texts of various emotional categories in the training model. During the iterative process, π _i will change with the addition of samples in the test set. The initial value is equal to the proportion of the texts of the five emotional categories in the training set. is 1/5.

半监督高斯混合模型是一个自训练算法，因而在本实施例中，将已标注的微博文本人为的分为两部分，一部分为训练集，一部分为测试集，以实现在每一次迭代训练的过程中，从测试集中选取文本加入训练集从而达到自训练的目的。在每次迭代过程中，对测试集中的文本在五个情感类别的高斯混合模型的概率值进行比较，选取该文本在高斯混合模型中概率值最大的情感分类作为该文本情感类别，并从对应类别分类正确的所有文本中选取概率值最大的一条文本，加入到训练集中，再根据新的训练集不断对混合高斯模型进行学习。直至迭代训练后的模型对测试集的分类效果与迭代前的无差距或差距可忽略不计，亦或者测试集为空，则算法停止。The semi-supervised Gaussian mixture model is a self-training algorithm, so in this embodiment, the marked microblog text is artificially divided into two parts, one part is the training set, and the other part is the test set, so as to realize the In the process, texts are selected from the test set and added to the training set to achieve the purpose of self-training. In each iteration process, compare the probability values of the text in the test set in the Gaussian mixture model of the five emotion categories, select the emotion category with the highest probability value in the Gaussian mixture model as the text emotion category, and select the emotion category from the corresponding Select a text with the highest probability value from all the texts with correct category classification, add it to the training set, and then continue to learn the mixed Gaussian model according to the new training set. The algorithm stops until the classification effect of the model after iterative training on the test set is the same as that before the iteration or the difference is negligible, or the test set is empty.

φ(u_j|θ_k)是高斯概率密度函数，通过训练学习后得到模型参数，其中μ_k代表各高斯混合模型的均值，代表各高斯混合模型的方差。本发明首先通过已标记微博消息文本学习高斯混合模型，输入为训练集中已标注的文本向量。然后以该模型参数和已标记样本的概率分布作为高斯混合模型的参数初值对已有模型进行迭代学习，最终得到的输出为五个高斯混合模型的均值和方差，代表五类情感。利用学习得到的模型参数，即可输入未标注的文本向量，根据输出的五个概率值进行比较并最终根据概率值最大的情感类别进行分类。φ(u _j |θ _k ) is a Gaussian probability density function, The model parameters are obtained after training and learning, where μ _k represents the mean value of each Gaussian mixture model, Represents the variance of each Gaussian mixture model. The present invention first learns the Gaussian mixture model through the marked microblog message text, and the input is the marked text vector in the training set. Then, the model parameters and the probability distribution of the marked samples are used as the initial parameters of the Gaussian mixture model to iteratively learn the existing model, and the final output is the mean and variance of five Gaussian mixture models, representing five types of emotions. Using the learned model parameters, the unlabeled text vector can be input, compared according to the five output probability values, and finally classified according to the emotion category with the largest probability value.

·基于对称相对熵的K近邻情感分类器模型：· K-nearest neighbor sentiment classifier model based on symmetric relative entropy:

K近邻分类算法是指一个样本所属类别取决于此样本所在特定空间中k个最相似即特征空间中最邻近的样本中大多数所属类别。该方法在分类决策上只依据最邻近的一个或几个样本的类别来决定待分类样本所属类别。The K-nearest neighbor classification algorithm means that the category of a sample depends on the k most similar samples in the specific space where the sample is located, that is, most of the categories of the nearest neighbor samples in the feature space. In the classification decision, this method only determines the category of the sample to be classified according to the category of the nearest one or several samples.

k值的选择、距离的度量和分类的决策规则构成了K近邻算法的三个基本要素。在本发明中我们采用相对熵对文本情感相似性进行度量。相对熵是对相同事件空间里的两个概率分布(X和Y的)的非对称性度量，记为D(X||Y)。因此对文本向量表示进行归一化，归一化后的文本向量记为T_i。The choice of k value, the measurement of distance and the decision rule of classification constitute the three basic elements of the K-nearest neighbor algorithm. In the present invention, we use relative entropy to measure text sentiment similarity. Relative entropy is a measure of the asymmetry of two probability distributions (of X and Y) in the same event space, denoted as D(X||Y). Therefore, the text vector representation is normalized, and the normalized text vector is denoted as T _i .

而微博消息文本T_i与T_j之间的距离定义如下：The distance between the microblog message text T _i and T _j is defined as follows:

$D D. (({T T}_{i i} | | | | {T T}_{j j})) = = {Σ Σ}_{k k = = 11}^{55} {ω ω}_{ik ik} {log log}_{22} \frac{{ω ω}_{ik ik}}{{ω ω}_{jk jk}}$

ω_ik、ω_jk分别表示T_i、T_j的第k项，k为1到5之间的整数。由于相对熵具有非对称性，因此在度量概率分布X和Y的差别时，X表示数据的真实分布，Y表示X的近似分布。因此，在计算文本之间的距离时，T_i为已标记文本的归一化向量表示，T_j则为未标记文本的归一化向量表示。但是这种非对称性计算形式忽略了X对于Y的近似分布。为了改进传统相对熵计算的非对称性，对称相对熵计算公式定义如下：ω _ik and ω _jk represent the kth item of T _i and T _j respectively, and k is an integer between 1 and 5. Due to the asymmetry of relative entropy, when measuring the difference between probability distribution X and Y, X represents the real distribution of data, and Y represents the approximate distribution of X. Therefore, when calculating the distance between texts, T _i is the normalized vector representation of the marked text, and T _j is the normalized vector representation of the unmarked text. But this form of asymmetric calculation ignores the approximate distribution of X with respect to Y. In order to improve the asymmetry of traditional relative entropy calculation, the symmetric relative entropy calculation formula is defined as follows:

输入为作为训练集的已标注的归一化微博文本的向量表示，以及未标注的归一化微博文本向量表示。可通过上述模型规定的非对称相对熵距离度量算法或者对称相对熵距离度量算法对未标记文本向量与训练集中已标注的微博文本进行距离计算，即微博文本的相似度计算。选取欲分类的文本向量在训练集中的K个最近邻中占比例最大的情感作为该文本的情感类别。The input is the vector representation of the labeled normalized microblog text as the training set, and the vector representation of the unlabeled normalized microblog text. The asymmetric relative entropy distance measurement algorithm or the symmetric relative entropy distance measurement algorithm stipulated in the above model can be used to calculate the distance between the unlabeled text vector and the marked microblog text in the training set, that is, the similarity calculation of the microblog text. Select the emotion category with the largest proportion among the K nearest neighbors of the text vector to be classified as the emotion category of the text.

如图6所示为本发明实施例的一种面向双语微博文本的多类情感分析系统，包括中英双语情感词典，语料预处理模块，语料文本特征空间表示模块，情感分类器识别模块；中英双语情感词典采用所述中英双语情感词典构造方法构建；语料预处理模块用于对待分析的语料本文进行分词和去停用词处理，对于英文文本还要进行词形规范化处理；语料文本特征空间表示模块用于对经语料预处理模块处理后的文本进行向量化表示，将文本处理为五维向量，向量中的五个元素分别表示文本中包含在所述中英双语情感词典中社会关爱、高兴、悲伤、愤怒和恐惧五类情感词的个数；情感分类器识别模块用于采用所述情感分类器模型对语料文本向量进行情感识别，确定语料文本所属的情感类别。As shown in Figure 6, it is a kind of multi-category sentiment analysis system facing bilingual microblog text according to the embodiment of the present invention, including a Chinese-English bilingual sentiment dictionary, a corpus preprocessing module, a corpus text feature space representation module, and an emotion classifier recognition module; The Chinese-English bilingual sentiment dictionary adopts the construction method of the Chinese-English bilingual sentiment dictionary to construct; the corpus preprocessing module is used to carry out word segmentation and de-stop word processing for the corpus text to be analyzed, and also to carry out morphological normalization processing for the English text; The feature space representation module is used to vectorize the text processed by the corpus preprocessing module, and process the text into a five-dimensional vector, and the five elements in the vector represent the social values included in the text in the Chinese-English bilingual sentiment dictionary. The number of the five types of emotional words of caring, joy, sadness, anger and fear; the emotion classifier identification module is used to adopt the emotion classifier model to carry out emotion recognition to the corpus text vector, and determine the emotion category to which the corpus text belongs.

应用本发明实施例的多类情感分析系统，输入一条双语微博文本，经系统处理后将输出该文本所属的情感类别，输出A表示该微博属于社会关爱类，输出B表示该微博属于高兴类，输出C表示该微博属于悲伤类，输出D表示该微博属于愤怒类，输出E表示该微博属于恐惧类。Apply the multi-category sentiment analysis system of the embodiment of the present invention, input a bilingual microblog text, and output the emotional category to which the text belongs after being processed by the system, the output A indicates that the microblog belongs to the social care category, and the output B indicates that the microblog belongs to For the happy category, the output C indicates that the Weibo belongs to the sad category, the output D indicates that the Weibo belongs to the angry category, and the output E indicates that the Weibo belongs to the fear category.

评价指标：Evaluation indicators:

本发明邀请研究自然语言方向的学生依照5类情感对新浪提供的API所抓取的文本进行人工类别标注。其中部分作为训练集，部分作为测试集。在模型经训练集训练后，针对测试集的微博文本分类的准确度即可评价模型的优劣。The invention invites students who study natural language to manually label the text captured by the API provided by Sina according to five types of emotions. Some of them are used as training set and some are used as test set. After the model is trained on the training set, the accuracy of the microblog text classification on the test set can be used to evaluate the quality of the model.

数据集：data set:

在进行机器学习算法比较时，我们选取的数据集是新浪API抓取的7170条中文微博文本信息作为实验数据。并邀请25位研究自然语言方向的学生依照5类情感对文本进行人工类别标注，进而使得文本的情感类别取决于多数人选取的情感类别。语料在各情感类别中的分布情况如表1所示：When comparing machine learning algorithms, the data set we selected was 7170 pieces of Chinese Weibo text information captured by Sina API as experimental data. And invited 25 students in the direction of natural language to manually label the text according to the five types of emotions, so that the emotional category of the text depends on the emotional category selected by most people. The distribution of corpus in each emotion category is shown in Table 1:

表1.微博文本在5类情感类别中的分布Table 1. Distribution of Weibo texts in 5 sentiment categories

在进行双语微博文本情感分类实验时，类似的，我们使用新浪API抓取7000条双语微博文本信息并邀请25位研究自然语言方向的学生依照5类情感对文本进行人工类别标注，情感类别语料在各情感类别中的分布情况如表2所示：Similarly, we used the Sina API to capture 7,000 bilingual microblog texts and invited 25 students who studied natural language to manually label the texts according to five types of emotions. The distribution of corpus in each emotion category is shown in Table 2:

表2 微博文本在5类情感类别中的分布Table 2 Distribution of microblog texts in 5 emotional categories

实验结果：Experimental results:

我们从7170条中文微博中选取3170条微博作为测试集(见表1)，其中表达社会关爱的微博文本500条，表达高兴的微博文本1300条，表达悲伤的微博文本540条，表达愤怒的微博文本510条，表达恐惧的微博文本320条。训练集则从余下4000条中选取1000至4000条微博不等。We selected 3170 microblogs from 7170 Chinese microblogs as the test set (see Table 1), among which 500 microblog texts expressing social care, 1300 microblog texts expressing happiness, and 540 microblog texts expressing sadness , 510 Weibo texts expressing anger, and 320 Weibo texts expressing fear. For the training set, 1,000 to 4,000 microblogs were selected from the remaining 4,000.

我们首先对基于非对称相对熵的K近邻分类算法和基于对称相对熵的K近邻分类算法进行比较，实验结果如表3所示：We first compare the K-nearest neighbor classification algorithm based on asymmetric relative entropy and the K-nearest neighbor classification algorithm based on symmetric relative entropy. The experimental results are shown in Table 3:

表3 基于不同距离度量算法的K近邻分类算法在不同训练集规模下的准确率比较Table 3 Comparison of the accuracy of the K-nearest neighbor classification algorithm based on different distance measurement algorithms under different training set sizes

结果表明，基于对称相对熵的K近邻分类算法的准确率略高，因此在之后的多种机器学习分类算法的比较中，我们仅选用基于对称相对熵的K近邻分类算法参与比较。The results show that the accuracy of the K-nearest neighbor classification algorithm based on symmetric relative entropy is slightly higher. Therefore, in the comparison of various machine learning classification algorithms, we only choose the K-nearest neighbor classification algorithm based on symmetric relative entropy to participate in the comparison.

我们选用多数投票算法(Majority vote)、支持向量机算法(SVM)、基于余弦距离的K近邻分类算法(KNN-Cosine)同本发明中提出的半监督高斯混合模型分类算法(Semi-GMM)和基于对称相对熵的K近邻算法(KNN-KL)进行比较。比较结果如图4所示。从图中可以看出当训练集文本规模为4000条时，KNN-KL准确率最高达到85.1％。当选用相同k近邻算法时，采用对称相对熵进行文本距离度量比采用余弦距离进行文本距离度量分类效果更好。We select majority voting algorithm (Majority vote), support vector machine algorithm (SVM), the K nearest neighbor classification algorithm (KNN-Cosine) based on cosine distance and the semi-supervised Gaussian mixture model classification algorithm (Semi-GMM) that proposes in the present invention and The K-nearest neighbor algorithm (KNN-KL) based on symmetric relative entropy was compared. The comparison results are shown in Figure 4. It can be seen from the figure that when the training set text size is 4000, the accuracy rate of KNN-KL can reach up to 85.1%. When the same k-nearest neighbor algorithm is selected, it is better to use symmetric relative entropy for text distance measurement than to use cosine distance for text distance measurement classification.

当训练集文本规模少于3000条时，Semi-GMM具有更好的表现。当训练文本数目下降时，相比于KNN-Cosine和KNN-KL，Semi-GMM具有更好的稳定性。随着训练集文本数目下降到1000条，采用KNN-KL的准确率下降了8.9％，而Semi-GMM仅下降了2.9％。这也进一步证实了Semi-GMM更加适合在训练集规模较小时使用，而KNN这种全监督学习算法容易被选取邻居数目k左右，影响分类效果。When the training set text size is less than 3000, Semi-GMM has better performance. Compared with KNN-Cosine and KNN-KL, Semi-GMM has better stability when the number of training texts decreases. As the number of texts in the training set drops to 1000, the accuracy of KNN-KL drops by 8.9%, while that of Semi-GMM drops only by 2.9%. This further confirms that Semi-GMM is more suitable for use when the training set size is small, while KNN, a fully supervised learning algorithm, is easily selected with a neighbor number of around k, which affects the classification effect.

SVM多用于二分类问题。虽然有支持多分类支持分类器的出现，但是其准确率多取决于训练数据的质量。并且SVM复杂度较高，并不擅长处理大规模分类问题。由于在文本分类中获得大规模高质量训练数据成本过高，因此我们认为当未标注文本集合过大时，运用Semi-GMM对文本进行情感分类更加适合。SVM is mostly used for binary classification problems. Although there are multi-class support classifiers, their accuracy depends on the quality of the training data. And SVM has high complexity and is not good at dealing with large-scale classification problems. Since the cost of obtaining large-scale high-quality training data in text classification is too high, we believe that when the unlabeled text set is too large, it is more suitable to use Semi-GMM to classify text sentiment.

针对5类情感，当文本训练集规模不同时，KNN-KL的准确率如表4所示。我们可以看出当训练集文本数目下降到1000条时，表达社会关爱和悲伤的微博文本分类效果也急剧下降。与训练集文本数目为4000条时的分类效果相比，有79条表达社会关爱的微博文本被错误分类，其中64条被识别为高兴，11条被识别为悲伤，4条被识别为愤怒。而对于表达悲伤的微博文本则有60条被错误分类，其中8条被识别为社会关爱，28条被识别为高兴，13条被识别为愤怒，11条被识别为恐惧。For the five types of emotions, when the size of the text training set is different, the accuracy of KNN-KL is shown in Table 4. We can see that when the number of texts in the training set drops to 1000, the classification effect of microblog texts expressing social care and sadness also drops sharply. Compared with the classification effect when the number of texts in the training set was 4000, 79 microblog texts expressing social care were misclassified, of which 64 were identified as happy, 11 as sad, and 4 as angry . For the sad Weibo texts, 60 were misclassified, 8 of which were identified as social care, 28 were identified as happy, 13 were identified as angry, and 11 were identified as fear.

表4 在不同训练集规模下，基于Semi-GMM和KMM-KL的文本分类准确率Table 4 Under different training set sizes, text classification accuracy based on Semi-GMM and KMM-KL

而在不同文本训练集规模下，Semi-GMM和KNN-KL的F1值如表5所示，这也进一步证实了Semi-GMM在小规模训练集下的分类优势。Under different text training set sizes, the F1 values of Semi-GMM and KNN-KL are shown in Table 5, which further confirms the classification advantage of Semi-GMM in small-scale training sets.

表5 在不同训练集规模下，基于Semi-GMM和KMM-KL的文本分类F1值Table 5 F1 value of text classification based on Semi-GMM and KMM-KL under different training set sizes

针对双语微博文本情感分类实验中，我们从7000条双语微博选取3000条微博作为测试集(见表2)，训练集则从余下4000条中选取1000至4000条微博不等。For the bilingual microblog text sentiment classification experiment, we selected 3000 microblogs from 7000 bilingual microblogs as the test set (see Table 2), and selected 1000 to 4000 microblogs from the remaining 4000 for the training set.

我们选用仅使用中文情感词典作情感词识别的半监督高斯混合模型分类算法(Semi-GMM(Ch.))和基于对称相对熵的K近邻算法(KNN-KL(Ch.))同使用中英文情感词典相结合进行情感词识别的多数投票算法(Majority vote(Ch.+Eng.))、SVM(Ch.+Eng.)算法、基于余弦距离的K近邻分类算法(KNN-Cosine(Ch.+Eng.))以及本发明提出的半监督高斯混合模型分类算法(Semi-GMM(Ch.+Eng.))和基于对称相对熵的K近邻算法(KNN-KL(Ch.+Eng.))进行比较。比较结果如图5所示。从图中可以看出，利用中英文情感词典相结合进行情感词识别的文本情感分类算法准确率明显高于单一利用中文情感词典进行情感词识别的文本情感分类算法，进一步证实了我们建立的双语情感词词典的有效性。当训练集微博文本下降到1000条时，Semi-GMM(Ch.+Eng.)的分类准确率最高达到了68.3％。We use the semi-supervised Gaussian mixture model classification algorithm (Semi-GMM(Ch.)) which only uses the Chinese sentiment dictionary for sentiment word recognition and the K-nearest neighbor algorithm (KNN-KL(Ch.)) based on symmetric relative entropy. The majority vote algorithm (Majority vote (Ch.+Eng.)), SVM (Ch.+Eng.) algorithm, K-nearest neighbor classification algorithm based on cosine distance (KNN-Cosine (Ch.+ Eng.)) and the semi-supervised Gaussian mixture model classification algorithm (Semi-GMM (Ch.+Eng.)) proposed by the present invention and the K nearest neighbor algorithm (KNN-KL (Ch.+Eng.)) based on symmetric relative entropy are carried out Compare. The comparison results are shown in Figure 5. It can be seen from the figure that the accuracy rate of the text sentiment classification algorithm using the combination of Chinese and English sentiment words for sentiment word recognition is significantly higher than that of the single text sentiment classification algorithm using the Chinese sentiment dictionary for sentiment word recognition, which further confirms the bilingual The effectiveness of a lexicon of sentiment words. When the number of microblog texts in the training set drops to 1000, the classification accuracy of Semi-GMM (Ch.+Eng.) reaches 68.3%.

表6 在不同训练集规模下，基于Semi-GMM和KMM-KL的文本分类准确率Table 6 Under different training set sizes, text classification accuracy based on Semi-GMM and KMM-KL

表7 在不同训练集规模下，基于Semi-GMM和KMM-KL的文本分类F1值Table 7 F1 value of text classification based on Semi-GMM and KMM-KL under different training set sizes

表6和表7给出了当文本训练集规模不同时，Semi-GMM和KNN-KL针对文本进行5类情感识别的准确率。在文本训练集规模下降到1000时，Semi-GMM的F1值大于KNN-KL的F1值，这也进一步证实了文本中出现不同语种的文字不会对Semi-GMM的稳定性造成影响，并且在小规模训练集下Semi-GMM更具分类优势。Table 6 and Table 7 show the accuracy of Semi-GMM and KNN-KL for 5 types of emotion recognition for text when the size of the text training set is different. When the size of the text training set drops to 1000, the F1 value of Semi-GMM is greater than that of KNN-KL, which further confirms that the presence of different languages in the text will not affect the stability of Semi-GMM, and in Semi-GMM has more classification advantages under the small-scale training set.

因此，本发明所提出的基于双语词典的微博多类情感分析方法是非常有实际应用价值的。Therefore, the bilingual dictionary-based microblog multi-category sentiment analysis method proposed by the present invention is of great practical application value.

为了说明本发明的内容及实施方法，本说明书给出了一个具体实施例。在实施例中引入细节的目的不是限制权利要求书的范围，而是帮助理解本发明所述方法。本领域的技术人员应理解：在不脱离本发明及其所附权利要求的精神和范围内，对最佳实施例步骤的各种修改、变化或替换都是可能的。因此，本发明不应局限于最佳实施例及附图所公开的内容。In order to illustrate the content and implementation method of the present invention, this specification provides a specific embodiment. The purpose of introducing details in the examples is not to limit the scope of the claims, but to facilitate the understanding of the method described by the invention. Those skilled in the art should understand that various modifications, changes or substitutions to the steps of the preferred embodiment are possible without departing from the spirit and scope of the present invention and its appended claims. Therefore, the present invention should not be limited to what is disclosed in the preferred embodiments and drawings.

Claims

1. a bilingual Chinese-English sentiment dictionary building method, is characterized in that: comprise the following steps:

Step one, capture microblogging webpage, from webpage, collect the Chinese and English language material with Sentiment orientation, and from corpus, extract the high frequency vocabulary with Sentiment orientation add sentiment dictionary storehouse;

Step 2, the existing knowledge base of application are expanded described sentiment dictionary;

The microblogging language material that step 3, analysis capture, adds described sentiment dictionary by emerging for network language and emoticon.

2. the bilingual Chinese-English sentiment dictionary building method of one according to claim 1, is characterized in that: described Sentiment orientation comprises the social help, happiness, sadness, indignation and frightened 5 classes.

3. the bilingual Chinese-English sentiment dictionary building method of one according to claim 1, it is characterized in that: the expansion of step 2 is the average similarity by calculating each Sentiment orientation vocabulary in emotion vocabulary and sentiment dictionary in each knowledge base respectively, and emotion word is extended in the maximum Sentiment orientation classification of similarity; Described knowledge base comprises WordNet, NTUSD and HowNet.

4., according to the arbitrary described bilingual Chinese-English sentiment dictionary building method of one of claim 1-3, it is characterized in that: adopt the mode of many people handshow to classify to its Sentiment orientation to described netspeak and emoticon.

5., based on a multiclass sentiment analysis method for bilingual dictionary, the method comprises the following steps:

Step one, pre-service is carried out to language material text;

Step 2, according to described bilingual Chinese-English sentiment dictionary, feature space expression is carried out to described language material text;

The many disaggregated models of text emotion that step 3, basis have been set up carry out emotional semantic classification to language material text.

6. a kind of microblogging multiclass sentiment analysis method based on bilingual dictionary according to claim 5, is characterized in that: described pre-service comprises participle further and removes stop words, also comprises morphology standardization for English text.

7. a kind of microblogging multiclass sentiment analysis method based on bilingual dictionary according to claim 5, it is characterized in that: described text feature space representation is that each text in language material is expressed as five dimensional vectors, and in vector, each element represents the number of the emotion word of corresponding classification in the described bilingual Chinese-English sentiment dictionary comprised respectively.

8. a kind of microblogging multiclass sentiment analysis method based on bilingual dictionary according to claim 5, is characterized in that:

The many disaggregated models of described emotion are semi-supervised gauss hybrid models sorting algorithm or the k nearest neighbor algorithm based on symmetric relative entropy;

Described semi-supervised gauss hybrid models sorting algorithm is the corpus collection study gauss hybrid models by having marked, then iterative learning is carried out to the testing material collection marked, until algorithm convergence or unlabeled set are combined into sky using the probability distribution of this model parameter and marker samples as the initial parameter values of gauss hybrid models;

The described k nearest neighbor algorithm based on symmetric relative entropy is the distance adopting relative entropy to measure to express text to text emotion similarity, and the classification according to adjacent sample decides sample generic to be sorted.

9. a kind of microblogging multiclass sentiment analysis method based on bilingual dictionary according to claim 8, is characterized in that: described relative entropy adopts following formula to calculate:

Wherein, T _ifor the normalized vector of retrtieval represents, T _jfor the normalized vector of unmarked text represents, ω _ik, ω _jkrepresent T respectively _i, T _jkth item, k is the integer between 1 to 5.

10. towards a multiclass sentiment analysis system for bilingual microblogging text, it is characterized in that: comprise bilingual Chinese-English sentiment dictionary, language material pretreatment module, language material text feature space representation module and emotion classifiers identification module four modules;

Bilingual Chinese-English sentiment dictionary adopts bilingual Chinese-English sentiment dictionary building method as claimed in claim 1 to build;

Language material pretreatment module is used for carrying out participle herein to language material to be analyzed and going stop words process, also will carry out morphology standardization processing for English text;

Language material text feature space representation module is used for carrying out vectorization expression to the text after the process of language material pretreatment module, be five dimensional vectors by text-processing, five elements in vector represent in text the number being included in the social help in described bilingual Chinese-English sentiment dictionary, happiness, sadness, indignation and frightened five class emotion word respectively;

Emotion classifiers identification module carries out emotion recognition for adopting emotion classifiers model as claimed in claim 8 to language material text vector, determines the emotion classification belonging to language material text.