CN106547866B - Fine-grained sentiment classification method based on random emotional word co-occurrence networks - Google Patents

Fine-grained sentiment classification method based on random emotional word co-occurrence networks Download PDF

Info

Publication number
CN106547866B
CN106547866B CN 201610936655 CN201610936655A CN106547866B CN 106547866 B CN106547866 B CN 106547866B CN 201610936655 CN201610936655 CN 201610936655 CN 201610936655 A CN201610936655 A CN 201610936655A CN 106547866 B CN106547866 B CN 106547866B
Authority
CN
Grant status
Grant
Patent type
Prior art keywords
emotional
classification
network
word
words
Prior art date
Application number
CN 201610936655
Other languages
Chinese (zh)
Other versions
CN106547866A (en )
Inventor
马力
刘锋
李培
白琳
宫玉龙
杨琳
Original Assignee
西安邮电大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Grant date

Links

Abstract

一种基于情感词随机共现网络的细粒度情感分类方法,采用随机网络理论,利用词语共现现象,经过情感本体词汇词库的标注,形成一个以情感特征构建的基于词语顺序的随机网络模型,即情感词共现网络模型,在此基础上进行模型约简,将情感词最长匹配方法和TC算法结合进行SWLM‑TC无监督学习分类,或进一步将情感词最长匹配方法和HMM机器学习算法结合建立细粒度情感分类模型并利用该模型实现分类预测;本发明可实现段落级文本的细粒度情感分类,提高了单纯TC算法的精度,使分类更加准确,使用SWLM‑TC对样本集进行HMM模型训练之后并对待测样本库进行情绪分类,提高了单纯机器学习算法的自动化。 Based on a random fine-grained emotional sentiment classification word co-occurrence network, random network theory, the use of the phenomenon of co-occurrence words, after the label body emotional vocabulary lexicon, network model is formed based on a random sequence of words to a built emotion feature that emotional word co-occurrence network model, model reduction based on this, the longest term emotional matching method and algorithm combined with TC were SWLM-TC ​​classification unsupervised learning, emotional or further word longest matching method and HMM machine emotional learning algorithm in conjunction with the establishment of a fine-grained classification model and use the model to classify prediction; the present invention may be fine-grained text paragraph level sentiment classification, improve the accuracy of TC simple algorithm, so that more accurate classification using a sample set SWLM-TC and the test sample library emotion category followed by HMM model training, improve the automation of simple machine learning algorithms.

Description

一种基于情感词随机共现网络的细粒度情感分类方法 Fine-grained sentiment classification method based on random emotional word co-occurrence networks

技术领域 FIELD

[0001] 本发明属于信息检索技术领域,特别涉及一种基于情感词随机共现网络的细粒度情感分类方法。 [0001] The present invention belongs to the technical field of information retrieval, especially relates to fine-grained sentiment classification method based on random emotional word co-occurrence network.

背景技术 Background technique

[0002] 近年来,随着经济和信息技术的快速发展,互连网化深刻影响了社会形态的发展, 并对经济产生了巨大的推动作用,互联网居民产生了浩如烟海的信息,在移动互联网加速落地的过程中,各种智能移动设备的普及,让信息以更低的成本,更快的速度在互联网中传播,不同类型的信息会产生不同的影响,消极的言论会让网民产生消极的影响,恶性群体消息以及公共事件的发生,不但会影响个体的感情,甚至会产生巨大的经济损失,挖掘情感信息就成了迫切的问题。 [0002] In recent years, with the rapid development of economy and information technology, the Internet profound impact on the development of social forms, and had a tremendous economic boost, the Internet population had a broad array of information, the mobile Internet speed up landing process, the popularity of a variety of smart mobile devices, so that information at a lower cost, faster spread on the Internet, different types of information will have different effects, negative comments will have a negative impact on Internet users, malignant community news and public events occurs, not only will affect the feelings of the individual, and even produce huge economic losses, emotional information mining became a pressing issue. 文本情感语料库的建设方面,目前已有的语料库包括Pang语料库, Whissel 1语料库,Berardinel I i电影评论语料库,产品评论语料库,而汉语情感语料库标注方面的资源则比较少,清华大学标注了部分旅游景点描述的情感语料,用来辅助语音合成, 但规模也比较小。 Building emotional text corpus, currently available include Corpus Corpus Pang, Whissel 1 corpus, Berardinel I i Corpus movie reviews, product reviews Corpus, corpus annotation and emotional aspects of Chinese resources are relatively small, Tsinghua University marked some tourist attractions emotional corpus described, to assist speech synthesis, but the scale is relatively small. 国际上把Blog、论坛和网络新闻评论等相关的文本称为新型文本,网络上的此类新型文本为我们进行情感分析提供了数据源,对新型文本的分析与处理正在成为当前研究的一个热点。 On the international Blog, forums and news commentary related text called the new text, the new text on the network such as our sentiment analysis provides data sources, analysis and processing of the new text is becoming a hot topic of current research . 在信息化的今天,网络已经成为人们生活的一部分,情绪分析成为了解网民真实想法的重要参考,在公共事件的应急管理中,利用网络上的新型文本研究网民情绪成为一个新的方向。 In information technology today, the network has become a part of life, sentiment analysis to become Internet users understand the true idea of ​​the important reference in the event of a public emergency management, the use of new text on the network users emotions become a new research direction.

[0003] 目前对于文本的倾向性研究已经比较深入,在产品评论和影评中比较成功,鉴于语言的复杂性、个体表达的差异,以及对人类情感的形成没有系统的描述,目前对于细粒度情感分析还很少,中文在演化的过程中,语法自由、词汇量大、形式比较自由等诸多原因造成了和英文情感分析的差异,在英文中经常使用的语义分析在中文的分析难度很大,造成了诸多困难。 [0003] For the current tendency has been more in-depth study of the text, product reviews and critics in more successful, given the complexity of the language, the expression of individual differences, and there is no systematic description of the formation of human emotion, emotion present, for fine-grained the analysis also few Chinese in the course of evolution, the free grammar, vocabulary, large, relatively free form, and many other causes of the differences in English and sentiment analysis, semantic analysis is often used in English in great difficulty in analysis of Chinese, It caused many difficulties. 情感分析和心理学的关系密不可分,心理学研究发现,词汇和人类情感之间的关系是可以度量的,独立的词汇或短语的语义倾向对于传达人类情感是重要的。 Analysis of the relationship between emotional and psychological inextricably linked, psychological studies have found that the relationship between words and human emotions are measurable, semantic orientation independent of words or phrases of human emotions is important to convey. 有研究表明,词汇和短语的语义倾向主要有两个现象:1)相同倾向的情感术语经常同时出现;2)相反倾向的情感术语一般不同时出现。 Studies have shown that semantic orientation words and phrases are mainly two phenomena: 1) the same term emotional tendencies often occur simultaneously; 2) The term emotional opposite tendencies generally do not occur simultaneously. 由于这两个现象的存在,情感分析可以简化很多事情,有研究表明,英文和中文的普通文本建立的词语共现网络都满足小世界特性,并在此网络的基础上进行了文本分割和主题抽取的方面的研究,有研究将随机网络模型用于文本主题分析,有研究将随机网络模型用于倾向性分析,但是将随机网络理论应用到文本的细粒度情感分析中,目前未见相关研究报道。 Because of these two phenomena, sentiment analysis can be simplified a lot of things, studies have shown that common words in English and Chinese text established co-occurrence networks to meet small world, and carried out on the basis of this network on text segmentation and topic Research extraction areas, studies have random network model for text analysis topics, studies the tendency for random network model analysis, but the random network theory is applied to fine-grained text sentiment analysis, the research currently no reported.

发明内容 SUMMARY

[0004] 为了克服上述现有技术的缺点,本发明的目的在于提供一种基于情感词随机共现网络的细粒度情感分类方法,将情感词最长匹配SWLM (Sentimental Word Longest Match) 和机器学习算法相结合,可实现段落级文本的细粒度情感分类。 [0004] In order to overcome the drawbacks of the prior art, an object of the present invention is to provide a method of fine-grained sentiment classification sentiment random word co-occurrence-based network, the longest match sentiment words SWLM (Sentimental Word Longest Match) and Machine Learning algorithm combine to provide the paragraph level of fine-grained text sentiment classification.

[0005] 为了实现上述目的,本发明采用的技术方案是: [0005] To achieve the above object, the technical solution adopted by the invention is:

[0006] 一种基于情感词随机共现网络的细粒度情感分类方法,采用随机网络理论,利用词语共现现象,经过情感本体词汇词库的标注,形成一个以情感特征构建的基于词语顺序的随机网络模型,即情感词共现网络模型,在此基础上进行模型约简,将情感词最长匹配方法(SWLM,Sentimental Word Longest Match)和TC算法结合进行SWLM-TC无监督学习分类, 或进一步将情感词最长匹配方法和HMM机器学习算法结合建立细粒度情感分类模型并利用该模型实现分类预测。 [0006] A fine-grained random word sentiment sentiment classification method based on co-occurrence network, random network theory, the use of the phenomenon of co-occurrence words, after the label body emotional vocabulary lexicon, to form a construct based on emotion feature word sequence random network model, that is, the emotional word co-occurrence network model, model reduction based on this, the longest term emotional matching method (SWLM, Sentimental Word longest match) and TC algorithm combines were SWLM-TC ​​classification unsupervised learning, or further emotional word HMM longest matching method and machine learning algorithms combine to establish fine-grained sentiment classification model and use the model to classify forecast.

[0007] 所述情感词共现网络模型的构建过程如下: [0007] The word co-occurrence sentiment network model was constructed as follows:

[0008] D对每个文本执行分句操作得到一组有序的句子Sl4&4…4Sn; [0008] D to obtain a set of ordered sentence Sl4 & amp each text clause perform operations; 4 ... 4Sn;

[0009] 2)对每一个句子51进行分词,滤除停用词以及无意义的实词,使用情感词汇本体库进彳T情感词标注,得到一组有序的情感词Wl^W2—----^Wn; [0009] 2) for each of the 51 word sentence, as well as stop words filtered meaningless content words, ontology vocabulary using emotional stimulation was denoted T emotional words, to obtain an ordered set of emotional words Wl ^ W2 --- - ^ Wn;

[0010] 3)对每个句子,采用WL (Word Long,窗口长度,一般取2)位滑动窗从句子中抽取词汇对〈Wi,wj>,若 [0010] 3) for each sentence, using WL (Word Long, window length, and generally 2) bit word sliding window extracted from sentences to <Wi, wj>, if

Figure CN106547866BD00051

:则向W中添加一个新节点Wi,并为Wi的权重nwi设初始值为1;否则nwi 加1,若 : To add a new node W Wi, Wi is the weight and the initial weight is set nwi 1; 1 nwi plus, if

Figure CN106547866BD00052

>则向E中添加一条新边(Wi,Wj),并为(Wi,Wj)的权重nwi,wj设初始值为1;否则Ilwi, wj加1 ; > Add a new edge (Wi, Wj), and as (Wi, Wj) weight nwi, wj set to an initial value of E 1; otherwise Ilwi, wj plus 1;

[0011] 4)所有文本处理完成之后,网络1¾型G建立完成; After [0011] 4) all text processing is complete, the network setup complete 1¾ type G;

[0012] 其中,S表示由多条句子组成的序列,W表示抽取出的情感词,we Σ,Σ为汉语词汇集,汉语词汇集为去除停用词、无意义实词后再经过情感词汇本体库标注后的情感本体词集;W为网络模型G的节点集合,W=IwiI ie [1,N]},N为G的节点个数;E为网络模型G的边集合,网络模型G的边的个数为M,E= {(wi,wj) |wi,WjGW,且Wi和Wj之间存在顺序共现关系}, (wi,wj)表示从节点Wi指向节点Wj的有向边;Nw为网络模型G中节点的权重,Nw= {nwi I wiGW}; NE为网络模型G中边的权重,表示节点Wi与Wj之间边的权重,NE= {nWi,Wj I (wi,wj) eE}。 [0012] where, S represents a sequence consisting of a plurality of sentences, W represents the extracted emotional words, we Σ, Σ is the set of Chinese vocabulary, Chinese vocabulary set is to remove stop words, nonsense words emotional content words and then through the body emotional body sets the word library labeled; W is the set of nodes of the network model G, W = IwiI ie [1, N]}, N is the number of the node G; E is the set of edges G network model, the network model G the number of side is M, E = {(wi, wj) | between wi, WjGW, Wi and Wj order and co-occurrence relation}, (wi, wj) represents from node points to the node Wi Wj directed edge; Nw is the weight of the network model G of nodes weight, Nw = {nwi I wiGW}; NE is the weight of the network model G of the edge weight represents the weight of the edges between nodes Wi and Wj weight, NE = {nWi, Wj I (wi, wj ) eE}.

[0013] 所述情感词汇本体库中,情感共分为7大类21小类,情感分类分别为乐{快乐(PA)、 安心(PE) }、好{尊敬(PD)、赞扬(PH)、相信(PG)、喜爱(PB)、祝愿(PK) }、怒{愤怒(NA) }、哀{悲伤(NB)、失望(NJ)、疚(NH)、思(PE) }、惧{:慌(NI)、恐惧(NC)、羞(NG) }、恶{烦闷(NE)、憎恶(ND)、贬责(NN)、妒忌(NK)、怀疑(NL) }、惊{惊奇(PC)};情感强度power分为1,3,5,7,9五档, 9表示强度最大,1表示强度最小,情感词汇本体中的词性种类一共分为7类,分别是名词(noun),动词(verb),形容词(adj),副词(adv),网络词语(nw),成语(idiom),介词短语(pr印),共含有情感词27466个。 [0013] The emotional vocabulary ontology database, is divided into seven categories 21 emotion subclasses, sentiment classification are happy music {(PA), Confidence (PE)}, {distinguished good (the PD), commends (PH) I believe (PG), like (PB), wish (PK)}, anger {anger (NA)}, sorrow {sadness (NB), down (NJ), guilt (NH), Si (PE)}, fear { : panic (NI), fear (NC), shame (NG)}, evil {bored (NE), hate (ND), derogatory responsible (NN), jealous (NK), suspected of (NL)}, panic {surprise ( PC)}; emotional intensity power into five files 1,3,5,7,9, 9 denotes the maximum intensity, minimum intensity represents 1, the type of emotion vocabulary speech body divided into a total of seven categories, namely, a noun (noun) verb (verb), adjectives (adj), adverb (adv), network words (nw), idioms (idiom), prepositional phrases (pr India), containing a total of 27466 sentiment words.

[0014] 将网络模型G按照乐、好、怒、哀、惧、恶、惊七种情绪分成7个子网络,子网络拆分过程中,如果有断裂的情况发生,使用权重最高的那个节点与断裂的网络子块进行连接,构建可用于细粒度计算的七个子网络Gx IX= 0 [0014] The network model in accordance with Le G, good, anger, sadness, fear, evil, shock seven kinds of emotions into 7 sub-networks, sub-networks resolution process, breakage occurs if the case, the use of that node with the highest weight fracture network sub-block, to construct a fine-grained seven sub-network can be used to calculate Gx IX = 0

[0015] 所述情感词最长匹配方法通过情感词的最大权重词汇进行最长匹配,使得不采用消歧和防噪声处理,即可准确分类到相关情感主题之下,并通过七个小分类模型进行权重计算,得出可进行机器学习分类的参数。 [0015] The term emotional longest matching method emotional vocabulary words maximum weight for the longest match, and so do not use the disambiguation process for preventing noise, classification accuracy can be related to the emotional topic below, and by seven small classification model weight calculation, machine learning parameters can be obtained classification.

[0016] 在进行分类时,具有如下定义: [0016] When performing the classification, have the following definitions:

[0017] 最长权重匹配路径长度(1_“5):网络614={1,2,3,4,5,6,7},如果两个情感词顺序覆盖,则使用直接相连的边进行匹配,如果两个情感词在网络Gx中存在网络间隔,则选择路径的时候选择通过权重最大的节点进行匹配,即为S的长度,计算公式如下: [0017] The maximum weight matching path length (1_ "5): 614 = {1,2,3,4,5,6,7} network, if two sequential cover emotional words, directly connected to the edge is used to match the , if the two words present emotion interval network Gx network, the path is selected when the selection by matching node with the largest weight, that is, the length S is calculated as follows:

[0018] [0018]

Figure CN106547866BD00061

[0019] 其中dmax(wi,wi+x)是网络中第i个词到第i+x词的最大权重匹配路径; [0019] wherein dmax (wi, wi + x) is the i-th word network to the maximum weight of the weight of the word i + x matching path;

[0020] 情感权重系数SW (Sentimetal weight):网络G中,七个子网络各自所占的情感极性比重,使用此系数会让分类更加明显,减少因为界限模糊引起的分类问题,令情感词网络中词的重现次数为freq,极性强度为P,计算公式如下: [0020] emotional weight coefficient SW (Sentimetal weight): G network, the sentiment polarity proportion of their share of the seven sub-network, using this factor will become more apparent classification, reducing classification problems caused because of blurred, so emotional word network number of words is reproduced freq, strength of polarity is P, is calculated as follows:

Figure CN106547866BD00062

[0024] 其中WC为子网络中每个词的情感数值,Wy为子网络的情感数值,SWx为子网络X的SW 值,即情感权重系数; [0024] WC is the sub-network where the value of each word Emotion, Wy of the emotional value of the sub-network, the SWx value for the SW sub-network X, i.e. emotional weight coefficient;

[0025] 分类系数CC (Classification coefficient):在最大匹配词路径确定之后,这条路径上的词语的重现度Re和情感强度power,假设有η个词,则计算公式如下: [0025] Category coefficient CC (Classification coefficient): After determining the maximum matching word path, Re and emotional intensity of the reproducing power of words on this path, assuming η word is calculated as follows:

Figure CN106547866BD00063

[0028] 其中0^是单个词的分类系数; [0028] where ^ 0 is a single word classification coefficient;

[0029] 分类预测系数CPC (Classification prediction coefficient):在使用机器学习算法进行分类时,对于无法判断样本的分类采取的预测机制;按照SWx进行排序,如果SW1+ SW2>80%,SWi/SW2> 1.5,则归入SWi下,如果SWi+SW2>80 %,SWi/SW2〈 = 1.5,在这种情况下归入SWdP SW2两个属性下;如果sWi+swxso%,则表示这篇文章的分类比较复杂,按照分类系数归入相应的分类下: [0029] class prediction coefficient CPC (Classification prediction coefficient): prediction mechanism when using machine learning algorithms to classify, for the classification can not determine the sample taken; sorted by the SWx, if SW1 + SW2> 80%, SWi / SW2> 1.5 , under the classified SWi, if SWi + SW2> 80%, SWi / SW2 <= 1.5, in this case fall under sWdP SW2 two attributes; if sWi + swxso%, indicates the classification of the article comparison complex, classified under the corresponding coefficients classified according to the classification:

Figure CN106547866BD00064

[0031] 所述SWLM-TC方法包括如下步骤: [0031] SWLM-TC ​​said method comprising the steps of:

[0032] 1)对所需要分类的文章进行分句,所分句子序列为… [0032] 1) the classification of the articles were needed clause, sentence sequence divided into ...

[0033] 2)对每一个顺序句子进行分词,并去除无意义的实词、助词,并使用情感词汇本体词库进行标注,选出被标记的词语,按照顺序,即… [0033] 2) for each sentence word order, and removing the meaningless content words, auxiliary, and using the emotional vocabulary lexicon tagging body, selected words are marked, in order, i.e., ...

[0034] 3)根据所标注的词语的归属进行相应网络搜索; [0034] 3) The home network search corresponding marked words;

[0035] 4)对网络中的词语进行路径选择,如果是两个相邻的词语,则使用直接相连的路径;如果是两个不相邻的词语,则选择他们相连路径上,通过最大权重路径的词语,按照上述步骤寻找最大权重路径,找出dmax⑶; [0035] 4) of the word network path selection, if two adjacent words, then using the path directly connected; if two non-adjacent words, they are connected to the path is selected, the maximum weight word path, according to the above steps to find the maximum weight path, find dmax⑶;

[0036] 5)计算最大权重路径dmax⑶上的分类系数CC; [0036] 5) calculates the weight coefficient classification dmax⑶ maximum weight path the CC;

[0037] 6)计算各个归属子网络下的分类系数CC,比较系数大小,如果相同,分类系数CO SW,SW为情感权重系数(Sentimetal weight),即分类情感在7个子网络中的权重,如果不相同,则按照分类系数CC最终的排序原则,若第一的权重占百分之八十,则归属到相应的情绪网络下,如果不超过百分之八十,则将此分类归到权重排名前二情绪网络下; [0037] 6) Sort coefficient calculating respective home subnetwork the CC, comparison coefficients size, if the same, classified coefficient CO SW, SW of emotional weight coefficient (Sentimetal weight), i.e. classification semantic weight in seven sub-network weights, if not the same, according to the final ranking classification principles coefficient CC, if the first weight accounted for 80 percent ownership of the network to the appropriate mood, if not more than 80 percent this classification is attributed to the weight under the first two ranking emotional network;

[0038] 7)如果无法确保分类的情况下,对待分类文本按照分类预测系数CPC进行分类预测。 [0038] 7) If you can not ensure that the classification under treatment classification text classification by category prediction prediction coefficient CPC.

[0039] 所述建立细粒度情感分类模型并利用该模型实现分类预测的方法是: [0039] The establishment of the model and fine-grained sentiment classification model to classify the prediction method is the use of:

[0040] 1)使用SWLM-TC对其中所有样本中的一部分文本进行细粒度分类,计算样本集中每个文本所属情感的权重系数swx,另一部分文本做为分类验证实验; [0040] 1) SWLM-TC ​​on a part of the text in all samples classified fine-grained, a sample concentration is calculated weights to each text belongs emotional weight coefficient SWX, as another part of the text classification verification experiment;

[0041] 2)对于使用SWLM-TC进行分类的所有样本:计算每个文本的分类系数CC,分类系数按照SWLM-TC算法第6步进行分类,则把本样本加入X相应的情感分类集TSx (Train Set)下, 如果分类系数不能决定的情况下,使用SWLM-TC进行,则使用SWLM-TC算法第7步进行预测, 归到相应的分类下; [0041] 2) For all samples SWLM-TC ​​classification of: calculating coefficients for each classification of the CC text, coefficients classified according to the classification step 6 SWLM-TC ​​algorithm, put this sample was added emotion category set corresponding X TSx lower (Train Set), if the classification is not the case the coefficient determined using SWLM-TC ​​performed using SWLM-TC ​​prediction algorithm in step 7, normalized to the corresponding classification;

[0042] 3)将样本数据使用SWLM-TC算法计算完文本情感之后,使用相应分类的文本训练HMM分类模型,而后使用HMM分类模型进行训练: [0042] 3) the sample data using the algorithm SWLM-TC ​​after completing calculating the emotional text, using the corresponding HMM trained classification text classification model, classification and then the HMM models trained:

[0043] a)对于待测文本,使用HMM算法进行分类,如果能正确分类,则就判定这个文本的子情感分类; [0043] a) For the test text using the HMM classification algorithm, if the classification is correct, it is determined that the sub-text sentiment classification;

[0044] b)对于无分类结果文本,使用分类预测系数CPC进行分类预测。 [0044] b) For the classification result without text, using the prediction coefficients CPC classification classifies prediction.

[0045] HMM是一种机器学习方法,首先使用SWLM-TC算法对样本集进行情感计算,得到分类为7个子情感文本样本库,使用样本库训练HMM模型,使用训练好的HMM模型,就可以对文本库剩下的一部分文本进行文本分类测试验证. [0045] HMM is a machine learning method, the first to use SWLM-TC ​​algorithm sample set of affective computing, emotion get classified into seven sub-text sample library, using the sample library training HMM model, using the trained HMM model, you can the remaining part of the library of text text text classification test verification.

[0046] 与现有技术相比,本发明的有益效果是: [0046] Compared with the prior art, the beneficial effects of the present invention are:

[0047] 1、本方法可以对文本的情感进行细粒度分类,不同于传统的倾向性计算,具有更细粒度的分类。 [0047] 1, the present method may be fine-grained emotion text classification, unlike conventional tendency calculating, has finer classification.

[0048] 2、提高了单纯TC算法的精度,使分类更加准确。 [0048] 2, improve the accuracy of TC simple algorithm, so that more accurate classification.

[0049] 3、使用SWLM-TC对样本集进行HMM模型训练之后并对待测样本库进行情绪分类,提高了单纯机器学习算法的自动化。 [0049] 3, and test sample library classification mood after use SWLM-TC ​​sample sets HMM model training, improve the automation of simple machine learning algorithms.

附图说明 BRIEF DESCRIPTION

[0050] 图1是本发明算法总流程图。 [0050] FIG. 1 is a flowchart showing the overall algorithm of the present invention.

[0051] 图2是本发明SWLM-TC算法流程图。 [0051] FIG 2 is a flowchart of an algorithm SWLM-TC ​​present invention.

[0052] 图3是本发明SWLM-HMM算法流程图。 [0052] FIG. 3 is a flowchart SWLM-HMM algorithm of the present invention.

[0053] 图4是基于词频的标记算法TC实验数据的折线图。 [0053] FIG. 4 is a line graph of experimental data TC labeling algorithm based on word frequency.

[0054] 图5是SWLM-TC启发式算法实验数据的折线图。 [0054] FIG. 5 is a line graph SWLM-TC ​​heuristic experimental data.

[0055] 图6是SWLM-HMM算法实验数据的折线图。 [0055] FIG. 6 is a line graph SWLM-HMM algorithm experimental data.

[0056] 图7是本发明实验中微平均数据示意图。 [0056] FIG. 7 is a schematic view of the experimental data of an average micro present invention.

[0057] 图8是本发明实验中宏平均数据示意图。 [0057] FIG. 8 is a schematic view of a macro average data experiment of the present invention.

[0058] 图9是本发明实验中分类数据分布示意图(正确分类)。 [0058] FIG. 9 is a classification of experimental data of the present invention in a schematic view of distribution (correctly classified).

[0059] 图10是本发明实验中分类数据分布示意图(误分到该类)。 [0059] FIG. 10 is a classification of experimental data of the present invention in a schematic view of distribution (error class assigned).

[0060] 图11是本发明实验中分类数据分布示意图(属于此类被误分)。 [0060] FIG. 11 is a classification of experimental data of the present invention in a schematic view of the distribution of (part of such misuse points).

具体实施方式 Detailed ways

[0061] 下面结合附图和实施例详细说明本发明的实施方式。 [0061] Next, embodiments of the present invention will be described in conjunction with the accompanying drawings and embodiments.

[0062] 如图1所示,本发明一种基于情感词随机共现网络的细粒度情感分类方法,首先, 采用随机网络理论,利用词语共现现象,经过情感本体词汇词库的标注,形成一个以情感特征构建的基于词语顺序的随机网络模型,即情感词共现网络模型,在此基础上进行模型约简,将情感词最长匹配方法(SWLM,Sentimental Word Longest Match)和TC算法结合进行SWLM-TC无监督学习分类,或进一步将情感词最长匹配方法和HMM机器学习算法结合建立细粒度情感分类模型并利用该模型实现分类预测。 [0062] As shown in FIG. 1, the present invention is based on fine-grained random emotional sentiment classification word co-occurrence network, first, random network theory, the use of the phenomenon of co-occurrence words, through the emotional vocabulary lexicon marked body is formed emotion feature a random network model constructed based on the order of the words, i.e., word co-occurrence emotional network model, reduction model based on the emotional words longest matching method (SWLM, Sentimental word longest match) algorithm and TC binding be SWLM-TC ​​classification unsupervised learning, emotional or further word HMM longest matching method and machine learning algorithms combine to establish fine-grained sentiment classification model and use the model to classify forecast. 具体内容如下: Details are as follows:

[0063] 1基于随机网络的情感词共现模型 [0063] 1 emotional word co-occurrence based on random network model

[0064] 为了便于对段落级文本进行细粒度进行研究,发现情感词之间的内在规律,本发明通过改进文献[YANG Feng,PENG Qin-ke, XU Tao, Sentiment Classification for Comments Based on Random Network Theory,Acta Automatica Sinica,2010·6 Vol·36, No6]所提出的方法,来构建适合细粒度情感分析的情感词共现网络模型。 [0064] To facilitate the text paragraph level fine-grained, and found that the inherent rule of emotion between words, the present invention is improved by the literature [YANG Feng, PENG Qin-ke, XU Tao, Sentiment Classification for Comments Based on Random Network Theory , Acta Automatica Sinica, 2010 · 6 Vol · 36, No6] the proposed method, suitable for fine-grained sentiment analysis to build the emotional word co-occurrence network model.

[0065] 1.1情感词汇本体词库 [0065] 1.1 emotional vocabulary body vocabulary

[0066] 中文情感词汇本体库是大连理工大学信息检索研究室在林鸿飞教授的指导下经过全体教研室成员的努力整理和标注的一个中文本体资源。 [0066] Chinese emotional vocabulary ontology is Information Retrieval Laboratory, Dalian University of Technology through the efforts of all members of the Department of finishing and labeling of a Chinese bulk of resources under the guidance of Professor Lin Hongfei. 中文情感词汇本体的情感分类体系是在国外比较有影响的Ekman的6大类情感分类体系的基础上构建的。 Chinese vocabulary emotional sentiment classification system is built on the basis of body in a foreign country more influential Ekman classification system of six categories of emotion on. 在Ekman的基础上,词汇本体加入情感类别“好”对褒义情感进行了更细致的划分。 On the basis of Ekman on the emotional vocabulary body to join the category of "good" to compliment emotion in more detailed classification. 最终词汇本体中的情感共分为7大类21小类。 The final terms of the emotional body is divided into 7 categories of 21 categories. 情感分类分别为乐{快乐(PA)、安心(PE) }、好{尊敬(PD)、赞扬(PH)、相信(PG)、喜爱(PB)、祝愿(PK) }、怒{愤怒(NA) }、哀{悲伤(NB)、失望(NJ)、疚(NH)、思(PE) }、惧{:慌(NI)、恐惧(NC)、羞(NG) }、恶{烦闷(NE)、憎恶(ND)、贬责(NN)、妒忌(NK)、怀疑(NL) }、惊{惊奇(PC)}。 Sentiment classification were happy {happy (PA), ease (PE)}, good {Dear (PD), praise (PH), I believe (PG), love (PB), wish (PK)}, anger {anger (NA )}, sorrow {sadness (NB), down (NJ), guilt (NH), Si (the PE)}, fear {: panic (the NI), fear (the NC), shame (the NG)}, evil {depressed (NE ), hate (ND), responsible banished (NN), jealous (of NK), suspected of (NL)}, {shock surprise (PC)}. 情感强度power分为1,3,5,7,9五档,9表示强度最大,1表示强度最小。 Emotional intensity power into five files 1,3,5,7,9, 9 represents a maximum intensity, minimum intensity represents 1. 情感词汇本体中的词性种类一共分为7类,分别是名词(noun),动词(verb),形容词(adj),副词(adv),网络词语(nw),成语(idiom),介词短语(prep),共含有情感词27466个。 Emotional Speech Vocabulary type body divided into a total of seven categories, namely, a noun (noun), verb (to verb), adjective (ADJ), adverb (ADV), the words in the network (NW), idioms (idiom), prepositional phrase (Prep ), containing a total of 27466 sentiment words.

[0067] 1.2网络模型 [0067] 1.2 Network Model

[0068] Watts和Strogtz引入的小世界网络模型[Watts DJ,Strogtz S !!.Collective dynamics of 'small-world' networks.Nature,1998,393 (6684):440-442]、Barabasi和Albert提出的无标度网络模型[Barabasi AL,Albert R.Emergence of scaling in random networks · Science, 1999,286 (5439) : 509-512]进行复杂网络的开创性工作之后, 相比规则网络和随机网络,小世界和无标度网络具有:小的平均路径长度、大的聚集系数. 通过词语之间的关联构建出的共现模型网络具有小世界网络的相关特性。 [0068] The small-world network model Watts and Strogtz introduced [Watts DJ, Strogtz S !! Collective dynamics of 'small-world' networks.Nature, 1998,393 (6684):. 440-442], Barabasi proposed by Albert and scale-free network model [Barabasi AL, Albert R.Emergence of scaling in random networks · Science, 1999,286 (5439): 509-512] after the pioneering work of complex networks, compared to the regular network and random network, small world and scale-free networks: a small average path length, clustering coefficient larger constructed by the association between the word co-occurrence characteristic model of the network associated with small-world network. 可以在构建出的词义网络上利用小的平均长度和大的聚集系数,快速的进行情感粒度特征分类。 Can mean length and with a small coefficient of large aggregates on the meaning constructed network, fast emotional classification size characteristics.

[0069] 1.3情感词共现网络模型 [0069] 1.3 emotional word co-occurrence network model

[0070] 文南犬[Shi Jing1Hu Ming,Dai Guo-Zhong· Topic analysis of Chinese text based on small world model. Journal of Chinese Information Processing,2007,21 (3) :69-75 (石晶,胡明,戴国忠.基于小世界模型的中文文本主题分析.中文信息学报, 2007,21 (3) :69-75)]中,根据词语之间的普通共现关系建立随机网络模型,文献[YANG Feng,PENG Qin-ke,XU Tao,Sentiment Classification for Comments Based on Random Network Theory,Acta Automatica Sinica,2010.6Vol.36,No6]中,根据词语之间的顺序共现关系建立随机网络模型,这种网络模型用于新闻短评论的增量式创建词语顺序共现网络,配合SCP算法,对缺少大量词语网络的短评论有很大的好处,这个算法对于倾向性计算比较适合,但对于更多粒度的情感分类不适合。 [0070] Venant dogs [Shi Jing1Hu Ming, Dai Guo-Zhong · Topic analysis of Chinese text based on small world model Journal of Chinese Information Processing, 2007,21 (3):. 69-75 (Shi Jing, Ming Hu, . Dai Guozhong analysis model based on Chinese text small World theme Chinese information Technology, 2007,21 (3): 69-75)], the establishment of a random network model, the literature [YANG Feng based on common co-occurrence relations between words, PENG Qin-ke, XU Tao, Sentiment Classification for Comments Based on random network Theory, Acta Automatica Sinica, 2010.6Vol.36, No6] in accordance with the order co-occurrence relationships between words to establish random network model, this network model with Creating order words co-occurrence network, with a short comment SCP algorithm for incremental news, a great advantage for lack of a large network of short-term review, the algorithm for computing the tendency is more suitable, but for the more emotionally size classification Not suitable.

[0071] 为了构建细粒度的情感分析,本发明采用情感词构建顺序共现随机网络模型,情感词共现顺序体现了情感词的语义方面的信息,比如前后修饰,情感词语的共现距离与词语的语义关系有很大的关系。 [0071] In order to construct a fine-grained analysis of emotion, the present invention employs stochastic network model constructing emotional word cooccurrence order, emotional word co-occurrence information sequence embodies semantic emotional words, before and after such modification, the co-occurrence of words emotional distance semantic relationships the words of a great relationship. 本发明是根据情感词的紧密顺序共现关系建立模型,即共现区域的窗口长度WL较小(一般取2),而且考虑到词汇共现的次序关系。 According to the present invention is to establish a close emotional relationship word cooccurrence order model, i.e., co-occurrence window length WL small area (typically take 2), and taking into account co-occurrence of word order relationship.

[0072] 在情感词共现网络构建完成后,将这个大的网络模型按照七个不同的情感大类, 分为七个小的情感词网络,再进行相关操作。 [0072] In the emotional word co-occurrence network construction is completed, this large network model according to seven different emotional categories, divided into seven small emotional word network, then the related operations.

[0073] 为了描述情感词共现网络模型的构建方法,用到相关的数学定义: [0073] In order to describe the emotional word co-occurrence method for constructing the network model, used mathematical definition of the relevant:

[0074] [0074]

Figure CN106547866BD00091

:汉语词汇集,本发明使用的词汇集为去除停用词、无意义实词后再经过情感词汇本体库标注后的情感本体词集; : Chinese vocabulary, vocabulary used in the invention to remove stop words, meaningless content words and then go through the emotional body after the emotional vocabulary word ontology annotation set;

[0075] w:抽取出的情感词C [0075] w: extracted emotional words C

Figure CN106547866BD00092

[0076] S:由多条句子组成的序列; [0076] S: the sequence of a plurality of sentences;

[0077] N:G的节点个数; [0077] N: number of the node G;

[0078] M: G的边的个数; [0078] M: the number of edges of G;

[0079] W= {wi| ie [1,N]} :G的节点集合; [0079] W = {wi | ie [1, N]}: set the node G;

[0080] E={(Wi,Wj)|wi,WjeW,且Wi和Wj之间存在的顺序共现关系}:G的边集合,其中(Wi, Wj)表示从节点Wi指向节点Wj的有向边; [0080] E = {(Wi, Wj) | wi, WjeW, and the sequence is present between Wi and Wj} co-occurrence relationship: G is the set of edges, wherein (Wi, Wj) indicates points to the node from the node Wi Wj are to the side;

[0081] Nw= {nwi I WiG w} :G中节点的权重; [0081] Nw = {nwi I WiG w}: Right weight G of nodes;

[0082] NE= {nwi,wj I (wi,wj) eE} :G中边的权重,表示节点Wi与Wj之间边的权重。 [0082] NE = {nwi, wj (wi, wj) eE}: G in the weight of an edge, the edge weight represents the node between Wi and Wj.

[0083] 下面给出情感词共现网络模型G的建立方法: [0083] The following presents a method to establish emotional word co-occurrence of the network model G:

[0084] D对每个文本执行分句操作得到一组有序的句子Sl4&amp;4…—Sn; [0084] D performed for each text clause obtained in an ordered set of sentences Sl4 & amp; 4 ... -Sn;

[0085] 2)对每一个句子51进行分词,滤除停用词以及无意义的实词,使用情感词汇本体库进彳T情感词标注,得到一组有序的情感词Wl^W2—----^Wn; [0085] 2) for each of the 51 word sentence, as well as stop words filtered meaningless content words, ontology vocabulary using emotional stimulation was denoted T emotional words, to obtain an ordered set of emotional words Wl ^ W2 --- - ^ Wn;

[0086] 3)对每个句子Si,采用WL (—般取2)位滑动窗从句子中抽取词汇对<Wi,Wj> .若 [0086] 3) for each sentence Si, using WL (- generally take 2) bit word sliding window extracted from sentences to <Wi, Wj> if.

Figure CN106547866BD00093

则向W中添加一个新节点Wi,并为Wi的权重nwi设初始值为1;否则nwi加1,对Wj的操作与Wi类似,老 Add a new node to the W Wi, Wi is the weight and the initial weight is set nwi 1; otherwise nwi 1, the operation of Wi and Wj is similar to the old

Figure CN106547866BD00094

丨则向E中添加一条新边(Wi,Wj),并为(Wi,Wj)的权重nwi,wj设初始值为1 ;否则Ilwi,wj加1 . Shu add a new edge (Wi, Wj), and as (Wi, Wj) weight nwi, wj set to an initial value of E 1; otherwise Ilwi, wj is incremented.

[0087] 4)所有文本处理完成之后,网络1¾型G建立完成. After [0087] 4) all text processing is complete, the network type G 1¾ established.

[0088] 5)将网络模型G按照七种情绪(乐、好、怒、哀、惧、恶、惊)分类分成7个子网络.子网络拆分过程中,如果有断裂的情况发生,使用权重最高的那个点与断裂的网络子块进行连接,可用于细粒度计算的七个子网络(G1,G2,G3,G4,G5,G6,G7)构建完成。 [0088] 5) according to the network model G seven kinds of emotions (Joy, good, anger, sadness, fear, evil, shock) classified into 7 sub-network. Subnetwork resolution process, if there event of breakage of, using a weight that the highest point of the fracture network of the sub-blocks are connected, it can be used to fine-grained seven sub-network computing (G1, G2, G3, G4, G5, G6, G7) build.

[0089] 2面向文本的情感细粒度特征分类 [0089] 2 fine-grained text feature classification for the emotional

[0090] 之前的研究中,将文本的情感倾向作为研究重点,随着这个领域的不断深入研究, 细粒度的研究价值以及用途就凸显出来,细粒度和倾向性分析侧重点不一样,细粒度是一个多分类的问题,而倾向性只需要计算出文本的倾向性即可,所使用的人工标注词典也是不同的,倾向性研究只需要标注出词的倾向性,而细粒度标注词典情感词汇本体库则对涉及的情感词进行了词性种类、强度、极性等相关特征进行了标注。 Previous studies [0090], the emotion of the text as a tendency to focus on research, with the deepening of research in this area, fine-grained research value and uses it come out, fine-grained and tendentious analysis focus is different, fine-grained It is a multi-classification problem, and only need to calculate the propensity propensity to text, use the manual annotation dictionary is different, tendentious studies only need to mark out the word tendency, and fine-grained marked emotional vocabulary dictionary the ontology according to the emotional word speech characteristics were related type, intensity, polarity been labeled. 配合使用HMM机器学习算法进行细粒度情感分类。 HMM with the use of machine learning algorithms fine-grained sentiment classification.

[0091] 情感词最长匹配分类方法SWLM,通过情感词的最大权重词汇进行最长匹配,使得不采用消歧和防噪声处理,就能比较准确到分类到相关情感主题之下,并通过七个小分类模型进行计算权重,得出可用于HMM进行机器学习分类的参数。 [0091] emotion word longest match classification SWLM, the longest match by heavy emotional vocabulary words maximum weight, so do not use disambiguation and anti-noise processing, can be more accurate to classify under relevant to the emotional topic, and the Seven small classification model to calculate weights, HMM parameters may be used to obtain a machine learning classification.

[0092] 本发明进行如下定义: [0092] The present invention is defined as follows:

[0093] 定义1 (最长权重匹配路径长度dmax⑶) [0093] Definition 1 (weight matching the longest path length dmax⑶)

[0094] 网络Gx中,如果两个情感词顺序覆盖,则使用直接相连的边进行匹配;如果两个情感词在网络中Gx存在网络间隔,则选择路径的时候选择通过权重最大的节点进行匹配,即为最长权重匹配路径S的长度。 [0094] Gx network, if two sequential cover emotional words, the edge directly connected using matched; emotional words, if two networks exist Gx interval network, path selection when choosing the maximum weight matching node by weight , is the longest matching path weights length S. 计算公式如下: Calculated as follows:

[0095] [0095]

Figure CN106547866BD00101

[0096] 其中dmax(Wi,Wi+x)是网络中第i个词到第i+x词的最大权重匹配路径。 [0096] wherein dmax (Wi, Wi + x) is the i-th word network to the maximum weight of the weight of the word i + x matching path.

[0097] 定义2 (情感权重系数SW (Sentimetal weight)) [0097] Definition 2 (emotional weight coefficient SW (Sentimetal weight))

[0098] 词义网络G中,七个子网络各自所占的情感极性比重,使用此系数会让分类更加明显,减少因为界限模糊引起的分类问题,令情感词网络中词的重现次数为freq,极性强度为P。 The number of reproducible [0098] G semantic network, the sentiment polarity proportion of their share of the seven sub-network, using this factor will become more apparent classification, reducing classification problems caused because of blurred, so emotional word for word network freq , the strength of polarity is P. 计算公式如下: Calculated as follows:

Figure CN106547866BD00102

[0102] 其中WC为子网络中每个词的情感数值,Wy为子网络的情感数值,SWx为子网络X的SW 值,即情感权重系数。 [0102] WC is the sub-network where the value of each word Emotion, Wy of the emotional value for the sub-network, the SWx value for the SW sub-network X, i.e. emotional weight coefficient.

[0103] 2.1使用TC算法的SWLM-TC无监督学习分类方法 Learning classification method [0103] 2.1 TC algorithms SWLM-TC ​​unsupervised

[0104] 定义3 (分类系数CC (Classification coefficient)) [0104] Definition 3 (Categories Coefficient CC (Classification coefficient))

[0105] 分类系数是SWLM-TC无监督算法分类时所定义的分类系数,在最大匹配词路径确定之后,这一条路径上的词语的重现度Re,还有情感强度power,假设有η个词,计算公式如下: [0105] Category coefficient SWLM-TC ​​when unsupervised classification algorithm defined classification coefficients, after the maximum matching word path is determined, to reproduce the words of Re on this path, and emotional intensity Power, assuming a η word is calculated as follows:

Figure CN106547866BD00103

[0108] 其中〇^是单个词的分类系数。 [0108] where ^ is a single word square classification coefficient.

[0109] 定义4 (分类预测系数CPC (Classification prediction coefficient)) [0109] Definition 4 (class prediction coefficient CPC (Classification prediction coefficient))

[0110] 分类预测系数是在使用机器学习算法进行分类的时候,对于无法判断样本的分类的时候,采取的预测机制。 [0110] categorical predictor coefficients in the use of machine learning algorithms to classify the time, for time can not determine the classification of the samples, taken prediction mechanism. 按照SWx进行排序,如果SWdSW2MO%,其中SWi/SWA 1.5,则归入SW1T,否则归入SWdPSW2下;如果sWi+swxso %,则表示,这篇文章的分类比较复杂,则按照分类系数归入相应的分类下。 Sorted in accordance SWx, if SWdSW2MO%, wherein SWi / SWA 1.5, then classified SW1T, or fall under SWdPSW2; if sWi + swxso%, said that the article is more complex classification, the classification classified according to the respective coefficients the next classification.

Figure CN106547866BD00111

[0112] 由于段落级的文本情感词的出现,呈现出文章情感的主线脉络,在通过共现随机网络的使用,这种情感上的感情脉络被保留下来,因此通过顺序随机共现网络具有良好的性能。 [0112] Since the appearance of the text paragraph-level emotion words, the article presents the emotional context of the main line, by using a random network of co-occurrence, emotional context on such emotions are preserved, so by order of a random co-occurrence network has a good performance.

[0113] 参考图2,基于SWLM的使用情感词标注TC算法的处理步骤如下: [0113] Referring to FIG 2, based on the use of emotional words SWLM TC labeling algorithm is the following processing steps:

[0114] 1)对所需要分类的文章进行分句,Sk—… [0114] 1) the classification of the articles were required clauses, Sk- ...

[0115] 2)对每一个顺序句子进行分词,并去除无意义的实词,助词,并使用情感词汇本体词库进行标注,选出被标记的词语,按照顺序,即… [0115] 2) for each sentence word order, and removing the meaningless content words, auxiliary, and using the emotional vocabulary lexicon tagging body, selected words are marked, in order, i.e., ...

[0116] 3)根据所标注的词语的归属进行相应网络搜索; [0116] 3) The home network search corresponding marked words;

[0117] 4)对网络中的词语进行路径选择,I如果是两个相邻的词语,则使用直接相连的路径;II如果是两个不相邻的词语,则选择他们相连路径上,通过最大权重路径的词语,按照上述步骤寻找最大权重路径,找出dmax⑶; [0117] 4) of the word network path selection, if the I two adjacent words, then using the path directly connected; II if two non-adjacent words, then the selected connected on their path through the maximum weight path words, according to the above steps to find the maximum weight path, find dmax⑶;

[0118] 5)计算最大权重路径dmax⑶上的分类系数CC,计算过程详见定义3; [0118] 5) calculated on the weight of the path coefficient classification dmax⑶ maximum weight CC, detailed calculation process defined 3;

[0119] 6)计算各个归属子网络下的分类系数CC,比较系数大小,I如果系数大小相同,分类系数CC*SW;II如果系数不相同,则进行步骤III,III按照CC最终的排序原则,如果第一的权重占百分之八十,则归属到相应的情绪网络下,如果不超过百分之八十,则将此分类归到权重系数CC,前两类之下。 [0119] 6) Sort coefficient calculating respective home subnetwork CC, comparison coefficients size, the I if the coefficient of the same size, classification coefficient CC * SW; II if the coefficient is not identical, step III, III in accordance with the final ordering principle CC If the first weight accounted for 80 percent ownership of the network to the appropriate mood, if not more than 80 percent this classification is attributed to the weight coefficient CC, under the first two categories.

[0120] 7)如果无法确保分类的情况下,对待分类文本按照定义4进行分类预测。 [0120] 7) If you can not ensure that the classification under treatment classification text classification predicted by definition 4.

[0121] 2.2基于监督机器学习的SWLM-HMM的算法分类 [0121] 2.2 SWLM-HMM classification algorithm based on supervised machine learning

[0122] 机器学习在文本分类中占有很大的作用,HMM算法在NLP有着非常好的表现,由于HMM算法的简洁性,计算量小,对于不定长的样本序列都可以进行训练,鉴于HMM对细粒度情感分类进行学习,以提高SWLM-HMM分类的准确度。 [0122] In machine learning plays a big role in text classification, HMM algorithm has a very good performance in NLP, due to the simplicity of the HMM algorithm, less calculation for variable-length sequences of samples can be trained, in view of HMM Fine-grained classification emotional learning, to improve the accuracy SWLM-HMM classification.

[0123] 使用SWLM-HMM进行分类的时候,不能直接使用HMM对语料库进行训练,结合SWLM进行处理之后进行处理,再使用HMM算法进行训练,可提高分类的准确度以及加快分类速度。 [0123] using SWLM-HMM to classify the time, it can not be used directly for HMM training corpus, combined SWLM processing After treatment, reuse HMM training algorithm can improve the accuracy of the classification and speed up the classification speed.

[0124] 参考图3,训练语料的方法如下: [0124] Referring to Figure 3, the training data as follows:

[0125] (1)使用样本库中的一部分文本,使用SWLM-TC对样本集进行细粒度分类,其中SWLM-TC的分类过程如SWLM-TC所示,计算这个样本所属情感的权重系数SWx; [0125] (1) use a portion of the text sample database using SWLM-TC ​​classification of the fine-grained sample set, wherein SWLM-TC ​​classification process SWLM-TC ​​as shown, to calculate the weight of the sample belongs to the emotional the SWx weight coefficient;

[0126] (2)对于使用SWLM-TC进行分类的所有样本:计算每个文本的分类系数CC,分类系数按照SWLM-TC算法第6步进行分类,则把本样本加入X相应的情感分类集TSx (Train Set) 下,如果分类系数不能决定的情况下,使用SWLM-TC进行,则使用SWLM-TC算法第7步进行预测,归到相应的分类下; [0126] (2) For all samples SWLM-TC ​​classification of: calculating coefficients for each classification of the CC text, coefficients classified according to the classification step 6 SWLM-TC ​​algorithm, put this sample into the respective emotion category set X under TSx (Train Set), if the classification is not the case the coefficient determined using SWLM-TC ​​performed using SWLM-TC ​​prediction algorithm in step 7, normalized to the corresponding classification;

[0127] (3)将样本库中采样的那部分文本使用SWLM-TC算法分类好子情感之后,将得到的训练文本,使用相应分类的文本训练HMM分类模型;其中分类好的的特征是每个文本的情感词已经标注,并按照文本的顺序形成了链式词语,在使用HMM模型训练的过程中,将每个文本的情感词词串和分类好的子情感作为参数传到HMM模型,进行HMM模型训练,将所有样本都输入到HMM算法进行训练; [0127] (3) the portion of the text sample database text using training samples SWLM-TC ​​classification algorithm after Yoshiko emotion, obtained using the corresponding HMM trained classification text classification model; wherein each classification is characterized by good words emotion text has been marked, the form and order of the words in a text chain, the process uses HMM models trained, the classification of the emotional words and each word string of text as a parameter passed to the emotional well sub-HMM model, HMM model training carried out, all the samples are input to the HMM algorithm for training;

[0128] a)对于样本库中剩余文本,使用HMM算法进行分类,如果能正确分类,则进行相应归类计算。 [0128] a) For the remainder of the text sample database, the HMM classification algorithm, if the classification is correct, the corresponding classification calculations.

[0129] b)对于无分类结果文本,使用定义4进行分类预测。 [0129] b) For non-text classification result, using the defined class prediction 4.

[0130] 3情感细粒度特征分类实验 [0130] 3 characterized by fine-grained sentiment classification experiment

[0131] 3.1分类数据 [0131] 3.1 Data Categories

[0132] 本发明实验数据采用收集的博客数据和CCF自然语言处理与中文计算会议评测的NLP&amp;CC2014数据。 [0132] Experimental data the present invention is employed for data collection and CCF blog natural language processing and evaluation of the Chinese computing session NLP & amp; CC2014 data. 爬取微博数据7000条,选取了其中的4000条博客数据,与NLP&amp;CC2014的数据集中相似话题中的2000条进行了融合,最后语料数据中标注的样本约6000条,其中全部选择的是含有情绪的微博,对于无情绪微博和博客数据进行了剔除,包括如下的数据构成: Crawling Twitter data 7000, selected 4000 blog data therein, the NLP & amp; data CC2014 concentrating similarly 2000 topics were fused, final corpus data noted in the sample of about 6000, wherein all selected is Weibo contain emotions, no feelings for the micro-blog and blog data were removed, including the following data components:

[0133] I) TrainDataNet:使用其中6000条微博数据; [0133] I) TrainDataNet: 6000 used in which micro-blog data;

[0134] 2) TrainDataHMM:使用6000条中的5000条微博数据,其中抽样采集其中含有7中情感数据的微博。 [0134] 2) TrainDataHMM: 5000 micro-blog using the 6000 data, in which the sample collection containing micro-blog 7 emotion data.

[0135] 3) TrainDataTest:使用除过TrainDataHMM的另外1000条数据。 [0135] 3) TrainDataTest: 1000 using additional data other through the TrainDataHMM.

[0136] 表1数据分布 [0136] Table 1 data distribution

Figure CN106547866BD00121

[0139] 精确度和召回率在信息检索和统计学分类领域应用最常使用的两个度量值,用来评价结果的质量。 Two measurements [0139] precision and recall information retrieval and statistical classification applications most frequently used for the quality evaluation results.

[0140] 本发明实验采用实验采集数据和中文NLP&amp;CC2014的中文倾向性评测数据集,经过系统的测试之后,实验结果如下。 [0140] The present invention uses experimental test data collection and Chinese NLP & amp; CC2014 in Chinese propensity evaluation data set, after testing the system, results are as follows.

[0141] 3.2分类结果 [0141] 3.2 classification results

[0142] A.实验原始数据 [0142] A. Experimental raw data

[0143] I) SWLM-TC的情感分类实验 [0143] I) sentiment classification of experimental SWLM-TC

[0144] 采用SWLM-TC验证算法的有效性,使用正确率、召回率和F值来衡量分类的有效性。 [0144] The validity verification using SWLM-TC ​​algorithm, correct, recall and F-measure the effectiveness of the classification.

[0145] 将七个分类结果展示如表2: [0145] The seven classification results in Table 2 show:

[0146] 表2 SWLM-TC算法分类结果 [0146] TABLE 2 SWLM-TC ​​classification algorithm results

[0147] [0147]

Figure CN106547866BD00122

Figure CN106547866BD00131

[0148] 2) SWLM-HMM的情感分类实验 Sentiment classification experiment [0148] 2) SWLM-HMM's

[0149] SWLM-HMM算法实验结果如表3所示 [0149] The results SWLM-HMM algorithm shown in Table 3

[0150] 表3 SWLM-HMM算法分类结果 [0150] Table 3 SWLM-HMM classification algorithm results

[0151] [0151]

Figure CN106547866BD00132

[0153] 3) TC算法的情感分类实验 Sentiment classification experiment [0153] 3) TC algorithms

[0154] TC算法的验证试验试验结果如下表4所示 [0154] TC verification test results shown in Table 4 below Algorithm

[0155] 表4 TC算法分类结果 [0155] The results in Table 4 TC classification algorithm

[0156] _ [0156] _

Figure CN106547866BD00133

[0157] B.准确率、召回率、Fl值 [0157] B. accuracy, recall, Fl value

[0158] 对上述实验数据进行数据计算,SWLM-TC计算得到的准确率、召回率和Fl值数据如下: [0158] Data calculated above experimental data, the accuracy of the calculated SWLM-TC, recall and Fl data values ​​as follows:

[0159] I) SWLM-TC的情感分类实验 [0159] I) sentiment classification of experimental SWLM-TC

[0160] SWLM-TC算法的P,R和F1值如下表5所示 [0160] P, R, and algorithm F1 SWLM-TC ​​values ​​in Table 5 below

[0161] 表5 SWLM-TC的P、R和Fl 值 [0161] TABLE 5 SWLM-TC ​​of P, R and Fl value

[0162] [0162]

Figure CN106547866BD00134

Figure CN106547866BD00141

[0163] 2) SWLM-HMM的情感分类实验 Sentiment classification experiment [0163] 2) SWLM-HMM's

[0164] SWLM-HMM算法的P,R和F1值如下表6所示 [0164] P, R and F1 values ​​SWLM-HMM algorithm is shown below in Table 6

[0165] 表6 SWLM-HMM的P、R和Fl 值 [0165] TABLE 6 SWLM-HMM of P, R and Fl value

[0166] [0166]

Figure CN106547866BD00142

[0168] 3) TC算法的情感分类实验 Sentiment classification experiment [0168] 3) TC algorithms

[0169] TC算法的P,R和Fl值如下表7所示 [0169] P, R, and TC values ​​Fl algorithm is shown as follows in Table 7

[0170] 表7 TC算法的P、R和Fl值 [0170] P Table 7 TC algorithm, R value and Fl

[0171] [0171]

Figure CN106547866BD00143

[0172] C.宏平均和微平均 [0172] C. macro and micro Average Average

[0173] 使用各算法的宏平均和为平均的P,R和Fl值如下表8所示 [0173] Using the algorithms and macro mean average of P, R and Fl values ​​shown in the following table 8

[0174] 表8算法的宏平均和微平均 [0174] Table 8 mean the macro and micro-mean algorithm

[0175] [0175]

Figure CN106547866BD00144

[0176] 基于词频的标记算法TC、SWLM-TC启发式算法、SWLM-HMM算法,实验数据的几个折线图的对比如图4、图5和图6所示。 [0176] As shown in FIG 4, FIG 5 and FIG 6 based on a comparison of several line graph labeling algorithm TC word frequency, SWLM-TC ​​heuristic algorithm, SWLM-HMM algorithm, the experimental data.

[0177] 图4、图5和图6表示的是使用同一情感词典进行分析的结果,其中TC算法代走向来看,情感分析最基本TC算法与SWLM-TC算法、SWLM-HMM算法对比结果如下, [0177] Figures 4, 5 and 6 show the results of analysis using the same sentiment lexicon, wherein the generation of the algorithm to look at TC, TC basic emotion analysis algorithm SWLM-TC ​​algorithm, SWLM-HMM comparison algorithm results were as follows ,

[0178] 1)在准确率上SWLM-HMM>SWLM-TC>TC,TC算法在7个粒度上的准确率范围是52.42% -54.54%,SWLM-TC算法在7个粒度上的准确率范围是60.12 %-65.73 %,SWLM-HMM 算法在7个粒度上的准确率范围是69.60 %-73.21 %,由此看出,TC算法在百分比量级上落后于SWLM-TC和SWLM-HMM算法,SWLM-TC在百分比量级上落后于SWLM-HMM。 [0178] 1) on the accuracy SWLM-HMM> SWLM-TC> TC, TC algorithm accuracy in the range of seven particle size was 52.42% -54.54% accuracy range SWLM-TC ​​algorithm on 7 granularity was 60.12% -65.73% accuracy range SWLM-HMM algorithm on 7 particle size was 69.60% -73.21%, which shows that, TC algorithm on the order of percentage behind SWLM-TC ​​and SWLM-HMM algorithm, SWLM-TC ​​on a percentage of the order behind SWLM-HMM.

[0179] 2)在召回率上SWLM-HMM>SWLM-TC>TC,TC算法在7个粒度上的召回率范围是60 · 14 % -62 · 09 %,SWLM-TC算法在7个粒度上的召回率范围是71 · 31 % -74 · 23 %,SWLM-HMM 算法在7个粒度上的召回率范围是79.39 %-83.88%,TC算法在百分比量级上落后于SWLM-TC和SWLM-HMM算法,SWLM-TC在百分比量级上落后于SWLM-HMM。 [0179] 2) on the recall SWLM-HMM> SWLM-TC> TC, TC recall algorithm in the range of seven particle size is 60 · 14% -62 · 09%, SWLM-TC ​​algorithm granularity on 7 recall rate in the range of 31% -74 · 71 · 23% recall rate range SWLM-HMM algorithm on 7 particle size was 79.39% -83.88%, TC algorithm on the order of percentage behind SWLM-TC ​​and SWLM- HMM algorithm, SWLM-TC ​​on a percentage of the order behind SWLM-HMM.

[0180] 3)在Fl值上SWLM-HMM>SWLM-TC>TC,几个评价标准中,SWLM-TC和SWLM-HMM的表现均好于TC算法,TC算法在7个粒度上的Fl值范围是56.43%-57.66%,SWLM-TC算法在7个粒度上的Fl值范围是66.10%-69.72% ,SWLM-HMM算法在7个粒度上的Fl值范围是75.26%_ 77.90%,TC算法在百分比量级上落后于SWLM-TC和SWLM-HMM算法,SWLM-TC在百分比量级上落后于SWLM-HMMo [0180] 3) on the Fl value SWLM-HMM> SWLM-TC> TC, several evaluation criteria, SWLM-TC ​​and performance SWLM-HMM algorithm are better than TC, TC algorithm Fl value on 7 granularity in the range of 56.43% -57.66%, Fl range SWLM-TC ​​algorithm on 7 particle size was 66.10% -69.72%, Fl range SWLM-HMM algorithm on 7 particle size was 75.26% _ 77.90%, TC algorithm on the percentage of the order behind SWLM-TC ​​and SWLM-HMM algorithm, SWLM-TC ​​on a percentage of the order behind SWLM-HMMo

[0181] 从上面的数据对比中,发现无论是准确率、召回率、Fl值上,SWLM-TC算法和SWLM-HMM算法的结果都要比TC算法好,这证明了,只使用传统TC算法在细粒度计算上的表现比较差。 [0181] From the comparison of the data above, found that both the precision, recall rate, Fl value, the result SWLM-TC ​​algorithms and SWLM-HMM algorithm algorithms are better than TC, demonstrating, using only traditional TC algorithms performance in the calculation of the fine-grained relatively poor.

[0182] 微平均和宏平均的数据如图7和图8。 [0182] The average micro and macro-averaged data in Figures 7 and 8.

[0183] 从宏平均和微平均的折线图中可以看出SWLM-HMM和SWLM-TC算法的性能要好于TC 算法,在SWLM-TC和SWLM-HMM算法中,SWLM-HMM的性能要好于SWLM-TC算法的性能。 [0183] from the macro and micro average mean line graph can be seen that the performance SWLM-HMM and SWLM-TC ​​TC algorithm is better than the algorithm, SWLM-TC ​​and SWLM-HMM algorithm, the performance is better than SWLM-HMM SWLM -TC performance of the algorithm. 三种算法在同一个数据集上的表现差不多是10%的一个量级。 The performance of the three algorithms on the same data set is almost an order of magnitude of 10%.

[0184] 从数据集分布中,得出几张表的柱形图,三个算法在分类中的三个分布如图9、图10和图11所示: [0184] From the distribution data set, several draw bar graph table, three algorithms in three distribution Classification 9, 10 and 11:

[0185] 从图9、图10和图11可以看出,在正确分类中,TC算法的正确分类文本最低,误分到该类下的错误条数也比较多,在属于此类被误分的这个项目中,TC算法的条数也是最多, SWLM-TC和SWLM-HMM和相应的数据上都领先与TC算法。 [0185] From FIGS. 9, 10 and 11 can be seen in the correct classification, a minimum text TC correct classification algorithms, error bars assigned to the number of errors in the class also are more points belonging to such mistaken of this project, the number of TC algorithm is also the largest, are leading to TC algorithms SWLM-TC ​​and SWLM-HMM and the corresponding data.

[0186] 3.3实验结论 [0186] 3.3 Experimental results

[0187] 通过原始实验数据,并进行准确率、召回率和Fl值的计算中,可以得出以下观点与结论: [0187] By the raw experimental data, and the accuracy of calculated recall and Fl value, and can draw the following conclusions views:

[0188] I) SWLM-TC算法在准确度上比传统TC算法高7.7%-13.31 %个百分点,SWLM-HMM算法在准确度上比SWLM-TC算法高9.48 % -13.09 %个百分点。 [0188] I) SWLM-TC ​​algorithm on a conventional high-accuracy algorithm TC -13.31% 7.7% percent, SWLM-HMM algorithm for high accuracy 9.48% -13.09% percentage points SWLM-TC ​​ratio algorithm. 在数据上说明了SWLM-X算法在准确度上比传统算法要高,是由于算法相比本发明提出的计算方法在进行分类之前,SWLM-TC和SWLM-HMM经过了情感词随机共现网络的阶段,共现词网络在本发明中所起的作用体现出来了,验证了本发明提出的基于情感词共现网络的算法的初衷是对的;在SWLM-TC和SWLM-HMM算法的比较中,发现SWLM-TC的准确率要低于SWLM-HMM算法,是因为SWLM-HMM算法在分类之前,使用SWLM-TC算法进行训练集的训练,在后期使用训练好的模型再去进行分类,而且在后期还对情感模糊的文本进行了情感预测,使用这两个策略使得SWLM-HMM在准确率上自然超越了SWLM-TC和TC算法。 In the data described SWLM-X algorithm on the accuracy higher than the conventional algorithm because before calculation algorithm proposed by the invention compared to that classification, SWLM-TC ​​and emotional SWLM-HMM after random word co-occurrence network stage, co-occurrence word network in the present invention in role manifested verify the co-occurrence network based on emotion word original intention of the algorithm proposed by the invention is right; comparing SWLM-TC ​​and SWLM-HMM algorithm We found accuracy SWLM-TC ​​is lower than SWLM-HMM algorithm, because SWLM-HMM algorithm before classification algorithm using SWLM-TC ​​training training set, using the trained model to go in the late classify, but also in the late fuzzy feeling emotional text prediction, using these two strategies makes SWLM-HMM on the accuracy beyond the natural SWLM-TC ​​and TC algorithms.

[0189] 情感词共现网络在补全文本上的情感缺失,以及突出文本中权重比较大出现的情感词,在这两点上,使得SWLM-X算法在细粒度情感计算上突出了优势。 [0189] emotion emotional network co-occurrence word on the missing text to be completed, as well as highlight text in a relatively large weight sentiment words appear on these two points, so that SWLM-X algorithm highlights the advantage in the fine-grained affective computing.

[0190] 2)在召回率上,SWLM-TC算法比传统TC算法要高出11.17%-14.09%个百分点, SWLM-HMM算法比SWLM-HMM算法要高出8.08%-12.57 %个百分点,说明了SWLM-TC和SWLM-HMM算法在对某类进行分类的时候,能够正确分类的能力要高于传统算法,是因为本发明提出的算法和框架在分类某个类的时候,对于属于该类的特定的情感词的标定的效率和准确度是比传统算法都要高的。 [0190] 2) in the recall, SWLM-TC ​​algorithm to be higher than the traditional algorithm TC -14.09% 11.17% percentage points, than SWLM-HMM algorithm for comparing SWLM-HMM algorithm percent 8.08% -12.57%, indicating the SWLM-TC ​​and when SWLM-HMM algorithm to classify certain types, the ability to correct classification than the traditional method, because the present invention proposed classification algorithm and a frame class, belonging to the class calibration of the efficiency and accuracy of specific emotions of the word is to be higher than the traditional method. 本发明提出的算法在突出重要情感词的能力高于传统算法,而且本发明中采用了情感词汇本体库中的情感极性,这个参数被文本很好的利用,通过共现次数与情感极性相乘,使得情感极性强烈的情感词更加突出,从而减少了其他情感词的干扰,使得在分类的时候更加准确。 Algorithm proposed in the present invention is the ability to highlight important emotional words than conventional algorithms, and the present invention is employed in a polar affective emotional vocabulary library body, this parameter is a good use of the text, by the co-occurrence frequency and polarity and emotion multiplying, making a strong emotional sentiment polarity word is more prominent, thus reducing interference with other emotions of the word, making it more accurate at the time of classification.

[0191] 3)在Fl值的比较中,SWLM-TC算法比传统算法高出9.67%-13.29%个百分点, SWLM-HMM算法比SWLM-TC算法要高出9.16 % -11.8 %个百分点,综合评价方面,本发明提出的SWLM-TC算法和SWLM-HMM都要好于传统算法,从这个方面来说,本发明提出的算法在综合性能上超出使用传统算法来进行细粒度情感分类算法。 [0191] 3) comparing the value of Fl, SWLM-TC ​​algorithm is higher than 9.67% -13.29% percentage points higher than conventional algorithms, SWLM-HMM algorithm than SWLM-TC ​​algorithm to be higher than 9.16% -11.8% percentage points, integrated evaluation, SWLM-TC ​​and SWLM-HMM algorithm proposed by the invention are better than traditional algorithms, this respect, the algorithm proposed by the invention beyond the use of traditional methods in overall performance to be fine-grained sentiment classification algorithm.

[0192] 4)从实验的各项数据上看,本发明提出的计算框架和算法在分类的相关性能不错,但是和情感倾向性分析方面的差别很大,因为细粒度计算在各个方面不仅仅要求算法能进行细粒度分类,在抗干扰的性能也比较高,对于一篇文章中出现的情感词,使用标定的做法很容易出现干扰,引起不必要的多分类问题,因为一篇文章中的情感倾向比价稳定,一般的情况下,在情感极性和共现性是很重要的性质,一般的分类算法没有对此进行处理,所以对此类文本进行处理的时候,本发明提出的算法就比较有优势了,在一般文本中,情感极性比较强的词和共现性不强的文本,本发明的算法在处理的时候和基于标定的算法一样, 但是不同的是,在本发明当中使用了对情感比较模糊的文本中采用了预测的机制,使得对这类文本的处理方面会比无预测机制要好一些 [0192] 4) From the experimental point of view of the data, and algorithm computing framework proposed by the invention in the classification of related performance is good, but a lot of emotion and tendentious analysis of the differences, because the fine-grained computing not only in all aspects requirements of the algorithm can be fine-grained classification, anti-jamming performance is relatively high, for the emotional word appears in an article, use a calibrated approach is very prone to interference, causing unnecessary multi-classification problem, because of an article when emotional tendencies parity stable, under normal circumstances, the sentiment polarity and co-occurrence of nature is very important, this is not the general classification algorithm to process, so the text of such processing, the algorithm proposed by the invention on have a comparative advantage in the general text, the word sentiment polarity relatively strong and not strong co-occurrence of the text, the inventive algorithm and calibration algorithm based on the same when dealing with, but the difference is that in which the present invention use the emotional rather ambiguous text using predictive mechanism, so that the processing aspects of such a text would be better than no prediction mechanism of some . 本发明在细粒度计算框架中采用了SWLM算法,在处理机制上,吸收了复杂网络的相关知识,对待分类文本中的进行情感词的补全方面的性能方面有比较好的效果,采取上述的机制之后,细粒度计算框架对细粒度计算通过了实验验证研究。 In the present invention employs a fine-grained SWLM algorithm computing framework, on the processing mechanism, absorb knowledge of complex networks, the performance of all aspects of the treatment of complement classification text words emotional aspects relatively good results, taking the above-described after the mechanism, fine-grained to fine-grained computational framework is calculated by experimental validation studies.

Claims (3)

  1. 1. 一种基于情感词随机共现网络的细粒度情感分类方法,采用随机网络理论,利用词语共现现象,经过情感本体词汇词库的标注,形成一个以情感特征构建的基于词语顺序的随机网络模型,即情感词共现网络模型,在此基础上进行模型约简,将情感词最长匹配方法(SWLM,Sentimental Word Longest Match)和TC算法结合进行SWLM-TC无监督学习分类,或进一步将情感词最长匹配方法和HMM机器学习算法结合建立细粒度情感分类模型并利用该模型实现分类预测,其中,所述情感词共现网络模型的构建过程如下: 1) 对每个文本执行分句操作得到一组有序的句子S1^S2H-HSn; 2) 对每一个句子31进行分词,滤除停用词以及无意义的实词,使用情感词汇本体库进行情感词标注,得到一组有序的情感词3) 对每个句子,采用WL位滑动窗从句子中抽取词汇对<wi,wj>,若W# W,则向W中添加一个 A fine-grained random word sentiment sentiment classification method based on co-occurrence network, random network theory, the use of the phenomenon of co-occurrence words, after the label body emotional vocabulary lexicon, forming a random sequence based on the word emotion feature constructed network model, that is, the emotional word co-occurrence network model, model reduction based on this, the longest term emotional matching method (SWLM, Sentimental Word longest match) and TC algorithm combines were SWLM-TC ​​classification unsupervised learning, or further the longest matching method emotional words and HMM machine learning algorithms to establish a fine-grained binding emotional classification model and use the model to classify the prediction, wherein said emotional word existing network model were constructed as follows: 1) performing for each text sub sentence obtained in an ordered set of sentences S1 ^ S2H-HSn; 2) for each word sentences 31, was filtered off and stop words meaningless content words, using emotional words ontology emotional word tagging, to give a group emotional word sequence 3) for each sentence, using a sliding window WL bits extracted from the vocabulary sentence <wi, wj>, when W # W, and W is added to a 新节点Wi,并为Wi的权重nwi设初始值为1;否则nwi加1,若(Wi,Wj) ί E,则向E中添加一条新边(wi,wj),并为(Wi,Wj)的权重nwi,wj设初始值为1;否则nwi,wj加1; 4) 所有文本处理完成之后,网络t旲型G建立完成; 其中,S表示由多条句子组成的序列,w表示抽取出的情感词,we Σ,Σ为汉语词汇集, 汉语词汇集为去除停用词、无意义实词后再经过情感词汇本体库标注后的情感本体词集;W 为网络模型G的节点集合,W= Iwi I ie [I,N]},N为G的节点个数;E为网络模型G的边集合,网络模型G的边的个数为M,E= {(Wi,Wj) IWi,WjGW,且Wi和Wj之间存在顺序共现关系},(Wi,wj) 表示从节点Wi指向节点Wj的有向边;Nw为网络模型G中节点的权重,Nw= {nwi I wiew} ;NE为网络模型G中边的权重,表示节点Wi与Wj之间边的权重,NE= {nwi,wj I (wi,wj) eE}; 将网络模型G按照乐、好、怒、哀、惧、恶、惊七种情绪分成7个 The new node Wi, and as the weight Wi of weight nwi assumed that an initial value of 1; otherwise nwi plus 1, if (Wi, Wj) ί E, is added to E, a new edge (wi, wj), and as (Wi, of Wj ) weight nwi, wj assumed that an initial value of 1; otherwise nwi, wj plus 1; 4) after all text processing is complete, the network t Dae type G setup complete; wherein, S represents a sequence consisting of a plurality of sentences, W represents extraction the emotional words, we Σ, Σ for the Chinese vocabulary, Chinese vocabulary to remove stop words, meaningless content words and then go through the emotional body after the emotional vocabulary word set ontology annotation; W is a collection of nodes in the network model G, W = Iwi I ie [I, N]}, N is the number of nodes of G; E is the set of edges G network model, the number of edges of the network model G M, E = {(Wi, Wj) IWi, between WjGW, and Wi and Wj order co-occurrence relation}, (Wi, wj) represents from node Wi points to the node Wj directed edge; Nw is the weight of the network model G of nodes weight, Nw = {nwi I wiew}; NE is a weight network model G of the edge weight represents the weight of the edges between nodes Wi and Wj weight, NE = {nwi, wj I (wi, wj) eE}; the network model G according to music, good, anger, sadness, fear , evil, shock seven kinds of emotions into 7 网络,子网络拆分过程中,如果有断裂的情况发生,使用权重最高的那个节点与断裂的网络子块进行连接,构建可用于细粒度计算的七个子网络Gxlx= 其特征在于,在进行分类时,具有如下定义: 最长权重匹配路径长度dmax (S):网络Gx IX= {1,2,3,4,5,6,7},如果两个情感词顺序覆盖,则使用直接相连的边进行匹配,如果两个情感词在网络Gx中存在网络间隔,则选择路径的时候选择通过权重最大的节点进行匹配,即为S的长度,计算公式如下: Network, subnetwork resolution process, breakage occurs if the case, using the weight of the highest sub-block that network node connection and Fracture, seven sub-network construct can be used to fine-grained calculations Gxlx = wherein, after classification when, have the following definitions: maximum weight matching path length dmax (S): network Gx IX = {1,2,3,4,5,6,7}, if two sequential cover emotional words, directly connected using matching edges, two emotional words if there is a network in the network Gx in the interval, when the selected path is selected by matching node with the largest weight, that is, the length S is calculated as follows:
    Figure CN106547866BC00021
    其中dmax(Wi,Wi+x)是网络中第i个词到第i+x词的最大权重匹配路径; 情感权重系数SW (Sentimetal weight):网络G中,七个子网络各自所占的情感极性比重,使用此系数会让分类更加明显,减少因为界限模糊引起的分类问题,令情感词网络中词的重现次数为freq,极性强度为P,计算公式如下: Wherein dmax (Wi, Wi + x) is the network of the i-th word to the maximum weight of the i + x Words weight matching path; emotional weight coefficient SW (Sentimetal weight): a network G in seven sub-network are each occupied affections of gravity, this factor will become more apparent classification, classification reduced because blurred due, so reproducing the number of words in the word emotion FREQ network, strength of polarity is P, is calculated as follows:
    Figure CN106547866BC00022
    其中WC为子网络中每个词的情感数值,Wy为子网络的情感数值,SWx为子网络X的SW值, 即情感权重系数; 分类系数CC (Classification coefficient):在最大匹配词路径确定之后,这条路径上的词语的重现度Re和情感强度power,假设有η个词,则计算公式如下: CCi = Re Xpower Where WC is the subnetwork emotional value of each word, Wy of the feeling and the value of the sub-network, the SWx for the SW value sub-network X, i.e. emotional weighting coefficient; classification coefficient CC (Classification coefficient): after the maximum matching word path determined , Re and emotional intensity of the reproducing power of words on this path, assuming η word is calculated as follows: CCi = Re Xpower
    Figure CN106547866BC00031
    其中〇^是单个词的分类系数; 分类预测系数CPC (Classification prediction coefficient):在使用机器学习算法进行分类时,对于无法判断样本的分类采取的预测机制;按照SWx进行排序,如果SWdSWA 80 %,SWi/SW2> 1 · 5,则归入SWi下,如果SWi+SW2>80%,SWi/SW2〈 = 1 · 5,在这种情况下归入SWi 和SW2两个属性下;如果SWi+SWXSO%,则表示这篇文章的分类比较复杂,按照分类系数归入相应的分类下: Wherein the square ^ is a single word classification coefficient; class prediction coefficient CPC (Classification prediction coefficient): prediction mechanism when using machine learning algorithms to classify, for the classification can not determine the sample taken; sorted by the SWx, if SWdSWA 80%, SWi / SW2> 1 · 5, fall under the SWi, if SWi + SW2> 80%, SWi / SW2 <= 1 · 5, SWi and SW2 classified under two properties in this case; if SWi + SWXSO %, it means that the classification of this article is more complex, classified under the appropriate classification in accordance with the classification factor:
    Figure CN106547866BC00032
  2. 2. 根据权利要求1所述基于情感词随机共现网络的细粒度情感分类方法,其特征在于, 所述情感词汇本体库中,情感共分为7大类21小类,情感分类分别为乐{快乐(PA)、安心(PE) }、好{尊敬(PD)、赞扬(PH)、相信(PG)、喜爱(PB)、祝愿(PK) }、怒{愤怒(NA) }、哀{悲伤(NB)、失望(NJ)、疚(NH)、思(PE) }、惧{:慌(NI)、恐惧(NC)、羞(NG) }、恶{:烦闷(NE)、憎恶(ND)、 贬责(NN)、妒忌(NK)、怀疑(NL) }、惊{惊奇(PC)};情感强度power分为1,3,5,7,9五档,9表示强度最大,1表示强度最小,情感词汇本体中的词性种类一共分为7类,分别是名词(noun), 动词(verb),形容词(adj),副词(adv),网络词语(nw),成语(idiom),介词短语(prep),共含有情感词27466个。 The fine-grained sentiment classification based on co-occurrence emotional word random network of claim 1, wherein said body emotional vocabulary database, is divided into seven categories 21 emotion subclasses, were classified music emotion {happy (PA), ease (PE)}, good {Dear (PD), praise (PH), I believe (PG), love (PB), wish (PK)}, anger {anger (NA)}, sorrow { sad (NB), down (NJ), guilt (NH), Si (PE)}, fear {: panic (NI), fear (NC), shame (NG)}, evil {: bored (NE), hate ( ND), responsible banished (NN), jealous (of NK), suspected of (NL)}, {shock surprise (PC)}; emotional intensity power into five files 1,3,5,7,9, 9 represents a maximum intensity, 1 represents the minimum intensity, the type of emotion vocabulary speech body divided into a total of seven categories, namely, a noun (noun), verb (to verb), adjective (ADJ), adverb (ADV), the words in the network (NW), idioms (idiom) , prepositional phrase (prep), containing a total of 27466 sentiment words.
  3. 3. 根据权利要求1所述基于情感词随机共现网络的细粒度情感分类方法,其特征在于, 所述情感词最长匹配方法通过情感词的最大权重词汇进行最长匹配,使得不采用消歧和防噪声处理,即可准确分类到相关情感主题之下,并通过七个小分类模型进行权重计算,得出可进行机器学习分类的参数。 The fine-grained sentiment classification based on co-occurrence emotional word random network of claim 1, wherein said longest word sentiment maximum weight matching method emotional vocabulary words the longest match so as not to eliminate use of discrimination and anti-noise processing, to accurately classified under related to the emotional topic, and heavy weights were calculated by seven small class model, machine learning can be derived parameter classification.
CN 201610936655 2016-10-24 2016-10-24 Fine-grained sentiment classification method based on random emotional word co-occurrence networks CN106547866B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201610936655 CN106547866B (en) 2016-10-24 2016-10-24 Fine-grained sentiment classification method based on random emotional word co-occurrence networks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201610936655 CN106547866B (en) 2016-10-24 2016-10-24 Fine-grained sentiment classification method based on random emotional word co-occurrence networks

Publications (2)

Publication Number Publication Date
CN106547866A true CN106547866A (en) 2017-03-29
CN106547866B true CN106547866B (en) 2017-12-26

Family

ID=58392940

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201610936655 CN106547866B (en) 2016-10-24 2016-10-24 Fine-grained sentiment classification method based on random emotional word co-occurrence networks

Country Status (1)

Country Link
CN (1) CN106547866B (en)

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8417713B1 (en) * 2007-12-05 2013-04-09 Google Inc. Sentiment detection as a ranking signal for reviewable entities
CN104899231A (en) * 2014-03-07 2015-09-09 上海市玻森数据科技有限公司 Sentiment analysis engine based on fine-granularity attributive classification

Also Published As

Publication number Publication date Type
CN106547866A (en) 2017-03-29 application

Similar Documents

Publication Publication Date Title
Galley et al. Identifying agreement and disagreement in conversational speech: Use of bayesian networks to model pragmatic dependencies
Nguyen et al. Author age prediction from text using linear regression
Speriosu et al. Twitter polarity classification with label propagation over lexical links and the follower graph
Kharde et al. Sentiment analysis of twitter data: a survey of techniques
Tur et al. What is left to be understood in ATIS?
CN101609459A (en) Extraction system of affective characteristic words
Torres-Moreno Automatic text summarization
CN103150367A (en) Method for analyzing emotional tendency of Chinese microblogs
CN103793503A (en) Opinion mining and classification method based on web texts
CN1952928A (en) Computer system to constitute natural language base and automatic dialogue retrieve
Zanzotto et al. Linguistic redundancy in twitter
CN103049435A (en) Text fine granularity sentiment analysis method and text fine granularity sentiment analysis device
CN103390051A (en) Topic detection and tracking method based on microblog data
CN102279890A (en) Extracting collection method based on the emotional word microblogging
CN103544242A (en) Microblog-oriented emotion entity searching system
Chen et al. Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN
CN103559233A (en) Extraction method for network new words in microblogs and microblog emotion analysis method and system
CN101470732A (en) Auxiliary word stock generation method and apparatus
CN103678278A (en) Chinese text emotion recognition method
Fernández-Gavilanes et al. Unsupervised method for sentiment analysis in online texts
CN102495892A (en) Webpage information extraction method
CN101436206A (en) Tourism request-answer system answer abstracting method based on ontology reasoning
CN103440235A (en) Method and device for identifying text emotion types based on cognitive structure model
US20130024407A1 (en) Text classifier system
CN103559176A (en) Microblog emotional evolution analysis method and system

Legal Events

Date Code Title Description
PB01
SE01