CN108647191B - A Sentiment Dictionary Construction Method Based on Supervised Sentiment Text and Word Vectors - Google Patents
A Sentiment Dictionary Construction Method Based on Supervised Sentiment Text and Word Vectors Download PDFInfo
- Publication number
- CN108647191B CN108647191B CN201810473308.6A CN201810473308A CN108647191B CN 108647191 B CN108647191 B CN 108647191B CN 201810473308 A CN201810473308 A CN 201810473308A CN 108647191 B CN108647191 B CN 108647191B
- Authority
- CN
- China
- Prior art keywords
- word
- text
- emotion
- words
- sentiment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 239000013598 vector Substances 0.000 title claims abstract description 54
- 238000010276 construction Methods 0.000 title abstract description 14
- 230000008451 emotion Effects 0.000 claims abstract description 42
- 238000000034 method Methods 0.000 claims abstract description 27
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 10
- 230000002996 emotional effect Effects 0.000 claims description 26
- 238000011176 pooling Methods 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000013527 convolutional neural network Methods 0.000 claims description 5
- 230000014509 gene expression Effects 0.000 claims description 2
- 238000007781 pre-processing Methods 0.000 claims description 2
- 238000012549 training Methods 0.000 claims description 2
- 239000003550 marker Substances 0.000 claims 4
- 230000001902 propagating effect Effects 0.000 claims 1
- 238000004458 analytical method Methods 0.000 abstract description 8
- 238000013528 artificial neural network Methods 0.000 abstract description 5
- 238000012545 processing Methods 0.000 abstract description 3
- 239000011159 matrix material Substances 0.000 description 14
- 239000000284 extract Substances 0.000 description 5
- 230000007935 neutral effect Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000007812 deficiency Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011423 initialization method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
- G06F40/117—Tagging; Marking up; Designating a block; Setting of attributes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
本发明提出一种基于有监督情感文本和词向量的情感词典构建方法,包括数据处理阶段、词向量情感嵌入阶段、情感词典生成阶段共三个阶段。本方法使用神经网络生成词向量,将情感嵌入到词向量内部,挖掘词与词之间的内在联系,然后构建词关系图,使用标签传播算法传播情感标签,自动构建特定领域的情感词典。通过本发明解决了基于人工和基于知识库的方法所构造的情感词典在处理特定领域的情感分析任务时不准确的问题。
The invention proposes an emotion dictionary construction method based on supervised emotion text and word vector, which includes three stages: data processing stage, word vector emotion embedding stage, and emotion dictionary generation stage. This method uses a neural network to generate word vectors, embeds emotions into the word vectors, mines the internal relationship between words, and then builds a word relationship graph, uses a label propagation algorithm to propagate sentiment labels, and automatically builds a domain-specific sentiment dictionary. The invention solves the problem that the sentiment dictionary constructed by artificial and knowledge base-based methods is inaccurate when dealing with sentiment analysis tasks in specific fields.
Description
技术领域technical field
本发明涉及情感分析领域,尤其是一种基于有监督情感文本和词向量的情感词典构建方法。The invention relates to the field of sentiment analysis, in particular to a construction method of sentiment dictionary based on supervised sentiment text and word vectors.
背景技术Background technique
随着互联网的飞速发展,诸如微博、贴吧、论坛等各类网络平台的流行,为人们提供了众多公开发声的机会。由此产生的公开的文本数据数量众多、易于获得,且含有巨大的商业和社会价值。为了获取这些文本中人们对事物或事件的情感倾向,情感分析技术便脱颖而出。With the rapid development of the Internet, the popularity of various network platforms such as Weibo, Tieba, and forums has provided people with many opportunities to speak out. The resulting public textual data is plentiful, readily available, and has enormous commercial and social value. Sentiment analysis techniques come to the fore in order to capture the sentimental tendencies of people towards things or events in these texts.
一直以来,情感词典都是情感分析的重要工具。一个优秀的情感词典可以极大地提升情感分析的效果。通常,随着应用领域的改变,词所体现的情感也会相应的改变。因此,在处理特定领域的情感分析任务时,人工整理情感词典变得费时费力,需要一种自动化的方法来构建情感词典。现有的情感词典自动构建方法分为两大类,分别是基于知识库的方法和基于语料库的方法。基于知识库的方法依赖于已有的语义知识库。这些经由人工整理的知识库中记录大量词的释义以及词与词之间的关系(如同义词、反义词)。基于知识库的方法通过这些已有的知识,构建具有高准确率和通用性的情感词典。然而,对于中文而言,整理完备的知识库相对稀缺,因此这种方法不能很好地应用于中文情感词典的构建。同时,这种方法生成的情感词典相对通用,不能很好地解决词语在不同领域情感变化的问题。基于语料库的方法可以用来生成特定领域的情感词典。这类方法对语料文本进行处理,挖掘语料中词与词之间的关系,如连词关系、共现关系等。其通过设置规则或使用统计学上的方法,将联系紧密的词聚集在一起,从而生成情感词典。这一类方法往往只考虑了词在文本中简单的关系,忽略了文本本身的复杂性,如一些复杂的句法以及否定词的影响等等,都会影响这类方法的效果。Sentiment dictionaries have always been an important tool for sentiment analysis. An excellent sentiment dictionary can greatly improve the effect of sentiment analysis. Usually, as the application domain changes, the emotion embodied by the word will also change accordingly. Therefore, when dealing with domain-specific sentiment analysis tasks, it becomes time-consuming and labor-intensive to manually organize sentiment lexicons, and an automated method is needed to construct sentiment lexicons. The existing automatic construction methods of sentiment dictionary are divided into two categories, namely knowledge base-based methods and corpus-based methods. Knowledge base-based methods rely on existing semantic knowledge bases. The definitions of a large number of words and the relationships between words (such as synonyms and antonyms) are recorded in these manually organized knowledge bases. The knowledge base-based method constructs a sentiment dictionary with high accuracy and generality through these existing knowledge. However, for Chinese, a well-organized knowledge base is relatively scarce, so this method cannot be well applied to the construction of Chinese sentiment dictionaries. At the same time, the sentiment dictionary generated by this method is relatively general, and cannot well solve the problem of emotional changes of words in different fields. Corpus-based methods can be used to generate domain-specific sentiment lexicons. This kind of method processes the corpus text and mines the relationship between words in the corpus, such as conjunction relationship, co-occurrence relationship, etc. It generates sentiment lexicons by setting rules or using statistical methods to cluster closely related words together. This type of method often only considers the simple relationship of words in the text, ignoring the complexity of the text itself, such as some complex syntax and the influence of negative words, etc., which will affect the effect of this type of method.
发明内容SUMMARY OF THE INVENTION
发明目的:本发明针对基于语料库的情感词典自动构建方法的不足,提出一种基于有监督情感文本和词向量的情感词典构建方法,使用神经网络生成词向量,挖掘词与词之间的内在联系,进而生成情感词典。Purpose of the invention: Aiming at the shortcomings of the automatic construction method of emotional dictionary based on corpus, the present invention proposes a construction method of emotional dictionary based on supervised emotional text and word vector, using neural network to generate word vector, and mining the inner relationship between words and words , and then generate a sentiment dictionary.
技术方案:本发明提出的技术方案为:Technical scheme: The technical scheme proposed by the present invention is:
一种基于有监督情感文本和词向量的情感词典构建方法,包括步骤:A sentiment dictionary construction method based on supervised sentiment text and word vectors, including steps:
(1)获取文本数据集D,文本数据集D中包括具有正面情感标记的正面情感文本和具有负面情感标记的负面情感文本;(1) Obtaining a text dataset D, the text dataset D includes positive sentiment texts with positive sentiment marks and negative sentiment texts with negative sentiment marks;
(2)对文本数据集中的文本进行预处理;构建词汇表V,将预处理后的文本数据集中的词语逐个填入词汇表V中;(2) Preprocess the text in the text data set; construct a vocabulary V, and fill in the words in the preprocessed text data set into the vocabulary V one by one;
(3)采用SO-PMI方法计算词汇表V中各个词语的情感倾向值,根据情感倾向值确定相应词语的情感标记:(3) The SO-PMI method is used to calculate the emotional tendency value of each word in the vocabulary V, and the emotional mark of the corresponding word is determined according to the emotional tendency value:
其中,lablew表示词语w感情标记,SO-PMI(w)表示词语w的情感倾向值;Among them, lable w represents the sentiment mark of the word w, and SO-PMI(w) represents the sentiment tendency value of the word w;
(4)构建具有词语级别监督的改进的skip-gram模型,改进的skip-gram模型以D中的词语为输入数据,预测词语的上下文和情感标记;计算预测上下文时的损失函数losscontext,以及预测情感标记时的损失函数lossword;(4) Construct an improved skip-gram model with word-level supervision. The improved skip-gram model takes the words in D as input data to predict the context and sentiment tags of the words; calculate the loss function loss context when predicting the context , and loss function loss word when predicting sentiment labels;
losscontext与lossword的表达式分别为:The expressions of loss context and loss word are:
losscontext(wt)=-∑-k≤j≤k,j≠0logp(wt+k|wt)loss context (w t )=-∑ -k≤j≤k,j≠0 logp(w t+k |w t )
其中,wt表示词语,wt∈D;{wt-k,…,wt-1,wt+1,…,wt+k}表示预测出的上下文词语集合,集合中包括预测出的词语wt的前k个词和后k个词,p(wt+j|wt)表示wt+j被预测为wt的上下文的概率,p(pos|wt)表示wt被预测为具有正面情感标记的概率,p(neg|wt)表示wt被预测为具有负面情感标记的概率;where w t represents a word, w t ∈ D; {w tk ,…,w t-1 ,w t+1 ,…,w t+k } represents a set of predicted context words, and the set includes the predicted words The first k words and the last k words of w t , p(w t+j |w t ) represents the probability of w t+j being predicted as the context of w t , p(pos|w t ) means that w t is predicted is the probability of having a positive sentiment label, p(neg|w t ) represents the probability that wt is predicted to have a negative sentiment label;
(5)构建一个卷积神经网络模型作为文本级监督模型,文本级监督模型以文本数据集D中的文本为输入数据,预测文本的感情标记;计算预测出的文本的情感标记与文本实际感情标记之间的损失函数lossdoc:(5) Construct a convolutional neural network model as a text-level supervision model. The text-level supervision model uses the text in the text data set D as input data to predict the emotional tags of the text; calculate the emotional tags of the predicted text and the actual emotions of the text. Loss function loss doc between tokens:
其中,di表示文本,di∈D;表示di的情感标签;p(pos|di)表示di被预测为具有正面情感标记的概率,p(neg|di)表示didi被预测为具有负面情感标记的概率;Among them, d i represents the text, d i ∈ D; represents the sentiment label of d i ; p(pos|d i ) represents the probability that d i is predicted to have a positive sentiment label, and p(neg|d i ) represents the probability that d i d i is predicted to have a negative sentiment label;
(6)设置联合损失函数:(6) Set the joint loss function:
loss=α1·losscontext+α2·lossdoc+α3·lossword loss=α 1 ·loss context +α 2 ·loss doc +α 3 ·loss word
式中,α1、α2、α3分别为losscontext、lossdoc、lossword的权重系数;In the formula, α 1 , α 2 , and α 3 are the weight coefficients of loss context , loss doc , and loss word respectively;
(7)以文本数据集D、词语的情感标记lablew、文本的情感标记为输入数据,利用反向传播算法训练联合损失函数,得到具有情感嵌入的词向量;(7) Take the text dataset D, the sentiment label of words, label w , and the sentiment label of text For the input data, use the back-propagation algorithm to train the joint loss function to obtain the word vector with sentiment embedding;
(8)根据步骤(7)获得的具有情感嵌入的词向量构建词关系图G;(8) constructing a word relation graph G according to the word vector with emotional embedding obtained in step (7);
(9)选取词关系图G中的部分词语作为种子词,为种子词标注情感标签,情感标签包括褒义、贬义和中性;然后使用标签传播算法将种子词的情感标签在关系图G中传播,生成情感词典。(9) Select some words in the word relation graph G as seed words, and mark the sentiment labels for the seed words. The sentiment labels include positive, derogatory and neutral; then use the label propagation algorithm to spread the sentiment labels of the seed words in the relation graph G. , to generate a sentiment dictionary.
进一步的,所述情感倾向值的计算公式为:Further, the calculation formula of the emotional tendency value is:
其中,SO-PMI(w)表示词语w的情感倾向值,pos表示正面情感文本,neg表示负面情感文本,p(w|pos)表示词语w在正面情感文本中出现的概率,p(w|neg)表示词语w在负面情感文本中出现的概率。Among them, SO-PMI(w) represents the sentiment tendency value of the word w, pos represents the positive sentiment text, neg represents the negative sentiment text, p(w|pos) represents the probability that the word w appears in the positive sentiment text, p(w| neg) represents the probability that the word w appears in the negative sentiment text.
进一步的,所述具有词语级别监督的改进的skip-gram模型包括输入层、投影层、输出层,输入层为文本数据集D中的词语wt,投影层将词语wt投影为词向量C(wt),输出层根据C(wt)分别预测wt的上下文和情感标记lablew。Further, the improved skip-gram model with word-level supervision includes an input layer, a projection layer, and an output layer. The input layer is the word wt in the text data set D, and the projection layer projects the word wt into a word vector C. (w t ), the output layer predicts the context and sentiment label lable w of wt t according to C(wt ), respectively.
进一步的,所述文本级监督模型包括:输入层、卷积层、池化层、全连接层,其中,输入层为文本数据集D中的文本di;卷积层通过特征抽取器从文本di中抽取多个特征向量发送给池化层;池化层通过MaxPoolingOverTime操作从特征向量中选取最重要的特征向量输出给全连接层;全连接层根据收到的特征向量,通过softmax函数预测输入文本di的情感标记 Further, the text-level supervision model includes: an input layer, a convolution layer, a pooling layer, and a fully connected layer, wherein the input layer is the text d i in the text data set D; Extract multiple feature vectors from d i and send them to the pooling layer; the pooling layer selects the most important feature vector from the feature vector through the MaxPoolingOverTime operation and outputs it to the fully connected layer; the fully connected layer predicts through the softmax function according to the received feature vector Sentiment token for input text d i
进一步的,所述构建词关系图G的具体步骤包括:Further, the specific steps of constructing the word relation graph G include:
1)对词汇表V,抽取其中的动词、形容词、副词构成新的词汇表V′;1) For vocabulary V, extract the verbs, adjectives and adverbs in it to form a new vocabulary V';
2)构建词关系图G,将V′中的词作为G中的顶点;2) Construct the word relation graph G, and use the words in V' as the vertices in G;
3)对于V′中的每个词wi,计算wi与V′中其他所有词在步骤(7)得到的词向量空间中的欧氏距离,选取欧式距离最近的k个词,在词关系图G中建立wi与这k个词之间的边,边的权重计算公式为:3) For each word wi in V', calculate the Euclidean distance between wi and all other words in V' in the word vector space obtained in step (7), select the k words with the nearest Euclidean distance, The edge between wi and the k words is established in the relational graph G, and the weight calculation formula of the edge is:
其中,wij表示词wi和wj之间边的权重,xi、xj分别为词wi和wj的词向量,euclidean_dis(xi,xj)表示xi、xj之间的欧式距离;σ为常数参数,用于控制wij的取值。Among them, w ij represents the weight of the edge between words wi and w j , xi and x j are the word vectors of words wi and w j respectively, euclidean_dis( xi , x j ) represents the distance between xi and x j The Euclidean distance; σ is a constant parameter, used to control the value of w ij .
对于和词wi的距离最近的m个词之外的其他词,使wij=0For words other than the m words closest to word wi, let
有益效果:与现有技术相比,本发明具有以下优势:Beneficial effect: Compared with the prior art, the present invention has the following advantages:
本发明基于有监督语料集生成情感词典,使用神经网络生成词向量,挖掘词与词之间的内在联系,使用标签传播算法传播情感标签,自动构建特定领域的情感词典。本发明既避免了基于知识库的情感词典构建方法无法用于特定领域情感分析的不足,相比于其他基于语料库的方法又加强了对本文中词的复杂关系的考虑。最终实现情感词典的自动构建。The invention generates an emotion dictionary based on a supervised corpus, uses a neural network to generate word vectors, mines the inner connection between words, uses a label propagation algorithm to propagate emotion labels, and automatically constructs an emotion dictionary in a specific field. The invention not only avoids the deficiency that the knowledge base-based sentiment dictionary construction method cannot be used for sentiment analysis in a specific field, but also strengthens the consideration of the complex relationship of words in this article compared with other corpus-based methods. Finally, the automatic construction of emotion dictionary is realized.
附图说明Description of drawings
图1为本发明的整体流程图;Fig. 1 is the overall flow chart of the present invention;
图2为改进的skip-gram模型的结构图;Fig. 2 is the structure diagram of the improved skip-gram model;
图3为卷积神经网络模型的结构图。Figure 3 is a structural diagram of a convolutional neural network model.
具体实施方式Detailed ways
下面结合附图对本发明作更进一步的说明。The present invention will be further described below in conjunction with the accompanying drawings.
图1所示为本发明的整体流程,本发明主要分为三个阶段:数据处理阶段、词向量情感嵌入阶段、情感词典生成阶段,下面结合附图1至3对各阶段的具体步骤进行详细描述。Fig. 1 shows the overall flow of the present invention. The present invention is mainly divided into three stages: data processing stage, word vector emotion embedding stage, and emotion dictionary generation stage. The specific steps of each stage are described in detail below in conjunction with accompanying
一、数据处理阶段(步骤1-3):1. Data processing stage (steps 1-3):
步骤1是数据获取,即获取具有情感标签标注的文本数据集D,D中文本的情感标签分为正面和负面,使用标记表示文本di的情感,其中di=0表示负面情感标签,di=1表示正面情感标签。
步骤2是数据预处理,先使用开源工具jieba对文本进行分词和词性标注,然后使用停用词表去除文本中的停用词,得到词语序列,根据词语序列构建词汇表V,将预处理后的文本数据集表示为D={d1,d2,…,dn}。
步骤3是使用SO-PMI方法计算词的大致情感,SO-PIM方法计算公式为:
其中,SO-PMI(w)表示词语w的情感倾向值,pos表示正面情感文本,neg表示负面情感文本,p(w|pos)表示词语w在正面情感文本中出现的概率,p(w|neg)表示词语w在负面情感文本中出现的概率。Among them, SO-PMI(w) represents the sentiment tendency value of the word w, pos represents the positive sentiment text, neg represents the negative sentiment text, p(w|pos) represents the probability that the word w appears in the positive sentiment text, p(w| neg) represents the probability that the word w appears in the negative sentiment text.
定义lablew表示词语w感情标记, Define lable w to represent the word w sentiment token,
二、词向量情感嵌入阶段(步骤4-6):Second, the word vector emotion embedding stage (steps 4-6):
步骤4是词语级监督模型的构建,即构建一个改进的Skip-gram模型,使用词语级别情感监督数据训练词向量,模型由输入层、投影层、输出层组成。其输入层为训练数据中的一个词wt,投影层将词wt投影为词向量表示C(wt),输出层使用C(wt)分别预测wt的上下文和情感标记其中,
预测wt的上下文的损失函数为:The loss function for predicting the context of wt is:
预测情感标记时的损失函数为:The loss function when predicting sentiment labels is:
其中,wt表示词语,wt∈D;k表示预测上下文的范围,{wt-k,…,wt-1,wt+1,…,wt+k}表示预测出的上下文词语集合,集合中包括预测出的词语wt的前k个词和后k个词;p(wt+j|wt)表示wt+j被预测为wt的上下文的概率,p(pos|wt)表示wt被预测为具有正面情感的概率,p(neg|wt)表示wt被预测为具有负面情感的概率。Among them, w t represents the word, w t ∈ D; k represents the range of the predicted context, {w tk ,...,w t-1 ,w t+1 ,...,w t+k } represents the predicted context word set, The set includes the first k words and the last k words of the predicted word w t ; p(w t+j |w t ) represents the probability that w t+j is predicted as the context of w t , p(pos|w t ) represents the probability that wt is predicted to have positive sentiment, and p(neg|w t ) represents the probability that wt is predicted to have negative sentiment.
步骤5是文本级监督模型的构建,即构建一个卷积神经网络模型,卷积神经网络模型由输入层、卷积层、池化层、全连接层组成。其中,输入为文本数据集D中的一段文本di,卷积层通过特征抽取器从文本di中抽取多个特征向量,池化层通过Max Pooling Over Time操作从特征向量中抽取最重要的特征向量作为输出,全连接层通过softmax函数预测输入文本di的情感标记其损失函数为:Step 5 is the construction of a text-level supervised model, that is, to construct a convolutional neural network model. The convolutional neural network model consists of an input layer, a convolutional layer, a pooling layer, and a fully connected layer. Among them, the input is a piece of text d i in the text dataset D, the convolution layer extracts multiple feature vectors from the text d i through the feature extractor, and the pooling layer extracts the most important feature vectors from the feature vectors through the Max Pooling Over Time operation. The feature vector is used as the output, and the fully connected layer predicts the sentiment tag of the input text d i through the softmax function Its loss function is:
其中,p(pos|di)表示di被预测为具有正面情感的概率,p(neg|di)表示di被预测为具有负面情感的概率。where p(pos|d i ) represents the probability that d i is predicted to have positive sentiment, and p(neg|d i ) represents the probability that d i is predicted to have negative sentiment.
步骤6是联合训练模型,即将词语级监督模型和文本级监督模型组合,设置损失函数为:
loss=α1·losscontext+α2·lossdoc+α3·lossword loss=α 1 ·loss context +α 2 ·loss doc +α 3 ·loss word
式中,α1、α2、α3分别为losscontext、lossdoc、lossword的权重系数,用于分别控制三个损失函数在最终损失函数中的权重。以文本数据集D、词语的情感标记lablew、文本的情感标记为输入数据,使用随机梯度下降和误差反向传播算法对loss进行优化训练,获得具有情感嵌入的词向量。In the formula, α 1 , α 2 , and α 3 are the weight coefficients of loss context , loss doc , and loss word , respectively, which are used to control the weights of the three loss functions in the final loss function respectively. Take text dataset D, the sentiment label of words lable w , the sentiment label of text For the input data, the loss is optimized using stochastic gradient descent and error back-propagation algorithms to obtain word vectors with sentiment embeddings.
三、情感词典生成阶段(步骤7-9):3. Sentiment dictionary generation stage (steps 7-9):
步骤7为构建词关系图,其具体步骤为:
1)对词汇表V,抽取其中的动词、形容词、副词构成新的词汇表V′;1) For vocabulary V, extract the verbs, adjectives and adverbs in it to form a new vocabulary V';
2)构建词关系图G,将V′中的词作为G中的顶点;2) Construct the word relation graph G, and use the words in V' as the vertices in G;
3)对于V′中的每个词wi,计算wi与V′中其他所有词在步骤(7)得到的词向量空间中的欧氏距离,选取欧式距离最近的k个词,在词关系图G中建立wi与这k个词之间的边,边的权重计算公式为:3) For each word wi in V', calculate the Euclidean distance between wi and all other words in V' in the word vector space obtained in step (7), select the k words with the nearest Euclidean distance, The edge between wi and the k words is established in the relational graph G, and the weight calculation formula of the edge is:
其中,wij表示词wi和wj之间边的权重,xi、xj分别为词wi和wj的词向量,euclidean_dis(xi,xj)表示xi、xj之间的欧式距离;σ为常数参数,用于控制wij的取值。Among them, w ij represents the weight of the edge between words wi and w j , xi and x j are the word vectors of words wi and w j respectively, euclidean_dis( xi , x j ) represents the distance between xi and x j The Euclidean distance; σ is a constant parameter, used to control the value of w ij .
对于和词wi的距离最近的m个词之外的其他词,使wij=0For words other than the m words closest to word wi, let
步骤8为使用标签传播算法传播情感标签,具体步骤为:
1)人工标注少量种子情感词,即人工标注少量具有褒义、贬义和中性的词语作为种子词;1) Manually mark a small number of seed emotional words, that is, manually mark a small number of words with positive, derogatory and neutral meanings as seed words;
2)定义标签矩阵Y,标签矩阵Y是一个大小为|V′|×3的矩阵,Y中每一行对应词汇表V′中的一个词,三列分别代表该词是褒义、贬义和中性的概率。根据人工标注的种子情感词,对标签矩阵进行初始化;本实施例中采用的初始化方法为:对于1)中人工标注的词语,如果是褒义,对应行初始化为[1,0,0],如果为贬义,初始化为[0,1,0],如果是中性,初始化为[0,0,1];对于没有在1)中人工标注的词语,对应行初始化为[0,0,0];2) Define the label matrix Y. The label matrix Y is a matrix of size |V′|×3. Each row in Y corresponds to a word in the vocabulary V′, and the three columns represent that the word is positive, derogatory and neutral. The probability. The label matrix is initialized according to the artificially labeled seed sentiment words; the initialization method adopted in this embodiment is: for the artificially labeled words in 1), if it is a compliment, the corresponding row is initialized to [1,0,0], if For derogatory meaning, it is initialized to [0,1,0], if it is neutral, it is initialized to [0,0,1]; for words that are not manually labeled in 1), the corresponding line is initialized to [0,0,0] ;
3)定义概率转移矩阵T,使得 3) Define the probability transition matrix T such that
4)进行情感标签传播:Y=TY4) Spread emotional labels: Y=TY
5)按照2)中的初始化方式重新初始化标签矩阵Y中的人工标注数据的标签概率分布5) Re-initialize the label probability distribution of the manually labeled data in the label matrix Y according to the initialization method in 2)
6)若标签矩阵Y收敛,则停止迭代,否则转到步骤4)。6) If the label matrix Y converges, stop the iteration, otherwise go to step 4).
步骤9为生成情感词,即根据步骤8中标签传播算法的结果,将词汇表V′中的词按照标签矩阵Y中的概率,分为褒义、贬义和中性;整理得到情感词典。
步骤10为结束。
图2为步骤4中词语级监督模型的结构图,其具体结构和设置如下:Figure 2 is a structural diagram of the word-level supervision model in
1)设置词向量维度为100,上下文窗口的大小为3;初始化词向量矩阵W,其大小为|V|×100,其第i行代表词汇表V中第i个词的词向量;1) Set the word vector dimension to 100 and the size of the context window to 3; initialize the word vector matrix W, whose size is |V|×100, and the i-th row represents the word vector of the i-th word in the vocabulary V;
2)在输入层中,从文本数据集D中选取一词wt作为输入,其表示形式为wt的one-hot表示;2) In the input layer, the word wt is selected from the text data set D as input, and its representation is the one-hot representation of wt ;
3)在投影层中,从词向量矩阵W中输出wt的向量形式C(wt);3) In the projection layer, the vector form C(w t ) of w t is output from the word vector matrix W;
4)在输出层中,使用C(wt)预测词wt的上下文{wt-k,…,wt-1,wt+1,…,wt+k}。其损失函数记为:losscontext(wt)=-∑-k≤j≤k,j≠0logp(wt+k|wt);4) In the output layer, use C(w t ) to predict the context of word wt {w tk ,...,w t -1 ,w t+1 ,...,w t+k }. Its loss function is recorded as: loss context (w t )=-∑ -k≤j≤k,j≠0 logp(w t+k |w t );
5)在输出层中,通过softmax函数使用C(wt)预测词wt的情感标记其损失函数记为: 5) In the output layer, use C(w t ) to predict the sentiment tag of word wt by softmax function Its loss function is recorded as:
6)结束。6) End.
图3为步骤5中文本级监督模型的结构图,其具体结构和设置如下:Figure 3 is a structural diagram of the text-level supervision model in step 5, and its specific structure and settings are as follows:
1)设置词向量维度m为100,初始化词向量矩阵W。根据文本数据集D中文本的长度设置模型输入的最大文本长度为L;1) Set the word vector dimension m to 100, and initialize the word vector matrix W. Set the maximum text length input by the model to L according to the length of the text in the text dataset D;
2)在输入层中,从文本数据集D中选取一段文本di。从词向量矩阵W中提取出文本di中每个词的词向量,相互连接构成大小为L×m的二维矩阵,将该矩阵作为模型输入;2) In the input layer, select a piece of text d i from the text dataset D. The word vector of each word in the text d i is extracted from the word vector matrix W, connected to each other to form a two-dimensional matrix of size L×m, and the matrix is used as the model input;
3)在卷积层中,使用200个滤波器进行卷积操作,获取特征向量;3) In the convolution layer, use 200 filters to perform convolution operations to obtain feature vectors;
4)在池化层中,使用Max Pooling Over Time操作从特征向量中抽取最重要的特征作为输出;4) In the pooling layer, use the Max Pooling Over Time operation to extract the most important features from the feature vector as output;
5)在全连接层中,使用一个全连接的神经网络层,并通过softmax函数预测输入文本di的情感标记其损失函数为:5) In the fully-connected layer, a fully-connected neural network layer is used, and the sentiment tag of the input text d i is predicted by the softmax function Its loss function is:
6)结束。6) End.
综上所述,本发明基于有监督语料集,使用神经网络生成具有情感嵌入的词向量,然后挖掘词与词之间的内在联系,使用标签传播算法传播情感标签,自动构建特定领域的情感词典。本发明既避免了基于知识库的情感词典构建方法无法用于特定领域情感分析的不足,相比于其他基于语料库的方法又加强了对本文中词的复杂关系的考虑。最终实现情感词典的自动构建。To sum up, the present invention is based on supervised corpus, uses neural network to generate word vectors with emotional embedding, then mines the inner connection between words, uses label propagation algorithm to spread emotional labels, and automatically builds sentiment dictionary in specific fields . The invention not only avoids the deficiency that the knowledge base-based sentiment dictionary construction method cannot be used for sentiment analysis in a specific field, but also strengthens the consideration of the complex relationship of words in this article compared with other corpus-based methods. Finally, the automatic construction of emotion dictionary is realized.
以上所述仅是本发明的优选实施方式,应当指出:对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above is only the preferred embodiment of the present invention, it should be pointed out that: for those skilled in the art, without departing from the principle of the present invention, several improvements and modifications can also be made, and these improvements and modifications are also It should be regarded as the protection scope of the present invention.
Claims (5)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810473308.6A CN108647191B (en) | 2018-05-17 | 2018-05-17 | A Sentiment Dictionary Construction Method Based on Supervised Sentiment Text and Word Vectors |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810473308.6A CN108647191B (en) | 2018-05-17 | 2018-05-17 | A Sentiment Dictionary Construction Method Based on Supervised Sentiment Text and Word Vectors |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108647191A CN108647191A (en) | 2018-10-12 |
CN108647191B true CN108647191B (en) | 2021-06-25 |
Family
ID=63756399
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810473308.6A Active CN108647191B (en) | 2018-05-17 | 2018-05-17 | A Sentiment Dictionary Construction Method Based on Supervised Sentiment Text and Word Vectors |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108647191B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109902300A (en) * | 2018-12-29 | 2019-06-18 | 深兰科技(上海)有限公司 | A kind of method, apparatus, electronic equipment and storage medium creating dictionary |
CN109885687A (en) * | 2018-12-29 | 2019-06-14 | 深兰科技(上海)有限公司 | A kind of sentiment analysis method, apparatus, electronic equipment and the storage medium of text |
CN110399595B (en) * | 2019-07-31 | 2024-04-05 | 腾讯科技(成都)有限公司 | Text information labeling method and related device |
CN110598207B (en) * | 2019-08-14 | 2020-09-01 | 华南师范大学 | Method, device and storage medium for obtaining word vector |
CN110717047B (en) * | 2019-10-22 | 2022-06-28 | 湖南科技大学 | Web service classification method based on graph convolution neural network |
CN114648015B (en) * | 2022-03-15 | 2022-11-15 | 北京理工大学 | Dependency relationship attention model-based aspect-level emotional word recognition method |
CN114822495B (en) * | 2022-06-29 | 2022-10-14 | 杭州同花顺数据开发有限公司 | Acoustic model training method and device and speech synthesis method |
CN116304028B (en) * | 2023-02-20 | 2023-10-03 | 重庆大学 | False news detection method based on social emotion resonance and relationship graph convolution network |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102663139B (en) * | 2012-05-07 | 2013-04-03 | 苏州大学 | Method and system for constructing emotional dictionary |
CN103544246A (en) * | 2013-10-10 | 2014-01-29 | 清华大学 | Method and system for constructing internet multi-sentiment dictionary |
CN104317965B (en) * | 2014-11-14 | 2018-04-03 | 南京理工大学 | Sentiment dictionary construction method based on language material |
CN107102989B (en) * | 2017-05-24 | 2020-09-29 | 南京大学 | Entity disambiguation method based on word vector and convolutional neural network |
CN107451118A (en) * | 2017-07-21 | 2017-12-08 | 西安电子科技大学 | Sentence-level sensibility classification method based on Weakly supervised deep learning |
CN107609132B (en) * | 2017-09-18 | 2020-03-20 | 杭州电子科技大学 | Semantic ontology base based Chinese text sentiment analysis method |
-
2018
- 2018-05-17 CN CN201810473308.6A patent/CN108647191B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN108647191A (en) | 2018-10-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108647191B (en) | A Sentiment Dictionary Construction Method Based on Supervised Sentiment Text and Word Vectors | |
CN109948165B (en) | Fine-grained emotion polarity prediction method based on hybrid attention network | |
CN108733792B (en) | An Entity Relationship Extraction Method | |
CN109657239B (en) | Chinese Named Entity Recognition Method Based on Attention Mechanism and Language Model Learning | |
CN106599032B (en) | Text event extraction method combining sparse coding and structure sensing machine | |
CN113591483A (en) | Document-level event argument extraction method based on sequence labeling | |
CN109992780B (en) | Specific target emotion classification method based on deep neural network | |
CN107943784B (en) | Generative Adversarial Network-Based Relation Extraction Method | |
CN110609891A (en) | A Visual Dialogue Generation Method Based on Context-Aware Graph Neural Network | |
CN109711465B (en) | Image subtitle generation method based on MLL and ASCA-FR | |
CN110162749A (en) | Information extracting method, device, computer equipment and computer readable storage medium | |
CN107967318A (en) | A kind of Chinese short text subjective item automatic scoring method and system using LSTM neutral nets | |
CN110287323B (en) | Target-oriented emotion classification method | |
CN111460824B (en) | Unmarked named entity identification method based on anti-migration learning | |
CN107168955A (en) | Word insertion and the Chinese word cutting method of neutral net using word-based context | |
CN110826335A (en) | A method and apparatus for named entity recognition | |
CN110263325A (en) | Chinese automatic word-cut | |
CN110276396B (en) | Image description generation method based on object saliency and cross-modal fusion features | |
CN110852089B (en) | Operation and maintenance project management method based on intelligent word segmentation and deep learning | |
CN110134954A (en) | A Named Entity Recognition Method Based on Attention Mechanism | |
CN110569355B (en) | Viewpoint target extraction and target emotion classification combined method and system based on word blocks | |
CN114417851B (en) | Emotion analysis method based on keyword weighted information | |
CN111145914B (en) | Method and device for determining text entity of lung cancer clinical disease seed bank | |
CN111159405B (en) | Irony detection method based on background knowledge | |
CN113934835B (en) | Retrieval type reply dialogue method and system combining keywords and semantic understanding representation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |