CN108647191B

CN108647191B - A Sentiment Dictionary Construction Method Based on Supervised Sentiment Text and Word Vectors

Info

Publication number: CN108647191B
Application number: CN201810473308.6A
Authority: CN
Inventors: 张雷; 张文哲; 李昀; 姚懿荣; 谢俊元
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2018-05-17
Filing date: 2018-05-17
Publication date: 2021-06-25
Anticipated expiration: 2038-05-17
Also published as: CN108647191A

Abstract

The invention proposes an emotion dictionary construction method based on supervised emotion text and word vector, which includes three stages: data processing stage, word vector emotion embedding stage, and emotion dictionary generation stage. This method uses a neural network to generate word vectors, embeds emotions into the word vectors, mines the internal relationship between words, and then builds a word relationship graph, uses a label propagation algorithm to propagate sentiment labels, and automatically builds a domain-specific sentiment dictionary. The invention solves the problem that the sentiment dictionary constructed by artificial and knowledge base-based methods is inaccurate when dealing with sentiment analysis tasks in specific fields.

Description

A Sentiment Dictionary Construction Method Based on Supervised Sentiment Text and Word Vectors

技术领域technical field

本发明涉及情感分析领域，尤其是一种基于有监督情感文本和词向量的情感词典构建方法。The invention relates to the field of sentiment analysis, in particular to a construction method of sentiment dictionary based on supervised sentiment text and word vectors.

背景技术Background technique

随着互联网的飞速发展，诸如微博、贴吧、论坛等各类网络平台的流行，为人们提供了众多公开发声的机会。由此产生的公开的文本数据数量众多、易于获得，且含有巨大的商业和社会价值。为了获取这些文本中人们对事物或事件的情感倾向，情感分析技术便脱颖而出。With the rapid development of the Internet, the popularity of various network platforms such as Weibo, Tieba, and forums has provided people with many opportunities to speak out. The resulting public textual data is plentiful, readily available, and has enormous commercial and social value. Sentiment analysis techniques come to the fore in order to capture the sentimental tendencies of people towards things or events in these texts.

一直以来，情感词典都是情感分析的重要工具。一个优秀的情感词典可以极大地提升情感分析的效果。通常，随着应用领域的改变，词所体现的情感也会相应的改变。因此，在处理特定领域的情感分析任务时，人工整理情感词典变得费时费力，需要一种自动化的方法来构建情感词典。现有的情感词典自动构建方法分为两大类，分别是基于知识库的方法和基于语料库的方法。基于知识库的方法依赖于已有的语义知识库。这些经由人工整理的知识库中记录大量词的释义以及词与词之间的关系(如同义词、反义词)。基于知识库的方法通过这些已有的知识，构建具有高准确率和通用性的情感词典。然而，对于中文而言，整理完备的知识库相对稀缺，因此这种方法不能很好地应用于中文情感词典的构建。同时，这种方法生成的情感词典相对通用，不能很好地解决词语在不同领域情感变化的问题。基于语料库的方法可以用来生成特定领域的情感词典。这类方法对语料文本进行处理，挖掘语料中词与词之间的关系，如连词关系、共现关系等。其通过设置规则或使用统计学上的方法，将联系紧密的词聚集在一起，从而生成情感词典。这一类方法往往只考虑了词在文本中简单的关系，忽略了文本本身的复杂性，如一些复杂的句法以及否定词的影响等等，都会影响这类方法的效果。Sentiment dictionaries have always been an important tool for sentiment analysis. An excellent sentiment dictionary can greatly improve the effect of sentiment analysis. Usually, as the application domain changes, the emotion embodied by the word will also change accordingly. Therefore, when dealing with domain-specific sentiment analysis tasks, it becomes time-consuming and labor-intensive to manually organize sentiment lexicons, and an automated method is needed to construct sentiment lexicons. The existing automatic construction methods of sentiment dictionary are divided into two categories, namely knowledge base-based methods and corpus-based methods. Knowledge base-based methods rely on existing semantic knowledge bases. The definitions of a large number of words and the relationships between words (such as synonyms and antonyms) are recorded in these manually organized knowledge bases. The knowledge base-based method constructs a sentiment dictionary with high accuracy and generality through these existing knowledge. However, for Chinese, a well-organized knowledge base is relatively scarce, so this method cannot be well applied to the construction of Chinese sentiment dictionaries. At the same time, the sentiment dictionary generated by this method is relatively general, and cannot well solve the problem of emotional changes of words in different fields. Corpus-based methods can be used to generate domain-specific sentiment lexicons. This kind of method processes the corpus text and mines the relationship between words in the corpus, such as conjunction relationship, co-occurrence relationship, etc. It generates sentiment lexicons by setting rules or using statistical methods to cluster closely related words together. This type of method often only considers the simple relationship of words in the text, ignoring the complexity of the text itself, such as some complex syntax and the influence of negative words, etc., which will affect the effect of this type of method.

发明内容SUMMARY OF THE INVENTION

发明目的：本发明针对基于语料库的情感词典自动构建方法的不足，提出一种基于有监督情感文本和词向量的情感词典构建方法，使用神经网络生成词向量，挖掘词与词之间的内在联系，进而生成情感词典。Purpose of the invention: Aiming at the shortcomings of the automatic construction method of emotional dictionary based on corpus, the present invention proposes a construction method of emotional dictionary based on supervised emotional text and word vector, using neural network to generate word vector, and mining the inner relationship between words and words , and then generate a sentiment dictionary.

技术方案：本发明提出的技术方案为：Technical scheme: The technical scheme proposed by the present invention is:

一种基于有监督情感文本和词向量的情感词典构建方法，包括步骤：A sentiment dictionary construction method based on supervised sentiment text and word vectors, including steps:

(1)获取文本数据集D，文本数据集D中包括具有正面情感标记的正面情感文本和具有负面情感标记的负面情感文本；(1) Obtaining a text dataset D, the text dataset D includes positive sentiment texts with positive sentiment marks and negative sentiment texts with negative sentiment marks;

(2)对文本数据集中的文本进行预处理；构建词汇表V，将预处理后的文本数据集中的词语逐个填入词汇表V中；(2) Preprocess the text in the text data set; construct a vocabulary V, and fill in the words in the preprocessed text data set into the vocabulary V one by one;

(3)采用SO-PMI方法计算词汇表V中各个词语的情感倾向值，根据情感倾向值确定相应词语的情感标记：(3) The SO-PMI method is used to calculate the emotional tendency value of each word in the vocabulary V, and the emotional mark of the corresponding word is determined according to the emotional tendency value:

其中，lable_w表示词语w感情标记，SO-PMI(w)表示词语w的情感倾向值；Among them, lable _w represents the sentiment mark of the word w, and SO-PMI(w) represents the sentiment tendency value of the word w;

(4)构建具有词语级别监督的改进的skip-gram模型，改进的skip-gram模型以D中的词语为输入数据，预测词语的上下文和情感标记；计算预测上下文时的损失函数loss_context，以及预测情感标记时的损失函数loss_word；(4) Construct an improved skip-gram model with word-level supervision. The improved skip-gram model takes the words in D as input data to predict the context and sentiment tags of the words; calculate the loss function loss _{context when predicting the context} , and loss function loss _word when predicting sentiment labels;

loss_context与loss_word的表达式分别为：The expressions of loss _context and loss _word are:

loss_context(w_t)＝-∑_{-k≤j≤k,j≠0}logp(w_t+k|w_t)loss _context (w _t )=-∑ _{-k≤j≤k,j≠0} logp(w _t+k |w _t )

其中，w_t表示词语，w_t∈D；{w_t-k,…,w_t-1,w_t+1,…,w_t+k}表示预测出的上下文词语集合，集合中包括预测出的词语w_t的前k个词和后k个词，p(w_t+j|w_t)表示w_t+j被预测为w_t的上下文的概率，p(pos|w_t)表示w_t被预测为具有正面情感标记的概率，p(neg|w_t)表示w_t被预测为具有负面情感标记的概率；where w _t represents a word, w _t ∈ D; {w _tk ,…,w _t-1 ,w _t+1 ,…,w _t+k } represents a set of predicted context words, and the set includes the predicted words The first k words and the last k words of w _t , p(w _t+j |w _t ) represents the probability of w _t+j being predicted as the context of w _t , p(pos|w _t ) means that w _t is predicted is the probability of having a positive sentiment label, p(neg|w _t ) represents the probability that _wt is predicted to have a negative sentiment label;

(5)构建一个卷积神经网络模型作为文本级监督模型，文本级监督模型以文本数据集D中的文本为输入数据，预测文本的感情标记；计算预测出的文本的情感标记与文本实际感情标记之间的损失函数loss_doc：(5) Construct a convolutional neural network model as a text-level supervision model. The text-level supervision model uses the text in the text data set D as input data to predict the emotional tags of the text; calculate the emotional tags of the predicted text and the actual emotions of the text. Loss function loss _doc between tokens:

其中，d_i表示文本，d_i∈D；

表示d_i的情感标签；p(pos|d_i)表示d_i被预测为具有正面情感标记的概率，p(neg|d_i)表示d_id_i被预测为具有负面情感标记的概率；Among them, d _i represents the text, d _i ∈ D;

represents the sentiment label of d _i ; p(pos|d _i ) represents the probability that d _i is predicted to have a positive sentiment label, and p(neg|d _i ) represents the probability that d _i d _i is predicted to have a negative sentiment label;

(6)设置联合损失函数：(6) Set the joint loss function:

loss＝α₁·loss_context+α₂·loss_doc+α₃·loss_word loss=α ₁ ·loss _context +α ₂ ·loss _doc +α ₃ ·loss _word

式中，α₁、α₂、α₃分别为loss_context、loss_doc、loss_word的权重系数；In the formula, α ₁ , α ₂ , and α ₃ are the weight coefficients of loss _context , loss _doc , and loss _word respectively;

(7)以文本数据集D、词语的情感标记lable_w、文本的情感标记

为输入数据，利用反向传播算法训练联合损失函数，得到具有情感嵌入的词向量；(7) Take the text dataset D, the sentiment label of words, label _w , and the sentiment label of text

For the input data, use the back-propagation algorithm to train the joint loss function to obtain the word vector with sentiment embedding;

(8)根据步骤(7)获得的具有情感嵌入的词向量构建词关系图G；(8) constructing a word relation graph G according to the word vector with emotional embedding obtained in step (7);

(9)选取词关系图G中的部分词语作为种子词，为种子词标注情感标签，情感标签包括褒义、贬义和中性；然后使用标签传播算法将种子词的情感标签在关系图G中传播，生成情感词典。(9) Select some words in the word relation graph G as seed words, and mark the sentiment labels for the seed words. The sentiment labels include positive, derogatory and neutral; then use the label propagation algorithm to spread the sentiment labels of the seed words in the relation graph G. , to generate a sentiment dictionary.

进一步的，所述情感倾向值的计算公式为：Further, the calculation formula of the emotional tendency value is:

其中，SO-PMI(w)表示词语w的情感倾向值，pos表示正面情感文本，neg表示负面情感文本，p(w|pos)表示词语w在正面情感文本中出现的概率，p(w|neg)表示词语w在负面情感文本中出现的概率。Among them, SO-PMI(w) represents the sentiment tendency value of the word w, pos represents the positive sentiment text, neg represents the negative sentiment text, p(w|pos) represents the probability that the word w appears in the positive sentiment text, p(w| neg) represents the probability that the word w appears in the negative sentiment text.

进一步的，所述具有词语级别监督的改进的skip-gram模型包括输入层、投影层、输出层，输入层为文本数据集D中的词语w_t，投影层将词语w_t投影为词向量C(w_t)，输出层根据C(w_t)分别预测w_t的上下文和情感标记lable_w。Further, the improved skip-gram model with word-level supervision includes an input layer, a projection layer, and an output layer. The input layer is the word _wt in the text data set D, and the projection layer projects the word _wt into a word vector C. (w _t ), the output layer predicts the context and sentiment label lable _w of _wt _t according to C(wt ), respectively.

进一步的，所述文本级监督模型包括：输入层、卷积层、池化层、全连接层，其中，输入层为文本数据集D中的文本d_i；卷积层通过特征抽取器从文本d_i中抽取多个特征向量发送给池化层；池化层通过MaxPoolingOverTime操作从特征向量中选取最重要的特征向量输出给全连接层；全连接层根据收到的特征向量，通过softmax函数预测输入文本d_i的情感标记

Further, the text-level supervision model includes: an input layer, a convolution layer, a pooling layer, and a fully connected layer, wherein the input layer is the text d _i in the text data set D; Extract multiple feature vectors from d _i and send them to the pooling layer; the pooling layer selects the most important feature vector from the feature vector through the MaxPoolingOverTime operation and outputs it to the fully connected layer; the fully connected layer predicts through the softmax function according to the received feature vector Sentiment token for input text d _i

进一步的，所述构建词关系图G的具体步骤包括：Further, the specific steps of constructing the word relation graph G include:

1)对词汇表V，抽取其中的动词、形容词、副词构成新的词汇表V′；1) For vocabulary V, extract the verbs, adjectives and adverbs in it to form a new vocabulary V';

2)构建词关系图G，将V′中的词作为G中的顶点；2) Construct the word relation graph G, and use the words in V' as the vertices in G;

3)对于V′中的每个词w_i，计算w_i与V′中其他所有词在步骤(7)得到的词向量空间中的欧氏距离，选取欧式距离最近的k个词，在词关系图G中建立w_i与这k个词之间的边，边的权重计算公式为：3) For each word _wi in V', calculate the Euclidean distance between _wi and all other words in V' in the word vector space obtained in step (7), select the k words with the nearest Euclidean distance, The edge between _wi and the k words is established in the relational graph G, and the weight calculation formula of the edge is:

其中，w_ij表示词w_i和w_j之间边的权重，x_i、x_j分别为词w_i和w_j的词向量，euclidean_dis(x_i,x_j)表示x_i、x_j之间的欧式距离；σ为常数参数，用于控制w_ij的取值。Among them, w _ij represents the weight of the edge between words _wi and w _j , _xi and x _j are the word vectors of words _wi and w _j respectively, euclidean_dis( _xi , x _j ) represents the distance between _xi and x _j The Euclidean distance; σ is a constant parameter, used to control the value of w _ij .

对于和词w_i的距离最近的m个词之外的其他词，使w_ij＝0For words other than the m words closest to word wi, let w _ij ₌ 0

有益效果：与现有技术相比，本发明具有以下优势：Beneficial effect: Compared with the prior art, the present invention has the following advantages:

本发明基于有监督语料集生成情感词典，使用神经网络生成词向量，挖掘词与词之间的内在联系，使用标签传播算法传播情感标签，自动构建特定领域的情感词典。本发明既避免了基于知识库的情感词典构建方法无法用于特定领域情感分析的不足，相比于其他基于语料库的方法又加强了对本文中词的复杂关系的考虑。最终实现情感词典的自动构建。The invention generates an emotion dictionary based on a supervised corpus, uses a neural network to generate word vectors, mines the inner connection between words, uses a label propagation algorithm to propagate emotion labels, and automatically constructs an emotion dictionary in a specific field. The invention not only avoids the deficiency that the knowledge base-based sentiment dictionary construction method cannot be used for sentiment analysis in a specific field, but also strengthens the consideration of the complex relationship of words in this article compared with other corpus-based methods. Finally, the automatic construction of emotion dictionary is realized.

附图说明Description of drawings

图1为本发明的整体流程图；Fig. 1 is the overall flow chart of the present invention;

图2为改进的skip-gram模型的结构图；Fig. 2 is the structure diagram of the improved skip-gram model;

图3为卷积神经网络模型的结构图。Figure 3 is a structural diagram of a convolutional neural network model.

具体实施方式Detailed ways

下面结合附图对本发明作更进一步的说明。The present invention will be further described below in conjunction with the accompanying drawings.

图1所示为本发明的整体流程，本发明主要分为三个阶段：数据处理阶段、词向量情感嵌入阶段、情感词典生成阶段，下面结合附图1至3对各阶段的具体步骤进行详细描述。Fig. 1 shows the overall flow of the present invention. The present invention is mainly divided into three stages: data processing stage, word vector emotion embedding stage, and emotion dictionary generation stage. The specific steps of each stage are described in detail below in conjunction with accompanying drawings 1 to 3. describe.

一、数据处理阶段(步骤1-3)：1. Data processing stage (steps 1-3):

步骤1是数据获取，即获取具有情感标签标注的文本数据集D，D中文本的情感标签分为正面和负面，使用标记

表示文本d_i的情感，其中d_i＝0表示负面情感标签，d_i＝1表示正面情感标签。Step 1 is data acquisition, that is, to obtain a text dataset D marked with emotional labels. The emotional labels of the text in D are divided into positive and negative, and the label is used.

represents the sentiment of the text d _i , where d _i =0 represents a negative sentiment label, and d _i =1 represents a positive sentiment label.

步骤2是数据预处理，先使用开源工具jieba对文本进行分词和词性标注，然后使用停用词表去除文本中的停用词，得到词语序列，根据词语序列构建词汇表V，将预处理后的文本数据集表示为D＝{d₁,d₂,…,d_n}。Step 2 is data preprocessing. First, use the open source tool jieba to segment the text and tag it, and then use the stop word list to remove the stop words in the text to obtain the word sequence. The text dataset of is denoted as D={d ₁ ,d ₂ ,...,d _n }.

步骤3是使用SO-PMI方法计算词的大致情感，SO-PIM方法计算公式为：Step 3 is to use the SO-PMI method to calculate the approximate sentiment of the word. The calculation formula of the SO-PIM method is:

定义lable_w表示词语w感情标记，

Define lable _w to represent the word w sentiment token,

二、词向量情感嵌入阶段(步骤4-6)：Second, the word vector emotion embedding stage (steps 4-6):

步骤4是词语级监督模型的构建，即构建一个改进的Skip-gram模型，使用词语级别情感监督数据训练词向量，模型由输入层、投影层、输出层组成。其输入层为训练数据中的一个词w_t，投影层将词w_t投影为词向量表示C(w_t)，输出层使用C(w_t)分别预测w_t的上下文和情感标记

其中，Step 4 is the construction of a word-level supervision model, that is, an improved Skip-gram model is constructed, and word-level emotion supervision data is used to train word vectors. The model consists of an input layer, a projection layer, and an output layer. The input layer is a word _wt in the training data, the projection layer projects the word _wt into a word vector representing C( _wt ), and the output layer uses C( _wt ) to predict the context and sentiment tags of _wt respectively

in,

预测w_t的上下文的损失函数为：The loss function for predicting the context of _wt is:

预测情感标记时的损失函数为：The loss function when predicting sentiment labels is:

其中，w_t表示词语，w_t∈D；k表示预测上下文的范围，{w_t-k,…,w_t-1,w_t+1,…,w_t+k}表示预测出的上下文词语集合，集合中包括预测出的词语w_t的前k个词和后k个词；p(w_t+j|w_t)表示w_t+j被预测为w_t的上下文的概率，p(pos|w_t)表示w_t被预测为具有正面情感的概率，p(neg|w_t)表示w_t被预测为具有负面情感的概率。Among them, w _t represents the word, w _t ∈ D; k represents the range of the predicted context, {w _tk ,...,w _t-1 ,w _t+1 ,...,w _t+k } represents the predicted context word set, The set includes the first k words and the last k words of the predicted word w _t ; p(w _t+j |w _t ) represents the probability that w _t+j is predicted as the context of w _t , p(pos|w _t ) represents the probability that wt is predicted to have positive sentiment, and p(neg|w _t ) represents the probability that _wt _is predicted to have negative sentiment.

步骤5是文本级监督模型的构建，即构建一个卷积神经网络模型，卷积神经网络模型由输入层、卷积层、池化层、全连接层组成。其中，输入为文本数据集D中的一段文本d_i，卷积层通过特征抽取器从文本d_i中抽取多个特征向量，池化层通过Max Pooling Over Time操作从特征向量中抽取最重要的特征向量作为输出，全连接层通过softmax函数预测输入文本d_i的情感标记

其损失函数为：Step 5 is the construction of a text-level supervised model, that is, to construct a convolutional neural network model. The convolutional neural network model consists of an input layer, a convolutional layer, a pooling layer, and a fully connected layer. Among them, the input is a piece of text d _i in the text dataset D, the convolution layer extracts multiple feature vectors from the text d _i through the feature extractor, and the pooling layer extracts the most important feature vectors from the feature vectors through the Max Pooling Over Time operation. The feature vector is used as the output, and the fully connected layer predicts the sentiment tag of the input text d _i through the softmax function

Its loss function is:

其中，p(pos|d_i)表示d_i被预测为具有正面情感的概率，p(neg|d_i)表示d_i被预测为具有负面情感的概率。where p(pos|d _i ) represents the probability that d _i is predicted to have positive sentiment, and p(neg|d _i ) represents the probability that d _i is predicted to have negative sentiment.

步骤6是联合训练模型，即将词语级监督模型和文本级监督模型组合，设置损失函数为：Step 6 is to jointly train the model, that is, to combine the word-level supervised model and the text-level supervised model, and set the loss function as:

式中，α₁、α₂、α₃分别为loss_context、loss_doc、loss_word的权重系数，用于分别控制三个损失函数在最终损失函数中的权重。以文本数据集D、词语的情感标记lable_w、文本的情感标记

为输入数据，使用随机梯度下降和误差反向传播算法对loss进行优化训练，获得具有情感嵌入的词向量。In the formula, α ₁ , α ₂ , and α ₃ are the weight coefficients of loss _context , loss _doc , and loss _word , respectively, which are used to control the weights of the three loss functions in the final loss function respectively. Take text dataset D, the sentiment label of words lable _w , the sentiment label of text

For the input data, the loss is optimized using stochastic gradient descent and error back-propagation algorithms to obtain word vectors with sentiment embeddings.

三、情感词典生成阶段(步骤7-9)：3. Sentiment dictionary generation stage (steps 7-9):

步骤7为构建词关系图，其具体步骤为：Step 7 is to construct a word relationship graph, and the specific steps are:

步骤8为使用标签传播算法传播情感标签，具体步骤为：Step 8 is to use the label propagation algorithm to spread emotional labels, and the specific steps are:

1)人工标注少量种子情感词，即人工标注少量具有褒义、贬义和中性的词语作为种子词；1) Manually mark a small number of seed emotional words, that is, manually mark a small number of words with positive, derogatory and neutral meanings as seed words;

2)定义标签矩阵Y,标签矩阵Y是一个大小为|V′|×3的矩阵,Y中每一行对应词汇表V′中的一个词，三列分别代表该词是褒义、贬义和中性的概率。根据人工标注的种子情感词，对标签矩阵进行初始化；本实施例中采用的初始化方法为：对于1)中人工标注的词语，如果是褒义，对应行初始化为[1,0,0]，如果为贬义，初始化为[0,1,0]，如果是中性，初始化为[0,0,1]；对于没有在1)中人工标注的词语，对应行初始化为[0,0,0]；2) Define the label matrix Y. The label matrix Y is a matrix of size |V′|×3. Each row in Y corresponds to a word in the vocabulary V′, and the three columns represent that the word is positive, derogatory and neutral. The probability. The label matrix is initialized according to the artificially labeled seed sentiment words; the initialization method adopted in this embodiment is: for the artificially labeled words in 1), if it is a compliment, the corresponding row is initialized to [1,0,0], if For derogatory meaning, it is initialized to [0,1,0], if it is neutral, it is initialized to [0,0,1]; for words that are not manually labeled in 1), the corresponding line is initialized to [0,0,0] ;

3)定义概率转移矩阵T，使得

3) Define the probability transition matrix T such that

4)进行情感标签传播：Y＝TY4) Spread emotional labels: Y=TY

5)按照2)中的初始化方式重新初始化标签矩阵Y中的人工标注数据的标签概率分布5) Re-initialize the label probability distribution of the manually labeled data in the label matrix Y according to the initialization method in 2)

6)若标签矩阵Y收敛，则停止迭代，否则转到步骤4)。6) If the label matrix Y converges, stop the iteration, otherwise go to step 4).

步骤9为生成情感词，即根据步骤8中标签传播算法的结果，将词汇表V′中的词按照标签矩阵Y中的概率，分为褒义、贬义和中性；整理得到情感词典。Step 9 is to generate emotional words, that is, according to the result of the label propagation algorithm in step 8, the words in the vocabulary V' are divided into positive, derogatory and neutral according to the probability in the label matrix Y;

步骤10为结束。Step 10 is the end.

图2为步骤4中词语级监督模型的结构图，其具体结构和设置如下：Figure 2 is a structural diagram of the word-level supervision model in step 4, and its specific structure and settings are as follows:

1)设置词向量维度为100，上下文窗口的大小为3；初始化词向量矩阵W，其大小为|V|×100，其第i行代表词汇表V中第i个词的词向量；1) Set the word vector dimension to 100 and the size of the context window to 3; initialize the word vector matrix W, whose size is |V|×100, and the i-th row represents the word vector of the i-th word in the vocabulary V;

2)在输入层中，从文本数据集D中选取一词w_t作为输入，其表示形式为w_t的one-hot表示；2) In the input layer, the word _wt is selected from the text data set D as input, and its representation is the one-hot representation of _wt ;

3)在投影层中，从词向量矩阵W中输出w_t的向量形式C(w_t)；3) In the projection layer, the vector form C(w _t ) of w _t is output from the word vector matrix W;

4)在输出层中，使用C(w_t)预测词w_t的上下文{w_t-k,…,w_t-1,w_t+1,…,w_t+k}。其损失函数记为：loss_context(w_t)＝-∑_{-k≤j≤k,j≠0}logp(w_t+k|w_t)；4) In the output layer, use C(w _t ) to predict the context of word wt {w _tk ,...,w _t _-1 ,w _t+1 ,...,w _t+k }. Its loss function is recorded as: loss _context (w _t )=-∑ _{-k≤j≤k,j≠0} logp(w _t+k |w _t );

5)在输出层中，通过softmax函数使用C(w_t)预测词w_t的情感标记

其损失函数记为：

5) In the output layer, use C(w _t ) to predict the sentiment tag of word _wt by softmax function

Its loss function is recorded as:

6)结束。6) End.

图3为步骤5中文本级监督模型的结构图，其具体结构和设置如下：Figure 3 is a structural diagram of the text-level supervision model in step 5, and its specific structure and settings are as follows:

1)设置词向量维度m为100，初始化词向量矩阵W。根据文本数据集D中文本的长度设置模型输入的最大文本长度为L；1) Set the word vector dimension m to 100, and initialize the word vector matrix W. Set the maximum text length input by the model to L according to the length of the text in the text dataset D;

2)在输入层中，从文本数据集D中选取一段文本d_i。从词向量矩阵W中提取出文本d_i中每个词的词向量，相互连接构成大小为L×m的二维矩阵，将该矩阵作为模型输入；2) In the input layer, select a piece of text d _i from the text dataset D. The word vector of each word in the text d _i is extracted from the word vector matrix W, connected to each other to form a two-dimensional matrix of size L×m, and the matrix is used as the model input;

3)在卷积层中，使用200个滤波器进行卷积操作，获取特征向量；3) In the convolution layer, use 200 filters to perform convolution operations to obtain feature vectors;

4)在池化层中，使用Max Pooling Over Time操作从特征向量中抽取最重要的特征作为输出；4) In the pooling layer, use the Max Pooling Over Time operation to extract the most important features from the feature vector as output;

5)在全连接层中，使用一个全连接的神经网络层，并通过softmax函数预测输入文本d_i的情感标记

其损失函数为：5) In the fully-connected layer, a fully-connected neural network layer is used, and the sentiment tag of the input text d _i is predicted by the softmax function

Its loss function is:

6)结束。6) End.

综上所述，本发明基于有监督语料集，使用神经网络生成具有情感嵌入的词向量，然后挖掘词与词之间的内在联系，使用标签传播算法传播情感标签，自动构建特定领域的情感词典。本发明既避免了基于知识库的情感词典构建方法无法用于特定领域情感分析的不足，相比于其他基于语料库的方法又加强了对本文中词的复杂关系的考虑。最终实现情感词典的自动构建。To sum up, the present invention is based on supervised corpus, uses neural network to generate word vectors with emotional embedding, then mines the inner connection between words, uses label propagation algorithm to spread emotional labels, and automatically builds sentiment dictionary in specific fields . The invention not only avoids the deficiency that the knowledge base-based sentiment dictionary construction method cannot be used for sentiment analysis in a specific field, but also strengthens the consideration of the complex relationship of words in this article compared with other corpus-based methods. Finally, the automatic construction of emotion dictionary is realized.

以上所述仅是本发明的优选实施方式，应当指出：对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above is only the preferred embodiment of the present invention, it should be pointed out that: for those skilled in the art, without departing from the principle of the present invention, several improvements and modifications can also be made, and these improvements and modifications are also It should be regarded as the protection scope of the present invention.

Claims

1. A method for constructing an emotion dictionary based on supervised emotion text and word vectors is characterized by comprising the following steps:

(1) acquiring a text data set D, wherein the text data set D comprises a positive emotion text with a positive emotion mark and a negative emotion text with a negative emotion mark;

(2) preprocessing texts in the text data set; constructing a vocabulary V, and filling words appearing for the first time in the preprocessed text data set into the vocabulary V one by one;

(3) calculating the emotional tendency value of each word in the vocabulary V by adopting an SO-PMI method, and determining the emotional mark of the corresponding word according to the emotional tendency value:

wherein, table_wAn emotional tag representing the word w, SO-PMI (w) representing an emotional tendency value of the word w;

(4) constructing an improved skip-gram model with word level supervision, wherein the improved skip-gram model takes the words in the D as input data and predicts the context and emotion marks of the words; loss function loss in computing a prediction context_contextAnd loss function loss in predicting emotion mark_word；

loss_contextAnd loss_wordAre respectively:

wherein, w_tMeaning term, w_t∈D，

Meaning word w_tThe sentiment mark of (2); { w_t-k，…，w_t-1，w_t+1，…，w_t+kIndicates the predicted set of context words, including the predicted word w_tThe first k words and the last k words; p (w)_t+j|w_t) Watch (A)Word w_t+jIs predicted as w_tProbability of context of (p (pos | w)_t) Denotes w_tProbability of being predicted to have positive emotion marker, p (neg | w)_t) Denotes w_tA probability of being predicted to have a negative sentiment marker;

(5) constructing a convolutional neural network model as a text-level supervision model, wherein the text-level supervision model takes a text in a text data set D as input data and predicts emotion marks of the text; calculating a loss function loss between the predicted emotion mark of the text and the actual emotion mark of the text_doc：

Wherein d is_iRepresenting text, d_i∈D；

Denotes d_iThe sentiment tag of (1); p (pos | d)_i) Denotes d_iProbability of being predicted to have positive emotion marker, p (neg | d)_i) Denotes d_iA probability of being predicted to have a negative sentiment marker;

(6) setting a joint loss function:

loss＝α₁·loss_context+α₂·loss_doc+α₃·loss_word

in the formula, alpha₁、α₂、α₃Are respectively loss_context、loss_doc、loss_wordThe weight coefficient of (a);

(7) text data set D and emotion mark table of words_wEmotion marking of text

Training a joint loss function by using a back propagation algorithm for inputting data to obtain a word vector with emotion embedding;

(8) constructing a word relation graph G according to the word vector with emotion embedding obtained in the step (7);

(9) selecting partial words in the word relation graph G as seed words, and marking emotion labels for the seed words, wherein the emotion labels comprise commendation, derogation and neutrality; and then, propagating the emotion labels of the seed words in the relational graph G by using a label propagation algorithm to generate an emotion dictionary.

2. The method as claimed in claim 1, wherein the calculation formula of the emotional tendency value is as follows:

where SO-PMI (w) represents an emotional tendency value of word w, pos represents positive emotion text, neg represents negative emotion text, p (w | pos) represents a probability that word w appears in the positive emotion text, and p (w | neg) represents a probability that word w appears in the negative emotion text.

3. The method as claimed in claim 2, wherein the modified skip-gram model with word-level supervision comprises an input layer, a projection layer and an output layer, wherein the input layer is a word w in a text data set D_tProjection layer will be the word w_tProjected as a word vector C (w)_t) The output layer is based on C (w)_t) Separately predict w_tContext and emotion markup of

4. The method as claimed in claim 3, wherein the text level supervision model comprises: an input layer, a convolution layer, a pooling layer and a full-link layer, wherein the input layer is a text in the text data set DThis d_i(ii) a From the text d, the convolutional layer is passed through a feature extractor_iExtracting a plurality of feature vectors and sending the feature vectors to a pooling layer; selecting the most important characteristic vector from the characteristic vectors by the Pooling layer through Max Pooling Over Time operation and outputting the most important characteristic vector to the full connection layer; the full-connection layer predicts an input text d through a softmax function according to the received feature vectors_iIs marked with emotion

5. The method as claimed in claim 4, wherein the step of constructing the word relationship graph G comprises the following steps:

1) extracting verbs, adjectives and adverbs in the vocabulary V to form a new vocabulary V';

2) constructing a word relation graph G, and taking words in V' as vertexes in G;

3) for each word w in V_iCalculating w_iAnd (4) selecting m words with the nearest Euclidean distances from all other words in the V' in the word vector space obtained in the step (7), and establishing w in the word relation graph G_iAnd the weight calculation formula of the edge between the m words is as follows:

wherein, w_ijThe expression w_iAnd w_jWeight of edges in between, x_i、x_jAre respectively a word w_iAnd w_jWord vector of (1), euclidean _ dis (x)_i，x_j) Denotes x_i、x_jThe Euclidean distance between; σ is a constant parameter for controlling w_ijTaking the value of (A);

then the word w_iThe weight of the edge with the other words than the m words is set to 0.