CN110837740B

CN110837740B - An opinion-level mining method for review aspects based on dictionary-based improved LDA model

Info

Publication number: CN110837740B
Application number: CN201911058218.1A
Authority: CN
Inventors: 袁凌; 冯晋田; 李金珊; 魏明; 杨雷
Original assignee: Huazhong University of Science and Technology; Wuhan Fiberhome Technical Services Co Ltd
Current assignee: Huazhong University of Science and Technology; Wuhan Fiberhome Technical Services Co Ltd
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2021-04-20
Anticipated expiration: 2039-10-31
Also published as: CN110837740A

Abstract

The invention discloses an opinion-level mining method based on a dictionary-based improved LDA model for comments, and belongs to the field of network comment text mining. Including: constructing an inverted index list based on the original network comment database; performing stop-word processing on each sentence of the original network comment database to obtain the preprocessed network comment database; inputting the preprocessed network comment database into the improved LDA based on SentiWordNet and WordNet The model uses Gibbs sampling to obtain the sampling results; sort the sampling results, select the top m words belonging to the corresponding evaluation category, and find the specific sentence according to the inverted index of the word. The present invention directly sets the aspect of the network comment database as the seed word without manual annotation. The evaluation object words are separated from the comment opinions, and the LDA model parameters are biased by calculating the similarity between the words and the seed words, so as to improve the effect of the model. Based on the inverted index, the clustering results are linked with the seed words and the original text to improve the readability of the results.

Description

An opinion-level mining method for review aspects based on dictionary-based improved LDA model

技术领域technical field

本发明属于网络评论文本挖掘领域，更具体地，涉及一种基于词典改进LDA模型的评论方面观点级挖掘方法。The invention belongs to the field of network comment text mining, and more particularly, relates to an opinion-level mining method for comments based on a dictionary-based improved LDA model.

背景技术Background technique

移动互联网的迅猛发展和智能手机的普及，为人们随时随地发表评论和意见提供了有利条件。在Twitter、微博等社交平台，在淘宝、亚马逊、京东等网购平台，人们可以对不同领域的不同商品进行评价。有效的分析这些评价，能够辅助厂家进行销售、未来发展的决策，亦能帮助消费者筛选合乎自己期待的产品。但单纯对评论语句进行情感极性判断，不能提供有效的信息，还需要进一步确定情感词描述的对象。与新闻报道、博客等不同，网络评论一般内容较短。因服务内容不同，网络评论的点评对象所在领域也各不相同。且网络评论对象包含的属性较多，只有通过对方面级观点进行挖掘，才能掌握评论中的有效信息。The rapid development of the mobile Internet and the popularization of smart phones have provided favorable conditions for people to post comments and opinions anytime, anywhere. On social platforms such as Twitter and Weibo, and online shopping platforms such as Taobao, Amazon, and JD.com, people can evaluate different products in different fields. Effective analysis of these evaluations can assist manufacturers in making decisions on sales and future development, and can also help consumers select products that meet their expectations. However, simply judging the sentiment polarity of the comment sentence cannot provide effective information, and it is necessary to further determine the object described by the sentiment word. Unlike news reports, blogs, etc., online reviews are generally shorter in content. Due to different service contents, the fields of review objects of online reviews are also different. And the online comment object contains many attributes, only by mining aspect-level viewpoints, can we grasp the effective information in the comment.

评论的方面级观点挖掘能够从评论中抽取方面级评论对象和评论范畴，有着重要的研究意义与价值。方面级评论对象(Opinion Target Expression Extraction)是指情感观点词所修饰的实体本身或者属性。在评论信息的挖掘中，仅判断评论语句的情感极性对阅读评论的人毫无意义，人们更关心商品具体层面上的好与坏。因此确定评论的方面级评论对象具有十分重要的意义。如商品评论，“这个手机外观一般，电池电量足，信号也很强。”，如果直接判断该语句的情感极性，在用户尚未阅读原文的情况下，只能了解到有一个评论表示手机好，这显然对用户的价值不大。因此在进行评论挖掘时，要先对评论语句中的方面级评论对象词进行抽取，如针对上述语句，应该抽取出的词为“外观”、“电池”、“信号”。评论范畴识别(AspectCategory Identification)与方面级评论对象为属于关系。除了判断词属于某个评论范畴外，语句同样可以打上评论范畴的标签。Aspect-level opinion mining of reviews can extract aspect-level review objects and review categories from reviews, which has important research significance and value. The aspect-level comment object (Opinion Target Expression Extraction) refers to the entity itself or the attribute modified by the sentiment opinion word. In the mining of comment information, only judging the emotional polarity of the comment sentence is meaningless to the people who read the comment, and people are more concerned about the good or bad of the product at the specific level. Therefore, it is very important to determine the aspect-level comment object of the comment. For example, in a product review, "This phone has a normal appearance, sufficient battery power, and strong signal." If you directly judge the emotional polarity of the sentence, you can only know that there is one comment that indicates that the phone is good without reading the original text. , which is obviously of little value to the user. Therefore, when conducting comment mining, the aspect-level comment object words in the comment sentences should be extracted first. For example, for the above sentences, the words that should be extracted are "appearance", "battery", and "signal". AspectCategory Identification and aspect-level review objects belong to each other. In addition to judging that the word belongs to a certain review category, sentences can also be labeled with a review category.

然而海量的评论涉及的商品种类繁多，方面级观点挖掘所需数据标注的过程繁琐，为所有领域的评论建立规范标注语料库将耗费大量的资源。依赖于标注数据集的有监督方法将很难应用于缺乏标注语料的评论领域。如何在少监督及无监督情况下提高模型的效果，并使模型具有领域适应性(包括不同领域、不同语言)，是非常值得研究的课题。现有技术为MaxEnt-LDA模型，它引入了两个分布来指示评论对象词与情感词的分类和积极情感词与消极情感词的分类。但是，它存在以下缺陷：用来指示评论对象词和情感词分类的分类器使用了最大熵模型，需要对数据集进行大量标注。However, the massive reviews involve a wide variety of commodities, and the process of data labeling required for aspect-level opinion mining is cumbersome. It will consume a lot of resources to build a canonical labeling corpus for reviews in all fields. Supervised methods that rely on annotated datasets will be difficult to apply to the review domain that lacks annotated corpora. How to improve the effect of the model under the condition of less supervision and no supervision, and make the model have domain adaptability (including different domains and different languages), is a very worthwhile research topic. The prior art is the MaxEnt-LDA model, which introduces two distributions to indicate the classification of comment object words and sentiment words and the classification of positive sentiment words and negative sentiment words. However, it has the following drawbacks: the classifier used to indicate the classification of comment object words and sentiment words uses a maximum entropy model, which requires a large number of annotations on the dataset.

发明内容SUMMARY OF THE INVENTION

针对现有技术基于MaxEnt-LDA模型评论的方面级观点挖掘方法需要对数据集进行大量标注，本发明提供了一种基于词典改进LDA模型的评论方面观点级挖掘方法，其目的在于使用尽可能少的用标注数据来解决网络评论方面级观点挖掘的问题。In view of the fact that the prior art aspect-level opinion mining method based on MaxEnt-LDA model reviews requires a large number of data sets to be labeled, the present invention provides a review aspect-level opinion mining method based on dictionary improvement LDA model, the purpose of which is to use as little as possible. Using labeled data to solve the problem of aspect-level opinion mining in online reviews.

为实现上述目的，按照本发明的第一方面，提供了一种基于词典改进LDA模型的评论方面观点级挖掘方法，该方法包括以下步骤：In order to achieve the above object, according to the first aspect of the present invention, a method for opinion-level mining of comments based on a dictionary improved LDA model is provided, and the method includes the following steps:

S1.基于原始网络评论库，构建倒排索引列表；S1. Build an inverted index list based on the original network comment library;

S2.对原始网络评论库各句子进行去停词处理，得到预处理后网络评论库；S2. Perform stop-word processing on each sentence of the original online comment database, and obtain the pre-processed network comment database;

S3.将预处理后网络评论库输入基于SentiWordNet与WordNet的改进LDA模型，采用吉布斯抽样，得到抽样结果；S3. Input the preprocessed network comment library into the improved LDA model based on SentiWordNet and WordNet, and use Gibbs sampling to obtain the sampling result;

S4.对抽样结果进行排序，选取属于对应评价类别的概率排名前m的单词，根据单词的倒排索引找到具体的句子。S4. Sort the sampling results, select the top m words belonging to the corresponding evaluation category in terms of probability, and find the specific sentence according to the inverted index of the word.

具体地，步骤S1包括以下子步骤：Specifically, step S1 includes the following sub-steps:

S11.以二元组<a，b>的形式对原始网络评论库中各句子各单词进行编号，a表示所在句子的编号，b表示在句中单词的编号；S11. Number each word in each sentence in the original online comment database in the form of a binary group <a, b>, where a represents the number of the sentence in which it is located, and b represents the number of the word in the sentence;

S12.去除原始网络评论库中重复的词，并记录剩余单词编号；S12. Remove the repeated words in the original online comment database, and record the number of the remaining words;

S13.基于去重后的单词编号，生成倒排索引列表。S13. Based on the deduplicated word numbers, generate an inverted index list.

具体地，步骤S3包括以下子步骤：Specifically, step S3 includes the following sub-steps:

S31.将网络评论库的方面直接设置为种子词；S31. Directly set the aspect of the online comment base as the seed word;

S32.将网络评论库中的评论文本以句子为单位进行划分，形成一个评论文本句子集合；S32. Divide the comment text in the online comment database into sentence units to form a set of comment text sentences;

S33.基于单词与种子词之间的相似度，为每个句子设置不同参数α_d；基于单词与种子词之间的语义相似度，将为每个主题单独设置对方面级对象词、积极评论词、消极评论词分别设置参数β_t，A、β_t，P、β_t，N；S33. Based on the similarity between the word and the seed word, set different parameters α _d for each sentence; based on the semantic similarity between the word and the seed word, set the aspect-level object word, positive comments for each topic separately Set the parameters β _{t, A} , β _{t, P} , β _{t, N} for words and negative comment words respectively;

S34.采用吉布斯抽样评论文本句子集合，对基于SentiWordNet与WordNet的改进LDA模型进行参数估计与推理。S34. Using Gibbs sampling comment text sentence set, perform parameter estimation and inference on the improved LDA model based on SentiWordNet and WordNet.

具体地，specifically,

β_t，A＝sim(w，A)*β_base β _{t, A} =sim(w, A)*β _base

β_t，P＝sim(w，P)*β_base β _t,P =sim(w,P)*β _base

β_t，N＝sim(w，N)*β_base β _{t, N} =sim(w, N)*β _base

其中，N_d为当前句子中所有词的个数，T为主题个数，w_d，i为当前句子中的第i个词，t为种子词，sim(w,t)表示w与种子词t语义相似度，α_base表示标准LDA模型中主题服从狄利克雷分布的定值参数α；sim(w，A)表示w属于对象词的概率，sim(w，P)表示w属于积极词的概率，sim(w，N)表示w属于消极词的概率，β_base为标准LDA模型中单词服从狄利克雷分布的定值参数β。Among them, N _d is the number of all words in the current sentence, T is the number of topics, w _{d, i} is the ith word in the current sentence, t is the seed word, and sim(w, t) represents w and the seed word t Semantic similarity, α _base represents the fixed value parameter α of the topic obeying Dirichlet distribution in the standard LDA model; sim(w, A) represents the probability that w belongs to the object word, sim(w, P) represents that w belongs to the positive word Probability, sim(w, N) represents the probability that w belongs to a negative word, and β _base is the fixed value parameter β of the word obeying Dirichlet distribution in the standard LDA model.

具体地，步骤S34包括以下步骤：Specifically, step S34 includes the following steps:

(1)当语料中每个句子中的每个单词，随机赋一个主题编号，为句中所有单词随机设置指示变量y和指示变量v的值，其中，当y＝A时，表示当前单词为评论的方面级对象，当y＝O时，当前单词为评论观点，当v＝P时表示当前单词为积极情感，当v＝N时表示当前单词为消极情感；(1) When each word in each sentence in the corpus is randomly assigned a topic number, the values of the indicator variable y and the indicator variable v are randomly set for all the words in the sentence, where, when y=A, it means that the current word is The aspect-level object of the comment, when y=O, the current word is a comment point of view, when v=P, it means that the current word is a positive emotion, and when v=N, it means that the current word is a negative emotion;

(2)重新扫描语料库，对每个词，按照公式(1)重新采样更新它的主题编号，在语料库中进行更新该词的编号，根据公式(2)和(3)对指示变量y和指示变量v进行重新采样，并进行更新；(2) Rescan the corpus, for each word, resample and update its topic number according to formula (1), update the number of the word in the corpus, according to formulas (2) and (3) for the indicator variables y and indicator The variable v is resampled and updated;

其中，z_d，n表示第d个评论语句所属主题，t表示主题编号，y_d，n指示第d个句子第n个单词为方面级对象词或情感观点词，v_d，n指示第d个句子第n个单词为积极情感词或消极情感词，

表示主题为t类别为q的词v的个数，

表示主题为t类别为q的词v的狄利克雷分布参数，

表示主题为t类别为u的词v的个数，

表示主题为t类别为u的词v的狄利克雷分布参数，V表示语料库中单词的个数，w_d，n表示第d个评论语句中的第n个词，则

表示主题为t类别为u的词w_d，n的个数，

表示主题为t类别为u的词w_d，n的狄利克雷分布参数，

表示主题为t类别为q的词w_d，n的个数，

表示主题为t类别为q的词w_d，n的狄利克雷分布参数，n_d，t表示第d个句子的主题为t的词个数，α_d，t表示第d个评论语句主题为t的狄利克雷分布参数，

i表示非i；Among them, z _{d, n} indicates the topic of the d-th comment sentence, t indicates the topic number, y _{d, n} indicates that the n-th word of the d-th sentence is an aspect-level object word or sentiment opinion word, and v _{d, n} indicates the d-th sentence. The nth word of each sentence is a positive emotion word or a negative emotion word,

represents the number of words v whose topic is t and whose category is q,

represents the parameters of the Dirichlet distribution of word v with topic t and category q,

Represents the number of words v whose topic is t and whose category is u,

Represents the Dirichlet distribution parameter of the word v with the topic t and the category u, V represents the number of words in the corpus, w _{d, n} represents the nth word in the dth comment sentence, then

Represents the number of words w _{d and n} whose topic is t and whose category is u,

represents the parameters of the Dirichlet distribution of words w _{d, n} with topic t and category u,

Represents the number of words w _{d, n} whose topic is t and whose category is q,

Represents the Dirichlet distribution parameters of the word w _{d, n} with the topic t and the category q, n _{d, t} represents the number of words with the topic t of the d-th sentence, α _{d, t} represents the d-th comment sentence The topic is the parameters of the Dirichlet distribution of t,

i means not i;

(3)重复以上语料库的重新采样直至吉布斯采样收敛；(3) Repeat the resampling of the above corpus until the Gibbs sampling converges;

(4)统计语料库中各个句子各个单词的主题，得到句子-主题的概率分布

统计语料库中各个主题词的分布，得到主题与单词的概率分布

(4) Count the topics of each word in each sentence in the corpus, and obtain the probability distribution of sentence-topic

Calculate the distribution of each subject word in the corpus to obtain the probability distribution of the subject and the word

具体地，句子d主题t的概率分布计算公式如下：Specifically, the formula for calculating the probability distribution of sentence d and topic t is as follows:

以t为主题，单词w_d，n为评价的方面级对象词的概率分布计算公式如下：Taking t as the subject, and the words w _{d and n} as the evaluation aspect-level object words, the probability distribution calculation formula is as follows:

以t为主题，单词w_d，n为评价的积极观点词概率分布计算公式如下：Taking t as the theme and the words w _{d and n} as the evaluation, the probability distribution of positive opinion words is calculated as follows:

以t为主题，单词w_d，n为评价的消极观点词的概率分布计算公式如下：Taking t as the theme and the words w _{d and n} as the evaluation, the probability distribution of negative opinion words is calculated as follows:

其中，n_d表示第d个句子的词个数。Among them, n _d represents the number of words in the d-th sentence.

具体地，基于SentiWordNet与WordNet改进LDA模型的文档生成过程如下：Specifically, the document generation process based on SentiWordNet and WordNet to improve the LDA model is as follows:

(1)从狄利克雷分布β_t，A、β_t，P、β_t，N中取样，生成评论的方面级对象分布

评论的积极观点词分布

评论的消极观点词分布

(1) Sampling from the Dirichlet distribution β _t,A , β _t,P , β _t,N to generate an aspect-level object distribution of reviews

Positive opinion word distribution for reviews

Negative opinion word distribution of reviews

(2)对每个句子，从狄利克雷分布α_d中取样，生成主题分布θ_d～Dir(α_d)；(2) For each sentence, sample from Dirichlet distribution α _d to generate topic distribution θ _d ~Dir(α _d );

(3)从主题的多项分布θ_d取样，生成句子d中的单词w_d，n的主题Z_d，n～Multi(θ_d)；(3) Sampling from the multinomial distribution θ _d of the topic, and generating the topic Z _d,n ~Multi(θ _d ) of the words w _d,n in the sentence d;

(4)由WordNet和SentiWordNet计算，得到参数为π_d，n的{0,1}上的伯努利分布和参数为Ω_d，n的{0,1}上的伯努利分布；(4) Calculated by WordNet and SentiWordNet, the Bernoulli distribution on {0,1} with parameter π _{d, n} and the Bernoulli distribution on {0, 1} with parameter Ω _{d, n} are obtained;

(5)由参数为π_d，n的{0,1}上的伯努利分布抽取，获得指示单词w_d，n为评论的方面级对象词或评论观点词y_d，n，由参数为Ω_d，n的{0,1}上的伯努利分布抽取，获得指示单词w_d，n为积极观点评论词或消极观点评论词v_d，n；(5) Extract from the Bernoulli distribution on {0, 1} with parameters π _{d, n} , and obtain the indicator word w _{d, n} is the aspect-level object word of the review or the review opinion word y _{d, n} , with the parameters as Extract the Bernoulli distribution on {0,1} of Ω _d,n to obtain the indicator word w _d,n is a positive opinion comment word or a negative opinion comment word v _d,n ;

(6)按照以下公式，生成单词w_d，n (6) According to the following formula, generate the word w _{d, n}

具体地，步骤(4)包括以下步骤：Specifically, step (4) includes the following steps:

(4.1)在WordNet中查询当前单词w_d，n的语义解释为s_d，n，k，计算w_d，n各语义s_d，n，k与各种子词w_t语义s_t，k0之间的相似度Sim(S_d，n，k，s_t，k0)，在所有计算结果中，取相似度结果最大值，并确定结果最大时的k′，为当前单词w_d，n在该语句中所属语义s_d，n，k′；(4.1) Query the semantic interpretation of the current word w _{d, n} in WordNet as s _{d, n, k} , calculate the semantics of w _{d, n and} the semantics s _{d, n, k} of each subword w _t _{, k0} The similarity Sim(S _{d, n, k} , s _{t, k0} ), among all the calculation results, take the maximum similarity result, and determine the k' when the result is the largest, which is the current word w _{d, n} in this the semantics s _{d, n, k′} to which the sentence belongs;

(4.2)在SentiWordNet中查询s_d，n，k′的各情感的得分

(4.2) Query the scores of each emotion of s _{d, n, k'} in SentiWordNet

(4.3)基于各情感的得分

计算参数π_d，n和Ω_d，n。(4.3) Score based on each emotion

Calculate the parameters π _d,n and Ω _d,n .

具体地，specifically,

为实现上述目的，按照本发明的第二方面，提供了一种计算机可读存储介质，其特征在于，所述计算机可读存储介质上存储有计算机程序，所述计算机程序被处理器执行时实现如第一方面所述的基于词典改进LDA模型的评论方面观点级挖掘方法。In order to achieve the above object, according to a second aspect of the present invention, a computer-readable storage medium is provided, wherein a computer program is stored on the computer-readable storage medium, and the computer program is implemented when executed by a processor As described in the first aspect, the opinion-level mining method of the review aspect based on the dictionary-based improved LDA model.

总体而言，通过本发明所构思的以上技术方案，能够取得以下有益效果：In general, through the above technical solutions conceived by the present invention, the following beneficial effects can be achieved:

(1)针对标准LDA模型，最终得到的主题-文档概率和主题-词概率中，主题词并未确定，需要人工筛选确定，本发明将网络评论库的方面直接设置为种子词，不需要额外的标注数据集。(1) For the standard LDA model, in the topic-document probability and topic-word probability finally obtained, the topic word is not determined and needs to be determined manually. Annotated datasets.

(2)本发明在LDA模型的基础上，加入观点词与评价对象分类层，实现观点与方面的分离。借助用户配置的种子词和WordNet、SentiWordNet两大工具来解决少监督方面级观点挖掘问题，通过设置种子词，利用WordNet中词汇相似度的计算工具，计算文本中单词与种子词的相似度，并反映在LDA模型参数上。同时利用SentiWordNet中计算词汇情感的工具，将评价对象词与评论观点进行分离，并对评论观点的正负极性进行分类。通过计算语料库中的单词与种子词的相似度对LDA模型参数进行偏置，提高模型的效果。(2) On the basis of the LDA model, the present invention adds a classification layer of opinion words and evaluation objects to realize the separation of opinion and aspect. With the help of user-configured seed words and the two tools of WordNet and SentiWordNet to solve the problem of less-supervised aspect-level opinion mining, by setting seed words, using the word similarity calculation tool in WordNet, calculate the similarity between words and seed words in the text, and Reflected on the LDA model parameters. At the same time, the tool for calculating lexical sentiment in SentiWordNet is used to separate the evaluation object words from the comment opinions, and classify the positive and negative polarity of the comment opinions. By calculating the similarity between the words in the corpus and the seed words, the parameters of the LDA model are biased to improve the effect of the model.

(3)针对标准LDA模型中最终结果为主题-文档和主题-单词概率，缺乏单词和文档之间的联系，本发明基于倒排索引，由单词查询所属的语句，将聚类结果同种子词及原文建立联系，提高结果的可读性。(3) For the final result in the standard LDA model, the probability of topic-document and topic-word, lacking the connection between words and documents, the present invention is based on the inverted index, and the sentence to which the word belongs is queried, and the clustering result is the same as the seed word and the original text to establish a link to improve the readability of the results.

附图说明Description of drawings

图1为本发明实施例提供的一种基于词典改进LDA模型的评论方面观点级挖掘方法流程图；1 is a flowchart of a method for opinion-level mining of comments based on a dictionary-based improved LDA model provided by an embodiment of the present invention;

图2为本发明实施例提供的SWLDA模型的文档生成过程示意图。FIG. 2 is a schematic diagram of a document generation process of a SWLDA model provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。此外，下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.

首先，对发明涉及的各术语和变量解释如下：First, the terms and variables involved in the invention are explained as follows:

评论：人对某种事物持有的观点,以及对所持有观点的阐释。Comment: A person's opinion about something, and an interpretation of that opinion.

方面：评论文本中所谈论的有关评论对象的属性等细节方面。评论者在对评论对象进行评论时，会首先确定要谈论的方面，选择相应的词语来表征该方面，然后根据自己对这一方面的观点选择具有特定情感倾向的观点词对其进行评价。例如，当评论者对某个宾馆进行评价时，会用price、value等词来表示所要谈论的“价格”这个方面，并随后选用high、cheap等词表达对其观点。Aspect: A detailed aspect about the properties of the review object that is talked about in the review text. When commenting on the object of the review, the reviewer will first determine the aspect to be discussed, select the corresponding words to represent the aspect, and then select the opinion words with specific emotional tendencies to evaluate it according to their own views on this aspect. For example, when reviewers evaluate a hotel, they will use words such as price and value to express the aspect of "price" they want to talk about, and then use words such as high and cheap to express their opinions.

评论的方面观点挖掘：利用文本分析与挖掘技术找到评论文本中谈论的有关评论的多个方面，并分析评论者对评论对象各个方面观点的情感倾向程度，以一定的形式将分析结果展示出来。在评论的方面级观点挖掘中，需要从文本信息中提取评论的方面、方面级评论对象和观点。例如，在餐馆评论中，“食物(food)”、“服务(service)”、“环境(ambiance)”即为评论的方面，在“食物”方面，“牛排(steak)”为方面级评论对象，而针对“牛排”的评价“好”则是评论的观点。Aspect mining of comments: Use text analysis and mining technology to find multiple aspects of comments discussed in the comment text, analyze the sentimental tendency of commenters to various aspects of the comments, and display the analysis results in a certain form. In the aspect-level opinion mining of reviews, it is necessary to extract the aspect of the review, the aspect-level review object and the opinion from the textual information. For example, in a restaurant review, "food", "service", and "ambiance" are aspects of the review, and in terms of "food", "steak" is an aspect-level review object , while a "good" rating for "steak" is a review's point of view.

主题模型：在评论与单词间建立“主题”这一桥梁。Topic Model: Build the "topic" bridge between comments and words.

表1符号说明Table 1 Symbol Description

本发明在已知网络评论库中的评论的方面的情况下，对各方面的方面级评论对象词和情感观点词进行抽取，同时确定语料库中句子所属的评论方面。如要进行面向评论方面的情感分析，需要进一步研究情感观点词和方面级评论对象词的联系，以及情感观点词所属的语句。The present invention extracts the aspect-level comment object words and sentiment opinion words of various aspects when the aspects of the comments in the network comment database are known, and simultaneously determines the comment aspects to which the sentences in the corpus belong. In order to conduct sentiment analysis for the comment aspect, it is necessary to further study the relationship between sentiment opinion words and aspect-level comment object words, as well as the sentences to which the sentiment opinion words belong.

如图1所示，本发明提出一种基于词典改进LDA模型的评论方面观点级挖掘方法，该方法包括以下步骤：As shown in FIG. 1 , the present invention proposes a method for opinion-level mining of comments based on a dictionary-based improved LDA model. The method includes the following steps:

步骤S1.基于原始网络评论库，构建倒排索引列表。Step S1. Construct an inverted index list based on the original network comment database.

网络评论信息中蕴含了大量观点信息，需要进行方面级观点挖掘以分析评论者对评论对象各个方面的意见，以求对评论对象有个完整的认识。不同于文档、博客、新闻等文字信息，网络评论往往篇幅较短，常以句子形式出现。Online comment information contains a lot of opinion information, and aspect-level opinion mining needs to be carried out to analyze the comments of commenters on various aspects of the comment object, in order to have a complete understanding of the comment object. Different from textual information such as documents, blogs, news, etc., online comments are often shorter in length and often appear in the form of sentences.

本实施例选用Restaurant英文评论集，以句1{“We,there were four of us,arrived at noon-the place was empty-and the staff acted like we were imposingon them and they were very rude.”“Everything is always cooked to perfection,the service is excellent,the decor cool and understated.”}为例，其评论的方面为{place,staff,cook,service,decor}；以句2{“This restaurant is quite complete andhas a good service attitude,but the location of the restaurant is notconvenient.”}为例，其评论的方面为{restaurant,service,location}，进行后续处理的说明。In this example, the English review collection of Restaurant is selected, and the sentence 1{“We,there were four of us,arrived at noon-the place was empty-and the staff acted like we were imposingon them and they were very rude.”“Everything is always cooked to perfection, the service is excellent, the decor cool and understated."} as an example, the aspect of the review is {place,staff,cook,service,decor}; with sentence 2{"This restaurant is quite complete and has a good service attitude, but the location of the restaurant is notconvenient."} as an example, the aspect of the comment is {restaurant, service, location}, and the description of the follow-up processing.

S11.以二元组<a，b>的形式对原始网络评论库中各句子各单词进行编号，a表示所在句子的编号，b表示在句中单词的编号。S11. Number each word in each sentence in the original online comment database in the form of a binary group <a, b>, where a represents the number of the sentence in which it is located, and b represents the number of the word in the sentence.

如句1中<1,16>表示staff，<1,36>表示service，<1,40>表示decor。句2中<2,10>表示service，<2,14>表示location。For example, in sentence 1, <1,16> means staff, <1,36> means service, and <1,40> means decor. In sentence 2, <2,10> means service, and <2,14> means location.

S12.去除原始网络评论库中重复的词，并记录剩余单词编号。S12. Remove the repeated words in the original online comment database, and record the number of the remaining words.

如<staff:1,16>，<decor:1,40>,<service:1,36；2,10>,<location:2,14>。倒排索引列表中保留了单词所在句子编号以及在句中的位置信息，便于之后利用上下文信息进行搜索。Such as <staff:1,16>, <decor:1,40>,<service:1,36; 2,10>,<location:2,14>. In the inverted index list, the sentence number in which the word is located and the position information in the sentence are retained, which is convenient for subsequent searches using contextual information.

步骤S2.对原始网络评论库各句子进行去停词处理，得到预处理后网络评论库。Step S2. Perform stop-word processing on each sentence of the original network comment database to obtain a preprocessed network comment database.

原始数据集格式为xml和csv两种，需要根据相应的标签和字段对评论语句进行提取。The format of the original data set is xml and csv, and the comment sentences need to be extracted according to the corresponding tags and fields.

网络评论库中的句子包含了许多无用的停词，如句1中的there、the等。在进一步处理之前去除这些停词，以免对结果产生干扰。The sentences in the online comment base contain many useless stop words, such as there, the, etc. in sentence 1. These stop words are removed before further processing so as not to interfere with the results.

步骤S3.将预处理后网络评论库输入基于SentiWordNet与WordNet的改进LDA模型，采用吉布斯抽样，得到抽样结果。Step S3. Input the preprocessed network comment library into the improved LDA model based on SentiWordNet and WordNet, and use Gibbs sampling to obtain the sampling result.

在SWLDA(SentiWordNet WordNet-Latent Dirichlet Allocation)模型中，以LDA主题模型为基础，引入了方面级评论与情感观点词分离层，并通过WordNet中的语义相似度计算以及SentiWordNet的情感因子计算来辅助解决该问题。先生成语句-主题分布，通过多项式分布为每个句子确定一个主题，再由WordNet和SentiWordNet来决定y_d与v_d两个影响因子以指示单词的类别，最终选择一个主题-单词分布并确定最终的单词。In the SWLDA (SentiWordNet WordNet-Latent Dirichlet Allocation) model, based on the LDA topic model, an aspect-level comment and sentiment opinion word separation layer is introduced, and the semantic similarity calculation in WordNet and the sentiment factor calculation of SentiWordNet are used to assist the solution. the question. First generate sentence-topic distribution, determine a topic for each sentence through multinomial distribution, and then use WordNet and SentiWordNet to determine the two impact factors y _d and v _d to indicate the category of words, and finally select a topic-word distribution and determine the final words.

如图2所示，SWLDA模型的文档生成过程如下：As shown in Figure 2, the document generation process of the SWLDA model is as follows:

评论的积极观点词分布

评论的消极观点词分布

Positive opinion word distribution for reviews

Negative opinion word distribution of reviews

(2)对每个句子，从狄利克雷分布α_d中取样，生成主题分布θ_d～Dir(α_d)。(2) For each sentence, sample from the Dirichlet distribution α _d to generate topic distributions θ _d to Dir(α _d ).

(3)从主题的多项分布θ_d取样，生成句子d中的单词w_d，n的主题z_d，n～Multi(θ_d)。(3) Sampling from the multinomial distribution θ _d of the topics, and generating the topic z _d,n ∼Multi(θ _d ) of the words w _d,n in the sentence d.

(4)由WordNet和SentiWordNet计算，得到参数为π_d，n的{0,1}上的伯努利分布和参数为Ω_d，n的{0,1}上的伯努利分布。(4) Calculated by WordNet and SentiWordNet, the Bernoulli distribution on {0, 1} with parameters π _{d, n} and the Bernoulli distribution on {0, 1} with parameters Ω _{d, n} are obtained.

(4.1)在WordNet中查询当前单词w_d，n的语义解释为s_d，n，k，计算w_d，n各语义s_d，n，k与各种子词w_t语义s_t，k0之间的相似度Sim(s_d，n，k，s_t，k0)，在所有计算结果中，取相似度结果最大值，并确定结果最大时的k′，为当前单词w_d，n在该语句中所属语义s_d，n，k，。(4.1) Query the semantic interpretation of the current word w _{d, n} in WordNet as s _{d, n, k} , calculate the semantics of w _{d, n and} the semantics s _{d, n, k} of each subword w _t _{, k0} The similarity Sim(s _{d, n, k} , s _{t, k0} ), among all the calculation results, take the maximum similarity result, and determine the k' when the result is the largest, which is the current word w _{d, n} in this The semantics s _{d, n, k, ,} to which the statement belongs.

(4.2)在SentiWordNet中查询s_d，n，k，的各情感的得分

(该语义为客观词)、

(该语义为积极情感)、

(该语义为消极情感)。(4.2) Query the scores of each emotion of s _{d, n, k,} in SentiWordNet

(the semantics are objective words),

(the semantic is positive emotion),

(The semantic is negative emotion).

SentiWordNet是基于WordNet的情感词典，针对WordNet中每个词的每个解释都给出了正面情感分数、负面情感分数以及中立情感分数，三者在0～1范围内，且相加和为1。本发明利用词向量的余弦距离来计算词之间相似度。SentiWordNet is a sentiment dictionary based on WordNet. For each explanation of each word in WordNet, a positive sentiment score, a negative sentiment score and a neutral sentiment score are given. The three are in the range of 0 to 1, and the sum is 1. The present invention uses the cosine distance of word vectors to calculate the similarity between words.

(4.3)基于各情感的得分

(该语义为客观词)、

(该语义为积极情感)、

计算参数π_d，n和Ω_d，n。(4.3) Score based on each emotion

(the semantics are objective words),

(the semantic is positive emotion),

Calculate the parameters π _d,n and Ω _d,n .

参数为π_d，n的伯努利分布(用于分离方面级对象词与情感观点词)与参数为Ω_d，n的伯努利分布(用于分离积极情感词和消极情感词)与WordNet和SentiWordNet有关。π_d，n的计算依赖于种子词w_t。Bernoulli distribution with parameters π _{d, n} (for separating aspect-level object words and sentiment opinion words) and Bernoulli distribution with parameters Ω _{d, n} (for separating positive sentiment words and negative sentiment words) and WordNet Related to SentiWordNet. The computation of π _d,n depends on the seed word w _t .

(5)由参数为π_d，n的{0,1}上的伯努利分布抽取，获得指示单词w_d，n为评论的方面级对象词或评论观点词y_d，n，由参数为Ω_d，n的{0,1}上的伯努利分布抽取，获得指示单词w_d，n为积极观点评论词或消极观点评论词v_d，n。(5) Extract from the Bernoulli distribution on {0, 1} with parameters π _{d, n} , and obtain the indicator word w _{d, n} is the aspect-level object word of the review or the review opinion word y _{d, n} , with the parameters as Ω _d,n is extracted from the Bernoulli distribution on {0,1}, and the indicator word w _d,n is obtained as a positive opinion comment word or a negative opinion comment word v _d,n .

参数为π_d，n的{0,1}上的伯努利分布抽取用来指示单词w_d，n为评论的方面级对象词或评论观点词y_d，n；参数为Ω_d，n的{0,1}上的伯努利分布抽取用来指示单词w_d，n为积极观点评论词或消极观点评论词v_d，n。The Bernoulli distribution over {0,1} with parameters π _d,n is used to indicate that the word w _d,n is the aspect-level object word of the review or the review opinion word y _d,n ; the parameter is Ω _d,n The Bernoulli distribution over {0,1} is used to indicate that the words w _d,n are positive opinion comments or negative opinion comments v _d,n .

为了表示评论的方面级对象和评论观点的分离，引入了变量y_d∈{A，O}进行表示。当y_d＝A时，表示当前单词为评论的方面级对象，当y_d＝O时，当前单词为评论观点。当v_d＝P时表示当前单词为积极情感，当v_d＝N时表示当前单词为消极情感。y_d与v_d都由基于WordNet和SentiWordNet的相关算法决定。In order to represent the separation of review aspect-level objects and review viewpoints, a variable y _d ∈ {A, O} is introduced to represent it. When y _d =A, it indicates that the current word is an aspect-level object of the review, and when y _d =O, the current word is a review point of view. When v _d =P, it means that the current word is a positive emotion, and when v _d =N, it means that the current word is a negative emotion. Both y _d and v _d are determined by related algorithms based on WordNet and SentiWordNet.

(6)按照以下公式，生成单词w_d，n。(6) According to the following formula, the word w _d,n is generated.

步骤S3包括以下子步骤：Step S3 includes the following sub-steps:

S31.将网络评论库的方面直接设置为种子词。S31. Directly set the aspect of the network comment base as the seed word.

将网络评论库的方面直接设置为种子词，如句1中的{place,staff,cook,service,decor}，种子词记为w_t，t∈{1，...，T}。设置种子词时，需要明确指出该词在WordNet中的语义，即确定种子词s_t在WordNet中的语义解释s_t，k0。当确定了各种子词的语义后，种子词即可当作主题。The aspect of the online review database is directly set as the seed word, such as {place, staff, cook, service, decor} in sentence 1, and the seed word is recorded as w _t , t∈{1,...,T}. When setting a seed word, it is necessary to clearly indicate the semantics of the word in WordNet, that is, to determine the semantic interpretation _{st, k0} of the seed word _st in WordNet. When the semantics of various subwords are determined, the seed words can be used as topics.

S32.将网络评论库中的评论文本以句子为单位进行划分，形成一个评论文本句子集合。S32. Divide the comment texts in the online comment database into sentence units to form a set of comment text sentences.

S33.基于单词与种子词之间的相似度，为每个句子设置不同参数α_d；基于单词与种子词之间的语义相似度，将为每个主题单独设置对方面级对象词、积极评论词、消极评论词分别设置参数β_t，A、β_t，P、β_t，N。S33. Based on the similarity between the word and the seed word, set different parameters α _d for each sentence; based on the semantic similarity between the word and the seed word, set the aspect-level object word, positive comments for each topic separately The parameters β _t,A , β _t,P , β _t,N are respectively set for words and negative comment words.

β_t，A＝sim(w，A)*β_base β _{t, A} =sim(w, A)*β _base

β_t，P＝sim(w，P)*β_base β _t,P =sim(w,P)*β _base

β_t，N＝sim(w，N)*β_base β _{t, N} =sim(w, N)*β _base

其中，N_d为当前句子中所有词的个数，T为主题个数，w_d，i为当前句子中的第i个词，t为种子词，sim(w,t)表示w与种子词t语义相似度，α_base表示标准LDA模型中主题服从狄利克雷分布的定值参数α。sim(w，A)表示w属于对象词，sim(w，P)表示w属于积极词的概率，sim(w，N)表示w属于消极词的概率，β_base为标准LDA模型中单词服从狄利克雷分布的定值参数β。Among them, N _d is the number of all words in the current sentence, T is the number of topics, w _{d, i} is the ith word in the current sentence, t is the seed word, and sim(w, t) represents w and the seed word t Semantic similarity, α _base represents the fixed value parameter α of the topic subject to Dirichlet distribution in the standard LDA model. sim(w, A) indicates that w belongs to the object word, sim(w, P) indicates the probability that w belongs to a positive word, sim(w, N) indicates the probability that w belongs to a negative word, and β _base is the standard LDA model. The fixed value parameter β of the Lickley distribution.

S34.采用吉布斯抽样，对基于SentiWordNet与WordNet的改进LDA模型进行参数估计与推理。S34. Using Gibbs sampling, parameter estimation and inference for the improved LDA model based on SentiWordNet and WordNet.

训练流程如下：The training process is as follows:

(1)当语料中每个句子中的每个单词，随机赋一个主题编号，为句中所有单词随机设置指示变量y和指示变量v的值。(1) For each word in each sentence in the corpus, a topic number is randomly assigned, and the values of the indicator variable y and the indicator variable v are randomly set for all the words in the sentence.

y∈{A，O}，v∈{P，N}。这里将y、u数字化，A和P对应0，O和N对应1，即y∈{0，1}，v∈{0，1}。y ∈ {A, O}, v ∈ {P, N}. Here, y and u are digitized, A and P correspond to 0, and O and N correspond to 1, that is, y∈{0,1}, v∈{0,1}.

(2)重新扫描语料库，对每个词，按照公式1重新采样更新它的主题编号，在语料库中进行更新该词的编号，根据公式2和3对指示变量y和指示变量v进行重新采样，并进行更新。(2) Rescan the corpus, for each word, resample and update its topic number according to formula 1, update the number of the word in the corpus, and resample the indicator variable y and the indicator variable v according to formulas 2 and 3, and update it.

SWLDA模型的吉布斯采样公式如下：The Gibbs sampling formula for the SWLDA model is as follows:

右边其实就是p(topic|doc)*p(word|topic)，这个概率其实就是doc→topic→word的路径概率。The right side is actually p(topic|doc)*p(word|topic), and this probability is actually the path probability of doc→topic→word.

其中，

表示非i。in,

means non-i.

(3)重复以上语料库的重新采样直至吉布斯采样收敛。(3) Repeat the resampling of the above corpus until the Gibbs sampling converges.

句子d主题t的狄利克雷分布期望计算公式如下：The formula for calculating the expectation of the Dirichlet distribution of the topic t of sentence d is as follows:

当

认为语句d的主题为t，即得评论主题-句子信息。when

Considering that the topic of sentence d is t, the comment topic-sentence information is obtained.

为单词w_d，n的主题为t，类别为y的概率，其中，y＝0表示评论对象词，y＝1表示观点词。具体计算方式如下：

is the probability that the topic of word w _{d, n} is t, and the category is y, where y=0 represents the comment object word, and y=1 represents the opinion word. The specific calculation method is as follows:

以t为主题，单词w_d，n为评价的方面级对象词的狄利克雷分布期望计算公式如下：Taking t as the subject and the words w _{d and n} as the evaluation aspect-level object words, the Dirichlet distribution expectation calculation formula is as follows:

以t为主题，单词w_d，n为评价的积极观点词的狄利克雷分布期望计算公式如下：Taking t as the theme and the words w _{d and n} as the evaluation of positive opinion words, the Dirichlet distribution expectation calculation formula is as follows:

以t为主题，单词w_d，n为评价的消极观点词的狄利克雷分布期望计算公式如下：Taking t as the theme and the words w _{d and n} as the evaluation of the negative opinion words, the Dirichlet distribution expectation calculation formula is as follows:

当

时认为单词的主题为t，类别为y，即得到评论主题-单词信息和评论主题-观点词信息。when

When the topic of the word is considered to be t and the category is y, the comment topic-word information and comment topic-opinion word information are obtained.

步骤S4.对抽样结果进行排序，选取属于对应评价类别的概率排名前m的单词，根据单词的倒排索引找到具体的句子。Step S4. Sort the sampling results, select the top m words that belong to the corresponding evaluation category in the probability ranking, and find the specific sentence according to the inverted index of the word.

以result来表示一条最终结果。result.tpic为主题信息，即设置的种子词，也可当作评论范畴词。result.wor保存了原始的单词。result.wordType为当前词的类别(方面级评论对象词、积极观点词、消极观点词)。result.senteces为该词所属的句子。result.prob为该词成为评论范畴下所属的类别词的概率。为所有主题下各个分类生成前m个result。基于倒排索引，可由单词查询所属的语句，另外可得<主题，评论对象，观点，原始语句>信息。A final result is represented by result. result.tpic is the topic information, that is, the set seed word, which can also be used as a comment category word. result.wor holds the original word. result.wordType is the category of the current word (aspect-level comment object word, positive opinion word, negative opinion word). result.sentices is the sentence to which the word belongs. result.prob is the probability that the word becomes a category word under the review category. Generate the top m results for each category under all topics. Based on the inverted index, the sentence to which it belongs can be queried by word, and the information of <subject, comment object, opinion, original sentence> can be obtained.

本领域的技术人员容易理解，以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。Those skilled in the art can easily understand that the above are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention, etc., All should be included within the protection scope of the present invention.

Claims

1. a review aspect opinion-level mining method based on dictionary improvement LDA model, it is characterized in that, described aspect is the attribute details of the relevant review object talked about in the review text, and the method comprises the following steps:

S1. Build an inverted index list based on the original network comment library;

S2. Perform stop-word processing on each sentence of the original online comment database, and obtain the pre-processed network comment database;

S3. Input the preprocessed network comment database into the improved LDA model based on the dictionary SentiWordNet and WordNet, and use Gibbs sampling to obtain the sampling result;

S4. Sort the sampling results, select the top m words belonging to the probability ranking of the corresponding evaluation category, and find the specific sentence according to the inverted index of the word;

Step S3 includes the following sub-steps:

S31. Directly set the aspect of the online comment base as the seed word;

S32. Divide the comment text in the online comment database into sentence units to form a set of comment text sentences;

S33. Based on the similarity between the word and the seed word, set different parameters α _d for each sentence; based on the semantic similarity between the word and the seed word, set the aspect-level object word, positive comments for each topic separately Set parameters β _t,A , β _t,P , β _t,N for words and negative comment words respectively;

S34. Using Gibbs sampling comment text sentence set, perform parameter estimation and reasoning on the improved LDA model based on dictionary SentiWordNet and WordNet.

2. The method of claim 1, wherein step S1 comprises the following substeps:

S11. Number each word in each sentence in the original online comment database in the form of a binary group <a, b>, where a represents the number of the sentence in which it is located, and b represents the number of the word in the sentence;

S12. Remove the repeated words in the original online comment database, and record the number of the remaining words;

S13. Based on the deduplicated word numbers, generate an inverted index list.

3. The method of claim 1, wherein

β _{t, A} =sim(w, A)*β _base

β _t,P =sim(w,P)*β _base

β _{t, N} =sim(w, N)*β _base

Among them, N _d is the number of all words in the current sentence, T is the number of topics, w _{d, i} is the ith word in the current sentence, t is the seed word, and sim(w, t) represents w and the seed word t Semantic similarity, α _base represents the fixed value parameter α of the topic obeying Dirichlet distribution in the standard LDA model; sim(w, A) represents the probability that w belongs to the object word, sim(w, P) represents that w belongs to the positive word Probability, sim(w, N) represents the probability that w belongs to a negative word, and β _base is the fixed value parameter β of the word obeying Dirichlet distribution in the standard LDA model.

4. The method of claim 1, wherein step S34 comprises the following steps:

(1) When each word in each sentence in the corpus is randomly assigned a topic number, the values of the indicator variable y and the indicator variable v are randomly set for all the words in the sentence. When y=A, it means that the current word is The aspect-level object of the comment, when y=O, it means that the current word is a comment opinion, when v=P, it means that the current word is a positive emotion, and when v=N, it means that the current word is a negative emotion;

(2) Rescan the corpus, for each word, resample and update its topic number according to formula (1), update the number of the word in the corpus, and change the indicator variables y and indicator according to formulas (2) and (3). The variable v is resampled and updated;

Among them, z _{d, n} indicates the topic of the d-th comment sentence, t indicates the topic number, y _{d, n} indicates that the n-th word of the d-th sentence is an aspect-level object word or sentiment opinion word, and v _{d, n} indicates the d-th sentence. The nth word of each sentence is a positive emotion word or a negative emotion word,

represents the number of words v whose topic is t and whose category is q,

Represents the number of words v whose topic is t and whose category is u,

Represents the Dirichlet distribution parameter of the word v with the topic t and the category u, y represents the number of words in the corpus, w _{d, n} represents the nth word in the dth comment sentence, then

means non-i;

(3) Repeat the resampling of the above corpus until the Gibbs sampling converges;

5. The method of claim 4, wherein

The formula for calculating the probability distribution of sentence d and topic t is as follows:

Taking t as the subject, and the words w _{d and n} as the evaluation aspect-level object words, the probability distribution calculation formula is as follows:

Taking t as the theme and the words w _{d and n} as the evaluation, the probability distribution of positive opinion words is calculated as follows:

Taking t as the theme and the words w _{d and n} as the evaluation, the probability distribution of negative opinion words is calculated as follows:

Among them, n _d represents the number of words in the d-th sentence.

6. method as claimed in claim 4 is characterized in that, the document generation process that improves LDA model based on dictionary SentiWordNet and WordNet is as follows:

Positive opinion word distribution for reviews

Negative opinion word distribution of reviews

(2) For each sentence, sample from Dirichlet distribution α _d to generate topic distribution θ _d ~Dir(α _d );

(3) Sampling from the multinomial distribution θ _d of the topic, and generating the topic z _d,n ~Multi(θ _d ) of the words w _{d and n} in the sentence d;

(4) Calculated by the dictionary WordNet and SentiWordNet to obtain the Bernoulli distribution on {0, 1} with parameters π _{d, n} and the Bernoulli distribution on {0, 1} with parameters Ω _{d, n} ;

(5) Extract from the Bernoulli distribution on {0, 1} with parameters π _{d, n} , and obtain the indication word w _{d, n} is the aspect-level object word of the review or the review opinion word y _{d, n} , with the parameters as Ω _d,n is extracted from the Bernoulli distribution on {0, 1}, and the indicator word w _d,n is obtained as a positive opinion comment word or a negative opinion comment word v _d,n ;

(6) According to the following formula, generate the word w _{d, n}

7. The method of claim 6, wherein step (4) comprises the following steps:

(4.1) Query the semantic interpretation of the current word w _{d, n} in WordNet as s _{d, n, k} , calculate the semantics of w _{d, n and} the semantics s _{d, n, k} of each subword w _t _{, k0} The similarity Sim(s _{d, n, k} , s _{t, k0} ), in all calculation results, take the maximum similarity result, and determine the k' when the result is the largest, which is the current word w _{d, n} in this the semantics s _{d, n, k′} to which the sentence belongs;

(4.2) Query the scores of each emotion of s _{d, n, k'} in SentiWordNet

(4.3) Score based on each emotion

Calculate the parameters π _d,n and Ω _d,n .

8. The method of claim 7, wherein

9 . A computer-readable storage medium, characterized in that, a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program according to any one of claims 1 to 8 is implemented. Lexicon-improving opinion-level mining methods for the comment aspect of LDA models.