WO2019218508A1 - 一种基于主题情感联合概率的电子商务虚假评论识别方法 - Google Patents

一种基于主题情感联合概率的电子商务虚假评论识别方法 Download PDF

Info

Publication number
WO2019218508A1
WO2019218508A1 PCT/CN2018/100372 CN2018100372W WO2019218508A1 WO 2019218508 A1 WO2019218508 A1 WO 2019218508A1 CN 2018100372 W CN2018100372 W CN 2018100372W WO 2019218508 A1 WO2019218508 A1 WO 2019218508A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
stm
emotion
subject
models
Prior art date
Application number
PCT/CN2018/100372
Other languages
English (en)
French (fr)
Inventor
纪淑娟
董鲁豫
张纯金
张琪
李达
Original Assignee
山东科技大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 山东科技大学 filed Critical 山东科技大学
Priority to US16/769,009 priority Critical patent/US11100283B2/en
Publication of WO2019218508A1 publication Critical patent/WO2019218508A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/018Certifying business or products
    • G06Q30/0185Product, service or business identity fraud

Definitions

  • the invention belongs to the fields of natural language processing, data mining and machine learning, and particularly relates to an e-commerce false comment recognition method based on joint emotion joint probability.
  • this patent proposes the Sentiment Joint Topic Model (STM), which expands the emotional layer based on the LDA model and enables comments.
  • STM Sentiment Joint Topic Model
  • the topics in the text and the corresponding emotional information are extracted.
  • the STM model is similar to them in that they add an emotional layer to the four-layer model based on the LDA model: document layer, topic layer , emotional layer and word layer.
  • the biggest difference between the STM model and the JST and ASUM models that consider both emotion and theme factors is that the structure level of the JST and ASUM models is “document-emotion (the emotions mentioned here are only positive and negative).
  • Theme-words they think that the theme is dependent on emotions, that is, each emotion has a theme distribution; the structure level of the STM model is “document-theme-emotion-word”, and the STM model believes that the generation of emotion depends on the theme. That is, there is an emotional distribution under each theme.
  • the four-layer model structure of the document-theme-emotion-word of the STM model is more in line with the ideology of the commentator when writing the comment.
  • the ASUM model assumes that words in the same topic tend to be adjacent to the word in the comment, while words that evaluate the emotions made on the subject tend to be adjacent words. Therefore, the ASUM model assumes that the words in the same single sentence are derived from the same subject and the same emotion, that is, the sentence is the smallest unit of the subject and emotion. In practice such constraint settings are not all applicable.
  • the essential difference between the STM model and the inverse JST (reversed JST) model that considers both emotion and subject factors is that the former is completely unsupervised learning, while the latter is semi-supervised learning.
  • the present invention proposes an e-commerce false comment recognition method based on the joint emotion joint probability, which is reasonable in design, overcomes the deficiencies of the prior art, and has good effects.
  • the STM model is a 9-tuple. among them,
  • is a hyperparameter that reacts to the relative strength between the hidden topics and emotions
  • is a hyperparameter that reflects the emotional probability distribution of the subject
  • is a hyperparameter of the probability distribution of words
  • z m,n is the subject of the nth word of the document m;
  • s m,n is the emotion to which the nth word of the document m belongs;
  • w m,n is the basic unit of discrete data, defined as the word index n in the document m;
  • the e-commerce false comment recognition method based on the subject emotion joint probability includes the following steps:
  • Step 1 Initialize the hyperparameters ⁇ , ⁇ , ⁇ of the STM model
  • Step 2 Set the appropriate number of topics, the number of emotions, and the maximum number of iterations of Gibbs sampling;
  • Step 3 Train the STM model until the model converges stably
  • Step 4 The joint sentiment joint probability matrix calculated by the STM model As a feature, input to the classifier for training;
  • Step 5 Input the new unlabeled sample into the STM model, and train the STM model to calculate the topic sentiment joint probability matrix of the new unlabeled sample.
  • Step 6 The subject emotion joint probability matrix of the new unlabeled sample Input into the trained classifier for prediction
  • Step 7 The classifier outputs the label of the new sample.
  • the STM model is superior to other comparison models under different domain data. Compared with other models, the STM model shows great advantages in dealing with unbalanced large sample data sets. Therefore, the STM model is more suitable for application to real e-commerce environments.
  • Figure 1 shows the probability plot of the STM model.
  • Figure 2 is a schematic diagram of the impact of Gibbs Sampling iterations on Perplexity for LDA, JST and STM models.
  • Figure 3 is a schematic diagram of the impact of the number of subjects on the LDB, JST and STM models on Perplexity.
  • Figure 4 is a schematic diagram of the performance of the model on the equilibrium hotel dataset.
  • 4(a), 4(b), and 4(c) are the effect diagrams of the model on the Precision, Recall, and F1-Score indicators, respectively.
  • Figure 5 is a graphical representation of the performance of the model on an unbalanced hotel dataset.
  • 5(a), 5(b), and 5(c) are the effect diagrams of the model on the Precision, Recall, and F1-Score indicators, respectively.
  • Figure 6 is a graphical representation of the performance of the model on a balanced restaurant data set.
  • 6(a), 6(b), and 6(c) are the effect diagrams of the model on the Precision, Recall, and F1-Score indicators, respectively.
  • Figure 7 is a graphical representation of the performance of the model on an unbalanced restaurant dataset.
  • Fig. 7(a), Fig. 7(b), and Fig. 7(c) are the effect diagrams of the model on the Precision, Recall, and F1-Score indicators, respectively.
  • Figure 1 shows the probability map model of the STM model.
  • the black box circle represents the emotional layer of the extended LDA.
  • From ⁇ to This arc is represented as a probability distribution vector for each document based on the Dirichlet( ⁇ ) function, ie From ⁇ to This arc represents the probability distribution vector for each implicit subject based on the Dirichlet( ⁇ ) function, ie From beta to This arc is expressed as a probability distribution vector for each implicit subject and emotion based on the Dirichlet ( ⁇ ) function.
  • From The arc to z m,n is represented as the word w m in the document d m , and n randomly selects a z m,n from the document-theme multi-distribution.
  • this arc represents, given the subject, the word w m,n from the document d m from the document-theme-emotion
  • An emotion s m,n is randomly selected from the item distribution. From The three arcs known theme z m, n and emotion s m, n the case where, from the subjects - selecting a word w m, n in number distribution Ci - emotions.
  • a selects the probability vector ⁇ representing the subject, ⁇ obeys the Dirichlet distribution with ⁇ as the super parameter, ie ⁇ ⁇ Dirichlet ( ⁇ )
  • Emotional dictionary refers to words with emotional orientation, and also as evaluation words or polar words. For example, "good” and “bad” are two words with obvious expressions of derogatory and derogatory emotions. Emotional words are usually used in the comment text to express the emotional orientation of the reviewer. Therefore, the emotional words in the review text play an important role in sentiment analysis. The recognition and polarity judgment of emotional words have attracted the attention of researchers in the field of sentiment analysis.
  • the Gibbs Sampling method causes the STM model to converge to a steady state after enough iterations. After determining the appropriate number of iterations, the topic label and the emotion label assigned to each word by the model at this time are the real cases that can maximize the approximate text.
  • This step contains two structures: with For different documents, this generation process is independent of each other, so for each document, the probability of the topic can be generated according to formula (1):
  • This step contains two structures: We use words as sampling units, so the words are opposite each other, and the probability generation of words can be calculated by formula (4):
  • the STM model is a generation model. For each document d generation process, the STM model first selects the theme k from the document-topic distribution ⁇ d . After determining the theme k, the emotion is selected from the theme-emotional distribution ⁇ k t; given the theme k and emotion t, from the theme-emotion-word distribution Generate each word in the document.
  • the flow of the STM model solving algorithm based on Gibbs sampling is as follows:
  • each iterative process follows the above process.
  • the first frequency statistics are based on the initialization results of the model.
  • the model initialization is to randomly assign the topic dimensions to all the words in the document. In the model initialization work, the distribution of emotions is not all random. Since we want to incorporate emotional prior information into the model, the emotional initialization process relies on the emotional seed dictionary.
  • the initialization process of the specific emotional dimension is as follows:
  • the second set of experiments was set up to be able to assess the classification performance of the model on balanced and unbalanced data sets. Evaluation indicators are the Precision, Recall and F1-Score commonly used in classification tasks.
  • the purpose of the third set of experiments was to verify the performance of the proposed model and the comparative model on different domain datasets.
  • the experimental data set used in this application is a tagged English comment text obtained from the foreign review website Yelp.
  • Table 2 gives the statistical characteristics of the data set. This data set is labeled: real and false. False comments are filtered by the filters that come with Yelp. The real comment is the comment that is kept on the shop page. These reviews are from two areas of data: hotels and restaurants. There are 780 false comments and 5078 real reviews in the hotel area. There are 8308 false comments and 58716 real comments in the restaurant field. According to the statistical results in Table 2, we can see that the category distribution of this Yelp data set is unbalanced. The representation with the "ND" in the table indicates a natural distribution.
  • the optimal parameter configuration For the unigram model, the character n-grams in token model, the POS model, all features are TF-IDF for feature weighting.
  • For the LDA model we use the topic probability distribution vector (hidden variable ⁇ ) as the feature of the review text.
  • the JST model we use the emotional subject joint probability distribution (hidden variable ⁇ ) as the text feature.
  • the subject sentiment joint probability distribution (hidden variable ⁇ ) of the STM model is taken as a feature and can be calculated by the formula (8).
  • the Dirichlet prior parameters were assigned values of 0.1, 0.01, and 0.1, respectively.
  • the number of topics is assigned 5, 10, 15, 20 in turn.
  • the emotional number is assigned a value of 2.
  • Precision accuracy measures the precision of the retrieval system and is the ratio of the number of related documents retrieved to the total number of documents retrieved.
  • the Recall recall rate also known as the recall rate, is the ratio of the number of related documents retrieved to the number of related documents in the document library, which is the recall rate of the retrieval system.
  • F1-Score is usually a combination of precision and recall in a single metric.
  • F1-Score is calculated as follows:
  • TP refers to the number of positive cases predicted as positive examples
  • TN refers to the number of negative cases predicted to be negative cases
  • FP refers to the number of positive cases predicted as positive examples
  • FN refers to positive cases predicted as The number of negative examples.
  • Perplexity is used to measure the probability of a probability distribution or a probability model predicting a sample, and it is also possible to compare two probability distributions or probability models.
  • the LDA model, the JST model, and the STM model used in this application are all probabilistic models, so we also used the confusion to compare the model prediction capabilities of the three topic models. Perpleixty monotonically decreases with the similarity of the test data and is algebraically equivalent to the inverse of the geometric mean of each word similarity. A lower Perplexity value indicates that the probability model has better generalization capabilities.
  • the calculation formula of Perplexity is as follows:
  • the Perplexity value of the STM model is always lower than the LDA model and the JST model, which indicates that the STM model has better generalization ability than the other two models.
  • the number of iterations is less than 40, the descending gradient of the three curves is larger.
  • the curve remains basically unchanged, which indicates that the probability model has basically converged. So in the experiment below we set the number of iterations to 500.
  • the character n-grams in token model (71.3, 79.52, and 75.19) works better than the unigram model (70.42, 75.63m and 72.93). This is because the character n-grams in token model not only inherits the advantages of the standard n-grams model but also gets fewer features.
  • the POS model is better than the unigram model and the character n-grams in token model, which also shows that the simple genre feature method is helpful for false comment detection.
  • the LDA model (76.34, 85.53, and 79.77) is better than the POS model (75.92, 82.42, and 79.04) because the LDA model captures the hidden semantic information in the comment text.
  • the accuracy of the LDA is greatly reduced, and the recall rate has increased. This indicates that the LDA model is affected by the number of topics. Comparing the results of LDA models (76.34, 85.53, and 79.77), JST models (83.5, 84.75, and 84.12) and STM models (87.15, 83.02 and 85.03), it can be seen that considering both emotional and subject information is more helpful in improving performance. .
  • our STM model applied to a balanced hotel dataset is also the best performer of all comparison models. Therefore, we can derive the joint probability distribution of subject emotions to further improve the performance of false comment detection.
  • Figure 5 shows the performance of the comparison model on an unbalanced hotel dataset.
  • the Character n-grams in token model still performs better than the unigram model.
  • the POS model is superior to the unigram model and the character n-grams in token model, because the frequency of part-of-speech often reflects the genre of the text.
  • POS performance (74.06, 78.89, 76.39) is better than the unigram model and the character n-grams in token model, because the POS model takes advantage of the shallow grammar information hidden in the text, while the unigram model and character
  • the n-grams in token model simply uses contextual information about the words.
  • the LDA model is significantly inferior to the JST models (80.5, 82.19, and 81.34) and the STM models (82.29, 85.62, and 83.92). This is because the JST model and the STM model consider both topic information and emotional information. Comparing the three subgraphs in Figure 6, we can see that the accuracy of the theme model is also increasing when the number of topics is gradually increased from 5 to 15. When the number of topics is 20, the LDA model has a significant downward trend in all three figures. In contrast, the JST model and the STM model are less volatile. In these three topic models, the curve of the STM model is always above the curve of other benchmark models, regardless of how the number of topics changes. This shows that the effect of our model is the best of these models.
  • Table 4 The number of topics is 15, when the model is in the restaurant dataset experiment results

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Biology (AREA)
  • Accounting & Taxation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Evolutionary Computation (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Finance (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于主题情感联合概率的电子商务虚假评论识别方法,属于自然语言处理、数据挖掘和机器学习领域。在不同领域数据下,STM模型优于其他对比模型;对比其他模型,STM模型属于完全的无监督(没有标签信息)统计学习方法,且在处理不均衡大样本数据集上表现出很大的优势,因此STM模型更适合应用到真实电子商务环境中。

Description

一种基于主题情感联合概率的电子商务虚假评论识别方法 技术领域
本发明属于自然语言处理、数据挖掘和机器学习领域,具体涉及一种基于主题情感联合概率的电子商务虚假评论识别方法。
背景技术
随着智能设备的普及和互联网的发展,人们的消费理念和消费模式都发生了不同程度的改变,网上购物成为人们普遍的消费方式。消费者在体验过商品或服务之后会在商品主页上发表自己对商品的观点和情感,因此这些大量的在线历史评论为卖家和买家提供了丰富的信息资源。然而C2C电子商务平台存在信息不对称性,即买家对于卖家的信息掌握的较少。因此买家在做出购买决策之前会通过评论了解之前用户体验商品的感受,希望能够得到有用的信息使自己理性地做出正确的购买决策。这种在线评论的机制确实有助于为用户和商家间接交流提供信息。一方面商家可以通过评论更好地精准营销;另一方面,用户也能够通过参考历史交易的评论信息寻找自己满意的商品。
研究表明,消费者的反馈能够极大地影响潜在消费者的购买动机。当用户在商品评论网页下查询到大量的负面评论之后,80%的人会改变原来的购买决策,此外,绝大多数用户在看到正向评论之后更有意愿购买商品。Luca等通过研究外国点评网站yelp的消费者评论发现,每当消费者对酒店评论的评分上升一个星级,酒店的收入会相应地增加5%-9%。
正是由于在线评论对于消费者决策和商家的利益有至关重要的作用,所以有些商家才会利用虚假评论迷惑消费者。不法商家抓住电子商务网站评论机制的漏洞,为了提高自己的利益和声誉雇佣网络水军撰写不实的言论迷惑消费者。这种行为不仅干扰了消费者的购买决策而且使得信誉良好的商家利益降低。因此,如何能够过滤掉虚假评论帮助用户避免逆向选择从而做出正确决策是研究者一直致力攻克的难点。因此,本申请通过STM模型挖掘出评论文本中的主题情感联合概率作为区分真实评论和虚假评论的证据,然后通过分类器进行判别评论的真假。
众所周知,消费者在针对商品或服务进行评论的时候,通常是针对某个特定的方面评价并表达自己的情感。例如,yelp评论文本,文本内容主要有两个特点:一是评论文本的内容往往是对于某个特定商品或服务的方面进行评价;二是评论文本方面的时候同时表达出相应的情感倾向信息。基于这种人们撰写评论的行为习惯,我们提出了如下假设:
假设1:在评论文本中,表达的情感是依赖于特定的主题(方面)
为了能够刻画出上面所提到的文本的隐藏主题和情感信息,本专利提出了主题情感联合概率模型STM(Sentiment Joint Topic Model),该模型在LDA模型的基础上扩充了情感层, 能够将评论文本中的主题和对应的情感信息抽取出来。与同是主题情感混合模型的JST和ASUM模型相比,STM模型与他们的相同之处在于都是在LDA模型的基础上增加了情感层,使其扩展到了四层模型:文档层、主题层、情感层和词层。
STM模型与同时考虑情感和主题因素的JST、ASUM模型的最大不同之处在于:JST和ASUM模型的结构层次为“文档-情感(这里所说的情感只有正向、负向两种。)-主题-词”,它们认为主题是依赖于情感,即每一个情感下都有一个主题分布;STM模型的结构层次是“文档-主题-情感-词”,STM模型认为情感的产生是依赖于主题,即每一个主题下都有一个情感分布。STM模型的“文档-主题-情感-词”的四层模型结构更能够符合评论者在撰写评论时的思想意识。此外,ASUM模型假设评论中同一主题的词趋向与邻近词,同时,对这个主题做出的情感的评价的词也趋向于邻近词。因此,ASUM模型假设同一个单句中的词都来源于同一个主题和同一个情感,即句子为主题和情感的最小单位。在实践中这样的约束设置并不是都适用。STM模型与同时考虑情感和主题因素的逆JST(reversed JST)模型的本质区别是前者是完全非监督学习,而后者是半监督学习方法。
发明内容
针对现有技术中存在的上述技术问题,本发明提出了一种基于主题情感联合概率的电子商务虚假评论识别方法,设计合理,克服了现有技术的不足,具有良好的效果。
一种基于主题情感联合概率的电子商务虚假评论识别方法,首先进行如下定义:
STM模型是一个9元组,
Figure PCTCN2018100372-appb-000001
其中,
α是反应隐藏主题间和情感间的相对强弱的超参数;
μ是反应关于主题的情感概率分布的超参数;
β是词的概率分布的超参数;
Figure PCTCN2018100372-appb-000002
是K维的狄利克雷随机变量,是主题概率分布矩阵;
Figure PCTCN2018100372-appb-000003
是K*T维的狄利克雷随机变量,是情感概率分布矩阵;
Figure PCTCN2018100372-appb-000004
是K*T*N维狄利克雷随机变量,是词的概率分布矩阵;
z m,n是文档m的第n个词所属的主题;
s m,n是文档m的第n个词所属的情感;
w m,n是离散数据的基本单元,被定义为文档m中索引为n的词;
所述的基于主题情感联合概率的电子商务虚假评论识别方法,具体包括如下步骤:
步骤1:初始化STM模型的超参数α,μ,β;
步骤2:设置合适的主题个数、情感个数和Gibbs sampling(吉布斯采样)最大迭代次数;
步骤3:训练STM模型,直到模型稳定收敛;
步骤4:将STM模型计算得到的主题情感联合概率矩阵
Figure PCTCN2018100372-appb-000005
作为特征,输入到分类器进行训练;
步骤5:将新的无标签样本输入到STM模型,并对STM模型进行训练,计算新的无标签样本的主题情感联合概率矩阵
Figure PCTCN2018100372-appb-000006
作为特征;
步骤6:将新的无标签样本的主题情感联合概率矩阵
Figure PCTCN2018100372-appb-000007
输入到训练好的分类器中,进行预测;
步骤7:分类器输出新的样本的标签。
本发明所带来的有益技术效果:
在不同领域数据下,STM模型优于其他对比模型;对比其他模型,STM模型在处理不均衡大样本数据集上表现出很大的优势,因此STM模型更适合应用到真实电子商务环境中。
附图说明
图1为STM模型的概率图。
图2为LDA,JST和STM模型的Gibbs Sampling迭代次数对Perplexity影响示意图。
图3为LDA,JST和STM模型的主题个数对Perplexity影响示意图。
图4为模型在均衡酒店数据集上的性能示意图。
其中,图4(a)、图4(b)、图4(c)分别是模型在Precision、Recall和F1-Score指标上的效果图。
图5为模型在不均衡酒店数据集上的性能示意图。
其中,图5(a)、图5(b)、图5(c)分别是模型在Precision、Recall和F1-Score指标上的效果图。
图6为模型在均衡餐厅数据集上的性能示意图。
其中,图6(a)、图6(b)、图6(c)分别是模型在Precision、Recall和F1-Score指标上的效果图。
图7为模型在不均衡餐厅数据集上的性能示意图。
其中,图7(a)、图7(b)、图7(c)分别是模型在Precision、Recall和F1-Score指标上的效果图。
具体实施方式
下面结合附图以及具体实施方式对本发明作进一步详细说明:
1、主题情感联合概率模型
1.1、模型思想
图1给出了STM模型的概率图模型,黑色方框圈出来的部分代表了扩展LDA的情感层部分。在图1中,从α到
Figure PCTCN2018100372-appb-000008
这条弧表示为每一篇文档根据Dirichlet(α)函数产生主题的概率分布向量,即
Figure PCTCN2018100372-appb-000009
从μ到
Figure PCTCN2018100372-appb-000010
这条弧表示为每一个隐含主题根据Dirichlet(μ)函数产生情感的概率分布向量,即
Figure PCTCN2018100372-appb-000011
从β到
Figure PCTCN2018100372-appb-000012
这条弧表示为每一个隐含主题和情感根据Dirichlet(β)函数产生词的概率分布向量,即
Figure PCTCN2018100372-appb-000013
Figure PCTCN2018100372-appb-000014
到z m,n这条弧表示为文档d m中的词w m,n从文档-主题的多项分布中随机选择一个z m,n。从
Figure PCTCN2018100372-appb-000015
到s m,n这条弧和z m,n到s m,n这条弧表示,给定了主题的条件下,为文档d m中的词w m,n从文档-主题-情感的多项分布中随机选择一个情感s m,n。从
Figure PCTCN2018100372-appb-000016
这三条弧表示在已知主题为z m,n和情感为s m,n的情况下,从主题-情感-词的多项分布中选择一个词w m,n
假设我们有一个文档集,包含D篇文档,定义为D={d 1,d 2,...d m},其中文档d是长度为N d的词序列,定义为
Figure PCTCN2018100372-appb-000017
文档集的词典大小为V。STM模型生成一篇文档的产生过程形式化描述如下
①对于每一个主题z和情感s,选择表示每个词的产生概率向量
Figure PCTCN2018100372-appb-000018
服从超参数为β的狄利克雷分布,即
Figure PCTCN2018100372-appb-000019
②对于每一篇文档d
a选择表示主题产生的概率向量θ,θ服从以α为超参的狄利克雷分布,即θ□Dirichlet(α)
b给定主题的情况下,选择表示情感产生的概率向量δ,δ服从以μ为超参的狄利克雷分布,即δ□Dirichlet(μ)
c对于文档中的每个词w
i选择主题z,z服从多项分布Multinomial(θ d)
ii给定主题z,选择情感s,s服从多项分布Multinomial(δ dz)
iii当主题为z和情感为s时,选择词w,w服从多项分布
Figure PCTCN2018100372-appb-000020
1.2、情感词典
情感字典是指带有情感倾向性的词语,也成为评价词语或者极性词。例如“good”和“bad”就是两个带有明显表示褒义情感和贬义情感的词。情感词在评论文本中通常是用来表达评论者的情感倾向性。因此,评论文本中的情感词对于情感分析具有很重要的作用。情感词的识别和极性判断在情感分析领域就受到了研究者的关注。
情感词“good,bad,disappointed,satisfied”,在排除否定表达结构的情况下,褒义词情感词:“good,satisfied“只能出现在正向情感类别中,贬义词情感词“bad,disappointed”只能出现在负向情感类别中。正是由于情感词的这种特点,我们在模型中引入了种子情感字典来初始化词的情感标签,使得对评论文本的情感挖掘更加地准确。
表1 情感种子词
Figure PCTCN2018100372-appb-000021
如表1所示,我们用到正向情感字典和负向情感字典。我们用到了Turney工作中的词典,包含有7个正向情感词和7个负向情感词。
1.3、模型求解
我们采用Gibbs Sampling方法对STM模型进行参数估计。Gibbs Sampling方法使得STM模型在经过足够多的迭代之后收敛到稳定状态。当确定好合适的迭代次数之后,此时模型为每个词分配的主题标签和情感标签是能够最大近似文本的真实情况。
在STM模型中我们需要估计三个隐含变量,分别是:文档-主题分布θ,主题-情感分布δ,主题-情感-词分布
Figure PCTCN2018100372-appb-000022
为了能够得到这三个隐含变量,我们需要采用Gibbs Sampling方法进行推理,参数评估过程可以分为三步:
Figure PCTCN2018100372-appb-000023
这一步包含两个结构:
Figure PCTCN2018100372-appb-000024
Figure PCTCN2018100372-appb-000025
对于不同的文档,这个生成过程是相互独立的,因此,对于每一篇文档,根据公式(1)可以产生主题的概率:
Figure PCTCN2018100372-appb-000026
其中,
Figure PCTCN2018100372-appb-000027
表示文档m中属于主题k的词的个数。□(α)(α={α 12,…α n})是狄利克雷分布Dirichlet(α)的归一化因子,□α可以根据公式(2)进行计算:
Figure PCTCN2018100372-appb-000028
Figure PCTCN2018100372-appb-000029
这一步包含两个结构:
Figure PCTCN2018100372-appb-000030
对应的是狄利克雷结构,
Figure PCTCN2018100372-appb-000031
对应的是多项分布,因此,
Figure PCTCN2018100372-appb-000032
是一个Dirichlet-Multinomial共轭结构。我们假设情感的生成是依赖于主题的,一旦确定了主题,情感的生成可以通过公式(3)进行计算:
Figure PCTCN2018100372-appb-000033
其中,
Figure PCTCN2018100372-appb-000034
代表了在文档m中属于主题k和情感t的词的个数。
Figure PCTCN2018100372-appb-000035
这一步包含两个结构:
Figure PCTCN2018100372-appb-000036
我们是以词为采样单位,因此词之间是相互对立的,词的概率生成可以通过公式(4)进行计算:
Figure PCTCN2018100372-appb-000037
其中,
Figure PCTCN2018100372-appb-000038
代表了分配给主题k和情感t的词的个数;
综合考虑公式(1)、公式(3)和公式(4),就能够得到隐含变量的联合概率分布,如公式(5)所示:
Figure PCTCN2018100372-appb-000039
在公式(5)的基础上,我们就能够通过Gibbs Sampling的方法得到公式(6);
Figure PCTCN2018100372-appb-000040
公式(6)中
Figure PCTCN2018100372-appb-000041
都是对词的频次统计量,在计数时都要排除第i个词。
文档m中文档-主题的近似概率分布为:
Figure PCTCN2018100372-appb-000042
文档m中主题为k时,主题-情感的近似概率分布为:
Figure PCTCN2018100372-appb-000043
当主题为k,情感为t时,主题-情感-词的近似概率分布为:
Figure PCTCN2018100372-appb-000044
STM模型是生成模型,对于每一篇文档d的生成过程,STM模型都是先从文档-主题分布θ d中选择主题k,当确定好主题k之后,从主题-情感分布δ k中选择情感t;给定了主题k和情感t之后,从主题-情感-词的分布
Figure PCTCN2018100372-appb-000045
中产生文档中的每一个词。基于Gibbs采样的STM模型求解算法流程如下所示:
Figure PCTCN2018100372-appb-000046
在STM模型求解迭代的过程中,每一次的迭代过程都是遵循上述的过程。每次迭代时,我们对上一次迭代结束后的结果进行频次统计,根据公式计算出分布θ,δ,
Figure PCTCN2018100372-appb-000047
然后再从分布中为每一个词选择相应的主题和情感标签,从而更新词的主题和情感维度。第一次的频次统计是根据模型的初始化结果进行的,模型初始化是为文档及中的所有的词随机分配主题维度。在模型初始化工作中,情感的分配不是全部随机的。由于我们要在模型中融入情感先验信息,所以情感的初始化过程要依赖于情感种子词典。具体的情感维度的初始化过程如下:
a、对于文档中的每一个词,查看该词是否出现在情感种子字典中,如果出现,记录该词的情感极性是正向的还是负向的
b、对于没有出现在情感种子字典中的词,则随机分配该词的情感极性
2、虚假评论检测实验及结果分析
为了能够证明本申请提出的STM模型的性能,并与虚假评论领域中典型的基于特征的模型进行对比,例如unigram模型,character n-grams in token模型,POS模型,LDA模型和JST 模型,我们设计并实现了三套实验。在第一套实验中,我们用Perplexity混淆度对比了LDA模型,JST模型和STM模型在模型泛化能力上的不同。这三个模型都是生成式概率模型,它们同时都收到两个参数的影响,即Gibbs Sampling迭代次数和主题的个数。因此,在第一套实验中,我们着重根据Perplexity指标观察改变这两个参数对这三个模型的影响。第二套实验的设置目的是为了能够评估模型在均衡数据集和不均衡数据集上的分类性能。评估指标就是在分类任务中常用的Precision,Recall和F1-Score。第三套实验的目的是为了验证本申请提出的模型和对比模型在不同领域数据集上的性能。
2.1数据集及实验设置
本申请所用到的实验数据集是从国外点评网站Yelp上得到的带标签的英文评论文本。表2给出了数据集的统计特征。这个数据集是带有标签的:真实和虚假。虚假评论是由Yelp自带的过滤器过滤得到的。真实的评论则是保留在商铺网页上的评论。这些评论是来自两个领域的数据集:酒店和餐厅。酒店领域一共有780条虚假评论和5078条真实评论,餐厅领域一共有8308条虚假评论和58716条真实评论。根据表2统计结果,我们能够看到这个Yelp数据集的类别分布是及其不均衡的。表中带有“ND”标记的表示自然分布。研究者表明在极不均衡的数据集上往往训练出的模型效果不好。因此,为了能够建立更好的模型,我们使用下采样的技术来构建不均衡数据集。下采样技术通常是随机从数据量大的类别中随机选择一部分实例与数据量小的类别形成一个类别分布相对均衡的不均衡训练数据集。在表2中,带有“#”记号的代表不均衡数据集,带有“*”记号的代表均衡数据集,为了能够验证虚假评论检测模型的适用性,我们在两个领域的数据集上都做了实验。在第一套和第二套实验中用到了酒店数据集,第三套实验中用到了餐厅数据集。
表2 评论检测的数据集
Figure PCTCN2018100372-appb-000048
在实现这三套实验之前,我们首先对评论文本数据做了预处理。因为文本是英文评论,因此我们只需要通过空格来对评论进行切词,然后去掉数字和标点符号。之后,我们用斯坦福大 学的分析器得到了每个词对应的词性。
在实验设置中,所有的分类任务都是用到五折交叉验证。我们选择了unigram模型,character n-grams in token模型,POS模型,LDA模型,JST模型作为对比的基准模型。这是因为在虚假评论检测领域中,这些模型通常都是典型的基于特征的代表模型。所有的分类任务都采取了随机森林分类器。特别地,对于高维特征的模型例如unigram模型和character n-grams in token模型,我们采用了SVM模型,因为SVM模型适合处理高维特征数据。
在所有的实验中,我们都采用了最优的参数配置。对于unigram模型,character n-grams in token模型,POS模型,所有的特征都是用到了TF-IDF进行特征加权。对于LDA模型来说,我们采用了主题概率分布向量(隐藏变量θ)作为评论文本的特征。对于JST模型,我们采用了情感主题联合概率分布(隐藏变量θ)作为文本特征。类似地,STM模型的主题情感联合概率分布(隐藏变量δ)作为特征,可以通过公式(8)计算得到。与论文中的配置一样,在三套实验中,狄利克雷先验参数分别被赋值为0.1,0.01,0.1。除此之外,在主题模型的实验中,主题个数依次被赋值为5,10,15,20。情感个数赋值为2。
2.2评估标准
本实验是基于前面介绍的特征和分类器来进行特征提取和模型训练。为了能够衡量不同特征和不同分类模型的效果,我们采用机器学习领域的算法评估,常用的三个基本指标分别是精确度(Precision)、召回率(Recall)和F-值(F1-Score)。
Precision精确度衡量的是检索系统的查准率,是检索出的相关文档数与检索出的文档总数的比率。
Figure PCTCN2018100372-appb-000049
Recall召回率,也叫查全率,是检索的相关文档数与文档库中所有相关文档数的比率,衡量的是检索系统的查全率。
Figure PCTCN2018100372-appb-000050
F1-Score通常就是将精确度和召回率用某种组合方式组成单一的度量标准,F1-Score的计算公式如下:
Figure PCTCN2018100372-appb-000051
其中,TP指的是正例预测为正例的个数,TN指的是负例预测为负例的个数,FP指的是负例预测为正例的个数,FN指的是正例预测为负例的个数。因为实验用到了五折交叉验证,因此实验结果中的Precision Recall和F1-Score都是用宏平均(micro-average)计算得到的。
在信息论中,困惑度Perplexity是用来衡量一个概率分布或者概率模型预测样本的好坏程 度,同时也可以比较两个概率分布或概率模型。本申请中用到的LDA模型,JST模型,STM模型都是概率模型,因此我们同时用到了困惑度来比较这三个主题模型的模型预测能力。Perpleixty随着测试数据的相似度单调递减,并且代数上等价于每个词相似度的几何平均的逆。较低的Perplexity值表示该概率模型具有更好的泛化能力。形式上,对于M个文档的测试集,Perplexity的计算公式如下:
Figure PCTCN2018100372-appb-000052
2.3实验结果分析
2.3.1改变Gibbs Sampling迭代次数和主题个数的实验结果
在第一组实验中我们分别改变Gibbs Sampling的迭代次数和主题模型的主题个数,观察三个主题模型在Perplexity上的变化。图2和图3展示了第一组实验得到的结果。图2中,横坐标是迭代次数,纵坐标是Perplexity值。如图2所示,LDA模型,JST模型,STM模型的Perplexity的值都是随着迭代次数的增加而下降。我们知道,较低的困惑度表示概率模型的泛化能力越好。这表明这三个模型随着迭代次数的增加逐渐趋于收敛。而STM模型的Perplexity值始终是低于LDA模型和JST模型,这说明STM模型得到泛化能力要比其他两个模型好。另外,我们发现当迭代次数小于40的时候,三条曲线的下降梯度较大,当迭代次数达到500的时候曲线基本保持不变,这说明概率模型基本上已经收敛。因此在下面的实验中我们将迭代次数都设置为500。
我们设置主题个数的变化是从5到20,观察主题个数对三个主题模型的影响。图3中,横轴代表主题个数,纵轴代表Perplexity值。与图2相似,图3中三个主题模型的Perplexity值随着主题个数的增加而下降。STM模型的曲线一直在LDA模型和JST模型曲线的下方,这表明STM模型的泛化能力要比其他两个主题模型好。当主题个数从5变化到10的时候,曲线下降较快,当主题个数达到15的时候,三个主题模型基本能够收敛。主题个数从15变化到20,三条曲线的变化较小,因此我们认为本申请中主题个数设置为15是比较合适的。
2.3.2模型在均衡数据和不均衡数据的实验结果
在第二套实验中,我们比较了我们的模型与基准模型在均衡数据集和不均衡数据集上的性能。由于LDA模型,JST模型和STM模型都受到主题个数的影响,因此有必要比较这三个主题模型在不同主题个数下的性能。图4和图5分别展示了随着主题个数从5变化到15,模型在均衡数据集和不均衡数据集上的实验效果。横轴代表了主题个数,纵轴代表了模型实验结果的Precision(P),Recall(R),F1-Score(F)这三个指标的值。从第一组实验的结果分析看大多数主题模型在主题个数设置为15的情况下能够达到好的性能。为了能够更进一步地 比较各个模型的效果,我们将主题个数设置为15的时候各个模型得到的结果列在了表3中。图和表中的所有的值都是代表的百分比(%)。
表3 主题个数为15时,模型在酒店数据集实验结果
Figure PCTCN2018100372-appb-000053
a在均衡酒店数据集上的结果
首先,我们先分析主题无关的模型效果。当使用随机森林分类器的时候,character n-grams in token模型(71.3,79.52,and 75.19)的效果比unigram模型(70.42,75.63m and 72.93)的效果好。这是因为character n-grams in token模型不仅能够继承标准的n-grams模型的优点而且能够获得更少的特征。POS模型要好于unigram模型和character n-grams in token模型,这也表明了简单的体裁特征的方法对于虚假评论检测有帮助。
LDA模型(76.34,85.53,and 79.77)要比POS模型(75.92,82.42,and 79.04)好,是因为LDA模型能够捕获评论文本中隐藏的语义信息。当主题个数设置为20的时候,LDA的精确度大幅下降,召回率有所上升。这表明LDA模型受主题个数的影响。比较LDA模型(76.34,85.53,and 79.77),JST模型(83.5,84.75,and 84.12)和STM模型(87.15,83.02and 85.03)的结果,可以看出同时考虑情感和主题信息对提高性能更有帮助。同样地,我们的STM模型应用到均衡的酒店数据集上也是所有对比模型中性能最好的。因此,我们能够得出主题情感联合概率分布能够进一步提高虚假评论检测的性能。
结论1:在虚假评论检测中,均衡的酒店数据集上实验验证STM模型是所有对比模型中性能最好的。
b在不均衡酒店数据集上的结果
图5展示了对比模型在不均衡酒店数据集上的性能。在酒店领域的不均衡数据集上,我们验证了所有的模型在不均衡数据集上的性能都要比在均衡数据集上的性能要差。在酒店领域的不均衡数据集上同样也有这样的现象。Character n-grams in token模型依旧要比unigram 模型的效果好。POS模型要优于unigram模型和character n-grams in token模型,因为词性出现的频率往往能够反映出文本的体裁。
当主题个数的取值小于15的时候,主题相关模型的性能不如POS模型的好。当主题个数增加到15的时候,三个主题模型能够达到他们最好的性能并优于POS模型。因此,选择一个合适的主题个数对于虚假评论检测的效果至关重要。从表4列出的不均衡数据的结果来看,我们能够看出STM模型的Precision Recall F1-Score值最大。
结论2:当STM模型应用到不均衡数据集上的时候,性能是最好的且优势更明显。
2.3.3模型在不同领域数据上的实验结果
为了进一步比较模型在不同领域的适用性,我们在餐厅数据集上做了第三套实验,餐厅数据集的数据量要比酒店数据集的数据量大。第三套实验的实验设置与第二套实验相同。图6和图7展示了第三套实验结果。横轴代表了主题个数,纵轴代表了模型获得的Precision(P),Recall(R)和F1-Score(F)三个指标的值。类似地,我们将主题个数设置为15的结果单独在表5中列出。表和图中所有的值都表示的是百分不(%)。
a均衡餐厅数据集上实验结果
在图6中,unigram模型,character n-grams in token模型,POS模型都不受主题个数的影响。从图6(a)和表4中,我们能够得出当使用随机森林作为分类器的时候,character n-grams in token模型(72.39,78.99,and 75.57)要比unigram模型(71.16,75.23,and 73.14)在Precision,Recall,and F1-Score这三个指标上的效果都要好。相似地,当我们使用SVM分类器的时候,character n-grams in token模型的性能依旧要好于unigram模型。这是因为character n-grams in token模型比n-grams模型更能够区分虚假评论者在撰写评论的风格。相比之下POS的性能(74.06,78.89,76.39)要比unigram模型和character n-grams in token模型都好,这是因为POS模型利用了文本中隐藏的浅层语法信息,而unigram模型和character n-grams in token模型仅仅是利用了字词的上下文信息。
在这三个主题相关模型中,LDA模型明显要劣于JST模型(80.5,82.19,and 81.34)和STM模型(82.29,85.62,and 83.92)。这是因为JST模型和STM模型同时考虑了主题信息和情感信息。对比图6中的三个子图,我们能够发现当主题个数从5逐步上升到15的时候,主题模型的精度也是在提高。当主题个数得到20,LDA模型在这三个图中都有一个明显下降的趋势。相比之下,JST模型和STM模型波动较小。在这三个主题模型中,不论主题个数如何变化,STM模型的曲线一直在其他基准模型的曲线上方。这表明我们的模型的效果是这几个模型中最好的。
表4 主题个数为15时,模型在餐厅数据集实验结果
Figure PCTCN2018100372-appb-000054
b.不均衡餐厅数据集上的实验结果
从图7和表4中,我们看到POS模型的效果要优于unigram模型和character n-grams in token模型,这是因为POS利用了字词的词性特征。当我们采用随机森林分类器的时候,character n-grams in token模型(67.41,69.63,and 68.5)相比于unigram模型(65.46,73.22,69.12)在精确度上有所下降,在召回率和F1值上都有提高。当我们采用SVM分类器的时候,character n-grams in token模型(64.79,69.03,and 66.84)要比unigram模型(64.15,68.43,and 66.22)好,但是整体效果不如均衡数据集。我们认为之所以会出现这种现象是因此均衡数据集和非均衡数据集词典数量级上的差异造成的。当我们构造不均衡数据集的时候,我们相当于是增加了训练集中的正例样本的比例。因此对于不均衡数据集来说,这样更难以检测虚假评论。
观察图7主题模型随着主题个数变化的趋势,我们发现JST模型表现的要比LDA模型差。另外,LDA模型和主题模型都较STM模型更容易受到影响。例如,当我们将主题个数设置到20的时候,LDA模型和JST模型在Precision Recall和F1-Score的值都比主题个数为15的时候有明显下降的趋势。但是,STM模型相比之下下降的幅度较小,基本是保持稳定。在图7中,我们还可以观察到当主题个数被赋值为5和10的时候,POS模型的曲线一直是在其他模型曲线的上方。当主题个数为15的时候,这三个主题模型的性能才有所提高。同时我们的STM模型是模型中最好的。这表明主题个数这个参数对于主题模型是有所影响的。只有我们设置一个合适的主题个数的时候,主题模型的性能才会发挥的很好。
横向对比表4中的均衡数据集结果和不均衡数据集的结果,很容易发现不均衡数据集的结果相应地要差。例如,当我们同时采用随机森林分类器在unigram模型上做实验,不均衡数据集得到的结果(65.46,73.22,and 69.12)要比均衡数据集的结果(71.16,75.23and 73.14)低。 这在现实中是合理的,因为在电子商务真实环境中真实评论本身就比虚假评论数据量大,很难检测。仔细观察表4中列出的不均衡数据集下所有模型的结果,我们的模型是最好的。
上面的实验结果支持了我们的假设“情感依赖于主题”,也证明了在挖掘用户撰写评论的思想轨迹是有用的。更进一步地说,从这些结果分析,我们能够得出主题情感联合概率特征能够提高虚假评论检测的性能。特别地,所有模型在不均衡数据集上的性能要相应地比均衡数据集上的实验效果要差,这也解释了为什么在真实的电子商务环境中检测虚假评论的工作难度大。对比于其他模型,我们的模型在处理不均衡数据集的时候表现出了很大的优势,尤其是在大样本中(餐厅不均衡数据集)。这也说明了我们的模型适合应用到真实电子商务环境中。
结论3:在不同领域数据下,STM模型优于其他对比模型
结论4:对比其他模型,STM模型在处理不均衡大样本数据集上表现出很大的优势。因此STM模型更适合应用到真实电子商务环境中。
当然,上述说明并非是对本发明的限制,本发明也并不仅限于上述举例,本技术领域的技术人员在本发明的实质范围内所做出的变化、改型、添加或替换,也应属于本发明的保护范围。

Claims (1)

  1. 一种基于主题情感联合概率的电子商务虚假评论识别方法,其特征在于:首先进行如下定义:
    STM模型是一个9元组,
    Figure PCTCN2018100372-appb-100001
    其中,
    α是反应隐藏主题间和情感间的相对强弱的超参数;
    μ是反应关于主题的情感概率分布的超参数;
    β是词的概率分布的超参数;
    Figure PCTCN2018100372-appb-100002
    是K维的狄利克雷随机变量,是主题概率分布矩阵;
    Figure PCTCN2018100372-appb-100003
    是K*T维的狄利克雷随机变量,是情感概率分布矩阵;
    Figure PCTCN2018100372-appb-100004
    是K*T*N维狄利克雷随机变量,是词的概率分布矩阵;
    z m,n是文档m的第n个词所属的主题;
    s m,n是文档m的第n个词所属的情感;
    w m,n是离散数据的基本单元,被定义为文档m中索引为n的词;
    所述的基于主题情感联合概率的电子商务虚假评论识别方法,具体包括如下步骤:
    步骤1:初始化STM模型的超参数α,μ,β;
    步骤2:设置合适的主题个数、情感个数和Gibbs sampling最大迭代次数;
    步骤3:训练STM模型,直到模型稳定收敛;
    步骤4:将STM模型计算得到的主题情感联合概率矩阵
    Figure PCTCN2018100372-appb-100005
    作为特征,输入到分类器进行训练;
    步骤5:将新的无标签样本输入到STM模型,并对STM模型进行训练,计算新的无标签样本的主题情感联合概率矩阵
    Figure PCTCN2018100372-appb-100006
    作为特征;
    步骤6:将新的无标签样本的主题情感联合概率矩阵
    Figure PCTCN2018100372-appb-100007
    输入到训练好的分类器中,进行预测;
    步骤7:分类器输出新的样本的标签。
PCT/CN2018/100372 2018-05-16 2018-08-14 一种基于主题情感联合概率的电子商务虚假评论识别方法 WO2019218508A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/769,009 US11100283B2 (en) 2018-05-16 2018-08-14 Method for detecting deceptive e-commerce reviews based on sentiment-topic joint probability

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810464828.0A CN108874768B (zh) 2018-05-16 2018-05-16 一种基于主题情感联合概率的电子商务虚假评论识别方法
CN201810464828.0 2018-05-16

Publications (1)

Publication Number Publication Date
WO2019218508A1 true WO2019218508A1 (zh) 2019-11-21

Family

ID=64334442

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/100372 WO2019218508A1 (zh) 2018-05-16 2018-08-14 一种基于主题情感联合概率的电子商务虚假评论识别方法

Country Status (3)

Country Link
US (1) US11100283B2 (zh)
CN (1) CN108874768B (zh)
WO (1) WO2019218508A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143564A (zh) * 2019-12-27 2020-05-12 北京百度网讯科技有限公司 无监督的多目标篇章级情感分类模型训练方法和装置
CN111310455A (zh) * 2020-02-11 2020-06-19 安徽理工大学 一种面向网购评论的新情感词极性计算方法
CN113239150A (zh) * 2021-05-17 2021-08-10 平安科技(深圳)有限公司 文本匹配方法、系统及设备
CN113934846A (zh) * 2021-10-18 2022-01-14 华中师范大学 一种联合行为-情感-时序的在线论坛主题建模方法
JP2022517845A (ja) * 2019-12-02 2022-03-10 ▲広▼州大学 言語間遷移を支援する細粒度感情解析方法
CN115098683A (zh) * 2022-07-05 2022-09-23 东南大学 基于机器学习的城市绿地情绪影响因子甄别与调控方法
CN116823321A (zh) * 2023-07-06 2023-09-29 青岛酒店管理职业技术学院 一种电商用经济管理数据分析方法及系统

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472045B (zh) * 2019-07-11 2023-02-03 中山大学 一种基于文档嵌入的短文本虚假问题分类预测方法及装置
US11386273B2 (en) * 2019-11-18 2022-07-12 International Business Machines Corporation System and method for negation aware sentiment detection
CN111415171B (zh) * 2020-02-24 2020-11-10 柳州达迪通信技术股份有限公司 一种基于sdh传输系统的数据采集校验系统
US11494565B2 (en) * 2020-08-03 2022-11-08 Optum Technology, Inc. Natural language processing techniques using joint sentiment-topic modeling
US11966702B1 (en) * 2020-08-17 2024-04-23 Alphavu, Llc System and method for sentiment and misinformation analysis of digital conversations
CN112988975A (zh) * 2021-04-09 2021-06-18 北京语言大学 一种基于albert和知识蒸馏的观点挖掘方法
CN112989056B (zh) * 2021-04-30 2021-07-30 中国人民解放军国防科技大学 基于方面特征的虚假评论识别方法及装置
US20230259711A1 (en) * 2022-02-11 2023-08-17 International Business Machines Corporation Topic labeling by sentiment polarity in topic modeling
CN114996390B (zh) * 2022-03-09 2024-06-18 华中师范大学 一种联合情感和话语角色的在线论坛主题建模方法
CN114626885B (zh) * 2022-03-17 2022-11-15 华院分析技术(上海)有限公司 一种基于大数据的零售管理方法和系统
CN114429109B (zh) * 2022-04-06 2022-07-19 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) 基于评论有用性的用户评论摘要的方法
CN115086186B (zh) * 2022-06-28 2024-06-04 清华大学 数据中心网络流量需求数据的生成方法和装置
CN115619041B (zh) * 2022-11-09 2023-11-21 哈尔滨工业大学 基于lda主题模型与固定效应模型的直播效果的预测方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101420321A (zh) * 2008-11-27 2009-04-29 电子科技大学 一种多模块化光纤的sdh网络规划设计方法
CN101739430A (zh) * 2008-11-21 2010-06-16 中国科学院计算技术研究所 一种基于关键词的文本情感分类器的训练方法和分类方法
CN104462408A (zh) * 2014-12-12 2015-03-25 浙江大学 一种基于主题建模的多粒度情感分析方法
CN107341270A (zh) * 2017-07-28 2017-11-10 东北大学 面向社交平台的用户情感影响力分析方法

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8630975B1 (en) * 2010-12-06 2014-01-14 The Research Foundation For The State University Of New York Knowledge discovery from citation networks
EP2546760A1 (en) * 2011-07-11 2013-01-16 Accenture Global Services Limited Provision of user input in systems for jointly discovering topics and sentiment
US20130080212A1 (en) * 2011-09-26 2013-03-28 Xerox Corporation Methods and systems for measuring engagement effectiveness in electronic social media
US11080721B2 (en) * 2012-04-20 2021-08-03 7.ai, Inc. Method and apparatus for an intuitive customer experience
US10832057B2 (en) * 2014-02-28 2020-11-10 Second Spectrum, Inc. Methods, systems, and user interface navigation of video content based spatiotemporal pattern recognition
US10521671B2 (en) * 2014-02-28 2019-12-31 Second Spectrum, Inc. Methods and systems of spatiotemporal pattern recognition for video content development
US10713494B2 (en) * 2014-02-28 2020-07-14 Second Spectrum, Inc. Data processing systems and methods for generating and interactive user interfaces and interactive game systems based on spatiotemporal analysis of video content
US11031133B2 (en) * 2014-11-06 2021-06-08 leso Digital Health Limited Analysing text-based messages sent between patients and therapists
CN104866468B (zh) * 2015-04-08 2017-09-29 清华大学深圳研究生院 一种中文虚假顾客评论识别方法
WO2017090051A1 (en) * 2015-11-27 2017-06-01 Giridhari Devanathan A method for text classification and feature selection using class vectors and the system thereof
US10410385B2 (en) * 2016-02-19 2019-09-10 International Business Machines Corporation Generating hypergraph representations of dialog
US10445356B1 (en) * 2016-06-24 2019-10-15 Pulselight Holdings, Inc. Method and system for analyzing entities
US9836183B1 (en) * 2016-09-14 2017-12-05 Quid, Inc. Summarized network graph for semantic similarity graphs of large corpora
CN106484679B (zh) * 2016-10-20 2020-02-11 北京邮电大学 一种应用于消费平台上的虚假评论信息识别方法及装置
US10552468B2 (en) * 2016-11-01 2020-02-04 Quid, Inc. Topic predictions based on natural language processing of large corpora
US11205103B2 (en) * 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis
US10162812B2 (en) * 2017-04-04 2018-12-25 Bank Of America Corporation Natural language processing system to analyze mobile application feedback
CN107392654A (zh) * 2017-07-04 2017-11-24 深圳齐心集团股份有限公司 一种电子商务产品评论质量鉴别系统
US10489792B2 (en) * 2018-01-05 2019-11-26 Asapp, Inc. Maintaining quality of customer support messages

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739430A (zh) * 2008-11-21 2010-06-16 中国科学院计算技术研究所 一种基于关键词的文本情感分类器的训练方法和分类方法
CN101420321A (zh) * 2008-11-27 2009-04-29 电子科技大学 一种多模块化光纤的sdh网络规划设计方法
CN104462408A (zh) * 2014-12-12 2015-03-25 浙江大学 一种基于主题建模的多粒度情感分析方法
CN107341270A (zh) * 2017-07-28 2017-11-10 东北大学 面向社交平台的用户情感影响力分析方法

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2022517845A (ja) * 2019-12-02 2022-03-10 ▲広▼州大学 言語間遷移を支援する細粒度感情解析方法
JP7253848B2 (ja) 2019-12-02 2023-04-07 ▲広▼州大学 言語間遷移を支援する細粒度感情解析方法
CN111143564A (zh) * 2019-12-27 2020-05-12 北京百度网讯科技有限公司 无监督的多目标篇章级情感分类模型训练方法和装置
CN111310455A (zh) * 2020-02-11 2020-06-19 安徽理工大学 一种面向网购评论的新情感词极性计算方法
CN111310455B (zh) * 2020-02-11 2022-09-20 安徽理工大学 一种面向网购评论的新情感词极性计算方法
CN113239150A (zh) * 2021-05-17 2021-08-10 平安科技(深圳)有限公司 文本匹配方法、系统及设备
CN113239150B (zh) * 2021-05-17 2024-02-27 平安科技(深圳)有限公司 文本匹配方法、系统及设备
CN113934846A (zh) * 2021-10-18 2022-01-14 华中师范大学 一种联合行为-情感-时序的在线论坛主题建模方法
CN115098683A (zh) * 2022-07-05 2022-09-23 东南大学 基于机器学习的城市绿地情绪影响因子甄别与调控方法
CN116823321A (zh) * 2023-07-06 2023-09-29 青岛酒店管理职业技术学院 一种电商用经济管理数据分析方法及系统
CN116823321B (zh) * 2023-07-06 2024-02-06 青岛酒店管理职业技术学院 一种电商用经济管理数据分析方法及系统

Also Published As

Publication number Publication date
US20210027016A1 (en) 2021-01-28
CN108874768A (zh) 2018-11-23
CN108874768B (zh) 2019-04-16
US11100283B2 (en) 2021-08-24

Similar Documents

Publication Publication Date Title
WO2019218508A1 (zh) 一种基于主题情感联合概率的电子商务虚假评论识别方法
Kaur et al. A deep learning-based model using hybrid feature extraction approach for consumer sentiment analysis
Xia et al. Sentiment analysis for online reviews using conditional random fields and support vector machines
Barushka et al. Review spam detection using word embeddings and deep neural networks
Liu et al. Detection of spam reviews through a hierarchical attention architecture with N-gram CNN and Bi-LSTM
Chang et al. Research on detection methods based on Doc2vec abnormal comments
Chen et al. A comparison of classical versus deep learning techniques for abusive content detection on social media sites
Ahmed Detecting opinion spam and fake news using n-gram analysis and semantic similarity
Ajitha et al. Design of text sentiment analysis tool using feature extraction based on fusing machine learning algorithms
Nithish et al. An Ontology based Sentiment Analysis for mobile products using tweets
Raghuvanshi et al. A brief review on sentiment analysis
Lim et al. Mitigating online product rating biases through the discovery of optimistic, pessimistic, and realistic reviewers
Oueslati et al. Sentiment analysis for helpful reviews prediction
Baishya et al. SAFER: sentiment analysis-based fake review detection in e-commerce using deep learning
Maurya et al. Deceptive opinion spam detection approaches: a literature survey
Gramyak et al. Intelligent Method of a Competitive Product Choosing based on the Emotional Feedbacks Coloring.
Hajek et al. Opinion mining of consumer reviews using deep neural networks with word-sentiment associations
Repke et al. Extraction and representation of financial entities from text
Yao et al. Online deception detection refueled by real world data collection
Duman Social media analytical CRM: a case study in a bank
Jiang et al. Detecting online fake reviews via hierarchical neural networks and multivariate features
Sangeetha et al. Comparison of sentiment analysis on online product reviews using optimised RNN-LSTM with support vector machine
Ma et al. Identifying purchase intention through deep learning: analyzing the Q &D text of an E-Commerce platform
Abbas et al. A deep learning approach for context-aware citation recommendation using rhetorical zone classification and similarity to overcome cold-start problem
Kim et al. Opinion Mining‐Based Term Extraction Sentiment Classification Modeling

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18919281

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18919281

Country of ref document: EP

Kind code of ref document: A1