CN110751180B - Spurious comment group division method based on spectral clustering - Google Patents
Spurious comment group division method based on spectral clustering Download PDFInfo
- Publication number
- CN110751180B CN110751180B CN201910887582.2A CN201910887582A CN110751180B CN 110751180 B CN110751180 B CN 110751180B CN 201910887582 A CN201910887582 A CN 201910887582A CN 110751180 B CN110751180 B CN 110751180B
- Authority
- CN
- China
- Prior art keywords
- user
- comment
- similarity
- spectral clustering
- graph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 230000003595 spectral effect Effects 0.000 title claims abstract description 23
- 239000011159 matrix material Substances 0.000 claims abstract description 26
- 230000003993 interaction Effects 0.000 claims abstract description 16
- 238000004140 cleaning Methods 0.000 claims abstract description 5
- 238000004364 calculation method Methods 0.000 claims description 17
- 238000012552 review Methods 0.000 claims description 15
- 239000013598 vector Substances 0.000 claims description 7
- 238000010276 construction Methods 0.000 claims description 5
- 101100494729 Syncephalastrum racemosum SPSR gene Proteins 0.000 claims description 3
- 238000005192 partition Methods 0.000 claims 1
- 230000000694 effects Effects 0.000 abstract description 4
- 230000006399 behavior Effects 0.000 description 9
- 230000008569 process Effects 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 3
- 238000013480 data collection Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000008676 import Effects 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000002547 anomalous effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/018—Certifying business or products
- G06Q30/0185—Product, service or business identity fraud
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0282—Rating or review of business operators or products
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Finance (AREA)
- Accounting & Taxation (AREA)
- Strategic Management (AREA)
- Development Economics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Economics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Entrepreneurship & Innovation (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Marketing (AREA)
- Evolutionary Computation (AREA)
- General Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本发明涉及数据挖掘技术领域,更具体地,涉及一种基于谱聚类的虚假评论群组划分方法。The present invention relates to the technical field of data mining, and more specifically, to a method for grouping false comments based on spectral clustering.
背景技术Background technique
随着互联网的迅速发展,电子商务平台的出现改变了人们购物、出行、就餐等各个方面的消费方式。在电子商务平台的交易过程中,服务或产品的评论对用户购买决策行为起关键作用,正面评论更多、更真实的商品更能获得用户的青睐,反之,用户对于负面评论更多的商品的购买意愿较低。近年来,随着电子商务平台的多元化发展,市场竞争加剧,许多不道德的商家会采用各种手段获得更多虚假的好评,或给竞争对手虚假的差评。传统的方法如卖家通过中介提供的刷评论服务获得高于平均水平的好评量,或是通过“好评返现”的方式诱导消费者给予不真实的好评。随着网络社交平台的兴起,一种新型的刷评论模式随之流行,该模式通过关键意见领袖(Key Opinion Leader,KOL)进行宣传,由其运营团队同时对商品发布不真实评论。在这种模式中,由于关键意见领袖的信息传播效率高,其粉丝会在短时间内发布大量正常消费后的评论,正常评论和虚假评论在短时间内爆发,使得对虚假评论群组检测的难度大大增加。With the rapid development of the Internet, the emergence of e-commerce platforms has changed people's consumption patterns in various aspects such as shopping, traveling, and dining. In the transaction process of the e-commerce platform, service or product reviews play a key role in the user's purchase decision-making behavior. Products with more positive reviews and more authentic products are more favored by users. The willingness to buy is low. In recent years, with the diversified development of e-commerce platforms and the intensification of market competition, many unscrupulous merchants will use various means to obtain more false positive reviews or give false negative reviews to competitors. Traditional methods such as sellers obtaining higher-than-average positive ratings through the review service provided by intermediaries, or inducing consumers to give false positive ratings by means of "cash back for positive ratings". With the rise of online social platforms, a new type of review mode has become popular. This mode is promoted through Key Opinion Leaders (KOL), and its operation team simultaneously publishes false reviews on products. In this mode, due to the high efficiency of information dissemination of key opinion leaders, their fans will publish a large number of comments after normal consumption in a short period of time, and normal comments and fake comments burst out in a short period of time, making the detection of fake comment groups difficult. The difficulty is greatly increased.
已有的许多对虚假评论群组检测的算法不能满足新的需求,特别是在新型刷评论模式中,由于许多正常的消费者会在相近的时间内购买同样的商品,该商品的评论会在短时间内大量增多,使得一些对少量异常行为进行检测的算法或利用评论爆发性进行检测的算法在这一问题上表现不佳。因此有必要寻找一个效果更佳、能够满足新的需求的方法。Many existing algorithms for detecting fake review groups cannot meet the new needs, especially in the new review mode, because many normal consumers will buy the same product within a similar period of time, the product reviews will be in the The large increase in a short period of time makes some algorithms that detect a small amount of anomalous behavior or that use comment explosion to perform poorly on this problem. Therefore, it is necessary to find a method with better effect and able to meet new demands.
发明内容Contents of the invention
本发明为克服上述现有技术所述的对虚假评论群组检测效果不佳的缺陷,提供一种基于谱聚类的虚假评论群组划分方法。In order to overcome the defect of poor detection effect on false comment groups described in the prior art, the present invention provides a method for dividing false comment groups based on spectral clustering.
所述方法包括以下步骤:The method comprises the steps of:
S1:收集和清洗电商平台的评论数据;使用到的元数据包括:用户id、评论id、商品id、评分以及评论的互动行为次数(如:被其他用户“点赞”、被其他用户“认为有用”、被其他用户“认为有趣”等);S1: Collect and clean the comment data of the e-commerce platform; the metadata used includes: user id, comment id, product id, rating, and the number of interaction behaviors of comments (such as: being "liked" by other users, being "liked" by other users) found useful", "interesting" by other users, etc.);
S2:基于S1中的元数据计算“共同评论次数”、“同一商品的评分相似度”、“用户互动次数”、“用户积极评分比例”和“用户消极评分比例”共5个相似度指标,两个用户之间评分比例的相似度用欧氏距离进行度量;S2: Based on the metadata in S1, calculate the five similarity indicators of "number of common comments", "score similarity of the same product", "user interaction times", "proportion of positive user ratings" and "proportion of negative user ratings". The similarity of rating ratio between two users is measured by Euclidean distance;
S3:构建一张带权评论者图:每个用户为一个图结点,两两共同评论过一件产品的用户用一条无向边连接,边的权值根据S2中计算获得的5个指标计算获得;S3: Construct a graph of weighted reviewers: each user is a graph node, and users who have commented on a product in pairs are connected by an undirected edge, and the weight of the edge is based on the five indicators calculated in S2 calculated;
S4:通过谱聚类算法对S3所构建图的邻接矩阵进行划分,获得若干个群组;S4: Divide the adjacency matrix of the graph constructed by S3 through the spectral clustering algorithm to obtain several groups;
S5:通过选择合理的分析指标和合适的阈值,人工对划分群组的类别做进一步判定。S5: By selecting reasonable analysis indicators and appropriate thresholds, manually make further judgments on the categories of the groups.
优选地,S2中的共同评论次数(Co-Reviewing Times,CRT)的计算公式为:Preferably, the calculation formula of the number of common reviews (Co-Reviewing Times, CRT) in S2 is:
CRT(n1,n2)=|P1∩P2|CRT(n 1 ,n 2 )=|P 1 ∩P 2 |
其中,n1,n2为两个不同的评论者,P1,P2分别为n1,n2发表过评论的商品集合。Among them, n 1 , n 2 are two different reviewers, and P 1 , P 2 are collections of commodities that n 1 , n 2 have commented on respectively.
优选地,S2中的同一商品的评分相似度(Similarity of Rating on SameProduct,SRSP)的计算公式为:Preferably, the formula for calculating the rating similarity (Similarity of Rating on SameProduct, SRSP) of the same product in S2 is:
其中,分别为n1,n2对商品P发表第i或第j条评价的评分,N1,N2分别为n1,n2在商品P上发表的评论数。in, N 1 , n 2 are the ratings of the i-th or j-th evaluation on product P, respectively, and N 1 , N 2 are the number of comments n 1 , n 2 have published on product P.
优选地,S2中的用户互动次数(Interaction Times,IT)的计算公式为:Preferably, the calculation formula of the user interaction times (Interaction Times, IT) in S2 is:
其中,C1i,C2i分别表示n1,n2第m种互动行为的次数。Among them, C 1i and C 2i represent the times of n 1 and n 2 interaction behaviors of the mth kind respectively.
优选地,S2中用户积极评分比例(Positive Rating Ratio,PR)的计算公式为:Preferably, the formula for calculating the positive rating ratio (Positive Rating Ratio, PR) of users in S2 is:
其中,Si和S表示用户某条评论的评分,∑S0为用户发表的评分为{1,2.3.4.5}的评论次,∑Si 0为用户发表的评分为{4,5}的次数。Among them, S i and S represent the rating of a user's comment, ∑S 0 is the number of comments published by the user with a score of {1, 2.3.4.5}, and ∑S i 0 is the number of comments published by the user with a score of {4,5} frequency.
优选地,S2中用户消极评分比例(Negative Rating Ratio,NR)的计算公式为:Preferably, the formula for calculating the negative rating ratio (Negative Rating Ratio, NR) of users in S2 is:
优选地,S2中用欧式距离度量两个用户积极评分比例和消极评分比例的相似度:Preferably, in S2, the Euclidean distance is used to measure the similarity between the proportion of positive ratings and the proportion of negative ratings of two users:
其中:n1和rn2为评论者n1,2的用户积极评分比例或用户消极评分比例。Among them: n1 and r n2 are the proportion of positive ratings or negative ratings of reviewers n 1 , 2 .
优选地,S3包括以下步骤:Preferably, S3 includes the following steps:
S3.1:导入所有用户作为图结点;S3.1: Import all users as graph nodes;
S3.2:以两两结点为一个结点组合,计算两两结点之间共同评论的次数;S3.2: Using two nodes as a node combination, calculate the number of common comments between two nodes;
S3.3:判断S3.2所计算的共同评论次数是否大于0;若是,则进行S3.4;若否,则返回S3.2进行下一结点组合,直到遍历所有结点组合;S3.3: Determine whether the number of common comments calculated in S3.2 is greater than 0; if so, proceed to S3.4; if not, return to S3.2 for the next node combination until all node combinations are traversed;
S3.4:计算该结点对其余的相似度指标;计算权值并建立两个结点的边;S3.4: Calculate the similarity index of the node to the rest; calculate the weight and establish the edge of the two nodes;
S3.5:完成带权评论者图的构建。S3.5: Complete the construction of the weighted reviewer graph.
优选地,S3中边的权值计算公式如下:Preferably, the formula for calculating the weight of an edge in S3 is as follows:
其中,ωij为结点i和结点j边上的权值,k为度量指标的数目,此处取5,CRTij为用户i和用户j共同评论的次数,SPSRij为用户i和用户j对相同商品发表评分的相似程度,ITij为用户i和用户j,和/>分别为用户i和用户j积极评分比例和消极评分比例的近似程度。Among them, ω ij is the weight on the edge of node i and node j, k is the number of metrics, here is 5, CRT ij is the number of common comments by user i and user j, SPSR ij is the number of user i and user j j is the similarity degree of ratings published by the same product, IT ij is user i and user j, and /> are the approximate degree of positive rating ratio and negative rating ratio of user i and user j respectively.
优选地,S4包括以下步骤:Preferably, S4 includes the following steps:
S4.1:输入带权评论者图G,划分簇的个数n;S4.1: Input the weighted reviewer graph G, and divide the number n of clusters;
S4.2:由带权评论者图G计算邻接矩阵A、度矩阵D以及拉普拉斯矩阵L=D-A;S4.2: Calculate the adjacency matrix A, degree matrix D and Laplacian matrix L=D-A from the weighted reviewer graph G;
S4.3:根据下式获得标准化的拉普拉斯矩阵:S4.3: Obtain the standardized Laplacian matrix according to the following formula:
NL=D-1/(-A)-1/=-1/LD-1/ NL=D -1/ (-A) -1/ = -1/ LD -1/
S4.4:计算NL最小的k个特征值及其对应的特征向量f,k取划分簇的个数n;S4.4: Calculate the smallest k eigenvalues of NL and their corresponding eigenvectors f, where k is the number n of divided clusters;
S4.5:将各自对应的特征向量f组成v×k大小的特征矩阵f并按行标准化,v为样本数,即图G结点的个数;S4.5: Form the corresponding feature vector f into a feature matrix f of v×k size and normalize it by row, v is the number of samples, that is, the number of nodes in graph G;
S4.6:利用K-Means方法对标准化后的特征向量进行聚类,划分得到n个候选群组C=(c1,2…cn)。S4.6: Clustering the standardized feature vectors by using the K-Means method to obtain n candidate groups C=(c 1 ,2...c n ).
与现有技术相比,本发明技术方案的有益效果是:本发明基于带权评论者图的思想,提出5个相似度指标用于度量不同用户之间在行为上的相似程度,比已有的基于带权评论者图的算法更加精确地反映了用户之间的行为相似度,提高后续划分算法的准确度。Compared with the prior art, the beneficial effect of the technical solution of the present invention is: the present invention is based on the idea of the weighted reviewer graph, and proposes five similarity indexes for measuring the similarity in behavior between different users, which is better than the existing The algorithm based on the weighted reviewer graph more accurately reflects the behavior similarity between users and improves the accuracy of the subsequent division algorithm.
本发明利用谱聚类算法对带权评论者图进行划分,比一些已有的划分算法,如:KMeans算法、层次聚类算法、Louvain社群发现算法等的划分效果更佳,且更具有普适性。The present invention uses the spectral clustering algorithm to divide the weighted reviewer graph, which is better than some existing division algorithms, such as: KMeans algorithm, hierarchical clustering algorithm, Louvain community discovery algorithm, etc. fitness.
附图说明Description of drawings
图1为实施例1所述基于谱聚类的虚假评论群组划分方法流程图。FIG. 1 is a flow chart of the method for grouping false comments based on spectral clustering described in Embodiment 1.
图2为谱聚类算法的流程示意图。Figure 2 is a schematic flow chart of the spectral clustering algorithm.
具体实施方式Detailed ways
附图仅用于示例性说明,不能理解为对本专利的限制;The accompanying drawings are for illustrative purposes only and cannot be construed as limiting the patent;
为了更好说明本实施例,附图某些部件会有省略、放大或缩小,并不代表实际产品的尺寸;In order to better illustrate this embodiment, some parts in the drawings will be omitted, enlarged or reduced, and do not represent the size of the actual product;
对于本领域技术人员来说,附图中某些公知结构及其说明可能省略是可以理解的。For those skilled in the art, it is understandable that some well-known structures and descriptions thereof may be omitted in the drawings.
下面结合附图和实施例对本发明的技术方案做进一步的说明。The technical solutions of the present invention will be further described below in conjunction with the accompanying drawings and embodiments.
实施例1Example 1
本实施例提供一种基于谱聚类的虚假评论群组划分方法,如图1所示,所述方法包括以下步骤:The present embodiment provides a method for grouping false comments based on spectral clustering, as shown in Figure 1, the method includes the following steps:
S1:收集和清洗电商平台的评论数据;使用到的元数据包括:用户id、评论id、商品id、评分以及评论的互动行为次数(如:被其他用户“点赞”、被其他用户“认为有用”、被其他用户“认为有趣”等);S1: Collect and clean the comment data of the e-commerce platform; the metadata used includes: user id, comment id, product id, rating, and the number of interaction behaviors of comments (such as: being "liked" by other users, being "liked" by other users) found useful", "interesting" by other users, etc.);
其中,元数据是用来定义数据的数据。评论数据是从电商平台收集的许多维度的数据,包括字段用户id、评论时间、评分等等。这里的每一个项,如“用户id”、“评论时间”即为一个元数据,通过元数据的描述,可以还原一条评论的信息。在本例中,选用的是一条评论的部分数据项(元数据),如“评论时间”就没有被使用。Among them, metadata is data used to define data. Review data is data of many dimensions collected from e-commerce platforms, including fields such as user id, comment time, rating, etc. Each item here, such as "user id" and "comment time", is a piece of metadata. Through the description of the metadata, the information of a comment can be restored. In this example, some data items (metadata) of a comment are selected, such as "comment time" is not used.
S2:基于S1中的元数据计算“共同评论次数”、“同一商品的评分相似度”、“用户互动次数”、“用户积极评分比例”和“用户消极评分比例”共5个相似度指标,两个用户之间评分比例的相似度用欧氏距离进行度量;S2: Based on the metadata in S1, calculate the five similarity indicators of "number of common comments", "score similarity of the same product", "user interaction times", "proportion of positive user ratings" and "proportion of negative user ratings". The similarity of rating ratio between two users is measured by Euclidean distance;
S3:构建一张带权评论者图:每个用户为一个图结点,两两共同评论过一件产品的用户用一条无向边连接,边的权值根据S2中计算获得的5个指标计算获得;S3: Construct a graph of weighted reviewers: each user is a graph node, and users who have commented on a product in pairs are connected by an undirected edge, and the weight of the edge is based on the five indicators calculated in S2 calculated;
S4:通过谱聚类算法对S3所构建图的邻接矩阵进行划分,获得若干个群组;S4: Divide the adjacency matrix of the graph constructed by S3 through the spectral clustering algorithm to obtain several groups;
S5:通过选择合理的分析指标和合适的阈值,人工对划分群组的类别做进一步判定。本实例中选用了:极端评分比、重复评论比和评分偏差作为分析指标。S2中的共同评论次数(Co-Reviewing Times,CRT)的计算公式为:S5: By selecting reasonable analysis indicators and appropriate thresholds, manually make further judgments on the categories of the groups. In this example, extreme rating ratio, repeated comment ratio and rating deviation are selected as analysis indicators. The calculation formula of Co-Reviewing Times (CRT) in S2 is:
CRT(n1,n2)=|P1∩P2|CRT(n 1 ,n 2 )=|P 1 ∩P 2 |
其中,n1,n2为两个不同的评论者,P1,P2分别为n1,n2发表过评论的商品集合。Among them, n 1 , n 2 are two different reviewers, and P 1 , P 2 are collections of commodities that n 1 , n 2 have commented on respectively.
S2中的同一商品的评分相似度(Similarity of Rating on Same Product,SRSP)的计算公式为:The calculation formula of the rating similarity (Similarity of Rating on Same Product, SRSP) of the same product in S2 is:
其中,分别为n1,n2对商品P发表第i或第j条评价的评分,N1,N2分别为n1,n2在商品P上发表的评论数。in, N 1 , n 2 are the ratings of the i-th or j-th evaluation on product P, respectively, and N 1 , N 2 are the number of comments n 1 , n 2 have published on product P.
S2中的用户互动次数(Interaction Times,IT)的计算公式为:The calculation formula of user interaction times (Interaction Times, IT) in S2 is:
其中,C1i,C2i分别表示n1,n2第m种互动行为的次数。Among them, C 1i and C 2i represent the times of n 1 and n 2 interaction behaviors of the mth kind respectively.
S2中用户积极评分比例(Positive Rating Ratio,PR)的计算公式为:The formula for calculating the Positive Rating Ratio (PR) of users in S2 is:
其中,Si和S表示用户某条评论的评分,∑S0为用户发表的评分为{1,2.3.4.5}的评论次,∑Si 0为用户发表的评分为{4,5}的次数。Among them, S i and S represent the rating of a user's comment, ∑S 0 is the number of comments published by the user with a score of {1, 2.3.4.5}, and ∑S i 0 is the number of comments published by the user with a score of {4,5} frequency.
S2中用户消极评分比例(Negative Rating Ratio,NR)的计算公式为:The formula for calculating the negative rating ratio (Negative Rating Ratio, NR) of users in S2 is:
S2中用欧式距离度量两个用户积极评分比例和消极评分比例的相似度:In S2, the Euclidean distance is used to measure the similarity between the proportion of positive ratings and the proportion of negative ratings of two users:
其中:rn1和rn2为评论者n1,n2的用户积极评分比例或用户消极评分比例。Among them: r n1 and r n2 are the proportion of positive ratings or negative ratings of reviewers n 1 and n 2 .
S3包括以下步骤:S3 consists of the following steps:
S3.1:导入所有用户作为图结点;S3.1: Import all users as graph nodes;
S3.2:以两两结点为一个结点组合,计算两两结点之间共同评论的次数;S3.2: Using two nodes as a node combination, calculate the number of common comments between two nodes;
S3.3:判断S3.2所计算的共同评论次数是否大于0;若是,则进行S3.4;若否,则返回S3.2进行下一结点组合,直到遍历所有结点组合;S3.3: Determine whether the number of common comments calculated in S3.2 is greater than 0; if so, proceed to S3.4; if not, return to S3.2 for the next node combination until all node combinations are traversed;
S3.4:计算该结点对其余的相似度指标,包括:“同一商品的评分相似度”、“用户互动次数”、“用户积极评分比例”和“用户消极评分比例”;计算权值并建立两个结点的边;S3.4: Calculate the similarity index of this node to the rest, including: "score similarity of the same product", "user interaction times", "user positive score ratio" and "user negative score ratio"; calculate the weight and Create an edge between two nodes;
S3.5:完成带权评论者图的构建。S3.5: Complete the construction of the weighted reviewer graph.
S3中边的权值计算公式如下:The formula for calculating the weight of an edge in S3 is as follows:
其中,ωij为结点i和结点j边上的权值,k为度量指标的数目,此处取5,CRTij为用户i和用户j共同评论的次数,SPSRij为用户i和用户j对相同商品发表评分的相似程度,ITij为用户i和用户j,和/>分别为用户i和用户j积极评分比例和消极评分比例的近似程度。Among them, ω ij is the weight on the edge of node i and node j, k is the number of metrics, here is 5, CRT ij is the number of common comments by user i and user j, SPSR ij is the number of user i and user j j is the similarity degree of ratings published by the same product, IT ij is user i and user j, and /> are the approximate degree of positive rating ratio and negative rating ratio of user i and user j respectively.
S4包括以下步骤:S4 includes the following steps:
S4.1:输入带权评论者图G,划分簇的个数n;S4.1: Input the weighted reviewer graph G, and divide the number n of clusters;
S4.2:由带权评论者图G计算邻接矩阵A、度矩阵D以及拉普拉斯矩阵L=D-A;S4.2: Calculate the adjacency matrix A, degree matrix D and Laplacian matrix L=D-A from the weighted reviewer graph G;
S4.3:根据下式获得标准化的拉普拉斯矩阵:S4.3: Obtain the standardized Laplacian matrix according to the following formula:
NL=D-1/2(D-AD-1/2=D-1/2LD-1/2 NL=D -1/2 (D-AD -1/2 =D -1/2 LD -1/2
S4.4:计算NL最小的k个特征值及其对应的特征向量f,k取划分簇的个数n;S4.4: Calculate the smallest k eigenvalues of NL and their corresponding eigenvectors f, where k is the number n of divided clusters;
S4.5:将各自对应的特征向量f组成v×k大小的特征矩阵F并按行标准化,v为样本数,即图G结点的个数;S4.5: Form the corresponding feature vector f into a feature matrix F of v×k size and normalize it by row, v is the number of samples, that is, the number of nodes in graph G;
S4.6:利用K-Means方法对标准化后的特征向量进行聚类,划分得到n个候选群组C=(c1,c2…cn)。S4.6: Use the K-Means method to cluster the standardized feature vectors to obtain n candidate groups C=(c 1 ,c 2 . . . c n ).
作为一个具体的实施例,本实施例对所讨论的相似度指标可以根据实际情况省略、替代或添加其他指标,以及可以组合、修改指标的计算方法。As a specific embodiment, the similarity index discussed in this embodiment can omit, replace or add other indexes according to the actual situation, and the calculation method of the indexes can be combined and modified.
群组划分是将大量用户划分成若干群组,每个群组中的用户具有相同或相似的行为,一个群组是否属于虚假评论群组则需要经过人工进一步判定。由于各电商平台产生的数据类型和收集的数据有较大差异,因此在实际应用中,群组划分算法的具体实施应该根据数据集给定的元数据类型不同而作相应调整。理论上,本实施例的思想也可以应用于舆论监测、营销等领域。Group division is to divide a large number of users into several groups. Users in each group have the same or similar behavior. Whether a group belongs to a fake comment group needs to be further judged manually. Since the types of data generated and collected by each e-commerce platform are quite different, in practical applications, the specific implementation of the group segmentation algorithm should be adjusted accordingly according to the different types of metadata given by the dataset. Theoretically, the idea of this embodiment can also be applied to public opinion monitoring, marketing and other fields.
本实施例虚假评论群组划分方法的实施例可以大致分五个步骤:数据收集与清洗、统计相似度指标、构建带权评论者图、谱聚类划分群组和人工判断群组类别,其中:The embodiment of the false comment group division method in this embodiment can be roughly divided into five steps: data collection and cleaning, statistical similarity index, construction of weighted reviewer graph, spectral clustering division of groups and manual judgment of group categories, among which :
数据收集与清洗包括对原始数据进行收集、分析和清洗,原始数据中往往存在许多错误及不完整的数据,对部分数据缺失或数据为异常值的数据项可采用删除、用均值填充等方式处理。实施例使用的是研究所用的数据集,因此只需对数据进行分析,选择所需的数据类型、删除无用数据。Data collection and cleaning include collecting, analyzing and cleaning the original data. There are often many errors and incomplete data in the original data. Some data items with missing data or data with abnormal values can be processed by deleting, filling with the mean value, etc. . The embodiment uses the data set used in the research, so it is only necessary to analyze the data, select the required data type, and delete useless data.
统计相似度指标:本实施例使用到的大部分指标是针对两两用户之间进行计算,因此对相似度指标的计算在构建带权评论者图的过程中同步进行。本步骤是对一些需要使用的统计类指标(如用户的积极/消极评分比例)提前统计并存储,避免在后续的计算过程中对数据重复统计,增加算法的时间复杂度。Statistical similarity index: Most of the indexes used in this embodiment are calculated for pairs of users, so the calculation of the similarity index is carried out simultaneously in the process of constructing the weighted reviewer graph. This step is to count and store some statistical indicators that need to be used (such as the ratio of positive/negative ratings of users) in advance, so as to avoid repeated statistics of data in the subsequent calculation process and increase the time complexity of the algorithm.
构建带权评论者图:该图以用户为图结点,当两个用户共同评论过同一商品时则建立一条边,边的权值反映的是两个结点的相似程度。图的构建和相似度指标的计算过程同步进行:首先向图中加入等同于数据集中用户数量的结点,每个结点以用户id命名;然后对两两结点的“共同评论次数”进行统计,若该指标大于等于1,进一步计算其余指标,并利用所有指标计算两个结点边上的权值,再构建一条连接两个结点的边;若该指标为0,则不进行任何操作,继续计算下一结点组合的“共同评论次数”,直到图中所有两两结点组合均被遍历。Build a weighted reviewer graph: the graph uses users as graph nodes. When two users comment on the same product together, an edge is established. The weight of the edge reflects the similarity between the two nodes. The construction of the graph and the calculation process of the similarity index are carried out simultaneously: first, add nodes equal to the number of users in the data set to the graph, and each node is named after the user id; Statistics, if the indicator is greater than or equal to 1, further calculate the remaining indicators, and use all indicators to calculate the weight on the edge of the two nodes, and then build an edge connecting the two nodes; if the indicator is 0, do not perform any Operation, continue to calculate the "number of common comments" of the next node combination, until all pairwise node combinations in the graph are traversed.
谱聚类划分群组:如图2所示,首先选定需要划分的群组数量n,然后由带权评论者图可以计算得到其邻接矩阵、度矩阵,进一步地,由邻接矩阵和度矩阵可以计算出该图的拉普拉斯矩阵和标准化后的拉普拉斯矩阵。下一步计算拉普拉斯矩阵最小的n个特征值及相应的n个特征向量,将n个特征向量组合成矩阵f并将其将按行标准化。最后,利用K-Means方法对f按行进行聚类,划分得到n个候选群组。Spectral clustering to divide groups: as shown in Figure 2, first select the number n of groups to be divided, and then calculate its adjacency matrix and degree matrix from the weighted reviewer graph, and further, from the adjacency matrix and degree matrix The Laplacian matrix of the graph and the normalized Laplacian matrix can be calculated. The next step is to calculate the smallest n eigenvalues and the corresponding n eigenvectors of the Laplacian matrix, combine the n eigenvectors into a matrix f and normalize it by row. Finally, use the K-Means method to cluster f by row, and divide to obtain n candidate groups.
人工判断群组类别:本发明实施例根据领域内已有的研究选择判别指标及相应的阈值,在不同的领域中,对群组所属类别的判断依据不同。Manual judgment of group category: the embodiment of the present invention selects the discrimination index and the corresponding threshold according to the existing research in the field. In different fields, the basis for judging the category of the group is different.
本实施例在美国点评网站Yelp的一个研究数据集上进行了划分,并选择了其他常见的虚假评论划分群组方法进行对照实验。根据领域内研究现状,实验选择了极端评分比(Extreme Rating Ratio,ERR)、重复评论比(Repeated Comment Ratio,RCR)和评分偏差(Rating Deviation,RD)三个具有代表性的虚假评论群组指标作为划分效果的判断依据,在三个指标上,本发明算法的表现均优于作为对照实验的K-Means聚类、层次聚类和Louvain社群发现算法。In this embodiment, a research data set of Yelp, an American review website, is divided, and other common methods of grouping false reviews are selected for comparative experiments. According to the research status in the field, the experiment selected three representative fake comment group indicators: Extreme Rating Ratio (ERR), Repeated Comment Ratio (RCR) and Rating Deviation (RD). As the basis for judging the division effect, in terms of three indicators, the performance of the algorithm of the present invention is better than that of the K-Means clustering, hierarchical clustering and Louvain community discovery algorithms used as control experiments.
附图中描述位置关系的用语仅用于示例性说明,不能理解为对本专利的限制;The terms describing the positional relationship in the drawings are only for illustrative purposes and cannot be interpreted as limitations on this patent;
显然,本发明的上述实施例仅仅是为清楚地说明本发明所作的举例,而并非是对本发明的实施方式的限定。对于所属领域的普通技术人员来说,在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明权利要求的保护范围之内。Apparently, the above-mentioned embodiments of the present invention are only examples for clearly illustrating the present invention, rather than limiting the implementation of the present invention. For those of ordinary skill in the art, other changes or changes in different forms can be made on the basis of the above description. It is not necessary and impossible to exhaustively list all the implementation manners here. All modifications, equivalent replacements and improvements made within the spirit and principles of the present invention shall be included within the protection scope of the claims of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910887582.2A CN110751180B (en) | 2019-09-19 | 2019-09-19 | Spurious comment group division method based on spectral clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910887582.2A CN110751180B (en) | 2019-09-19 | 2019-09-19 | Spurious comment group division method based on spectral clustering |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110751180A CN110751180A (en) | 2020-02-04 |
CN110751180B true CN110751180B (en) | 2023-06-20 |
Family
ID=69276657
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910887582.2A Active CN110751180B (en) | 2019-09-19 | 2019-09-19 | Spurious comment group division method based on spectral clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110751180B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117421492B (en) * | 2023-12-19 | 2024-04-05 | 四川久远银海软件股份有限公司 | Screening system and method for data element commodities |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108345587A (en) * | 2018-02-14 | 2018-07-31 | 广州大学 | A kind of the authenticity detection method and system of comment |
CN109829733A (en) * | 2019-01-31 | 2019-05-31 | 重庆大学 | A kind of false comment detection system and method based on Shopping Behaviors sequence data |
-
2019
- 2019-09-19 CN CN201910887582.2A patent/CN110751180B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108345587A (en) * | 2018-02-14 | 2018-07-31 | 广州大学 | A kind of the authenticity detection method and system of comment |
CN109829733A (en) * | 2019-01-31 | 2019-05-31 | 重庆大学 | A kind of false comment detection system and method based on Shopping Behaviors sequence data |
Non-Patent Citations (1)
Title |
---|
马晓宁 ; 王婷 ; 董松月 ; .基于PSO-SVM的网络舆情垃圾观点识别.计算机与数字工程.2018,(02),第119-124页. * |
Also Published As
Publication number | Publication date |
---|---|
CN110751180A (en) | 2020-02-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Malik et al. | EPR-ML: E-Commerce Product Recommendation Using NLP and Machine Learning Algorithm | |
CN112131480B (en) | Personalized commodity recommendation method and system based on multilayer heterogeneous attribute network representation learning | |
CN104778186B (en) | Merchandise items are mounted to the method and system of standardized product unit | |
TW201501059A (en) | Method and system for recommending information | |
CN113379494B (en) | Commodity recommendation method and device based on heterogeneous social relationship and electronic equipment | |
CN108921602B (en) | User purchasing behavior prediction method based on integrated neural network | |
CN105740430A (en) | Personalized recommendation method with socialization information fused | |
CN105608600A (en) | A method for evaluating and optimizing the effect of B2B sellers | |
CN110532429B (en) | Online user group classification method and device based on clustering and association rules | |
CN108648038B (en) | Credit frying and malicious evaluation identification method based on subgraph mining | |
KR102142126B1 (en) | Hierarchical Category Cluster Based Shopping Basket Associated Recommendation Method | |
CN111259140B (en) | False comment detection method based on LSTM multi-entity feature fusion | |
CN112070543B (en) | Method for detecting comment quality in E-commerce website | |
CN113763095A (en) | Information recommendation method and device and model training method and device | |
CN106126549A (en) | A kind of community's trust recommendation method decomposed based on probability matrix and system thereof | |
CN112613953A (en) | Commodity selection method, system and computer readable storage medium | |
CN112231583B (en) | E-commerce recommendation method based on dynamic interest group identification and generative adversarial network | |
CN113821827A (en) | Joint modeling method and device for protecting multi-party data privacy | |
CN111309815A (en) | Method and device for processing relation map and electronic equipment | |
CN110751180B (en) | Spurious comment group division method based on spectral clustering | |
CN110020918B (en) | Recommendation information generation method and system | |
CN114742564B (en) | False reviewer group detection method integrating complex relations | |
Liu et al. | Features for link prediction in social networks: A comprehensive study | |
CN118552235A (en) | Commodity data resource analysis method and system based on intelligent counting system | |
CN118780899A (en) | An e-commerce intelligent customer service product recommendation method based on customer behavior |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |