WO2021223275A1 - Online water army group detection method and apparatus - Google Patents

Online water army group detection method and apparatus Download PDF

Info

Publication number
WO2021223275A1
WO2021223275A1 PCT/CN2020/092791 CN2020092791W WO2021223275A1 WO 2021223275 A1 WO2021223275 A1 WO 2021223275A1 CN 2020092791 W CN2020092791 W CN 2020092791W WO 2021223275 A1 WO2021223275 A1 WO 2021223275A1
Authority
WO
WIPO (PCT)
Prior art keywords
group
navy
value
target product
product
Prior art date
Application number
PCT/CN2020/092791
Other languages
French (fr)
Chinese (zh)
Other versions
WO2021223275A8 (en
Inventor
纪淑娟
张琪
李金鹏
许少华
伊磊
公茂果
Original Assignee
山东科技大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 山东科技大学 filed Critical 山东科技大学
Publication of WO2021223275A1 publication Critical patent/WO2021223275A1/en
Publication of WO2021223275A8 publication Critical patent/WO2021223275A8/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0203Market surveys; Market polls

Abstract

Disclosed is an online water army group detection method. The detection method comprises: acquiring review data information in a network, wherein the review data information comprises reviewed products, reviewers, review time, and reviewer's ratings of the reviewed products (S101); identifying a target product attacked by an online water army group on the basis of the review data information (S102); and generating a candidate online water army group on the basis of the identified target product (S103). By means of the online water army group detection method, locating a target product attacked by an online water army group to detect the online water army group that attacks each target product can greatly improve the time and space efficiency of detecting the online water army group.

Description

水军群组检测方法及其装置Navy group detection method and device
相关申请的交叉引用Cross-references to related applications
本申请要求于2020年05月6日向中国国家知识产权局提交的第202010372504.1号中国专利申请的优先权和权益,所述申请公开的内容通过引用整体并入本文中。This application claims the priority and rights of the Chinese patent application No. 202010372504.1 filed with the State Intellectual Property Office of China on May 6, 2020, and the content of the application is incorporated herein by reference in its entirety.
技术领域Technical field
本申请涉及网络安全领域,更具体地说,涉及一种水军群组检测方法及其装置。This application relates to the field of network security, and more specifically, to a navy group detection method and device.
背景技术Background technique
在电子商务交易中,在线商品评论对用户的购买决策有重要的影响。用户一般倾向于购买交易量大、正面评论较多的产品,而不是负面评论较多的产品。因此,为了冲交易量、抬高或降低某产品的信誉、赚取更多利润,很多商家往往会雇佣虚假评论者发布大量赞美自家商品或诋毁竞争对手商品的不实评论。水军群组,是指那些有组织地协同发布虚假评论的一群人。相比水军个体,水军群组影响力更大。这是因为,水军群组规模更大,能够有组织地进行造假活动,甚至能完全控制一个产品的舆论,进而误导买方的购买决策、导致电子商务信誉系统失真、影响电商平台中卖方之间的公平竞争、降低交易环境的可信度,最终影响电子商务企业甚至整个行业的可持续发展。因此,挖掘与发现水军群体具有重要的意义。In e-commerce transactions, online product reviews have an important influence on users' purchasing decisions. Users generally tend to buy products with large transaction volume and more positive reviews rather than products with more negative reviews. Therefore, in order to offset transaction volume, increase or decrease the reputation of a product, and earn more profits, many businesses often hire false reviewers to post a large number of false reviews that praise their own products or slander competitors' products. The navy group refers to a group of people who publish false comments in an organized and coordinated manner. Compared with individual naval forces, naval groups have greater influence. This is because the navy group has a larger scale, can conduct fraudulent activities in an organized manner, and can even completely control the public opinion of a product, thereby misleading the buyer’s purchase decision, causing distortion of the e-commerce reputation system, and affecting the seller’s status in the e-commerce platform. Fair competition between the two countries, reducing the credibility of the trading environment, and ultimately affect the sustainable development of e-commerce companies and even the entire industry. Therefore, the excavation and discovery of naval groups is of great significance.
自Jindal和Liu首次提出虚假评论(虚假评论者)检测问题以来,越来越多的研究者开始关注该问题,并作出了很多相关研究,包括基于机器学习的算法,基于概率的算法,基于行为特征的算法,基于图的算法和基于规则的算法。近年来,水军群组的检测问题吸引了越来越多的关注。Since Jindal and Liu first proposed the problem of detecting false reviews (false reviewers), more and more researchers have begun to pay attention to the problem and have done a lot of related research, including algorithms based on machine learning, algorithms based on probability, and based on behavior. Feature-based algorithms, graph-based algorithms and rule-based algorithms. In recent years, the detection problem of the navy group has attracted more and more attention.
现有的水军群组检测算法可分为基于频繁项挖掘(FIM)的算法和基于拓扑图的算法。基于FIM的算法假设同一水军群组的成员倾向于为同一产品或服务共同编写虚假评论,即所谓的共评论。他们利用频繁项挖掘技术(FIM)生成候选水军群组,然后构建模型对群组的可疑度进行排序,以发现真正的水军群组。然而,共评论不一定意味着共同造假(即多人协同作业,对同一目标产品进行造假活动)。随着推荐系统性能的提高,许多消费者可能会购买相同的产品或使用同样的服务。也就是说,共评论不够可靠,容易将正常评论者误判为水军。The existing naval group detection algorithms can be divided into algorithms based on frequent item mining (FIM) and algorithms based on topological graphs. The algorithm based on FIM assumes that members of the same navy group tend to jointly write false reviews for the same product or service, the so-called co-review. They use Frequent Item Mining (FIM) to generate candidate navy groups, and then build a model to rank the suspiciousness of the groups to find the real navy groups. However, co-review does not necessarily mean joint fraud (that is, multiple people work together to conduct fraudulent activities on the same target product). As the performance of the recommendation system improves, many consumers may purchase the same products or use the same services. In other words, the total comment is not reliable enough, and it is easy to misjudge a normal commenter as a navy.
并且,基于频繁项目挖掘(FIM)的算法,将共同评论过相同产品的评论者作为一个候选群组。频繁项目挖掘的强度影响着FIM的可靠性。如果设置的强度太高(例如设置共评论产品数大于5),会产生一个非常紧密的群组,挖掘到的群组会大大减少。相反,如果强度太低,得到的候选群组中会包含许多正常评论者,而且这些算法并没有考虑过滤候选群组中的正常评论者。Moreover, based on the algorithm of frequent item mining (FIM), reviewers who have reviewed the same product together are regarded as a candidate group. The intensity of frequent item mining affects the reliability of FIM. If the intensity of the setting is too high (for example, if the number of products reviewed is greater than 5), a very close group will be generated, and the number of groups mined will be greatly reduced. On the contrary, if the intensity is too low, the obtained candidate group will contain many normal commenters, and these algorithms do not consider filtering the normal commenters in the candidate group.
基于拓扑图的算法建模了评论者之间的关系(在早期的研究中使用了无向图,现在经常采用有向带权图),并根据图划分算法或社区划分算法对它们进行分群组。一般来说,基于拓扑图的算法首先通过评论者的关系特征(如共评论)来构建评论者的拓扑图,然后使用图划分算法,聚类算法等等生成候选水军群组。由于评论者关系图是基于评论数据元数据构建的,因此随着评论数据的迅速增加,评论者关系图的构建和处理需要较高的时间复杂度和空间复杂度。特别是,在基于图的算法中,候选水军群组通常是通过像min-cut这样的图划分算法生成的。但是,这些由图划分算法人工划分的群组,可能与实际的水军群组并不符。Algorithms based on topological graphs model the relationship between reviewers (undirected graphs were used in early research, and directed weighted graphs are often used now), and group them according to graph partitioning algorithms or community partitioning algorithms Group. Generally speaking, algorithms based on topological graphs first construct a topological graph of reviewers through the relationship characteristics of the reviewers (such as co-comments), and then use graph partitioning algorithms, clustering algorithms, etc. to generate candidate navy groups. Since the reviewer relationship graph is constructed based on review data metadata, with the rapid increase of review data, the construction and processing of the reviewer relationship graph requires a high degree of time complexity and space complexity. In particular, in graph-based algorithms, candidate navy groups are usually generated by graph partitioning algorithms like min-cut. However, these groups manually divided by the graph partition algorithm may not match the actual navy group.
发明内容Summary of the invention
为了解决上述问题,本发明提供一种水军群组检测方法及其装置,通过该方法和装置不但能够提高检测效率,而且能够更好地过滤掉真实的(或无辜的)评论者,从而更准确地定位水军群组。In order to solve the above-mentioned problems, the present invention provides a navy group detection method and device. The method and device can not only improve the detection efficiency, but also can better filter out real (or innocent) commenters, thereby improving Accurately locate the navy group.
为了实现上述目的,提供一种水军群组检测方法,所述检测方法包括:获取网络中的评论数据信息,所述评论数据信息包括:评论产品、评论者、评论时间以及评论者对评论产品的评分;基于所述评论数据信息识别水军群组所攻击的目标产品;基于所识别出的目标产品生成候选水军群组。In order to achieve the above objective, a naval group detection method is provided. The detection method includes: obtaining review data information in the network, the review data information including: review products, reviewers, review time, and reviewers’ comments on the review products The score; based on the review data information to identify the target product attacked by the navy group; based on the identified target product to generate a candidate navy group.
进一步地,基于所述评论数据信息识别水军群组所攻击的目标产品包括:基于所述评论者对评论产品的评分计算产品评分分布异常值和产品平均分分布异常值;以及通过所述产品评分分布异常值和产品平均分分布异常值计算水军群组所攻击的目标产品的可疑值,并将所述可疑值与设定的目标产品可疑值的阈值进行比较,根据比较结果识别水军群组所攻击的目标产品Further, identifying the target product attacked by the navy group based on the review data information includes: calculating the product score distribution abnormal value and the product average score distribution abnormal value based on the review product's rating by the reviewer; and passing the product Score distribution abnormal value and product average score distribution abnormal value Calculate the suspicious value of the target product attacked by the navy group, compare the suspicious value with the set threshold of the suspicious value of the target product, and identify the navy based on the comparison result Target product attacked by the group
进一步地,所述基于所识别出的目标产品生成候选水军群组包括:利用核密度估计方法获取所识别出的目标产品的评论爆发区,所述评论爆发区是所识别出的目标产品的评论在短时间内激增的区域;获取所述评论爆发区中的评论者,生成候选水军群组。Further, the generating a candidate navy group based on the identified target product includes: using a nuclear density estimation method to obtain a review outbreak area of the identified target product, where the review outbreak area is a part of the identified target product An area where comments surge in a short period of time; obtain commenters in the comment outbreak area to generate candidate navy groups.
进一步地,所述检测方法还包括:计算所述候选水军群组的群组造假值,将所述候选水军群组的群组尺寸与设定值进行比较,并且将所述群组造假值与设定的水军群组造假指标的阈值进行比较,根据比较结 果输出候选水军群组,其中,所述群组造假值用于衡量水军群组造假程度,所述群组尺寸用于表示水军群组中评论者的数量。Further, the detection method further includes: calculating a group fraud value of the candidate navy group, comparing the group size of the candidate navy group with a set value, and falsifying the group The value is compared with the set threshold value of the naval group fraud index, and the candidate naval group is output according to the comparison result, wherein the group fraud value is used to measure the degree of fraud of the navy group, and the group size is used Yu represents the number of commenters in the navy group.
进一步地,在计算所述候选水军群组的群组造假值,将所述候选水军群组的群组尺寸与设定值进行比较,并且将所述群组造假值与设定的水军群组造假指标的阈值进行比较,根据比较结果输出候选水军群组之前,所述检测方法还包括:计算每个候选水军群组的每个评论者的个体造假值,并将所述个体造假值与设定的水军个体造假指标的阈值进行比较,根据比较结果剔除可疑度低的评论者,获得净化后的候选群组,其中,所述个体造假值用于衡量评论者造假程度。Further, in calculating the group fraud value of the candidate navy group, the group size of the candidate navy group is compared with a set value, and the group fraud value is compared with the set water. Before outputting the candidate navy group according to the comparison result, the detection method further includes: calculating the individual false value of each commenter of each candidate navy group, and comparing the The individual fraud value is compared with the set threshold value of the individual fraud index of the navy, and the commenters with low suspiciousness are eliminated according to the comparison result to obtain a purified candidate group, wherein the individual fraud value is used to measure the degree of fraud by the commenter .
进一步地,通过如下公式计算水军群组所攻击的目标产品的可疑值STP(p):Further, the suspicious value STP(p) of the target product attacked by the navy group is calculated by the following formula:
S TP(p)=ωS avg(p)+(1-ω)S ext(p) S TP (p)=ωS avg (p)+(1-ω)S ext (p)
其中,p表示水军群组所攻击的目标产品,Savg(p)为所述产品平均分分布异常值,Sext(p)为所述产品评分分布异常值,ω是用于平衡Savg(p)和Sext(p)权重因子,取值范围在0到1之间。Among them, p represents the target product attacked by the navy group, Savg(p) is the abnormal value of the average score distribution of the product, Sex(p) is the abnormal value of the product score distribution, and ω is used to balance the Savg(p) And Sext(p) weighting factor, the value range is between 0 and 1.
进一步地,利用核密度估计方法获取所识别出的目标产品的评论爆发区包括:计算所识别出的目标产品的生命周期;利用核密度估计方法对所识别出的目标产品的评论和评论所对应的评论时间序列进行建模;设置时间窗口尺寸,将所识别出的目标产品的生命周期分割成多个子时间窗口;选取每个子时间窗口的上界和所述子时间窗口内评论数目作为样本点;根据
Figure PCTCN2020092791-appb-000001
计算核密度估计值,获取针对所识别出的目标产品的评论数目的极值点集;计算每个子时间窗口的平均评论数,其中,所述平均评论数=总评论数/所述子时间窗口的数量;以及判断所获得极值点集中的极值点所在的子时间窗口中的评论数是否大于平均评论数且大于1,根据判断结果获取所述评论爆发区,其中,所述评论爆发区为所获得极值点集中大于平均评论数且大于1的极值点所对应时间加上或减去设定天数所形成的区域。
Further, using the kernel density estimation method to obtain the comment outbreak area of the identified target product includes: calculating the life cycle of the identified target product; using the kernel density estimation method to evaluate the comments of the identified target product and the corresponding comments Model the time series of comments; set the size of the time window, and divide the life cycle of the identified target product into multiple sub-time windows; select the upper bound of each sub-time window and the number of comments in the sub-time window as sample points ;according to
Figure PCTCN2020092791-appb-000001
Calculate the kernel density estimate to obtain the extreme point set of the number of reviews for the identified target product; calculate the average number of reviews for each sub-time window, where the average number of reviews = total number of reviews/the sub-time window And determine whether the number of comments in the sub-time window where the extreme point in the obtained extreme point set is greater than the average number of comments and greater than 1, and obtain the comment outbreak area according to the determination result, wherein the comment outbreak area It is the area formed by adding or subtracting the set number of days to the time corresponding to the extreme points that are greater than the average number of comments and greater than 1 in the obtained extreme points.
进一步地,通过如下公式获得所述群组造假值GSS(g):Further, the group fraud value GSS(g) is obtained by the following formula:
Figure PCTCN2020092791-appb-000002
Figure PCTCN2020092791-appb-000002
其中,g表示由评论者所形成的群组,GTW(g)为群组时间窗,GRD(g)为群组评分偏差,GS(g)为所述群组尺寸,GRT(g)为群组评论紧密性,GOR(g)为群组一天评论数,GER(g)为群组极端评分比例,GCA(g)为群组共活跃程度,GCAR(g)为群组共活跃期评论占比,Among them, g represents the group formed by reviewers, GTW(g) is the group time window, GRD(g) is the group score deviation, GS(g) is the group size, and GRT(g) is the group Group comment closeness, GOR(g) is the number of group comments in a day, GER(g) is the extreme score ratio of the group, GCA(g) is the co-active degree of the group, GCAR(g) is the group’s total active period of comments accounted for Compare,
所述GTW(g)用于衡量群组的活跃程度;The GTW(g) is used to measure the activity level of the group;
所述GRD(g)用于反映群组的评分偏离目标产品的平均评分的程度;The GRD(g) is used to reflect the degree to which the score of the group deviates from the average score of the target product;
所述GRT(g)用于衡量群组成员合作撰写虚假评论的紧密程度;The GRT(g) is used to measure how closely group members collaborate to write false comments;
所述GOR(g)用于反映一个群组一天发布的评论数量;The GOR(g) is used to reflect the number of comments posted by a group in a day;
所述GER(g)表示群组成员极端评分比例的平均值;The GER(g) represents the average value of extreme score ratios of group members;
所述GCA(g)用于表示群组成员在一定时间内共同活跃的程度;The GCA(g) is used to indicate the degree to which group members are active together in a certain period of time;
所述GCAR(g)用于表示群组在共同活跃期间发布的针对目标产品的评论占群组总评论的比例。The GCAR(g) is used to indicate the proportion of the group's total comments published by the group during the common active period for the comments on the target product.
进一步地,通过如下公式获得所述个体造假值ISS(a):Further, the individual fraud value ISS(a) is obtained by the following formula:
Figure PCTCN2020092791-appb-000003
Figure PCTCN2020092791-appb-000003
其中,a表示评论者,EXR(a)为极端评分比例,RD(a)为评分偏差;MRO(a)为一天最大评论数,RTI(a)为评论时间间隔,AD(a)为账户生存周期,ATR(a)为活跃时期评论占比,Among them, a represents the reviewer, EXR(a) is the extreme rating ratio, RD(a) is the rating deviation; MRO(a) is the maximum number of comments in a day, RTI(a) is the comment interval, and AD(a) is the account survival Period, ATR(a) is the proportion of comments in the active period,
所述EXR(a)表示极端评分的数量占评论者评述总数的比例;The EXR(a) represents the ratio of the number of extreme ratings to the total number of reviews by the reviewer;
所述RD(a)反映评论者的评分偏离产品整体评分的程度;The RD(a) reflects the degree to which the reviewer’s rating deviates from the overall product rating;
所述MRO(a)反映一个评论者单天发布评论的最大数量;The MRO(a) reflects the maximum number of comments posted by a commenter in a single day;
所述RTI(a)用于表示一个评论者发布评论的时间间隔长短;The RTI(a) is used to indicate the length of the time interval for a commenter to post comments;
所述AD(a)用于表示评论者发布的第一条与最后一条评论之间的时间间隔;The AD(a) is used to indicate the time interval between the first and last comments posted by the commenter;
所述ATR(a)用于衡量评论者活跃时期评论的数目与总评论数目的关系。The ATR(a) is used to measure the relationship between the number of reviews during the active period of the reviewer and the total number of reviews.
根据本申请的另一方面,提供一种水军群组检测装置,所述检测装置包括:数据信息获取模块,所述 数据信息获取模块获取网络中的评论数据信息,所述评论数据信息包括:评论产品、评论者、评论时间以及评论者对评论产品的评分;异常值计算模块,所述异常值计算模块基于所述评论者对评论产品的评分计算产品评分分布异常值和产品平均分分布异常值;目标产品识别模块,所述目标产品识别模块通过所述产品评分分布异常值和产品平均分分布异常值计算水军群组所攻击的目标产品的可疑值,并将所述可疑值与设定的目标产品可疑值的阈值进行比较,根据比较结果识别水军群组所攻击的目标产品;以及候选水军群组生成模块,所述候选水军群组生成模块基于所识别出的目标产品生成候选水军群组。According to another aspect of the present application, there is provided a navy group detection device, the detection device comprising: a data information acquisition module, the data information acquisition module acquires comment data information in the network, the comment data information includes: The review product, the reviewer, the review time, and the reviewer’s rating of the review product; an abnormal value calculation module, which calculates the product score distribution abnormal value and the product average score distribution abnormality based on the reviewer’s rating of the review product Value; target product identification module, the target product identification module calculates the suspicious value of the target product attacked by the navy group through the product score distribution abnormal value and the product average score distribution abnormal value, and combines the suspicious value with the design The threshold value of the suspicious value of the target product is compared, and the target product attacked by the navy group is identified according to the comparison result; and a candidate navy group generation module, which is based on the identified target product Generate candidate navy groups.
根据本申请的又一方面,提供一种计算机设备,包括存储器及处理器,所述存储器上存储有可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述水军群组检测方法的步骤。According to another aspect of the present application, there is provided a computer device, including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the above-mentioned water force when the computer program is executed. Steps of the group detection method.
根据本申请的再一方面,提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述水军群组检测方法的步骤。According to another aspect of the present application, there is provided a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of the above-mentioned naval group detection method are realized.
本申请的水军群组检测方法通过定位被水军群组攻击的目标产品来检测攻击目标产品的水军群组,可以大大提高检测水军群组的时间和空间效率。The navy group detection method of the present application detects the navy group attacking the target product by locating the target product attacked by the navy group, which can greatly improve the time and space efficiency of detecting the navy group.
附图说明Description of the drawings
构成本申请的一部分的说明书附图用来提供对本发明的进一步理解,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The drawings of the specification constituting a part of the present application are used to provide a further understanding of the present invention. The exemplary embodiments of the present invention and the description thereof are used to explain the present invention, and do not constitute an improper limitation of the present invention. In the attached picture:
图1示出了根据本申请的水军群组检测方法的流程图;Figure 1 shows a flow chart of the navy group detection method according to the present application;
图2示出了根据本申请的评论爆发的示意图;Figure 2 shows a schematic diagram of a commentary outbreak according to this application;
图3出了根据本申请一实施例的利用核密度估计方法获取所识别出的目标产品的评论爆发区的流程图;FIG. 3 shows a flow chart of obtaining a comment outbreak area of an identified target product by using a kernel density estimation method according to an embodiment of the present application;
图4示出了根据本申请一优选实施例的水军群组检测方法的流程图;Fig. 4 shows a flowchart of a method for detecting a navy group according to a preferred embodiment of the present application;
图5a至图5f示出了由GSBC与根据本申请的GSDB分别生成的前500个水军群组的个体造假指标的CDF曲线的对比图;Figures 5a to 5f show comparison diagrams of CDF curves of individual fraud indicators of the first 500 navy groups respectively generated by GSBC and GSDB according to this application;
图6a至图6i示出了由GSBC与根据本申请的GSDB分别生成的前500个水军群组的水军群组造假行为指标的CDF曲线及所有群组指标的平均值曲线的对比图;6a to 6i show the CDF curve of the first 500 navy group fraud behavior indicators generated by GSBC and the GSDB according to the application, and the comparison diagrams of the average curve of all the group indicators;
图7示出了由GSBC与根据本申请的GSDB分别生成的前500个群组的尺寸对比图;Fig. 7 shows the size comparison diagram of the first 500 groups respectively generated by GSBC and GSDB according to this application;
图8a至图8c示出了GSBC与根据本申请的GSDB在前n个群组上的对比图。Figures 8a to 8c show comparison diagrams of GSBC and GSDB according to the present application on the first n groups.
图9示出根据本申请的水军群组检测装置的结构示意图。FIG. 9 shows a schematic diagram of the structure of the navy group detection device according to the present application.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions, and advantages of this application clearer and clearer, the following further describes the application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application.
根据本申请,提供一种水军群组检测方法,所述检测方法包括:获取网络中的评论数据信息,所述评论数据信息包括:评论产品、评论者、评论时间以及评论者对评论产品的评分;基于所述评论数据信息识别水军群组所攻击的目标产品;基于所识别出的目标产品生成候选水军群组。According to the present application, there is provided a navy group detection method. The detection method includes: obtaining review data information in the network, the review data information including: review products, reviewers, review time, and reviewers’ comments on review products Score; identify the target product attacked by the navy group based on the review data information; generate a candidate navy group based on the identified target product.
根据本申请的水军群组检测方法通过定位被水军群组攻击的目标产品来检测攻击每个目标产品的水军群组,可以大大提高检测水军群组的时间和空间效率。According to the navy group detection method of the present application, the navy group that attacks each target product is detected by locating the target product attacked by the navy group, which can greatly improve the time and space efficiency of detecting the navy group.
在挖掘与发现水军群组过程中,需要一系列有效的指标(或称特征)来评价个体的可疑性以及群体的可疑性。因此,在大数据分析的基础上,本申请通过以下数据特征作为评价水军个体和水军群组可疑度的指标。In the process of mining and discovering naval groups, a series of effective indicators (or characteristics) are needed to evaluate the suspiciousness of individuals and groups. Therefore, on the basis of big data analysis, this application uses the following data characteristics as indicators for evaluating the suspiciousness of individual naval forces and naval groups.
下面详细介绍个体造假行为指标和群组造假行为指标。The following is a detailed introduction to individual fraud behavior indicators and group fraud behavior indicators.
本申请中通过极端评分比例(EXR)、评分偏差(RD)、一天最大评论数(MRO)、评论时间间隔(RTI)、账户生存周期(AD)和活跃时期评论占比(ATR)来反映个体造假行为指标。In this application, the extreme rating ratio (EXR), rating deviation (RD), maximum number of comments per day (MRO), comment interval (RTI), account life cycle (AD), and active period comment percentage (ATR) are used to reflect individuals Counterfeit behavior indicators.
极端评分比例(EXR)反映了极端评分的数量占评论者评分总数的比率。EXR越高,越可疑。在五星级尺度上,EXR(a)的计算公式如下:The extreme rating ratio (EXR) reflects the ratio of the number of extreme ratings to the total number of reviewers' ratings. The higher the EXR, the more suspicious. On the five-star scale, the calculation formula of EXR(a) is as follows:
Figure PCTCN2020092791-appb-000004
Figure PCTCN2020092791-appb-000004
其中,R a是评论者a的评分集合,r a是集合R a的元素。 Wherein, R a is a reviewer ratings set, r a is the set of elements of R a.
评分偏差(RD)反映评论者的评分偏离产品整体评分的程度。评论者的整体评分能够反映一个产品的 基本情况。RD越高,越可疑。RD(a)的计算公式如下:Rating deviation (RD) reflects the extent to which the reviewer's rating deviates from the overall product rating. The overall rating of the reviewer can reflect the basic situation of a product. The higher the RD, the more suspicious. The calculation formula of RD(a) is as follows:
Figure PCTCN2020092791-appb-000005
Figure PCTCN2020092791-appb-000005
其中,r ap是评论者a对产品p的评分,
Figure PCTCN2020092791-appb-000006
是产品p的平均评分。本申请通过除以4进行归一化,也就是采用在五分制下最大的评分偏差。
Among them, r ap is the rating of product p by reviewer a,
Figure PCTCN2020092791-appb-000006
Is the average rating of product p. This application is normalized by dividing by 4, that is, using the largest scoring deviation under the five-point system.
一天最大评论数(MRO)反映一个评论者单天发布评论的最大数量,并以所有评论者中的最大值对其进行归一化。MRO越高,越可疑。MRO(a)的计算公式如下:The maximum number of comments per day (MRO) reflects the maximum number of comments posted by a commenter in a single day, and normalizes it to the maximum value among all commenters. The higher the MRO, the more suspicious. The calculation formula of MRO(a) is as follows:
Figure PCTCN2020092791-appb-000007
Figure PCTCN2020092791-appb-000007
其中,MaxRev(a)是评论者a一天评论的最大数,A为评论者集合。Among them, MaxRev(a) is the maximum number of comments made by commenter a in one day, and A is the set of commenters.
评论时间间隔(RTI)用于表示一个评论者发布评论的时间间隔的长短,其反映了一个评论者的活跃程度。RTI越高,越可疑。RTI(a)的计算公式如下:The comment interval (RTI) is used to indicate the length of the time interval for a commenter to post comments, which reflects the activity level of a commenter. The higher the RTI, the more suspicious. The calculation formula of RTI(a) is as follows:
Figure PCTCN2020092791-appb-000008
Figure PCTCN2020092791-appb-000008
其中,T a是评论者a的评论时间序列,
Figure PCTCN2020092791-appb-000009
是T a的第l个元素,ρ是时间间隔(也可以称为时间窗)的阈值,对于该时间间隔的阈值可以根据数据量来选择,数据量大的话,可以选择相对小的值,数据量少的话,可以选择相对大的值。
Among them, T a is the review time series of reviewer a,
Figure PCTCN2020092791-appb-000009
Is T a of the l th element, [rho] is the time interval (also referred to as a time window) of threshold value of the threshold value the time interval may be selected according to the amount of data, the amount of data, you can select a relatively small value, the data If the amount is small, a relatively large value can be selected.
账户生存周期(AD)用于表示评论者发布的第一条与最后一条评论之间的时间间隔。AD越高,越可疑。AD(a)的计算公式如下:The account life cycle (AD) is used to indicate the time interval between the first and the last comment posted by the commenter. The higher the AD, the more suspicious. The calculation formula of AD(a) is as follows:
Figure PCTCN2020092791-appb-000010
Figure PCTCN2020092791-appb-000010
其中,
Figure PCTCN2020092791-appb-000011
Figure PCTCN2020092791-appb-000012
分别是评论者a发布第一条及最后一条评论的时间,t data是整个数据集的时间跨度。
in,
Figure PCTCN2020092791-appb-000011
with
Figure PCTCN2020092791-appb-000012
They are the time when commenter a posted the first and last comment, and t data is the time span of the entire data set.
活跃时期评论占比(ATR)用来衡量评论者活跃时期评论的数目与总评论数目的关系。按事物发展规律,真实评论者发布评论是由需求决定的,因此发布时间和数量具有很大随机性,而水军发布的时间和数量具有活跃期,即通常在短时间内发布大量假评论。ATR越大,越可疑。ATR(a)的计算公式如下:The percentage of comments in the active period (ATR) is used to measure the relationship between the number of comments in the active period of the reviewer and the total number of comments. According to the law of the development of things, real commentators publish comments by demand, so the release time and quantity are very random, while the time and quantity released by the navy have an active period, that is, a large number of fake comments are usually posted in a short period of time. The larger the ATR, the more suspicious. The calculation formula of ATR(a) is as follows:
Figure PCTCN2020092791-appb-000013
Figure PCTCN2020092791-appb-000013
ActiveTimePeriod(a)表示评论者a在活跃期发布的评论集合,R a是评论者a发布的针对所有产品的全部评论集合。 ActiveTimePeriod (a) represented by a comment in the comments posted on the active collection, R a is a reviewer for the publication of a collection of all reviews of all products.
并且,根据本申请,设定上述6个指标的平均值作为衡量评论者造假程度的个体造假值(ISS),计算公式如下:In addition, according to this application, the average value of the above 6 indicators is set as the individual fraud value (ISS) to measure the degree of fraud by the reviewer, and the calculation formula is as follows:
Figure PCTCN2020092791-appb-000014
Figure PCTCN2020092791-appb-000014
本申请中通过群组时间窗(GTW)、群组评分偏差(GRD)、群组尺寸(GS)、群组评论紧密性(GRT)、群组一天评论数(GOR)、群组极端评分比例(GER)、群组共活跃(GCA)、群组共活跃期评论占比(GCAR)来反映群组造假行为指标。In this application, the group time window (GTW), group rating deviation (GRD), group size (GS), group comment tightness (GRT), number of group comments per day (GOR), and group extreme rating ratio are adopted in this application (GER), Group Co-active (GCA), Group Co-active Comment Percentage (GCAR) to reflect group fraud behavior indicators.
时间窗口(TW)通常被用来衡量水军群组的活跃程度。本申请中设置群组时间窗(GTW)来衡量水军群组的活跃程度,该指标由Mukherjee等人最早提出,其考虑了第一次评论时间和最后一次评论时间之间的间隔。相比之下,Wang等人使用群组成员评论时间的标准差来衡量群组整体评论时间的分布。The time window (TW) is usually used to measure the activity of the navy group. In this application, a group time window (GTW) is set to measure the activity of the navy group. This indicator was first proposed by Mukherjee et al., which considers the interval between the time of the first comment and the time of the last comment. In contrast, Wang et al. used the standard deviation of group members' comment time to measure the distribution of the group's overall comment time.
本申请设置了一个时间窗口的阈值,将低于该阈值的时间窗口认为是活跃窗口,然后计算活跃窗口的 活跃度。GTW越高,越可疑。其计算公式如下:This application sets a threshold for a time window, and considers the time window below the threshold as an active window, and then calculates the activity of the active window. The higher the GTW, the more suspicious. The calculation formula is as follows:
Figure PCTCN2020092791-appb-000015
Figure PCTCN2020092791-appb-000015
SD g p是群组g中成员对产品p评论时间的标准差,该标准差根据标准的标准差计算公式计算出来的,用于观察群组成员对某一目标产品的评论时间是否集中,如果评论时间集中,则数据分布集中,相应的标准差就小,就比较可疑。 SD g p is the standard deviation of the time when members of group g comment on product p. The standard deviation is calculated according to the standard standard deviation calculation formula. It is used to observe whether group members’ comment time on a target product is concentrated. If When the comment time is concentrated, the data distribution is concentrated, and the corresponding standard deviation is small, which is more suspicious.
T是用户自定义的时间窗阈值,用于评价一个群组的评论时间是否集中,一般该阈值会设置的大一些,诸如30天。个体造假行为指标中的ρ是针对一个评论者个体的评论时间间隔设置的,阈值一般设置的比较小,诸如7天。当然,可以根据实际需要设置个体造假行为指标中的ρ和群组造假行为指标中的T。T is a user-defined time window threshold, which is used to evaluate whether the comment time of a group is concentrated. Generally, the threshold is set to be larger, such as 30 days. The ρ in the individual fraud behavior index is set for the comment interval of an individual reviewer, and the threshold is generally set to be relatively small, such as 7 days. Of course, ρ in the individual fraud behavior index and T in the group fraud behavior index can be set according to actual needs.
p∈P g,P g是群组g的目标产品集,指的是至少被群组g中半数以上的成员共同评论过的产品的集合。 p ∈ P g , P g is the target product set of group g, and refers to the set of products that have been reviewed by at least half of the members of group g.
群组评分偏差(GRD)用于反映群组的评分偏离目标产品的平均评分的程度。GRD越高,越可疑。GRD(g)的计算公式如下:Group Rating Deviation (GRD) is used to reflect the degree to which the group's rating deviates from the average rating of the target product. The higher the GRD, the more suspicious. The calculation formula of GRD(g) is as follows:
Figure PCTCN2020092791-appb-000016
Figure PCTCN2020092791-appb-000016
r ap是群组g中的用户a对产品p的评分,
Figure PCTCN2020092791-appb-000017
是产品p的平均分,p∈P g,P g是群组g的目标产品集,指的是至少被群组g中半数以上的成员共同评论过的产品的集合。RD p(g)计算了群组g对目标产品p的评分偏差。需要注意的是,本申请通过除以4进行归一化,这是五分评分体系下的最大评分偏差。GRD(g)取群组g对所有目标产品评分偏差的平均值。
r ap is the rating of product p by user a in group g,
Figure PCTCN2020092791-appb-000017
Is the average score of product p, p∈P g , P g is the target product set of group g, which refers to the set of products that have been reviewed by at least half of the members of group g. RD p (g) calculates the scoring deviation of group g to target product p. It should be noted that this application is normalized by dividing by 4, which is the maximum scoring deviation under the five-point scoring system. GRD(g) is the average value of group g's score deviations for all target products.
群组尺寸(GS)表示水军群组中评论者的数量,用于反映群组的大小。一个群组规模越大,这个群组就越可疑和有害。这是因为小群组往往是由偶然形成的,大群组一般是在一定目标驱使下形成的。GS(g)的计算公式如下:The group size (GS) represents the number of commenters in the navy group and is used to reflect the size of the group. The larger the size of a group, the more suspicious and harmful the group. This is because small groups are often formed by accident, and large groups are generally formed under certain goals. The calculation formula of GS(g) is as follows:
Figure PCTCN2020092791-appb-000018
Figure PCTCN2020092791-appb-000018
其中,Rg表示群组g中的成员集合,|Rg|表示群组g中成员的数量。Among them, Rg represents the set of members in group g, and |Rg| represents the number of members in group g.
群组评论紧密性(GRT)用来衡量群组成员合作撰写虚假评论的紧密程度。其中,GRT(g)的计算方式如下:Group Comment Tightness (GRT) is used to measure how closely group members collaborate to write false comments. Among them, GRT(g) is calculated as follows:
Figure PCTCN2020092791-appb-000019
Figure PCTCN2020092791-appb-000019
其中,Vg是指群组里对该群组评论的目标产品的评论集合。Among them, Vg refers to a group of comments on the target product commented on the group.
群组一天评论数(GOR)关注一个群组一天发布的评论数量。如果群组成员经常在一天之内发布许多评论,则该群组十分可疑。Mukherje等人估计称,水军通常一天至少发布6篇评论,而正常评论者通常只会发布1-2篇评论。Group comments per day (GOR) focuses on the number of comments posted by a group in a day. If group members often post many comments in one day, the group is very suspicious. Mukherje and others estimate that the navy usually publishes at least 6 comments a day, while normal commenters usually only publish 1-2 comments.
本申请中,通过计算群组成员发布的评论数量超过5的天数,然后取群组成员的平均值作为群组一天评论数。GOR(g)的计算公式如下:In this application, the number of days when the number of comments posted by group members exceeds 5 is calculated, and then the average value of group members is taken as the number of group comments per day. The calculation formula of GOR(g) is as follows:
Figure PCTCN2020092791-appb-000020
Figure PCTCN2020092791-appb-000020
T a是群组成员a的所有评论日期的集合,t a是T a的元素,CountRev(t a)表示群组成员a在日期t a发布的评论数量。. T a set of all the group members review date of a, t a T a is an element, CountRev (t a) represents the number of group members comment on a t a release date. .
群组极端评分比例(GER)定义为群组成员极端评分比例的平均值,计算公式如下:The group extreme score ratio (GER) is defined as the average of the extreme score ratios of group members. The calculation formula is as follows:
Figure PCTCN2020092791-appb-000021
Figure PCTCN2020092791-appb-000021
R a是群组成员a的评论集合,r a是R a的元素。 R a is a group member of the set of comments, r a is the element of R a.
群组成员在短时间内一起发布评论可以被认为是一个可疑的共活跃造假活动。群组共活跃(GCA)用于表示群组成员在一时间内共同活跃的次数或程度,并利用logistic函数进行归一化。GCA(g)计算公式如下:Group members posting comments together in a short period of time can be considered a suspicious co-active fraud activity. Group Co-Activity (GCA) is used to indicate the number or degree of group members being active together in a time, and is normalized by the logistic function. The calculation formula of GCA(g) is as follows:
Figure PCTCN2020092791-appb-000022
Figure PCTCN2020092791-appb-000022
CA g是满足群组所有成员在δ天内共同发布评论的共活跃时间集合,其中,δ是设定的阈值,诸如,群组连续5天共同活跃。|CA g|即为群组g满足共活跃的时间段的数目,单位为天。 CA g is a co-active time set that satisfies all members of the group to post comments together in δ days, where δ is a set threshold, such as the group being active together for 5 consecutive days. |CA g | is the number of time periods during which the group g is active, and the unit is day.
群组共活跃期评论占比(GCAR)用于表示群组在活跃期间发布的针对目标产品的评论占群组总评论的比例。群组具有较大的共活跃比例,且在共活跃期间发布了大量的评论,揭示了一种可疑的造假行为。GCAR越高,群组越可疑。GCAR(g)的计算公式如下:The percentage of group comments in total active period (GCAR) is used to indicate the percentage of the group's total comments published during the active period for the target product. The group has a large co-active ratio, and a large number of comments were posted during the co-active period, revealing a suspicious fraudulent behavior. The higher the GCAR, the more suspicious the group. The calculation formula of GCAR(g) is as follows:
Figure PCTCN2020092791-appb-000023
Figure PCTCN2020092791-appb-000023
Figure PCTCN2020092791-appb-000024
表示群组g在共活跃时间内发布的评论集合。
Figure PCTCN2020092791-appb-000024
Represents the collection of comments posted by the group g during the total active time.
本申请取上述8个指标的平均值作为衡量群组造假程度的群组造假值(GSS),计算公式如下:In this application, the average value of the above 8 indicators is used as the group fraud value (GSS) to measure the degree of group fraud. The calculation formula is as follows:
Figure PCTCN2020092791-appb-000025
Figure PCTCN2020092791-appb-000025
根据本申请,选取上述数据特征作为评价水军个体和水军群组可疑度指标,从而基于所获取的可疑度指标来确认候选水军群组。According to this application, the above-mentioned data characteristics are selected as the suspiciousness index for evaluating the navy individual and navy group, so as to confirm the candidate navy group based on the obtained suspiciousness index.
下面将详细描述根据上述指标特征检测水军或水军群组的方法。The method of detecting a navy or navy group based on the above-mentioned indicator characteristics will be described in detail below.
本申请通过从产品的角度来发现候选水军群组,即,通过先获取被水军攻击的目标产品,然后基于目标产品发现候选水军群组。This application discovers the candidate navy group from the perspective of products, that is, first obtains the target product attacked by the navy, and then discovers the candidate navy group based on the target product.
图1示出了根据本申请的水军群组检测方法的流程图。Fig. 1 shows a flowchart of a method for detecting a navy group according to the present application.
如图1所示,水军群组检测方法包括:As shown in Figure 1, the navy group detection method includes:
S101:获取网络中的评论数据信息,其中,评论数据信息包括:评论产品、评论者、评论时间以及评论者对评论产品的评分;S101: Obtain review data information in the network, where the review data information includes: review product, reviewer, review time, and reviewer's rating on the review product;
S102:基于评论数据信息识别水军群组所攻击的目标产品;S102: Identify the target product attacked by the navy group based on the review data information;
一个产品是否被水军攻击可以由产品的异常评分分布,诸如,产品正常平均分、异常平均分、正常所有评分、异常所有评分,反映出来。本申请结合产品的评分分布异常及产品平均分分布异常,计算目标产品的异常值来检测目标产品。Whether a product is attacked by the navy can be reflected by the product's abnormal score distribution, such as the product's average normal score, abnormal average score, normal all scores, and abnormal all scores. This application combines abnormal product score distribution and abnormal product average score distribution to calculate the abnormal value of the target product to detect the target product.
具体地,根据本申请的一优选实施例,基于评论数据信息识别水军群组所攻击的目标产品可以通过以下方式实现:Specifically, according to a preferred embodiment of the present application, the identification of the target product attacked by the navy group based on the review data information can be achieved in the following ways:
S1021:基于评论者对评论产品的评分来计算产品评分分布异常值和产品平均分分布异常值;以及S1021: Calculate the outlier value of the product score distribution and the outlier value of the product average score based on the reviewer's score on the reviewed product; and
S1022:通过所计算出的产品评分分布异常值和产品平均分分布异常值计算水军群组所攻击的目标产品的可疑值,并将该可疑值与设定的目标产品可疑值的阈值进行比较,进而,根据比较结果来识别水军群组所攻击的目标产品。S1022: Calculate the suspicious value of the target product attacked by the navy group based on the calculated product score distribution abnormal value and the product average score distribution abnormal value, and compare the suspicious value with the set threshold of the target product suspicious value , And then, according to the comparison result to identify the target product attacked by the navy group.
已知的是,被水军攻击的目标产品在极端评分(1、5分)分布上存在异常,本申请中,产品评分分布异常值Sext(p)由每个产品的极端评分之比计算得到:It is known that the target product attacked by the navy has an abnormality in the distribution of extreme scores (1, 5). In this application, the product score distribution outlier Sext(p) is calculated from the ratio of the extreme scores of each product :
Figure PCTCN2020092791-appb-000026
Figure PCTCN2020092791-appb-000026
r p是产品p的评分,S ext(p)越高,产品p越有可能被攻击。 r p is the score of product p, the higher S ext (p), the more likely product p is to be attacked.
并且,大部分水军账号只使用一次,也就是一次性评论者。本申请中,通过计算一般评论者(TR)发布评论的平均分与一次性评论者(SR)发布评论的平均分之比来获得产品平均分分布异常值Savg(p)。该产品平均分分布异常值Savg(p)计算公式如下:Moreover, most of the navy accounts are used only once, that is, one-time commenters. In this application, the average product score distribution outlier Savg(p) is obtained by calculating the ratio of the average score of reviews posted by general reviewers (TR) to the average score of reviews posted by one-time reviewers (SR). The formula for calculating the average score distribution outlier Savg(p) of the product is as follows:
Figure PCTCN2020092791-appb-000027
Figure PCTCN2020092791-appb-000027
Figure PCTCN2020092791-appb-000028
是一次性评论者群体SR对产品p发布评分的平均分,
Figure PCTCN2020092791-appb-000029
是一般评论者群体TR对产品p发布评分的平均分。
Figure PCTCN2020092791-appb-000028
It is the average score of the one-time reviewer group SR on the release score of product p,
Figure PCTCN2020092791-appb-000029
It is the average score of the general reviewer group TR on the release score of product p.
通过上面的公式(17)和(18)可以得到产品评分分布异常值和产品平均分分布异常值。Through the above formulas (17) and (18), the outlier value of product score distribution and the outlier value of product average score distribution can be obtained.
根据本申请,产品的可疑值可以由产品评分分布异常值和产品平均分分布异常值两者结合进行计算,以量化一个产品是目标产品的可疑程度。根据本申请的一优选实施例,可以通过如下公式计算水军群组所攻击的目标产品的可疑值STP(p):According to this application, the suspicious value of a product can be calculated by combining the abnormal value of the product score distribution and the abnormal value of the product average score distribution to quantify the degree of suspiciousness of a product as a target product. According to a preferred embodiment of the present application, the suspicious value STP(p) of the target product attacked by the navy group can be calculated by the following formula:
S TP(p)=ωS avg(p)+(1-ω)S ext(p)       (19) S TP (p)=ωS avg (p)+(1-ω)S ext (p) (19)
其中,p表示水军群组所攻击的目标产品,Savg(p)为产品平均分分布异常值,Sext(p)为产品评分分布异常值,ω是用于平衡Savg(p)和Sext(p)权重因子,取值范围在0到1之间,优选地,ω为0.5。Among them, p represents the target product attacked by the navy group, Savg(p) is the abnormal value of the average product score distribution, Sext(p) is the abnormal value of the product score distribution, and ω is used to balance Savg(p) and Sext(p). ) Weighting factor, with a value range between 0 and 1, preferably, ω is 0.5.
根据本申请,对于一个产品,计算出水军群组所攻击的目标产品的可疑值STP(p),将该可疑值与设定的目标产品可疑值的阈值δ TP进行比较,如果计算的目标产品的可疑值STP(p)大于等于阈值δ TP,则认为该产品为水军群组攻击的目标产品。其中,所设定的目标产品可疑值的阈值δ TP是通过差值法做实验获得,其取效果好的最低值。当然,也可以取效果好的更高的值。 According to this application, for a product, the suspicious value STP(p) of the target product attacked by the navy group is calculated, and the suspicious value is compared with the set threshold δ TP of the target product suspicious value. If the calculated target product If the suspicious value STP(p) of is greater than or equal to the threshold δ TP , the product is considered to be the target product of the navy group attack. Among them, the set threshold δ TP of the suspicious value of the target product is obtained through experiments with the difference method, and the lowest value with a good effect is taken. Of course, you can also take a higher value for better results.
S103:基于所识别出的目标产品生成候选水军群组。S103: Generate a candidate navy group based on the identified target product.
根据本申请的一优选实施例,水军群组检测方法通过利用核密度估计方法来生成候选水军群组。具体地,利用核密度估计方法获取所识别出的目标产品的评论爆发区,通过获取评论爆发区中的评论者来生成候选水军群组,其中,评论爆发区是所识别出的目标产品的评论在短时间内激增的区域。According to a preferred embodiment of the present application, the navy group detection method uses a nuclear density estimation method to generate a candidate navy group. Specifically, the kernel density estimation method is used to obtain the comment outbreak area of the identified target product, and the candidate navy group is generated by obtaining reviewers in the comment outbreak area, where the comment outbreak area is the identified target product Areas where comments have surged in a short period of time.
某产品的评论在短时间内激增称为评论爆发。图2示意性示出一个评论爆发的例子,横坐标是时间跨度的归一化,纵坐标是评论的数量。评论爆发的周期发生在0.5-0.6之间。The sudden increase in reviews of a product in a short period of time is called a review burst. Figure 2 schematically shows an example of a comment burst. The abscissa is the normalization of the time span, and the ordinate is the number of comments. The cycle of comment burst occurs between 0.5-0.6.
假设评论爆发预示着水军群组的造假活动,为了获得候选水军群组信息,可以先获得评论爆发区,本申请利用核密度估计方法(KDE)获得评论爆发区。具体地,如图3所示,其示出了获得评论爆发区的流程图。Assuming that the comment explosion indicates the fraudulent activities of the navy group, in order to obtain the information of the candidate navy group, the comment explosion area can be obtained first. This application uses the kernel density estimation method (KDE) to obtain the comment explosion area. Specifically, as shown in FIG. 3, it shows a flowchart for obtaining a comment outbreak area.
如图3所示,利用核密度估计方法获取所识别出的目标产品的评论爆发区主要包括:As shown in Figure 3, using the kernel density estimation method to obtain the identified target product review outbreak area mainly includes:
S201:计算所识别出的目标产品的生命周期(见下述算法中的第1行所示)。S201: Calculate the life cycle of the identified target product (see the first row in the following algorithm).
假设产品一共有m条评论,第一条评论的时间t 1,最后一条评论的时间t m,第一条评论与最后一条评论之间的时间间隔为产品的生命周期dur,则dur=tm-t1。 Assuming that the product has a total of m comments, the time of the first comment is t 1 , the time of the last comment is t m , and the time interval between the first comment and the last comment is the life cycle dur of the product, then dur=tm- t1.
S202:利用核密度估计方法对所识别出的目标产品的评论和评论所对应的评论时间序列进行建模(见下述算法中的第2-3行所示)。S202: Use the kernel density estimation method to model the identified reviews of the target product and the review time series corresponding to the reviews (see lines 2-3 in the following algorithm).
在该步骤中设定产品p的评论序列以及产品p的评论集所对应的评论时间序列。In this step, the review sequence of product p and the review time sequence corresponding to the review set of product p are set.
S203:设置时间窗口尺寸,将所识别出的目标产品的生命周期分割成多个子时间窗口(见下述算法中的第4行所示)。S203: Set the size of the time window, and divide the identified life cycle of the target product into multiple sub-time windows (see row 4 in the following algorithm).
本申请中,选择一个合适的时间窗口尺寸ISIZE,将一个产品的生命周期dur分割成一个个小的时间窗口(即,子时间窗口),那么子时间窗口k的数量=dur/ISIZE。根据本申请的一优选实施例,可以将ISIZE设为7天。In this application, an appropriate time window size ISIZE is selected to divide the life cycle dur of a product into small time windows (ie, sub-time windows), then the number of sub-time windows k = dur/ISIZE. According to a preferred embodiment of the present application, ISIZE can be set to 7 days.
S204:选取每个子时间窗口的上界和所述子时间窗口内评论数目作为样本点(见下述算法中的第5-9行所示)。S204: Select the upper bound of each sub-time window and the number of comments in the sub-time window as sample points (see lines 5-9 in the following algorithm).
S205:根据
Figure PCTCN2020092791-appb-000030
计算核密度估计值,获取针对所识别出的目标产品的评论数目的极值点集(见下述算法中的第10-11行所示)。
S205: According to
Figure PCTCN2020092791-appb-000030
Calculate the kernel density estimate, and obtain the extreme point set of the number of reviews for the identified target product (see lines 10-11 in the following algorithm).
在公式中,使用高斯核
Figure PCTCN2020092791-appb-000031
并利用h控制着估计的平滑度。h的具体取值通过实验选取最适合的值,使估计曲线既不太平滑,又不会呈锯齿状。对KDE(x)求导,并使其导数为零,计算出一系列估计曲线的极值点,这些极值点分别对应所识别出的目标产品的评论数目。通过上述方式获得针对所识别出的目标产品的评论数目的极值点集。
In the formula, use Gaussian kernel
Figure PCTCN2020092791-appb-000031
And use h to control the smoothness of the estimate. The specific value of h is selected through experiments to select the most suitable value, so that the estimated curve is neither too smooth nor jagged. Take the derivative of KDE(x) and make its derivative zero to calculate a series of extreme points of the estimated curve. These extreme points correspond to the number of reviews of the identified target product. Obtain the extreme point set of the number of reviews for the identified target product in the above manner.
S206:计算每个子时间窗口的平均评论数,其中,所述平均评论数=总评论数/所述子时间窗口的数量(见下述算法中的第12行所示)。S206: Calculate the average number of comments in each sub-time window, where the average number of comments=the total number of comments/the number of the sub-time windows (see line 12 in the following algorithm).
平均评论数avgrev=m/k,其中,m为总评论数,k为上述子时间窗口的数量。The average number of comments avgrev=m/k, where m is the total number of comments, and k is the number of the aforementioned sub-time windows.
S207:判断所获得极值点集中的极值点所在的子时间窗口中的评论数是否大于平均评论数且大于1,根据判断结果获取所述评论爆发区,其中,所述评论爆发区为所获得极值点集中大于平均评论数且大于1的极值点所对应时间加上或减去设定天数所形成的区域(见下述算法中的第13-19行所示)。S207: Determine whether the number of comments in the sub-time window where the extreme point in the obtained extreme point set is located is greater than the average number of comments and greater than 1, and obtain the comment outbreak area according to the determination result, wherein the comment outbreak area is Obtain the area formed by adding or subtracting the set number of days to the time corresponding to the extreme points in the set of extreme points greater than the average number of comments and greater than 1 (see lines 13-19 in the following algorithm).
本申请的目标是发现评论爆发区,也就是目标产品的评论在短时间内激增的区域,所以本申请不考虑那些落在评论数小于等于平均评论数的极值点,也不考虑那些评论数小于等于1的窗口的极值点,即只考虑评论数大于整体平均评论数且大于1的极值点。The goal of this application is to discover the comment outbreak area, that is, the area where the reviews of the target product surge in a short period of time, so this application does not consider those extreme points that fall in the number of reviews less than or equal to the average number of reviews, nor does it consider those reviews. The extreme points of the window less than or equal to 1, that is, only the extreme points with the number of reviews greater than the overall average number of reviews and greater than one are considered.
获得了极值点后,根据本申请,选取上述经过筛选后的极值点所对应时间加上或减去设定天数所形成的区域作为评论爆发区,诸如,选取上述经过筛选后的极值点所对应时间前后各三天,即共7天作为评论爆发区。After obtaining the extremum points, according to this application, select the area formed by adding or subtracting the set number of days corresponding to the above-mentioned filtered extremum points as the comment outbreak area, such as selecting the above-mentioned extremum after screening Three days before and after the time corresponding to the point, that is, a total of 7 days as the comment outbreak area.
之后,通过获取评论爆发区中的评论者来生成候选水军群组。After that, the candidate navy group is generated by obtaining commenters in the comment outbreak area.
通过上述方法可以获得攻击目标产品的候选水军群组。Through the above method, the candidate navy group for attacking the target product can be obtained.
其中,上述利用核密度估计方法生成候选水军群组可以通过如下算法实现:Among them, the above-mentioned use of the kernel density estimation method to generate candidate navy groups can be implemented by the following algorithm:
1.计算产品p的生命周期dur=t m-t 1 1. Calculate the life cycle of product p dur=t m -t 1
2.令Vp={v1,v2,……,vm};2. Let Vp={v1,v2,……,vm};
3.令Tp={t1,t2,……,tm};3. Let Tp={t1,t2,……,tm};
4.设置合适的窗口尺寸ISIZE,将dur均分割为k个窗口4. Set the appropriate window size ISIZE, and divide dur into k windows
5.for窗口i∈{1,……,k}do5. For window i∈{1,……,k}do
6.令R i={v j|t j∈(b i-1,b i]},其中b i=i·ISIZE 6. Let R i ={v j |t j ∈(b i-1 ,b i ]}, where b i =i·ISIZE
7.令b i=b i/dur 7. Let b i = b i /dur
8.令x i=b i,w i=|R i| 8. Let x i = b i , w i = |R i |
9.end for9.end for
10.计算
Figure PCTCN2020092791-appb-000032
其中
Figure PCTCN2020092791-appb-000033
10. Calculation
Figure PCTCN2020092791-appb-000032
in
Figure PCTCN2020092791-appb-000033
11.计算KDE(x)的导数,令其为0,得到所有极值点集
Figure PCTCN2020092791-appb-000034
11. Calculate the derivative of KDE(x), set it to 0, and get all extreme point sets
Figure PCTCN2020092791-appb-000034
12.计算每个窗口的平均评论数avg rev=m/k 12. Calculate the average number of comments for each window avg rev = m/k
13.for
Figure PCTCN2020092791-appb-000035
in x peak do
13.for
Figure PCTCN2020092791-appb-000035
in x peak do
14.if
Figure PCTCN2020092791-appb-000036
落在窗口i中then
14.if
Figure PCTCN2020092791-appb-000036
Fall in the window i then
15.if|H i|≠1and|H i|>avg rev then 15.if|H i |≠1and|H i |>avg rev then
16.令t peak等于
Figure PCTCN2020092791-appb-000037
16. Let t peak be equal to
Figure PCTCN2020092791-appb-000037
17.输出时间间隔{t peak-3,t peak+3}中的评论者作为候选水军群组 17. Output the commenters in the time interval {t peak -3, t peak +3} as the candidate navy group
18.end if18.end if
19.end for19.end for
根据本申请的一优选实施例,水军群组检测方法可以利用一系列群组造假指标衡量群组的可疑度,剔除可疑度低的群组,获得经过净化后的候选水军群组。According to a preferred embodiment of the present application, the navy group detection method can use a series of group fraud indicators to measure the suspiciousness of the group, eliminate groups with low suspiciousness, and obtain a purified candidate navy group.
具体地,水军群组检测方法还包括:S105:计算步骤S103获得的候选水军群组的群组造假值GSS(g),将候选水军群组的群组尺寸与设定值进行比较,并且将群组造假值与设定的水军群组造假指标的阈值进行比较,根据比较结果输出候选水军群组。Specifically, the navy group detection method further includes: S105: calculating the group fraud value GSS(g) of the candidate navy group obtained in step S103, and comparing the group size of the candidate navy group with the set value , And compare the group fraud value with the set threshold value of the Navy group fraud index, and output the candidate Navy group according to the comparison result.
本申请中,如果候选水军群组的群组尺寸GS大于等于设定值(诸如,设定值为2),且群组造假值大于设定的水军群组造假指标的阈值GSS,则输出相应的候选水军群组。其中,水军群组造假指标的阈值通过差值法做实验获得,根据本申请,取效果好的最低值,当然也可以取效果好的更高值。In this application, if the group size GS of the candidate navy group is greater than or equal to the set value (for example, the set value is 2), and the group fraud value is greater than the set threshold GSS of the naval group fraud index, then Output the corresponding candidate navy group. Among them, the threshold value of the false index of the navy group is obtained through experiments with the difference method. According to this application, the lowest value with good effect is taken, of course, a higher value with good effect can also be taken.
为了获得更精确的水军群组,防止对正常评论者可能恰巧在评论爆发期间对目标产品进行评论造成的误判问题,根据本申请的一优选实施例,在进行步骤S105之前,水军群组检测方法还可以利用一系列个体造假指标来衡量个体评论者的可疑度,以剔除可疑度很低的个体评论者。In order to obtain a more accurate navy group and prevent misjudgment problems caused by normal reviewers who may happen to comment on the target product during a comment outbreak, according to a preferred embodiment of the present application, before step S105 is performed, the navy group The group detection method can also use a series of individual fraud indicators to measure the suspiciousness of individual reviewers, so as to eliminate individual reviewers with low suspiciousness.
具体地,在进行步骤S105之前,水军群组检测方法还可以包括:S104:计算每个候选水军群组的每个评论者的个体造假值ISS(a),并将个体造假值与设定的水军个体造假指标的阈值ISS进行比较,根据比较结果剔除可疑度低的评论者,获得净化后的候选群组。Specifically, before step S105, the navy group detection method may further include: S104: Calculate the individual fraud value ISS(a) of each reviewer of each candidate navy group, and compare the individual fraud value with the setting The set threshold ISS of the individual fraud indicators of the navy is compared, and the commenters with low suspiciousness are eliminated according to the comparison result, and the candidate group after purification is obtained.
本申请中,如果个体造假值小于水军个体造假指标的阈值,则剔除相应的评论者,从而获得净化后的候选群组。其中,个体造假指标的阈值通过差值法做实验获得,根据本申请,取效果好的最低值,当然也可以取效果好的更高值。In this application, if the individual fraud value is less than the threshold value of the individual fraud indicator of the navy, the corresponding commenter is eliminated, so as to obtain a purified candidate group. Among them, the threshold value of the individual fraud index is obtained through experiments with the difference method. According to the present application, the lowest value with good effect is taken, of course, a higher value with good effect can also be taken.
通过上述步骤S104和步骤S105,能够对所获取的候选水军群组进行净化与分类,从而获得更准确的水军群组。Through the above steps S104 and S105, the obtained candidate navy group can be purified and classified, so as to obtain a more accurate navy group.
下面通过实验来说明本申请的基于爆发的水军群组检测方法(GSDB,burst-based spammer group detection method)所具有的优点。The following experiments are used to illustrate the advantages of the burst-based spammer group detection method (GSDB, burst-based spammer group detection method) of this application.
根据本申请,使用亚马逊评论数据集(没有标签),AmazonBooks从1993年到2014年的评论数据,其中包括22,507,155条评论、8,026,324个评论者和2,330,066个产品。由于数据量太大,所以本申请只提取了2013年的评论数据,其中包括6,990,316条评论,2,998,38个评论者以及1,079,741个产品。处理后的数据集统计数据如表1(数据集概况一览表)所示。According to this application, using the Amazon review data set (without tags), AmazonBooks review data from 1993 to 2014 includes 22,507,155 reviews, 8,026,324 reviewers and 2,330,066 products. Due to the large amount of data, this application only extracts 2013 review data, which includes 6,990,316 reviews, 2,998,38 reviewers, and 1,079,741 products. The statistical data of the processed data set is shown in Table 1 (data set overview list).
表1数据集概况一览表Table 1 Overview of the data set
DatasetDataset 原始亚马逊书籍数据集Original Amazon Books Dataset 2013年数据2013 data
#评论#Comment 22,507,15522,507,155 6,990,3166,990,316
#评论者#Comment by 8,026,3248,026,324 2,998,3802,998,380
#产品#product 2,330,0662,330,066 1,079,7411,079,741
水军群组检测问题非常具有挑战性,因为没有可用于模型构建或评估所需的带标签的标准数据集(标记为虚假/真实)。先前的研究主要依靠人工标注来获取标签。Mukherjee等人和Xu等人首先使用FIM(频繁项集挖掘)算法获取候选水军群组,然后由8位专家进行人工标注。Wang等人使用基于拓扑图的算法来生成候选水军群组,并由3个人对其进行人工标注。The problem of navy group detection is very challenging because there is no labeled standard data set (marked as false/real) that can be used for model building or evaluation. Previous research mainly relied on manual labeling to obtain labels. Mukherjee et al. and Xu et al. first used FIM (Frequent Itemset Mining) algorithm to obtain candidate navy groups, and then manually labeled them by 8 experts. Wang et al. used an algorithm based on topological graphs to generate candidate navy groups, which were manually annotated by three individuals.
根据本申请的水军群组检测方法是一种完全非监督的算法,在模型构建中不需要任何标签。但是,标签对于评估其性能至关重要。由于水军群组的群体性造假行为比较容易被人工观察到,所以人工标注水军群组比标注水军个体更具有可操作性。因此,在本申请中,聘用了三个非常熟悉电子商务环境的研究生来对GSDB和GSBC(Group Spam detection via Bi-Connected graphs)方法检测出的前300名水军群组进行人工标注。在以前标记为水军群组的方法的指导下,再加上本申请自己的观察,本申请力求最大程度地减少评估过程中的人为偏差。The navy group detection method according to this application is a completely unsupervised algorithm and does not require any tags in model construction. However, the label is essential for evaluating its performance. Since the group fraud behavior of the navy group is easier to be observed manually, manually marking the navy group is more maneuverable than marking the navy individual. Therefore, in this application, three graduate students who are very familiar with the e-commerce environment are hired to manually label the top 300 navy groups detected by the GSDB and GSBC (Group Spam detection via Bi-Connected graphs) methods. Under the guidance of the method previously marked as a navy group, coupled with the application’s own observations, this application strives to minimize human bias in the evaluation process.
基于2013年的Amazon数据集(详细信息列于表1的第三列),本申请设计了一组实验和三种分析。首先,对比分析算法在造假指标上的表现;其次,对比分析算法在生成的水军群组尺寸大小上的表现;最后,结合人工标注的结果,对比分析算法在准确率、召回率及F1值上的表现。Based on the 2013 Amazon data set (detailed information is listed in the third column of Table 1), this application designs a set of experiments and three analyses. First, compare and analyze the performance of the algorithm in the counterfeiting index; second, compare and analyze the performance of the algorithm in the size of the generated navy group; finally, combine the results of manual labeling to compare and analyze the accuracy, recall and F1 value of the algorithm Performance.
在实验和分析中,以GSBC算法为基准,与本申请的GSDB算法进行比较。GSBC方法是目前最新提出的的基于拓扑图的水军群组检测算法,也是使用Amazonbooks数据集进行实验,特别是Wang等人在提出GSBC算法时,已经将该算法与一些之前的典型算法,GSBP、SCAN、FraudEagle和SpEagle进行了比 较。GSBP与SCAN算法是无监督的,而FraudEagle和SpEagle算法是有监督的。GSBP算法与GSBC算法均由Wang等人提出,后者是前者的改进算法。SCAN是一种基于图的聚类算法。FraudEagle与SpEagle算法是基于概率图模型的算法,使用循环信念传播(LBP)来推断评论(评论者)的虚假程度。Wang等人的实验结果表明,GSBC方法可以产生比其他两种无监督方法(GSBP和SCAN)更高质量的水军群组。与有监督的算法(FraudEagle与SpEagle)相比,GSBC方法也达到了较高的精度,GSBC算法在比较中取得了较好的结果。因此,本申请只需将GSDB与GSBC方法进行比较。In the experiment and analysis, the GSBC algorithm is used as a benchmark to compare with the GSDB algorithm of this application. The GSBC method is the latest proposed algorithm for naval group detection based on topological graphs. It also uses the Amazonbooks data set for experiments. In particular, when Wang et al. proposed the GSBC algorithm, the algorithm has been compared with some previous typical algorithms, GSBP , SCAN, FraudEagle and SpEagle were compared. The GSBP and SCAN algorithms are unsupervised, while the FraudEagle and SpEagle algorithms are supervised. Both the GSBP algorithm and the GSBC algorithm are proposed by Wang et al. The latter is an improved algorithm of the former. SCAN is a graph-based clustering algorithm. FraudEagle and SpEagle algorithms are algorithms based on probabilistic graph models that use circular belief propagation (LBP) to infer the degree of falsehood of reviews (commenters). The experimental results of Wang et al. show that the GSBC method can produce higher quality navy groups than the other two unsupervised methods (GSBP and SCAN). Compared with the supervised algorithms (FraudEagle and SpEagle), the GSBC method also achieves higher accuracy, and the GSBC algorithm has achieved better results in comparison. Therefore, this application only needs to compare the GSDB and GSBC methods.
Mukherjee等人最早提出,通过比较造假行为指标的累积分布函数(CDF)曲线来评估算法的性能,在现有技术中也得到了广泛应用。同样,本申请也使用CDF曲线进行对比来分析算法的性能。另外,得益于人工标注,本申请可以使用精度Precision、召回Recall和F1值作为评价标准进行算法评估,相关公式如下。Mukherjee et al. first proposed that the performance of the algorithm is evaluated by comparing the cumulative distribution function (CDF) curve of the fraud behavior index, which has also been widely used in the prior art. Similarly, this application also uses the CDF curve for comparison to analyze the performance of the algorithm. In addition, thanks to manual labeling, this application can use Precision, Recall, and F1 as the evaluation criteria for algorithm evaluation. The relevant formulas are as follows.
Figure PCTCN2020092791-appb-000038
Figure PCTCN2020092791-appb-000038
Figure PCTCN2020092791-appb-000039
Figure PCTCN2020092791-appb-000039
Figure PCTCN2020092791-appb-000040
Figure PCTCN2020092791-appb-000040
其中,TP(真正例)是指被分类器正确分类为正的正样本数,FP(假正例)是指被分类器错误标记为正的负样本数。而FN(假阴性)是指错误标记为负的正样本数。Among them, TP (True Cases) refers to the number of positive samples that are correctly classified as positive by the classifier, and FP (False Positive Cases) refers to the number of negative samples that are incorrectly marked as positive by the classifier. And FN (false negative) refers to the number of positive samples that are falsely marked as negative.
本申请中,比较了GSDB和GSBC方法检测到的水军群组的垃圾造假行为指标的CDF曲线。首先,本申请按照wang等人的参数设置,用GSBC算法在本申请数据集上生成了500+群组。具体参数设置与生成的群组数如表所示。τ是一个用户指定的共评论时间窗口大小值;δ是评论者图的边权重的阈值。MP是一个用户指定的参数,MINSPAM是群组造假分数的阈值。In this application, the CDF curves of the garbage fraud behavior indicators of the navy group detected by the GSDB and GSBC methods are compared. First of all, this application uses the GSBC algorithm to generate 500+ groups on the data set of this application according to the parameter settings of Wang et al. The specific parameter settings and the number of groups generated are shown in the table. τ is a user-specified value of the time window for co-review; δ is the threshold of the edge weight of the reviewer graph. MP is a user-specified parameter, and MINSPAM is the threshold for group fraud scores.
为了公平起见,本申请调整GSDB的算法参数使其产生与GSBC算法相当数量的水军群组。具体的参数设置与生成的群组数如表所示。从表中可以看出,GSBC与GSDB算法分别产生了545与555个群组。本申请分别提取GSBC与GSDB算法的前500个群组,分别绘制本申请定义的水军个体造假指标与群组造假指标的CDF曲线进行比较,如图5a至图5f及图6a至图6f所示。For the sake of fairness, this application adjusts the algorithm parameters of the GSDB to generate a number of navy groups equivalent to the GSBC algorithm. The specific parameter settings and the number of groups generated are shown in the table. It can be seen from the table that the GSBC and GSDB algorithms produced 545 and 555 groups respectively. This application extracts the first 500 groups of the GSBC and GSDB algorithms, and draws the CDF curves of the individual fraud indicators of the Navy and the group fraud indicators defined in this application for comparison, as shown in Figures 5a to 5f and Figures 6a to 6f. Show.
表2 GSBC的参数设置及生成的群组数Table 2 GSBC parameter settings and the number of groups generated
ττ δδ MPMP MINSPAMMINSPAM #Groups#Groups
3030 0.10.1 10001000 0.490.49 545545
表3 GSDB的参数设置及生成的群组数Table 3 GSDB parameter settings and the number of groups generated
δ TP δ TP δ I δ I δ G δ G #groups#groups
0.10.1 0.430.43 0.540.54 555555
图5a至图5f示出了由GSBC与根据本申请的GSDB分别生成的前500个水军群组的个体造假指标的CDF曲线的对比图;图6a至图6f示出了由GSBC与根据本申请的GSDB分别生成的前500个水军群组的水军群组造假行为指标的CDF曲线及所有群组指标的平均值(AVG)曲线的对比图。横轴表示群组个数的归一化,纵轴表示CDF值。曲线越靠右,代表算法的性能越好。从图5a至图5f中可以看出,在绝大部分指标上,GSDB方法比GSBC取得了更高的分数。Figures 5a to 5f show comparison diagrams of the CDF curves of individual fraud indicators of the first 500 navy groups respectively generated by GSBC and the GSDB according to the application; A comparison chart of the CDF curve and the average value (AVG) curve of all the group indicators generated by the GSDB of the first 500 navy groups. The horizontal axis represents the normalization of the number of groups, and the vertical axis represents the CDF value. The closer the curve is to the right, the better the performance of the algorithm. It can be seen from Figure 5a to Figure 5f that the GSDB method achieves higher scores than GSBC on most indicators.
在图6a至图6f中,在绝大部分指标上,GSDB方法也比GSBC取得了更高的分数,在平均值(AVG)曲线上,GSDB一直优于GSBC。本申请表现不足的EXR与GER指标均是关于极端评分的指标,而在GSBC算法中,对用户评分做了筛选,因此,在这两个指标上,GSBC算法取得了更高的分数。总体来说,GSDB表现更优。In Fig. 6a to Fig. 6f, the GSDB method also achieved higher scores than GSBC in most indicators. In the average (AVG) curve, GSDB has always been better than GSBC. The under-performing EXR and GER indicators in this application are indicators of extreme scores. In the GSBC algorithm, user scores are screened. Therefore, on these two indicators, the GSBC algorithm has achieved higher scores. Overall, GSDB performed better.
本申请中,对GSDB与GSBC生成的群组尺寸进行了统计分析(见图7)。从图7中可以看到GSBC算法生成的群组大部分是小群组(2-4个成员)。而与GSBC算法相比,GSDB能生成更多尺寸较大的群组,正如之前提到的,群组尺寸越大,危害越大。GSDB算法能检测到更多更大的群组,无疑对减少水军群组的危害更有效果。In this application, the group size generated by GSDB and GSBC is statistically analyzed (see Figure 7). It can be seen from Figure 7 that most of the groups generated by the GSBC algorithm are small groups (2-4 members). Compared with the GSBC algorithm, GSDB can generate more groups with larger sizes. As mentioned earlier, the larger the group size, the greater the harm. The GSDB algorithm can detect more and larger groups, which is undoubtedly more effective in reducing the harm of navy groups.
基于人工对GSDB和GSBC方法检测到前300个群组标注的标签,本申请对GSDB方法与GSBC方 法的精度进行了对比。图8a到图8c示出了两算法在top-n个群组上的精度、召回以及F1-值的连续变化。Based on the manual detection of the GSDB and GSBC methods to detect the labels of the first 300 groups, this application compares the accuracy of the GSDB method and the GSBC method. Figures 8a to 8c show the accuracy, recall, and continuous changes of the F1-value of the two algorithms on top-n groups.
从图8a中可以看到,GSDB算法的精度一直优于GSBC算法。并且,随着n的增大,GSDB方法的精度值缓慢下降,而GSBC方法的精度值出现了急剧下降,然后又出现了回升。也就是说,本申请的GSDB方法的精确性不依赖于样本数,而GSBC方法的精确性在很大程度上依赖于样本数。It can be seen from Figure 8a that the accuracy of the GSDB algorithm has always been better than that of the GSBC algorithm. And, with the increase of n, the accuracy value of the GSDB method decreases slowly, while the accuracy value of the GSBC method drops sharply, and then rises again. In other words, the accuracy of the GSDB method of this application does not depend on the number of samples, while the accuracy of the GSBC method largely depends on the number of samples.
从图8b中可以看出,GSDB整体依然优于GSBC方法,但差距不大。此外,召回曲线随n的增加呈线性增加。It can be seen from Figure 8b that GSDB is still better than GSBC method overall, but the gap is not big. In addition, the recall curve increases linearly with the increase of n.
从图8c中可以看出,本申请的GSDB方法在F1值上始终优于GSBC方法。此外,当考虑足够多的样本时,两算法均趋于稳定。It can be seen from Fig. 8c that the GSDB method of the present application is always better than the GSBC method in F1 value. In addition, when considering enough samples, both algorithms tend to be stable.
以上实验结果表明,本申请提出的GSDB方法优于GSBC方法。The above experimental results show that the GSDB method proposed in this application is better than the GSBC method.
根据本申请的另一方面,提供一种水军群组检测装置,如图9所示,该装置包括:数据信息获取模块100,所述数据信息获取模块获取网络中的评论数据信息,所述评论数据信息包括:评论产品、评论者、评论时间以及评论者对评论产品的评分;异常值计算模块200,所述异常值计算模块基于所述评论者对评论产品的评分计算产品评分分布异常值和产品平均分分布异常值;目标产品识别模块300,所述目标产品识别模块通过所述产品评分分布异常值和产品平均分分布异常值计算水军群组所攻击的目标产品的可疑值,并将所述可疑值与设定的目标产品可疑值的阈值进行比较,根据比较结果识别水军群组所攻击的目标产品;候选水军群组生成模块400,所述候选水军群组生成模块基于所识别出的目标产品生成候选水军群组。According to another aspect of the present application, there is provided a navy group detection device. As shown in FIG. 9, the device includes: a data information acquisition module 100, and the data information acquisition module The review data information includes: review product, reviewer, review time, and reviewer’s rating of the review product; outlier calculation module 200, which calculates the product rating distribution abnormal value based on the reviewer’s rating of the review product And the product average score distribution abnormal value; the target product identification module 300, the target product identification module calculates the suspicious value of the target product attacked by the navy group through the product score distribution abnormal value and the product average score distribution abnormal value, and The suspicious value is compared with the set threshold of the suspicious value of the target product, and the target product attacked by the navy group is identified according to the comparison result; a candidate navy group generation module 400, the candidate navy group generation module Generate candidate navy groups based on the identified target products.
在一个实施例中,提供了一种计算机设备,包括存储器及处理器,所述存储器上存储有可在处理器上运行的计算机程序,处理器执行计算机程序时实现上述水军群组检测方法的步骤。In one embodiment, a computer device is provided, including a memory and a processor. The memory stores a computer program that can run on the processor. step.
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现上述水军群组检测方法的步骤。根据本申请的计算机可读存储介质例如可包括非易失性和/或易失性存储器。例如,非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非限制,RAM可以具有多种形式,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the above-mentioned navy group detection method are realized. The computer-readable storage medium according to the present application may include non-volatile and/or volatile memory, for example. For example, non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM can have many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
如上所述,根据本申请的水军群组检测方法从产品评分出发通过检测产品评分分布是否异常筛选出可能被水军群组攻击的产品;并且利用核密度算法发现评论爆发区,并将爆发区间的所有评论者看作候选水军群组;针对正常评论者可能会恰巧在评论爆发期间对产品进行评论造成的误判问题,本申请利用一系列个体造假指标来衡量个体评论者的可疑度,并剔除可疑度很低的评论者;而且,本申请利用一系列群组造假指标衡量群组的可疑度、实现群组分类。根据本申请的水军群组检测方法能够大大提高检测水军群组的时间和空间效率。并且,这种方法与电子平台的责任和义务是一致的,能够监督和管理卖方。As mentioned above, the navy group detection method according to the present application starts from the product score and screens out products that may be attacked by navy groups by detecting whether the product score distribution is abnormal; and uses the nuclear density algorithm to find out the comment outbreak area, and will explode. All reviewers in the interval are regarded as candidate navy groups; in view of the misjudgment problem caused by normal reviewers who may happen to comment on the product during the review outbreak, this application uses a series of individual fraud indicators to measure the suspiciousness of individual reviewers , And exclude commenters with low suspiciousness; moreover, this application uses a series of group fraud indicators to measure the suspiciousness of the group and realize group classification. The naval group detection method according to the present application can greatly improve the time and space efficiency of detecting naval groups. Moreover, this method is consistent with the responsibilities and obligations of the electronic platform, and can supervise and manage the seller.
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not used to limit the present invention. For those skilled in the art, the present invention can have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

  1. 一种水军群组检测方法,其特征在于,所述检测方法包括:A navy group detection method, characterized in that, the detection method includes:
    获取网络中的评论数据信息,所述评论数据信息包括:评论产品、评论者、评论时间以及评论者对评论产品的评分;Obtain review data information in the network, where the review data information includes: review product, reviewer, review time, and reviewer's rating on the review product;
    基于所述评论数据信息识别水军群组所攻击的目标产品;以及Identify the target product attacked by the navy group based on the comment data information; and
    基于所识别出的目标产品生成候选水军群组。Generate candidate navy groups based on the identified target products.
  2. 根据权利要求1所述的水军群组检测方法,其特征在于,基于所述评论数据信息识别水军群组所攻击的目标产品包括:The navy group detection method according to claim 1, wherein the identification of the target product attacked by the navy group based on the comment data information comprises:
    基于所述评论者对评论产品的评分计算产品评分分布异常值和产品平均分分布异常值;以及Calculate the abnormal value of the product score distribution and the abnormal value of the product average score distribution based on the score of the review product by the reviewer; and
    通过所述产品评分分布异常值和产品平均分分布异常值计算水军群组所攻击的目标产品的可疑值,并将所述可疑值与设定的目标产品可疑值的阈值进行比较,根据比较结果识别水军群组所攻击的目标产品。Calculate the suspicious value of the target product attacked by the navy group based on the abnormal value of the product score distribution and the abnormal value of the average product score distribution, and compare the suspicious value with the set threshold of the target product suspicious value, according to the comparison The result identifies the target product attacked by the navy group.
  3. 根据权利要求1或2所述的水军群组检测方法,其特征在于,所述基于所识别出的目标产品生成候选水军群组包括:The navy group detection method according to claim 1 or 2, wherein said generating a navy group candidate based on the identified target product comprises:
    利用核密度估计方法获取所识别出的目标产品的评论爆发区,所述评论爆发区是所识别出的目标产品的评论在短时间内激增的区域;Obtaining the identified comment burst area of the target product by using the kernel density estimation method, where the comment burst area is an area where the identified target product reviews surge in a short period of time;
    获取所述评论爆发区中的评论者,生成候选水军群组。Obtain commenters in the comment outbreak area, and generate candidate navy groups.
  4. 根据权利要求3所述的水军群组检测方法,其特征在于,所述检测方法还包括:The navy group detection method according to claim 3, wherein the detection method further comprises:
    计算所述候选水军群组的群组造假值,将所述候选水军群组的群组尺寸与设定值进行比较,并且将所述群组造假值与设定的水军群组造假指标的阈值进行比较,根据比较结果输出候选水军群组,其中,所述群组造假值用于衡量水军群组造假程度,所述群组尺寸用于表示水军群组中评论者的数量。Calculate the group fraud value of the candidate navy group, compare the group size of the candidate navy group with a set value, and compare the group fraud value with the set naval group fraud The threshold values of the indicators are compared, and the candidate navy group is output according to the comparison result, wherein the group fraud value is used to measure the degree of fraud of the navy group, and the group size is used to represent the commenters in the navy group. quantity.
  5. 根据权利要求4所述的水军群组检测方法,其特征在于,在计算所述候选水军群组的群组造假值,将所述候选水军群组的群组尺寸与设定值进行比较,并且将所述群组造假值与设定的水军群组造假指标的阈值进行比较,根据比较结果输出候选水军群组之前,所述检测方法还包括:The navy group detection method according to claim 4, wherein when calculating the group fraud value of the candidate navy group, the group size and the set value of the candidate navy group are calculated. Compare, and compare the group fraud value with the set threshold value of the naval group fraud index, and before outputting the candidate naval group according to the comparison result, the detection method further includes:
    计算每个候选水军群组的每个评论者的个体造假值,并将所述个体造假值与设定的水军个体造假指标的阈值进行比较,根据比较结果剔除可疑度低的评论者,获得净化后的候选群组,其中,所述个体造假值用于衡量评论者造假程度。Calculate the individual fraud value of each reviewer in each candidate navy group, compare the individual fraud value with the set threshold value of the individual naval fraud indicator, and eliminate commenters with low suspiciousness based on the comparison result. The purified candidate group is obtained, wherein the individual fraud value is used to measure the degree of fraud by the reviewer.
  6. 根据权利要求2所述的水军群组检测方法,其特征在于,通过如下公式计算水军群组所攻击的目标产品的可疑值STP(p):The navy group detection method according to claim 2, wherein the suspicious value STP(p) of the target product attacked by the navy group is calculated by the following formula:
    S TP(p)=ωS avg(p)+(1-ω)S ext(p) S TP (p)=ωS avg (p)+(1-ω)S ext (p)
    其中,p表示水军群组所攻击的目标产品,Savg(p)为所述产品平均分分布异常值,Sext(p)为所述产品评分分布异常值,ω是用于平衡Savg(p)和Sext(p)权重因子,取值范围在0到1之间。Among them, p represents the target product attacked by the navy group, Savg(p) is the abnormal value of the average score distribution of the product, Sex(p) is the abnormal value of the product score distribution, and ω is used to balance the Savg(p) And Sext(p) weighting factor, the value range is between 0 and 1.
  7. 根据权利要求4所述的水军群组检测方法,其特征在于,利用核密度估计方法获取所识别出的目标产品的评论爆发区包括:The navy group detection method according to claim 4, wherein the use of the nuclear density estimation method to obtain the identified target product comment outbreak area comprises:
    计算所识别出的目标产品的生命周期;Calculate the life cycle of the identified target product;
    利用核密度估计方法对所识别出的目标产品的评论和评论所对应的评论时间序列进行建模;Use the kernel density estimation method to model the identified target product reviews and the review time series corresponding to the reviews;
    设置时间窗口尺寸,将所识别出的目标产品的生命周期分割成多个子时间窗口;Set the size of the time window and divide the life cycle of the identified target product into multiple sub-time windows;
    选取每个子时间窗口的上界和所述子时间窗口内评论数目作为样本点;Select the upper bound of each sub-time window and the number of comments in the sub-time window as sample points;
    根据
    Figure PCTCN2020092791-appb-100001
    计算核密度估计值,获取针对所识别出的目标产品的评论数目的极值点集;
    according to
    Figure PCTCN2020092791-appb-100001
    Calculate the kernel density estimate and obtain the extreme point set of the number of reviews for the identified target product;
    计算每个子时间窗口的平均评论数,其中,所述平均评论数=总评论数/所述子时间窗口的数量;Calculate the average number of comments in each sub-time window, where the average number of comments = the total number of comments/the number of the sub-time windows;
    以及as well as
    判断所获得极值点集中的极值点所在的子时间窗口中的评论数是否大于平均评论数且大于1,根据判断结果获取所述评论爆发区,其中,所述评论爆发区为所获得极值点集中大于平均评论数且大于1的极值点所对应时间加上或减去设定天数所形成的区域。It is judged whether the number of comments in the sub-time window where the extreme point in the obtained extreme point set is greater than the average number of comments and greater than 1, and the comment explosion area is obtained according to the judgment result, wherein the comment explosion area is the obtained extreme point. The value point concentration is the area formed by adding or subtracting the set number of days to the time corresponding to the extreme points that are greater than the average number of comments and greater than 1.
  8. 根据权利要求4所述的水军群组检测方法,其特征在于,通过如下公式获得所述群组造假值GSS(g):The navy group detection method according to claim 4, wherein the group fraud value GSS(g) is obtained by the following formula:
    Figure PCTCN2020092791-appb-100002
    Figure PCTCN2020092791-appb-100002
    其中,g表示由评论者所形成的群组,GTW(g)为群组时间窗,GRD(g)为群组评分偏差,GS(g) 为所述群组尺寸,GRT(g)为群组评论紧密性,GOR(g)为群组一天评论数,GER(g)为群组极端评分比例,GCA(g)为群组共活跃程度,GCAR(g)为群组共活跃期评论占比,Among them, g represents the group formed by reviewers, GTW(g) is the group time window, GRD(g) is the group score deviation, GS(g) is the group size, and GRT(g) is the group Group comment closeness, GOR(g) is the number of group comments in a day, GER(g) is the extreme score ratio of the group, GCA(g) is the co-active degree of the group, GCAR(g) is the group’s total active period of comments accounted for Compare,
    所述GTW(g)用于衡量群组的活跃程度;The GTW(g) is used to measure the activity level of the group;
    所述GRD(g)用于反映群组的评分偏离目标产品的平均评分的程度;The GRD(g) is used to reflect the degree to which the score of the group deviates from the average score of the target product;
    所述GRT(g)用于衡量群组成员合作撰写虚假评论的紧密程度;The GRT(g) is used to measure how closely group members collaborate to write false comments;
    所述GOR(g)用于反映一个群组一天发布的评论数量;The GOR(g) is used to reflect the number of comments posted by a group in a day;
    所述GER(g)表示群组成员极端评分比例的平均值;The GER(g) represents the average value of extreme score ratios of group members;
    所述GCA(g)用于表示群组成员在一定时间内共同活跃的程度;The GCA(g) is used to indicate the degree to which group members are active together in a certain period of time;
    所述GCAR(g)用于表示群组在共同活跃期间发布的针对目标产品的评论占群组总评论的比例。The GCAR(g) is used to indicate the proportion of the group's total comments published by the group during the common active period for the comments on the target product.
  9. 根据权利要求5所述的水军群组检测方法,其特征在于,通过如下公式获得所述个体造假值ISS(a):The navy group detection method according to claim 5, wherein the individual fraud value ISS(a) is obtained by the following formula:
    Figure PCTCN2020092791-appb-100003
    Figure PCTCN2020092791-appb-100003
    其中,a表示评论者,EXR(a)为极端评分比例,RD(a)为评分偏差;MRO(a)为一天最大评论数,RTI(a)为评论时间间隔,AD(a)为账户生存周期,ATR(a)为活跃时期评论占比,Among them, a represents the reviewer, EXR(a) is the extreme rating ratio, RD(a) is the rating deviation; MRO(a) is the maximum number of comments in a day, RTI(a) is the comment interval, and AD(a) is the account survival Period, ATR(a) is the proportion of comments in the active period,
    所述EXR(a)表示极端评分的数量占评论者评述总数的比例;The EXR(a) represents the ratio of the number of extreme ratings to the total number of reviews by the reviewer;
    所述RD(a)反映评论者的评分偏离产品整体评分的程度;The RD(a) reflects the degree to which the reviewer’s rating deviates from the overall product rating;
    所述MRO(a)反映一个评论者单天发布评论的最大数量;The MRO(a) reflects the maximum number of comments posted by a commenter in a single day;
    所述RTI(a)用于表示一个评论者发布评论的时间间隔长短;The RTI(a) is used to indicate the length of the time interval for a commenter to post comments;
    所述AD(a)用于表示评论者发布的第一条与最后一条评论之间的时间间隔;The AD(a) is used to indicate the time interval between the first and last comments posted by the commenter;
    所述ATR(a)用于衡量评论者活跃时期评论的数目与总评论数目的关系。The ATR(a) is used to measure the relationship between the number of reviews during the active period of the reviewer and the total number of reviews.
  10. 一种水军群组检测装置,其特征在于,所述检测装置包括:A navy group detection device, characterized in that the detection device includes:
    数据信息获取模块,所述数据信息获取模块获取网络中的评论数据信息,所述评论数据信息包括:评论产品、评论者、评论时间以及评论者对评论产品的评分;A data information acquisition module, where the data information acquisition module acquires review data information in the network, the review data information includes: review products, reviewers, review time, and reviewers’ ratings of review products;
    异常值计算模块,所述异常值计算模块基于所述评论者对评论产品的评分计算产品评分分布异常值和产品平均分分布异常值;An abnormal value calculation module, which calculates the abnormal value of the product score distribution and the abnormal value of the product average score distribution based on the reviewer's rating of the reviewed product;
    目标产品识别模块,所述目标产品识别模块通过所述产品评分分布异常值和产品平均分分布异常值计算水军群组所攻击的目标产品的可疑值,并将所述可疑值与设定的目标产品可疑值的阈值进行比较,根据比较结果识别水军群组所攻击的目标产品;以及The target product identification module, the target product identification module calculates the suspicious value of the target product attacked by the navy group through the product score distribution abnormal value and the product average score distribution abnormal value, and compares the suspicious value with the set Compare the suspicious value threshold of the target product, and identify the target product attacked by the navy group based on the comparison result; and
    候选水军群组生成模块,所述候选水军群组生成模块基于所识别出的目标产品生成候选水军群组。A candidate navy group generation module, which generates a candidate navy group based on the identified target product.
PCT/CN2020/092791 2020-05-06 2020-05-28 Online water army group detection method and apparatus WO2021223275A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010372504.1A CN113627960A (en) 2020-05-06 2020-05-06 Water army group detection method and device
CN202010372504.1 2020-05-06

Publications (2)

Publication Number Publication Date
WO2021223275A1 true WO2021223275A1 (en) 2021-11-11
WO2021223275A8 WO2021223275A8 (en) 2021-12-23

Family

ID=78376540

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/092791 WO2021223275A1 (en) 2020-05-06 2020-05-28 Online water army group detection method and apparatus

Country Status (2)

Country Link
CN (1) CN113627960A (en)
WO (1) WO2021223275A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116150507A (en) * 2023-04-04 2023-05-23 湖南蚁坊软件股份有限公司 Water army group identification method, device, equipment and medium
CN117076812B (en) * 2023-10-13 2023-12-12 西安康奈网络科技有限公司 Intelligent monitoring management system of network information release and propagation platform

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090157417A1 (en) * 2007-12-18 2009-06-18 Changingworlds Ltd. Systems and methods for detecting click fraud
CN106484679A (en) * 2016-10-20 2017-03-08 北京邮电大学 A kind of false review information recognition methodss being applied on consumption platform and device
CN109460508A (en) * 2018-10-10 2019-03-12 浙江大学 A kind of efficient comment spam groups of users detection method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408634A (en) * 2018-09-17 2019-03-01 重庆邮电大学 A kind of opinion junk user group's detection method based on factions' filtering

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090157417A1 (en) * 2007-12-18 2009-06-18 Changingworlds Ltd. Systems and methods for detecting click fraud
CN106484679A (en) * 2016-10-20 2017-03-08 北京邮电大学 A kind of false review information recognition methodss being applied on consumption platform and device
CN109460508A (en) * 2018-10-10 2019-03-12 浙江大学 A kind of efficient comment spam groups of users detection method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZHANG LU;ZHU HAITING: "An Efficient Distributed Detection Algorithm for Spammer Group", COMPUTER ENGINEERING, vol. 45, no. 7, 12 November 2018 (2018-11-12), pages 6 - 12, XP055863443, ISSN: 1000-3428, DOI: 10.19678/j.issn.1000-3428.0052048 *
ZHANG QI , JI SHUJUAN , FU QIANG , ZHANG CHUNJIN: "Weighted Reviewer Graph based Spammer Group Detection and Characteristic Analysis", JOURNAL OF COMPUTER APPLICATIONS, vol. 39, no. 6, 10 June 2019 (2019-06-10), pages 1595 - 1600, XP055863439, ISSN: 1001-9081, DOI: 10.11772/j.issn.1001-9081.2018122611 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116150507A (en) * 2023-04-04 2023-05-23 湖南蚁坊软件股份有限公司 Water army group identification method, device, equipment and medium
CN117076812B (en) * 2023-10-13 2023-12-12 西安康奈网络科技有限公司 Intelligent monitoring management system of network information release and propagation platform

Also Published As

Publication number Publication date
WO2021223275A8 (en) 2021-12-23
CN113627960A (en) 2021-11-09

Similar Documents

Publication Publication Date Title
Fuster et al. Predictably unequal? The effects of machine learning on credit markets
CN109165840B (en) Risk prediction processing method, risk prediction processing device, computer equipment and medium
CN106484679B (en) False comment information identification method and device applied to consumption platform
CN109767322B (en) Suspicious transaction analysis method and device based on big data and computer equipment
CN107357902B (en) Data table classification system and method based on association rule
EP2779075A1 (en) Multi-dimensional credibility scoring
CN109711955B (en) Poor evaluation early warning method and system based on current order and blacklist base establishment method
WO2021223275A1 (en) Online water army group detection method and apparatus
WO2019200739A1 (en) Data fraud identification method, apparatus, computer device, and storage medium
CN112419029A (en) Similar financial institution risk monitoring method, risk simulation system and storage medium
Ford et al. Identifying Suspicious Bidders Utilizing Hierarchical Clustering and Decision Trees.
CN112990989B (en) Value prediction model input data generation method, device, equipment and medium
CN110659810A (en) Method for calculating credibility of analysts
EP2610809A1 (en) Score fusion based on the gravitational force between two objects
Le et al. Detection of fake reviews on social media using machine learning algorithms
Shibly et al. Performance comparison of two class boosted decision tree snd two class decision forest algorithms in predicting fake job postings
Marais et al. Predicting financial statement manipulation in South Africa: A comparison of the Beneish and Dechow models
Rodica-Oana The evolution of Romania's financial and banking system
MARCHIORI MANERBA et al. Bias discovery within human raters: A case study of the jigsaw dataset
Keerthana et al. Accurate prediction of fake job offers using machine learning
CN113706258A (en) Product recommendation method, device, equipment and storage medium based on combined model
Bardos What is at stake in the construction and use of credit scores?
Weale et al. Fifty shades of QE revisited
CN110827144A (en) Application risk evaluation method and application risk evaluation device for user and electronic equipment
Shivraman et al. A Model Frame Work To Segregate Clusters Through K-Means Method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20934842

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20934842

Country of ref document: EP

Kind code of ref document: A1