CN112800304A - Microblog water army group detection method based on clustering - Google Patents

Microblog water army group detection method based on clustering Download PDF

Info

Publication number
CN112800304A
CN112800304A CN202110023795.8A CN202110023795A CN112800304A CN 112800304 A CN112800304 A CN 112800304A CN 202110023795 A CN202110023795 A CN 202110023795A CN 112800304 A CN112800304 A CN 112800304A
Authority
CN
China
Prior art keywords
water army
cluster
detection method
blog
army group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110023795.8A
Other languages
Chinese (zh)
Inventor
马海峰
吴爱华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Maritime University
Original Assignee
Shanghai Maritime University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Maritime University filed Critical Shanghai Maritime University
Priority to CN202110023795.8A priority Critical patent/CN112800304A/en
Publication of CN112800304A publication Critical patent/CN112800304A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cluster-based microblog water army group detection method, which comprises the following steps: processing the blog messages issued by the users; carrying out structural composition on the comment interaction users under the blog; processing the multi-feature attribute data of the user to construct a discrete distance and radius function; clustering users based on a DBSCAN algorithm to obtain a plurality of clusters; and comparing the clusters under the Bowen, and dividing by utilizing the node similarity to obtain a water army group. The method reduces the requirement of semantic analysis, avoids the deception of the characteristics of the water army by focusing on the integral interaction structure, and has higher accuracy and simple process compared with the prior art. The method can effectively excavate the water army group in the microblog and can effectively excavate different water army groups which may exist.

Description

Microblog water army group detection method based on clustering
Technical Field
The invention relates to the field of water army group detection, in particular to a microblog water army group detection method based on clustering.
Background
In the prior art, a comprehensive factor is constructed by the characteristic attributes (such as account ID, registration time, fan attention ratio and the like) of a single water army to detect the single water army. The lower the accuracy of this method, the lower the disguise and normalization of the water army, the less it is suitable. Secondly, a detection mode starting from the whole user interaction structure is often clustered through various community algorithms, and although the mode can avoid the characteristic that the detection is difficult to be carried out on the semantics of the water army and the self disguise increasing normalization, the misjudgment on normal users is more. And most often only one group of naval parties is classified by two.
Disclosure of Invention
The invention aims to provide a cluster-based microblog water army group detection method, which reduces the requirement of semantic analysis, avoids the deception of water army characteristics by focusing on an integral interaction structure, and has higher accuracy and simple process compared with the prior art. The water army group in the microblog can be effectively excavated. And different water army groups which may exist can be effectively excavated.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a microblog water army group detection method based on clustering comprises the following steps:
processing the blog messages issued by the users;
carrying out structural composition on the comment interaction users under the blog;
processing the multi-feature attribute data of the user to construct a discrete distance and radius function;
clustering users based on a DBSCAN algorithm to obtain a plurality of clusters;
and comparing the clusters under the Bowen, and dividing by utilizing the node similarity to obtain a water army group.
Further, the user multi-feature attributes include account features, social attributes, and content features.
Further, the account characteristics include an ID, a registration date, whether a member is ranked, a microblog count, and a profile missing degree.
Further, the social attributes include a number of fans, a fan ID, a number of concerns, and a follower ID.
Further, the content features comprise historical published blog content subject, blog publication time, and blog total.
Further, the DBSCAN algorithm is an improved DBSCAN algorithm, and specifically includes: and acquiring the preprocessed data and the interactive graph structure thereof, updating the distance calculation formula and the radius r along with the time period change, and clustering by using a DBScan algorithm on the basis.
Further, the modified DBscan algorithm adds a radius variation function.
Further, still include:
and preprocessing the users under the blog to obtain a batch of screened user data.
Further, still include:
eliminating the noise in the water army group.
Further, the judgment is carried out according to whether the account characteristic ID is randomly generated by messy codes, the time period of the registration time, whether the member is present, whether the microblog number/account existence time > i value, and whether the basic data missing value/total fillable data value number > j.
Compared with the prior art, the invention has at least one of the following advantages:
the method reduces the requirement of semantic analysis, avoids the deception of the characteristics of the water army by focusing on the integral interaction structure, and has higher accuracy and simple process compared with the prior art. The water army group in the microblog can be effectively excavated. And different water army groups which may exist can be effectively excavated.
Drawings
FIG. 1 is an overall method flow diagram of a method for microblog water army group identification according to the invention;
FIG. 2 is a diagram of a node interaction architecture in the present invention;
FIG. 3 is a chart of popularity of the present invention;
FIG. 4 is a cluster map based on the structure diagram of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings 1 to 4 and the detailed description thereof. The advantages and features of the present invention will become more apparent from the following description. It is to be noted that the drawings are in a very simplified form and are all used in a non-precise scale for the purpose of facilitating and distinctly aiding in the description of the embodiments of the present invention. To make the objects, features and advantages of the present invention comprehensible, reference is made to the accompanying drawings. It should be understood that the structures, ratios, sizes, and the like shown in the drawings and described in the specification are only used for matching with the disclosure of the specification, so as to be understood and read by those skilled in the art, and are not used to limit the implementation conditions of the present invention, so that the present invention has no technical significance, and any structural modification, ratio relationship change or size adjustment should still fall within the scope of the present invention without affecting the efficacy and the achievable purpose of the present invention.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "include", "include" or any other variant thereof are intended to cover a non-exclusive inclusion, such that a process, a cluster-based microblog water army group detection method, an article or a field device that includes a series of elements includes not only those elements but also other elements that are not explicitly listed, or also includes elements inherent to such a process, cluster-based microblog water army group detection method, article or field device. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of additional like elements in a process, a cluster-based microblog water army group detection method, an article, or a field device that includes the element.
Referring to fig. 1 to 4, according to the method for identifying a microblog water army group provided by the embodiment, captured data are preprocessed, where the preprocessed data refer to a mark to obtain a theme of a blog article, and attributes of captured user nodes respectively mark an ID, interaction information among users, registration time, historical release number of the blog articles, historical release subject and time of the blog articles, number of fans, ID of fans, number of concerns, ID of concerns and the like. And then, connecting user nodes below each topic according to the comment interactivity to construct a structure diagram. The edges between the nodes have distance attributes, and since the node attributes on which the distance construction is based are all unordered attributes, the distance is the VDM distance here. And the size of the radius size decision cluster, where the radius is set as a function of time since the naval community has characteristics of cooperatively behaving within the same time period.
Thus, the improved DBScan algorithm can be used for clustering the structure diagram of the data. The user nodes under each blog have clusters, and at this time, the similarity between the clusters under different blogs is compared. The similarity here is based on whether the common target label structures of the reachability neighbors are the same and the common neighbors are making the construction of the adjacency matrix. The structural diagram of the node with the matrix similarity higher than the set threshold is the water army group.
The crawled user data features are as follows:
account characteristics: ID. Registration date, whether the grade is a member, the number of microblogs and the degree of missing basic data.
Social attributes: number of fans, fan ID, number of concerns, and user ID.
Content attribute: the historical release blog content theme, the blog release time and the blog total number.
The method specifically comprises the following steps:
s1: the structure diagram is constructed based on the interaction between users. The nodes vi have comment interaction on the nodes vj, and are defined as edges eij, please refer to fig. 3, circles in the graph are user nodes, edges between the nodes represent that interaction between a user and the user occurs, and please refer to fig. 4, a cluster obtained by clustering on the basis of fig. 3 is a user node in a polygon (a frame in fig. 4). The weight of an edge is represented here by a time period. Since the water and military bodies act together in a coordinated manner in a time period to cause rapid propagation of the blog in a short time, the time period is divided into n equal time periods from the time of a reviewer appearing for the first time to the time to be crawled, and the time period is divided into four stages, namely a gentle rising stage, a rapid falling stage and a gentle falling stage according to the kurtosis change participated by the user (fig. 2 can see that the time period is divided into 12 equal time periods on average, but the peak change is mainly reflected between T2 and T3, which can be divided into T1, T2 is the gentle rising stage, T3 to T5 is the rapid rising stage and is the most probable time period for the water army group to enter, T6 to T12 are the falling stages, and generally users who have high peak heat after the past pay attention to the blog on the land). The proliferation of the water army group leads to a higher degree of spread of the blog so that more users may participate in the blog. Therefore, it can be known that the possibility that the general water army group is followed by the ascending phase is high. Thus, we can set different radii for different time periods Ti, i.e. construct the radius function: r ═ n/Ti × k; where k is a default value, set as the case may be. And (ni/Ti) is the peakedness value of different stages, and the r value is inevitably increased when the water army group is introduced, namely the water army nodes are prevented from being lost when the water army group is clustered. The distance is then calculated. Since the characteristic attributes here are all discrete attributes, Minkowski distances are not applicable. We can introduce VDM (value Difference metric).
Wherein, let [ mu, a ] denote the number of samples whose value is [ a ] on the attribute [ u ], and [ mu, a, i ] denote the number of samples whose value is [ a ] on the attribute [ u ] in the [ i ] th sample cluster, and [ k ] is the number of sample clusters, then the VDM distance between two discrete values [ a ] and [ b ] on the attribute [ u ] is:
VDMp(a,b)=∑i=1-k|mu,a,i/mu,a-mu,b,i/mu,b|
here we define a distance calculation method to measure similarity, such as the clustering problem discussed below, where the smaller the distance, the greater the similarity, and vice versa. Such a method is called: non-metric distance (non-metric distance).
Based on the above formula, we can calculate that the distance between two discrete attribute values a, b of the attribute u between any two nodes i, j is: VDmp (a, b). And the discrete value ranges between the single attributes between the i, j nodes can calculate the distance. We can calculate the distance of u1, u2.. At this time, the conditions for realizing the DBScan algorithm are all satisfied, and the similarity between each cluster is compared after a plurality of clusters are obtained. The similarity here is constructed based on whether the common target label structures of the reachability neighbors are the same and the common neighbors. The structural diagram of the node with the similarity higher than the set threshold is the water army group. Firstly, each structure chart cluster can be converted into an adjacent matrix, and the rows i and the columns j are ID numbers from a first node to a last node respectively. And marking the attribute value of the time period of the interactive edge when i to j have the interaction, and marking 0 when no interaction exists. Therefore, each cluster has a corresponding matrix, and the cluster is determined to be a water army group as long as the similarity of the matrixes is compared and is greater than a set threshold value. The similarity comparison between nodes can be generally determined by adopting common neighbors, and in order to improve the accuracy, a constraint condition is added on the basis: the same ID and the same matrix value relationship between different matrices are only required to exceed a set threshold. We detect such a group as a water force group. For example, if three of the four matrices have a two-row to three-column value of 2 and the frequent threshold setting is 1/2, then 3/4 is greater than 1/2, and the two-row to three-column nodes are the generated water force groups.
S2: and eliminating the noise value. In step S1, whether the account feature ID is randomly generated based on whether the random code is present, whether the registration time belongs to a time period (specifically, a certain week in a certain month of a certain year), whether the member is present, whether the microblog count/account presence time (days) is > i (i is obtained by the mode of S1 data set acquisition), and whether the basic profile missing value/total fillable profile value is > j (j is obtained by the mode of S1 data set acquisition).
Whether the fan count/concern count and the bobble count/registration time period and the bobble link count/bobble count meet the normal user ratio. Each attribute thus set has a corresponding value, and the results obtained at S1 are the quantity values to which the nodes in the water army group correspond, with which these attributes are compared. If the quantity value is beyond the set threshold value, the rejection is the result obtained by us.
While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims (10)

1. A microblog water army group detection method based on clustering is characterized by comprising the following steps:
processing the blog messages issued by the users;
carrying out structural composition on the comment interaction users under the blog;
processing the multi-feature attribute data of the user to construct a discrete distance and radius function;
clustering users based on a DBSCAN algorithm to obtain a plurality of clusters;
and comparing the clusters under the Bowen, and dividing by utilizing the node similarity to obtain a water army group.
2. The cluster-based microblog water army group detection method of claim 1, wherein the user multi-feature attributes include account features, social attributes and content features.
3. The cluster-based microblog water army group detecting method according to claim 2, wherein the account characteristics include an ID, a registration date, whether a level is a member, a number of microblogs, and a degree of basic material missing.
4. The cluster-based microblog water army group detection method of claim 2, wherein the social attributes include fan number, fan ID, concern number, and concern person ID.
5. The cluster-based microblog water army community detection method of claim 2, wherein the content features include a historical release blog content subject, a blog release time, and a blog total.
6. The cluster-based microblog water army group detection method according to claim 1, wherein the DBSCAN algorithm is an improved DBSCAN algorithm, and specifically comprises: and acquiring the preprocessed data and the interactive graph structure thereof, updating the distance calculation formula and the radius r along with the time period change, and clustering by using a DBScan algorithm on the basis.
7. The cluster-based microblog water army group detection method of claim 6, wherein the modified DBscan algorithm adds a radius variation function.
8. The cluster-based microblog water army group detection method of claim 1, further comprising:
and preprocessing the users under the blog to obtain a batch of screened user data.
9. The cluster-based microblog water army group detection method of claim 1, further comprising:
eliminating the noise in the water army group.
10. The cluster-based microblog water army group detecting method according to claim 9, wherein the judgment is made according to whether the account feature ID is randomly generated with a random code, a time period to which the registration time belongs, whether members are present, whether the microblog count/account existence time > ═ i value, and whether the basic material missing value/total fillable material value number > j.
CN202110023795.8A 2021-01-08 2021-01-08 Microblog water army group detection method based on clustering Pending CN112800304A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110023795.8A CN112800304A (en) 2021-01-08 2021-01-08 Microblog water army group detection method based on clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110023795.8A CN112800304A (en) 2021-01-08 2021-01-08 Microblog water army group detection method based on clustering

Publications (1)

Publication Number Publication Date
CN112800304A true CN112800304A (en) 2021-05-14

Family

ID=75809402

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110023795.8A Pending CN112800304A (en) 2021-01-08 2021-01-08 Microblog water army group detection method based on clustering

Country Status (1)

Country Link
CN (1) CN112800304A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116150507A (en) * 2023-04-04 2023-05-23 湖南蚁坊软件股份有限公司 Water army group identification method, device, equipment and medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102571484A (en) * 2011-12-14 2012-07-11 上海交通大学 Method for detecting and finding online water army
CN105956184A (en) * 2016-06-01 2016-09-21 西安交通大学 Method for identifying collaborative and organized junk information release team in micro-blog social network
CN106940732A (en) * 2016-05-30 2017-07-11 国家计算机网络与信息安全管理中心 A kind of doubtful waterborne troops towards microblogging finds method
CN107688955A (en) * 2016-08-03 2018-02-13 浙江工业大学 A kind of city commercial circle group variety division methods based on adaptive DBSCAN Density Clusterings
CN107895010A (en) * 2017-11-13 2018-04-10 华东师范大学 A kind of method that detection network navy is thumbed up based on network
CN108052543A (en) * 2017-11-23 2018-05-18 北京工业大学 A kind of similar account detection method of microblogging based on map analysis cluster
CN109255125A (en) * 2018-08-17 2019-01-22 浙江工业大学 A kind of Web service clustering method based on improvement DBSCAN algorithm
CN110210575A (en) * 2019-06-13 2019-09-06 重庆亿创西北工业技术研究院有限公司 A kind of three clustering methods and system based on improvement DBSCAN
CN110956210A (en) * 2019-11-29 2020-04-03 重庆邮电大学 Semi-supervised network water force identification method and system based on AP clustering
CN111640033A (en) * 2020-04-11 2020-09-08 中国人民解放军战略支援部队信息工程大学 Detection method and device for network water army

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102571484A (en) * 2011-12-14 2012-07-11 上海交通大学 Method for detecting and finding online water army
CN106940732A (en) * 2016-05-30 2017-07-11 国家计算机网络与信息安全管理中心 A kind of doubtful waterborne troops towards microblogging finds method
CN105956184A (en) * 2016-06-01 2016-09-21 西安交通大学 Method for identifying collaborative and organized junk information release team in micro-blog social network
CN107688955A (en) * 2016-08-03 2018-02-13 浙江工业大学 A kind of city commercial circle group variety division methods based on adaptive DBSCAN Density Clusterings
CN107895010A (en) * 2017-11-13 2018-04-10 华东师范大学 A kind of method that detection network navy is thumbed up based on network
CN108052543A (en) * 2017-11-23 2018-05-18 北京工业大学 A kind of similar account detection method of microblogging based on map analysis cluster
CN109255125A (en) * 2018-08-17 2019-01-22 浙江工业大学 A kind of Web service clustering method based on improvement DBSCAN algorithm
CN110210575A (en) * 2019-06-13 2019-09-06 重庆亿创西北工业技术研究院有限公司 A kind of three clustering methods and system based on improvement DBSCAN
CN110956210A (en) * 2019-11-29 2020-04-03 重庆邮电大学 Semi-supervised network water force identification method and system based on AP clustering
CN111640033A (en) * 2020-04-11 2020-09-08 中国人民解放军战略支援部队信息工程大学 Detection method and device for network water army

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵星宇 等: "基于聚类分析的微博广告发布者识别", 《计算机应用》, vol. 38, no. 5, pages 1267 - 1271 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116150507A (en) * 2023-04-04 2023-05-23 湖南蚁坊软件股份有限公司 Water army group identification method, device, equipment and medium
CN116150507B (en) * 2023-04-04 2023-06-30 湖南蚁坊软件股份有限公司 Water army group identification method, device, equipment and medium

Similar Documents

Publication Publication Date Title
Khabbazian et al. Fast and accurate detection of evolutionary shifts in Ornstein–Uhlenbeck models
US10467234B2 (en) Differentially private database queries involving rank statistics
Bakar et al. A comparative study for outlier detection techniques in data mining
McMinn et al. Building a large-scale corpus for evaluating event detection on twitter
KR101386777B1 (en) Method and device for pushing data
Gray et al. Leopard density in post‐conflict landscape, Cambodia: Evidence from spatially explicit capture–recapture
WO2010054349A2 (en) Method and system for clustering data points
EP3769278A1 (en) Method of news evaluation in social media networks
Zhaoyun et al. Mining topical influencers based on the multi-relational network in micro-blogging sites
US9967321B2 (en) Meme discovery system
Feng et al. Modelling coastal land use change by incorporating spatial autocorrelation into cellular automata models
Brown et al. Success of spatial statistics in determining underlying process in simulated plant communities
Leyequién et al. Influence of body size on coexistence of bird species
D’andrea et al. Counting niches: Abundance‐by‐trait patterns reveal niche partitioning in a Neotropical forest
Alguliyev et al. Anomaly detection in Big data based on clustering
Cheshire et al. Identifying spatial concentrations of surnames
Nie et al. SPIRE: efficient data inference and compression over RFID streams
CN106407473B (en) event similarity modeling-based method and system for acquiring event context
Fujiki et al. Identification of bursts in a document stream
CN112258223A (en) Marketing advertisement click prediction method based on decision tree
Sun et al. A classifier graph based recurring concept detection and prediction approach
CN112800304A (en) Microblog water army group detection method based on clustering
Sun et al. Matrix based community evolution events detection in online social networks
Humphreys et al. Detecting evolutionarily significant units above the species level using the generalised mixed Yule coalescent method
US20170177590A1 (en) Natural classification of content using unsupervised learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210514