Disclosure of Invention
The invention aims to provide a cluster-based microblog water army group detection method, which reduces the requirement of semantic analysis, avoids the deception of water army characteristics by focusing on an integral interaction structure, and has higher accuracy and simple process compared with the prior art. The water army group in the microblog can be effectively excavated. And different water army groups which may exist can be effectively excavated.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a microblog water army group detection method based on clustering comprises the following steps:
processing the blog messages issued by the users;
carrying out structural composition on the comment interaction users under the blog;
processing the multi-feature attribute data of the user to construct a discrete distance and radius function;
clustering users based on a DBSCAN algorithm to obtain a plurality of clusters;
and comparing the clusters under the Bowen, and dividing by utilizing the node similarity to obtain a water army group.
Further, the user multi-feature attributes include account features, social attributes, and content features.
Further, the account characteristics include an ID, a registration date, whether a member is ranked, a microblog count, and a profile missing degree.
Further, the social attributes include a number of fans, a fan ID, a number of concerns, and a follower ID.
Further, the content features comprise historical published blog content subject, blog publication time, and blog total.
Further, the DBSCAN algorithm is an improved DBSCAN algorithm, and specifically includes: and acquiring the preprocessed data and the interactive graph structure thereof, updating the distance calculation formula and the radius r along with the time period change, and clustering by using a DBScan algorithm on the basis.
Further, the modified DBscan algorithm adds a radius variation function.
Further, still include:
and preprocessing the users under the blog to obtain a batch of screened user data.
Further, still include:
eliminating the noise in the water army group.
Further, the judgment is carried out according to whether the account characteristic ID is randomly generated by messy codes, the time period of the registration time, whether the member is present, whether the microblog number/account existence time > i value, and whether the basic data missing value/total fillable data value number > j.
Compared with the prior art, the invention has at least one of the following advantages:
the method reduces the requirement of semantic analysis, avoids the deception of the characteristics of the water army by focusing on the integral interaction structure, and has higher accuracy and simple process compared with the prior art. The water army group in the microblog can be effectively excavated. And different water army groups which may exist can be effectively excavated.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings 1 to 4 and the detailed description thereof. The advantages and features of the present invention will become more apparent from the following description. It is to be noted that the drawings are in a very simplified form and are all used in a non-precise scale for the purpose of facilitating and distinctly aiding in the description of the embodiments of the present invention. To make the objects, features and advantages of the present invention comprehensible, reference is made to the accompanying drawings. It should be understood that the structures, ratios, sizes, and the like shown in the drawings and described in the specification are only used for matching with the disclosure of the specification, so as to be understood and read by those skilled in the art, and are not used to limit the implementation conditions of the present invention, so that the present invention has no technical significance, and any structural modification, ratio relationship change or size adjustment should still fall within the scope of the present invention without affecting the efficacy and the achievable purpose of the present invention.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "include", "include" or any other variant thereof are intended to cover a non-exclusive inclusion, such that a process, a cluster-based microblog water army group detection method, an article or a field device that includes a series of elements includes not only those elements but also other elements that are not explicitly listed, or also includes elements inherent to such a process, cluster-based microblog water army group detection method, article or field device. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of additional like elements in a process, a cluster-based microblog water army group detection method, an article, or a field device that includes the element.
Referring to fig. 1 to 4, according to the method for identifying a microblog water army group provided by the embodiment, captured data are preprocessed, where the preprocessed data refer to a mark to obtain a theme of a blog article, and attributes of captured user nodes respectively mark an ID, interaction information among users, registration time, historical release number of the blog articles, historical release subject and time of the blog articles, number of fans, ID of fans, number of concerns, ID of concerns and the like. And then, connecting user nodes below each topic according to the comment interactivity to construct a structure diagram. The edges between the nodes have distance attributes, and since the node attributes on which the distance construction is based are all unordered attributes, the distance is the VDM distance here. And the size of the radius size decision cluster, where the radius is set as a function of time since the naval community has characteristics of cooperatively behaving within the same time period.
Thus, the improved DBScan algorithm can be used for clustering the structure diagram of the data. The user nodes under each blog have clusters, and at this time, the similarity between the clusters under different blogs is compared. The similarity here is based on whether the common target label structures of the reachability neighbors are the same and the common neighbors are making the construction of the adjacency matrix. The structural diagram of the node with the matrix similarity higher than the set threshold is the water army group.
The crawled user data features are as follows:
account characteristics: ID. Registration date, whether the grade is a member, the number of microblogs and the degree of missing basic data.
Social attributes: number of fans, fan ID, number of concerns, and user ID.
Content attribute: the historical release blog content theme, the blog release time and the blog total number.
The method specifically comprises the following steps:
s1: the structure diagram is constructed based on the interaction between users. The nodes vi have comment interaction on the nodes vj, and are defined as edges eij, please refer to fig. 3, circles in the graph are user nodes, edges between the nodes represent that interaction between a user and the user occurs, and please refer to fig. 4, a cluster obtained by clustering on the basis of fig. 3 is a user node in a polygon (a frame in fig. 4). The weight of an edge is represented here by a time period. Since the water and military bodies act together in a coordinated manner in a time period to cause rapid propagation of the blog in a short time, the time period is divided into n equal time periods from the time of a reviewer appearing for the first time to the time to be crawled, and the time period is divided into four stages, namely a gentle rising stage, a rapid falling stage and a gentle falling stage according to the kurtosis change participated by the user (fig. 2 can see that the time period is divided into 12 equal time periods on average, but the peak change is mainly reflected between T2 and T3, which can be divided into T1, T2 is the gentle rising stage, T3 to T5 is the rapid rising stage and is the most probable time period for the water army group to enter, T6 to T12 are the falling stages, and generally users who have high peak heat after the past pay attention to the blog on the land). The proliferation of the water army group leads to a higher degree of spread of the blog so that more users may participate in the blog. Therefore, it can be known that the possibility that the general water army group is followed by the ascending phase is high. Thus, we can set different radii for different time periods Ti, i.e. construct the radius function: r ═ n/Ti × k; where k is a default value, set as the case may be. And (ni/Ti) is the peakedness value of different stages, and the r value is inevitably increased when the water army group is introduced, namely the water army nodes are prevented from being lost when the water army group is clustered. The distance is then calculated. Since the characteristic attributes here are all discrete attributes, Minkowski distances are not applicable. We can introduce VDM (value Difference metric).
Wherein, let [ mu, a ] denote the number of samples whose value is [ a ] on the attribute [ u ], and [ mu, a, i ] denote the number of samples whose value is [ a ] on the attribute [ u ] in the [ i ] th sample cluster, and [ k ] is the number of sample clusters, then the VDM distance between two discrete values [ a ] and [ b ] on the attribute [ u ] is:
VDMp(a,b)=∑i=1-k|mu,a,i/mu,a-mu,b,i/mu,b|
here we define a distance calculation method to measure similarity, such as the clustering problem discussed below, where the smaller the distance, the greater the similarity, and vice versa. Such a method is called: non-metric distance (non-metric distance).
Based on the above formula, we can calculate that the distance between two discrete attribute values a, b of the attribute u between any two nodes i, j is: VDmp (a, b). And the discrete value ranges between the single attributes between the i, j nodes can calculate the distance. We can calculate the distance of u1, u2.. At this time, the conditions for realizing the DBScan algorithm are all satisfied, and the similarity between each cluster is compared after a plurality of clusters are obtained. The similarity here is constructed based on whether the common target label structures of the reachability neighbors are the same and the common neighbors. The structural diagram of the node with the similarity higher than the set threshold is the water army group. Firstly, each structure chart cluster can be converted into an adjacent matrix, and the rows i and the columns j are ID numbers from a first node to a last node respectively. And marking the attribute value of the time period of the interactive edge when i to j have the interaction, and marking 0 when no interaction exists. Therefore, each cluster has a corresponding matrix, and the cluster is determined to be a water army group as long as the similarity of the matrixes is compared and is greater than a set threshold value. The similarity comparison between nodes can be generally determined by adopting common neighbors, and in order to improve the accuracy, a constraint condition is added on the basis: the same ID and the same matrix value relationship between different matrices are only required to exceed a set threshold. We detect such a group as a water force group. For example, if three of the four matrices have a two-row to three-column value of 2 and the frequent threshold setting is 1/2, then 3/4 is greater than 1/2, and the two-row to three-column nodes are the generated water force groups.
S2: and eliminating the noise value. In step S1, whether the account feature ID is randomly generated based on whether the random code is present, whether the registration time belongs to a time period (specifically, a certain week in a certain month of a certain year), whether the member is present, whether the microblog count/account presence time (days) is > i (i is obtained by the mode of S1 data set acquisition), and whether the basic profile missing value/total fillable profile value is > j (j is obtained by the mode of S1 data set acquisition).
Whether the fan count/concern count and the bobble count/registration time period and the bobble link count/bobble count meet the normal user ratio. Each attribute thus set has a corresponding value, and the results obtained at S1 are the quantity values to which the nodes in the water army group correspond, with which these attributes are compared. If the quantity value is beyond the set threshold value, the rejection is the result obtained by us.
While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.