CN108052543A

CN108052543A - A kind of similar account detection method of microblogging based on map analysis cluster

Info

Publication number: CN108052543A
Application number: CN201711181758.XA
Authority: CN
Inventors: 姜伟; 田原; 庄俊玺; 吴贤达; 潘邵芹
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2017-11-23
Filing date: 2017-11-23
Publication date: 2018-05-18
Anticipated expiration: 2037-11-23
Also published as: CN108052543B

Abstract

The invention discloses a kind of similar account detection method of microblogging based on map analysis cluster and multidimensional similarity calculation, particular content includes：S1. malice account identification problem is converted into user's similarity calculation problem；S2. user information is calculated into structure digraph by scheming；S3. using map analysis algorithm to user clustering；S4. consistency weight d, the user for filtering Sparse are introduced；S5. MDUS algorithms are introduced, similarity is calculated based on various dimensions information；S6. each dimensionality weight is calculated using analytic hierarchy process (AHP), obtains Weighted Similarity；S7. reptile obtains m user data, is tested in spark, inputs target user's information, obtains similar account set as suspicious malice account, the accuracy rate of MDUS algorithms is up to 80%.This method will be based on map analysis cluster and multidimensional similarity calculation is combined, and realizes the account number that quickly notes abnormalities, to safeguarding that social network sites stabilization has great importance.

Description

A kind of similar account detection method of microblogging based on map analysis cluster

Technical field

The invention belongs to information technology fields, and in particular to a kind of micro- based on map analysis cluster and multidimensional similarity calculation Win similar account detection method.It realizes the similar account of quick discovery microblogging, perceives malice group behavior, effectively identify network navy Or reincarnation account, have great importance to social networks improvement.

Background technology

At present, the analytical technology of social networks is becoming the hot spot and trend of network technology research, academia and industry Boundary proposes numerous studies scheme, including analysis user characteristics, user behavior pattern and network structure, pacifies for social networks Entirely, privacy of user protection, network colony event monitoring etc. have important value.Domestic and international many universities and research institution are all herein Field expands further investigation, external such as University of California Berkeley, Carnegie Mellon University；It is and Microsoft Research, clear The units such as Hua Da, Peking University are the representatives of studies in China, some important achievements in research repeatedly appear in S＆P, CCS, On the top-level meeting and periodical of the international information-securities such as USS, KDD, AAAI field and Data Mining, wherein there is hundreds of It is related to social networks safety problem, the academic research more than thousand is related with social network user similitude.In current network There are a collection of account numbers, they are concentrated in a certain period of time, carry out substantial amounts of malicious act.These account numbers may be attacker's wound The a large amount of false account numbers built and the account number usurped, this kind of account should find and handle as early as possible, prevent to social network sites and user Privacy causes security risk.However the manual examination and verification stage is remained in for the main monitoring means of this kind of account at present, therefore, A kind of method for quickly detecting similar account number is conducive to efficiently safeguard the stability and security of social network space.

The content of the invention

The invention discloses a kind of similar accounts of microblogging calculated based on Spark Graph X map analysis clusters and various dimensions Detection method carries out graphX map analysis clusters based on user basic information and behavioural characteristic, passes through MDUS in respective classes (calculating similarity based on user's various dimensions information) algorithm, obtains similar account number sequence.Particular content includes：

S1. since the malice account number of batch has the similitude of height, the identification problem of malice account is converted into user Similarity calculation problem obtains similar account set by similarity calculation, chooses topN (N can choose 3,5,10 etc.) and makees For suspicious malice account, be conducive to network administrator and examined；

S2. microblog users information by large-scale parallel figure is calculated and carries out user's portrait, drawn a portrait including customer relationship It draws a portrait with user behavior, builds user's concern relation digraph and microblogging forwarding relation digraph respectively；

S3. using digraph as data source, using Spark GraphX map analysis algorithm Connected Components, It is connected based on the user in network to user clustering, it is generally recognized that it is divided into the similarity that a kind of user has bigger, therefore, Data set to be tested will be used as in of a sort user with target user, so as to reduce answering for similarity calculation between mass users Polygamy；

S4. consistency weight d is introduced, considers the bean vermicelli number m of user₁, concern number m₂And hair microblogging number m₃With the user The average value n of all users in the classification of place_iThe ratio between (i=1,2,3), by formulaObtain consistency weight d, mistake It filters the user of Sparse (d ＜ α), solves caused by microblog users Deta sparseness that accuracy rate of testing result is not high to ask Topic；

S5. after the user for filtering out Sparse, calculate in target user u and place classification between other users u ' Similarity Sim (u, u '), the calculating of similarity introduces MDUS algorithms, i.e., calculates similarity based on user's various dimensions information, will Microblog users information is divided into four dimensions, and background information, blog article information, bean vermicelli concern information, comment forwarding information use respectively Editing distance, tf-idf, LDA, cosine similarity algorithm calculate similarity；

S6. total similarity is calculated by the following formula

Sim (u, u ')=

w₁Sim_backgroud(u, u ')+w₂Sim_text(u, u ')+w₃Sim_fang(u, u ')+w₄Sim_tweet(u, u '), wherein w₁+ w₂+w₃+w₄=1, and w₁, w₂, w₃, w₄Value drawn by the layer discrimination matrix computations in analytic hierarchy process (AHP), finally by formula Calculate Weighted Similarity.

S7. choose n (n ＞ 500) and organize k (k ＞ 100) the name beans vermicelli of the well-known user of Sina as similarity threshold training set, The similarity Sim (u, u ') between account is calculated, and calculates the average value of each group of Sim (u, u ')And standard deviation sigma, by formulaObtain similarity threshold μ；

S8. m (m are obtained by reptile>100000) name microblog users data, are tested on spark, and input target is used Family (u) information verifies the accuracy of MDUS algorithms.Inspection result is arranged according to the numerical values recited of similarity, similar account For topN (N can use 3,5,10 etc.) in set as suspicious malice account, Detection accuracy reaches maximum in N=5.

Description of the drawings

Fig. 1 is the calculating process of MDUS algorithms；

Fig. 2 is holistic approach frame；

Specific embodiment

The similar account detection method of microblogging based on map analysis cluster and multidimensional information similarity calculation

S1. m (m are obtained at random by reptile>100000) name microblog users data are as initial data set, including user's Background information, blog article information, bean vermicelli, concern information and comment, forwarding information；

S2. tested in spark platforms, the information of input target user (u) passes through large-scale together with m users Parallel map analysis carries out user's portrait, draws a portrait including customer relationship and user behavior is drawn a portrait, and is respectively that each user builds pass Note relation digraph and microblogging forwarding relation digraph；

S3. using digraph as data source, using Spark GraphX map analysis algorithm Connected Components, User is clustered based on user's connection in network, it is generally recognized that it is divided into the similarity that a kind of user has bigger, Therefore, testing data collection will be used as in of a sort all users with target user u, so as to reduce similarity between mass users The complexity of calculating；

S4. the consistency weight d that testing data concentrates each user is calculated, considers the bean vermicelli number m of user₁, concern number m₂ And hair microblogging number m₃With the average value n of all users in the category_iThe ratio between (i=1,2,3), by formulaIt obtains Consistency weight d, filters out Sparse (d<α, through cross validation, α=0.5) user, solve microblog users Sparse The problem of accuracy rate of testing result caused by property is not high；S5. after the user for filtering out Sparse, respectively using editor away from From, tf-idf, LDA, cosine similarity algorithm calculate user u and testing data and concentrate background information between each user u ', win Literary information, bean vermicelli concern information, the similarity Sim for commenting on forwarding information four dimensions_backgroud(u, u '), Sim_text(u, U '), Sim_fans(u, u '), Sim_tweet(u, u ')；

S6. total similarity is calculated by the following formula

Sim (u, u ')=

w₁Sim_backgroud(u, u ')+w₂Sim_text(u, u ')+w₃Sim_fans(u, u ')+w₄Sim_tweet(u, u '), wherein w₁+ w₂+w₃+w₄- 1, and w₁, w₂, w₃, w₄Value drawn by the layer discrimination matrix computations in analytic hierarchy process (AHP), due to the back of the body of user Scape information is smaller including mailbox, real-name authentication role when calculating similarity, therefore is assigned to relatively low weights, conversely, Microblogging text message is assigned to higher weights；

S7. using MDUS algorithms, similarity is calculated based on various dimensions information, user u and testing data are obtained with reference to weights Concentrate the Weighted Similarity between each user u '；

S8. through Spark Distributed Calculations, obtain the similar account set of target user u, according to similarity numerical value by greatly to It is small to be arranged, during due to similarity threshold μ=0.25, rate of accuracy reached to peak, therefore μ=0.25 is chosen, if similarity Numerical value is more than similarity threshold, then labeled as similar account number.TopN (N can use 3,5,10 etc.) in set is as suspicious malice Account, Detection accuracy reach maximum in N=5, therefore choose top5 as testing result, rate of accuracy reached to 80%.

Claims

1. a kind of similar account number detection method of microblogging based on map analysis cluster, it is characterised in that：

S1. since the malice account number of batch has the similitude of height, it is similar that the identification problem of malice account is converted into user Computational problem is spent, similar account set is obtained by similarity calculation, chooses topN as suspicious malice account；

S2. microblog users information is subjected to user's portrait by scheming to calculate parallel, is drawn including customer relationship portrait and user behavior Picture, customer relationship portrait include concern, are concerned information, and behavior portrait includes comment, forwarding information, builds user's concern respectively Relation digraph and microblogging forwarding relation digraph；

S3. using Spark GraphX map analysis algorithm Connected Components, based on user's connection pair in network User clustering, it is generally recognized that be divided into the similarity that a kind of user has bigger, therefore, with target user in of a sort use Family will be used as data set to be tested, so as to reduce the complexity of similarity calculation between mass users；；

S4. consistency weight d is introduced, considers bean vermicelli number m1, concern number m2 and the hair microblogging number m3 of user and the user place The average value n of all users in classification_iThe ratio between, wherein i=1,2,3, by formulaObtain consistency weight d, mistake Filter the Sparse i.e. user of d ＜ 0.5；

S5. it is similar between calculating target user u and other users u ' in the classification of place after the user for filtering out Sparse Degree Sim (u, u '), the calculating of similarity introduces MDUS algorithms, i.e., similarity is calculated based on user's various dimensions information, by microblogging User information is divided into four dimensions, background information, blog article information, bean vermicelli concern information, comment forwarding information, respectively using editor Distance, tf-idf, LDA, cosine similarity algorithm calculate similarity；

S6. total similarity is calculated by the following formula

Sim (u, u ')=w₁Sim_baokgroud(u, u ')+W₂Sim_text(u, u ')+W₃Sim_fan3(u, u ')+W₄Sim_tweet(u, u '),

Wherein w₁+w₂+w₃+w₄=1, and w₁, w₂, w₃, w₄Value drawn by the layer discrimination matrix computations in analytic hierarchy process (AHP), Weighted Similarity is finally calculated by formula；Wherein w₁, w₂, w₃, w₄Respectively background information, blog article information, bean vermicelli concern information, Comment on the similarity weights of forwarding information；

S7. the k name beans vermicelli of the n groups well-known user of Sina are collected as similarity threshold training set, calculate the similarity between account Sim (u, u '), and calculate the average value of each group of Sim (u, u ')And standard deviation sigma, by formulaObtain similitude Threshold value μ；

S8. m microblog users data are obtained by reptile, inputs target user (u) information, if similarity numerical value is more than similitude Threshold value μ, then labeled as suspicious malice account number.