CN110598129A

CN110598129A - Cross-social network user identity recognition method based on two-stage information entropy

Info

Publication number: CN110598129A
Application number: CN201910865901.XA
Authority: CN
Inventors: 邢玲; 邓凯凯; 高建平; 吴红海; 谢萍; 张明川
Original assignee: Henan University of Science and Technology
Current assignee: Henan University of Science and Technology
Priority date: 2019-09-09
Filing date: 2019-09-09
Publication date: 2019-12-20
Anticipated expiration: 2039-09-09
Also published as: CN110598129B

Abstract

The invention discloses a cross-social network user identity recognition method based on two-level information entropy, which comprises the steps of crawling archive information and behavior information of respective users from two social networks, screening common attributes from the archive information attributes of the two social networks, extracting data corresponding to the common attributes from the archive information of each user, calculating the similarity of the common attributes of the users in the two social networks, extracting characteristic attributes of behaviors from the behavior information of each user, calculating the similarity of the behavior attributes of the users in the two social networks, performing weight distribution based on the two-level information entropy, weighting each attribute to obtain matching scores of the two users, and performing user matching according to the matching scores to obtain a user identity recognition result. The method for distributing the weight based on the two-level information entropy solves the problem of unbalance of multiple attributes of the user in the aspect of weight distribution, and improves the user identity recognition performance.

Description

Cross-social network user identity recognition method based on two-stage information entropy

Technical Field

The invention belongs to the technical field of data mining, and particularly relates to a cross-social-network user identity recognition method based on two-level information entropy.

Background

Social networks provide people with a rich social service. According to statistics, 42% of users have multiple social network accounts at the same time. Because different social networks have respective unique social modes and bring different social services to users, rich social user information is generated. However, the individual social network accounts are isolated and have no direct connection, so that the social information generated by the user account is distributed over multiple social networks. The identification of the user identity across the social networks refers to identifying virtual accounts belonging to the same real user in different social networks. The technical solution can provide comprehensive user information for network recommendation, user modeling and user behavior analysis, and realize full mining of the multisource social network big data.

The core idea of the existing related research is to utilize user profile information, network topology information and user behavior information to calculate and analyze whether a user account matching pair is the same user. Cross-social network user identification consists essentially of three parts: user data extraction, data similarity calculation and account matching. The user data is extracted by mainly adopting a relatively efficient crawler technology to crawl, clean and store the data. Secondly, the similarity between the user data is calculated by using the extracted data and the similarity function, and the greater the similarity is, the greater the probability that different virtual accounts belong to the same user is. And finally, matching the account numbers by adopting a related matching strategy according to the calculated similarity.

The existing cross-social network user identity identification method based on user profile information has the possibility of user data forgery, and people pay more and more attention to privacy protection at present. Therefore, the recognition effect of this kind of method is not ideal. And secondly, the identification method based on the network topological structure is adopted, and although the friend relationship of the user is easy to obtain, the connection of the friend relationship has sparsity. And finally, the method is an identification method based on user behavior data, the method utilizes the content issued by the user to identify the user identity, and compared with the two methods, the method breaks the limit of the two methods. In addition, the existing research also utilizes the combination of user profile information and network structure for identification, but the method is still limited by the above conditions so as not to achieve good identification effect.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a cross-social network user identity recognition method based on two-level information entropy, and provides a weight distribution method based on two-level information entropy, so that the problem of unbalance of multiple attributes of a user in the aspect of weight distribution is solved, and the user identity recognition performance is improved.

In order to achieve the purpose, the cross-social network user identity recognition method based on the two-level information entropy comprises the following steps:

s1: respectively crawling the profile information and the behavior information of respective users from the social network A and the social network B, and respectively recording the number N of the users in the two social networks_AAnd N_B；

S2: common attributes are screened out from the profile information attributes of the two social networks, data corresponding to the common attributes are extracted from the profile information of each user, and then the similarity of each common attribute of each user i in the social network A and each common attribute of each user j in the social network B is calculatedi＝1,2,…,N_A，j＝1,2,…,N_BM is 1,2, …, M represents the number of common attributes;

s3: extracting preset data of N characteristic attributes from the behavior information of each user, and then calculating each user i in the social network A and each user i in the social network BSimilarity of each characteristic attribute of each user jn＝1,2,…,N；

S4: integrating the data of M common attributes of all users extracted from the archive information and the data of N characteristic attributes of all users extracted from the behavior information into H attribute data, wherein H is M + N, then determining the weight of the H attributes by adopting an entropy weight method, and taking the weight as a primary weight z of each attribute_h，h＝1,2,…,H；

Calculating contribution probability normalization value P of each attribute_h：

Construction of variant weight R based on information entropy_h：

E(P_h)＝-P_hlogP_h

Calculating attribute weight W based on two-stage information entropy_h：

S5: the attribute weight W obtained in the step S4 is adopted_hCalculating the weighted sum of the similarity of the H attributes of each user i in the social network A and each user j in the social network B as the matching score of each user i in the social network A and each user j in the social network B_i,j；

S6: score according to matching scores of each user i in the social network A and each user j in the social network B_i,jAnd matching the users in the two social networks to obtain a user identity recognition result.

The invention discloses a cross-social network user identity recognition method based on two-level information entropy, which comprises the steps of crawling archive information and behavior information of respective users from two social networks, screening common attributes from the archive information attributes of the two social networks, extracting data corresponding to the common attributes from the archive information of each user, calculating the similarity of the common attributes of the users in the two social networks, extracting characteristic attributes of behaviors from the behavior information of each user, calculating the similarity of the behavior attributes of the users in the two social networks, performing weight distribution based on the two-level information entropy, weighting each attribute to obtain matching scores of the two users, and performing user matching according to the matching scores to obtain a user identity recognition result.

The invention integrates two types of information which are most relevant to the user, namely user file information and user behavior information, so that the calculated similarity is more accurate, weight distribution is carried out based on two-stage information entropy, the problem of unbalance of multiple attributes of the user in the aspect of weight distribution is solved, the accuracy of user matching scoring can be improved, and the user identity identification performance is improved.

Drawings

FIG. 1 is a flowchart of an embodiment of a cross-social-network user identity recognition method based on two levels of information entropy;

FIG. 2 is a flowchart of a method for calculating the similarity of common attributes in the present embodiment;

FIG. 3 is a flowchart of a text information feature extraction calculation method based on frequent pattern mining in the present embodiment;

FIG. 4 is a graph comparing accuracy of the weight assignment method and the comparison method according to the present invention;

FIG. 5 is a chart comparing recall rates of the weight assignment method and the comparison method according to the present invention;

FIG. 6 is a comparison graph of F1 scores in the weight assignment method and the comparison method of the present invention;

FIG. 7 is a comparison graph of AUC of the weight assignment method and the comparison method of the present invention in this example;

fig. 8 is a comparison chart of four evaluation indexes of the user identification method and two comparison methods according to the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

Examples

FIG. 1 is a flowchart of a specific embodiment of a cross-social-network user identity recognition method based on two levels of information entropy. As shown in fig. 1, the method for identifying the user identity across the social network based on the two-level information entropy includes the following specific steps:

s101: acquiring user data:

respectively crawling the profile information and the behavior information of respective users from the social network A and the social network B, and respectively recording the number N of the users in the two social networks_AAnd N_B. Generally speaking, can make N_A＝N_B。

S102: calculating the similarity of the user profile information:

common attributes are screened out from the profile information attributes of the two social networks, data corresponding to the common attributes are extracted from the profile information of each user, and then the similarity of each common attribute of each user i in the social network A and each common attribute of each user j in the social network B is calculatedi＝1,2,…,N_A，j＝1,2,…,N_BAnd M is 1,2, …, M indicates the number of common attributes.

Since the user profile information includes a plurality of common attributes, for example, the user profile information includes 17 common attributes in this embodiment, and the data format corresponding to each common attribute may be different, it is necessary to select different ways to calculate the similarity of the common attributes according to the actual situation. Fig. 2 is a flowchart of a method for calculating the similarity of common attributes in this embodiment. As shown in fig. 2, the specific steps of the common attribute similarity in this embodiment include:

s201: firstly, judging whether the m-th common attribute is a preset key attribute. The key attribute refers to an attribute whose data must be consistent to determine similarity between users, for example, gender information of two users must be "male" or "female" at the same time to indicate similarity between the two users. If the attribute is the key attribute, the process proceeds to step S202, otherwise, the process proceeds to step S203.

S202: determining similarity based on consistency:

judging whether the m-th common attributes of the two users are consistent, if so, determining the similarity of the common attributesOtherwise

S203: and judging whether the m-th common attribute data is vectorized, if so, entering the step S204, otherwise, entering the step S205.

S204: determining similarity based on cosine similarity:

vectorizing the data of the m-th common attributes of the two users, calculating the cosine similarity between the two vectors, and taking the cosine similarity as the similarity of the m-th common attributes of the two usersThe cosine similarity is calculated as follows:

where A and B represent a vector formed by two data, A_q、B_qDenotes the qth dimension of vectors a and B, respectively, Q being 1,2, …, Q denoting the vector dimension.

S205: determining similarity based on the Dice coefficient:

taking the data of the m-th common attributes of the two users as character strings, then calculating a Dice coefficient between the two character strings, and taking the Dice coefficient as the m-th common attributes of the two usersDegree of similarity ofThe calculation formula of the Dice coefficient is as follows:

where a and b represent two character strings, comm (a &' b) represents the number of identical characters in a and b, and len () represents the length of the character string.

S103: calculating the similarity of user behaviors:

extracting preset data of N characteristic attributes from the behavior information of each user, and then calculating the similarity of each characteristic attribute of each user i in the social network A and each characteristic attribute of each user j in the social network Bn＝1,2,…,N。

As for the feature attributes, the category of the feature attributes may be determined according to actual needs, and in the embodiment of the present invention, three feature attributes are adopted: text information features, punctuation features, and state timestamp features. The similarity calculation methods for the three behavior feature attributes are described below.

Text information feature:

firstly, extracting text information features of each user based on frequent pattern mining to obtain a plurality of frequent items and support degree counts corresponding to the frequent items, and then calculating by adopting the following formula to obtain text information feature similarity of the two users

Wherein, F represents a frequent item,respectively representing the support degree counts of frequent items F corresponding to the user i in the social network A and the user j in the social network B, C_FThe number of item sets representing frequent items F. Adding a "1" to the formula is to avoid high frequency terms.

Fig. 3 is a flowchart of a text information feature extraction calculation method based on frequent pattern mining in this embodiment. As shown in fig. 3, the text information feature extraction method based on frequent pattern mining in this embodiment includes the specific steps of:

s301: text word segmentation:

and performing word segmentation on each piece of text information issued by each user, taking each word obtained by word segmentation as a transaction, and obtaining a transaction set T according to all the text information issued by the user.

S302: acquiring a frequent 1 item set:

traversing all items in the transaction set T and calculating the support degree of all items to form an item set C of 1₁Filtering out the item set which does not meet the condition according to the preset minimum support degree of the 1 item set to obtain the frequent 1 item set L₁In the present embodiment, the minimum support degree is set to 2 for 1 item set. Let the number of terms parameter k equal to 1.

S303: generating a frequent k +1 term set:

set L of frequent k items_kConnecting with itself (the inner item sets are mutually merged) to obtain a k +1 item set C_k+1Filtering out the item set which does not meet the condition according to the preset minimum support degree of the k +1 item set to obtain a frequent k +1 item set L_k+1。

S304: judging whether L is_k+1Null, if null, then state all k +1 term sets C currently_k+1If the minimum support degree is not met, the item set generation is finished, and the step S306 is entered, otherwise, the step S305 is entered.

S305: let k be k +1, return to step S303.

S306: determining text information characteristics:

and obtaining frequent items corresponding to the text published by the current user, and obtaining the support degree count corresponding to each frequent item.

Punctuation features:

and (3) statistically obtaining the proportion of the use times of each punctuation mark in the total punctuation mark number from the text information issued by the user i in the social network A and the user j in the social network B to form punctuation mark vectors, and calculating the similarity between the two vectors to be the punctuation mark similarity.

Time status stamp feature:

dividing each day into G time periods, counting to obtain the average dynamic number of each user in each time period in a preset date, and calculating the similarity of the time state stamps of the user i in the social network A and the user j in the social network B by adopting the following formula:

wherein, V_i ^A(g)、The average dynamic numbers of the user i in the social network A and the user j in the social network B in the g-th time period are respectively represented, and | l represents the absolute value.

S104: and (3) weight distribution based on two levels of information entropy:

in order to fuse all the similarities obtained above, a weight needs to be assigned to each attribute. In order to make the obtained weight more reasonable, the invention provides a weight distribution method based on two-level information entropy, which comprises the following specific steps:

integrating the data of M common attributes of all users extracted from the archive information and the data of N characteristic attributes of all users extracted from the behavior information into H attribute data, wherein H is M + N, then determining the weight of the H attributes by adopting an entropy weight method, and taking the weight as a primary weight z of each attribute_h，h＝1,2,…,H。

The basic idea of the entropy weight method is that the larger the degree of difference of the indexes is, the larger the weight difference is correspondingly. Therefore, the problem of weight assignment in user identification can be solved by using the concept of information entropy. The specific method of the entropy weight method may be set as required, and the specific method of the entropy weight method in this embodiment is as follows: firstly, calculating to obtain each genusSexual information entropy E_hThen, the posterior probability p (y) of each attribute of the user is obtained_x| x), a primary weight corresponding to the attribute can be calculated: z is a radical of_h＝p(y_x|x)×E_h. By adopting the method, the influence of each attribute on the user identity identification performance can be more accurately acquired.

The output of Softmax characterizes the relative probability between the different classes, so the present invention utilizes the concept of Softmax to perform a secondary weight assignment on the attributes of the user. After the first-level weight of the user attribute is obtained, the weight values of all the attributes are combined into an array Z ═ (Z ═₁,z₂,…,z_H) As input, the contribution probability normalization value P of each attribute is obtained by utilizing the concept of Softmax_hThe calculation formula is as follows:

wherein, P_hThe contribution probability normalization value of the h attribute is represented, and the value range of the contribution probability normalization value is [0,1 ]]And is_hP_h1, e represents a natural constant.

The concept of information entropy is utilized again, and a variant weight R is constructed_hThe calculation formula is as follows:

E(P_h)＝-P_hlogP_h

finally, user attribute weight distribution based on two-level information entropy is obtained, namely attribute weight W_hThe calculation formula of (2) is as follows:

by carrying out weight distribution on each attribute item of the user and calculating the variance formed between different attribute weight distribution methods, it can be obviously seen that the method of the invention has more distinctiveness.

S105: and (3) similarity fusion:

the attribute weight W obtained in the step S4 is adopted_hCalculating the weighted sum of the similarity of the H attributes of each user i in the social network A and each user j in the social network B as the matching score of each user i in the social network A and each user j in the social network B_i,j：

Wherein, W_hRepresents the weight of the H-th attribute of all H attributes,and the similarity of the h-th attribute corresponding to the user i in the social network A and the user j in the social network B is represented.

Match score_i,jIs used to determine whether the physical users behind the two social accounts have identity.

S106: matching users:

score according to matching scores of each user i in the social network A and each user j in the social network B_i,jAnd matching the users in the two social networks to obtain a user identity recognition result.

In this embodiment, a two-way stable marital matching algorithm is used for user matching, and the specific method is as follows: sequentially selecting users i in the social network A, and collecting users to be matched with the users i in the social network A by lambda_iSet to the set of all users in social network B. From the set of users to be matched lambda_iAnd screening out the user j with the highest matching score with the user i, and matching the user j with the user i if the user j is not matched with other users in the social network A. If the user j is matched with other users i ' in the social network A, if the matching score of the user i and the user j is higher than that of the user i ' and the user j, matching the user i and the user j, deleting the matching result of the user i ', and otherwise, selecting the user j from the user set lambda to be matched_iDeletion from the deleted set λ of users to be matched_iIn re-screening matching scores with user iThe highest user is scored until a matching user of user i in social network B is determined.

In order to illustrate the technical effects of the invention, the invention was experimentally verified by using a specific example. In the experimental verification, user data on two social networks, namely Facebook and Twitter, are selected for cross-social-network user identity identification, and precision (precision), recall (call), F-measure (F1) and AUC (area UnderCurve) are adopted as evaluation standards.

AUC is the area under the ROC curve. False Positive Rate (FPR) is defined as X-axis and True Positive Rate (TPR) is defined as Y-axis. Because the result of the invention is divided into two categories, namely the same entity user and different entity users, the AUC can also be used for evaluating the quality of the identification result.

Wherein, TP represents positive and actually positive matching pairs, TN represents negative and actually negative matching pairs, FP represents positive and actually negative matching pairs, FN represents negative and actually positive matching pairs.

In order to illustrate the effectiveness of the weight distribution method (TIW) based on the two-level information entropy, the method is compared and analyzed with other two methods, wherein the comparison method comprises an empirical probability-based weight distribution method (EW) and a posterior probability-based weight distribution method (PW). FIG. 4 is a graph comparing accuracy of the weight assignment method and the comparison method according to the present invention. FIG. 5 is a chart comparing recall rates of the weight assignment method and the comparison method according to the present invention. Fig. 6 is a comparison graph of F1 scores in the weight assignment method and the comparison method of the present invention in this embodiment. FIG. 7 is a comparison graph of AUC of the weight assignment method and the comparison method of the present invention in this example. As can be seen from fig. 4 to 7, the evaluation indexes of the aspects of the present invention are superior to those of the other two comparative methods. With the increase of the number of users, the evaluation indexes of the three methods are reduced to a certain extent, because when the number of user accounts is increased, the condition that the accounts are high in similarity but are not users of the same entity occurs. Once this occurs, the final matching result will be negatively affected. The rate of decrease of the present invention is small, whereas the rate of decrease of the other two comparative methods is relatively fast. Compared with other two comparison methods, the method has better performance in the aspect of cross-social-network user identification.

And then comparing a user identity recognition method (TIW-UI) which combines weight distribution based on two-stage information entropy and user matching based on a bidirectional stable marital matching algorithm with a random forest confirmation algorithm (RFCA-SMM) based on stable marital matching and a cross matching method (RCM) based on ranking. Fig. 8 is a comparison chart of four evaluation indexes of the user identification method and two comparison methods according to the present invention. As shown in FIG. 8, the present invention is superior to RFCA-SMM and RCM in terms of accuracy, recall, F1 score and AUC, which also demonstrates the effectiveness of the present invention.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. A cross-social network user identity recognition method based on two-level information entropy is characterized by comprising the following steps:

s3: extracting preset data of N characteristic attributes from the behavior information of each user, and then calculating the similarity of each characteristic attribute of each user i in the social network A and each characteristic attribute of each user j in the social network B

Construction of variant weight R based on information entropy_h：

E(P_h)＝-P_hlogP_h

Calculating attribute weight W based on two-stage information entropy_h：

2. The method for identifying the identity of the user across the social network based on the two-level entropy of information as claimed in claim 1, wherein the similarity of the common attributesThe calculation method comprises the following steps:

s2.1: firstly, judging whether the m-th common attribute is a preset key attribute, if so, entering a step S2.2, otherwise, entering a step S2.3;

s2.2: judging whether the m-th common attributes of the two users are consistent, if so, determining the similarity of the common attributesOtherwise

S2.3: judging whether the m-th common attribute data is vectorized, if so, entering a step S2.4, otherwise, entering a step S2.5;

s2.4: vectorizing the data of the m-th common attributes of the two users, calculating the cosine similarity between the two vectors, and taking the cosine similarity as the similarity of the m-th common attributes of the two users

S2.5: taking the data of the m-th common attributes of the two users as character strings, then calculating a Dice coefficient between the two character strings, and taking the Dice coefficient as the similarity of the m-th common attributes of the two users

3. The method for identifying the user identity across the social network based on the two-level information entropy of claim 1, wherein the feature attributes in the step S3 include a text information feature, a punctuation mark feature and a state timestamp feature, and the similarity calculation methods thereof respectively are as follows:

for text information features, firstly extracting text information features of each user based on frequent pattern mining to obtain a plurality of frequent items and support degree counts corresponding to the frequent items, and then calculating by adopting the following formula to obtain text information feature similarity of two users

Wherein, F represents a frequent item,support degree meter for respectively representing frequent items F corresponding to user i in social network A and user j in social network BNumber, C_FA number of item sets representing frequent items F;

for punctuation mark characteristics, the proportion of the use times of each punctuation mark in the total punctuation mark number is statistically obtained from text information issued by a user i in a social network A and a user j in a social network B to form punctuation mark vectors, and the similarity between the two vectors is calculated to be punctuation mark similarity;

for the time state stamp characteristics, dividing each day into G time periods, counting to obtain the average dynamic number of each user in each time period within a preset date, and calculating the similarity of the time state stamps of the user i in the social network A and the user j in the social network B by adopting the following formula:

wherein, V_i ^A(g)、Respectively represent the average dynamic numbers of the user i in the social network A and the user j in the social network B in the g-th time period.

4. The method for identifying the user identity across the social network based on the two-level information entropy of claim 1, wherein the user matching in step S7 adopts a bidirectional stable marital matching algorithm, and the specific method is as follows: sequentially selecting users i in the social network A, and collecting users to be matched with the users i in the social network A by lambda_iSet to the set of all users in social network B. From the set of users to be matched lambda_iScreening out a user j with the highest matching score with the user i, and matching the user j with the user i if the user j is not matched with other users in the social network A; if the user j is matched with other users i ' in the social network A, if the matching score of the user i and the user j is higher than that of the user i ' and the user j, matching the user i and the user j, deleting the matching result of the user i ', and otherwise, selecting the user j from the user set lambda to be matched_iDeletion from deletionDivided user set lambda to be matched_iRe-screening the user with the highest matching score with the user i until determining the matching user of the user i in the social network B.