CN110598129B

CN110598129B - Cross-social network user identity recognition method based on two-stage information entropy

Info

Publication number: CN110598129B
Application number: CN201910865901.XA
Authority: CN
Inventors: 邢玲; 邓凯凯; 高建平; 吴红海; 谢萍; 张明川
Original assignee: Henan University of Science and Technology
Current assignee: Henan University of Science and Technology
Priority date: 2019-09-09
Filing date: 2019-09-09
Publication date: 2022-10-18
Anticipated expiration: 2039-09-09
Also published as: CN110598129A

Abstract

The invention discloses a cross-social network user identity identification method based on two-level information entropy, which comprises the steps of crawling archive information and behavior information of respective users from two social networks, screening common attributes from archive information attributes of the two social networks, extracting data corresponding to the common attributes from archive information of each user, then calculating similarity of the common attributes of the users in the two social networks, extracting characteristic attributes of behaviors from the behavior information of each user, calculating similarity of the behavior attributes of the users in the two social networks, carrying out weight distribution based on the two-level information entropy, weighting each attribute to obtain matching scores of the two users, carrying out user matching according to the matching scores, and obtaining a user identity identification result. The method for distributing the weight based on the two-level information entropy solves the problem of unbalance of multiple attributes of the user in the aspect of weight distribution, and improves the user identity recognition performance.

Description

Cross-social network user identity recognition method based on two-stage information entropy

Technical Field

The invention belongs to the technical field of data mining, and particularly relates to a cross-social network user identity identification method based on two-level information entropy.

Background

Social networks provide people with a rich social service. According to statistics, 42% of users have multiple social network accounts at the same time. Because different social networks have respective unique social modes and bring different social services to users, rich social user information is generated. However, the individual social network accounts are isolated and have no direct connection, so that the social information generated by the user account is distributed over multiple social networks. The identification of the user identity across the social networks refers to identifying virtual accounts belonging to the same real user in different social networks. The technical solution can provide comprehensive user information for network recommendation, user modeling and user behavior analysis, and realize full mining of the multisource social network big data.

The core idea of the existing related research is to utilize user profile information, network topology information and user behavior information to calculate and analyze whether a user account matching pair is the same user. Cross-social network user identification consists essentially of three parts: user data extraction, data similarity calculation and account matching. The user data is extracted by mainly adopting a relatively efficient crawler technology to crawl, clean and store the data. Secondly, the similarity between the user data is calculated by using the extracted data and the similarity function, and the greater the similarity is, the greater the probability that different virtual accounts belong to the same user is. And finally, carrying out account matching by adopting a related matching strategy according to the calculated similarity.

The existing cross-social network user identity identification method based on user profile information has the possibility of user data forgery, and people pay more and more attention to privacy protection at present. Therefore, the recognition effect of this type of method is not very good. And secondly, the identification method based on the network topological structure is adopted, and although the friend relationship of the user is easy to obtain, the connection of the friend relationship has sparsity. And finally, the method is an identification method based on user behavior data, the method utilizes the content issued by the user to identify the user identity, and compared with the two methods, the method breaks the limit of the two methods. In addition, the existing research also utilizes the combination of user profile information and network structure to identify, but the method is still limited by the above conditions, so that a good identification effect cannot be achieved.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a cross-social network user identity recognition method based on two-level information entropy, and provides a weight distribution method based on two-level information entropy, so that the problem of unbalance of multiple attributes of a user in the aspect of weight distribution is solved, and the user identity recognition performance is improved.

In order to achieve the purpose, the cross-social network user identity identification method based on the two-level information entropy comprises the following steps:

s1: respectively crawling the profile information and the behavior information of respective users from the social network A and the social network B, and respectively recording the number N of the users in the two social networks _A And N _B ；

S2: common attributes are screened out from the profile information attributes of the two social networks, data corresponding to the common attributes are extracted from the profile information of each user, and then the similarity of each common attribute of each user i in the social network A and each common attribute of each user j in the social network B is calculated

i＝1,2,…,N _A ，j＝1,2,…,N _B M =1,2, \8230;, M, M represents the number of common attributes;

s3: extracting preset data of N characteristic attributes from the behavior information of each user, and then calculating the similarity of each characteristic attribute of each user i in the social network A and each characteristic attribute of each user j in the social network B

n＝1,2,…,N；

S4: integrating the data of M common attributes of all users extracted from the archive information and the data of N characteristic attributes of all users extracted from the behavior information into the data of H attributes, H = M + N, and then determining the weights of the H attributes by adopting an entropy weight method to serve as primary weights z of all the attributes _h ，h＝1,2,…,H；

Calculating contribution probability normalization value P of each attribute _h ：

Construction of variant weights R based on information entropy _h ：

E(P _h )＝-P _h logP _h

Calculating attribute weight W based on two-stage information entropy _h ：

S5: adopting the attribute weight W obtained in the step S4 _h Calculating the weighted sum of the similarity of the H attributes of each user i in the social network A and each user j in the social network B as the matching score of each user i in the social network A and each user j in the social network B _i,j ；

S6: score according to matching scores of each user i in the social network A and each user j in the social network B _i,j And matching the users in the two social networks to obtain a user identity recognition result.

The invention discloses a cross-social network user identity recognition method based on two-level information entropy, which comprises the steps of crawling archive information and behavior information of respective users from two social networks, screening common attributes from the archive information attributes of the two social networks, extracting data corresponding to the common attributes from the archive information of each user, calculating the similarity of the common attributes of the users in the two social networks, extracting characteristic attributes of behaviors from the behavior information of each user, calculating the similarity of the behavior attributes of the users in the two social networks, performing weight distribution based on the two-level information entropy, weighting each attribute to obtain matching scores of the two users, and performing user matching according to the matching scores to obtain a user identity recognition result.

The invention integrates two types of information which are most relevant to the user, namely user file information and user behavior information, so that the calculated similarity is more accurate, weight distribution is carried out based on two-stage information entropy, the problem of unbalance of multiple attributes of the user in the aspect of weight distribution is solved, the accuracy of user matching scoring can be improved, and the user identity identification performance is improved.

Drawings

FIG. 1 is a flowchart of an embodiment of a cross-social-network user identity recognition method based on two levels of information entropy;

FIG. 2 is a flowchart of a method for calculating the similarity of common attributes in the present embodiment;

FIG. 3 is a flowchart of a text information feature extraction calculation method based on frequent pattern mining in this embodiment;

FIG. 4 is a graph comparing accuracy of the weight assignment method and the comparison method according to the present invention;

FIG. 5 is a chart comparing recall rates of the weight assignment method and the comparison method according to the present invention;

FIG. 6 is a comparison graph of F1 scores of the weight assignment method and the comparison method of the present invention in this embodiment;

FIG. 7 is a comparison graph of AUC of the weight assignment method and the comparison method of the present invention in this example;

fig. 8 is a comparison chart of four evaluation indexes of the user identification method and two comparison methods according to the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

Examples

FIG. 1 is a flowchart of a specific embodiment of a cross-social-network user identity recognition method based on two levels of information entropy. As shown in FIG. 1, the method for identifying the identity of the user across the social network based on the two-level information entropy comprises the following specific steps:

s101: acquiring user data:

respectively crawling the profile information and the behavior information of respective users from the social network A and the social network B, and respectively recording the number N of the users in the two social networks _A And N _B . Generally speaking, can make N _A ＝N _B 。

S102: calculating the similarity of the user profile information:

screening common attributes from the profile information attributes of the two social networks, extracting data corresponding to the common attributes from the profile information of each user, and then calculating the similarity of each common attribute of each user i in the social network A and each common attribute of each user j in the social network B

i＝1,2,…,N _A ，j＝1,2,…,N _B M =1,2, \8230;, M represents the number of common attributes.

Since the user profile information includes a plurality of common attributes, for example, the user profile information includes 17 common attributes in this embodiment, and the data format corresponding to each common attribute may be different, it is necessary to select different ways to calculate the similarity of the common attributes according to the actual situation. Fig. 2 is a flowchart of a method for calculating the similarity of common attributes in this embodiment. As shown in fig. 2, the specific steps of the common attribute similarity in this embodiment include:

s201: firstly, judging whether the m-th common attribute is a preset key attribute. The key attribute refers to an attribute whose data must be consistent to determine similarity between users, for example, the sex information of two users must be "male" or "female" at the same time to indicate that the two users are similar. If the attribute is the key attribute, the process proceeds to step S202, otherwise, the process proceeds to step S203.

S202: determining similarity based on consistency:

judging whether the m-th common attributes of the two users are consistent, if so, judging the similarity of the common attributes

Otherwise

S203: and judging whether the m-th common attribute data is vectorized, if so, entering the step S204, otherwise, entering the step S205.

S204: determining similarity based on cosine similarity:

vectorizing the data of the m-th common attributes of the two users, calculating the cosine similarity between the two vectors, and taking the cosine similarity as the similarity of the m-th common attributes of the two users

The cosine similarity is calculated as follows:

where A and B represent a vector formed by two data, A _q 、B _q Representing the qth dimension of vectors a and B, Q =1,2, \ 8230;, respectively, Q representing the vector dimensions.

S205: determining similarity based on the Dice coefficient:

taking the data of the m-th common attributes of the two users as character strings, then calculating a Dice coefficient between the two character strings, and taking the Dice coefficient as the similarity of the m-th common attributes of the two users

The calculation formula of the Dice coefficient is as follows:

where a and b represent two character strings, comm (a &' b) represents the number of identical characters in a and b, and len () represents the length of the character string.

S103: calculating the similarity of user behaviors:

from each to eachExtracting preset data of N characteristic attributes from the behavior information of each user, and then calculating the similarity of each characteristic attribute of each user i in the social network A and each characteristic attribute of each user j in the social network B

n＝1,2,…,N。

As for the feature attributes, the category of the feature attributes may be determined according to actual needs, and in the embodiment of the present invention, three feature attributes are adopted: text information features, punctuation features, and state timestamp features. The similarity calculation methods for the three behavior feature attributes are described below.

Text information feature:

firstly, extracting text information features of each user based on frequent pattern mining to obtain a plurality of frequent items and support degree counts corresponding to the frequent items, and then calculating by adopting the following formula to obtain text information feature similarity of the two users

Wherein, F represents a frequent item,

respectively representing the support degree counts of frequent items F corresponding to the user i in the social network A and the user j in the social network B, C _F The number of item sets representing frequent items F. Adding a "1" to the formula is to avoid high frequency terms.

Fig. 3 is a flowchart of a text information feature extraction calculation method based on frequent pattern mining in this embodiment. As shown in fig. 3, the text information feature extraction method based on frequent pattern mining in this embodiment includes the specific steps of:

s301: text word segmentation:

and performing word segmentation on each piece of text information issued by each user, taking each word obtained by word segmentation as a transaction, and obtaining a transaction set T according to all the text information issued by the user.

S302: acquiring a frequent 1 item set:

traversing all items in the transaction set T and calculating the support degree of all items to form 1 item set C ₁ Filtering out the item set which does not meet the condition according to the preset minimum support degree of the 1 item set to obtain the frequent 1 item set L ₁ In the present embodiment, the minimum support degree is set to 2 for 1 item set. Let the number of terms parameter k =1.

S303: generating a frequent k +1 term set:

set L of frequent k items _k Connecting with itself (the inner item sets are mutually merged) to obtain a k +1 item set C _k+1 Filtering out the item set which does not meet the condition according to the preset minimum support degree of the k +1 item set to obtain a frequent k +1 item set L _k+1 。

S304: judging whether L is _k+1 Null, if null, then state all current k +1 term sets C _k+1 If the minimum support degree is not met, the item set generation is finished, and the step S306 is entered, otherwise, the step S305 is entered.

S305: let k = k +1, return to step S303.

S306: determining text information characteristics:

and obtaining frequent items corresponding to the text published by the current user, and obtaining the support degree count corresponding to each frequent item.

Punctuation features:

statistically obtaining the proportion of the using times of each punctuation mark to the total punctuation mark number from the text information released by the user i in the social network A and the user j in the social network B to form punctuation mark vectors, and calculating the similarity between the two vectors, namely the similarity of the punctuation marks

Time status stamp feature:

dividing each day into G time periods, counting to obtain the average dynamic number of each user in each time period within a preset date, and calculating the similarity of the time state stamps of the user i in the social network A and the user j in the social network B by adopting the following formula:

wherein the content of the first and second substances,

the average dynamic numbers of the user i in the social network a and the user j in the social network B in the g-th time period are respectively represented, and | | represents the absolute value.

S104: and (3) weight distribution based on two levels of information entropy:

in order to fuse all the similarities obtained above, a weight needs to be assigned to each attribute. In order to make the obtained weight more reasonable, the invention provides a weight distribution method based on two-level information entropy, which comprises the following specific steps:

integrating the data of M common attributes of all users extracted from the archive information and the data of N characteristic attributes of all users extracted from the behavior information into the data of H attributes, H = M + N, then determining the weight of the H attributes by adopting an entropy weight method, and taking the weight as a primary weight z of each attribute _h ，h＝1,2,…,H。

The basic idea of the entropy weight method is that the larger the degree of difference of the indexes is, the larger the weight difference is correspondingly. Therefore, the problem of weight assignment in user identification can be solved by using the concept of information entropy. The specific method of the entropy weight method may be set as required, and the specific method of the entropy weight method in this embodiment is as follows: firstly, the information entropy E of each attribute is obtained through calculation _h Then, the posterior probability p (y) of each attribute of the user is obtained _x | x), a primary weight corresponding to the attribute can be calculated: z is a radical of formula _h ＝p(y _x |x)×E _h . By adopting the method, the influence of each attribute on the user identity identification performance can be more accurately acquired.

The output of Softmax characterizes the relative probability between the different classes, so the present invention takes advantage of the concept of Softmax to perform a secondary weight assignment on the attributes of the user. Obtaining a userAfter the first-level weighting of the attributes, combining the weighted values of all the attributes into an array Z = (Z) ₁ ,z ₂ ,…,z _H ) As input, the contribution probability normalization value P of each attribute is obtained by utilizing the concept of Softmax _h The calculation formula is as follows:

wherein, P _h The contribution probability normalization value of the h attribute is represented, and the value range of the value is [0,1 ]]And is _h P _h =1,e represents a natural constant.

The concept of information entropy is utilized again, and a variation weight R is constructed _h The calculation formula is as follows:

E(P _h )＝-P _h logP _h

finally, user attribute weight distribution based on two-level information entropy is obtained, namely attribute weight W _h The calculation formula of (2) is as follows:

by carrying out weight distribution on each attribute item of the user and calculating the variance formed between different attribute weight distribution methods, the method of the invention has more distinctiveness obviously.

S105: and (3) similarity fusion:

adopting the attribute weight W obtained in the step S4 _h Calculating the weighted sum of the similarity of the H attributes of each user i in the social network A and each user j in the social network B as the matching score of each user i in the social network A and each user j in the social network B _i,j ：

Wherein, W _h Represents the weight of the H-th attribute of all H attributes,

and the similarity of the h-th attributes corresponding to the user i in the social network A and the user j in the social network B is represented.

Match score _i,j Is used to determine whether the physical users behind the two social accounts have identity.

S106: user matching:

score according to matching scores of each user i in the social network A and each user j in the social network B _i,j And matching the users in the two social networks to obtain a user identity recognition result.

In this embodiment, a two-way stable marital matching algorithm is used for user matching, and the specific method is as follows: sequentially selecting users i in the social network A, and collecting users to be matched with the users i in the social network A by lambda _i Set to the set of all users in social network B. From the set of users to be matched lambda _i And screening out a user j with the highest matching score with the user i, and matching the user j with the user i if the user j is not matched with other users in the social network A. If the user j is matched with other users i ' in the social network A, if the matching score of the user i and the user j is higher than that of the user i ' and the user j, matching the user i with the user j, deleting the matching result of the user i ', and otherwise, selecting the user j from the user set lambda to be matched _i Deleting from the deleted user set lambda to be matched _i Re-screening the user with the highest matching score with the user i until determining the matching user of the user i in the social network B.

In order to illustrate the technical effects of the invention, the invention was experimentally verified by using a specific example. In the experimental verification, user data on two social networks, namely Facebook and Twitter, are selected for cross-social-network user identity identification, and accuracy (precision), recall (recall), F-measure (F1) and AUC (Area Under Current) are adopted as evaluation criteria.

AUC is the area under the ROC curve. False Positive Rate (FPR) is defined as X-axis and True Positive Rate (TPR) is defined as Y-axis. Because the result of the invention is divided into two categories, namely the same entity user and different entity users, the AUC can also be used for evaluating the quality of the identification result.

Wherein, TP represents positive and actually positive matching pairs, TN represents negative and actually negative matching pairs, FP represents positive and actually negative matching pairs, FN represents negative and actually positive matching pairs.

In order to illustrate the effectiveness of the weight distribution method (TIW) based on the two-level information entropy, the method is compared and analyzed with other two methods, wherein the comparison method comprises an empirical probability-based weight distribution method (EW) and a posterior probability-based weight distribution method (PW). Fig. 4 is a graph comparing accuracy of the weight assignment method and the comparison method of the present invention in this embodiment. FIG. 5 is a chart comparing recall rates of the weight assignment method and the comparison method according to the present invention. Fig. 6 is a comparison graph of F1 scores of the weight assignment method and the comparison method of the present invention in this embodiment. FIG. 7 is a graph comparing AUC of the weight assignment method and the comparison method of the present invention in this embodiment. As can be seen from fig. 4 to 7, the evaluation indexes of the aspects of the present invention are superior to those of the other two comparative methods. With the increase of the number of users, the evaluation indexes of the three methods are reduced to a certain extent, because when the number of user accounts is increased, the condition that the accounts are high in similarity but are not users of the same entity occurs. Once this occurs, the final matching result will be negatively affected. The rate of decrease of the present invention is small, whereas the rate of decrease of the other two comparative methods is relatively fast. Compared with other two comparison methods, the method has better performance in the aspect of cross-social-network user identification.

And then comparing a user identity recognition method (TIW-UI) which combines weight distribution based on two-stage information entropy and user matching based on a bidirectional stable marital matching algorithm with a random forest confirmation algorithm (RFCA-SMM) based on stable marital matching and a cross matching method (RCM) based on ranking. Fig. 8 is a comparison chart of four evaluation indexes of the user identification method and two comparison methods according to the present invention. As shown in FIG. 8, the present invention is superior to RFCA-SMM and RCM in terms of accuracy, recall, F1 score and AUC, which also demonstrates the effectiveness of the present invention.

Although the illustrative embodiments of the present invention have been described in order to facilitate those skilled in the art to understand the present invention, it is to be understood that the present invention is not limited to the scope of the embodiments, and that various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined in the appended claims, and all matters of the invention using the inventive concepts are protected.

Claims

1. A cross-social network user identity recognition method based on two-level information entropy is characterized by comprising the following steps:

s1: crawling profile information and rows of respective users from social network A and social network B respectivelyFor information, the number of users in two social networks is respectively recorded as N _A And N _B ；

M represents the number of common attributes;

S4: integrating the data of M common attributes of all users extracted from the archive information and the data of N characteristic attributes of all users extracted from the behavior information into the data of H attributes, H = M + N, then determining the weight of the H attributes by adopting an entropy weight method, and taking the weight as a primary weight z of each attribute _h ，h＝1,2,…,H；

Wherein e represents a natural constant;

construction of variant weight R based on information entropy _h ：

E(P _h )＝-P _h logP _h

Calculating attribute weight W based on two-stage information entropy _h ：

S5: adopting the attribute weight W obtained in the step S4 _h Calculating the weighted sum of the H attribute similarity of each user i in the social network A and each user j in the social network B as the matching score of each user i in the social network A and each user j in the social network B _i,j ；

2. The method for identifying the identity of a user across social networks based on two levels of entropy of information according to claim 1, wherein the similarity of the common attributes

The calculation method comprises the following steps:

s2.1: firstly, judging whether the mth common attribute is a preset key attribute, if so, entering a step S2.2, otherwise, entering a step S2.3;

s2.2: judging whether the m-th common attributes of the two users are consistent, if so, determining the similarity of the common attributes

Otherwise

S2.3: judging whether the m-th common attribute data is vectorized, if so, entering a step S2.4, and otherwise, entering a step S2.5;

s2.4: vectorizing the data of the m-th common attributes of the two users, calculating the cosine similarity between the two obtained vectors, and taking the cosine similarity as the similarity of the m-th common attributes of the two users

S2.5: taking the data of the m-th common attributes of the two users as character strings, then calculating a Dice coefficient between the two character strings, and taking the Dice coefficient as the similarity of the m-th common attributes of the two users

3. The method for identifying the user identity across the social network based on the two-level information entropy of claim 1, wherein the feature attributes in the step S3 include a text information feature, a punctuation mark feature and a state timestamp feature, and the similarity calculation methods respectively include:

for text information features, firstly extracting text information features of each user based on frequent pattern mining to obtain a plurality of frequent items and support degree counts corresponding to the frequent items, and then calculating by adopting the following formula to obtain text information feature similarity of two users

Wherein, F represents a frequent item,

respectively representing the support degree counts of the frequent items F corresponding to the user i in the social network A and the user j in the social network B, C _F A number of sets of items representing frequent items F;

for punctuation mark characteristics, the proportion of the use times of each punctuation mark in the total punctuation mark number is statistically obtained from text information issued by a user i in a social network A and a user j in a social network B to form punctuation mark vectors, and the similarity between the two vectors is calculated to be the punctuation mark similarity

For the time state stamp characteristics, dividing each day into G time periods, counting to obtain the average dynamic number of each user in each time period within a preset date, and calculating the similarity of the time state stamps of the user i in the social network A and the user j in the social network B by adopting the following formula:

wherein, V _i ^A (g)、V _j ^B (g) Respectively represent the average dynamic numbers of the user i in the social network A and the user j in the social network B in the g-th time period.

4. The cross-social-network user identity recognition method based on two-level information entropy of claim 1, wherein a bidirectional stable marital matching algorithm is adopted for user matching, and the specific method comprises the following steps: sequentially selecting users i in the social network A, and collecting users to be matched with users lambda _i Setting as a set of all users in social network B; from the set of users to be matched lambda _i Screening out a user j with the highest matching score with the user i, and matching the user j with the user i if the user j is not matched with other users in the social network A; if the user j is matched with other users i ' in the social network A, if the matching score of the user i and the user j is higher than that of the user i ' and the user j, matching the user i with the user j, deleting the matching result of the user i ', and otherwise, selecting the user j from the user set lambda to be matched _i Deletion from the deleted set λ of users to be matched _i Re-screening the user with the highest matching score with the user i until determining the matching user of the user i in the social network B.