CN107330020B

CN107330020B - User entity analysis method based on structure and attribute similarity

Info

Publication number: CN107330020B
Application number: CN201710470266.6A
Authority: CN
Inventors: 徐杰; 刘震; 卢思变; 陈文龙
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2017-06-20
Filing date: 2017-06-20
Publication date: 2020-03-24
Anticipated expiration: 2037-06-20
Also published as: CN107330020A

Abstract

The invention discloses a user entity analysis method based on structure and attribute similarity, which combines the friend relationship and user personal data in a social network, namely structure information and attribute information, by analyzing and modeling the social network, and achieves the purpose of user entity analysis across social platforms. In the process of entity analysis, a concept of dynamic threshold is introduced, different thresholds are used in different iterative periods to adapt to the data characteristics under the current condition, and the proportion of attributes and structures is regulated and controlled to obtain a more accurate result.

Description

User entity analysis method based on structure and attribute similarity

Technical Field

The invention belongs to the technical field of entity analysis, and particularly relates to a user entity analysis method based on structure and attribute similarity.

Background

In a dataset, objects in the real world, to which data is directed, are generally referred to as entities (entities). There may be many different representations or descriptions of the same entity in different or even the same data set, and when data sets from different sources are combined for analysis, the descriptions of the same entity may be mixed together to cause some degree of duplication. Entity Resolution (Entity Resolution) is a process of identifying and connecting a plurality of different descriptions in a data set, and determining which descriptions map to the same Entity in the real world. Entity analysis is an important step in the data preprocessing process and is mainly used for solving the quality problems of repeated redundancy and the like of data.

With the rapid development of social networks, the application of entity resolution in social networks is receiving increasing attention. Most social network users not only use one social network, but also use a plurality of social networks according to own interests and needs, and information among different social platforms is isolated and not intercommunicated, so that a method for directly identifying the virtual identity of the same user on different platforms is unavailable. The problem of cross-platform entity resolution of social networks is to match and identify accounts belonging to the same user entity on different social platforms, i.e. user identification or account matching. Through the matching of account identities, personalized services to users can be achieved, and some security issues of social networks can also be helped to be solved.

The concept of entity resolution was first proposed in 1959. The article published by Newcombe et al in science first proposes the concept, and considers entity resolution as a statistical problem, which is illustrated from the perspective of probability. In 1969 a decade later, felelgi and Sunter first normalized and formulated the entity resolution problem, they treated it as a classification problem in machine learning and specified a series of symbols and definitions of entity resolution in their article, creating the classical felelgi-Sunter model. In the subsequent research, many researchers have improved and supplemented the Fellegi-Sunter model, mainly Jaro, Winkler, Belin and Rubin, Ravikumar, Larsen, Sadinle et al, wherein Winkler has done a lot of work, and a Bayesian statistical model is adopted to make a series of improvements on parameter calculation and matching rules of the Fellegi-Sunter model.

Entity resolution research for social networks has been developed primarily in recent years. Most researchers have conducted research with a view to the attributes, structure, and social content of social networks. The attributes refer to personal details of the user, such as head portraits, user names, sexes, birthdays, education backgrounds, locations and the like, the structures refer to friends and relationships between accounts in the social network, and the social content refers to information such as texts and pictures generated by the user in social activities, such as blogs, comments, geographical positions and the like.

The attribute-based algorithm mainly utilizes personal profile information of users in the social network, takes each item of description information as an attribute of the user, and converts the problem into matching of attribute fields. Such as zafirni and Liu, propose user entity resolution algorithms using the user name and URL of the user's personal home page. Goga et al propose algorithms suitable for large-scale identification. The structure-based algorithm is to mainly utilize social networksAnd the friend relation information of the network abstracts the social network into a graph structure, and realizes user entity analysis by utilizing some graph structure information. Narayanan and Shmatikov^[13]And Bartunov et al studied the relevant algorithms from this perspective. The social content based algorithm utilizes analysis of text styles and information such as time, geographic location, and the like to achieve user entity resolution. Such as Almishari and tsudiik, propose a method for identifying users on different social platforms by analyzing the author's writing style. Goga et al propose to jointly implement the user identification by using the geographic location of the user when distributing the content, the timestamp, and the authoring style of the content.

For algorithms that look at attribute information, due to a certain degree of missing and inaccurate account profiles on the social platform, such abnormal data can have an impact on algorithm performance, and such impact from the data itself is very difficult to remove. The algorithm based on the structure avoids the inaccuracy of information, but when a small group with a small number of people exists, the situation that full connection is formed between a plurality of accounts is possible, and how to distinguish the accounts becomes a very difficult problem. Therefore, the algorithm based on the structure is difficult to play a good role under the condition that the friend relationship is very dense. However, the content-based method is inconvenient in use because the related data is difficult to acquire and process. The method provided by the invention organically combines two types of information, namely attribute information and structure information, and avoids the defects of various methods as far as possible.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a user entity analysis method based on structure and attribute similarity, and the problem of cross-platform user entity analysis of a social network is solved by combining information of attributes and structures.

In order to achieve the above object, the present invention provides a method for analyzing a user entity based on structure and attribute similarity, comprising the following steps:

(1) establishing an attribute similarity matrix and an adjacency matrix

According to social platform A and social platformB, establishing attribute similarity matrix S according to the attribute similarity between every two accounts^m×nWhere m and n are the total number of accounts in platforms A and B, respectively, S^m×nThe element in (1) represents the attribute similarity between the corresponding two accounts;

establishing an adjacency matrix according to whether all accounts on the social platform A and the social platform B are in friend relationship or not

And

each row and each column of the adjacency matrix represent an account in the platform, elements in the adjacency matrix represent whether two corresponding accounts in the platform are in a friend relationship or not, if the two corresponding accounts in the platform are in the friend relationship, the element value is 1, and if the two corresponding accounts are not in the friend relationship, the element value is 0;

(2) establishing a correlation matrix

According to a adjacency matrix

And a priori matching pair, establishing an incidence matrix between the unidentified account and the identified account in the social platform A and the social platform B

Wherein tau represents the number of prior matching pairs, each row of the association matrix represents an unidentified account, each column represents an identified account, elements in the association matrix represent whether the unidentified account and the identified account are in a friend relationship, if the unidentified account and the identified account are in the friend relationship, the element value is 1, and if the unidentified account and the identified account are not in the friend relationship, the element value is 0;

(3) establishing a common friend matrix

According to the incidence matrix

And a priori matching pair is used for establishing the common friend moment of the unidentified accounts in the social platform A and the social platform BArraying;

wherein (C)^TRepresenting transpose, each row of the common friends matrix represents one unidentified account in social platform A

Each column represents an unidentified account in social platform B

Element f in common friend matrix_ijTo represent

And

the number of common friends in the prior matching pair;

(4) selecting an account pair consisting of two unidentified accounts corresponding to the largest non-zero element from the common friend matrix, and storing the account pair in an account pair set Q, wherein Q { (i, j) | f_ij＝max(F^{(m-τ)×(n-τ)})}；

(5) In the attribute similarity matrix S^m×nIn the method, the attribute similarity between all account pairs in the account pair set Q is taken out and stored in a similarity set S^*In, S^*＝{s_ij|s_ij∈S^m×n,(i,j)∈Q}；

(6) According to a preset initial threshold value, a similarity set S^*Deleting elements lower than the initial threshold value, and deleting corresponding elements in the account pair set Q;

(7) judging whether the account pair set Q is empty, if so, setting the maximum non-zero element in the common friend matrix to be 0, and returning to the step (4); if not, entering the step (8);

(8) and extracting a similarity set S^*Max element max (S) of^*) And in the account pair setSelecting and max (S) from Q^*) Corresponding account pair (i, j), then (i, j) corresponding set of accounts

Marking the matching success, and adding the result into a result set M of the iteration of the current round;

(9) deleting the account pair (i, j) added into the result set M and the account pair with the common account in the account pair set Q, and deleting the similarity set S^*Middle corresponding element;

(10) judging whether elements exist in the account pair set Q or not, and if the elements exist, returning to the step (8); if not, outputting a result set M;

(11) adding the corresponding account pair in the result set M into the prior matching pair, returning to the step (2), performing the next iteration of the current round, and finishing the iteration of the current round when no new matching pair is output in the result set M;

(12) and (3) modifying the size of the initial threshold, returning to the step (2), performing the next iteration, and finishing the iteration when no new matching pair is output in the result set M after the initial threshold is modified, thereby completing the user entity analysis.

The invention aims to realize the following steps:

according to the user entity analysis method based on the structure and attribute similarity, through analysis and modeling of the social network, the friend relationship and the user personal data, namely the structure information and the attribute information, in the social network are combined, and the purpose of user entity analysis across social platforms is achieved. In the process of entity analysis, a concept of dynamic threshold is introduced, different thresholds are used in different iterative periods to adapt to the data characteristics under the current condition, and the proportion of attributes and structures is regulated and controlled to obtain a more accurate result.

Meanwhile, the user entity analysis method based on the structure and attribute similarity further has the following beneficial effects:

(1) the method combines the information of the attributes and the structure, and avoids the defects of single information and adverse effects on the result accuracy, such as the effects caused by attribute loss and the effects caused by dense friend relationships.

(2) And a dynamic threshold concept is introduced, and in the iteration process, the attribute similarity threshold is not always constant but gradually changed within a certain range along with the increase of generated results so as to adapt to the characteristics of different iteration periods. The threshold is initially performed from the upper bound, obtaining the most accurate results, and then gradually decreases to accommodate more true matching pairs with insufficiently high attribute similarity.

(3) And the priori matching pairs are used as the iteration starting points, the result is fed back to the known conditions in each iteration, a large amount of known conditions or training data are not needed for establishing a model, the method can be implemented only by a few real matching pairs, and the problem of insufficient known conditions is solved.

Drawings

FIG. 1 is a flow chart of a user entity resolution method based on structure and attribute similarity in accordance with the present invention;

FIG. 2 is a block diagram of a friendship structure of two social platforms in an example.

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

Examples

FIG. 1 is a flow chart of a user entity parsing method based on structure and attribute similarity according to the present invention.

In this embodiment, we first describe the definitions of some names:

modeling a social platform into an undirected graph, wherein the accounts correspond to nodes, and the friends among the accounts correspond to edges among the nodes, namely G is { V, E }, wherein G represents the social platform, V is a set of accounts in the platform, and E is a set of friends in the platform. The friend relationships in a social platform are divided into two types, one-way and two-way. For the unidirectional friend type, theoretically, the unidirectional friend type should be abstracted into a directed graph, but in the algorithm, the friend relationship is a very important basis for user entity analysis, and considering that the intimacy degree of accounts which only concern in a unidirectional way is insufficient and the real friend making condition of the user to which the accounts belong cannot be well reflected, therefore, in the social platform with unidirectional connection, the algorithm only considers the accounts which concern in each other, and equates the relationship to the friend relationship in bidirectional connection, and still models the relationship into an undirected graph.

A series of personal profiles owned by each account in the social platform are collectively called attributes of the node, and each attribute represents a certain profile of the account, such as user name, gender, age and the like. C represents a set of attributes, C ═ C₁,C₂,C₃…), wherein C_iRepresenting the name of an attribute.

Also defined herein as a user entity is the owner of an account in the real world, i.e., the person using the account, in a social networking site. The set of user entities is denoted by U ═ U₁,u₂,u₃,…)。

Assuming that the two social platforms are a and B, respectively, the two accounts belonging to the two social platforms respectively,

and

if they point to the same entity, i.e. they are owned by the same person in the real world, then this is called

And

match, expressed as

Or M_A,B(i, j), otherwise, if not matched, is expressed as

Or UM_A,B(i, j). If the corresponding user entity of the matched account number in the real world is u_kThen, it can be expressed as:

prior to account matching, a portion of the account pairs that have previously been known to be correct matches, commonly referred to as prior account matching pairs, or seed matching pairs, are typically required. In the actual user entity analysis process, the acquisition of prior account matching pairs is difficult to solve, except for manually labeling some prior matching pairs, the main method is to find a unique identifier capable of determining the user entity, such as an electronic mailbox, a bound account number, an IP address or the like, but the information is generally difficult to obtain, so that a new method needs to be considered for replacement.

In the case where there is no information that can directly determine the a priori matching pairs, it may be considered to use the attribute similarity of the accounts to obtain the a priori matching pairs. The algorithm has low requirement on the number of the prior matching pairs, so that a part of account matching pairs with the highest similarity can be selected through attribute similarity, a part of account matching pairs with the highest friend number can be selected as the prior matching pairs to ensure the importance in the network, and then the algorithm is executed. Although the matching pair selected in this way cannot guarantee accurate pointing to an entity, the problem that the prior matching pair is difficult to obtain can be solved to a certain extent.

Referring to fig. 1, a detailed description is provided below of a user entity analysis method based on structural attribute similarity, which specifically includes the following steps:

s1, establishing attribute similarity matrix and adjacency matrix

Firstly, according to the attribute similarity between every two accounts on a social platform A and a social platform B, an attribute similarity matrix S is constructed^m×nM and n are both 7 in this example, where the attribute similarity matrix is directly givenThe portions of the a priori matched pairs are removed and are presented in tabular form, as shown in table 1.

Table 1 is the main part of the attribute similarity matrix.

TABLE 1

Next, according to the structural relationship between the two platforms in FIG. 2, as shown in FIG. 2(a) and FIG. 2(b), an adjacency matrix is established

And

respectively as follows:

s2, establishing a correlation matrix

According to a adjacency matrix

In this embodiment, the a priori matching pairs are two sets of (1,1) and (2,2), and are represented by solid nodes in fig. 2, then the correlation matrices are respectively:

s3, establishing a common friend matrix

According to the incidence matrix

And a priori matching pair is used for establishing an unidentified account in the social platform A and the social platform BA common friends matrix of;

s4, selecting an account pair consisting of two unidentified accounts corresponding to the largest non-zero element from the common friend matrix, and storing the account pair in an account pair set Q, wherein Q { (i, j) | f_ij＝max(F^{(m-τ)×(n-τ)})}；

S5 matrix S of similarity of attributes^m×nIn the method, the attribute similarity between all account pairs in the account pair set Q is taken out and stored in a similarity set S^*In, S^*＝{s_ij|s_ij∈S^m×n,(i,j)∈Q}；

S6, setting an initial threshold, where in this embodiment, the upper and lower bounds of the threshold are set to 0.8 and 0.2, respectively, and then the initial threshold is 0.8. Set similarity to S^*Elements below the initial threshold are deleted while the corresponding elements in the set Q of account pairs are deleted, at which time Q { (3,3), (4,4) }, S^*＝{0.85,1}；

S7, judging whether the account pair set Q is empty, if so, setting the maximum non-zero element in the common friend matrix to 0, and returning to the step S4; if not, go to step S8;

s8, extracting similarity set S^*Max element max (S) of^*) I.e. 1, and the account pair (4,4) corresponding to 1 is selected from the account pair set Q, then the group of accounts corresponding to (4,4)

s9, deleting the account pair (4,4) added into the result set M and the account pair with the common account with the account pair (4,4) in the account pair set Q, and deleting the similarity set S^*In, when Q { (3,3) }, S^*＝{0.85}；

S10, judging whether the elements still exist in the account pair set Q, if so, returning to the step S8 to execute repeatedly; if no element exists, adding the (3,3) and the (4,4) into a result set M after the iteration is finished, and outputting the result set M;

s11, adding the corresponding account pairs in the result set M into the prior matching pairs, returning to S2, reconstructing the association matrix and the common friend matrix, performing the next iteration of the current round, and repeatedly executing the steps, wherein the iteration of the current round is finished when no new result is generated, and at this time, three groups of account pairs (3,3), (4,4) and (5,5) are added into the result set M;

s12, modifying the size of the initial threshold, returning to the step S2, and performing the next iteration;

the formula for modifying the threshold is:

where th denotes the modified threshold value th_uAnd th_lUpper and lower bounds of the initial threshold, | M_cI represents the number of matching pairs in the current result set M, min (N)_A,N_B) Represents the smaller of the account numbers of the social platforms a and B, and τ represents the number of a priori matched pairs.

According to the formula, the threshold used in the next iteration is:

three sets of account pairs in M are added to the a priori matching pairs, return to step S2 and perform the subsequent steps with the new thresholds.

The second iteration adds (6,6) the set of account pairs into the result set, a new iteration does not result, and after changing the threshold to 0.32, a new iteration is executed without any result, and the iteration is ended, and finally four sets of account matching results of (3,3), (4,4), (5,5) and (6,6) are generated.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. A user entity analysis method based on structural attribute similarity is characterized by comprising the following steps:

(1) establishing an attribute similarity matrix and an adjacency matrix

Constructing an attribute similarity matrix S according to the attribute similarity between every two accounts on the social platform A and the social platform B^m×nWhere m and n are the total number of accounts in platforms A and B, respectively, S^m×nThe element in (1) represents the attribute similarity between the corresponding two accounts;

And

(2) establishing a correlation matrix

According to a adjacency matrix

(3) establishing a common friend matrix

According to the incidence matrix

Establishing a common friend matrix of unidentified accounts in the social platform A and the social platform B;

Each column represents an unidentified account in social platform B

Element f in common friend matrix_ijTo represent

And

the number of common friends in the prior matching pair;

(5) Similarity between attributesMatrix S^m×nIn the method, the attribute similarity between all account pairs in the account pair set Q is taken out and stored in a similarity set S^*In, S^*＝{s_ij|s_ij∈S^m×n,(i,j)∈Q}；

(8) and extracting a similarity set S^*Max element max (S) of^*) And selects the sum max (S) in the account pair set Q^*) Corresponding account pair (i, j), then (i, j) corresponding set of accounts

(12) and (3) modifying the size of the initial threshold, returning to the step (2), performing the next iteration, and finishing the iteration when no new matching pair exists in the result set M after the initial threshold is modified, thereby completing the user entity analysis.

2. The method for analyzing user entity based on structural attribute similarity according to claim 1, wherein the method for modifying the initial threshold comprises: