CN114782209B

CN114782209B - Social network topological graph-based associated user identity recognition method

Info

Publication number: CN114782209B
Application number: CN202210429087.9A
Authority: CN
Inventors: 胡瑞敏; 甄宇; 任灵飞; 吴俊杭; 胡文怡; 李登实
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-04-22
Filing date: 2022-04-22
Publication date: 2024-06-11
Anticipated expiration: 2042-04-22
Also published as: CN114782209A

Abstract

Most current approaches embed social networks into a low-dimensional vector space and then align users into the low-dimensional space. However, because the social network is extremely complex and bulky, it is susceptible to error propagation and noise from different neighbors during the network embedding process. Based on the method, the invention provides an associated user identity recognition method based on a social network topological graph, which comprises the steps of firstly forming a ego network of a user (namely, extracting a local network formed by a node neighbor of the user), then extracting a user node sequence by using random walk, then learning a low-dimensional vector representation of the user by using a natural language model framework, and finally mapping two social networks into the same feature space by using a training matrix for alignment. The invention can avoid the interference caused by the high-order neighbors by utilizing ego network, thus improving the node embedding result and the association accuracy.

Description

Social network topological graph-based associated user identity recognition method

Technical Field

The invention relates to the technical field of multi-exchange network data analysis and mining, in particular to an associated user identity recognition method based on a social network topological graph.

Background

The related user identity recognition aims at finding out the corresponding relation between different identities of the same user in a plurality of social network platforms, is a key technology in the fields of analysis and mining of a plurality of social network data, has wide commercial application requirements, and has important application in the aspects of network security and individual recommendation.

Most of the current methods are DeepWalk(Perozzi B.,AI-Rfou R.,Skiena S.DeepWalk:Online learning of social representations[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge discovery and data mining.New York:ACM Press,2014:701-710.) -based methods, which borrow Word2vec(Mikolov T.,Sutskever I.,Chen Kai,et al.Distributed representations of words and phrases and their compositionality[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems.Red Hook:Curran Associates Inc.,2013:3111-3119.) methods. The Word2vec method is a method for obtaining Word vectors in natural language processing, and can convert sparse, high-dimensional discrete vectors into relatively dense, low-dimensional continuous vectors. While this approach is for word vectors, reconstructing word vectors around it from the center word vector, node representations can also borrow this idea. Since both nodes in the social network and words in natural language have power law distributions, the DeepWalk method is thus applied to the social network with the method in the word vector. The method combines a random walk method and a Skip-gram method, adopts a random walk mode to extract a node sequence of a node in a social network, and then utilizes the Skip-gram method to obtain an embedded vector of the node. However, this method only obtains two feature spaces, and does not unify the feature spaces.

Thereafter, fan et al proposed ACCM's method (Zhou F,Zhang K,Xie S,et al.Learning to Correlate Accounts Across Online Social Networks:An Embedding-Based Approach[J].INFORMS Journal on Computing,2020,32.), in 2020, which also extracted the node sequences using random walk, and then mapped the set of node sequences into a feature vector space by Skip-gram method. Thus, the characteristic space of each of the two social networks can be obtained. In order to unify feature spaces, they also use some known matching users as constraints to train a mapping matrix so that the feature spaces of two social networks can be projected into the same feature space. And thus, similarity measurement is carried out in the unified feature space, and then, similarity user identity association is carried out according to a similarity result. Although the method reduces errors caused by different feature vector spaces of two social networks and is difficult to better match, the method uses the whole social network when the network is embedded, so that the influence of higher-order neighbors of the nodes is overlarge, the higher-order neighbors often do not play a key role on the nodes, more noise interference can be introduced, the node embedding result is not very accurate, and more errors are introduced.

Disclosure of Invention

The invention aims to provide an associated user identity recognition method based on a social network topological graph, which is used for solving the technical problem that the recognition accuracy is low due to excessive noise introduced into a high-order neighbor (namely, other nodes which are not directly connected with the node) when the neighbor node is embedded in the conventional method.

In order to solve the technical problems, the invention provides a related user identity recognition method based on network representation, which comprises the following steps:

S1: acquiring two known social network data sets, wherein the known social network data sets comprise friend relations between users, and the two social network data sets have associated users;

S2: respectively constructing topological graphs of the social networks G ₁ and G ₂ according to users and friend relations in the social network data set, wherein the social network topological graph comprises nodes and connected edges, the nodes represent the users, and the connected edges represent the friend relations; forming a first-order ego network of each node according to the social networks G ₁ and G ₂ respectively, wherein the first-order ego network graphs of each node in the G ₁ network are combined to form a ego topological graph set, and the first-order ego network graphs of each node in the G ₂ network are combined to form a ego topological graph set;

S3: forming s node sequences according to a ego network of each node by using a ego topological graph set of each node in two social networks G ₁ and G ₂ respectively, wherein the node sequences are extracted by adopting a random walk method to form node sequence sets of the two social networks;

S4: respectively mapping the node sequence sets of the two formed social networks into two feature spaces by using a skip-gram model, and learning the low-dimensional vector representation of the nodes in the mapped feature spaces to obtain the feature vector representation of each node;

S5: training according to the associated users of the two social network data sets to obtain a target feature mapping matrix, mapping the two feature spaces into the same feature space, calculating the similarity between a new node in the social network G ₁ and each node in the social network G ₂, and carrying out associated user identity recognition according to the calculated similarity, wherein the new node in the social network G ₁ is a node obtained by mapping the original node in the social network G ₁ according to the trained target feature mapping matrix.

In one embodiment, the two social network datasets include dataset one and dataset two, step S2 comprising:

S2.1: constructing a topology map of a social network G ₁ according to a dataset, wherein G ₁ comprises n nodes, v ₁,v₂…v_n respectively, starting from node v ₁ in G ₁, extracting the node and all first-order neighbors thereof, then supplementing the connection edges between the extracted node and the first-order neighbors and the connection edges between the first-order neighbors according to the edges in G ₁, forming ego network map Gv ₁,v₂-v_n of node v ₁, repeating the process until forming ego network maps of n nodes, and finally forming a ego network set

S2.2: constructing a topological graph of a social network G ₂ according to a dataset II, wherein G ₂ comprises m nodes, v ' ₁,v′₂…v′_m respectively, starting from a node v ' ₁ in G ₂, extracting the node and all first-order neighbors thereof, then supplementing the connection edge between the extracted node and the first-order neighbors and the connection edge between each first-order neighbor according to the edge in G ₂, forming a ego network graph Gv ₁′,v′₂-v′_m of the node v ' ₁, repeating the process until forming a ego network graph of m nodes, and finally forming a ego network set

In one embodiment, step S3 includes:

S3.1: from node v ₁, starting with the ego network set formed by G ₁, at the corresponding ego network Extracting s node sequences by using a random walk mode, wherein the beginning of each sequence is a node v ₁, the sequence length is t, and the rest nodes repeat the process, and finally, extracting s node sequences from a ego network of each node to obtain n s node sequences altogether, and combining the n node sequences into a node sequence set L ₁ of G ₁;

S3.2: according to the ego network set formed by G ₂, starting from a node v '₁, extracting s node sequences in a corresponding ego network Gv ₁' by using a random walk mode, wherein each sequence starts with a node v ₁, the sequence length is t, and the rest nodes repeat the process, finally, extracting s node sequences from the ego network of each node, and obtaining m x s node sequences altogether, so as to combine the node sequence set L ₂ of G ₂.

In one embodiment, S4 comprises:

S4.1: inputting ego network set formed by G ₁ into a skip-gram model as training data, adjusting model parameters, mapping each node into a p-dimensional feature vector, and finally mapping the G ₁ network into a feature space G ₁＝{u₁,u₂…u_n, wherein each node is represented by the feature vector;

S4.2: the ego network set formed by G ₂ is used as training data to be input into a skip-gram model, model parameters are adjusted, each node is mapped into a p-dimensional feature vector, the G ₂ network is finally mapped into a feature space G ₂＝{u′₁,u′₂…u′_m, and each node is represented by the feature vector.

In one embodiment, step S5 includes:

S5.1: training the associated users of the two social data sets in the step S1 as mapping basis to obtain a target feature mapping matrix, and mapping vector spaces of the two social networks into the same feature space based on the target feature mapping matrix;

S5.2: and mapping the nodes in G ₁ to G ₂ according to the target feature mapping matrix, obtaining corresponding new nodes, calculating the similarity between each new node in G ₁ and each node in G ₂, and carrying out associated user identification according to the calculated similarity.

In one embodiment, step S5.1 comprises:

And (3) constructing a mapping matrix by adopting the two new feature spaces obtained in the step (S4), and training by using a minimized objective function W ^*＝argmin(Y-XW)^T (Y-XW) to obtain a final target mapping matrix W ^*＝(X^TY)^-1(X^T Y, wherein X, Y respectively represents the two new feature spaces, W is the mapping matrix, and W ^* is the target mapping matrix.

In one embodiment, step S5.2 comprises:

According to the target mapping matrix, mapping each node in G ₁ into G ₂ to obtain a corresponding new node, wherein the calculation mode is as follows:

Wherein u ₁ is a node in G ₁, The new node corresponding to u ₁, namely the node u ₁ in G ₁ is mapped to the mapping node in G ₂;

and calculating cosine similarity between each new node and each node in the social network G ₂:

where u' _i is the ith node in G ₂, Representing nodes/>Similarity to u' _i;

By comparison of Cosine similarity values with each node in the social network G ₂ are sorted in order from large to small, and the top N are sequentially taken as association matching results in the social network G ₂ with the node u ₁ in the social network G ₁.

The above technical solutions in the embodiments of the present application at least have one or more of the following technical effects:

according to the social network topological graph-based associated user identity recognition method, after the social network topological graph is built, in order to obtain better embedding, a ego network of a user (namely a local network formed by a node neighbor) is formed firstly, then a random walk is used for extracting a user node sequence, then a natural language model framework is used for learning a low-dimensional vector representation of the user, and finally a training matrix maps two social networks into the same feature space for alignment. The method can avoid the interference caused by the high-order neighbors by utilizing ego network, so that the node embedding result can be improved, and the association accuracy is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for identifying an associated user identity based on a network representation according to an embodiment of the present invention.

Detailed Description

User alignment across social networks refers to finding users with the same identity among multiple social networks. The method has important application in the natural science fields of link prediction, individual recommendation and the like, and has a certain research value in the data mining field. The present inventors have found through a great deal of research and practice that: most current approaches embed social networks into a low-dimensional vector space and then align users into the low-dimensional space. However, because the social network is extremely complex and bulky, it is susceptible to error propagation and noise from different neighbors during the network embedding process.

Based on this, to obtain better embedding, the method of the present invention first forms the ego network of the user (i.e., extracts the local network formed by a section of neighbors of the user), then uses random walk to extract the user node sequence, then uses the natural language model framework to learn the low-dimensional vector representation of the user, and finally trains the matrix to map the two social networks into the same feature space for alignment. The invention can avoid the interference caused by the high-order neighbors by utilizing ego network, thus improving the node embedding result and the association accuracy.

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention provides an associated user identity recognition method based on a social network topological graph, which comprises the following steps:

Referring to fig. 1, a flowchart of a related user identity recognition method based on network representation according to an embodiment of the present invention is shown, where an SG model is a skip-gram model.

Specifically, step S1 is the acquisition of a data set, and step S2 is the formation of a social network topology according to the acquired given social network data set, and the generation of a ego network topology of the network by extracting the first-order neighbors of the node. Step S3 is to extract a sequence containing node structure information from ego networks of nodes by using a random walk method, and a corpus of node sequences is formed by a plurality of node sequences. Step S4 is to convert the node sequence in the corpus by using a continuous word bag model (skip-gram model) in natural language processing to obtain the expression vector of the node. Step S5 is to obtain a space mapping matrix through partial known association nodes, map vector spaces of two social networks into the same vector space, and obtain the similarity of the nodes by using the newly obtained expression vector in the new space.

In S2, first-order neighbors of each node in the network are extracted for two social networks G ₁ and G ₂ respectively to form a ego network setAnd/>

In S4, the node sequence is mapped into a vector matrix by adopting a skip-gram model in natural language processing, and the model has a self-contained function package in python, and only needs to be called and then parameters needed by people are adjusted.

In step S4, the present invention trains a mapping matrix W, constructs a mapping function W ^*＝argmin(Y-XW)^T (Y-XW) by k known matched correlation nodes x= { u ₁,u₂…u_k } and y= { u' ₁,u′₂…u′_k }, and finds the final target mapping matrix by minimizing the target function.

The scheme of the invention can adopt technical software to realize automatic flow operation.

In one embodiment, step S3 includes:

In the specific implementation process, the S node sequences in the step S3.1 are respectively as follows: l ₁,l₂…l_s.

In one embodiment, S4 comprises:

In one embodiment, step S5 includes:

Specifically, after the two feature spaces are formed in step S4, since the dimensions of the two feature spaces are inconsistent, a feature mapping matrix needs to be trained to map the two feature spaces into one feature space, and then the similarity of each node is calculated.

In one embodiment, step S5.1 comprises:

In one embodiment, step S5.2 comprises:

Where u' _i is the ith node in G ₂, the value of i may traverse each node in social network G ₂, i.e. i= … m, Representing nodes/>Similarity to u' _i;

Specifically, since the two feature spaces (X, Y) of G ₁ and G ₂ are obtained separately and are not the same space, and the similarity cannot be calculated directly, a mapping matrix needs to be trained, specifically, the known partial association nodes in the data set in step 1 can be used to make a mapping basis, and the two feature space dimensions are matched, so that the nodes in the G ₁ network can be mapped into the G ₂ network through the mapping matrix W.

For example, when we get the vector spaces G ₁＝{u₁,u₂…u_n and G ₂＝{u′₁,u′₂…u′_m of two social networks, where u _i is the embedded vector of the node v _i obtained by the skip-gram model, n is the total number of nodes in G ₁, u '_i is the embedded vector of the node v' _i obtained by the skip-gram model, and m is the total number of nodes in G ₂. The known k matched associated nodes X= { u ₁,u₂…u_k } and Y= { u' ₁,u′₂…u′_k } in the data set in the step 1 are selected, wherein 1-k is the renumbering of the k nodes, and the renumbering is not consistent with the previous numbering, so that a mapping matrix W is constructed by using only two new vector spaces X and Y, and the XW is similar to the Y as much as possible. The final mapping matrix W ^*＝(X^TY)^-1(X^T Y is found by minimizing the objective function W ^*＝argmin(Y-XW)^T (Y-XW). Finally, each node vector in the social network G ₁＝{u₁,u₂…u_n may be mapped into the social network G ₂ by the mapping matrix W ^*, e.g., for node u ₁, a mapped node in the social network G ₂ may be obtainedThe remaining nodes are similar.

The nodes in the G ₁ network are mapped into the G ₂ network after being multiplied by W ^*, then cosine similarity is calculated between new vectors of the nodes in the G ₁ network and vectors in the G ₂ network respectively, the nodes in the G ₂ are arranged according to the similarity, and a plurality of similar nodes in the front are selected. Such as when a mapped node for node u ₁ is obtainedTongue, calculate/>, respectivelyCosine similarity/>, with each node in social network G ₁ And then by comparison/>The cosine similarity values of the nodes in the social network G ₂ are ranked in order from large to small, and the top N are sequentially taken as association matching results in the social network G ₂ and the node u ₁ in the social network G ₁, and the same is similar to other nodes in the social network G ₁.

Finally, comparing the result with the accurate result to obtain an identity association index value The concrete representation forms are as follows:

Wherein, The true associated user a' _i, which refers to user a _i, is present in the first N selected predicted users, topN, and a ₁-a_n refers to the N user nodes in the above step, where N is the total number of nodes.

The applicant runs on an Intel (R) Core (TM) i5-9500CPU@3.00GHz 3.00GHz computer, and by using the method of the embodiment, the disclosed data set Fourdure-Twitter is compared with the document (Tan S,Guan Z,CaiD,Qin X,Bu J,Chen C(2014)Mapping users across networks by manifold alignment on hypergraph.Proc.AAAI Conf.Artificial Intelligence,Quebec City,Canada,159-165.),(Perozzi B.,AI-Rfou R.,Skiena S.DeepWalk：Online learning of social representations[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge discovery and data mining.New York：ACM Press,2014：701-710.),(Zhou F,Zhang K,Xie S,et al.Learning to Correlate Accounts Across Online Social Networks：An Embedding-Based Approach[J].INFORMS Journal on Computing,2020,32.), so that the identity association effect is improved, and the method can be applied to the fields of recommendation systems, network security and the like.

The specific embodiments described herein are offered by way of example only to illustrate the spirit of the invention. Various modifications may be made to the particular embodiments described, or equivalents may be substituted, by those skilled in the art without departing from the spirit of the invention or exceeding the scope of the invention as defined by the appended claims.

Claims

1. The method for identifying the identity of the associated user based on the social network topological graph is characterized by comprising the following steps of:

S5: training according to the associated users of the two social network data sets to obtain a target feature mapping matrix, mapping the two feature spaces into the same feature space, then calculating the similarity between a new node in the social network G ₁ and each node in the social network G ₂, and carrying out associated user identity recognition according to the calculated similarity, wherein the new node in the social network G ₁ is a node obtained by mapping the original node in the social network G ₁ according to the trained target feature mapping matrix;

Wherein, step S5 includes:

S5.2: according to the target feature mapping matrix, mapping the nodes in G ₁ to G ₂, obtaining corresponding new nodes, then calculating the similarity between each new node of G ₁ and each node in G ₂, and carrying out associated user identification according to the calculated similarity;

Step S5.1 includes:

Constructing a mapping matrix by adopting the two new feature spaces obtained in the step S4, and training by using a minimized objective function W ^*＝argmin(Y-XW)^T (Y-XW) to obtain a final objective mapping matrix W ^*＝(X^TY)^-1(X^T Y, wherein x and Y respectively represent the two new feature spaces, W is the mapping matrix, and W ^* is the objective mapping matrix;

step S5.2 comprises:

where u' _i is the ith node in G ₂, Representing nodes/>Similarity to u' _i;

2. The social network topology-based associated user identification method of claim 1, wherein the two social network datasets comprise dataset one and dataset two, step S2 comprising:

3. The social network topology-based associated user identification method of claim 1, wherein step S3 comprises:

4. The social network topology-based associated user identification method of claim 1, wherein S4 comprises: