CN113434782A

CN113434782A - Cross-social network user identity recognition method based on joint embedded learning model

Info

Publication number: CN113434782A
Application number: CN202110718740.9A
Authority: CN
Inventors: 王李冬; 关佶红; 常乐; 曹世华; 胡克用
Original assignee: Qianjiang College of Hangzhou Normal University
Current assignee: Guangzhou Dayu Chuangfu Technology Co ltd
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2021-09-24
Anticipated expiration: 2041-06-28
Also published as: CN113434782B

Abstract

The invention discloses a cross-social-network user identity recognition method based on a joint embedded learning model. Firstly, selecting candidate paired user pairs from two social networks by utilizing the similarity of user names and network structures; then, constructing a user pair network graph by taking all candidate paired user pairs as nodes; secondly, on the basis of the constructed UPG and the labeled user pair data, a labeled matched user information label, structure information and attribute information are fused to build a combined embedded learning model, and the model is designed into a deep neural network structure with 1 input and 2 outputs; and finally, performing minimum learning on the loss function of the combined embedded model by using a random gradient descent algorithm, predicting the user pair to be predicted by using the parameters of the model after learning is finished, and judging whether the output is the same user or not. The method and the system can effectively predict whether two users from different networks are the same user, and play a vital role in related application of cross-social networks in commerce.

Description

Cross-social network user identity recognition method based on joint embedded learning model

Technical Field

The invention relates to the field of user relationship mining for social networks. In particular to a cross-social network user identity recognition method based on a joint embedded learning model.

Background

From early email, BBS, to today's Social Media Networks (SMNs), more and more users are becoming accustomed to daily interaction and information acquisition on Social networks. People often need to register as users of a different website in order to enjoy services provided by the website. It is a common phenomenon that a common user owns virtual accounts of multiple different social networking sites. Because each social network site is independent, the data information is not shared, and a uniform identity identifier is lacked on the network to uniquely identify the netizen, a plurality of social network site accounts belonging to the same netizen are not directly related. In order to obtain a complete image (profile) of a user, data of the user on different social networks needs to be integrated, which breaks through the association of user identities across social platforms, i.e., the identification of accounts of the user on multiple social networks. In recent years, social network identification methods based on representation learning have been prevalent, and researchers have begun to identify users on multiple social networks by using algorithms based on network embedding, however, the following problems still exist in the realization of the cross-social network user identification technology based on representation learning:

1. the existing expression-based learning method belongs to a supervised learning mode and an unsupervised learning mode, wherein the former needs a large amount of Labeled data (Labeled data), the Labeled data is difficult to obtain, and a large amount of manpower is consumed; the latter does not require labeling data, but the obtained effect is often unsatisfactory.

2. The accuracy of user identity recognition can be improved by comprehensively utilizing modal data such as attribute information of a user, structural information of a network, label information of the user and the like, but how to embed the information into a uniform vector space is a difficult problem;

3. the existing user identity correlation method based on representation learning usually splits a task into two steps of embedded learning and identity recognition of nodes, so that label information of a user cannot be effectively integrated.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a cross-social-network user identity association method based on a joint embedding model.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step 1, aiming at social network G^AAnd G^BThe user selects candidate paired user pairs from the two social networks by utilizing the user name similarity and the network structure;

step 2, all the candidate pairing user pairs P ═ { P ═ P_iIs node, if user is p_iTwo users in the system are respectively neighbors of the other party, then p_iAnd p_jAn edge exists between the two, and a user-to-network graph UPG is constructed by taking the edge as a principle;

step 3, fusing labeled paired user information labels, structure information and attribute information to build a combined embedded learning model on the basis of the constructed user pair network graph UPG and labeled user pair data, and designing the combined embedded learning model into a deep neural network structure with 1 input and 2 outputs;

and 4, performing minimum learning on the loss function of the combined embedded learning model by using a random gradient descent algorithm, predicting the user pair to be predicted by using the model after learning is finished, and judging whether the output is the same user or not.

Further, the step 1 is specifically realized as follows:

1-1.G^A＝(U^A,E^A,X^A) Representing social networks A, U^ASet of users representing social network A, E^AUser relationship set representing social network A，X^AA matrix of user attributes representing social network a,

representing user i in social network A; g^B＝(U^B,E^B,X^B) Representing a social network B, and the rest parameters have similar meanings;

1-2, acquiring data of different social network platforms by using a crawler;

1-3. pairs are from social network G respectively^AAnd G^BTo a user

User name n of_kAnd n_jCalculating the similarity according to a formula (1), and adding a user pair with the similarity larger than 0.8 into the candidate paired user pair set P;

wherein, lev (n)_k,n_j) Represents the Levenshtein distance, l (n)_k) Representing a user name n_kThe character length of (d);

1-4, expanding neighbor nodes by taking each pair of users in the user pair set P as a seed user pair, selecting user pairs with r common neighbors (known pairs) from the neighbor nodes of the seed user pair, adding the user pairs into the P, and setting different r values according to different data sets.

Further, the step 2 is specifically realized as follows:

2-1.UPG＝(U_UPG,E_UPG) Representing a user versus network diagram, U_UPGRepresenting a set of nodes, E_UPGRepresenting a set of relationships between nodes; pairing candidate users p_iNode as UPG and is recorded as u'_i，u'_i∈U_UPG；

2-2. suppose

And

two nodes in the UPG have an edge between the two nodes if the following relationship exists between the two nodes;

wherein,

representing a user

A set of neighboring nodes.

Further, the step 3 is specifically realized as follows:

3-1, marking accurate mapping account numbers of the users in another network by text analysis and matching technology and combining manual judgment through partial user attribute information crawled by a crawler; the marked user matching pairs are used as the monitoring information of model training;

3-2, every two users in the candidate pairing user set generated in the step 2-1 are paired

And

the attributes of (1) are subjected to feature conversion through one-hot coding and are respectively recorded as

And

the attributes comprise a user name, a gender, a graduation institution and a geographic location;

3-3, constructing a joint embedded learning model for the network aiming at the constructed user; vector the attributes of two users in a node

Performing a splicing operation, note

And d is_iAs input to a joint embedded learning model; the output has a left branch and a right branch, and the left branch utilizes a multilayer perceptron model to output a node label y_iThe probability values are 0 and 1, wherein 1 represents that two users in the node are the same user, and 0 represents that two users in the node are different users; outputting the predicted probability value of the Context node by the right branch by using a skipgram model;

the mth layer of the skipgram model is represented as:

wherein δ (·) represents a sigmoid function, W^mAnd b^mWeights and biases parameters for m layers; formula (4) and formula (5) represent the m +1 th layers of the left and right branches, respectively;

the weights parameter representing the left branch of the (m + 1) th layer,

the weights parameter of the right branch of the (m + 1) th layer is represented,

and

and so on;

the last layer of the left branch of the model is designed as softmax layer, and the input of the layer is:

the last layer of the right branch of the model is designed as a softmax layer, and the input of the layer is as follows:

where k represents the number of layers of the left branch implicit layer and k' represents the number of layers of the right branch implicit layer.

Further, the step 4 is specifically realized as follows:

4-1. the left branch of the joint embedding learning model is a multi-layer perceptual model, and the loss function of the branch is defined as:

wherein

Representing a tagged node in UPG, p (y)_i|d_i) Represents given d_iUnder the condition of y_iIs calculated as follows:

the right branch adopts a negative sampling mechanism to define a loss function as follows:

where δ (·) stands for sigmoid function, n ═ U_UPGL, u 'represents node u'_iThe context node of (a) is selected,

representing randomly selected t negative samples;

4-2, calculating parameters by adopting a mini-batch gradient descent method; setting the value of the left branch's batch b₁Set to 200, the value of batch of the right branch b₂Is 200; slave U_UPGMiddle random sampling b₁The labeled nodes, and calculate L^(L)According to the gradient value of the parameter W^mAnd b^m，

And

updating;

4-3 from U_UPGMiddle random sampling b₂A node and calculate

According to the gradient value of the parameter W^mAnd b^m，

And

updating of (1);

4-4, returning to the step 4-2, and iterating for 100 times;

4-5 input node u 'to be predicted in UPG'_jCalculating according to the step 3-2 to obtain the attribute vectors of the two users in the node, and splicing the attribute vectors to obtain a vector d_jInputting the data into a joint embedding learning model, and calculating to obtain a node u 'to be predicted'_jThe label of (1).

The invention has the following beneficial effects:

the invention focuses on how to implement network embedding method, effectively integrates key factors of user identity identification, and realizes user identity identification on two social platforms. The cross-social platform identity association plays a crucial role in business cross-social network applications, such as user behavior analysis of multiple social networks, information service push of cross-social networks, cross-platform friend recommendation, network security governance of government offices and enterprises and the like. The method and the system can effectively predict whether two users from different networks are the same user, and play a vital role in the related application of cross-social networks in commerce.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a diagram of an example of a candidate paired user pair generation;

FIG. 3 is an exemplary diagram of a user generating a network graph;

FIG. 4 is an exemplary diagram of a joint embedding model;

Detailed Description

The invention will be further explained with reference to the drawings.

As shown in FIG. 1, the method for identifying the user identity across the social network based on the joint embedded learning model comprises the following steps:

step 1 for social network G^AAnd G^BThe user selects candidate paired user pairs from the two social networks by utilizing the user name similarity and the network structure;

step 2, all candidate pairing user pairs P ═ { P ═ P_iIs node, if user is p_iTwo users in the system are respectively neighbors of the other party, then p_iAnd p_jAn edge exists between the two, and a User Pair network Graph (UPG) is constructed by taking the edge as a principle;

step 3, on the basis of the constructed UPG and labeled user pair data (labeled user pairs), labeled paired user information labels, structure information and attribute information are fused to build a joint embedded learning model, and the model is designed into a deep neural network structure with 1 input and 2 outputs;

and 4, learning the loss function minimization of the combined embedded model by using a random gradient descent algorithm, predicting the user pair to be predicted by using the model after learning is finished, and judging whether the output is the same user or not.

The specific implementation process of the step 1 is as follows:

1-1.G^A＝(U^A,E^A,X^A) Representing social networks A, U^ASet of users representing social network A, E^ASet of user relationships, X, representing social network A^AA matrix of user attributes representing social network a,

representing a user in social network A; g^B＝(U^B,E^B,X^B) Representing social network B, the remaining parameters are similar in meaning. The invention utilizes web crawlers to microblog from the green sea (G)^A) And known as (G)^B) The new wave net comprises about 1.23 x 10 user nodes⁵The human network contains about 1.95 x 10 user data⁵. The user information common to both networks includes user name, gender, college and location.

And 1-2, data of different social network platforms are obtained by using a crawler.

1-3. pairs are from social network G respectively^AAnd G^BTo a user

User name string n_kAnd n_jCalculating the similarity according to the following formula, selecting the user pairs with the similarity more than 0.8 to be added into the candidate paired user pair set P,

wherein, lev (n)_k,n_j) Represents the Levenshtein distance, l (n)_k) Representing a user name n_kThe character length of (2). For example, the user name "vio" and "violet" have a similarity of 0.5.

1-3, taking each pair of users in the P as a seed user pair to expand neighbor nodes, selecting the user pairs with r common neighbors (known pairs) from the neighbor nodes of the seed user pair to be added into the P, and setting different r values according to different data sets. In this step, the present invention provides the example shown in FIG. 2. In FIG. 2, assume that

For user pairs with a username similarity greater than 0.8, let r be 2, according to which step it will be

Four user pairs are used as candidate pairing user pairs to be added into P, and finally

The specific implementation process of the step 2 is as follows:

2-1.UPG＝(U_UPG,E_UPG) Representing a user versus network diagram, U_UPGRepresenting a set of nodes, E_UPGRepresenting a set of relationships between nodes. Pairing candidate users p_iNode as UPG and is recorded as u'_i，u'_i∈U_UPG；

2-2. suppose

And

for two nodes in a UPG, there is an edge between the two nodes if there is a relationship between them.

Wherein,

representing a user

A set of neighboring nodes.

The present invention provides step 2 with a user-to-network graph generated by the two social networks shown in FIG. 2, with the results shown in FIG. 3. According to step 2-1 and step 2-2, the generated user-to-network graph contains 6 nodes and 8 edges.

The specific implementation process of the step 3 is as follows:

and 3-1, marking the accurate mapping account of the user in another network by using partial user attribute information (such as account information of other platforms, mobile phones and mailboxes provided by the user in personal introduction) crawled by a crawler, text analysis and matching technology and manual judgment. And the marked user matching pairs are used as the monitoring information of model training.

And

the attributes (user name, gender, college and geography) of (1) are subjected to feature conversion by one-hot coding and are respectively recorded as

And

specifically, aiming at the attribute of the user name, Chinese characters are unified into pinyin, capital letters are unified into lowercase letters, special characters such as underlines and the like are removed, and then a plurality of character substrings are intercepted from the user name

For charactersSubstrings are subjected to one-hot encoding. For example, for a user name "violet", several character substrings { "vio", "iol", "ole", "let" } with a length of 3 may be truncated. And directly implementing one-hot coding according to the classifiable attributes such as gender, geographic position, graduation colleges and the like. For example, there are only two options in gender, "male" and "female", then the "male" attribute may be encoded as {10}, the "female" attribute may be encoded as {01}, and the remaining attributes are similar.

3-3. as shown in FIG. 4, a joint embedding model is built for the built user to the network. Attribute vectors (denoted as attribute vectors) for two users in a node

) Performing a splicing operation, note

And as input to the joint embedding model; the output has a left branch and a right branch, and the left branch utilizes a multilayer perceptron model to output a node label y representing prediction_iAnd the probability values are 0 and 1 (1 represents that two users in the node are the same user, and 0 represents that two users in the node are different users), and the probability value of the predicted Context node is output by the right branch by using a skipgram model. The mth layer of the model is represented as:

wherein δ (·) represents a sigmoid function, W^mAnd b^mAre the weights and biases parameters for the m layers. The latter two formulas represent the (m + 1) th layers of the left branch and the right branch respectively;

the weights parameter representing the left branch of the (m + 1) th layer,

and

and so on.

The last layer of the left branch (node label prediction) of the model is designed as the softmax layer, and the inputs of the layer are:

the last layer of the right branch (node label prediction) of the model is designed as the softmax layer, and the inputs of the layer are:

The specific implementation process of the step 4 is as follows:

4-1. the left branch of the joint embedding model is a multi-layer perceptual model, and the loss function of the branch is defined as:

wherein

where δ (·) stands for sigmoid function, n ═ U_UPGL, u 'represent all points u'_iThe context node of (a) is selected,

representing t negative samples chosen at random. The remaining parameters are referred to in step 3-3.

4-2, calculating parameters by adopting a mini-batch gradient descent method. Setting the value of the left branch's batch b₁Set to 200, the value of batch of the right branch b₂Is 200, randomly sampling b₁Node with label, and calculate ^ L^(L)By a gradient value of the parameter W^mAnd b^m，

And

updating;

4-3 from U_UPGMiddle sampling b₂A node and calculate

By a gradient value of the parameter W^mAnd b^m，

And

updating of (1);

4-4 returns to step 4-2 and iterates 100 times.

4-5 input node u 'to be predicted in UPG'_jCalculating according to the step 3-2 to obtain the attribute vectors of the two users in the node, and splicing the attribute vectors to obtain a vector d_jInputting the data into a joint embedding model, and calculating to obtain a node u 'to be predicted'_jThe label of (1).

In step 4, taking the crawl of the user data of the Xinlang microblog and the known net user data as an example, 7325 user data pairs are extracted from the user data, wherein the 7325 user data pairs comprise 2213 labeled data, 30% of the labeled data are extracted to serve as model training data, and the rest are taken as test data. And aiming at the network pair, constructing a user-to-network diagram, constructing a joint embedded model according to the diagram 4, and performing parameter learning on the model. And (4) carrying out user identity correlation and calculating accuracy aiming at the test data pair, wherein the finally obtained accuracy reaches 84.7%.

Claims

1. The cross-social network user identity recognition method based on the joint embedded learning model is characterized by comprising the following steps of:

2. The method for identifying the user identity across the social network based on the joint embedded learning model according to claim 1, wherein the step 1 is implemented as follows:

1-2, acquiring data of different social network platforms by using a crawler;

1-3. pairs are from social network G respectively^AAnd G^BTo a user

3. The method for identifying the user identity across the social network based on the joint embedded learning model according to claim 2, wherein the step 2 is implemented as follows:

2-2. suppose

And

wherein,

representing a user

A set of neighboring nodes.

4. The method for identifying the user identity across the social network based on the joint embedded learning model according to claim 3, wherein the step 3 is implemented as follows:

And

And

Performing a splicing operation, note

the mth layer of the skipgram model is represented as:

the weights parameter representing the left branch of the (m + 1) th layer,

and

and so on;

5. The method for identifying the user identity across the social network based on the joint embedded learning model according to claim 4, wherein the step 4 is implemented as follows: