CN117892019A

CN117892019A - Cross-social network identity linking method and device

Info

Publication number: CN117892019A
Application number: CN202410289109.5A
Authority: CN
Inventors: 黄锐; 马延淮; 彭可兴
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2024-03-14
Filing date: 2024-03-14
Publication date: 2024-04-16
Anticipated expiration: 2044-03-14
Also published as: CN117892019B

Abstract

The invention discloses a method and a device for linking identities across social networks, wherein the method comprises the following steps: acquiring two target text posts in different social networks; after preprocessing the target text posts, inputting the target text posts into a trained network model, and acquiring whether the two target text posts belong to the same user; wherein the training of the network model comprises: acquiring each text post in different social networks; preprocessing each text post to generate a text data set; dividing the text data set into a training set, a verification set and a test set; constructing a network model based on multi-angle text information, and training the network model through the training set; after each round of training is completed, performing model screening by using the verification set, reserving an optimal network model, performing model testing by using the test set, and obtaining the effect of the network model; compared with the prior art, the method and the device can improve the accuracy and the stability of identity linking.

Description

Cross-social network identity linking method and device

Technical Field

The invention relates to a method and a device for linking identities across social networks, and belongs to the technical field of network information.

Background

The advent of social networking platforms has led to a diversity of services, people tending to register multiple accounts on different social networks. However, some malicious users may engage in illegal activities on networks that do not have a real-name system, but do not expose their own real identities.

The method for linking the identities across the social network can acquire key information, such as real names, of malicious users. But malicious users may forge their registration details. The prior art related to the method utilizes attribute information of users to carry out identity link, such as social relationship of the users, personal data of the users and the like. The application number is CN202110607064.8, the patent name is a user identity association method integrating multi-mode information and weight tensor, and the application number is CN202110148895.3, the patent name is a social network user identity association method integrating user characteristics and embedded learning, and multiple information of a user is used for acquiring the characteristics of the user in the two methods. However, increasingly stringent privacy protection policies of social networking platforms make user attributes difficult to obtain. Some techniques also use user-generated content (UGCs) for identity linking, which is easier to obtain than the user's attribute information, because UGCs is published and disclosed in terms of the user's personal behavior. The use of such information does not violate the privacy policy of the social network. However, due to the diversity and heterogeneity of UGC, there will be limitations to the modeling of the user's intrinsic features. Existing research ignores the task of relying on homogenized UGC (e.g., text) to address identity linking. The patent name of China patent with the application number of CN202010376438.5, an identity matching method and device, discloses an identity matching framework, but the framework proposed by the patent is abstract, and does not propose how to accurately extract user characteristics according to the existing information of a user, and does not design a method aiming at a specific identity matching environment, such as a social network with high privacy protection level.

In the identity linking process in the prior art, user information which is difficult to acquire and easy to falsify is generally used, so that instability of the identity linking technology is increased.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides a method and a device for linking identities across social networks, and solves the technical problems of instability and poor using effect of the identity linking method in the prior art.

In order to achieve the above purpose, the invention is realized by adopting the following technical scheme:

in a first aspect, the present invention provides a method for linking identities across social networks, including:

Acquiring two target text posts in different social networks;

after preprocessing the target text posts, inputting the target text posts into a trained network model, and acquiring whether the two target text posts belong to the same user;

wherein the training of the network model comprises:

Acquiring each text post in different social networks;

Preprocessing each text post to generate a text data set;

dividing the text data set into a training set, a verification set and a test set;

Constructing a network model based on multi-angle text information, and training the network model through the training set; after each round of training is completed, the verification set is used for model screening, an optimal network model is reserved, the test set is used for model testing, and the effect of the network model is obtained.

Optionally, the preprocessing each text post to generate a text data set includes:

preprocessing the text posts to generate sample data, wherein the preprocessing comprises the steps of deleting links in the text posts and replacing emoji expressions with corresponding characters;

Taking account pairs known to belong to the same natural person in two different community networks as positive sample cross-network account pairs, and randomly generating negative sample cross-network account pairs according to the positive sample cross-network account pairs;

Taking the sample data corresponding to the positive sample cross-network account pair as positive sample data, and taking the sample data corresponding to the negative sample cross-network account pair as negative sample data;

The same number of positive and negative sample data is combined into a text data set.

Optionally, the network model includes a post-level vector characterization module, a user-level vector generation module, and a total similarity distribution generation module;

The post-level vector characterization module comprises a theme characterization module, a shallow semantic characterization module and a post similarity distribution generation module; the topic characterization module comprises RoBERTa language models, a variation self-encoder and a decoder, and is used for generating topic vector representations according to text posts; the shallow semantic characterization module comprises GloVe word embedding tools and a BiLSTM network model, and is used for generating shallow semantic vector representations according to text posts; the post similarity distribution generation module is used for calculating the similarity of the time factors contained between the topic vector representations and the similarity of the time factors contained between the shallow semantic vector representations, and calculating post similarity distribution according to the similarity;

The user-level vector generation module comprises a knowledge triplet extraction module, a user-level vector generation module and a multi-layer perceptron; the knowledge triplet extraction module comprises sbert models, and is used for matching knowledge triples according to text posts and an open source knowledge graph library; the user-level vector generation module comprises an encoder, a user portrait representation vector generation module and a user portrait representation vector generation module, wherein the encoder is used for generating a user portrait representation vector according to the knowledge triples; the multi-layer perceptron is used for generating user-level similarity distribution according to the user portrait representation vector;

the total similarity distribution generation module is used for generating a total similarity distribution according to the post similarity distribution and the user-level similarity distribution.

Optionally, the generating a post vector representation from the text post includes:

Converting the/> text posts/> of the th user into post vector representations/> through the RoBERTa language model;

Generating, by the variational self-encoder, a probability distribution of a topic vector representation from the post vector representation :

；

wherein is a variational self-encoder,/> is a multi-layer perceptron,/> is a gaussian distribution function,/> is a topic vector representation of text posts/> ,/> is a post vector representation of the/> text posts of the/> user and all text posts preceding it,/> is a topic vector representation of the/> text posts of the/> user; the/> is the feature vector of text post/> :

；

Where is the text post/> post vector representation,/> is the attention mechanism function, is the query vector, key vector, and value vector, respectively, as the attention mechanism function.

Optionally, the decoder is configured to reconstruct the post vector representation according to the probability distribution of the topic vector representation;

Training optimization of the variance from the encoder and decoder is performed with the goal of minimizing the gap between post vector representation and post vector representation/> .

Optionally, the generating a shallow semantic vector representation from the text post includes:

Converting the/> words in the/> text posts/> of the th user into word vector representations/> by the GloVe word embedding tool;

Generating, by the BiLSTM network model, a shallow semantic vector representation/> of a text post/> from the word vector representation ;

；

Where is the shallow semantic vector representation of text post/> ,/> is the number of words in text post/> ,/> is the shallow semantic vector representation of the/> words in text post/> :

；

Wherein is a vector obtained for the forward LSTM model for the first/> words in text post/> ; the/> is the update gate, memory cell state, reset gate of the forward LSTM model, respectively:

；

Wherein is a weight matrix and a bias value corresponding to a storage unit of the forward LSTM model,/> is an element multiplication operation,/> is a weight matrix and a bias value corresponding to an update gate of the forward LSTM model,/> is a weight matrix and a bias value corresponding to a reset gate of the forward LSTM model, and/> is a Sigmoid activation function;

；

wherein is a vector obtained for LSTM backwards for the first/> words in text post/> ; Update gate, memory cell state, reset gate for backward LSTM model respectively:

；

Wherein is a weight matrix and a bias value corresponding to a storage unit of the backward LSTM model, is a weight matrix and a bias value corresponding to an update gate of the backward LSTM model, and/() is a weight matrix and a bias value corresponding to a reset gate of the backward LSTM model.

Optionally, the similarity between the topic vector representations including the time factor is:

；

wherein is the cosine similarity between the topic vector representations:

；

wherein is the topic vector representation of the/> text posts/> of the/> users,/> is the topic vector representation of the/> text posts/> of the/> users; the/> is the similarity between the topic vector representations of text post/> and text post/> ; and/> is the time-relevance weight between text posts:

；

the similarity between the shallow semantic vector representations including the time factor is:

；

Wherein is cosine similarity between shallow semantic vector representations:

；

Wherein is a shallow semantic vector representation of the/> text posts/> of the/> users,/> is a shallow semantic vector representation of the/> text posts/> of the/> users,/> is a similarity between the shallow semantic vector representations of the text posts/> and the text posts/> ;

The calculating the post similarity distribution according to the similarity comprises the following steps:

Calculating post similarity according to the similarity:

；

Wherein is a similarity set containing time factors between topic vector representations and a similarity set containing time factors between shallow semantic vector representations; the/> ,/>, is the number of text posts for the/> user; the confidence level of/> is/> :

；

Wherein is a topic vector representing a corresponding weight matrix and a bias value, and/( is a shallow semantic vector representing a corresponding weight matrix and a bias value, and/( is a attention moment matrix parameter, and/( is a vector connection operation);

Calculating a post similarity distribution/> according to the post similarity :

；

Where is the post similarity distribution of the first/> text posts of the first/> user/> , and/> is the number of text posts of the first/> user.

Alternatively to this, the method may comprise,

The generating a user representation token vector from the knowledge triples comprises:

Generating a knowledge vector representation from the knowledge triples by a variational self-encoder;

Generating, by a position encoder, a user image vector representation for the knowledge vector representation embedding timing information:

；

Where is the dimension represented by the knowledge vector,/> is the location of the currently processed post and/> is the dimension index.

Optionally, the loss function of the network model is:

；

Wherein is post-level loss and/() is user-level loss;

；

Wherein is the number of sample data, the sample data types include positive sample data and negative sample data, i.e./> is the number of sample types of the first/> sample data, i.e./> is the number of sample type predictions corresponding to the first/> sample data;

；

Wherein is the weight matrix and bias value of the/> activation function, and/> is the post similarity distribution; A user representation vector representation for user/> , respectively,/> is a weight matrix for/> activation functions.

In a second aspect, the present invention provides an identity linking device across a social network, the device comprising:

The target acquisition module is used for acquiring two target text posts in different social networks;

The identity link module is used for preprocessing the target text posts, inputting the preprocessed target text posts into a trained network model, and acquiring whether the two target text posts belong to the same user or not;

wherein the training of the network model comprises:

Acquiring each text post in different social networks;

Preprocessing each text post to generate a text data set;

Compared with the prior art, the invention has the beneficial effects that:

According to the method and the device for linking the identities across the social network, the multi-angle text information of the user is used as the element for enriching the characteristic information of the user to be used in the task of matching the identities across the social network, and the characteristics of the user are represented by using the subject information, the shallow semantic information and the knowledge representation information at the same time, so that whether the accounts in two different social networks belong to the same natural person is judged; compared with the prior art, the method can achieve better accuracy and stability.

Drawings

FIG. 1 is a schematic flow chart of an identity linking method across social networks provided by an embodiment of the invention;

FIG. 2 is a schematic diagram of the structure and principle of a network model according to an embodiment of the present invention;

FIG. 3 is a comparative schematic diagram of experimental results provided in the examples of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.

Embodiment one:

As shown in fig. 1, an embodiment of the present invention provides an identity linking method across social networks, including the following steps:

step S1, acquiring two target text posts in different social networks.

Step S2, after preprocessing the target text posts, inputting the target text posts into a trained network model, and obtaining whether the two target text posts belong to the same user or not.

Wherein, training of the network model comprises:

And S21, acquiring each text post in different social networks.

S22, preprocessing each text post to generate a text data set; the method specifically comprises the following steps:

Step S221, preprocessing the text posts to generate sample data, wherein the preprocessing comprises deleting links in the text posts and replacing emoji expressions with corresponding words;

step S222, taking account pairs known to belong to the same natural person in two different community networks as positive sample cross-network account pairs, and randomly generating negative sample cross-network account pairs according to the positive sample cross-network account pairs;

Step S223, taking the sample data corresponding to the positive sample cross-network account pair as positive sample data, and taking the sample data corresponding to the negative sample cross-network account pair as negative sample data;

Step S224, merging the same number of positive sample data and negative sample data into a text data set.

Step S23, dividing the text data set into a training set, a verification set and a test set; in this embodiment, the dividing ratio is 8:1:1.

S24, constructing a network model based on multi-angle text information, and training the network model through a training set; after each round of training is completed, a verification set is used for model screening, an optimal network model is reserved, a test set is used for model testing, and the effect of the network model is obtained.

As shown in fig. 2, the network model includes a post-level vector characterization module, a user-level vector generation module, and a total similarity distribution generation module;

The post-level vector characterization module comprises a theme characterization module, a shallow semantic characterization module and a post similarity distribution generation module; the topic characterization module comprises RoBERTa language models, a variation self-encoder and a decoder, and is used for generating topic vector representations according to the text posts; the shallow semantic characterization module comprises GloVe word embedding tools and a BiLSTM network model, and is used for generating shallow semantic vector representations according to text posts; the post similarity distribution generation module is used for calculating the similarity containing time factors between the topic vector representations and the similarity containing time factors between the shallow semantic vector representations, and calculating post similarity distribution according to the similarity;

the user-level vector generation module comprises a knowledge triplet extraction module, a user-level vector generation module and a multi-layer perceptron; the knowledge triplet extraction module comprises sbert models, which are used for matching the knowledge triples according to the text posts and the open source knowledge graph library; the user-level vector generation module comprises an encoder for generating a user representation token vector according to the knowledge triples; the multi-layer perceptron is used for generating user-level similarity distribution according to the user portrait representation vector;

the total similarity distribution generation module is used for generating total similarity distribution according to the post similarity distribution and the user-level similarity distribution; the overall similarity distribution ,/> is the post similarity distribution and the user-level similarity distribution, and the overall similarity distribution/> is the final identity link prediction result.

The specific (1) generating a post vector representation from a text post includes:

Converting the th user's/> text posts/> to post vector representations/> by RoBERTa language model;

Generating a probability distribution of the topic vector representation from post vector representation by a variational self-encoder:

；

Wherein the decoder is configured to reconstruct the post vector representation from the probability distribution of the topic vector representation;

The variance self-encoder and decoder are trained and optimized with the goal of minimizing the gap between post vector representation and post vector representation/> .

Specifically (2), generating a shallow semantic vector representation from the text post includes:

The method comprises the steps of converting a first/> word in a first/> text post/> of a first user into a word vector representation/> through a GloVe word embedding tool;

Generating a shallow semantic vector representation/> of text post/> from the word vector representation by BiLSTM network model;

；

The similarity between the topic vector representations including the time factor is:

；

wherein is the cosine similarity between the topic vector representations:

；

The similarity between shallow semantic vector representations, which contains time factors, is:

；

wherein is cosine similarity between shallow semantic vector representations:

；

Calculating a post similarity distribution from the similarity includes:

calculating post similarity according to the similarity:

；

Calculating post similarity distribution/> according to the post similarity :

；

The specific (4) generating the user portrait representation vector according to the knowledge triples comprises the following steps:

generating a user image vector representation by embedding timing information into the knowledge vector representation by a position encoder:

；

The specific loss function of the network model is:

；

Wherein is post-level loss and/() is user-level loss;

；

As shown in fig. 3, the method provided in this embodiment is compared with the identity linking method in the prior art through experiments, and the models used in the identity linking method in the prior art include a BPR-DAE model, a MV-URL model, a DLHD model, a MSUIL-V model, an MNA-V model, a UserNet model, and a UserNet-C model, and the model of the method provided in this embodiment is denoted as TEAKM. As can be seen from FIG. 3, the accuracy of the network model trained by the present invention on the test set is higher than that of other models, which indicates that the present invention performs better than other existing models in the task of identity linking.

Embodiment two:

the embodiment of the invention provides an identity linking device crossing a social network, which comprises the following components:

Wherein, training of the network model comprises:

Acquiring each text post in different social networks;

Preprocessing each text post to generate a text data set;

constructing a network model based on multi-angle text information, and training the network model through a training set; after each round of training is completed, a verification set is used for model screening, an optimal network model is reserved, a test set is used for model testing, and the effect of the network model is obtained.

In summary, the characteristics of text information are represented from two angles of post level and user level, specifically, post level information comprises subject information and shallow semantic information, user level usage knowledge is used for representing semantic information, whether account numbers in two different social networks belong to the same natural person can be rapidly judged by using post information which is easy to acquire, key information of illegal users is further confirmed in a cross-network environment, and the problem that matching is inaccurate due to the fact that false user social information or user archive information is used is solved.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims

1. An identity linking method across a social network, comprising:

Acquiring two target text posts in different social networks;

wherein the training of the network model comprises:

Acquiring each text post in different social networks;

Preprocessing each text post to generate a text data set;

2. The method of claim 1, wherein preprocessing each text post to generate a text data set comprises:

3. The method of claim 2, wherein the network model comprises a post-level vector characterization module, a user-level vector generation module, and a total similarity distribution generation module;

4. The method of identity linking across a social network of claim 3, wherein generating a post vector representation from text posts comprises:

；

5. The method of identity linking across a social network of claim 4, wherein the decoder is configured to reconstruct post vector representation from a probability distribution of topic vector representations;

6. The method of identity linking across a social network of claim 3, wherein generating a shallow semantic vector representation from text posts comprises:

；

Wherein is a vector obtained for the forward LSTM model for the first/> words in text post/> ; Update gate, memory cell state, reset gate for forward LSTM model respectively:

；

Wherein is a weight matrix and a bias value corresponding to a storage unit of the backward LSTM model,/() is a weight matrix and a bias value corresponding to an update gate of the backward LSTM model, and/() is a weight matrix and a bias value corresponding to a reset gate of the backward LSTM model.

7. The method of claim 3, wherein the similarity between topic vector representations including time factors is:

；

Wherein is the cosine similarity between the topic vector representations:

；

Wherein is the topic vector representation of the/> text posts/> of the/> users,/> is the topic vector representation of the/> text posts/> of the/> users; the/> is the similarity between the topic vector representations of text post/> and text post/> ; Weighting the time relevance among text posts:

；

Wherein is cosine similarity between shallow semantic vector representations:

；

calculating post similarity according to the similarity:

；

Wherein is a similarity set containing time factors between topic vector representations and a similarity set containing time factors between shallow semantic vector representations; the/> ,/>,/> is the number of text posts for the th user; the confidence level of/> is/> :

；

Calculating a post similarity distribution/> according to the post similarity :

；

8. The method for identity linking across a social network recited in claim 3, wherein,

；

9. A method of identity linking across a social network as recited in claim 3, wherein the loss function of the network model is:

；

Wherein is post-level loss and/() is user-level loss;

；

Wherein is the weight matrix and bias value of the/> activation function, and/> is the post similarity distribution; the/> is the user representation vector representation of the user/> , respectively, and the/> is the weight matrix of the/> activation function.

10. An identity linking apparatus across a social network, the apparatus comprising:

wherein the training of the network model comprises:

Acquiring each text post in different social networks;

Preprocessing each text post to generate a text data set;