CN117892019A - Cross-social network identity linking method and device - Google Patents

Cross-social network identity linking method and device Download PDF

Info

Publication number
CN117892019A
CN117892019A CN202410289109.5A CN202410289109A CN117892019A CN 117892019 A CN117892019 A CN 117892019A CN 202410289109 A CN202410289109 A CN 202410289109A CN 117892019 A CN117892019 A CN 117892019A
Authority
CN
China
Prior art keywords
text
post
vector
user
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410289109.5A
Other languages
Chinese (zh)
Other versions
CN117892019B (en
Inventor
黄锐
马延淮
彭可兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202410289109.5A priority Critical patent/CN117892019B/en
Publication of CN117892019A publication Critical patent/CN117892019A/en
Application granted granted Critical
Publication of CN117892019B publication Critical patent/CN117892019B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9558Details of hyperlinks; Management of linked annotations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Business, Economics & Management (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method and a device for linking identities across social networks, wherein the method comprises the following steps: acquiring two target text posts in different social networks; after preprocessing the target text posts, inputting the target text posts into a trained network model, and acquiring whether the two target text posts belong to the same user; wherein the training of the network model comprises: acquiring each text post in different social networks; preprocessing each text post to generate a text data set; dividing the text data set into a training set, a verification set and a test set; constructing a network model based on multi-angle text information, and training the network model through the training set; after each round of training is completed, performing model screening by using the verification set, reserving an optimal network model, performing model testing by using the test set, and obtaining the effect of the network model; compared with the prior art, the method and the device can improve the accuracy and the stability of identity linking.

Description

Cross-social network identity linking method and device
Technical Field
The invention relates to a method and a device for linking identities across social networks, and belongs to the technical field of network information.
Background
The advent of social networking platforms has led to a diversity of services, people tending to register multiple accounts on different social networks. However, some malicious users may engage in illegal activities on networks that do not have a real-name system, but do not expose their own real identities.
The method for linking the identities across the social network can acquire key information, such as real names, of malicious users. But malicious users may forge their registration details. The prior art related to the method utilizes attribute information of users to carry out identity link, such as social relationship of the users, personal data of the users and the like. The application number is CN202110607064.8, the patent name is a user identity association method integrating multi-mode information and weight tensor, and the application number is CN202110148895.3, the patent name is a social network user identity association method integrating user characteristics and embedded learning, and multiple information of a user is used for acquiring the characteristics of the user in the two methods. However, increasingly stringent privacy protection policies of social networking platforms make user attributes difficult to obtain. Some techniques also use user-generated content (UGCs) for identity linking, which is easier to obtain than the user's attribute information, because UGCs is published and disclosed in terms of the user's personal behavior. The use of such information does not violate the privacy policy of the social network. However, due to the diversity and heterogeneity of UGC, there will be limitations to the modeling of the user's intrinsic features. Existing research ignores the task of relying on homogenized UGC (e.g., text) to address identity linking. The patent name of China patent with the application number of CN202010376438.5, an identity matching method and device, discloses an identity matching framework, but the framework proposed by the patent is abstract, and does not propose how to accurately extract user characteristics according to the existing information of a user, and does not design a method aiming at a specific identity matching environment, such as a social network with high privacy protection level.
In the identity linking process in the prior art, user information which is difficult to acquire and easy to falsify is generally used, so that instability of the identity linking technology is increased.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, provides a method and a device for linking identities across social networks, and solves the technical problems of instability and poor using effect of the identity linking method in the prior art.
In order to achieve the above purpose, the invention is realized by adopting the following technical scheme:
in a first aspect, the present invention provides a method for linking identities across social networks, including:
Acquiring two target text posts in different social networks;
after preprocessing the target text posts, inputting the target text posts into a trained network model, and acquiring whether the two target text posts belong to the same user;
wherein the training of the network model comprises:
Acquiring each text post in different social networks;
Preprocessing each text post to generate a text data set;
dividing the text data set into a training set, a verification set and a test set;
Constructing a network model based on multi-angle text information, and training the network model through the training set; after each round of training is completed, the verification set is used for model screening, an optimal network model is reserved, the test set is used for model testing, and the effect of the network model is obtained.
Optionally, the preprocessing each text post to generate a text data set includes:
preprocessing the text posts to generate sample data, wherein the preprocessing comprises the steps of deleting links in the text posts and replacing emoji expressions with corresponding characters;
Taking account pairs known to belong to the same natural person in two different community networks as positive sample cross-network account pairs, and randomly generating negative sample cross-network account pairs according to the positive sample cross-network account pairs;
Taking the sample data corresponding to the positive sample cross-network account pair as positive sample data, and taking the sample data corresponding to the negative sample cross-network account pair as negative sample data;
The same number of positive and negative sample data is combined into a text data set.
Optionally, the network model includes a post-level vector characterization module, a user-level vector generation module, and a total similarity distribution generation module;
The post-level vector characterization module comprises a theme characterization module, a shallow semantic characterization module and a post similarity distribution generation module; the topic characterization module comprises RoBERTa language models, a variation self-encoder and a decoder, and is used for generating topic vector representations according to text posts; the shallow semantic characterization module comprises GloVe word embedding tools and a BiLSTM network model, and is used for generating shallow semantic vector representations according to text posts; the post similarity distribution generation module is used for calculating the similarity of the time factors contained between the topic vector representations and the similarity of the time factors contained between the shallow semantic vector representations, and calculating post similarity distribution according to the similarity;
The user-level vector generation module comprises a knowledge triplet extraction module, a user-level vector generation module and a multi-layer perceptron; the knowledge triplet extraction module comprises sbert models, and is used for matching knowledge triples according to text posts and an open source knowledge graph library; the user-level vector generation module comprises an encoder, a user portrait representation vector generation module and a user portrait representation vector generation module, wherein the encoder is used for generating a user portrait representation vector according to the knowledge triples; the multi-layer perceptron is used for generating user-level similarity distribution according to the user portrait representation vector;
the total similarity distribution generation module is used for generating a total similarity distribution according to the post similarity distribution and the user-level similarity distribution.
Optionally, the generating a post vector representation from the text post includes:
Converting the/> text posts/> of the th user into post vector representations/> through the RoBERTa language model;
Generating, by the variational self-encoder, a probability distribution of a topic vector representation from the post vector representation :
wherein is a variational self-encoder,/> is a multi-layer perceptron,/> is a gaussian distribution function,/> is a topic vector representation of text posts/> ,/> is a post vector representation of the/> text posts of the/> user and all text posts preceding it,/> is a topic vector representation of the/> text posts of the/> user; the/> is the feature vector of text post/> :
Where is the text post/> post vector representation,/> is the attention mechanism function, is the query vector, key vector, and value vector, respectively, as the attention mechanism function.
Optionally, the decoder is configured to reconstruct the post vector representation according to the probability distribution of the topic vector representation;
Training optimization of the variance from the encoder and decoder is performed with the goal of minimizing the gap between post vector representation and post vector representation/> .
Optionally, the generating a shallow semantic vector representation from the text post includes:
Converting the/> words in the/> text posts/> of the th user into word vector representations/> by the GloVe word embedding tool;
Generating, by the BiLSTM network model, a shallow semantic vector representation/> of a text post/> from the word vector representation ;
Where is the shallow semantic vector representation of text post/> ,/> is the number of words in text post/> ,/> is the shallow semantic vector representation of the/> words in text post/> :
Wherein is a vector obtained for the forward LSTM model for the first/> words in text post/> ; the/> is the update gate, memory cell state, reset gate of the forward LSTM model, respectively:
Wherein is a weight matrix and a bias value corresponding to a storage unit of the forward LSTM model,/> is an element multiplication operation,/> is a weight matrix and a bias value corresponding to an update gate of the forward LSTM model,/> is a weight matrix and a bias value corresponding to a reset gate of the forward LSTM model, and/> is a Sigmoid activation function;
wherein is a vector obtained for LSTM backwards for the first/> words in text post/> ; Update gate, memory cell state, reset gate for backward LSTM model respectively:
Wherein is a weight matrix and a bias value corresponding to a storage unit of the backward LSTM model, is a weight matrix and a bias value corresponding to an update gate of the backward LSTM model, and/() is a weight matrix and a bias value corresponding to a reset gate of the backward LSTM model.
Optionally, the similarity between the topic vector representations including the time factor is:
wherein is the cosine similarity between the topic vector representations:
wherein is the topic vector representation of the/> text posts/> of the/> users,/> is the topic vector representation of the/> text posts/> of the/> users; the/> is the similarity between the topic vector representations of text post/> and text post/> ; and/> is the time-relevance weight between text posts:
the similarity between the shallow semantic vector representations including the time factor is:
Wherein is cosine similarity between shallow semantic vector representations:
Wherein is a shallow semantic vector representation of the/> text posts/> of the/> users,/> is a shallow semantic vector representation of the/> text posts/> of the/> users,/> is a similarity between the shallow semantic vector representations of the text posts/> and the text posts/> ;
The calculating the post similarity distribution according to the similarity comprises the following steps:
Calculating post similarity according to the similarity:
Wherein is a similarity set containing time factors between topic vector representations and a similarity set containing time factors between shallow semantic vector representations; the/> ,/>, is the number of text posts for the/> user; the confidence level of/> is/> :
Wherein is a topic vector representing a corresponding weight matrix and a bias value, and/( is a shallow semantic vector representing a corresponding weight matrix and a bias value, and/( is a attention moment matrix parameter, and/( is a vector connection operation);
Calculating a post similarity distribution/> according to the post similarity :
Where is the post similarity distribution of the first/> text posts of the first/> user/> , and/> is the number of text posts of the first/> user.
Alternatively to this, the method may comprise,
The generating a user representation token vector from the knowledge triples comprises:
Generating a knowledge vector representation from the knowledge triples by a variational self-encoder;
Generating, by a position encoder, a user image vector representation for the knowledge vector representation embedding timing information:
Where is the dimension represented by the knowledge vector,/> is the location of the currently processed post and/> is the dimension index.
Optionally, the loss function of the network model is:
Wherein is post-level loss and/() is user-level loss;
Wherein is the number of sample data, the sample data types include positive sample data and negative sample data, i.e./> is the number of sample types of the first/> sample data, i.e./> is the number of sample type predictions corresponding to the first/> sample data;
Wherein is the weight matrix and bias value of the/> activation function, and/> is the post similarity distribution; A user representation vector representation for user/> , respectively,/> is a weight matrix for/> activation functions.
In a second aspect, the present invention provides an identity linking device across a social network, the device comprising:
The target acquisition module is used for acquiring two target text posts in different social networks;
The identity link module is used for preprocessing the target text posts, inputting the preprocessed target text posts into a trained network model, and acquiring whether the two target text posts belong to the same user or not;
wherein the training of the network model comprises:
Acquiring each text post in different social networks;
Preprocessing each text post to generate a text data set;
dividing the text data set into a training set, a verification set and a test set;
Constructing a network model based on multi-angle text information, and training the network model through the training set; after each round of training is completed, the verification set is used for model screening, an optimal network model is reserved, the test set is used for model testing, and the effect of the network model is obtained.
Compared with the prior art, the invention has the beneficial effects that:
According to the method and the device for linking the identities across the social network, the multi-angle text information of the user is used as the element for enriching the characteristic information of the user to be used in the task of matching the identities across the social network, and the characteristics of the user are represented by using the subject information, the shallow semantic information and the knowledge representation information at the same time, so that whether the accounts in two different social networks belong to the same natural person is judged; compared with the prior art, the method can achieve better accuracy and stability.
Drawings
FIG. 1 is a schematic flow chart of an identity linking method across social networks provided by an embodiment of the invention;
FIG. 2 is a schematic diagram of the structure and principle of a network model according to an embodiment of the present invention;
FIG. 3 is a comparative schematic diagram of experimental results provided in the examples of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.
Embodiment one:
As shown in fig. 1, an embodiment of the present invention provides an identity linking method across social networks, including the following steps:
step S1, acquiring two target text posts in different social networks.
Step S2, after preprocessing the target text posts, inputting the target text posts into a trained network model, and obtaining whether the two target text posts belong to the same user or not.
Wherein, training of the network model comprises:
And S21, acquiring each text post in different social networks.
S22, preprocessing each text post to generate a text data set; the method specifically comprises the following steps:
Step S221, preprocessing the text posts to generate sample data, wherein the preprocessing comprises deleting links in the text posts and replacing emoji expressions with corresponding words;
step S222, taking account pairs known to belong to the same natural person in two different community networks as positive sample cross-network account pairs, and randomly generating negative sample cross-network account pairs according to the positive sample cross-network account pairs;
Step S223, taking the sample data corresponding to the positive sample cross-network account pair as positive sample data, and taking the sample data corresponding to the negative sample cross-network account pair as negative sample data;
Step S224, merging the same number of positive sample data and negative sample data into a text data set.
Step S23, dividing the text data set into a training set, a verification set and a test set; in this embodiment, the dividing ratio is 8:1:1.
S24, constructing a network model based on multi-angle text information, and training the network model through a training set; after each round of training is completed, a verification set is used for model screening, an optimal network model is reserved, a test set is used for model testing, and the effect of the network model is obtained.
As shown in fig. 2, the network model includes a post-level vector characterization module, a user-level vector generation module, and a total similarity distribution generation module;
The post-level vector characterization module comprises a theme characterization module, a shallow semantic characterization module and a post similarity distribution generation module; the topic characterization module comprises RoBERTa language models, a variation self-encoder and a decoder, and is used for generating topic vector representations according to the text posts; the shallow semantic characterization module comprises GloVe word embedding tools and a BiLSTM network model, and is used for generating shallow semantic vector representations according to text posts; the post similarity distribution generation module is used for calculating the similarity containing time factors between the topic vector representations and the similarity containing time factors between the shallow semantic vector representations, and calculating post similarity distribution according to the similarity;
the user-level vector generation module comprises a knowledge triplet extraction module, a user-level vector generation module and a multi-layer perceptron; the knowledge triplet extraction module comprises sbert models, which are used for matching the knowledge triples according to the text posts and the open source knowledge graph library; the user-level vector generation module comprises an encoder for generating a user representation token vector according to the knowledge triples; the multi-layer perceptron is used for generating user-level similarity distribution according to the user portrait representation vector;
the total similarity distribution generation module is used for generating total similarity distribution according to the post similarity distribution and the user-level similarity distribution; the overall similarity distribution ,/> is the post similarity distribution and the user-level similarity distribution, and the overall similarity distribution/> is the final identity link prediction result.
The specific (1) generating a post vector representation from a text post includes:
Converting the th user's/> text posts/> to post vector representations/> by RoBERTa language model;
Generating a probability distribution of the topic vector representation from post vector representation by a variational self-encoder:
wherein is a variational self-encoder,/> is a multi-layer perceptron,/> is a gaussian distribution function,/> is a topic vector representation of text posts/> ,/> is a post vector representation of the/> text posts of the/> user and all text posts preceding it,/> is a topic vector representation of the/> text posts of the/> user; the/> is the feature vector of text post/> :
Where is the text post/> post vector representation,/> is the attention mechanism function, is the query vector, key vector, and value vector, respectively, as the attention mechanism function.
Wherein the decoder is configured to reconstruct the post vector representation from the probability distribution of the topic vector representation;
The variance self-encoder and decoder are trained and optimized with the goal of minimizing the gap between post vector representation and post vector representation/> .
Specifically (2), generating a shallow semantic vector representation from the text post includes:
The method comprises the steps of converting a first/> word in a first/> text post/> of a first user into a word vector representation/> through a GloVe word embedding tool;
Generating a shallow semantic vector representation/> of text post/> from the word vector representation by BiLSTM network model;
where is the shallow semantic vector representation of text post/> ,/> is the number of words in text post/> ,/> is the shallow semantic vector representation of the/> words in text post/> :
Wherein is a vector obtained for the forward LSTM model for the first/> words in text post/> ; the/> is the update gate, memory cell state, reset gate of the forward LSTM model, respectively:
Wherein is a weight matrix and a bias value corresponding to a storage unit of the forward LSTM model,/> is an element multiplication operation,/> is a weight matrix and a bias value corresponding to an update gate of the forward LSTM model,/> is a weight matrix and a bias value corresponding to a reset gate of the forward LSTM model, and/> is a Sigmoid activation function;
Wherein is a vector obtained for LSTM backwards for the first/> words in text post/> ; Update gate, memory cell state, reset gate for backward LSTM model respectively:
Wherein is a weight matrix and a bias value corresponding to a storage unit of the backward LSTM model, is a weight matrix and a bias value corresponding to an update gate of the backward LSTM model, and/() is a weight matrix and a bias value corresponding to a reset gate of the backward LSTM model.
The similarity between the topic vector representations including the time factor is:
wherein is the cosine similarity between the topic vector representations:
Wherein is the topic vector representation of the/> text posts/> of the/> users,/> is the topic vector representation of the/> text posts/> of the/> users; the/> is the similarity between the topic vector representations of text post/> and text post/> ; and/> is the time-relevance weight between text posts:
The similarity between shallow semantic vector representations, which contains time factors, is:
wherein is cosine similarity between shallow semantic vector representations:
wherein is a shallow semantic vector representation of the/> text posts/> of the/> users,/> is a shallow semantic vector representation of the/> text posts/> of the/> users,/> is a similarity between the shallow semantic vector representations of the text posts/> and the text posts/> ;
Calculating a post similarity distribution from the similarity includes:
calculating post similarity according to the similarity:
Wherein is a similarity set containing time factors between topic vector representations and a similarity set containing time factors between shallow semantic vector representations; the/> ,/>, is the number of text posts for the/> user; the confidence level of/> is/> :
Wherein is a topic vector representing a corresponding weight matrix and a bias value, and/( is a shallow semantic vector representing a corresponding weight matrix and a bias value, and/( is a attention moment matrix parameter, and/( is a vector connection operation);
Calculating post similarity distribution/> according to the post similarity :
Where is the post similarity distribution of the first/> text posts of the first/> user/> , and/> is the number of text posts of the first/> user.
The specific (4) generating the user portrait representation vector according to the knowledge triples comprises the following steps:
generating a knowledge vector representation from the knowledge triples by a variational self-encoder;
generating a user image vector representation by embedding timing information into the knowledge vector representation by a position encoder:
Where is the dimension represented by the knowledge vector,/> is the location of the currently processed post and/> is the dimension index.
The specific loss function of the network model is:
Wherein is post-level loss and/() is user-level loss;
Wherein is the number of sample data, the sample data types include positive sample data and negative sample data, i.e./> is the number of sample types of the first/> sample data, i.e./> is the number of sample type predictions corresponding to the first/> sample data;
Wherein is the weight matrix and bias value of the/> activation function, and/> is the post similarity distribution; A user representation vector representation for user/> , respectively,/> is a weight matrix for/> activation functions.
As shown in fig. 3, the method provided in this embodiment is compared with the identity linking method in the prior art through experiments, and the models used in the identity linking method in the prior art include a BPR-DAE model, a MV-URL model, a DLHD model, a MSUIL-V model, an MNA-V model, a UserNet model, and a UserNet-C model, and the model of the method provided in this embodiment is denoted as TEAKM. As can be seen from FIG. 3, the accuracy of the network model trained by the present invention on the test set is higher than that of other models, which indicates that the present invention performs better than other existing models in the task of identity linking.
Embodiment two:
the embodiment of the invention provides an identity linking device crossing a social network, which comprises the following components:
The target acquisition module is used for acquiring two target text posts in different social networks;
The identity link module is used for preprocessing the target text posts, inputting the preprocessed target text posts into a trained network model, and acquiring whether the two target text posts belong to the same user or not;
Wherein, training of the network model comprises:
Acquiring each text post in different social networks;
Preprocessing each text post to generate a text data set;
dividing the text data set into a training set, a verification set and a test set;
constructing a network model based on multi-angle text information, and training the network model through a training set; after each round of training is completed, a verification set is used for model screening, an optimal network model is reserved, a test set is used for model testing, and the effect of the network model is obtained.
In summary, the characteristics of text information are represented from two angles of post level and user level, specifically, post level information comprises subject information and shallow semantic information, user level usage knowledge is used for representing semantic information, whether account numbers in two different social networks belong to the same natural person can be rapidly judged by using post information which is easy to acquire, key information of illegal users is further confirmed in a cross-network environment, and the problem that matching is inaccurate due to the fact that false user social information or user archive information is used is solved.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims (10)

1. An identity linking method across a social network, comprising:
Acquiring two target text posts in different social networks;
after preprocessing the target text posts, inputting the target text posts into a trained network model, and acquiring whether the two target text posts belong to the same user;
wherein the training of the network model comprises:
Acquiring each text post in different social networks;
Preprocessing each text post to generate a text data set;
dividing the text data set into a training set, a verification set and a test set;
Constructing a network model based on multi-angle text information, and training the network model through the training set; after each round of training is completed, the verification set is used for model screening, an optimal network model is reserved, the test set is used for model testing, and the effect of the network model is obtained.
2. The method of claim 1, wherein preprocessing each text post to generate a text data set comprises:
preprocessing the text posts to generate sample data, wherein the preprocessing comprises the steps of deleting links in the text posts and replacing emoji expressions with corresponding characters;
Taking account pairs known to belong to the same natural person in two different community networks as positive sample cross-network account pairs, and randomly generating negative sample cross-network account pairs according to the positive sample cross-network account pairs;
Taking the sample data corresponding to the positive sample cross-network account pair as positive sample data, and taking the sample data corresponding to the negative sample cross-network account pair as negative sample data;
The same number of positive and negative sample data is combined into a text data set.
3. The method of claim 2, wherein the network model comprises a post-level vector characterization module, a user-level vector generation module, and a total similarity distribution generation module;
The post-level vector characterization module comprises a theme characterization module, a shallow semantic characterization module and a post similarity distribution generation module; the topic characterization module comprises RoBERTa language models, a variation self-encoder and a decoder, and is used for generating topic vector representations according to text posts; the shallow semantic characterization module comprises GloVe word embedding tools and a BiLSTM network model, and is used for generating shallow semantic vector representations according to text posts; the post similarity distribution generation module is used for calculating the similarity of the time factors contained between the topic vector representations and the similarity of the time factors contained between the shallow semantic vector representations, and calculating post similarity distribution according to the similarity;
The user-level vector generation module comprises a knowledge triplet extraction module, a user-level vector generation module and a multi-layer perceptron; the knowledge triplet extraction module comprises sbert models, and is used for matching knowledge triples according to text posts and an open source knowledge graph library; the user-level vector generation module comprises an encoder, a user portrait representation vector generation module and a user portrait representation vector generation module, wherein the encoder is used for generating a user portrait representation vector according to the knowledge triples; the multi-layer perceptron is used for generating user-level similarity distribution according to the user portrait representation vector;
the total similarity distribution generation module is used for generating a total similarity distribution according to the post similarity distribution and the user-level similarity distribution.
4. The method of identity linking across a social network of claim 3, wherein generating a post vector representation from text posts comprises:
Converting the/> text posts/> of the th user into post vector representations/> through the RoBERTa language model;
Generating, by the variational self-encoder, a probability distribution of a topic vector representation from the post vector representation :
Wherein is a variational self-encoder,/> is a multi-layer perceptron,/> is a gaussian distribution function,/> is a topic vector representation of text posts/> ,/> is a post vector representation of the/> text posts of the/> user and all text posts preceding it,/> is a topic vector representation of the/> text posts of the/> user; the/> is the feature vector of text post/> :
Where is the text post/> post vector representation,/> is the attention mechanism function, is the query vector, key vector, and value vector, respectively, as the attention mechanism function.
5. The method of identity linking across a social network of claim 4, wherein the decoder is configured to reconstruct post vector representation from a probability distribution of topic vector representations;
Training optimization of the variance from the encoder and decoder is performed with the goal of minimizing the gap between post vector representation and post vector representation/> .
6. The method of identity linking across a social network of claim 3, wherein generating a shallow semantic vector representation from text posts comprises:
Converting the/> words in the/> text posts/> of the th user into word vector representations/> by the GloVe word embedding tool;
Generating, by the BiLSTM network model, a shallow semantic vector representation/> of a text post/> from the word vector representation ;
Where is the shallow semantic vector representation of text post/> ,/> is the number of words in text post/> ,/> is the shallow semantic vector representation of the/> words in text post/> :
Wherein is a vector obtained for the forward LSTM model for the first/> words in text post/> ; Update gate, memory cell state, reset gate for forward LSTM model respectively:
Wherein is a weight matrix and a bias value corresponding to a storage unit of the forward LSTM model,/> is an element multiplication operation,/> is a weight matrix and a bias value corresponding to an update gate of the forward LSTM model,/> is a weight matrix and a bias value corresponding to a reset gate of the forward LSTM model, and/> is a Sigmoid activation function;
Wherein is a vector obtained for LSTM backwards for the first/> words in text post/> ; Update gate, memory cell state, reset gate for backward LSTM model respectively:
Wherein is a weight matrix and a bias value corresponding to a storage unit of the backward LSTM model,/() is a weight matrix and a bias value corresponding to an update gate of the backward LSTM model, and/() is a weight matrix and a bias value corresponding to a reset gate of the backward LSTM model.
7. The method of claim 3, wherein the similarity between topic vector representations including time factors is:
Wherein is the cosine similarity between the topic vector representations:
Wherein is the topic vector representation of the/> text posts/> of the/> users,/> is the topic vector representation of the/> text posts/> of the/> users; the/> is the similarity between the topic vector representations of text post/> and text post/> ; Weighting the time relevance among text posts:
The similarity between the shallow semantic vector representations including the time factor is:
Wherein is cosine similarity between shallow semantic vector representations:
wherein is a shallow semantic vector representation of the/> text posts/> of the/> users,/> is a shallow semantic vector representation of the/> text posts/> of the/> users,/> is a similarity between the shallow semantic vector representations of the text posts/> and the text posts/> ;
The calculating the post similarity distribution according to the similarity comprises the following steps:
calculating post similarity according to the similarity:
Wherein is a similarity set containing time factors between topic vector representations and a similarity set containing time factors between shallow semantic vector representations; the/> ,/>,/> is the number of text posts for the th user; the confidence level of/> is/> :
wherein is a topic vector representing a corresponding weight matrix and a bias value, and/( is a shallow semantic vector representing a corresponding weight matrix and a bias value, and/( is a attention moment matrix parameter, and/( is a vector connection operation);
Calculating a post similarity distribution/> according to the post similarity :
Where is the post similarity distribution of the first/> text posts of the first/> user/> , and/> is the number of text posts of the first/> user.
8. The method for identity linking across a social network recited in claim 3, wherein,
The generating a user representation token vector from the knowledge triples comprises:
Generating a knowledge vector representation from the knowledge triples by a variational self-encoder;
Generating, by a position encoder, a user image vector representation for the knowledge vector representation embedding timing information:
Where is the dimension represented by the knowledge vector,/> is the location of the currently processed post and/> is the dimension index.
9. A method of identity linking across a social network as recited in claim 3, wherein the loss function of the network model is:
Wherein is post-level loss and/() is user-level loss;
Wherein is the number of sample data, the sample data types include positive sample data and negative sample data, i.e./> is the number of sample types of the first/> sample data, i.e./> is the number of sample type predictions corresponding to the first/> sample data;
Wherein is the weight matrix and bias value of the/> activation function, and/> is the post similarity distribution; the/> is the user representation vector representation of the user/> , respectively, and the/> is the weight matrix of the/> activation function.
10. An identity linking apparatus across a social network, the apparatus comprising:
The target acquisition module is used for acquiring two target text posts in different social networks;
The identity link module is used for preprocessing the target text posts, inputting the preprocessed target text posts into a trained network model, and acquiring whether the two target text posts belong to the same user or not;
wherein the training of the network model comprises:
Acquiring each text post in different social networks;
Preprocessing each text post to generate a text data set;
dividing the text data set into a training set, a verification set and a test set;
Constructing a network model based on multi-angle text information, and training the network model through the training set; after each round of training is completed, the verification set is used for model screening, an optimal network model is reserved, the test set is used for model testing, and the effect of the network model is obtained.
CN202410289109.5A 2024-03-14 2024-03-14 Cross-social network identity linking method and device Active CN117892019B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410289109.5A CN117892019B (en) 2024-03-14 2024-03-14 Cross-social network identity linking method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410289109.5A CN117892019B (en) 2024-03-14 2024-03-14 Cross-social network identity linking method and device

Publications (2)

Publication Number Publication Date
CN117892019A true CN117892019A (en) 2024-04-16
CN117892019B CN117892019B (en) 2024-05-14

Family

ID=90652048

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410289109.5A Active CN117892019B (en) 2024-03-14 2024-03-14 Cross-social network identity linking method and device

Country Status (1)

Country Link
CN (1) CN117892019B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210365444A1 (en) * 2020-05-20 2021-11-25 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for processing dataset
CN114169449A (en) * 2021-12-10 2022-03-11 同济大学 Cross-social network user identity matching method
CN114663245A (en) * 2022-03-16 2022-06-24 南京信息工程大学 Cross-social network identity matching method
CN114741515A (en) * 2022-04-25 2022-07-12 西安交通大学 Social network user attribute prediction method and system based on graph generation
CN115659966A (en) * 2022-10-29 2023-01-31 福州大学 Rumor detection method and system based on dynamic heteromorphic graph and multi-level attention
CN116776193A (en) * 2023-05-17 2023-09-19 广州大学 Method and device for associating virtual identities across social networks based on attention mechanism

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210365444A1 (en) * 2020-05-20 2021-11-25 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for processing dataset
CN114169449A (en) * 2021-12-10 2022-03-11 同济大学 Cross-social network user identity matching method
CN114663245A (en) * 2022-03-16 2022-06-24 南京信息工程大学 Cross-social network identity matching method
CN114741515A (en) * 2022-04-25 2022-07-12 西安交通大学 Social network user attribute prediction method and system based on graph generation
CN115659966A (en) * 2022-10-29 2023-01-31 福州大学 Rumor detection method and system based on dynamic heteromorphic graph and multi-level attention
CN116776193A (en) * 2023-05-17 2023-09-19 广州大学 Method and device for associating virtual identities across social networks based on attention mechanism

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
杨奕卓;于洪涛;黄瑞阳;刘正铭;: "基于融合表示学习的跨社交网络用户身份匹配", 计算机工程, no. 09, 15 September 2018 (2018-09-15) *
罗梁;王文贤;钟杰;王海舟;: "跨社交网络的实体用户关联技术研究", 信息网络安全, no. 02, 10 February 2017 (2017-02-10) *

Also Published As

Publication number Publication date
CN117892019B (en) 2024-05-14

Similar Documents

Publication Publication Date Title
US11514247B2 (en) Method, apparatus, computer device and readable medium for knowledge hierarchical extraction of a text
CN112199608B (en) Social media rumor detection method based on network information propagation graph modeling
CN104573028B (en) Realize the method and system of intelligent answer
CN110428820B (en) Chinese and English mixed speech recognition method and device
CN113515634B (en) Social media rumor detection method and system based on hierarchical heterogeneous graph neural network
CN109766432A (en) A kind of Chinese abstraction generating method and device based on generation confrontation network
Yang et al. Rits: Real-time interactive text steganography based on automatic dialogue model
WO2022134834A1 (en) Potential event predicting method, apparatus and device, and storage medium
CN110909230A (en) Network hotspot analysis method and system
CN108763211A (en) The automaticabstracting and system of knowledge are contained in fusion
CN117272142A (en) Log abnormality detection method and system and electronic equipment
CN112882899B (en) Log abnormality detection method and device
CN114118058A (en) Emotion analysis system and method based on fusion of syntactic characteristics and attention mechanism
CN114330483A (en) Data processing method, model training method, device, equipment and storage medium
CN117633196A (en) Question-answering model construction method and project question-answering method
CN112966296A (en) Sensitive information filtering method and system based on rule configuration and machine learning
CN117892019B (en) Cross-social network identity linking method and device
CN117113973A (en) Information processing method and related device
CN111008329A (en) Page content recommendation method and device based on content classification
CN115455945A (en) Entity-relationship-based vulnerability data error correction method and system
CN115169293A (en) Text steganalysis method, system, device and storage medium
CN111401067B (en) Honeypot simulation data generation method and device
CN114357160A (en) Early rumor detection method and device based on generation propagation structure characteristics
He et al. Case Study: Quora Question Pairs
CN117216193B (en) Controllable text generation method and device based on large language model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant