CN117892019A - Cross-social network identity linking method and device - Google Patents
Cross-social network identity linking method and device Download PDFInfo
- Publication number
- CN117892019A CN117892019A CN202410289109.5A CN202410289109A CN117892019A CN 117892019 A CN117892019 A CN 117892019A CN 202410289109 A CN202410289109 A CN 202410289109A CN 117892019 A CN117892019 A CN 117892019A
- Authority
- CN
- China
- Prior art keywords
- text
- post
- vector
- user
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 238000012549 training Methods 0.000 claims abstract description 37
- 238000007781 pre-processing Methods 0.000 claims abstract description 22
- 238000012360 testing method Methods 0.000 claims abstract description 22
- 238000012795 verification Methods 0.000 claims abstract description 14
- 230000000694 effects Effects 0.000 claims abstract description 9
- 238000012216 screening Methods 0.000 claims abstract description 7
- 238000009826 distribution Methods 0.000 claims description 49
- 239000011159 matrix material Substances 0.000 claims description 33
- 230000006870 function Effects 0.000 claims description 22
- 238000012512 characterization method Methods 0.000 claims description 18
- 230000004913 activation Effects 0.000 claims description 9
- 238000003860 storage Methods 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 6
- 230000007246 mechanism Effects 0.000 claims description 6
- 238000005315 distribution function Methods 0.000 claims description 3
- 230000014509 gene expression Effects 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 8
- 238000004590 computer program Methods 0.000 description 7
- 238000012545 processing Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000006855 networking Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9558—Details of hyperlinks; Management of linked annotations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Business, Economics & Management (AREA)
- Economics (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a method and a device for linking identities across social networks, wherein the method comprises the following steps: acquiring two target text posts in different social networks; after preprocessing the target text posts, inputting the target text posts into a trained network model, and acquiring whether the two target text posts belong to the same user; wherein the training of the network model comprises: acquiring each text post in different social networks; preprocessing each text post to generate a text data set; dividing the text data set into a training set, a verification set and a test set; constructing a network model based on multi-angle text information, and training the network model through the training set; after each round of training is completed, performing model screening by using the verification set, reserving an optimal network model, performing model testing by using the test set, and obtaining the effect of the network model; compared with the prior art, the method and the device can improve the accuracy and the stability of identity linking.
Description
Technical Field
The invention relates to a method and a device for linking identities across social networks, and belongs to the technical field of network information.
Background
The advent of social networking platforms has led to a diversity of services, people tending to register multiple accounts on different social networks. However, some malicious users may engage in illegal activities on networks that do not have a real-name system, but do not expose their own real identities.
The method for linking the identities across the social network can acquire key information, such as real names, of malicious users. But malicious users may forge their registration details. The prior art related to the method utilizes attribute information of users to carry out identity link, such as social relationship of the users, personal data of the users and the like. The application number is CN202110607064.8, the patent name is a user identity association method integrating multi-mode information and weight tensor, and the application number is CN202110148895.3, the patent name is a social network user identity association method integrating user characteristics and embedded learning, and multiple information of a user is used for acquiring the characteristics of the user in the two methods. However, increasingly stringent privacy protection policies of social networking platforms make user attributes difficult to obtain. Some techniques also use user-generated content (UGCs) for identity linking, which is easier to obtain than the user's attribute information, because UGCs is published and disclosed in terms of the user's personal behavior. The use of such information does not violate the privacy policy of the social network. However, due to the diversity and heterogeneity of UGC, there will be limitations to the modeling of the user's intrinsic features. Existing research ignores the task of relying on homogenized UGC (e.g., text) to address identity linking. The patent name of China patent with the application number of CN202010376438.5, an identity matching method and device, discloses an identity matching framework, but the framework proposed by the patent is abstract, and does not propose how to accurately extract user characteristics according to the existing information of a user, and does not design a method aiming at a specific identity matching environment, such as a social network with high privacy protection level.
In the identity linking process in the prior art, user information which is difficult to acquire and easy to falsify is generally used, so that instability of the identity linking technology is increased.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, provides a method and a device for linking identities across social networks, and solves the technical problems of instability and poor using effect of the identity linking method in the prior art.
In order to achieve the above purpose, the invention is realized by adopting the following technical scheme:
in a first aspect, the present invention provides a method for linking identities across social networks, including:
Acquiring two target text posts in different social networks;
after preprocessing the target text posts, inputting the target text posts into a trained network model, and acquiring whether the two target text posts belong to the same user;
wherein the training of the network model comprises:
Acquiring each text post in different social networks;
Preprocessing each text post to generate a text data set;
dividing the text data set into a training set, a verification set and a test set;
Constructing a network model based on multi-angle text information, and training the network model through the training set; after each round of training is completed, the verification set is used for model screening, an optimal network model is reserved, the test set is used for model testing, and the effect of the network model is obtained.
Optionally, the preprocessing each text post to generate a text data set includes:
preprocessing the text posts to generate sample data, wherein the preprocessing comprises the steps of deleting links in the text posts and replacing emoji expressions with corresponding characters;
Taking account pairs known to belong to the same natural person in two different community networks as positive sample cross-network account pairs, and randomly generating negative sample cross-network account pairs according to the positive sample cross-network account pairs;
Taking the sample data corresponding to the positive sample cross-network account pair as positive sample data, and taking the sample data corresponding to the negative sample cross-network account pair as negative sample data;
The same number of positive and negative sample data is combined into a text data set.
Optionally, the network model includes a post-level vector characterization module, a user-level vector generation module, and a total similarity distribution generation module;
The post-level vector characterization module comprises a theme characterization module, a shallow semantic characterization module and a post similarity distribution generation module; the topic characterization module comprises RoBERTa language models, a variation self-encoder and a decoder, and is used for generating topic vector representations according to text posts; the shallow semantic characterization module comprises GloVe word embedding tools and a BiLSTM network model, and is used for generating shallow semantic vector representations according to text posts; the post similarity distribution generation module is used for calculating the similarity of the time factors contained between the topic vector representations and the similarity of the time factors contained between the shallow semantic vector representations, and calculating post similarity distribution according to the similarity;
The user-level vector generation module comprises a knowledge triplet extraction module, a user-level vector generation module and a multi-layer perceptron; the knowledge triplet extraction module comprises sbert models, and is used for matching knowledge triples according to text posts and an open source knowledge graph library; the user-level vector generation module comprises an encoder, a user portrait representation vector generation module and a user portrait representation vector generation module, wherein the encoder is used for generating a user portrait representation vector according to the knowledge triples; the multi-layer perceptron is used for generating user-level similarity distribution according to the user portrait representation vector;
the total similarity distribution generation module is used for generating a total similarity distribution according to the post similarity distribution and the user-level similarity distribution.
Optionally, the generating a post vector representation from the text post includes:
Converting the/> text posts/> of the th user into post vector representations/> through the RoBERTa language model;
Generating, by the variational self-encoder, a probability distribution of a topic vector representation from the post vector representation :
;
wherein is a variational self-encoder,/> is a multi-layer perceptron,/> is a gaussian distribution function,/> is a topic vector representation of text posts/> ,/> is a post vector representation of the/> text posts of the/> user and all text posts preceding it,/> is a topic vector representation of the/> text posts of the/> user; the/> is the feature vector of text post/> :
;
;
Where is the text post/> post vector representation,/> is the attention mechanism function, is the query vector, key vector, and value vector, respectively, as the attention mechanism function.
Optionally, the decoder is configured to reconstruct the post vector representation according to the probability distribution of the topic vector representation;
Training optimization of the variance from the encoder and decoder is performed with the goal of minimizing the gap between post vector representation and post vector representation/> .
Optionally, the generating a shallow semantic vector representation from the text post includes:
Converting the/> words in the/> text posts/> of the th user into word vector representations/> by the GloVe word embedding tool;
Generating, by the BiLSTM network model, a shallow semantic vector representation/> of a text post/> from the word vector representation ;
;
Where is the shallow semantic vector representation of text post/> ,/> is the number of words in text post/> ,/> is the shallow semantic vector representation of the/> words in text post/> :
;
;
Wherein is a vector obtained for the forward LSTM model for the first/> words in text post/> ; the/> is the update gate, memory cell state, reset gate of the forward LSTM model, respectively:
;
;
;
Wherein is a weight matrix and a bias value corresponding to a storage unit of the forward LSTM model,/> is an element multiplication operation,/> is a weight matrix and a bias value corresponding to an update gate of the forward LSTM model,/> is a weight matrix and a bias value corresponding to a reset gate of the forward LSTM model, and/> is a Sigmoid activation function;
;
wherein is a vector obtained for LSTM backwards for the first/> words in text post/> ; Update gate, memory cell state, reset gate for backward LSTM model respectively:
;
;
;
Wherein is a weight matrix and a bias value corresponding to a storage unit of the backward LSTM model, is a weight matrix and a bias value corresponding to an update gate of the backward LSTM model, and/() is a weight matrix and a bias value corresponding to a reset gate of the backward LSTM model.
Optionally, the similarity between the topic vector representations including the time factor is:
;
wherein is the cosine similarity between the topic vector representations:
;
wherein is the topic vector representation of the/> text posts/> of the/> users,/> is the topic vector representation of the/> text posts/> of the/> users; the/> is the similarity between the topic vector representations of text post/> and text post/> ; and/> is the time-relevance weight between text posts:
;
the similarity between the shallow semantic vector representations including the time factor is:
;
Wherein is cosine similarity between shallow semantic vector representations:
;
Wherein is a shallow semantic vector representation of the/> text posts/> of the/> users,/> is a shallow semantic vector representation of the/> text posts/> of the/> users,/> is a similarity between the shallow semantic vector representations of the text posts/> and the text posts/> ;
The calculating the post similarity distribution according to the similarity comprises the following steps:
Calculating post similarity according to the similarity:
;
Wherein is a similarity set containing time factors between topic vector representations and a similarity set containing time factors between shallow semantic vector representations; the/> ,/>, is the number of text posts for the/> user; the confidence level of/> is/> :
;
;
;
Wherein is a topic vector representing a corresponding weight matrix and a bias value, and/( is a shallow semantic vector representing a corresponding weight matrix and a bias value, and/( is a attention moment matrix parameter, and/( is a vector connection operation);
Calculating a post similarity distribution/> according to the post similarity :
;
;
Where is the post similarity distribution of the first/> text posts of the first/> user/> , and/> is the number of text posts of the first/> user.
Alternatively to this, the method may comprise,
The generating a user representation token vector from the knowledge triples comprises:
Generating a knowledge vector representation from the knowledge triples by a variational self-encoder;
Generating, by a position encoder, a user image vector representation for the knowledge vector representation embedding timing information:
;
;
Where is the dimension represented by the knowledge vector,/> is the location of the currently processed post and/> is the dimension index.
Optionally, the loss function of the network model is:
;
Wherein is post-level loss and/() is user-level loss;
;
;
Wherein is the number of sample data, the sample data types include positive sample data and negative sample data, i.e./> is the number of sample types of the first/> sample data, i.e./> is the number of sample type predictions corresponding to the first/> sample data;
;
;
Wherein is the weight matrix and bias value of the/> activation function, and/> is the post similarity distribution; A user representation vector representation for user/> , respectively,/> is a weight matrix for/> activation functions.
In a second aspect, the present invention provides an identity linking device across a social network, the device comprising:
The target acquisition module is used for acquiring two target text posts in different social networks;
The identity link module is used for preprocessing the target text posts, inputting the preprocessed target text posts into a trained network model, and acquiring whether the two target text posts belong to the same user or not;
wherein the training of the network model comprises:
Acquiring each text post in different social networks;
Preprocessing each text post to generate a text data set;
dividing the text data set into a training set, a verification set and a test set;
Constructing a network model based on multi-angle text information, and training the network model through the training set; after each round of training is completed, the verification set is used for model screening, an optimal network model is reserved, the test set is used for model testing, and the effect of the network model is obtained.
Compared with the prior art, the invention has the beneficial effects that:
According to the method and the device for linking the identities across the social network, the multi-angle text information of the user is used as the element for enriching the characteristic information of the user to be used in the task of matching the identities across the social network, and the characteristics of the user are represented by using the subject information, the shallow semantic information and the knowledge representation information at the same time, so that whether the accounts in two different social networks belong to the same natural person is judged; compared with the prior art, the method can achieve better accuracy and stability.
Drawings
FIG. 1 is a schematic flow chart of an identity linking method across social networks provided by an embodiment of the invention;
FIG. 2 is a schematic diagram of the structure and principle of a network model according to an embodiment of the present invention;
FIG. 3 is a comparative schematic diagram of experimental results provided in the examples of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.
Embodiment one:
As shown in fig. 1, an embodiment of the present invention provides an identity linking method across social networks, including the following steps:
step S1, acquiring two target text posts in different social networks.
Step S2, after preprocessing the target text posts, inputting the target text posts into a trained network model, and obtaining whether the two target text posts belong to the same user or not.
Wherein, training of the network model comprises:
And S21, acquiring each text post in different social networks.
S22, preprocessing each text post to generate a text data set; the method specifically comprises the following steps:
Step S221, preprocessing the text posts to generate sample data, wherein the preprocessing comprises deleting links in the text posts and replacing emoji expressions with corresponding words;
step S222, taking account pairs known to belong to the same natural person in two different community networks as positive sample cross-network account pairs, and randomly generating negative sample cross-network account pairs according to the positive sample cross-network account pairs;
Step S223, taking the sample data corresponding to the positive sample cross-network account pair as positive sample data, and taking the sample data corresponding to the negative sample cross-network account pair as negative sample data;
Step S224, merging the same number of positive sample data and negative sample data into a text data set.
Step S23, dividing the text data set into a training set, a verification set and a test set; in this embodiment, the dividing ratio is 8:1:1.
S24, constructing a network model based on multi-angle text information, and training the network model through a training set; after each round of training is completed, a verification set is used for model screening, an optimal network model is reserved, a test set is used for model testing, and the effect of the network model is obtained.
As shown in fig. 2, the network model includes a post-level vector characterization module, a user-level vector generation module, and a total similarity distribution generation module;
The post-level vector characterization module comprises a theme characterization module, a shallow semantic characterization module and a post similarity distribution generation module; the topic characterization module comprises RoBERTa language models, a variation self-encoder and a decoder, and is used for generating topic vector representations according to the text posts; the shallow semantic characterization module comprises GloVe word embedding tools and a BiLSTM network model, and is used for generating shallow semantic vector representations according to text posts; the post similarity distribution generation module is used for calculating the similarity containing time factors between the topic vector representations and the similarity containing time factors between the shallow semantic vector representations, and calculating post similarity distribution according to the similarity;
the user-level vector generation module comprises a knowledge triplet extraction module, a user-level vector generation module and a multi-layer perceptron; the knowledge triplet extraction module comprises sbert models, which are used for matching the knowledge triples according to the text posts and the open source knowledge graph library; the user-level vector generation module comprises an encoder for generating a user representation token vector according to the knowledge triples; the multi-layer perceptron is used for generating user-level similarity distribution according to the user portrait representation vector;
the total similarity distribution generation module is used for generating total similarity distribution according to the post similarity distribution and the user-level similarity distribution; the overall similarity distribution ,/> is the post similarity distribution and the user-level similarity distribution, and the overall similarity distribution/> is the final identity link prediction result.
The specific (1) generating a post vector representation from a text post includes:
Converting the th user's/> text posts/> to post vector representations/> by RoBERTa language model;
Generating a probability distribution of the topic vector representation from post vector representation by a variational self-encoder:
;
wherein is a variational self-encoder,/> is a multi-layer perceptron,/> is a gaussian distribution function,/> is a topic vector representation of text posts/> ,/> is a post vector representation of the/> text posts of the/> user and all text posts preceding it,/> is a topic vector representation of the/> text posts of the/> user; the/> is the feature vector of text post/> :
;
;
Where is the text post/> post vector representation,/> is the attention mechanism function, is the query vector, key vector, and value vector, respectively, as the attention mechanism function.
Wherein the decoder is configured to reconstruct the post vector representation from the probability distribution of the topic vector representation;
The variance self-encoder and decoder are trained and optimized with the goal of minimizing the gap between post vector representation and post vector representation/> .
Specifically (2), generating a shallow semantic vector representation from the text post includes:
The method comprises the steps of converting a first/> word in a first/> text post/> of a first user into a word vector representation/> through a GloVe word embedding tool;
Generating a shallow semantic vector representation/> of text post/> from the word vector representation by BiLSTM network model;
;
where is the shallow semantic vector representation of text post/> ,/> is the number of words in text post/> ,/> is the shallow semantic vector representation of the/> words in text post/> :
;
;
Wherein is a vector obtained for the forward LSTM model for the first/> words in text post/> ; the/> is the update gate, memory cell state, reset gate of the forward LSTM model, respectively:
;
;
;
Wherein is a weight matrix and a bias value corresponding to a storage unit of the forward LSTM model,/> is an element multiplication operation,/> is a weight matrix and a bias value corresponding to an update gate of the forward LSTM model,/> is a weight matrix and a bias value corresponding to a reset gate of the forward LSTM model, and/> is a Sigmoid activation function;
;
Wherein is a vector obtained for LSTM backwards for the first/> words in text post/> ; Update gate, memory cell state, reset gate for backward LSTM model respectively:
;
;
;
Wherein is a weight matrix and a bias value corresponding to a storage unit of the backward LSTM model, is a weight matrix and a bias value corresponding to an update gate of the backward LSTM model, and/() is a weight matrix and a bias value corresponding to a reset gate of the backward LSTM model.
The similarity between the topic vector representations including the time factor is:
;
wherein is the cosine similarity between the topic vector representations:
;
Wherein is the topic vector representation of the/> text posts/> of the/> users,/> is the topic vector representation of the/> text posts/> of the/> users; the/> is the similarity between the topic vector representations of text post/> and text post/> ; and/> is the time-relevance weight between text posts:
;
The similarity between shallow semantic vector representations, which contains time factors, is:
;
wherein is cosine similarity between shallow semantic vector representations:
;
wherein is a shallow semantic vector representation of the/> text posts/> of the/> users,/> is a shallow semantic vector representation of the/> text posts/> of the/> users,/> is a similarity between the shallow semantic vector representations of the text posts/> and the text posts/> ;
Calculating a post similarity distribution from the similarity includes:
calculating post similarity according to the similarity:
;
Wherein is a similarity set containing time factors between topic vector representations and a similarity set containing time factors between shallow semantic vector representations; the/> ,/>, is the number of text posts for the/> user; the confidence level of/> is/> :
;
;
;
Wherein is a topic vector representing a corresponding weight matrix and a bias value, and/( is a shallow semantic vector representing a corresponding weight matrix and a bias value, and/( is a attention moment matrix parameter, and/( is a vector connection operation);
Calculating post similarity distribution/> according to the post similarity :
;
;
Where is the post similarity distribution of the first/> text posts of the first/> user/> , and/> is the number of text posts of the first/> user.
The specific (4) generating the user portrait representation vector according to the knowledge triples comprises the following steps:
generating a knowledge vector representation from the knowledge triples by a variational self-encoder;
generating a user image vector representation by embedding timing information into the knowledge vector representation by a position encoder:
;
;
Where is the dimension represented by the knowledge vector,/> is the location of the currently processed post and/> is the dimension index.
The specific loss function of the network model is:
;
Wherein is post-level loss and/() is user-level loss;
;
;
Wherein is the number of sample data, the sample data types include positive sample data and negative sample data, i.e./> is the number of sample types of the first/> sample data, i.e./> is the number of sample type predictions corresponding to the first/> sample data;
;
;
Wherein is the weight matrix and bias value of the/> activation function, and/> is the post similarity distribution; A user representation vector representation for user/> , respectively,/> is a weight matrix for/> activation functions.
As shown in fig. 3, the method provided in this embodiment is compared with the identity linking method in the prior art through experiments, and the models used in the identity linking method in the prior art include a BPR-DAE model, a MV-URL model, a DLHD model, a MSUIL-V model, an MNA-V model, a UserNet model, and a UserNet-C model, and the model of the method provided in this embodiment is denoted as TEAKM. As can be seen from FIG. 3, the accuracy of the network model trained by the present invention on the test set is higher than that of other models, which indicates that the present invention performs better than other existing models in the task of identity linking.
Embodiment two:
the embodiment of the invention provides an identity linking device crossing a social network, which comprises the following components:
The target acquisition module is used for acquiring two target text posts in different social networks;
The identity link module is used for preprocessing the target text posts, inputting the preprocessed target text posts into a trained network model, and acquiring whether the two target text posts belong to the same user or not;
Wherein, training of the network model comprises:
Acquiring each text post in different social networks;
Preprocessing each text post to generate a text data set;
dividing the text data set into a training set, a verification set and a test set;
constructing a network model based on multi-angle text information, and training the network model through a training set; after each round of training is completed, a verification set is used for model screening, an optimal network model is reserved, a test set is used for model testing, and the effect of the network model is obtained.
In summary, the characteristics of text information are represented from two angles of post level and user level, specifically, post level information comprises subject information and shallow semantic information, user level usage knowledge is used for representing semantic information, whether account numbers in two different social networks belong to the same natural person can be rapidly judged by using post information which is easy to acquire, key information of illegal users is further confirmed in a cross-network environment, and the problem that matching is inaccurate due to the fact that false user social information or user archive information is used is solved.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.
Claims (10)
1. An identity linking method across a social network, comprising:
Acquiring two target text posts in different social networks;
after preprocessing the target text posts, inputting the target text posts into a trained network model, and acquiring whether the two target text posts belong to the same user;
wherein the training of the network model comprises:
Acquiring each text post in different social networks;
Preprocessing each text post to generate a text data set;
dividing the text data set into a training set, a verification set and a test set;
Constructing a network model based on multi-angle text information, and training the network model through the training set; after each round of training is completed, the verification set is used for model screening, an optimal network model is reserved, the test set is used for model testing, and the effect of the network model is obtained.
2. The method of claim 1, wherein preprocessing each text post to generate a text data set comprises:
preprocessing the text posts to generate sample data, wherein the preprocessing comprises the steps of deleting links in the text posts and replacing emoji expressions with corresponding characters;
Taking account pairs known to belong to the same natural person in two different community networks as positive sample cross-network account pairs, and randomly generating negative sample cross-network account pairs according to the positive sample cross-network account pairs;
Taking the sample data corresponding to the positive sample cross-network account pair as positive sample data, and taking the sample data corresponding to the negative sample cross-network account pair as negative sample data;
The same number of positive and negative sample data is combined into a text data set.
3. The method of claim 2, wherein the network model comprises a post-level vector characterization module, a user-level vector generation module, and a total similarity distribution generation module;
The post-level vector characterization module comprises a theme characterization module, a shallow semantic characterization module and a post similarity distribution generation module; the topic characterization module comprises RoBERTa language models, a variation self-encoder and a decoder, and is used for generating topic vector representations according to text posts; the shallow semantic characterization module comprises GloVe word embedding tools and a BiLSTM network model, and is used for generating shallow semantic vector representations according to text posts; the post similarity distribution generation module is used for calculating the similarity of the time factors contained between the topic vector representations and the similarity of the time factors contained between the shallow semantic vector representations, and calculating post similarity distribution according to the similarity;
The user-level vector generation module comprises a knowledge triplet extraction module, a user-level vector generation module and a multi-layer perceptron; the knowledge triplet extraction module comprises sbert models, and is used for matching knowledge triples according to text posts and an open source knowledge graph library; the user-level vector generation module comprises an encoder, a user portrait representation vector generation module and a user portrait representation vector generation module, wherein the encoder is used for generating a user portrait representation vector according to the knowledge triples; the multi-layer perceptron is used for generating user-level similarity distribution according to the user portrait representation vector;
the total similarity distribution generation module is used for generating a total similarity distribution according to the post similarity distribution and the user-level similarity distribution.
4. The method of identity linking across a social network of claim 3, wherein generating a post vector representation from text posts comprises:
Converting the/> text posts/> of the th user into post vector representations/> through the RoBERTa language model;
Generating, by the variational self-encoder, a probability distribution of a topic vector representation from the post vector representation :
;
Wherein is a variational self-encoder,/> is a multi-layer perceptron,/> is a gaussian distribution function,/> is a topic vector representation of text posts/> ,/> is a post vector representation of the/> text posts of the/> user and all text posts preceding it,/> is a topic vector representation of the/> text posts of the/> user; the/> is the feature vector of text post/> :
;
;
Where is the text post/> post vector representation,/> is the attention mechanism function, is the query vector, key vector, and value vector, respectively, as the attention mechanism function.
5. The method of identity linking across a social network of claim 4, wherein the decoder is configured to reconstruct post vector representation from a probability distribution of topic vector representations;
Training optimization of the variance from the encoder and decoder is performed with the goal of minimizing the gap between post vector representation and post vector representation/> .
6. The method of identity linking across a social network of claim 3, wherein generating a shallow semantic vector representation from text posts comprises:
Converting the/> words in the/> text posts/> of the th user into word vector representations/> by the GloVe word embedding tool;
Generating, by the BiLSTM network model, a shallow semantic vector representation/> of a text post/> from the word vector representation ;
;
Where is the shallow semantic vector representation of text post/> ,/> is the number of words in text post/> ,/> is the shallow semantic vector representation of the/> words in text post/> :
;
;
Wherein is a vector obtained for the forward LSTM model for the first/> words in text post/> ; Update gate, memory cell state, reset gate for forward LSTM model respectively:
;
;
;
Wherein is a weight matrix and a bias value corresponding to a storage unit of the forward LSTM model,/> is an element multiplication operation,/> is a weight matrix and a bias value corresponding to an update gate of the forward LSTM model,/> is a weight matrix and a bias value corresponding to a reset gate of the forward LSTM model, and/> is a Sigmoid activation function;
;
Wherein is a vector obtained for LSTM backwards for the first/> words in text post/> ; Update gate, memory cell state, reset gate for backward LSTM model respectively:
;
;
;
Wherein is a weight matrix and a bias value corresponding to a storage unit of the backward LSTM model,/() is a weight matrix and a bias value corresponding to an update gate of the backward LSTM model, and/() is a weight matrix and a bias value corresponding to a reset gate of the backward LSTM model.
7. The method of claim 3, wherein the similarity between topic vector representations including time factors is:
;
Wherein is the cosine similarity between the topic vector representations:
;
Wherein is the topic vector representation of the/> text posts/> of the/> users,/> is the topic vector representation of the/> text posts/> of the/> users; the/> is the similarity between the topic vector representations of text post/> and text post/> ; Weighting the time relevance among text posts:
;
The similarity between the shallow semantic vector representations including the time factor is:
;
Wherein is cosine similarity between shallow semantic vector representations:
;
wherein is a shallow semantic vector representation of the/> text posts/> of the/> users,/> is a shallow semantic vector representation of the/> text posts/> of the/> users,/> is a similarity between the shallow semantic vector representations of the text posts/> and the text posts/> ;
The calculating the post similarity distribution according to the similarity comprises the following steps:
calculating post similarity according to the similarity:
;
Wherein is a similarity set containing time factors between topic vector representations and a similarity set containing time factors between shallow semantic vector representations; the/> ,/>,/> is the number of text posts for the th user; the confidence level of/> is/> :
;
;
;
wherein is a topic vector representing a corresponding weight matrix and a bias value, and/( is a shallow semantic vector representing a corresponding weight matrix and a bias value, and/( is a attention moment matrix parameter, and/( is a vector connection operation);
Calculating a post similarity distribution/> according to the post similarity :
;
;
Where is the post similarity distribution of the first/> text posts of the first/> user/> , and/> is the number of text posts of the first/> user.
8. The method for identity linking across a social network recited in claim 3, wherein,
The generating a user representation token vector from the knowledge triples comprises:
Generating a knowledge vector representation from the knowledge triples by a variational self-encoder;
Generating, by a position encoder, a user image vector representation for the knowledge vector representation embedding timing information:
;
;
Where is the dimension represented by the knowledge vector,/> is the location of the currently processed post and/> is the dimension index.
9. A method of identity linking across a social network as recited in claim 3, wherein the loss function of the network model is:
;
Wherein is post-level loss and/() is user-level loss;
;
;
Wherein is the number of sample data, the sample data types include positive sample data and negative sample data, i.e./> is the number of sample types of the first/> sample data, i.e./> is the number of sample type predictions corresponding to the first/> sample data;
;
;
Wherein is the weight matrix and bias value of the/> activation function, and/> is the post similarity distribution; the/> is the user representation vector representation of the user/> , respectively, and the/> is the weight matrix of the/> activation function.
10. An identity linking apparatus across a social network, the apparatus comprising:
The target acquisition module is used for acquiring two target text posts in different social networks;
The identity link module is used for preprocessing the target text posts, inputting the preprocessed target text posts into a trained network model, and acquiring whether the two target text posts belong to the same user or not;
wherein the training of the network model comprises:
Acquiring each text post in different social networks;
Preprocessing each text post to generate a text data set;
dividing the text data set into a training set, a verification set and a test set;
Constructing a network model based on multi-angle text information, and training the network model through the training set; after each round of training is completed, the verification set is used for model screening, an optimal network model is reserved, the test set is used for model testing, and the effect of the network model is obtained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410289109.5A CN117892019B (en) | 2024-03-14 | 2024-03-14 | Cross-social network identity linking method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410289109.5A CN117892019B (en) | 2024-03-14 | 2024-03-14 | Cross-social network identity linking method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117892019A true CN117892019A (en) | 2024-04-16 |
CN117892019B CN117892019B (en) | 2024-05-14 |
Family
ID=90652048
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410289109.5A Active CN117892019B (en) | 2024-03-14 | 2024-03-14 | Cross-social network identity linking method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117892019B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210365444A1 (en) * | 2020-05-20 | 2021-11-25 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for processing dataset |
CN114169449A (en) * | 2021-12-10 | 2022-03-11 | 同济大学 | Cross-social network user identity matching method |
CN114663245A (en) * | 2022-03-16 | 2022-06-24 | 南京信息工程大学 | Cross-social network identity matching method |
CN114741515A (en) * | 2022-04-25 | 2022-07-12 | 西安交通大学 | Social network user attribute prediction method and system based on graph generation |
CN115659966A (en) * | 2022-10-29 | 2023-01-31 | 福州大学 | Rumor detection method and system based on dynamic heteromorphic graph and multi-level attention |
CN116776193A (en) * | 2023-05-17 | 2023-09-19 | 广州大学 | Method and device for associating virtual identities across social networks based on attention mechanism |
-
2024
- 2024-03-14 CN CN202410289109.5A patent/CN117892019B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210365444A1 (en) * | 2020-05-20 | 2021-11-25 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for processing dataset |
CN114169449A (en) * | 2021-12-10 | 2022-03-11 | 同济大学 | Cross-social network user identity matching method |
CN114663245A (en) * | 2022-03-16 | 2022-06-24 | 南京信息工程大学 | Cross-social network identity matching method |
CN114741515A (en) * | 2022-04-25 | 2022-07-12 | 西安交通大学 | Social network user attribute prediction method and system based on graph generation |
CN115659966A (en) * | 2022-10-29 | 2023-01-31 | 福州大学 | Rumor detection method and system based on dynamic heteromorphic graph and multi-level attention |
CN116776193A (en) * | 2023-05-17 | 2023-09-19 | 广州大学 | Method and device for associating virtual identities across social networks based on attention mechanism |
Non-Patent Citations (2)
Title |
---|
杨奕卓;于洪涛;黄瑞阳;刘正铭;: "基于融合表示学习的跨社交网络用户身份匹配", 计算机工程, no. 09, 15 September 2018 (2018-09-15) * |
罗梁;王文贤;钟杰;王海舟;: "跨社交网络的实体用户关联技术研究", 信息网络安全, no. 02, 10 February 2017 (2017-02-10) * |
Also Published As
Publication number | Publication date |
---|---|
CN117892019B (en) | 2024-05-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11514247B2 (en) | Method, apparatus, computer device and readable medium for knowledge hierarchical extraction of a text | |
CN112199608B (en) | Social media rumor detection method based on network information propagation graph modeling | |
CN104573028B (en) | Realize the method and system of intelligent answer | |
CN110428820B (en) | Chinese and English mixed speech recognition method and device | |
CN113515634B (en) | Social media rumor detection method and system based on hierarchical heterogeneous graph neural network | |
CN109766432A (en) | A kind of Chinese abstraction generating method and device based on generation confrontation network | |
Yang et al. | Rits: Real-time interactive text steganography based on automatic dialogue model | |
WO2022134834A1 (en) | Potential event predicting method, apparatus and device, and storage medium | |
CN110909230A (en) | Network hotspot analysis method and system | |
CN108763211A (en) | The automaticabstracting and system of knowledge are contained in fusion | |
CN117272142A (en) | Log abnormality detection method and system and electronic equipment | |
CN112882899B (en) | Log abnormality detection method and device | |
CN114118058A (en) | Emotion analysis system and method based on fusion of syntactic characteristics and attention mechanism | |
CN114330483A (en) | Data processing method, model training method, device, equipment and storage medium | |
CN117633196A (en) | Question-answering model construction method and project question-answering method | |
CN112966296A (en) | Sensitive information filtering method and system based on rule configuration and machine learning | |
CN117892019B (en) | Cross-social network identity linking method and device | |
CN117113973A (en) | Information processing method and related device | |
CN111008329A (en) | Page content recommendation method and device based on content classification | |
CN115455945A (en) | Entity-relationship-based vulnerability data error correction method and system | |
CN115169293A (en) | Text steganalysis method, system, device and storage medium | |
CN111401067B (en) | Honeypot simulation data generation method and device | |
CN114357160A (en) | Early rumor detection method and device based on generation propagation structure characteristics | |
He et al. | Case Study: Quora Question Pairs | |
CN117216193B (en) | Controllable text generation method and device based on large language model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |