CN114513337A

CN114513337A - Privacy protection link prediction method and system based on mail data

Info

Publication number: CN114513337A
Application number: CN202210066876.0A
Authority: CN
Inventors: 王勇; 王范川; 王晓虎; 秦瑞; 张应福; 石锟
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-01-20
Filing date: 2022-01-20
Publication date: 2022-05-17
Anticipated expiration: 2042-01-20
Also published as: CN114513337B

Abstract

The invention discloses a privacy protection link prediction method and a system based on mail data, wherein the method comprises the following steps: constructing a figure relation knowledge graph by using the mail data; training a distribution of training data for learning of a generative model using the generative confrontation network; reconstructing the multivariate relational data so as to confuse sensitive and non-sensitive relational information implied in the data; and the relationship between the entities is complemented by the reconstructed multivariate relational data, so that the sensitive relationship between the entities is protected while the non-sensitive relationship between the entities is complemented. The invention also provides a privacy protection link prediction system based on the mail data to realize the method. The invention completes the relationship between the entities by using the reconstructed multivariate relationship data, achieves the purpose of completing the non-sensitive relationship between the entities and protecting the sensitive relationship between the entities, and solves the technical problem that the social relationship of the personnel under the mail system can not be protected in the prior link prediction technology.

Description

Privacy protection link prediction method and system based on mail data

Technical Field

The invention relates to the technical field of counterwork learning, graph network representation learning, knowledge maps and link prediction, in particular to a privacy protection link prediction method and system based on mail data.

Background

Mail is one of the important information communication modes in modern society as one of the applications of the internet. The mail data records the contents of human communication, including important information such as communication relation, communication time, communication frequency, and the like. By simple entity relation extraction and data mining, a plurality of knowledge maps can be established for one mail data. Such as exemplified by a campus student mail system: a communication relationship map can be established for the communication relationship view, and an online login behavior map can be established for the online device login view. For such a graph, where nodes correspond to entities and edges correspond to relationships, we represent that each such triple represents an entity and that such a relationship exists between entities.

In recent years, the study of knowledge maps has been greatly advanced. However, the incompleteness of the knowledge graph affects its application to some extent. To address this problem, a series of knowledge graph embedding models are proposed. Where the model may generate embedded representations of entities and relationships and may be used for link prediction, i.e., predicting relationships between existing entities. This approach creates some problems. Any attacker can use the generated embedding to carry out link prediction, and accurate relationships between entities can be obtained. However, some of these relationships may be sensitive information that we do not want to obtain by others. Therefore, we cannot use embedding directly, but need to do some processing to achieve privacy protection, where we treat these relationships as sensitive information.

The existing privacy protection technologies are mainly classified into the following categories. The first type is differential privacy, which is achieved mainly by adding noise to the original data or parameters or results. The common laplacian mechanism and exponential mechanism cause high practical loss when realizing differential privacy. Based on this situation, xu et al proposed a matrix factorization based differential privacy network embedding method that introduces enough noise to guarantee privacy, but is not suitable for link prediction. Kearns et al propose a model to protect some nodes, but this is not applicable to link prediction scenarios. Abir De et al introduced a ranking algorithm that monotonically transformed the base scores of the non-private link prediction system, and then added noise that more effectively weighed privacy and prediction performance. Javier et al propose a method of adding or deleting items to minimize privacy risks. Privacy protection may be achieved by deleting or adding specific edges, but this may affect the prediction of the remaining non-sensitive relationships. In addition, simple deletion of sensitive information is also vulnerable to inference attacks. The second type is encryption technology. The encryption-based privacy protection scheme achieves privacy protection through advanced encryption techniques. Classical encryption techniques include homomorphic encryption and secure multiparty computation, among others. They can effectively achieve privacy protection, but the computational load is always high. The last category is GAN, which is embedded by generating an antagonistic network training. Li kaiyang et al propose that this is a graph confrontation training framework that integrates privacy stripping and clearing mechanisms to avoid inference attacks. Wherein the countermeasure self-encoding (AAE) employs a generative countermeasure network (GAN) to make varying inferences forcing the posterior distribution of the covert code to a specified prior distribution such that supervised separation capability can protect privacy. However, GAN training still has some problems, such as unstable training.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a link prediction method and a system for privacy protection based on mail data, aims to solve the technical problem that the social relationship of people under a mail system cannot be protected in the prior link prediction technology, ensures the diversity of generated samples, and has better privacy protection and smaller calculation amount than the encryption technology in the aspect of prediction of non-sensitive relationship.

The purpose of the invention is realized by the following technical scheme:

a privacy-preserving link prediction method based on mail data comprises the following steps:

the method comprises the following steps: preprocessing the mail data, mining implicit relations in the mails, and constructing a figure relation knowledge graph based on the mail data;

step two: encoding entities and implicit relations in the human relationship knowledge graph by using an energy-based learning entity low-dimensional embedding model to obtain embedding space and embedding data with one-to-one relations among different entities;

step three: training by using the generated countermeasure network and using the coded embedded data to obtain a generated model, and simulating an embedded space by using the model;

step four: a gradient descent reconstruction method is used for confusing the sensitive relation and the non-sensitive relation implied in the original data, and the distribution structure of the embedding space is finely adjusted;

step five: and performing reasoning prediction based on the character relationship of the mail system based on the data of the finely adjusted embedding space.

Specifically, the first step specifically comprises:

s101, aiming at a college student mail system data set, selecting a student communication relation which is most closely related to personnel, and establishing a communication relation knowledge map;

s102, dividing a college student mail system network into an intra-domain communication network and an extra-domain communication network;

s103, defining the communication relation knowledge graph as a (h, l, t) triple, wherein the communication relation l is divided into two groups of relations which are respectively known relations l_oAnd unknown relationships l that need to be de-predicted_uAnd l is_u∈l_o；

S104, converting the known relation l_oFurther divided into sensitive relationships in intra-domain networks

And non-sensitive relationships in out-of-domain communication networks

And is provided with

Specifically, the second step specifically comprises:

s201, generating a real Gaussian distribution, and randomly sampling and initializing entities and relations of original mail data;

s202, carrying out normalization processing on the vectors of the entities and the relations in each iteration;

s203, selecting a fixed amount of data as positive samples S each time_batchIs represented by (h, l)_oT) and for each positive sample, then replace its head and tail entities as a negative sample S'_batchIs represented by (h', l)_o,t’)；

S204, updating the entity and the relation vector by using a random gradient descent algorithm according to the following loss functions:

wherein, [ x ]]₊Represents taking [0, x]Maximum value of (1), γ>0 is a boundary over-parameter, which acts as an interval correction before a positive and negative sample; d (x, y) is a distance function, d (x, y) being (x-y)²。

Specifically, the process of obtaining a generated model by training in the third step specifically includes:

s301, sampling a random noise Z from Gaussian distribution;

s302, using a neural network comprising two fully-connected layers and a normalization layer as a generator model G (), and adopting Wasserstein loss and link prediction loss, wherein the link prediction loss is expressed as ranking loss based on margin and is represented as follows:

wherein the content of the first and second substances,

is not sensitiveThe relationship of the three-element group,

a sensitive relationship triplet; gamma ray>0 is a boundary hyperparameter, d (x, y) represents the Euclidean distance between two vectors;

the Wasserstein loss was calculated as follows:

wherein, y_nDenotes a non-sensitive label, y_sThe loss of the entire generative model for the sensitized tag is shown as follows:

L_G＝L₂+λL_Dist

wherein, λ is a hyper-parameter for adjusting the weight of a single loss function;

s303, using two full-connection layer networks with LeakyReLU active layers as a discriminator model D (), using the second full-connection layer as a classifier to distinguish the authenticity of input data, and using Wasserstein loss; penalizing L with a gradient_GPTo enforce the lipschitz constraint, the discriminator model is penalized if the gradient norm deviates from its target norm value of 1, and therefore the penalty function of the discriminator model is given by:

and S304, alternately training the generator model and the discriminator model.

Specifically, the step four specifically includes the following substeps:

s401, sampling R initial embeddings from Gaussian distribution

S402, the original data set is processed according to the relation l_oIs divided into a plurality of sets of data X_l，X_lTransE coding representing correspondence of relation l；

S403, for any group of data sets X containing the relation l_lUsing the trained generator model as a reconstructed neural network, and reconstructing the coding Z of the relational data using the following loss function:

wherein G is_(Z)Is the output of the generator model with input Z (Z ∈ Z); alpha is alpha>0 is boundary over-parameter, and the function of the boundary over-parameter is equivalent to interval correction of sensitive relation reconstruction coding and normal coding;

s404, the initial embedding is reconstructed by using the gradient descent algorithm for L times

The reconstruction process is calculated according to the following formula:

s405, randomly initializing R z, and sampling to different local minimum values to improve robustness of a reconstructed model, wherein z is^*Is found by minimizing the following equation:

finally using the reconstructed data

And embedding as a final relation, and predicting the subsequent personnel relation.

A privacy protection link prediction system based on mail data realized by the privacy protection link prediction method based on mail data comprises

The data preprocessing module is used for constructing a knowledge graph according to original mail data to form strict mathematical definition and a target;

the entity relationship low-dimensional embedding module is used for learning the low-dimensional embedding of the entities and the relationships in the knowledge graph;

the generator training module comprises a generator G and a discriminator D, and input data are real embedded data and random sampling noise Z which obeys Gaussian distribution;

a data reconstruction module for reconstructing the embedded data X processed by the entity relationship low-dimensional embedding module to obtain a new entity and relationship low-dimensional embedded G (z)^*)。

A link prediction module for embedding G (z) according to a low dimension^*) The relationship of people in the mail network is predicted.

The invention has the beneficial effects that:

1. the invention uses the reconstructed multivariate relational data to complement the relationship between the entities, achieves the purpose of complementing the non-sensitive relationship between the entities and protecting the sensitive relationship between the entities, and solves the technical problem that the social relationship of the personnel under the mail system can not be protected in the prior link prediction technology.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a frame diagram of the present model;

FIG. 3 is a schematic diagram of a neural network architecture used in generating a countermeasure network;

fig. 4 is a functional block diagram of the system of the present invention.

Detailed Description

The following detailed description will be selected to more clearly understand the technical features, objects and advantages of the present invention. It should be understood that the embodiments described are illustrative of some, but not all embodiments of the invention, and are not to be construed as limiting the scope of the invention. All other embodiments that can be obtained by a person skilled in the art based on the embodiments of the present invention without any inventive step are within the scope of the present invention.

The first embodiment is as follows:

in this embodiment, as shown in fig. 1, a method for predicting a privacy-preserving link based on mail data includes: preprocessing the mail data, mining implicit relations in the mails, and constructing a figure relation knowledge graph based on the mail data; encoding entities and implicit relations in the human relationship knowledge graph by using an energy-based learning entity low-dimensional embedding model to obtain embedding space and embedding data with one-to-one relations among different entities; training by using the generated countermeasure network and using the coded embedded data to obtain a generated model, and simulating an embedded space by using the model; a gradient descent reconstruction method is used for confusing the sensitive relation and the non-sensitive relation implied in the original data, and the distribution structure of the embedding space is finely adjusted; and performing reasoning prediction based on the character relationship of the mail system based on the data of the finely adjusted embedding space.

In this embodiment, as shown in fig. 2, a frame diagram of a privacy protection link prediction model based on mail data is designed, and prediction of privacy protection links is performed by using the model, which includes the following steps:

preprocessing mail data, and extracting a relation from original data to construct a knowledge graph;

step (2), the entity and the relation in the multi-relation data are coded by using TransE, so that the obtained representation has good performance on the downstream link prediction tasks (sensitive relation and non-sensitive relation);

step (3), training a generation model with good generation capacity by using a generation countermeasure network;

step (4), reconstructing the coded representation of the data by using the generating model and combining the sensitive relation data and the non-sensitive relation data;

and (5) predicting the personnel relationship based on the mail network by using the new coded representation.

In the data set preprocessing, taking a data set of a student mail system of a college as an example, the specific implementation steps are as follows:

a) from different perspectives, different knowledge maps can be established for the mail system: such as online login behavior maps, mailbox common time interval maps and the like. Selecting the student communication relationship with the closest relationship with personnel, and establishing a communication relationship knowledge graph;

b) the student mail system network is divided into two parts: communication network inside students — intra-domain communication networks, such as: communication between kenou, fellow, classmates, etc.; communication between inside and outside of students-communication systems outside of the domain, such as communication between students and instructors, students and teachers, students and instructors;

c) the knowledge-graph is defined as (h, l, t) triplets, where the relationship l is divided into two groups, the currently known relationship l_o(Buddha, fellow, instructor, lovers … …) and unknown relationships l that need to be predicted_u(Buddha, fellow, mentor … …), where l_u∈l_o；

d) Will know the relation l_oFurther divided into sensitive relationships in intra-domain networks

(Buddha, fellow, lovers, etc.) and non-sensitive relationships in extraterritorial communication networks

(student to instructor relationship, student to professor relationship, student to instructor relationship, etc.), here

Privacy preserving link prediction is based on a known relationship triplet (h, l)_oT) to predict unknown relationship triplets (h, l)_uT), and if

Making the probability of the prediction as small as possible and vice versa;

in the process of encoding the entities and the relations in the original data by using TransE, the specific implementation steps are as follows:

a) generating a real Gaussian distribution, and carrying out random sampling to initialize the entity and the relation of the original data;

b) normalizing the vectors of the entities and the relations in each iteration;

c) each time, a fixed amount of data is selected as a positive sample S_batchIs represented by (h, l)_oT) and for each positive sample, then replace its head and tail entities as a negative sample S'_batchIs represented by (h', l)_o,t’)；

d) The entity and relationship vectors are updated using a stochastic gradient descent algorithm with the following loss functions:

here, [ x ]]₊Represents taking [0, x]Maximum value of (1), γ>0 is a boundary hyperparameter which acts as a correction of the interval between a positive and a negative sample, the larger γ the larger the interval between two samples which has been corrected, the more stringent the correction for the code vector, d (x, y) is a distance function, usually chosen as l₂Norm, i.e.:

d(x,y)＝(x-y)² (2)

in the process of training a generative model with good generative capacity by using a generative confrontation network, the invention adopts the following algorithm:

the specific implementation steps of the process are as follows:

a) sampling a random noise Z from a gaussian distribution;

b) as shown in fig. 3, a neural network structure comprising two fully-connected layers and one normalization layer is used as a generator model G (), and to avoid mode collapse and increase diversity, we adopt Wasserstein loss plus link prediction loss, which is expressed as margin-based ranking loss, as follows:

here, the first and second liquid crystal display panels are,

in the case of a non-sensitive relationship triplet,

a sensitive relationship triplet. Gamma ray>0 is a boundary hyperparameter, d (x, y) represents the euclidean distance between the two vectors;

the Wasserstein loss was calculated as follows:

wherein y is_nAnd y_sRepresent non-sensitized tags and sensitized tags, respectively, so the overall generative model penalty is as follows:

L_G＝L₂+λL_Dist (5)

wherein lambda is a hyper-parameter for adjusting the weight of a single loss function;

c) two fully-connected layers with LeakyReLU active layers are used as a discriminator model D (), and the second fully-connected layer is used as a classifier for distinguishing input data as real data and false data, and Wasserstein loss is used. To stabilize the training process and eliminate pattern collapse, we also employ a gradient penalty L_GPTo strengthen the liphowstz constraint. The model is penalized if the gradient norm deviates from its target norm value of 1, so the penalty function for the discriminator model is as follows:

d) alternately training a generator model and a discriminator model;

in reconstructing the encoded representation of the data using the generative model and combining the sensitive relationship data and the non-sensitive relationship data, the present invention employs the following algorithm:

the specific steps of the process comprise:

a) sampling R initial embeddings from a Gaussian distribution

b) The original data set is expressed by the relation l_oIs divided into a plurality of sets of data X_lE.g. X_{Lovers' electric heating device}、X_{Teachers and students}、X_{Buddha's friend}Etc. X_lRepresenting the TransE code corresponding to the relation l;

c) for any set of data set X containing relation l_lUsing the trained generator model as a reconstructed neural network, and reconstructing the coding Z of the relational data using the following loss function:

here, G_(z)Is the output of the generator model with an input of Z (Z ∈ Z), α>0 is boundary hyperparameter, its action is equal to interval correction of sensitive relation reconstruction code and its normal code, the larger alpha is, the larger interval between two codes is corrected is, and for code directionThe more stringent the correction of the quantity;

d) we use the gradient descent algorithm of degree L to reconstruct the initial embedding

The reconstruction process is calculated according to the following formula:

e) due to non-convexity of mean square error, randomly initializing R z to enable us to sample different local minimum values so as to improve robustness of a reconstruction model, wherein z is^*Is found by minimizing the following equation:

finally using the reconstructed data

The solution in this embodiment adopts WGAN to solve the problems of conventional GAN training, such as unstable training, and basically solves the problem of collapse mode, thereby ensuring the diversity of generated samples. In terms of prediction of non-sensitive relationships, the solution of the embodiment is better than differential privacy, and the calculation amount is smaller than that of encryption technology.

Example two:

in this embodiment, a privacy-preserving link prediction system based on mail data is constructed by using the method provided in the first embodiment, and as shown in fig. 4, the system includes the following modules:

a data preprocessing module: constructing a knowledge graph according to original mail data to form strict mathematical definition and a target;

an entity relationship low-dimensional embedding module: given a set S of triples in the form of (h, L, t) containing two entities h, t E E (the set of entities), a relationship L E L (the set of relationships). The entity relation low-dimensional embedding module mainly learns the low-dimensional embedding of the entities and the relations, and the embedding has a good effect on a downstream link prediction task. The patent selects a TransE model with excellent performance to be used for the entity relationship embedding module.

A generator training module: the module is shown in fig. 2. part r, and comprises a generator G and a discriminator D, the input data being real embedded data and random sampling noise Z from a gaussian distribution. The generator can generate data with the same distribution as the real embedded data during the counter training.

A data reconstruction module: the module is shown in the left part of FIG. 2. the data processed by the entity relationship low-dimensional embedding module is called embedded data and is represented by X. Therefore, for any entity or relationship, we can represent the mapping relationship between the entity and the embedded data by a unique duplet { e (h, l, te ∈ e) → X }. Given a pre-trained generator G and the entity or relationship X to be predicted, z should first be found^*To minimize our reconstruction loss. Then G (z)^*) Embedded as a reconstruction is used as a link prediction. Since equation 1 is a highly non-convex minimization problem, we use different random initializations of R z (denoted as

) To make an L gradient descent to approximate the process. After antagonism training, we will

Input into the generator, we use the gradient descent algorithm at L steps to evaluate the projection of the real dataset in the embedding space of the generator.

A link prediction module: through the data reconstruction module, we obtain new low-dimensional embedding of entities and relationsG (z)^*). This embedding can be used to predict the relationship of people in the mail network.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are given by way of illustration of the principles of the present invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, and such changes and modifications are within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A privacy protection link prediction method based on mail data is characterized by comprising the following steps:

2. The method according to claim 1, wherein the first step specifically comprises:

And non-sensitive relationships in out-of-domain communication networks

And is provided with

3. The method according to claim 1, wherein the second step specifically comprises:

s203, selecting a fixed amount of data as positive samples S each time_batchIs represented by (h, l)_oT) and for each positive sample, then replace its head and tail entities as a negative sample S'_batchIs represented by (h' l)_o，t’)；

wherein, [ x ]]₊Represents taking [0, x]The maximum value of (a) is a boundary hyperparameter whose function is equivalent to a gap correction between a positive and a negative sample; d (x, y) is a distance function, d (x, y) being (x-y)²。

4. The method for predicting privacy-preserving links based on mail data as claimed in claim 1, wherein the training in step three obtains the generative model specifically comprising:

s301, sampling a random noise Z from Gaussian distribution;

wherein the content of the first and second substances,

in the case of a non-sensitive relationship triplet,

a sensitive relationship triplet; gamma ray>0 is a boundary hyperparameter, d (x, y) represents the euclidean distance between the two vectors;

the Wasserstein loss was calculated as follows:

wherein, y_nDenotes a non-sensitive label, y_sThe loss of the entire generative model is as followsShown in the figure:

L_G＝L₂+λL_Dist

wherein, λ is a hyper-parameter for adjusting a single loss function weight;

and S304, alternately training the generator model and the discriminator model.

5. The method for predicting privacy-preserving links based on mail data as claimed in claim 1, wherein the fourth step specifically comprises the following sub-steps:

s401, sampling R initial embeddings from Gaussian distribution

S402, the original data set is processed according to the relation l_oIs divided into a plurality of sets of data X_l，X_lRepresenting the TransE code corresponding to the relation l;

wherein G is_(Z)Is input asThe output of the generator model for Z (Z ∈ Z); alpha is alpha>0 is boundary over-parameter, and the function of the boundary over-parameter is equivalent to interval correction of sensitive relation reconstruction coding and normal coding;

The reconstruction process is calculated according to the following formula:

finally using the reconstructed data

6. A privacy-preserving link prediction system based on mail data, which is realized by the privacy-preserving link prediction method based on mail data of any one of claims 1 to 5, and is characterized by comprising

a data reconstruction module for reconstructing the embedded data X processed by the entity relationship low-dimensional embedding module to obtain a new entity and relationship low-dimensional embedded G (z)^*)；