CN114513337A - Privacy protection link prediction method and system based on mail data - Google Patents

Privacy protection link prediction method and system based on mail data Download PDF

Info

Publication number
CN114513337A
CN114513337A CN202210066876.0A CN202210066876A CN114513337A CN 114513337 A CN114513337 A CN 114513337A CN 202210066876 A CN202210066876 A CN 202210066876A CN 114513337 A CN114513337 A CN 114513337A
Authority
CN
China
Prior art keywords
data
relationship
relation
sensitive
entities
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210066876.0A
Other languages
Chinese (zh)
Other versions
CN114513337B (en
Inventor
王勇
王范川
王晓虎
秦瑞
张应福
石锟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202210066876.0A priority Critical patent/CN114513337B/en
Publication of CN114513337A publication Critical patent/CN114513337A/en
Application granted granted Critical
Publication of CN114513337B publication Critical patent/CN114513337B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/04Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
    • H04L63/0407Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the identity of one or more communicating identities is hidden
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a privacy protection link prediction method and a system based on mail data, wherein the method comprises the following steps: constructing a figure relation knowledge graph by using the mail data; training a distribution of training data for learning of a generative model using the generative confrontation network; reconstructing the multivariate relational data so as to confuse sensitive and non-sensitive relational information implied in the data; and the relationship between the entities is complemented by the reconstructed multivariate relational data, so that the sensitive relationship between the entities is protected while the non-sensitive relationship between the entities is complemented. The invention also provides a privacy protection link prediction system based on the mail data to realize the method. The invention completes the relationship between the entities by using the reconstructed multivariate relationship data, achieves the purpose of completing the non-sensitive relationship between the entities and protecting the sensitive relationship between the entities, and solves the technical problem that the social relationship of the personnel under the mail system can not be protected in the prior link prediction technology.

Description

Privacy protection link prediction method and system based on mail data
Technical Field
The invention relates to the technical field of counterwork learning, graph network representation learning, knowledge maps and link prediction, in particular to a privacy protection link prediction method and system based on mail data.
Background
Mail is one of the important information communication modes in modern society as one of the applications of the internet. The mail data records the contents of human communication, including important information such as communication relation, communication time, communication frequency, and the like. By simple entity relation extraction and data mining, a plurality of knowledge maps can be established for one mail data. Such as exemplified by a campus student mail system: a communication relationship map can be established for the communication relationship view, and an online login behavior map can be established for the online device login view. For such a graph, where nodes correspond to entities and edges correspond to relationships, we represent that each such triple represents an entity and that such a relationship exists between entities.
In recent years, the study of knowledge maps has been greatly advanced. However, the incompleteness of the knowledge graph affects its application to some extent. To address this problem, a series of knowledge graph embedding models are proposed. Where the model may generate embedded representations of entities and relationships and may be used for link prediction, i.e., predicting relationships between existing entities. This approach creates some problems. Any attacker can use the generated embedding to carry out link prediction, and accurate relationships between entities can be obtained. However, some of these relationships may be sensitive information that we do not want to obtain by others. Therefore, we cannot use embedding directly, but need to do some processing to achieve privacy protection, where we treat these relationships as sensitive information.
The existing privacy protection technologies are mainly classified into the following categories. The first type is differential privacy, which is achieved mainly by adding noise to the original data or parameters or results. The common laplacian mechanism and exponential mechanism cause high practical loss when realizing differential privacy. Based on this situation, xu et al proposed a matrix factorization based differential privacy network embedding method that introduces enough noise to guarantee privacy, but is not suitable for link prediction. Kearns et al propose a model to protect some nodes, but this is not applicable to link prediction scenarios. Abir De et al introduced a ranking algorithm that monotonically transformed the base scores of the non-private link prediction system, and then added noise that more effectively weighed privacy and prediction performance. Javier et al propose a method of adding or deleting items to minimize privacy risks. Privacy protection may be achieved by deleting or adding specific edges, but this may affect the prediction of the remaining non-sensitive relationships. In addition, simple deletion of sensitive information is also vulnerable to inference attacks. The second type is encryption technology. The encryption-based privacy protection scheme achieves privacy protection through advanced encryption techniques. Classical encryption techniques include homomorphic encryption and secure multiparty computation, among others. They can effectively achieve privacy protection, but the computational load is always high. The last category is GAN, which is embedded by generating an antagonistic network training. Li kaiyang et al propose that this is a graph confrontation training framework that integrates privacy stripping and clearing mechanisms to avoid inference attacks. Wherein the countermeasure self-encoding (AAE) employs a generative countermeasure network (GAN) to make varying inferences forcing the posterior distribution of the covert code to a specified prior distribution such that supervised separation capability can protect privacy. However, GAN training still has some problems, such as unstable training.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides a link prediction method and a system for privacy protection based on mail data, aims to solve the technical problem that the social relationship of people under a mail system cannot be protected in the prior link prediction technology, ensures the diversity of generated samples, and has better privacy protection and smaller calculation amount than the encryption technology in the aspect of prediction of non-sensitive relationship.
The purpose of the invention is realized by the following technical scheme:
a privacy-preserving link prediction method based on mail data comprises the following steps:
the method comprises the following steps: preprocessing the mail data, mining implicit relations in the mails, and constructing a figure relation knowledge graph based on the mail data;
step two: encoding entities and implicit relations in the human relationship knowledge graph by using an energy-based learning entity low-dimensional embedding model to obtain embedding space and embedding data with one-to-one relations among different entities;
step three: training by using the generated countermeasure network and using the coded embedded data to obtain a generated model, and simulating an embedded space by using the model;
step four: a gradient descent reconstruction method is used for confusing the sensitive relation and the non-sensitive relation implied in the original data, and the distribution structure of the embedding space is finely adjusted;
step five: and performing reasoning prediction based on the character relationship of the mail system based on the data of the finely adjusted embedding space.
Specifically, the first step specifically comprises:
s101, aiming at a college student mail system data set, selecting a student communication relation which is most closely related to personnel, and establishing a communication relation knowledge map;
s102, dividing a college student mail system network into an intra-domain communication network and an extra-domain communication network;
s103, defining the communication relation knowledge graph as a (h, l, t) triple, wherein the communication relation l is divided into two groups of relations which are respectively known relations loAnd unknown relationships l that need to be de-predicteduAnd l isu∈lo
S104, converting the known relation loFurther divided into sensitive relationships in intra-domain networks
Figure BDA0003480482070000021
And non-sensitive relationships in out-of-domain communication networks
Figure BDA0003480482070000022
And is provided with
Figure BDA0003480482070000023
Specifically, the second step specifically comprises:
s201, generating a real Gaussian distribution, and randomly sampling and initializing entities and relations of original mail data;
s202, carrying out normalization processing on the vectors of the entities and the relations in each iteration;
s203, selecting a fixed amount of data as positive samples S each timebatchIs represented by (h, l)oT) and for each positive sample, then replace its head and tail entities as a negative sample S'batchIs represented by (h', l)o,t’);
S204, updating the entity and the relation vector by using a random gradient descent algorithm according to the following loss functions:
Figure BDA0003480482070000031
wherein, [ x ]]+Represents taking [0, x]Maximum value of (1), γ>0 is a boundary over-parameter, which acts as an interval correction before a positive and negative sample; d (x, y) is a distance function, d (x, y) being (x-y)2
Specifically, the process of obtaining a generated model by training in the third step specifically includes:
s301, sampling a random noise Z from Gaussian distribution;
s302, using a neural network comprising two fully-connected layers and a normalization layer as a generator model G (), and adopting Wasserstein loss and link prediction loss, wherein the link prediction loss is expressed as ranking loss based on margin and is represented as follows:
Figure BDA0003480482070000032
wherein the content of the first and second substances,
Figure BDA0003480482070000033
is not sensitiveThe relationship of the three-element group,
Figure BDA0003480482070000034
a sensitive relationship triplet; gamma ray>0 is a boundary hyperparameter, d (x, y) represents the Euclidean distance between two vectors;
the Wasserstein loss was calculated as follows:
Figure BDA0003480482070000035
wherein, ynDenotes a non-sensitive label, ysThe loss of the entire generative model for the sensitized tag is shown as follows:
LG=L2+λLDist
wherein, λ is a hyper-parameter for adjusting the weight of a single loss function;
s303, using two full-connection layer networks with LeakyReLU active layers as a discriminator model D (), using the second full-connection layer as a classifier to distinguish the authenticity of input data, and using Wasserstein loss; penalizing L with a gradientGPTo enforce the lipschitz constraint, the discriminator model is penalized if the gradient norm deviates from its target norm value of 1, and therefore the penalty function of the discriminator model is given by:
Figure BDA0003480482070000036
and S304, alternately training the generator model and the discriminator model.
Specifically, the step four specifically includes the following substeps:
s401, sampling R initial embeddings from Gaussian distribution
Figure BDA0003480482070000037
S402, the original data set is processed according to the relation loIs divided into a plurality of sets of data Xl,XlTransE coding representing correspondence of relation l;
S403, for any group of data sets X containing the relation llUsing the trained generator model as a reconstructed neural network, and reconstructing the coding Z of the relational data using the following loss function:
Figure BDA0003480482070000041
wherein G is(Z)Is the output of the generator model with input Z (Z ∈ Z); alpha is alpha>0 is boundary over-parameter, and the function of the boundary over-parameter is equivalent to interval correction of sensitive relation reconstruction coding and normal coding;
s404, the initial embedding is reconstructed by using the gradient descent algorithm for L times
Figure BDA0003480482070000042
The reconstruction process is calculated according to the following formula:
Figure BDA0003480482070000043
Figure BDA0003480482070000044
s405, randomly initializing R z, and sampling to different local minimum values to improve robustness of a reconstructed model, wherein z is*Is found by minimizing the following equation:
Figure BDA0003480482070000045
finally using the reconstructed data
Figure BDA0003480482070000046
And embedding as a final relation, and predicting the subsequent personnel relation.
A privacy protection link prediction system based on mail data realized by the privacy protection link prediction method based on mail data comprises
The data preprocessing module is used for constructing a knowledge graph according to original mail data to form strict mathematical definition and a target;
the entity relationship low-dimensional embedding module is used for learning the low-dimensional embedding of the entities and the relationships in the knowledge graph;
the generator training module comprises a generator G and a discriminator D, and input data are real embedded data and random sampling noise Z which obeys Gaussian distribution;
a data reconstruction module for reconstructing the embedded data X processed by the entity relationship low-dimensional embedding module to obtain a new entity and relationship low-dimensional embedded G (z)*)。
A link prediction module for embedding G (z) according to a low dimension*) The relationship of people in the mail network is predicted.
The invention has the beneficial effects that:
1. the invention uses the reconstructed multivariate relational data to complement the relationship between the entities, achieves the purpose of complementing the non-sensitive relationship between the entities and protecting the sensitive relationship between the entities, and solves the technical problem that the social relationship of the personnel under the mail system can not be protected in the prior link prediction technology.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a frame diagram of the present model;
FIG. 3 is a schematic diagram of a neural network architecture used in generating a countermeasure network;
fig. 4 is a functional block diagram of the system of the present invention.
Detailed Description
The following detailed description will be selected to more clearly understand the technical features, objects and advantages of the present invention. It should be understood that the embodiments described are illustrative of some, but not all embodiments of the invention, and are not to be construed as limiting the scope of the invention. All other embodiments that can be obtained by a person skilled in the art based on the embodiments of the present invention without any inventive step are within the scope of the present invention.
The first embodiment is as follows:
in this embodiment, as shown in fig. 1, a method for predicting a privacy-preserving link based on mail data includes: preprocessing the mail data, mining implicit relations in the mails, and constructing a figure relation knowledge graph based on the mail data; encoding entities and implicit relations in the human relationship knowledge graph by using an energy-based learning entity low-dimensional embedding model to obtain embedding space and embedding data with one-to-one relations among different entities; training by using the generated countermeasure network and using the coded embedded data to obtain a generated model, and simulating an embedded space by using the model; a gradient descent reconstruction method is used for confusing the sensitive relation and the non-sensitive relation implied in the original data, and the distribution structure of the embedding space is finely adjusted; and performing reasoning prediction based on the character relationship of the mail system based on the data of the finely adjusted embedding space.
In this embodiment, as shown in fig. 2, a frame diagram of a privacy protection link prediction model based on mail data is designed, and prediction of privacy protection links is performed by using the model, which includes the following steps:
preprocessing mail data, and extracting a relation from original data to construct a knowledge graph;
step (2), the entity and the relation in the multi-relation data are coded by using TransE, so that the obtained representation has good performance on the downstream link prediction tasks (sensitive relation and non-sensitive relation);
step (3), training a generation model with good generation capacity by using a generation countermeasure network;
step (4), reconstructing the coded representation of the data by using the generating model and combining the sensitive relation data and the non-sensitive relation data;
and (5) predicting the personnel relationship based on the mail network by using the new coded representation.
In the data set preprocessing, taking a data set of a student mail system of a college as an example, the specific implementation steps are as follows:
a) from different perspectives, different knowledge maps can be established for the mail system: such as online login behavior maps, mailbox common time interval maps and the like. Selecting the student communication relationship with the closest relationship with personnel, and establishing a communication relationship knowledge graph;
b) the student mail system network is divided into two parts: communication network inside students — intra-domain communication networks, such as: communication between kenou, fellow, classmates, etc.; communication between inside and outside of students-communication systems outside of the domain, such as communication between students and instructors, students and teachers, students and instructors;
c) the knowledge-graph is defined as (h, l, t) triplets, where the relationship l is divided into two groups, the currently known relationship lo(Buddha, fellow, instructor, lovers … …) and unknown relationships l that need to be predictedu(Buddha, fellow, mentor … …), where lu∈lo
d) Will know the relation loFurther divided into sensitive relationships in intra-domain networks
Figure BDA0003480482070000061
(Buddha, fellow, lovers, etc.) and non-sensitive relationships in extraterritorial communication networks
Figure BDA0003480482070000062
(student to instructor relationship, student to professor relationship, student to instructor relationship, etc.), here
Figure BDA0003480482070000063
Privacy preserving link prediction is based on a known relationship triplet (h, l)oT) to predict unknown relationship triplets (h, l)uT), and if
Figure BDA0003480482070000064
Making the probability of the prediction as small as possible and vice versa;
in the process of encoding the entities and the relations in the original data by using TransE, the specific implementation steps are as follows:
a) generating a real Gaussian distribution, and carrying out random sampling to initialize the entity and the relation of the original data;
b) normalizing the vectors of the entities and the relations in each iteration;
c) each time, a fixed amount of data is selected as a positive sample SbatchIs represented by (h, l)oT) and for each positive sample, then replace its head and tail entities as a negative sample S'batchIs represented by (h', l)o,t’);
d) The entity and relationship vectors are updated using a stochastic gradient descent algorithm with the following loss functions:
Figure BDA0003480482070000065
here, [ x ]]+Represents taking [0, x]Maximum value of (1), γ>0 is a boundary hyperparameter which acts as a correction of the interval between a positive and a negative sample, the larger γ the larger the interval between two samples which has been corrected, the more stringent the correction for the code vector, d (x, y) is a distance function, usually chosen as l2Norm, i.e.:
d(x,y)=(x-y)2 (2)
in the process of training a generative model with good generative capacity by using a generative confrontation network, the invention adopts the following algorithm:
Figure BDA0003480482070000066
Figure BDA0003480482070000071
the specific implementation steps of the process are as follows:
a) sampling a random noise Z from a gaussian distribution;
b) as shown in fig. 3, a neural network structure comprising two fully-connected layers and one normalization layer is used as a generator model G (), and to avoid mode collapse and increase diversity, we adopt Wasserstein loss plus link prediction loss, which is expressed as margin-based ranking loss, as follows:
Figure BDA0003480482070000072
here, the first and second liquid crystal display panels are,
Figure BDA0003480482070000073
in the case of a non-sensitive relationship triplet,
Figure BDA0003480482070000074
a sensitive relationship triplet. Gamma ray>0 is a boundary hyperparameter, d (x, y) represents the euclidean distance between the two vectors;
the Wasserstein loss was calculated as follows:
Figure BDA0003480482070000075
wherein y isnAnd ysRepresent non-sensitized tags and sensitized tags, respectively, so the overall generative model penalty is as follows:
LG=L2+λLDist (5)
wherein lambda is a hyper-parameter for adjusting the weight of a single loss function;
c) two fully-connected layers with LeakyReLU active layers are used as a discriminator model D (), and the second fully-connected layer is used as a classifier for distinguishing input data as real data and false data, and Wasserstein loss is used. To stabilize the training process and eliminate pattern collapse, we also employ a gradient penalty LGPTo strengthen the liphowstz constraint. The model is penalized if the gradient norm deviates from its target norm value of 1, so the penalty function for the discriminator model is as follows:
Figure BDA0003480482070000076
d) alternately training a generator model and a discriminator model;
in reconstructing the encoded representation of the data using the generative model and combining the sensitive relationship data and the non-sensitive relationship data, the present invention employs the following algorithm:
Figure BDA0003480482070000077
Figure BDA0003480482070000081
the specific steps of the process comprise:
a) sampling R initial embeddings from a Gaussian distribution
Figure BDA0003480482070000082
b) The original data set is expressed by the relation loIs divided into a plurality of sets of data XlE.g. XLovers' electric heating device、XTeachers and students、XBuddha's friendEtc. XlRepresenting the TransE code corresponding to the relation l;
c) for any set of data set X containing relation llUsing the trained generator model as a reconstructed neural network, and reconstructing the coding Z of the relational data using the following loss function:
Figure BDA0003480482070000083
here, G(z)Is the output of the generator model with an input of Z (Z ∈ Z), α>0 is boundary hyperparameter, its action is equal to interval correction of sensitive relation reconstruction code and its normal code, the larger alpha is, the larger interval between two codes is corrected is, and for code directionThe more stringent the correction of the quantity;
d) we use the gradient descent algorithm of degree L to reconstruct the initial embedding
Figure BDA0003480482070000084
The reconstruction process is calculated according to the following formula:
Figure BDA0003480482070000085
Figure BDA0003480482070000086
e) due to non-convexity of mean square error, randomly initializing R z to enable us to sample different local minimum values so as to improve robustness of a reconstruction model, wherein z is*Is found by minimizing the following equation:
Figure BDA0003480482070000087
finally using the reconstructed data
Figure BDA0003480482070000088
And embedding as a final relation, and predicting the subsequent personnel relation.
The solution in this embodiment adopts WGAN to solve the problems of conventional GAN training, such as unstable training, and basically solves the problem of collapse mode, thereby ensuring the diversity of generated samples. In terms of prediction of non-sensitive relationships, the solution of the embodiment is better than differential privacy, and the calculation amount is smaller than that of encryption technology.
Example two:
in this embodiment, a privacy-preserving link prediction system based on mail data is constructed by using the method provided in the first embodiment, and as shown in fig. 4, the system includes the following modules:
a data preprocessing module: constructing a knowledge graph according to original mail data to form strict mathematical definition and a target;
an entity relationship low-dimensional embedding module: given a set S of triples in the form of (h, L, t) containing two entities h, t E E (the set of entities), a relationship L E L (the set of relationships). The entity relation low-dimensional embedding module mainly learns the low-dimensional embedding of the entities and the relations, and the embedding has a good effect on a downstream link prediction task. The patent selects a TransE model with excellent performance to be used for the entity relationship embedding module.
A generator training module: the module is shown in fig. 2. part r, and comprises a generator G and a discriminator D, the input data being real embedded data and random sampling noise Z from a gaussian distribution. The generator can generate data with the same distribution as the real embedded data during the counter training.
A data reconstruction module: the module is shown in the left part of FIG. 2. the data processed by the entity relationship low-dimensional embedding module is called embedded data and is represented by X. Therefore, for any entity or relationship, we can represent the mapping relationship between the entity and the embedded data by a unique duplet { e (h, l, te ∈ e) → X }. Given a pre-trained generator G and the entity or relationship X to be predicted, z should first be found*To minimize our reconstruction loss. Then G (z)*) Embedded as a reconstruction is used as a link prediction. Since equation 1 is a highly non-convex minimization problem, we use different random initializations of R z (denoted as
Figure BDA0003480482070000091
) To make an L gradient descent to approximate the process. After antagonism training, we will
Figure BDA0003480482070000092
Input into the generator, we use the gradient descent algorithm at L steps to evaluate the projection of the real dataset in the embedding space of the generator.
A link prediction module: through the data reconstruction module, we obtain new low-dimensional embedding of entities and relationsG (z)*). This embedding can be used to predict the relationship of people in the mail network.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are given by way of illustration of the principles of the present invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, and such changes and modifications are within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (6)

1. A privacy protection link prediction method based on mail data is characterized by comprising the following steps:
the method comprises the following steps: preprocessing the mail data, mining implicit relations in the mails, and constructing a figure relation knowledge graph based on the mail data;
step two: encoding entities and implicit relations in the human relationship knowledge graph by using an energy-based learning entity low-dimensional embedding model to obtain embedding space and embedding data with one-to-one relations among different entities;
step three: training by using the generated countermeasure network and using the coded embedded data to obtain a generated model, and simulating an embedded space by using the model;
step four: a gradient descent reconstruction method is used for confusing the sensitive relation and the non-sensitive relation implied in the original data, and the distribution structure of the embedding space is finely adjusted;
step five: and performing reasoning prediction based on the character relationship of the mail system based on the data of the finely adjusted embedding space.
2. The method according to claim 1, wherein the first step specifically comprises:
s101, aiming at a college student mail system data set, selecting a student communication relation which is most closely related to personnel, and establishing a communication relation knowledge map;
s102, dividing a college student mail system network into an intra-domain communication network and an extra-domain communication network;
s103, defining the communication relation knowledge graph as a (h, l, t) triple, wherein the communication relation l is divided into two groups of relations which are respectively known relations loAnd unknown relationships l that need to be de-predicteduAnd l isu∈lo
S104, converting the known relation loFurther divided into sensitive relationships in intra-domain networks
Figure FDA0003480482060000011
And non-sensitive relationships in out-of-domain communication networks
Figure FDA0003480482060000012
And is provided with
Figure FDA0003480482060000013
3. The method according to claim 1, wherein the second step specifically comprises:
s201, generating a real Gaussian distribution, and randomly sampling and initializing entities and relations of original mail data;
s202, carrying out normalization processing on the vectors of the entities and the relations in each iteration;
s203, selecting a fixed amount of data as positive samples S each timebatchIs represented by (h, l)oT) and for each positive sample, then replace its head and tail entities as a negative sample S'batchIs represented by (h' l)o,t’);
S204, updating the entity and the relation vector by using a random gradient descent algorithm according to the following loss functions:
Figure FDA0003480482060000014
wherein, [ x ]]+Represents taking [0, x]The maximum value of (a) is a boundary hyperparameter whose function is equivalent to a gap correction between a positive and a negative sample; d (x, y) is a distance function, d (x, y) being (x-y)2
4. The method for predicting privacy-preserving links based on mail data as claimed in claim 1, wherein the training in step three obtains the generative model specifically comprising:
s301, sampling a random noise Z from Gaussian distribution;
s302, using a neural network comprising two fully-connected layers and a normalization layer as a generator model G (), and adopting Wasserstein loss and link prediction loss, wherein the link prediction loss is expressed as ranking loss based on margin and is represented as follows:
Figure FDA0003480482060000021
wherein the content of the first and second substances,
Figure FDA0003480482060000022
in the case of a non-sensitive relationship triplet,
Figure FDA0003480482060000023
a sensitive relationship triplet; gamma ray>0 is a boundary hyperparameter, d (x, y) represents the euclidean distance between the two vectors;
the Wasserstein loss was calculated as follows:
Figure FDA0003480482060000024
wherein, ynDenotes a non-sensitive label, ysThe loss of the entire generative model is as followsShown in the figure:
LG=L2+λLDist
wherein, λ is a hyper-parameter for adjusting a single loss function weight;
s303, using two full-connection layer networks with LeakyReLU active layers as a discriminator model D (), using the second full-connection layer as a classifier to distinguish the authenticity of input data, and using Wasserstein loss; penalizing L with a gradientGPTo enforce the lipschitz constraint, the discriminator model is penalized if the gradient norm deviates from its target norm value of 1, and therefore the penalty function of the discriminator model is given by:
Figure FDA0003480482060000025
and S304, alternately training the generator model and the discriminator model.
5. The method for predicting privacy-preserving links based on mail data as claimed in claim 1, wherein the fourth step specifically comprises the following sub-steps:
s401, sampling R initial embeddings from Gaussian distribution
Figure FDA0003480482060000026
S402, the original data set is processed according to the relation loIs divided into a plurality of sets of data Xl,XlRepresenting the TransE code corresponding to the relation l;
s403, for any group of data sets X containing the relation llUsing the trained generator model as a reconstructed neural network, and reconstructing the coding Z of the relational data using the following loss function:
Figure FDA0003480482060000027
wherein G is(Z)Is input asThe output of the generator model for Z (Z ∈ Z); alpha is alpha>0 is boundary over-parameter, and the function of the boundary over-parameter is equivalent to interval correction of sensitive relation reconstruction coding and normal coding;
s404, the initial embedding is reconstructed by using the gradient descent algorithm for L times
Figure FDA0003480482060000031
The reconstruction process is calculated according to the following formula:
Figure FDA0003480482060000032
Figure FDA0003480482060000033
s405, randomly initializing R z, and sampling to different local minimum values to improve robustness of a reconstructed model, wherein z is*Is found by minimizing the following equation:
Figure FDA0003480482060000034
finally using the reconstructed data
Figure FDA0003480482060000035
And embedding as a final relation, and predicting the subsequent personnel relation.
6. A privacy-preserving link prediction system based on mail data, which is realized by the privacy-preserving link prediction method based on mail data of any one of claims 1 to 5, and is characterized by comprising
The data preprocessing module is used for constructing a knowledge graph according to original mail data to form strict mathematical definition and a target;
the entity relationship low-dimensional embedding module is used for learning the low-dimensional embedding of the entities and the relationships in the knowledge graph;
the generator training module comprises a generator G and a discriminator D, and input data are real embedded data and random sampling noise Z which obeys Gaussian distribution;
a data reconstruction module for reconstructing the embedded data X processed by the entity relationship low-dimensional embedding module to obtain a new entity and relationship low-dimensional embedded G (z)*);
A link prediction module for embedding G (z) according to a low dimension*) The relationship of people in the mail network is predicted.
CN202210066876.0A 2022-01-20 2022-01-20 Privacy protection link prediction method and system based on mail data Active CN114513337B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210066876.0A CN114513337B (en) 2022-01-20 2022-01-20 Privacy protection link prediction method and system based on mail data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210066876.0A CN114513337B (en) 2022-01-20 2022-01-20 Privacy protection link prediction method and system based on mail data

Publications (2)

Publication Number Publication Date
CN114513337A true CN114513337A (en) 2022-05-17
CN114513337B CN114513337B (en) 2023-04-07

Family

ID=81550105

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210066876.0A Active CN114513337B (en) 2022-01-20 2022-01-20 Privacy protection link prediction method and system based on mail data

Country Status (1)

Country Link
CN (1) CN114513337B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115238827A (en) * 2022-09-16 2022-10-25 支付宝(杭州)信息技术有限公司 Privacy-protecting sample detection system training method and device
CN117290888A (en) * 2023-11-23 2023-12-26 江苏风云科技服务有限公司 Information desensitization method for big data, storage medium and server

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647334A (en) * 2018-05-11 2018-10-12 电子科技大学 A kind of video social networks homology analysis method under spark platforms
CN110147450A (en) * 2019-05-06 2019-08-20 北京科技大学 A kind of the knowledge complementing method and device of knowledge mapping
EP3557505A1 (en) * 2018-04-20 2019-10-23 Facebook, Inc. Contextual auto-completion for assistant systems
WO2019231481A1 (en) * 2018-05-29 2019-12-05 Visa International Service Association Privacy-preserving machine learning in the three-server model
CN111046187A (en) * 2019-11-13 2020-04-21 山东财经大学 Sample knowledge graph relation learning method and system based on confrontation type attention mechanism
US10671752B1 (en) * 2019-11-20 2020-06-02 Capital One Services, Llc Computer-based methods and systems for managing private data of users
CN111639359A (en) * 2020-04-22 2020-09-08 中国科学院计算技术研究所 Method and system for detecting and early warning privacy risks of social network pictures
CN111859454A (en) * 2020-07-28 2020-10-30 桂林慧谷人工智能产业技术研究院 Privacy protection method for defending link prediction based on graph neural network
CN112182245A (en) * 2020-09-28 2021-01-05 中国科学院计算技术研究所 Knowledge graph embedded model training method and system and electronic equipment
CN113190688A (en) * 2021-05-08 2021-07-30 中国人民解放军国防科技大学 Complex network link prediction method and system based on logical reasoning and graph convolution
CN113220897A (en) * 2021-04-29 2021-08-06 天津大学 Knowledge graph embedding model based on entity-relation association graph
CN113282818A (en) * 2021-01-29 2021-08-20 中国人民解放军国防科技大学 Method, device and medium for mining network character relationship based on BilSTM
CN113360286A (en) * 2021-06-21 2021-09-07 中国人民解放军国防科技大学 Link prediction method based on knowledge graph embedding
CN113361658A (en) * 2021-07-15 2021-09-07 支付宝(杭州)信息技术有限公司 Method, device and equipment for training graph model based on privacy protection

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3557505A1 (en) * 2018-04-20 2019-10-23 Facebook, Inc. Contextual auto-completion for assistant systems
CN108647334A (en) * 2018-05-11 2018-10-12 电子科技大学 A kind of video social networks homology analysis method under spark platforms
WO2019231481A1 (en) * 2018-05-29 2019-12-05 Visa International Service Association Privacy-preserving machine learning in the three-server model
CN110147450A (en) * 2019-05-06 2019-08-20 北京科技大学 A kind of the knowledge complementing method and device of knowledge mapping
CN111046187A (en) * 2019-11-13 2020-04-21 山东财经大学 Sample knowledge graph relation learning method and system based on confrontation type attention mechanism
US10671752B1 (en) * 2019-11-20 2020-06-02 Capital One Services, Llc Computer-based methods and systems for managing private data of users
CN111639359A (en) * 2020-04-22 2020-09-08 中国科学院计算技术研究所 Method and system for detecting and early warning privacy risks of social network pictures
CN111859454A (en) * 2020-07-28 2020-10-30 桂林慧谷人工智能产业技术研究院 Privacy protection method for defending link prediction based on graph neural network
CN112182245A (en) * 2020-09-28 2021-01-05 中国科学院计算技术研究所 Knowledge graph embedded model training method and system and electronic equipment
CN113282818A (en) * 2021-01-29 2021-08-20 中国人民解放军国防科技大学 Method, device and medium for mining network character relationship based on BilSTM
CN113220897A (en) * 2021-04-29 2021-08-06 天津大学 Knowledge graph embedding model based on entity-relation association graph
CN113190688A (en) * 2021-05-08 2021-07-30 中国人民解放军国防科技大学 Complex network link prediction method and system based on logical reasoning and graph convolution
CN113360286A (en) * 2021-06-21 2021-09-07 中国人民解放军国防科技大学 Link prediction method based on knowledge graph embedding
CN113361658A (en) * 2021-07-15 2021-09-07 支付宝(杭州)信息技术有限公司 Method, device and equipment for training graph model based on privacy protection

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
H. A. DEYLAMI AND M. ASADPOUR: ""Link prediction in social networks using hierarchical community detection"", 《2015 7TH CONFERENCE ON INFORMATION AND KNOWLEDGE TECHNOLOGY (IKT)》 *
Y. WANG等: """Efficient Privacy Preserving Matchmaking for Mobile Social Networking against Malicious Users"", 《2012 IEEE 11TH INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS》 *
张钊等: ""用于知识表示学习的对抗式负样本生成"", 《计算机应用》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115238827A (en) * 2022-09-16 2022-10-25 支付宝(杭州)信息技术有限公司 Privacy-protecting sample detection system training method and device
CN115238827B (en) * 2022-09-16 2022-11-25 支付宝(杭州)信息技术有限公司 Privacy-protecting sample detection system training method and device
CN117290888A (en) * 2023-11-23 2023-12-26 江苏风云科技服务有限公司 Information desensitization method for big data, storage medium and server
CN117290888B (en) * 2023-11-23 2024-02-09 江苏风云科技服务有限公司 Information desensitization method for big data, storage medium and server

Also Published As

Publication number Publication date
CN114513337B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
Luo et al. Feature inference attack on model predictions in vertical federated learning
EP3114540B1 (en) Neural network and method of neural network training
CN114513337B (en) Privacy protection link prediction method and system based on mail data
CN112199717B (en) Privacy model training method and device based on small amount of public data
Yuan et al. Es attack: Model stealing against deep neural networks without data hurdles
CN114417427B (en) Deep learning-oriented data sensitivity attribute desensitization system and method
CN111242157A (en) Unsupervised domain self-adaption method combining deep attention feature and conditional opposition
CN112883200A (en) Link prediction method for knowledge graph completion
CN113961759A (en) Anomaly detection method based on attribute map representation learning
CN115660050A (en) Robust federated learning method with efficient privacy protection
CN115659408B (en) Method, system and storage medium for sharing sensitive data of power system
CN115238827B (en) Privacy-protecting sample detection system training method and device
CN115481431A (en) Dual-disturbance-based privacy protection method for federated learning counterreasoning attack
CN111597352B (en) Network space knowledge graph reasoning method and device combining ontology concepts and instances
Zheng et al. Training data reduction in deep neural networks with partial mutual information based feature selection and correlation matching based active learning
Matsumoto et al. XCSR based on compressed input by deep neural network for high dimensional data
CN112463956A (en) Text summary generation system and method based on counterstudy and hierarchical neural network
CN112988851B (en) Counterfactual prediction model data processing method, device, equipment and storage medium
Suri et al. Dissecting distribution inference
CN113989595A (en) Federal multi-source domain adaptation method and system based on shadow model
CN113935496A (en) Robustness improvement defense method for integrated model
CN116545764B (en) Abnormal data detection method, system and equipment of industrial Internet
EP4174738B1 (en) Systems and methods for protecting trainable model validation datasets
CN116541593A (en) Course recommendation method based on hypergraph neural network
Arribas et al. Neural architectures for parametric estimation of a posteriori probabilities by constrained conditional density functions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant