CN114662143A

CN114662143A - Sensitive link privacy protection method based on graph embedding

Info

Publication number: CN114662143A
Application number: CN202210191540.7A
Authority: CN
Inventors: 荆涛; 邓芝琳; 霍炎; 高青鹤; 卢燕飞
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2022-06-24
Anticipated expiration: 2042-02-28
Also published as: CN114662143B

Abstract

The invention provides a sensitive link privacy protection method based on graph embedding. The method comprises the following steps: representing the network topology of a network model to be subjected to privacy protection as an undirected graph G, and compressing privacy information related to sensitive links in the undirected graph G into a privacy-embedded vector matrix Z_pIs a reaction of Z_pEmbedding a vector matrix Z with another graph_fCarrying out vector combination to form an embedded vector matrix Z; the embedded vector matrix Z is input to a decoder which reconstructs the graph structure of the undirected graph G. The invention provides a sensitive link protection method aiming at the privacy protection problem of a sensitive link in the Internet of things, which is used for hiding the sensitive link in a network and preventing link predictionAnd (5) attacking. By designing the loss function, a tradeoff between privacy and utility is achieved. Compared with the traditional link interference method, the method abandons the idea of directly applying link interference on the original graph to remove private information, thereby reducing the utility loss.

Description

Sensitive link privacy protection method based on graph embedding

Technical Field

The invention relates to the technical field of sensitive link protection, in particular to a sensitive link privacy protection method based on graph embedding.

Background

The Internet of things integrates a large number of communication, calculation and sensing devices, and is an organic set of terminal intelligent devices and users. The distributed terminal devices in the internet of things form a distributed multi-domain network by establishing communication links, and data in the network usually contains privacy information, such as sensitive links, namely privacy communication relations between entities. Some information platforms transact collected network topology data to others for business purposes, resulting in privacy disclosure by the entity.

A graph is a data structure common in real life, and its basic elements are nodes and edges, and any complex system in the real world can be represented by a graph, where nodes represent entities in the real world and links capture various relationships between them. Similarly, the network topology in the internet of things may have graph structure data with terminal devices as nodes and communication links as edges, and the sensitive links are the privacy information of the graph. Therefore, we can model the sensitive link privacy protection problem in the internet of things system as a sensitive link hiding problem.

Generally, the most straightforward operation to implement link hiding is to remove sensitive links from the graph. However, link prediction techniques in data mining may predict missing links in a graph by mapping the graph into a contiguous vector space. Furthermore, in addition to nodes and edges, nodes in a graph typically contain attribute information (i.e., device information) that typically enhances the effectiveness of link prediction. Therefore, the technical problem to be solved by the invention is the sensitive link hiding problem of the attribute graph, and aiming at the problem, a graph embedding-based sensitive link privacy protection method (SLPGE) is designed, and the method is suitable for protecting the sensitive link privacy in an internet of things system.

Link prediction can infer missing or unknown facts from known information, which is widely used in social networks, biological fields, communication networks, and network analysis of recommendation systems. Meanwhile, link prediction can be used as a powerful reasoning attack mode and is used for snooping the privacy of entities in the network. In recent years, the privacy problem caused by link prediction has attracted the attention of researchers, and studies on a link prediction countermeasure method have been carried out. At present, most privacy protection methods for resisting link prediction by using graph structure data change the internal rules of a network by disturbing the network structure, so that the prediction capability of various link prediction methods on sensitive links in the network is reduced, and the purpose of privacy protection of the sensitive links in the network is achieved.

At present, in a link interference scheme in the prior art, a link between a common node of two sensitive links and an endpoint is used as a link to be deleted, and an attack based on local similarity is expressed as an optimization problem to determine the link to be deleted. In addition, an Iterative Gradient Attack (IGA) method based on Gradient information in a Graph self-encoder (GAE) is provided, Gradient of the link is obtained by maximizing a sensitive link loss function, the Gradient represents the influence degree of other links on the sensitive link, n links with the maximum Gradient are modified in each iteration, and a final network is obtained after k iterations. In addition, a first research on resisting Dynamic Network Link Prediction (DNLP) is provided by the scheme aiming at a Dynamic Network, and the proposed method is Time-aware Gradient Attack (TGA), and some links are modified by using Gradient information generated by Deep Dynamic Network Embedding (DDNE) across different snapshots, so that the DDNE cannot accurately predict a target Link.

The above-mentioned link interference method in the prior art mainly has the following disadvantages:

(1) excessive data effectiveness can be sacrificed when the edge adding and deleting operation is directly carried out on the original image, so that the privacy and the effectiveness need to be balanced when the sensitive link is hidden;

(2) only for simple graphs containing nodes and edges, the influence of non-structural information of the graph on link prediction, such as node attributes, is not considered. In fact, experiments show that the link strength can be deepened by the attribute information (such as the performance, identity, type and the like of equipment) of the node, so that the prediction accuracy of the link prediction model is improved. Therefore, attribute information of the node needs to be considered in privacy protection.

Disclosure of Invention

The embodiment of the invention provides a graph embedding-based sensitive link privacy protection method, which is used for effectively hiding a sensitive link in a network and preventing link prediction attack.

In order to achieve the purpose, the invention adopts the following technical scheme.

A graph embedding-based sensitive link privacy protection method comprises the following steps:

representing the network topology of a network model to be subjected to privacy protection as an undirected graph G, and compressing privacy information related to sensitive links in the undirected graph G into a privacy embedded vector matrix Z_p；

Embedding the privacy into a vector matrix Z_pEmbedding a vector matrix Z with another graph_fAnd carrying out vector combination to form an embedded vector matrix Z, inputting the embedded vector matrix Z into a decoder, and reconstructing the graph structure of the undirected graph G by the decoder.

2. The method of claim 1, wherein the network topology of the network model to be privacy protected is represented as an undirected graph G, comprising:

representing the network topology of the network model as an undirected graph G ═ (V, E, X), where V ═ V₁,v₂,...,v_NThe node is a set of nodes, which represents the terminal equipment, and N ═ V | is the number of nodes; e is a set of edges, representing a communication link, E_ijE is defined as v_iAnd v_jI is more than or equal to 1, j is more than or equal to N, and X belongs to R^N×FIs node attribute matrix, N rows represent N nodes, F columns represent F attributes, and the connection relationship between nodes is represented by an adjacency matrix A e R^N×NWhen e indicates_ijWhen present, A_ijIs 1, otherwise A_ijIs 0;

the privacy information related to the sensitive link refers to information capable of helping to deduce the sensitive link, and includes attribute information of two nodes and network structure information, and the network structure information includes information of first-order and above neighbor nodes of the node.

Preferably, the privacy-embedded vector matrix Z_pGenerating model of (2) and said graph-embedded vector matrix Z_fThe generation models adopt the coding-decoding structure of a variational graph automatic coder in a graph neural network, and comprise a coder Encoder, an inner product Decoder Decoder and a Softmax classifier which are formed by two layers of graph convolution neural networks GCN.

Preferably, the privacy information related to the sensitive link in the undirected graph G is compressed into a privacy-embedded vector matrix Z_pThe method comprises the following steps:

the privacy embedded vector matrix Z_pThe generated model is preprocessed by an algorithm I and an algorithm II to obtain an adjacency matrix A_pThe first algorithm is an edge increasing algorithm for generating a hidden-private-graph adjacency matrix, the first algorithm considers that the connection strength between two nodes with more common neighbors is stronger, the second algorithm is an edge deleting algorithm for generating the hidden-private-graph adjacency matrix, and the second algorithm considers that when only sensitive links and adjacent links are reserved in the graph, key information in the graph is concentrated on the sensitive links; the adjacency matrix A_pAnd the node attribute matrix X is used as the input of the encoder, and the output Z of the encoder_p；

The privacy embedded vector matrix Z_pThe generative model of (1) has three optimization objectives: 1) make it possible to

And A_pAs similar as possible; 2) make Z_pThe distribution of (a) is as close to gaussian as possible; 3) make Z_pThe node classification of (2) is as accurate as possible;

the privacy embedded vector matrix Z_pThe loss function of the generative model of (2) is designed as follows:

loss₁＝loss_link1+loss_dist+loss_label， (9)

wherein A is_ijAnd

are respectively A_pAnd

value of element (1) of (1)_link1、loss_distAnd loss_labelRespectively link reconstruction loss, distribution loss and node classification loss, loss₁Is the sum of losses, p is A_pThe ratio of the number of the middle element 0 to the number of the element 1, which is used to solve the problem of unbalance of the positive and negative samples, y_ilRepresentative node v_iTrue values, y, belonging to class l _il1 illustrates node v_iBelong to class l, otherwise y_il＝0，

Is that

The value of element (v) represents the node (v)_iProbability value of classification into class i.

Preferably, said embedding said privacy into a vector matrix Z_pEmbedding a vector matrix Z with another graph_fVector combination is carried out to form an embedded vector matrix Z, the embedded vector matrix Z is input to a decoder, and the decoder reconstructs a graph structure Z of the undirected graph G and comprises the following steps:

the graph embedding vector matrix Z_fGenerating a model ofTaking an encoder as a generator, taking a two-layer fully-connected network as a discriminator, and deleting all sensitive links in an undirected graph G to obtain an adjacency matrix A_tA is_tAnd X inputs the graph-embedded vector matrix Z_fThe output graph after being coded is embedded into a vector matrix Z_f；

Will Z_fDirectly inputting into classifier for node classification, and combining Gaussian samples with Z_fRespectively as true sample and false sample, inputting them together into discriminator, and outputting two estimated values d_realAnd d_fakeEmbedding a graph into a vector matrix Z_fAnd privacy embedding vector matrix Z_pCarrying out vector combination to obtain an embedded vector matrix Z;

sending the embedded vector matrix Z into a decoder for decoding reconstruction, and obtaining a reconstruction matrix by the decoder directly using inner product operation

According to the reconstruction matrix

Estimate d_realAnd d_fakeCalculating the losses of the encoder and the discriminator, iteratively training the graph-embedded vector matrix Z by minimizing the losses of the encoder and the discriminator_fUntil the model converges, the final embedded vector matrix Z capable of protecting sensitive links is obtained.

Preferably, the graph is embedded in a vector matrix Z_fThe loss of the generative model comprises link reconstruction loss, distribution loss and node classification loss, wherein the distribution loss is divided into generator loss and discriminator loss, and the loss function is designed as follows:

loss_dist(G)＝-log(d_fake)， (11)

loss_dist(D)＝-log(d_real)-log(1-d_fake)， (12)

loss_G＝loss_link2+loss_dist(G)+loss_label， (13)

wherein A is_ijAnd

are A and

value of element(s) of (1)_link2、loss_dist(G)、loss_dist(D)And loss_GRespectively link reconstruction loss, generator distribution loss, discriminator distribution loss and encoder total loss, loss_labelSame as formula (8), loss_dist(G)And loss_dist(D)All are obtained by BCE calculation, loss_link2Calculated from the mean square error function.

The technical scheme provided by the embodiment of the invention can show that the embodiment of the invention provides a sensitive link protection method aiming at the privacy protection problem of the sensitive link in the Internet of things, and the method is used for hiding the sensitive link in the network and preventing link prediction attack. By designing the loss function, a tradeoff between privacy and utility is achieved. Compared with the traditional link interference method, the method abandons the idea of directly applying link interference on the original graph to remove private information, thereby reducing the utility loss.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating a graph embedding principle provided by an embodiment of the present invention;

FIG. 2 shows a privacy-embedded vector matrix Z according to an embodiment of the present invention_pA schematic diagram of the generative model of (1);

FIG. 3 is a diagram embedding vector matrix Z according to an embodiment of the present invention_fA schematic diagram of the generative model of (1);

fig. 4 is a schematic diagram of an implementation principle of a first algorithm according to an embodiment of the present invention;

fig. 5 is a schematic diagram illustrating an implementation principle of a second algorithm according to an embodiment of the present invention;

fig. 6 is a schematic diagram illustrating classification and visualization of a Cora node according to an embodiment of the present invention;

fig. 7 is a schematic view of classifying and visualizing Yale nodes according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.

Graph embedding (also called network representation learning) aims at compressing high-dimensional sparse original graph data into a low-dimensional dense embedded vector matrix including nodes of the graph, link information, and node attribute information.

In view of the shortcomings of the prior art, the embodiment of the invention provides a graph embedding-based sensitive link privacy protection method for defending against link prediction attacks, which can be used for simple graphs and attribute graphs and can preserve most of data utility while hiding sensitive links.

The method combines the graph neural network and the generation countermeasure network in the graph embedding technology, compresses the network topology information in the Internet of things into a graph embedding vector matrix, so that the classification model of the link prediction cannot accurately predict the existence of the sensitive link, and simultaneously retains certain data utility in the graph embedding vector matrix. Fig. 1 is a schematic diagram of a Graph embedding principle according to an embodiment of the present invention, and as shown in fig. 1, an encoding-decoding idea of a Graph Auto-Encoders (GAE) is adopted, and an original Graph structure is first encoded into an embedded vector matrix (Graph embedding vector) by using a Graph embedding technique, where each row of the matrix is an embedded vector representation of each node. The embedded vector matrix can be decoded to reconstruct a graph structure that does not contain sensitive links. Finally, we publish the embedded vector matrix, and an attacker can train the model and predict sensitive links by using the published embedded vector matrix as input according to the known partial topology (point side information).

The network model is an object to be privacy-protected, and a network topology of the network model is represented as an undirected graph G ═ (V, E, X), where V ═ V₁,v₂,...,v_NIs a set of nodes representing the terminal device (N ═ V | is the number of nodes); e is a set of edges, representing a communication link, with E_ijE is defined as v_iAnd v_j(1. ltoreq. i, j. ltoreq. N). X is formed by R^N×FIs a node attribute matrix, with N rows representing N nodes and F columns representing F attributes. The connection relation between the nodes is formed by an adjacency matrix A epsilon R^N×NWhen e indicates_ijWhen present, A_ijIs 1, otherwise A_ijIs 0. The privacy information related to the sensitive link refers to information capable of helping to deduce the sensitive link, and includes attribute information of two nodes and network structure information, and the network structure information includes information of one or more neighboring nodes of the node.

The attack model is a link prediction method against which the method of the present invention is directed. The classification model is an important method in link prediction, and a representative Support Vector Machine (SVM) and a Multilayer Perceptron (MLP) in the classification model are used as attack models to check the effectiveness of the invention. The classification model treats the link prediction problem as a two-classification problem, takes the existing link and the non-existing link as a positive example and a negative example respectively, trains the model by using the known partial structure and predicts the unknown link.

The overall idea of the scheme of the sensitive link privacy protection method based on graph embedding is to compress the privacy information related to the sensitive link in the undirected graph G into a privacy embedded vector matrix Z_pIn (1), Z is_pEmbedding a vector matrix Z with another graph_fVector combination is carried out to form a new embedded vector matrix Z, the embedded vector matrix Z can be used for reconstructing a graph structure through a decoder, and Z is gradually eliminated through the reverse propagation of a loss function in iterative training_fTo obtain the information of the privacy in the network, and finally to obtain the information of the link against predictionGraph embedding vector matrix Z_f. The model framework of the present invention is divided into a privacy-embedded vector matrix Z_pGenerating model and graph embedding vector matrix Z_fTwo parts of the generative model of (2) are shown in fig. 2 and fig. 3, respectively. Wherein, the Encoder, Decoder, classifier and Discriminator modules represent the Encoder, Decoder, classifier and Discriminator respectively, and Add and coordinate represent two vector combination modes respectively: addition and horizontal splicing. Privacy-embedded vector matrix Z in fig. 2 and 3_pGenerating model and graph embedding vector matrix Z_fThe generation models of (1) all adopt the encoding-decoding structure of a variable Graph Auto-Encoders (VGAE) in a Graph neural Network, and all include an Encoder (Encoder), an inner product Decoder (Decoder) and a Softmax classifier which are composed of two layers of Graph Convolutional neural networks (GCNs), and the mechanisms of the Encoder, the Decoder and the classifier are as follows.

The encoder comprises two GCN layers, the forward propagation rule for each layer is as follows:

wherein the content of the first and second substances,

I∈R^N×Nis a matrix of units, and is,

is degree matrix

H_lIs a node feature matrix of layer l, W_lIs the trainable weight matrix at level i, σ (-) is the activation function.

Is a normalization of a because a multiplication of the feature matrix without normalization would change the original distribution of the features. Since A is the input for each layer, the first one can beCharacteristic matrix H of l +1 layer_l+1Is represented by H_l+1＝f(H_lAnd A), the coding mechanism of the two-layer GCN is as follows:

wherein X is H₀As a feature matrix of the input layer,

in order to normalize the adjacency matrix, the adjacency matrix is normalized,

and

weight matrices (D) for the first and second layers, respectively₁And D₂Node feature dimensions for the first and second layers, respectively), ReLU (·) max (0; and) is the activation function sigma (-) of the first layer, the second layer has no activation function,

and

output feature matrices for the first and second layers, respectively.

The encoder aims to learn the mean and standard deviation of a multidimensional gaussian distribution from which the embedded vector matrix is to be derived. The generation process of the mean matrix mu, the standard deviation matrix sigma and the embedded vector matrix z is as follows:

μ＝GCN_μ(X,A)

σ＝GCN_σ(X,A) (3)

z＝μ+ε×σ

wherein, z- Ν (μ, σ)²) Mu and sigma share the first layer parameter W₀ε -N (0,1) is random noise from a standard Gaussian distribution.

And (3) sending the embedded vector matrix z output by the encoder into a decoder for decoding and reconstruction, wherein the decoder directly uses inner product operation:

wherein z is^TIs a transposed matrix of z, the activation function σ (·) employs a sigmoid function (f (x) ═ 1/(1+ exp (-x))),

in order to reconstruct the adjacency matrix,

can be viewed as node v_iAnd v_jThe product of the probabilities of the independent events (i.e. the probability of the presence of a link),

then the representative link exists.

The Softmax classifier can be used to solve the multi-classification problem of node labels. Assuming that the nodes are classified into different types, i.e., the nodes possess labels, and the node labels relate to the node attributes, the node classification can be used as one of the indicators for checking the data utility of the graph embedding vector matrix z. Softmax is an index normalization function, also called a normalization index function, which can normalize each row of the input matrix, map the elements in the matrix into a (0,1) range, ensure that the sum of the elements in each row is 1, and because the normalized value is the probability of the corresponding node label class, the sum of the probability of multi-classification is also 1. The formula of the Softmax function is as follows:

wherein the content of the first and second substances,

representing a node v_iThe probability of belonging to the class i,z is an input matrix, z_ilIs row i (node v)_i) The element value in the L-th column (category L), L is the number of categories of the node label.

As shown in FIG. 2, the privacy-embedded vector matrix Z_pConsists of an encoder, a decoder and a classifier. To increase Z_pThe proportion of the private information in the node B is that firstly, an undirected graph G is preprocessed through an algorithm I and an algorithm II so as to enhance the connection strength of the end nodes of the sensitive link. Fig. 4 is a schematic diagram illustrating an implementation principle of a first algorithm according to an embodiment of the present invention, for the first algorithm, it is considered that the connection strength between two nodes having more common neighbors is stronger, as shown in fig. 4, e₀₁Is a sensitive link, { v₂,v₄And { v }₃,v₅Are v, respectively₀And v₁In which e₂₃Existence, we connect e₀₃And e₁₂Let v be₀And v₁The relationship between them is more intimate. Fig. 5 is a schematic diagram illustrating an implementation principle of a second algorithm according to an embodiment of the present invention, and for the second algorithm, it is considered that when only a sensitive link and its neighboring links are reserved in the graph, main information in the graph will be concentrated on the sensitive link, as shown in fig. 5, e₀₁Is a sensitive link, { e₀₂,e₀₄,e₁₃,e₁₅Is e₀₁We delete other unrelated nodes and links, only keep { e₀₁,e₀₂,e₀₄,e₁₃,e₁₅}。

Will pass algorithm one or algorithm twoThe obtained adjacency matrix A_pAnd the node attribute matrix X as the input to the encoder, the encoder output Z via the encoding scheme as described above_pSubsequently introducing Z_pThe input decoder and the classifier respectively carry out decoding reconstruction and node classification to obtain a reconstruction matrix

And normalizing the node label probability matrix

Privacy embedded vector matrix Z_pThe generative model of (1) has three optimization objectives: 1) make it

And A_pAs similar as possible; 2) make Z_pThe distribution of (a) is as close to gaussian as possible; 3) make Z_pThe node classification of (a) is as accurate as possible, so the privacy is embedded into a vector matrix Z_pThe loss function of the generative model of (2) is designed as follows:

loss₁＝loss_link1+loss_dist+loss_label， (9)

wherein, A_ijAnd

are respectively A_pAnd

value of element (1) of (1)_link1、loss_distAnd loss_labelRespectively link reconstruction loss, distribution loss and node classification loss, loss₁Is the sum of the losses of the first part. p is A_pThe ratio of the number of the middle element 0 to the number of the element 1 is used for solving the problem of unbalance of the positive and negative samples. y is_ilRepresentative node v_iTrue values, y, belonging to class l_ilSpecification of node v at 1_iBelong to class l, otherwise y_il＝0。

Is that

The value of element (v) represents the node (v)_iProbability value of classification into class i. loss_link1And loss_labelIs Binary Cross Entropy (BCE), which determines the closeness between the actual output and the expected output, while loss_distIs a simplified form of the Kullback-leibler (KL) divergence, which can measure the difference between the two distributions. Through a certain number of times of iterative training, the loss is continuously reduced, and finally the encoder outputs better Z_p。

Graph embedding vector matrix Z_fThe object of the generative model of (2) is to generate a graph embedding vector matrix Z_fGraph embedding vector matrix Z_fThe generative model of (1) consists of an Encoder (Encoder), a Decoder (Decoder), a Softmax classifier and a Discriminator (Discriminator), wherein the mechanisms of the Encoder, the Decoder and the classifier are the same as those of the first part. The second part combines a variational graph autoencoder and a generation countermeasure network, which usually comprises a Generator (Generator) and a discriminator, and the capability of the discriminator to discriminate true and false is continuously enhanced through countermeasure training between the Generator and the discriminator, and meanwhile, the Generator is forced to output samples which can be falsified and truthful. The invention takes the encoder as a generator and takes a two-layer fully-connected network as a discriminator, and the discriminator can compress an input matrix into an estimation value to judge whether the input matrix comes from a real Gaussian distribution sample or from a real Gaussian distribution sampleFalse samples of the encoder, forcing the encoder to generate Z's that conform to the Gaussian distribution_fThis is to be able to follow Z_fBetter removes private information. Before training, in order to remove the most intuitive private information, all sensitive links in G are deleted to obtain an adjacency matrix A_t，A_tWill embed vector matrix Z as a graph_fTo generate the model.

Training begins with A_tAnd X input coder for outputting Z after coding_fIs a reaction of Z_fAnd directly inputting the data into a classifier to classify the nodes. Simultaneously sum the Gaussian samples with Z_fRespectively as true sample (label is 1) and false sample (label is 0), and outputting two estimated values d_realAnd d_fake. In addition, Z_fWill also react with Z_pCarrying out vector combination to form a new vector matrix Z, and carrying out decoding reconstruction by taking Z as the input of a decoder to obtain a reconstruction matrix

There are two possible vector combinations in the present invention: 1) direct addition: z_fAnd Z_pBy adding corresponding elements of, requiring Z_fAnd Z_pThe obtained Z characteristic dimensions are the same and are unchanged; 2) horizontal splicing: z_fAnd Z_pSplicing in horizontal direction, with unchanged number of rows and increased number of columns, Z_fAnd Z_pMay be different, the characteristic dimension of Z obtained is Z_fAnd Z_pThe sum of the feature dimensions of (c).

Graph embedding vector matrix Z_fThe losses of the generative model also include link reconstruction losses, distribution losses, and node classification losses, wherein the distribution losses are further divided into generator losses and discriminator losses. The loss function is designed as follows:

loss_dist(G)＝-log(d_fake)， (11)

loss_dist(D)＝-log(d_real)-log(1-d_fake)， (12)

loss_G＝loss_link2+loss_dist(G)+loss_label， (13)

wherein A is_ijAnd

are A and

value of element (1) of (1)_link2、loss_dist(G)、loss_dist(D)And loss_GRespectively link reconstruction loss, generator (encoder) distribution loss, discriminator distribution loss and encoder total loss, loss_labelThe same as in formula (8). loss_dist(G)And loss_dist(D)Are all calculated by BCE, and loss_link2Calculated from the Mean Squared Error (MSE). Graph embedding vector matrix Z_fThe iterative training process of the generative model is shown in algorithm three, and the encoder and the discriminator respectively minimize loss_GAnd loss_dist(D)Alternately performing iterative training for the target to make Z_fThe privacy information in the method is continuously reduced and effective information is kept, and finally the graph embedding vector matrix Z capable of protecting the sensitive link is obtained_f。

Z_fEmbedded vectors comprising N nodes, the embedded vectors of edges being obtained by horizontal concatenation of the node vectors, e.g. z_iAnd z_jAre respectively a node v_iAnd v_jIs embedded vector of (2), then edge e_ijIs expressed as (z)_i,z_j). AttackThe clicker has learned partial structural information where there is a positive case (label 1) for the edge present and a negative case (label 0) for the edge not present. And then, the edge embedding vector of the unknown structure information is used as a test set of the trained classification model to predict the edge label, so that the prediction accuracy of the sensitive link and the non-sensitive link is obtained.

Privacy accuracy Acc by prediction accuracy of sensitive links_slTo evaluate, Acc_slIs the ratio of the number of sensitive links predicted to be present to the total number of sensitive links.

Utility includes prediction accuracy Acc of non-sensitive links_nslLink reconstruction accuracy Acc_linkLink reconstruction recall Rec_reconAnd node classification accuracy Acc_nodeWherein Acc is_nslIs the ratio of the number of non-sensitive links predicted to exist to the total number of non-sensitive links, Acc_linkIs the ratio of the number of correctly predicted links to the total number of links, Rec_reconIndicating the ratio of the number of correctly reconstructed links to the total number of links present, Acc_nodeIs the ratio of the number of nodes predicted correctly to the total number of nodes.

Example two

Simulation verification on public data sets

TABLE 1 Experimental parameters

The data sets used for the experiments were the Cora and Yale data sets. Cora is a citation network, which is composed of 7 classes of machine learning articles, and comprises 2708 article nodes, wherein 5278 reference relations are used as edges among the articles, and the attributes of the nodes are composed of unique words appearing in all the articles, and 1433 dimensions. Yale is a social network comprising 8578 individuals and 405450 edges, we have 7 grades as classification labels for nodes, which share 188-dimensional attributes. The experimental parameter settings are shown in Table 1, where N, | E |, F, L, | E |_sl|、|E_knowTable of | and m respectivelyIndicating total number of nodes, total number of edges, initial dimension of node attribute, class number of node labels, number of sensitive links, total number of edges of known structure information of attackers and maximum increasing number of edges in algorithm one, dim (Z)_p) And dim (Z)_f) Is Z_pAnd Z_fWith dim (Z _ add) and dim (Z _ cat) being Z_pAnd Z_fThe dimension of Z formed by direct addition (add) and horizontal concatenation (cat).

In addition, we take the original graph and the graph with the deleted sensitive link to obtain two graph embeddings as a control group to compare with SLPGE. The experimental data are shown in tables 2 and 3, the data are percentages, SVM and MLP are attack models, VGAE represents the input as the original graph, VGAE represents the input as_tSLPGE, a graph representing an input as a deletion of a sensitive link⁺(cat) denotes the use of Algorithm one and horizontal splicing, SLPGE^-(cat) denotes SLPGE using Algorithm two-sum horizontal splicing⁺(add) denotes SLPGE using Algorithm one and direct addition^-(add) denotes direct addition using the algorithm two. At the same time, we use TSNE to visualize Z in two-dimensional space_fWith the view of observing the node classification effect, the Cora node classification visualization diagram provided by the embodiment of the invention is shown in fig. 6, the Yale node classification visualization diagram is shown in fig. 7, and nodes belonging to the same label are represented by the same color.

TABLE 2Cora Experimental data

TABLE 3Yale Experimental data

By comparing the data in tables 2 and 3, SLPGE can be found⁺And SLPGE^-Acc on two datasets_slCompare VGAE_tMore of the reduction, especially for the Cora dataset, was a maximum of 30.05% reduction based on VGAE, while for Yale, SLPGE⁺Acc of (2)_slThe reduction is 15.03 percent at most on the basis of VGAE. Thus SLPGE privacy and VGAE_tCompared with the prior art, the method is remarkably improved. However, although the privacy of SLPGE on Yale is VGAE_t1.3-3.6 times, but the privacy protection effect is still less ideal than that of Cora, because the node attribute and the link relationship of Yale are more close, and the privacy information of Yale is more difficult to remove. This also proves that node attributes are one of the challenges for privacy protection of sensitive links.

The loss of part of the utility is a necessary cost of privacy protection. As can be seen from the visualization results, the node classification effect of SLPGE is inferior to that of VGAE, and meanwhile, SLPGE and VGAE_tAcc of (2)_nsl、Acc_link、Rec_reconAnd Acc_nodeAll have different degrees of decline, but the decline range is generally lower than Acc_sl. Acc for Cora and Yale, SLPGE_nslA reduction of up to 6.94% and 11.56% on the basis of VGAE, but Acc_slThe degree of reduction was greater, reaching 21.76% and 13.99%. Acc of SLPGE_linkA reduction of at most 5.75% and 9.07%. Rec of SLPGE_reconAnd Acc_nodeThe fall ranges of (a) and (b) are respectively: 5.84 to 15.50 percent and 11.26 to 6.79 percent. The comparison of the privacy index and the utility index reflects the balance between privacy and utility of SLPGE.

In addition, the data in the table show SLPGE⁺And SLPGE^-There is a similar behavior in privacy and utility, which demonstrates that both algorithm one and algorithm two are feasible. For both vector combination modes, the privacy and utility of the "add" mode is better than the "cat" mode because Z_pAnd Z_fThe distribution is close to Gaussian distribution, and the privacy information is in Z_pThe weight in (1) is large and constant, when the Z obtained through the "add" mode is used for reconstructing the link of the original graph, the MSE loss function will force the Z_fThe weight of the private information is reduced. Therefore, we conclude that the MSE matches the "add" pattern better.

Overall, a comparison of the above graphs may confirm that SLPGE has better sensitive link protection performance. By analyzing the experimental data, we can conclude that: while protecting the privacy of sensitive links, some utility information will be sacrificed, possibly structural or attribute information, but SLPGE may still retain most of the utility information. In practical application, part of the structure in the model can be adjusted to meet different task requirements.

In summary, the embodiment of the present invention provides a method for protecting a sensitive link, which is provided for hiding the sensitive link in a network and preventing link prediction attack, for the privacy protection problem of the sensitive link in the internet of things. By designing the loss function, a tradeoff between privacy and utility is achieved. Compared with the traditional link interference method, the method abandons the idea of directly applying link interference on the original graph to remove private information, thereby reducing the utility loss.

According to the embodiment of the invention, through designing the loss function, the balance between privacy and utility is realized. Compared with the traditional link interference method, the idea of directly applying link interference on the original graph to remove private information is abandoned, and therefore the utility loss is reduced. The node, edge and node attribute information in the network topology of the embodiment of the invention are compressed into the embedded matrix, and the embedded matrix can decode and reconstruct the point edge information in the adjacent matrix.

The fact that the graph embedding containing the attribute information is easier to predict the existence of the sensitive link by a link prediction model is found through comparison of the graph embedding obtained by the network topology without the node attribute and the graph embedding obtained by the network topology containing the node attribute.

Experiments are carried out on two public data sets with node attributes, and the experimental results show that the graph embedding generated by the method has better privacy protection effect than that of only deleting sensitive links, and meanwhile, compared with the graph embedding directly obtained by embedding the original network topology, the method reduces the sensitive link prediction accuracy of the attack model by 30.05% and 15.03% respectively on the basis. By analyzing the evaluation indexes of privacy and utility, the invention can balance privacy and utility and hide sensitive links on the premise of sacrificing less utility.

Those of ordinary skill in the art will understand that: the figures are schematic representations of one embodiment, and the blocks or processes shown in the figures are not necessarily required to practice the present invention.

From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A sensitive link privacy protection method based on graph embedding is characterized by comprising the following steps:

the network topology of the network model is represented as an undirected graph G ═ (V, E, X), where V ═ V₁,v₂,...,v_NThe node is a set of nodes, which represents the terminal equipment, and N ═ V | is the number of nodes; e is a set of edges, representing a communication link, with E_ijE is defined as v_iAnd v_jI is more than or equal to 1, j is more than or equal to N, and X belongs to R^N×FIs node attribute matrix, N rows represent N nodes, F columns represent F attributes, and the connection relationship between nodes is represented by an adjacency matrix A e R^N×NWhen e indicates_ijWhen present, A_ijIs 1, otherwise A_ijIs 0;

3. The method of claim 2, wherein the privacy-embedded vector matrix Z_pGenerating model of (a) and the graph embedding vectorMatrix Z_fThe generation models adopt the coding-decoding structure of a variational graph automatic coder in a graph neural network, and comprise a coder Encoder, an inner product Decoder Decoder and a Softmax classifier which are formed by two layers of graph convolution neural networks GCN.

4. The method as recited in claim 3, wherein the compressing the privacy information associated with the sensitive links in the undirected graph G into a privacy-embedded vector matrix Z_pThe method comprises the following steps:

the privacy embedded vector matrix Z_pThe generated model is preprocessed through an algorithm I and an algorithm II to an undirected graph G to obtain an adjacency matrix A_pThe first algorithm is an edge increasing algorithm for generating a hidden-private-graph adjacency matrix, the first algorithm considers that the connection strength between two nodes with more common neighbors is stronger, the second algorithm is an edge deleting algorithm for generating the hidden-private-graph adjacency matrix, and the second algorithm considers that when only sensitive links and adjacent links are reserved in the graph, key information in the graph is concentrated on the sensitive links; the adjacency matrix A_pAnd the node attribute matrix X is used as the input of the encoder, and the output Z of the encoder_p；

The privacy embedded vector matrix Z_pThe generative model of (1) has three optimization objectives: 1) make it

loss₁＝loss_link1+loss_dist+loss_label， (9)

wherein A is_ijAnd

are respectively A_pAnd

value of element(s) of (1)_link1、loss_distAnd loss_labelRespectively link reconstruction loss, distribution loss and node classification loss, loss₁Is the sum of losses, p is A_pThe ratio of the number of the middle element 0 to the number of the element 1, which is used to solve the problem of unbalance of the positive and negative samples, y_ilRepresentative node v_iTrue values, y, belonging to class l_il1 illustrates node v_iBelong to class l, otherwise y_il＝0，

Is that

5. The method of claim 4, wherein the privacy embedding vector matrix Z_pEmbedding a vector matrix Z with another graph_fVector combination is carried out to form an embedded vector matrix Z, the embedded vector matrix Z is input to a decoder, and the decoder reconstructs a graph structure Z of the undirected graph G and comprises the following steps:

the graph embedding vector matrix Z_fThe generation model takes an encoder as a generator, takes a two-layer full-connection network as a discriminator, and deletes all sensitive links in an undirected graph G to obtain an adjacency matrix A_tA is_tAnd X inputs the graph-embedded vector matrix Z_fThe output graph after being coded is embedded into a vector matrix Z_f；

According to the reconstruction matrix

Estimate d_realAnd d_fakeComputing the encoder and discriminator penalties, iteratively training the graph-embedded vector matrix Z by minimizing the encoder and discriminator penalties_fUntil the model converges, the final embedded vector matrix Z capable of protecting sensitive links is obtained.

6. The method of claim 5, wherein the graph is embedded in a vector matrix Z_fThe loss of the generative model comprises link reconstruction loss, distribution loss and node classification loss, wherein the distribution loss is divided into generator loss and discriminator loss, and the loss function is designed as follows:

loss_dist(G)＝-log(d_fake)， (11)

loss_dist(D)＝-log(d_real)-log(1-d_fake)， (12)

loss_G＝loss_link2+loss_dist(G)+loss_label， (13)

wherein A is_ijAnd

are A and

value of element (1) of (1)_link2、loss_dist(G)、loss_dist(D)And loss_GRespectively link reconstruction loss, generator distribution loss, discriminator distribution loss and encoder total loss, loss_labelSame as formula (8), loss_dist(G)And loss_dist(D)All are obtained by BCE calculation, loss_link2Calculated from the mean square error function.