CN113361606A

CN113361606A - Deep map attention confrontation variational automatic encoder training method and system

Info

Publication number: CN113361606A
Application number: CN202110630525.3A
Authority: CN
Inventors: 张维玉; 翁自强; 夏忠秀
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2021-06-07
Filing date: 2021-06-07
Publication date: 2021-09-07

Abstract

The disclosure provides a deep map attention confrontation variational automatic encoder training method and a system, comprising the following steps: obtaining a first feature vector corresponding to each node in graph data; after the first feature vector is subjected to aggregation operation taking the attention mechanism as a core, outputting a second feature vector of each node for forming a plurality of groups of independent attention mechanisms; applying attention distribution to a plurality of relevant features between the central node and the neighbor nodes by summarizing a plurality of groups of independent attention mechanisms to form a graph attention confrontation automatic encoder; encoding graph data using an encoder of a graph attention confrontation auto-encoder to obtain a graph representation vector; and performing inner product processing on the graph representation vector by using a decoder to reconstruct the graph data, and predicting whether a connecting edge exists between any two points in the graph. The problems of overfitting and overflugging are effectively solved, and the graph embedding capacity is further improved.

Description

Deep map attention confrontation variational automatic encoder training method and system

Technical Field

The disclosure belongs to the technical field of encoders, and particularly relates to a deep map attention immutation automatic encoder training method and system.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Non-euclidean data such as graph data is difficult to process compared to euclidean space data such as image, text, and voice data. Therefore, graph embedding algorithms have become a hotspot of research. The graph research focuses on tasks such as node classification, link prediction, graph classification and graph generation, and graph embedding algorithms can be divided into three types, namely graph decomposition, random walk and graph neural network.

Recently, graph embedding algorithms have entered the neural network era. Kipf et al simplify the definition of frequency domain convolution and propose GCN that performs convolution operation in the spatial domain, which greatly improves the embedding capability of the graph convolution model. Since then, researchers have proposed many variants of GCN. GraphSAGE does not limit sampling to the topology information of the nodes, but instead takes advantage of the intrinsic characteristics of the nodes and gives up the diffusion mechanism involving a large number of parameters, thereby enabling distributed training and inductive learning of large-scale graph data. A graph attention network (GAT) performs aggregation operations on neighbor nodes using an attention mechanism to adaptively assign neighbor weights.

The above method is a supervised graph embedding method. In recent years, the application of graph data has become more and more widespread, and the structure of a graph is more complicated. In practical application scenarios, many data tags typically have a higher acquisition threshold. Therefore, it is of great value to study unsupervised graphs of graph data to be effective in learning. A graph auto-encoder based on reconstruction loss is a typical unsupervised learning method. GAE and VGAE use an encoder to obtain latent vectors, while a decoder uses latent variables to reconstruct the structure. Due to the high-dimensional and complex distribution characteristics of the map data, the distribution of the potential vectors encoded by the encoder deviates from the actual distribution. To solve the problem with the distribution of the encoded data, the DVNE embeds the nodes directly from the gaussian distribution and uses the Wasserstein distance as a similarity measure between the distributions, effectively modeling the uncertainty of the nodes in the network. ARGA and argvga further introduce a countermeasure mechanism that forces the encoder to generate a potential vector that more closely approximates the true distribution of the data through countertraining. Although these autoencoders have achieved some success, they do not take into account differences in node importance. Inspired by the success of deep CNNs in image classification, researchers have proposed several attempts to explore the idea of how to construct deep GCNs, including GCN, GraphSAGE, ResGCN, and JKNet. However, none of them provide an architecture for a detailed presentation.

In addition, the current encoder uses a relatively small number of layers (typically 1 or 2) to notice the network, and cannot completely release the embedding performance of the image-attention encoder. The number of the power layers of the model cannot be deepened due to the existence of the over-smoothing and over-fitting problems.

Thus, the technical problems in the prior art include:

1. existing graph autoencoders ignore the differences of graph neighbor nodes and the potential data distribution of the graph.

2. Overfitting and overflustering are two major problems that hinder the deepening of the graph model. Overfitting comes from the following cases: when a hyper-parametric model is used to fit the limited distribution of training data, the learned model is very suitable for the training data, but the popularization of test data is poor, and the overfitting problem is particularly prominent when deep GCN is applied to a small graph. Over-smoothing, progressing towards the other extreme, makes training deep GCNs very difficult.

The nature of graph embedding is aggregation, and if the number of layers used is unlimited, the representation of all nodes will converge to a fixed point, which will isolate the result from the input elements and cause the gradient to vanish, a phenomenon known as over-smoothing.

Disclosure of Invention

In order to overcome the defects of the prior art, the deep map attention confrontation variational automatic encoder training method is provided, the deep map attention confrontation variational automatic encoder deepens the number of drawing attention layers of the encoder, effectively solves the problems of overfitting and oversmooth, and further improves the map embedding capacity.

In order to achieve the above object, one or more embodiments of the present disclosure provide the following technical solutions:

in a first aspect, a deep map attention confrontation variational automatic encoder training method is disclosed, comprising:

obtaining a first feature vector corresponding to each node in graph data;

after the first feature vector is subjected to aggregation operation taking the attention mechanism as a core, outputting a second feature vector of each node for forming a plurality of groups of independent attention mechanisms;

applying attention distribution to a plurality of relevant features between the central node and the neighbor nodes by summarizing a plurality of groups of independent attention mechanisms to form a graph attention confrontation automatic encoder;

encoding graph data using an encoder of a graph attention confrontation auto-encoder to obtain a graph representation vector;

and performing inner product processing on the graph representation vector by using a decoder to reconstruct the graph data, and predicting whether a connecting edge exists between any two points in the graph.

In a further technical solution, the step of obtaining the second feature vector of each node includes:

setting weight coefficients of adjacent nodes;

selecting a single-layer full-connection layer as a correlation function;

normalizing the correlation calculation of all the neighbors to obtain a weight coefficient of each neighbor node;

and after the weight coefficient is obtained, obtaining a second feature vector of the node according to a weighted summation strategy of an attention mechanism.

According to the further technical scheme, the optimization of the graph attention confrontation automatic encoder by the random edge deletion technology is realized by carrying out graph attention aggregation through the obtained weight coefficients.

In the further technical scheme, in the aggregation process of the attention of the graph, the adjacent nodes are identified by means of the adjacent matrix, and after the attention coefficient which is not normalized is obtained in the attention layer of the graph, the masking operation is carried out.

In a further technical scheme, the attention confrontation automatic encoder is combined with a random edge deletion technology, and is called as a deep map attention confrontation variational automatic encoder.

In a further technical scheme, the random edge deletion technology randomly deletes edges of the input graph in a certain proportion, specifically: will randomly adjoin V of matrix A_pThe non-zero elements are reset to zero, where V is the total number of edges and p is the erasure rate.

In a further technical scheme, the encoder in the automatic encoder is taken as a generator of a countermeasure network, and the generator deceives the discriminator by generating fake data, wherein the fake data refers to potential variables obtained by the encoder through image data encoding;

the task of the discriminator is to distinguish whether the sample is from the true data or the generator, the discriminator will be from the prior distribution_zThe output data is judged to be positive, and the data from the latent variable output is judged to be negative.

In a second aspect, a deep map attention confrontation variational automatic encoder training system is disclosed, comprising:

a feature vector formation module configured to: obtaining a first feature vector corresponding to each node in graph data;

an auto-encoder training module configured to: applying attention distribution to a plurality of relevant features between the central node and the neighbor nodes by summarizing a plurality of groups of independent attention mechanisms to form a graph attention confrontation automatic encoder;

The above one or more technical solutions have the following beneficial effects:

the present invention focuses on differential representation of neighboring nodes, proposing an attention-directed anti-variation autoencoder (AAVGA). The purpose is to distinguish graph structure information and apply a counterregularization mechanism to improve the graph embedding capability of the model. The encoder generates potential feature vectors through a graph attention layer, differentiates node representations in an embedding process by utilizing adaptive distribution of weights, and adds a plurality of groups of independent attention mechanisms, so that attention aggregation is more stable. To normalize the distribution of encoded data, a countering mechanism is introduced into the attention-based graph variation autoencoder. The component can determine whether the input is from a low-dimensional representation of the graph network or from a true distribution of the sample. The discriminator encourages the encoder to generate low-dimensional variables with a more realistic distribution of data and learns an efficient representation of the graph.

In addition, the present invention introduces a random edge deletion technique (RDedge), which can help the model to randomly discard some edges of the input graph at each training period. RDedge is considered herein as a data enhancement technique. Different randomly deformed copies of the original graph are generated by the RDedge. This enhances the randomness and diversity of the input data and thus better prevents overfitting. RDEdge can also be viewed as a message passing reducer, where in the graph attention layer, the passing of messages between neighboring nodes is along an edge. Deleting certain edges makes the node connections more sparse, so excessive smoothing can be avoided to some extent when the attention level is getting deeper. RDEdge can bring much help to the graph model training, and as shown in FIG. 1, after the RDEdge is combined, AAVGA can well deal with the problems of overfitting and overflugging. This allows further deepening of the coding layer of the model, improving the graph embedding capability of the model.

The invention combines the graph attention confrontation automatic encoder (AAVGA) with the random edge deletion technology (RDedge), and is called a deep graph attention confrontation variation automatic encoder (AAVGA-d). AAVGA-d deepens the number of drawing attention layers of the encoder, effectively solves the problems of overfitting and overflugging, and further improves the drawing embedding capacity.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

FIG. 1 is an overall architecture of an embodiment of the disclosure AAVGA-d;

FIG. 2 shows the results of a link prediction experiment for 3 models;

FIG. 3 is a schematic diagram showing the layer-by-layer RDEdge and RDEdge loss comparison;

FIG. 4(a) is a schematic diagram of distance between attention layer outputs before training;

FIG. 4(b) is a diagram illustrating distance training between attention layer outputs;

the cluster visualization of the graph of fig. 5 is Cora visualization, Citeseer visualization, Pubmed visualization, respectively.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

The importance of each node in the graph is different, so the difference of the importance of the nodes must be considered in the graph embedding learning process, wherein the measure of the importance is the problem to be considered firstly.

The graph attention encoder embeds graph representation vectors which deviate from the real distribution of graph data, and how to ensure that the graph representation vectors obtained by encoding obey the real distribution of the data is another problem to be solved by the invention.

When reviewing various graph embedding methods, shallow graph neural networks (typically 2 layers) are used for both supervised and unsupervised graph embedding. When large-scale graph embedding learning is faced, the number of graph neural network layers needs to be deepened. However, overfitting and overflustering are two major problems that hinder the deepening of the graph model. Overfitting comes from the following cases: when a hyper-parametric model is used to fit the limited distribution of training data, the learned model is very suitable for the training data, but the popularization of test data is poor, and the overfitting problem is particularly prominent when deep GCN is applied to a small graph.

Over-smoothing, progressing towards the other extreme, makes training deep GCNs very difficult. The nature of graph convolution is aggregation, and if the number of layers used is unlimited, the representation of all nodes will converge to a fixed point, which will isolate the result from the input elements and cause the gradient to vanish, a phenomenon known as over-smoothing.

Example one

Referring to fig. 1, the embodiment discloses a deep map attention-confrontation variational auto-encoder training method, and the proposed AAVGA-d combines the strategy of GAT and replaces the two-layer graph convolution network in the general encoder with a multi-layer graph attention network to generate a potential representation of graph data.

The method comprises the following specific steps:

order and node v in L layer_iThe corresponding feature vector is h_i，

Wherein d is⁽¹⁾Representing the characteristic length of the node. Outputting a new feature vector h 'of each node after the aggregation operation taking attention mechanism as a core'_iWherein

d⁽¹⁺¹⁾Representing the length of the output feature vector, where this is aggregatedThe resultant operation is referred to as the graph attention layer.

In order to obtain a new feature vector of each node, the specific process is as follows:

assume a central node of v_iNeighboring node v_jTo v_iThe weight coefficients of (a) are set as:

e_ij＝a(Wh_i，Wh_j)#(1)

in the formula:

is a weight parameter of the node feature transformation of the layer, and a (-) is a function of calculating the correlation between two nodes. In principle, any node in the graph to node v can be computed here_iThe weight parameter of (2); however, to simplify the computation, it is limited to first order neighbors only. a can define a parameter-free correlation computation using the inner product of vectors<wh_i，wh_j>. Alternatively, it may be defined as a parameterized neural network layer. If a:

a scalar value representing the correlation between the two vectors is output. Here a single fully connected layer is chosen as the correlation function:

e_ij＝Leaky ReLU(a^T[Wh_i||Wh_j])#(2)

in the formula: the weight parameter is

The activation function is LeakyReLU. To better assign weights, the correlation calculations for all neighbors must be normalized by softmax normalization:

after the weight coefficient is obtained through calculation, a node v is obtained according to a weighted summation strategy of an attention mechanism_iNew feature vector of (2):

to further improve the expression capacity of the attention layer, a multi-head attention mechanism is also introduced in AAVGA-d, wherein formula (5) is used to form K sets of independent attention mechanisms. Compared to GAT, to reduce the dimensionality of the output potential feature vectors, the stitching operation is replaced with an averaging operation:

by aggregating multiple sets of independent attention mechanisms, the multi-head attention mechanism can apply attention distribution to multiple correlation features between the central node and the neighbor nodes, thereby enhancing the representation capability of the encoder.

The present invention uses an attention-based encoder to fit μ and σ:

μ＝GNN_μ(X，A)#(7)

logσ＝GNN_σ(X，A)#(8)

in the formula: μ is the mean vector μ_iLog σ shares the weight w, σ with μ in the attention layer²Is the variance.

The output of the encoder based on the attention mechanism is the distribution (mu, sigma) of the low-dimensional representation vector of the graph²) Assume that the prior distribution of the map data is gaussian. The model uses a graph attention force variation encoder in the encoding part, and a graph representation vector is obtained from the learned representation vector distribution through a sampling operation, wherein a weight parameter method is used: z is mu + epsilon^＊σ, where ε is sampled from the Gaussian distribution (0, 1). In short, the low-order representation vector is constrained in a distribution (Gaussian distribution) and randomly sampled from the distribution, and the generated low-order representation vector can generate approximate real samples through a decoder, and even can obtain new samples. Such training patterns of the graph variation autoencoder can often be obtained more than the graph autoencoderGood effect.

It should be noted that, the model of the embodiment of the present disclosure uses the automatic attention variance encoder in the encoding portion, and the automatic variance encoder and the automatic attention encoder are integrated.

Specifically, the graph is embedded using a multi-layer graph attention network, and the distribution of the representation vectors of the graph is learned.

The definition of the variational diagram autoencoder is as follows:

with respect to the decoder: the invention performs inner product processing on the graph representation vectors in a decoder to reconstruct a graph A:

Z＝GNN(X，A)#(11)

the adjacency relation is reconstructed by using the inner product of the low-dimensional representation vector of the graph, so that whether a connecting edge exists between any two points in the graph is predicted:

empirically, the standard normal distribution was chosen as the prior distribution of the latent variable z of the graph:

the complete loss function is defined as follows:

if only will

Used as a loss function to optimize the performance of the encoder, the variance value obtained by the model is zero, since sampling from a fixed normal distribution will reduce the number of samples generated and the actual difference between samples. However, the main goal is to optimize the variational auto-encoder. To achieve this goal, the KL divergence of the distribution of potential vectors and the standard normal distribution is added to the loss function herein, forcing the distribution of each potential vector to approximate the standard normal distribution.

It should be further noted that the graph data after being embedded by the encoder is a distribution of graph representation vectors, and it is assumed that the prior distribution of the graph data is a gaussian distribution (0, 1), but the embedded distribution does not necessarily follow the gaussian distribution, so that an anti-regularization constraint is added in the training process, and the distribution of the table vector generated by the encoder is forced to follow the prior distribution of the graph data. The decoder reconstructs the representation vector to obtain an adjacent matrix of the graph, and trains the encoder through a loss function to generate a more accurate representation vector, and the reconstructed adjacent matrix of the graph is similar to a real adjacent matrix of the graph as possible and has no direct relation to the unsupervised device and the decoder.

Countermeasure mechanism of the joint encoder:

the challenge model of the present invention consists of two parts: the figure notes that the encoder in the autoencoder acts as a generator of the countermeasure network. The generator attempts to fool the discriminator by generating spurious data, which refers to the latent variables that the encoder obtains from the encoding of the graph data. Losses of generators such asFormula (16). The task of the discriminator is to distinguish whether the sample is from the real data or the generator. The discriminator will be from the prior distribution p_zThe output data is judged to be positive, the output data from the latent variable z is judged to be negative, and the loss is as follows:

the present invention uses gaussian distributions as prior distributions for the image data and assumes that the potential vector z generated by the encoder does not satisfy the prior distributions of the data in euclidean space; thus, a standard multilayer perceptron is used as the discriminator. In the embedding and learning process, an antagonistic regularization constraint is applied to reduce the bias of the data distribution in the training process. The main goal of the confrontation model is to co-train the encoder and the discriminator through the minimax game so that they optimize each other:

training D to maximize the correct discrimination of the samples from the training data and G. At the same time, G was trained to minimize logarithms (1-D (G (Z))).

Random edge deletion technology:

in order to solve the problems of overfitting and overflustering caused by deepening the attention layer of the graph, the invention designs a random edge deletion technology (RDedge) aiming at the attention model of the graph. Dropedge has contributed to the graph model's resistance to over-smoothing, but it does not work well for deep graph attention models because graph attention does not rely on the graph's Laplace matrix during aggregation.

The invention deepens the figure attention layer number (8 layers) in the encoder to obtain a deep figure attention model and provides RDedge. In each training, the RDEdge technique randomly deletes a certain proportion of the edges of the input graph. In particular, it randomly resets Vp non-zero elements of the adjacency matrix a toZero, where V is the total number of edges and p is the deletion rate. If the adjacency matrix after randomly deleting edges is represented as A_rdThen, the relationship between the original image and the original image adjacency matrix a is:

A_rd＝A-A′#(17)

in the formula: a' is a subset of the original adjacent matrix A and has a size V_pA sparse matrix of (2). In the process of aggregation of the attention, the adjacency matrix a does not directly participate in the operation, but the adjacent nodes need to be identified by the adjacency matrix a. In the attention level of the graph, the unnormalized attention coefficient e is obtained by the formula (2)_ijAfter that, a mask operation (mask) is required:

e_ijrd＝A_rd⊙e_ij#(18)

the invention obtains e after mask operation_ijrdThe weight coefficient alpha of each neighbor node can be obtained by carrying out normalization operation in the formula (3)_ijrd. And finally, carrying out graph attention aggregation through the obtained weight coefficients to realize the optimization of the deep graph attention model by the RDEdge.

For the over-smoothing problem of GNN, Oono and Suzuki consider that as the number of layers increases, the representation of the node eventually converges to a subspace. The theoretical analysis of the anti-over-smoothing capability of RDedge is carried out by the concept. The following definitions are first given:

a subspace is defined 1. Order to

Is a space

M-dimensional subspace of (N is the number of nodes, C is the dimension of the node feature, where

Is an orthogonal matrix and has E^TE＝I_M，M≤N

In the formula: projection matrix

d_wCharacteristic dimension for a contextual user, each row P in the projection matrix_iRepresenting a unique translation of the user context.

Definition 2 is over-smoothed. Giving input feature independent subspaces

If the hidden matrix H of the l-th layer^(l)The distance of all vectors in the GNN does not exceed the epsilon (epsilon is larger than 0), and the node characteristics in the GNN are called to be over-fitted.

Define 3 order original graph as

The graph after the RDedge is randomly deleted is

Given a minimum e, assume

And

in a subspace

And

when the problem of over-smoothness is encountered, the following two inequalities are deleted and are satisfied after enough sides are deleted.

According to Oono and Suzuki^[28]In conclusion, deep GNNs under certain conditions have an over-smoothing problem for any small e-value, but they do not propose a corresponding solution. This document illustrates that RDEdge helps to alleviate the over-smoothing problem from two perspectives:

(1) by reducing the connection between the nodes, the RDEdge can reduce the over-smooth convergence speed and improve the upper bound of the E-smooth layer.

(2) The difference (N-M) between the original space and the convergence subspace measures the information loss amount, and the larger the difference is, the more serious the information loss is. RDedge can increase the dimension of the subspace and has the capability of reducing information loss.

The flow of the AAVGA-d is shown in fig. 1, and before each training, the RDEdge technology is used to perform sparse processing on the adjacency matrix a of the graph to obtain Ard. Obtaining an unnormalized attention coefficient e through node characteristics X of a multi-layer graph attention layer fitting graph_ij. Ard and e_ijPerforming a masking operation to obtain e_ijrd. To e_ijrdNormalization processing is carried out to obtain the final attention weight coefficient

The subsequent attention clustering of the graphs results in a low-dimensional representation matrix Z of the graphs. The prior distribution pz representing the matrix Z and the data is sampled and the samples are used to train the discriminator. Attempting to spoof the discriminator using the encoder, which may also be understood as training the encoder with the discriminator to bring the data distribution generated by the encoder closer to the true distribution, and finally, training the entire model using the overall loss function of AAVGA-d.

Experimental evaluation examples:

AAVGA-d was evaluated on three benchmark quote network datasets, demonstrating the effectiveness of this framework through the link prediction task in unsupervised graph learning. Three most popular citation data sets (Cora, Citeseer and Pubmed) were used herein in graph-embedding learning to evaluate the proposed model. The structure of the data set is described in table 1:

TABLE 1 statistics of the three cited data sets

	Cora	Citeseer	Pubmed
				Nodes	2708	3327	19717
Edges	5429	4732	44338
				Features	1433	3703	500
Labels	7	6	3

The link prediction algorithm may be trained to obtain a similarity value (i.e., a similarity value of an edge) for each pair of nodes in the network. Graphic region enclosed by ROC curve and x-axisThe domain is regarded as a comprehensive measure, called AUC. The AUC can be understood as the probability that the similarity value of an edge in the test set is higher than the similarity value of an edge that does not actually exist. Specifically, one edge is randomly selected from the test set each time to be compared with the randomly selected nonexistent edge, and if the similarity value of the edge in the test set is larger than that of the nonexistent edge, 1 point is added; if the two score values are equal, add 0.5 points. Independently comparing n times, if there is n₁The similarity value of the edges in the secondary test set is greater than the similarity value of the non-existing edges, with n₂The second two similar values are equal, then AUC is defined as:

another evaluation index is AP, representing the area of the graph enclosed by the Precision-recycle (PR) curve and the x-axis. Precision and Recall are specified below:

in the formula, TP represents the node number of the predicted connection and the correct node number, all detections represents the node number of the predicted connection, and all ground nodes represents the node number of the actual connection.

The two indexes are main evaluation indexes of the link prediction task. The data set is divided into a training set, a validation set and a test set. The validation set contains 5% of edges for hyper-parametric optimization and the test set contains 10% of edges for performance evaluation. To ensure accuracy, 10 experiments were performed on each data set and the average of the experimental results was calculated.

To verify that the AAVGA-d framework proposed herein has competitive graph embedding capabilities, it is compared to six popular graph embedding algorithms:

(1) spectral Clustering, which is a graph theory-based Clustering method, a weighted undirected graph is divided into two or more optimal subgraphs so that the interiors of the subgraphs are as similar as possible and the distance between the subgraphs is as large as possible to achieve a common Clustering goal.

(2) Deep Walk: learning the social representation by truncated random walks yields better results if the vertices of the network are few, and the method is also scalable and can adapt to changes in the network.

(3) And (3) GAE: a representative of unsupervised graph embedding learning learns an efficient representation of input graph data by encoding and decoding based on reconstruction loss.

(4) VGAE: the encoder no longer learns the low-dimensional vector representation of the samples, it learns the distribution of the sample representation. Assuming this representation follows a normal distribution, then samples are taken from the learned distribution to obtain a low-dimensional vector representation.

(5) ARGA: a countermeasure mechanism is added on the basis of the graph automatic encoder to ensure the consistency of data distribution in the training process.

(6) ARVGA: a countermeasure mechanism is introduced on the basis of a graph variation automatic encoder, the true data distribution is directly sampled, and distribution difference identification is carried out through a potential vector obtained by a discriminator and the encoder.

The results of the experiment are shown in table 2:

table 2 link prediction experimental results

The results of the link prediction experiments are shown in table 2. The method herein AAVGA-d gave excellent results on all three datasets. Compared with AAVGA, the AAVGA-d graph embedding performance using the RDedge technology is better, and the AP and AUC of the three data sets are surpassed. This indicates that strategies to deepen the graph attention layer and apply the RDedge technique against overfitting and against overflugging are feasible. Other indicators on the data set, except for the AUC of Cora, exceeded 94%. The model performed most well on the Citeseer dataset compared to other benchmarks, with 3.7% and 3% improvement in AUC and AP, respectively, compared to VGAE and 2.1% and 2% improvement compared to ARVGA. The performance of this method is significantly improved compared to the non-encoder picture embedding method. On the Citeser dataset, AUC of AAVGA-d is 14% higher than that of Spectral Cluster and Deepwalk; AP increased by 10% and 11.4%, respectively. Experimental results show that by combining the attention mechanism and the countermeasure mechanism in the graph encoder, the graph embedding capability can be improved.

Analytical experiments were performed by using three data sets, Cora, Citeseer and Pubmed. Taking the Cora dataset as an example, the dataset consists of 2078 machine learning papers, the number of references among the papers reaches 5429 times, and 1433 words are divided. From the perspective of the figure, the dataset has 2078 points, 1433 dimensional features and 5429 edges. The method comprises the steps of performing embedded learning on data by using a deep-layer graph attention immutation automatic encoder, firstly, performing one-hot encoding on the data to obtain an adjacency matrix A and a characteristic X, embedding the characteristic by using the deep-layer attention encoder to obtain an expression vector, fully considering importance difference between neighboring nodes by using an attention weight distribution mechanism, giving a larger weight to similar nodes (with a plurality of same words) in an aggregation process, and giving a smaller weight to dissimilar nodes (with fewer same words). Secondly, the discriminator is used for carrying out countermeasure supervision on the encoder in embedding learning, and the obtained expression vector is forced to obey the real distribution of the Cora data set, so that a more accurate embedding result is obtained. And finally, reconstructing by using the obtained graph representation vector to obtain a prediction matrix, wherein data in the matrix represents whether a reference relation exists between the two articles. The image embedding performance is finally improved through the joint training of the encoder, the decoder and the monitor.

Next, the effect of RDEdge on the model will be discussed further. The above has shown that RDEdge has an anti-overfitting and anti-smoothing effect on the depth map attention model. Specifically, in order to explain the improvement of the model precision by the RDEdge technology, the document performs further work: and comparing the precisions of the three models of AAVGA, AAVGA-4/8 and AAVGA-d under the link prediction experiment.

(1) AAVGA: single-layer picture attention-resisting variational automatic encoder

(2) AAVGA-4/8 AAVGA's multi-attention layer version of AAVGA.

(3) AAVGA-d: while deepening the attention layer of the graph, a random edge deletion technology is applied.

Experiment each hyper-parameter set is the same as the above experiment set, where AAVGA uses only a single layer of the graph attention layer on all three datasets. AAVGA-4/8 and AAVGA-d deepen the image attention layer to 4 layers on the Cora, Citeseer dataset and Pubmed to 8 layers. The results of the experiment are shown in FIG. 2. Note that simply deepening the encoder's map attention on the Cora and Citeseer datasets results in a decrease in the accuracy of AUC and AP, since the deep map attention leads to overfitting and over-smoothing problems. Secondly, no significant experimental accuracy degradation is seen on the Pubmed dataset, which is considered herein to be due to the fact that the Pubmed dataset itself is large enough in data volume, and the graph herein notes that the force model is limited to the first-order neighbors of the nodes at the time of aggregation, so that no significant over-fitting and over-smoothing problems occur. It is worth noting that AAVGA-d combined with RDedge technology has excellent performance on three data. This indicates that the RDEdge technique does have the capability of resisting overfitting and overflugging for the graph attention model; on the other hand, it is also shown that appropriately deepening the drawing attention layer number can improve the drawing embedding capability of the model.

In the above experiment, when the random edge deletion technique is applied to the AAVGA model, all attention layers share the same a_ra. Notably, the model can perform RDEdge separately for each attention layer, specifically computing each layer independently

This allows the attention layer to be made unique

Thereby forming diversified expression of attention and obtaining more randomness. The layer-by-layer random edge deletion technology (RDedge-L) is experimentally evaluated on a Cora data set from the perspective of a loss function, as shown in FIG. 3, AAVGA-dl combined with the layer-by-layer random edge deletion technology has smaller training loss than AAVGA-d, but the loss curves on the two verification sets are close to coincidence and have little difference in performance, which indicates that the layer-by-layer random edge deletion technology is more favorable for training.

To further verify that AAVGA-d has the ability to mitigate over-smoothing, the degree of over-smoothing is quantified herein by calculating the difference between the output of the current attention tier and the output of the next tier, with euclidean distance being used for the difference calculation, with smaller distance meaning more over-smoothing. The experiments were performed on a Cora dataset, using 6 layers of graphical attention for both AAVGA and AAVGA-d. As shown in fig. 4(a), the over-smoothing phenomenon becomes severe as the number of layers increases before training. However, the distance between the AAVGA-d layers is relatively large and the rate of degradation is slow. As shown in fig. 4(b), after 400 training passes, without the AAVGA that uses the RDEdge technique, the difference between the 5 th and 6 th layers is almost zero, which indicates that the hidden features have converged to a certain fixed point. On the contrary, the distance between the AAVGA-d layers is not reduced, the AAVGA-d shows a trend of slowly rising, and the AAVGA-d can better deal with the problem of over-smoothness caused by deepening of the layer number.

To more intuitively present the graph embedding capabilities of the AAVGA-d model and demonstrate its versatility, we will visualize graph data in this section.

First, we embed three processed citation data sets (Cora, Citeseer and Pubmed) using AAVGA-d. After training is complete, the graph data is passed through a deep-graph attention-fighting variational auto-encoder to obtain a representation matrix of the graph composed of feature vectors. To achieve two-dimensional visualization of graphical data, the dimensionality of the feature vectors is compressed using PCA. And finally, clustering the dimensionality reduction data by using k-means + +. The visualization results are shown in fig. 5. Wherein each color represents a category, the quotation categories in each data set are well divided, which again demonstrates the excellent graph embedding ability of the AAVGA-d model.

The invention provides a random edge deletion technology (RDedge) aiming at a graph attention network, which randomly deletes edges in an original input graph in an embedding process to achieve the functions of anti-over-smoothing and anti-over-fitting and deepens the number of attention layers. The picture embedding capability of the encoder is improved. In the face of large graphs, it is particularly necessary to deepen the graph to notice the number of layers.

In order to improve the embedding capability of the graph automatic encoder, the invention provides a graph-attention-resisting variation automatic encoder (AAVGA-d), which draws attention to the encoder and uses a resisting mechanism in embedding training. The graph attention encoder realizes the self-adaptive distribution of the weights of the neighbor nodes, and the anti-regularization enables the distribution of embedded vectors generated by the encoder to be close to the real distribution of data. In order to deepen the attention layer number, a random edge deletion technology (RDedge) aiming at the attention network is designed, and the loss of the over-smooth information caused by the over-deep layer number is reduced.

Example two

It is an object of this embodiment to provide a computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the program.

EXAMPLE III

An object of the present embodiment is to provide a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.

Example four

The present embodiment is directed to a deep map attention confrontation variational automatic encoder training system, including:

According to the invention, the attention and countermeasure mechanism of the graph is introduced into the graph automatic encoder, and the graph attention and countermeasure automatic encoder can well embed graph data by realizing the self-adaptive distribution of the neighbor node weight and the regularization constraint of the expression vector distribution in the embedding process. Meanwhile, a random edge deletion technology (RDedge) aiming at the graph attention model is designed, which is helpful for deepening the attention layer number of the graph attention confronted with the variational automatic encoder and well solves the problems of overfitting and overflugging caused by the graph attention model. The RDEdge randomly discards a certain proportion of edges in the graph, so that the input data generates random deformation, and the diversity of the data is increased to deal with overfitting. Messaging is reduced during attention aggregation to mitigate over-smoothing. After the RDedge technology is combined, a deep graph attention confrontation variable division automatic encoder (AAVGA-d) is developed, which is a deep graph attention encoder considering importance difference between nodes, and a confrontation mechanism is combined, so that consistency of graph representation vector distribution and prior distribution obtained by embedding the encoder is ensured. The deepening of the attention layer number further improves the graph embedding capability of the encoder.

The attention encoder only limits the first-order neighbor of the node to participate in the calculation when calculating the attention weight of the neighbor node, thereby greatly reducing the complexity of the calculation and avoiding the more serious over-smoothing problem to a certain extent.

The steps involved in the apparatuses of the above second, third and fourth embodiments correspond to the first embodiment of the method, and the detailed description thereof can be found in the relevant description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media containing one or more sets of instructions; it should also be understood to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any of the methods of the present disclosure.

Those skilled in the art will appreciate that the modules or steps of the present disclosure described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code executable by computing means, whereby the modules or steps may be stored in memory means for execution by the computing means, or separately fabricated into individual integrated circuit modules, or multiple modules or steps thereof may be fabricated into a single integrated circuit module. The present disclosure is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. The deep map attention confrontation variational automatic encoder training method is characterized by comprising the following steps:

obtaining a first feature vector corresponding to each node in graph data;

2. The deep map attention-fighting variational automatic encoder training method of claim 1, wherein the step of obtaining the second feature vector of each node comprises:

setting weight coefficients of adjacent nodes;

selecting a single-layer full-connection layer as a correlation function;

3. The deep map attention-confrontation variational automatic encoder training method of claim 1, wherein the optimization of the graph attention-confrontation automatic encoder by the random edge deletion technique is realized by performing graph attention aggregation through the obtained weight coefficients.

4. The deep graph attention-confrontation variational automatic encoder training method as claimed in claim 1, wherein in the process of graph attention aggregation, the neighbor nodes are identified by means of the adjacency matrix, and after the graph attention layer obtains the unnormalized attention coefficient, the masking operation is performed.

5. The deep map attention-confrontation variational automatic encoder training method as claimed in claim 1, wherein the map attention-confrontation automatic encoder is combined with a random edge deletion technique, called deep map attention-confrontation variational automatic encoder, which randomly deletes a certain proportion of edges of the input map by the random edge deletion technique in each training.

6. The deep map attention-confrontation variational automatic encoder training method of claim 1, wherein the random edge deletion technique randomly deletes a certain proportion of edges of the input map, specifically: will randomly adjoin V of matrix A_pThe non-zero elements are reset to zero, where V is the total number of edges and p is the erasure rate.

7. The deep map attention-fighting variational automatic encoder training method as set forth in claim 1, wherein the encoder in the map attention-fighting automatic encoder serves as a generator of the fighting network, the generator deceives the discriminator by generating fake data, wherein the fake data refers to latent variables obtained by the encoder by encoding of the map data;

8. Deep map attention confrontation variational automatic encoder training system, characterized by, includes:

9. A computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the method of any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, is adapted to carry out the steps of the method according to any one of the preceding claims 1 to 7.