CN117272017A

CN117272017A - Training method for heterogeneous graph data node embedded feature extraction model, embedded feature extraction method, node classification method and device

Info

Publication number: CN117272017A
Application number: CN202311071070.1A
Authority: CN
Inventors: 杜军平; 王嘉; 邵蓥侠; 李昂; 余今; 管泽礼
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2023-08-23
Filing date: 2023-08-23
Publication date: 2023-12-22

Abstract

The invention provides a heterogeneous graph data node embedded feature extraction model training method, an embedded feature extraction method, a node classification method and a device, wherein a generator and a discriminator are locally constructed at each client based on a federal learning mode to train a local model by using antagonism learning, and model parameters are aggregated at a central server, so that data privacy is effectively ensured. Meanwhile, in the process of local training, each client acquires the original characteristics of the target node-associated cross-client neighbor nodes through cross-client communication, and various neighbors based on top-k random walk-flat sampling participate in embedded learning, so that the influence of client data difference on training is reduced, the learning quality of the federal heterogeneous graph neural network is improved, and the execution efficiency of downstream subtasks is improved.

Description

Training method for heterogeneous graph data node embedded feature extraction model, embedded feature extraction method, node classification method and device

Technical Field

The invention relates to the technical field of data processing, in particular to a heterogeneous graph data node embedded feature extraction model training method, an embedded feature extraction method, a node classification method and a device.

Background

With the development of information technology, graph data are becoming more and more common in the real world, especially heterogeneous information networks are abbreviated as heterogeneous graphs, and the heterogeneous information networks have various types of nodes and edges, can conveniently model data networks in real life, and are widely focused in various fields such as citation networks, social networks, recommendation systems and the like. Because the heterogeneous graph data carries very abundant semantic and structural information, heterogeneous graph mining algorithms capable of completely extracting the features are also widely concerned, and heterogeneous graph embedding technology has become a powerful tool, wherein many research achievements have been successfully applied to real systems such as recommendation systems, text analysis, network security and the like.

The heterogeneous graph neural network can analyze and process heterogeneous graph data, but based on the requirement of privacy protection, heterogeneous graph data held by different main bodies in the network are not disclosed, so that the problem of data island is caused, for example, when social heterogeneous network data from each platform is used for analyzing public safety emergency, a trusted third party needs to consume a large amount of storage space and transmission time to aggregate massive data from all participants, and the direct transmission of sensitive information related to the safety event also brings high-risk privacy problem, so that the traditional central calculation paradigm and the traditional distributed calculation paradigm cannot realize training targets while ensuring data privacy. Different from other forms of private data, the heterogeneous graph data also has an association relation across clients, so that the heterogeneous graph neural network obtained by training in the local by the traditional federal learning method loses the information across clients.

Disclosure of Invention

In view of this, the embodiment of the invention provides a heterogeneous graph data node embedded feature extraction model training method, an embedded feature extraction method, a node classification method and a device, so as to eliminate or improve one or more defects existing in the prior art, and solve the problem that the prior art cannot pay attention to cross-client information in multi-subject local training.

In one aspect, the present invention provides a heterogeneous graph data node embedded feature extraction model training method, which is executed based on a plurality of clients and a central server that are connected to each other, each of the clients holds a heterogeneous subgraph of a part of heterogeneous graph data as private data, each of the clients further deploys a cross-client sampler, the method comprising the steps of:

initializing global model parameters of a global generator and a global discriminator by the central server;

in one training round, the performing steps include:

the central server sends the global model parameters to each client, and builds a local generator and a local discriminator, wherein the local generator outputs negative sample embedded features of a target node under the condition of adding noise data based on original features and weights of neighbor nodes in the client of the target node in the heterogeneous subgraph which are locally held and original features and weights of the target node, which are provided by related clients, crossing the neighbor nodes of the client; the original features are feature vectors preliminarily extracted aiming at the data types of the neighbor nodes in the client and the cross-client neighbor nodes;

Sampling the target node in the client inner neighbor node and the cross-client neighbor node based on a top-k random walk algorithm by a cross-client sampler local to the client, and calculating positive sample embedding characteristics according to the original characteristics and weight aggregation of the client inner neighbor node and the cross-client neighbor node obtained by sampling;

the local discriminator performs a discrimination task of the neighbor node type of the target node and an identification task of positive and negative samples based on the negative sample embedded feature and the positive sample embedded feature;

based on countermeasure learning, the local generator performs parameter updating by minimizing the recognition success rate construction loss of the negative sample embedded feature by the local discriminator, and the local discriminator performs parameter updating by maximizing the recognition task of the neighbor node type and the recognition task success rate construction loss of the positive and negative samples;

the updated parameters of the local generator and the local discriminator are sent to the central server by each client for parameter aggregation;

executing a plurality of training rounds according to a set condition, and updating the global generator, the global discriminator, the local generator and the local discriminator;

And constructing a target embedded feature extraction model on each client, extracting a first embedded feature of a designated target node aiming at a heterogeneous subgraph held by a local generator of a corresponding client by the target embedded feature extraction model, and aggregating the first embedded feature into original features of a neighbor node in the client of the designated target node and a cross-client neighbor node to obtain the embedded feature of the designated target node.

In some embodiments, the global generator and the local generator employ structurally identical multi-layer perceptrons, and the global discriminator and the local discriminator employ structurally identical multi-layer perceptrons.

In some embodiments, the local generator performs parameter updating by minimizing an identification success rate construction penalty of the negative-sample embedded feature by the local discriminator, wherein the constructed penalty function expression is:

where u represents a given node, u ε V _i ，V _i The method comprises the steps that a local node set of an ith client is represented, v represents a target node, R represents the type of a neighbor node, R epsilon R, and R represents an edge set; h's' _v Representing the output of the ith client generator after adding noise data, HG _i Representing a heterogeneous subgraph held by an ith client, G _i A local generator representing the i-th client,represents G _i Parameters D of (2) _i Representing the local discriminator of the i-th client.

In some embodiments, the method, the discriminator performs a task of discriminating a neighbor node type of the target node, and the probability calculation formula of the neighbor relation r between the target node v and the neighbor node v is:

where u represents a given node, u ε V _i ，V _i The method comprises the steps that a local node set of an ith client is represented, v represents a target node, R represents the type of a neighbor node, R epsilon R, and R represents an edge set;transpose of the embedded feature representing node u, +.>Represents the weight, h, corresponding to the neighbor relation r _v Embedded feature representing target node v, +.>Parameters representing the i-th client local discriminator;

and the discriminator outputs the discrimination result of the neighbor node type by calculating the probability that the neighbor relation between the target node and the neighbor node is r.

In some embodiments, the local discriminator performs parameter updating by maximizing discrimination tasks of neighbor node types and loss of identification task success rate construction loss of positive and negative samples, and the loss function of the identification tasks of the positive and negative samples is:

where u represents a given node, u ε V _i ，V _i The method comprises the steps that a local node set of an ith client is represented, v represents a target node, R represents the type of a neighbor node, R epsilon R, and R represents an edge set; h's' _v Represents the output of the ith client generator after adding noise data, h _v Representing embedded features of the target node v, HG _i Representing a heterogeneous subgraph held by an ith client, G _i A local generator representing the i-th client,represents G _i Parameters D of (2) _i A local discriminator representing the i-th client;

the local discriminator builds loss by maximizing discrimination tasks of the neighbor node types and discrimination task success rates of positive and negative samples, and in parameter updating, the loss function of the discrimination tasks of the neighbor node types is as follows:

where u represents a given node, u ε V _i ，V _i The method comprises the steps that a local node set of an ith client is represented, v represents a target node, R represents the type of a neighbor node, R epsilon R, and R represents an edge set; r ' represents the error relationship of node u with target node v, R ' =r/{ R }, h ' _v Representing the output of the ith client generator after adding noise data, HG _i Representing a heterogeneous subgraph held by an ith client;

the local discriminator performs parameter updating by maximizing discrimination task of neighbor node type and discrimination task success rate construction loss of positive and negative samples, and the adopted joint loss is as follows:

In some embodiments, in the method, the extracting manner of the original features of the intra-client neighbor node and the inter-client neighbor node is as follows:

if the data type is text, using a bag-of-word vector as the original feature;

if the data type is an image, directly adopting a pixel numerical vector as the original characteristic, or extracting the original characteristic through a pre-trained neural network.

On the other hand, the invention provides a sub-graph level federal heterogeneous node embedded feature extraction method, which is executed based on a plurality of clients and a central server which are connected with each other, wherein each client holds a heterogeneous subgraph of a part of heterogeneous graph data as private data, and for a designated target node of a designated client, a target embedded feature is extracted by adopting a target embedded feature extraction model in the heterogeneous graph data node embedded feature extraction model training method deployed on the designated client.

On the other hand, the invention also provides a sub-graph level federation heterogeneous node classification method, which is executed based on a plurality of clients and a central server which are connected with each other, wherein each client holds a heterogeneous subgraph of a part of heterogeneous graph data as private data, and aiming at a designated target node of a designated client, the designated client adopts the subgraph level federation heterogeneous node embedding feature extraction method to acquire the target embedding feature of the designated target node;

And inputting the target embedded features into a pre-trained logistic regression model to perform node classification.

On the other hand, the invention also provides a heterogeneous graph data management system, which comprises a plurality of clients and a central server which are connected with each other for executing, wherein each client holds a heterogeneous subgraph of a part of heterogeneous graph data as private data, and the clients and the central server execute the steps of the method.

In another aspect, the present invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above method.

The invention has the advantages that:

according to the heterogeneous graph data node embedded feature extraction model training method, the embedded feature extraction method, the node classification method and the device, based on the federal learning mode, the local building generator and the discriminator at each client side train the local model by using the antagonism learning, and the aggregation of model parameters is carried out at the central server, so that the data privacy is effectively ensured. Meanwhile, in the process of local training, each client acquires the original characteristics of the target node-associated cross-client neighbor nodes through cross-client communication, and various neighbors based on top-k random walk-flat sampling participate in embedded learning, so that the influence of client data difference on training is reduced, the learning quality of the federal heterogeneous graph neural network is improved, and the execution efficiency of downstream subtasks is improved.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and drawings.

It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to the above-described specific ones, and that the above and other objects that can be achieved with the present invention will be more clearly understood from the following detailed description.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate and together with the description serve to explain the invention. In the drawings:

fig. 1 is an overall structure diagram of a training method for embedding feature extraction models into heterogeneous graph data nodes according to an embodiment of the present invention.

Fig. 2 is a heterogeneous graph countermeasure network of a local client of the heterogeneous graph data node embedded feature extraction model training method according to an embodiment of the invention.

Fig. 3 is a schematic diagram of a training method of a heterogeneous graph data node embedded feature extraction model according to an embodiment of the present invention, where target node features are obtained by aggregating neighbor node features.

Detailed Description

The present invention will be described in further detail with reference to the following embodiments and the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent. The exemplary embodiments of the present invention and the descriptions thereof are used herein to explain the present invention, but are not intended to limit the invention.

It should be noted here that, in order to avoid obscuring the present invention due to unnecessary details, only structures and/or processing steps closely related to the solution according to the present invention are shown in the drawings, while other details not greatly related to the present invention are omitted.

It should be emphasized that the term "comprises/comprising" when used herein is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.

It is also noted herein that the term "coupled" may refer to not only a direct connection, but also an indirect connection in which an intermediate is present, unless otherwise specified.

Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. In the drawings, the same reference numerals represent the same or similar components, or the same or similar steps.

Heterogeneous graph data (Heterogeneous Graph Data) refers to a data structure in a graph that contains multiple types of nodes and multiple types of edges. Conventional graph data is typically homogenous, i.e., there is only one type of node and one type of edge in the graph, while heterogeneous graph data is more complex and can represent more rich relationships and information. In heterogeneous graph data, nodes and edges may be of different types, each type may represent a different entity or concept, while different types of edges may represent different types of relationships. For example, heterogeneous graph data in a social network may include user nodes, post nodes, and comment nodes, as well as various types of edges between users and posts, between users and comments, and so on.

The heterogeneous graphic neural network (Heterogeneous Graph Neural Network, abbreviated as HGNN) is a graphic neural network model capable of processing heterogeneous graphic data. Compared with the traditional graph neural network, the HGNN can process different node types and relation types in the heterogeneous graph, so that the complex structure and characteristics of graph data are better expressed. Specifically, the HGNN learns the feature representation of each node in different types of relationships by integrating and passing information between different node types and relationship types.

In an actual application scene, the heterogeneous graph neural network is utilized to perform characteristic representation on heterogeneous graph data, and when a downstream task is executed, the global heterogeneous graph is divided into different clients, and some connection relations are difficult to be added into training of any one client because two endpoints are located on the different clients, so that federal learning can be introduced to perform model training for privacy protection. However, these cross-client edges provide great challenges to the heterogeneous graph neural network at the sub-graph level in the federal scenario, and when the neighbor nodes and their corresponding edges are not local to the client, the training local to the client cannot utilize these cross-client information, resulting in information loss.

The prior art mainly researches the application of the homogeneous graph in the federation scene, wherein some of the homogeneous graph data are directly regarded as Euclidean data, the existence of cross-client information is not considered, and common federation algorithm is adopted to process the graph data; others contemplate exchanging this portion of information directly at the client via privacy-based communications, enabling the neural network to utilize more comprehensive topology information. When federal graph information to be processed is a heterogeneous graph, the influence of various nodes and adjacent edge types on the federal graph neural network cannot be considered by adopting a federal homogeneous graph processing mode, so that rich structural information can be lost, and the advantage of heterogeneous graph data is lost.

The distributed data storage and computing environment makes federal learning unable to evaluate and guarantee data quality of each data holder, structural isomerism of the heterogeneous graph makes the heterogeneous graph more vulnerable to data robustness threat than euclidean data or even homogeneous graph, and the prior art discusses about robustness of euclidean data in federal scene or about heterogeneous graph robustness of centralized training, ignoring the combined effect of heterogeneity of the distributed environment and heterogeneous graph structural heterogeneity can amplify interference of noise to a model.

According to the invention, federal learning is combined with the heterogeneous graph neural network, and on one hand, how to better protect structural information of the heterogeneous graph serving as non-Euclidean data in federal learning is considered; on the other hand, the influence of the data heterogeneity problem which is important in federal learning on the robustness of the federal heterogeneous graph algorithm is considered.

Specifically, the invention provides a heterogeneous graph data node embedded feature extraction model training method, which is executed based on a plurality of clients and a central server which are connected with each other, wherein each client holds a heterogeneous subgraph of a part of heterogeneous graph data as private data, each client is further provided with a cross-client sampler, and referring to fig. 1, the method comprises the following steps of:

Step S101: the global model parameters of the global generator and the global discriminator are initialized by the central server.

Step S102: in one training round, the executing steps include steps S1021 to S1024:

step S1021: the method comprises the steps that a central server sends global model parameters to each client, a local generator and a local discriminator are built, and the local generator outputs negative sample embedded features of target nodes under the condition of adding noise data based on original features and weights of the target nodes in the client of the target nodes in the local hold heterogeneous subgraph and original features and weights of the target nodes, provided by the associated clients, of the target nodes crossing the client neighbor nodes; the original features are feature vectors initially extracted for neighbor nodes within the client and data types across the client neighbor nodes.

Step S1022: and sampling the neighbor nodes in the client and the cross-client neighbor nodes of the target node by a local cross-client sampler of the client based on a top-k random walk algorithm, and calculating the embedded characteristics of the positive sample according to the original characteristics and the weight aggregation of the neighbor nodes in the client and the cross-client neighbor nodes obtained by sampling.

Step S1023: the local discriminator performs a discrimination task of a neighbor node type of the target node and an identification task of positive and negative samples based on the negative sample embedded feature and the positive sample embedded feature.

Step S1023: based on the countermeasure learning, the local generator performs parameter updating on the recognition success rate construction loss of the negative sample embedded feature by minimizing the local discriminator, and the local discriminator performs parameter updating on the recognition task construction loss of the positive and negative samples and the discrimination task success rate construction loss of the maximized neighbor node type.

Step S1024: and the updated parameters of the local generator and the local discriminator are sent to a central server for parameter aggregation by each client.

Step S103: the global generator, global discriminator, local generator and local discriminator are updated by executing a plurality of training rounds according to the set condition.

Step S104: and constructing a target embedded feature extraction model on each client, extracting a first embedded feature of the appointed target node by the target embedded feature extraction model according to the heterogeneous subgraphs held by the local generator of the corresponding client, and aggregating the first embedded feature into the neighbor node in the client of the appointed target node and the original feature crossing the neighbor node of the client to obtain the embedded feature of the appointed target node.

In steps S101 and S102, in order to obtain a model capable of effectively acquiring node embedded features, model training is performed in a form of countermeasure learning. The global generator and the local generator adopt multi-layer perceptrons with the same structure, and the global discriminator and the local discriminator adopt multi-layer perceptrons with the same structure. The global generator and the global discriminator perform random deployment on model parameters in the initialization process.

In steps S1021 to S1024, based on the federal learning process, the initialized global generator and global discriminator are distributed to each client by the central server to form a local generator and local discriminator, and each client trains the local generator and local discriminator based on the form of countermeasure learning by using the locally held heterogeneous subgraph. After each round of training is finished, each client sends parameters of the local generator and the local discriminator to a central server for aggregation, and specifically, fedAvg algorithm can be adopted to execute parameter aggregation operation.

In steps S1021 to S1022, the local generator performs the action of extracting the embedded feature, and based on the given heterogeneous subgraph, the local generator generates the false embedded feature of the target node as a negative-sample embedded feature to fool the discriminator as much as possible in the case of adding noise data. The noise data may employ gaussian noise. In this process, the local generator synchronously introduces the original characteristics and weights of the client-side internal neighbor nodes local to the client and the original characteristics and weights of the cross-client-side neighbor nodes. Based on the known node association relation in the heterogeneous graph data, the original characteristics and weights of the neighbor nodes associated by the target node in other clients are transferred through communication between the clients.

In some embodiments, the extraction of the original features of the intra-client neighbor nodes and the cross-client neighbor nodes is as follows:

if the data type is text, a bag-of-word vector is adopted as an original feature; if the data type is an image, directly adopting the pixel numerical vector as an original characteristic, or extracting the original characteristic through a pre-trained neural network.

Meanwhile, a cross-client sampler samples neighbor nodes of the target node in the client based on a top-k random walk algorithm and cross-client neighbor nodes, and embedded features of the target node are directly aggregated and calculated based on corresponding original features and weights. Specifically, referring to fig. 3, based on the category of the neighbor node, the feature vector of the neighbor node is mapped to the space where the feature of the target node is located by using the corresponding weight matrix, and is aggregated. This part does not add noise data and so is embedded as a positive sample.

In step S1023, the discriminator performs the task of judging positive and negative samples to which noise data is added and to which no noise data is added, and judging the type of the neighbor node associated with the target node.

In the countermeasure learning process, the local generator carries out parameter updating on the construction loss of the recognition success rate of the negative sample embedded feature by minimizing the local discriminator, wherein the constructed loss function expression is:

Where u represents a given node, u ε V _i ，V _i The method comprises the steps that a local node set of an ith client is represented, v represents a target node, R represents the type of a neighbor node, R epsilon R, and R represents an edge set; h's' _v Representing the output of the ith client generator after adding noise data, HG _i Representing a heterogeneous subgraph held by an ith client, G _i Office representing the ith clientA portion generator that generates a portion of the output signal,represents G _i Parameters D of (2) _i Representing the local discriminator of the i-th client.

The local discriminator executes discrimination task of the neighbor node type of the target node, and the probability calculation formula of the neighbor relation r between the target node v and the neighbor node v is as follows:

where u represents a given node, u ε V _i ，V _i The method comprises the steps that a local node set of an ith client is represented, v represents a target node, R represents the type of a neighbor node, R epsilon R, and R represents an edge set;transpose of the embedded feature representing node u, +.>Represents the weight, h, corresponding to the neighbor relation r _v Embedded feature representing target node v, +.>Representing parameters of the i-th client local discriminator.

The local discriminator outputs discrimination results of neighbor node types by calculating the probability that the neighbor relation between the target node and the neighbor node is r.

The local discriminator performs parameter updating by maximizing discrimination tasks of neighbor node types and the success rate construction loss of the identification tasks of the positive and negative samples, and the loss function of the identification tasks of the positive and negative samples is as follows:

Wherein u represents a givenNode u epsilon V _i ，V _i The method comprises the steps that a local node set of an ith client is represented, v represents a target node, R represents the type of a neighbor node, R epsilon R, and R represents an edge set; h's' _v Represents the output of the ith client generator after adding noise data, h _v Representing embedded features of the target node v, HG _i Representing a heterogeneous subgraph held by an ith client, G _i A local generator representing the i-th client,represents G _i Parameters D of (2) _i Representing the local discriminator of the i-th client.

The local discriminator performs parameter updating by maximizing the discrimination task of the neighbor node type and the success rate construction loss of the discrimination task of the positive and negative samples, and the loss function of the discrimination task of the neighbor node type is as follows:

In step S1024, after the clients finish training locally, the model parameters are sent to the central server for aggregation, and the FedAvg algorithm is mainly adopted.

In step S103, after the central server completes aggregation, the central server distributes the aggregated parameters to each client for a new training round, and the training is repeated multiple times. The condition for training to be completed here may be that a set number of iterations is reached.

In step S104, in order to extract the embedded feature from the target node, each client performs feature extraction on the target node by using the local heterogeneous subgraph, and then aggregates the original features of the neighboring nodes in the client and the neighboring nodes crossing the client, so as to obtain the embedded feature of the target node. The aggregation here may refer to the part of step S1022.

On the other hand, the invention provides a sub-graph level federal heterogeneous node embedded feature extraction method, which is executed based on a plurality of clients and a central server which are connected with each other, wherein each client holds a heterogeneous subgraph of a part of heterogeneous graph data as private data, and for a designated target node of a designated client, target embedded features are extracted by adopting a target embedded feature extraction model in the heterogeneous graph data node embedded feature extraction model training method in the steps S101 to S104 deployed on the designated client.

On the other hand, the invention also provides a sub-graph level federation heterogeneous node classification method, which is executed based on a plurality of clients and a central server which are connected with each other, wherein each client holds a heterogeneous sub-graph of a part of heterogeneous graph data as private data, and aiming at a designated target node of a designated client, the designated client adopts the sub-graph level federation heterogeneous node embedding feature extraction method to acquire target embedding features of the designated target node; the target embedded features are input into a pre-trained logistic regression model to perform node classification.

The invention is described below in connection with a specific embodiment:

The embodiment provides a scheme for extracting the embedded features of the heterogeneous graph data, and performs downstream node classification tasks on the basis of extracting the embedded features.

Firstly, each client side independently trains a heterogeneous graph according to local data to generate an countermeasure network to obtain a local model, then the client side uploads the local countermeasure network parameters to a central server, the central server obtains a global countermeasure network through federal average, the central server further transmits the global model to the client side to guide the client side to continuously train the local model, and the whole framework of the embodiment is shown in figure 1. In order to solve the problem that data difference of each client in federation learning introduces a robust neighbor sampling strategy adapting to federation, the robustness degree of a model to data heterogeneity of different clients is increased on the basis of better utilizing cross-client information. In order to enhance the robustness of the model to noise data, noise relation discriminators are added, and through training of federal error relation discriminators, each discriminator can improve the performance of the local discriminator according to the global discriminator when the local data is polluted, so that the polluted data can be correctly identified, and the accuracy of feature embedding is guaranteed.

The method constructs locally a minimal maximum optimization of the generator and discriminator according to a known data structure, the specific structure is shown in fig. 2. The generation of the challenge network is a deep learning algorithm that is opposed by the two neural networks, discriminator D and generator G. Wherein the discriminator D learns how to come from the data distribution P, respectively _data Real data (characteristics, tag distribution, data volume distribution) and spurious data containing noise disturbance; the generator G learns how to base on the noise profile P _Z Generating dummy data that approximates the real data as closely as possible to obtain a higher discriminator score can be expressed as follows:

noise distribution in data refers to that noise may occur to some characteristics of nodes due to data quality limitation in real application, such as white noise on a picture or messy codes in a text; other nodes that are not related to each other introduce error relationships in composition. In this embodiment, noise data is simulated by adding gaussian noise to node characteristics, adding error relationships, and the like. And the performance of both sides is continuously improved in the countermeasure training of the generator and the discriminator, so that accurate feature embedding is obtained, and the generated features are classified by using a logistic regression model to realize the node classification task of the heterogeneous graph.

1) Federal heterogeneous graph node embedded generator

Since the data to be processed is heterogeneous graph data, it is also necessary to generate a reactive network capable of considering the topological relationship between the data. Taking client i as an example, its generator G _i It is desirable to generate as realistic false samples as possible under conditions of perceived topology. For a given node u e V _i And r.epsilon.R, generatorHidden layer embedding h of target node v is output _v So that < u, r, v > is as close as possible to the true presence in subgraph HG _i Is a data set of the data set. Generator G _i Firstly, neighbor embedding is obtained according to neighbor node embedding and neighbor weight aggregation>To enhance the ability of the generator spoof discriminator, an approximate embedding of r class neighbor nodes, possibly as node u, is added with gaussian noise of standard deviation σAnd inputting the extracted features to a multi-layer persistence (MLP) to obtain hidden layer features of the target node v.

Wherein f is an activation function, W _i ^* Andthe parameters of the MLP on the client i are also the target parameters that need to be aggregated into the global model. Generator optimization-log d (G (·; θ) ^G )；θ ^D ) Better results are achieved than building a normal loss function, so the local training loss function of the generator part is as follows:

2) Federal heterogeneous diagram noise relationship discriminator

Discriminator on client iThe main purpose is, for example, to identify potential connectivity between the neighbor node u and the target node v. Wherein u and v are both from child nodesSet V _i Relationship R from relationship set R indicates the neighbor types of nodes u and v, h _v For the hidden representation of the target node v, +.>For the discriminator to learn parameters, also target parameters of the federally aggregated global discriminator ++in the heterogeneous graph countermeasure network>Is the adjacency weight corresponding to the relation r>After the target node v and the sampling neighbor relation r and the neighbor node u are given, the discriminator calculates the probability that the neighbor relation between the node v and the node u is r, and the calculation formula is as follows:

where u represents a given node, u ε V _i ，V _i The method comprises the steps that a local node set of an ith client is represented, v represents a target node, R represents the type of a neighbor node, R epsilon R, and R represents an edge set; Transpose of the embedded feature representing node u, +.>Represents the weight, h, corresponding to the neighbor relation r _v Embedded feature representing target node v, +.>Representing parameters of the i-th client local discriminator.

When (when)<u,r,v>Sampled from the real data subset V on the client _i The time discriminator outputs a high probability; conversely, when this triplet is negative sampled, the discriminator outputs a low probability. Thus local trainingThe discriminator first calculates the loss of positive and negative samples as follows:

where u represents a given node, u ε V _i ，V _i The method comprises the steps that a local node set of an ith client is represented, v represents a target node, R represents the type of a neighbor node, R epsilon R, and R represents an edge set; h's' _v Represents the output of the ith client generator after adding noise data, h _v Representing embedded features of the target node v, HG _i Representing a heterogeneous subgraph held by an ith client, G _i A local generator representing the i-th client,represents G _i Parameters D of (2) _i Representing the local discriminator of the i-th client.

In order to ensure that the discriminator can correctly identify positive sampling samples even in federal scenarios where the data quality of the clients is low and noise relationships exist, the discriminator needs to be enhanced in the ability to identify negative samples with error relationships. Specifically, the current negative sampling samples are mainly graph data containing node embedded information errors, negative samples with error relations need to be further increased, and discriminators are correspondingly trained to improve the capability of the discriminators for identifying the samples. Based on the discrimination of positive and negative samples, the construction loss is:

Where u represents a given node, u ε V _i ，V _i The method comprises the steps that a local node set of an ith client is represented, v represents a target node, R represents the type of a neighbor node, R epsilon R, and R represents an edge set; r ' represents the error relationship of node u with target node v, R ' =r/{ R }, h ' _v Representing the addition of noise dataOutput of the ith client generator, HG _i Representing a heterogeneous subgraph held by an ith client;

thus discriminator D _i The loss function of (2) is:

3) Cross-client federal heterogeneous graph neighbor sampling

In each client, the heterogeneous graph neural network needs to be obtained by aggregation according to neighbor information of the target node, and the information can be obtained by communication in advance before training, so that full utilization of cross-client information is met. For generator G on the ith client _i For the embedding h 'it generates for the target node v' _v Mainly by means of characteristics of adjacent nodesG _i Local neighbor weight parameter +.>Also for discriminator D _i To say, the positive and negative of the sample are determined only by +.>And +.>Thus client i only needs to receive +_ from the relevant client when training the local model>And->Where j=1, 2, …, K, and j+.i, u ε N _v ，N _v Is a neighbor of the target node v. It should be noted that client i receives the neighbor node characteristics from other clients, but due to each Local training of individual clients, model parameters θ= { W, b, M between different clients ^D ,M ^G All are different and the information passed between clients is an aggregation of features and parameter weights such that client i cannot infer node information of other clients through local parameters. This also satisfies the processing ideas of federal learning when processing cross-client graph information.

In order to ensure the quality of the embedded information of the target node, the heterogeneous neighbors need to be sampled fairly during training to prevent the local heterogeneous graph node types from being unevenly distributed so that the embedded information of the target node is influenced by a large number of node types. To this end, a Top-K random walk strategy is employed by the FedHGAN (federal hierarchical generation resistant network) to sample the target node neighbors. Specifically, the Top-K random walk strategy first samples the neighbors of node v with equal probability to obtain n _s A neighbor node, then for each type node a in the node type set A, intercepting the k sampled first _a The individual nodes act as sampled neighbors in this class of nodes. By the sampling method, all kinds of random sampling results can be obtained fairly, and meanwhile, the sampled neighbors are prevented from being extremely affected by data distribution.

4) Network training

The present embodiment builds a set of discriminators D on each client _i Sum generator G _i The client side is firstly fixed in the Epoch turns of the local trainingSampling Top-K random walk algorithm to update after positive and negative sampling>To enhance the performance of the discriminator, then fixing +.>And training a generator and updating +/using a discriminator capable of better discriminating positive and negative samples>Each training round updates n first _D The secondary discriminator updates n again _G And the secondary generator is used for transmitting the local parameters to the server after the local training round is completed. Then the server aggregates the model parameters on each client side through FedAVg algorithm to obtain a global model theta _Global And then, the global model is used for guiding the local model to complete training.

Accordingly, the present invention also provides an apparatus/system comprising a computer device including a processor and a memory, the memory having stored therein computer instructions for executing the computer instructions stored in the memory, the apparatus/system implementing the steps of the method as described above when the computer instructions are executed by the processor.

The embodiments of the present invention also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the edge computing server deployment method described above. The computer readable storage medium may be a tangible storage medium such as Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, floppy disks, hard disk, a removable memory disk, a CD-ROM, or any other form of storage medium known in the art.

Therefore, in the embodiment, the identifier and the generator are built through countermeasure training on the local client to better aggregate the neighbor node information for embedding of the heterogeneous nodes, and the central server aggregates the local model to obtain a global model with data heterogeneity robustness and further guides the local client to train. The embodiment also provides a federal heterogeneous graph data sampling algorithm crossing the client, helps the federal heterogeneous graph model to better utilize heterogeneous graph information crossing the client, fairly samples various neighbors to participate in embedded learning, reduces the influence of client data difference on training, and improves the learning quality of the federal heterogeneous graph neural network.

In summary, according to the heterogeneous graph data node embedded feature extraction model training method, the embedded feature extraction method, the node classification method and the device, based on the federal learning form, local building generators and discriminators at each client side train local models by using antagonism learning, model parameters are aggregated at a central server, and data privacy is effectively ensured. Meanwhile, in the process of local training, each client acquires the original characteristics of the target node-associated cross-client neighbor nodes through cross-client communication, and various neighbors based on top-k random walk-flat sampling participate in embedded learning, so that the influence of client data difference on training is reduced, the learning quality of the federal heterogeneous graph neural network is improved, and the execution efficiency of downstream subtasks is improved.

Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein can be implemented as hardware, software, or a combination of both. The particular implementation is hardware or software dependent on the specific application of the solution and the design constraints. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, a plug-in, a function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine readable medium or transmitted over transmission media or communication links by a data signal carried in a carrier wave.

It should be understood that the invention is not limited to the particular arrangements and instrumentality described above and shown in the drawings. For the sake of brevity, a detailed description of known methods is omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and shown, and those skilled in the art can make various changes, modifications and additions, or change the order between steps, after appreciating the spirit of the present invention.

In this disclosure, features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations can be made to the embodiments of the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A heterogeneous graph data node embedded feature extraction model training method, wherein the method is performed based on a plurality of clients and a central server which are connected with each other, each client holds a heterogeneous subgraph of a part of heterogeneous graph data as private data, each client is further provided with a cross-client sampler, and the method comprises the following steps:

in one training round, the performing steps include:

2. The heterogeneous graph data node embedded feature extraction model training method of claim 1, wherein the global generator and the local generator employ multi-layer perceptrons of the same structure, and the global discriminator and the local discriminator employ multi-layer perceptrons of the same structure.

3. The heterogeneous graph data node embedded feature extraction model training method of claim 1, wherein the local generator performs parameter updating on the recognition success rate construction loss of the negative-sample embedded features by minimizing the local discriminator, wherein the constructed loss function expression is:

4. The heterogeneous graph data node embedded feature extraction model training method according to claim 1, wherein in the method, the local discriminator performs a discrimination task for a neighbor node type of the target node, and a probability calculation formula of a neighbor relation r between the target node v and the neighbor node v is:

where u represents a given node, u ε V _i ，V _i The method comprises the steps that a local node set of an ith client is represented, v represents a target node, R represents the type of a neighbor node, R epsilon R, and R represents an edge set; Transpose of the embedded feature representing node u, +.>Represents the weight, h, corresponding to the neighbor relation r _v Embedded feature representing target node v, +.>Parameters representing the i-th client local discriminator;

5. The heterogeneous graph data node embedded feature extraction model training method of claim 4, wherein in parameter updating by maximizing discrimination tasks of neighbor node types and recognition task success rate construction losses of positive and negative samples, a loss function of the recognition tasks of the positive and negative samples is:

6. the heterogeneous graph data node embedded feature extraction model training method according to claim 1, wherein in the method, the extraction modes of the original features of the client-side inner neighbor node and the cross-client-side neighbor node are as follows:

if the data type is text, using a bag-of-word vector as the original feature;

7. A sub-graph level federal heterogeneous node embedded feature extraction method, wherein the method is performed based on a plurality of clients and a central server which are connected with each other, each client holds a heterogeneous subgraph of a part of heterogeneous graph data as private data, and for a designated target node of a designated client, target embedded features are extracted by using a target embedded feature extraction model in the heterogeneous graph data node embedded feature extraction model training method according to any one of claims 1 to 7 deployed on the designated client.

8. A sub-graph level federal heterogeneous node classification method, wherein the method is performed based on a plurality of clients and a central server which are connected with each other, each client holds a heterogeneous sub-graph of a part of heterogeneous graph data as private data, and for a specified target node of a specified client, the specified client acquires target embedded features of the specified target node by adopting the sub-graph level federal heterogeneous node embedded feature extraction method according to claim 7;

9. A heterogram data management system, characterized in that the system comprises a plurality of clients and a central server executing in connection with each other, each of the clients holding a heterogram of a part of the heterogram data as private data, the clients and the central server executing the steps of the method according to any one of claims 1 to 9.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 9.