CN113011282A

CN113011282A - Graph data processing method and device, electronic equipment and computer storage medium

Info

Publication number: CN113011282A
Application number: CN202110220724.7A
Authority: CN
Inventors: 徐挺洋; 余俊驰; 荣钰; 卞亚涛
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2021-06-22

Abstract

The application discloses a graph data processing method, a graph data processing device, electronic equipment and a computer storage medium, and relates to the technical field of artificial intelligence, block chains and cloud, wherein the method comprises the following steps: and inputting the graph to be processed into the trained sub-graph prediction model to obtain a target sub-graph of the graph to be processed. The training of the subgraph prediction model comprises the following steps: the method comprises the steps of obtaining a training data set comprising a plurality of sample graphs, inputting each sample graph into an initial neural network model to obtain a classification result of each node in each sample graph, determining subgraph connectivity loss for each sample graph based on the classification result of each node corresponding to each sample graph and the incidence relation between each node in each sample graph, and training the initial neural network model based on the subgraph connectivity loss corresponding to each sample graph and the training data set to obtain a subgraph prediction model. According to the method, the target subgraph can be accurately identified by the trained model on the premise of no labeled subgraph.

Description

Graph data processing method and device, electronic equipment and computer storage medium

Technical Field

The present application relates to the technical field of artificial intelligence, block chaining, and cloud, and in particular, to a graph data processing method and apparatus, an electronic device, and a computer storage medium.

Background

In the prior art, for graph data, because a target subgraph of the graph data can embody main attributes of the graph data, identification of the target subgraph has wide application in practical application, for example, processing such as compression, denoising and the like of the graph data can be realized through the target subgraph.

In the prior art, in order to accurately recognize a target subgraph of graph data, the recognition of the target subgraph is usually realized based on a model trained by a labeled subgraph, and for this reason, how to accurately recognize the target subgraph under the condition of no labeled subgraph is a problem to be solved.

Disclosure of Invention

The present application aims to solve at least one of the above technical drawbacks, and particularly proposes the following technical solution to solve the problem of accurately identifying a target subgraph without subgraph labeling.

According to an aspect of the present application, there is provided a graph data processing method, including:

acquiring a graph to be processed;

inputting the graph to be processed into the trained sub-graph prediction model to obtain a target sub-graph of the graph to be processed, wherein the sub-graph prediction model is obtained by training in the following way:

acquiring a training data set, wherein the training data set comprises a plurality of sample graphs;

acquiring an incidence relation between nodes of each sample graph;

inputting each sample graph into an initial neural network model to obtain a classification result of each node in each sample graph, wherein for any node in one sample graph, the classification result represents the probability that the node is the node of a target subgraph of the sample graph;

for each sample graph, determining subgraph connectivity loss corresponding to the sample graph based on the classification result of each node of the sample graph and the incidence relation between each node of the sample graph;

and training the initial neural network model based on the subgraph connectivity loss and the training data set corresponding to each sample graph until a preset training end condition is met, and determining the neural network model at the end of training as a subgraph prediction model.

According to another aspect of the present application, there is provided a graph data processing apparatus including:

the graph data acquisition module is used for acquiring a graph to be processed;

the subgraph identification module is used for inputting the graph to be processed into the trained subgraph prediction model to obtain a target subgraph of the graph to be processed, the subgraph prediction model is obtained by training through the following model training module, and the model training module is used for:

acquiring an incidence relation between nodes of each sample graph;

for each sample graph, determining subgraph connectivity loss corresponding to the sample graph based on the classification result of each node of the sample graph and the incidence relation between each node in the sample graph;

In one possible implementation manner, each sample map carries an attribute tag, and for each sample map, the apparatus further includes:

the attribute loss determining module is used for obtaining a prediction target subgraph of the sample graph based on the classification result of each node of the sample graph; determining attribute loss between the sample graph and the prediction target subgraph based on the attribute labels of the prediction target subgraph and the sample graph, wherein the attribute loss represents the difference between the attribute of the prediction target subgraph and the attribute of the sample graph;

when the model training module trains the initial neural network model based on the subgraph connectivity loss and the training data set of each sample graph, the model training module is specifically configured to:

and training the initial neural network model based on the subgraph connectivity loss, the attribute loss and the training data set corresponding to each sample graph.

In one possible implementation, for each sample graph, the apparatus further includes:

the relevance loss determining module is used for obtaining a prediction target subgraph of the sample graph based on the classification result of each node of the sample graph; determining the correlation loss between the prediction target subgraph and the sample graph, wherein the correlation loss characterizes the correlation between the prediction target subgraph and the sample graph;

and training the initial neural network model based on the subgraph connectivity loss, the correlation loss and the training data set corresponding to each sample graph.

In a possible implementation manner, for each sample graph, the association relationship includes an adjacency matrix corresponding to the sample graph, and when determining connectivity loss of a sub-graph corresponding to the sample graph based on the classification result of each node corresponding to the sample graph and the association relationship between each node in the sample graph, the model training module is specifically configured to:

determining a node classification matrix corresponding to the sample graph according to the classification result of each node of the sample graph, wherein the element of each row of the node classification matrix corresponds to the classification result of one node in the sample graph;

and determining subgraph connectivity loss corresponding to the sample graph according to the node classification matrix and the adjacency matrix corresponding to the sample graph.

In a possible implementation manner, when determining connectivity loss of a sub-graph corresponding to a sample graph according to a node classification matrix and an adjacency matrix corresponding to the sample graph, the model training module is specifically configured to:

determining a connectivity result of a subgraph of the sample graph according to the node classification matrix and the adjacency matrix;

and determining subgraph connectivity loss corresponding to the sample graph based on the connectivity result of the subgraph and the constraint condition of the connectivity result of the subgraph.

In one possible implementation, the expression of the subgraph connectivity loss is:

L_con(g(G；θ))＝||Norm(S^TAS-I₂)||_F

wherein G is a sample graph, theta is a model parameter of the neural network model, and L_con(G (G; theta)) represents subgraph connectivity loss, Norm represents row normalization of the matrix, S represents a node classification matrix, A represents an adjacency matrix, and I represents₂Identity matrix of 2 x 2, S^TIs a transposed matrix of S, | | · |. non-woven phosphor_FIs the Frobenius norm.

In a possible implementation manner, when determining the attribute loss between the sample graph and the prediction target sub-graph based on the attribute labels of the prediction target sub-graph and the sample graph, the attribute loss determining module is specifically configured to:

acquiring node characteristics of each node in the predicted target subgraph;

obtaining sub-graph characteristics of the prediction target sub-graph by fusing the node characteristics of each node in the prediction target sub-graph;

determining attribute information of a prediction target subgraph according to the subgraph characteristics;

and determining the attribute loss between the sample graph and the prediction target subgraph according to the attribute information of the prediction target subgraph and the attribute label of the sample graph.

In a possible implementation manner, for any sample graph, when determining the correlation loss between the prediction target sub-graph and the sample graph, the correlation loss determining module is specifically configured to:

extracting sub-graph features of the predicted target sub-graph and sample graph features of the sample graph;

and determining the correlation loss between the prediction target subgraph and the sample graph based on the subgraph characteristics and the sample graph characteristics.

In a possible implementation manner, for any sample graph, when determining, based on the sub-graph features and the sample graph features, the correlation loss determining module is specifically configured to:

splicing sub-graph features corresponding to the sample graph and sample graph features, and determining first correlation loss between a prediction target sub-graph and the sample graph based on the spliced features;

acquiring sub-graph features corresponding to other sample graphs except the sample graph;

splicing the sample graph features of the sample graph with sub-graph features corresponding to each other sample graph, and determining second relevancy loss between the sample graph and each other sample graph based on the spliced features;

and determining the correlation loss between the prediction target subgraph and the sample graph according to the first correlation loss and each second correlation loss.

According to yet another aspect of the present application, there is provided an electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the graph data processing method of the present application when executing the computer program.

According to yet another aspect of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the graph data processing method of the present application.

Embodiments of the present invention also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations of the graph data processing method described above.

The beneficial effect that technical scheme that this application provided brought is:

according to the graph data processing method, the graph data processing device, the electronic device and the computer readable storage medium, for the graph to be processed, of which the target subgraph needs to be recognized, the graph to be processed can be input into the trained subgraph prediction model, the target subgraph of the graph to be processed is recognized through the subgraph prediction model, and the subgraph prediction model only comprises the sample graph in the training data set during training, namely the subgraph prediction model can recognize the target subgraph of the graph to be processed under the condition that no sample subgraph (labeled subgraph) exists. Meanwhile, during model training, subgraph connectivity loss is determined based on the node characteristics of each node and the incidence relation between each node, the incidence relation can reflect the incidence relation between each node, and when the subgraph connectivity loss is determined, the subgraph connectivity loss based on each sample graph and the subgraph prediction model obtained by training of the training data set can be used for identifying the target subgraph more accurately by combining the incidence relation between each node.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic structural diagram of a data structure according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a sub-graph prediction model according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a training method of a subgraph prediction model in a graph data processing method according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of an attribute prediction model according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a correlation prediction model according to an embodiment of the present application

FIG. 6 is a schematic diagram illustrating a connection relationship and data flow between various prediction networks according to an embodiment of the present application;

FIG. 7 is a schematic diagram illustrating a connection relationship and data flow between various prediction networks according to an embodiment of the present application;

FIG. 8 is a flowchart illustrating a graph data processing method according to an embodiment of the present application;

FIG. 9 is a diagram illustrating an environment for implementing a graph data processing method according to an embodiment of the present application;

FIG. 10 is a schematic diagram of an environment for implementing yet another data processing method according to an embodiment of the present application;

FIG. 11 is a block diagram of a graph data processing apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a distributed system applied to a blockchain system according to an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

In computer science, a Graph (Graph) is a data structure composed of two parts, namely a vertex and an edge, and a Graph G can be described by a vertex set V and an edge set E contained by the vertex set V, and can be particularly described as G (V, E).

As shown in fig. 1, fig. 1 shows a data structure diagram, in which vertices may be represented by circles, and edges are connecting lines between the circles, and the vertices are connected by edges. In which, a vertex can also be called a node or an intersection, and an edge can also be called a connection. For example, a data structure graph may represent a social network, where each person is a vertex and people who know each other are connected by edges. The graph has various shapes and sizes, and the edges can have weights (weights), that is, each edge can be assigned a positive or negative value. For a data structure diagram representing a flight path, each city may be a vertex, the flight path an edge, and the weight of the edge may be time of flight, or ticket price, etc. In addition, the edges may be directional or non-directional according to whether the direction dependency relationship exists between the vertexes; a directional edge means that there is only a one-way relationship between two vertices, while an undirected edge (or bi-directional edge) means that there is a two-way relationship between two vertices.

The graph data processing method is provided for accurately identifying a target sub-graph without sub-graph marking, and the target sub-graph of a graph to be processed can be accurately identified through the method, wherein a sub-graph prediction model in the embodiment of the application can be realized based on an artificial intelligence technology, optionally, data calculation related to the embodiment of the application can be realized by adopting cloud computing, and data storage related to the embodiment of the application can be realized by adopting cloud storage.

Specifically, as shown in fig. 2, in the embodiment of the present application, a graph (G) to be processed is input into a trained subgraph prediction model for processing, so as to obtain a target subgraph (Gsub) corresponding to the graph to be processed. The target subgraph Gsub structurally retains part of the structure of the graph G to be processed, but can embody a specific attribute Y of G, or an attribute label of Gsub is consistent with an attribute label of G, and the connection condition between nodes in the target subgraph is consistent with the connection condition in the sample graph. In other words, on the spatial structure, Gsub can filter out noise and redundant data in the graph G to be processed, and effective structural information therein is retained.

As shown in fig. 2, the sub-Graph prediction model is composed of Graph Neural Networks (GNNs) and Multi-Layer fully connected Networks (MLPs). The graph neural network GNN is a neural network that acts directly on the graph structure. A multi-layer fully-connected Network (or multi-layer perceptron) is also called an Artificial Neural Network (ANN), which may have one or more hidden layers in between, in addition to input and output layers. The neural network is a technology that is similar to a biological neural network, and by connecting a plurality of characteristic values, a target (in this example, a target subgraph) is finally achieved through a combination of linearity and nonlinearity.

The subgraph prediction model in the scheme of the embodiment of the application adopts an artificial intelligence technology, namely, the embodiment of the application provides a target subgraph recognition scheme (namely, the graph data processing method provided by the embodiment of the application) based on artificial intelligence.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The target subgraph identification scheme specifically relates to a Machine Learning (ML) technology in an artificial intelligence technology, wherein the Machine Learning is a multi-field cross subject and relates to multi-subjects such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and counterlearning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

In an embodiment of the present application, the scheme provided in the embodiment of the present application may be implemented based on a cloud technology, and the data processing (including but not limited to data computing, etc.) involved in each optional embodiment may be implemented by using cloud computing. Cloud technology refers to a hosting technology for unifying serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied in the cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

Cloud computing (cloud computing) is a computing model that distributes computing tasks over a pool of resources formed by a large number of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand.

As a basic capability provider of cloud computing, a cloud computing resource pool (called as a cloud Platform in general, an Infrastructure as a Service) Platform is established, and multiple types of virtual resources are deployed in the resource pool for selective use by external clients, the cloud computing resource pool mainly includes a computing device (including an operating system, for a virtualized machine), a storage device, and a network device, and is divided according to logical functions, a PaaS (Platform as a Service) layer may be deployed on an IaaS (Infrastructure as a Service) layer, a SaaS (Software as a Service) layer may be deployed on the PaaS layer, or the SaaS may be directly deployed on the IaaS layer, the PaaS may be a Platform running on Software, such as a web database, a container, and the like, as business Software of various websites, a web portal, and the like, SaaS and PaaS are upper layers relative to IaaS.

The following describes the technical solutions of the present application and how to solve the above technical problems in detail with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

The scheme provided by the embodiment of the application can be applied to any scene needing to determine the target subgraph in the graph to be processed. The scheme provided by the embodiment of the application can be executed by any electronic device, can be executed by user terminal equipment, and can also be executed by a server, wherein the server can be an independent physical server, a server cluster or distributed system formed by a plurality of physical servers, and a cloud server for providing cloud computing service.

The terminal device may comprise at least one of: smart phones, tablet computers, notebook computers, desktop computers, smart speakers, smart watches, smart televisions, and smart car-mounted devices.

A possible implementation manner is provided in the embodiments of the present application, and as shown in fig. 3, a flowchart of a graph data processing method is provided, where the scheme may be executed by any electronic device, for example, the scheme may be executed by a terminal device, a server, or both the terminal device and the server. For convenience of description, the method provided by the embodiment of the present application will be described below by taking a server as an execution subject. The method may comprise the steps of:

step S110, a graph to be processed is acquired.

The Graph to be processed refers to a Graph of a data structure to be processed, and may also be referred to as an original Graph, that is, a Graph whose target sub-Graph needs to be determined is a Graph (Graph) in computer science. The graph to be processed comprises M nodes and connection conditions (namely edge distribution conditions) among the M nodes, wherein M is a positive integer larger than 1.

Step S120, inputting the graph to be processed into the trained sub-graph prediction model to obtain a target sub-graph of the graph to be processed, where the sub-graph prediction model is obtained by training in the following manner, as shown in the flowchart in fig. 3, and the training method of the sub-graph prediction model may include the following steps:

step S1201, a training data set is obtained, where the training data set includes a plurality of sample maps.

Wherein the plurality of sample maps are maps having the same attribute.

Step S1202, obtaining an association relationship between nodes of each sample graph.

The graph has a plurality of nodes, the association relationship among the nodes can reflect the correlation among the nodes, and the higher the correlation is, the closer the classification results of the corresponding nodes are.

Optionally, obtaining the association relationship of each node in the graph to be processed may include:

acquiring an adjacency matrix of a graph to be processed;

and determining the incidence relation of each node in the graph to be processed based on the adjacency matrix.

The graph may be described by a vertex set V and an edge set E contained in the vertex set V, and may be specifically described as G (V, E). Therefore, a one-dimensional array is used for storing all vertex data in the graph; a two-dimensional array is used to store the data of the relationship (edge or arc) between the vertexes, and the two-dimensional array is called as an adjacent matrix. Specifically, each element in the adjacency matrix represents an association relationship between two adjacent nodes.

Step S1203, inputting each sample graph into the initial neural network model, and obtaining a classification result of each node in each sample graph, where, for any node in one sample graph, the classification result represents a probability that the node is a node of a target sub-graph of the sample graph.

For a node, the classification result represents the probability that the node is a node in the target subgraph, and for a sample graph, the node in each node for forming the target subgraph can be determined based on the classification result of each node in the sample graph.

Step S1204, for each sample graph, determining subgraph connectivity loss corresponding to the sample graph based on the classification result of each node corresponding to the sample graph and the incidence relation between each node in the sample graph.

And the subgraph connectivity loss represents the accuracy of the classification result of each node.

Optionally, for each sample graph, the association relationship includes an adjacency matrix corresponding to the sample graph, and determining a subgraph connectivity loss corresponding to the sample graph based on the classification result of each node corresponding to the sample graph and the association relationship between each node in the sample graph, including:

The adjacency matrix of the graph is a matrix representing the adjacency relation between the vertexes, the incidence relation of each node can be represented by the connection relation between each node, and the incidence relation between each node in the graph can be represented by the adjacency matrix of the graph.

Optionally, determining a subgraph connectivity loss corresponding to the sample graph according to the node classification matrix and the adjacency matrix corresponding to the sample graph, including:

The connectivity result of the subgraph represents connectivity among nodes serving as a target subgraph, namely whether each node belonging to the target subgraph is an associated node or not, the discreteness among the nodes is considered, the target subgraph is determined to be not accurate enough only based on the classification result of each node, namely, the association relation among the nodes is not considered in the determination process of the target subgraph, therefore, during model training, the connectivity result of the subgraph is restrained through a subgraph connectivity result restraining condition, the accuracy of the subgraph classification result can be improved on one hand, and the classification results of adjacent nodes can be restrained to be the same on the other hand.

Optionally, the constraint condition of the sub-graph classification result may be a adjacency matrix of the sample graph. The adjacency matrix can reflect the incidence relation between nodes, and therefore, the adjacency matrix can play a role in restraining the connectivity result of the subgraph.

As an example, the node classification matrix is a matrix of n × 2, n is the number of nodes in the sample graph, and the ith row element represents the probability that the ith node belongs to the target subgraph, that is, the classification result of the ith node, where i is an integer greater than or equal to 1 and less than or equal to n.

If so, the node classification matrix can be expressed as:

wherein the first row element [ 10 ] indicates that the node is a node belonging to the target subgraph. The second row element [ 01 ] indicates that the node is a node that does not belong to the target subgraph.

For a sample graph, the nodes in the sample graph have two classification results, one is that the nodes belong to a target subgraph, and the other is that the target nodes do not belong to the subgraph. The subgraph connectivity loss may include a target subgraph connectivity loss characterizing the accuracy of the classification results of the nodes belonging to the target subgraph and a non-target subgraph connectivity loss characterizing the accuracy of the classification results of the nodes not belonging to the target subgraph.

Optionally, the expression of the subgraph connectivity loss is as follows:

L_con(g(G；θ))＝||Norm(S^TAS-I₂)||_F

wherein G is a sample graph, theta is a model parameter of the neural network model, G (G; theta) is a prediction target subgraph, and L_con(G (G; theta)) represents subgraph connectivity loss, Norm represents row normalization of the matrix, S represents a node classification matrix, A represents an adjacency matrix, and I represents₂Identity matrix of 2 x 2, S^TOf STranspose matrix, | | · | | non-conducting phosphor_FIs the Frobenius norm.

For the Frobenius norm, a unique real number always exists, so that the norm takes a minimum value, and therefore, the subgraph connectivity loss is characterized by the Frobenius norm. The elements of each row of the node classification matrix S correspond to the classification result of one node in the sample graph. S^TIs a matrix of 2 x n, S^TOne row of elements of (a) represents the classification result of each node belonging to the target subgraph, and the other row of elements represents the classification result of each node not belonging to the target subgraph. The number of columns of matrix S is 2, so S^TAS is a 2 x 2 matrix, matrix S^TIn the AS, the first row element represents the classification result of the node belonging to the target subgraph. And the second row of elements represents the classification result of the nodes which do not belong to the target subgraph, wherein the sum of the corresponding probabilities of the two elements in each row is 1.

Wherein S is^TThe first element in the first row of the AS matrix represents the sum of probabilities that each node belongs to the target sub-graph among the nodes belonging to the target sub-graph, the second element in the first row represents the sum of probabilities that each node does not belong to the target sub-graph, for example, the total of the probabilities that each node belongs to the target sub-graph in the graph to be processed is 10 nodes, and the number of the nodes belonging to the target sub-graph is 4, so that the first element in the first row represents the sum of probabilities that each node belongs to the target sub-graph among the 4 nodes, and the second element in the first row represents the sum of probabilities that each node does not belong to the target sub-graph among the 4 nodes.

Similarly, the first element of the second row represents the sum of the probabilities that each node belongs to the target subgraph in the nodes not belonging to the target subgraph, and the second element of the second row represents the sum of the probabilities that each node does not belong to the target subgraph in the nodes not belonging to the target subgraph. Based on the above example, the first element of the second row represents the sum of the probabilities that each node belongs to the target subgraph in 6 nodes other than the above 4 nodes, and the second element of the second row represents the sum of the probabilities that each node does not belong to the target subgraph in the 6 nodes.

And S1205, training the initial neural network model based on the subgraph connectivity loss and the training data set corresponding to each sample graph until a preset training end condition is met, and determining the neural network model at the end of training as a subgraph prediction model.

Considering the discreteness among the nodes, the target subgraph determined based on the classification result of each node is not accurate enough, that is, the target subgraph comprises nodes without incidence relation or nodes with weak incidence relation, so that the trained subgraph prediction model is unstable. Therefore, when the model is trained, the connectivity loss of the sub-graph is determined based on the classification result of each node and the incidence relation of each node, and the stability of the model can be improved based on the loss, so that the target sub-graph determined based on the model is more accurate.

In practical application, if the training loss does not meet the training end condition, the model parameters of the initial neural network model are adjusted based on the training loss, and the adjusted neural network model is trained based on the training data set.

In an alternative of the present application, the training end condition may be configured based on the actual requirement, for example, the training loss is represented by a real number, and the training end condition may be that the training loss is smaller than the set threshold. And when the training loss is less than the set threshold, the training loss value meets the training ending condition, and the training is ended to obtain the sub-graph prediction model. And when the training loss is not less than the set threshold, the training loss value does not meet the training end condition, the model parameters of the initial neural network model need to be adjusted, the adjusted model continues to be trained on the basis of the training data set until the obtained training loss meets the training end condition, and the training is ended.

In an alternative of the present application, the training end condition may also be a loss function convergence, i.e. the training loss is a loss function.

As an alternative, the model architecture of the initial neural Network model is not limited in this embodiment, and may be any initial neural Network model that can be used to determine a predicted target subgraph of the SAmple Graph, such as a Graph SAGE (Graph SAmple and aggregation) Network, a GAT (Graph Attention Network), and a GIN (Graph isomorphic Network).

According to the scheme, for the graph to be processed, which needs to identify the target subgraph, the graph to be processed can be input into a trained subgraph prediction model, the target subgraph of the graph to be processed is identified through the subgraph prediction model, and the subgraph prediction model only comprises a sample graph in a training data set during training, namely the subgraph prediction model can identify the target subgraph of the graph to be processed under the condition that no sample subgraph (labeled subgraph) exists. Meanwhile, during model training, subgraph connectivity loss is determined based on the node characteristics of each node and the incidence relation between each node, the incidence relation can reflect the incidence relation between each node, and when the subgraph connectivity loss is determined, the subgraph connectivity loss based on each sample graph and the subgraph prediction model obtained by training of the training data set can be used for identifying the target subgraph more accurately by combining the incidence relation between each node.

In order to make the target subgraph consistent with the attributes of the graph to be processed, during model training, the attributes of the predicted target subgraph and the sample graph need to have strong correlation, so that the target subgraph identified by the trained model can be consistent with the attributes of the graph to be processed. Therefore, during model training, a predicted target subgraph of the sample graph can be obtained based on the classification result of each node of the sample graph, attribute loss between the sample graph and the predicted target subgraph is determined based on the predicted target subgraph and the attribute labels of the sample graph, and the model is trained based on the attribute loss, so that the trained model restricts the correlation between the attributes of the target subgraph and the attributes of the graph to be processed when the target subgraph is determined.

In an embodiment of the present application, each sample map carries an attribute tag, and for each sample map, the method further includes:

obtaining a prediction target subgraph of the sample graph based on the classification result of each node of the sample graph;

and determining attribute loss between the sample graph and the prediction target subgraph based on the attribute labels of the prediction target subgraph and the sample graph, wherein the attribute loss characterizes the difference between the attribute of the prediction target subgraph and the attribute of the sample graph.

Training the initial neural network model based on the subgraph connectivity loss and the training data set of each sample graph, which may include:

The attribute labels characterize attributes of the sample graph, and the attributes may reflect characteristics of some aspect of the object. As an example, if the sample map is a drug molecule, the corresponding property of the sample map may be a drug property, such as fever reduction, inflammation reduction, and the like.

Because the target subgraph structurally reserves partial structure of the sample graph and can embody attributes of the sample graph, in order to ensure that the target subgraph determined by the model is more accurate, the attribute difference between the predicted target subgraph and the sample graph can be used as attribute loss during model training, and the performance of the model is improved through the attribute loss, namely the attribute of the target subgraph is constrained to be consistent with the attributes of the corresponding sample graph through the attribute loss. As an example, for a graph corresponding to a drug molecule as a graph to be processed, for example, an attribute of the graph to be processed is fever reduction, since a target subgraph of the graph to be processed can embody the attribute of the graph to be processed, the attribute of the target subgraph of the graph to be processed should also be fever reduction. For this reason, during model training, the performance of the model may be improved based on attribute loss, so that the attributes of the target sub-graph identified by the model are consistent with the attributes of the graph to be processed.

For a drug molecule, a molecular functional group of the drug molecule can determine the chemical property of the drug molecule, and therefore, when a graph to be processed is a graph corresponding to the drug molecule, a target subgraph can be a subgraph corresponding to the molecular functional group of the drug molecule, and therefore, based on the scheme of the embodiment of the application, when the graph to be processed is a graph corresponding to the drug molecule, the molecular functional group of the drug molecule can be accurately identified based on a trained subgraph prediction model, and data support is provided for drug research based on the molecular functional group.

The mutual information can measure the information-containing relationship between different variables, so that the information-containing relationship between the attributes of the prediction target sub-graph and the sample graph can be measured through the mutual information, namely when attribute loss is determined, first mutual information between the attribute labels of the prediction target sub-graph and the sample graph can be determined based on the attribute labels of the prediction target sub-graph and the sample graph, and the relationship between the attributes of the prediction target sub-graph and the attributes of the sample graph is constrained through the first mutual information.

In an optional embodiment of the present application, since for a node, the classification result characterizes a probability that the node is a node in the target subgraph, for this reason, for a sample graph, obtaining a predicted target subgraph of the sample graph based on the classification result of each node of the sample graph may include at least one of the following implementation manners:

in an embodiment, for example, a sample graph includes M nodes, and a node with a probability greater than or equal to a preset probability threshold in the M nodes may be determined as a node used for forming a predicted target sub-graph of the sample graph, that is, a node with a probability greater than or equal to the preset probability threshold is classified into one class, and a node with a probability less than the preset probability threshold is classified into another class, where the class is a node not belonging to the target sub-graph.

In another embodiment, the probabilities of the nodes may be sorted from large to small, that is, the M nodes are sorted, and the nodes in the top N bits are selected as the nodes for forming the prediction target sub-graph of the sample graph according to the sorting result, that is, the nodes in the top N bits are classified into one class, and the other nodes are classified into another class. At this time, N may be determined according to M, for example, a set ratio of the number of nodes in the target subgraph to the number of nodes in the original graph is obtained according to the experimental data, and a value of N is determined according to M and the set ratio.

Optionally, determining an attribute loss between the sample graph and the prediction target sub-graph based on the attribute labels of the prediction target sub-graph and the sample graph may include:

acquiring node characteristics of each node in the predicted target subgraph;

The attribute information of the prediction target sub-graph (the prediction attribute shown in fig. 4) may be implemented by an attribute prediction model, and the attribute prediction model is used for predicting the attribute information of the graph, specifically, referring to the model structure diagram of the attribute prediction model shown in fig. 4, for each sample graph, the prediction target sub-graph of the sample graph (i.e., the prediction target sub-graph Gsub in fig. 4) is input into the attribute prediction model, and the model includes a graph neural network GCN and a multilayer fully-connected network (q shown in fig. 4). Firstly, extracting subgraph feature x of predicted target subgraph Gsub based on graph neural network_sub. Subgraph feature x_subThe method is characterized by comprising node characteristics of all nodes in a prediction target subgraph Gsub. Further, the subgraph feature x of the target subgraph Gsub is predicted based on_subAnd inputting the data into a multilayer fully-connected network for processing to obtain the attribute information of the prediction target subgraph. Based on the attribute information of the prediction target sub-graph and the attribute labels of the sample graph, attribute loss between the sample graph and the prediction target sub-graph can be determined.

In an alternative embodiment, the attribute prediction model may be trained before the initial sub-graph prediction model is trained, or may be obtained by synchronous training with the sub-graph prediction model, that is, in the process of training the sub-graph prediction model, the attribute prediction model is trained simultaneously.

Specifically, a plurality of sample subgraphs carrying attribute labels can be used as training data to train an attribute prediction model, specifically, each sample subgraph is input into an initial network, and subgraph features of each sample subgraph are extracted; for each sample subgraph, obtaining the prediction attribute of the sample subgraph based on the subgraph feature of the sample subgraph; and determining attribute loss based on the prediction attributes of each sample subgraph and the corresponding attribute labels, obtaining an attribute prediction model when the attribute loss meets the training end condition, adjusting model parameters of the initial network based on the attribute loss when the attribute loss does not meet the training end condition, and training the adjusted model based on training data.

In the scheme of the embodiment of the present application, since the information correlation (correlation) between the graphs can reflect the information inclusion relationship between the two graphs, that is, the amount of information shared between the two graphs is larger, the larger the correlation between the two graphs is, the larger the amount of information shared is, and conversely, the smaller the correlation is, the smaller the amount of information shared is. Therefore, during model training, the information correlation between the sample graph and the corresponding prediction target subgraph can be used as the correlation loss, and the model is trained based on the loss, so that when the trained model identifies the target subgraph of the graph to be processed, the target subgraph contains useful information in the graph to be processed as much as possible.

Based on this, when the model is trained, the training loss refers to the loss of correlation between the prediction target subgraph and the sample graph, and specifically, the loss of correlation can be determined in the following manner:

in one embodiment of the present application, for each sample graph, the method further comprises:

determining the correlation loss between the prediction target subgraph and the sample graph, wherein the correlation loss characterizes the correlation between the prediction target subgraph and the sample graph;

training an initial neural network model based on the subgraph connectivity loss and the training data set of each sample graph, wherein the training comprises the following steps:

It can be understood that the training loss may include a subgraph connectivity loss, an attribute loss and a correlation loss at the same time, when the graph to be processed is subjected to target subgraph recognition based on the trained model, the attributes of the target subgraph are consistent with those of the graph to be processed, and the target subgraph contains as much useful information in the graph to be processed as possible.

In order to make the target subgraph contain useful information in the graph to be processed as much as possible, the degree of information correlation between the prediction target subgraph and the sample graph needs to be made as large as possible during model training, so that the target subgraph identified by the trained model contains useful information in the graph to be processed as much as possible.

As can be seen from the foregoing description, mutual information may measure information-containing relationships between different variables, and thus, information correlation between the prediction target sub-graph and the sample graph may be measured by the mutual information, that is, when determining a correlation loss, information correlation between the prediction target sub-graph and the sample graph may be constrained based on second mutual information between the prediction target sub-graph and the sample graph.

Optionally, for any sample graph, determining a correlation loss between the prediction target sub-graph and the sample graph may include:

The subgraph features of the predicted target subgraph and the sample graph features of the sample graph can be extracted through the graph neural network.

In an embodiment of the present application, for any sample graph, determining a loss of correlation between the prediction target subgraph and the sample graph based on the subgraph features and the sample graph features may include:

For a sample graph, the second correlation loss corresponding to the sample graph and each sample graph except the sample graph is an inter-class loss, that is, the larger the difference between different classes is, the more dissimilar the information is, the first correlation loss is an intra-class loss, that is, the smaller the loss between the same classes is, the more similar the information is. The correlation loss determined based on the first correlation loss and each second correlation loss can describe the information correlation between the prediction target subgraph and the sample graph more accurately.

Optionally, the information correlation between the prediction target sub-graph and the sample graph may be described by the second mutual information.

In embodiments of the present application, the loss of correlation may be determined directly based on the sub-graph features and the sample graph features. And determining the correlation loss between the predicted target subgraph and the sample graph based on the trained correlation prediction model approximation correlation loss, namely determining the information correlation between the predicted target subgraph and the sample graph based on the correlation prediction model. Specifically, the sub-graph features and the sample graph features may be spliced (or aggregated) first, and then the feature vectors obtained by splicing are processed to obtain the information correlation.

Referring to a model structure diagram of the correlation prediction model shown in fig. 5, for each sample graph, the sample graph (i.e., the sample graph G in fig. 5) and the prediction target sub-graph (i.e., the prediction target sub-graph Gsub in fig. 5) of the sample graph are input into the correlation prediction model, and the model includes a graph neural network GCN and a fully-connected network MLP 2. Firstly, extracting sample graph feature x of a sample graph G and subgraph feature x of a prediction target subgraph Gsub based on a graph neural network_sub. The sample graph feature x is composed of node features of all nodes in the sample graph G, and the sub-graph feature x_subThe method is characterized by comprising node characteristics of all nodes in a prediction target subgraph Gsub. Further, the method can be used for preparing a novel materialThe sample graph feature x and the subgraph feature x of the prediction target subgraph Gsub are combined_subAnd inputting the data into a full connecting layer MLP2 for processing to obtain the information correlation degree between the sample graph and the prediction target subgraph.

In an optional embodiment, the correlation prediction model may be trained before the initial sub-graph prediction model is trained, or may be trained synchronously with the sub-graph prediction model, that is, the correlation prediction model is trained simultaneously in the process of training the sub-graph prediction model.

In an embodiment, the correlation prediction model may be specifically obtained by training in the following manner:

acquiring a training data set, wherein the data training set comprises a plurality of sample graphs, a target subgraph corresponding to each sample graph, and a correlation degree marking result between each sample graph and the corresponding target subgraph; training an initial network model based on a training data set, specifically extracting a first feature of each sample graph, determining a correlation prediction result between the sample graph and a corresponding target sub-graph of the sample graph based on a second feature of the target sub-graph corresponding to the sample graph and the first feature and the second feature, wherein the correlation prediction result represents information correlation between the sample graph and the corresponding target sub-graph; and determining a correlation training loss based on the correlation, the prediction result and the correlation labeling result corresponding to each sample graph, obtaining a correlation prediction model when the correlation training loss meets a training ending condition, adjusting model parameters of the initial network model based on the correlation training loss when the correlation training loss does not meet the training ending condition, and training the adjusted model based on a training data set.

Optionally, the training end condition may be that the correlation prediction result is smaller than an information amount threshold, and the information amount of the target sub-graph is constrained by the information amount threshold, so that the target sub-graph may filter noise and redundant information in the target sub-graph while containing useful information in the sample graph as much as possible.

The scheme provided by the embodiment of the application can be suitable for any application scene needing to identify the target subgraph of the graph to be processed, the graph to be processed can be input into the subgraph prediction model through the scheme to obtain the target subgraph of the graph to be processed, the subgraph prediction model is not trained based on the labeled subgraph, the data processing efficiency is improved, in addition, the attribute of the target subgraph obtained through the model is consistent with the attribute of the graph to be processed, useful information in the graph to be processed is contained as far as possible, and the information quantity of the target subgraph is restrained to filter noise and redundant information in the target subgraph. In order to better understand the scheme provided by the embodiment of the present application, the following further describes the training of the sub-map prediction model in the scheme with reference to a specific implementation example.

Referring to fig. 6 and fig. 7, the model structure diagram of each prediction model is shown, and the training method of the sub-graph prediction model includes, but is not limited to, the following steps:

step S1, a training data set is obtained, the training data set including a plurality of sample maps.

In step S2, the association relationship between the nodes of each sample graph (the adjacency matrix a shown in fig. 6) is obtained.

And step S3, inputting each sample graph into the initial neural network model to obtain the classification result of each node in each sample graph.

In this example, a sample graph (sample graph G shown in fig. 6 and 7) is taken as an example to specifically describe how to input the sample graph into the initial neural network model to obtain the classification result of each node in the sample graph.

The initial neural network model comprises a graph neural network GNN and a full-connection network MLP1, the graph neural network comprises a graph convolution network GCN, the node characteristics of each node in the graph are extracted through the GCN, and the node characteristics of each node can pass through X^lAnd (5) characterizing.

X^l＝GCN(A,X^l-1；θ₁) (1)

Where A is the adjacency matrix of the sample graph G, l is the number of layers of the graph neural network, and θ₁Parameters of the neural network GNN are plotted.

Then, the node characteristics of each node are input into the multi-layer fully-connected network MLP1, and the classification result of each node is obtained.

And step 4, determining a node classification matrix (S shown in fig. 6, select shown in fig. 7) corresponding to the sample graph according to the classification result of each node corresponding to the sample graph, wherein the row number in the node classification matrix S represents the number of nodes in the sample graph, and the element in each row of the node classification matrix represents the classification result of one node in the sample graph.

In this example, the classification result is 2 classifications, and the node classification matrix S determined based on the classification result of each node is expressed as:

S＝MLP(X^l；θ₂) (2)

where S is an n × 2 matrix, n is the number of nodes in the sample graph, and MLP is the fully-connected network MLP1 shown in fig. 6 and 7, θ₂Is a parameter of the fully connected network MLP 1. The ith row element of the matrix represents the probability of whether the node belongs to the prediction target sub-graph.

Step 5, based on the classification result of each node, the prediction target subgraph G of the sample graph G can be determined_sub(IB-subgraph shown in FIG. 7). When the element in the matrix S is 0 or 1, predicting the characteristics of the target subgraph by S^TX^lThe first row of elements of (1).

Alternatively, the predicted target sub-graph G of the sample graph G may be determined based on the graph information Bottleneck model bottleeck shown in fig. 7_sub. The graph information bottleneck model will be described in detail below.

For the convenience of the following description, the scheme for determining the prediction target subgraph is denoted as G_sub＝g(G；θ)。

And 6, for each sample graph, determining connectivity loss of the subgraph corresponding to the sample graph G according to the node classification matrix corresponding to the sample graph and the incidence relation between the nodes in the sample graph (in the example, the incidence relation between the nodes is represented by the adjacency matrix).

The subgraph connectivity loss characterizes the accuracy of the classification result of each node, namely whether the predicted target subgraph determined in the foregoing is accurate or not. Subgraph connectivity loss can be represented as:

L_con(g(G；θ))＝||Norm(S^TAS)-I₂||_F (3)

wherein G is a sample graph, theta is a model parameter of the neural network model, G (G; theta) is a prediction target subgraph, and L_con(G (G; theta)) represents subgraph connectivity loss, Norm represents row normalization of the matrix, S represents a node classification matrix, A represents an adjacency matrix, and I represents₂Identity matrix of 2 x 2, S^TIs a transposed matrix of S, | | · |. non-woven phosphor_FIs the Frobenius norm. Theta is a model parameter of the initial neural network model and comprises a parameter theta₁And theta₂。

The number of columns of matrix S is 2, so S^TAS is a 2 x 2 matrix, matrix S^TIn the AS, the first row element represents the probability that a node belongs to the target subgraph. And the second row of elements represents the probability that the node does not belong to the target subgraph, wherein the sum of the probabilities corresponding to the two elements in each row is 1.

By means of a matrix I₂The relationship between nodes belonging to the prediction target subgraph in the sample graph can be drawn, and the relationship between nodes not belonging to the prediction target subgraph in the sample graph can be drawn.

Minimization of L_con(G (G; θ)), the elements of the matrix S can be caused to converge to 0 or 1 on the one hand, and the classification results of the neighboring nodes can be constrained to be the same on the other hand.

In the embodiment of the application, in the process of training the sub-graph prediction model, a prediction target sub-graph of a sample graph needs to embody a specific attribute Y of the sample graph, the prediction target sub-graph contains useful information in the sample graph as much as possible, and meanwhile, the prediction target sub-graph filters noise and redundant information in the sample graph. Therefore, during model training, a graph information bottleneck model is provided in the embodiment of the application to describe the requirements, and the subgraph prediction model is continuously optimized based on the graph information bottleneck model, so that the subgraph prediction model obtained through training meets the requirements.

The graph information bottleneck model is represented as:

wherein G is_subTo predict the target subgraph, Y is the attribute of the sample graph G, s.t. represents constrained, I_cFor the information amount threshold, I () is mutual information between two variables, and the information containing relationship between different variables is measured by the mutual information.

First objective function

The sub graph representing the prediction target has stronger correlation with the attribute Y of the sample graph G, namely (G)_subAnd Y) is calculated. Second objective function s.t.I (G, G)_sub) As a constraint condition of the first objective function, the objective prediction subgraph filters noise and redundant information in the sample graph while trying to predict the objective subgraph to contain useful information in the sample graph as much as possible, and I (G, G)_sub)≤I_cI.e. by I (G, G)_sub) The information correlation degree between the prediction target subgraph and the sample graph is restricted, so that the prediction target subgraph contains useful information in the sample graph as much as possible, and passes through I (G, G)_sub)≤I_cAnd the information quantity contained in the prediction target subgraph is restricted to be not larger than the information quantity threshold value, so that the target prediction subgraph filters out noise and redundant information in the sample graph.

By means of the Lagrange multiplier method, the constrained optimization problem of the graph information bottleneck model can be converted into an unconstrained optimization problem, and the converted graph information bottleneck model is expressed as follows:

where β is the Lagrangian multiplier.

As can be seen from the above equation, the equation is divided into two parts, a first part I (G)_subAnd Y) is the attribute for restricting the prediction target subgraph. Since the target subgraph of the sample graph needs to reflect the properties of the sample graph, therefore,during model training, attribute loss between a predicted target subgraph and a sample graph can be obtained by optimizing the first mutual information, the attribute loss is used as training loss of the model, and the subgraph prediction model is optimized, so that the target subgraph of the graph to be processed, which is predicted by the subgraph prediction model obtained through training, is strongly associated with the attribute of the graph to be processed. In addition, since the target subgraph of the sample graph needs to contain as much useful information in the sample graph as possible, the prediction target subgraph filters out noise and redundant information in the sample graph. Therefore, the second mutual information I (G, G) can be optimized during model training_sub) And obtaining the correlation loss between the predicted target subgraph and the sample graph, taking the correlation loss as the training loss of the model, and optimizing the subgraph prediction model so that the target subgraph of the graph to be processed, which is predicted by the subgraph prediction model obtained through training, contains useful information in the graph to be processed as much as possible, and simultaneously, the target subgraph filters noise and redundant information in the graph to be processed.

And 7, for each sample graph, each sample graph carries an attribute label, the attribute label represents the attribute of the sample graph, and the attribute loss corresponding to the sample graph is determined based on the predicted target subgraph of the sample graph and the attribute label of the sample graph.

The mutual information can measure the information containing relationship between different variables, so that the information containing relationship between the attributes of the prediction target sub-graph and the sample graph can be measured through the mutual information, namely when attribute loss is determined, first mutual information between the attribute labels of the prediction target sub-graph and the sample graph can be determined based on the attribute labels of the prediction target sub-graph and the sample graph, and the relationship between the attributes of the prediction target sub-graph and the attributes of the sample graph is constrained through the first mutual information.

Optionally, the first mutual information I (G) is determined based on the predicted target subgraph of each sample graph and the attribute label of the corresponding sample graph_subY), the first mutual information characterizes a correlation between the predicted target subgraph and the attribute label.

The first mutual information may be specifically expressed as:

I(G_sub,Y)＝∫p(y,G_sub)logp(y|G_sub)dydG_sub+H(Y) (6)

wherein G is_subFor predicting the target subgraph, Y is the attribute of the sample graph in the training dataset, the attributes of all the sample graphs in the training dataset are the same, Y is the attribute of a certain sample graph in the training dataset, and p (Y, G)_sub) Is a joint distribution between the attributes of a sample graph and the corresponding predicted target subgraphs, by which the predicted target subgraph G is characterized_subHas the attribute of y, p (yG)_sub) For the posterior probability, on the premise that the attribute of the sample graph is y, the probability that the attribute of the predicted target subgraph is y is represented. H (Y) is the entropy of Y, which can be ignored in calculating the first mutual information since h (Y) is not related to the solution of the predicted target subgraph.

Since mutual information is difficult to calculate, for this reason, the first mutual information needs to be optimized to make the first mutual information convenient to calculate, and in one embodiment of the present application, the first mutual information is optimized in the following manner, and based on the first mutual information, the attribute loss is obtained.

Specifically, the sample graph and the corresponding target subgraph are used as positive samples, the attributes of the sample graph in the positive samples are the same as those of the target subgraph, and the joint distribution p (y, G) is approximated based on the positive samples_sub) Based on variation estimation

To approximate p (y | G)_sub) I.e. by

And the corresponding network predicts to obtain the attribute information of the predicted target subgraph, and determines the attribute loss based on the attribute information and the attribute of the corresponding sample graph.

In this example, through variation estimation, a lower bound function corresponding to the first mutual information may be obtained:

where N is the number of sample graphs in the training dataset, y_gtAs an attribute of the sample graph, y_iFor the attributes of the ith sample graph in the training dataset, G_sub,iRepresenting a predicted target subgraph corresponding to the ith sample graph in the training dataset,

in order to predict the attribute information of the target sub-graph,

is attribute loss (L shown in FIGS. 6 and 7_cls)，

Are parameters of the network q. For the assignment operator, the attribute penalty can be expressed by the left formula.

By the lower bound function, the optimization of the first mutual information can be converted into the traditional classification problem, namely the problem of optimizing the mutual information is converted into the problem of optimizing the lower bound function. Maximizing I (G) by minimizing subgraph connectivity loss_sub,Y)。

In an alternative of this embodiment, the above-mentioned processing manner may be

The method is characterized in that the method is regarded as a multi-layer fully-connected network, attribute loss is determined based on an attribute prediction model formed by the network, and the attribute prediction model is used for predicting attributes (also called attribute information) of a graph and attribute loss. As shown in fig. 6 and 7, the attribute prediction model includes an aggregation module (aggregation module 2 shown in fig. 7) and a fully connected network q, wherein,

it may be a multi-layer fully-connected network (q shown in fig. 6 and fig. 7), and it should be noted that the functions implemented by the aggregation module may also be integrated in the network q.

Specifically, the node characteristics of each node in the target subgraph will be predictedAggregating through the aggregation module shown in fig. 6 to obtain sub-graph features of the prediction target sub-graph (x shown in fig. 6 and 7)_sub) Based on the sub-graph feature x_subThe attribute loss (L shown in fig. 6 and 7) can be obtained based on the attribute of the prediction target sub-graph and the attribute of the sample graph by outputting the attribute of the prediction target sub-graph through the network q_cls)。

Alternatively, for discrete graph attribute labels, cross entropy penalties may be used to represent attribute penalties, and for continuous graph attribute values, squared error penalties may be used to represent attribute penalties.

Since the target subgraph of the sample graph needs to contain as much useful information in the sample graph as possible, the prediction target subgraph filters out noise and redundant information in the sample graph. Therefore, the second mutual information I (G, G) can be optimized during model training_sub) And obtaining the correlation loss between the predicted target subgraph and the sample graph, taking the correlation loss as the training loss of the model, and optimizing the subgraph prediction model so that the target subgraph of the graph to be processed, which is predicted by the subgraph prediction model obtained through training, contains useful information in the graph to be processed as much as possible, and simultaneously, the target subgraph filters noise and redundant information in the graph to be processed.

And 8, determining the correlation loss between the prediction target subgraph and the sample graph.

Specifically, the loss of correlation can be determined directly based on the sub-graph features and the sample graph features. And a mutual information estimation method based on a graph of a Donsker-Varadhan large deviation theory form is also adopted, the trained correlation prediction model is used for approaching second mutual information to obtain correlation loss, namely the correlation loss between the prediction target subgraph and the sample graph is determined based on the correlation prediction model, and the correlation prediction model is used for determining the information correlation between the two graphs and determining the correlation loss. Specifically, the sub-graph features (x shown in fig. 6 and 7) may be first processed_sub) And the sample graph features (x shown in fig. 6 and 7) are spliced (or aggregated, which can be realized by the aggregation module 1 in the correlation prediction model in fig. 6), and then the spliced feature vectors are processedAnd obtaining the correlation loss.

Since the second mutual information is also difficult to calculate, the second mutual information can be calculated by optimizing the second mutual information:

and taking the sample graph and the respectively corresponding target subgraphs as positive samples, wherein the attributes of the sample graph in the positive samples are the same as those of the target subgraphs, taking the sample graph and the target subgraphs which do not correspond to the sample graph as negative samples, and the attributes of the sample graph in the negative samples are different from those of the target subgraphs. Specifically, the negative samples can be obtained by means of data random sampling.

And obtaining first correlation loss based on the positive samples, obtaining second correlation losses based on the negative samples, and determining the correlation loss between the prediction target subgraph and the sample graph according to the first correlation loss and the second correlation losses.

Specifically, sub-graph features corresponding to the sample graph may be merged with the sample graph features, a first loss of correlation between the prediction target sub-graph and the sample graph is determined based on the merged features, the sample graph features of the sample graph are merged with sub-graph features corresponding to each of the other sample graphs, and a second loss of correlation between the sample graph and each of the other sample graphs is determined based on the merged features.

And continuously adjusting parameters of the correlation prediction model based on the correlation loss, so that the trained correlation prediction model can accurately determine the information correlation (second mutual information) between the prediction target subgraph and the corresponding sample graph.

The loss of correlation can be expressed as:

wherein the content of the first and second substances,

for correlation loss, N is the number of sample maps in the training dataset,

f represents the correlation prediction model for the model parameters of the correlation prediction model. G_iRepresents the ith sample plot in the training dataset, G_sub,iRepresents G_iCorresponding predicted target subgraph, G_sub,jAnd representing the predicted target subgraph corresponding to any sample graph except the sample graph in the training data set. G_iAnd G_sub,jConstituting a negative sample, G_iAnd G_sub,iConstituting a positive sample.

In order to be the first loss of correlation,

for each second degree of correlation loss.

Step 9, after obtaining the three losses (subgraph connectivity loss, attribute loss and correlation loss), obtaining a training loss based on the three losses, where the training loss can be expressed as:

wherein the content of the first and second substances,

for training loss, L_con(G (G; theta)) is subgraph connectivity loss,

in order to be a loss of an attribute,

in order to be a loss of the degree of correlation,

g is a sample graph, G_subFor predicting the target subgraph, y is a certain training data setThe attribute of the sample graph, theta, is a model parameter of the initial neural network model, and comprises a parameter theta₁And theta₂。y_gtIn order to be an attribute of the sample graph,

are parameters of the network q and are,

model parameters of the model are predicted for the degree of correlation.

And step 10, training the initial neural network model based on the subgraph connectivity loss and the training data set corresponding to each sample graph until a preset training end condition is met, and determining the neural network model at the end of training as a subgraph prediction model.

In an optional embodiment, the attribute prediction model, the correlation prediction model, and the training process of the sub-graph prediction model may be performed synchronously, that is, in the process of training the sub-graph prediction model, the attribute prediction model and the correlation prediction model are trained, and a network parameter of the sub-graph prediction model is optimized based on an attribute loss obtained by the output of the attribute prediction model and a correlation loss obtained by the output of the correlation prediction model, so as to obtain the trained sub-graph prediction model. At this time, the connection relationship among the attribute prediction model, the correlation prediction model, and the sub-graph prediction model is as shown in fig. 7. It should be noted that the structures of the attribute prediction model, the correlation prediction model, and the sub-graph prediction model are as described above, and are not described herein again.

In fig. 7, the aggregation module 1(Aggregate) and the fully-connected network MLP2 constitute a correlation prediction model, and the aggregation module 2(Aggregate) and the q-network constitute an attribute prediction model.

The inner layer optimization (T-step inner optimization) in FIG. 7 refers to optimizing model parameters of a correlation prediction model given a sample graph G and a target subgraph of the sample graph

Therefore, the trained correlation degree prediction model can accurately predict the confidence between the two graphsAnd (4) information correlation degree. Outer layer optimization (Outerr optimization) in FIG. 7 refers to model parameters

And under the fixed condition, optimizing other parameters in the sub-image prediction model, so that the trained sub-image prediction model can accurately identify the target sub-image of the image.

Based on the same principle as the method shown in fig. 3, an embodiment of the present application further provides a graph data processing method, which is described below with a server as an execution subject, and as shown in fig. 8, the method may include the following steps:

step S210, a graph to be processed is acquired.

This step is identical to the foregoing step S110, and is not described herein again.

And S220, inputting the graph to be processed into the trained subgraph prediction model to obtain a target subgraph of the graph to be processed. Wherein, the subgraph prediction model is trained based on the method described in the foregoing.

This step is identical to the foregoing step S120, and is not described herein again.

The scheme provided by the embodiment of the application can be executed by any electronic equipment, and can be user terminal equipment, and a user can call the trained sub-graph prediction model through the user terminal equipment to realize the identification of the target sub-graph of the graph to be processed. The graph processing request comprises the graph to be processed, the server identifies a target sub-graph in the graph to be processed through a sub-graph prediction model based on the request, and sends the target sub-graph to the user terminal so as to be displayed to the client through the user terminal.

In an optional embodiment of the present application, inputting the graph to be processed into the trained sub-graph prediction model to obtain a target sub-graph of the graph to be processed may include:

inputting the graph to be processed into the trained sub-graph prediction model, and executing the following operations through the sub-graph prediction model to obtain a target sub-graph of the graph to be processed:

extracting node characteristics of each node in a graph to be processed;

obtaining a classification result of each node based on the node characteristics of each node;

and obtaining a target subgraph of the graph to be processed based on the classification result of each node.

Fig. 9 is a schematic diagram of an implementation environment of a graph data processing method according to an embodiment of the present application, where the implementation environment in this example may include, but is not limited to, sub-graph recognition server 101, network 102, and terminal device 103. Subgraph recognition server 101 can communicate with terminal device 103 through network 102, terminal device 103 sends a subgraph recognition request to subgraph recognition server 101, and subgraph recognition server 101 can send a recognized target subgraph to terminal device 103 through the network.

The terminal device 103 includes a human-computer interaction screen 1031, a processor 1032 and a memory 1033. The man-machine interaction screen 1031 is used for displaying the target subgraph. The memory 1033 is used for storing the retrieval image and the target sub-image and the like related data. Subgraph recognition server 101 includes a database 1011 and a processing engine 1012, where processing engine 1012 may be used to train a subgraph prediction model. Database 1011 is used to store the trained sub-graph prediction models. The terminal device 103 may upload the subgraph recognition request to the subgraph recognition server 101 through the network, and a processing engine 1012 in the subgraph recognition server 101 may call the subgraph prediction model, recognize a target subgraph of the graph to be processed, and provide the target subgraph to the terminal device 103 for presentation.

The processing engine in the subgraph recognition server 101 has two main functions, the first function is used for training to obtain a subgraph prediction model, and the second function is used for processing a graph to be processed based on the subgraph prediction model to obtain a target subgraph of the graph to be processed. It can be understood that the above two functions can be implemented by two servers, see fig. 10, where the two servers are a training server 201 and a sub-graph recognition server 202, respectively, the training server 201 is used for training to obtain a sub-graph prediction model, and the sub-graph recognition server 202 is used for implementing recognition of a target sub-graph.

In practical application, the two servers can communicate with each other, and after the training server 201 trains the sub-graph prediction model, the sub-graph prediction model can be stored in the training server 201, or the sub-graph prediction model can be sent to the sub-graph recognition server 202. Alternatively, when sub-graph recognition server 202 needs to call the sub-graph prediction model, a model call request is sent to training server 201, and training server 201 sends the sub-graph prediction model to sub-graph recognition server 202 based on the request.

As an example, the terminal device 204 sends a sub-graph recognition request to the sub-graph recognition server 202 through the network 203, the sub-graph recognition server 202 calls a sub-graph prediction model in the training server 201, and based on the sub-graph prediction model, the sub-graph recognition server 202 sends a target sub-graph obtained through recognition to the terminal device 204 through the network 203 after the sub-graph recognition is completed, so that the terminal device 204 displays the target sub-graph.

The graph data processing method provided by the application can be applied to any application scene needing target subgraph identification, such as the fields of drug discovery, molecular optimization, molecular generation and the like in the fields of structure biology and medicine. To facilitate understanding of the present solution, for example, the solution is applied to the excavation of molecular functional groups. For a drug molecule, because the molecular functional group of the drug molecule can determine the chemical property of the drug molecule, if the graph to be processed is a graph corresponding to the drug molecule, the target subgraph can be a subgraph corresponding to the molecular functional group of the drug molecule, and therefore, based on the scheme of the embodiment of the application, when the graph to be processed is a graph corresponding to the drug molecule, the molecular functional group of the drug molecule can be accurately identified based on a trained subgraph prediction model, and based on this, data support can be provided for drug research.

Based on the same principle as the method shown in fig. 3, the embodiment of the present application further provides a graph data processing apparatus 30, as shown in fig. 11, the graph data processing apparatus 30 may include a graph data acquiring module 310 and a sub-graph identifying module 320, where:

a graph data obtaining module 310, configured to obtain a graph to be processed;

the subgraph identification module 320 is used for inputting the graph to be processed into the trained subgraph prediction model to obtain a target subgraph of the graph to be processed, and the subgraph prediction model is obtained by training through the following model training modules:

acquiring an incidence relation between nodes of each sample graph;

In an embodiment of the present application, each sample map carries an attribute tag, and for each sample map, the apparatus further includes:

In one embodiment of the present application, for each sample map, the apparatus further comprises:

In an embodiment of the present application, for each sample graph, the association relationship includes an adjacency matrix corresponding to the sample graph, and when determining connectivity loss of a sub-graph corresponding to the sample graph based on the classification result of each node corresponding to the sample graph and the association relationship between each node in the sample graph, the model training module is specifically configured to:

In an embodiment of the present application, when determining, by the model training module, connectivity loss of a sub-graph corresponding to the sample graph according to the node classification matrix and the adjacency matrix corresponding to the sample graph, the model training module is specifically configured to:

In an embodiment of the present application, an expression of the subgraph connectivity loss is:

L_con(g(G；θ))＝||Norm(S^TAS-I₂)||_F

In an embodiment of the present application, when determining, based on the attribute labels of the prediction target sub-graph and the sample graph, the attribute loss determining module is specifically configured to:

acquiring node characteristics of each node in the predicted target subgraph;

In an embodiment of the present application, for any sample graph, when determining a correlation loss between a prediction target sub-graph and the sample graph, the correlation loss determining module is specifically configured to:

In an embodiment of the present application, for any sample graph, when determining, based on the sub-graph features and the sample graph features, the correlation loss determining module is specifically configured to:

The graph data processing apparatus according to the embodiment of the present application can execute the graph data processing method according to the embodiment of the present application, and the implementation principle is similar, the actions performed by each module and unit in the graph data processing apparatus according to the embodiments of the present application correspond to the steps in the graph data processing method according to the embodiments of the present application, and for the detailed functional description of each module of the graph data processing apparatus, reference may be specifically made to the description in the corresponding graph data processing method shown in the foregoing, and details are not repeated here.

Wherein the graph data processing apparatus may be a computer program (including program code) running in a computer device, for example, the graph data processing apparatus is an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present application.

In some embodiments, the graph data processing apparatus provided by the embodiments of the present invention may be implemented by a combination of hardware and software, and by way of example, the graph data processing apparatus provided by the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the graph data processing method provided by the embodiments of the present invention, for example, the processor in the form of the hardware decoding processor may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

In other embodiments, the graph data processing apparatus provided in the embodiments of the present invention may be implemented in software, and fig. 11 illustrates the graph data processing apparatus stored in the memory, which may be software in the form of programs and plug-ins, and includes a series of modules, including a training data obtaining module 410 and a model training module 420, for implementing the graph data processing method provided in the embodiments of the present invention.

Based on the same principle as the method shown in the embodiments of the present application, there is also provided in the embodiments of the present application an electronic device, which may include but is not limited to: a processor and a memory; a memory for storing a computer program; and the processor is used for executing the graph data processing method shown in any embodiment of the application by calling the computer program.

According to the graph data processing method, for the graph to be processed, of which the target subgraph needs to be recognized, the graph to be processed can be input into a trained subgraph prediction model, the target subgraph of the graph to be processed is recognized through the subgraph prediction model, and the subgraph prediction model only comprises a sample graph in a training data set during training, namely the subgraph prediction model can recognize the target subgraph of the graph to be processed under the condition that no sample subgraph (labeled subgraph) exists. Meanwhile, during model training, subgraph connectivity loss is determined based on the node characteristics of each node and the incidence relation between each node, the incidence relation can reflect the incidence relation between each node, and when the subgraph connectivity loss is determined, the subgraph connectivity loss based on each sample graph and the subgraph prediction model obtained by training of the training data set can be used for identifying the target subgraph more accurately by combining the incidence relation between each node.

In an alternative embodiment, there is provided an electronic device, as shown in fig. 12, an electronic device 4000 shown in fig. 12 including: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, and the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 12, but this is not intended to represent only one bus or type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 4003 is used for storing application program codes (computer programs) for executing the present scheme, and is controlled by the processor 4001 to execute. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in the foregoing method embodiments.

The electronic device may also be a terminal device, and the electronic device shown in fig. 12 is only an example, and should not bring any limitation to the functions and the application scope of the embodiments of the present application.

The graph data processing method provided by the application can also be realized in a cloud computing mode, wherein the cloud computing mode refers to a delivery and use mode of an IT infrastructure, and refers to a mode of acquiring required resources in an easily-extensible mode as required through a network; the generalized cloud computing refers to a delivery and use mode of a service, and refers to obtaining a required service in an on-demand and easily-extensible manner through a network. Such services may be IT and software, internet related, or other services. Cloud Computing is a product of development and fusion of traditional computers and Network Technologies, such as Grid Computing (Grid Computing), distributed Computing (distributed Computing), Parallel Computing (Parallel Computing), Utility Computing (Utility Computing), Network Storage (Network Storage Technologies), Virtualization (Virtualization), Load balancing (Load Balance), and the like.

With the development of diversification of internet, real-time data stream and connecting equipment and the promotion of demands of search service, social network, mobile commerce, open collaboration and the like, cloud computing is rapidly developed. Different from the prior parallel distributed computing, the generation of cloud computing can promote the revolutionary change of the whole internet mode and the enterprise management mode in concept.

The graph data processing method provided by the application can also be realized through an artificial intelligence cloud Service, which is generally called as AI as a Service (AI as a Service in chinese). The method is a service mode of an artificial intelligence platform, and particularly, the AIaaS platform splits several types of common AI services and provides independent or packaged services at a cloud. This service model is similar to the one opened in an AI theme mall: all developers can access one or more artificial intelligence services provided by the platform through an API (application programming interface), and part of the qualified developers can also use an AI framework and an AI infrastructure provided by the platform to deploy and operate and maintain the self-dedicated cloud artificial intelligence services. In the present application, the graph data processing method provided by the present application may be implemented by using an AI framework and an AI infrastructure provided by a platform.

The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments.

The method related to the embodiment of the application can be realized by a distributed system formed by connecting a client, a plurality of nodes (computing equipment in any form in an access network, such as a server and a user terminal) through a network communication mode.

Specifically, the subgraph recognition server 101 and the terminal device 103 referred to in fig. 9, or the training server 201, the subgraph recognition server 202 and the terminal device 204 referred to in fig. 10 may be used as nodes in the distributed system. The data interaction involved in the graph data processing method of the application is realized through the distributed system,

taking a distributed system as an example of a blockchain system, referring To fig. 13, fig. 13 is an optional structural schematic diagram of the distributed system 100 applied To the blockchain system, which is formed by a plurality of nodes (computing devices in any form in an access network, such as servers and user terminals) and clients, and a Peer-To-Peer (P2P, Peer To Peer) network is formed between the nodes, and the P2P Protocol is an application layer Protocol operating on a Transmission Control Protocol (TCP). In a distributed system, any machine, such as a server or a terminal, can join to become a node, and the node comprises a hardware layer, a middle layer, an operating system layer and an application layer.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The computer readable storage medium provided by the embodiments of the present application may be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer-readable storage medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the methods shown in the above embodiments.

According to another aspect of the application, there is also provided a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the graph data processing method provided in the various embodiment implementation manners.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present application may be implemented by software or hardware. Wherein the name of a module in some cases does not constitute a limitation on the module itself.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the disclosure. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A graph data processing method, comprising:

acquiring a graph to be processed;

inputting the graph to be processed into a trained sub-graph prediction model to obtain a target sub-graph of the graph to be processed, wherein the sub-graph prediction model is obtained by training in the following way:

obtaining a training data set, wherein the training data set comprises a plurality of sample graphs;

acquiring an incidence relation between nodes of each sample graph;

inputting each sample graph into an initial neural network model to obtain a classification result of each node in each sample graph, wherein for any node in one sample graph, the classification result represents the probability that the node is the node of a target sub-graph of the sample graph;

and training the initial neural network model based on the subgraph connectivity loss corresponding to each sample graph and the training data set until a preset training end condition is met, and determining the neural network model at the end of training as the subgraph prediction model.

2. The method of claim 1, wherein each of the sample maps carries an attribute tag, and for each of the sample maps, the method further comprises:

determining an attribute loss between the sample graph and the prediction target subgraph based on the attribute labels of the prediction target subgraph and the sample graph, wherein the attribute loss characterizes a difference between the attribute of the prediction target subgraph and the attribute of the sample graph;

training the initial neural network model based on the subgraph connectivity loss of each of the sample graphs and the training data set, including:

3. The method of claim 1, wherein for each of the sample maps, the method further comprises:

determining a correlation loss between the prediction target sub-graph and the sample graph, wherein the correlation loss characterizes a correlation between the prediction target sub-graph and the sample graph;

4. The method according to any one of claims 1 to 3, wherein the association relationship comprises, for each of the sample graphs, an adjacency matrix corresponding to the sample graph, and the determining the subgraph connectivity loss corresponding to the sample graph based on the classification result of each node corresponding to the sample graph and the association relationship between each node in the sample graph comprises:

determining a node classification matrix corresponding to the sample graph according to the classification result of each node of the sample graph, wherein elements in each row of the node classification matrix correspond to the classification result of one node in the sample graph;

and determining the subgraph connectivity loss corresponding to the sample graph according to the node classification matrix corresponding to the sample graph and the adjacency matrix.

5. The method of claim 4, wherein determining the subgraph connectivity loss corresponding to the sample graph based on the node classification matrix and the adjacency matrix corresponding to the sample graph comprises:

and determining subgraph connectivity loss corresponding to the sample graph based on the connectivity result of the subgraph and the subgraph connectivity result constraint condition.

6. The method of claim 5, wherein the sub-graph connectivity loss is expressed by:

L_con(g(G；θ))＝||Norm(S^TAS-I₂)||_F

wherein G is the sample graph, theta is a model parameter of the neural network model, and L_con(G (G; θ)) is the subgraph connectivity loss, Norm represents the row normalization of the matrix, S is the node classification matrix, A is the adjacency matrix, I₂Identity matrix of 2 x 2, S^TIs a transposed matrix of S, | | · |. non-woven phosphor_FIs the Frobenius norm.

7. The method of claim 2, wherein the determining a loss of attributes between the sample graph and the prediction target subgraph based on the prediction target subgraph and the attribute labels of the sample graph comprises:

acquiring node characteristics of each node in the prediction target subgraph;

determining attribute information of the prediction target subgraph according to the subgraph features;

and determining attribute loss between the sample graph and the prediction target subgraph according to the attribute information of the prediction target subgraph and the attribute label of the sample graph.

8. The method of claim 3, wherein for any of the sample graphs, the determining a loss of correlation between the prediction target subgraph and the sample graph comprises:

extracting sub-graph features of the prediction target sub-graph and sample graph features of the sample graph;

determining a loss of correlation between the prediction target subgraph and the sample graph based on the subgraph features and the sample graph features.

9. The method of claim 8, wherein for any of the sample graphs, the determining a loss of correlation between the prediction target subgraph and the sample graph based on the subgraph features and the sample graph features comprises:

splicing the sub-graph features corresponding to the sample graph and the sample graph features, and determining a first correlation degree loss between a prediction target sub-graph and the sample graph based on the spliced features;

10. A graph data processing apparatus, comprising:

the subgraph identification module is used for inputting the graph to be processed into a trained subgraph prediction model to obtain a target subgraph of the graph to be processed, and the subgraph prediction model is obtained by training in the following mode:

acquiring an incidence relation between nodes of each sample graph;

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1-9 when executing the program.

12. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method of any one of claims 1-9.