CN116416478B

CN116416478B - Bioinformatics classification model based on graph structure data characteristics

Info

Publication number: CN116416478B
Application number: CN202310659097.6A
Authority: CN
Inventors: 魏玉锌; 王翔
Original assignee: Fujian University of Technology
Current assignee: Fujian University of Technology
Priority date: 2023-06-06
Filing date: 2023-06-06
Publication date: 2023-09-26
Anticipated expiration: 2043-06-06
Also published as: CN116416478A

Abstract

The application provides a bioinformatics classification model based on graph structure data characteristics, which comprises a plurality of characteristic extraction layers stacked step by step; any feature extraction layer comprises a graph convolution layer and a graph pooling layer; the image pooling layer comprises a three-channel pooling module and a feature fusion module, wherein the three-channel pooling module comprises an image convolution pooling channel, a differential pooling channel and a Transformer pooling channel which are respectively used for learning and fusing local topological structure information, global topological structure information and dependence information of features among nodes of image structure data; the pooled graph obtained by any preceding-stage feature extraction layer is formed into residual connection by the corresponding read-out layer extracted graph feature representation and the graph feature representation extracted by the last-stage feature extraction layer through the corresponding read-out layer, and then the full-connection layer outputs the prediction result of the bioinformatics classification. The application can fuse the characteristic information of various graphs together, and can better generate the characteristic representation of the whole graph, so that the classification is more accurate.

Description

Bioinformatics classification model based on graph structure data characteristics

Technical Field

The application relates to the technical field of data processing, in particular to a bioinformatics classification model based on graph structure data characteristics.

Background

In real life, there is a large amount of complex network data, such as social networks, knowledge maps, proteins, viruses, shopping networks, molecular compounds, etc., which can be abstracted into one graph. Graph structure data has a more complex structure and higher dimensionality than traditional data types, and therefore analysis and processing of the graph structure data is also more challenging. Deep learning also exhibits a strong learning ability in processing the graph structure data, so that in recent years, more and more researchers apply deep learning to the fields of graph structure data analysis and processing, such as the fields of recommendation systems, link prediction, graph classification, node classification, and the like.

The graph classification task is mainly applied to bioinformatics classification, including the fields of drug discovery, virus analysis, protein analysis, molecular compound analysis and the like. Unlike image classification, these complex network data present a large amount of topology information that has a great impact on generating a level representation of the entire graph. However, in the process of modeling a graph classification task, how to simultaneously capture characteristic information of graph data and generate graph-level representation is still a core problem of modeling research. In the prior art of modeling of the graph classification model, the graph classification model is concentrated on the modeling of topological structure information of the graph structure or the modeling of graph characteristic information, so that fusion modeling of various information in graph structure data is ignored to a great extent, better graph characteristic representation cannot be obtained, and accuracy of bioinformatics classification is affected.

Disclosure of Invention

The application aims to solve the technical problem of providing a biological informatics classification model based on graph structure data characteristics, which is a graph neural network model based on characteristic fusion, and can simultaneously capture local topological structure information, global topological structure information and dependence information of long-distance nodes of a graph, fuse various graph characteristic information together and better generate characteristic representation of the whole graph.

In a first aspect, the application provides a bioinformatics classification model based on graph structure data characteristics, which comprises a plurality of feature extraction layers, a plurality of readout layers and a full connection layer, wherein the feature extraction layers are stacked step by step; any feature extraction layer comprises a graph convolution layer and a graph pooling layer, wherein the graph convolution layer is correspondingly connected with a reading layer through the graph pooling layer, and the reading layers are connected with the full-connection layer;

the image pooling layer comprises a three-channel pooling module and a feature fusion module, wherein the three-channel pooling module comprises an image convolution pooling channel, a differential pooling channel and a Transformer pooling channel which are respectively used for learning local topological structure information, global topological structure information and dependence information of features among nodes of image structure data, and the feature fusion module fuses the local topological structure information, the global topological structure information and the dependence information of the features among the nodes to obtain a pooling image;

and inputting the pooled graph obtained by the feature extraction layer of the previous stage into a graph rolling layer of the feature extraction layer of the next stage, extracting graph features of the pooled graph obtained by the feature extraction layer of any previous stage by the corresponding readout layer, forming residual connection with graph feature representations extracted by the feature extraction layer of the last stage by the corresponding readout layer, and outputting a prediction result of bioinformatics classification by the full connection layer.

One or more technical solutions provided in the embodiments of the present application at least have the following technical effects or advantages: providing a bioinformatics classification model based on graph structure data characteristics, wherein the model comprises a plurality of feature extraction layers, a plurality of readout layers and a full connection layer which are stacked step by step; any feature extraction layer comprises a graph convolution layer and a graph pooling layer, wherein each graph pooling layer is composed of a three-channel pooling module and a feature fusion module, and the three-channel pooling module comprises a graph convolution pooling channel, a differential pooling channel and a transducer pooling channel which are respectively used for learning local topological structure information, global topological structure information and dependency information of features among nodes of graph structure data, so that a constructed model has better performance expression in a graph classification task, and the biological informatics of the graph structure data features can be classified more accurately.

The foregoing description is only an overview of the present application, and is intended to be implemented in accordance with the teachings of the present application in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present application more readily apparent.

Drawings

The application will be further described with reference to examples of embodiments with reference to the accompanying drawings.

FIG. 1 is a schematic diagram of the structure of a bioinformatics classification model according to the present application;

FIG. 2 is a flow chart of the processing principle of the pooling layer of the present application.

Detailed Description

The embodiment of the application provides a biological informatics classification model based on graph structure data characteristics, which is a graph neural network model based on characteristic fusion, and can simultaneously capture local topological structure information, global topological structure information and dependence information of long-distance nodes of a graph, fuse various graph characteristic information together and better generate characteristic representation of the whole graph.

The technical scheme in the embodiment of the application has the following overall thought: providing a bioinformatics classification model based on graph structure data characteristics, wherein the model comprises a plurality of feature extraction layers, a plurality of readout layers and a full connection layer which are stacked step by step; any feature extraction layer comprises a graph convolution layer and a graph pooling layer, wherein each graph pooling layer is composed of a three-channel pooling module and a feature fusion module, the three-channel pooling module comprises a graph convolution pooling channel, a differential pooling channel and a transducer pooling channel, and the three-channel pooling module is respectively used for learning local topological structure information, global topological structure information and dependency information of features among nodes of graph structure data, so that a constructed model has better performance expression in a graph classification task, fusion modeling is carried out on various information in the graph structure data, and classification accuracy of bioinformatics is greatly improved.

Regarding the graph structure data, for the fields of drug discovery, virus analysis, protein analysis, molecular compound analysis and the like, the graph structure data corresponds to the molecular structure of bioinformatics, and comprises atoms and chemical bonds between atoms, wherein the atoms are nodes, and the chemical bonds are connecting sides. Therefore, the feature information of the graph structure data includes feature information of nodes and dependency information of features between nodes.

As shown in fig. 1, the present embodiment provides a bioinformatics classification model based on graph structure data features, which includes a plurality of feature extraction layers stacked step by step, a plurality of readout layers, and a full connection layer; any feature extraction layer comprises a graph convolution layer and a graph pooling layer, wherein the graph convolution layer is correspondingly connected with a reading layer through the graph pooling layer, and the reading layers are connected with the full-connection layer;

The respective components of the bioinformatics classification model based on the characteristics of the map structural data are described in detail below.

The feature extraction layer is used for extracting feature information and topological structure information of the graph structure data and carrying out feature fusion. The bioinformatics classification model is provided with a plurality of feature extraction layers which are stacked step by step, wherein any feature extraction layer comprises a graph convolution layer and a graph pooling layer. The pooling layer in the feature extraction layer of the previous stage is connected with the scroll lamination layer in the feature extraction layer of the next stage.

Graph convolution layer: for aggregating the characteristic information of the node itself and surrounding neighbor nodes. For each node, it is considered that the node is affected by all its surrounding neighbor nodes and itself. The graph convolution neural network can aggregate the characteristic information of the node and surrounding neighbor nodes, and the propagation formula is as follows:

and (5) a pooling layer: for capturing map feature information. The capture of graph feature information is a key task to the graph classification task. The pooling layer can effectively capture the topological structure, deep node characteristics and other information of the graph structure data. As shown in fig. 2, the pooling layer of the present application is an end-to-end three-channel pooling graph neural network model, which mainly comprises a three-channel pooling module and a feature fusion module, wherein the three-channel pooling module comprises a graph convolution pooling channel, a differential pooling channel and a transducer pooling channel; the feature fusion module is composed of a cross-channel convolution module and an aggregation module.

Three-channel pooling module:

the first channel, namely the transform pooling channel, is based on TOP-K pooling model to capture the dependency information of the characteristics between nodes, namely the node long distance dependency information, the score calculation mode of TOP-K pooling model is obtained by transforming the graph, and the calculation formula is as follows:

the feature matrix is obtained by transforming the node feature X after passing through a transducer module, and the feature matrix is used as a feature for calculation;

is a matrix of learnable parameters for learning the impact of each feature dimension of a node on the overall characteristics of the node,representing a matrix of real numbers,representing node feature dimensions;

is the number of the graph structuresAccording to the scores of all the nodes,is the score of each node of the graph structure data,representing a matrix of real numbers,representing the number of original nodes;

the nodes are ordered according to the scores calculated based on the node long-distance dependency information, and the highest score is taken out after the orderingOf individual nodesAs a reserved nodeThe reserved nodes are regarded as important nodes in the graph structure data, and the rest nodes are discarded;

after the rest nodes are discarded, the characteristic information of the discarded nodes is converged to the reserved nodes according to a certain proportion, and the specific formula is as follows:wherein:

is the node discarded；

Is an aggregation matrix of the characteristic information of the discarded nodes, and the characteristic information of the discarded nodes is a graph-following structureEdges of the data are aggregated over the retained node characteristic information,representing a matrix of real numbers,indicating the number of reserved nodes,representing the number of original nodes;

and generating a node characteristic matrix after the channel is subjected to transform pooling.

And a second channel, namely the differential pooling channel, is used for capturing global topological structure information and generating a roughened sub-graph. The application designs a graph clustering algorithm, which uses a graph convolution neural network to learn a soft distribution matrix for generating a coarsened graphDistribution matrixIs generated by the following formula:

wherein:representing the transposed symbol.

The third channel, namely the graph rolling pooling channel, is used for capturing local topological structure information in a graph, and the graph rolling pooling channel is a node voting type graph pooling method based on a graph rolling neural network, wherein the node voting type graph pooling method captures the local topological structure information among nodes in graph structure data by using the graph rolling neural network, and the node score calculating mode is as follows:

wherein:

is a learnable parameter for learning the influence of each feature of a node of the graph structure data on the overall feature of the node,representing a matrix of real numbers,representing node feature dimensions;

is the score of all nodes of the graph structure data, is the score of each node of the graph structure data at the graph convolution pooling channel,representing a matrix of real numbers,representing the number of original nodes;

the nodes are ordered according to the scores calculated based on the node long-distance dependency information, and the highest score is taken out after the orderingOf individual nodesAs a reserve nodeThe reserved nodes are regarded as important nodes in the graph structure data, and the rest nodes are discarded;

after discarding the node, the characteristic information of the discarded node is converged to the reserved node according to a certain proportion, and the specific formula is as follows:

wherein:

is the node discarded；

Is an aggregate matrix of the characteristic information of the discarded nodes,representing a matrix of real numbers,representing the number of original nodes and,representing the number of original nodes;

the node characteristic matrix is generated after the path is pooled through graph convolution.

And a feature fusion module: the system comprises a cross-channel convolution module and an aggregation module;

the cross-channel convolution module adopts a cross-channel convolution method to fuse the dependency information of the characteristics among the nodes of the transform pooling channel and the global topology information of the differential pooling channel together, and fuses the local topology information of the graph convolution pooling channel and the global topology information of the differential pooling channel together, so that two cross-channel aggregation pooling graphs are obtained, and the cross-channel convolution method has the following formula:

wherein:

Includedand,reserving a node characteristic matrix generated after the cross-channel convolution of the nodes in the Transformer pooling channel;reserving a node characteristic matrix generated after the cross-channel convolution of the nodes in the graph rolling pooling channel;

Includedand,is a node characteristic matrix generated after a channel is subjected to transform pooling,the node characteristic matrix is generated after the path is pooled through graph convolution;

is a node characteristic matrix generated after passing through the differential pooling channel;

representingTo the direction ofA conversion matrix for conversion, whereinRepresenting a matrix of real numbers,a number of nodes representing graph structure data generated in the Transformer pooling pass or the graph convolution pooling pass,a number of nodes representing graph structure data generated by the differential pooling channel;

wherein the method comprises the steps ofIs a soft distribution matrix learned by a graph neural network in a differential pooling channel;

after the operation, two cross-channel aggregation pooling graphs are provided, and in order to aggregate the information of the two cross-channel aggregation pooling graphs, the application designs an aggregation module.

The aggregation module represents the index of reserved nodes in the Transformer pooling channel asThe reserved node index in the graph roll pooling channel is expressed asThe method comprises the steps of carrying out a first treatment on the surface of the Taking the average value of node characteristics existing in both the transform pooling channel and the graph convolution pooling channel as the characteristic of a new node, the average value only exists in the transform poolThe characteristics of the nodes in the path or the path of the graph convolution pool are used as the characteristics of the new nodes; the new node is a node of the graph structure data processed by the aggregation module, and the specific formula is as follows:

extracting a sub-graph consisting of most representative nodes of the original graph structure data by indexes, wherein an adjacency matrix is expressed as follows:

a readout layer for extracting a graph feature representation of each of the pooled graphs using a readout function, the readout function being:

wherein the method comprises the steps ofA feature representation representing the pooling graph,representing the characteristic dimension of the node,representing the number of nodes of the pooling graph.

The pooling layer of each feature extraction layer is connected with a reading layer, and the pooling image obtained by the feature extraction layer at any previous stage is connected with the image feature representation extracted by the feature extraction layer at the last stage through the corresponding image feature representation extracted by the feature extraction layer at the last stage after the image feature representation is represented by the corresponding reading layer, so that the phenomena of over-smoothing and over-fitting of the model can be relieved.

The full-connection layer comprises a full-connection layer and an activation function, adopts a multi-layer perceptron as a classifier and classifies the input graph characteristic representation, and the formula is as follows:

wherein the method comprises the steps ofIs a bioinformatics category of graph structure data prediction.

And finally obtaining a final output prediction result through a full-connection layer based on the bioinformatics classification model of the graph structure data characteristics.

The following illustrates the implementation of a bioinformatics classification model based on the characteristics of the graph structure data, which includes the following steps:

s1, preprocessing graph data: the common protein dataset DD and the biomedical dataset NCI1 were selected for model verification use according to 7:1.5: the data set is divided into three parts of a training set, a verification set and a test set according to the proportion of 1.5, and standardized processing is carried out uniformly.

S2, establishing a graph neural network model, and in order to complete the extraction of the characteristics and topological structure information of graph data and realize the prediction task of protein attributes, adopting a layered pooling structure on a model structure, namely, carrying out layered sampling on the graph structure data, reducing the number of nodes in each layer, converging the node characteristics of each layer, and finally obtaining an integral characteristic vector representation.

The FIPool model is formed by stacking a plurality of feature extraction layers, wherein the feature extraction layers comprise a graph convolution layer and a graph pooling layer so as to extract and fuse features of graph data. The three-channel map pooling layer is designed for extracting characteristic information and topological structure information of the map structure data, so that local topological structure information, global topological structure information and dependency information of characteristics among nodes of the map structure data can be effectively learned; the pooling layer also comprises a feature fusion module which can effectively aggregate the feature information of different channels in a convolution calculation mode between the different channels.

The feature extraction layer is connected with the final output through the reading layer except the final one to form residual connection, and the full connection layer is used for outputting the classification prediction result.

S3, model training and parameter tuning: for input graph structure dataIs a sequence of (2)After S1, inputting the result into the model constructed in S2, and finally outputting the classification prediction result through the last full connection layer of the model. In the whole model training process, the super-parameter combination which enables the model to perform best on the test data set is finally searched by adjusting the loss function, the optimizer function and the learnable super-parameters, and the model is built.

S4, analyzing and evaluating the performance of the graph neural network model: and comparing and analyzing the established model with a plurality of reference models by taking the set evaluation indexes as standards, and verifying the performance of the evaluation models.

The method, the device, the system, the equipment and the medium provided by the embodiment of the application have at least the following technical effects or advantages: providing a bioinformatics classification model based on graph structure data characteristics, wherein the model comprises a plurality of feature extraction layers, a plurality of readout layers and a full connection layer which are stacked step by step; any feature extraction layer comprises a graph convolution layer and a graph pooling layer, wherein each graph pooling layer is composed of a three-channel pooling module and a feature fusion module, and the three-channel pooling module comprises a graph convolution pooling channel, a differential pooling channel and a transducer pooling channel which are respectively used for learning local topological structure information, global topological structure information and dependency information of features among nodes of graph structure data, so that a constructed model has better performance expression in a graph classification task, and the biological informatics of the graph structure data features can be classified more accurately.

While specific embodiments of the application have been described above, it will be appreciated by those skilled in the art that the specific embodiments described are illustrative only and not intended to limit the scope of the application, and that equivalent modifications and variations of the application in light of the spirit of the application will be covered by the claims of the present application.

Claims

1. The bioinformatics classification model based on the graph structure data characteristics is characterized in that: the device comprises a plurality of feature extraction layers stacked step by step, a plurality of reading layers and a full connection layer; any feature extraction layer comprises a graph convolution layer and a graph pooling layer, wherein the graph convolution layer is correspondingly connected with a reading layer through the graph pooling layer, and the reading layers are connected with the full-connection layer;

2. The bioinformatics classification model based on graph structure data features of claim 1, wherein: the transducer pooling channel captures the dependency information of the characteristics among nodes based on a TOP-K pooling model, the score calculation mode of the TOP-K pooling model is obtained by converting a graph by the transducer, and the calculation formula is as follows:

X _T ＝Transformer(X)；

S _T ＝X _T W ₁ ；

id ₁ ＝TOP(S _T ,k)；

wherein:

X _T the feature matrix is obtained by transforming the node feature X after passing through a transducer module, and the feature matrix is used as a feature for calculation;

W ₁ ∈R ^d×1 the method is a learnable parameter matrix and is used for learning the influence of each characteristic dimension of the node on the overall characteristics of the node, wherein R represents a real number matrix and d represents the characteristic dimension of the node;

S _T is the score of all nodes of the graph structure data, S _T ∈R ^n×1 Is the score of each node of the graph structure data, R represents a real matrix, and n represents the number of original nodes;

TOP(S _T k) sorting the nodes according to the scores calculated based on the node long-distance dependency information, and taking out the ids of k nodes with the highest scores as reserved node ids after sorting ₁ The reserved nodes are regarded as important nodes in the graph structure data, and the rest nodes are discarded;

after the rest nodes are discarded, the characteristic information of the discarded nodes is converged to the reserved nodes according to a certain proportion, and the specific formula is as follows:

wherein:

is the id of the discarded node;

the characteristic information of the discarded nodes is aggregated along the edges of the graph structure data to the upper part of the characteristic information of the reserved nodes, R represents a real matrix, k represents the number of reserved nodes, and n represents the number of original nodes;

is a node characteristic matrix generated after the channel is subjected to transform pooling.

3. The bioinformatics classification model based on graph structure data features of claim 1, wherein: the differential pooling channel learns soft allocation matrices using a graph rolling neural network for generating a coarsened graph G _C The allocation matrix C is generated by the following formula:

C＝GCN(A,X)；

wherein: a represents an adjacency matrix; x represents a node feature matrix;

after obtaining the allocation matrix C, the graph G is coarsened _C Feature matrix X of (2) _C And adjacent matrix A _C Is generated by the following formula:

X _C ＝C ^T X；

A _C ＝C ^T AC；

wherein: t represents the transposed mathematical symbol.

4. The bioinformatics classification model based on graph structure data features of claim 1, wherein: the graph rolling pooling channel is a node voting type graph pooling method based on a graph rolling neural network, the node voting type graph pooling method captures local topological structure information among nodes in graph structure data by using the graph rolling neural network, and the node score calculation mode is as follows:

S _G ＝GCN(A,X,W ₃ )；

id ₃ ＝TOP(S _G ,k)；

wherein:

W ₃ ∈R ^d×1 is a learnable parameter, and is used for learning the influence of each feature of nodes of the graph structure data on the overall feature of the nodes, R represents a real number matrix, and d represents the feature dimension of the nodes;

S _G is the score of all nodes of the graph structure data, S _G ∈R ^n×1 The score of each node of the graph structure data in the graph convolution pooling channel is that R represents a real number matrix and n represents the number of original nodes;

TOP(S _G k) sorting the nodes according to the scores calculated based on the node long-distance dependency information, and taking out the ids of k nodes with the highest scores after sorting as the ids of reserved nodes ₃ The reserved nodes are regarded as important nodes in the graph structure data, and the rest nodes are discarded;

wherein:

is the id of the discarded node;

is an aggregation matrix of characteristic information of discarded nodes, R represents a real matrix, n represents the number of original nodes, and k represents the number of reserved nodes;

5. The bioinformatics classification model based on graph structure data features of claim 1, wherein: the feature fusion module comprises a cross-channel convolution module and an aggregation module;

X＝σ(X _F +A _cross X _C )；

wherein:

x includes X ₁ And X ₂ ，X ₁ Reserving a node characteristic matrix generated after the cross-channel convolution of the nodes in the Transformer pooling channel; x is X ₂ Reserving a node characteristic matrix generated after the cross-channel convolution of the nodes in the graph rolling pooling channel;

sigma represents an activation function;

X _F comprises X _F1 And X _F3 ，X _F1 Is a node characteristic matrix X generated after a conversion channel is formed _F3 The node characteristic matrix is generated after the path is pooled through graph convolution;

X _C is a node characteristic matrix generated after passing through the differential pooling channel;

A _cross ∈R ^k×c represented by X _C To X direction _F A conversion matrix for conversion, wherein R represents a real matrix, k represents the number of nodes of the graph structure data generated in the conversion former pooling channel or the graph convolution pooling channel, and c represents the number of nodes of the graph structure data generated in the differential pooling channel; a is that _cross The method is obtained by the following formula:

A _cross [i]＝C[i],i∈id ₁ or id ₃ ；

wherein C is a soft allocation matrix learned by a graph neural network in the differential pooling channel;

the aggregation module represents the index of the reserved nodes in the transform pooling channel as id ₁ The reserved node index in the graph volume pooling channel is denoted as id ₂ The method comprises the steps of carrying out a first treatment on the surface of the Taking the average value of node characteristics existing in both the transform pooling channel and the graph convolution pooling channel as the characteristic of a new node, and taking the characteristic of the node existing in only the transform pooling channel or the graph convolution pooling channel as the characteristic of the new node; the specific formula is as follows:

id＝id ₁ ∪id ₂ ；

wherein:

representing an adjacency matrix for extracting sub-graphs consisting of the most representative nodes of the original graph structure data by indexing;

k= |id| represents the number of nodes to be reserved;

n is the total number of nodes in the original graph structure data;

the pooling graph is then generated using the following two formulas:

wherein X is _P ∈R ^K×d Is the node characteristic of the aggregated graph structure data, A ^P ∈{0,1} ^K×K Is an adjacency matrix of aggregated graph structure data.

6. The bioinformatics classification model based on graph structure data features of claim 1, wherein: the readout layer extracts a graph feature representation of each of the pooled graphs using a readout function, the readout function being:

wherein Read E R ^d And the characteristic representation of the pooling graph is represented, d represents the characteristic dimension of the node, and n represents the number of nodes of the pooling graph.

7. The bioinformatics classification model based on graph structure data features of claim 1, wherein: the full-connection layer adopts a multi-layer perceptron as a classifier to classify the input graph characteristic representation, and the formula is as follows:

P＝SOFTMAX(Liner(RELU(Liner(Read))))；

where P is the bioinformatics class of graph structure data predictions.