CN110956199A

CN110956199A - Node classification method based on sampling subgraph network

Info

Publication number: CN110956199A
Application number: CN201911068473.4A
Authority: CN
Inventors: 宣琦; 王金焕; 裘坤锋; 单雅璐
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2019-11-05
Filing date: 2019-11-05
Publication date: 2020-04-03

Abstract

A node classification method based on a sampling subgraph network comprises the following steps: s1 walk sample; s2 construction of simple sub-graph SGN⁰(ii) a S3 construction of first-order sub-graph network SGN¹(ii) a S4 construction of second-order sub-graph network SGN²(ii) a S5, extracting characteristics of the network diagram; s6 averaging the feature vectors; s7 forming a characterization matrix; s8 feature vector space expansion; s9, adopting classifier model limit random tree in machine learning, adopting ten-fold cross validation to the representation results of all nodes of the original network graph, and obtaining classification precision. The invention provides a node classification method based on a sampling sub-graph network, which converts a node classification problem into a graph classification problem by using a sequence generated by walking, fully utilizes potential structure information, enhances the classification effect of the traditional walking method, further introduces a sub-graph network SGN to perform network feature space expansion and improves the classification precision.

Description

Node classification method based on sampling subgraph network

Technical Field

The invention relates to network science, data mining and data analysis technologies, in particular to a node classification method based on a sampling subgraph network.

Background

In real life, a series of purposeful analyses can be performed on the network nodes such as social networks, citation networks and biological networks through a method for constructing a complex network model, and the analysis tasks involve research on the network nodes, such as node classification. In a typical node classification task, the main concern is the most likely belonging label of a node. For example, in a social network, different nodes represent different users, and through prediction of node tags, an advertisement operator can deduce interests and hobbies of the users, so that targeted popularization is performed. Therefore, the node classification problem of the research network graph has important significance in real life.

The current algorithms for Node classification include walk-based network representation learning algorithms, such as deep walk and Node2vec, which mainly analogize nodes in the network into words in natural language processing according to the theoretical basis of Word2vec, and represent nodes in the network as low-dimensional feature vectors by using a Word vector generation method. The class ratio algorithm realizes the low-dimensional vector representation from the network node to the Euclidean space, and is convenient for executing network analysis tasks such as node classification and the like by a machine learning method in the follow-up process.

At the present stage, the node classification method in the social network does not fully utilize local structural information around the node, neglects the characteristic of multilevel of the node characteristics in the network, and cannot capture node information in a deeper network structure, so that the node classification problem cannot be embodied with greater superiority.

Therefore, the invention utilizes the walk strategy to construct the local self-network of the nodes and creatively converts the node classification problem into the graph classification problem. In addition, the invention also carries out high-order mapping on a local network obtained by node migration based on the sub-graph network SGN to capture a deeper topological structure, expands the structural feature space of the nodes by combining various feature extraction methods, and effectively supplements the node features in the original network so as to improve the node classification precision.

Disclosure of Invention

In order to overcome the defect that the characteristic information of nodes in a network cannot be fully utilized by the existing node classification method based on wandering, the invention provides a node classification method based on a sampling subgraph network, which converts the node classification problem into a network graph classification problem, fully uses the existing structural information of the network graph, and enriches the research work of node classification.

The technical scheme adopted by the invention for realizing the aim is as follows:

a node classification method based on a sampling subgraph network comprises the following steps:

s1: wandering sampling, namely wandering sampling is carried out on each Node in an original network graph G (V, E) based on random wandering in deep walk or a second-order biased wandering strategy in Node2vec to obtain a corresponding sampling sequence with the length of L, and each Node is repeatedly sampled for M times;

s2: construction of simple subgraph SGN⁰Sequentially considering each sampling sequence, extracting nodes contained in the original network graph and a network graph formed by connecting edges between the nodes, namely a simple subgraph, from the original network graph, and obtaining a simple subgraph SGN by each sampling sequence⁰；

S3: construction of first-order subgraph network SGN¹Considering each simple sub-graph SGN in step S2 in turn⁰First order mapping it, SGN⁰All connected edges in (1) are mapped to SGN¹Different node in (1), if simple sub-graph SGN⁰Two connecting edges in the network share the same node, and the node corresponds to the SGN¹The two nodes in the network are connected to form a finished first-order sub-graph network SGN¹Each SGN⁰Obtaining an SGN¹；

S4: construction of second-order subgraph network SGN²Considering each first-order sub-graph network SGN in step S3 in turn¹Second order mapping it, SGN¹All the connecting edges in the network are regarded as different nodes, if the first-order subgraph network SGN¹Two connecting edges in the network share the same node, and then the connecting edge is added between two nodes converted from the two connecting edges, and the nodes and the connecting edge form a second-order subgraph network SGN²Each SGN¹Obtaining an SGN²；

S5: extracting features of network graph, and respectively extracting all simple sub-graphs SGN by using feature extraction method⁰A toHierarchical sub-graph network SGN¹And second-order subgraph network SGN²Extracting features, and respectively obtaining V multiplied by M K-dimensional feature vectors;

s6: averaging the feature vectors, and separately processing the feature vectors from the simple sub-graph SGN in step S5⁰First-order subgraph network SGN¹And second-order subgraph network SGN²The extracted feature vector belongs to the original network graph G ═<V,E>Carrying out averaging processing on M K-dimensional feature vectors of the same node, and finally obtaining a K-dimensional feature vector by each node belonging to the original network graph under the three sub-graphs;

s7: forming a characterization matrix, SGN in a simple sub-graph⁰First-order subgraph network SGN¹And second-order subgraph network SGN²In these three cases, the original network graph G is set to<V,E>The characterization vectors of all nodes form a characterization matrix phi₀∈R^V×K、Φ₁∈R^V×KAnd phi₂∈R^V×K；

S8: and (3) feature vector space expansion, namely performing feature space expansion on the characterization matrix learned from each seed graph network in a transverse splicing mode to obtain a network characterization matrix phi (phi) merge (phi) of all nodes₀,Φ₁,Φ₂)∈R^V×3K；

S9: and adopting a classifier model limit random tree in machine learning, and adopting ten-fold cross validation on the representation results of all nodes of the original network graph to obtain classification precision.

Further, in step S1, the element in each sampling sequence is a node label of the original network graph, and the same node label may appear multiple times, and when calculating the sequence length, the node repeatedly appears and is also calculated as the effective length.

The invention provides a node classification method based on a sampling subgraph network. The method constructs a new idea of a network graph by using a wandering sampling sequence, and converts a node classification task into a graph classification task. And the feature space of the original network is expanded through the sub-graph network SGN, so that the classification precision of the network graph is improved.

The invention has the beneficial effects that: by means of wandering sampling, a sampling sequence is obtained for each node of an original network graph and is converted into the network graph, a new idea that a node classification problem is converted into a network graph classification problem is provided, and then by constructing an SGN, a network characteristic space is expanded, potential structure information is fully utilized, and the performance of a traditional wandering method is enhanced. In addition, the invention can fuse a plurality of feature extraction methods and classify the nodes by adopting the extreme random tree algorithm in machine learning, thereby effectively improving the node classification precision compared with the prior art.

Drawings

Fig. 1 is a design flow chart of a node classification method based on a sampling subgraph network in the invention.

FIG. 2 is a diagram of the extraction of simple subgraph SGN in the present invention⁰First-order subgraph network SGN¹And second-order subgraph network SGN²Wherein (a) represents the original network, (b) represents the wandering sampling of node 1, (c) represents a sequence derived from the wandering, and (d) represents a simple sub-graph SGN extracted from the original network graph⁰And (e) shows SGN from a simple subgraph⁰First-order sub-graph network SGN extracted from the network¹And (f) represents SGN from a first-order subgraph network¹Second-order sub-graph network SGN extracted from the network²。

Detailed Description

The following detailed description of embodiments of the invention is provided in connection with the accompanying drawings.

Referring to fig. 1 and fig. 2, a node classification method based on a sampling subgraph network is described by taking a social network as an example, wherein a node represents a member in the social network, and a connecting edge represents a friendship between members. All node members can be divided into two categories according to the community to which they belong. Thus, the node classification algorithm may identify community members. The invention carries out social network modeling on the Karate data set, G is (V, E), V represents a node set, each node represents a member, E represents a continuous edge set, each continuous edge represents interaction between two members, and further carries out analysis of converting the node into a network graph and extracting a subgraph network.

The invention comprises the following nine steps:

step 1: wandering sampling;

step 2: construction of simple subgraph SGN⁰；

And step 3: construction of first-order subgraph network SGN¹；

And 4, step 4: construction of second-order subgraph network SGN²；

And 5: extracting features of the network graph;

step 6: averaging the feature vectors;

and 7: forming a characterization matrix;

and 8: expanding a feature vector space;

and step 9: and adopting a classifier model limit random tree in machine learning, and adopting ten-fold cross validation on the representation results of all nodes of the original network graph to obtain classification precision.

In step S1, the wander sampling process includes: based on random walk in deep walk or second-order biased walk strategy in Node2vec, each Node in an original social network graph G ═ V, E > is subjected to walk sampling to obtain a corresponding sampling sequence with the length of L, and the walk from a current Node to different next nodes for selection can have the same probability, so that different sequences can be generated in each walk, different sequences can have different information, and each Node is selected to be subjected to repeated sampling for M times.

In the step S2, a simple sub-graph SGN is constructed⁰Referring to fig. 2(a) (b) (c) (d), the process is: after a sampling sequence is obtained through wandering, nodes in the sequence are extracted from the original network graph, and then connecting edges formed by the nodes in the original network graph are extracted to form a simple subgraph, wherein each sampling sequence obtains a simple subgraph SGN⁰；

In the step S3, a first-order sub-graph network SGN is constructed¹Referring to fig. 2(d) (e), the process is: considering each simple sub-graph SGN in step S2 in turn⁰First order mapping it, SGN⁰All connected edges in (1) are mapped to SGN¹Different node in (1), if simple sub-graph SGN⁰Two connected edges in the same node share the same node, then the node will correspond to the SGN¹The two nodes in the network are connected to form a finished first-order sub-graph network SGN¹Each SGN⁰Obtaining an SGN¹；

In the step S4, a second-order sub-graph network SGN is constructed²Referring to fig. 2(e) (f), the process is: considering each first-order sub-graph network SGN in step S3 in turn¹Second order mapping it, SGN¹All the connecting edges in the network are regarded as different nodes, if the first-order subgraph network SGN¹Two connecting edges in the network share the same node, and then the connecting edge is added between two nodes converted from the two connecting edges, and the nodes and the connecting edge form a second-order subgraph network SGN²Each SGN¹Obtaining an SGN²；

In step S5, the process of characterizing the network graph includes: using graph2vec model to respectively perform SGN on all simple subgraphs⁰First-order subgraph network SGN¹And second-order subgraph network SGN²Extracting features to respectively obtain V multiplied by M K-dimensional feature vectors;

in step S6, the averaging process for the feature vectors includes: processing by simple sub-graph SGN in step S5 respectively⁰First-order subgraph network SGN¹And second-order subgraph network SGN²The extracted feature vector belongs to the original network graph G ═<V,E>Carrying out averaging processing on M K-dimensional feature vectors of the same node, and finally obtaining a K-dimensional feature vector by each node belonging to the original network graph under the three sub-graphs;

in step S7, a characterization matrix is formed, and the specific process is as follows: in simple sub-diagram SGN⁰First-order subgraph network SGN¹And second-order subgraph network SGN²In these three cases, the original network graph G is set to<V,E>The characterization vectors of all nodes form a characterization matrix phi₀∈R^V×K、Φ₁∈R^V×KAnd phi₂∈R^V×K；

In the step S8, the characteristic directionThe volume space is expanded, and the process is as follows: performing feature space expansion on the characterization matrix learned from each seed graph network in a transverse splicing mode to obtain a network characterization matrix phi of all nodes, namely merge (phi)₀,Φ₁,Φ₂)∈R^V×3K；

In step S9, a classifier model limit random tree in machine learning is used, and cross-folding cross validation is performed on the characterization results of all nodes in the original network graph, that is, data is randomly divided into 10 parts, 1 part of the data is sequentially taken as a test sample, and the remaining 9 parts are taken as training samples, so as to obtain classification accuracy.

The above is a description of an example of the node classification method based on the sampling sub-graph network in the Karate social network. The method for converting the sampling sequence into the network graph converts the node classification problem into the graph classification problem, and further introduces a sub-graph network SGN to perform network feature space expansion, thereby improving the classification precision. In addition, after the node classification task is converted into the graph classification task, potential structural information is fully utilized, and the classification effect of the traditional walking method is enhanced. The node classification method based on the sampling subgraph network provides a new scheme for the node classification task. The present invention is to be considered as illustrative and not restrictive. It will be understood by those skilled in the art that various changes, modifications and equivalents may be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A node classification method based on a sampling subgraph network is characterized by comprising the following steps:

s2: construction of simple subgraph SGN⁰Sequentially considering each sampling sequence, extracting nodes contained in the original network graph and a network formed by connecting edges of the nodes from the original network graphThe net graph, i.e. simple subgraphs, yielding one simple subgraph SGN per sample sequence⁰；

S5: extracting features of network graph, and respectively extracting all simple sub-graphs SGN by using feature extraction method⁰First-order subgraph network SGN¹And second-order subgraph network SGN²Extracting features, and respectively obtaining V multiplied by M K-dimensional feature vectors;

s7: forming a characterization matrix, SGN in a simple sub-graph⁰First-order subgraph network SGN¹And second-order subgraph network SGN²In these three cases, the original network graph G is set to<V,E>Is composed of the characterization vectors of all nodesCharacterization matrix phi₀∈R^V×K、Φ₁∈R^V ^×KAnd phi₂∈R^V×K；

2. The method of claim 1, wherein in the step S1, the element in each sampling sequence is a node label of the original network graph, the same node label may appear multiple times, and when calculating the sequence length, the node repeatedly appears and is also calculated as the effective length.