CN110929029A

CN110929029A - Text classification method and system based on graph convolution neural network

Info

Publication number: CN110929029A
Application number: CN201911064089.7A
Authority: CN
Inventors: 唐钰葆; 于静; 曹聪; 刘燕兵; 谭建龙; 郭莉
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2019-11-04
Filing date: 2019-11-04
Publication date: 2020-03-27

Abstract

The invention discloses a text classification method and system based on a graph convolution neural network. The method comprises the following steps: 1) for each classified labeled text in a text training set of a target field, generating a text feature vector of the text according to the word frequency and the inverse document rate of words in the text; combining all text feature vectors to generate a text feature matrix, namely a TF-IDF matrix, and constructing a graph structure of the text training set according to word vector similarity of words; 2) training a graph convolution neural network by using the graph structure and the text feature matrix; 3) and for a text a to be classified in the target field, inputting the text feature vector of the text a into the trained graph convolution neural network to obtain the category of the text a. The invention not only considers the semantic structure information of the text, but also captures the hidden characteristics of the text from another angle, and has high classification accuracy.

Description

Text classification method and system based on graph convolution neural network

Technical Field

The invention belongs to the field of graph data mining and graph classification, and particularly relates to a text classification method and system based on a graph convolution neural network.

Background

With the arrival of big data, the data scale shows an explosive growth trend, and the relationship among massive heterogeneous data is gradually compact. A graph is a common type of abstract data structure that represents relationships between things. Data elements which are closely related in real life, such as social networks, academic networks and the like can be represented by graph data. The actual problem can be converted into the technical problem of the graph and data mining. For example, the social software WeChat takes the micro signals as nodes, and the mutual relationships such as 'friend relationships', 'comment like comments' between the micro signals are taken as the edges of the graph, so that graph structure data are constructed. The actual problems can be converted into technical problems of graphs and data mining, and graph data classification is a research focus in large-scale data processing. And (4) graph classification, namely automatically distinguishing and classifying different types of graphs, wherein the graph classification is mainly applied to identification of sudden and terrorist behaviors, social network relationship classification, chemical molecule classification and the like.

The graph classification can provide important technical means for data analysis and understanding in different fields, and related research and application are in the spotlight. Although graph classification has an important role in various areas of society, graph classification still faces many technical challenges.

The graph data has strong local coupling, and the nodes have relations, so the representation of the graph needs to contain the structural information and the attributes of the graph. The existing data representation mode mainly aims at serialized documents, structured images and the like, and is difficult to be expanded to the representation of graphs, so graph classification faces serious challenges.

Meanwhile, on the other hand, the feature representation of the graph, namely the feature representation of the nodes calculated through the connection relation among the nodes, and the training of the classifier by using the feature set are two independent processes, each process needs independent design and optimization, and even if each step is optimal, the classifier with the optimal overall effect is difficult to ensure.

As can be seen from the above, the graph classification has important positions in various fields, but has the challenges of strong local coupling and difficult feature representation. In the field of graph classification, there are the above-mentioned chemical molecule classification, relationship network entity classification, etc., and in the present application, the text classification task is aimed at. And (3) text classification, namely performing certain data preprocessing according to the given text content with the label, and classifying the text by using some algorithms or models. There are two main categories of text classification methods: the first category is the traditional text classification technique, which consists of feature extraction and classification using classifiers. The second type is that a deep learning method is used, features are not extracted manually, features, specific pattern rules and the like in the text are learned through a deep learning model, so that a classification model is obtained through training, and then the classification of the text can be realized by utilizing the classification model. Common models are LSTM, CNN, RNN, GRU, etc. These methods, despite their advantages, have difficulty in ensuring that an overall-optimal classification model is obtained.

Disclosure of Invention

The application provides a text classification method and system based on a graph convolution neural network. The text in the invention is natural language text, such as news category, entertainment news, financial news, military news and other texts. The method has the basic idea that a text is expressed into a graph structure, the semantic structure relation of the text and the characteristics of the text are considered, a graph convolution neural network is constructed to realize end-to-end classification of graph data, namely, the text information and the text characteristics of the graph structure are directly used as input, and the output is the category of each text, namely, a label. By representing the text as a graph structure, the semantic structure information of the text can be considered, the hidden features of the text can be captured from another angle, and the processing result can compete with the mainstream text classification method after passing through the graph convolution neural network. The algorithm flow chart of the present invention is shown in fig. 1.

A text classification method based on a graph convolution neural network comprises the following steps:

1) carrying out word segmentation, meaningless word removal, punctuation mark removal, TF-IDF matrix calculation and other preprocessing on the text;

2) constructing a graph structure of the preprocessed text obtained in the step 1), wherein words are used as nodes of the graph, and a plurality of words (8 words are selected in the application) which are most similar to one node (cosine similarity of two word vectors is calculated) are used as neighbor nodes of the words;

3) preprocessing a graph structure, calculating a Laplace matrix of the graph, and the like;

4) and constructing and training a graph convolution neural network, wherein the graph convolution neural network comprises an input layer, two hidden layers and an output layer. Wherein the hidden layer comprises a graph convolution layer, an active layer and a pooling layer.

5) Preprocessing the text to be classified, constructing a text characteristic matrix and a graph structure as the input of the graph convolution neural network, and using the step 4) to train to obtain the graph convolution neural network to classify to obtain the category of the text.

Furthermore, the graph convolution neural network comprises an input layer and two hidden layers which are sequentially connected, each hidden layer comprises a graph convolution layer, a pooling layer and an activation layer which are operated in the same way, the input of the second hidden layer is the output of the first hidden layer, the second hidden layer carries out further feature capture on the output of the first hidden layer, and finally the second hidden layer is connected with the full connection layer and the softmax output layer; the input layer is used for importing the constructed graph structure and the TF-IDF matrix of the text into the whole network for subsequent training. The graph convolution layer is used for carrying out convolution operation on an input graph structure and text characteristics and capturing characteristic information of a text from the graph convolution layer; the pooling layer is used for carrying out layered sampling on the characteristics obtained by the activation layer; the activation layer is used for carrying out nonlinear activation processing on the features obtained by the graph convolution layer and using a ReLU activation function; the full connection layer processes the output of the activation layer, and integrates the output of the previous layer to obtain output with richer information; the input of the softmax layer is the output of the full connection layer, and is used for predicting the category of the corresponding article, and the calculation formula is shown in the following specific implementation process; cross entropy is used as a loss function for the graph convolution neural network.

Further, the TF-IDF matrix of the text is used as the feature matrix of the text. TF-IDF (term frequency-inverse document frequency) is a common statistical weighting technique used to evaluate the importance of a word to one of the documents in a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus.

Furthermore, the graph convolution layer operation is to perform fourier transform on the graph structure to the spectrum domain, realize convolution operation in the spectrum domain, and then perform inverse fourier transform to complete the convolution operation of the graph. The theoretical basis is spectrogram theory, and the undirected connected graph is defined as G ═ V, E and W, wherein V is a finite set | V | ═ n nodes, E is a group of edges, and W ∈ R^n*nThe method is a weighted adjacency matrix for coding a connection weight between two nodes, wherein the weight is defined according to a specific problem, and in the application, W is an adjacency matrix without a weight. Signal x defined at the nodes (i.e., vertices) of the graph: v → R can be thought of as the vector x ∈ RⁿWherein x is_iIs the value of x at the ith node; the signal x may be understood as attribute information contained in a node, for example, in the present application, a node is represented by a word vector, which contains semantic information of the word, i.e., a signal of the node. An important operation in spectrogram analysis is the graph laplace, whose combined definition is L ═ D-W ∈ R^n*nWherein D ∈ R^n*nIs a diagonal matrix, D_ii＝∑_jW_ijNormalized is defined as

Wherein, W_ijRepresenting the values between the ith and jth nodes in the adjacency matrix. If the two nodes have edges connected, the value is 1, otherwise the value is 0. In is an identity matrix, R is a real number, RⁿRepresenting a one-dimensional vector of length n, R^n*nRepresenting a 2-dimensional vector of size n x n. Function f ∈ R, extending on the graph structure from the Fourier transform, defining any node on the graph GⁿFourier transform of the corresponding graph based on the feature vector of graph Laplace

An expansion formula:

n is the number of nodes, u_lIs the coefficient of the number of the first and second,

the function f is a general abstract definition of a Fourier transform formula of a graph aiming at the coefficient of a node i, and node information is represented in the invention. The inverse fourier transform of the corresponding graph is defined as:

u_l(i) are the coefficients for node i in the inverse fourier transform. In classical fourier analysis, the eigenvalues contain the notion of frequency. When the eigenvalue is close to 0, i.e. at low frequency, the associated complex exponential eigenfunction is a smooth, slowly fluctuating function; on the contrary, when the characteristic value is far from 0, namely, when the frequency is high, the fluctuation characteristic of the corresponding complex exponential characteristic function is severe. For graph structures, the graph laplacian eigenvalues and the graph laplacian vectors have a similar concept as frequency, and the frequency in the conventional fourier transform is analogous to the laplacian eigenvalues/vectors of the graph fourier transform.

The obtained graph Laplace matrix L is a real symmetric semi-positive definite matrix, and an orthogonal eigenvector set, namely the orthogonal eigenvector set, is obtained by decomposing the eigenvalue of the matrix L

(called the mode of the graph Fourier), in the Fourier transform of the graph

It is considered the frequency of the graph. The Laplace operator is determined by Fourier basis U ═ U₀,…,u_n-1]∈R^n*nSo that L is equal to U Λ U^TWherein Λ ═ diag ([ λ ])₀,…,λ_n-1])∈R^n*n. Fourier transform signal x ∈ R of the graphⁿThen will be defined as

Its inverse is

After fourier transformation of the graph, it looks like euclidean space, so that basic operational concepts of graph signal processing such as filtering, down-sampling, etc. can be implemented.

Furthermore, the pooling layer coarsens the graph structure (namely, hierarchically samples the features obtained by the active layer), finds a representative node of the graph and completes sampling; and then, a balanced binary tree mode is constructed to pool the graph structure characteristics obtained by the activation layer.

Further, the pooling layer calculates the normalized cutting value of each node and the adjacent nodes thereof, and the formula is W_i,j(1/d_i+1/d_j) Wherein d is_i,d_jThe degrees of the node i and the node j, the degree of the node represents the number of nodes connected with the node, W_i,jIs the weight of the edge of node i, node j. And selecting the adjacent node with the maximum normalized cutting value with the current node to be combined with the current node, and coarsening the step. The coarsening can be continuously performed for a plurality of times, after the coarsening is performed to a proper level, nodes of each level are randomly numbered, and a balanced binary tree is constructed according to the coarsening mapping process. And performing maximum pooling operation on the topmost layer of the binary tree, and sequentially mapping back to the original graph structure, thereby completing pooling.

Furthermore, in the training process of the graph convolution neural network, the full connection layer adopts a dropout strategy, a plurality of nodes are randomly selected according to probability p in each iteration and do not participate in actual operation, a softmax function is used for calculating after the full connection layer output y is obtained, and the maximum value of the softmax function is selected as the category of the corresponding article.

Further, in the step 1), removing punctuation marks and invisible characters, removing stop words and low-frequency words from each article, and calculating a TF-IDF (word frequency-inverse document rate) matrix of each article as a feature matrix of the article.

Further, for the words in the text processed in step 1), the similarity of the word vector of each word and other words is sequentially calculated, and a plurality of words most similar to each word (8 words are selected in the invention) are selected as neighbor nodes of the word, so that a graph structure is constructed.

Further, a Mini-batch gradient descent method or a momentum optimization method is adopted to train the graph convolution neural network.

Since there is often a large amount of "noisy" data in the real dataset, it can interfere with subsequent feature capture. Therefore, the proposal of the application needs to preprocess the original data, remove the 'noise' data in the original data set, and make the data more easily extracted with refined and non-redundant features.

Because corpus data is stored in text form, it needs to be converted to digital form in order to be used as input for training of the atlas neural network. Therefore, after the preprocessing operation on the article original data set is completed, the article is subjected to text representation by using the TF-IDF matrix of the article and the word vector so as to improve the effect. After word vectors corresponding to the article information are obtained, word vector similarity between words is calculated, and therefore a graph is constructed. The proposal of the application realizes the construction of a graph convolution neural network and trains a model on a data set so as to realize article classification; after the model training is completed, the model of the application is scored on the test set to check the effect of the model.

Compared with the prior text classification technical scheme, the text classification method has the following technical advantages:

1. the text classification is realized based on the graph convolution neural network method, the texts are represented in a graph structure mode, the semantic structure correlation among the texts can be captured, and the text feature capture is better. Meanwhile, parameter sharing is realized through graph convolution operation, parameter number is reduced through pooling operation, overfitt avoids model overfitting, the defects of low efficiency, low text classification accuracy and the like are overcome, and the method has the advantages of no need of manually extracting features and the like; the requirement on data is loose, only a text form is needed, and the universality is high;

2. the data preprocessing operation adopted by the application proposal, the method for constructing the graph and text feature matrix, the graph convolution neural network structure and the like are easy to use;

3. the text classification method and the text classification device overcome the defects of low efficiency, low classification accuracy, lack of persuasion and the like in the existing text classification technical scheme, classify articles in a quantitative representation mode, are high in accuracy, and have a solid theoretical foundation.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of data preprocessing;

FIG. 3 is a schematic diagram of a construction diagram;

FIG. 4 is a diagram of a graph convolution neural network structure;

FIG. 5 is a diagram illustrating a graph convolution operation;

FIG. 6 is a schematic view of pooling;

FIG. 7 is a schematic view of a fully connected layer;

FIG. 8 is a schematic drawing of dropout;

fig. 9 is a schematic diagram of gradient descent.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and examples.

As shown in fig. 1, the text classification algorithm mainly includes five key processes: preprocessing data, constructing a graph structure, preprocessing the graph structure, constructing and training a graph convolution neural network model and predicting text categories by using the graph convolution neural network model. In the following, a specific embodiment of this algorithm will be described by way of elaborating the above five key processes, respectively.

The first process is as follows: data pre-processing

In real data, there are often a lot of redundant information, default values and noise, and there may be abnormal points due to human errors. In addition, as for the data set adopted in the proposal of the application, due to the characteristics of the text information, the data set also has the defects of non-structure, no separators between words and the like which are not beneficial to extracting the characteristics. Therefore, data preprocessing is an essential loop in the text classification prefiltering algorithm proposed in the present application.

Common data preprocessing operations include numerical normalization, data structuring, data de-redundancy, and the like. For the present application, it is necessary to represent the original data set (text information) in digital form to perform data preprocessing operations such as removing stop words, removing punctuation marks and invisible characters, and removing low-frequency words from the original data set. There are many ways to represent text information into numbers, such as statistical word frequency, TF-IDF, word vector, etc. (see fig. 2 for flow). The model needs two parts of input, namely a feature matrix of a text and a graph structure.

For the feature matrix of the text, a TF-IDF matrix of the text will be employed. TF-IDF is a statistical method to evaluate the importance of a word to one of a set of documents or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. The word frequency (TF) represents the frequency of the occurrence of the entry (keyword) in the text, and the formula is

n_ijIs that the word is in the document d_jThe frequency of occurrence. Inverse file frequency (IDF): the IDF of a particular term can be obtained by dividing the total number of documents by the number of documents containing that term and taking the logarithm of the quotient

I D is the total number of files in the corpus, | { j: t_i∈d_jDenotes the inclusion word t_iIf the word is not in the corpus, it will result in a denominator of zero, so 1+ | { j: t [ ] is typically used_i∈d_jJ. In summary, the calculation formula of TF-IDF is: TF-IDF ═ TF × IDF. Thus, the column length of the TF-IDF matrix represents the total number of documents, the row length represents the number of words in each document, and each value in the matrix represents the TF-IDF value corresponding to the current word.

And a second process: building graph structures

For the graph structure, the nodes of the graph adopt the word vector of each word, and the neighbor nodes are the words with the highest similarity. In the present application, the effect of selecting the 8 words with the highest similarity is the best, so the number of neighbor nodes is set to 8 (see fig. 3 for schematic diagram). Finally, the graph structure is represented as a graph matrix G ∈ N × N, N representing the number of all words, G_ijWhether the ith word and the jth word have edges or not is shown, if the value is 1, the right edge is formed, otherwise, if the value is 0, no edge is formed.

Word vectors, also known as word embedding, represent words in a corpus or vocabulary in the form of vectors.

In this way, words in the primitive material library or vocabulary are mapped to points in vector space, which can be used as input for training of the convolutional neural network model. In the actual development process, there are many technical models for obtaining word vectors, such as Skip-gram, CBOW, randomly generating word vectors and adjusting them continuously. And if more linguistic data are provided, the word vector is obtained by adopting Skip-gram.

The third process: graph structure preprocessing

Since the subsequent calculation involves a convolution operation of the graph, the laplacian matrix of the graph is required according to the spectrogram theory, and therefore the calculation is performed in advance. The combined definition of the graph Laplace matrix is L ═ D-W ∈ R^n*nWherein D ∈ R^n*nIs a diagonal matrix D_ii＝∑_jW_ijNormalized is defined as

Where In is the identity matrix. Firstly, calculating a graph matrix constructed by word vectors, calculating to obtain a degree matrix of the graph matrix, and then judging whether regularization is needed. If no regularization is needed, the laplacian matrix of the graph is derived from L-D-W. Otherwise, if regularization is required, the corresponding calculation formula of the Laplace matrix is

To implement the subsequent graph convolution (filtering) operation, a fourier transform of the graph needs to be implemented. The laplacian matrix L of the graph obtained from the above is a real symmetric positive semi-definite matrix, which has an orthogonal feature vector set,

model called graph Fourier, in the Fourier transform of graphs

Seen as the frequency of the graph. The Laplace operator is determined by Fourier basis U ═ U₀,…,u_n-1]∈R^n*nSo that L is equal to U Λ U^TWherein Λ ═ diag ([ λ ])₀,…,λ_n-1])∈R^n*n. Fourier transform signal x ∈ R of the graphⁿ(ii) a Then will be defined as

Its inverse is

Where x is the text feature matrix and U is the fourier basis resulting from the laplacian matrix decomposition of the graph.

The laplacian matrix of the graph is calculated by the graph structure preprocessing step, and the graph fourier transform is performed at the same time.

The process four is as follows: construction and training of graph convolutional neural network model

The Convolutional Neural Network (CNN) is one of the most representative Network structures in deep learning, and overcomes the defects of the traditional Neural Network such as various parameters through methods such as local connection, weight sharing and pooling, so as to obtain excellent performances in various fields such as visual processing and natural language processing. The model using CNN on Graph data is called Graph Convolutional neural network (GCN). Generalization of CNN to graph data requires three main steps: (1) in order to realize the filtering operation, the graph needs to be converted from a node domain to a spectrum domain, and a local convolution filter used on the graph is designed; (2) the graph with the approximate nodes gathered together is coarsened. The reason is that when the image is maximally pooled or averaged, the operation of averaging or selecting the maximum value is performed every few data points. Similarly, when the graph data is subjected to pooling operation, marking and distinguishing similar nodes and coarsening the graph with the similar nodes gathered together; (3) after the graph is coarsened, graphs of different coarsened versions are obtained, and the aggregation of approximate nodes is realized. The pooling operation of the graph is then performed, translating the spatial resolution to a higher degree of filter resolution.

The graph convolution neural network structure adopted in the present application is shown in fig. 4, and includes network structures such as a graph convolution layer, an activation function layer, a pooling layer, and a full connection layer. To facilitate understanding of the structure of the convolutional neural network used in the present application, the structure thereof will be described in detail.

The structure I is as follows: picture volume lamination

The filtering operation can be implemented in the spectral domain of the graph, in which the graph data has been transformed from the spatial domain to the spectral domain via the fourier transform of the graph, see fig. 5 for a flow chart. The convolution operation of the graph in the fourier, i.e. spectral, domain is defined as: x Gy ═ U ((U)^Tx)⊙(U^Ty)), where ⊙ is an element-by-element Hadamard product_θG is obtained after filtration_θ(L)x＝g_θ(UΛU^T)x＝Ug_θ(Λ)U^Tx. A nonparametric filter, i.e., a filter with all spatial parameters, will be defined as g_θ(Λ) ═ diag (θ). Wherein the parameter theta is equal to RⁿIs a vector of fourier coefficients.

Although the filtering operation can be achieved after the graph has been fourier transformed into the spectral domain, such a filter (i.e., convolution kernel) is parameter-free. The disadvantage of the filter without parameters: local features cannot be captured, learning complexity is still in direct proportion to the number of the graphs, and when graph data are too large, learning cost is too high, so that efficiency is low. This problem can be solved with a polynomial filter:

wherein the parameter theta is equal to RⁿIs a polynomial coefficient vector. Taking the node i as the center and the value of the neighbor node j thereof, passing through a filter g_θIs prepared from (g)_θ(L)δ_i)_j＝(g_θ(L))_i,j＝∑_kθ_k(L^k)_i,jThe initial expression of the node is word vector, the node information is updated in the training process, the node information update is influenced by the neighbor node, and the initial expression is obtained through calculation by a formula (g)_θ(L)δ_i)_j＝(g_θ(L))_i,j＝∑_kθ_k(L^k)_i,jContinuously calculating and updating; the convolution kernel passes the kronecker function delta_iThe convolution operation carried out by epsilon R can capture local characteristics. d_G(i,j)>K means (L)^K)_i,jIs 0, wherein d_GIs the shortest path distance, i.e., the minimum number of edges connecting two nodes on the graph. Thus, the spectral filter represented by the laplacian K-th order polynomial is just K-localized. Furthermore, their learning complexity is o (k), the size of the support of the filter, and therefore the same complexity as classical CNN.

Even if the operation of filtering the signal x by learning a local filter using the above K parameters, x has y ═ Ug_θ(Λ)U^Tx, which is due to the multiplication of the Fourier basis U, so that the cost is still high O (n)²). The solution to this problem is to parameterize g_θ(L), which is considered a polynomial function, can be recursively computed from L because K is multiplied by the sparse matrix L at the cost that O (K | E |) is much smaller than O (n)²). One such polynomial, conventionally used to approximate kernels (e.g., wavelets) in image signal processing, is the chebyshev expansion.

k-order Chebyshev polynomial T_k(x) The relational calculation, T, can be performed by recursion_k(x)＝2xT_k-1(x)-T_k-2(x) Wherein T is₀1 and T₁X. These polynomials form an orthogonal basis for L

About

Is the Hilbert space of the squared integrable function. The filter can thus be parameterized as a truncated expansion

Order K-1 of the above formula, wherein the parameter theta epsilon R^KIs a vector of chebyshev coefficients,

is that

The Chebyshev polynomial of order k of the evaluation, in which the standard eigenvalues of the diagonal matrix lie in [ -1, 1]. The filtering operation may then be written as

Wherein

Is a Chebyshev polynomial of the k-th order, derived from the standard Laplace

And (6) evaluating. To represent

We can use this iterative relationship to compute

And is

The whole filtration operation

The cost is then O (K | E |).

The structure II is as follows: non-linear active layer

To add the nonlinear element, an activation layer is therefore added. The present application proposes a ReLU (modified linear unit) method. ReLU is defined as:

while there are other activation functions, such as sigmoid function, tanh function, ReLU has advantages that they do not. The convergence speed of the ReLU is faster if a random gradient descent method is used in model optimization. Moreover, indexes are used in the sigmoid activation function and the tanh activation function, so that the calculation cost is very high, and the defect is obvious particularly when the data volume is large. The function definition according to ReLU is intuitively perceived as being computationally inexpensive. In addition, sigmoid and tanh are not effective in the gradient disappearance problem, but ReLU can be effectively alleviated. Of course, the ReLU has certain disadvantages, but in this experiment it was shown that its advantages affect more, so the ReLU activation function was chosen.

The structure is three: pooling layer

After the graph convolution layer completes convolution operation on the graph structure, after the characteristics used for classification are extracted, the next step is to use the characteristics for classification. However, the feature and related parameters obtained by the graph convolution are still too many, which results in too large calculation amount and even over-fitting phenomenon. Thus, the present application proposes to deploy the pooling layer after the map rolling layer to avoid the effects of the above-mentioned adverse factors.

Pooling is simply understood to be the sampling of features obtained by a graph convolution layer. Conventional rule data is deleted every few data points when pooling is performed. However, the node of a weight graph is downsampled, and the concept of every other node is not realized. Therefore, similar to regular data, it is necessary to cluster the similar nodes of the graph together, i.e. the graph cluster. In practice, a graph structure with a large number of nodes is clustered once, and most similar nodes cannot be clustered together. This operation therefore needs to be repeated, which is in fact a multi-scale clustering of the graphs. However, clustering of graphs is an NP-hard problem, and therefore, it is necessary to adopt a method that can obtain an approximate result.

The clustering algorithm of the graph mainly comprises the following steps: partitional clustering algorithms, hierarchical clustering algorithms, density-based clustering algorithms, grid-based clustering algorithms, and the like. The multi-scale clustering algorithm comprises three steps: coarsening the graph, dividing the graph and refining the graph.

And (3) coarsening the graph: and combining the nodes and the edges on the graph according to a set rule to obtain a coarsened version. On the basis, the rule of node and edge combination is continuously repeated, and a coarsening version with a higher level is further obtained. And determining the coarsening degree and the coarsening frequency according to specific requirements. In the proposal of the application, the combination rule adopts Graclus greedy algorithm. The greedy rule of Graclus involves picking an unmarked node i at each coarsening level and matching it to one of its unmarked neighbors j to maximize the local normalized cut value W_i,j(1/d_i+1/d_j). Two matching nodes are then marked and the coarsened weight is set to the sum of their weights. The matching is repeated until all nodes are marked. From one level to the next coarser level, it roughly divides the node number into two parts, where there may be a few individual nodes that are not matched.

In the present application, clustering of graphs is mainly applied in that: and after the graph structure is coarsened, randomly numbering nodes on the graph, and constructing a balanced binary tree. Each coarsened version of the node corresponds to a level of the balanced binary tree. The most coarsened node on the graph is the parent node of the balanced binary tree, the next most coarsened node on the graph corresponds to the second level of the balanced binary tree, and so on, the most original node on the graph is the leaf node on the balanced binary tree.

After the graph structure is convoluted and activated, a new feature graph is obtained, and the pooling layer coarsens the feature graph to a certain degree and constructs a corresponding balanced binary tree. And performing downsampling operation on the binary tree, and mapping the binary tree to the second layer and the third layer … … of the binary tree in sequence from the father node of the binary tree, wherein the pooling of the graph is equivalent to pooling one-dimensional data.

By way of example (see FIG. 6) G₀Is the original finest graph, each node is randomly numbered as shown. And combining the nodes and the edges by using a Graclus algorithm, wherein the nodes are combined into one node under the assumption that the

nodes

0 and 1 meet the maximum normalized cutting value, the

nodes

4 and 5 are combined into one node, the nodes 8 and 9 are combined into one node, and the nodes which are not matched and combined with the

nodes

6 and 10 are single nodes, so that the requirement of a balanced binary tree is met, the nodes 7 and 11 are added, and the initial value is set to be 0, thereby obtaining G1. Similarly, nodes on G1 are numbered randomly,

nodes

2 and 3 are merged by using the Graclus algorithm,

nodes

4 and 5 are merged, and node 0 has no matching node, so as to satisfy the rule of the balanced binary tree, a dummy node 1 is added to obtain G2. At this time, G2 is the most coarsened graph.

A balanced binary tree is constructed from the three coarsened versions. Pooling starts with the parent node of the binary tree, here using maximal pooling as an example. From node 0, mapping to the child nodes in the second layer in sequence, wherein the child nodes correspond to node 0 and node 1, and node 0 in the second layer is a single node and corresponds to

nodes

0 and 1 of leaf nodes; the node 1 of the second layer is a false node, the corresponding child nodes are all false nodes, and the values of the nodes are all 0, so that the pooling result is not influenced. Therefore, maximizing the pooling of parent node 0 is equivalent to maximizing the pooling of node 0 and node 1 in the original graph structure. By analogy, the maximum pooling is performed on the father node 1, which is equivalent to the maximum pooling performed on the

nodes

4,5 and 6 in the original graph structure. The maximum pooling of the parent node 2 is equivalent to the maximum pooling of the

nodes

8,9 and 10 in the original graph structure. Therefore, the pooling result of the entire graph is z ═ max {0,1}, max {4,5,6}, max {8,9,10} }.

The structure is four: full connection layer

The fully-connected layer, as the name implies, is such that each node of the fully-connected layer is connected to each node of the previous layer, as shown in fig. 7. In the proposal of the application, the upper layer is a pooling layer, the input and output layers are arranged behind the full-connection layer, and the softmax is used for carrying out category prediction. In addition, the application proposes to adopt a dropout strategy in order to avoid the disadvantages that the weight parameters of the full connection layer are too many, calculation is difficult, and overfitting is easy to cause. So called dropout, in the training process, every iteration randomly selects some nodes with probability p to not participate in the actual operation, as shown in fig. 8, and the second node of the input layer temporarily does not participate in the operation.

The structure is five: output layer

The output layer outputs the categories of the articles. After the full link layer output y is obtained, the corresponding category, namely the category of the article, can be obtained by using the softmax function on the full link layer output y. Wherein the softmax function is as follows,

in the formula, l represents the number of categories, y_iThe ith value representing the fully connected layer output. The result of the above formula is a probability value. And calculating the softmax function value of all the values output by the full connection layer, and selecting the maximum value as the category of the article.

The structure is six: loss function and training method

After the model is determined, the next and final step is to determine the loss function and the training method.

The loss function is used to measure the predicted value of the model. It is a non-negative real-valued function, usually represented by the function L (y, f (x)). The smaller the loss function is, the better the robustness of the model is, i.e. the parameters are adjusted by the training method during the training process so that the value of the loss function is reduced. Commonly used loss functions are a mean absolute value loss function, a mean square error loss function, a cross entropy loss function, and the like. The cross-entropy loss function is generally superior to other loss functions in experimental effect in more networks, and well reflects the difference between the expected output and the current actual output. Therefore, the present application proposes to use the commonly used cross entropy as the loss function, and the formula is as follows.

Here, N represents the number of samples. After the loss function is determined, the next step is to determine the training method. In the neural network, the adjustment optimization of the parameters is completed by gradient descent.

The gradient descent method is a first-order optimization algorithm, also commonly referred to as the steepest descent method. To find the local minimum value of a function by using the gradient descent method, iterative search must be performed to a distance point with a specified step length corresponding to the opposite direction of the gradient (or approximate gradient) on the function at the current point, as shown in the formula

Wherein the function f (x) is at point x₁Can be fine and defined, and gamma is the step size. It is easy to see that when gamma is>When 0 is a sufficiently small value, there is f (x)₁)≥f(x₂). The gradient descent diagram is shown in fig. 9.

However, since the model is too complex, the computation amount of calculating the gradient for all training samples is too large, and the academia and the industry often adopt an improved gradient descent method as a scheme for finding the optimal value or the local optimal value by the model. The commonly used modified gradient descent method includes a random gradient descent method, a batch gradient descent method, an Adam gradient descent method, and the like. The application proposal adopts the small batch gradient descent method and the momentum optimization method as a model optimization scheme because the latter can calculate the self-adaptive learning rate of each parameter.

Process five prediction

Finally, after the model training is completed, the text information is classified on the data set by using the graph convolution neural network model, and the classification effect is compared with other text classification technical schemes to check.

To verify the performance of the convolutional neural network used in the present application on the text classification problem, this section will compare the effect of the convolutional neural network on the classification with other text classification technical solutions on the same article data set.

The hardware environment of the experiment in this section is 2.8GHz CPU, 506.3GB memory, 88 nuclear server, and the operating system is 64-bit Linux system.

The data set used in this experiment is shown in table 1:

TABLE 1 data set

Specifically, the model hyper-parameters proposed in the present application are shown in table 2 according to the data set characteristics proposed in the present application and the conventional setting scheme of the hyper-parameters of the convolutional neural network.

TABLE 2 model hyper-parameter table

Hyper-parameter	Means of	Numerical value
			num_GCN	Number of layers of picture-volume laminate	2
learning_rate	Initial learning rate	0.0001
			dropout_keep_prob	dropout ratio	0.5
batch_size	Size of batch	128
			num_epochs	Number of training rounds	50
output_dim	Output dimension of output layer	512

In the experiment, the word vector is generated by using a skip-gram method in a word2vec tool, a ReLU function is selected as an activation function, a cross entropy loss function is selected as a loss function of a model, a small batch gradient descent method and a momentum optimization method are adopted as training methods of the model, and the initial learning rate is set to be 0.0001. The results of the experiment are shown in table 3.

TABLE 3 results of the experiment

Model (model)	Rate of accuracy
		CBOW	0.92
GCN+CBOW	0.95
		Fast Text	0.91
GCN+Fast Text	0.95
		LSTM	0.93
Text-CNN	0.94

The experimental analysis in this section is as follows:

as can be seen from Table 1, there are 4 chapter categories in this section of experiment, and each sample belongs to one category. Thus, the classification is performed randomly from one document, with correct results around 1/4. As can be seen from Table 3, the accuracy of the convolutional neural network is much higher than that of the artificial random selection, and the final accuracy index is higher than that of other text classification technical solutions, which is satisfactory! For the above experimental results, the following specific analysis exists:

1) the method and the device for representing the text information in the graph structure are used for representing the text information, the graph is constructed through word similarity, the semantic structure correlation among the texts is captured well, and the implicit relation of the text information is further well described.

2) The graph convolution neural network realizes the capture of structural information between texts through graph convolution operation, simultaneously takes the statistical attribute characteristics of the texts into consideration by using the TF-IDF matrix of the texts, and comprehensively takes the display and implicit characteristics of the texts into consideration through the two aspects. Meanwhile, the number of parameters is reduced through the pooling operation of multilayer clustering, dropout avoids model overfitting, overcomes the defects of low efficiency, low text classification accuracy and the like, has the advantages of no need of manually extracting features and the like, and is obviously superior to other schemes in a final experimental result.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A text classification method based on a graph convolution neural network comprises the following steps:

1) for each classified labeled text in a text training set of a target field, generating a text feature vector of the text according to the word frequency and the inverse document rate of words in the text; combining all text feature vectors to generate a text feature matrix, namely a TF-IDF matrix, and constructing a graph structure of the text training set according to word vector similarity of words;

2) training a graph convolution neural network by using the graph structure and the text feature matrix;

3) and for a text a to be classified in the target field, inputting the text feature vector of the text a into the trained graph convolution neural network to obtain the category of the text a.

2. The method of claim 1, wherein the graph structure is generated by: and taking the words in the text as nodes of the graph, and taking a plurality of words most similar to one node as neighbor nodes of the node to generate the graph structure.

3. The method according to claim 1 or 2, wherein in step 2), the graph structure is preprocessed first, and a laplacian matrix of the graph is calculated; and then training the graph convolutional neural network by using the Laplace matrix and the text characteristic matrix of the graph.

4. The method of claim 3, wherein the graph Laplace matrix is L-D-W e R^n*nWherein D ∈ R^n*nIs a diagonal matrix, D_ii＝∑_jW_ij，W∈R^n*nIs an adjacency matrix, W, that encodes the connection weights between two nodes_ijRepresenting the value corresponding to the ith node and the jth node in the adjacency matrix, and if the ith node and the jth node have edge connection, W_ijThe value is 1, otherwise 0.

5. The method of claim 1, wherein the graph convolutional neural network comprises an input layer, a number of hidden layers, a fully connected layer, and an output layer connected in sequence; wherein, each hidden layer comprises a graph convolution layer, a pooling layer and an activation layer; the input layer is used for receiving the graph structure and the text characteristics and inputting the graph structure and the text characteristics into the hidden layer; the graph convolution layer is used for carrying out convolution operation on the input graph structure and the text characteristics to obtain the characteristic information of the text and inputting the characteristic information into the activation layer; the activation layer is used for carrying out nonlinear activation processing on the features captured by the input convolution layer; the pooling layer is used for carrying out layered sampling on the information obtained by the activation layer; and the information of the layered sampling is input into an output layer after passing through the full connection layer, and the category of the corresponding text is predicted.

6. The method of claim 4, wherein the graph convolution layer performs a graph Fourier transform on the graph structure to a spectrum domain, performs a convolution operation in the spectrum domain, and performs an inverse graph Fourier transform on the graph structure back to a frequency domain to obtain a convolution result; the pooling layer is represented by the formula W_i,j(1/d_i+1/d_j) Calculating the normalized cutting value of each node and the adjacent node, then selecting the adjacent node with the maximum normalized cutting value of the current node to be combined with the current node, and then completing pooling through one-dimensional pooling; wherein d is_iIs the degree of node i, d_jIs the degree of node j, W_i,jIs the weight of the edge between node i and node j.

7. The method of claim 6, wherein the function f ∈ R that defines the nodes on any graph GⁿFourier transform of the corresponding graph based on the feature vector of graph Laplace

An expansion formula:

where n is the number of nodes in the graph structure, u_lIs the coefficient of the number of the first and second,

is a coefficient for node i; the inverse fourier transform of the corresponding graph is defined as:

u_l(i) is the coefficient for node i in the inverse fourier transform; graph G (V, E, W), where V is a finite set | V | ═ n nodes, E is a set of edges, and W ∈ R^n*nIs a adjacency matrix that encodes the connection weights between two nodes.

8. The method of claim 6 or 7, wherein the graph convolutional layer filters the node signal x in the graph structure by using a filter, wherein the filtering operation is

y is the filtered signal, theta ∈ R^KIs a chebyshev coefficient vector; the signal x belongs to RⁿX is semantic information of a word corresponding to a node, x_iIs the value of x at the ith node.

9. The method of claim 1, wherein the atlas neural network is trained using a Mini-batch gradient descent method or a momentum optimization method.

10. A text classification system based on a graph convolution neural network is characterized by comprising a text preprocessing module, a graph convolution neural network training module and a text classification module; wherein,

the text preprocessing module is used for generating text characteristic vectors of the text according to the word frequency and the inverse document rate of words in the text, and then combining the text characteristic vectors to generate a text characteristic matrix, namely a TF-IDF matrix; constructing a graph structure of the text training set according to the word vector similarity of the words;

the graph convolution neural network training module is used for training a graph convolution neural network according to the text feature matrix and the graph structure;

and the text classification module is used for inputting the text feature vector of the text a to be classified into the trained atlas neural network to obtain the category of the text a.