CN113516198B

CN113516198B - Cultural resource text classification method based on memory network and graphic neural network

Info

Publication number: CN113516198B
Application number: CN202110864647.9A
Authority: CN
Inventors: 王海; 王妍; 黄帝淞; 周腾; 吴旭东; 曹瑞; 郑杰; 马于惠; 高岭
Original assignee: NORTHWEST UNIVERSITY
Current assignee: NORTHWEST UNIVERSITY
Priority date: 2021-07-29
Filing date: 2021-07-29
Publication date: 2023-09-22
Anticipated expiration: 2041-07-29
Also published as: CN113516198A

Abstract

A cultural resource text classification method based on a memory network and a graph neural network mainly comprises three modules, a pre-trained two-way long-short term memory network (BiLSTM) module, a text graph construction module and a graph neural network (GCN) module, wherein firstly, a text data set is pre-trained by utilizing a two-way long-short term memory network model and text and word characteristics containing time sequence information are obtained, secondly, a global text graph consisting of the text and the words is constructed according to the co-occurrence relation among the words and the importance of the words in the text, the node characteristics of the text graph are initialized by adopting the characteristics extracted by the pre-trained two-way long-short term memory network module, then, further characterization learning is carried out on the node characteristics of the global text graph through a two-layer graph convolution neural network, and a final text classification result is obtained, so that the method can be used for improving the classification accuracy of cultural resource texts.

Description

Cultural resource text classification method based on memory network and graphic neural network

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a cultural resource text classification method based on a two-way long-short-term memory network and a graphic neural network (BiLSTM-GCN).

Background

Cultural resources are natural or social resources which can meet the requirements of human culture and provide basis for cultural industry, and can be divided into various types. The culture resources are effectively classified, and the storage, the mining and the reuse of the culture resources can be promoted.

Text classification is a technology for automatically classifying and labeling texts according to certain classification rules and classification standards, and is also a basic task of natural language processing, and is used in many applications such as emotion analysis, data mining, news filtering and the like. Early text classification, which relies on manual labeling and some set rules to classify text data, was extremely inefficient, which was not possible to process hundreds of millions of levels of text data at any time in the current big data age. With the progress of the age and the development of computer science and technology, text classification realizes automatic classification by training and learning text data through a computer. The manual classification workload is greatly reduced, the working efficiency is improved, and the classification effect is even better than that of manual classification.

The text classification method mainly comprises a traditional machine learning method and a deep learning method, the traditional machine learning-based text classification method is used for realizing text classification by manually extracting features and utilizing a shallow classifier, the text classification task is very mature, however, as the effect requirements of people on text classification are higher and higher, the current text classification method mainly uses the deep learning-based method, most of deep learning models usually represent unstructured text data into data which can be understood by a computer, and then training is carried out through a large number of marked data sets, so that important features in the text are extracted, and a final classification result is obtained.

The graph neural network is a deep learning network based on graph data, and compared with the traditional network, the graph neural network can better perform characterization learning on the graph structure data, and is widely applied to tasks such as social networks, recommendation systems, molecular activity prediction and the like. The text in the natural language processing task also contains rich graph structure information, including word co-occurrence, syntactic semantics and text context, and the graph neural network can fully utilize the graph structure information of the text data. Current text classification methods with respect to graphic neural networks rarely take into account time series problems in the text, which greatly limit the effectiveness of text classification.

Disclosure of Invention

In order to overcome the problem that the text classification method based on the graph convolution neural network does not fully consider the time sequence problem contained in the text, the invention aims to provide the cultural resource text classification method based on the memory network and the graph neural network, and the method can introduce the time sequence information in the text in advance so as to achieve a better text classification effect.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a text classification method based on BiLSTM-GCN comprises a BiLSTM pre-training module, a text diagram construction module and a GCN module, and is characterized by comprising the following steps:

1) BiLSTM pre-training module: pre-training the text by using a bidirectional LSTM and acquiring the text and word characteristics after pre-training;

LSTM is a deep learning network based on modeling of time step sequence, its structural features determine that hidden layer at each moment is related to previous input, compared with RNN, LSTM can avoid wellAvoiding the problem of gradient vanishing due to long-term dependence, assuming a text consisting of n words { w ] ₁ ，w ₂ ，...，w _n Each word is represented by a d-dimensional vector, and is initialized by using a glove word vector trained by Stanford, wherein the initialized word vector is { v } ₁ ，v ₂ ，...，v _n In order to get better word vectors for the model, the word vectors are updated continuously with iteration during training. Each text is modeled using bi-directional LSTM, one LSTM is semantic representation of the text from beginning to end, and the other LSTM is semantic representation of the text from end to end, the input at the current time being related not only to the previous state but also to the following state.

Each word v _t Both forward and backward hidden statesAnd->

h _t The two hidden states are concatenated together, which fuses the word context information. The specific formula is as follows:

through the previous operation, text and word characteristics containing time series information can be extracted, then the full connection layer is needed to classify samples, and the model is used for hiding the first state h ₁ And last hidden state h _n As the characteristics of the text, the characteristics are input into a full connection layer, and then an activation function is connected to realize text classification, wherein the specific formula is as follows:

y＝softmax(MLP(h ₁ ，h ₂ ))；

2) The text diagram construction module: constructing a large global text graph for the whole text data set, wherein nodes in the text graph are composed of texts and words;

constructing a global text graph for the whole text data set, wherein nodes in the text graph are composed of texts and words, the edge weights between words are expressed based on co-occurrence relations between words, PMI values or cosine similarity between words are adopted, the edge weights between the texts and the words are expressed based on importance of the words in the text, and the edge weights are obtained by adopting TF-IDF values or a keyword extraction algorithm;

3) GCN module: the method comprises the steps of realizing characterization learning of text graph nodes through a two-layer graph convolution neural network, and further realizing final text classification;

the word characteristics v and the text characteristics h which contain time sequence information can be obtained through the step 1), the characteristics are initialization characteristics of the global text graph nodes, and the characteristics are put into the GCN, so that the characteristic information of the text graph can be further extracted. Assuming that the constructed undirected text graph is g, the node number is n, and the word node number is n _word Wen Benjie the number of points is n _text A represents the adjacency matrix of the text graph g, I _N Is an identity matrix, D is a degree matrix of the text graph g, W ₀ ，W ₁ All are weight matrixes, and X is a node characteristic matrix of the text graph g.

The GCN model adopts a graph convolution neural network of semi-supervised learning, the whole model can train part of node data with a small number of labels, then the rest of nodes without labels are classified, and the specific implementation formula is as follows:

the loss function is the cross entropy loss of all text, defined as follows:

Y _D is a text with labels, i.e. training set and validation set in the model, F is the dimension of the final output feature of the model, equal to the number of classifications, y _pred Is a label for model prediction.

The beneficial effects of the invention are as follows:

the text and word characteristics are initialized through the pre-trained BiLSTM module, time sequence information contained in the text can be introduced in advance before graph convolution is carried out, the defect that the global text graph cannot contain the time sequence information is overcome, and then the text graph is characterized and learned through a two-layer graph convolution neural network, so that the effect of text classification can be greatly improved.

Drawings

FIG. 1 is a schematic diagram of a method of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and examples, but the present invention is not limited to the following examples.

As shown in FIG. 1, the text classification method based on BiLSTM-GCN comprises a BiLSTM pre-training module, a text diagram construction module and a GCN module, and is characterized by comprising the following steps:

LSTM is a deep learning network based on modeling of time step sequence, its structural features determine that hidden layer at each moment is related to previous input, compared with RNN, LSTM can well avoid gradient vanishing problem caused by long-time dependence, and a text is assumed to be composed of n words { w ₁ ，w ₂ ，...，w _n Each word is represented by a d-dimensional vector, and is initialized by using a glove word vector trained by Stanford, wherein the initialized word vector is { v } ₁ ，v ₂ ，...，v _n In order to get better word vectors for the model, the word vectors are updated continuously with iteration during training. Each text is modeled using bi-directional LSTM, one LSTM is semantic representation of the text from beginning to end, and the other LSTM is semantic representation of the text from end to end, the input at the current time being related not only to the previous state but also to the following state.

Each word v _t Both forward and backward hidden statesAnd->

y＝softmax(MLP(h ₁ ，h ₂ ))；

word features v and text features h containing time sequence information can be obtained through the step 1), the features are initialization features of the global text graph nodes, and the features are put into GCN and can be furtherFeature information of the text graph is extracted. Assuming that the constructed undirected text graph is g, the node number is n, and the word node number is n _word Wen Benjie the number of points is n _text A represents the adjacency matrix of the text graph g, I _N Is an identity matrix, D is a degree matrix of the text graph g, W ₀ ，W ₁ All are weight matrixes, and X is a node characteristic matrix of the text graph g.

the loss function is the cross entropy loss of all text, defined as follows:

Experimental comparative analysis process

The experiment of the invention is based on Python3.7 programming language and Pytorch frame, the word vector uses 100-dimensional, 200-dimensional and 300-dimensional Glove word vector of Stanford pre-training, and the initial learning rate of the whole model is 10 ^-3 L2 weight decay is 10 ^-4 Dropou after full connection layert is set to 0.5 and the batch size is set to 64. If the validation set does not drop in loss after consecutive 5 generations of training, training is stopped.

This patent was tested on the R8 and R52 datasets to verify the validity of the model and its performance.

R8, R52 are each news classification data sets, wherein each sample contains a piece of news text and a corresponding hot topic label. Wherein, R8 contains 8 news categories, the training set and the test set respectively have 5485 samples and 2189 samples, R52 contains 52 news categories, and the training set and the test set respectively have 6532 samples and 2568 samples. In the R8, R52 data sets, 10% of the training set data were randomly selected as the validation set. The comparative experiments are mainly divided into algorithms based on a traditional deep learning algorithm comprising CNN, LSTM, biLSTM and Fastext and algorithms based on a graph neural network comprising Text-GCN.

The accuracy and the F1 value can be used as evaluation indexes of the performance of the classification model, as shown in table 1, the text classification model based on the GCN is stronger than the traditional deep learning text classification model on several classification data sets, and the text classification model based on the GCN can learn rich graph information in the text due to the fact that graph convolution is adopted for learning text characteristics, so that better classification effect can be achieved. According to the BiLSTM-GCN model, time sequence information of a text is introduced in advance on the basis of GCN, characteristic information of the text is enriched, experiments show that the accuracy of the BiLSTM-GCN model on an R8 data set is improved by 0.51%, and the accuracy of the BiLSTM-GCN model on an R52 data set is improved by 0.7%, and the fact that a better classification effect is achieved by directly using one-hot characteristics after words brought by the BiLSTM model are introduced and the text characteristic information is achieved.

TABLE 1 BiLSTM-GCN vs. different model accuracy

TABLE 2 comparison of BiLSTM-GCN with different model F1 values

The accuracy is only used, so that the effect of the model is not sufficiently evaluated, the recall rate and the accuracy are considered by the F1 value, and the effect of the model provided by the invention can be more sufficiently judged. As can be seen from table 2, the model proposed by the present invention has an improvement of 1.32% over the R8 dataset and 1.84% over the R52 dataset over the baseline GCN model F1, which further indicates that the model proposed by us is more effective in text classification tasks.

Claims

1. A cultural resource text classification method based on a memory network and a graphic neural network is characterized by comprising the following steps:

assume that a text consists of n words { w } ₁ ，w ₂ ,…，w _n Each word is represented by a d-dimensional vector, and is initialized by using a glove word vector trained by Stanford, wherein the initialized word vector is { v } ₁ ，v ₂ ,…v _n In order to make the model obtain better word vectors, the word vectors are updated continuously along with iteration in the training process; modeling each text by adopting a bidirectional LSTM, wherein one LSTM is used for carrying out semantic representation of the text from the beginning of the sentence to the end of the sentence, and the other LSTM is used for carrying out semantic representation of the text from the end of the sentence to the beginning of the sentence, and the input at the current moment is not only related to the previous state but also related to the following state;

each word v _t Both forward and backward hidden statesAnd->h _t The two hidden states are spliced together, and the two hidden states fuse the context information of the word, and the specific formula is as follows:

y＝soft max(MLP(h ₁ ,h _n ))；

the word characteristics v and text characteristics h containing time sequence information can be obtained through the step 1), the characteristics are initialization characteristics of the global text graph nodes, the characteristics are put into GCN, the characteristic information of the text graph can be further extracted, the constructed undirected text graph is assumed to be g, the node number is N, and the word node number is N _word Wen Benjie the number of points is N _text A represents the adjacency matrix of the text graph g, I _N Is an identity matrix, W ₀ ，W ₁ The node feature matrices are weight matrices, and X is a node feature matrix of the text graph g;

the GCN model adopts a graph convolution neural network of semi-supervised learning, the whole model can train part of node data with a small number of labels, then classify the rest of nodes without labels, and the specific implementation formula is as follows:

the loss function is the cross entropy loss of all text, defined as follows: