CN112347255B

CN112347255B - Text classification method based on title and text combination of graph network

Info

Publication number: CN112347255B
Application number: CN202011233244.6A
Authority: CN
Inventors: 谢宗霞; 袁春宇
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-11-06
Filing date: 2020-11-06
Publication date: 2021-11-23
Anticipated expiration: 2040-11-06
Also published as: CN112347255A

Abstract

The invention discloses a text classification method based on the combination of a title and a text of a graph network, which mainly comprises the following steps: dividing each document into a title document and a text document, respectively carrying out data preprocessing to obtain a title word set and a text word set, obtaining word vector representation by using a word vector model, obtaining a subject vector by using an LDA model, obtaining text document feature representation by using an HAN model, constructing a heterogeneous graph by using three types of nodes of the title, the title word set and the subject, inputting the heterogeneous graph into a GAT model to realize the fusion of the title and the text feature, obtaining the feature representation of each document, and carrying out text category prediction by using a Softmax function. The classification method not only utilizes the extra information to enhance the semantic sparsity of the titles, but also better fuses the characteristics of the titles and the text, embodies the importance of the titles in the text classification task, improves the classification precision, and solves the problem of low classification efficiency caused by neglecting the importance of the titles in the current news text classification.

Description

Text classification method based on title and text combination of graph network

Technical Field

The invention relates to a text classification method based on the combination of a title and a text of a graph network, belonging to the field of natural language processing.

Background

Text classification is a fundamental problem of natural language processing. Nowadays, statistical learning methods have become the mainstream in the field of text classification. The text classification method based on traditional machine learning mainly comprises the steps of preprocessing and feature extraction of texts, vectorization of the processed texts, and modeling of a training data set through a common machine learning classification algorithm, wherein the common machine learning classification algorithm mainly comprises a naive Bayes model, a k nearest neighbor algorithm, an expectation maximization algorithm and a Support Vector Machine (SVM) model. However, the difficulty of feature engineering is considered a challenge for traditional text classification.

Today, the constant development of deep learning methods and artificial intelligence has yielded many promising results in the field of text classification. Different from the traditional Chinese text classification learning method, the deep learning method adopts a neural network model to train word embedding. Such as Convolutional Neural Networks (CNN), periodic neural networks (RNN), and long-short term memory networks (LSTM). The deep learning models can automatically learn text features well, improve classification efficiency and are popular with researchers.

In recent years, a new research pattern neural network has attracted much attention, which is effective for a task having a rich relational structure and can preserve global structural information of a pattern in pattern embedding. The invention solves the problem of neglecting the importance of the title in the text classification by using the graph network, and improves the text classification efficiency.

Disclosure of Invention

The invention provides a text classification method based on the combination of a title and a text of a graph network, which utilizes the graph network to fuse the characteristics of the title and the text and solves the problem of low text classification precision caused by neglecting the importance of the title in text classification in the current text classification task.

The invention provides a text classification method based on the combination of a title and a text of a graph network, which comprises the following steps:

1) collecting a Chinese news text data set, wherein the data set comprises documents and belonged categories; and a deactivation vocabulary is established,

2) processing the data set, and dividing all documents in the data set into a title document and a text document;

3) carrying out data preprocessing on the text document divided in the step 2), including sentence segmentation, word segmentation and stop word removal, and constructing a text word set;

4) training the text word set constructed in the step 3) by using a word vector training model to obtain distributed representation of each word in the text word set;

5) dividing the text document divided in the step 2) into a training set, a verification set and a test set;

6) inputting the training set divided in the step 5) into an HAN (hierarchical Attention networks) model for training, detecting the HAN model by using the test set divided in the step 5), optimizing the HAN model, and obtaining each text document vector;

7) segmenting the title documents divided in the step 2), constructing a topic word set, and training the topic word set by using a word vector training model to obtain distributed representation of each word in the topic word set;

8) training the document in the data set by using an LDA topic model to obtain N topics and topic word distribution of each topic, and obtaining each topic vector according to the topic word distribution;

9) taking the title document divided in the step 2), the title word set constructed in the step 7) and the theme obtained in the step 8) as nodes, and constructing a heterogeneous graph according to the relationship among the nodes;

10) dividing the title documents divided in the step 2) into a training set, a verification set and a test set;

11) representing each heading document vector in the training set of the step 10) by each text document vector obtained in the step 6);

12) training a GAT (graph Attention networks) model by using the heterogeneous graph constructed in the step 9), the title document vector constructed in the step 11), the word vector constructed in the step 7) and the theme vector constructed in the step 8), detecting the GAT model by using the test set divided in the step 10), realizing the fusion of title and text characteristics, obtaining the characteristic representation of the whole document, inputting the characteristic representation of the document into a softmax function, wherein the output of the softmax function is the document category.

Further, the text classification method based on the combination of the title and the text of the graph network, provided by the invention, comprises the following steps:

in step 1), the stop word list includes punctuation marks, mathematical marks, conjunctions, exclamation words and word-atmosphere words.

The specific steps of step 3) are as follows: 3-1) intercepting each text document by 500 words; 3-2) carrying out sentence division on the text document by 20 words per sentence, wherein the sequence after the sentence division is consistent with the sequence in the text; 3-3) performing word segmentation on each clause by using a jieba word segmentation tool, and removing stop words in each clause according to a stop word list; 3-4) establishing a text word set.

In the step 4), a skip-gram model in Word2vec is utilized to train the Chinese Word set, and the dimension is set to be 300 dimensions.

In the step 5), the text document is divided into a training set, a verification set and a test set, and in the step 10), the title document is divided into the training set, the verification set and the test set, wherein the division ratio of the training set, the verification set and the test set is 8:1: 1.

In the step 7), a jieba Word segmentation tool is used for Word segmentation, and the Word vector model is a skip-gram model in Word2 vec.

In step 8), the value of N is set according to the confusion degree of the LDA theme model.

In the step 9), the relationship among the three types of nodes is shown as the formula (1):

in step 12), each document feature represents the output document category by using the softmax function shown as the formula (2),

Z＝softmax(H^(L)) (2)

wherein Z is a document category, H^(L)Is a document feature representation.

Compared with the prior art, the invention has the beneficial effects that:

(1) the method utilizes the HAN network to extract the text characteristic representation, when long texts are classified, only performing attention on Word granularity is not enough, and performing attention learning on each sentence is also needed, so that the long text characteristic representation can be well learned.

(2) According to the method for fusing the title and the text features by using the GAT, the GAT model not only utilizes extra information to enhance the title semantic sparsity, but also can better fuse the title and the text features.

(3) The invention provides the importance of the title in the text classification task, and provides a text classification method based on the combination of the title and the text of a graph network, so that the classification precision is improved.

Drawings

FIG. 1 is a flow chart of the text classification based on the heading and body combination of a graph network in accordance with the present invention;

fig. 2 is a heterostructure display diagram.

Detailed Description

In order to solve the problem that the classification efficiency is low because the importance of the titles is neglected in the classification of the current news texts, the text classification method based on the combination of the titles and the texts of the graph network has the design concept that: dividing each document into a title document and a text document, respectively carrying out data preprocessing to obtain a title word set and a text word set, obtaining word vector representation by using a word vector model, obtaining a subject vector by using an LDA model, obtaining text document characteristic representation by using an HAN model, constructing a heterogeneous graph by using three types of nodes of a title, the title word set and a subject, inputting the heterogeneous graph into a GAT model to realize the fusion of the title and the text characteristics, obtaining the characteristic representation of each document, and carrying out text category prediction by using a Softmax function.

The text classification method based on the combination of the headline and the text of the graph network is further described by taking a Qinghua news data set as an example in combination with the attached drawings. The following examples are only for more clearly illustrating the technical solutions of the present invention, and the described examples are only a part of the embodiments of the present invention, and thus the protection scope of the present invention is not limited thereby. All other embodiments obtained by a person skilled in the art without making any inventive step should fall within the scope of protection of the present invention.

As shown in fig. 1, the text classification method of the present invention includes the following steps:

step 1) preparing a Chinese news text data set required by training, selecting a Qinghua news data set (THUCNews) as an example, wherein the data set comprises ten categories of finance, real estate, home furnishing, education, science and technology, fashion, politeness, sports, games and entertainment, and each category comprises ten thousand pieces of data; and establishing a deactivation vocabulary, wherein the deactivation vocabulary comprises punctuation marks, mathematical marks, connection words, exclamation words and word-of-speech words, but is not limited to the punctuation marks, the mathematical marks, the connection words, the exclamation words and the word-of-speech words.

Step 2) processing the data set, and dividing all documents into header documents and text documents; according to experimental data, for example, who did sports bob kui award? The NCAA in the season enters the end segment, the data is divided into two parts according to spaces between a title and a text, and the two parts are respectively marked with labels.

Step 3) carrying out data preprocessing on the text document divided in the step 2), including sentence segmentation, word segmentation and stop word removal, and constructing a text word set; the method comprises the following specific steps:

3-1) intercepting each text document by 500 words;

3-2) carrying out sentence division on the text document by 20 words per sentence, wherein the sequence after the sentence division is consistent with the sequence in the text;

3-3) segmenting each sentence using a jieba segmentation tool, such as "who is bob kui award? "who can get bob kui award-who belongs? ", and removing stop words from the stop word list; for example, 'the great wall of the great wall is the crystal of famous blood sweat of ancient Chinese laborers and the symbol of ancient Chinese culture and the proud' of Chinese nation, the stop word is removed to 'the symbol of ancient Chinese laborers blood sweat crystal of great wall of the great wall of China' the proud 'of Chinese nation', and the calculation amount can be saved.

3-4) establishing a text word set.

Step 4) training the text word set constructed in the step 3) by using a word vector training model to obtain distributed representation of each word in the text word set; in the embodiment, a skip-gram model in Word2vec is used for training a text Word set, and the dimension is set to be 300 dimensions. The distributed representation of each Word of the text and title that can be obtained according to Word2vec, such as { great wall 0.330.320.250.350.23.., china 0.520.390.56.. said. }, the specific dimension can be set by itself during model training, such as 200 dimensions and 100 dimensions.

Step 5) dividing the text document divided in the step 2) into a training set, a verification set and a test set, wherein the division ratio is 8:1: 1;

and 6) inputting the training set divided in the step 5) into an HAN (hierarchical Attention networks) model for training, detecting the HAN model by using the test set divided in the step 5), optimizing the HAN model, and acquiring each text document vector, namely the document 1{ 0.360560.35.

And 7) segmenting the title document segmented in the step 2), constructing a topic word set, and training the topic word set by using a word vector training model to obtain the distributed representation of each word in the topic word set. And performing Word segmentation by using a jieba Word segmentation tool, wherein the Word vector model is a skip-gram model in Word2 vec.

Step 8) training the document in the data set by using an LDA topic model to obtain N topics and topic word distribution of each topic, and obtaining each topic vector according to the topic word distribution; wherein the value of N is set according to the confusion degree of the LDA theme model.

Step 9) taking the title document divided in the step 2), the title word set constructed in the step 7) and the subject obtained in the step 8) as nodes, and constructing a heterogeneous graph according to the relationship among the nodes, as shown in FIG. 2; the relationship among the three types of nodes of the title document, the set of title words and the theme is shown as formula (1):

step 10) dividing the title document divided in the step 2) into a training set, a verification set and a test set, wherein the division ratio is 8:1: 1;

step 11) representing each title document vector in the training set in the step 10) by each text document vector obtained in the step 6);

and step 12) utilizing the heterogeneous map constructed in the step 9), the title document vector in the step 11), the word vector in the step 7) and the theme vector in the step 8) to train a GAT (graph Attention networks) model. And respectively placing the title document vector, the word vector and the theme vector in three files and respectively marking labels. The second is the relationship between nodes, i.e. the adjacency matrix, and the storage format in the file, e.g., { 23, 36, 915. Detecting the GAT model by using the test set divided in the step 10), realizing the fusion of title and text characteristics to obtain the characteristic representation of the whole document, inputting the characteristic representation of the document into the softmax function as shown in the formula (2), wherein the output of the softmax function is the document category,

Z＝softmax(H^(L)) (2)

wherein Z is a document category, H^(L)Is a document feature representation.

The classification accuracy obtained in the example is 96.04, 2 comparative examples are made for the above-mentioned qinghua news data set, the classification accuracy of the TextCNN model in comparative example 1 is 92.36, and the classification accuracy of the BiLstm model in comparative example 2 is 94.36, so that the text classification accuracy is improved by the method provided by the invention. It is useful to describe the present invention, i.e., not to ignore the importance of the headline text in the text classification task.

Claims

1. A text classification method based on title and text combination of a graph network is characterized by comprising the following steps:

step 1) collecting a Chinese news text data set, wherein the data set comprises documents and belonged categories; and a deactivation vocabulary is established,

step 2) processing the data set, and dividing all documents into header documents and text documents;

step 3) carrying out data preprocessing on the text document divided in the step 2), including sentence segmentation, word segmentation and stop word removal, and constructing a text word set;

step 4) training the text word set constructed in the step 3) by using a word vector training model to obtain distributed representation of each word in the text word set;

step 5) dividing the text documents divided in the step 2) into a training set, a verification set and a test set;

step 6) inputting the training set divided in the step 5) into an HAN (hierarchical Attention networks) model for training, detecting the HAN model by using the test set divided in the step 5), optimizing the HAN model, and obtaining each text document vector;

step 7) segmenting the title documents segmented in the step 2), constructing a topic word set, and training the topic word set by using a word vector training model to obtain distributed representation of each word in the topic word set;

step 8) training the document in the data set by using an LDA topic model to obtain N topics and topic word distribution of each topic, and obtaining each topic vector according to the topic word distribution;

step 9) taking the title document divided in the step 2), the title word set constructed in the step 7) and the theme obtained in the step 8) as nodes, and constructing a heterogeneous graph according to the relationship among the nodes;

step 10) dividing the title documents divided in the step 2) into a training set, a verification set and a test set;

and step 12) training a GAT (graph attachment networks) model by using the heterogeneous map constructed in the step 9), the title document vector constructed in the step 11), the word vector constructed in the step 7) and the theme vector constructed in the step 8), detecting the GAT model by using the test set divided in the step 10), realizing the fusion of title and text characteristics, obtaining the characteristic representation of the whole document, inputting the characteristic representation of the document into a softmax function, wherein the output of the softmax function is the document category.

2. The method for classifying texts combining titles and texts based on graph network according to claim 1, wherein in step 1), the deactivated vocabulary includes punctuation marks, mathematical marks, conjunctions, exclamation words and word moods.

3. The text classification method based on the combination of the title and the text of the graph network as claimed in claim 1, wherein the specific steps of step 3) are as follows:

3-1) intercepting each text document by 500 words;

3-3) performing word segmentation on each clause by using a jieba word segmentation tool, and removing stop words in each clause according to a stop word list;

3-4) establishing a text word set.

4. The text classification method based on the combination of the title and the text of the graph network as claimed in claim 1, wherein in the step 4), the skip-gram model in Word2vec is used for training the text Word set, and the set dimension is 300 dimensions.

5. The method for classifying texts combining titles and texts based on graph networks according to claim 1, wherein in step 5), the text documents are divided into training sets, verification sets and test sets, and in step 10), the title documents are divided into training sets, verification sets and test sets, wherein the division ratio of the training sets, the verification sets and the test sets is 8:1: 1.

6. The method for text classification based on the combination of caption and text of graph network according to claim 1, wherein in step 7), a jieba Word segmentation tool is used for Word segmentation, and the Word vector model is a skip-gram model in Word2 vec.

7. The method for classifying texts combining titles and texts based on graph network as claimed in claim 1, wherein in step 8), the value of N is set according to the perplexity of LDA topic model.

8. The text classification method based on the combination of the title and the text of the graph network as claimed in claim 1, wherein in step 9), the relationship between the three types of nodes is shown in formula (1):

9. the text classification method based on title and body combination of graph network of claim 1, characterized in that, in step 12), each document feature representation uses softmax function as shown in formula (2) to output document classification,

Z＝softmax(H^(L)) (2)

wherein Z is a document category, H^(L)Is a document feature representation.