CN112347255B - Text classification method based on title and text combination of graph network - Google Patents

Text classification method based on title and text combination of graph network Download PDF

Info

Publication number
CN112347255B
CN112347255B CN202011233244.6A CN202011233244A CN112347255B CN 112347255 B CN112347255 B CN 112347255B CN 202011233244 A CN202011233244 A CN 202011233244A CN 112347255 B CN112347255 B CN 112347255B
Authority
CN
China
Prior art keywords
text
word
title
document
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN202011233244.6A
Other languages
Chinese (zh)
Other versions
CN112347255A (en
Inventor
谢宗霞
袁春宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202011233244.6A priority Critical patent/CN112347255B/en
Publication of CN112347255A publication Critical patent/CN112347255A/en
Application granted granted Critical
Publication of CN112347255B publication Critical patent/CN112347255B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text classification method based on the combination of a title and a text of a graph network, which mainly comprises the following steps: dividing each document into a title document and a text document, respectively carrying out data preprocessing to obtain a title word set and a text word set, obtaining word vector representation by using a word vector model, obtaining a subject vector by using an LDA model, obtaining text document feature representation by using an HAN model, constructing a heterogeneous graph by using three types of nodes of the title, the title word set and the subject, inputting the heterogeneous graph into a GAT model to realize the fusion of the title and the text feature, obtaining the feature representation of each document, and carrying out text category prediction by using a Softmax function. The classification method not only utilizes the extra information to enhance the semantic sparsity of the titles, but also better fuses the characteristics of the titles and the text, embodies the importance of the titles in the text classification task, improves the classification precision, and solves the problem of low classification efficiency caused by neglecting the importance of the titles in the current news text classification.

Description

Text classification method based on title and text combination of graph network
Technical Field
The invention relates to a text classification method based on the combination of a title and a text of a graph network, belonging to the field of natural language processing.
Background
Text classification is a fundamental problem of natural language processing. Nowadays, statistical learning methods have become the mainstream in the field of text classification. The text classification method based on traditional machine learning mainly comprises the steps of preprocessing and feature extraction of texts, vectorization of the processed texts, and modeling of a training data set through a common machine learning classification algorithm, wherein the common machine learning classification algorithm mainly comprises a naive Bayes model, a k nearest neighbor algorithm, an expectation maximization algorithm and a Support Vector Machine (SVM) model. However, the difficulty of feature engineering is considered a challenge for traditional text classification.
Today, the constant development of deep learning methods and artificial intelligence has yielded many promising results in the field of text classification. Different from the traditional Chinese text classification learning method, the deep learning method adopts a neural network model to train word embedding. Such as Convolutional Neural Networks (CNN), periodic neural networks (RNN), and long-short term memory networks (LSTM). The deep learning models can automatically learn text features well, improve classification efficiency and are popular with researchers.
In recent years, a new research pattern neural network has attracted much attention, which is effective for a task having a rich relational structure and can preserve global structural information of a pattern in pattern embedding. The invention solves the problem of neglecting the importance of the title in the text classification by using the graph network, and improves the text classification efficiency.
Disclosure of Invention
The invention provides a text classification method based on the combination of a title and a text of a graph network, which utilizes the graph network to fuse the characteristics of the title and the text and solves the problem of low text classification precision caused by neglecting the importance of the title in text classification in the current text classification task.
The invention provides a text classification method based on the combination of a title and a text of a graph network, which comprises the following steps:
1) collecting a Chinese news text data set, wherein the data set comprises documents and belonged categories; and a deactivation vocabulary is established,
2) processing the data set, and dividing all documents in the data set into a title document and a text document;
3) carrying out data preprocessing on the text document divided in the step 2), including sentence segmentation, word segmentation and stop word removal, and constructing a text word set;
4) training the text word set constructed in the step 3) by using a word vector training model to obtain distributed representation of each word in the text word set;
5) dividing the text document divided in the step 2) into a training set, a verification set and a test set;
6) inputting the training set divided in the step 5) into an HAN (hierarchical Attention networks) model for training, detecting the HAN model by using the test set divided in the step 5), optimizing the HAN model, and obtaining each text document vector;
7) segmenting the title documents divided in the step 2), constructing a topic word set, and training the topic word set by using a word vector training model to obtain distributed representation of each word in the topic word set;
8) training the document in the data set by using an LDA topic model to obtain N topics and topic word distribution of each topic, and obtaining each topic vector according to the topic word distribution;
9) taking the title document divided in the step 2), the title word set constructed in the step 7) and the theme obtained in the step 8) as nodes, and constructing a heterogeneous graph according to the relationship among the nodes;
10) dividing the title documents divided in the step 2) into a training set, a verification set and a test set;
11) representing each heading document vector in the training set of the step 10) by each text document vector obtained in the step 6);
12) training a GAT (graph Attention networks) model by using the heterogeneous graph constructed in the step 9), the title document vector constructed in the step 11), the word vector constructed in the step 7) and the theme vector constructed in the step 8), detecting the GAT model by using the test set divided in the step 10), realizing the fusion of title and text characteristics, obtaining the characteristic representation of the whole document, inputting the characteristic representation of the document into a softmax function, wherein the output of the softmax function is the document category.
Further, the text classification method based on the combination of the title and the text of the graph network, provided by the invention, comprises the following steps:
in step 1), the stop word list includes punctuation marks, mathematical marks, conjunctions, exclamation words and word-atmosphere words.
The specific steps of step 3) are as follows: 3-1) intercepting each text document by 500 words; 3-2) carrying out sentence division on the text document by 20 words per sentence, wherein the sequence after the sentence division is consistent with the sequence in the text; 3-3) performing word segmentation on each clause by using a jieba word segmentation tool, and removing stop words in each clause according to a stop word list; 3-4) establishing a text word set.
In the step 4), a skip-gram model in Word2vec is utilized to train the Chinese Word set, and the dimension is set to be 300 dimensions.
In the step 5), the text document is divided into a training set, a verification set and a test set, and in the step 10), the title document is divided into the training set, the verification set and the test set, wherein the division ratio of the training set, the verification set and the test set is 8:1: 1.
In the step 7), a jieba Word segmentation tool is used for Word segmentation, and the Word vector model is a skip-gram model in Word2 vec.
In step 8), the value of N is set according to the confusion degree of the LDA theme model.
In the step 9), the relationship among the three types of nodes is shown as the formula (1):
Figure BDA0002765898190000021
in step 12), each document feature represents the output document category by using the softmax function shown as the formula (2),
Z=softmax(H(L)) (2)
wherein Z is a document category, H(L)Is a document feature representation.
Compared with the prior art, the invention has the beneficial effects that:
(1) the method utilizes the HAN network to extract the text characteristic representation, when long texts are classified, only performing attention on Word granularity is not enough, and performing attention learning on each sentence is also needed, so that the long text characteristic representation can be well learned.
(2) According to the method for fusing the title and the text features by using the GAT, the GAT model not only utilizes extra information to enhance the title semantic sparsity, but also can better fuse the title and the text features.
(3) The invention provides the importance of the title in the text classification task, and provides a text classification method based on the combination of the title and the text of a graph network, so that the classification precision is improved.
Drawings
FIG. 1 is a flow chart of the text classification based on the heading and body combination of a graph network in accordance with the present invention;
fig. 2 is a heterostructure display diagram.
Detailed Description
In order to solve the problem that the classification efficiency is low because the importance of the titles is neglected in the classification of the current news texts, the text classification method based on the combination of the titles and the texts of the graph network has the design concept that: dividing each document into a title document and a text document, respectively carrying out data preprocessing to obtain a title word set and a text word set, obtaining word vector representation by using a word vector model, obtaining a subject vector by using an LDA model, obtaining text document characteristic representation by using an HAN model, constructing a heterogeneous graph by using three types of nodes of a title, the title word set and a subject, inputting the heterogeneous graph into a GAT model to realize the fusion of the title and the text characteristics, obtaining the characteristic representation of each document, and carrying out text category prediction by using a Softmax function.
The text classification method based on the combination of the headline and the text of the graph network is further described by taking a Qinghua news data set as an example in combination with the attached drawings. The following examples are only for more clearly illustrating the technical solutions of the present invention, and the described examples are only a part of the embodiments of the present invention, and thus the protection scope of the present invention is not limited thereby. All other embodiments obtained by a person skilled in the art without making any inventive step should fall within the scope of protection of the present invention.
As shown in fig. 1, the text classification method of the present invention includes the following steps:
step 1) preparing a Chinese news text data set required by training, selecting a Qinghua news data set (THUCNews) as an example, wherein the data set comprises ten categories of finance, real estate, home furnishing, education, science and technology, fashion, politeness, sports, games and entertainment, and each category comprises ten thousand pieces of data; and establishing a deactivation vocabulary, wherein the deactivation vocabulary comprises punctuation marks, mathematical marks, connection words, exclamation words and word-of-speech words, but is not limited to the punctuation marks, the mathematical marks, the connection words, the exclamation words and the word-of-speech words.
Step 2) processing the data set, and dividing all documents into header documents and text documents; according to experimental data, for example, who did sports bob kui award? The NCAA in the season enters the end segment, the data is divided into two parts according to spaces between a title and a text, and the two parts are respectively marked with labels.
Step 3) carrying out data preprocessing on the text document divided in the step 2), including sentence segmentation, word segmentation and stop word removal, and constructing a text word set; the method comprises the following specific steps:
3-1) intercepting each text document by 500 words;
3-2) carrying out sentence division on the text document by 20 words per sentence, wherein the sequence after the sentence division is consistent with the sequence in the text;
3-3) segmenting each sentence using a jieba segmentation tool, such as "who is bob kui award? "who can get bob kui award-who belongs? ", and removing stop words from the stop word list; for example, 'the great wall of the great wall is the crystal of famous blood sweat of ancient Chinese laborers and the symbol of ancient Chinese culture and the proud' of Chinese nation, the stop word is removed to 'the symbol of ancient Chinese laborers blood sweat crystal of great wall of the great wall of China' the proud 'of Chinese nation', and the calculation amount can be saved.
3-4) establishing a text word set.
Step 4) training the text word set constructed in the step 3) by using a word vector training model to obtain distributed representation of each word in the text word set; in the embodiment, a skip-gram model in Word2vec is used for training a text Word set, and the dimension is set to be 300 dimensions. The distributed representation of each Word of the text and title that can be obtained according to Word2vec, such as { great wall 0.330.320.250.350.23.., china 0.520.390.56.. said. }, the specific dimension can be set by itself during model training, such as 200 dimensions and 100 dimensions.
Step 5) dividing the text document divided in the step 2) into a training set, a verification set and a test set, wherein the division ratio is 8:1: 1;
and 6) inputting the training set divided in the step 5) into an HAN (hierarchical Attention networks) model for training, detecting the HAN model by using the test set divided in the step 5), optimizing the HAN model, and acquiring each text document vector, namely the document 1{ 0.360560.35.
And 7) segmenting the title document segmented in the step 2), constructing a topic word set, and training the topic word set by using a word vector training model to obtain the distributed representation of each word in the topic word set. And performing Word segmentation by using a jieba Word segmentation tool, wherein the Word vector model is a skip-gram model in Word2 vec.
Step 8) training the document in the data set by using an LDA topic model to obtain N topics and topic word distribution of each topic, and obtaining each topic vector according to the topic word distribution; wherein the value of N is set according to the confusion degree of the LDA theme model.
Step 9) taking the title document divided in the step 2), the title word set constructed in the step 7) and the subject obtained in the step 8) as nodes, and constructing a heterogeneous graph according to the relationship among the nodes, as shown in FIG. 2; the relationship among the three types of nodes of the title document, the set of title words and the theme is shown as formula (1):
Figure BDA0002765898190000041
step 10) dividing the title document divided in the step 2) into a training set, a verification set and a test set, wherein the division ratio is 8:1: 1;
step 11) representing each title document vector in the training set in the step 10) by each text document vector obtained in the step 6);
and step 12) utilizing the heterogeneous map constructed in the step 9), the title document vector in the step 11), the word vector in the step 7) and the theme vector in the step 8) to train a GAT (graph Attention networks) model. And respectively placing the title document vector, the word vector and the theme vector in three files and respectively marking labels. The second is the relationship between nodes, i.e. the adjacency matrix, and the storage format in the file, e.g., { 23, 36, 915. Detecting the GAT model by using the test set divided in the step 10), realizing the fusion of title and text characteristics to obtain the characteristic representation of the whole document, inputting the characteristic representation of the document into the softmax function as shown in the formula (2), wherein the output of the softmax function is the document category,
Z=softmax(H(L)) (2)
wherein Z is a document category, H(L)Is a document feature representation.
The classification accuracy obtained in the example is 96.04, 2 comparative examples are made for the above-mentioned qinghua news data set, the classification accuracy of the TextCNN model in comparative example 1 is 92.36, and the classification accuracy of the BiLstm model in comparative example 2 is 94.36, so that the text classification accuracy is improved by the method provided by the invention. It is useful to describe the present invention, i.e., not to ignore the importance of the headline text in the text classification task.

Claims (9)

1. A text classification method based on title and text combination of a graph network is characterized by comprising the following steps:
step 1) collecting a Chinese news text data set, wherein the data set comprises documents and belonged categories; and a deactivation vocabulary is established,
step 2) processing the data set, and dividing all documents into header documents and text documents;
step 3) carrying out data preprocessing on the text document divided in the step 2), including sentence segmentation, word segmentation and stop word removal, and constructing a text word set;
step 4) training the text word set constructed in the step 3) by using a word vector training model to obtain distributed representation of each word in the text word set;
step 5) dividing the text documents divided in the step 2) into a training set, a verification set and a test set;
step 6) inputting the training set divided in the step 5) into an HAN (hierarchical Attention networks) model for training, detecting the HAN model by using the test set divided in the step 5), optimizing the HAN model, and obtaining each text document vector;
step 7) segmenting the title documents segmented in the step 2), constructing a topic word set, and training the topic word set by using a word vector training model to obtain distributed representation of each word in the topic word set;
step 8) training the document in the data set by using an LDA topic model to obtain N topics and topic word distribution of each topic, and obtaining each topic vector according to the topic word distribution;
step 9) taking the title document divided in the step 2), the title word set constructed in the step 7) and the theme obtained in the step 8) as nodes, and constructing a heterogeneous graph according to the relationship among the nodes;
step 10) dividing the title documents divided in the step 2) into a training set, a verification set and a test set;
step 11) representing each title document vector in the training set in the step 10) by each text document vector obtained in the step 6);
and step 12) training a GAT (graph attachment networks) model by using the heterogeneous map constructed in the step 9), the title document vector constructed in the step 11), the word vector constructed in the step 7) and the theme vector constructed in the step 8), detecting the GAT model by using the test set divided in the step 10), realizing the fusion of title and text characteristics, obtaining the characteristic representation of the whole document, inputting the characteristic representation of the document into a softmax function, wherein the output of the softmax function is the document category.
2. The method for classifying texts combining titles and texts based on graph network according to claim 1, wherein in step 1), the deactivated vocabulary includes punctuation marks, mathematical marks, conjunctions, exclamation words and word moods.
3. The text classification method based on the combination of the title and the text of the graph network as claimed in claim 1, wherein the specific steps of step 3) are as follows:
3-1) intercepting each text document by 500 words;
3-2) carrying out sentence division on the text document by 20 words per sentence, wherein the sequence after the sentence division is consistent with the sequence in the text;
3-3) performing word segmentation on each clause by using a jieba word segmentation tool, and removing stop words in each clause according to a stop word list;
3-4) establishing a text word set.
4. The text classification method based on the combination of the title and the text of the graph network as claimed in claim 1, wherein in the step 4), the skip-gram model in Word2vec is used for training the text Word set, and the set dimension is 300 dimensions.
5. The method for classifying texts combining titles and texts based on graph networks according to claim 1, wherein in step 5), the text documents are divided into training sets, verification sets and test sets, and in step 10), the title documents are divided into training sets, verification sets and test sets, wherein the division ratio of the training sets, the verification sets and the test sets is 8:1: 1.
6. The method for text classification based on the combination of caption and text of graph network according to claim 1, wherein in step 7), a jieba Word segmentation tool is used for Word segmentation, and the Word vector model is a skip-gram model in Word2 vec.
7. The method for classifying texts combining titles and texts based on graph network as claimed in claim 1, wherein in step 8), the value of N is set according to the perplexity of LDA topic model.
8. The text classification method based on the combination of the title and the text of the graph network as claimed in claim 1, wherein in step 9), the relationship between the three types of nodes is shown in formula (1):
Figure FDA0002765898180000021
9. the text classification method based on title and body combination of graph network of claim 1, characterized in that, in step 12), each document feature representation uses softmax function as shown in formula (2) to output document classification,
Z=softmax(H(L)) (2)
wherein Z is a document category, H(L)Is a document feature representation.
CN202011233244.6A 2020-11-06 2020-11-06 Text classification method based on title and text combination of graph network Expired - Fee Related CN112347255B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011233244.6A CN112347255B (en) 2020-11-06 2020-11-06 Text classification method based on title and text combination of graph network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011233244.6A CN112347255B (en) 2020-11-06 2020-11-06 Text classification method based on title and text combination of graph network

Publications (2)

Publication Number Publication Date
CN112347255A CN112347255A (en) 2021-02-09
CN112347255B true CN112347255B (en) 2021-11-23

Family

ID=74428724

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011233244.6A Expired - Fee Related CN112347255B (en) 2020-11-06 2020-11-06 Text classification method based on title and text combination of graph network

Country Status (1)

Country Link
CN (1) CN112347255B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239200B (en) * 2021-05-20 2022-07-12 东北农业大学 Content identification and classification method, device and system and storage medium
CN113378950A (en) * 2021-06-22 2021-09-10 深圳市查策网络信息技术有限公司 Unsupervised classification method for long texts
CN116701812B (en) * 2023-08-03 2023-11-28 中国测绘科学研究院 Geographic information webpage text topic classification method based on block units

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6217468B2 (en) * 2014-03-10 2017-10-25 富士ゼロックス株式会社 Multilingual document classification program and information processing apparatus
CN109753567A (en) * 2019-01-31 2019-05-14 安徽大学 A kind of file classification method of combination title and text attention mechanism
CN110704626B (en) * 2019-09-30 2022-07-22 北京邮电大学 Short text classification method and device
CN111581967B (en) * 2020-05-06 2023-08-11 西安交通大学 News theme event detection method combining LW2V with triple network

Also Published As

Publication number Publication date
CN112347255A (en) 2021-02-09

Similar Documents

Publication Publication Date Title
CN112347255B (en) Text classification method based on title and text combination of graph network
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN110413986B (en) Text clustering multi-document automatic summarization method and system for improving word vector model
CN110866117B (en) Short text classification method based on semantic enhancement and multi-level label embedding
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
CN108280206B (en) Short text classification method based on semantic enhancement
CN107392147A (en) A kind of image sentence conversion method based on improved production confrontation network
CN101599071A (en) The extraction method of conversation text topic
CN106202256A (en) Propagate based on semanteme and mix the Web graph of multi-instance learning as search method
CN102207945A (en) Knowledge network-based text indexing system and method
CN111027595A (en) Double-stage semantic word vector generation method
CN112883171B (en) Document keyword extraction method and device based on BERT model
CN111274804A (en) Case information extraction method based on named entity recognition
CN112163089B (en) High-technology text classification method and system integrating named entity recognition
CN110956041A (en) Depth learning-based co-purchase recombination bulletin summarization method
CN108920586A (en) A kind of short text classification method based on depth nerve mapping support vector machines
CN110457711A (en) A kind of social media event topic recognition methods based on descriptor
CN117171333A (en) Electric power file question-answering type intelligent retrieval method and system
CN103853792A (en) Automatic image semantic annotation method and system
CN107832307B (en) Chinese word segmentation method based on undirected graph and single-layer neural network
CN101271448A (en) Chinese language fundamental noun phrase recognition, its regulation generating method and apparatus
CN104123336A (en) Deep Boltzmann machine model and short text subject classification system and method
CN116304064A (en) Text classification method based on extraction
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
CN113434668B (en) Deep learning text classification method and system based on model fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20211123