CN112529071A - Text classification method, system, computer equipment and storage medium - Google Patents

Text classification method, system, computer equipment and storage medium Download PDF

Info

Publication number
CN112529071A
CN112529071A CN202011425848.0A CN202011425848A CN112529071A CN 112529071 A CN112529071 A CN 112529071A CN 202011425848 A CN202011425848 A CN 202011425848A CN 112529071 A CN112529071 A CN 112529071A
Authority
CN
China
Prior art keywords
text
graph
classification
corpus
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011425848.0A
Other languages
Chinese (zh)
Other versions
CN112529071B (en
Inventor
刘勋
宗建华
夏国清
叶和忠
刘强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China Institute Of Software Engineering Gu
Original Assignee
South China Institute Of Software Engineering Gu
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China Institute Of Software Engineering Gu filed Critical South China Institute Of Software Engineering Gu
Priority to CN202011425848.0A priority Critical patent/CN112529071B/en
Publication of CN112529071A publication Critical patent/CN112529071A/en
Application granted granted Critical
Publication of CN112529071B publication Critical patent/CN112529071B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a text classification method, a system, computer equipment and a storage medium, wherein the method comprises the steps of establishing a new high-low order graph convolution neural network model which comprises a high-low order graph convolution layer for simultaneously capturing multi-order neighborhood information of nodes, an information fusion layer for mixing first-order to high-order characteristics of different neighborhoods, a first-order graph convolution layer and a softmax classification output layer, inputting a training set text graph network to train to obtain a text classification model, and then inputting a test set text graph network to the classification model to obtain a classification result. When the embodiment of the invention is used for text classification, the text classification efficiency and the classification effect are ensured, meanwhile, the problems of complex calculation, large parameter quantity, over-smoothness, limited receptive field and the like when the existing graph convolution is applied to text classification are solved by a method for simultaneously capturing multi-order neighborhood information of nodes, and the expression capability of a text classification model, the stability of the model and the precision of a text classification task are further improved.

Description

Text classification method, system, computer equipment and storage medium
Technical Field
The invention relates to the technical field of text classification, in particular to a text classification method, a text classification system, computer equipment and a storage medium based on a high-low order graph convolution network.
Background
With the rapid development of the internet technology, various social platforms, technical communication platforms, shopping platforms and the like are rapidly developed, massive text data information is generated continuously, and the text data information becomes an object of enthusiasm of big data mining research due to the existence of ultrahigh-value data information, so that the status of text classification in information processing is more and more important. Researchers all hope to adopt an effective text classification method to efficiently manage, extract and analyze useful information in text data, and provide powerful support for enterprise or social development.
At present, the technology of text classification has been developed from early manual classification relying on the prior knowledge of linguistic experts to deep machine learning, and deep learning models represented by Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) are widely applied to the task of text classification, but these models may ignore global word co-occurrence information in corpus, and the discontinuous and long-distance semantic information carried by these information has important influence on the document classification result. Although the existing graph convolutional neural network can process data of any structure and capture global word co-occurrence information, and can effectively learn a text graph network with rich relationships and protect the global structure information of the graph when the graph is embedded, the existing graph convolutional neural network generally has two layers, the shallow layer mechanism limits the scale of receptive fields and the expression capability of models, and the multilayer (more than 2 layers) network can make different types of text node values tend to a fixed value and further bring about the problem of over-smoothness. On the basis of continuing the advantages of text classification of the conventional graph convolution network, the problem of over-smoothness in graph convolution network application is solved, and the receptive field of a classification model can be increased, so that the expression capability of the model and the precision of a text classification task are improved.
Disclosure of Invention
The invention aims to solve the problems of over-smoothness and limitation of model receptive field when the prior graph convolution network is applied to text classification, and further improve the expression capability of a text classification model and the precision of a text classification task.
In order to achieve the above objects, it is necessary to provide a text classification method, a system, a computer device and a storage medium for solving the above technical problems.
In a first aspect, an embodiment of the present invention provides a text classification method, where the method includes the following steps:
establishing a high-low order graph convolution neural network model; the high-low order graph convolution neural network model sequentially comprises an input layer, a high-low order graph convolution layer, an information fusion layer, a first order graph convolution layer and an output layer;
obtaining a corpus of text classification by adopting the high-low order graph convolutional neural network model; the corpus comprises a plurality of samples, each sample containing a document and a title;
preprocessing the corpus set to obtain a training set and a test set;
respectively constructing a training set text graph network and a test set text graph network according to the training set and the test set;
inputting the training set text graph network into a high-low order graph convolutional neural network model, and training by combining a loss function to obtain a text classification model;
and inputting the test set text graph network into the text classification model for testing to obtain a classification result.
Further, if the output of the high-low order graph convolution neural network model is Z, then:
Figure BDA0002824602200000021
where X is the input matrix of the graph, w1And w2Respectively a parameter matrix between the input layer to the hidden layer and a parameter matrix between the hidden layer to the output layer,
Figure BDA0002824602200000022
is the regularized adjacency matrix of the graph with self-joins, k is the highest order of graph convolution,
Figure BDA0002824602200000023
ReLU (. cndot.) is the activation function, NMPooling (. cndot.) is the information fusion layer, and softmax (. cndot.) is the multi-class output function.
Further, the high-low order graph convolution layer includes a first order graph convolution to a k order graph convolution based on weight sharing; the order k of the high-low order graph convolution layer is one of orders of two orders and above, or a combination of any plural orders.
Further, the information fusion layer adopts minimum value negation information fusion pooling, and the implementation steps include:
according to the input matrix X and the parameter matrix w1And regularizing adjacency matrices
Figure BDA0002824602200000031
Calculating minimum value matrixes of convolution of different graphs;
and negating each element value of the minimum value matrix to obtain a pooled graph feature matrix.
Further, the step of preprocessing the corpus to obtain a training set and a test set includes:
carrying out pre-processing of removing duplication, word segmentation and stop word and special symbol removal on the titles and the documents of the samples in the corpus to obtain words in the corpus, and forming the words and the documents in the corpus into a corpus text group;
and dividing the corpus text group into a training set and a test set according to the quantity proportion.
Further, the step of respectively constructing a training set text graph network and a test set text graph network according to the training set and the test set comprises:
respectively establishing a training set text chart and a test set text chart of which feature matrixes are corresponding dimension unit matrixes according to the training set and the test set;
and determining the adjacency matrixes of the training set text graph and the test set text graph according to the TF-IDF algorithm and the PMI algorithm.
Further, the step of determining the adjacency matrix of the training set text graph and the test set text graph according to the TF-IDF algorithm and the PMI algorithm comprises:
calculating the weights of document nodes and word node connecting edges in the adjacent matrix of the training set text graph according to the TF-IDF algorithm, and calculating the weights of the word nodes and word node connecting edges in the adjacent matrix of the training set text graph according to the PMI algorithm;
and calculating the weights of the document nodes and the word node connecting edges in the adjacent matrix of the test set text graph according to the TF-IDF algorithm, and calculating the weights of the word nodes and the word node connecting edges of the test set text graph according to the PMI algorithm.
In a second aspect, an embodiment of the present invention provides a text classification system, where the system includes:
the classification model establishing module is used for establishing a high-low order graph convolution neural network model; the high-low order graph convolution neural network model sequentially comprises an input layer, a high-low order graph convolution layer, an information fusion layer, a first order graph convolution layer and an output layer;
the corpus classification module is used for acquiring a corpus set for text classification by adopting the high-low order graph convolutional neural network model; the corpus comprises a plurality of samples, each sample containing a document and a title;
the corpus preprocessing module is used for preprocessing the corpus to obtain a training set and a test set;
the text graph network building module is used for respectively building a training set text graph network and a test set text graph network according to the training set and the test set;
the text classification model training module is used for inputting the training set text graph network into a high-low order graph convolutional neural network model, and training the high-low order graph convolutional neural network model by combining a loss function to obtain a text classification model;
and the text classification test module is used for inputting the test set text graph network into the text classification model for testing to obtain a classification result.
In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method when executing the computer program.
In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the above method.
The method realizes the effects that after a TF-IDF algorithm and a PMI algorithm are adopted to construct a training set text graph network and a test set text graph network on a preprocessed corpus, the training set text graph network is input into a high-low order graph convolutional neural network model with an input layer, 1 high-low order graph convolutional layer, 1 information fusion layer, 1 first order graph convolutional layer and a softmax function output layer, training is carried out by combining with a defined loss function to determine a classification model parameter matrix, and the test set text is accurately classified. Compared with the prior art, the method has the advantages that in the application of text classification, the text classification efficiency and the classification effect are guaranteed, meanwhile, the problems of complex calculation, large parameter quantity, over-smoothness and limitation of model receptive field when the conventional graph convolution network is applied to text classification are solved through the method of constructing the two-layer network structure model and capturing the multi-order neighborhood information of the nodes by utilizing the high-low-order graph convolution, and the expression capability of the text classification model, the stability of the model and the precision of a text classification task are further improved.
Drawings
FIG. 1 is a flow chart illustrating a text classification method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of the high-low level graph convolutional neural network model structure of FIG. 1;
FIG. 3 is a schematic flow chart of corpus preprocessing of step S13 in FIG. 1;
FIG. 4 is a schematic flow chart illustrating the step S14 in FIG. 1 of constructing a corresponding text graph network according to the training set and the test set;
FIG. 5 is a schematic diagram of the creation of a network of text graphs based on a portion of the data of OH using the method of FIG. 4;
FIG. 6 is a schematic flowchart of the step S142 in FIG. 4 for constructing the adjacency matrix of the text graph according to the TF-IDF algorithm and the PMI algorithm;
FIG. 7 is a schematic structural diagram of a text classification system in an embodiment of the invention;
fig. 8 is an internal structural diagram of a computer device in the embodiment of the present invention.
Detailed Description
In order to make the purpose, technical solution and advantages of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments, and it is obvious that the embodiments described below are part of the embodiments of the present invention, and are used for illustrating the present invention only, but not for limiting the scope of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The text classification method provided by the invention can be applied to a terminal or a server, the adopted high-low order graph convolutional neural Network Model (NMGC) is an improvement of the existing graph convolutional network model, other similar full-supervised classification tasks can be completed, and a text corpus is preferably adopted for training and testing.
In one embodiment, as shown in fig. 1, there is provided a text classification method, including the steps of:
s11, establishing a high-low level graph convolutional neural network model; the high-low order graph convolution neural network model comprises an input layer, a high-low order graph convolution layer, an information fusion layer, a first order graph convolution layer and an output layer;
wherein, the high-low level graph convolutional layer and the first-order graph convolutional layer in the high-low level graph convolutional neural network model are both 1. And if the output of the high-low order graph convolution neural network model is Z, then:
Figure BDA0002824602200000061
where X is the input matrix of the graph, w1And w2Respectively a parameter matrix between the input layer to the hidden layer and a parameter matrix between the hidden layer to the output layer,
Figure BDA0002824602200000062
is the regularized adjacency matrix of the graph with self-joins, k is the highest order of graph convolution,
Figure BDA0002824602200000063
ReLU (. cndot.) is an activation function, NMPooling (. cndot.) is an information fusion layer, and softmax (. cndot.) is a multi-class output function, and the specific model structure is shown in FIG. 2.
The high-low graph convolution layer in the present embodiment includes first-order graph convolution to k-order graph convolution based on weight sharing, i.e., the first-order graph convolution to k-order graph convolution
Figure BDA0002824602200000064
The high-low order graph convolution is by first order graph convolution
Figure BDA0002824602200000065
Capturing first-order neighborhood information of text nodes by second-order to k-order graph convolution
Figure BDA0002824602200000066
Capturing higher-order neighborhood information of text nodes to augmentThe receptive field of the model is enhanced, and the learning ability of the model is further enhanced. The order k of the high-low order graph convolution layer can be one of the orders of two orders and above, or a combination of any plurality of orders. When k is 2, namely the adopted model is an NMGC-2 model with a mixed neighborhood of 1 st order and 2 nd order, the formula is as follows:
Figure BDA0002824602200000067
when k is 3, namely the adopted model is an NMGC-3 model with mixed neighborhood of 1 st, 2 nd and 3 rd orders, the formula is as follows:
Figure BDA0002824602200000071
when k is equal to n, the adopted model is an NMGC-n model with a neighborhood mixture from 1 st order to n th order, and the formula is as follows:
Figure BDA0002824602200000072
in the model, the same weight parameters are adopted in each order neighborhood of the convolution layer of the same graph to realize weight sharing and parameter quantity reduction, and the parameters are specifically represented in the selection of parameters W1 and W2 in the formulas (1) to (4).
When the method is actually applied to large-scale text graph network training, calculation is needed first
Figure BDA0002824602200000073
Due to the fact that
Figure BDA0002824602200000074
Usually a sparse matrix with m non-zero elements, and the convolution based on the high-low order graph adopts a weight sharing mechanism and adopts multiplication from right to left to calculate
Figure BDA0002824602200000075
E.g. when k is 2, use
Figure BDA0002824602200000076
By multiplication to obtain
Figure BDA0002824602200000077
In the same way, the method for preparing the composite material,
Figure BDA0002824602200000078
and so on through
Figure BDA0002824602200000079
Calculating a k-th order graph convolution, i.e. by left-multiplying a k-1 order graph convolution
Figure BDA00028246022000000710
The calculation method effectively reduces the calculation complexity. In addition, since different order graph convolutions employ a weight sharing mechanism, the parameters of the high-order graph convolution and the parameters of the first-order graph convolution are the same, assuming that
Figure BDA00028246022000000711
(n number of nodes) of the node,
Figure BDA00028246022000000712
(r0individual attribute feature dimensions),
Figure BDA00028246022000000713
(r1a filter), and
Figure BDA00028246022000000714
(r2a filter), then
Figure BDA00028246022000000715
The time complexity and the parameter quantity of the high-low level graph convolution model are respectively O (k multiplied by m multiplied by r)0×r1) And O (r)0×r1) The high-order graph convolution calculation efficiency is guaranteed to a certain extent.
S12, acquiring a corpus of text classification by adopting the high-low order graph convolutional neural network model; the corpus comprises a plurality of samples, each sample containing a document and a title;
the text classification corpus can be selected according to actual needs, in the application, supervised text data sets of R52 and R8, 20-Newsgroups (20NG), Ohsated (OH) and Movie Review (MR) of Reuters 21578 are adopted, and specific information of the data sets is as follows: the 20NG data set includes 18846 newsgroup documents, without duplicate documents, divided into 20 different classes, of all the newsgroup documents, 11314 documents were used for training, and the remaining 7532 documents were used as test sets; the OH dataset is a medical dataset from the MEDLINE database, 7400 medical documents were selected, of which 3357 documents were used for training, while the remaining 4043 documents were taken as a test set, dividing into 23 different classes. R52 and R8 are two subsets of Reuters 21578, and are divided into 52, 8 different classes, respectively, where the number of training and test documents of the R52 dataset is 6532 and 2568, respectively, and the number of training and test documents of the R8 dataset is 5485 and 2189, respectively. MR is a movie review data set with 10662 review documents, half as many positive and negative review documents, each containing only one sentence, using 7108 review documents as training and 3554 review documents as testing. 10% of the data in the training set of data above will be used as validation, and the specific information of the supervised text data set is shown in Table 1.
TABLE 1 supervision text data set
Data set Number of documents Number of words Training Testing Categories Number of nodes Average length
R52 9,100 8,892 6,532 2,568 52 17,992 69.82
R8 7,674 7,688 5,485 2,189 8 15,362 65.72
20NG 18,846 42,757 11,314 7,532 20 61,603 221.26
OH 7,400 14,157 3,357 4,043 23 21,557 135.82
MR 10,662 18,764 7,108 3,554 2 29,426 20.39
S13, preprocessing the corpus to obtain a training set and a test set;
wherein, the step S13 of preprocessing the corpus to obtain a training set and a test set includes, as shown in fig. 3:
s131, preprocessing the titles and the documents of the samples in the corpus to remove duplicates, participles and stop words and special symbols to obtain words in the corpus, and forming the words and the documents in the corpus into a corpus text group;
the corpus collected by the network only comprises documents and titles, more words in the documents and the titles can be relied on when actual text data is processed, therefore, certain preprocessing is carried out before model training, word segmentation logic can be customized by combining the requirements of users through tools such as the existing python nltk and the like, word segmentation operation is carried out on the titles and text chapters of all samples in the corpus, preprocessing such as stop words and special symbols is removed, then the words in the corpus is obtained, and the obtained corpus words and the documents form a corpus text group for subsequent analysis and use.
And S132, dividing the corpus text group into a training set and a test set according to the quantity proportion.
When a corpus text group is used for training a graph convolution model, collected data are divided into a training set and a test set according to a certain quantity proportion according to requirements, the training set is used for training parameters of the model and optimizing the parameters to determine a final model, and the test set is directly classified by using the determined model.
S14, respectively constructing a training set text graph network and a test set text graph network according to the training set and the test set;
s15, inputting the training set text graph network into a high-low order graph convolutional neural network model, and training by combining a loss function to obtain a text classification model;
wherein, the loss function used in the model training is:
Figure BDA0002824602200000091
xlfor a set of labeled vertices (nodes), M is the number of classes, YlmReal labels, Z, representing label nodeslmAnd represents the probability value between 0 and 1 predicted by softmax (input tag node).
When the feature matrix and the regularization adjacency matrix of the training set text graph are used as input matrixes to be input into the high-low order graph convolution neural network model for training and learning,
updating the parameters of the convolution of the learning graph by a gradient descent method to primarily determine a text classification model, transmitting 10% of verification sets reserved in a training set into the model and adjusting the parameters by combining with a defined loss function to finally obtain a stable text classification model, and then obtaining a classification result by using a test set, thereby well ensuring the classification precision of the classification model.
And S16, inputting the test set text graph network into the text classification model for testing to obtain a classification result.
In the embodiment of the application, firstly, the important reference data set of the text classification is adopted for parameter training during the training of the classification model based on the generalization ability consideration of the model, and the data set does not have repeated data, so that the workload of model training can be reduced to a certain extent, and the efficiency of model training is improved; secondly, a high-low graph convolution network model with only two-layer graph convolution is established, so that the over-smooth phenomenon of the training model is reduced while the training parameters are reduced to a certain extent, and the universality of the classification model obtained by training is improved.
In one embodiment, the information fusion layer in the formula (1) of the present invention adopts minimum value negation information fusion pooling, and the specific calculation method thereof is as follows:
according to the input matrix X and the parameter matrix w1And regularizing adjacency matrices
Figure BDA0002824602200000101
Calculating minimum value matrixes of convolution of different graphs; the minimum value is calculated as follows:
Figure BDA0002824602200000102
for the minimum value matrix HminIs negated to obtain a pooled graph feature matrix, i.e., Hnm=-Hmin
The information fusion mode in the above embodiment is described by a specific third-order embodiment, and the higher-order case is similar. Suppose the order K of the neighborhood is 3 and the first order domain is H1Second order field is H2Third order field is H3Then, the process of information fusion is:
(1)
Figure BDA0002824602200000103
then:
Figure BDA0002824602200000104
(2) to HminEach element value ofGet the inverse, have
Figure BDA0002824602200000105
The implementation process of the NMPooling-based high-low order graph convolution algorithm in this embodiment is as follows:
inputting:
Figure BDA0002824602200000106
H(1),W
convolution operation:
Figure BDA0002824602200000107
information fusion: hnm=NMPooling(H1,H2,…,Hk)
Nonlinear activation: h ═ RELU (H)nm)
In the embodiment, the text graph network is firstly input into the high-low order graph convolution for the above algorithm processing, then the NMPooling information fusion is used for mixing the first-order to high-order characteristics of different neighborhoods, and the text graph is input into the classical first-order graph convolution after nonlinear activation for further learning the representation of the text graph, so that the method for obtaining the classification probability result is finally obtained, more and richer characteristic information can be reserved in the learning process for the learning of the global graph topology, and the learning effect is further improved.
In one embodiment, as shown in fig. 4, the step S14 of constructing a training set text graph network and a test set text graph network respectively according to the training set and the test set includes:
s141, respectively establishing a training set text chart and a test set text chart of which feature matrixes are corresponding dimension unit matrixes according to the training set and the test set;
in the text classification training, converting a text corpus into a corresponding text graph is a necessary step for performing machine training. In this embodiment, the training set text graph and the test set text graph are all necessary inputs of the high-low order graph convolution neural network model, and corresponding text graphs need to be respectively established based on text data of the training set and the test set, for example, the training set text graph G ═ V, E, where V is a vertex set composed of all words and all documents of the training set text, that is, the number of nodes in the text graph network is the number of documents plus the number of words, that is, the sum of the size of the actual corpus and the vocabulary amount, E is an edge set including all dependencies between two words in the training set and between words and documents, and similarly, a file graph of the test set can be obtained.
As shown in FIG. 5, a text graph network is established for a part of the corpus of Ohsumed, the nodes beginning with "O" are document nodes, the other nodes are word nodes, the gray lines represent the edges between words, the black lines represent the edges between documents and words, the document nodes of the same color belong to the same class, and the document nodes of different colors belong to different classes. In this embodiment, the feature matrices corresponding to the training set text diagram and the test set text diagram are set as identity matrices of corresponding dimensions, that is, one-hot codes are used as model inputs for each word and document.
And S142, determining the adjacency matrixes of the training set text graph and the test set text graph according to the TF-IDF algorithm and the PMI algorithm.
The adjacency matrix of the text graph comprises the weights of the words and the document edges, the weights of the words and the word edges and the weights of the documents and the document edges. In the present embodiment, the edges between the words and the document are established by the number of times the words appear in the document, and the edges between the words are established by co-occurrence of the words.
As shown in fig. 6, the step S142 of determining the adjacency matrices of the training set text graph and the test set text graph according to the TF-IDF algorithm and the PMI algorithm includes:
s1421, calculating the weights of the document nodes and the word node connecting edges in the adjacent matrix of the training set text graph according to the TF-IDF algorithm, and calculating the weights of the word nodes and the word node connecting edges in the adjacent matrix of the training set text graph according to the PMI algorithm;
s1422, calculating the weights of the document nodes and word node connecting edges in the adjacent matrix of the test set text graph according to the TF-IDF algorithm, and calculating the weights of the word nodes and word node connecting edges of the test set text graph according to the PMI algorithm.
Wherein the weights of the document nodes and the word node connecting edges are calculated according to the word frequency-inverse document frequency (TF-IDF). The word frequency (TF) represents the number of times a given word appears in a document, the larger the value of the word frequency is, the larger the contribution degree of the given word to the document is, and if the value of the word frequency is smaller, the lower the contribution degree of the given word to the document is, or even no contribution degree is. The expression of word frequency is as follows:
tfj,k=nj,k/∑ii,k,
wherein n isj,kIs the number of times the word j appears in the document k, Σii, k represent the number of occurrences of word j and other words in document k. The Inverse Document Frequency (IDF) reflects the ability of a given word to distinguish across a document, with the greater the inverse document frequency, the fewer documents containing the given word, and the greater the ability to distinguish across the document. The inverse document frequency is calculated as follows:
idfj=logD/{k:tj∈dk},
where D is the number of all documents in the corpus, { k: tj∈dkMeans containing a given word tjThe number of documents.
TF-IDF considers the influence of a given word on a certain document through TF, and also describes the importance degree of the given word on the whole document by using IDF, and the importance degree is defined as the product of word frequency and inverse document frequency, and the calculation formula is as follows:
TF-IDF=TF*IDF。
to exploit global word co-occurrence information, a sliding window of fixed size is set for all documents in the corpus to aggregate co-occurrence features. The present embodiment utilizes the PMI algorithm to measure the correlation between words, calculates the weight between two word nodes, and the PMI value of the word j and the word k is defined as:
PMI(j,k)=log p(j,k)/p(j)p(k),
where p (j, k) is W (j, k)/W, p (j) is W (j)/W, W (j, k) represents the number of sliding windows including word j and word k, W represents the total number of sliding windows, and W (j) represents the number of sliding windows including word j. The larger the value of PMI, the stronger the semantic relevance of word j to word k, the smaller the value of PMI, the weaker the semantic relevance of word j to word k, when the PMI value is negative, the very weak or no relevance between word j and word k. Establishing edges between words only takes into account the fact that the PMI value is positive.
To sum up, in this embodiment, the element values in the adjacency matrix corresponding to the text graph, that is, the weights for constructing the network edge of the text graph, are defined as follows:
Figure BDA0002824602200000131
after a training set text graph network is constructed, a feature matrix and a neighbor matrix of the graph are transmitted into a high-low order graph convolution neural network model for training.
In the embodiment, after the training set and the test set text corpus are converted into the corresponding text graphs, the method for determining the adjacency matrixes corresponding to the text graphs by adopting the TF-IDF algorithm and the PMI algorithm captures global word co-occurrence information and considers the distinguishing capability of the documents, so that a weight relation with high accuracy for describing the text graph information is given, the model training effect is further improved, and the precision of model classification is also improved.
In the embodiment of the present application, classification tests are performed based on supervised text data sets R52 and R8, 20NG, OH, and MR, and it is found that, in a text classification task, a high-low-order graph convolution neural network model with k being 2 and k being 3 already has very good performance in terms of classification accuracy and computational complexity, and a k value being 4 or higher will reduce text classification accuracy, so that only comparison results of classification effects, model parameters, and computational complexity of NMGC-2 and NMGC-3 models (i.e. only two cases of k being 2 and k being 3 are considered) with other existing classification models are given, as shown in tables 2-4 below:
table 2 NMGC-2 and NMGC-3 comparison of tests based on the same text data set with the existing model
Figure BDA0002824602200000132
Figure BDA0002824602200000141
Table 2 illustrates: the accuracy in the table is expressed as a percentage and the number is the average of 10 runs.
TABLE 3 NMGC-2 and NMGC-3 model classification result comparison table for different hidden neuron numbers
Figure BDA0002824602200000142
Figure BDA0002824602200000151
TABLE 4 comparison of computational complexity and parameter values for NMGC-2, NMGC-3 and GCN models
Figure BDA0002824602200000152
Table 4 illustrates: 1. 2 and 3 represent the graph convolution order, and 200, 64, 128 and 32 represent the number of hidden neurons.
Based on the above experimental results, the embodiment provides a high-low order graph convolution neural Network Model (NMGC) including a high-low order graph convolution capable of simultaneously capturing the correlation between low-order and high-order neighborhood text nodes and an nmpoling information fusion layer capable of mixing first-order to high-order features of different neighborhoods, which can retain more and richer feature information in text classification, learn a global graph topology, not only broaden the receptive field, but also improve the model expression capability. In addition, the computation complexity and the parameter amount are reduced by adopting weight sharing and setting a small number of hidden neurons for convolution of different orders, overfitting of the model is avoided, and experimental results based on five reference text data sets show that text classification by applying a high-low order graph convolution neural network model and adopting a classical first order graph convolution network has obvious advantages in the aspects of classification precision, classification performance and parameters, and the method is most stable while achieving the highest precision.
It should be noted that, although the steps in the above-described flowcharts are shown in sequence as indicated by arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in the above-described flowcharts may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or the stages is not necessarily sequential, but may be performed alternately or alternatingly with other steps or at least a portion of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 7, there is provided a text classification system, the system comprising:
a classification model establishing module 71, configured to establish a high-low level graph convolution neural network model; the high-low order graph convolution neural network model sequentially comprises an input layer, a high-low order graph convolution layer, an information fusion layer, a first order graph convolution layer and an output layer;
a corpus classifying module 72, configured to obtain a corpus set for text classification using the high-low order graph convolutional neural network model; the corpus comprises a plurality of samples, each sample containing a document and a title;
a corpus preprocessing module 73, configured to preprocess the corpus to obtain a training set and a test set;
a text graph network building module 74, configured to build a training set text graph network and a test set text graph network according to the training set and the test set, respectively;
the text classification model training module 75 is configured to input the training set text graph network into a high-low order graph convolutional neural network model, and perform training in combination with a loss function to obtain a text classification model;
and a text classification test module 76, configured to input the test set text graph into the text classification model through a network for testing, so as to obtain a classification result.
For the specific definition of the text classification system, reference may be made to the above definition of the text classification method, which is not described herein again. The various modules in the text classification system described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
Fig. 8 shows an internal structure diagram of a computer device in one embodiment, and the computer device may be specifically a terminal or a server. As shown in fig. 8, the computer apparatus includes a processor, a memory, a network interface, a display, and an input device, which are connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a power rate probability prediction method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those of ordinary skill in the art that the architecture shown in FIG. 8 is merely a block diagram of some of the structures associated with the present solution and is not intended to limit the computing devices to which the present solution may be applied, and that a particular computing device may include more or less components than those shown in the drawings, or may combine certain components, or have the same arrangement of components.
In one embodiment, a computer device is provided, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the steps of the above method being performed when the computer program is executed by the processor.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned method.
To sum up, the embodiment of the present invention provides a text classification method, a system, a computer device, and a storage medium, wherein the text classification method based on a high-low order graph convolutional network provides a method for text classification using a new high-low order graph convolutional neural network model including a high-low order graph convolutional layer capturing multi-order neighborhood information of nodes, an NMPooling information fusion layer mixing first-order to high-order features of different neighborhoods, a first-order graph convolutional layer, and a softmax classification output layer, on the basis of fully considering various problems such as easily ignored global word co-occurrence information of text classification, an easily occurring narrow receptive field, an excessively smooth model, and a lack of expression capability. When the method is applied to actual text classification, the method can capture the low-order neighborhood and the high-order neighborhood information of text nodes through the high-low-order graph convolution layer to obtain more and richer text node information so as to widen the receptive field and improve the expression capability of the model, and reduces the calculation complexity and parameter quantity of the model by adopting the method of weight sharing and setting less number of hidden neurons for convolution of different orders, thereby avoiding overfitting of the model and improving the stability of the model and the precision of text classification.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above.
The embodiments in this specification are described in a progressive manner, and all the same or similar parts of the embodiments are directly referred to each other, and each embodiment is described with emphasis on differences from other embodiments. In particular, for embodiments of the system, the computer device, and the storage medium, since they are substantially similar to the method embodiments, the description is relatively simple, and in relation to the description, reference may be made to some portions of the description of the method embodiments. It should be noted that, the technical features of the embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express some preferred embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for those skilled in the art, various modifications and substitutions can be made without departing from the technical principle of the present invention, and these should be construed as the protection scope of the present application. Therefore, the protection scope of the present patent shall be subject to the protection scope of the claims.

Claims (10)

1. A method of text classification, the method comprising the steps of:
establishing a high-low order graph convolution neural network model; the high-low order graph convolution neural network model sequentially comprises an input layer, a high-low order graph convolution layer, an information fusion layer, a first order graph convolution layer and an output layer;
obtaining a corpus of text classification by adopting the high-low order graph convolutional neural network model; the corpus comprises a plurality of samples, each sample containing a document and a title;
preprocessing the corpus set to obtain a training set and a test set;
respectively constructing a training set text graph network and a test set text graph network according to the training set and the test set;
inputting the training set text graph network into a high-low order graph convolutional neural network model, and training by combining a loss function to obtain a text classification model;
and inputting the test set text graph network into the text classification model for testing to obtain a classification result.
2. The text classification method of claim 1, wherein the output of the high-low-level graph convolutional neural network model is Z, then:
Figure FDA0002824602190000011
where X is the input matrix of the graph, w1And w2Respectively a parameter matrix between the input layer to the hidden layer and a parameter matrix between the hidden layer to the output layer,
Figure FDA0002824602190000012
is the regularized adjacency matrix of the graph with self-joins, k is the highest order of graph convolution,
Figure FDA0002824602190000013
ReLU (. cndot.) is a nonlinear activation function, NMPooling (. cndot.) is an information fusion layer, and softmax (. cndot.) is a multi-class output function.
3. The text classification method of claim 2, wherein the high-low level graph convolution layer includes a first order graph convolution to a k order graph convolution based on weight sharing; the order k of the high-low order graph convolution layer is one of orders of two orders and above, or a combination of any plural orders.
4. The text classification method according to claim 2, characterized in that the information fusion layer employs minimum-inverted information fusion pooling, which is implemented by the steps of:
according to the input matrix X and the parameter matrix w1And regularizing adjacency matrices
Figure FDA0002824602190000021
Calculating minimum value matrixes of convolution of different graphs;
and negating each element value of the minimum value matrix to obtain a pooled graph feature matrix.
5. The method for classifying texts according to claim 1, wherein the step of preprocessing the corpus to obtain a training set and a test set comprises:
carrying out pre-processing of removing duplication, word segmentation and stop word and special symbol removal on the titles and the documents of the samples in the corpus to obtain words in the corpus, and forming the words and the documents in the corpus into a corpus text group;
and dividing the corpus text group into a training set and a test set according to the quantity proportion.
6. The text classification method according to claim 1, wherein the step of constructing a training set text graph network and a test set text graph network from the training set and the test set, respectively, comprises:
respectively establishing a training set text chart and a test set text chart of which feature matrixes are corresponding dimension unit matrixes according to the training set and the test set;
and determining the adjacency matrixes of the training set text graph and the test set text graph according to the TF-IDF algorithm and the PMI algorithm.
7. The text classification method of claim 6, wherein the step of determining the adjacency matrices of the training set text graph and the test set text graph according to the TF-IDF algorithm and the PMI algorithm comprises:
calculating the weights of document nodes and word node connecting edges in the adjacent matrix of the training set text graph according to the TF-IDF algorithm, and calculating the weights of the word nodes and word node connecting edges in the adjacent matrix of the training set text graph according to the PMI algorithm;
and calculating the weights of the document nodes and the word node connecting edges in the adjacent matrix of the test set text graph according to the TF-IDF algorithm, and calculating the weights of the word nodes and the word node connecting edges of the test set text graph according to the PMI algorithm.
8. A text classification system, the system comprising:
the classification model establishing module is used for establishing a high-low order graph convolution neural network model; the high-low order graph convolution neural network model sequentially comprises an input layer, a high-low order graph convolution layer, an information fusion layer, a first order graph convolution layer and an output layer;
the corpus classification module is used for acquiring a corpus set for text classification by adopting the high-low order graph convolutional neural network model; the corpus comprises a plurality of samples, each sample containing a document and a title;
the corpus preprocessing module is used for preprocessing the corpus to obtain a training set and a test set;
the text graph network building module is used for respectively building a training set text graph network and a test set text graph network according to the training set and the test set;
the text classification model training module is used for inputting the training set text graph network into a high-low order graph convolutional neural network model, and training the high-low order graph convolutional neural network model by combining a loss function to obtain a text classification model;
and the text classification test module is used for inputting the test set text graph network into the text classification model for testing to obtain a classification result.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202011425848.0A 2020-12-08 2020-12-08 Text classification method, system, computer equipment and storage medium Active CN112529071B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011425848.0A CN112529071B (en) 2020-12-08 2020-12-08 Text classification method, system, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011425848.0A CN112529071B (en) 2020-12-08 2020-12-08 Text classification method, system, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112529071A true CN112529071A (en) 2021-03-19
CN112529071B CN112529071B (en) 2023-10-17

Family

ID=74996781

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011425848.0A Active CN112529071B (en) 2020-12-08 2020-12-08 Text classification method, system, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112529071B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792144A (en) * 2021-09-16 2021-12-14 南京理工大学 Text classification method based on semi-supervised graph convolution neural network
CN113961708A (en) * 2021-11-10 2022-01-21 北京邮电大学 Power equipment fault tracing method based on multilevel graph convolutional network
CN114021574A (en) * 2022-01-05 2022-02-08 杭州实在智能科技有限公司 Intelligent analysis and structuring method and system for policy file

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929029A (en) * 2019-11-04 2020-03-27 中国科学院信息工程研究所 Text classification method and system based on graph convolution neural network
US20200151289A1 (en) * 2018-11-09 2020-05-14 Nvidia Corp. Deep learning based identification of difficult to test nodes
CN111159425A (en) * 2019-12-30 2020-05-15 浙江大学 Temporal knowledge graph representation method based on historical relationship and double-graph convolution network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200151289A1 (en) * 2018-11-09 2020-05-14 Nvidia Corp. Deep learning based identification of difficult to test nodes
CN110929029A (en) * 2019-11-04 2020-03-27 中国科学院信息工程研究所 Text classification method and system based on graph convolution neural network
CN111159425A (en) * 2019-12-30 2020-05-15 浙江大学 Temporal knowledge graph representation method based on historical relationship and double-graph convolution network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MICHAEL EDWARDS等: "Graph convolutional neural network for multi-scale feature learning", 《ELSEVIER SCIENCE》, pages 1 - 12 *
周阿健: "基于深度结构特征表示学习的视觉跟踪研究", 《万方数据知识服务平台论文库》, pages 1 - 65 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792144A (en) * 2021-09-16 2021-12-14 南京理工大学 Text classification method based on semi-supervised graph convolution neural network
CN113792144B (en) * 2021-09-16 2024-03-12 南京理工大学 Text classification method of graph convolution neural network based on semi-supervision
CN113961708A (en) * 2021-11-10 2022-01-21 北京邮电大学 Power equipment fault tracing method based on multilevel graph convolutional network
CN113961708B (en) * 2021-11-10 2024-04-23 北京邮电大学 Power equipment fault tracing method based on multi-level graph convolutional network
CN114021574A (en) * 2022-01-05 2022-02-08 杭州实在智能科技有限公司 Intelligent analysis and structuring method and system for policy file

Also Published As

Publication number Publication date
CN112529071B (en) 2023-10-17

Similar Documents

Publication Publication Date Title
CN111797321B (en) Personalized knowledge recommendation method and system for different scenes
CN112529071B (en) Text classification method, system, computer equipment and storage medium
CN111553759A (en) Product information pushing method, device, equipment and storage medium
Khan et al. An unsupervised deep learning ensemble model for anomaly detection in static attributed social networks
CN112529069A (en) Semi-supervised node classification method, system, computer equipment and storage medium
Wu et al. Optimized deep learning framework for water distribution data-driven modeling
CN116821776A (en) Heterogeneous graph network node classification method based on graph self-attention mechanism
Wu et al. EvoNet: A neural network for predicting the evolution of dynamic graphs
CN115577678A (en) Document level event cause and effect relationship identification method, system, medium, equipment and terminal
Rai Advanced deep learning with R: Become an expert at designing, building, and improving advanced neural network models using R
CN113592593A (en) Training and application method, device, equipment and storage medium of sequence recommendation model
CN112905906A (en) Recommendation method and system fusing local collaboration and feature intersection
Xu et al. Collective vertex classification using recursive neural network
EP4064038B1 (en) Automated generation and integration of an optimized regular expression
CN114842247B (en) Characteristic accumulation-based graph convolution network semi-supervised node classification method
CN115689639A (en) Commercial advertisement click rate prediction method based on deep learning
CN114692012A (en) Electronic government affair recommendation method based on Bert neural collaborative filtering
Denli et al. Geoscience language processing for exploration
Zhang An English teaching resource recommendation system based on network behavior analysis
CN117252665B (en) Service recommendation method and device, electronic equipment and storage medium
Zhao et al. Test case classification via few-shot learning
Xiong et al. Bayesian nonparametric regression modeling of panel data for sequential classification
Patil et al. COMPARISON OF DIFFERENT MUSIC RECOMMENDATION SYSTEM ALGORITHMS
CN112509640B (en) Gene ontology item name generation method and device and storage medium
CN117235533B (en) Object variable analysis method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant