CN112529071B - Text classification method, system, computer equipment and storage medium - Google Patents

Text classification method, system, computer equipment and storage medium Download PDF

Info

Publication number
CN112529071B
CN112529071B CN202011425848.0A CN202011425848A CN112529071B CN 112529071 B CN112529071 B CN 112529071B CN 202011425848 A CN202011425848 A CN 202011425848A CN 112529071 B CN112529071 B CN 112529071B
Authority
CN
China
Prior art keywords
graph
text
layer
classification
order
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011425848.0A
Other languages
Chinese (zh)
Other versions
CN112529071A (en
Inventor
刘勋
宗建华
夏国清
叶和忠
刘强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China Institute Of Software Engineering Gu
Original Assignee
South China Institute Of Software Engineering Gu
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China Institute Of Software Engineering Gu filed Critical South China Institute Of Software Engineering Gu
Priority to CN202011425848.0A priority Critical patent/CN112529071B/en
Publication of CN112529071A publication Critical patent/CN112529071A/en
Application granted granted Critical
Publication of CN112529071B publication Critical patent/CN112529071B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application provides a text classification method, a system, computer equipment and a storage medium, wherein the method comprises the steps of establishing a new high-low order graph convolution neural network model comprising a high-low order graph convolution layer which simultaneously captures node multi-order neighborhood information, an information fusion layer which mixes first-order features to high-order features of different neighborhoods, a first-order graph convolution layer and a softmax classification output layer, inputting a training set text graph network to train to obtain a text classification model, and inputting a testing set text graph network to the classification model to obtain a classification result. According to the embodiment of the application, when the text is classified, the text classification efficiency and the classification effect are ensured, and meanwhile, the problems of complex calculation, large parameter quantity, overcomplete, limited receptive field and the like when the conventional graph convolution is applied to the text classification are solved by a method for simultaneously capturing the node multi-order neighborhood information, so that the expression capacity of a text classification model, the stability of the model and the precision of a text classification task are further improved.

Description

Text classification method, system, computer equipment and storage medium
Technical Field
The present application relates to the field of text classification technology, and in particular, to a text classification method, system, computer device and storage medium based on a high-low order graph rolling network.
Background
With the rapid development of internet technology, various social platforms, technical communication platforms, shopping platforms and the like are rapidly developed, massive text data information is continuously generated, the text data information is an enthusiast object of a large data mining research institute because of ultrahigh-value data information, and text classification is more and more important in information processing. Researchers want to efficiently manage, extract and analyze useful information in text data by using an effective text classification method, so as to provide a powerful support for enterprise or social development.
Currently, text classification techniques have evolved from early manual classification relying on priori knowledge of linguistic experts to deep machine learning, such as deep learning models represented by Convolutional Neural Networks (CNNs) and cyclic neural networks (RNNs), which are widely used for text classification tasks, but these models may ignore global word co-occurrence information in a corpus, where discontinuous and long-range semantic information in the information carrier has an important impact on the file classification results. Although the existing graph-rolling neural network can process data of any structure and capture global word co-occurrence information, and can effectively learn a text graph network with rich relations and protect global structure information of a graph when the graph is embedded, the existing graph-rolling neural network generally has only two layers, the shallow layer mechanism limits the scale of a receptive field and the expression capability of a model, and a multi-layer (> 2-layer) network can lead the text node values of different types to tend to a fixed value so as to bring about a smoothing problem. On the basis of continuing the text classification advantages of the conventional graph convolution network, the problem of excessive smoothness in application of the graph convolution network is solved, and meanwhile, the receptive field of the classification model can be increased, so that the expression capacity of the model and the precision of text classification tasks are significant.
Disclosure of Invention
The application aims to solve the problems of over-smoothing and limiting model receptive fields when the current graph rolling network is applied to text classification, and further improve the expression capacity of a text classification model and the precision of a text classification task.
In order to achieve the above object, it is necessary to provide a text classification method, system, computer device and storage medium in view of the above technical problems.
In a first aspect, an embodiment of the present application provides a text classification method, including the steps of:
establishing a high-low order graph convolution neural network model; the high-low order graph convolution neural network model sequentially comprises an input layer, a high-low order graph convolution layer, an information fusion layer, a first order graph convolution layer and an output layer;
acquiring a corpus set for text classification by adopting the high-low order graph convolution neural network model; the corpus comprises a plurality of samples, each sample containing a document and a title;
preprocessing the corpus to obtain a training set and a testing set;
respectively constructing a training set text graph network and a testing set text graph network according to the training set and the testing set;
inputting the training set text graph network into a high-low order graph convolution neural network model, and training by combining a loss function to obtain a text classification model;
and inputting the test set text graph network into the text classification model for testing to obtain a classification result.
Further, if the output of the high-low order graph convolution neural network model is Z, then:
where X is the input matrix of the graph, w 1 And w 2 The parameter matrix between the input layer and the hidden layer and the parameter matrix between the hidden layer and the output layer respectively,is the regularized adjacency matrix of the graph containing self-connections, k is the highest order of the graph convolution,ReLU (·) is the activation function, NMPooling (·) is the information fusion layer, and softmax (·) is the multi-class output function.
Further, the high-low order graph convolution layer comprises a first order graph convolution to a k order graph convolution based on weight sharing; the order k of the high-low order graph convolution layer is one of two orders and more, or a combination of any plurality of orders.
Further, the information fusion layer adopts minimum value inversion information fusion pooling, and the implementation steps comprise:
according to the input matrix X and the parameter matrix w 1 And regularizing the adjacency matrixCalculating minimum value matrixes of different-order graph convolution;
and inverting each element value of the minimum value matrix to obtain a pooled graph feature matrix.
Further, the step of preprocessing the corpus to obtain a training set and a testing set includes:
performing de-duplication and word segmentation on the title and the document of each sample in the corpus, removing the stop words and the special symbols, obtaining corpus words, and forming the corpus words and the document into a corpus text group;
dividing the corpus text group into a training set and a testing set according to the quantity proportion.
Further, the step of constructing a training set text graph network and a testing set text graph network according to the training set and the testing set respectively includes:
respectively establishing a training set text graph and a testing set text graph of which feature matrixes are corresponding dimension unit matrixes according to the training set and the testing set;
and determining adjacency matrixes of the training set text graph and the testing set text graph according to the TF-IDF algorithm and the PMI algorithm.
Further, the step of determining the adjacency matrix of the training set text graph and the test set text graph according to the TF-IDF algorithm and the PMI algorithm comprises:
calculating weights of the connecting edges of the document nodes and the word nodes in the adjacent matrix of the training set text graph according to the TF-IDF algorithm, and calculating weights of the connecting edges of the word nodes and the word nodes in the adjacent matrix of the training set text graph according to the PMI algorithm;
and calculating weights of the connecting edges of the document nodes and the word nodes in the adjacency matrix of the test set text graph according to the TF-IDF algorithm, and calculating weights of the connecting edges of the word nodes and the word nodes of the test set text graph according to the PMI algorithm.
In a second aspect, an embodiment of the present application provides a text classification system, including:
the classification model building module is used for building a high-low order graph convolution neural network model; the high-low order graph convolution neural network model sequentially comprises an input layer, a high-low order graph convolution layer, an information fusion layer, a first order graph convolution layer and an output layer;
the classifying corpus obtaining module is used for obtaining a corpus set for text classification by adopting the high-low order graph convolution neural network model; the corpus comprises a plurality of samples, each sample containing a document and a title;
the corpus preprocessing module is used for preprocessing the corpus to obtain a training set and a testing set;
a text graph network building module is used for building a training set text graph network and a testing set text graph network according to the training set and the testing set respectively;
the text classification model training module is used for inputting the training set text graph network into a high-low order graph convolution neural network model, and training the training set text graph network by combining a loss function to obtain a text classification model;
and the text classification test module is used for inputting the test set text graph network into the text classification model for testing to obtain a classification result.
In a third aspect, embodiments of the present application further provide a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.
In a fourth aspect, embodiments of the present application also provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the above-described method.
The application provides a text classification method, a system, a computer device and a storage medium, through the method, after constructing a training set text graph network and a testing set text graph network on a preprocessed corpus by adopting a TF-IDF algorithm and a PMI algorithm, inputting the training set text graph network into a high-low order graph convolution neural network model with an input layer, 1 high-low order graph convolution layer, 1 information fusion layer, 1 first order graph convolution layer and a softmax function output layer, combining a defined loss function to train and determine a classification model parameter matrix, and accordingly, accurately classifying the testing set text. Compared with the prior art, the method has the advantages that on the application of text classification, the problems of complex calculation, large parameter quantity, over-smoothness and limitation of model receptive field when the current graph rolling network is applied to text classification are respectively solved by constructing a two-layer network structure model and utilizing the method of capturing multi-order neighborhood information of nodes by high-low order graph rolling while the text classification efficiency and classification effect are ensured, and the expression capacity of a text classification model, the stability of the model and the precision of a text classification task are further improved.
Drawings
FIG. 1 is a flow chart of a text classification method according to an embodiment of the application;
FIG. 2 is a schematic diagram of a structure of the high-low order graph convolutional neural network model in FIG. 1;
FIG. 3 is a schematic flow chart of corpus preprocessing in step S13 in FIG. 1;
FIG. 4 is a flowchart of step S14 in FIG. 1 for constructing a corresponding text graph network based on the training set and the test set;
FIG. 5 is a schematic diagram of creating a text graph network based on partial data of OH using the method of FIG. 4;
FIG. 6 is a flowchart illustrating the construction of the text-map adjacency matrix according to the TF-IDF algorithm and the PMI algorithm in step S142 of FIG. 4;
FIG. 7 is a schematic diagram of a text classification system according to an embodiment of the application;
fig. 8 is an internal structural view of a computer device in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantageous effects of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples, and it is apparent that the examples described below are part of the examples of the present application, which are provided for illustration only and are not intended to limit the scope of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The text classification method provided by the application can be applied to a terminal or a server, the adopted high-low order graph rolling neural Network Model (NMGC) is an improvement on the existing graph rolling network model, other similar full-supervision classification tasks can be completed, and the text corpus is preferably adopted for training and testing.
In one embodiment, as shown in fig. 1, there is provided a text classification method, including the steps of:
s11, establishing a high-low order graph convolution neural network model; the high-low order graph convolution neural network model comprises an input layer, a high-low order graph convolution layer, an information fusion layer, a first order graph convolution layer and an output layer;
the number of the high-low order graph convolution layers and the number of the first order graph convolution layers in the high-low order graph convolution neural network model are 1. Assuming that the output of the high-low order graph convolution neural network model is Z, then:
where X is the input matrix of the graph, w 1 And w 2 The parameter matrix between the input layer and the hidden layer and the parameter matrix between the hidden layer and the output layer respectively,is the regularized adjacency matrix of the graph containing self-connections, k is the highest order of the graph convolution,ReLU (·) is an activation function, NMPooling (·) is an information fusion layer, softmax (·) is a multi-class output function, and a specific model structure is shown in FIG. 2.
The high-low graph convolution layer in this embodiment includes a first-order graph convolution to a k-order graph convolution based on weight sharing, i.eThe higher-order and lower-order picture convolution is formed by first-order picture convolution +.>Capturing first-order neighborhood information of text nodes, convolving the first-order neighborhood information with a second-order to k-order graph>The high-order neighborhood information of the text node is captured, so that the receptive field of the model is increased, and the learning capacity of the model is further enhanced. The order k of the high-low order graph convolution layer can be one of two orders and more, or a combination of any complex orders. When k=2, the model used is NMGC-2 model with a mixture of 1 st and 2 nd order neighborhoodsThe formula is as follows:
when k=3, the model used is NMGC-3 model with a mixture of 1, 2 and 3 order neighborhoods, the formula is as follows:
when k=n, the model used is NMGC-n model of 1 st order to n th order neighborhood mixture, the formula is as follows:
and the same weight parameters are adopted in each order neighborhood of the same graph convolution layer in the model to realize weight sharing and parameter quantity reduction, and the selection of the parameters W1 and W2 in the formulas (1) - (4) is embodied.
When the method is practically applied to large-scale text graph network training, the method needs to calculateDue to->Usually a sparse matrix with m non-zero elements, and based on higher and lower order graph convolution, both employ a weight sharing mechanism, employing right-to-left multiplication to compute +.>For example, when k=2, use +.>Multiplying to get +.>In the same way, the processing method comprises the steps of,/>and so on by +.>Left-hand k-1 order graph convolution to calculate k-order graph convolution, i.e. +.>The calculation method effectively reduces the calculation complexity. In addition, since different order graph convolution adopts a weight sharing mechanism, the parameter quantity of the high-order graph convolution and the low-order graph convolution and the parameter quantity of the first-order graph convolution are the same, and the assumption is +.>(n nodes), ->(r 0 Personal attribute feature dimension),>(r 1 filters), and->(r 2 Filters), thenThe time complexity and parameter quantity of the high-low order graph rolling model are O (k multiplied by m multiplied by r) 0 ×r 1 ) And O (r) 0 ×r 1 ) The high efficiency of the calculation of the convolution of the high-order chart and the low-order chart is ensured to a certain extent.
S12, acquiring a corpus set for text classification by adopting the high-low order graph convolution neural network model; the corpus comprises a plurality of samples, each sample containing a document and a title;
the text classification corpus can be selected according to actual needs, R52 and R8 of Reuters 21578, a supervision text data set of 20-News groups (20 NG), ohsumed (OH) and Movie Review (MR) is adopted in the application, and specific information of the data set is as follows: the 20NG dataset included 18846 newsgroup documents, without duplicate documents, divided into 20 different categories, of which 11314 documents were used for training and the remaining 7532 documents were used as test sets; the OH dataset is one medical dataset from the MEDLINE database, 7400 medical documents are selected, of which 3357 are used for training, and the remaining 4043 are used as test sets, for a total of 23 different classes. R52 and R8 are two subsets of Reuters 21578, divided into 52, 8 different classes, respectively, wherein the number of training documents and test documents for the R52 dataset is 6532 and 2568, respectively, and the number of training documents and test documents for the R8 dataset is 5485 and 2189, respectively. MR is a movie review data set, and 10662 review documents are used, wherein the number of positive review documents and negative review documents is half of that of each review document, each document contains only one sentence, and 7108 review documents are used as training and 3554 review documents are used as testing. 10% of the data in the training set of data described above was used for verification and specific information for the supervised text data set is shown in table 1.
Table 1 supervision text dataset
Data set Document number Word number Training Testing Category(s) Node count Average length of
R52 9,100 8,892 6,532 2,568 52 17,992 69.82
R8 7,674 7,688 5,485 2,189 8 15,362 65.72
20NG 18,846 42,757 11,314 7,532 20 61,603 221.26
OH 7,400 14,157 3,357 4,043 23 21,557 135.82
MR 10,662 18,764 7,108 3,554 2 29,426 20.39
S13, preprocessing the corpus to obtain a training set and a testing set;
the step S13 of preprocessing the corpus to obtain a training set and a testing set includes, as shown in fig. 3:
s131, performing de-duplication and word segmentation on the titles and the documents of all samples in the corpus, removing stop words and special symbols, preprocessing to obtain corpus words, and forming the corpus words and the documents into a corpus text group;
the corpus collected by the network only comprises documents and titles, more words in the documents and the titles are depended on during actual text data processing, so that certain preprocessing is needed before model training, word segmentation logic can be customized by combining with the needs of a user through existing python ntk and other tools, word segmentation operation is carried out on the titles and text chapters of all samples in the corpus, preprocessing such as stop words and special symbols is removed, words in the corpus are further obtained, and the obtained corpus words and documents form a corpus text group for subsequent analysis and use.
S132, dividing the corpus text group into a training set and a testing set according to the quantity proportion.
When the language text group is used for training the graph rolling model, the collected data are divided into a training set and a testing set according to a certain quantity proportion and the training set is used for training and optimizing parameters of the model to determine a final model, and the testing set is directly classified by using the determined model.
S14, respectively constructing a training set text graph network and a testing set text graph network according to the training set and the testing set;
s15, inputting the training set text graph network into a high-low order graph convolution neural network model, and training by combining a loss function to obtain a text classification model;
the loss function used in model training is as follows:
x l m is the number of classes, Y, for a set of labeled vertices (nodes) lm Real label representing label node, Z lm And represents a predicted probability value between 0 and 1 for softmax (input label node).
When the feature matrix and regularized adjacency matrix of the training set text graph are used as input matrix to be input into the high-low order graph convolution neural network model for training and learning,
updating parameters of the learning graph convolution by a gradient descent method to preliminarily determine a text classification model, then using a 10% verification set input model reserved in a training set and combining a defined loss function to adjust the parameters, finally obtaining a stable text classification model, and then using a test set to obtain a classification result, thereby well ensuring the classification precision of the classification model.
S16, inputting the test set text graph network into the text classification model for testing, and obtaining a classification result.
In the embodiment of the application, firstly, based on the generalization capability consideration of the model, the important benchmark dataset of the text classification adopted in the training of the classification model is used for carrying out parameter training, and as the dataset does not have repeated data, the workload of model training can be reduced to a certain extent, and the efficiency of model training is improved; and secondly, a high-low graph rolling network model with only two layers of graph rolling is built, training parameters are reduced to a certain extent, and meanwhile, the over-smoothing phenomenon of the training model is reduced, so that the universality of the classification model obtained through training is improved.
In one embodiment, the information fusion layer in the formula (1) of the application adopts minimum value inversion information fusion pooling, and the specific calculation method is as follows:
according to the input matrix X and the parameter matrix w 1 And regularizing the adjacency matrixCalculating minimum value matrixes of different-order graph convolution; wherein the calculation formula of the minimum value is as follows:
for the minimum value matrix H min The values of each element are inverted to obtain a pooled graph feature matrix, namely H nm =-H min
The information fusion method in the above embodiment is described with a specific third-order embodiment, and the case of the higher order is similar. Let k=3, let H be the first order field 1 Second order field is H 2 The third-order field is H 3 The information fusion process is as follows:
(1)then:
(2) For H min Is inverted from each element value of (1) and has
The implementation process of the high-low order graph volume integration algorithm based on NMPooling in the embodiment is as follows:
input:H (1) ,W
convolution operation:
and (3) information fusion: h nm =NMPooling(H 1 ,H 2 ,…,H k )
Nonlinear activation: h=relu (H nm )
In the embodiment, the text graph network is firstly input into the high-low order graph convolution to carry out the algorithm processing, then NMPooling information fusion is used for mixing first-order to high-order features of different neighborhoods, the first-order to high-order features are input into the classical first-order graph convolution after nonlinear activation to further learn the representation of the text graph, and finally the classification probability result is obtained.
In one embodiment, as shown in fig. 4, the step S14 of constructing a training set text graph network and a testing set text graph network according to the training set and the testing set includes:
s141, respectively establishing a training set text graph and a testing set text graph of which feature matrixes are corresponding dimension unit matrixes according to the training set and the testing set;
in text classification training, converting a corpus of text into a corresponding text graph is an essential step in performing machine training. In this embodiment, the training set text graph and the test set text graph are both necessary inputs of the high-low order graph convolution neural network model, and corresponding text graphs need to be respectively built based on text data of the training set and the test set, such as training set text graph g= (V, E), where V is a vertex set formed by all words and all documents of the training set text, i.e. the number of nodes in the text graph network is the number of documents plus the number of words, i.e. the sum of the size and vocabulary of the actual corpus, and E is an edge set containing all dependency relationships between two words in the training set and between words and documents, and similarly, the file graph of the test set can be obtained.
As shown in FIG. 5, a text graph network is built for a portion of the corpus of Ohsumed, the nodes beginning with "O" are document nodes, the other nodes are word nodes, gray lines represent edges between words, black lines represent edges between documents and words, document nodes of the same color belong to the same class, and document nodes of different colors belong to different classes. In this embodiment, feature matrices corresponding to the training set text graph and the test set text graph are set as identity matrices of corresponding dimensions, that is, one-hot codes are used as model inputs for each word and document.
S142, determining adjacency matrixes of the training set text graph and the testing set text graph according to the TF-IDF algorithm and the PMI algorithm.
The adjacency matrix of the text graph comprises weights of words and document edges, weights of words and word edges and weights of documents and document edges. In this embodiment, the edges between the words and the document are established by the number of occurrences of the words in the document, and the edges between the words are established by word co-occurrence.
As shown in fig. 6, the step S142 of determining the adjacency matrix of the training set text graph and the test set text graph according to the TF-IDF algorithm and the PMI algorithm includes:
s1421, calculating weights of connecting edges of document nodes and word nodes in an adjacent matrix of the training set text graph according to the TF-IDF algorithm, and calculating weights of connecting edges of word nodes and word nodes in the adjacent matrix of the training set text graph according to the PMI algorithm;
s1422, calculating weights of the connecting edges of the document nodes and the word nodes in the adjacency matrix of the test set text graph according to the TF-IDF algorithm, and calculating weights of the connecting edges of the word nodes and the word nodes of the test set text graph according to the PMI algorithm.
Wherein weights of the document nodes and word node connecting edges are calculated according to word frequency-inverse document frequency (TF-IDF). The Term Frequency (TF) indicates the number of times a given word appears in a document, and a larger value of the term frequency indicates a larger contribution of the given word to the document, and if the value of the term frequency is smaller, it indicates a lower contribution or even no contribution of the given word to the document. The expression of word frequency is as follows:
tf j,k =n j,k /∑ i i,k,
wherein n is j,k Is the number of times word j appears in document k, Σ i i, k represent the number of occurrences of word j and other words in document k. The Inverse Document Frequency (IDF) reflects the ability of a given word to distinguish over the entire document, the greater the inverse document frequency, the fewer documents containing the given word, and the greater the ability to distinguish over the entire document. The inverse document frequency is calculated as follows:
idf j =logD/{k:t j ∈d k },
where D is the number of documents in the corpus, { k: t j ∈d k The expression contains a given word t j Is a document number of (c).
The TF-IDF considers the influence of a given word on a certain document through the TF, and the importance degree of the given word on the whole document is characterized by utilizing the IDF, wherein the importance degree is defined as the product of word frequency and inverse document frequency, and the calculation formula is as follows:
TF-IDF=TF*IDF。
to utilize global word co-occurrence information, a fixed size sliding window is set for all documents in the corpus to aggregate co-occurrence features. The present embodiment uses PMI algorithm to measure the word-to-word correlation, calculates the weight between two word nodes, and the PMI values of word j and word k are defined as:
PMI(j,k)=log p(j,k)/p(j)p(k),
wherein p (j, k) =w (j, k)/W, p (j) =w (j)/W, W (j, k) represents the number of sliding windows containing word j and word k, W represents the total number of sliding windows, and W (j) represents the number of sliding windows containing word j. The larger the value of the PMI, the stronger the semantic correlation of the word j and the word k, the smaller the value of the PMI, the weaker the semantic correlation of the word j and the word k, and when the PMI value is negative, the correlation between the word j and the word k is very weak or has no correlation. Thus establishing edges between words only considers the case where the PMI value is positive.
To sum up, in the present embodiment, the element values in the adjacency matrix corresponding to the text graph, that is, the weights for constructing the text graph network edge are defined as follows:
after the training set text graph network is built, the feature matrix and the neighbor matrix of the graph are transmitted into a high-low order graph convolution neural network model for training.
In the embodiment, after the training set and the test set text corpus are converted into the corresponding text graphs, a TF-IDF algorithm and a PMI algorithm are adopted to determine the adjacency matrix corresponding to the text graphs, so that global word co-occurrence information is captured, the distinguishing capability of documents is considered, a weight relation with higher accuracy of describing the text graph information is provided, the model training effect is improved, and meanwhile, the model classification precision is also improved.
In the embodiment of the application, based on the supervised text data sets R52 and R8, 20NG and OH and the classification test performed by MR, when the higher-order graph convolution neural network models with k=2 and k=3 have very good performance in terms of classification accuracy and calculation complexity in the text classification task, and when the k value with k=4 or higher will reduce the text classification accuracy, therefore, only the comparison results of the classification effects, the model parameters and the calculation complexity of the NMGC-2 and NMGC-3 models (i.e. only considering the two cases with k=2 and k=3) and the existing other classification models are given, as shown in the following tables 2 to 4:
table 2 comparison of NMGC-2, NMGC-3 with existing models based on the same text dataset
Table 2 illustrates: the accuracy in the table is expressed as a percentage and the number is the average of 10 runs.
TABLE 3 comparison of NMGC-2 and NMGC-3 model classification results for different numbers of hidden neurons
TABLE 4 comparison of computational complexity and parameter values for NMGC-2, NMGC-3 and GCN models
Table 4 illustrates: 1. 2 and 3 represent the order of the graph convolution, and 200, 64, 128 and 32 represent the number of hidden neurons.
Based on the above experimental results, this embodiment provides a high-low order graph convolution neural Network Model (NMGC) including a high-low order graph convolution layer capable of capturing correlations between low-order and high-order neighborhood text nodes simultaneously and an NMPooling information fusion layer capable of mixing first-order to high-order features of different neighbors, so that more and richer feature information can be retained in text classification, global graph topology can be learned, and not only is the receptive field widened, but also the model expression capability is improved. In addition, the weight sharing and the setting of fewer hidden neurons are adopted for different order graph convolution to reduce the computational complexity and parameter quantity, the overfitting of the model is avoided, and the experimental results based on five reference text data sets show that the application of the high-low order graph convolution neural network model and the adoption of a classical first order graph convolution network for text classification have obvious advantages in terms of classification precision, classification performance and parameters, and the highest precision is achieved and the method is stable.
Although the steps in the flowcharts described above are shown in order as indicated by arrows, these steps are not necessarily executed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described above may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, and the order of execution of the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with at least a part of the sub-steps or stages of other steps or other steps.
In one embodiment, as shown in FIG. 7, a text classification system is provided, the system comprising:
a classification model building module 71, configured to build a high-low order graph convolution neural network model; the high-low order graph convolution neural network model sequentially comprises an input layer, a high-low order graph convolution layer, an information fusion layer, a first order graph convolution layer and an output layer;
the classification corpus obtaining module 72 is configured to obtain a corpus set for text classification by using the high-low order graph convolution neural network model; the corpus comprises a plurality of samples, each sample containing a document and a title;
a corpus preprocessing module 73, configured to preprocess the corpus to obtain a training set and a testing set;
a build text graph network module 74 for building a training set text graph network and a testing set text graph network from the training set and the testing set, respectively;
a text classification model training module 75, configured to input the training set text graph network into a high-low order graph convolution neural network model, and perform training in combination with a loss function to obtain a text classification model;
the text classification test module 76 is configured to input the test set text graph network into the text classification model for testing, so as to obtain a classification result.
For specific limitations of the text classification system, reference may be made to the above limitations of the text classification method, and no further description is given here. The various modules in the text classification system described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
Fig. 8 shows an internal structural diagram of a computer device, which may be a terminal or a server in particular, in one embodiment. As shown in fig. 8, the computer device includes a processor, a memory, a network interface, a display, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program when executed by a processor implements a method of predicting electricity price probability. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those of ordinary skill in the art that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements may be implemented, and that a particular computing device may include more or less components than those shown in the middle, or may combine some of the components, or have the same arrangement of components.
In one embodiment, a computer device is provided comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when the computer program is executed.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, implements the steps of the above method.
In summary, the embodiment of the application provides a text classification method, a system, a computer device and a storage medium, which is based on a high-low order graph rolling network, and provides a text classification method by adopting a new high-low order graph rolling neural network model comprising a high-low order graph rolling layer of multi-order neighborhood information of capturing nodes, an NMpooling information fusion layer mixing first-order to high-order features of different neighbors, a first-order graph convolution layer and a softmax classification output layer on the basis of fully considering the problems of easy-neglected global word co-occurrence information, easy-to-occur receptive field narrowing, model overcomplification, expression capacity deficiency and the like of text classification. When the method is applied to actual text classification, the low-order neighborhood and high-order neighborhood information of text nodes can be captured through the high-order and low-order graph convolution layer simultaneously, more and more text node information is obtained, the receptive field is widened, the expression capacity of a model is improved, the calculation complexity and the parameter quantity of the model are reduced by adopting a method of weight sharing and less hidden neuron number setting for different-order graph convolution, the overfitting of the model is further avoided, and the stability of the model and the precision of text classification are improved.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above.
In this specification, each embodiment is described in a progressive manner, and all the embodiments are directly the same or similar parts referring to each other, and each embodiment mainly describes differences from other embodiments. In particular, for embodiments of the system, computer device, and storage medium, the description is simpler as it is substantially similar to the method embodiments, with reference to the description of the method embodiments in part. It should be noted that, any combination of the technical features of the foregoing embodiments may be used, and for brevity, all of the possible combinations of the technical features of the foregoing embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples represent only a few preferred embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the application. It should be noted that modifications and substitutions can be made by those skilled in the art without departing from the technical principles of the present application, and such modifications and substitutions should also be considered to be within the scope of the present application. Therefore, the protection scope of the patent of the application is subject to the protection scope of the claims.

Claims (8)

1. A method of text classification, the method comprising the steps of:
establishing a high-low order graph convolution neural network model; the high-low order graph convolution neural network model sequentially comprises an input layer, a high-low order graph convolution layer, an information fusion layer, a first order graph convolution layer and an output layer;
acquiring a corpus set for text classification by adopting the high-low order graph convolution neural network model; the corpus comprises a plurality of samples, each sample containing a document and a title;
preprocessing the corpus to obtain a training set and a testing set;
respectively constructing a training set text graph network and a testing set text graph network according to the training set and the testing set;
inputting the training set text graph network into a high-low order graph convolution neural network model, and training by combining a loss function to obtain a text classification model;
inputting the test set text graph network into the text classification model for testing to obtain a classification result;
wherein, the output of the high-low order graph convolution neural network model is Z, then:
where X is the input matrix of the graph,and->Parameter matrix between input layer and hidden layer and parameter matrix between hidden layer and output layer respectively +.>Is the regularized adjacency matrix of the graph containing self-connections, k is the highest order of the graph convolution,,/>for nonlinear activation function +.>In order to provide an information fusion layer,outputting a function for multiple classifications;
the information fusion layer adopts minimum value negation information fusion pooling, and the implementation steps comprise:
according to the input matrix X and the parameter matrixAnd regularized adjacency matrix->Calculating minimum value matrixes of different-order graph convolution;
and inverting each element value of the minimum value matrix to obtain a pooled graph feature matrix.
2. The text classification method of claim 1, wherein the high-low order graph convolution layer comprises a first order graph convolution to a k order graph convolution based on weight sharing; the order k of the high-low order graph convolution layer is one of two orders and more, or a combination of any plurality of orders.
3. The text classification method of claim 1, wherein the step of preprocessing the corpus to obtain a training set and a testing set comprises:
performing de-duplication and word segmentation on the title and the document of each sample in the corpus, removing the stop words and the special symbols, obtaining corpus words, and forming the corpus words and the document into a corpus text group;
dividing the corpus text group into a training set and a testing set according to the quantity proportion.
4. The text classification method of claim 1, wherein the steps of constructing a training set text graph network and a testing set text graph network from the training set and the testing set, respectively, comprise:
respectively establishing a training set text graph and a testing set text graph of which feature matrixes are corresponding dimension unit matrixes according to the training set and the testing set;
and determining adjacency matrixes of the training set text graph and the testing set text graph according to the TF-IDF algorithm and the PMI algorithm.
5. The text classification method of claim 4, wherein said step of determining adjacency matrices for said training set text graph and test set text graph based on TF-IDF algorithm and PMI algorithm comprises:
calculating weights of the connecting edges of the document nodes and the word nodes in the adjacent matrix of the training set text graph according to the TF-IDF algorithm, and calculating weights of the connecting edges of the word nodes and the word nodes in the adjacent matrix of the training set text graph according to the PMI algorithm;
and calculating weights of the connecting edges of the document nodes and the word nodes in the adjacency matrix of the test set text graph according to the TF-IDF algorithm, and calculating weights of the connecting edges of the word nodes and the word nodes of the test set text graph according to the PMI algorithm.
6. A text classification system, the system comprising:
the classification model building module is used for building a high-low order graph convolution neural network model; the high-low order graph convolution neural network model sequentially comprises an input layer, a high-low order graph convolution layer, an information fusion layer, a first order graph convolution layer and an output layer;
the classifying corpus obtaining module is used for obtaining a corpus set for text classification by adopting the high-low order graph convolution neural network model; the corpus comprises a plurality of samples, each sample containing a document and a title;
the corpus preprocessing module is used for preprocessing the corpus to obtain a training set and a testing set;
a text graph network building module is used for building a training set text graph network and a testing set text graph network according to the training set and the testing set respectively;
the text classification model training module is used for inputting the training set text graph network into a high-low order graph convolution neural network model, and training the training set text graph network by combining a loss function to obtain a text classification model;
the text classification test module is used for inputting the test set text graph network into the text classification model for testing to obtain a classification result;
wherein, the output of the high-low order graph convolution neural network model is Z, then:
where X is the input matrix of the graph,and->Parameter matrix between input layer and hidden layer and parameter matrix between hidden layer and output layer respectively +.>Is the regularized adjacency matrix of the graph containing self-connections, k is the highest order of the graph convolution,,/>for nonlinear activation function +.>In order to provide an information fusion layer,outputting a function for multiple classifications;
the information fusion layer adopts minimum value negation information fusion pooling, and the implementation steps comprise:
according to the input matrix X and the parameter matrixAnd regularized adjacency matrix->Calculating minimum value matrixes of different-order graph convolution;
and inverting each element value of the minimum value matrix to obtain a pooled graph feature matrix.
7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 5 when the computer program is executed by the processor.
8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 5.
CN202011425848.0A 2020-12-08 2020-12-08 Text classification method, system, computer equipment and storage medium Active CN112529071B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011425848.0A CN112529071B (en) 2020-12-08 2020-12-08 Text classification method, system, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011425848.0A CN112529071B (en) 2020-12-08 2020-12-08 Text classification method, system, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112529071A CN112529071A (en) 2021-03-19
CN112529071B true CN112529071B (en) 2023-10-17

Family

ID=74996781

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011425848.0A Active CN112529071B (en) 2020-12-08 2020-12-08 Text classification method, system, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112529071B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792144B (en) * 2021-09-16 2024-03-12 南京理工大学 Text classification method of graph convolution neural network based on semi-supervision
CN113961708B (en) * 2021-11-10 2024-04-23 北京邮电大学 Power equipment fault tracing method based on multi-level graph convolutional network
CN114021574B (en) * 2022-01-05 2022-05-17 杭州实在智能科技有限公司 Intelligent analysis and structuring method and system for policy file

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929029A (en) * 2019-11-04 2020-03-27 中国科学院信息工程研究所 Text classification method and system based on graph convolution neural network
CN111159425A (en) * 2019-12-30 2020-05-15 浙江大学 Temporal knowledge graph representation method based on historical relationship and double-graph convolution network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11010516B2 (en) * 2018-11-09 2021-05-18 Nvidia Corp. Deep learning based identification of difficult to test nodes

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929029A (en) * 2019-11-04 2020-03-27 中国科学院信息工程研究所 Text classification method and system based on graph convolution neural network
CN111159425A (en) * 2019-12-30 2020-05-15 浙江大学 Temporal knowledge graph representation method based on historical relationship and double-graph convolution network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Graph convolutional neural network for multi-scale feature learning;Michael Edwards等;《Elsevier Science》;第1-12页 *
基于深度结构特征表示学习的视觉跟踪研究;周阿健;《万方数据知识服务平台论文库》;第1-65页 *

Also Published As

Publication number Publication date
CN112529071A (en) 2021-03-19

Similar Documents

Publication Publication Date Title
Perozzi et al. Don't walk, skip! online learning of multi-scale network embeddings
Sun et al. What and how: generalized lifelong spectral clustering via dual memory
CN112529071B (en) Text classification method, system, computer equipment and storage medium
CN111797321B (en) Personalized knowledge recommendation method and system for different scenes
CN112529168B (en) GCN-based attribute multilayer network representation learning method
WO2017206936A1 (en) Machine learning based network model construction method and apparatus
CN110674850A (en) Image description generation method based on attention mechanism
CN111523051A (en) Social interest recommendation method and system based on graph volume matrix decomposition
CN114372573B (en) User portrait information recognition method and device, computer equipment and storage medium
CN112598080A (en) Attention-based width map convolutional neural network model and training method thereof
CN112529069B (en) Semi-supervised node classification method, system, computer equipment and storage medium
Yang et al. Temporal-spatial three-way granular computing for dynamic text sentiment classification
Dhamdhere et al. The shapley taylor interaction index
Jia et al. Weakly supervised label distribution learning based on transductive matrix completion with sample correlations
CN110781271A (en) Semi-supervised network representation learning model based on hierarchical attention mechanism
CN116186390A (en) Hypergraph-fused contrast learning session recommendation method
Zhu et al. Generalization properties of nas under activation and skip connection search
CN112905906B (en) Recommendation method and system fusing local collaboration and feature intersection
CN112015890B (en) Method and device for generating movie script abstract
CN113592593A (en) Training and application method, device, equipment and storage medium of sequence recommendation model
CN113609337A (en) Pre-training method, device, equipment and medium of graph neural network
CN112668700A (en) Width map convolutional network model based on grouping attention and training method thereof
Xu et al. Collective vertex classification using recursive neural network
Cao et al. Implicit user relationships across sessions enhanced graph for session-based recommendation
CN114842247B (en) Characteristic accumulation-based graph convolution network semi-supervised node classification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant