CN111475622A

CN111475622A - Text classification method, device, terminal and storage medium

Info

Publication number: CN111475622A
Application number: CN202010268806.4A
Authority: CN
Inventors: 王涛; 周佳乐; 邓健峰
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2020-04-08
Filing date: 2020-04-08
Publication date: 2020-07-31

Abstract

The application provides a text classification method, a text classification device, a terminal and a storage medium, wherein the method comprises the following steps: acquiring text data; performing word segmentation processing on the text data through a preset Bert text processing model to obtain word vectors; extracting context feature data corresponding to each word vector from the text data through a preset context feature extraction model; and inputting the context characteristic data into a preset capsule network classification model for topic classification operation, and outputting a classification result. According to the text classification method based on the capsule network, a method combining Bert, GAT and the capsule network is adopted, text context characteristics are learned through Bert preprocessing and GAT, then subject classification is carried out on the characteristics through the capsule network, and the process of dynamic routing in the capsule network is achieved, so that the defects that data in a convolutional neural network possibly bring to a text classifier model in a back propagation process are reduced, the accuracy of text classification is improved, and the technical problem that the text classification accuracy rate is low in the existing deep learning method is solved.

Description

Text classification method, device, terminal and storage medium

Technical Field

The present application relates to the field of text classification, and in particular, to a text classification method, apparatus, terminal, and storage medium.

Background

With the development of the big data era, a large amount of text data is accumulated, the category topics related to the text data are various, and in the general field, the relationship among the text data can be further understood by analyzing and processing the text data. However, the obtained texts are various and difficult to be understood by computers, the related topics are different, and the texts are supposed to be effectively classified and recognized, so that the large data processing work is positively influenced.

Currently, the mainstream methods for text classification are: GNN, CNN. Although the context characteristics of the text can be obtained by the recurrent neural network GNN, the problem of gradient explosion exists; the convolutional neural network CNN has the problems that information is lost in a pooling layer, text global information cannot be obtained and the like, so that the existing deep learning method is difficult to obtain higher text classification accuracy.

Disclosure of Invention

The application provides a text classification method, a text classification device, a terminal and a storage medium, which are used for solving the technical problem of low text classification accuracy of the existing deep learning method.

In view of the above, a first aspect of the present application provides a text classification method, including:

acquiring text data;

performing word segmentation processing on the text data through a preset Bert text processing model to obtain word vectors;

extracting context feature data corresponding to each word vector from the text data through a preset context feature extraction model;

and inputting the context characteristic data into a preset capsule network classification model for topic classification operation, and outputting a classification result.

Optionally, the obtaining text data further includes:

and preprocessing the text data, wherein the preprocessing comprises removing stop words and removing punctuation marks.

Optionally, the context feature extraction model is specifically a GAT graph attention network model.

Optionally, the method further comprises:

inputting preset training sample data into an initial capsule network model, and training the initial capsule network model to obtain the capsule network classification model, wherein the training sample data is context feature sample data obtained by processing preset sample text data through the Bert text processing model and the context feature extraction model.

A second aspect of the present application provides a text classification apparatus, including:

a text acquisition unit for acquiring text data;

the word segmentation processing unit is used for carrying out word segmentation processing on the text data through a preset Bert text processing model to obtain word vectors;

the context feature extraction unit is used for extracting context feature data corresponding to each word vector from the text data through a preset context feature extraction model;

and the text classification unit is used for inputting the context characteristic data into a preset capsule network classification model to perform topic classification operation and outputting a classification result.

Optionally, the obtaining text data further includes:

and the preprocessing unit is used for preprocessing the text data, wherein the preprocessing comprises the removal of stop words and the removal of punctuation marks.

Optionally, the method further comprises:

and the capsule network classification model training unit is used for inputting preset training sample data to an initial capsule network model and training the initial capsule network model to obtain the capsule network classification model, wherein the training sample data is context feature sample data obtained by processing preset sample text data through the Bert text processing model and the context feature extraction model.

A third aspect of the present application provides a terminal, comprising: a memory and a processor;

the memory is used for storing program codes corresponding to the text classification method of the first aspect of the application;

the processor is configured to execute the program code.

A fourth aspect of the present application provides a storage medium having stored therein program code corresponding to the text classification method according to the first aspect of the present application.

According to the technical scheme, the embodiment of the application has the following advantages:

the application provides a text classification method, which comprises the following steps: acquiring text data; performing word segmentation processing on the text data through a preset Bert text processing model to obtain word vectors; extracting context feature data corresponding to each word vector from the text data through a preset context feature extraction model; and inputting the context characteristic data into a preset capsule network classification model for topic classification operation, and outputting a classification result.

According to the text classification method based on the capsule network, a method combining Bert, GAT and the capsule network is adopted, text context characteristics are learned through Bert preprocessing and GAT, then subject classification is carried out on the characteristics through the capsule network, and the process of dynamic routing in the capsule network is achieved, so that the defects that data in a convolutional neural network possibly bring to a text classifier model in a back propagation process are reduced, the accuracy of text classification is improved, and the technical problem that the text classification accuracy rate is low in the existing deep learning method is solved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic flowchart of a text classification method according to a first embodiment of the present application;

fig. 2 is a schematic flowchart of a text classification method according to a second embodiment of the present application;

fig. 3 is a schematic structural diagram of a first embodiment of a text classification apparatus provided in the present application;

FIG. 4 is a flowchart of GAT extraction of text context features;

fig. 5 is a diagram illustrating classification of the subject of the capsule network.

Detailed Description

The embodiment of the application provides a text classification method, a text classification device, a terminal and a storage medium, which are used for solving the technical problem of low text classification accuracy of the existing deep learning method.

In order to make the objects, features and advantages of the present invention more apparent and understandable, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the embodiments described below are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, a first aspect of the present application provides a text classification method, including:

step 101, acquiring text data.

And 102, performing word segmentation processing on the text data through a preset Bert text processing model to obtain a word vector.

It should be noted that, in this embodiment, after the text data to be classified is obtained, first, a Bert text processing model is used to perform a first-step word segmentation process on the text data, including word embedding, and an input text is converted into a vector to be output, so as to obtain a word vector.

The Bert model firstly encodes input, converts the input into an encoding format required by the model, and uses auxiliary markers [ C L S ] and [ SEP ] to represent the beginning and the separation of sentences.

The Bert text processing model adopted in this embodiment is a sentence-level language model, and the adoption of the Bert text processing model can further increase the generalization capability of the word vector model, fully describe the character-level, word-level, sentence-level and even inter-sentence relationship characteristics, and greatly improve the efficiency of model representation.

And 103, extracting context feature data corresponding to each word vector from the text data through a preset context feature extraction model.

It should be noted that, as shown in fig. 4, the context feature extraction model adopted in this embodiment is specifically a GAT graph attention network model, and the GAT performs aggregation operation on neighbor nodes through an attention mechanism, so as to realize distribution of different neighbor weights and assign a greater attention weight to the importance content, thereby greatly improving the expression capability of the graph neural network model. The method has the effects that the knowledge is expressed into a graph, an attention mechanism is used for carrying out attention calculation on each input word vector, different weights are distributed to each node, the influence of neighborhood nodes on a central node is reflected by weight parameters, the dependency relationship between context sentences is learned, and the context features in the text are extracted, so that the semantic expression capability of the text is improved, the understanding of the text is changed from shallow analysis to deep fusion, the attention of different sentences in the text is found, information with small attention can be filtered quickly, and the complexity of calculation tasks is reduced. Nodes with larger effects are concerned, while nodes with smaller effects are ignored. When local information is processed, the whole information can be concerned, so that the node characteristics can be extracted more accurately.

More specifically, the present embodiment addresses the shortcomings of existing graph-based convolution and similar approaches using a hidden self-attention layer. By stacking such layers, the nodes in the layer can notice the characteristics of the adjacent nodes, assign different weights to the adjacent different nodes, and convert the vectors obtained from the previous layer into the nodes in the graph. The method mainly comprises an encoder and a decoder, wherein the encoder comprises a plurality of pairs of self-Attention layers, the decoder has the same structure as the encoder, the difference is that an Attention layer is newly added between the two layers and is used for focusing an input part corresponding to a current decoding word, the context characteristics of a sentence are mainly represented through a self-Attention mechanism, an obtained vector is converted into node representation, and the GAT calculation is divided into two steps of calculating an Attention coefficient and weighting and summing by utilizing the Attention layer of the GAT.

Calculating an attention coefficient:

the input being a set of node feature vectors

N represents the number of nodes, and F is the number of features of each node. This layer generates a new set of node characteristics

As an output, at least one learnable linear transformation is required to obtain sufficient expressiveness to transform the input features into higher-level features, for this purpose, as an initialization step, a shared parameter is a weight matrix W ∈ R^F'×FIs applied at each node. The self-attention mechanism, a shared attention mechanism a, is then used to calculate the attention factor at the node

The above formula represents the importance of the characteristics of node j for node i. In its most general expression, j will be the first order neighbor of i. To make the factors of different nodes easy to compare, we use flexible maximum function normalization

Once the normalized attention factor is obtained, it is used to compute a linear combination of features associated with it, with the result being the final output for each node.

Weighted summation:

after obtaining the normalized attention coefficient, the normalized values are used to calculate the linear combination of the corresponding features as the final output feature of each vertex, and the output quantity is input into the next layer of network. The method ingeniously depends on the context information, so that the task and the text are combined more closely, the accuracy is improved, and a better feature extraction effect is shown.

For example, a sentence is input, each word in the sentence is subjected to attention calculation with all words in the sentence. The method aims to learn the word dependency relationship in the sentence and capture the internal structure of the sentence. On long range dependencies, the maximum path length is only 1 regardless of how long the distance between them is, since self-entry is to calculate entry for each and all words. Long range dependencies can be captured.

And 104, inputting the context characteristic data into a preset capsule network classification model to perform topic classification operation, and outputting a classification result.

It should be noted that, as shown in fig. 5, the capsule network classification model of the present embodiment is a compression function using a capsule network, as shown in equation (1), that is, the module length of each capsule, and the module length may be normalized to be between 0 and 1, so as to represent that the number of classified probability capsules is n.

Wherein v is_jFor the final output vector of the capsule, s_j(j∈[0,m]) Is the output vector of the low-level capsule. Expressing based on the text vector obtained from the previous layer, and using the text vector as the expression vectorThe input of the capsule network and the training of the capsule network text classification model specifically comprise the following steps:

extracting local features of the text vector;

performing feature reconstruction on the text features, and mapping the text features from a low-dimensional space to a higher-dimensional space;

classifying by using a capsule layer, wherein input neuron vectors of the capsule layer flow between an input capsule and an output capsule through weighting, coupling, extrusion and dynamic routing;

the calculation formulas of the capsule layers are shown in (2) - (3).

s_j＝∑_ic_iju_j|i(2)

u_j|i＝w_ijl_i(3)

Wherein, w_ijRepresented by a weight matrix, u_j|iIs the output vector of the low-level capsule. l_i(i∈[0,N]) Is an input to the capsule network. c. C_ijThe weight parameter can be calculated by a dynamic routing algorithm included in the capsule network, and more specifically, the input of the dynamic routing algorithm of the embodiment is a vector u_j|iIteration number r, output as classification capsule vector v_j。

First, initialization is performed. Setting parameters b for all low-grade capsules i and high-grade capsules j_ijInitially 0. The input neuron vectors of the capsule layer flow between the input capsule to the output capsule through weighting, coupling, compression, dynamic routing. Secondly, for each iterative process, for each low-level capsule i, operation, c_i＝softmax(b_i) For each advanced capsule j, there is s_j＝∑_ic_iju_j|iThe high-grade capsule output is extruded by v_j＝squash(s_j) Finally update b_ijOperation of (b)_ij＝b_ij+u_j|i*v_j。

Aiming at the problems that the traditional network has gradient explosion and can not obtain the global information of the text under the text classification, the invention adopts a method combining Bert, GAT and capsule network, learns the text context characteristics through Bert preprocessing and GAT, and then carries out theme classification on the characteristics through the capsule network. The invention reduces some defects possibly brought to the text classifier model in the back propagation process of data in the convolutional neural network through the dynamic routing process in the capsule network, obtains the classification probability by utilizing the normalization of the capsule network, and finally can obtain better effect.

The above is a detailed description of a first embodiment of a text classification method provided in the present application, and the following is a detailed description of a second embodiment of a text classification method provided in the present application.

Referring to fig. 2, the embodiment of the present application, on the basis of the text classification method provided by the first embodiment, further includes, in step 101:

step 201, preprocessing the text data, wherein the preprocessing includes removing stop words and removing punctuation marks.

Further, the method can also comprise the following steps:

step 200, inputting preset training sample data into the initial capsule network model, and training the initial capsule network model to obtain a capsule network classification model, wherein the training sample data is context feature sample data obtained by processing preset sample text data through a Bert text processing model and a context feature extraction model.

The above is a detailed description of the second embodiment of the text classification method provided in the present application, and the following is a detailed description of the first embodiment of the text classification device provided in the present application.

Referring to fig. 3, a second aspect of the present application provides a text classification apparatus, including:

a text acquisition unit 301 for acquiring text data;

the word segmentation processing unit 302 is configured to perform word segmentation processing on the text data through a preset Bert text processing model to obtain a word vector;

a context feature extraction unit 303, configured to extract context feature data corresponding to each word vector from the text data through a preset context feature extraction model;

and the text classification unit 304 is used for inputting the context feature data into a preset capsule network classification model to perform topic classification operation and outputting a classification result.

Further, after acquiring the text data, the method further comprises:

a preprocessing unit 305, configured to perform preprocessing on the text data, where the preprocessing includes removing stop words and removing punctuation marks.

Further, still include:

and the capsule network classification model training unit 306 is configured to input preset training sample data to the initial capsule network model, train the initial capsule network model, and obtain the capsule network classification model, where the training sample data is context feature sample data obtained by processing preset sample text data through a Bert text processing model and a context feature extraction model.

The above is a detailed description of a first embodiment of a text classification apparatus provided in the present application, and the following is a detailed description of a terminal and a storage medium corresponding to a text classification method provided in the present application.

A fourth embodiment of the present application provides a terminal, including: a memory and a processor;

the processor is used for executing the program codes.

A fifth embodiment of the present application provides a storage medium having stored therein program code corresponding to the text classification method of the first aspect of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of text classification, comprising:

acquiring text data;

2. The method of claim 1, wherein the obtaining text data further comprises:

3. The method according to claim 1, wherein the context feature extraction model is a GAT graph attention network model.

4. The method of claim 1, further comprising:

5. A text classification apparatus, comprising:

a text acquisition unit for acquiring text data;

6. The apparatus of claim 5, wherein after the acquiring the text data, the apparatus further comprises:

7. The apparatus according to claim 5, wherein the context feature extraction model is a GAT attention network model.

8. The apparatus for classifying a text according to claim 5, further comprising:

9. A terminal, comprising: a memory and a processor;

the memory is used for storing program codes corresponding to the text classification method of any one of claims 1 to 4;

the processor is configured to execute the program code.

10. A storage medium storing a program code corresponding to the text classification method according to any one of claims 1 to 4.