CN114357166A

CN114357166A - Text classification method based on deep learning

Info

Publication number: CN114357166A
Application number: CN202111662807.8A
Authority: CN
Inventors: 张丽; 王月怡
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-15
Anticipated expiration: 2041-12-31
Also published as: CN114357166B

Abstract

The invention discloses a text classification method based on deep learning, which firstly carries out noise elimination and comprises the step of removing punctuation marks and special characters. Constructing a dictionary and constructing a data set according to the dictionary; word embedding and confrontation training; training a bidirectional long-time and short-time memory network layer; training an attention mechanism layer; and calculating an output result. The method applies the countervailing training method widely applied to the image field to the natural language processing field, changes the direction of increasing the loss of the network in the model training process by adding the countervailing disturbance in the deep neural network, and updates the parameters by utilizing the loss to carry out derivation on the input, thereby reducing the sensitivity of the model to the countervailing disturbance, effectively relieving the overfitting of the model and improving the text classification effect.

Description

Text classification method based on deep learning

Technical Field

The invention belongs to the field of natural language processing. Text classification is one of the most basic and key technologies in natural language processing, and accurate and efficient text classification has great significance for natural language processing tasks. And carrying out accurate text classification by using a deep learning algorithm.

Background

Among the various fields of artificial intelligence development, natural language processing is one of the fastest growing and most widely used fields. Natural language processing is machine processing of human language and is intended to teach machines how to process and understand human language, thereby establishing a simple communication channel between human and machine. Text classification is one of the most basic and key technologies in natural language processing, and is a technology for converting text and then automatically classifying the converted text into a certain or a plurality of specified categories. Under the background of big data era, the text classification technology applying the deep learning algorithm can automatically and efficiently execute classification tasks, and the cost consumption is greatly reduced. The text classification task plays an important role in a plurality of fields such as emotion analysis, public opinion analysis, field recognition, intention recognition and the like.

The text classification task comprises two parts: text representation and text classification. The text representation goes through the process from symbolic representation to implicit semantic representation, including text preprocessing techniques and text representation techniques. Text preprocessing refers to that in most cases, a certain noise and useless parts exist in the text, so that before classification, the text needs to be preprocessed, and the preprocessing usually comprises the steps of noise removal, word deactivation, Chinese word segmentation, English case unification and the like. The text representation technology is a technology in which, when an original natural language is composed of natural language characters that can be recognized only by humans, a computer cannot directly understand and process the characters, and therefore, it is necessary to convert a text composed of a natural language into a digital representation that can be recognized by a computer. . Including a representation method based on one-hot encoding, a representation method based on a vector space model, a representation method based on a distributed word vector, and the like.

The current text classification model based on deep learning comprises a text classification model based on a convolutional neural network; and secondly, a classification model based on a recurrent neural network, which is mainly used for better processing sequence information, takes sequence data as input, recurses in the evolution direction of the sequence, all nodes are connected in a chain manner, can effectively identify sequence characteristics and predict the next possible situation by using a previous mode, thereby effectively solving the problem that the traditional neural network cannot capture the correlation of each input, but due to an RNN feedback loop, the gradient can quickly diverge to infinity or quickly become 0, namely the problems of gradient disappearance and gradient explosion exist, and under the two conditions, the network stops learning any useful things. The problem of gradient explosion can be solved by gradient clipping, and the problem of gradient disappearance needs more complex RNN basic units to be defined; a more complex RNN basic unit is used, a long-time memory network model and a gate control cycle unit model are obtained through improvement, and both the model and the model pass through a gate mechanism, so that information is selectively passed, and historical information is updated or kept, and the gradient problem is solved to a certain extent; also included are attentional mechanisms that can give different degrees of attention to important and secondary content, which, as an assistive technique commonly used in the field of deep learning, focus neural networks more on the learning of certain specific neurons.

Disclosure of Invention

The method aims at the problem that the existing text classification models based on deep learning are not subjected to noise introduction in training, and the robustness of the models is to be enhanced.

The technical scheme adopted by the invention is to provide a text classification model based on deep learning, which introduces noise data in the model training process. In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:

step 1, preprocessing the text.

And carrying out noise elimination on the text, including removing punctuations and special characters. And constructing a dictionary and constructing a data set according to the dictionary.

And 2, word embedding and confrontation training.

And 2.1, using a word embedding mode based on the pre-trained word vector, using the word + word as the pre-trained word vector of the context characteristics, and adapting to the current context in a fine tuning mode.

Step 2.2 represents the new sample input by X + delta, where X is the original input representation, delta is the disturbance superimposed on the input, deltaIs δ ═ α × sign (g), where g denotes the gradient of the Loss function Loss with respect to the input X. Calculating a disturbance delta superimposed on the sample X, and passing through a neural network function f_θ() The resulting loss is compared to the tag y and the delta that maximizes the loss is found.

And 2.3, aiming at the loss value obtained in the last step, optimizing the neural network by using a minimization formula.

And 3, training a bidirectional long-time and short-time memory network layer.

The word embedding result is input into a bidirectional long-short time memory neural network layer, the two-way long-short time memory neural network layer is formed by combining the LSTM of the previous item and the backward LSTM, and bidirectional semantic dependence is captured better through Bi-LSTM. Wherein the ith hidden state h of the Bi-LSTM_iFrom h_i→And h_i←Are formed by splicing h_i→And h_i←All information in the forward and reverse directions, respectively. Wherein each LSTM layer is composed of a plurality of cells, and the output H at any time t_tFrom H_t-1、C_t-1And X_tIs calculated to obtain wherein C_t-1Is the candidate cell state at time t-1, X_tIs the input of a time step t.

And 4, training an attention mechanism layer.

The input of the training attention machine layer is H ═ H₁,h₂，...，h_T]Where T represents the length of the input sequence. The attention score M is calculated from tanh (H), and the probability distribution α of the attention score is calculated from softmax (ω)^TM) is calculated, wherein ω is^TAre trainable parameters.

The output r of the training attention mechanism layer is composed of H and alpha^TAnd matrix multiplication is carried out to obtain the product.

And 5, calculating an output result.

Mapping the extracted features to specific categories by using a full-connection layer, splicing the features extracted by two LSTM layers, mapping the feature information to each category by multiplying the feature information by a weight matrix and adding an offset term, and finally obtaining the probability by a Softmax function, wherein the calculation method is Lable [, ]]＝softmax(F_c(A) Wherein a ═ a₀,A₂，...，A_i]For an input feature, i is the dimension of the input feature. C ═ C₀,C₂，...，C_n]N represents the number of categories for the score of each category obtained after the characteristics pass through the full connection layer. Then C₀To C_nAnd obtaining the probability distribution L from the category fraction to each category through a Softmax function.

The method applies the countervailing training method widely applied to the image field to the natural language processing field, changes the direction of increasing the loss of the network in the model training process by adding the countervailing disturbance in the deep neural network, and updates the parameters by utilizing the loss to carry out derivation on the input, thereby reducing the sensitivity of the model to the countervailing disturbance, effectively relieving the overfitting of the model and improving the text classification effect.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention.

Detailed Description

The flow chart of an embodiment is shown in fig. 1, and comprises the following steps:

(1) text pre-processing

Including operations to clean up noise, i.e., remove noise such as punctuation marks, special characters, etc. And then constructing a dictionary and constructing a data set according to the dictionary.

(2) Word embedding and FGSM attack layer

The effect of word embedding is to map simple word IDs to dense spatial vectors. The word is a basic unit for text processing by a deep learning model, and firstly, the word needs to be symbolized and converted into a digital vector representation by a text composed of a natural language. In word representation, for a given text consisting of T words, the purpose of the word embedding layer is to represent each word as a vector of appropriate dimensions.

Adding one perturbation to the gradient by an FGSM method on word embedding, generating antagonistic samples, inputting the antagonistic samples into a subsequent processing layer in the same form as the original samples, and training a model by optimizing the sum of loss functions of the two types of samples. FGSM will let the direction of the perturbation follow the direction of gradient elevation. Lifting along the gradient also means that the loss increase can be maximized.

After the disturbance is finished, the parameters are added to Embedding to finish the confrontation training of the word Embedding part.

(3) Bidirectional LSTM layer

Since semantic information contained in a word in the text is related not only to the preceding text but also to the following text, the one-way LSTM ignores important information of the preceding text or the following text. If the text is learned from front to back and from back to front at the same time, the semantic information of the text can be better extracted, and the specific contextual meaning is considered. The bidirectional long-and-short-term memory neural network is formed by combining an LSTM of a previous item and an LSTM of a backward item, after a word vector is obtained, a forward hidden layer and a backward hidden layer are spliced by a bidirectional LSTM layer, and finally an output matrix H obtained by multiplying the current cell state by a weight matrix of an output gate is output [ H₁,h₂,…,h_T]。

To prevent high prediction accuracy on the training data set and low prediction accuracy on the test data set, i.e. overfitting, training is performed in the bi-directional LSTM layer in combination with Dropout and the parameter optimization algorithm: in each iteration process, the neurons of the hidden layer are temporarily discarded with a certain probability, then a new network is trained, and parameters of the retained neurons are updated.

(4) Attention layer

The main idea of Attention Mechanism (Attention Mechanism) is to mimic the way a human observes something, i.e. the Mechanism of aligning internal experience to external senses to increase the accuracy of observation of partial areas. When text classification is carried out, key words related to category information are certainly related in a certain sentence, and other words in the sentences are context information words, and the roles of the words are far from large than that of the keywords. The attention mechanism may determine which words in the entire sentence need significant attention, allowing the model to extract more discriminative features from key words.

Obtaining an output matrix H of the bidirectional LSTM layer₁,h₂,…,h_T]Then, the Attention layer learns the weight distribution of the vector representation of the moment at each moment, and then performs resource distribution and weighted summation according to the weight distribution to obtain a vector representation h of the current moment i with richer key information_i。

(5) Classification

The main functions of the former two-way LSTM and attention layer are to complete the feature extraction of problem text data, and the fully connected layer maps the extracted features into specific categories. The input of the method is formed by splicing the features extracted from two bidirectional LSTM layers with different depths, the feature information is mapped into each category by multiplying the feature information by a weight matrix and adding a bias term, and finally the probability p of the problem data on each category is obtained by a Softmax function to obtain the final classification result.

The results of experiments using the present invention are given below.

Table 1 shows the test results of twenty thousand news headline data sets extracted from theucnews by the method of the present invention, and the test evaluation method consists of the accuracy, precision and recall F1 values. As can be seen from the table, the four indexes of the method are higher than that of the Bi-LSTM-Attention method without adding the counter-training, which shows that the method has better effect than that of the Bi-LSTM-Attention method without using the counter-training

TABLE 1 comparison of Performance of the inventive method to the reference model method

Measurement index	Bi-LSTM-Attention	The method of the invention
			Rate of accuracy	90.47％	91.93％
Rate of accuracy	90.6％	92.02％
			Recall rate	90.4％	91.93％
F1 value	90.4％	91.95％

Claims

1. A text classification method based on deep learning is characterized in that: the method comprises the following implementation steps:

step 1, preprocessing a text;

noise elimination is carried out on the text, and punctuation marks and special characters are removed; constructing a dictionary and constructing a data set according to the dictionary;

step 2, word embedding and confrontation training;

step 2.1, using a word embedding mode based on the word vector of pre-training, using the word + word as the pre-training word vector of the context characteristic, and adapting to the current context by a fine tuning mode;

step 2.2, using X + δ as a new sample input to represent, wherein X is the original input representation, δ is the disturbance superimposed on the input, and δ is calculated by δ α sign (g), wherein g represents the gradient of the Loss function Loss with respect to the input X; calculating a disturbance delta superimposed on the sample X, and passing through a neural network function f_θ() Comparing the obtained loss with the tag y, and finding a loss value which maximizes the loss;

step 2.3, aiming at the loss value obtained in the last step, optimizing the neural network by using a minimization formula;

step 3, training a bidirectional long-time and short-time memory network layer;

the word embedding result is input into a bidirectional long-short time memory neural network layer, the bidirectional long-short time memory neural network layer is formed by combining an LSTM of a previous item and an LSTM of a backward item, and bidirectional semantic dependence is captured better through a Bi-LSTM; wherein the ith hidden state h of the Bi-LSTM_iFrom h_i→And h_i←Are formed by splicing h_i→And h_i←All information in the forward and reverse directions, respectively; wherein each LSTM layer is composed of a plurality of cells, and the output H at any time t_tFrom H_t-1、C_t-1And X_tIs calculated to obtain wherein C_t-1Is the candidate cell state at time t-1, X_tInputting a time step t;

step 4, training an attention mechanism layer;

the input of the training attention machine layer is H ═ H₁,h₂，...，h_T]Wherein T represents the length of the input sequence; the attention score M is calculated from tanh (H), and the probability distribution α of the attention score is calculated from softmax (ω)^TM) is calculated, wherein ω is^TIs a trainable parameter;

the output r of the training attention mechanism layer is composed of H and alpha^TMatrix multiplication is carried out to obtain;

step 5, calculating an output result;

mapping the extracted features to specific categories by using a full-connection layer, splicing the features extracted by two LSTM layers, mapping the feature information to each category by multiplying the feature information by a weight matrix and adding an offset term, and finally obtaining the probability by a Softmax function, wherein the calculation method is Lable [, ]]＝softmax(F_C(A) Wherein a ═ a₀,A₂，...，A_i]As input features, i is the dimension of the input feature;

C＝[C₀,C₂，...，C_n]the method is characterized in that the score of each category is obtained after the characteristics pass through a full connection layer, and n represents the number of the categories; then C₀To C_nObtaining category scores to each item through a Softmax functionProbability distribution of classes L.