LU504829B1

LU504829B1 - Text classification method, computer readable storage medium and system

Info

Publication number: LU504829B1
Application number: LU504829A
Authority: LU
Inventors: Biqing Zeng
Original assignee: Univ South China Normal
Priority date: 2023-07-28
Filing date: 2023-07-28
Publication date: 2024-01-29

Abstract

The invention relates to a text classification method, a computer readable storage medium and a system, including: obtaining the text to be classified; obtaining multiple characters and multiple words representing the text to be classified; obtaining multiple character vectors and multiple word vectors; multiple described character vectors are input into the stack bidirectional recurrent neural network based on character vector to obtain the classification results based on character vector, and multiple described word vectors are input into the stack bidirectional recurrent neural network based on word vector to obtain the classification results based on word vector; counting the number of characters and the number of words representing the text to be classified, if the relationship of the number of characters and words satisfies the set threshold, the classification result based on character vector is selected; otherwise, the classification result based on word vector is selected.

Description

DESCRIPTION LU504829

TEXT CLASSIFICATION METHOD, COMPUTER READABLE

STORAGE MEDIUM AND SYSTEM

TECHNICAL FIELD

The invention relates to the field of natural language processing, in particular to a text classification method, a computer readable storage medium and a system.

BACKGROUND ART

With the development of Internet technology, people can use the Internet to publish a variety of comments, so it also produces a large amount of text information. Text information expresses people's choice tendency and provides a platform for information display and communication. It has become a research topic for obtaining the selection tendency information from text information. Among them, in the process of making the invention, the inventor found that the way to obtain the selection information is inefficient and the analysis accuracy is low.

SUMMARY

Based on the above problems, the purpose of the invention is to provide a text classification method, which has the advantages of improving accuracy and efficiency.

A text classification method, including the following steps: obtaining the text to be classified; the text to be classified is subjected to character dividing and word dividing to obtain multiple characters and multiple words representing the text to be classified, obtaining multiple character vectors and multiple word vectors by vectorizing multiple characters and multiple words respectively; constructing a stack bidirectional recurrent neural network based on character vector and a stack bidirectional recurrent neural network based on word vector, and multiple described character vectors are input into the stack bidirectional recurrent neural network based on character vector to obtain the classification results based on character vector, and multiple described word vectors are input into the stack bidirectional recurrent neural network based 504829 word vector to obtain the classification results based on word vector, among them, the stack bidirectional recurrent neural network includes three layers of BLSTM layer and one layer of

Sigmod layer, each BLSTM layer is stacked with multiple LSTM units, and multiple LSTM units in each layer are distributed hierarchically, multiple LSTM units in each layer are set corresponding weight parameters, each LSTM unit takes the output of the upper LSTM unit and/or the upper LSTM unit of the same layer as the input, and finally obtains the output result in the Sigmod layer; counting the number of characters and the number of words representing the text to be classified, if the number of characters is less than or equal to half of the number of words, the classification result based on character vector is selected; otherwise, the classification result based on word vector is selected.

By using the stack bidirectional recurrent neural network, the high-level features representing the semantics of the text can be obtained by analyzing the content of the context in the text to be classified, the accuracy and efficiency are improved by fusing the character information and word information of the text to be classified.

Furthermore, the steps of constructing a stack bidirectional recurrent neural network based on character vector include: obtaining multiple training texts and the corresponding selection labels for each training text; dividing each training text separately to obtain multiple characters representing each training text; vectorizing multiple described characters representing each training text to obtain multiple character vectors; inputting the multiple character vectors corresponding to each training text and the corresponding selection labels of each training text into the stack bidirectional recurrent neural network based on character vector for training, and the parameters of the stack bidirectional recurrent neural network are optimized to obtain the stack bidirectional recurrent neural network based on character vector.

Furthermore, the steps of constructing a stacked bidirectional recurrent neural network based on word vector include: LU504829 obtaining multiple training texts and the corresponding selection labels for each training text; dividing each training text separately to obtain multiple words representing each training text; vectorizing multiple described words representing each training text to obtain multiple word vectors; inputting the multiple word vectors corresponding to each training text and the corresponding selection labels of each training text into the stack bidirectional recurrent neural network based on word vector for training, and the parameters of the stack bidirectional recurrent neural network are optimized to obtain the stack bidirectional recurrent neural network based on word vector.

Furthermore, character segmentation and word segmentation are performed on the text and/or training text to be classified by the hidden Markov model to obtain multiple characters and multiple words, so as to perform fast and accurate character segmentation and word segmentation of the text through the prediction and evaluation of the text.

Furthermore, using word2vec to vectorize multiple characters and multiple words that represent the text to be analyzed and/or the training text to obtain multiple character vectors and multiple word vectors, so as to realize the rapid vectorization of character vectors and word vectors.

Furthermore, the relationship between the number of characters and the number of words satisfies the set threshold: the number of characters is less than or equal to half of the number of words, the number of characters and the number of words segmented in the text have a great influence on the classification results, therefore, the optimal classification results can be selected to classify the text more accurately by analyzing the number of characters and the number of words in the text to be classified.

The invention also provides a computer readable storage medium on which a computer program is stored, which is characterized in that the computer program is executed by the processor to implement the steps of the text classification method as described in any of the above content.

The invention also provides a text classification system, including a memory, a processdr/504829 and a computer program stored in the memory and can be executed by the processor. The processor implements the steps of the text classification method as described above when executing the computer program.

The invention is described in detail in the following for a better understanding and implementation.

BRIEF DESCRIPTION OF THE DRAWINGS

Fig. 1 1s the flow chart of the text classification method in the embodiment of the invention;

Fig. 2 is the flow chart of the stack bidirectional recurrent neural network based on character vector in the embodiment of the invention;

Fig. 3 is the flow chart of the stack bidirectional recurrent neural network based on word vector in the embodiment of the invention;

Fig. 4 is the schematic diagram of the stack bidirectional recurrent neural network based on character vector and word vector in the embodiment of the invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS

Referring to Fig. 1, which is the flow chart of the text classification method in the embodiment of the invention. The text classification method includes the following steps:

Step S1: obtaining the text to be classified.

In an embodiment, the text to be classified is a text with a tendency to choose, for example, positive emotions such as preference and approval for a person, event, or product indicating the text of choosing the person, event, or product; or the negative emotions such as disgust and opposition to a person, event or product indicating the text of not choosing the person, event or product.

Step S2: the text to be classified is subjected to character dividing and word dividing to obtain multiple characters and multiple words representing the text to be classified,

Step S3: obtaining multiple character vectors and multiple word vectors by vectorizing multiple characters and multiple words respectively; LU504829 in one embodiment, the vectorization transforms the symbolic information in the form of natural language into digital information in the form of a vector, and then machine learning and processing is realized, for example, ‘good’ is expressed as [0000000100... ].

Step S4: constructing a stack bidirectional recurrent neural network based on character vector and a stack bidirectional recurrent neural network based on word vector, and multiple described character vectors are input into the stack bidirectional recurrent neural network based on character vector to obtain the classification results based on character vector, and multiple described word vectors are input into the stack bidirectional recurrent neural network based on word vector to obtain the classification results based on word vector;

In an embodiment, the classification result can be a text result with positive emotions such as preference and approval for a person, event, or product indicating the text of choosing the person, event, or product, or a text result with negative emotions such as disgust and opposition to a person, event or product indicating the text of not choosing the person, event or product. In machine learning and processing, optionally, 'l' denotes the selected text result, and '0' denotes the unselected text result.

Step SS: counting the number of characters and the number of words representing the text to be classified, if the number of characters is less than or equal to half of the number of words, the classification result based on character vector is selected; otherwise, the classification result based on word vector is selected.

In one embodiment, the inventor found in the creation process that the number of characters and the number of words segmented in the text have a great influence on the classification results. By analyzing the number of characters and the number of words in the text to be classified, the optimal classification results can be selected. In an embodiment, the inventor found in the creation process that the relationship between the number of characters and the number of words satisfies the set threshold: the number of characters is less than or equal to half of the number of words, that is, if the number of characters is less than or equal to half of the number of words, the classification result based on character vector is more accurate; if the number of characters is greater than half of the number of words, the classification result based on word vector is more accurate.

By using the stacked bidirectional recurrent neural network, the content of the context 904829 the text to be classified can be analyzed, and the high-level features representing the semantics of the text can be obtained, the accuracy and efficiency are improved by fusing the character information and word information of the text to be classified.

In one embodiment, the text to be classified is subjected to character dividing and word dividing through the hidden Markov model to obtain multiple characters and multiple words representing the text to be classified, so as to perform fast and accurate character segmentation and word segmentation on the text through the prediction and evaluation of the text.

In one embodiment, using word2vec to vectorize multiple characters and multiple words that represent the text to be analyzed and/or the training text to obtain multiple character vectors and multiple word vectors, so as to realize the rapid vectorization of character vectors and word vectors.

Please refer to Fig. 2, which is the flow chart of constructing a stacked bidirectional recurrent neural network based on word vector for the embodiment of the invention.

In one embodiment, the steps for constructing a stacked bidirectional recurrent neural network based on character vector include:

Step S411: obtaining multiple training texts and the corresponding selection labels for each training text;

In one embodiment, multiple training texts are training texts with selection labels from the Chinese sentiment analysis corpus of ChnSentiCorp, and/or texts in a network data set with selection labels, wherein the selection label can be a text label with positive emotions such as preference and approval for a person, event or product indicating the text of choosing the person, event or product, or a text label with negative emotions such as disgust and opposition to a person, event or product indicating the text of not choosing the person, event or product. In machine learning and processing, optionally, ‘1’ denotes the selected text result, and ‘0’ denotes the unselected text result.

Step S412: dividing each training text separately to obtain multiple characters representing each training text;

In one embodiment, the text to be classified is character-divided by the hidden Markov model to obtain multiple characters representing the text to be classified.

Step S413: vectorizing multiple described characters representing each training text t6)504829 obtain multiple character vectors;

Step S414: inputting the multiple character vectors corresponding to each training text and the corresponding selection labels of each training text into the stack bidirectional recurrent neural network based on character vector for training, and the parameters of the stack bidirectional recurrent neural network are optimized to obtain the stack bidirectional recurrent neural network based on character vector.

In one embodiment, the character vector-based stacked bidirectional recurrent neural network includes three BLSTM layers and one Sigmod layer; each BLSTM layer 1s stacked with multiple LSTM units, and multiple LSTM units in each layer are distributed hierarchically.

Multiple LSTM units in each layer are set with corresponding weight parameters. Each LSTM unit takes the output of the upper LSTM unit and/or the upper LSTM unit of the same layer as the input and finally obtains the output result in the Sigmod layer. For example, multiple character vectors corresponding to each training text are input into the stack bidirectional recurrent neural network based on word vector. After the three-layer BLSTM layer, the output results are obtained in the Sigmod layer. At this time, if the output results do not match the corresponding selection labels, the random gradient descent algorithm is used to update and iterate each weight parameter, and then multiple character vectors are used as input to recalculate until the output results match the corresponding selection labels. By repeating a large number of the above training, the stack bidirectional recurrent neural network based on character vector is obtained. In order to prevent the problem of over-fitting, the dropout strategy is adopted in the training process, that is, in a training cycle, we first randomly select some units in the neural layer and temporarily hide them, and then carry out the training and optimization process of the neural network in this cycle. In the next cycle, we will hide some other neurons until the training is over. In one embodiment, dropout is set to 0.5.

Please refer to Fig. 3 and Fig. 4 at the same time. Fig. 3 is the flow chart of the stack bidirectional recurrent neural network based on word vector in the embodiment of the invention.

In one embodiment, the steps of constructing a stacked bidirectional recurrent neural network based on word vector include: LU504829

Step S421: obtaining multiple training texts and the corresponding selection labels for each training text;

In one embodiment, multiple training texts are training texts with a selection label from the Chinese sentiment analysis corpus of ChnSentiCorp, and/or texts in a network dataset with a selection label, where the selection label can be a text label with positive emotions such as preference and approval for a person, event or product indicating the text of choosing the person, event or product, or a text label with negative emotions such as disgust and opposition to a person, event or product indicating the text of not choosing the person, event or product. In machine learning and processing, optionally, ‘1’ denotes the selected text result, and ‘0’ denotes the unselected text label.

Step S422: dividing each training text separately to obtain multiple words representing each training text;

In one embodiment, the text to be classified is word-divided by the hidden Markov model to obtain multiple words representing the text to be classified.

Step S423: vectorizing multiple described words representing each training text to obtain multiple word vectors;

Step S424: inputting the multiple word vectors corresponding to each training text and the corresponding selection labels of each training text into the stack bidirectional recurrent neural network based on word vector for training, and the parameters of the stack bidirectional recurrent neural network are optimized to obtain the stack bidirectional recurrent neural network based on word vector.

In one embodiment, the word vector-based stacked bidirectional recurrent neural network includes three BLSTM layers and one Sigmod layer; each BLSTM layer is stacked with multiple LSTM units, and multiple LSTM units in each layer are distributed hierarchically.

Multiple LSTM units in each layer are set with corresponding weight parameters. Each LSTM unit takes the output of the upper LSTM unit and/or the upper LSTM unit of the same layer as the input and finally obtains the output result in the Sigmod layer. For example, multiple word vectors corresponding to each training text are input into a word vector-based stacked bidirectional recurrent neural network, after passing through the three-layer BLSTM layer, the output result is obtained in the Sigmod layer. At this time, if the output result does not conforht504829 to the corresponding selection label, the random gradient descent algorithm is used to update and iterate the weight parameters, and then the multiple word vectors are used as input to recalculate until the output result conforms to the corresponding selection label. By repeating a lot of the above training, the word vector-based stacked bidirectional recurrent neural network is obtained.

In order to prevent the problem of over-fitting, the dropout strategy is adopted in the training process, that is, in a training cycle, we first randomly select some units in the neural layer and temporarily hide them, and then carry out the training and optimization process of the neural network in this cycle. In the next cycle, we will hide some other neurons until the training is over.

In one embodiment, dropout is set to 0.5.

In one embodiment, the training text is character-divide and word-divide by the hidden

Markov model to obtain multiple characters and multiple words of the training text. Through the prediction and evaluation of the text, the text is segmented quickly and accurately.

In one embodiment, word2vec is used to vectorize multiple described characters and multiple described words of the training text respectively to obtain multiple character vectors and multiple word vectors, so as to realize fast vectorization of character vectors and word vectors.

The invention also provides a text classification system, including a memory, a processor, and a computer program stored in the memory and can be executed by the processor. The processor implements the steps of the text classification method as described above when executing the computer program.

By using the stacked bidirectional recurrent neural network, the high-level features representing the semantics of the text can be obtained by analyzing the content of the context in the text to be classified. By fusing the character information and word information of the text to be classified, the accuracy and efficiency are improved.

The above embodiments only express several implementation methods of the invention, and the descriptions are more specific and detailed, but they cannot be understood as restrictions on the scope of the invention, it should be pointed out that for the ordinary technical personnel 04829 this field, some deformations and improvements can be made without breaking away from the idea of the invention, those deformations and improvements are all within the protection scope of the invention.

Claims

CLAIMS: LU504829

1. À text classification method, including the following steps: obtaining the text to be classified; the text to be classified is subjected to character dividing and word dividing to obtain multiple characters and multiple words representing the text to be classified; obtaining multiple character vectors and multiple word vectors by vectorizing multiple characters and multiple words respectively; constructing a stack bidirectional recurrent neural network based on character vector and a stack bidirectional recurrent neural network based on word vector, and multiple described character vectors are input into the stack bidirectional recurrent neural network based on character vector to obtain the classification results based on character vector, and multiple described word vectors are input into the stack bidirectional recurrent neural network based on word vector to obtain the classification results based on word vector, among them, the stack bidirectional recurrent neural network includes three layers of BLSTM layer and one layer of Sigmod layer; each BLSTM layer is stacked with multiple LSTM units, and multiple LSTM units in each layer are distributed hierarchically, multiple LSTM units in each layer are set corresponding weight parameters, each LSTM unit takes the output of the upper LSTM unit and/or the upper LSTM unit of the same layer as the input, and finally obtains the output result in the Sigmod layer; counting the number of characters and the number of words representing the text to be classified, 1f the number of characters is less than or equal to half of the number of words, the classification result based on character vector 1s selected; otherwise, the classification result based on word vector is selected.

2. The text classification method according to claim 1, wherein the steps of constructing a stack bidirectional recurrent neural network based on character vector include: obtaining multiple training texts and the corresponding selection labels for each training text; dividing each training text separately to obtain multiple characters representing each training text;

vectorizing multiple described characters representing each training text to obtain multipt/504829 character vectors; inputting the multiple character vectors corresponding to each training text and the corresponding selection labels of each training text into the stack bidirectional recurrent neural network based on character vector for training, and the parameters of the stack bidirectional recurrent neural network are optimized to obtain the stack bidirectional recurrent neural network based on character vector.

3. The text classification method according to claim 2, wherein the steps of constructing a stacked bidirectional recurrent neural network based on word vector include: obtaining multiple training texts and the corresponding selection labels for each training text; dividing each training text separately to obtain multiple words representing each training text; vectorizing multiple described words representing each training text to obtain multiple word vectors; inputting the multiple word vectors corresponding to each training text and the corresponding selection labels of each training text into the stack bidirectional recurrent neural network based on word vector for training, and the parameters of the stack bidirectional recurrent neural network are optimized to obtain the stack bidirectional recurrent neural network based on word vector.

4. The text classification method according to claim 3, wherein character segmentation and word segmentation are performed on the text and/or training text to be classified by the hidden Markov model to obtain multiple characters and multiple words.

5. The text classification method according to claim 3, wherein using word2vec to vectorize multiple characters and multiple words that represent the text to be analyzed and/or the training text to obtain multiple character vectors and multiple word vectors.

6. The text classification method according to claim 2, wherein multiple training texts are training texts with selection labels from the Chinese sentiment analysis corpus of ChnSentiCorp, and/or texts in a network data set with selection labels.

7. A computer readable storage medium on which a computer program is stored, wherein the computer program is executed by the processor to implement the steps of the tek}/504829 classification method as described in any of the claims 1-6.

8. À text classification system, wherein includes a memory, a processor, and a computer program stored in the memory and can be executed by the processor; the processor implements the steps of the text classification method as described in any of the claims 1-6 when executing the computer program.