CN113157921A - Chinese text classification method integrating radical semantics - Google Patents
Chinese text classification method integrating radical semantics Download PDFInfo
- Publication number
- CN113157921A CN113157921A CN202110388441.3A CN202110388441A CN113157921A CN 113157921 A CN113157921 A CN 113157921A CN 202110388441 A CN202110388441 A CN 202110388441A CN 113157921 A CN113157921 A CN 113157921A
- Authority
- CN
- China
- Prior art keywords
- chinese text
- vector
- radical
- chinese
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a Chinese text classification method integrated with radical semantics, which comprises the steps of firstly using a BERT model to carry out vectorization representation on a Chinese text and a radical, then respectively using different deep learning models to extract the characteristics of the Chinese text and the components, finally fusing the characteristic vectors of the Chinese text and the components, realizing the classification of the Chinese text by using a softmax classifier, the proposal fully considers Chinese as a unique pictograph, the components of the characters contain rich semantic information, plays an important role in the semantic understanding process of Chinese texts, can further improve the performance of the classifier, based on the information content of the components, the training complexity is greatly reduced, and the purposes of reducing the learning time, improving the classification efficiency, organically combining the advantages of different Chinese text classification methods and realizing efficient and accurate Chinese text classification are achieved.
Description
Technical Field
The invention relates to a Chinese text classification method integrated with radical semantics, belonging to the technical field of computers.
Background
The Chinese text classification refers to a process of automatically classifying Chinese texts according to a certain classification system or rule, and is widely applied to the fields of information indexing, digital book management, information filtering and the like.
Methods for Chinese text classification are generally classified into three categories: a classification method based on Knowledge Engineering (KE), a classification method based on Machine Learning (ML), and a classification method based on Deep Learning (DL).
The classification method based on knowledge engineering refers to the manual text classification according to classification task rules written by domain experts, and the method has obvious inefficiency and limitation, and although some achievements are achieved, the method is eliminated quickly.
The classification method based on machine learning means that automatic classification of texts is realized by independently learning and extracting text classification rules through a computer, the classification method has high efficiency and strong portability, is widely applied to the field of Chinese text classification, but still has the defects, such as: the classification effect of the naive Bayes algorithm depends on the prior probability, and the expression form of the input data can greatly influence the classification result of the Chinese text; support vector machine algorithms are sensitive to missing data and have no general solution to the non-linear problem; the decision tree algorithm is easy to ignore the correlation of the attributes of the data set when text classification is carried out, and an overfitting problem occurs; the neural network algorithm has a large number of parameters to be determined in the process of training data, the learning process among networks cannot be observed, the learning time is long, and the output result is difficult to explain.
The deep learning-based classification method is used for extracting features of the Chinese text in the process of constructing a deep learning model, so that higher-level and more abstract semantic representations are obtained, and the Chinese text is classified. The method comprises the steps of firstly using a BERT pre-training language model to express feature vectors of sentences of a text, then inputting the feature vectors of the sentences of the text into a softmax regression model to realize Chinese text classification, and obtaining a better Chinese text classification effect. However, the method is directly transplanted to English text classification, the characteristics of Chinese characters are ignored, the structure of a pre-training model is huge, and a large amount of data and equipment resources are needed to complete the training process.
Although the three different Chinese text classification methods can meet the target requirement of classifying Chinese texts, the problems of low algorithm efficiency, poor field pertinence, easy occurrence of overfitting in the learning process and the like still exist, and the hot problem of how to reduce the learning time, improve the classification efficiency, organically combine the advantages of the different Chinese text classification methods and realize efficient and accurate Chinese text classification is the research in the field of natural language processing.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a Chinese text classification method integrating radical semantics, which comprises the steps of firstly using a BERT model to carry out vectorization representation on a Chinese text and a radical, then respectively using different deep learning models to carry out feature extraction on the Chinese text and the radical, and finally fusing feature vectors of the Chinese text and the radical, and realizing Chinese text classification by using a softmax classifier The purpose of efficiently and accurately classifying the Chinese text is achieved.
The technical scheme adopted by the invention for solving the technical problem is as follows: the Chinese text classification method integrated with the radical semantics comprises the following steps:
s1, forming the components of each character in the Chinese text set into a component set, and forming the Chinese text set and the component set into a training set;
s2, vectorizing and representing the Chinese text set and the component set in the training set to obtain a Chinese text vector and a component vector;
s3, extracting the characteristics of the Chinese text vector and the radical vector by using a deep learning model to obtain a Chinese text characteristic vector and a radical characteristic vector;
and S4, fusing the Chinese text feature vectors and the radical feature vectors, and classifying the Chinese text by using a classifier.
In step S1, the data set is preprocessed to obtain a chinese text set, including removing noise data in the data set that is not related to text classification.
In step S1, the component set is obtained by mapping chinese characters and radicals in the xinhua dictionary data set.
Step S2 is to adopt the BERT model to vectorially represent the chinese text set and the radical set in the training set, which specifically includes the following procedures:
s2.1, respectively representing a Chinese text set and a radical set by using a word vector, a segment vector and a position vector, wherein the Chinese text set is recorded asRadicals are collected as;
S2.2, respectively collecting Chinese text setsAnd the component setInputting the encoder of the BERT model, and training to obtain Chinese text vectorAnd radical vector。
In step S3, a deep learning model is selected according to the characteristics of the chinese text set corresponding to the chinese text vector to perform feature extraction on the chinese text vector and the radical vector.
The deep learning model comprises one or more than 2 combinations of a bidirectional circulation neural network Bi-RNN, a bidirectional long and short memory network Bi-LSTM, a bidirectional circulation neural network ATT-Bi-RNN introducing an attention mechanism and a bidirectional long and short memory network ATT-Bi-LSTM introducing the attention mechanism, and feature extraction is carried out according to the following four conditions:
a. if the Chinese text set corresponding to the Chinese text vector is a simple short text, performing feature extraction on the Chinese text vector and the radical vector by adopting a Bi-directional recurrent neural network Bi-RNN;
b. if the Chinese text set corresponding to the Chinese text vector is a complex short text, performing feature extraction on the Chinese text vector and the radical vector by adopting a bidirectional recurrent neural network ATT-Bi-RNN;
c. if the Chinese text set corresponding to the Chinese text vector is a long text with simple semantic expression, performing feature extraction on the Chinese text vector and the radical vector by adopting a bidirectional long-short memory network Bi-LSTM;
d. if the Chinese text set corresponding to the Chinese text vector is a long text with complex semantic expression, extracting the characteristics of the Chinese text vector and the radical vector by adopting a bidirectional long-short memory network ATT-Bi-LSTM introducing an attention mechanism;
wherein the content of the first and second substances,tthe update of the RNN neuron information at the time is expressed by the following equation:
wherein the content of the first and second substances,to representtThe time of day implies the information of the layer,to representt-1 time instant the information of the hidden layer,a weight matrix representing the input information,representation updatet-1 a weight matrix of time instant information,to representtInformation of the input layer at the moment whentWhen =1I.e. a chinese text vector or a radical vector,representation updatet-1 a matrix of offset values for the time instance information,to representtThe time instants imply the information output by the layer,representation updatetThe time instants imply a weight matrix of the layer output information,representation updatetA matrix of offset values for the time instant implicit layer output information,in the form of a function of the hyperbolic tangent,is a normalized exponential function;
ttemporal LSTM spiritThe updating of the meta information is expressed by the following formula:
whereinA sigmoid activation function is represented,a forgetting gate weight matrix is represented,represents the output gate weight matrix,Indicating entry rightsThe weight matrix is a matrix of the weight,a weight matrix representing the current information is represented,a matrix of forgetting gate bias values is represented,a matrix of input gate offset values is represented,a matrix of output gate offset values is represented,a matrix of bias values representing the information is represented,to representtA temporary variable of the time of day information,to representtThe information of the text of the moment of time,to representt-a text message of a time instant 1,to representtInformation input at a moment whentWhen =1I.e. a chinese text vector or a radical vector,to representt-hidden layer information at time 1,to representtThe time implies layer information;
the ATT attention handling mechanism is expressed by the following formula:
whereinHRepresents the vector sum of the output layers of the bidirectional recurrent neural network ATT-Bi-RNN or the bidirectional long and short memory network ATT-Bi-LSTM introducing attention mechanism,Mto representHThe vector matrix after the calculation of the tanh function,a transposed matrix representing the weights of the keywords,represents passing throughThe vector matrix after the function is calculated,to representThe transpose matrix of (a) is,Yrepresenting the output of the ATT attention handling mechanism.
In step S4, the text feature vector and the radical feature vector are fused using the following formulas:
wherein the content of the first and second substances,the feature vector after the fusion is represented,the feature vector of the Chinese text is represented,representing radical feature vectors.
In step S4, the softmax classifier is used to classify the chinese text, which is expressed by the following formula:
wherein the content of the first and second substances,Rthe result of the classification of the chinese text is represented,a matrix of weights is represented by a matrix of weights,representing a matrix of bias values.
The invention has the beneficial effects based on the technical scheme that:
the Chinese text classification method integrated with the radical semantics, provided by the invention, takes Chinese texts and radicals as research objects, obtains abundant semantic information from the Chinese texts and the radicals, firstly uses a BERT model to carry out vectorization representation on the Chinese texts and the radicals, then respectively uses different deep learning models to carry out feature extraction on the Chinese texts and the radicals, finally fuses feature vectors of the Chinese texts and the radicals, and utilizes a softmax classifier to realize Chinese text classification. The experimental result not only shows the superiority of the Chinese text classification model provided by the invention, but also verifies the effectiveness of the components in the Chinese text classification task, solves the problems of low efficiency, poor field pertinence, easy occurrence of overfitting in the learning process and the like of the traditional Chinese text classification algorithm, achieves the purposes of reducing the learning time, improving the classification efficiency and organically combining the advantages of different Chinese text classification methods, and realizes the efficient and accurate Chinese text classification.
Drawings
FIG. 1 is a model diagram of a Chinese text classification method with incorporated radical semantics according to the present invention.
FIG. 2 is a schematic diagram of a BERT model training process.
FIG. 3 is a schematic diagram of the Bi-RNN model.
FIG. 4 is a schematic diagram of RNN neuron structure.
FIG. 5 is a diagram of the Bi-LSTM model.
FIG. 6 is a schematic diagram of the structure of an LSTM neuron.
Detailed Description
The invention is further illustrated by the following figures and examples.
The research idea of the invention is as follows:
chinese is a language derived from pictographs, and not only can express specific semantic information through characters, but also components contain rich semantic information. As shown in table 1:
component side | Name (R) | Examples of the present invention |
Bean curd | Beside the handle | Picking, picking and carrying |
| Side of foot character | Kicking, running and jumping |
Chinese medicine | Three-point water | River and sea |
Side of disease character | Pain, ache and scar | |
Rice and its production process | Chinese character mi side | Powder, material and grain |
Soil for soil | Side for lifting soil | Ground, city |
TABLE 1 radical introduction
Use "hand" to "beat" or "pluck"; use the "foot" to "kick" or "run"; "river" and "sea" are related to the meaning of "water"; "ground" and "city" are related to the meaning of "soil", etc., and these examples fully reveal the importance of the radical for semantic understanding, but existing research rarely uses the radical for Chinese text classification. Therefore, the method takes the Chinese text and the components as research objects, obtains richer semantic information from the research objects, and improves the classification effect of the Chinese text.
Example (b):
the invention provides a Chinese text classification method integrated with radical semantics, which comprises the following steps with reference to fig. 1:
s1, training set preprocessing: the method comprises the steps that the components of each character in a Chinese text set form a component set, and the Chinese text set and the component set form a training set; first, noise data of the chinese dataset that is not related to the text classification is removed, for example: stop words, web site links, English letters, and the like. Then, to obtain the radicals of each Chinese character in the data set, the Xinhua dictionary data set is used[1]Mapping of each Chinese character and each radical is realized, and the Xinhua dictionary data set comprises all Chinese characters and radicals appearing in the data set, wherein the Xinhua dictionary data set comprises 20849 Chinese characters and 270 radicals;
for example, the text in the Chinese text set is 'the player comes from football family', and the corresponding radical is 'Wangkouzuowanyi '.
S2, centralizing trainingVectorizing and expressing the Chinese text set and the component set to obtain a Chinese text vector and a component vector; referring to fig. 2, the BERT model may be specifically used to vectorize the chinese text set and the radical set in the training set, respectively. The BERT model is applied to a transform encoder [8 ]]The method is a bidirectional language model obtained by improving a GPT (general Pre-tracing, GPT) language model of a main body structure. As shown in fig. 1, when the BERT model vectorizes a Chinese text (a player comes from a football family) or a radical (wankou from king \ 2342452), the trained word vector model has better generalization capability by randomly masking some words in the Chinese text or the radical and using the unmasked words for prediction. The vectorization process of the BERT model for the text and components of a Chinese language is shown in FIG. 2, in whichThe sum of the word vectors, segment vectors and position vectors representing chinese text or radicals, trm representing the Transformer encoder,and vectors representing the Chinese text or the components obtained after training.
The method specifically comprises the following steps:
s2.1, respectively representing a Chinese text set and a radical set by using a word vector, a segment vector and a position vector, wherein the Chinese text set is recorded asRadicals are collected as;
S2.2, respectively collecting Chinese text setsAnd the component setInputting the encoder of the BERT model, and training to obtain Chinese text(Vector)And radical vector。
S3, extracting the characteristics of the Chinese text vector and the radical vector by using a deep learning model to obtain a Chinese text characteristic vector and a radical characteristic vector; and selecting a deep learning model according to the characteristics of the Chinese text set corresponding to the Chinese text vector to extract the characteristics of the Chinese text vector and the radical vector.
The deep learning model comprises one or more than 2 combinations of a Bidirectional circulation Neural network Bi-RNN (Bi-RNN), a Bidirectional Long Short Memory network Bi-LSTM (Bidirectional Long Short Memory, Bi-LSTM), an Attention-drawing Bidirectional circulation Neural network ATT-Bi-RNN (Attention-Based Bidirectional Short Memory access, ATT-Bi-RNN) and an Attention-drawing Bidirectional Long Short Memory network ATT-Bi-LSTM (Attention-Based Bidirectional Long Short Memory access, ATT-Bi-LSTM), and feature extraction is carried out according to the following four conditions:
a. if the Chinese text set corresponding to the Chinese text vector is a simple short text, for example, the Zhao Li Ying Ma is really wearing thick-bottom shoes and comfortable without feet, a bidirectional recurrent neural network Bi-RNN is adopted to extract the characteristics of the Chinese text vector and the radical vector; the Bi-RNN model is shown in FIG. 3, wherein RNN represents neurons of the recurrent neural network, as shown in FIG. 4;
b. if the Chinese text set corresponding to the Chinese text vector is a long text with simple semantic expression, performing feature extraction on the Chinese text vector and the radical vector by adopting a bidirectional long and short memory network Bi-LSTM, as shown in FIG. 5, wherein the LSTM represents neurons of the long and short memory network, as shown in FIG. 6;
in FIGS. 3 and 6Represents each hourThe information of the depth model is input at an instant,representing the features of each time point obtained by the learning of the depth model,a weight matrix representing updated information between the input layer and the previous layer,a weight matrix representing updated information between the previous-time hidden layer and the current-time hidden layer,a weight value representing update information between the input layer and the backward layer,a weight matrix representing updated information between the previous layer and the output layer,a weight matrix representing updated information between the hidden layer at the later time and the hidden layer at the current time,a weight matrix representing updated information between the backward layer and the output layer.
c. If the Chinese text set corresponding to the Chinese text vector is a complex short text, extracting features of the Chinese text vector and the radical vector by using a bidirectional recurrent neural network ATT-Bi-RNN, as shown in FIG. 7, wherein ATT represents an attention processing mechanism, as shown in FIG. 9;
the ATT-Bi-RNN can give different weights to different words in complex short text expressions, and highlight the effect of keywords, so that the modified text classification effect is achieved, for example: "did the revolution in 1911 make an aggressive missense", the ATT-Bi-RNN model would give greater weight to the revolution in 1911, and thus correctly classify it as a historical text.
d. If the Chinese text set corresponding to the Chinese text vector is a long text with complex semantic expression, feature extraction is performed on the Chinese text vector and the radical vector by using a bidirectional long-short memory network ATT-Bi-LSTM with attention mechanism introduced, as shown in FIG. 9.
In the deep learning model described above, the learning model,tthe update of the RNN neuron information at the time is expressed by the following equation:
wherein the content of the first and second substances,to representtThe time of day implies the information of the layer,to representt-1 time instant the information of the hidden layer,a weight matrix representing the input information,representation updatet-1 a weight matrix of time instant information,to representtInformation of the input layer at the moment whentWhen =1I.e. a chinese text vector or a radical vector,representation updatet-1 a matrix of offset values for the time instance information,to representtThe time instants imply the information output by the layer,representation updatetThe time instants imply a weight matrix of the layer output information,representation updatetA matrix of offset values for the time instant implicit layer output information,in the form of a function of the hyperbolic tangent,is a normalized exponential function;
the depth model automatically updates the parameters of the matrixes according to the training process to obtain the optimal solution, wherein the parameters can be updated by adopting an Adagarad method in the training process of the ATT-Bi-LSTM model.
tThe update of LSTM neuron information at a time is expressed by the following equation:
whereinA sigmoid activation function is represented,a forgetting gate weight matrix is represented,represents the output gate weight matrix,A weight matrix of the input gates is represented,a weight matrix representing the current information is represented,a matrix of forgetting gate bias values is represented,a matrix of input gate offset values is represented,a matrix of output gate offset values is represented,a matrix of bias values representing the information is represented,to representtA temporary variable of the time of day information,to representtThe information of the text of the moment of time,to representt-a text message of a time instant 1,to representtInformation input at a moment whentWhen =1I.e. a chinese text vector or a radical vector,to representt-hidden layer information at time 1,to representtThe time implies layer information;
the ATT attention handling mechanism is expressed by the following formula:
whereinHRepresents the vector sum of the output layers of the bidirectional recurrent neural network ATT-Bi-RNN or the bidirectional long and short memory network ATT-Bi-LSTM introducing attention mechanism,Mto representHThe vector matrix after the calculation of the tanh function,a transposed matrix representing the weights of the keywords,represents passing throughThe vector matrix after the function is calculated,to representThe transpose matrix of (a) is,Yrepresenting the output of the ATT attention handling mechanism.
And S4, fusing the Chinese text feature vectors and the radical feature vectors, and classifying the Chinese text by using a classifier. Specifically, the following formula is used for fusing the Chinese text feature vector and the radical feature vector:
wherein the content of the first and second substances,the feature vector after the fusion is represented,the feature vector of the Chinese text is represented,representing radical feature vectors;
and then, classifying the Chinese text by using a softmax classifier, wherein the Chinese text is expressed by the following formula:
wherein the content of the first and second substances,Rthe result of the classification of the chinese text is represented,a matrix of weights is represented by a matrix of weights,representing a matrix of bias values.
Experimental data:
in this embodiment, a THUCNEWS data set is selected for an experiment, where the data set includes 74 ten thousand new wave news texts, and a writer removes english, english symbols, numbers, and stop words in the texts on the basis of an original new wave news classification system, and manually labels 70241 news texts to divide the texts into: finance, lottery, real estate, stock, home, education, science and technology, society, fashion, time, sports, constellation, games and entertainment 14 categories, wherein the training set comprises 57981 news texts and the test set comprises 12260 news texts.
Setting model parameters:
the default hyper-parameters of the model are used for training the hyper-parameters of each model, and the specific settings are shown in tables 2 and 3:
parameter(s) | BERT |
max_seq_length | 50 |
dimsh | 768 |
TABLE 2 BERT model parameter settings
Parameter(s) | Bi-RNN | Bi-LSTM | ATT- Bi-RNN | ATT- Bi-LSTM |
batch_size | 128 | 128 | 128 | 128 |
epoch | 40 | 40 | 40 | 40 |
dropout | 0.5 | 0.5 | 0.5 | 0.5 |
learning_rate | 0.0001 | 0.0001 | 0.0001 | 0.0001 |
num_nodes | 128 | 128 | 128 | 128 |
max _length | 500 | 500 | 500 | 500 |
TABLE 3 depth model parameter settings
Wherein max _ seq _ length represents the maximum length of an input text, dimsh represents the vector dimension of each word, batch _ size represents the number of texts input after training once, epoch represents the training times of all texts, dropout represents the parameter value for solving the neural network overfitting problem, learning _ rate represents the learning rate, num _ nodes represents the number of neurons in an implicit layer, and max _ length represents the training time step.
Evaluation indexes are as follows:
the text classification experiment result is evaluated by using evaluation indexes such as accuracy P (precision), recall rate R (recall), F value (F-value) and the like, and the evaluation indexes are calculated as shown in the following formulas:
wherein A, B, C represents the number of correctly recognized, incorrectly recognized, and unrecognized chinese texts, respectively.
The experimental results are as follows:
in order to verify the superiority of the Chinese text classification method integrated with the radical semantics and the effectiveness of the radical in the Chinese text classification task, the Chinese text classification task is completed by two different methods, namely: a method not incorporating radical training; the second method comprises the following steps: the invention provides a method for integrating the Chinese text with the radical semantics, and particularly relates to a method for integrating the radical training. The results of the experiment are shown in table 4.
TABLE 4 results of the first and second methods
The F value of the Bi-LSTM model in the table 4 is improved by about 0.03 compared with the F value of the Bi-RNN model, because the problem of gradient disappearance can occur in the process of continuously propagating the update parameters backwards by RNN, so that the semantic information is lost, the long-term dependence relationship of the semantic information is difficult to learn, and the long-term dependence relationship of the semantic information is well learned by introducing a gate structure into the LSTM, so that the F value of the Bi-LSTM model is improved; the F value of the ATT-Bi-LSTM model is improved by about 0.2 compared with the F value of the LSTM model, and the F value of the ATT-Bi-RNN model is also improved by about 0.2 compared with the F value of the RNN model, because the attention mechanism can automatically find words which play a key role in classification in the text and give different weights to the words, more important semantic information is captured from the text, and the F value is improved after the attention mechanism is added; compared with the F values of the first and second learning model methods with different depths, the F value of the second learning model method is improved, which shows that more accurate semantic information can be obtained through the components, and the Chinese text classification effect can be improved; the average value of P, R, F of all deep learning models is about 0.82, which shows that the Chinese text classification model provided by the invention can achieve better text classification effect.
Claims (8)
1. A Chinese text classification method integrated with radical semantics is characterized by comprising the following steps:
s1, forming the components of each character in the Chinese text set into a component set, and forming the Chinese text set and the component set into a training set;
s2, vectorizing and representing the Chinese text set and the component set in the training set to obtain a Chinese text vector and a component vector;
s3, extracting the characteristics of the Chinese text vector and the radical vector by using a deep learning model to obtain a Chinese text characteristic vector and a radical characteristic vector;
and S4, fusing the Chinese text feature vectors and the radical feature vectors, and classifying the Chinese text by using a classifier.
2. The method of claim 1, wherein the method comprises the following steps: in step S1, the data set is preprocessed to obtain a chinese text set, including removing noise data in the data set that is not related to text classification.
3. The method of claim 1, wherein the method comprises the following steps: in step S1, the component set is obtained by mapping chinese characters and radicals in the xinhua dictionary data set.
4. The method of claim 1, wherein the method comprises the following steps: step S2 is to adopt the BERT model to vectorially represent the chinese text set and the radical set in the training set, which specifically includes the following procedures:
s2.1, respectively representing a Chinese text set and a radical set by using a word vector, a segment vector and a position vector, wherein the Chinese text set is recorded asRadicals are collected as;
5. The method of claim 1, wherein the method comprises the following steps: in step S3, a deep learning model is selected according to the characteristics of the chinese text set corresponding to the chinese text vector to perform feature extraction on the chinese text vector and the radical vector.
6. The method of claim 5 for Chinese text classification with incorporated radical semantics, wherein: the deep learning model comprises one or more than 2 combinations of a bidirectional circulation neural network Bi-RNN, a bidirectional long and short memory network Bi-LSTM, a bidirectional circulation neural network ATT-Bi-RNN introducing an attention mechanism and a bidirectional long and short memory network ATT-Bi-LSTM introducing the attention mechanism, and feature extraction is carried out according to the following four conditions:
a. if the Chinese text set corresponding to the Chinese text vector is a simple short text, performing feature extraction on the Chinese text vector and the radical vector by adopting a Bi-directional recurrent neural network Bi-RNN;
b. if the Chinese text set corresponding to the Chinese text vector is a long text with simple semantic expression, performing feature extraction on the Chinese text vector and the radical vector by adopting a bidirectional long-short memory network Bi-LSTM;
c. if the Chinese text set corresponding to the Chinese text vector is a complex short text, performing feature extraction on the Chinese text vector and the radical vector by adopting a bidirectional recurrent neural network ATT-Bi-RNN;
d. if the Chinese text set corresponding to the Chinese text vector is a long text with complex semantic expression, extracting the characteristics of the Chinese text vector and the radical vector by adopting a bidirectional long-short memory network ATT-Bi-LSTM introducing an attention mechanism;
wherein the content of the first and second substances,tthe update of the RNN neuron information at the time is expressed by the following equation:
wherein the content of the first and second substances,to representtThe time of day implies the information of the layer,to representt-1 time instant the information of the hidden layer,a weight matrix representing the input information,representation updatet-1 a weight matrix of time instant information,to representtInformation of the input layer at the moment whentWhen =1I.e. a chinese text vector or a radical vector,representation updatet-1 a matrix of offset values for the time instance information,to representtThe time instants imply the information output by the layer,representation updatetThe time instants imply a weight matrix of the layer output information,representation updatetA matrix of offset values for the time instant implicit layer output information,in the form of a function of the hyperbolic tangent,is a normalized exponential function;
tthe update of LSTM neuron information at a time is expressed by the following equation:
whereinA sigmoid activation function is represented,a forgetting gate weight matrix is represented,representing output gate weight matrices、A weight matrix of the input gates is represented,a weight matrix representing the current information is represented,a matrix of forgetting gate bias values is represented,a matrix of input gate offset values is represented,a matrix of output gate offset values is represented,a matrix of bias values representing the information is represented,to representtA temporary variable of the time of day information,to representtThe information of the text of the moment of time,to representt-a text message of a time instant 1,to representtInformation input at a moment whentWhen =1I.e. a chinese text vector or a radical vector,to representt-hidden layer information at time 1,to representtThe time implies layer information;
the ATT attention handling mechanism is expressed by the following formula:
whereinHRepresents the vector sum of the output layers of the bidirectional recurrent neural network ATT-Bi-RNN or the bidirectional long and short memory network ATT-Bi-LSTM introducing attention mechanism,Mto representHThe vector matrix after the calculation of the tanh function,a transposed matrix representing the weights of the keywords,represents passing throughThe vector matrix after the function is calculated,to representThe transpose matrix of (a) is,Yrepresenting the output of the ATT attention handling mechanism.
7. The method of claim 1, wherein the method comprises the following steps: in step S4, the text feature vector and the radical feature vector are fused using the following formulas:
8. The method of classifying Chinese text incorporated with radical semantics as claimed in claim 7, wherein: in step S4, the softmax classifier is used to classify the chinese text, which is expressed by the following formula:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110388441.3A CN113157921B (en) | 2021-04-12 | 2021-04-12 | Chinese text classification method integrating radical semantics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110388441.3A CN113157921B (en) | 2021-04-12 | 2021-04-12 | Chinese text classification method integrating radical semantics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113157921A true CN113157921A (en) | 2021-07-23 |
CN113157921B CN113157921B (en) | 2021-11-23 |
Family
ID=76889935
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110388441.3A Active CN113157921B (en) | 2021-04-12 | 2021-04-12 | Chinese text classification method integrating radical semantics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113157921B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107301225A (en) * | 2017-06-20 | 2017-10-27 | 挖财网络技术有限公司 | Short text classification method and device |
CN108304376A (en) * | 2017-12-15 | 2018-07-20 | 腾讯科技(深圳)有限公司 | Determination method, apparatus, storage medium and the electronic device of text vector |
US20180240012A1 (en) * | 2017-02-17 | 2018-08-23 | Wipro Limited | Method and system for determining classification of text |
CN109471946A (en) * | 2018-11-16 | 2019-03-15 | 中国科学技术大学 | A kind of classification method and system of Chinese text |
CN111476031A (en) * | 2020-03-11 | 2020-07-31 | 重庆邮电大学 | Improved Chinese named entity recognition method based on L attice-L STM |
US20210034707A1 (en) * | 2019-07-30 | 2021-02-04 | Intuit Inc. | Neural network system for text classification |
CN112464663A (en) * | 2020-12-01 | 2021-03-09 | 小牛思拓(北京)科技有限公司 | Multi-feature fusion Chinese word segmentation method |
CN112559744A (en) * | 2020-12-07 | 2021-03-26 | 中国科学技术大学 | Chinese text classification method and device based on radical association mechanism |
-
2021
- 2021-04-12 CN CN202110388441.3A patent/CN113157921B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180240012A1 (en) * | 2017-02-17 | 2018-08-23 | Wipro Limited | Method and system for determining classification of text |
CN107301225A (en) * | 2017-06-20 | 2017-10-27 | 挖财网络技术有限公司 | Short text classification method and device |
CN108304376A (en) * | 2017-12-15 | 2018-07-20 | 腾讯科技(深圳)有限公司 | Determination method, apparatus, storage medium and the electronic device of text vector |
CN109471946A (en) * | 2018-11-16 | 2019-03-15 | 中国科学技术大学 | A kind of classification method and system of Chinese text |
US20210034707A1 (en) * | 2019-07-30 | 2021-02-04 | Intuit Inc. | Neural network system for text classification |
CN111476031A (en) * | 2020-03-11 | 2020-07-31 | 重庆邮电大学 | Improved Chinese named entity recognition method based on L attice-L STM |
CN112464663A (en) * | 2020-12-01 | 2021-03-09 | 小牛思拓(北京)科技有限公司 | Multi-feature fusion Chinese word segmentation method |
CN112559744A (en) * | 2020-12-07 | 2021-03-26 | 中国科学技术大学 | Chinese text classification method and device based on radical association mechanism |
Non-Patent Citations (4)
Title |
---|
HANQING TAO: "A Radical-Aware Attention-Based Model for Chinese Text Classification", 《THE THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE》 * |
PENG ZHOU: "Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification", 《PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS》 * |
WENTING LI: "The Automatic Text Classification Method Based on BERT and Feature Union", 《2019 IEEE 25TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS》 * |
刘哲源: "基于字粒度多维度特征的深度学习情感分类架构研究", 《科学咨询(科技·管理)》 * |
Also Published As
Publication number | Publication date |
---|---|
CN113157921B (en) | 2021-11-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
Mohammadzadeh et al. | A novel hybrid whale optimization algorithm with flower pollination algorithm for feature selection: Case study Email spam detection | |
CN109271522B (en) | Comment emotion classification method and system based on deep hybrid model transfer learning | |
CN107133213B (en) | Method and system for automatically extracting text abstract based on algorithm | |
CN109189925A (en) | Term vector model based on mutual information and based on the file classification method of CNN | |
Chen et al. | Expressing objects just like words: Recurrent visual embedding for image-text matching | |
CN111414461B (en) | Intelligent question-answering method and system fusing knowledge base and user modeling | |
CN108717439A (en) | A kind of Chinese Text Categorization merged based on attention mechanism and characteristic strengthening | |
CN110222163A (en) | A kind of intelligent answer method and system merging CNN and two-way LSTM | |
CN109918528A (en) | A kind of compact Hash code learning method based on semanteme protection | |
CN108830287A (en) | The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method | |
CN106569998A (en) | Text named entity recognition method based on Bi-LSTM, CNN and CRF | |
CN110826338B (en) | Fine-grained semantic similarity recognition method for single-selection gate and inter-class measurement | |
CN106980609A (en) | A kind of name entity recognition method of the condition random field of word-based vector representation | |
CN107909115B (en) | Image Chinese subtitle generating method | |
CN105469096A (en) | Feature bag image retrieval method based on Hash binary code | |
CN108228845B (en) | Mobile phone game classification method | |
CN112765352A (en) | Graph convolution neural network text classification method based on self-attention mechanism | |
CN110188195B (en) | Text intention recognition method, device and equipment based on deep learning | |
CN108304573A (en) | Target retrieval method based on convolutional neural networks and supervision core Hash | |
WO2017092623A1 (en) | Method and device for representing text as vector | |
CN110263174B (en) | Topic category analysis method based on focus attention | |
Van Hieu et al. | Automatic plant image identification of Vietnamese species using deep learning models | |
CN108470025A (en) | Partial-Topic probability generates regularization own coding text and is embedded in representation method | |
Chen et al. | Sparse Boltzmann machines with structure learning as applied to text analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |