CN113157921A - Chinese text classification method integrating radical semantics - Google Patents

Chinese text classification method integrating radical semantics Download PDF

Info

Publication number
CN113157921A
CN113157921A CN202110388441.3A CN202110388441A CN113157921A CN 113157921 A CN113157921 A CN 113157921A CN 202110388441 A CN202110388441 A CN 202110388441A CN 113157921 A CN113157921 A CN 113157921A
Authority
CN
China
Prior art keywords
chinese text
vector
radical
chinese
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110388441.3A
Other languages
Chinese (zh)
Other versions
CN113157921B (en
Inventor
刘忠宝
荀恩东
赵文娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING LANGUAGE AND CULTURE UNIVERSITY
Original Assignee
BEIJING LANGUAGE AND CULTURE UNIVERSITY
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING LANGUAGE AND CULTURE UNIVERSITY filed Critical BEIJING LANGUAGE AND CULTURE UNIVERSITY
Priority to CN202110388441.3A priority Critical patent/CN113157921B/en
Publication of CN113157921A publication Critical patent/CN113157921A/en
Application granted granted Critical
Publication of CN113157921B publication Critical patent/CN113157921B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a Chinese text classification method integrated with radical semantics, which comprises the steps of firstly using a BERT model to carry out vectorization representation on a Chinese text and a radical, then respectively using different deep learning models to extract the characteristics of the Chinese text and the components, finally fusing the characteristic vectors of the Chinese text and the components, realizing the classification of the Chinese text by using a softmax classifier, the proposal fully considers Chinese as a unique pictograph, the components of the characters contain rich semantic information, plays an important role in the semantic understanding process of Chinese texts, can further improve the performance of the classifier, based on the information content of the components, the training complexity is greatly reduced, and the purposes of reducing the learning time, improving the classification efficiency, organically combining the advantages of different Chinese text classification methods and realizing efficient and accurate Chinese text classification are achieved.

Description

Chinese text classification method integrating radical semantics
Technical Field
The invention relates to a Chinese text classification method integrated with radical semantics, belonging to the technical field of computers.
Background
The Chinese text classification refers to a process of automatically classifying Chinese texts according to a certain classification system or rule, and is widely applied to the fields of information indexing, digital book management, information filtering and the like.
Methods for Chinese text classification are generally classified into three categories: a classification method based on Knowledge Engineering (KE), a classification method based on Machine Learning (ML), and a classification method based on Deep Learning (DL).
The classification method based on knowledge engineering refers to the manual text classification according to classification task rules written by domain experts, and the method has obvious inefficiency and limitation, and although some achievements are achieved, the method is eliminated quickly.
The classification method based on machine learning means that automatic classification of texts is realized by independently learning and extracting text classification rules through a computer, the classification method has high efficiency and strong portability, is widely applied to the field of Chinese text classification, but still has the defects, such as: the classification effect of the naive Bayes algorithm depends on the prior probability, and the expression form of the input data can greatly influence the classification result of the Chinese text; support vector machine algorithms are sensitive to missing data and have no general solution to the non-linear problem; the decision tree algorithm is easy to ignore the correlation of the attributes of the data set when text classification is carried out, and an overfitting problem occurs; the neural network algorithm has a large number of parameters to be determined in the process of training data, the learning process among networks cannot be observed, the learning time is long, and the output result is difficult to explain.
The deep learning-based classification method is used for extracting features of the Chinese text in the process of constructing a deep learning model, so that higher-level and more abstract semantic representations are obtained, and the Chinese text is classified. The method comprises the steps of firstly using a BERT pre-training language model to express feature vectors of sentences of a text, then inputting the feature vectors of the sentences of the text into a softmax regression model to realize Chinese text classification, and obtaining a better Chinese text classification effect. However, the method is directly transplanted to English text classification, the characteristics of Chinese characters are ignored, the structure of a pre-training model is huge, and a large amount of data and equipment resources are needed to complete the training process.
Although the three different Chinese text classification methods can meet the target requirement of classifying Chinese texts, the problems of low algorithm efficiency, poor field pertinence, easy occurrence of overfitting in the learning process and the like still exist, and the hot problem of how to reduce the learning time, improve the classification efficiency, organically combine the advantages of the different Chinese text classification methods and realize efficient and accurate Chinese text classification is the research in the field of natural language processing.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a Chinese text classification method integrating radical semantics, which comprises the steps of firstly using a BERT model to carry out vectorization representation on a Chinese text and a radical, then respectively using different deep learning models to carry out feature extraction on the Chinese text and the radical, and finally fusing feature vectors of the Chinese text and the radical, and realizing Chinese text classification by using a softmax classifier The purpose of efficiently and accurately classifying the Chinese text is achieved.
The technical scheme adopted by the invention for solving the technical problem is as follows: the Chinese text classification method integrated with the radical semantics comprises the following steps:
s1, forming the components of each character in the Chinese text set into a component set, and forming the Chinese text set and the component set into a training set;
s2, vectorizing and representing the Chinese text set and the component set in the training set to obtain a Chinese text vector and a component vector;
s3, extracting the characteristics of the Chinese text vector and the radical vector by using a deep learning model to obtain a Chinese text characteristic vector and a radical characteristic vector;
and S4, fusing the Chinese text feature vectors and the radical feature vectors, and classifying the Chinese text by using a classifier.
In step S1, the data set is preprocessed to obtain a chinese text set, including removing noise data in the data set that is not related to text classification.
In step S1, the component set is obtained by mapping chinese characters and radicals in the xinhua dictionary data set.
Step S2 is to adopt the BERT model to vectorially represent the chinese text set and the radical set in the training set, which specifically includes the following procedures:
s2.1, respectively representing a Chinese text set and a radical set by using a word vector, a segment vector and a position vector, wherein the Chinese text set is recorded as
Figure 100002_DEST_PATH_IMAGE001
Radicals are collected as
Figure 261547DEST_PATH_IMAGE002
S2.2, respectively collecting Chinese text sets
Figure 801857DEST_PATH_IMAGE001
And the component set
Figure 4430DEST_PATH_IMAGE002
Inputting the encoder of the BERT model, and training to obtain Chinese text vector
Figure 100002_DEST_PATH_IMAGE003
And radical vector
Figure 321011DEST_PATH_IMAGE004
In step S3, a deep learning model is selected according to the characteristics of the chinese text set corresponding to the chinese text vector to perform feature extraction on the chinese text vector and the radical vector.
The deep learning model comprises one or more than 2 combinations of a bidirectional circulation neural network Bi-RNN, a bidirectional long and short memory network Bi-LSTM, a bidirectional circulation neural network ATT-Bi-RNN introducing an attention mechanism and a bidirectional long and short memory network ATT-Bi-LSTM introducing the attention mechanism, and feature extraction is carried out according to the following four conditions:
a. if the Chinese text set corresponding to the Chinese text vector is a simple short text, performing feature extraction on the Chinese text vector and the radical vector by adopting a Bi-directional recurrent neural network Bi-RNN;
b. if the Chinese text set corresponding to the Chinese text vector is a complex short text, performing feature extraction on the Chinese text vector and the radical vector by adopting a bidirectional recurrent neural network ATT-Bi-RNN;
c. if the Chinese text set corresponding to the Chinese text vector is a long text with simple semantic expression, performing feature extraction on the Chinese text vector and the radical vector by adopting a bidirectional long-short memory network Bi-LSTM;
d. if the Chinese text set corresponding to the Chinese text vector is a long text with complex semantic expression, extracting the characteristics of the Chinese text vector and the radical vector by adopting a bidirectional long-short memory network ATT-Bi-LSTM introducing an attention mechanism;
wherein the content of the first and second substances,tthe update of the RNN neuron information at the time is expressed by the following equation:
Figure 100002_DEST_PATH_IMAGE005
(1)
Figure 289710DEST_PATH_IMAGE006
(2)
wherein the content of the first and second substances,
Figure 100002_DEST_PATH_IMAGE007
to representtThe time of day implies the information of the layer,
Figure 269430DEST_PATH_IMAGE008
to representt-1 time instant the information of the hidden layer,
Figure 100002_DEST_PATH_IMAGE009
a weight matrix representing the input information,
Figure 254572DEST_PATH_IMAGE010
representation updatet-1 a weight matrix of time instant information,
Figure 100002_DEST_PATH_IMAGE011
to representtInformation of the input layer at the moment whentWhen =1
Figure 935696DEST_PATH_IMAGE011
I.e. a chinese text vector or a radical vector,
Figure 168357DEST_PATH_IMAGE012
representation updatet-1 a matrix of offset values for the time instance information,
Figure 100002_DEST_PATH_IMAGE013
to representtThe time instants imply the information output by the layer,
Figure 817513DEST_PATH_IMAGE014
representation updatetThe time instants imply a weight matrix of the layer output information,
Figure 100002_DEST_PATH_IMAGE015
representation updatetA matrix of offset values for the time instant implicit layer output information,
Figure 749607DEST_PATH_IMAGE016
in the form of a function of the hyperbolic tangent,
Figure 100002_DEST_PATH_IMAGE017
is a normalized exponential function;
ttemporal LSTM spiritThe updating of the meta information is expressed by the following formula:
Figure 706193DEST_PATH_IMAGE018
(3)
Figure 100002_DEST_PATH_IMAGE019
(4)
Figure 885370DEST_PATH_IMAGE020
(5)
Figure 100002_DEST_PATH_IMAGE021
(6)
Figure 203963DEST_PATH_IMAGE022
(7)
Figure 100002_DEST_PATH_IMAGE023
(8)
wherein
Figure 100002_DEST_PATH_IMAGE025
A sigmoid activation function is represented,
Figure 727479DEST_PATH_IMAGE026
a forgetting gate weight matrix is represented,
Figure 100002_DEST_PATH_IMAGE027
represents the output gate weight matrix,
Figure 625772DEST_PATH_IMAGE028
Indicating entry rightsThe weight matrix is a matrix of the weight,
Figure 100002_DEST_PATH_IMAGE029
a weight matrix representing the current information is represented,
Figure 393876DEST_PATH_IMAGE030
a matrix of forgetting gate bias values is represented,
Figure 100002_DEST_PATH_IMAGE031
a matrix of input gate offset values is represented,
Figure 620721DEST_PATH_IMAGE032
a matrix of output gate offset values is represented,
Figure 100002_DEST_PATH_IMAGE033
a matrix of bias values representing the information is represented,
Figure 677538DEST_PATH_IMAGE034
to representtA temporary variable of the time of day information,
Figure 100002_DEST_PATH_IMAGE035
to representtThe information of the text of the moment of time,
Figure 146566DEST_PATH_IMAGE036
to representt-a text message of a time instant 1,
Figure 100002_DEST_PATH_IMAGE037
to representtInformation input at a moment whentWhen =1
Figure 470974DEST_PATH_IMAGE037
I.e. a chinese text vector or a radical vector,
Figure 570517DEST_PATH_IMAGE038
to representt-hidden layer information at time 1,
Figure 100002_DEST_PATH_IMAGE039
to representtThe time implies layer information;
the ATT attention handling mechanism is expressed by the following formula:
Figure 849052DEST_PATH_IMAGE040
(9)
Figure 100002_DEST_PATH_IMAGE041
(10)
Figure 357656DEST_PATH_IMAGE042
(11)
whereinHRepresents the vector sum of the output layers of the bidirectional recurrent neural network ATT-Bi-RNN or the bidirectional long and short memory network ATT-Bi-LSTM introducing attention mechanism,Mto representHThe vector matrix after the calculation of the tanh function,
Figure 100002_DEST_PATH_IMAGE043
a transposed matrix representing the weights of the keywords,
Figure 303615DEST_PATH_IMAGE044
represents passing through
Figure 100002_DEST_PATH_IMAGE045
The vector matrix after the function is calculated,
Figure 134912DEST_PATH_IMAGE046
to represent
Figure 838425DEST_PATH_IMAGE044
The transpose matrix of (a) is,Yrepresenting the output of the ATT attention handling mechanism.
In step S4, the text feature vector and the radical feature vector are fused using the following formulas:
Figure 100002_DEST_PATH_IMAGE047
(12)
wherein the content of the first and second substances,
Figure 383676DEST_PATH_IMAGE048
the feature vector after the fusion is represented,
Figure 100002_DEST_PATH_IMAGE049
the feature vector of the Chinese text is represented,
Figure 747924DEST_PATH_IMAGE050
representing radical feature vectors.
In step S4, the softmax classifier is used to classify the chinese text, which is expressed by the following formula:
Figure 100002_DEST_PATH_IMAGE051
(13)
wherein the content of the first and second substances,Rthe result of the classification of the chinese text is represented,
Figure 454849DEST_PATH_IMAGE052
a matrix of weights is represented by a matrix of weights,
Figure DEST_PATH_IMAGE053
representing a matrix of bias values.
The invention has the beneficial effects based on the technical scheme that:
the Chinese text classification method integrated with the radical semantics, provided by the invention, takes Chinese texts and radicals as research objects, obtains abundant semantic information from the Chinese texts and the radicals, firstly uses a BERT model to carry out vectorization representation on the Chinese texts and the radicals, then respectively uses different deep learning models to carry out feature extraction on the Chinese texts and the radicals, finally fuses feature vectors of the Chinese texts and the radicals, and utilizes a softmax classifier to realize Chinese text classification. The experimental result not only shows the superiority of the Chinese text classification model provided by the invention, but also verifies the effectiveness of the components in the Chinese text classification task, solves the problems of low efficiency, poor field pertinence, easy occurrence of overfitting in the learning process and the like of the traditional Chinese text classification algorithm, achieves the purposes of reducing the learning time, improving the classification efficiency and organically combining the advantages of different Chinese text classification methods, and realizes the efficient and accurate Chinese text classification.
Drawings
FIG. 1 is a model diagram of a Chinese text classification method with incorporated radical semantics according to the present invention.
FIG. 2 is a schematic diagram of a BERT model training process.
FIG. 3 is a schematic diagram of the Bi-RNN model.
FIG. 4 is a schematic diagram of RNN neuron structure.
FIG. 5 is a diagram of the Bi-LSTM model.
FIG. 6 is a schematic diagram of the structure of an LSTM neuron.
Detailed Description
The invention is further illustrated by the following figures and examples.
The research idea of the invention is as follows:
chinese is a language derived from pictographs, and not only can express specific semantic information through characters, but also components contain rich semantic information. As shown in table 1:
component side Name (R) Examples of the present invention
Bean curd Beside the handle Picking, picking and carrying
Side of foot character Kicking, running and jumping
Chinese medicine Three-point water River and sea
Side of disease character Pain, ache and scar
Rice and its production process Chinese character mi side Powder, material and grain
Soil for soil Side for lifting soil Ground, city
TABLE 1 radical introduction
Use "hand" to "beat" or "pluck"; use the "foot" to "kick" or "run"; "river" and "sea" are related to the meaning of "water"; "ground" and "city" are related to the meaning of "soil", etc., and these examples fully reveal the importance of the radical for semantic understanding, but existing research rarely uses the radical for Chinese text classification. Therefore, the method takes the Chinese text and the components as research objects, obtains richer semantic information from the research objects, and improves the classification effect of the Chinese text.
Example (b):
the invention provides a Chinese text classification method integrated with radical semantics, which comprises the following steps with reference to fig. 1:
s1, training set preprocessing: the method comprises the steps that the components of each character in a Chinese text set form a component set, and the Chinese text set and the component set form a training set; first, noise data of the chinese dataset that is not related to the text classification is removed, for example: stop words, web site links, English letters, and the like. Then, to obtain the radicals of each Chinese character in the data set, the Xinhua dictionary data set is used[1]Mapping of each Chinese character and each radical is realized, and the Xinhua dictionary data set comprises all Chinese characters and radicals appearing in the data set, wherein the Xinhua dictionary data set comprises 20849 Chinese characters and 270 radicals;
for example, the text in the Chinese text set is 'the player comes from football family', and the corresponding radical is 'Wangkouzuowanyi '.
S2, centralizing trainingVectorizing and expressing the Chinese text set and the component set to obtain a Chinese text vector and a component vector; referring to fig. 2, the BERT model may be specifically used to vectorize the chinese text set and the radical set in the training set, respectively. The BERT model is applied to a transform encoder [8 ]]The method is a bidirectional language model obtained by improving a GPT (general Pre-tracing, GPT) language model of a main body structure. As shown in fig. 1, when the BERT model vectorizes a Chinese text (a player comes from a football family) or a radical (wankou from king \ 2342452), the trained word vector model has better generalization capability by randomly masking some words in the Chinese text or the radical and using the unmasked words for prediction. The vectorization process of the BERT model for the text and components of a Chinese language is shown in FIG. 2, in which
Figure 3248DEST_PATH_IMAGE001
The sum of the word vectors, segment vectors and position vectors representing chinese text or radicals, trm representing the Transformer encoder,
Figure 493135DEST_PATH_IMAGE054
and vectors representing the Chinese text or the components obtained after training.
The method specifically comprises the following steps:
s2.1, respectively representing a Chinese text set and a radical set by using a word vector, a segment vector and a position vector, wherein the Chinese text set is recorded as
Figure 882528DEST_PATH_IMAGE001
Radicals are collected as
Figure 229196DEST_PATH_IMAGE002
S2.2, respectively collecting Chinese text sets
Figure 408767DEST_PATH_IMAGE001
And the component set
Figure 764662DEST_PATH_IMAGE002
Inputting the encoder of the BERT model, and training to obtain Chinese text(Vector)
Figure 8562DEST_PATH_IMAGE003
And radical vector
Figure 260551DEST_PATH_IMAGE004
S3, extracting the characteristics of the Chinese text vector and the radical vector by using a deep learning model to obtain a Chinese text characteristic vector and a radical characteristic vector; and selecting a deep learning model according to the characteristics of the Chinese text set corresponding to the Chinese text vector to extract the characteristics of the Chinese text vector and the radical vector.
The deep learning model comprises one or more than 2 combinations of a Bidirectional circulation Neural network Bi-RNN (Bi-RNN), a Bidirectional Long Short Memory network Bi-LSTM (Bidirectional Long Short Memory, Bi-LSTM), an Attention-drawing Bidirectional circulation Neural network ATT-Bi-RNN (Attention-Based Bidirectional Short Memory access, ATT-Bi-RNN) and an Attention-drawing Bidirectional Long Short Memory network ATT-Bi-LSTM (Attention-Based Bidirectional Long Short Memory access, ATT-Bi-LSTM), and feature extraction is carried out according to the following four conditions:
a. if the Chinese text set corresponding to the Chinese text vector is a simple short text, for example, the Zhao Li Ying Ma is really wearing thick-bottom shoes and comfortable without feet, a bidirectional recurrent neural network Bi-RNN is adopted to extract the characteristics of the Chinese text vector and the radical vector; the Bi-RNN model is shown in FIG. 3, wherein RNN represents neurons of the recurrent neural network, as shown in FIG. 4;
b. if the Chinese text set corresponding to the Chinese text vector is a long text with simple semantic expression, performing feature extraction on the Chinese text vector and the radical vector by adopting a bidirectional long and short memory network Bi-LSTM, as shown in FIG. 5, wherein the LSTM represents neurons of the long and short memory network, as shown in FIG. 6;
in FIGS. 3 and 6
Figure DEST_PATH_IMAGE055
Represents each hourThe information of the depth model is input at an instant,
Figure 986805DEST_PATH_IMAGE056
representing the features of each time point obtained by the learning of the depth model,
Figure DEST_PATH_IMAGE057
a weight matrix representing updated information between the input layer and the previous layer,
Figure 146391DEST_PATH_IMAGE058
a weight matrix representing updated information between the previous-time hidden layer and the current-time hidden layer,
Figure DEST_PATH_IMAGE059
a weight value representing update information between the input layer and the backward layer,
Figure 339738DEST_PATH_IMAGE060
a weight matrix representing updated information between the previous layer and the output layer,
Figure DEST_PATH_IMAGE061
a weight matrix representing updated information between the hidden layer at the later time and the hidden layer at the current time,
Figure 28208DEST_PATH_IMAGE062
a weight matrix representing updated information between the backward layer and the output layer.
c. If the Chinese text set corresponding to the Chinese text vector is a complex short text, extracting features of the Chinese text vector and the radical vector by using a bidirectional recurrent neural network ATT-Bi-RNN, as shown in FIG. 7, wherein ATT represents an attention processing mechanism, as shown in FIG. 9;
the ATT-Bi-RNN can give different weights to different words in complex short text expressions, and highlight the effect of keywords, so that the modified text classification effect is achieved, for example: "did the revolution in 1911 make an aggressive missense", the ATT-Bi-RNN model would give greater weight to the revolution in 1911, and thus correctly classify it as a historical text.
d. If the Chinese text set corresponding to the Chinese text vector is a long text with complex semantic expression, feature extraction is performed on the Chinese text vector and the radical vector by using a bidirectional long-short memory network ATT-Bi-LSTM with attention mechanism introduced, as shown in FIG. 9.
In the deep learning model described above, the learning model,tthe update of the RNN neuron information at the time is expressed by the following equation:
Figure 743223DEST_PATH_IMAGE005
(1)
Figure 175341DEST_PATH_IMAGE006
(2)
wherein the content of the first and second substances,
Figure 830051DEST_PATH_IMAGE007
to representtThe time of day implies the information of the layer,
Figure 689423DEST_PATH_IMAGE008
to representt-1 time instant the information of the hidden layer,
Figure 626155DEST_PATH_IMAGE009
a weight matrix representing the input information,
Figure 861964DEST_PATH_IMAGE010
representation updatet-1 a weight matrix of time instant information,
Figure 138225DEST_PATH_IMAGE011
to representtInformation of the input layer at the moment whentWhen =1
Figure 935542DEST_PATH_IMAGE011
I.e. a chinese text vector or a radical vector,
Figure 93991DEST_PATH_IMAGE012
representation updatet-1 a matrix of offset values for the time instance information,
Figure 867912DEST_PATH_IMAGE013
to representtThe time instants imply the information output by the layer,
Figure 998679DEST_PATH_IMAGE014
representation updatetThe time instants imply a weight matrix of the layer output information,
Figure 721825DEST_PATH_IMAGE015
representation updatetA matrix of offset values for the time instant implicit layer output information,
Figure 367570DEST_PATH_IMAGE016
in the form of a function of the hyperbolic tangent,
Figure 945182DEST_PATH_IMAGE017
is a normalized exponential function;
the depth model automatically updates the parameters of the matrixes according to the training process to obtain the optimal solution, wherein the parameters can be updated by adopting an Adagarad method in the training process of the ATT-Bi-LSTM model.
tThe update of LSTM neuron information at a time is expressed by the following equation:
Figure 992773DEST_PATH_IMAGE018
(3)
Figure 866313DEST_PATH_IMAGE019
(4)
Figure 202617DEST_PATH_IMAGE020
(5)
Figure 318340DEST_PATH_IMAGE021
(6)
Figure 486016DEST_PATH_IMAGE022
(7)
Figure 232255DEST_PATH_IMAGE023
(8)
wherein
Figure DEST_PATH_IMAGE063
A sigmoid activation function is represented,
Figure 147865DEST_PATH_IMAGE026
a forgetting gate weight matrix is represented,
Figure 332859DEST_PATH_IMAGE027
represents the output gate weight matrix,
Figure 292725DEST_PATH_IMAGE028
A weight matrix of the input gates is represented,
Figure 6603DEST_PATH_IMAGE029
a weight matrix representing the current information is represented,
Figure 881280DEST_PATH_IMAGE030
a matrix of forgetting gate bias values is represented,
Figure 338806DEST_PATH_IMAGE031
a matrix of input gate offset values is represented,
Figure 949916DEST_PATH_IMAGE032
a matrix of output gate offset values is represented,
Figure 100275DEST_PATH_IMAGE033
a matrix of bias values representing the information is represented,
Figure 928160DEST_PATH_IMAGE034
to representtA temporary variable of the time of day information,
Figure 189377DEST_PATH_IMAGE035
to representtThe information of the text of the moment of time,
Figure 858256DEST_PATH_IMAGE036
to representt-a text message of a time instant 1,
Figure 179516DEST_PATH_IMAGE037
to representtInformation input at a moment whentWhen =1
Figure 497627DEST_PATH_IMAGE037
I.e. a chinese text vector or a radical vector,
Figure 562535DEST_PATH_IMAGE038
to representt-hidden layer information at time 1,
Figure 148237DEST_PATH_IMAGE039
to representtThe time implies layer information;
the ATT attention handling mechanism is expressed by the following formula:
Figure 843661DEST_PATH_IMAGE040
(9)
Figure 882024DEST_PATH_IMAGE041
(10)
Figure 249158DEST_PATH_IMAGE042
(11)
whereinHRepresents the vector sum of the output layers of the bidirectional recurrent neural network ATT-Bi-RNN or the bidirectional long and short memory network ATT-Bi-LSTM introducing attention mechanism,Mto representHThe vector matrix after the calculation of the tanh function,
Figure 954946DEST_PATH_IMAGE043
a transposed matrix representing the weights of the keywords,
Figure 555691DEST_PATH_IMAGE044
represents passing through
Figure 346930DEST_PATH_IMAGE045
The vector matrix after the function is calculated,
Figure 255105DEST_PATH_IMAGE046
to represent
Figure 815400DEST_PATH_IMAGE044
The transpose matrix of (a) is,Yrepresenting the output of the ATT attention handling mechanism.
And S4, fusing the Chinese text feature vectors and the radical feature vectors, and classifying the Chinese text by using a classifier. Specifically, the following formula is used for fusing the Chinese text feature vector and the radical feature vector:
Figure 587047DEST_PATH_IMAGE047
(12)
wherein the content of the first and second substances,
Figure 131160DEST_PATH_IMAGE048
the feature vector after the fusion is represented,
Figure 75983DEST_PATH_IMAGE049
the feature vector of the Chinese text is represented,
Figure 192581DEST_PATH_IMAGE050
representing radical feature vectors;
and then, classifying the Chinese text by using a softmax classifier, wherein the Chinese text is expressed by the following formula:
Figure 197446DEST_PATH_IMAGE051
(13)
wherein the content of the first and second substances,Rthe result of the classification of the chinese text is represented,
Figure 963277DEST_PATH_IMAGE052
a matrix of weights is represented by a matrix of weights,
Figure 446211DEST_PATH_IMAGE053
representing a matrix of bias values.
Experimental data:
in this embodiment, a THUCNEWS data set is selected for an experiment, where the data set includes 74 ten thousand new wave news texts, and a writer removes english, english symbols, numbers, and stop words in the texts on the basis of an original new wave news classification system, and manually labels 70241 news texts to divide the texts into: finance, lottery, real estate, stock, home, education, science and technology, society, fashion, time, sports, constellation, games and entertainment 14 categories, wherein the training set comprises 57981 news texts and the test set comprises 12260 news texts.
Setting model parameters:
the default hyper-parameters of the model are used for training the hyper-parameters of each model, and the specific settings are shown in tables 2 and 3:
parameter(s) BERT
max_seq_length 50
dimsh 768
TABLE 2 BERT model parameter settings
Parameter(s) Bi-RNN Bi-LSTM ATT- Bi-RNN ATT- Bi-LSTM
batch_size 128 128 128 128
epoch 40 40 40 40
dropout 0.5 0.5 0.5 0.5
learning_rate 0.0001 0.0001 0.0001 0.0001
num_nodes 128 128 128 128
max _length 500 500 500 500
TABLE 3 depth model parameter settings
Wherein max _ seq _ length represents the maximum length of an input text, dimsh represents the vector dimension of each word, batch _ size represents the number of texts input after training once, epoch represents the training times of all texts, dropout represents the parameter value for solving the neural network overfitting problem, learning _ rate represents the learning rate, num _ nodes represents the number of neurons in an implicit layer, and max _ length represents the training time step.
Evaluation indexes are as follows:
the text classification experiment result is evaluated by using evaluation indexes such as accuracy P (precision), recall rate R (recall), F value (F-value) and the like, and the evaluation indexes are calculated as shown in the following formulas:
Figure 482562DEST_PATH_IMAGE064
(14)
Figure DEST_PATH_IMAGE065
(15)
Figure 658328DEST_PATH_IMAGE066
(16)
wherein A, B, C represents the number of correctly recognized, incorrectly recognized, and unrecognized chinese texts, respectively.
The experimental results are as follows:
in order to verify the superiority of the Chinese text classification method integrated with the radical semantics and the effectiveness of the radical in the Chinese text classification task, the Chinese text classification task is completed by two different methods, namely: a method not incorporating radical training; the second method comprises the following steps: the invention provides a method for integrating the Chinese text with the radical semantics, and particularly relates to a method for integrating the radical training. The results of the experiment are shown in table 4.
Figure DEST_PATH_IMAGE068AA
TABLE 4 results of the first and second methods
The F value of the Bi-LSTM model in the table 4 is improved by about 0.03 compared with the F value of the Bi-RNN model, because the problem of gradient disappearance can occur in the process of continuously propagating the update parameters backwards by RNN, so that the semantic information is lost, the long-term dependence relationship of the semantic information is difficult to learn, and the long-term dependence relationship of the semantic information is well learned by introducing a gate structure into the LSTM, so that the F value of the Bi-LSTM model is improved; the F value of the ATT-Bi-LSTM model is improved by about 0.2 compared with the F value of the LSTM model, and the F value of the ATT-Bi-RNN model is also improved by about 0.2 compared with the F value of the RNN model, because the attention mechanism can automatically find words which play a key role in classification in the text and give different weights to the words, more important semantic information is captured from the text, and the F value is improved after the attention mechanism is added; compared with the F values of the first and second learning model methods with different depths, the F value of the second learning model method is improved, which shows that more accurate semantic information can be obtained through the components, and the Chinese text classification effect can be improved; the average value of P, R, F of all deep learning models is about 0.82, which shows that the Chinese text classification model provided by the invention can achieve better text classification effect.

Claims (8)

1. A Chinese text classification method integrated with radical semantics is characterized by comprising the following steps:
s1, forming the components of each character in the Chinese text set into a component set, and forming the Chinese text set and the component set into a training set;
s2, vectorizing and representing the Chinese text set and the component set in the training set to obtain a Chinese text vector and a component vector;
s3, extracting the characteristics of the Chinese text vector and the radical vector by using a deep learning model to obtain a Chinese text characteristic vector and a radical characteristic vector;
and S4, fusing the Chinese text feature vectors and the radical feature vectors, and classifying the Chinese text by using a classifier.
2. The method of claim 1, wherein the method comprises the following steps: in step S1, the data set is preprocessed to obtain a chinese text set, including removing noise data in the data set that is not related to text classification.
3. The method of claim 1, wherein the method comprises the following steps: in step S1, the component set is obtained by mapping chinese characters and radicals in the xinhua dictionary data set.
4. The method of claim 1, wherein the method comprises the following steps: step S2 is to adopt the BERT model to vectorially represent the chinese text set and the radical set in the training set, which specifically includes the following procedures:
s2.1, respectively representing a Chinese text set and a radical set by using a word vector, a segment vector and a position vector, wherein the Chinese text set is recorded as
Figure DEST_PATH_IMAGE001
Radicals are collected as
Figure 565340DEST_PATH_IMAGE002
S2.2, respectively collecting Chinese text sets
Figure 524069DEST_PATH_IMAGE001
And the component set
Figure 241358DEST_PATH_IMAGE002
Inputting the encoder of the BERT model, and training to obtain Chinese text vector
Figure DEST_PATH_IMAGE003
And radical vector
Figure 151807DEST_PATH_IMAGE004
5. The method of claim 1, wherein the method comprises the following steps: in step S3, a deep learning model is selected according to the characteristics of the chinese text set corresponding to the chinese text vector to perform feature extraction on the chinese text vector and the radical vector.
6. The method of claim 5 for Chinese text classification with incorporated radical semantics, wherein: the deep learning model comprises one or more than 2 combinations of a bidirectional circulation neural network Bi-RNN, a bidirectional long and short memory network Bi-LSTM, a bidirectional circulation neural network ATT-Bi-RNN introducing an attention mechanism and a bidirectional long and short memory network ATT-Bi-LSTM introducing the attention mechanism, and feature extraction is carried out according to the following four conditions:
a. if the Chinese text set corresponding to the Chinese text vector is a simple short text, performing feature extraction on the Chinese text vector and the radical vector by adopting a Bi-directional recurrent neural network Bi-RNN;
b. if the Chinese text set corresponding to the Chinese text vector is a long text with simple semantic expression, performing feature extraction on the Chinese text vector and the radical vector by adopting a bidirectional long-short memory network Bi-LSTM;
c. if the Chinese text set corresponding to the Chinese text vector is a complex short text, performing feature extraction on the Chinese text vector and the radical vector by adopting a bidirectional recurrent neural network ATT-Bi-RNN;
d. if the Chinese text set corresponding to the Chinese text vector is a long text with complex semantic expression, extracting the characteristics of the Chinese text vector and the radical vector by adopting a bidirectional long-short memory network ATT-Bi-LSTM introducing an attention mechanism;
wherein the content of the first and second substances,tthe update of the RNN neuron information at the time is expressed by the following equation:
Figure DEST_PATH_IMAGE005
(1)
Figure 485706DEST_PATH_IMAGE006
(2)
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE007
to representtThe time of day implies the information of the layer,
Figure 871294DEST_PATH_IMAGE008
to representt-1 time instant the information of the hidden layer,
Figure DEST_PATH_IMAGE009
a weight matrix representing the input information,
Figure 584035DEST_PATH_IMAGE010
representation updatet-1 a weight matrix of time instant information,
Figure DEST_PATH_IMAGE011
to representtInformation of the input layer at the moment whentWhen =1
Figure 399807DEST_PATH_IMAGE011
I.e. a chinese text vector or a radical vector,
Figure 486580DEST_PATH_IMAGE012
representation updatet-1 a matrix of offset values for the time instance information,
Figure DEST_PATH_IMAGE013
to representtThe time instants imply the information output by the layer,
Figure 472598DEST_PATH_IMAGE014
representation updatetThe time instants imply a weight matrix of the layer output information,
Figure DEST_PATH_IMAGE015
representation updatetA matrix of offset values for the time instant implicit layer output information,
Figure 633321DEST_PATH_IMAGE016
in the form of a function of the hyperbolic tangent,
Figure DEST_PATH_IMAGE017
is a normalized exponential function;
tthe update of LSTM neuron information at a time is expressed by the following equation:
Figure 885573DEST_PATH_IMAGE018
(3)
Figure DEST_PATH_IMAGE019
(4)
Figure 397325DEST_PATH_IMAGE020
(5)
Figure DEST_PATH_IMAGE021
(6)
Figure 859137DEST_PATH_IMAGE022
(7)
Figure DEST_PATH_IMAGE023
(8)
wherein
Figure 546471DEST_PATH_IMAGE024
A sigmoid activation function is represented,
Figure DEST_PATH_IMAGE025
a forgetting gate weight matrix is represented,
Figure 969624DEST_PATH_IMAGE026
representing output gate weight matrices、
Figure DEST_PATH_IMAGE027
A weight matrix of the input gates is represented,
Figure 968673DEST_PATH_IMAGE028
a weight matrix representing the current information is represented,
Figure DEST_PATH_IMAGE029
a matrix of forgetting gate bias values is represented,
Figure 765334DEST_PATH_IMAGE030
a matrix of input gate offset values is represented,
Figure DEST_PATH_IMAGE031
a matrix of output gate offset values is represented,
Figure 228545DEST_PATH_IMAGE032
a matrix of bias values representing the information is represented,
Figure DEST_PATH_IMAGE033
to representtA temporary variable of the time of day information,
Figure 88179DEST_PATH_IMAGE034
to representtThe information of the text of the moment of time,
Figure DEST_PATH_IMAGE035
to representt-a text message of a time instant 1,
Figure 512207DEST_PATH_IMAGE036
to representtInformation input at a moment whentWhen =1
Figure 895915DEST_PATH_IMAGE036
I.e. a chinese text vector or a radical vector,
Figure DEST_PATH_IMAGE037
to representt-hidden layer information at time 1,
Figure 915430DEST_PATH_IMAGE038
to representtThe time implies layer information;
the ATT attention handling mechanism is expressed by the following formula:
Figure DEST_PATH_IMAGE039
(9)
Figure 913342DEST_PATH_IMAGE040
(10)
Figure DEST_PATH_IMAGE041
(11)
whereinHRepresents the vector sum of the output layers of the bidirectional recurrent neural network ATT-Bi-RNN or the bidirectional long and short memory network ATT-Bi-LSTM introducing attention mechanism,Mto representHThe vector matrix after the calculation of the tanh function,
Figure 919607DEST_PATH_IMAGE042
a transposed matrix representing the weights of the keywords,
Figure DEST_PATH_IMAGE043
represents passing through
Figure 293956DEST_PATH_IMAGE044
The vector matrix after the function is calculated,
Figure DEST_PATH_IMAGE045
to represent
Figure 105661DEST_PATH_IMAGE043
The transpose matrix of (a) is,Yrepresenting the output of the ATT attention handling mechanism.
7. The method of claim 1, wherein the method comprises the following steps: in step S4, the text feature vector and the radical feature vector are fused using the following formulas:
Figure 602370DEST_PATH_IMAGE046
(12)
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE047
the feature vector after the fusion is represented,
Figure 299193DEST_PATH_IMAGE048
the feature vector of the Chinese text is represented,
Figure DEST_PATH_IMAGE049
representing radical feature vectors.
8. The method of classifying Chinese text incorporated with radical semantics as claimed in claim 7, wherein: in step S4, the softmax classifier is used to classify the chinese text, which is expressed by the following formula:
Figure 742813DEST_PATH_IMAGE050
(13)
wherein the content of the first and second substances,Rthe result of the classification of the chinese text is represented,
Figure DEST_PATH_IMAGE051
a matrix of weights is represented by a matrix of weights,
Figure 205762DEST_PATH_IMAGE052
representing a matrix of bias values.
CN202110388441.3A 2021-04-12 2021-04-12 Chinese text classification method integrating radical semantics Active CN113157921B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110388441.3A CN113157921B (en) 2021-04-12 2021-04-12 Chinese text classification method integrating radical semantics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110388441.3A CN113157921B (en) 2021-04-12 2021-04-12 Chinese text classification method integrating radical semantics

Publications (2)

Publication Number Publication Date
CN113157921A true CN113157921A (en) 2021-07-23
CN113157921B CN113157921B (en) 2021-11-23

Family

ID=76889935

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110388441.3A Active CN113157921B (en) 2021-04-12 2021-04-12 Chinese text classification method integrating radical semantics

Country Status (1)

Country Link
CN (1) CN113157921B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107301225A (en) * 2017-06-20 2017-10-27 挖财网络技术有限公司 Short text classification method and device
CN108304376A (en) * 2017-12-15 2018-07-20 腾讯科技(深圳)有限公司 Determination method, apparatus, storage medium and the electronic device of text vector
US20180240012A1 (en) * 2017-02-17 2018-08-23 Wipro Limited Method and system for determining classification of text
CN109471946A (en) * 2018-11-16 2019-03-15 中国科学技术大学 A kind of classification method and system of Chinese text
CN111476031A (en) * 2020-03-11 2020-07-31 重庆邮电大学 Improved Chinese named entity recognition method based on L attice-L STM
US20210034707A1 (en) * 2019-07-30 2021-02-04 Intuit Inc. Neural network system for text classification
CN112464663A (en) * 2020-12-01 2021-03-09 小牛思拓(北京)科技有限公司 Multi-feature fusion Chinese word segmentation method
CN112559744A (en) * 2020-12-07 2021-03-26 中国科学技术大学 Chinese text classification method and device based on radical association mechanism

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180240012A1 (en) * 2017-02-17 2018-08-23 Wipro Limited Method and system for determining classification of text
CN107301225A (en) * 2017-06-20 2017-10-27 挖财网络技术有限公司 Short text classification method and device
CN108304376A (en) * 2017-12-15 2018-07-20 腾讯科技(深圳)有限公司 Determination method, apparatus, storage medium and the electronic device of text vector
CN109471946A (en) * 2018-11-16 2019-03-15 中国科学技术大学 A kind of classification method and system of Chinese text
US20210034707A1 (en) * 2019-07-30 2021-02-04 Intuit Inc. Neural network system for text classification
CN111476031A (en) * 2020-03-11 2020-07-31 重庆邮电大学 Improved Chinese named entity recognition method based on L attice-L STM
CN112464663A (en) * 2020-12-01 2021-03-09 小牛思拓(北京)科技有限公司 Multi-feature fusion Chinese word segmentation method
CN112559744A (en) * 2020-12-07 2021-03-26 中国科学技术大学 Chinese text classification method and device based on radical association mechanism

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
HANQING TAO: "A Radical-Aware Attention-Based Model for Chinese Text Classification", 《THE THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE》 *
PENG ZHOU: "Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification", 《PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS》 *
WENTING LI: "The Automatic Text Classification Method Based on BERT and Feature Union", 《2019 IEEE 25TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS》 *
刘哲源: "基于字粒度多维度特征的深度学习情感分类架构研究", 《科学咨询(科技·管理)》 *

Also Published As

Publication number Publication date
CN113157921B (en) 2021-11-23

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
Mohammadzadeh et al. A novel hybrid whale optimization algorithm with flower pollination algorithm for feature selection: Case study Email spam detection
CN109271522B (en) Comment emotion classification method and system based on deep hybrid model transfer learning
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN109189925A (en) Term vector model based on mutual information and based on the file classification method of CNN
Chen et al. Expressing objects just like words: Recurrent visual embedding for image-text matching
CN111414461B (en) Intelligent question-answering method and system fusing knowledge base and user modeling
CN108717439A (en) A kind of Chinese Text Categorization merged based on attention mechanism and characteristic strengthening
CN110222163A (en) A kind of intelligent answer method and system merging CNN and two-way LSTM
CN109918528A (en) A kind of compact Hash code learning method based on semanteme protection
CN108830287A (en) The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method
CN106569998A (en) Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN110826338B (en) Fine-grained semantic similarity recognition method for single-selection gate and inter-class measurement
CN106980609A (en) A kind of name entity recognition method of the condition random field of word-based vector representation
CN107909115B (en) Image Chinese subtitle generating method
CN105469096A (en) Feature bag image retrieval method based on Hash binary code
CN108228845B (en) Mobile phone game classification method
CN112765352A (en) Graph convolution neural network text classification method based on self-attention mechanism
CN110188195B (en) Text intention recognition method, device and equipment based on deep learning
CN108304573A (en) Target retrieval method based on convolutional neural networks and supervision core Hash
WO2017092623A1 (en) Method and device for representing text as vector
CN110263174B (en) Topic category analysis method based on focus attention
Van Hieu et al. Automatic plant image identification of Vietnamese species using deep learning models
CN108470025A (en) Partial-Topic probability generates regularization own coding text and is embedded in representation method
Chen et al. Sparse Boltzmann machines with structure learning as applied to text analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant