CN111177376B - Chinese text classification method based on BERT and CNN hierarchical connection - Google Patents

Chinese text classification method based on BERT and CNN hierarchical connection Download PDF

Info

Publication number
CN111177376B
CN111177376B CN201911302047.2A CN201911302047A CN111177376B CN 111177376 B CN111177376 B CN 111177376B CN 201911302047 A CN201911302047 A CN 201911302047A CN 111177376 B CN111177376 B CN 111177376B
Authority
CN
China
Prior art keywords
bert
model
sentence
cnn
sentences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911302047.2A
Other languages
Chinese (zh)
Other versions
CN111177376A (en
Inventor
马强
赵鸣博
孔维健
王晓峰
孙嘉瞳
邓开连
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Donghua University
Original Assignee
Donghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Donghua University filed Critical Donghua University
Priority to CN201911302047.2A priority Critical patent/CN111177376B/en
Publication of CN111177376A publication Critical patent/CN111177376A/en
Application granted granted Critical
Publication of CN111177376B publication Critical patent/CN111177376B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to a Chinese text classification method based on BERT and CNN hierarchical connection, which is mainly used for solving the text classification problems of emotion analysis, core sentence recognition, relationship recognition and the like of Chinese texts. In the application, a CNN model and a BERT model are used for hierarchical connection to obtain a new model BERT-CNN. The BERT-CNN model is added with the CNN model, so that sentence features extracted by the BERT model can be further extracted, and more effective sentence semantic representation is obtained. Therefore, in the text classification task, a better classification effect can be obtained.

Description

Chinese text classification method based on BERT and CNN hierarchical connection
Technical Field
The application belongs to the technical field of natural language processing, and particularly relates to a Chinese text classification method based on deep learning model BERT and CNN hierarchical connection.
Background
With the high speed of economy and the internet, more and more people choose to post various utterances on the internet. In the face of a large amount of text data on a network, how to efficiently obtain data of use value from the data becomes a research hotspot. Question-answering robots, searching, machine translation and emotion analysis are key application fields of natural language processing, and text classification technologies are not separated from the technologies, and are the basis of the technologies. Precisely because text classification technology is the basis, its accuracy requirements are high. Thus, text classification technology has been a research hotspot and is a difficulty over the years.
With the rapid development of the fields of machine learning, deep learning and the like, text classification does not depend on time-consuming and labor-consuming manual work any more, so that the automatic text classification technology is turned to. And along with the continuous improvement of the accuracy, the method has been widely applied to emotion analysis and junk text recognition. However, there are some areas where the effectiveness is poor, such as illegal advertisement recognition, etc., and the areas of emotion analysis and spam text recognition are in urgent need for higher accuracy.
At present, the effect obtained by the deep learning technology in the text classification technology is better, but the effect of the deep learning technology depends on semantic feature extraction of sentences. Conventional deep learning models rely on quantization of words or characters in sentences as model inputs, but this method is sometimes affected by quantization results, so that quantization is required for texts in different fields separately, which is relatively time-consuming and laborious. The model introduced herein not only has better effect, but also does not need to quantify words or characters for each field.
Disclosure of Invention
The purpose of the application is that: further improving the classification effect of Chinese text.
In order to achieve the above purpose, the technical scheme of the application is to provide a Chinese text classification method based on the hierarchical connection of BERT and CNN, which is characterized by comprising the following steps:
step 1, pretraining a BERT model through a large number of public Chinese text data sets to obtain and store all parameters in the BERT model, wherein the BERT model is composed of 12 layers of transformers;
step 2, performing hierarchical connection by using a CNN model and a BERT model, wherein when performing hierarchical connection, the output of the first position of each layer in the layer structure of the BERT model 12 is used as the input of the CNN model, the input width is 12, a BERT-CNN model is obtained, in the BERT-CNN model, an input matrix with the width of 12 is subjected to rolling and maximum pooling operation by the CNN model to obtain a new more effective sentence semantic feature vector, and then the sentence semantic feature vector is input into a full connection layer and finally passes through a classifier;
step 3, initializing parameters of the BERT model part, wherein the initialized parameter values are parameters obtained by pre-training before, and the parameters of the CNN model part are generated randomly by meeting normal distribution;
step 4, data preprocessing is carried out on the classification training set;
and 5, retraining the BERT-CNN model through the preprocessed data set.
Preferably, in step 1, the chinese text data set for pre-training the BERT model includes a sentence-inside prediction training set and a continuous training set of sentence pairs, wherein:
the construction process of the sentence internal prediction training set comprises the following steps:
after the data is segmented according to sentences, 15% of words in the sentences are randomly masked. 80% of the 15% of words are replaced by [ mask ], 10% of words are still replaced by original words, the rest 10% of words are replaced by one word at random, and [ CLS ] characters are spliced at the initial position of the sentence, and a new sentence formed by the method is input as a BERT model to predict the 15% of words which are shaded;
the process of whether sentence pairs are continuous in training sets comprises the following steps:
after the data is segmented according to sentences, any two sentences are connected into a sentence through [ sep ], [ CLS ] characters are spliced at the starting position of the sentence, whether the two sentences are continuous in the article or not is predicted by using the formed new sentence as the input of the BERT model, and the output of the BERT model is a probability value which indicates the probability of the continuous two sentences.
Preferably, in step 2, the core component in the transform encoder is a multi-head attention mechanism, the multi-head attention mechanism is composed of 8 self-attention mechanisms, and the output of the transform encoder is formed by splicing the outputs of the 8 self-attention mechanisms.
Preferably, in step 4, the data preprocessing includes removing part of invalid character strings in the sentence, and then segmenting the sentence by character.
The application provides a Chinese text classification method based on Bidirectional Encoder Representations from Transformers (BERT for short) and Convolutional Neural Networks (CNN for short) hierarchical connection, which uses a BERT model and a CNN model to conduct hierarchical connection so as to further improve the capability of the model to extract sentence semantic features.
The application provides a Chinese text classification method based on BERT-CNN, which can obtain more effective semantic features of sentences by adding a CNN model when the semantic features of sentences are obtained, and can obtain better effect when text classification is performed compared with some Chinese text classification models at present.
Drawings
FIG. 1 is a flow diagram of a method for classifying Chinese text based on BERT and CNN hierarchical connection according to the present application;
fig. 2 is an internal structural diagram of the BERT-CNN model of the present application.
Detailed Description
The application will be further illustrated with reference to specific examples. It is to be understood that these examples are illustrative of the present application and are not intended to limit the scope of the present application. Furthermore, it should be understood that various changes and modifications can be made by one skilled in the art after reading the teachings of the present application, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.
The specific embodiment of the application relates to a Chinese text classification method based on BERT and CNN hierarchical connection, which comprises the steps of pre-training a BERT model through a wiki encyclopedia Chinese text data set to obtain and store all parameters in the BERT model; hierarchical connection is carried out by using a CNN model and a BERT model, so that a new model BERT-CNN is obtained; initializing parameters of the BERT model part, wherein the initialized parameter values are parameters obtained by pre-training before, and the parameters of the CNN model part are generated randomly by meeting normal distribution; data preprocessing is carried out on the classification training set; finally, retraining the BERT-CNN model through the preprocessed data set;
fig. 1 shows a flow diagram of a chinese text classification method based on BERT and CNN hierarchical connection according to the present application.
As shown in fig. 1, after the flow begins, the first step to be performed is to pre-train the BERT model. The pre-training BERT model mainly comprises two parts, namely, constructing a training set; secondly, training the BERT model by using a constructed training set.
Two training sets are constructed, namely, a sentence internal prediction training set; but whether sentence pairs are continuous training sets. The specific implementation steps are as follows:
the construction embodiment of the intra-sentence prediction training set is to mask 15% of the words in the sentence randomly. Of these 15% of words, 80% are replaced with mask, 10% of words are still replaced with the original words, and the remaining 10% are replaced with one word at random. And concatenating the [ CLS ] characters at the start position of the sentence, the new sentence constructed in this way is input as a model to predict 15% of the words that are masked.
The specific implementation mode of the construction of the continuous training set of the sentence pairs is to connect any two sentences in the article into one sentence through [ sep ], splice [ CLS ] characters at the starting position of the sentence, and use the formed new sentence as the input of the model to predict whether the two sentences are continuous in the article. The output of the model is a probability value that represents the probability that the two sentences are consecutive.
Pre-training the BERT model through the two training sets, and storing the trained model weight parameters for the weight parameter initialization value of the BERT model part of the BERT-CNN model.
The second step, as shown in FIG. 1, is to construct the BERT-CNN model. The internal structure of the BERT-CNN model is shown in fig. 2. And taking the output of the first position of each layer in the BERT model 12-layer structure as the input of the CNN model, wherein the input width is 12, and performing convolution and maximum pooling operation on the input matrix with the width of 12 through the CNN model to obtain a new and more effective sentence semantic feature vector.
The BERT model employed above consisted of a 12-layer transducer encoder. the core component in the tansformer encoder is a multi-head attention mechanism, the multi-head attention mechanism is composed of 8 self-attention mechanisms, and the output is formed by splicing the outputs of the 8 self-attention mechanisms. The purpose of this is to enable the model to learn relevant information in the different representation subspaces.
Wherein, the self-attention is calculated as follows:
in self-intent, q=v=k, which are all input matrices for the intent mechanism, WQ, WK, WV are three weight matrices corresponding to Q, K, V, which are weight parameters that require model learning. d, d k Refers to the dimension of the input matrix row vector in order to control the inner product result of the denominator not to be too large.
Wherein, the calculation formula of the multi-head attention is as follows:
multihead(Q,K,V)=concat(head 1 ,head 2 ,...,head h )W o
the aim of concat () is to realize the splicing of row vectors by the matrix; head part i The calculation result of the ith self-attribute in the multi-head attribute of the finger; w (W) O Refers to the weighting parameters of the output of the multi-head attention and the next layer connection.
The CNN model employed above is a one-dimensional convolutional neural network. The one-dimensional convolution neural network is characterized in that in the convolution process, convolution operation only moves downwards continuously without left and right movement operation, so that a one-dimensional vector is obtained after one convolution operation is completed with an input matrix. The CNN model is divided into a convolutional layer and a pooling layer, the convolutional layer is composed of convolution kernels, in this embodiment, three types of convolution kernels with window sizes of 2, 3 and 4 are adopted, and the pooling layer adopts maximum pooling.
Therefore, the semantic feature vector of the obtained sentence can be used as the input of the full connection layer, and the class probability of the sentence is finally obtained through the softmax layer.
The third step is the initialization of the BERT-CNN model weight parameters, as shown in FIG. 1. The specific initialization steps are as follows: firstly, initializing weight parameters of the BERT model, wherein the initial value is the weight value which is pre-trained and stored in the first step. The weight parameters of the CNN model are then initialized, this time with a randomly generated set of data satisfying the normal distribution.
The fourth step, as shown in FIG. 1, is to train the BERT-CNN model classifier. The specific training step is by inputting a sentence, such as "the product quality is not good", which is obviously a bad comment, the present application expects the model output to be a probability value of less than 0.5. The closer this value is to 0, the more accurate the prediction. While for a good example, the present application expects its predictive probability to be greater than 0.5, and the closer the probability value is to 1, the better. Therefore, the application adopts cross entropy as a loss function, adam as an optimizer, and continuously updates the weight, so that the model can obtain a group of weight parameter optimal solutions under the action of training data. Furthermore, the parameter updating is not only aimed at the CNN model, but also the weight parameters of the BERT model are updated continuously, namely, the parameters are fine-tuned for the task.
The application adopts the BERT and CNN combined model to extract the effective characteristics of the sentences, and the pre-trained BERT model not only has the powerful semantic representation function of WORDs and sentences, but also can be directly applied to tasks in any field without re-training by adopting data, and has certain advantages compared with the WORD2VEC model. And the BERT model adopts an attribute mechanism to solve the problem of long-distance dependence, so that the problem that parallel computation cannot be performed by using a penetrating RNN model is also solved. On the basis, the application introduces a CNN model to further perform feature fusion on the result of the BERT model, so that more effective sentence semantic features can be obtained.

Claims (3)

1. A Chinese text classification method based on BERT and CNN hierarchical connection is characterized by comprising the following steps:
step 1, pretraining a BERT model through a large number of public Chinese text data sets to obtain and store all parameters in the BERT model, wherein the BERT model is composed of 12 layers of transformers, and the Chinese text data sets pretraining the BERT model comprise a sentence internal prediction training set and a sentence pair continuous training set, wherein:
the construction process of the sentence internal prediction training set comprises the following steps:
after the data are segmented according to sentences, 15% of words in the sentences are randomly covered; 80% of the 15% of words are replaced by [ mask ], 10% of words are still replaced by original words, the rest 10% of words are replaced by one word at random, and [ CLS ] characters are spliced at the initial position of the sentence, and a new sentence formed by the method is input as a BERT model to predict the 15% of words which are shaded;
the process of whether sentence pairs are continuous in training sets comprises the following steps:
after the data is segmented according to sentences, connecting any two sentences into a sentence through [ sep ], splicing [ CLS ] characters at the initial position of the sentence, and predicting whether the two sentences are continuous in the article or not by using the formed new sentence as the input of the BERT model, wherein the output of the BERT model is a probability value, and the probability value represents the probability of the two continuous sentences;
step 2, performing hierarchical connection by using a CNN model and a BERT model, wherein when performing hierarchical connection, the output of the first position of each layer in the layer structure of the BERT model 12 is used as the input of the CNN model, the input width is 12, a BERT-CNN model is obtained, in the BERT-CNN model, an input matrix with the width of 12 is subjected to rolling and maximum pooling operation by the CNN model to obtain a new more effective sentence semantic feature vector, and then the sentence semantic feature vector is input into a full connection layer and finally passes through a classifier;
step 3, initializing parameters of the BERT model part, wherein the initialized parameter values are parameters obtained by pre-training before, and the parameters of the CNN model part are generated randomly by meeting normal distribution;
step 4, data preprocessing is carried out on the classification training set;
and 5, retraining the BERT-CNN model through the preprocessed data set.
2. The method of claim 1, wherein in step 2, the core component of the transformer encoder is a multi-head attention mechanism, the multi-head attention mechanism is composed of 8 self-attention mechanisms, and the output of the transformer encoder is formed by splicing the outputs of 8 self-attention mechanisms.
3. A Chinese text classification method based on BERT and CNN hierarchical connection as claimed in claim 1,
wherein in step 4, the data preprocessing includes removing part of invalid character strings in sentences,
the sentence is then segmented per character.
CN201911302047.2A 2019-12-17 2019-12-17 Chinese text classification method based on BERT and CNN hierarchical connection Active CN111177376B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911302047.2A CN111177376B (en) 2019-12-17 2019-12-17 Chinese text classification method based on BERT and CNN hierarchical connection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911302047.2A CN111177376B (en) 2019-12-17 2019-12-17 Chinese text classification method based on BERT and CNN hierarchical connection

Publications (2)

Publication Number Publication Date
CN111177376A CN111177376A (en) 2020-05-19
CN111177376B true CN111177376B (en) 2023-08-15

Family

ID=70657375

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911302047.2A Active CN111177376B (en) 2019-12-17 2019-12-17 Chinese text classification method based on BERT and CNN hierarchical connection

Country Status (1)

Country Link
CN (1) CN111177376B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111858848B (en) * 2020-05-22 2024-03-15 青岛创新奇智科技集团股份有限公司 Semantic classification method and device, electronic equipment and storage medium
CN111737512B (en) * 2020-06-04 2021-11-12 东华大学 Silk cultural relic image retrieval method based on depth feature region fusion
CN111737475B (en) * 2020-07-21 2021-06-22 南京擎盾信息科技有限公司 Unsupervised network public opinion spam long text recognition method
CN112101027A (en) * 2020-07-24 2020-12-18 昆明理工大学 Chinese named entity recognition method based on reading understanding
CN111930952A (en) * 2020-09-21 2020-11-13 杭州识度科技有限公司 Method, system, equipment and storage medium for long text cascade classification
CN113342970B (en) * 2020-11-24 2023-01-03 中电万维信息技术有限责任公司 Multi-label complex text classification method
CN112463965A (en) * 2020-12-03 2021-03-09 上海欣方智能系统有限公司 Method and system for semantic understanding of text
CN112559730B (en) * 2020-12-08 2021-08-24 北京京航计算通讯研究所 Text abstract automatic generation method and system based on global feature extraction
CN112732916B (en) * 2021-01-11 2022-09-20 河北工业大学 BERT-based multi-feature fusion fuzzy text classification system
CN113032539A (en) * 2021-03-15 2021-06-25 浙江大学 Causal question-answer pair matching method based on pre-training neural network
CN113312568B (en) * 2021-03-25 2022-06-17 罗普特科技集团股份有限公司 Web information extraction method and system based on HTML source code and webpage snapshot
CN113468324A (en) * 2021-06-03 2021-10-01 上海交通大学 Text classification method and system based on BERT pre-training model and convolutional network
CN114242159B (en) * 2022-02-24 2022-06-07 北京晶泰科技有限公司 Method for constructing antigen peptide presentation prediction model, and antigen peptide prediction method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110209824A (en) * 2019-06-13 2019-09-06 中国科学院自动化研究所 Text emotion analysis method based on built-up pattern, system, device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110209824A (en) * 2019-06-13 2019-09-06 中国科学院自动化研究所 Text emotion analysis method based on built-up pattern, system, device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于混合注意力机制的中文文本蕴含识别方法;黄生斌等;《北京信息科技大学学报(自然科学版)》;20200615(第03期);全文 *

Also Published As

Publication number Publication date
CN111177376A (en) 2020-05-19

Similar Documents

Publication Publication Date Title
CN111177376B (en) Chinese text classification method based on BERT and CNN hierarchical connection
CN112214599B (en) Multi-label text classification method based on statistics and pre-training language model
CN109918671B (en) Electronic medical record entity relation extraction method based on convolution cyclic neural network
Chen et al. Research on text sentiment analysis based on CNNs and SVM
CN105938485B (en) A kind of Image Description Methods based on convolution loop mixed model
CN110609891A (en) Visual dialog generation method based on context awareness graph neural network
CN110866117A (en) Short text classification method based on semantic enhancement and multi-level label embedding
CN110619034A (en) Text keyword generation method based on Transformer model
CN109558576B (en) Punctuation mark prediction method based on self-attention mechanism
CN110287323B (en) Target-oriented emotion classification method
CN108388560A (en) GRU-CRF meeting title recognition methods based on language model
CN111143563A (en) Text classification method based on integration of BERT, LSTM and CNN
CN108170848B (en) Chinese mobile intelligent customer service-oriented conversation scene classification method
CN113297364B (en) Natural language understanding method and device in dialogue-oriented system
CN113673254B (en) Knowledge distillation position detection method based on similarity maintenance
CN110555084A (en) remote supervision relation classification method based on PCNN and multi-layer attention
CN111027292B (en) Method and system for generating limited sampling text sequence
CN112861524A (en) Deep learning-based multilevel Chinese fine-grained emotion analysis method
CN114153973A (en) Mongolian multi-mode emotion analysis method based on T-M BERT pre-training model
CN112070139A (en) Text classification method based on BERT and improved LSTM
CN113239690A (en) Chinese text intention identification method based on integration of Bert and fully-connected neural network
CN111581392B (en) Automatic composition scoring calculation method based on statement communication degree
CN112287106A (en) Online comment emotion classification method based on dual-channel hybrid neural network
CN114417851A (en) Emotion analysis method based on keyword weighted information
CN111949762A (en) Method and system for context-based emotion dialogue, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant