CN114595325A

CN114595325A - Chinese short text classification method based on distillation BERT

Info

Publication number: CN114595325A
Application number: CN202111479291.3A
Authority: CN
Inventors: 许文波; 李昶霖; 白闰冰; 高鹏; 袁帅; 贾海涛
Original assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Current assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Priority date: 2021-12-04
Filing date: 2021-12-04
Publication date: 2022-06-07

Abstract

The invention discloses a Chinese short text classification method based on distillation BERT, belonging to the field of natural language processing. The invention first preprocesses the input text so that the output of the BERT contains more text information. And then, fine tuning is carried out on the original BERT to obtain a BERT model suitable for text classification. The parameters of the BERT were then compressed using distillation, increasing the running speed of the model. Finally, the compressed BERT is connected with the CNN network, fine adjustment is carried out again, and the classification capability of the specific text is improved.

Description

Chinese short text classification method based on distillation BERT

Technical Field

The invention relates to a pre-trained language model and Chinese text classification, and in particular to Chinese text classification under a compressed BERT model.

Background

Since the release of the BERT model, NLP enters the era of pre-training and fine-tuning, the progress of task research in the NLP field is greatly accelerated, and the related task effect is continuously improved. However, the training resources of the BERT model are dozens of G of characters, and the task load is huge. Such costs are not affordable to the average person. More preferable models are also proposed in this period, such as roerta by FACEBOOK, etc. As model performance increases, parameters are also growing explosively, from hundreds of millions of BERTs to hundreds of billions later. Pre-training seems to be a proprietary game for large organizations, and how to obtain a usable pre-training model is very urgent for small companies with application requirements or occasions with high real-time requirements.

In the process of continuously enlarging the pre-training model, the research of compressing the pre-training model is continuously updated. Both the distillation-based BERT and the parameter-sharing-based ALBERT are continuously proposed, which both trade off the compression of model parameters and the acceleration of inference speed with less loss of accuracy. The invention designs a small BERT text classification model by combining some research results of a pre-training model BERT and a knowledge distillation method.

Disclosure of Invention

In order to solve the problem that the BERT large model is not practical, the invention provides a text classification technology based on distillation BERT. The technology improves the Chinese text classification method by adjusting the input mask method and the distillation method of the BERT model.

The technical scheme adopted by the invention is as follows:

step 1: construct the BERT input text.

Step 2: and (3) using the text in the step (1) as input to enable a BERT-CNN model to be subjected to fine tuning training to obtain a Chinese text classification Teacher model, wherein the BERT selects a BERT-base model.

And step 3: distillation was performed using 6-layer transformer as the Student encoder model and the fine-tuned BERT obtained in step 2 as the Teacher encoder.

And 4, step 4: and (4) carrying out fine tuning training by using the Student BERT obtained in the step (3) and the CNN to obtain a compression model of the text classification.

Compared with the prior art, the beneficial results of the invention are as follows:

(1) fine adjustment of 2 specific tasks is performed, and the text classification capability is improved.

(2) The pre-training model is compressed, and the use of the pre-training model is more convenient.

Drawings

FIG. 1 is a diagram: schematic diagram of the Teacher BERT fine tuning model.

FIG. 2 is a diagram of: schematic diagram of distillation process.

FIG. 3 is a diagram of: and (3) a schematic diagram of an overall model compression method.

Detailed Description

Step 1: an input text is constructed.

Removing the SEP and Embedding of the inter-sentence separation symbols, and directly splicing sentences to construct an input text.

Step 2: and (3) constructing a text training BERT-CNN by using the step 1 to obtain a Teacher BERT model.

And (4) using the sentence obtained in the step (1) as an input of a BERT-CNN text classifier. The output of the BERT is accessed to the CNN at positions except the CLS, and then the output of the CNN network and the CLS vector splicing connection full-connection layer are classified by using softmax. And updating the model parameters by using the cross entropy of the softmax function as a loss function, wherein the BERT parameters are not fixed in the step. The network structure is shown in fig. 1.

And step 3: student BERT encoder was trained.

Distillation was performed using the Teacher BERT model obtained in step 2 using 6-layer transformers as the insert encoder of the Student model. The 1, 2, 3, 4, 5 layers of the Student transform correspond to the 2, 4, 6, 8, 10 layers of the teacher layer. The initialization value is the corresponding layer parameter of the teacher model, and the rest is initialized to 0. The BERT Teacher parameters were fixed for the BERT Student to learn. Loss functions MSE and DS are respectively defined for the intermediate layer and the output layer, and a distillation target loss function is obtained. Learning the hidden states of all teacher models is contrary to the goal of compression for the middle layer, so the loss function calculation for the middle layer considers only CLS, which is acceptable for text classification tasks, with the MSE loss function as shown in equation (1).

Wherein N represents the number of learning samples, mset (j) represents the Teacher coding layer corresponding to the j coding layer of the Student, h is the embedding of the CLS, s represents the Student model parameter, and t represents the Teacher model parameter.

The output layer is classified by using a softmax layer, and the loss function DS is shown in formula (2).

P(y_i＝c；x_i)＝softmax(Wh_i) (3)

Where s represents the use of the Teacher model parameters, t represents the use of the Student model parameters, and h represents the embedded output with the network input x. The total loss function of distillation is shown in equation (4).

The distillation process is shown in figure 2.

And 4, step 4: a text classification network based on compressed BERT-CNN is trained.

The same structure as in step 2 was constructed using the Student Transformer encoder obtained in step 3 to connect the CNN and FC + softmax layers. The initialization value of the Student CNN layer is the same as that of the Teacher CNN layer after fine adjustment, and softmax classification is carried out after the Student CNN output and the CLS output of the Student BERT are spliced and connected with a full connection layer. And then updating the parameters of the student model again by using the softmax cross entropy as a loss function to obtain a compressed BERT-CNN text classification model.

The overall model compression method is shown in fig. 3.

While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except combinations where mutually exclusive features or/and steps are present.

Claims

1. A Chinese short text classification method based on distillation BERT is characterized by comprising the following steps:

step 1: an input text is constructed.

Step 2: and (3) finely adjusting the BERT-CNN by using the text constructed in the step (1) to obtain a Teacher BERT model.

And step 3: distilling the Teacher BERT from step 2 to obtain Student BERT.

And 4, step 4: and (4) forming a text classification network by using the Student BERT obtained in the step (3) and the CNN finely adjusted in the step (2), and finely adjusting again.

2. The method of classifying short chinese texts based on distillation BERT as claimed in claim 1, wherein the method of constructing texts in step 1 specifically comprises:

step 101: and removing the inter-sentence separation symbols, namely splicing sentences directly without using SEP embedding.

3. The method of classifying short chinese texts based on distillation BERT as claimed in claim 2, wherein the BERT-CNN model in step 2 is specifically composed of:

step 201: and splicing the outputs of the Teacher BERT except for the CLS embedding into a 2-dimensional data graph, inputting the 2-dimensional data graph into the CNN network, connecting the output of the CNN network and the CLS embedding of the BERT after splicing, and finally classifying by using a softmax function.

4. The method of classifying short chinese texts based on distillation BERT as claimed in claim 3, wherein the distillation method in step 3 is specifically:

step 301: as Student BERT, 6 Transformer layers were used, of which 5 layers were interfacially distilled with a Teacher BERT.

Step 302: the loss function of distillation is the weighted sum of the middle layer distillation MSE loss and DS loss between outputs for the Teacher BERT and Student BERT.

5. The method of classifying short chinese texts based on distillation BERT as claimed in claim 4, wherein the fine tuning method in step 5 specifically comprises:

step 401: and (3) forming a text classifier with the same structure as the text classifier in the step 2 by using the Student BERT obtained in the step 3 and the CNN layer obtained after fine tuning in the step 2, and performing fine tuning training again.