CN114595325A - Chinese short text classification method based on distillation BERT - Google Patents

Chinese short text classification method based on distillation BERT Download PDF

Info

Publication number
CN114595325A
CN114595325A CN202111479291.3A CN202111479291A CN114595325A CN 114595325 A CN114595325 A CN 114595325A CN 202111479291 A CN202111479291 A CN 202111479291A CN 114595325 A CN114595325 A CN 114595325A
Authority
CN
China
Prior art keywords
bert
distillation
text
cnn
student
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111479291.3A
Other languages
Chinese (zh)
Inventor
许文波
李昶霖
白闰冰
高鹏
袁帅
贾海涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yangtze River Delta Research Institute of UESTC Huzhou
Original Assignee
Yangtze River Delta Research Institute of UESTC Huzhou
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yangtze River Delta Research Institute of UESTC Huzhou filed Critical Yangtze River Delta Research Institute of UESTC Huzhou
Priority to CN202111479291.3A priority Critical patent/CN114595325A/en
Publication of CN114595325A publication Critical patent/CN114595325A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a Chinese short text classification method based on distillation BERT, belonging to the field of natural language processing. The invention first preprocesses the input text so that the output of the BERT contains more text information. And then, fine tuning is carried out on the original BERT to obtain a BERT model suitable for text classification. The parameters of the BERT were then compressed using distillation, increasing the running speed of the model. Finally, the compressed BERT is connected with the CNN network, fine adjustment is carried out again, and the classification capability of the specific text is improved.

Description

Chinese short text classification method based on distillation BERT
Technical Field
The invention relates to a pre-trained language model and Chinese text classification, and in particular to Chinese text classification under a compressed BERT model.
Background
Since the release of the BERT model, NLP enters the era of pre-training and fine-tuning, the progress of task research in the NLP field is greatly accelerated, and the related task effect is continuously improved. However, the training resources of the BERT model are dozens of G of characters, and the task load is huge. Such costs are not affordable to the average person. More preferable models are also proposed in this period, such as roerta by FACEBOOK, etc. As model performance increases, parameters are also growing explosively, from hundreds of millions of BERTs to hundreds of billions later. Pre-training seems to be a proprietary game for large organizations, and how to obtain a usable pre-training model is very urgent for small companies with application requirements or occasions with high real-time requirements.
In the process of continuously enlarging the pre-training model, the research of compressing the pre-training model is continuously updated. Both the distillation-based BERT and the parameter-sharing-based ALBERT are continuously proposed, which both trade off the compression of model parameters and the acceleration of inference speed with less loss of accuracy. The invention designs a small BERT text classification model by combining some research results of a pre-training model BERT and a knowledge distillation method.
Disclosure of Invention
In order to solve the problem that the BERT large model is not practical, the invention provides a text classification technology based on distillation BERT. The technology improves the Chinese text classification method by adjusting the input mask method and the distillation method of the BERT model.
The technical scheme adopted by the invention is as follows:
step 1: construct the BERT input text.
Step 2: and (3) using the text in the step (1) as input to enable a BERT-CNN model to be subjected to fine tuning training to obtain a Chinese text classification Teacher model, wherein the BERT selects a BERT-base model.
And step 3: distillation was performed using 6-layer transformer as the Student encoder model and the fine-tuned BERT obtained in step 2 as the Teacher encoder.
And 4, step 4: and (4) carrying out fine tuning training by using the Student BERT obtained in the step (3) and the CNN to obtain a compression model of the text classification.
Compared with the prior art, the beneficial results of the invention are as follows:
(1) fine adjustment of 2 specific tasks is performed, and the text classification capability is improved.
(2) The pre-training model is compressed, and the use of the pre-training model is more convenient.
Drawings
FIG. 1 is a diagram: schematic diagram of the Teacher BERT fine tuning model.
FIG. 2 is a diagram of: schematic diagram of distillation process.
FIG. 3 is a diagram of: and (3) a schematic diagram of an overall model compression method.
Detailed Description
Step 1: an input text is constructed.
Removing the SEP and Embedding of the inter-sentence separation symbols, and directly splicing sentences to construct an input text.
Step 2: and (3) constructing a text training BERT-CNN by using the step 1 to obtain a Teacher BERT model.
And (4) using the sentence obtained in the step (1) as an input of a BERT-CNN text classifier. The output of the BERT is accessed to the CNN at positions except the CLS, and then the output of the CNN network and the CLS vector splicing connection full-connection layer are classified by using softmax. And updating the model parameters by using the cross entropy of the softmax function as a loss function, wherein the BERT parameters are not fixed in the step. The network structure is shown in fig. 1.
And step 3: student BERT encoder was trained.
Distillation was performed using the Teacher BERT model obtained in step 2 using 6-layer transformers as the insert encoder of the Student model. The 1, 2, 3, 4, 5 layers of the Student transform correspond to the 2, 4, 6, 8, 10 layers of the teacher layer. The initialization value is the corresponding layer parameter of the teacher model, and the rest is initialized to 0. The BERT Teacher parameters were fixed for the BERT Student to learn. Loss functions MSE and DS are respectively defined for the intermediate layer and the output layer, and a distillation target loss function is obtained. Learning the hidden states of all teacher models is contrary to the goal of compression for the middle layer, so the loss function calculation for the middle layer considers only CLS, which is acceptable for text classification tasks, with the MSE loss function as shown in equation (1).
Figure RE-GDA0003598286050000021
Wherein N represents the number of learning samples, mset (j) represents the Teacher coding layer corresponding to the j coding layer of the Student, h is the embedding of the CLS, s represents the Student model parameter, and t represents the Teacher model parameter.
The output layer is classified by using a softmax layer, and the loss function DS is shown in formula (2).
Figure RE-GDA0003598286050000031
P(yi=c;xi)=softmax(Whi) (3)
Where s represents the use of the Teacher model parameters, t represents the use of the Student model parameters, and h represents the embedded output with the network input x. The total loss function of distillation is shown in equation (4).
Figure RE-GDA0003598286050000032
The distillation process is shown in figure 2.
And 4, step 4: a text classification network based on compressed BERT-CNN is trained.
The same structure as in step 2 was constructed using the Student Transformer encoder obtained in step 3 to connect the CNN and FC + softmax layers. The initialization value of the Student CNN layer is the same as that of the Teacher CNN layer after fine adjustment, and softmax classification is carried out after the Student CNN output and the CLS output of the Student BERT are spliced and connected with a full connection layer. And then updating the parameters of the student model again by using the softmax cross entropy as a loss function to obtain a compressed BERT-CNN text classification model.
The overall model compression method is shown in fig. 3.
While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except combinations where mutually exclusive features or/and steps are present.

Claims (5)

1. A Chinese short text classification method based on distillation BERT is characterized by comprising the following steps:
step 1: an input text is constructed.
Step 2: and (3) finely adjusting the BERT-CNN by using the text constructed in the step (1) to obtain a Teacher BERT model.
And step 3: distilling the Teacher BERT from step 2 to obtain Student BERT.
And 4, step 4: and (4) forming a text classification network by using the Student BERT obtained in the step (3) and the CNN finely adjusted in the step (2), and finely adjusting again.
2. The method of classifying short chinese texts based on distillation BERT as claimed in claim 1, wherein the method of constructing texts in step 1 specifically comprises:
step 101: and removing the inter-sentence separation symbols, namely splicing sentences directly without using SEP embedding.
3. The method of classifying short chinese texts based on distillation BERT as claimed in claim 2, wherein the BERT-CNN model in step 2 is specifically composed of:
step 201: and splicing the outputs of the Teacher BERT except for the CLS embedding into a 2-dimensional data graph, inputting the 2-dimensional data graph into the CNN network, connecting the output of the CNN network and the CLS embedding of the BERT after splicing, and finally classifying by using a softmax function.
4. The method of classifying short chinese texts based on distillation BERT as claimed in claim 3, wherein the distillation method in step 3 is specifically:
step 301: as Student BERT, 6 Transformer layers were used, of which 5 layers were interfacially distilled with a Teacher BERT.
Step 302: the loss function of distillation is the weighted sum of the middle layer distillation MSE loss and DS loss between outputs for the Teacher BERT and Student BERT.
5. The method of classifying short chinese texts based on distillation BERT as claimed in claim 4, wherein the fine tuning method in step 5 specifically comprises:
step 401: and (3) forming a text classifier with the same structure as the text classifier in the step 2 by using the Student BERT obtained in the step 3 and the CNN layer obtained after fine tuning in the step 2, and performing fine tuning training again.
CN202111479291.3A 2021-12-04 2021-12-04 Chinese short text classification method based on distillation BERT Pending CN114595325A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111479291.3A CN114595325A (en) 2021-12-04 2021-12-04 Chinese short text classification method based on distillation BERT

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111479291.3A CN114595325A (en) 2021-12-04 2021-12-04 Chinese short text classification method based on distillation BERT

Publications (1)

Publication Number Publication Date
CN114595325A true CN114595325A (en) 2022-06-07

Family

ID=81814547

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111479291.3A Pending CN114595325A (en) 2021-12-04 2021-12-04 Chinese short text classification method based on distillation BERT

Country Status (1)

Country Link
CN (1) CN114595325A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116719945A (en) * 2023-08-08 2023-09-08 北京惠每云科技有限公司 Medical short text classification method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116719945A (en) * 2023-08-08 2023-09-08 北京惠每云科技有限公司 Medical short text classification method and device, electronic equipment and storage medium
CN116719945B (en) * 2023-08-08 2023-10-24 北京惠每云科技有限公司 Medical short text classification method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111554268B (en) Language identification method based on language model, text classification method and device
CN110334705B (en) Language identification method of scene text image combining global and local information
WO2020119631A1 (en) Lightweight visual question-answering system and method
CN108170816A (en) A kind of intelligent vision Question-Answering Model based on deep neural network
CN111738251A (en) Optical character recognition method and device fused with language model and electronic equipment
CN110309839A (en) A kind of method and device of iamge description
WO2023050738A1 (en) Knowledge distillation-based model training method and apparatus, and electronic device
CN110070855A (en) A kind of speech recognition system and method based on migration neural network acoustic model
CN115563327A (en) Zero sample cross-modal retrieval method based on Transformer network selective distillation
CN111858984A (en) Image matching method based on attention mechanism Hash retrieval
CN113823272A (en) Voice processing method, device, electronic equipment and storage medium
CN115953645A (en) Model training method and device, electronic equipment and storage medium
CN114626504A (en) Model compression method based on group relation knowledge distillation
CN114004220A (en) Text emotion reason identification method based on CPC-ANN
CN114595325A (en) Chinese short text classification method based on distillation BERT
CN117273150A (en) Visual large language model method based on few sample learning
CN112989843B (en) Intention recognition method, device, computing equipment and storage medium
CN109190471B (en) Attention model method for video monitoring pedestrian search based on natural language description
CN112528989B (en) Description generation method for semantic fine granularity of image
CN114220095A (en) Image semantic description improvement method based on instance segmentation
CN117634459A (en) Target content generation and model training method, device, system, equipment and medium
CN116662924A (en) Aspect-level multi-mode emotion analysis method based on dual-channel and attention mechanism
CN115017900B (en) Conversation emotion recognition method based on multi-mode multi-prejudice
CN115934891A (en) Question understanding method and device
CN115470799A (en) Text transmission and semantic understanding integrated method for network edge equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication