CN116737927A

CN116737927A - Gravitational field constraint model distillation method, system, electronic equipment and storage medium for sequence annotation

Info

Publication number: CN116737927A
Application number: CN202310681569.8A
Authority: CN
Inventors: 文怡; 钱卫东; 朱珈乐; 张小丫; 闫岱峻; 李晓瑜
Original assignee: University of Electronic Science and Technology of China; 702th Research Institute of CSIC
Current assignee: University of Electronic Science and Technology of China; 702th Research Institute of CSIC
Priority date: 2023-06-09
Filing date: 2023-06-09
Publication date: 2023-09-12

Abstract

The invention discloses a gravity field constraint model distillation method, a system, electronic equipment and a storage medium for sequence annotation, wherein the method comprises the following steps: acquiring a trained teacher model and an untrained student model; training the student model for a plurality of times, each training comprising the following sub-steps: obtaining distillation loss; obtaining real loss; calculating the update amplitude of the fine-tuned parameters in the student model to obtain constraint loss; and carrying out back propagation updating by using distillation loss, real loss and constraint loss, and carrying out fine tuning training on parameters of the student model. In the invention, by adding a constraint mechanism for parameter updating in the fine tuning process, the model keeps the memory of language knowledge learned in the pre-training as much as possible in the fine tuning process, and can adapt to the downstream task through the fine tuning training.

Description

Gravitational field constraint model distillation method, system, electronic equipment and storage medium for sequence annotation

Technical Field

The invention relates to a gravity field constraint model distillation method, a system, electronic equipment and a storage medium for sequence annotation.

Background

Sequence Tagging (Sequence Tagging) is the most basic task in NLP, and has very wide application, such as word segmentation, part-of-speech Tagging (POS Tagging), named entity recognition (Named Entity Recognition, NER), keyword extraction, semantic role Tagging (Semantic Role Labeling), slot extraction (Slot alignment), and the like, which substantially belong to the category of Sequence Tagging. Common means include HMM, MEMM, CRF.

However, in the sequence marking in the prior art, the problem of "catastrophic forgetting" exists in the Fine tuning (Fine-Tune) of the downstream task application in the large language model, that is, when training for adapting to the downstream application task is performed, the learned language knowledge is forgotten in a catastrophic manner, so that the effect of the model is reduced.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a gravity field constraint model distillation method, a system, electronic equipment and a storage medium for sequence labeling.

The aim of the invention is realized by the following technical scheme:

in a first aspect of the invention, there is provided a gravity field constraint model distillation method for sequence annotation, comprising the steps of:

acquiring a trained teacher model and an untrained student model;

training the student model for a plurality of times, each training comprising the following sub-steps:

inputting the text sequence to be marked into a teacher model and a student model respectively, and calculating KL divergence of the first class prediction probability output by the teacher model and the second class prediction probability output by the student model to obtain distillation loss;

inputting the text sequence with the classification labels into a student model, and performing cross information entropy calculation on the third class prediction probability and the classification labels output by the student model to obtain real loss;

calculating the update amplitude of the fine-tuned parameters in the student model to obtain constraint loss;

and carrying out back propagation updating by using distillation loss, real loss and constraint loss, and carrying out fine tuning training on parameters of the student model.

Further, the distillation loss is calculated by the following steps:

in the formula, distill_loss is expressed as distillation Loss, x represents a text sequence to be marked, and t _i (x) The prediction probability of the class i in the first class prediction probability output by the teacher model is represented as s _i (x) The class probability predicted as the class i in the second class prediction probability output by the student model is represented; sigma represents performing an operation on all classes;

the calculation mode of the real loss is as follows:

where ce_loss represents the true loss, n represents the total number of classes of the prediction class, p (i) represents the class label, and q (i) represents the second class prediction probability output by the student model.

Further, the constraint loss is calculated by the following steps:

in the formula, the gradient_loss represents constraint loss, Δw is the change amount of each fine-tuned parameter in the student model, namely the update amplitude, N represents the number of fine-tuned parameters, alpha is a weight factor of a constraint term, and alpha is a super-parameter.

Further, the teacher model is a pre-trained deep neural network model, specifically a large language model pre-trained by a task-independent language model, or a trained large model specially applied to a sequence labeling task.

In a second aspect of the invention, there is provided a gravity field constrained model distillation system for sequence annotation, comprising:

model acquisition module: for obtaining a trained teacher model and an untrained student model;

student model training module: for training a student model several times, each training comprising:

distillation loss calculation unit: the method comprises the steps of respectively inputting a text sequence to be marked into a teacher model and a student model, and calculating KL divergence of a first class prediction probability output by the teacher model and a second class prediction probability output by the student model to obtain distillation loss;

a real loss calculation unit: the method comprises the steps of inputting a text sequence with a classification label into a student model, and performing cross information entropy calculation on a third class prediction probability output by the student model and the classification label to obtain real loss;

constraint loss calculation unit: the method comprises the steps of calculating the updating amplitude of parameters finely tuned in a student model to obtain constraint loss;

fine tuning training unit: the method is used for carrying out back propagation updating by utilizing distillation loss, true loss and constraint loss and carrying out fine tuning training on parameters of the student model.

Further, the distillation loss is calculated by the following steps:

the calculation mode of the real loss is as follows:

Further, the constraint loss is calculated by the following steps:

In a third aspect of the present invention, an electronic device is provided, including a storage unit and a processing unit, where the storage unit stores computer instructions capable of running on the processing unit, and the processing unit executes the steps of the gravity field constraint model distillation method for sequence labeling when the processing unit runs the computer instructions.

In a fourth aspect of the invention, there is provided a storage medium having stored thereon computer instructions which, when executed, perform the steps of the method for sequence annotation of a gravitational field constraint model distillation.

The beneficial effects of the invention are as follows:

in an exemplary embodiment of the invention, by adding a constraint mechanism for parameter updating in the fine tuning process, the model keeps the memory of language knowledge learned in the pre-training as much as possible in the fine tuning process, and can adapt to the downstream task through the fine tuning training.

Drawings

FIG. 1 is a flow chart of a gravity field constraint model distillation method for sequence annotation according to an exemplary embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully understood from the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the description of the present invention, it should be noted that directions or positional relationships indicated as being "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are directions or positional relationships described based on the drawings are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements to be referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

In addition, the technical features of the different embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

Referring to fig. 1, fig. 1 shows a flow chart of a gravity field constraint model distillation method for sequence labeling according to an exemplary embodiment of the present invention, including the following steps:

acquiring a trained teacher model and an untrained student model;

Specifically, in the present exemplary embodiment, training of the student model includes:

(1) Language knowledge distillation

The first step is knowledge distillation, as shown on the left side of FIG. 1, distills linguistic knowledge in a pre-trained large model (student model) to a smaller donor model (student model). Two modelsThe same input is accepted, and the probability distribution Prob of the prediction result output by the teacher model is calculated _T (first class prediction probability) and prediction result probability distribution Prob output by student model _S1 (second class prediction probability) KL divergence calculation is performed, and KL divergence values of two distributions are used as a Loss function of distillation learning, namely distill_loss.

In a preferred exemplary embodiment, the distillation loss is calculated by:

in the formula, distill_loss is expressed as distillation Loss, x represents a text sequence to be marked, and t _i (x) The prediction probability of the class i in the first class prediction probability output by the teacher model is represented as s _i (x) The class probability predicted as the class i in the second class prediction probability output by the student model is represented; Σ represents performing an operation on all categories.

The whole calculation is to calculate distillation loss for one text sequence element, and the loss predicted by labeling classification for all sequence elements (words) is the sum of the distillation loss of each sequence element to average.

(2) Fine tuning under gravitational constraints

The second step is the loss of fine tuning training, as shown on the right side of fig. 1, with supervised training on the manually labeled dataset (text sequence with classification labels).

(2-1) calculating a loss ce_loss between the predicted result of the model classifying the sequence annotation and the real tag using the cross information entropy.

In a preferred exemplary embodiment, the actual loss is calculated by:

where ce_loss represents the true loss, n represents the total number of classes of the prediction class, p (i) represents the class label, and q (i) represents the second class prediction probability output by the student model. Ce_loss is used to calculate the distribution gap between the probability distribution of model predictions and the true labels, and is a commonly used loss function in classification tasks.

(2-2) calculating the update amplitude of the fine-tuned parameters in the student model to obtain constraint loss.

In a preferred exemplary embodiment, the constraint loss is calculated by:

The method of the present exemplary embodiment effectively solves the problem of "catastrophic forgetting" existing in the Fine tuning (Fine-Tune) of the downstream task application of the large language model (when training for adapting to the downstream application task, the learned language knowledge is catastrophic forgetting, which results in a decrease in the model effect). Through adding a constraint mechanism for parameter updating in the fine tuning process, the model keeps the memory of language knowledge learned in the pre-training as much as possible in the fine tuning process, and meanwhile, downstream tasks can be adapted through fine tuning training.

It should be noted that, when we add the gravity constraint term to the fine tuning weight to the total Loss (i.e. the total Loss function, including the weighted components of all the Loss functions), during the model training, when the ce_loss decrease caused by the update amount Δw of the weight parameter is greater than the Loss caused by the gravity constraint term, then the parameter is updated. Conversely, when the ce_loss decrease caused by the update amount Δw of the weight parameter < the Loss increase caused by the attraction constraint term, the back propagation dominated by the total Loss does not cause the parameter to be updated. Therefore, in the fine tuning process of the model, the memory of the knowledge of the teacher model is guaranteed, and proper parameter adjustment is guaranteed so as to achieve the fine tuning effect.

Meanwhile, each fine-tuned parameter in the student model mainly includes parameters in a parameter matrix constituting each layer (e.g., a front embedded layer, a hidden layer, an output layer, etc.). In addition, the parameters of fine tuning generally do not exceed-1 to-1, and the update width of constraint loss is not large because the dimension is small. And, the proportion of three losses can be selected according to actual demands.

More preferably, in an exemplary embodiment, the teacher model is a pre-trained deep neural network model, specifically a large pre-trained language model that is task independent, or a large trained model that is specifically applied to sequence labeling tasks.

Specifically, in an exemplary embodiment, the teacher Model may be either a large Language Model that is task independent (e.g., hfl/Chinese-macbert-large, bigscience/bloom-560M, etc., model parameters above 500M, model files typically above 1G.) or a large Model that is trained specifically for sequence labeling tasks (e.g., dslim/bert-large-NER, jean-Baptite/roberta-large-NER-englist, etc., model parameters above 500M, model files typically above 1G.).

Correspondingly, the student model may be a small model consisting of shallow Transformer encoder layer, with parameters randomly initialized prior to training (e.g., transformer encoder layer for layer 5, about 50M for parameters, about 100M for model files).

On our manual annotation test set, compared with the ordinary unconstrained fine tuning, the fine tuning based on the 'gravitational constraint' method improves the accuracy of sequence annotation prediction from 79.3% to 94.7%. The accuracy is improved by 25.4 percent absolutely. The experimental procedure was as follows:

in the construction of the knowledge graph in the ship field, a plurality of 5,500 entity identification samples are marked. We randomly split it into a training set of 3,500 sample size and a test set of 2,000 sample size. Initially we did not use gravitational constraints and sequence annotation accuracy for entity recognition on the test set was 79.3% by distillation learning on the large model + fine tuning on the training set. Then we propose a method using gravity constraint to mitigate the "knowledge forgetting phenomenon" of the model when fine tuning downstream, by analysis of the model prediction error samples. Tests on the test set show that the novel method is more effective, and the accuracy of sequence labeling of entity identification on the test set is 94.7%.

The same inventive concept as the above exemplary embodiment provides in a further exemplary embodiment of the present invention a gravity field constraint model distillation system for sequence annotation, comprising:

Correspondingly, the distillation loss is calculated by the following steps:

in the formula, distill_loss is expressed as distillation Loss, x represents a text sequence to be marked, and t _i (x) Prediction in first class prediction probability representing teacher model outputFor the prediction probability of category i, s _i (x) The class probability predicted as the class i in the second class prediction probability output by the student model is represented; sigma represents performing an operation on all classes;

the calculation mode of the real loss is as follows:

Correspondingly, the constraint loss is calculated in the following manner:

Correspondingly, the teacher model is a pre-trained deep neural network model, in particular a large language model pre-trained by a language model irrelevant to tasks or a trained large model specially applied to sequence labeling tasks.

In another exemplary embodiment of the present invention, there is provided an electronic device including a storage unit and a processing unit, where the storage unit stores computer instructions executable on the processing unit, and the processing unit executes the steps of the gravity field constraint model distillation method for sequence labeling when the processing unit executes the computer instructions.

The electronic device is in the form of a general purpose computing device. Components of an electronic device may include, but are not limited to: the at least one processing unit, the at least one memory unit, and a bus connecting the different system components (including the memory unit and the processing unit).

Wherein the storage unit stores program code executable by the processing unit such that the processing unit performs steps according to various exemplary embodiments of the present invention described in the above section of the exemplary method of the present specification. For example, the processing unit may perform the method as shown in fig. 1.

The memory unit may include readable media in the form of volatile memory units, such as Random Access Memory (RAM) 3201 and/or cache memory units, and may further include Read Only Memory (ROM).

The storage unit may also include a program/utility having a set (at least one) of program modules including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The bus may be one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with any device (e.g., router, modem, etc.) that enables the electronic device to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface. And, the electronic device may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through a network adapter. The network adapter communicates with other modules of the electronic device via a bus. It should be appreciated that other hardware and/or software modules may be used in connection with an electronic device, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

As will be readily appreciated by those skilled in the art from the foregoing description, the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Accordingly, the technical solution according to the present exemplary embodiment may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the present exemplary embodiment.

In yet another exemplary embodiment of the present invention, there is provided a storage medium having stored thereon computer instructions which, when executed, perform the steps of the method for gravity field constraint model distillation for sequence annotation.

Based on this understanding, the technical solution of the present embodiment may be essentially or, what contributes to the prior art, or part of the technical solution may be embodied in the form of a software product (program product) stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to execute all or part of the steps of the method described in the embodiments of the present invention.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

It is apparent that the above examples are given by way of illustration only and not by way of limitation, and that other variations or modifications may be made in the various forms based on the above description by those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. And obvious variations or modifications thereof are contemplated as falling within the scope of the present invention.

Claims

1. A gravitational field constraint model distillation method for sequence labeling is characterized in that: the method comprises the following steps:

acquiring a trained teacher model and an untrained student model;

2. A gravity field constraint model distillation method for sequence annotation according to claim 1, wherein: the distillation loss is calculated by the following steps:

in the formula, distill_loss is expressed as distillation Loss, x represents a text sequence to be marked, and t _i (x) The prediction probability of the class i in the first class prediction probability output by the teacher model is represented as s _i (x) The class probability predicted as the class i in the second class prediction probability output by the student model is represented; sigma represents the operation of all classes;

the calculation mode of the real loss is as follows:

3. A gravity field constraint model distillation method for sequence annotation according to claim 1, wherein: the constraint loss is calculated in the following way:

4. A gravity field constraint model distillation method for sequence annotation according to claim 1, wherein: the teacher model is a pre-trained deep neural network model, in particular a large language model pre-trained by a task-independent language model or a trained large model specially applied to a sequence labeling task.

5. A gravitational field constraint model distillation system for sequence annotation, characterized in that: comprising the following steps:

6. A gravity field constrained model distillation system for sequence annotation according to claim 5, wherein: the distillation loss is calculated by the following steps:

the calculation mode of the real loss is as follows:

7. A gravity field constrained model distillation system for sequence annotation according to claim 5, wherein: the constraint loss is calculated in the following way:

8. A gravity field constrained model distillation system for sequence annotation according to claim 5, wherein: the teacher model is a pre-trained deep neural network model, in particular a large language model pre-trained by a task-independent language model or a trained large model specially applied to a sequence labeling task.

9. An electronic device comprising a memory unit and a processing unit, the memory unit having stored thereon computer instructions executable on the processing unit, characterized by: the processing unit, when executing the computer instructions, performs the steps of a gravity field constraint model distillation method for sequence annotation according to any one of claims 1 to 4.

10. A storage medium having stored thereon computer instructions, characterized by: the computer instructions, when executed, perform the steps of a gravity field constrained model distillation method for sequence annotation according to any one of claims 1 to 4.