CN116737927A - Gravitational field constraint model distillation method, system, electronic equipment and storage medium for sequence annotation - Google Patents

Gravitational field constraint model distillation method, system, electronic equipment and storage medium for sequence annotation Download PDF

Info

Publication number
CN116737927A
CN116737927A CN202310681569.8A CN202310681569A CN116737927A CN 116737927 A CN116737927 A CN 116737927A CN 202310681569 A CN202310681569 A CN 202310681569A CN 116737927 A CN116737927 A CN 116737927A
Authority
CN
China
Prior art keywords
model
loss
constraint
distillation
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310681569.8A
Other languages
Chinese (zh)
Inventor
文怡
钱卫东
朱珈乐
张小丫
闫岱峻
李晓瑜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
702th Research Institute of CSIC
Original Assignee
University of Electronic Science and Technology of China
702th Research Institute of CSIC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China, 702th Research Institute of CSIC filed Critical University of Electronic Science and Technology of China
Priority to CN202310681569.8A priority Critical patent/CN116737927A/en
Publication of CN116737927A publication Critical patent/CN116737927A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a gravity field constraint model distillation method, a system, electronic equipment and a storage medium for sequence annotation, wherein the method comprises the following steps: acquiring a trained teacher model and an untrained student model; training the student model for a plurality of times, each training comprising the following sub-steps: obtaining distillation loss; obtaining real loss; calculating the update amplitude of the fine-tuned parameters in the student model to obtain constraint loss; and carrying out back propagation updating by using distillation loss, real loss and constraint loss, and carrying out fine tuning training on parameters of the student model. In the invention, by adding a constraint mechanism for parameter updating in the fine tuning process, the model keeps the memory of language knowledge learned in the pre-training as much as possible in the fine tuning process, and can adapt to the downstream task through the fine tuning training.

Description

Gravitational field constraint model distillation method, system, electronic equipment and storage medium for sequence annotation
Technical Field
The invention relates to a gravity field constraint model distillation method, a system, electronic equipment and a storage medium for sequence annotation.
Background
Sequence Tagging (Sequence Tagging) is the most basic task in NLP, and has very wide application, such as word segmentation, part-of-speech Tagging (POS Tagging), named entity recognition (Named Entity Recognition, NER), keyword extraction, semantic role Tagging (Semantic Role Labeling), slot extraction (Slot alignment), and the like, which substantially belong to the category of Sequence Tagging. Common means include HMM, MEMM, CRF.
However, in the sequence marking in the prior art, the problem of "catastrophic forgetting" exists in the Fine tuning (Fine-Tune) of the downstream task application in the large language model, that is, when training for adapting to the downstream application task is performed, the learned language knowledge is forgotten in a catastrophic manner, so that the effect of the model is reduced.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a gravity field constraint model distillation method, a system, electronic equipment and a storage medium for sequence labeling.
The aim of the invention is realized by the following technical scheme:
in a first aspect of the invention, there is provided a gravity field constraint model distillation method for sequence annotation, comprising the steps of:
acquiring a trained teacher model and an untrained student model;
training the student model for a plurality of times, each training comprising the following sub-steps:
inputting the text sequence to be marked into a teacher model and a student model respectively, and calculating KL divergence of the first class prediction probability output by the teacher model and the second class prediction probability output by the student model to obtain distillation loss;
inputting the text sequence with the classification labels into a student model, and performing cross information entropy calculation on the third class prediction probability and the classification labels output by the student model to obtain real loss;
calculating the update amplitude of the fine-tuned parameters in the student model to obtain constraint loss;
and carrying out back propagation updating by using distillation loss, real loss and constraint loss, and carrying out fine tuning training on parameters of the student model.
Further, the distillation loss is calculated by the following steps:
in the formula, distill_loss is expressed as distillation Loss, x represents a text sequence to be marked, and t i (x) The prediction probability of the class i in the first class prediction probability output by the teacher model is represented as s i (x) The class probability predicted as the class i in the second class prediction probability output by the student model is represented; sigma represents performing an operation on all classes;
the calculation mode of the real loss is as follows:
where ce_loss represents the true loss, n represents the total number of classes of the prediction class, p (i) represents the class label, and q (i) represents the second class prediction probability output by the student model.
Further, the constraint loss is calculated by the following steps:
in the formula, the gradient_loss represents constraint loss, Δw is the change amount of each fine-tuned parameter in the student model, namely the update amplitude, N represents the number of fine-tuned parameters, alpha is a weight factor of a constraint term, and alpha is a super-parameter.
Further, the teacher model is a pre-trained deep neural network model, specifically a large language model pre-trained by a task-independent language model, or a trained large model specially applied to a sequence labeling task.
In a second aspect of the invention, there is provided a gravity field constrained model distillation system for sequence annotation, comprising:
model acquisition module: for obtaining a trained teacher model and an untrained student model;
student model training module: for training a student model several times, each training comprising:
distillation loss calculation unit: the method comprises the steps of respectively inputting a text sequence to be marked into a teacher model and a student model, and calculating KL divergence of a first class prediction probability output by the teacher model and a second class prediction probability output by the student model to obtain distillation loss;
a real loss calculation unit: the method comprises the steps of inputting a text sequence with a classification label into a student model, and performing cross information entropy calculation on a third class prediction probability output by the student model and the classification label to obtain real loss;
constraint loss calculation unit: the method comprises the steps of calculating the updating amplitude of parameters finely tuned in a student model to obtain constraint loss;
fine tuning training unit: the method is used for carrying out back propagation updating by utilizing distillation loss, true loss and constraint loss and carrying out fine tuning training on parameters of the student model.
Further, the distillation loss is calculated by the following steps:
in the formula, distill_loss is expressed as distillation Loss, x represents a text sequence to be marked, and t i (x) The prediction probability of the class i in the first class prediction probability output by the teacher model is represented as s i (x) The class probability predicted as the class i in the second class prediction probability output by the student model is represented; sigma represents performing an operation on all classes;
the calculation mode of the real loss is as follows:
where ce_loss represents the true loss, n represents the total number of classes of the prediction class, p (i) represents the class label, and q (i) represents the second class prediction probability output by the student model.
Further, the constraint loss is calculated by the following steps:
in the formula, the gradient_loss represents constraint loss, Δw is the change amount of each fine-tuned parameter in the student model, namely the update amplitude, N represents the number of fine-tuned parameters, alpha is a weight factor of a constraint term, and alpha is a super-parameter.
Further, the teacher model is a pre-trained deep neural network model, specifically a large language model pre-trained by a task-independent language model, or a trained large model specially applied to a sequence labeling task.
In a third aspect of the present invention, an electronic device is provided, including a storage unit and a processing unit, where the storage unit stores computer instructions capable of running on the processing unit, and the processing unit executes the steps of the gravity field constraint model distillation method for sequence labeling when the processing unit runs the computer instructions.
In a fourth aspect of the invention, there is provided a storage medium having stored thereon computer instructions which, when executed, perform the steps of the method for sequence annotation of a gravitational field constraint model distillation.
The beneficial effects of the invention are as follows:
in an exemplary embodiment of the invention, by adding a constraint mechanism for parameter updating in the fine tuning process, the model keeps the memory of language knowledge learned in the pre-training as much as possible in the fine tuning process, and can adapt to the downstream task through the fine tuning training.
Drawings
FIG. 1 is a flow chart of a gravity field constraint model distillation method for sequence annotation according to an exemplary embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made apparent and fully understood from the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the description of the present invention, it should be noted that directions or positional relationships indicated as being "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are directions or positional relationships described based on the drawings are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements to be referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
In addition, the technical features of the different embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
Referring to fig. 1, fig. 1 shows a flow chart of a gravity field constraint model distillation method for sequence labeling according to an exemplary embodiment of the present invention, including the following steps:
acquiring a trained teacher model and an untrained student model;
training the student model for a plurality of times, each training comprising the following sub-steps:
inputting the text sequence to be marked into a teacher model and a student model respectively, and calculating KL divergence of the first class prediction probability output by the teacher model and the second class prediction probability output by the student model to obtain distillation loss;
inputting the text sequence with the classification labels into a student model, and performing cross information entropy calculation on the third class prediction probability and the classification labels output by the student model to obtain real loss;
calculating the update amplitude of the fine-tuned parameters in the student model to obtain constraint loss;
and carrying out back propagation updating by using distillation loss, real loss and constraint loss, and carrying out fine tuning training on parameters of the student model.
Specifically, in the present exemplary embodiment, training of the student model includes:
(1) Language knowledge distillation
The first step is knowledge distillation, as shown on the left side of FIG. 1, distills linguistic knowledge in a pre-trained large model (student model) to a smaller donor model (student model). Two modelsThe same input is accepted, and the probability distribution Prob of the prediction result output by the teacher model is calculated T (first class prediction probability) and prediction result probability distribution Prob output by student model S1 (second class prediction probability) KL divergence calculation is performed, and KL divergence values of two distributions are used as a Loss function of distillation learning, namely distill_loss.
In a preferred exemplary embodiment, the distillation loss is calculated by:
in the formula, distill_loss is expressed as distillation Loss, x represents a text sequence to be marked, and t i (x) The prediction probability of the class i in the first class prediction probability output by the teacher model is represented as s i (x) The class probability predicted as the class i in the second class prediction probability output by the student model is represented; Σ represents performing an operation on all categories.
The whole calculation is to calculate distillation loss for one text sequence element, and the loss predicted by labeling classification for all sequence elements (words) is the sum of the distillation loss of each sequence element to average.
(2) Fine tuning under gravitational constraints
The second step is the loss of fine tuning training, as shown on the right side of fig. 1, with supervised training on the manually labeled dataset (text sequence with classification labels).
(2-1) calculating a loss ce_loss between the predicted result of the model classifying the sequence annotation and the real tag using the cross information entropy.
In a preferred exemplary embodiment, the actual loss is calculated by:
where ce_loss represents the true loss, n represents the total number of classes of the prediction class, p (i) represents the class label, and q (i) represents the second class prediction probability output by the student model. Ce_loss is used to calculate the distribution gap between the probability distribution of model predictions and the true labels, and is a commonly used loss function in classification tasks.
(2-2) calculating the update amplitude of the fine-tuned parameters in the student model to obtain constraint loss.
In a preferred exemplary embodiment, the constraint loss is calculated by:
in the formula, the gradient_loss represents constraint loss, Δw is the change amount of each fine-tuned parameter in the student model, namely the update amplitude, N represents the number of fine-tuned parameters, alpha is a weight factor of a constraint term, and alpha is a super-parameter.
The method of the present exemplary embodiment effectively solves the problem of "catastrophic forgetting" existing in the Fine tuning (Fine-Tune) of the downstream task application of the large language model (when training for adapting to the downstream application task, the learned language knowledge is catastrophic forgetting, which results in a decrease in the model effect). Through adding a constraint mechanism for parameter updating in the fine tuning process, the model keeps the memory of language knowledge learned in the pre-training as much as possible in the fine tuning process, and meanwhile, downstream tasks can be adapted through fine tuning training.
It should be noted that, when we add the gravity constraint term to the fine tuning weight to the total Loss (i.e. the total Loss function, including the weighted components of all the Loss functions), during the model training, when the ce_loss decrease caused by the update amount Δw of the weight parameter is greater than the Loss caused by the gravity constraint term, then the parameter is updated. Conversely, when the ce_loss decrease caused by the update amount Δw of the weight parameter < the Loss increase caused by the attraction constraint term, the back propagation dominated by the total Loss does not cause the parameter to be updated. Therefore, in the fine tuning process of the model, the memory of the knowledge of the teacher model is guaranteed, and proper parameter adjustment is guaranteed so as to achieve the fine tuning effect.
Meanwhile, each fine-tuned parameter in the student model mainly includes parameters in a parameter matrix constituting each layer (e.g., a front embedded layer, a hidden layer, an output layer, etc.). In addition, the parameters of fine tuning generally do not exceed-1 to-1, and the update width of constraint loss is not large because the dimension is small. And, the proportion of three losses can be selected according to actual demands.
More preferably, in an exemplary embodiment, the teacher model is a pre-trained deep neural network model, specifically a large pre-trained language model that is task independent, or a large trained model that is specifically applied to sequence labeling tasks.
Specifically, in an exemplary embodiment, the teacher Model may be either a large Language Model that is task independent (e.g., hfl/Chinese-macbert-large, bigscience/bloom-560M, etc., model parameters above 500M, model files typically above 1G.) or a large Model that is trained specifically for sequence labeling tasks (e.g., dslim/bert-large-NER, jean-Baptite/roberta-large-NER-englist, etc., model parameters above 500M, model files typically above 1G.).
Correspondingly, the student model may be a small model consisting of shallow Transformer encoder layer, with parameters randomly initialized prior to training (e.g., transformer encoder layer for layer 5, about 50M for parameters, about 100M for model files).
On our manual annotation test set, compared with the ordinary unconstrained fine tuning, the fine tuning based on the 'gravitational constraint' method improves the accuracy of sequence annotation prediction from 79.3% to 94.7%. The accuracy is improved by 25.4 percent absolutely. The experimental procedure was as follows:
in the construction of the knowledge graph in the ship field, a plurality of 5,500 entity identification samples are marked. We randomly split it into a training set of 3,500 sample size and a test set of 2,000 sample size. Initially we did not use gravitational constraints and sequence annotation accuracy for entity recognition on the test set was 79.3% by distillation learning on the large model + fine tuning on the training set. Then we propose a method using gravity constraint to mitigate the "knowledge forgetting phenomenon" of the model when fine tuning downstream, by analysis of the model prediction error samples. Tests on the test set show that the novel method is more effective, and the accuracy of sequence labeling of entity identification on the test set is 94.7%.
The same inventive concept as the above exemplary embodiment provides in a further exemplary embodiment of the present invention a gravity field constraint model distillation system for sequence annotation, comprising:
model acquisition module: for obtaining a trained teacher model and an untrained student model;
student model training module: for training a student model several times, each training comprising:
distillation loss calculation unit: the method comprises the steps of respectively inputting a text sequence to be marked into a teacher model and a student model, and calculating KL divergence of a first class prediction probability output by the teacher model and a second class prediction probability output by the student model to obtain distillation loss;
a real loss calculation unit: the method comprises the steps of inputting a text sequence with a classification label into a student model, and performing cross information entropy calculation on a third class prediction probability output by the student model and the classification label to obtain real loss;
constraint loss calculation unit: the method comprises the steps of calculating the updating amplitude of parameters finely tuned in a student model to obtain constraint loss;
fine tuning training unit: the method is used for carrying out back propagation updating by utilizing distillation loss, true loss and constraint loss and carrying out fine tuning training on parameters of the student model.
Correspondingly, the distillation loss is calculated by the following steps:
in the formula, distill_loss is expressed as distillation Loss, x represents a text sequence to be marked, and t i (x) Prediction in first class prediction probability representing teacher model outputFor the prediction probability of category i, s i (x) The class probability predicted as the class i in the second class prediction probability output by the student model is represented; sigma represents performing an operation on all classes;
the calculation mode of the real loss is as follows:
where ce_loss represents the true loss, n represents the total number of classes of the prediction class, p (i) represents the class label, and q (i) represents the second class prediction probability output by the student model.
Correspondingly, the constraint loss is calculated in the following manner:
in the formula, the gradient_loss represents constraint loss, Δw is the change amount of each fine-tuned parameter in the student model, namely the update amplitude, N represents the number of fine-tuned parameters, alpha is a weight factor of a constraint term, and alpha is a super-parameter.
Correspondingly, the teacher model is a pre-trained deep neural network model, in particular a large language model pre-trained by a language model irrelevant to tasks or a trained large model specially applied to sequence labeling tasks.
In another exemplary embodiment of the present invention, there is provided an electronic device including a storage unit and a processing unit, where the storage unit stores computer instructions executable on the processing unit, and the processing unit executes the steps of the gravity field constraint model distillation method for sequence labeling when the processing unit executes the computer instructions.
The electronic device is in the form of a general purpose computing device. Components of an electronic device may include, but are not limited to: the at least one processing unit, the at least one memory unit, and a bus connecting the different system components (including the memory unit and the processing unit).
Wherein the storage unit stores program code executable by the processing unit such that the processing unit performs steps according to various exemplary embodiments of the present invention described in the above section of the exemplary method of the present specification. For example, the processing unit may perform the method as shown in fig. 1.
The memory unit may include readable media in the form of volatile memory units, such as Random Access Memory (RAM) 3201 and/or cache memory units, and may further include Read Only Memory (ROM).
The storage unit may also include a program/utility having a set (at least one) of program modules including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
The bus may be one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with any device (e.g., router, modem, etc.) that enables the electronic device to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface. And, the electronic device may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through a network adapter. The network adapter communicates with other modules of the electronic device via a bus. It should be appreciated that other hardware and/or software modules may be used in connection with an electronic device, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
As will be readily appreciated by those skilled in the art from the foregoing description, the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Accordingly, the technical solution according to the present exemplary embodiment may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the present exemplary embodiment.
In yet another exemplary embodiment of the present invention, there is provided a storage medium having stored thereon computer instructions which, when executed, perform the steps of the method for gravity field constraint model distillation for sequence annotation.
Based on this understanding, the technical solution of the present embodiment may be essentially or, what contributes to the prior art, or part of the technical solution may be embodied in the form of a software product (program product) stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to execute all or part of the steps of the method described in the embodiments of the present invention.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
It is apparent that the above examples are given by way of illustration only and not by way of limitation, and that other variations or modifications may be made in the various forms based on the above description by those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. And obvious variations or modifications thereof are contemplated as falling within the scope of the present invention.

Claims (10)

1. A gravitational field constraint model distillation method for sequence labeling is characterized in that: the method comprises the following steps:
acquiring a trained teacher model and an untrained student model;
training the student model for a plurality of times, each training comprising the following sub-steps:
inputting the text sequence to be marked into a teacher model and a student model respectively, and calculating KL divergence of the first class prediction probability output by the teacher model and the second class prediction probability output by the student model to obtain distillation loss;
inputting the text sequence with the classification labels into a student model, and performing cross information entropy calculation on the third class prediction probability and the classification labels output by the student model to obtain real loss;
calculating the update amplitude of the fine-tuned parameters in the student model to obtain constraint loss;
and carrying out back propagation updating by using distillation loss, real loss and constraint loss, and carrying out fine tuning training on parameters of the student model.
2. A gravity field constraint model distillation method for sequence annotation according to claim 1, wherein: the distillation loss is calculated by the following steps:
in the formula, distill_loss is expressed as distillation Loss, x represents a text sequence to be marked, and t i (x) The prediction probability of the class i in the first class prediction probability output by the teacher model is represented as s i (x) The class probability predicted as the class i in the second class prediction probability output by the student model is represented; sigma represents the operation of all classes;
the calculation mode of the real loss is as follows:
where ce_loss represents the true loss, n represents the total number of classes of the prediction class, p (i) represents the class label, and q (i) represents the second class prediction probability output by the student model.
3. A gravity field constraint model distillation method for sequence annotation according to claim 1, wherein: the constraint loss is calculated in the following way:
in the formula, the gradient_loss represents constraint loss, Δw is the change amount of each fine-tuned parameter in the student model, namely the update amplitude, N represents the number of fine-tuned parameters, alpha is a weight factor of a constraint term, and alpha is a super-parameter.
4. A gravity field constraint model distillation method for sequence annotation according to claim 1, wherein: the teacher model is a pre-trained deep neural network model, in particular a large language model pre-trained by a task-independent language model or a trained large model specially applied to a sequence labeling task.
5. A gravitational field constraint model distillation system for sequence annotation, characterized in that: comprising the following steps:
model acquisition module: for obtaining a trained teacher model and an untrained student model;
student model training module: for training a student model several times, each training comprising:
distillation loss calculation unit: the method comprises the steps of respectively inputting a text sequence to be marked into a teacher model and a student model, and calculating KL divergence of a first class prediction probability output by the teacher model and a second class prediction probability output by the student model to obtain distillation loss;
a real loss calculation unit: the method comprises the steps of inputting a text sequence with a classification label into a student model, and performing cross information entropy calculation on a third class prediction probability output by the student model and the classification label to obtain real loss;
constraint loss calculation unit: the method comprises the steps of calculating the updating amplitude of parameters finely tuned in a student model to obtain constraint loss;
fine tuning training unit: the method is used for carrying out back propagation updating by utilizing distillation loss, true loss and constraint loss and carrying out fine tuning training on parameters of the student model.
6. A gravity field constrained model distillation system for sequence annotation according to claim 5, wherein: the distillation loss is calculated by the following steps:
in the formula, distill_loss is expressed as distillation Loss, x represents a text sequence to be marked, and t i (x) The prediction probability of the class i in the first class prediction probability output by the teacher model is represented as s i (x) The class probability predicted as the class i in the second class prediction probability output by the student model is represented; sigma represents performing an operation on all classes;
the calculation mode of the real loss is as follows:
where ce_loss represents the true loss, n represents the total number of classes of the prediction class, p (i) represents the class label, and q (i) represents the second class prediction probability output by the student model.
7. A gravity field constrained model distillation system for sequence annotation according to claim 5, wherein: the constraint loss is calculated in the following way:
in the formula, the gradient_loss represents constraint loss, Δw is the change amount of each fine-tuned parameter in the student model, namely the update amplitude, N represents the number of fine-tuned parameters, alpha is a weight factor of a constraint term, and alpha is a super-parameter.
8. A gravity field constrained model distillation system for sequence annotation according to claim 5, wherein: the teacher model is a pre-trained deep neural network model, in particular a large language model pre-trained by a task-independent language model or a trained large model specially applied to a sequence labeling task.
9. An electronic device comprising a memory unit and a processing unit, the memory unit having stored thereon computer instructions executable on the processing unit, characterized by: the processing unit, when executing the computer instructions, performs the steps of a gravity field constraint model distillation method for sequence annotation according to any one of claims 1 to 4.
10. A storage medium having stored thereon computer instructions, characterized by: the computer instructions, when executed, perform the steps of a gravity field constrained model distillation method for sequence annotation according to any one of claims 1 to 4.
CN202310681569.8A 2023-06-09 2023-06-09 Gravitational field constraint model distillation method, system, electronic equipment and storage medium for sequence annotation Pending CN116737927A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310681569.8A CN116737927A (en) 2023-06-09 2023-06-09 Gravitational field constraint model distillation method, system, electronic equipment and storage medium for sequence annotation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310681569.8A CN116737927A (en) 2023-06-09 2023-06-09 Gravitational field constraint model distillation method, system, electronic equipment and storage medium for sequence annotation

Publications (1)

Publication Number Publication Date
CN116737927A true CN116737927A (en) 2023-09-12

Family

ID=87914452

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310681569.8A Pending CN116737927A (en) 2023-06-09 2023-06-09 Gravitational field constraint model distillation method, system, electronic equipment and storage medium for sequence annotation

Country Status (1)

Country Link
CN (1) CN116737927A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117057413A (en) * 2023-09-27 2023-11-14 珠高智能科技(深圳)有限公司 Reinforcement learning model fine tuning method, apparatus, computer device and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117057413A (en) * 2023-09-27 2023-11-14 珠高智能科技(深圳)有限公司 Reinforcement learning model fine tuning method, apparatus, computer device and storage medium
CN117057413B (en) * 2023-09-27 2024-03-15 传申弘安智能(深圳)有限公司 Reinforcement learning model fine tuning method, apparatus, computer device and storage medium

Similar Documents

Publication Publication Date Title
US10824815B2 (en) Document classification using attention networks
CN110717039A (en) Text classification method and device, electronic equipment and computer-readable storage medium
US20180053107A1 (en) Aspect-based sentiment analysis
US20200364409A1 (en) Implicit discourse relation classification with contextualized word representation
CN111522958A (en) Text classification method and device
CN107679234A (en) Customer service information providing method, device, electronic equipment, storage medium
CN111309915A (en) Method, system, device and storage medium for training natural language of joint learning
CN110245232B (en) Text classification method, device, medium and computing equipment
CN113420822B (en) Model training method and device and text prediction method and device
CN112188312B (en) Method and device for determining video material of news
CN113128227A (en) Entity extraction method and device
CN114298050A (en) Model training method, entity relation extraction method, device, medium and equipment
CN112560504B (en) Method, electronic equipment and computer readable medium for extracting information in form document
CN116737927A (en) Gravitational field constraint model distillation method, system, electronic equipment and storage medium for sequence annotation
CN113434683A (en) Text classification method, device, medium and electronic equipment
CN113826113A (en) Counting rare training data for artificial intelligence
CN115062617A (en) Task processing method, device, equipment and medium based on prompt learning
CN115269827A (en) Intent determination in an improved messaging conversation management system
CN113591998A (en) Method, device, equipment and storage medium for training and using classification model
CN111445271A (en) Model generation method, and prediction method, system, device and medium for cheating hotel
US20230139642A1 (en) Method and apparatus for extracting skill label
CN116048463A (en) Intelligent recommendation method and device for content of demand item based on label management
CN113407719B (en) Text data detection method and device, electronic equipment and storage medium
CN114970540A (en) Method and device for training text audit model
CN114580399A (en) Text error correction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination