WO2022126797A1 - 基于多层级知识蒸馏预训练语言模型自动压缩方法及平台 - Google Patents

基于多层级知识蒸馏预训练语言模型自动压缩方法及平台 Download PDF

Info

Publication number
WO2022126797A1
WO2022126797A1 PCT/CN2020/142577 CN2020142577W WO2022126797A1 WO 2022126797 A1 WO2022126797 A1 WO 2022126797A1 CN 2020142577 W CN2020142577 W CN 2020142577W WO 2022126797 A1 WO2022126797 A1 WO 2022126797A1
Authority
WO
WIPO (PCT)
Prior art keywords
distillation
network
model
vector
knowledge
Prior art date
Application number
PCT/CN2020/142577
Other languages
English (en)
French (fr)
Inventor
王宏升
王恩平
俞再亮
Original Assignee
之江实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 之江实验室 filed Critical 之江实验室
Priority to GB2214215.2A priority Critical patent/GB2610319A/en
Priority to JP2022566730A priority patent/JP7283835B2/ja
Priority to US17/555,535 priority patent/US11501171B2/en
Publication of WO2022126797A1 publication Critical patent/WO2022126797A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the invention belongs to the field of language model compression, and in particular relates to a pre-trained language model automatic compression method and platform based on multi-level knowledge distillation.
  • the present invention generates a general compression architecture for multi-task oriented pre-trained language models based on multi-level knowledge distillation.
  • the purpose of the present invention is to provide a pre-trained language model automatic compression method and platform based on multi-level knowledge distillation in view of the deficiencies of the prior art.
  • the present invention first constructs a multi-level knowledge distillation, and distills the knowledge structure of the large model at different levels. Furthermore, meta-learning is introduced to generate a general compression architecture for multiple pre-trained language models. Specifically, a meta-network of a structure generator is designed, a knowledge distillation encoding vector is constructed based on a multi-level knowledge distillation method, and the structure generator is used to generate a distillation structure model corresponding to the current input encoding vector. At the same time, a method of Bernoulli distribution sampling is proposed to train the structure generator.
  • the self-attention units migrated by each encoder are sampled using Bernoulli distribution to form the corresponding encoding vector.
  • the encoding vector of the input structure generator and the training data of the mini-batch and jointly training the structure generator and the corresponding distillation structure, we can learn a structure generator that can generate weights for different distillation structures.
  • an evolutionary algorithm is used to search for the optimal compression structure, thereby obtaining the optimal general compression structure of the task-independent pre-trained language model.
  • An automatic compression method for a pre-trained language model based on multi-level knowledge distillation comprising the following steps:
  • Step 1 Build multi-level knowledge distillation, and distill the knowledge structure of the large model at three different levels: self-attention unit, hidden layer state, and embedding layer;
  • Step 2 Train a knowledge distillation network for meta-learning to generate a general compression architecture for multiple pre-trained language models
  • Step 3 Search for an optimal compression structure based on an evolutionary algorithm.
  • a meta-network of a structure generator is designed in step 2, a knowledge distillation coding vector is constructed based on the multi-level knowledge distillation in step 1, and a distillation structure model corresponding to the currently input coding vector is generated by using the structure generator;
  • the Bernoulli distribution sampling method trains the structure generator.
  • the self-attention units migrated by each encoder are sampled by the Bernoulli distribution to form the corresponding encoding vector; by changing the encoding vector and the small value of the input structure generator.
  • Batches of training data jointly train the structure generator and the corresponding distillation structure, to obtain a structure generator that generates weights for different distillation structures.
  • step 3 on the basis of the trained meta-learning network, an optimal compression architecture is searched through an evolutionary algorithm to obtain an optimal general compression architecture of a task-independent pre-trained language model.
  • step 1 self-attention distribution knowledge, hidden state knowledge and embedding layer knowledge are encoded into a distillation network, and knowledge distillation is used to compress large models to small models.
  • the first step includes self-attention knowledge distillation, hidden layer state knowledge distillation and embedding layer knowledge distillation.
  • the meta-network of the structure generator described in step 2 is composed of two fully connected layers, a self-attention knowledge distillation encoding vector is input, and the weight matrix of the structure generator is output;
  • the training process of the structure generator is as follows:
  • Step 1 Construct knowledge distillation encoding vector, including layer sampling vector, multi-head pruning vector, hidden layer dimension reduction vector and embedding layer dimension reduction vector;
  • Step 2 Build a distillation network architecture based on the structure generator, use the structure generator to build a distillation structure model corresponding to the currently input encoding vector, adjust the shape of the weight matrix output by the structure generator, and use the structure generator to construct a distillation corresponding to the self-attention encoding vector.
  • the input and output of the structure have the same number of self-attention units;
  • Step 3 Joint training of the structure generator and distillation structure model:
  • the structure generator is trained by the method of Bernoulli distribution sampling, and the structure is jointly trained by changing the self-attention encoding vector of the input structure generator and a small batch of training data.
  • the generator and the corresponding distillation structure can learn a structure generator that can generate weights for different distillation structures.
  • step 3 the network coding vector is input into the trained structure generator to generate the weight of the corresponding distillation network, and the distillation network is evaluated on the verification set to obtain the accuracy of the corresponding distillation network; the details are as follows:
  • Step 1 Define the knowledge distillation encoding vector as the gene G of the distillation structure model, and randomly select a series of genes that satisfy the constraint C as the initial population;
  • Step 2 Evaluate the inference accuracy accuracy of the distillation structure model corresponding to each gene G i in the existing population on the validation set, and select the top k genes with the highest accuracy;
  • Step 3 using the top k genes with the highest precision selected in step 2 to carry out gene recombination and gene mutation to generate new genes, and add the new genes to the existing population;
  • Step 4 Repeat steps 2 and 3 for N rounds of iterations, select the top k genes with the highest accuracy in the existing population and generate new genes, until the gene that satisfies the constraint C and has the highest accuracy is obtained.
  • a pre-trained language model automatic compression platform based on multi-level knowledge distillation including the following components:
  • Data loading component used to obtain the training samples of the BERT model and the multi-task-oriented pre-training language model uploaded by the logged-in user to be compressed and containing specific natural language processing downstream tasks.
  • the training samples are labeled to meet the supervised learning task. text sample;
  • Automatic compression component used to automatically compress multi-task-oriented pre-trained language models, including knowledge distillation vector encoding module, distillation network generation module, structure generator and distillation network joint training module, distillation network search module, and task-specific fine-tuning module;
  • the knowledge distillation vector coding module includes the layer sampling vector of the Transformer, the multi-head pruning vector of the self-attention unit, the hidden layer dimensionality reduction vector, and the embedding layer dimensionality reduction vector.
  • the distillation network coding vector input structure is generated. generator to generate the weight matrix of the distillation network and the structure generator of the corresponding structure;
  • the distillation network generation module constructs a distillation network corresponding to the currently input encoding vector based on the structure generator, and adjusts the shape of the weight matrix output by the structure generator to make it correspond to the self-attention encoding vector of the input and output of the distillation structure.
  • the number of attention units is the same;
  • the structure generator and the distillation network joint training module train the structure generator end-to-end, input the multi-level knowledge distillation encoding vector and the training data of the small batch into the distillation network, update the weight of the distillation structure and the weight matrix of the structure generator;
  • the distillation network search module uses an evolutionary algorithm to search for a distillation network with the highest accuracy that satisfies a specific constraint condition
  • the task-specific fine-tuning module is to construct a downstream task network on the pre-trained model distillation network generated by the automatic compression component, use the feature layer and output layer of the distillation network to fine-tune the downstream task scene, and output the final fine-tuned student model, That is, the pre-trained language model compression model containing downstream tasks required by the login user, output the compressed model to a designated container for the login user to download, and display the size of the model before and after compression on the output compression model page of the platform comparative information;
  • the logged-in user obtains the pre-trained compression model from the platform, and uses the compression model output by the automatic compression component to infer the new data of the natural language processing downstream task uploaded by the logged-in user on the dataset of the actual scene. And the comparison information of the inference speed before and after compression is presented on the compression model inference page of the platform.
  • the logged-in user can directly download the trained pre-training language model, and according to the user's demand for a specific downstream task of natural language processing, build a downstream task network on the basis of the compressed pre-training model architecture generated by the platform. Fine-tune, and finally deploy on end devices, or inference on NLP downstream tasks directly on the platform.
  • the beneficial effects of the present invention are: the pre-training language model automatic compression method based on multi-level knowledge distillation of the present invention, the platform method and the platform, first, research based on meta-learning knowledge distillation to generate a general compression architecture for multiple pre-trained language models; Secondly, on the basis of the trained meta-learning network, an evolutionary algorithm is used to search for the optimal compression structure, thereby obtaining the optimal general compression structure of the task-independent pre-trained language model.
  • the pre-trained language model automatic compression platform based on multi-level knowledge distillation of the present invention compresses and generates a general architecture of multi-task-oriented pre-trained language models, fully utilizes the compressed model architecture to improve the compression efficiency of downstream tasks, and can compress large Large-scale natural language processing models are deployed on end-side devices with small memory and limited resources, which promotes the implementation of general-purpose deep language models in the industry.
  • Figure 1 is the overall architecture diagram of the pre-trained language model automatic compression platform based on multi-level knowledge distillation
  • Fig. 2 is a schematic diagram of multi-level knowledge distillation
  • Fig. 3 is the training flow chart of structure generator
  • FIG. 4 is a schematic diagram of the dimensionality reduction structure of the hidden layer of the encoder module and the input embedding layer;
  • Figure 5 is an architecture diagram of building a distillation network based on a structure generator
  • Figure 6 is a flow chart of the joint training process of the structure generator and the distillation network
  • Figure 7 is a schematic diagram of a distillation network search architecture based on an evolutionary algorithm.
  • the present invention includes knowledge distillation based on meta-learning and automatic search of distillation network based on evolutionary algorithm. Automatically compress large-scale pre-trained language models for multitasking to generate task-independent general architectures that satisfy different hard constraints (such as the number of floating-point operations).
  • the specific scheme is as follows: the whole process of the pre-trained language model automatic compression method based on multi-level knowledge distillation of the present invention is divided into three stages.
  • the knowledge structure of the large model is distilled at three different levels of the embedding layer; the second stage is to train the knowledge distillation network of meta-learning to generate a general compression architecture for multiple pre-trained language models.
  • a meta-network of structure generator is designed, a knowledge distillation encoding vector is constructed based on the multi-level knowledge distillation method proposed in the first stage, and a distillation structure model corresponding to the currently input encoding vector is generated by using the structure generator.
  • a method of Bernoulli distribution sampling is proposed to train the structure generator.
  • the self-attention units migrated by each encoder are sampled using Bernoulli distribution to form the corresponding encoding vector.
  • a structure generator that can generate weights for different distillation structures can be learned; the third stage is based on evolution
  • the algorithm searches for the optimal compression structure.
  • the evolutionary algorithm searches for the optimal compression structure, thereby obtaining the optimal general compression structure of the task-independent pre-trained language model.
  • the present invention encodes self-attention distribution knowledge, hidden state knowledge and embedding layer knowledge into a distillation network, as shown in Figure 2.
  • Knowledge distillation is used to compress the large model to the small model, and the self-attention knowledge of the large model is transferred to the small model to the greatest extent.
  • Self-attention knowledge distillation Transformer layer distillation includes self-attention-based knowledge distillation and hidden layer state-based knowledge distillation, as shown in Figure 2.
  • Self-attention based distillation is able to focus on rich linguistic knowledge. This vast amount of linguistic knowledge includes the semantics and associated information necessary for natural language understanding. Therefore, self-attention based knowledge distillation is proposed to encourage the transfer of knowledge from the rich teacher model to the student model.
  • the attention function is calculated from three matrices of queries, keys and values, which are denoted as matrix Q, matrix K and matrix V, respectively.
  • the self-attention function is defined as follows:
  • A represents the self-attention matrix, which is calculated by the dot product operation of matrix Q and matrix K.
  • the output of the final self-attention function Attention(Q, K, V) is used as a weight sum of the matrix V, and the weight is calculated by performing the softmax operation on each column of the matrix V.
  • Self-attention matrix A can pay attention to a large amount of language knowledge, so self-attention distillation plays an important role in knowledge distillation. Multi-head attention is obtained by splicing attention heads from different feature subspaces as follows:
  • h is the number of attention heads
  • head i represents the ith attention head, which is calculated by the Attention() function of different feature subspaces
  • Concat represents concatenation
  • W is a linear transformation matrix.
  • the student model learns to mimic the multi-head attention knowledge of the teacher network, where the loss function is defined as follows:
  • a i ⁇ R l ⁇ l represents the self-attention matrix corresponding to the ith attention head of the teacher or student model
  • R is a real number
  • l is the size of the current layer input
  • L is The length of the input text
  • MSE() is the mean squared loss function. It is worth noting that the attention matrix A i is used here instead of the output softmax(A i ) of the softmax.
  • Hidden layer state knowledge distillation In addition to knowledge distillation based on self-attention, this application is also based on knowledge distillation of hidden layer state, that is, the knowledge output from the Transformer layer is transferred.
  • the loss function of the hidden layer state knowledge distillation is as follows:
  • the matrices H S ⁇ R L ⁇ d' and H T ⁇ R L ⁇ d refer to the hidden states of the student and teacher networks, respectively
  • the scalar values d and d' represent the sizes of the hidden layers of the teacher and student models, respectively
  • the matrix W h ⁇ R d' ⁇ d is a learnable linear transformation matrix that transforms the hidden layer state of the student network into the same feature space as the hidden layer state of the teacher network.
  • the matrices ES and ET refer to the embedding layers of the student and teacher networks, respectively.
  • the shape of the embedded layer matrix is the same as the hidden layer state matrix.
  • the matrix We is a linear transformation matrix.
  • FIG. 1 Design a structure generator, which is a meta-network consisting of two fully connected layers.
  • Figure 3 shows the training process of the structure generator.
  • Input a self-attention knowledge distillation encoding vector and output the weight matrix of the structure generator.
  • the training process of the structure generator is as follows:
  • Step 1 Construct knowledge distillation encoding vector.
  • the distillation network encoding vector is input into the structure generator, and the weight matrix of the structure generator is output.
  • the distillation network encoding vector consists of the Transformer layer sampling vector, the multi-head attention pruning vector, the hidden layer dimension reduction vector and the embedding layer dimension reduction vector.
  • the specific scheme of the distillation network encoding vector is as follows:
  • a. Layer sampling vector In the layer sampling stage, the number of Transformer layers of BERT is firstly sampled by Bernoulli distribution to generate a layer sampling vector. Specifically, assuming that the ith Transformer module is currently being migrated, Xi is an independent Bernoulli random variable, and the probability of Xi being 1 (or 0) is p (or 1-p).
  • the random variable X is used to sequentially perform Bernoulli sampling on the 12 Transformer units of BERT to generate a vector consisting of 12 elements of 0 or 1.
  • the probability value of the random variable X i being 1 is greater than or equal to 0.5
  • the element corresponding to the layer sampling vector is 1, which means that the current Transformer module performs migration learning
  • the probability value of the random variable X i being 1 is less than 0.5
  • the layer sampling vector corresponds to The element of is 0, which means that the current Transformer module does not perform transfer learning.
  • BERT base has a total of 12 Transformer modules.
  • the layer sampling stage is performed on all Transformer layers of BERT, and the constraints are constructed so that the number of elements in the vector obtained by the final layer sampling is not less than 6, otherwise the layer sampling is performed again.
  • the Transformer module of the student model transfer is initialized using the weight of the Transformer whose element is 1 in the layer sampling vector of the teacher model.
  • Multi-head pruning vector Each Transformer module consists of multi-head attention units. Inspired by channel pruning, this application proposes multi-head pruning based on multi-head attention units. Each time a distillation network structure is generated, a multi-head pruning encoding vector is generated, representing the number of attention heads for self-attention knowledge transfer in all Transformer layers currently being transferred. The formula is defined as follows:
  • Head i represents the number of attention heads included in each layer of Transformer when generating the i-th distillation network structure.
  • the head scale represents a decay factor for the number of self-attention heads included in each Transformer module.
  • SRS_Sample means simple random sampling; since each Transformer module of the BERT base has 12 self-attention units, the head max is 12.
  • the random number Mi is obtained by simply randomly sampling the list [0,1,2,...,30], and the attenuation factor head scale of the current distillation structure is obtained, which is the same as the standard head max . After multiplication, the number of multi-head attention units of the current distillation network structure is obtained.
  • Hidden layer dimensionality reduction vector The knowledge distillation of the hidden layer state is to perform knowledge distillation on the final output of each Transformer layer, that is, reduce the dimension of the BERT hidden layer.
  • the specific process is as follows.
  • the dimensions of the hidden layers of all Transformer layers of the distillation network generated each time are the same.
  • the dimension hidn i of the hidden layer that generates the i-th distillation network is defined as follows:
  • hidn i hidn base ⁇ hidn scale
  • the hidn base is a hyperparameter. Since the hidden layer dimension of the BERT base is 768, the initialized hidn base is a common divisor of 768, and the initialized hidn base is 128 here. hidn scale represents the dimensionality reduction factor of the hidden layer dimension, which is an element obtained by simple random sampling of the list [1, 2, 3, 4, 5, 6].
  • Figure 4 shows the dimensionality reduction structure of the hidden layer of the encoder module and the embedding layer of the input. As can be seen from the figure, since both the hidden layer part and the embedding layer part have a residual connection, the output dimension of the embedding layer is equal to the output dimension of the hidden layer.
  • the dimension of each embedding layer should be consistent with the dimension of the current hidden layer.
  • Step 2 Build the distillation network architecture based on the structure generator.
  • the structure generator is used to build a distillation structure model corresponding to the current input encoding vector, and the shape of the weight matrix output by the structure generator needs to be adjusted so that it corresponds to the self-attention of the input and output of the distillation structure corresponding to the self-attention encoding vector.
  • the number of units is the same.
  • Figure 5 shows the network architecture of the distillation structure constructed by the structure generator.
  • Step 3 Jointly train the structure generator and the distillation structure model. Feed the self-attention knowledge distillation encoding vector and a mini-batch of training data into the distillation structure model. It is worth noting that in the process of backpropagation, the weights of the distillation structure and the weight matrix of the structure generator are updated together. The weights of the structure generator can be calculated using the chain rule, thus, the structure generator can be trained end-to-end.
  • a method of Bernoulli distribution sampling is proposed to train the structure generator.
  • the self-attention units migrated by the Transformer of each layer are sampled by Bernoulli distribution to form the corresponding encoding vector.
  • a structure generator that can generate weights for different distillation structures can be learned.
  • Figure 6 shows the training process of the structure generator.
  • each distillation network is encoded and generated by a network encoding vector including three distillation modules: embedding layer distillation, hidden layer distillation and self-attention knowledge distillation, so the distillation network encoding vector Defined as the genes of the distillation network.
  • a series of distillation network encoding vectors are first selected as the genes of the distillation network, and the accuracy of the corresponding distillation network is obtained by evaluating on the validation set. Then, the top k genes with the highest precision are selected, and gene recombination and mutation are used to generate new genes.
  • Gene mutation refers to mutation by randomly changing the value of some elements in the gene.
  • Step 1 Each distillation structure model is generated by the knowledge distillation coding vector based on Transformer layer sampling, so the knowledge distillation coding vector is defined as the gene G of the distillation structure model, and a series of genes that satisfy the constraint C are randomly selected as the initial population. .
  • Step 2 Evaluate the inference accuracy accuracy of the distillation structure model corresponding to each gene G i in the existing population on the validation set, and select the top k genes with the highest accuracy.
  • Step 3 Use the top k genes with the highest accuracy selected in step 2 to perform gene recombination and gene mutation to generate new genes, and add the new genes to the existing population.
  • Gene mutation refers to mutation by randomly changing the value of some elements in the gene;
  • Gene recombination refers to randomly recombining the genes of two parents to produce offspring; and it is easy to strengthen constraint C by eliminating unqualified genes.
  • Step 4 Repeat steps 2 and 3 for N rounds of iterations, select the top k genes with the highest accuracy in the existing population and generate new genes, until the gene that satisfies the constraint C and has the highest accuracy is obtained.
  • the pre-trained language model automatic compression platform based on multi-level knowledge distillation of the present invention includes the following components:
  • Data loading component used to obtain the training samples of the BERT model and the multi-task-oriented pre-training language model uploaded by the logged-in user to be compressed and containing specific natural language processing downstream tasks.
  • the training samples are labeled to meet the supervised learning task. Text sample.
  • Automatic compression component used to automatically compress multi-task-oriented pre-trained language models, including knowledge distillation vector encoding module, distillation network generation module, structure generator and distillation network joint training module, distillation network search module, and task-specific fine-tuning module.
  • the knowledge distillation vector encoding module includes the layer sampling vector of the Transformer, the multi-head pruning vector of the self-attention unit, the hidden layer dimension reduction vector, and the embedding layer dimension reduction vector.
  • the distillation network encoding vector is input into the structure generator, and the weight matrix of the distillation network corresponding to the structure and the structure generator is generated.
  • the distillation network generation module is based on the structure generator to construct a distillation network corresponding to the current input encoding vector, and adjust the shape of the weight matrix output by the structure generator to make it correspond to the self-attention encoding vector of the self-attention of the input and output of the distillation structure.
  • the number of force units is the same.
  • the structure generator and distillation network joint training module is an end-to-end training structure generator. Specifically, the multi-level knowledge distillation encoding vector and a small batch of training data are fed into the distillation network. Update the weights of the distillation structure and the weight matrix of the structure generator.
  • the distillation network search module is to search for the distillation network with the highest accuracy that satisfies the specific constraints, and proposes an evolutionary algorithm to search for the distillation network with the highest accuracy that meets the specific constraints.
  • each distillation network is encoded and generated by the network coding vector including the three distillation modules of embedding layer knowledge distillation, hidden layer state knowledge distillation and self-attention knowledge distillation.
  • the distillation network encoding vector is defined as the genes of the distillation network.
  • a series of distillation network encoding vectors are first selected as the genes of the distillation network, and the accuracy of the corresponding distillation network is obtained by evaluating on the validation set. Then, the top k genes with the highest precision are selected, and gene recombination and mutation are used to generate new genes. By further repeating the process of selecting the top k optimal genes and the process of generating new genes, the genes that satisfy the constraints and have the highest accuracy are obtained.
  • the task-specific fine-tuning module is to build a downstream task network on the pre-trained model distillation network generated by the automatic compression component, use the feature layer and output layer of the distillation network to fine-tune the downstream task scene, and output the final fine-tuned student model, that is, log in User-required pre-trained language model compression models that include downstream tasks.
  • the compressed model is output to a specified container, which can be downloaded by the logged-in user, and the comparison information of the size of the model before and after compression is presented on the page of outputting the compressed model of the platform.
  • the logged-in user obtains the pre-trained compression model from the platform, and the user uses the compressed model output by the automatic compression component to infer the new data of the natural language processing downstream task uploaded by the logged-in user on the dataset of the actual scene. And the comparison information of the inference speed before and after compression is presented on the compression model inference page of the platform.
  • the logged-in user can directly download the trained pre-training language model provided by the platform of the present invention, and according to the user's demand for a specific downstream task of natural language processing, build the downstream task on the basis of the compressed pre-training model architecture generated by the platform Network and fine-tuned, and finally deployed on end devices. Inference on natural language processing downstream tasks can also be performed directly on the platform.
  • the BERT pre-training model generated by the automatic compression component is loaded by the platform, and a model of the text classification task is constructed on the generated pre-training model;
  • the compressed model is output to the designated container, which can be downloaded by the logged-in user, and the comparison information of the model size before and after compression is presented on the output compressed model page of the platform.
  • the size of the model before compression is 110M, and the size after compression is 53M , compressed by 51.8%. As shown in Table 1 below.
  • the compression model output by the platform is used to infer the SST-2 test set data uploaded by the logged-in user, and the compression model inference page of the platform shows that the inference speed after compression is 1.95 faster than that before compression times, and the inference accuracy improves from 91.5% before compression to 92.3%.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Feedback Control In General (AREA)
  • Machine Translation (AREA)

Abstract

一种基于多层级知识蒸馏的预训练语言模型自动压缩方法及平台,所述方法包括如下步骤:步骤一、构建多层级知识蒸馏,在自注意力单元、隐藏层状态、嵌入层三个不同层级上蒸馏大模型的知识结构;步骤二、训练元学习的知识蒸馏网络,生成多种预训练语言模型的通用压缩架构;步骤三、基于进化算法搜索最佳压缩结构。首先,研究基于元学习的知识蒸馏生成多种预训练语言模型的通用压缩架构;其次,在已训练好的元学习网络基础上,通过进化算法搜索最佳压缩结构,由此得到与任务无关的预训练语言模型的最优通用压缩架构。

Description

基于多层级知识蒸馏预训练语言模型自动压缩方法及平台 技术领域
本发明属于语言模型压缩领域,尤其涉及一种基于多层级知识蒸馏的预训练语言模型自动压缩方法及平台。
背景技术
大规模预训练语言模型在自然语言理解和生成任务上都取得了优异的性能,然而,将具有海量参数的预训练语言模型部署到内存有限的设备中仍然面临巨大挑战。在模型压缩领域,已有的语言模型压缩方法都是针对特定任务的语言模型压缩。面向下游其它任务时,使用特定任务知识蒸馏生成的预训练模型仍需要重新微调大模型以及生成相关的大模型知识。大模型微调费时费力,计算成本也很高。为了提高压缩模型面向多种下游任务使用过程中的灵活性和有效性,研究与任务无关的预训练语言模型的通用压缩架构。而且,已有的知识蒸馏方法主要是人工设计的知识蒸馏策略。由于受计算资源等限制,人工设计所有可能的蒸馏结构并且寻找最优结构几乎不可能。受神经网络架构搜索的启发,尤其是在少样本的情况下,本发明基于多层级知识蒸馏生成面向多任务的预训练语言模型的通用压缩架构。
发明内容
本发明的目的在于针对现有技术的不足,提供一种基于多层级知识蒸馏的预训练语言模型自动压缩方法及平台。本发明首先构建一种多层级的知识蒸馏,在不同层级上蒸馏大模型的知识结构。而且,引入元学习,生成多种预训练语言模型的通用压缩架构。具体地,设计一种结构生成器的元网络,基于多层级知识蒸馏方法构建知识蒸馏编码向量,利用该结构生成器生成与当前输入的编码向量对应的蒸馏结构模型。同时,提出伯努利分布采样的方法训练结构生成器。每轮迭代时,利用伯努利分布采样各个编码器迁移的自注意力单元,组成对应的编码向量。通过改变输入结构生成器的编码向量和小批次的训练数据,联合训练结构生成器和对应的蒸馏结构,可以学得能够为不同蒸馏结构生成权重的结构生成器。同时,在已训练好的元学习网络基础上,通过进化算法搜索最佳压缩结构,由此得到与任务无关的预训练语言模型的最优通用压缩架构。
本发明的目的是通过以下技术方案实现的:
一种基于多层级知识蒸馏的预训练语言模型自动压缩方法,包括如下步骤:
步骤一、构建多层级知识蒸馏,在自注意力单元、隐藏层状态、嵌入层三个不同层级上蒸馏 大模型的知识结构;
步骤二、训练元学习的知识蒸馏网络,生成多种预训练语言模型的通用压缩架构;
步骤三、基于进化算法搜索最优压缩结构。
进一步的,步骤二中设计一种结构生成器的元网络,基于步骤一的多层级知识蒸馏构建知识蒸馏编码向量,利用结构生成器生成与当前输入的编码向量对应的蒸馏结构模型;同时,采用伯努利分布采样的方法训练结构生成器,每轮迭代时,利用伯努利分布采样各个编码器迁移的自注意力单元,组成对应的编码向量;通过改变输入结构生成器的编码向量和小批次的训练数据,联合训练结构生成器和对应的蒸馏结构,得到为不同蒸馏结构生成权重的结构生成器。
进一步的,步骤三中在已训练好的元学习网络基础上,通过进化算法搜索最优压缩架构,得到与任务无关的预训练语言模型的最优通用压缩架构。
进一步的,步骤一中将自注意力分布知识、隐藏状态知识和嵌入层知识编码为一个蒸馏网络,采用知识蒸馏实现大模型向小模型的压缩。
进一步的,步骤一中包括自注意力知识蒸馏、隐藏层状态知识蒸馏和嵌入层知识蒸馏。
进一步的,步骤二中所述结构生成器的元网络,由两个全连接层组成,输入一个自注意力知识蒸馏编码向量,输出结构生成器的权重矩阵;
结构生成器的训练过程如下:
步骤1:构造知识蒸馏编码向量,包括层采样向量、多头剪枝向量、隐藏层降维向量和嵌入层降维向量;
步骤2:基于结构生成器构建蒸馏网络架构,利用该结构生成器构建与当前输入的编码向量对应的蒸馏结构模型,调整结构生成器输出的权重矩阵的形状,与自注意力编码向量对应的蒸馏结构的输入输出的自注意力单元数目一致;
步骤3:联合训练结构生成器和蒸馏结构模型:通过伯努利分布采样的方法训练结构生成器,通过改变输入结构生成器的自注意力编码向量和一个小批次的训练数据,联合训练结构生成器和对应的蒸馏结构,可以学得能够为不同蒸馏结构生成权重的结构生成器。
进一步的,步骤三中,将网络编码向量输入训练好的结构生成器,生成对应蒸馏网络的权重,在验证集上对蒸馏网络进行评估,获得对应蒸馏网络的精度;具体如下:
满足特定约束条件下,首先选取一系列蒸馏网络编码向量作为蒸馏网络的基因,通过在验证集上评估获得对应蒸馏网络的精度;然后,选取精度最高的前k个基因,采用基因重组 和变异生成新的基因,通过进一步重复前k个最优基因选择的过程和新基因生成的过程来迭代获得满足约束条件并且精度最高的基因。
进一步的,所述进化算法的具体步骤如下:
步骤一、将知识蒸馏编码向量定义为蒸馏结构模型的基因G,随机选取满足约束条件C的一系列基因作为初始种群;
步骤二、评估现有种群中各个基因G i对应的蒸馏结构模型在验证集上的推理精度accuracy,选取精度最高的前k个基因;
步骤三、利用步骤二选取的精度最高的前k个基因进行基因重组和基因变异生成新的基因,将新基因加入现有种群中;
步骤四、重复迭代N轮步骤二和步骤三,选择现有种群中前k个精度最高的基因并生成新基因,直到获得满足约束条件C并且精度最高的基因。
一种基于多层级知识蒸馏的预训练语言模型自动压缩平台,包括如下组件:
数据加载组件:用于获取登陆用户上传的待压缩的包含具体自然语言处理下游任务的BERT模型和面向多任务的预训练语言模型的训练样本,所述训练样本是满足监督学习任务的带标签的文本样本;
自动压缩组件:用于将面向多任务的预训练语言模型自动压缩,包括知识蒸馏向量编码模块、蒸馏网络生成模块、结构生成器和蒸馏网络联合训练模块、蒸馏网络搜索模块、特定任务微调模块;
所述知识蒸馏向量编码模块包括Transformer的层采样向量、自注意力单元的多头剪枝向量、隐藏层降维向量、嵌入层降维向量,前向传播过程中,将蒸馏网络编码向量输入结构生成器,生成对应结构的蒸馏网络和结构生成器的权重矩阵;
所述蒸馏网络生成模块基于结构生成器构建与当前输入的编码向量对应的蒸馏网络,调整结构生成器输出的权重矩阵的形状,使其与自注意力编码向量对应的蒸馏结构的输入输出的自注意力单元数目一致;
结构生成器和蒸馏网络联合训练模块端到端地训练结构生成器,将多层级知识蒸馏编码向量和小批次的训练数据输入蒸馏网络,更新蒸馏结构的权重和结构生成器的权重矩阵;
所述蒸馏网络搜索模块采用进化算法搜索满足特定约束条件的最高精度的蒸馏网络;
所述特定任务微调模块是在所述自动压缩组件生成的预训练模型蒸馏网络上构建下游任务网络,利用蒸馏网络的特征层和输出层对下游任务场景进行微调,输出最终微调好的学生模型,即登陆用户需求的包含下游任务的预训练语言模型压缩模型,将所述压缩模型输出到 指定的容器,供所述登陆用户下载,并在所述平台的输出压缩模型的页面呈现压缩前后模型大小的对比信息;
推理组件:登陆用户从所述平台获取预训练压缩模型,利用所述自动压缩组件输出的压缩模型在实际场景的数据集上对登陆用户上传的自然语言处理下游任务的新数据进行推理。并在所述平台的压缩模型推理页面呈现压缩前后推理速度的对比信息。
进一步的,登陆用户可直接下载训练好的预训练语言模型,根据用户对具体某个自然语言处理下游任务的需求,在所述平台生成的已压缩的预训练模型架构基础上构建下游任务网络并进行微调,最后部署在终端设备,或直接在所述平台上对自然语言处理下游任务进行推理。
本发明的有益效果是:本发明的基于多层级知识蒸馏的预训练语言模型自动压缩方法及平台法及平台,首先,研究基于元学习的知识蒸馏生成多种预训练语言模型的通用压缩架构;其次,在已训练好的元学习网络基础上,通过进化算法搜索最佳压缩结构,由此得到与任务无关的预训练语言模型的最优通用压缩架构。
本发明的基于多层级知识蒸馏的预训练语言模型自动压缩平台,压缩生成面向多任务的预训练语言模型的通用架构,充分利用已压缩好的模型架构提高下游任务的压缩效率,并且可将大规模自然语言处理模型部署在内存小、资源受限等端侧设备,推动了通用深度语言模型在工业界的落地进程。
附图说明
图1是基于多层级知识蒸馏的预训练语言模型自动压缩平台的整体架构图;
图2是多层级知识蒸馏示意图;
图3是结构生成器的训练流程图;
图4是编码器模块的隐藏层和输入的嵌入层的降维结构示意图;
图5是基于结构生成器构建蒸馏网络的架构图;
图6是结构生成器和蒸馏网络联合训练过程流程图;
图7是基于进化算法的蒸馏网络搜索架构示意图。
具体实施方式
下面结合附图对本发明作进一步说明。
如图1所示,本发明包括基于元学习的知识蒸馏以及基于进化算法的蒸馏网络自动搜索。将面向多任务的大规模预训练语言模型自动压缩生成满足不同硬约束条件(如,浮点数运算次数),与任务无关的通用架构。
具体方案如下:本发明的基于多层级知识蒸馏的预训练语言模型自动压缩方法整个过程分为三个阶段,第一个阶段是构建多层级的知识蒸馏,在自注意力单元、隐藏层状态、嵌入层三个不同层级上蒸馏大模型的知识结构;第二个阶段是训练元学习的知识蒸馏网络,生成多种预训练语言模型的通用压缩架构。具体地,设计一种结构生成器的元网络,基于第一阶段提出的多层级知识蒸馏方法构建知识蒸馏编码向量,利用结构生成器生成与当前输入的编码向量对应的蒸馏结构模型。同时,提出伯努利分布采样的方法训练结构生成器。每轮迭代时,利用伯努利分布采样各个编码器迁移的自注意力单元,组成对应的编码向量。通过改变输入结构生成器的编码向量和小批次的训练数据,联合训练结构生成器和对应的蒸馏结构,可以学得能够为不同蒸馏结构生成权重的结构生成器;第三个阶段是基于进化算法搜索最佳压缩结构,在已训练好的元学习网络基础上,通过进化算法搜索最优压缩架构,由此得到与任务无关的预训练语言模型的最优通用压缩架构。具体过程如下:
第一阶段:多层级知识蒸馏
本发明将自注意力分布知识、隐藏状态知识和嵌入层知识编码为一个蒸馏网络,如图2所示。采用知识蒸馏实现大模型向小模型的压缩,并最大程度地将大模型的自注意力知识迁移给小模型。
a.自注意力知识蒸馏。Transformer层蒸馏包括基于自注意力知识蒸馏和基于隐藏层状态知识蒸馏,其中如图2所示。基于自注意力的蒸馏能够关注丰富的语言知识。这些大量的语言知识包括自然语言理解必要的语义和关联的信息。因此,提出基于自注意力知识蒸馏,鼓励将丰富的教师模型的知识迁移到学生模型。
注意力函数是由queries、keys和values三个矩阵计算所得,分别表示为矩阵Q、矩阵K和矩阵V。自注意力函数定义如下:
Figure PCTCN2020142577-appb-000001
Attention(Q,K,V)=softmax(A)V,
其中,d k是矩阵K的维度,是一个缩放因子。A表示自注意力矩阵,由矩阵Q和矩阵K通过点积操作计算而得。最终自注意力函数Attention(Q,K,V)的输出作为矩阵V的一个权重和,权重是由对矩阵V的每列进行softmax操作计算而得。自注意力矩阵A可以关注大量的语言知识,因此基于自注意力蒸馏在知识蒸馏中起到很重要的作用。多头注意力是将来自不同特征子空间的注意力头按如下方式拼接所得:
MultiHead(Q,K,V)=Concat(head 1,...,head h)W,   (2)
其中,h是注意力头的数量,head i表示第i个注意力头,是由不同特征子空间的Attention()函数计算而得,Concat表示拼接,其中W是一个线性变换矩阵。
学生模型学习模仿教师网络的多头注意力知识,其中损失函数定义如下:
Figure PCTCN2020142577-appb-000002
其中,h是自注意力头的数量,A i∈R l×l表示教师或学生模型第i个注意力头对应的自注意力矩阵,R表示实数,l表示当前层输入的大小,L是输入文本的长度,MSE()是均方差损失函数。值得注意的是,这里使用注意力矩阵A i,而不是softmax的输出softmax(A i)。
b.隐藏层状态知识蒸馏。除了基于自注意力知识蒸馏,本申请还基于隐藏层状态知识蒸馏,即对Transformer层输出的知识进行迁移。隐藏层状态知识蒸馏的损失函数如下:
L hidn=MSE(H SW h,H T),   (4)
其中,矩阵H S∈R L×d’和H T∈R L×d分别指学生和教师网络的隐藏状态,标量值d和d’分别表示教师和学生模型隐藏层的大小,矩阵W h∈R d’×d是一个可学习的线性变换矩阵,将学生网络的隐藏层状态转换为与教师网络的隐藏层状态相同的特征空间。
c.嵌入层知识蒸馏。本申请同时采用了基于嵌入层知识蒸馏,与基于隐藏状态知识蒸馏类似,定义为:
L embd=MSE(E SW e,E T),   (5)
其中,矩阵E S和E T分别指学生和教师网络的嵌入层。在本申请中,嵌入层矩阵的形状与隐藏层状态矩阵相同。矩阵W e是一个线性变换矩阵。
第二阶段:元学习的知识蒸馏
设计一个结构生成器,结构生成器是一个元网络,由两个全连接层组成。图3显示了结构生成器的训练过程。输入一个自注意力知识蒸馏编码向量,输出结构生成器的权重矩阵。
结构生成器的训练过程如下:
步骤一:构造知识蒸馏编码向量。前向传播过程中,将蒸馏网络编码向量输入结构生成器,输出结构生成器的权重矩阵。蒸馏网络编码向量由Transformer层采样向量、多头注意力剪枝向量、隐藏层降维向量和嵌入层降维向量组成。蒸馏网络编码向量的具体方案如下:
a.层采样向量。层采样阶段,首先采用伯努利分布对BERT的Transformer层数进行层采样,生成一个层采样向量。具体地,假设当前考虑迁移第i个Transformer模块,Xi是一个独立的伯努利随机变量,X i为1(或0)的概率为p(或1-p)。
X i∽Bernoulli(p)    (6)
利用随机变量X依次对BERT的12个Transformer单元进行伯努利采样,生成一个由12个0或1元素组成的向量。当随机变量X i为1的概率值大于等于0.5时,层采样向量对应的元素为1,代表当前Transformer模块进行迁移学习;当随机变量X i为1的概率值小于0.5时时,层采样向量对应的元素为0,则代表当前Transformer模块不进行迁移学习。
利用以上伯努利采样方式对BERT包含的所有Transformer层依次进行层采样,组成网络编码向量中的层采样向量:
layer_sample=[0 1 1..i..1 0]s.t.sum(layer_sample[i]==1)≥6  (7)
值得注意的是,BERT base一共有12个Transformer模块,为了防止层采样Transformer模块迁移的数量(指层采样向量中元素为1的数量)过少,提出增加层采样约束条件,即每生成一个蒸馏网络结构时,对BERT的所有Transformer层进行层采样阶段,构建约束条件,使得最终层采样所得向量中元素为1的数量不小于6,否则重新进行层采样。
此时,对Transformer进行知识蒸馏时,通过网络编码向量中的层采样向量,将教师模型和学生模型建立一对一的映射关系,之后根据网络编码向量中的层采样向量生成对应的蒸馏网络结构。为了加速整个训练过程,使用教师模型中层采样向量对应元素为1的Transformer的权重来初始化学生模型迁移的Transformer模块。
b.多头剪枝向量。每个Transformer模块由多头注意力单元组成。受通道剪枝的启发,本申请提出基于多头注意力单元的多头剪枝。每次生成一个蒸馏网络结构时,生成一个多头剪枝编码向量,表示当前迁移的所有Transformer层中进行自注意力知识迁移的注意力头的数量。公式定义如下:
Figure PCTCN2020142577-appb-000003
其中,Head i表示生成第i个蒸馏网络结构时,每层Transformer包含的注意力头的数量,这里每次生成的不同蒸馏网络结构中每层Transformer包含的注意力头的数量相同。head scale表示每个Transformer模块包含的自注意力头数量的衰减因子。SRS_Sample表示简单随机抽样;由于BERT base的每个Transformer模块有12个自注意力单元,所以head max为12。在生成第i个蒸馏网络结构过程中,首先通过对列表[0,1,2,…,30]进行简单随机抽 样得到随机数M i,获得当前蒸馏结构的衰减因子head scale,与标准head max进行相乘过后,得到当前蒸馏网络结构的多头注意力单元的数量。
因此,基于注意力多头剪枝进行知识蒸馏,每次生成一个蒸馏网络结构时,对于进行迁移的Transformer模块,即Transformer层采样编码向量中值为1的元素,生成一个多头剪枝编码向量,表示当前迁移的所有Transformer层中进行自注意力知识迁移的注意力头的数量。
c.隐藏层降维向量。隐藏层状态的知识蒸馏是针对每个Transformer层的最终输出进行知识蒸馏,即减小BERT隐藏层的维度。具体过程如下,每次生成的蒸馏网络的所有Transformer层的隐藏层的维度相同。生成第i个蒸馏网络的隐藏层的维度hidn i定义如下:
hidn scale=SRS_Sample([1,2,3,4,5,6])   (9)
hidn i=hidn base×hidn scale
其中,hidn base是一个超参数,由于BERT base的隐藏层维度是768,初始化hidn base为768的一个公约数,这里初始化hidn base为128。hidn scale表示隐藏层维度的降维因子,是通过对列表[1,2,3,4,5,6]进行简单随机抽样得到的一个元素。
因此,基于隐藏层进行知识蒸馏生成蒸馏网络时,每次采用简单随机抽样从上述列表采样一个降维因子,生成当前蒸馏网络对应的隐藏层的维度大小。
d.嵌入层降维向量。图4所示了编码器模块的隐藏层和输入的嵌入层的降维结构。从图中可以看出,由于隐藏层部分和嵌入层部分都有一个残差连接,所以嵌入层的输出维度大小与隐藏层输出维度大小相等。
因此,基于嵌入层进行知识蒸馏生成蒸馏网络时,每次嵌入层的维度与当前隐藏层的维度的大小保持一致即可。
步骤二:基于结构生成器构建蒸馏网络架构。利用该结构生成器构建与当前输入的编码向量对应的蒸馏结构模型,而且需要调整结构生成器输出的权重矩阵的形状,使其与自注意力编码向量对应的蒸馏结构的输入输出的自注意力单元数目一致。图5所示由结构生成器构建的蒸馏结构的网络架构。
步骤三:联合训练结构生成器和蒸馏结构模型。将自注意力知识蒸馏编码向量和一个小批次的训练数据输入蒸馏结构模型。值得注意的是,反向传播的过程中,蒸馏结构的权重和结构生成器的权重矩阵都会一起更新。构生成器的权重可以使用链式法则计算,因此,可以端到端的训练结构生成器。
同时,提出伯努利分布采样的方法训练结构生成器。每轮迭代时,利用伯努利分布采样各层Transformer迁移的自注意力单元,组成对应的编码向量。通过改变输入结构生成器的自注意力编码向量和一个小批次的训练数据,联合训练结构生成器和对应的蒸馏结构,可以学得能够为不同蒸馏结构生成权重的结构生成器。图6所示结构生成器的训练过程。
第三阶段:基于进化算法的蒸馏网络搜索
将网络编码向量输入训练好的结构生成器,生成对应蒸馏网络的权重,在验证集上对蒸馏网络进行评估,获得对应蒸馏网络的精度。然后,为了搜索出满足特定约束条件的最高精度的蒸馏网络,提出采用进化算法搜索满足特定约束条件的最高精度的蒸馏网络,图7所示了基于进化算法的蒸馏网络搜索架构。
在元学习蒸馏网络中采用的进化搜索算法中,每个蒸馏网络是由包含嵌入层蒸馏、隐藏层蒸馏和自注意力知识蒸馏三个蒸馏模块的网络编码向量编码生成,所以将蒸馏网络编码向量定义为蒸馏网络的基因。在满足特定约束条件下,首先选取一系列蒸馏网络编码向量作为蒸馏网络的基因,通过在验证集上评估获得对应蒸馏网络的精度。然后,选取精度最高的前k个基因,采用基因重组和变异生成新的基因。基因变异是指,通过随机改变基因里一部分元素值来进行变异。基因重组是指,随机地将两个父辈的基因进行重组,产生后代。而且,可以很容易地通过消除不合格的基因来加强约束。通过进一步重复前k个最优基因选择的过程和新基因生成的过程来迭代几轮,就可以获得满足约束条件并且精度最高的基因。第三阶段:如图7所示为基于进化算法的蒸馏网络搜索的过程:
在第二阶段训练好的元学习的知识蒸馏网络基础上,将多个满足特定约束条件的知识蒸馏编码向量输入结构生成器生成对应的权重矩阵,得到多个蒸馏结构模型;在验证集上对每个蒸馏结构模型进行评估,获得对应的精度;采用进化算法搜索其中满足特定约束条件(如浮点数运算次数)的精度最高的蒸馏结构模型,由此得到与任务无关的预训练语言模型的通用压缩架构。进化搜索算法的具体步骤如下:
步骤一、每个蒸馏结构模型是由基于Transformer层采样的知识蒸馏编码向量生成的,所以将知识蒸馏编码向量定义为蒸馏结构模型的基因G,随机选取满足约束条件C的一系列基因作为初始种群。
步骤二、评估现有种群中各个基因G i对应的蒸馏结构模型在验证集上的推理精度accuracy,选取精度最高的前k个基因。
步骤三、利用步骤二选取的精度最高的前k个基因进行基因重组和基因变异生成新的基因,将新基因加入现有种群中。基因变异是指通过随机改变基因里一部分元素值来进行变异; 基因重组是指随机地将两个父辈的基因进行重组产生后代;而且可以很容易地通过消除不合格的基因来加强约束C。
步骤四、重复迭代N轮步骤二和步骤三,选择现有种群中前k个精度最高的基因并生成新基因,直到获得满足约束条件C并且精度最高的基因。
本发明的基于多层级知识蒸馏的预训练语言模型自动压缩平台,包括以下组件:
数据加载组件:用于获取登陆用户上传的待压缩的包含具体自然语言处理下游任务的BERT模型和面向多任务的预训练语言模型的训练样本,所述训练样本是满足监督学习任务的带标签的文本样本。
自动压缩组件:用于将面向多任务的预训练语言模型自动压缩,包括知识蒸馏向量编码模块、蒸馏网络生成模块、结构生成器和蒸馏网络联合训练模块、蒸馏网络搜索模块、特定任务微调模块。
知识蒸馏向量编码模块包括Transformer的层采样向量、自注意力单元的多头剪枝向量、隐藏层降维向量、嵌入层降维向量。前向传播过程中,将蒸馏网络编码向量输入结构生成器,生成对应结构的蒸馏网络和结构生成器的权重矩阵。
蒸馏网络生成模块是基于结构生成器构建与当前输入的编码向量对应的蒸馏网络,调整结构生成器输出的权重矩阵的形状,使其与自注意力编码向量对应的蒸馏结构的输入输出的自注意力单元数目一致。
结构生成器和蒸馏网络联合训练模块是端到端的训练结构生成器,具体地,将多层级知识蒸馏编码向量和一个小批次的训练数据输入蒸馏网络。更新蒸馏结构的权重和结构生成器的权重矩阵。
蒸馏网络搜索模块是为了搜索出满足特定约束条件的最高精度的蒸馏网络,提出进化算法搜索满足特定约束条件的最高精度的蒸馏网络。将网络编码向量输入训练好的结构生成器,生成对应蒸馏网络的权重,在验证集上对蒸馏网络进行评估,获得对应蒸馏网络的精度。在元学习蒸馏网络中采用的进化搜索算法中,每个蒸馏网络是由包含嵌入层知识蒸馏、隐藏层状态的知识蒸馏和自注意力知识蒸馏三个蒸馏模块的网络编码向量编码生成,所以将蒸馏网络编码向量定义为蒸馏网络的基因。在满足特定约束条件下,首先选取一系列蒸馏网络编码向量作为蒸馏网络的基因,通过在验证集上评估获得对应蒸馏网络的精度。然后,选取精度最高的前k个基因,采用基因重组和变异生成新的基因。通过进一步重复前k个最优基因选择的过程和新基因生成的过程进行迭代,获得满足约束条件并且精度最高的基因。
特定任务微调模块是在所述自动压缩组件生成的预训练模型蒸馏网络上构建下游任务网络,利用蒸馏网络的特征层和输出层对下游任务场景进行微调,输出最终微调好的学生模型,即登陆用户需求的包含下游任务的预训练语言模型压缩模型。将所述压缩模型输出到指定的容器,可供所述登陆用户下载,并在所述平台的输出压缩模型的页面呈现压缩前后模型大小的对比信息。
推理组件:登陆用户从所述平台获取预训练压缩模型,用户利用所述自动压缩组件输出的压缩模型在实际场景的数据集上对登陆用户上传的自然语言处理下游任务的新数据进行推理。并在所述平台的压缩模型推理页面呈现压缩前后推理速度的对比信息。
登陆用户可直接下载本发明平台提供的训练好的预训练语言模型,根据用户对具体某个自然语言处理下游任务的需求,在所述平台生成的已压缩的预训练模型架构基础上构建下游任务网络并进行微调,最后部署在终端设备。也可以直接在所述平台上对自然语言处理下游任务进行推理。
下面将以电影评论进行情感分类任务对本发明的技术方案做进一步的详细描述。
通过所述平台的数据加载组件获取登陆用户上传的单个句子的文本分类任务的BERT模型和情感分析数据集SST-2;
通过所述平台的自动压缩组件,生成面向多任务的BERT预训练语言模型;
通过所述平台加载自动压缩组件生成的BERT预训练模型,在所述生成的预训练模型上构建文本分类任务的模型;
基于所述自动压缩组件的特定任务微调模块所得的学生模型进行微调,利用自动压缩组件生成的BERT预训练模型的特征层和输出层对下游文本分类任务场景进行微调,最终,平台输出登陆用户需求的包含文本分类任务的BERT模型的压缩模型。
将所述压缩模型输出到指定的容器,可供所述登陆用户下载,并在所述平台的输出压缩模型的页面呈现压缩前后模型大小的对比信息,压缩前模型大小为110M,压缩后为53M,压缩了51.8%。如下表1所示。
表1:文本分类任务BERT模型压缩前后对比信息
文本分类任务(SST-2)(包含67K个样本) 压缩前 压缩后 对比
模型大小 110M 53M 压缩51.8%
推理精度 91.5% 92.3% 提升0.8%
通过所述平台的推理组件,利用所述平台输出的压缩模型对登陆用户上传的SST-2测试集数据进行推理,并在所述平台的压缩模型推理页面呈现压缩后比压缩前推理速度加快1.95倍,并且推理精度从压缩前的91.5%提升为92.3%。

Claims (10)

  1. 一种基于多层级知识蒸馏的预训练语言模型自动压缩方法,其特征在于,包括如下步骤:
    步骤一、构建多层级知识蒸馏,在自注意力单元、隐藏层状态、嵌入层三个不同层级上蒸馏大模型的知识结构;
    步骤二、训练元学习的知识蒸馏网络,生成多种预训练语言模型的通用压缩架构;
    步骤三、基于进化算法搜索最优压缩结构。
  2. 如权利要求1所述的基于多层级知识蒸馏的预训练语言模型自动压缩方法,其特征在于:步骤二中设计一种结构生成器的元网络,基于步骤一的多层级知识蒸馏构建知识蒸馏编码向量,利用结构生成器生成与当前输入的编码向量对应的蒸馏结构模型;同时,采用伯努利分布采样的方法训练结构生成器,每轮迭代时,利用伯努利分布采样各个编码器迁移的自注意力单元,组成对应的编码向量;通过改变输入结构生成器的编码向量和小批次的训练数据,联合训练结构生成器和对应的蒸馏结构,得到为不同蒸馏结构生成权重的结构生成器。
  3. 如权利要求2所述的基于多层级知识蒸馏的预训练语言模型自动压缩方法,其特征在于:步骤三中在已训练好的元学习网络基础上,通过进化算法搜索最优压缩架构,得到与任务无关的预训练语言模型的最优通用压缩架构。
  4. 如权利要求1所述的基于多层级知识蒸馏的预训练语言模型自动压缩方法,其特征在于:
    步骤一中将自注意力分布知识、隐藏状态知识和嵌入层知识编码为一个蒸馏网络,采用知识蒸馏实现大模型向小模型的压缩。
  5. 如权利要求4所述的基于多层级知识蒸馏的预训练语言模型自动压缩方法,其特征在于:
    步骤一中包括自注意力知识蒸馏、隐藏层状态知识蒸馏和嵌入层知识蒸馏。
  6. 如权利要求2所述的基于多层级知识蒸馏的预训练语言模型自动压缩方法,其特征在于:
    步骤二中所述结构生成器的元网络,由两个全连接层组成,输入一个自注意力知识蒸馏编码向量,输出结构生成器的权重矩阵;
    结构生成器的训练过程如下:
    步骤2.1:构造知识蒸馏编码向量,包括层采样向量、多头剪枝向量、隐藏层降维向量和嵌入层降维向量;
    步骤2.2:基于结构生成器构建蒸馏网络架构,利用该结构生成器构建与当前输入的编码向量对应的蒸馏结构模型,调整结构生成器输出的权重矩阵的形状,与自注意力编码向量对应的蒸馏结构的输入输出的自注意力单元数目一致;
    步骤2.3:联合训练结构生成器和蒸馏结构模型:通过伯努利分布采样的方法训练结构生成器,通过改变输入结构生成器的自注意力编码向量和一个小批次的训练数据,联合训练结构生成器和对应的蒸馏结构,学得能够为不同蒸馏结构生成权重的结构生成器。
  7. 如权利要求6所述的基于多层级知识蒸馏的预训练语言模型自动压缩方法,其特征在于:
    步骤三中,将网络编码向量输入训练好的结构生成器,生成对应蒸馏网络的权重,在验证集上对蒸馏网络进行评估,获得对应蒸馏网络的精度;具体如下:
    满足特定约束条件下,首先选取一系列蒸馏网络编码向量作为蒸馏网络的基因,通过在验证集上评估获得对应蒸馏网络的精度;然后,选取精度最高的前k个基因,采用基因重组和变异生成新的基因,通过进一步重复前k个最优基因选择的过程和新基因生成的过程来迭代获得满足约束条件并且精度最高的基因。
  8. 如权利要求7所述的基于多层级知识蒸馏的预训练语言模型自动压缩方法,其特征在于:
    所述进化算法的具体步骤如下:
    步骤3.1、将知识蒸馏编码向量定义为蒸馏结构模型的基因G,随机选取满足约束条件C的一系列基因作为初始种群;
    步骤3.2、评估现有种群中各个基因G i对应的蒸馏结构模型在验证集上的推理精度accuracy,选取精度最高的前k个基因;
    步骤3.3、利用步骤3.2选取的精度最高的前k个基因进行基因重组和基因变异生成新的基因,将新基因加入现有种群中;
    步骤3.4、重复迭代N轮步骤3.2和步骤3.3,选择现有种群中前k个精度最高的基因并生成新基因,直到获得满足约束条件C并且精度最高的基因。
  9. 一种基于多层级知识蒸馏的预训练语言模型自动压缩平台,其特征在于包括如下组件:
    数据加载组件:用于获取登陆用户上传的待压缩的包含具体自然语言处理下游任务的BERT模型和面向多任务的预训练语言模型的训练样本,所述训练样本是满足监督学习任务的带标签的文本样本;
    自动压缩组件:用于将面向多任务的预训练语言模型自动压缩,包括知识蒸馏向量编码模块、蒸馏网络生成模块、结构生成器和蒸馏网络联合训练模块、蒸馏网络搜索模块、特定任务微调模块;
    所述知识蒸馏向量编码模块包括Transformer的层采样向量、自注意力单元的多头剪枝向量、隐藏层降维向量、嵌入层降维向量,前向传播过程中,将蒸馏网络编码向量输入结构生成器,生成对应结构的蒸馏网络和结构生成器的权重矩阵;
    所述蒸馏网络生成模块基于结构生成器构建与当前输入的编码向量对应的蒸馏网络,调整结构生成器输出的权重矩阵的形状,使其与自注意力编码向量对应的蒸馏结构的输入输出的自注意力单元数目一致;
    结构生成器和蒸馏网络联合训练模块端到端地训练结构生成器,将多层级知识蒸馏编码向量和小批次的训练数据输入蒸馏网络,更新蒸馏结构的权重和结构生成器的权重矩阵;
    所述蒸馏网络搜索模块采用进化算法搜索满足特定约束条件的最高精度的蒸馏网络;
    所述特定任务微调模块是在所述自动压缩组件生成的预训练模型蒸馏网络上构建下游任务网络,利用蒸馏网络的特征层和输出层对下游任务场景进行微调,输出最终微调好的学生模型,即登陆用户需求的包含下游任务的预训练语言模型压缩模型,将所述压缩模型输出到指定的容器,供所述登陆用户下载,并在所述平台的输出压缩模型的页面呈现压缩前后模型大小的对比信息;
    推理组件:登陆用户从所述平台获取预训练压缩模型,利用所述自动压缩组件输出的压缩模型在实际场景的数据集上对登陆用户上传的自然语言处理下游任务的新数据进行推理;并在所述平台的压缩模型推理页面呈现压缩前后推理速度的对比信息。
  10. 如权利要求9所述的基于多层级知识蒸馏的预训练语言模型自动压缩平台,其特征在于:
    登陆用户直接下载训练好的预训练语言模型,根据用户对具体某个自然语言处理下游任务的需求,在所述平台生成的已压缩的预训练模型架构基础上构建下游任务网络并进行微调,最后部署在终端设备,或直接在所述平台上对自然语言处理下游任务进行推理。
PCT/CN2020/142577 2020-12-17 2020-12-31 基于多层级知识蒸馏预训练语言模型自动压缩方法及平台 WO2022126797A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
GB2214215.2A GB2610319A (en) 2020-12-17 2020-12-31 Automatic compression method and platform for multilevel knowledge distillation-based pre-trained language model
JP2022566730A JP7283835B2 (ja) 2020-12-17 2020-12-31 マルチレベル知識蒸留に基づく事前訓練言語モデルの自動圧縮方法およびプラットフォーム
US17/555,535 US11501171B2 (en) 2020-12-17 2021-12-20 Method and platform for pre-trained language model automatic compression based on multilevel knowledge distillation

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011498328.2 2020-12-17
CN202011498328.2A CN112241455B (zh) 2020-12-17 2020-12-17 基于多层级知识蒸馏预训练语言模型自动压缩方法及平台

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/555,535 Continuation US11501171B2 (en) 2020-12-17 2021-12-20 Method and platform for pre-trained language model automatic compression based on multilevel knowledge distillation

Publications (1)

Publication Number Publication Date
WO2022126797A1 true WO2022126797A1 (zh) 2022-06-23

Family

ID=74175234

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/142577 WO2022126797A1 (zh) 2020-12-17 2020-12-31 基于多层级知识蒸馏预训练语言模型自动压缩方法及平台

Country Status (2)

Country Link
CN (1) CN112241455B (zh)
WO (1) WO2022126797A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115774851A (zh) * 2023-02-10 2023-03-10 四川大学 基于分级知识蒸馏的曲轴内部缺陷检测方法及其检测系统
CN115979973A (zh) * 2023-03-20 2023-04-18 湖南大学 一种基于双通道压缩注意力网络的高光谱中药材鉴别方法
CN116821699A (zh) * 2023-08-31 2023-09-29 山东海量信息技术研究院 一种感知模型训练方法、装置及电子设备和存储介质
CN117574961A (zh) * 2024-01-15 2024-02-20 成都信息工程大学 一种将适配器注入预训练模型的参数高效化方法和装置
CN117725844A (zh) * 2024-02-08 2024-03-19 厦门蝉羽网络科技有限公司 基于学习权重向量的大模型微调方法、装置、设备及介质

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7381814B2 (ja) * 2020-12-15 2023-11-16 之江実験室 マルチタスク向けの予めトレーニング言語モデルの自動圧縮方法及びプラットフォーム
CN113099175B (zh) * 2021-03-29 2022-11-04 苏州华云视创智能科技有限公司 一种基于5g的多模型手持云端检测传输系统及检测方法
CN114037653A (zh) * 2021-09-23 2022-02-11 上海仪电人工智能创新院有限公司 基于二阶段知识蒸馏的工业机器视觉缺陷检测方法和系统
CN113849641B (zh) * 2021-09-26 2023-10-24 中山大学 一种跨领域层次关系的知识蒸馏方法和系统
CN113986958B (zh) * 2021-11-10 2024-02-09 北京有竹居网络技术有限公司 文本信息的转换方法、装置、可读介质和电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062489A (zh) * 2019-12-11 2020-04-24 北京知道智慧信息技术有限公司 一种基于知识蒸馏的多语言模型压缩方法、装置
CN111506702A (zh) * 2020-03-25 2020-08-07 北京万里红科技股份有限公司 基于知识蒸馏的语言模型训练方法、文本分类方法及装置
US20200302295A1 (en) * 2019-03-22 2020-09-24 Royal Bank Of Canada System and method for knowledge distillation between neural networks
CN111767711A (zh) * 2020-09-02 2020-10-13 之江实验室 基于知识蒸馏的预训练语言模型的压缩方法及平台

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112016674B (zh) * 2020-07-29 2024-06-18 魔门塔(苏州)科技有限公司 一种基于知识蒸馏的卷积神经网络的量化方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200302295A1 (en) * 2019-03-22 2020-09-24 Royal Bank Of Canada System and method for knowledge distillation between neural networks
CN111062489A (zh) * 2019-12-11 2020-04-24 北京知道智慧信息技术有限公司 一种基于知识蒸馏的多语言模型压缩方法、装置
CN111506702A (zh) * 2020-03-25 2020-08-07 北京万里红科技股份有限公司 基于知识蒸馏的语言模型训练方法、文本分类方法及装置
CN111767711A (zh) * 2020-09-02 2020-10-13 之江实验室 基于知识蒸馏的预训练语言模型的压缩方法及平台

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
NATURAL LANGUAGE PROCESSING GROUP: "Smaller, Faster and Better Models! Microsoft General-Purpose Compression Method for Pre-Trained Language Models, MiniLM, Helps You Get Twice as Much Done with Half as Much Work", MICROSOFT RESEARCH ASIA - NEWS- FEATURES, 12 May 2020 (2020-05-12), CN, XP009538033, Retrieved from the Internet <URL:http://www.msra.cn/zh-cn/news/features/miniIm> *
WENHUI WANG; FURU WEI; LI DONG; HANGBO BAO; NAN YANG; MING ZHOU: "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 25 February 2020 (2020-02-25), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081607651 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115774851A (zh) * 2023-02-10 2023-03-10 四川大学 基于分级知识蒸馏的曲轴内部缺陷检测方法及其检测系统
CN115979973A (zh) * 2023-03-20 2023-04-18 湖南大学 一种基于双通道压缩注意力网络的高光谱中药材鉴别方法
CN116821699A (zh) * 2023-08-31 2023-09-29 山东海量信息技术研究院 一种感知模型训练方法、装置及电子设备和存储介质
CN116821699B (zh) * 2023-08-31 2024-01-19 山东海量信息技术研究院 一种感知模型训练方法、装置及电子设备和存储介质
CN117574961A (zh) * 2024-01-15 2024-02-20 成都信息工程大学 一种将适配器注入预训练模型的参数高效化方法和装置
CN117574961B (zh) * 2024-01-15 2024-03-22 成都信息工程大学 一种将适配器注入预训练模型的参数高效化方法和装置
CN117725844A (zh) * 2024-02-08 2024-03-19 厦门蝉羽网络科技有限公司 基于学习权重向量的大模型微调方法、装置、设备及介质
CN117725844B (zh) * 2024-02-08 2024-04-16 厦门蝉羽网络科技有限公司 基于学习权重向量的大模型微调方法、装置、设备及介质

Also Published As

Publication number Publication date
CN112241455B (zh) 2021-05-04
CN112241455A (zh) 2021-01-19

Similar Documents

Publication Publication Date Title
WO2022126797A1 (zh) 基于多层级知识蒸馏预训练语言模型自动压缩方法及平台
WO2022141754A1 (zh) 一种卷积神经网络通用压缩架构的自动剪枝方法及平台
US11501171B2 (en) Method and platform for pre-trained language model automatic compression based on multilevel knowledge distillation
US11941522B2 (en) Address information feature extraction method based on deep neural network model
CN111291836B (zh) 一种生成学生网络模型的方法
Chen et al. Adabert: Task-adaptive bert compression with differentiable neural architecture search
WO2022126683A1 (zh) 面向多任务的预训练语言模型自动压缩方法及平台
CN107844469B (zh) 基于词向量查询模型的文本简化方法
US20240177047A1 (en) Knowledge grap pre-training method based on structural context infor
CN109885756B (zh) 基于cnn和rnn的序列化推荐方法
CN112000772B (zh) 面向智能问答基于语义特征立方体的句子对语义匹配方法
US20220188658A1 (en) Method for automatically compressing multitask-oriented pre-trained language model and platform thereof
CN111274375A (zh) 一种基于双向gru网络的多轮对话方法及系统
KR102592585B1 (ko) 번역 모델 구축 방법 및 장치
CN111353313A (zh) 基于进化神经网络架构搜索的情感分析模型构建方法
CN112347756A (zh) 一种基于序列化证据抽取的推理阅读理解方法及系统
CN113157919A (zh) 语句文本方面级情感分类方法及系统
CN111882042A (zh) 用于液体状态机的神经网络架构自动搜索方法、系统及介质
CN116543289B (zh) 一种基于编码器-解码器及Bi-LSTM注意力模型的图像描述方法
CN117539977A (zh) 一种语言模型的训练方法及装置
CN116822593A (zh) 一种基于硬件感知的大规模预训练语言模型压缩方法
CN115424663B (zh) 一种基于attention的双向表示模型的RNA修饰位点预测方法
CN115455162A (zh) 层次胶囊与多视图信息融合的答案句子选择方法与装置
CN113849641A (zh) 一种跨领域层次关系的知识蒸馏方法和系统
Wang et al. AutoDistiller: An Automatic Compression Method for Multi-task Language Models

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20965809

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 202214215

Country of ref document: GB

Kind code of ref document: A

Free format text: PCT FILING DATE = 20201231

ENP Entry into the national phase

Ref document number: 2022566730

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20965809

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20965809

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 061223)