WO2022088444A1 - 一种面向多任务语言模型的元-知识微调方法及平台 - Google Patents

一种面向多任务语言模型的元-知识微调方法及平台 Download PDF

Info

Publication number
WO2022088444A1
WO2022088444A1 PCT/CN2020/138014 CN2020138014W WO2022088444A1 WO 2022088444 A1 WO2022088444 A1 WO 2022088444A1 CN 2020138014 W CN2020138014 W CN 2020138014W WO 2022088444 A1 WO2022088444 A1 WO 2022088444A1
Authority
WO
WIPO (PCT)
Prior art keywords
task
meta
model
knowledge
fine
Prior art date
Application number
PCT/CN2020/138014
Other languages
English (en)
French (fr)
Inventor
王宏升
单海军
胡胜健
Original Assignee
之江实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 之江实验室 filed Critical 之江实验室
Priority to GB2214177.4A priority Critical patent/GB2609768A/en
Priority to JP2022567027A priority patent/JP7283836B2/ja
Priority to US17/531,813 priority patent/US11354499B2/en
Publication of WO2022088444A1 publication Critical patent/WO2022088444A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Definitions

  • the invention belongs to the field of language model compression, and in particular relates to a meta-knowledge fine-tuning method and platform for multi-task language models.
  • the purpose of the present invention is to provide a meta-knowledge fine-tuning method and platform for multi-task language models in view of the deficiencies of the prior art.
  • the present invention proposes cross-domain typicality score learning, uses this method to obtain highly transferable shared knowledge on different data sets of similar tasks, and introduces "meta-knowledge" to learn the process of learning similar tasks on different domains corresponding to different data sets Interconnect and reinforce each other, improve the fine-tuning effect of similar downstream tasks on different domain datasets in the application of language models in the smart city domain, and improve the parameter initialization and generalization capabilities of general language models for similar tasks.
  • a multi-task language model-oriented meta-knowledge fine-tuning method including the following stages:
  • the class prototypes of the cross-domain datasets of the same task from the datasets of the same task in different domains, centrally learn the embedding features of the prototypes of the corresponding domains of the task, and average the input texts of all the input texts of the same task in different domains. Embedding features, as the corresponding multi-domain class prototypes of the same type of task;
  • the typicality score of the instance is calculated: d self represents the distance between the embedded feature of each instance and its own domain prototype, and d others represents the distance between the embedded feature of each instance and other domain prototypes; the typicality of each instance The score is defined as the linear combination of d self and d others ;
  • the third stage, meta-knowledge fine-tuning network based on typicality score Using the typicality score obtained in the second stage as the weight coefficient of the meta-knowledge fine-tuning network, a multi-task typicality-sensitive label classification loss function is designed as the meta-knowledge fine-tuning Learn an objective function; this loss function penalizes the labels of instances of all domains where the text classifier mispredicts.
  • M is the set of all class labels in the dataset; is the i-th instance in the k-th domain;
  • ⁇ ( ) represents the output of the BERT model
  • is a predefined balance factor, 0 ⁇ 1; cos( ⁇ , ⁇ ) is the cosine similarity measure function; K is the number of domains; is the indicator function, if returns 1 if then returns 0, the index for summation; ⁇ m > 0 is weights of the same class The weights are the same.
  • D represents the set of all domains; is the indicator function, if returns 1 if then return 0; Indicates prediction The probability that the class label of m is m; Embedding layer feature representing the token of "[CLS]" output by the last layer of the BERT model.
  • a meta-knowledge fine-tuning platform for multi-task language models including the following components:
  • Data loading component used to obtain training samples of a multi-task-oriented pre-trained language model, where the training samples are labeled text samples that satisfy supervised learning tasks;
  • Automatic compression component used to automatically compress a multi-task-oriented pre-trained language model, including a pre-trained language model and a meta-knowledge fine-tuning module; wherein the meta-knowledge fine-tuning module is used for the pre-trained language generated by the automatic compression component
  • the downstream task network is constructed on the model, and the meta-knowledge of the typical score is used to fine-tune the downstream task scene, and the final fine-tuned student model is output, that is, the pre-trained language model compression model containing the downstream tasks required by the landing user; the compressed model is output Go to the specified container for the login user to download, and present the comparison information of the model size before and after compression;
  • the login user obtains the pre-trained language model compression model from the platform, and the user uses the compression model output by the automatic compression component to infer the new data of the natural language processing downstream task uploaded by the login user on the data set of the actual scene, and Presents comparison information of inference speed before and after compression.
  • the present invention studies a multi-task language model-oriented meta-knowledge fine-tuning method based on cross-domain typicality score learning, the fine-tuning method of the downstream task-oriented pre-training language model is to fine-tune on the downstream task cross-domain data set,
  • the effect of the compressed model obtained by fine-tuning is not limited to the specific dataset of this type of task.
  • the downstream task is fine-tuned through the meta-knowledge fine-tuning network, thereby obtaining similar downstream tasks that are independent of the dataset. language model;
  • the present invention proposes to learn highly transferable shared knowledge on different datasets of the same task, namely meta-knowledge; introducing meta-knowledge, and the meta-knowledge fine-tuning network combines the learning process on different domains corresponding to different datasets of the same task Interrelated and mutually reinforcing, improve the fine-tuning effect of similar downstream tasks on different domain datasets in the application of language models in the smart city domain, improve the parameter initialization and generalization capabilities of general language models for similar tasks, and finally obtain similar downstream task languages.
  • Model
  • the multi-task language model-oriented meta-knowledge fine-tuning platform of the present invention generates a general architecture for similar task language models, makes full use of the fine-tuned model architecture to improve the compression efficiency of downstream similar tasks, and can convert large-scale natural
  • the language processing model is deployed on end-side devices such as small memory and limited resources, which promotes the implementation of general-purpose deep language models in the industry.
  • FIG. 1 is an overall architecture diagram of the meta-knowledge fine-tuning method of the present invention.
  • a meta-knowledge fine-tuning method and platform for multi-task language models of the present invention is based on cross-domain typicality score learning on the downstream task multi-domain data set of the pre-trained language model, using typical Fractional meta-knowledge fine-tunes downstream task scenarios, making it easier for meta-learners to fine-tune to any domain, and the learned knowledge is highly generalizable and transferable, rather than limited to a specific domain.
  • the effect of the compression model is suitable for data scenarios of the same task and different domains.
  • a meta-knowledge fine-tuning method for multi-task language model of the present invention specifically includes the following steps:
  • Step 1 Calculate the class prototypes of the same task cross-domain datasets: Considering that the multi-domain class prototypes can summarize the key semantic features of the corresponding training datasets; therefore, from the datasets of different domains, centrally learn the prototypes of the corresponding domains of this type of task The embedding feature of the same type of task and multi-domain class prototype is generated. Specifically, for the BERT language model, the average embedding feature of all input texts in different domains of the same task is used as the class prototype corresponding to this type of task, where the average embedding feature is The output of the last layer of Transformer encoder corresponding to the current input instance is used to average the pooling layer output.
  • the average embedding feature of all input texts in the kth domain Dk is taken as the class prototype corresponding to this domain.
  • class prototype is the input BERT model
  • the average pooling of the corresponding last layer Transformer encoder is calculated as follows:
  • ⁇ ( ) means that the Embedding features that map to d-dimension.
  • Step 2 Calculate the typicality score of the training instance: Considering that if the training instance is semantically close to the class prototype of its own domain, and is not too far from the class prototypes generated by other domains, the instance is considered to be typical and has a high High portability.
  • the semantics of a training instance should include not only its associated features with its own domain, but also its associated features with other domains.
  • a typical training instance is defined as a linear combination of the above two associated features. Specifically, d self is used to represent the distance between the embedded feature of each training instance and its own domain prototype, d others is used to represent the distance between the embedded feature of each training instance and other domain prototypes, and the typicality score of each training instance is defined as d self Linear combination with d others .
  • the above single class prototype is further augmented to generate a certain category of class prototypes based on the clustering of multiple prototypes.
  • the possible polarities include positive, negative, neutral and conflict.
  • the generic class prototype corresponding to the category can be generated by clustering on multiple different datasets.
  • the associated feature of each training instance with its own domain is that each training instance with its own domain prototype The cosine similarity measure distance of , i.e.
  • the associated feature of each training instance with other domains is that each training instance Cosine similarity measure distance to class prototypes generated by other domains, i.e.
  • Typical training example The characteristic score of :
  • is a predefined balance factor, 0 ⁇ 1, cos( ⁇ , ⁇ ) is the cosine similarity measure function, 1 ( ⁇ ) is the indicator function, and returns 1 if the input Boolean function is true, Otherwise, return 0.
  • Step 3 Meta-knowledge fine-tuning network based on typicality score:
  • the present invention proposes to design a multi-task typicality-sensitive label classification loss function based on cross-domain typical instance features. This loss function penalizes the labels of typical instances of all K domains where the text classifier mispredicts.
  • the typicality score obtained in the second stage is used as the weight coefficient of the meta-knowledge fine-tuning network.
  • the learning objective function of the meta-knowledge fine-tuning network is defined as:
  • L T is a multi-task typicality-sensitive label classification loss function that penalizes the labels of typical instances of all K domains where the text classifier mispredicts.
  • is the weight of each training instance. is the predicted instance
  • the probability of the class label of m ⁇ M, the embedding layer of the token of the d-dimensional "[CLS]" of the last layer of BERT is used as the feature, and the express.
  • the present invention is a multi-task language model-oriented meta-knowledge fine-tuning platform, comprising the following components:
  • Data loading component used to obtain training samples of a multi-task-oriented pre-trained language model, where the training samples are labeled text samples that satisfy supervised learning tasks.
  • Automatic compression component used to automatically compress multi-task-oriented pre-trained language models, including pre-trained language models and meta-knowledge fine-tuning modules.
  • the meta-knowledge fine-tuning module is to build a downstream task network on the pre-trained language model generated by the automatic compression component, use the meta-knowledge of typical scores to fine-tune the downstream task scene, and output the final fine-tuned student model, that is, the login user
  • a pre-trained language model compression model that includes downstream tasks as required; output the compressed model to a designated container for download by the logged-in user, and present a comparison of the size of the model before and after compression on the output compression model page of the platform information.
  • the logged-in user obtains the pre-trained compression model from the platform, and the user uses the compression model output by the automatic compression component to infer the new data of the natural language processing downstream task uploaded by the logged-in user on the dataset of the actual scene; and The comparison information of the inference speed before and after compression is presented on the compression model inference page of the platform.
  • the natural language inference task is to give a pair of sentences and determine whether the two sentences are semantically similar, contradictory, or neutral. Since it is also a classification problem, it is also called a sentence pair classification problem.
  • the MNLI dataset provides training examples from multiple domains in order to infer whether two sentences are similar, contradictory, or unrelated.
  • the BERT pre-training model generated by the automatic compression component is used to construct a model for the natural language inference task on the generated pre-training model; based on the student model obtained from the meta-knowledge fine-tuning module of the automatic compression component, fine-tuning is performed, and the pre-training language
  • a downstream task network is constructed, and the meta-knowledge of the typical score is used to fine-tune the downstream task scene, and the final fine-tuned student model is output, that is, the pre-trained language model compression model containing the natural language inference task required by the landing user;
  • the compressed model is output to a designated container for download by the logged-in user, and 5%, 10%, and 20% of the data in each domain are randomly sampled from the training
  • Table 1 BERT model element for natural language inference task - comparative information before and after knowledge fine-tuning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

一种面向多任务语言模型的元-知识微调方法及平台,该方法基于跨域的典型性分数学习,获得同类任务不同数据集上高度可转移的共有知识,即元-知识,将不同数据集对应的不同域上的同类任务的学习过程进行相互关联和相互强化,提升语言模型应用中同类下游任务在不同域数据集上的微调效果,提升了同类任务通用语言模型的参数初始化能力和泛化能力。该方法是在下游任务跨域数据集上进行微调,微调所得的压缩模型的效果不受限于该类任务的特定数据集,在预训练语言模型基础上,通过元-知识微调网络对下游任务进行微调,由此得到与数据集无关的同类下游任务语言模型。

Description

一种面向多任务语言模型的元-知识微调方法及平台 技术领域
本发明属于语言模型压缩领域,尤其涉及一种面向多任务语言模型的元-知识微调方法及平台。
背景技术
大规模预训练语言模型自动压缩技术在自然语言理解和生成任务的应用领域都取得了显著作用;然而,在面向智慧城市领域下游任务时,基于特定数据集重新微调大模型仍然是提升模型压缩效果的关键步骤,已有的面向下游任务语言模型的微调方法是在下游任务特定数据集上进行微调,训练所得的压缩模型的效果受限于该类任务的特定数据集。
发明内容
本发明的目的在于针对现有技术的不足,提供一种面向多任务语言模型的元-知识微调方法及平台。本发明提出基于跨域的典型性分数学习,利用该方法获得同类任务不同数据集上高度可转移的共有知识,引入“元-知识”将不同数据集对应的不同域上的同类任务的学习过程进行相互关联和相互强化,提升智慧城市领域语言模型应用中同类下游任务在不同域数据集上的微调效果,提升了同类任务通用语言模型的参数初始化能力和泛化能力。
本发明的目的是通过以下技术方案实现的:一种面向多任务语言模型的元-知识微调方法,包括以下几个阶段:
第一阶段,计算同类任务跨域数据集的类原型:从同一类任务的不同域的数据集中,集中学习该类任务对应域的原型的嵌入特征,将同类任务不同域的所有输入文本的平均嵌入特征,作为对应的同一类任务多域的类原型;
第二阶段,计算实例的典型性分数:采用d self表示每个实例的嵌入特征与自身域原型的距离,d others表示每个实例的嵌入特征与其它域原型的距离;每个实例的典型性分数定义为d self与d others的线性组合;
第三阶段,基于典型性分数的元-知识微调网络:利用第二阶段得到的典型性分数作为元-知识微调网络的权重系数,设计多任务典型性敏感标签分类损失函数作为元-知识微调的学习目标函数;该损失函数惩罚文本分类器预测错误的所有域的实例的标签。
进一步地,所述第一阶段中,采用
Figure PCTCN2020138014-appb-000001
表示在数据集的第k个域D k中类标签为m的输入文本
Figure PCTCN2020138014-appb-000002
的集合:
Figure PCTCN2020138014-appb-000003
其中,m∈M,M为数据集中所有类标签的集合;
Figure PCTCN2020138014-appb-000004
为第k个域中第i个实例;
类原型
Figure PCTCN2020138014-appb-000005
为第k个域中类标签为m所有输入文本的平均嵌入特征:
Figure PCTCN2020138014-appb-000006
其中,ε(·)表示BERT模型输出的
Figure PCTCN2020138014-appb-000007
的嵌入表示;对于BERT模型,平均嵌入特征是输入
Figure PCTCN2020138014-appb-000008
对应的最后一层Transformer编码器的平均池化。
进一步地,所述第二阶段中,将实例
Figure PCTCN2020138014-appb-000009
的典型性分数
Figure PCTCN2020138014-appb-000010
为:
Figure PCTCN2020138014-appb-000011
其中,α是一个预定义的平衡因子,0<α<1;cos(·,·)是余弦相似性度量函数;K是域的个数;
Figure PCTCN2020138014-appb-000012
是指示函数,如果
Figure PCTCN2020138014-appb-000013
则返回1,如果
Figure PCTCN2020138014-appb-000014
则返回0,索引
Figure PCTCN2020138014-appb-000015
用于求和;β m>0是
Figure PCTCN2020138014-appb-000016
的权重,同一类的
Figure PCTCN2020138014-appb-000017
权重相同。
进一步地,所述第三阶段中,多任务典型性敏感标签分类损失函数L T
Figure PCTCN2020138014-appb-000018
其中,D表示所有域的集合;
Figure PCTCN2020138014-appb-000019
是指示函数,如果
Figure PCTCN2020138014-appb-000020
则返回1,如果
Figure PCTCN2020138014-appb-000021
则返回0;
Figure PCTCN2020138014-appb-000022
表示预测
Figure PCTCN2020138014-appb-000023
的类标签为m的概率;
Figure PCTCN2020138014-appb-000024
表示BERT模型最后一层输出的“[CLS]”的token的嵌入层特征。
一种面向多任务语言模型的元-知识微调平台,包括以下组件:
数据加载组件:用于获取面向多任务的预训练语言模型的训练样本,所述训练样本是满足监督学习任务的有标签的文本样本;
自动压缩组件:用于将面向多任务的预训练语言模型自动压缩,包括预训练语言模型和元-知识微调模块;其中,所述元-知识微调模块用于在自动压缩组件生成的预训练语言模型上构建下游任务网络,利用典型性分数的元-知识对下游任务场景进行微调,输出最终微调好的学生模型,即登陆用户需求的包含下游任务的预训练语言模型压缩模型;将压缩模型输出到指定的容器,供登陆用户下载,并呈现压缩前后模型大小的对比信息;
推理组件:登陆用户从平台获取预训练语言模型压缩模型,用户利用所述自动压缩组件输出的压缩模型在实际场景的数据集上对登陆用户上传的自然语言处理下游任务的新数据进行推理,并呈现压缩前后推理速度的对比信息。
本发明的有益效果如下:
(1)本发明基于跨域的典型性分数学习研究面向多任务语言模型的元-知识微调方法,该面向下游任务的预训练语言模型的微调方法是在下游任务跨域数据集上进行微调,微调所得的压缩模型的效果不受限于该类任务的特定数据集,在预训练语言模型基础上,通过元-知识微调网络对下游任务进行微调,由此得到与数据集无关的同类下游任务语言模型;
(2)本发明提出学习同类任务不同数据集上高度可转移的共有知识,即元-知识;引入元-知识,元-知识微调网络将同类任务不同数据集对应的不同域上的的学习过程相互关联和相互强化,提升智慧城市领域语言模型应用中同类下游任务在不同域数据集上的微调效果,提升了同类任务通用语言模型的参数初始化能力和泛化能力,最终,获得同类下游任务语言模型;
(3)本发明的面向多任务语言模型的元-知识微调平台,生成面向同类任务语言模型的通用架构,充分利用已微调好的模型架构提高下游同类任务的压缩效率,并且可将大规模自然语言处理模型部署在内存小、资源受限等端侧设备,推动了通用深度语言模型在工业界的落地进程。
附图说明
图1是本发明元-知识微调方法的整体架构图。
具体实施方式
如图1所示,本发明一种面向多任务语言模型的元-知识微调方法及平台,在预训练语言模型的下游任务多域数据集上,基于跨域的典型性分数学习,利用典型性分数的元-知识对下游任务场景进行微调,使元学习者较容易地微调到任何域,所学得的知识具有高度泛化和可转移能力,而不是只局限于某个特定域,所得的压缩模型的效果适应于同类任务不同域的数据场景。
本发明一种面向多任务语言模型的元-知识微调方法,具体包括以下步骤:
步骤一:计算同类任务跨域数据集的类原型:考虑到多域的类原型能够总结对应训练数据集的关键语义特征;所以,从不同域的数据集中,集中学习该类任务对应域的原型的嵌入特征,生成同一类任务多域的类原型,具体地,对于BERT语言模型,将同类任务不同域的所有输入文本的平均嵌入特征作为该类任务对应的类原型,其中,平均嵌入特征是采用当前输入实例对应的最后一层Transformer编码器平均池化层的输出。
步骤(1.1):定义跨域数据集。定义输入实例的种类集合为M,定义第k域中第m类标签的所有输入文本
Figure PCTCN2020138014-appb-000025
实例的集合为
Figure PCTCN2020138014-appb-000026
其中m∈M。
步骤(1.2):定义类原型。将第k个域D k的所有输入文本的平均嵌入特征作为该域对应的 类原型。
步骤(1.3):计算类原型。类原型
Figure PCTCN2020138014-appb-000027
是采用输入BERT模型的
Figure PCTCN2020138014-appb-000028
对应的最后一层Transformer编码器的平均池化,计算如下:
Figure PCTCN2020138014-appb-000029
其中,ε(·)表示将
Figure PCTCN2020138014-appb-000030
映射到d维的嵌入特征。
步骤二:计算训练实例的典型性分数:考虑到如果训练实例在语义上接近其自身域的类原型,并且距离其它域生成的类原型也不太远,则认为该实例是典型的,具有很高的可移植性。训练实例的语义既要包含其与自身域的关联特征,也要包含其与其它域的关联特征,定义典型性训练实例为以上所述两个关联特征的线性组合。具体地,采用d self表示每个训练实例的嵌入特征与自身域原型的距离,d others表示每个训练实例的嵌入特征与其它域原型的距离,每个训练实例的典型性分数定义为d self与d others的线性组合。
由于一个原型可能不足以表示某一类别的复杂语义信息,所以,进一步将以上单个类原型扩增为基于多个原型聚类生成某一类别的类原型。具体地,如在自然语言情感的极性分类问题中,即判别某个句子的情感极性,可能的极性包括正面(positive)、负面(negative)、中性(neutral)和冲突(conflict),面向所有情感的极性分类任务,对于正面类别的类原型的计算方法,可以通过在多个不同数据集上进行聚类生成该类别对应的通用类原型。
步骤(2.1):计算训练实例与自身域的关联特征。每个训练实例与自身域的关联特征是每个训练实例
Figure PCTCN2020138014-appb-000031
与其自身域原型
Figure PCTCN2020138014-appb-000032
的余弦相似性度量距离,即
Figure PCTCN2020138014-appb-000033
步骤(2.2):计算训练实例与其它域的关联特征。每个训练实例与其它域的关联特征是每个训练实例
Figure PCTCN2020138014-appb-000034
与其它域生成的类原型的余弦相似性度量距离,即
Figure PCTCN2020138014-appb-000035
步骤(2.3):计算典型性训练实例的特征分数。典型性训练实例
Figure PCTCN2020138014-appb-000036
的特征分数:
Figure PCTCN2020138014-appb-000037
其中,α是一个预定义的平衡因子,0<α<1,cos(·,·)是余弦相似性度量函数,1 (·)是指示函数,如果输入的布尔函数是true,则返回1,否则,返回0。
步骤(2.4):基于多个原型计算典型性训练实例的特征分数。考虑到一个原型可能不足以表示某一类别的复杂语义信息,所以,通过聚类生成多个原型,基于同一类的多个原型计算该类别的类原型。因此,实例
Figure PCTCN2020138014-appb-000038
的特征分数
Figure PCTCN2020138014-appb-000039
扩增为:
Figure PCTCN2020138014-appb-000040
其中,β m>0是实例
Figure PCTCN2020138014-appb-000041
的聚类成员的权重,每个类标签m∈M。
步骤三:基于典型性分数的元-知识微调网络:接下来将根据以上计算出的典型性特征分数,研究如何设计元-知识微调的学习目标函数。本发明提出基于跨域的典型性实例特征设计多任务典型性敏感标签分类损失函数。该损失函数惩罚文本分类器预测错误的所有K个域的典型实例的标签。具体地,利用第二阶段所得的典型性分数作为元-知识微调网络的权重系数。元-知识微调网络学习目标函数定义为:
Figure PCTCN2020138014-appb-000042
其中,L T是多任务典型性敏感标签分类损失函数,该损失函数惩罚文本分类器预测错误的所有K个域的典型实例的标签。
Figure PCTCN2020138014-appb-000043
是每个训练实例的权重。
Figure PCTCN2020138014-appb-000044
是预测实例
Figure PCTCN2020138014-appb-000045
的类别标签为m∈M的概率,采用BERT最后一层的d维的“[CLS]”的token的嵌入层作为特征,用
Figure PCTCN2020138014-appb-000046
表示。
本发明一种面向多任务语言模型的元-知识微调平台,包括以下组件:
数据加载组件:用于获取面向多任务的预训练语言模型的训练样本,所述训练样本是满足监督学习任务的有标签的文本样本。
自动压缩组件:用于将面向多任务的预训练语言模型自动压缩,包括预训练语言模型和元-知识微调模块。
元-知识微调模块是在所述自动压缩组件生成的预训练语言模型上构建下游任务网络,利用典型性分数的元-知识对下游任务场景进行微调,输出最终微调好的学生模型,即登陆用户需求的包含下游任务的预训练语言模型压缩模型;将所述压缩模型输出到指定的容器,可供所述登陆用户下载,并在所述平台的输出压缩模型的页面呈现压缩前后模型大小的对比信息。
推理组件:登陆用户从所述平台获取预训练压缩模型,用户利用所述自动压缩组件输出的压缩模型在实际场景的数据集上对登陆用户上传的自然语言处理下游任务的新数据进行推理;并在所述平台的压缩模型推理页面呈现压缩前后推理速度的对比信息。
下面将以智能问答、智能客服、多轮对话应用场景中的自然语言推断任务对本发明的技术方案做进一步的详细描述。
自然语言推断任务,即给出一对句子,判断两个句子语义是相近,矛盾,还是中立。由于也是分类问题,也被称为句子对分类问题。MNLI数据集提供了来自多个领域的训练示例,目的就是推断两个句子是意思相近,矛盾,还是无关的。通过所述平台的数据加载组件获取 登陆用户上传的自然语言推断任务的BERT模型和MNLI数据集;通过所述平台的自动压缩组件,生成面向多任务的BERT预训练语言模型;通过所述平台加载自动压缩组件生成的BERT预训练模型,在所述生成的预训练模型上构建自然语言推断任务的模型;基于所述自动压缩组件的元-知识微调模块所得的学生模型进行微调,在预训练语言模型基础上构建下游任务网络,利用典型性分数的元-知识对下游任务场景进行微调,输出最终微调好的学生模型,即登陆用户需求的包含自然语言推断任务的预训练语言模型压缩模型;将所述压缩模型输出到指定的容器,可供所述登陆用户下载,从训练数据中随机采样了每个领域数据的5%、10%、20%的数据进行元-知识微调。并在所述平台的输出压缩模型的页面呈现微调前后模型精度的对比信息,如下表1所示。
表1:自然语言推断任务BERT模型元-知识微调前后对比信息
方法 动物 植物 车辆 平均
元-知识微调前 93.6% 91.8% 84.2% 89.3%
元-知识微调后 94.5% 92.3% 90.2% 92.3%
从表1中更可以看出,通过所述平台的推理组件,利用所述平台输出的压缩模型对登陆用户上传的MNLI测试集数据进行推理,并在所述平台的压缩模型推理页面呈现元-知识微调后比元-知识微调前推理精度在动物、植物、车辆领域分别提升了0.9%、0.5%、6.0%。

Claims (5)

  1. 一种面向多任务语言模型的元-知识微调方法,其特征在于,包括以下几个阶段:
    第一阶段,计算同类任务跨域数据集的类原型:从同一类任务的不同域的数据集中,集中学习该类任务对应域的原型的嵌入特征,将同类任务不同域的所有输入文本的平均嵌入特征,作为对应的同一类任务多域的类原型;
    第二阶段,计算实例的典型性分数:采用d self表示每个实例的嵌入特征与自身域原型的距离,d others表示每个实例的嵌入特征与其它域原型的距离;每个实例的典型性分数定义为d self与d others的线性组合;
    第三阶段,基于典型性分数的元-知识微调网络:利用第二阶段得到的典型性分数作为元-知识微调网络的权重系数,设计多任务典型性敏感标签分类损失函数作为元-知识微调的学习目标函数;该损失函数惩罚语言模型预测错误的所有域的实例的标签。
  2. 如权利要求1所述面向多任务语言模型的元-知识微调方法,其特征在于,所述第一阶段中,采用
    Figure PCTCN2020138014-appb-100001
    表示在数据集的第k个域D k中类标签为m的输入文本
    Figure PCTCN2020138014-appb-100002
    的集合:
    Figure PCTCN2020138014-appb-100003
    其中,m∈M,M为数据集中所有类标签的集合;
    Figure PCTCN2020138014-appb-100004
    为第k个域中第i个实例;
    类原型
    Figure PCTCN2020138014-appb-100005
    为第k个域中类标签为m所有输入文本的平均嵌入特征:
    Figure PCTCN2020138014-appb-100006
    其中,ε(·)表示BERT模型输出的
    Figure PCTCN2020138014-appb-100007
    的嵌入表示;对于BERT模型,平均嵌入特征是输入
    Figure PCTCN2020138014-appb-100008
    对应的最后一层Transformer编码器的平均池化。
  3. 如权利要求2所述面向多任务语言模型的元-知识微调方法,其特征在于,所述第二阶段中,将实例
    Figure PCTCN2020138014-appb-100009
    的典型性分数
    Figure PCTCN2020138014-appb-100010
    为:
    Figure PCTCN2020138014-appb-100011
    其中,α是一个预定义的平衡因子,0<α<1;cos(·,·)是余弦相似性度量函数;K是域的个数;
    Figure PCTCN2020138014-appb-100012
    是指示函数,如果
    Figure PCTCN2020138014-appb-100013
    则返回1,如果
    Figure PCTCN2020138014-appb-100014
    则返回0,索引
    Figure PCTCN2020138014-appb-100015
    用于求和;β m>0是
    Figure PCTCN2020138014-appb-100016
    的权重,同一类的
    Figure PCTCN2020138014-appb-100017
    权重相同。
  4. 如权利要求3所述面向多任务语言模型的元-知识微调方法,其特征在于,所述第三阶段中,多任务典型性敏感标签分类损失函数L T
    Figure PCTCN2020138014-appb-100018
    其中,D表示所有域的集合;
    Figure PCTCN2020138014-appb-100019
    是指示函数,如果
    Figure PCTCN2020138014-appb-100020
    则返回1,如果
    Figure PCTCN2020138014-appb-100021
    则返回0;
    Figure PCTCN2020138014-appb-100022
    表示预测
    Figure PCTCN2020138014-appb-100023
    的类标签为m的概率;
    Figure PCTCN2020138014-appb-100024
    表示BERT模型最后一层输出的“[CLS]”的token的嵌入层特征。
  5. 一种基于权利要求1-4任一项所述面向多任务语言模型的元-知识微调方法的平台,其特征在于,包括以下组件:
    数据加载组件:用于获取面向多任务的预训练语言模型的训练样本,所述训练样本是满足监督学习任务的有标签的文本样本;
    自动压缩组件:用于将面向多任务的预训练语言模型自动压缩,包括预训练语言模型和元-知识微调模块;其中,所述元-知识微调模块用于在自动压缩组件生成的预训练语言模型上构建下游任务网络,利用典型性分数的元-知识对下游任务场景进行微调,输出最终微调好的学生模型,即登陆用户需求的包含下游任务的预训练语言模型压缩模型;将压缩模型输出到指定的容器,供登陆用户下载,并呈现压缩前后模型大小的对比信息;
    推理组件:登陆用户从平台获取预训练语言模型压缩模型,用户利用所述自动压缩组件输出的压缩模型在实际场景的数据集上对登陆用户上传的自然语言处理下游任务的新数据进行推理,并呈现压缩前后推理速度的对比信息。
PCT/CN2020/138014 2020-11-02 2020-12-21 一种面向多任务语言模型的元-知识微调方法及平台 WO2022088444A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
GB2214177.4A GB2609768A (en) 2020-11-02 2020-12-21 Multi-task language model-oriented meta-knowledge fine tuning method and platform
JP2022567027A JP7283836B2 (ja) 2020-11-02 2020-12-21 マルチタスク言語モデル向けのメタ知識微調整方法及びプラットフォーム
US17/531,813 US11354499B2 (en) 2020-11-02 2021-11-22 Meta-knowledge fine tuning method and platform for multi-task language model

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011202867.7 2020-11-02
CN202011202867.7A CN112100383B (zh) 2020-11-02 2020-11-02 一种面向多任务语言模型的元-知识微调方法及平台

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/531,813 Continuation US11354499B2 (en) 2020-11-02 2021-11-22 Meta-knowledge fine tuning method and platform for multi-task language model

Publications (1)

Publication Number Publication Date
WO2022088444A1 true WO2022088444A1 (zh) 2022-05-05

Family

ID=73784520

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/138014 WO2022088444A1 (zh) 2020-11-02 2020-12-21 一种面向多任务语言模型的元-知识微调方法及平台

Country Status (2)

Country Link
CN (1) CN112100383B (zh)
WO (1) WO2022088444A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114647732A (zh) * 2022-05-23 2022-06-21 之江实验室 一种面向弱监督文本分类系统、方法和装置
CN115859175A (zh) * 2023-02-16 2023-03-28 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) 基于跨模态生成式学习的液压减震器设备异常检测方法
CN117436236A (zh) * 2023-09-27 2024-01-23 四川大学 一种基于大模型的工艺流程智能规划方法
CN117669737A (zh) * 2023-12-20 2024-03-08 中科星图数字地球合肥有限公司 一种端到端地理行业大语言模型构建及使用方法
CN117708337A (zh) * 2024-02-05 2024-03-15 杭州杰竞科技有限公司 一种面向复杂定域的人机交互方法和系统

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112100383B (zh) * 2020-11-02 2021-02-19 之江实验室 一种面向多任务语言模型的元-知识微调方法及平台
JP7283836B2 (ja) 2020-11-02 2023-05-30 之江実験室 マルチタスク言語モデル向けのメタ知識微調整方法及びプラットフォーム
CN112364945B (zh) * 2021-01-12 2021-04-16 之江实验室 一种基于域-不变特征的元-知识微调方法及平台
GB2608344A (en) 2021-01-12 2022-12-28 Zhejiang Lab Domain-invariant feature-based meta-knowledge fine-tuning method and platform
CN113032559B (zh) * 2021-03-15 2023-04-28 新疆大学 一种用于低资源黏着性语言文本分类的语言模型微调方法
CN113987209B (zh) * 2021-11-04 2024-05-24 浙江大学 基于知识指导前缀微调的自然语言处理方法、装置、计算设备和存储介质
CN114780722B (zh) * 2022-03-31 2024-05-14 北京理工大学 一种结合领域通用型语言模型的领域泛化方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10607598B1 (en) * 2019-04-05 2020-03-31 Capital One Services, Llc Determining input data for speech processing
CN111814448A (zh) * 2020-07-03 2020-10-23 苏州思必驰信息科技有限公司 预训练语言模型量化方法和装置
CN111832282A (zh) * 2020-07-16 2020-10-27 平安科技(深圳)有限公司 融合外部知识的bert模型的微调方法、装置及计算机设备
CN112100383A (zh) * 2020-11-02 2020-12-18 之江实验室 一种面向多任务语言模型的元-知识微调方法及平台

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107767954A (zh) * 2017-10-16 2018-03-06 中国科学院地理科学与资源研究所 一种基于空间贝叶斯网络的环境健康风险监测预警系统及方法
CN108830287A (zh) * 2018-04-18 2018-11-16 哈尔滨理工大学 基于残差连接的Inception网络结合多层GRU的中文图像语义描述方法
CN110909145B (zh) * 2019-11-29 2022-08-09 支付宝(杭州)信息技术有限公司 针对多任务模型的训练方法及装置
CN111310848B (zh) * 2020-02-28 2022-06-28 支付宝(杭州)信息技术有限公司 多任务模型的训练方法及装置
CN111291166B (zh) * 2020-05-09 2020-11-03 支付宝(杭州)信息技术有限公司 基于Bert的语言模型的训练方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10607598B1 (en) * 2019-04-05 2020-03-31 Capital One Services, Llc Determining input data for speech processing
CN111814448A (zh) * 2020-07-03 2020-10-23 苏州思必驰信息科技有限公司 预训练语言模型量化方法和装置
CN111832282A (zh) * 2020-07-16 2020-10-27 平安科技(深圳)有限公司 融合外部知识的bert模型的微调方法、装置及计算机设备
CN112100383A (zh) * 2020-11-02 2020-12-18 之江实验室 一种面向多任务语言模型的元-知识微调方法及平台

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114647732A (zh) * 2022-05-23 2022-06-21 之江实验室 一种面向弱监督文本分类系统、方法和装置
CN114647732B (zh) * 2022-05-23 2022-09-06 之江实验室 一种面向弱监督文本分类系统、方法和装置
CN115859175A (zh) * 2023-02-16 2023-03-28 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) 基于跨模态生成式学习的液压减震器设备异常检测方法
CN117436236A (zh) * 2023-09-27 2024-01-23 四川大学 一种基于大模型的工艺流程智能规划方法
CN117436236B (zh) * 2023-09-27 2024-05-17 四川大学 一种基于大模型的工艺流程智能规划方法
CN117669737A (zh) * 2023-12-20 2024-03-08 中科星图数字地球合肥有限公司 一种端到端地理行业大语言模型构建及使用方法
CN117669737B (zh) * 2023-12-20 2024-04-26 中科星图数字地球合肥有限公司 一种端到端地理行业大语言模型构建及使用方法
CN117708337A (zh) * 2024-02-05 2024-03-15 杭州杰竞科技有限公司 一种面向复杂定域的人机交互方法和系统
CN117708337B (zh) * 2024-02-05 2024-04-26 杭州杰竞科技有限公司 一种面向复杂定域的人机交互方法和系统

Also Published As

Publication number Publication date
CN112100383B (zh) 2021-02-19
CN112100383A (zh) 2020-12-18

Similar Documents

Publication Publication Date Title
WO2022088444A1 (zh) 一种面向多任务语言模型的元-知识微调方法及平台
CN110427461B (zh) 智能问答信息处理方法、电子设备及计算机可读存储介质
CN114064918B (zh) 一种多模态事件知识图谱构建方法
CN109815336B (zh) 一种文本聚合方法及系统
Zhang et al. Sentiment Classification Based on Piecewise Pooling Convolutional Neural Network.
CN113407660B (zh) 非结构化文本事件抽取方法
CN112256866B (zh) 一种基于深度学习的文本细粒度情感分析算法
CN109271516B (zh) 一种知识图谱中实体类型分类方法及系统
CN108416032A (zh) 一种文本分类方法、装置及存储介质
CN114818703B (zh) 基于BERT语言模型和TextCNN模型的多意图识别方法及系统
CN112732872B (zh) 面向生物医学文本的基于主题注意机制的多标签分类方法
CN113849653B (zh) 一种文本分类方法及装置
CN116089873A (zh) 模型训练方法、数据分类分级方法、装置、设备及介质
CN112925904A (zh) 一种基于Tucker分解的轻量级文本分类方法
CN114756678A (zh) 一种未知意图文本的识别方法及装置
CN113743079A (zh) 一种基于共现实体交互图的文本相似度计算方法及装置
US11354499B2 (en) Meta-knowledge fine tuning method and platform for multi-task language model
CN117033961A (zh) 一种上下文语境感知的多模态图文分类方法
CN116306869A (zh) 训练文本分类模型的方法、文本分类方法及对应装置
CN117216617A (zh) 文本分类模型训练方法、装置、计算机设备和存储介质
CN113239143B (zh) 融合电网故障案例库的输变电设备故障处理方法及系统
WO2022151553A1 (zh) 一种基于域-不变特征的元-知识微调方法及平台
CN114969341A (zh) 一种针对餐饮行业评论的细粒度情感分析方法及装置
CN113869051A (zh) 一种基于深度学习的命名实体识别方法
CN113076741A (zh) 一种基于多语言文本数据分析方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20959601

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 202214177

Country of ref document: GB

Kind code of ref document: A

Free format text: PCT FILING DATE = 20201221

ENP Entry into the national phase

Ref document number: 2022567027

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20959601

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 28.09.2023)