WO2022221045A1

WO2022221045A1 - Performing multiple tasks with continual adaptation

Info

Publication number: WO2022221045A1
Application number: PCT/US2022/022234
Authority: WO
Inventors: An Wang; Yongliang MA; Duyu TANG; Daxin Jiang; Nan Duan
Original assignee: Microsoft Technology Licensing, Llc
Priority date: 2021-04-15
Filing date: 2022-03-29
Publication date: 2022-10-20
Also published as: CN115220875A

Abstract

The present disclosure proposes a method and apparatus for performing multiple tasks. A text input may be obtained. A set of shared representations of the text input in multiple layers may be generated. Multiple task-specific representations of the text input may be generated based on the set of shared representations. The multiple tasks may be performed with the multiple task-specific representations, respectively.

Description

PERFORMING MULTIPLE TASKS WITH CONTINUAL ADAPTATION

BACKGROUND

[0001] Natural Language Processing (NLP) is a technology that uses a natural language to communicate with computers, which aims to enable the computers to understand and use the natural language to achieve communication between humans and the computers, thereby replacing the humans to perform various tasks related to the natural language, e.g., Query Understanding task, Machine Reading Comprehension task, Question Answering task, etc. A NLP task may be performed through a neural network model. For example, various NLP tasks may be performed through a Bidirectional Encoder Representations from Transformers (BERT) model, a Generative Pre-trained Transformer (GPT) model, a Robustly optimized BERT approach (RoBERTa) model, etc. These models are usually complex models that rely on deep networks with a huge number of parameters, and thus have excellent performance when performing the NLP tasks. SUMMARY

[0002] This Summary is provided to introduce a selection of concepts that are further described below in the Detailed Description. It is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

[0003] Embodiments of the present disclosure propose a method and apparatus for performing multiple tasks. A text input may be obtained. A set of shared representations of the text input in multiple layers may be generated. Multiple task-specific representations of the text input may be generated based on the set of shared representations. The multiple tasks may be performed with the multiple task-specific representations, respectively.

[0004] It should be noted that the above one or more aspects comprise the features hereinafter fully described and particularly pointed out in the claims. The following description and the drawings set forth in detail certain illustrative features of the one or more aspects. These features are only indicative of the various ways in which the principles of various aspects may be employed, and this disclosure is intended to include all such aspects and their equivalents.

BRIEF DESCRIPTION OF THE DRAWINGS

[0005] The disclosed aspects will hereinafter be described in conjunction with the appended drawings that are provided to illustrate and not to limit the disclosed aspects. [0006] FIG.l illustrates an exemplary process for performing multiple tasks with continuous adaptation according to an embodiment of the present disclosure.

[0007] FIG.2 illustrates an exemplary process for generating a task-specific representation according to an embodiment of the present disclosure.

[0008] FIG.3 illustrates an exemplary process for training a multi-task model through multiple single-task reference models according to an embodiment of the present disclosure.

[0009] FIG.4 illustrates an exemplary process for training a multi-task model through a multi-task reference model according to an embodiment of the present disclosure.

[0010] FIG.5 is a flowchart of an exemplary method for performing multiple tasks according to an embodiment of the present disclosure.

[0011] FIG.6 illustrates an exemplary apparatus for performing multiple tasks according to an embodiment of the present disclosure.

[0012] FIG.7 illustrates an exemplary apparatus for performing multiple tasks according to an embodiment of the present disclosure.

DETAILED DESCRIPTION

[0013] The present disclosure will now be discussed with reference to several exemplary implementations. It is to be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.

[0014] It is desirable to use a neural network model such as a BERT model, a GPT model, and a RoBERTa model to perform multiple tasks. There are some existing ways of using a neural network model to perform multiple tasks. In a way, taking the BERT model as an example, given multiple tasks, the BERT model may be trained separately for each task, thereby obtaining multiple BERT models for the multiple tasks. This way requires hosting the multiple BERT models, which requires a large amount of storage resources and computing resources. In addition, when there is an additional task, the BERT model needs to be retrained, which may affect the performance of the model when performing the existing tasks. In another way, a multi-task model may be constructed through adding multiple task-specific output layers for fixed multiple tasks on a shared feature extractor. Herein, a model capable of performing multiple tasks simultaneously may be referred to as a multi-task model. The multi-task model may be used to perform the fixed multiple tasks. However, when the multi-task model is used to perform an additional task, all parameters in the multi-task model need to be updated for the additional task, which will affect its performance when performing the existing tasks.

[0015] Embodiments of the present disclosure propose an improved method for performing multiple tasks. These multiple tasks may be based on the same text input. A set of representations of the text input in multiple layers may be generated. The set of representations may be referred to as a set of shared representations, which may be used to generate multiple representations of the text input for multiple tasks. A representation for a specific task may be referred to as a task-specific representation. Multiple task-specific representations for the multiple tasks may be further used to perform the multiple tasks respectively.

[0016] In an aspect, an embodiment of the present disclosure proposes a multi-task model with a novel structure for performing multiple tasks. The multi-task model may comprise, e.g., a shared encoder, multiple task-specific encoders and multiple task-specific linear layers for the multiple tasks, etc. The shared encoder may comprise a set of shared encoder layers, which may generate a set of shared representations of a text input in multiple layers. During training of the multi-task model, or deployment of the multi-task model to perform the multiple tasks, parameters of the shared encoder may be fixed. The task-specific encoder may also be referred to as an adapter, which may be adapted to a target task. Herein, a task targeted by a task-specific encoder may be referred to as a target task of the task-specific encoder. The task-specific encoder may capture task-specific semantics of its target task from a set of shared representations provided by the shared encoder, and generate a task-specific representation of the target task. A task-specific linear layer may perform a target task corresponding to the task-specific representation with the task-specific representation provided by the task-specific encoder. The task- specific encoder may be connected to the shared encoder like a plug-in. This way of connection will not affect the parameters of the shared encoder. In addition, when there is an additional task, the multi-task model may perform the additional task through adding an additional task-specific encoder and an additional task-specific linear layer for the additional task to the multi-task model. This way will not affect the structure and parameters related to performing existing tasks in the multi-task model, e.g., it will not change the parameters of the shared encoder, the existing task-specific encoders and task- specific linear layers, and thus will not affect the performance of the multi-task model to perform the existing tasks. Therefore, the multi-task model according to an embodiment of the present disclosure may perform multiple tasks with continuous adaptation.

[0017] In another aspect, an embodiment of the present disclosure proposes to train a multi-task model through a teacher- student architecture. A multi-task model may be trained with multiple training datasets for multiple tasks. Herein, a dataset used to train the multi-task model may be referred to as a training dataset. An embodiment of the present disclosure proposes to obtain multiple training datasets for training a multi-task model through a teacher-student architecture. For example, one or more reference models may be trained with a small amount of supervised datasets, and then multiple training datasets may be generated through the trained one or more reference models. Herein, a supervised dataset may refer to a dataset used to train a reference model, and a reference model refers to a model capable of being used to assist in training a multi-task model, which may also be referred to as a teacher model. In this mode, the multi-task model learns knowledge from the teacher model, which may also be referred to as a student model. In an implementation, the multiple training datasets may be generated by multiple single-task reference models, and the multiple single-task reference models may be previously trained with multiple supervised datasets, respectively. Herein, a single-task reference model may refer to a reference model capable of performing a single task. In another implementation, the multiple training datasets may be generated by a multi-task reference model, and the multi-task reference model may be previously trained with multiple supervised datasets. Herein, a multi-task reference model may refer to a reference model capable of performing multiple tasks simultaneously.

[0018] In yet another aspect, a multi-task model according to an embodiment of the present disclosure may be used in a multilingual scenario or a cross-lingual scenario. A neural network model with a multilingual capability or a cross-lingual capability may be used as a shared encoder. A task-specific encoder may be adapted to generate a task- specific representation of a text input in an arbitrary language, and the arbitrary language may be, e.g., any one of multiple languages supported by the shared encoder. For example, a task-specific encoder may be trained for a target task based on an arbitrary language. The trained task-specific encoder is able to generate a task-specific representation of a text input in an arbitrary language for the target task. Preferably, a task-specific encoder may be dedicated to generating a task-specific representation of a text input in a target language. Herein, a language specifically targeted by a task-specific encoder may be referred to as a target language of the task-specific encoder. For example, a task-specific encoder may be trained for a target task based on a target language. The trained task- specific encoder is able to generate a more accurate task-specific representation of the text input in the target language for the target task. In this way, task-specific representations of a text input in different languages for the same target task may be generated independently of each other through different task-specific encoders. Further, a multi-task model including such task-specific encoders may flexibly perform the same task based on different languages.

[0019] FIG.l illustrates an exemplary process 100 for performing multiple tasks with continuous adaptation according to an embodiment of the present disclosure. The process 100 may be performed by a multi-task model 110. A text input 102 may be obtained. The text input 102 may be used to perform multiple tasks, e.g., task 1 to task M, wherein M ³ 1 is the number of tasks. Multiple task results of the text input 102 corresponding to the multiple tasks may be obtained through the multi-task model 110, e.g., task results 152-1 to 152 -M.

[0020] The text input 102 may be denoted as x = (x_1;

...,x_n), wherein n is the number of words included in the text input x, and x_t is the i-th word in the text input x. An embedding layer 120 may generate an initial representation of the text input x.

[0021] A shared encoder 130 may be a known neural network model, e.g., a neural network model based on a fully connected layer structure, a neural network model based on a transformer layer structure, etc. Taking the neural network model based on the transformer layer structure as an example, it may include, e.g., a BERT model, a GPT model, a RoBERTa model, etc. During training of the multi-task model 110, or deployment of the multi-task model 110 to perform multiple tasks, parameters of the shared encoder 130 may be fixed. For example, the parameters of the shared encoder 130 may be fixed during the performing of multiple tasks, which may generate a set of shared representations of the text input x in multiple layers based on an initial representation of the text input x provided by the embedding layer 120. The shared encoder 130 may comprise a set of shared encoder layers, e.g., shared encoder layers 130-1 to 130-L, wherein L ³ 1 is the number of shared encoder layers. A set of shared representations of the text input x in L layers may be generated through shared encoder layers 130-1 to 130- L, e.g., shared representations 132-1 to 132-L. Taking the shared encoder 130 being a BERT model as an example, it may include a set of transformer layers. The set of transformer layers may generate a set of contextualized shared representations h =

wherein /) is the dimension of a shared representation. The shared representation 132-Z of the text input x output by the shared encoder layer 130-/ in the shared encoder 130 may be denoted as hi ( l E [1, L]), wherein

represents a shared representation of the t-th word x_L in the text input x output by the shared encoder layer 130-/.

[0022] The shared representations 132-1 to 132-L may be provided to multiple task- specific encoders, e.g., task-specific encoders 140-1 to 140 -M. The task specific encoders 140-1 to 140-M may target multiple tasks, e.g., task 1 to task M, respectively. Each task- specific encoder 140-m(m e [1 ,M]) may capture task-specific semantics of its target task from the shared representations 132-1 to 132-L provided by the shared encoder 130, and generate a task-specific representation of the target task. The task-specific encoders 140-1 to 140-M may generate multiple task-specific representations of the text input x based on the shared representations 132-1 to 132-L, respectively, e.g., task-specific representations 142-1 to 142-M. Each task-specific encoder 140-m may be connected to the shared encoder 130 like a plug-in. This connection approach does not affect the parameters of the shared encoder 130. An exemplary structure of a task-specific encoder and an exemplary process for generating a task-specific representation will be described later in conjunction with FIG. 2.

[0023] The task-specific representations 142-1 to 142-M may be used to perform multiple tasks, respectively. For example, the multiple tasks may be performed with the task-specific representations 142-1 to 142-M through task-specific linear layers 150-1 to 150-M, respectively. The multiple task-specific linear layers 150-1 to 150-M may output multiple task results for the multiple tasks, respectively, e.g., task results 152-1 to 152-M. Taking a task m being a domain classification task in a Query Understanding task as an example, a task result 152-m for this task m may be a binary classification result used to indicate whether the text input 102 belongs to a specific domain.

[0024] According to an embodiment of the present disclosure, in the case where multiple tasks are performed based on the same text input, only a set of shared representations of the text input may be generated, e.g., shared representations 132-1 to 132-L in FIG. 1. The set of shared representations may be provided to multiple task- specific encoders for multiple tasks, and further used to generate multiple task-specific representations to perform the multiple tasks. That is, the shared encoder 130 only needs to be executed once, which significantly saves computing resources and improves the efficiency of the model. However, it should be understood that the multi-task model 110 may also be used to perform multiple tasks based on different text inputs. A text input for a current task may be provided to the multi-task model. The multi-task model may perform the current task at least through a task-specific encoder and a task-specific linear layer for the current task.

[0025] The multi-task model 110 according to an embodiment of the present disclosure may efficiently and flexibly support various NLP tasks. For example, the multi task model 110 having the task-specific encoders 140-1 to 140 -M and the corresponding task-specific linear layers 150-1 to 150 -M may perform tasks 1 to task M. There is a need to use the multi-task model 110 to perform an additional task, e.g., additional task M+1. According to an embodiment of the present disclosure, adding an additional task-specific encoder (not shown) and an additional task-specific linear layer (not shown) for the additional task to the multi-task model 110 may enable the multi-task model 110 to perform the additional task.

[0026] It is assumed that the additional task M+ 1 may be performed based on the text input 102. In an implementation, an additional task specific representation of the text input 102 for the additional task M+1 may be generated based on the shared representations 132-1 to 132-L of the text input 102, and the additional task M+l may be performed with the additional task-specific representation. For example, the additional task-specific representation for the additional task M +l may be generated based on the shared representations 132-1 to 132-L through the additional task-specific encoder. The additional task M+1 may be performed with the additional task-specific representation through the additional task-specific linear layer.

[0027] Adding an additional task-specific encoder and an additional task-specific linear layer for an additional task to the multi-task model 110 may enable the multi-task model to flexibly support various NLP tasks. Moreover, because the task-specific encoder may be connected to the shared encoder like a plug-in, adding the task-specific encoder and the task-specific linear layer will not affect the structure and parameters related to performing the existing tasks in the multi-task model, thus will not affect the performance of the multi-task model to perform the existing tasks. For example, the parameters of the shared encoder 130, the task-specific encoders 140-1 to 140-M, and the task-specific linear layers 150-1 to 150-M will not be changed due to the addition of the additional task- specific encoder and the additional task-specific linear layer. Therefore, the multi-task model 110 according to an embodiment of the present disclosure may perform multiple tasks with continuous adaptation. In addition, one or more task-specific encoders and corresponding task-specific linear layers may be flexibly removed from the multi-task model 110 as required.

[0028] It should be understood that the process 100 in FIG. 1 is only an example of the process for performing multiple tasks with continuous adaptation. According to actual application requirements, the process for performing multiple tasks may comprise any other steps, and may comprise more or fewer steps. In addition, the multi-task model 110 in FIG. 1 is only an example of the multi-task model. According to actual application requirements, the multi-task model may have any other structure and may comprise more or fewer layers. In addition, it should be understood that although the foregoing discussion and the following discussion may involve an example of adopting the neural network model based on the transformer layer structure as a shared encoder, the embodiments of the present disclosure are not limited to this, but may adopt a neural network model based on other structures, e.g., based on a fully connected layer structure, as a shared encoder in a similar way.

[0029] According to an embodiment of the present disclosure, the multi-task model may be used in a multilingual scenario or a cross-lingual scenario. A neural network model with a multilingual capability or a cross-lingual capability may be used as a shared encoder. The neural network model with the multilingual capability or the cross-lingual capability may be, e.g., a Cross-lingual Language Model (XLM). A task-specific encoder may be adapted to generate a task-specific representation of a text input in an arbitrary language, and the arbitrary language may be, e.g., any one of multiple languages supported by the shared encoder. For example, a task-specific encoder may be trained for a target task based on an arbitrary language. The trained task-specific encoder is able to generate a task-specific representation of a text input in an arbitrary language for the target task. Preferably, the task-specific encoder may be dedicated to generating a task-specific representation of a text input in a target language. For example, a task-specific encoder may be trained for a target task based on a target language. For example, the task-specific encoder may be trained with the training dataset in the target language for the target task. The trained task-specific encoder is able to generate a more accurate task-specific representation of the text input in the target language for the target task. In this way, task- specific representations of a text input in different languages for the same target task may be generated independently of each other through different task-specific encoders. For example, there are a text input in English for a classification task and a text input in French for the classification task. A task-specific representation of the text input in English for the classification task and a task-specific representation of the text input in French for the classification task may be generated through two task-specific encoders, respectively. Further, a multi-task model including such task-specific encoders may flexibly perform the same task based on different languages.

[0030] FIG.2 illustrates an exemplary process 200 for generating a task-specific representation according to an embodiment of the present disclosure. The process 200 may be performed by a task-specific encoder 210. The task-specific encoder 210 may correspond to any one of the task-specific encoders 140- 512 to 140 -M in FIG. 1. The task-specific encoder 210 may be, e.g., based on a transformer structure, and it may comprise, e.g., a set of task-specific feature extracting units 220-1 to 220 -L, a set of scaled self-attention units 230-1 to 230 -L, a concatenating unit 240, a layer normalization 250, a feed-forward layer 260, a layer normalization 270, a concatenating unit 280, etc.

[0031] The shared representations 202-1 to 202 -L may correspond to the shared representations 132-1 to 132-L in FIG. 1. The task-specific encoder 210 may capture task- specific semantics of its target task from the shared representations 202-1 to 202 -L, and generate a task-specific representation of the target task. For example, the task-specific encoder 210 may first extract a task-specific feature set for the target task from each shared representation of the shared representations 202-1 to 202 -L, and encode the task- specific feature set into a task-specific sub-representation. Taking a shared representation hi 202 -l as an example, a task-specific feature extracting unit 220-1 may extract a task- specific feature set for the target task from the shared representation h_L For example, the task-specific feature extracting unit 220-1 may extract the task-specific feature set for the target task from the shared representation h_L through applying a linear transformation (W_t ^k, W , W ) to the shared representation h_{t ,} wherein

e R^Dxd are trainable model parameters, and d is the dimension of a word embedding inside the task- specific encoder 210. The dimension d may be much smaller than the dimension D of the shared representation hi. The extracted task-specific feature set may be, e.g., a triplet (fc_f, q_t, Vi), wherein fc_f is a key, qq is a query, and p_f is a value.

[0032] Subsequently, the extracted task-specific feature set (fc_f, qq,p_f) may be encoded into a task-specific sub -representation. For example, a scaled attention operation may be performed on the task-specific feature set (fc_f, qq, tq) through the scaled self attention unit 230 -l to obtain a task-specific sub-representation layer as shown by the following formula: layeri = SA(ki, qq, v ) (1) [0033] A set of task-specific sub-representations corresponding to the shared representations 202-1 to 202 -L, e.g., task-specific sub-representations laye^ to layer_L , may be combined into a task-specific intermediate representation. For example, the task- specific sub-representations laye^ to layer_L may be concatenated into a task-specific intermediate representation attention through the concatenating unit 240, as shown in the following formula: attention = concatilayer^ ... , layer_L ) (2)

[0034] Next, a task-specific representation may be generated based at least on the task- specific intermediate representation attention. For example, a normalization operation may be performed on the task-specific intermediate representation attention through the layer normalization 250 containing residual connections, to obtain an output att_output , as shown by the following formula: att output = LN(v + W * attention ) (3) wherein v = {p_f } ₌₁, and W is a trainable model parameter with dimension (L x d,L x d ).

[0035] Preferably, the feed-forward layer 260 and another layer normalization 270 containing residual connections may further process the output att output , to obtain the output ada output. For the t-th word x_t in the text input , a word embedding in the output ada output corresponding to the i -th word x_t in the text input x may have dimension L d.

[0036] The uppermost shared representation, i.e., the shared representation 202 -L, and the output ada output may be concatenated into a task-specific representation 282 through the concatenating unit 280.

[0037] It should be understood that the process 200 in FIG. 2 is only an example of the process for generating the task-specific representation. According to actual application requirements, the process for generating the task-specific representation may comprise any other steps, and may comprise more or fewer steps. In addition, the task-specific encoder 210 in FIG. 2 is only an example of the task-specific encoder. According to actual application requirements, the task-specific encoder may have any other structure and may comprise more or fewer layers. For example, the feed-forward layer 260 and the layer normalization 270 may be removed from the task-specific encoder 210, so that the task- specific representation 282 may be obtained through concatenating the output att output provided by the layer normalization 250 directly with the shared representation 202 -L. In addition, it should be understood that although the foregoing discussion and the following discussion may involve an example of adopting the neural network model based on the transformer layer structure as the task-specific encoder, the embodiments of the present disclosure are not limited to this, but may adopt a neural network model based on other structures, e.g., based on a fully connected layer structure, as the task-specific encoder in a similar way.

[0038] As described above, multiple tasks may be performed through a multi-task model according to an embodiment of the present disclosure, e.g., the multi-task model 110 in FIG. 1. The multi-task model may be trained for these multiple tasks. When the trained multi-task model is actually deployed, multiple task-specific representations for the multiple tasks may be generated, and the generated multiple task-specific representations may be used to perform the multiple tasks respectively. The multi-task model may comprise, e.g., a shared encoder, multiple task-specific encoders and multiple task-specific linear layers for multiple tasks, etc. The multi-task model may be trained through pre training the shared encoder, and in the case of fixing parameters of the pre-trained shared encoder, optimizing multiple task-specific encoder parameter sets of the multiple task- specific encoders and/or multiple linear layer parameter sets of the multiple task-specific linear layers.

[0039] The shared encoder may be pre-trained through known approaches. Taking the shared encoder being a BERT model as an example, the BERT model may be pre-trained through approaches such as masked language model (MLM), Next Sentence Prediction (NSP), etc.

[0040] After the shared encoder is pre-trained, the multiple task-specific encoder parameter sets and/or multiple linear layer parameter sets may be optimized in the case of fixing parameters of the pre-trained shared encoder. Compared with the parameters in the shared encoder, the number of each task-specific encoder parameter set and each linear layer parameter set is much smaller, so the optimization of the task-specific encoder parameter set and/or the linear layer parameter set will not occupy too much computing resources and storage resources.

[0041] The multiple task-specific encoder parameter sets and/or multiple linear layer parameter sets may be optimized in a distributed manner. For example, the multiple task- specific encoder parameter sets may be optimized independently of each other. The optimization of each task-specific encoder parameter set will not affect other task-specific encoder parameter sets. Similarly, the multiple linear layer parameter sets may also be optimized independently of each other. The optimization of each linear layer parameter set will not affect other linear layer parameter sets. In addition, optimizing various parameter sets independently of each other may also ensure that training for additional tasks will not affect the parameter sets for existing tasks in the multi-task model, thereby ensuring the performance of the multi-task model when performing the existing tasks.

[0042] The multiple task-specific encoder parameter sets and/or multiple linear layer parameter sets may be optimized with multiple training datasets for multiple tasks, respectively. The optimization may be performed based on a standard supervised loss function. The multiple training datasets for optimizing the multiple task-specific encoder parameter sets and/or the multiple linear layer parameter sets may be obtained in multiple ways. For example, the training dataset may be a supervised dataset, e.g., a human labeled dataset.

[0043] In addition, an embodiment of the present disclosure proposes to obtain multiple training datasets for optimizing multiple task-specific encoder parameter sets and/or multiple linear layer parameter sets through a way based on a teacher-student architecture. For example, one or more reference models may be trained with a small amount of supervised datasets, and then multiple training datasets may be generated through the trained one or more reference models. In an implementation, the multiple training datasets may be generated by multiple single-task reference models, and the multiple single-task reference models may be previously trained with multiple supervised datasets, respectively. An exemplary process for training a multi-task model through multiple single-task reference models will be described later in conjunction with FIG. 3. In another implementation, the multiple training datasets may be generated by a multi-task reference model, and the multi-task reference model may be previously trained with multiple supervised datasets. An exemplary process for training a multi-task model through a multiple single-task reference model will be described later in conjunction with FIG. 4.

[0044] FIG.3 illustrates an exemplary process 300 for training a multi-task model through multiple single-task reference models according to an embodiment of the present disclosure. In the process 300, a multi-task model 320 may be trained through multiple single-task reference models, e.g., single-task reference models 310-1 to 310 -M. The single-task reference model may be a model with higher complexity than the multi-task model 320. For example, when a shared encoder adopted by the multi-task model 320 is a 12-layer BERT model, the single-task reference model may be a 24-layer BERT model. The multi-task model 320 may correspond to the multi-task model 110 in FIG. 1.

[0045] The single-task reference models 310-1 to 310-M may be previously trained with multiple supervised datasets, e.g., supervised datasets 302-1 to 302 -M, to obtain trained single-task reference models 312-1 to 312 -M. The trained single-task reference models 312-1 to 312 - M may generate training datasets 316-1 to 316 - M based on unsupervised datasets 314-1 to 314 -M, respectively. For example, an unsupervised dataset 314-m for a task m may be provided to a trained single-task reference model 312-m. The unsupervised dataset 314-m may include, e.g., a set of unlabeled texts. For each text in the set of unlabeled texts, the trained single-task reference model 312-m may generate a soft label or a pseudo label for the text. Herein, the soft label or the pseudo label may refer to a label generated by a reference model. The text and its corresponding soft label or pseudo label may be combined into a training sample. Then, a set of training samples corresponding to the unsupervised dataset 314-m may be combined into a training dataset 316-m.

[0046] The training datasets 316-1 to 316 -M may be provided to the multi-task model 320. The multi-task model 320 may comprise, e.g., an embedding layer 330, a shared encoder 340, task-specific encoders 350-1 to 350 -M, task-specific linear layers 360-1 to 360 -M, etc. These modules may correspond to the embedding layer 120, the shared encoder 130, the task-specific encoders 140-1 to 140-M, and the task-specific linear layers 150-1 to 150 -M in FIG. 1, respectively. In the case of at least fixing parameters of the shared encoder 340, multiple task-specific encoder parameter sets corresponding to the task-specific encoders 340-1 to 340 - M and/or multiple linear layer parameter sets corresponding to the task-specific linear layers 350-1 to 350 -M may be optimized with the training datasets 316-1 to 316-M, thereby implementing the training of the multi-task model 320.

[0047] It should be understood that the process 300 in FIG. 3 is only an example of the process for training the multi-task model through the multiple single-task reference models. According to actual application requirements, the process for training the multi task model through the multiple single-task reference models may comprise any other steps, and may comprise more or fewer steps. For example, in addition to the training datasets 316-1 to 316-M, the multi-task model 320 may also be trained with the supervised datasets 302-1 to 302 -M used for training the single-task reference models 310-1 to 310- M. [0048] FIG.4 illustrates an exemplary process 400 for training a multi-task model through a multi-task reference model according to an embodiment of the present disclosure. In the process 400, a multi-task model 420 may be trained through a multi-task reference model 410. The multi-task reference model 410 may be a known multi-task model, e.g., a Multi-Task Deep Neural Network (MT-DNN) model. The multi-task model 420 may correspond to the multi-task model 110 in FIG. 1.

[0049] The multi-task reference model 410 may be previously trained with multiple supervised datasets, e.g., supervised datasets 402-1 to 402 -M, to obtain trained multi-task reference model 412. The trained multi-task reference model 412 may generate training datasets 416-1 to 416-M based on unsupervised datasets 414-1 to 414-M, respectively. For example, an unsupervised dataset 414-m for a task m may be provided to the trained multi-task reference model 412. The unsupervised dataset 414-m may include, e.g., a set of unlabeled texts. For each text in the set of unlabeled texts, the trained multi-task reference model 412 may generate a soft label or a pseudo label for the text. The text and its corresponding soft label or pseudo label may be combined into a training sample. Then, a set of training samples corresponding to the unsupervised dataset 414-m may be combined into a training dataset 416-m.

[0050] The training datasets 416-1 to 416-M may be provided to the multi-task model 420. The multi-task model 420 may comprise, e.g., an embedding layer 430, a shared encoder 440, task-specific encoders 450-1 to 450-M, task-specific linear layers 460-1 to 460-M, etc. These modules may correspond to the embedding layer 120, the shared encoder 130, the task-specific encoders 140-1 to 140-M, and the task-specific linear layers 150-1 to 150 -M in FIG. 1, respectively. In the case of at least fixing parameters of the shared encoder 440, multiple task-specific encoder parameter sets corresponding to the task-specific encoders 440-1 to 440- M and/or multiple linear layer parameter sets corresponding to the task-specific linear layers 450-1 to 450-M may be optimized with the training datasets 416-1 to 416-M, thereby implementing the training of the multi-task model 420.

[0051] It should be understood that the process 400 in FIG. 4 is only an example of the process for training the multi-task model through the multi-task reference model. According to actual application requirements, the process for training the multi-task model through the multi-task reference model may comprise any other steps, and may comprise more or fewer steps. For example, in addition to the training datasets 416-1 to 416-M, the multi-task model 420 may also be trained with the supervised datasets 402-1 to 402 -M used for training the multi-task reference model 410.

[0052] FIG.5 is a flowchart of an exemplary method 500 for performing multiple tasks according to an embodiment of the present disclosure.

[0053] At 510, a text input may be obtained.

[0054] At 520, a set of shared representations of the text input in multiple layers may be generated.

[0055] At 530, multiple task-specific representations of the text input may be generated based on the set of shared representations.

[0056] At 540, the multiple tasks may be performed with the multiple task-specific representations, respectively.

[0057] In an implementation, the generating a set of shared representations may comprise: generating the set of shared representations through a set of shared encoder layers in a shared encoder.

[0058] Parameters of the shared encoder may be fixed during performing of the multiple tasks.

[0059] In an implementation, the generating multiple task-specific representations may comprise: generating the multiple task-specific representations based on the set of shared representations through multiple task-specific encoders, respectively.

[0060] Each task-specific encoder in the multiple task-specific encoders may generate a task-specific representation for a target task through: extracting a task-specific feature set for the target task from each shared representation in the set of shared representations, and encoding the task-specific feature set into a task-specific sub-representation; combining a set of task-specific sub-representations corresponding to the set of shared representations into a task-specific intermediate representation; and generating the task- specific representation based at least on the task-specific intermediate representation. [0061] Each task-specific encoder in the multiple task-specific encoders may be adapted to generate a task-specific representation of a text input in an arbitrary language or a task-specific representation of a text input in a target language.

[0062] In an implementation, the performing the multiple tasks may comprise: performing the multiple tasks with the multiple task-specific representations through multiple task-specific linear layers, respectively.

[0063] In an implementation, the method 500 may be implemented through a multi task model. The multi-task model may include at least a shared encoder as well as multiple task-specific encoders and multiple task-specific linear layers for the multiple tasks.

[0064] Training of the multi-task model may comprise: pre-training the shared encoder; and in the case of fixing parameters of the pre-trained shared encoder, optimizing multiple task-specific encoder parameter sets of the multiple task-specific encoders and/or multiple linear layer parameter sets of the multiple task-specific linear layers.

[0065] The multiple task-specific encoder parameter sets may be optimized independently of each other. The multiple linear layer parameter sets may be optimized independently of each other.

[0066] The optimizing multiple task-specific encoder parameter sets and/or multiple linear layer parameter sets may comprise: optimizing the multiple task-specific encoder parameter sets and/or the multiple linear layer parameter sets with multiple training datasets for the multiple tasks, respectively.

[0067] The multiple training datasets may be generated through multiple single-task reference models. The multiple single-task reference models may be previously trained with multiple supervised datasets, respectively.

[0068] The multiple training datasets may be generated through a multi-task reference model. The multi-task reference model may be previously trained with multiple supervised datasets.

[0069] In an implementation, the method 500 may further comprise: generating an additional task-specific representation of the text input for an additional task based on the set of shared representations; and performing the additional task with the additional task- specific representation.

[0070] The generating an additional task-specific representation may comprise: generating the additional task-specific representation based on the set of shared representations through an additional task-specific encoder. The performing the additional task may comprise: performing the additional task with the additional task-specific representation through an additional task-specific linear layer.

[0071] It should be understood that the method 500 may further comprise any step/process for performing multiple tasks according to embodiments of the present disclosure described above.

[0072] FIG.6 illustrates an exemplary apparatus 600 for performing multiple tasks according to an embodiment of the present disclosure.

[0073] The apparatus 600 may comprise: a text input obtaining module 610, for obtaining a text input; a shared representation generating module 620, for generating a set of shared representations of the text input in multiple layers; a task-specific representation generating module 630, for generating multiple task-specific representations of the text input based on the set of shared representations; and a task performing module 640, for performing the multiple tasks with the multiple task-specific representations, respectively. In addition, the apparatus 600 may further comprise any other modules configured for performing multiple tasks according to embodiments of the present disclosure described above.

[0074] FIG.7 illustrates an exemplary apparatus 700 for performing multiple tasks according to an embodiment of the present disclosure.

[0075] The apparatus 700 may comprise at least one processor 710 and a memory 720 storing computer-executable instructions. The computer-executable instructions, when executed, may cause the at least one processor 710 to: obtain a text input, generate a set of shared representations of the text input in multiple layers, generate multiple task-specific representations of the text input based on the set of shared representations, and perform the multiple tasks with the multiple task-specific representations, respectively.

[0076] In an implementation, the generating a set of shared representations may comprise: generating the set of shared representations through a set of shared encoder layers in a shared encoder.

[0077] In an implementation, the generating multiple task-specific representations may comprise: generating the multiple task-specific representations based on the set of shared representations through multiple task-specific encoders, respectively.

[0078] In an implementation, the computer-executable instructions, when executed, may cause the at least one processor 710 to: generate an additional task-specific representation of the text input for an additional task based on the set of shared representations; and perform the additional task with the additional task-specific representation.

[0079] It should be understood that the processor 710 may further perform any other step/process of the method for performing multiple tasks according to embodiments of the present disclosure described above.

[0080] An embodiment of the present disclosure proposes a computer program product for performing multiple tasks, comprising a computer program that is executed by at least one processor for: obtaining a text input; generating a set of shared representations of the text input in multiple layers; generating multiple task-specific representations of the text input based on the set of shared representations; and performing the multiple tasks with the multiple task-specific representations, respectively. In addition, the computer programs may further be performed for implementing any other step/process of a method for performing multiple tasks according to an embodiment of the present disclosure described above.

[0081] The embodiments of the present disclosure may be embodied in non-transitory computer-readable medium. The non-transitory computer readable medium may comprise instructions that, when executed, cause one or more processors to perform any operation of a method for performing multiple tasks according to an embodiment of the present disclosure as described above.

[0082] It should be appreciated that all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts. In addition, the articles "a" and "an" as used in this description and appended claims, unless otherwise specified or clear from the context that they are for the singular form, should generally be interpreted as meaning "one" or "one or more."

[0083] It should also be appreciated that all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.

[0084] Processors have been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and overall design constraints imposed on the system. By way of example, a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with a microprocessor, microcontroller, digital signal processor (DSP), a field-programmable gate array (FPGA), a programmable logic device (PLD), a state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure. The functions of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented with software executed by a microprocessor, a microcontroller, a DSP, or other suitable platforms.

[0085] Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, etc. The software may reside on a computer-readable medium. A computer-readable medium may include, e.g., memory, the memory may be e.g., a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk, a smart card, a flash memory device, random access memory (RAM), read only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), a register, or a removable disk. Although a memory is shown separate from a processor in the various aspects presented throughout the present disclosure, the memory may be internal to the processor, e.g., a cache or register.

[0086] The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skilled in the art are expressly incorporated herein and encompassed by the claims.

Claims

1. A method for performing multiple tasks, comprising: obtaining a text input; generating a set of shared representations of the text input in multiple layers; generating multiple task-specific representations of the text input based on the set of shared representations; and performing the multiple tasks with the multiple task-specific representations, respectively.

2. The method of claim 1, wherein the generating a set of shared representations comprises: generating the set of shared representations through a set of shared encoder layers in a shared encoder.

3. The method of claim 2, wherein parameters of the shared encoder are fixed during performing of the multiple tasks.

4. The method of claim 1, wherein the generating multiple task-specific representations comprises: generating the multiple task-specific representations based on the set of shared representations through multiple task-specific encoders, respectively.

5. The method of claim 4, wherein each task-specific encoder in the multiple task- specific encoders is adapted to generate a task-specific representation of a text input in an arbitrary language or a task-specific representation of a text input in a target language.

6. The method of claim 1, wherein the performing the multiple tasks comprises: performing the multiple tasks with the multiple task-specific representations through multiple task-specific linear layers, respectively.

7. The method of claim 1, wherein the method is implemented through a multi-task model, and the multi-task model includes at least a shared encoder as well as multiple task-specific encoders and multiple task-specific linear layers for the multiple tasks.

8. The method of claim 7, wherein training of the multi-task model comprises: pre-training the shared encoder; and in the case of fixing parameters of the pre-trained shared encoder, optimizing multiple task-specific encoder parameter sets of the multiple task-specific encoders and/or multiple linear layer parameter sets of the multiple task-specific linear layers.

9. The method of claim 8, wherein the multiple task-specific encoder parameter sets are optimized independently of each other, and/or the multiple linear layer parameter sets are optimized independently of each other.

10. The method of claim 8, wherein the optimizing multiple task-specific encoder parameter sets and/or multiple linear layer parameter sets comprises: optimizing the multiple task-specific encoder parameter sets and/or the multiple linear layer parameter sets with multiple training datasets for the multiple tasks, respectively.

11. The method of claim 10, wherein the multiple training datasets are generated through multiple single-task reference models, and the multiple single-task reference models are previously trained with multiple supervised datasets, respectively.

12. The method of claim 10, wherein the multiple training datasets are generated through a multi-task reference model, and the multi-task reference model is previously trained with multiple supervised datasets.

13. The method of claim 1, further comprising: generating an additional task-specific representation of the text input for an additional task based on the set of shared representations; and performing the additional task with the additional task-specific representation.

14. An apparatus for performing multiple tasks, comprising: at least one processor; and a memory storing computer-executable instructions that, when executed, cause the at least one processor to: obtain a text input, generate a set of shared representations of the text input in multiple layers, generate multiple task-specific representations of the text input based on the set of shared representations, and perform the multiple tasks with the multiple task-specific representations, respectively.

15. A computer program product for performing multiple tasks, comprising a computer program that is executed by at least one processor for: obtaining a text input; generating a set of shared representations of the text input in multiple layers; generating multiple task-specific representations of the text input based on the set of shared representations; and performing the multiple tasks with the multiple task-specific representations, respectively.