CN111598253A

CN111598253A - Training machine learning models using teacher annealing

Info

Publication number: CN111598253A
Application number: CN202010404322.8A
Authority: CN
Inventors: 唐·明·良; 国·V·勒; 凯文·斯特凡·克拉克
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2019-05-13
Filing date: 2020-05-13
Publication date: 2020-08-28
Also published as: US20200364617A1; US20230049747A1; US11488067B2; US11922281B2

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a machine learning model using teacher annealing.

Description

Training machine learning models using teacher annealing

Technical Field

This specification relates to training machine learning models.

Background

The machine learning model receives input and generates output, e.g., predicted output, based on the received input and model-based parameter values.

Neural networks are machine learning models that employ one or more layers of nonlinear units to predict output for received inputs. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e. the next hidden layer or output layer. Each layer of the network generates an output from the received input in accordance with the current values of the respective parameter set.

Disclosure of Invention

This specification broadly describes a system implemented as one or more computer programs on one or more computers in one or more locations or multiple locations that use teacher annealing (anealing) to train machine learning models.

The machine learning model being trained will be referred to as "student machine learning model" in this specification, and the parameters of the student machine learning model, i.e., the parameters updated by the training, will be referred to as "student parameters".

In particular, during training, the system uses both truth outputs and teacher outputs generated by one or more trained teacher machine learning models. By performing teacher annealing, the system repeatedly adjusts weight values during training that define weights between teacher output and true value output for use in calculating an objective function for training.

Particular embodiments of the subject matter described in this specification can be implemented to realize one or more of the following advantages.

This specification describes using teacher annealing to improve training of student machine learning models when output generated by already trained teacher models is available. By using teacher annealing early in training, the student model is mainly refined, i.e., mainly learned from the output of the teacher model, to get as useful training signals as possible. Near the end of training, student models rely primarily on true value outputs. This progression allows the student machine learning model to achieve performance on any given task that exceeds the teacher machine learning model for that task, even though the student machine learning model is a multitasking machine learning model and the teacher model is a single-tasking machine learning model that is specific to that task. In particular, using these techniques, a student model may exceed the performance of a teacher without training any more training data that is used to train the teacher. In addition, in a multitasking setting, the described techniques allow the student machine learning model to achieve robust multitasking gains across multiple tasks at once, i.e., relative to conventional training techniques for those tasks.

In addition, by utilizing the teacher model during training as described in this specification, the student model may be trained to perform equally well or even better than the teacher model even when the student model consumes less computing resources than the teacher model to generate output. For example, where the teacher model and the student models are both single-tasking models, the student models may have fewer parameters than the teacher model, or generate student outputs with fewer iterations than required by the teacher model. As a particular example, the student model and the teacher model may have similar architectures, but the student model has fewer neural network layers, and thus fewer parameters. As another particular example, the teacher model may be an auto-regression model that generates output through multiple iterations, while the student model is a feed-forward model that generates student output through a single pass-forward of the student model. As another example, when the teacher model is a single-tasking model and the student model is a multitasking model, for example, the student model may have much fewer parameters than the total number of parameter combinations in the teacher model due to having certain parameters shared between all of the multitasking.

The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.

Drawings

FIG. 1 illustrates an example machine learning model training system.

FIG. 2 is a flow diagram of an example process for training a student machine learning model.

FIG. 3 is a flow diagram of another example process for determining updates to student parameters using current weight values.

Like reference numbers and designations in the various drawings indicate like elements.

Detailed Description

FIG. 1 illustrates an example machine learning model training system 100. The machine learning model training system 100 is an example of a system implemented as a computer program on one or more computers in one or more locations in which the systems, components, and techniques described below may be implemented.

This training system 100 trains a machine learning model 110. The machine learning model 110 being trained will be referred to in this specification as a "student machine learning model" and the parameters 118 of the student machine learning model, i.e., the parameters updated by the training, will be referred to as "student parameters".

In some implementations, the student machine learning model 110 is a single-task machine learning model that is only configured to perform a single machine learning task, and the system 100 trains the student machine learning model 110 to perform the single machine learning task.

For example, a single task may be a natural language processing task, such as an implication task, a paraphrase task, a text similarity task, an emotion task, a grammar task, and the like.

As another example, a single task may be an image processing task, such as image classification, object detection, semantic segmentation, image enhancement, domain transfer, and the like.

As another example, a single task may be a health prediction task, where the input to the machine learning model is the electronic health record data of the patient, and the model output for a given patient may be a probability distribution over patient health related categories (e.g., possible diagnoses of the patient, possible future health events associated with the patient, etc.).

As another example, the single task may be an audio processing task, such as speech recognition, language recognition, hotword detection, and the like.

Thus, in these cases, the student machine learning model 110 receives input specific to a task and generates student output for that task.

In some other implementations, the student machine learning model 110 is a multi-tasking machine learning model configured to perform a plurality of machine learning tasks, and the system 100 trains the student machine learning model 110 to perform all of the plurality of machine learning tasks.

In other words, in these embodiments, the student machine learning model 110 receives inputs of a type common to all machine learning tasks and generates a model output comprising a plurality of student outputs, one for each of the machine learning tasks.

For example, multitasking may be a number of different natural language processing tasks that may be performed on the same input text sequence.

As another example, multitasking may be a plurality of different image processing tasks that may be performed on the same input image.

The machine learning model 110 may have any architecture suitable for the type of model input processed by the machine learning model 110. For example, when the model input is image or audio data, the machine learning model 110 may be a convolutional neural network. When the model input is a text sequence or a sequence of other features, such as electronic medical record features or audio features, the machine learning model 110 may be based on a self-attention neural network (e.g., a transformer) or a recurrent neural network (e.g., a Long Short Term Memory (LSTM) neural network). When the model inputs include multi-modal inputs such as images and text, the model 110 may include different types of neural networks, such as convolutional layers and self-attention or recursive layers.

When the model 110 is a multitasking model, the model 110 may include an initial layer shared among all tasks and a respective set of task-specific output layers for each of the multiple tasks. As another example, all layers of the model 110 may be shared between tasks, and the model inputs processed by the model 110 may each include an identifier or other data that identifies the task to be performed by the model 110 for that model input.

The system 100 receives training data 140 for training the student machine learning model 110. More specifically, for a given task on which system 100 is training student machine learning model 110, training data 140 includes training inputs 142 and, for each training input 142, a true value output 144 for the given task. The true value output 144 is the output that should be generated by the student machine learning model 110 for a given task by processing the training inputs 142. Stated differently, the true value output 144 is a known (presumed) accurate output for a given task.

System 100 then trains student machine learning model 110 using truth outputs 144 and one or more teacher machine learning models 120.

In general, each teacher machine learning model 120 is a machine learning model that has been trained to perform one or more tasks that student machine learning model 110 is being trained to perform.

In particular, when the student machine learning model 100 is a single-tasking model that is being trained for a single task, the system 100 uses a single teacher model 120 that has been trained to perform a single task (or equivalently, a collection of single-tasking teacher models whose outputs are combined to generate a teacher output).

In some of these cases, student machine learning model 110 may be a smaller model than teacher model 120, i.e., may have fewer parameters and a lower computational load (computational complexity) than teacher model 120, and system 100 may use a training process to generate a trained model that is more computationally efficient than a single teacher model 120, while having comparable or higher accuracy than teacher model 120.

For example, teacher model 120 may have a similar architecture as student model 110, but student model 110 may have fewer neural network layers, and thus fewer parameters.

In other of these cases, the teacher machine learning model 120 may be an autoregressive model that generates output over many time steps, while the student machine learning model 110 is a feedforward model that only requires a single time step to generate output. For example, the teacher model 120 may be an autoregressive convolutional, self-attentive, or recurrent neural network, while the student model 110 is a feedforward convolutional neural network.

In other of these cases, teacher model 120 and student model 110 may have the same architecture, and the system may use the training process to generate a trained model that performs better than teacher model 120.

In some cases, when student model 110 is a multitasking model, system 100 uses a single teacher model 120 that is also a multitasking model, and system 100 uses the same teacher model 120 for all tasks. That is, in these cases, the system may be training a model that is more computationally efficient than teacher model 120, or training a model that has the same architecture but improves the performance of teacher model 120, as described above.

In other cases, when student model 110 is a multitasking model, system 100 uses a plurality of different teacher models 120 for different tasks on which multitasking student model 110 is trained. For example, system 100 may use a respective single-task teacher model 120 for each of the multiple tasks being performed by training student models 110. Thus, in these cases, the system 100 is using a training process to train a single model 110, the single model 110 having much fewer parameters and less computation than the combined parameters and computation of all of the teacher models 120.

In general, during training, training engine 150 in system 100 iteratively uses truth output 144 and teacher output 124 generated by one or more teacher models 120 to determine errors in student output 114 generated by student machine learning model 110. The training engine 150 then uses the error to update the values of the model parameters 118.

More specifically, the training engine 150 repeatedly adjusts a weight value, which defines a weighting between the teacher output 124 (i.e., the output generated by one of the teacher models 120) and the true value output 144 used in calculating the objective function for training, using a technique that will be referred to as teacher annealing. Training model 110 using teacher annealing is described in more detail below with reference to fig. 2-3.

In some embodiments, training engine 150 pre-trains student machine learning model 110 on an unsupervised task before training student model 110 using teacher annealing. In some cases, pre-training the student model 110 may improve the performance of the final trained model without the need for additional labeled data. For example, when the task is a natural language processing task, the unsupervised task may be an unsupervised language modeling task, such as described in devin, Jacob; chang, Ming-Wei; lee, Kenton; toutanova, Kristina (10/11/2018). "BERT: pre-training of Deep Bidirectional transducer Language Understanding. arXiv: 1810.04805v 2.

Once the model 110 is trained, the system 100 may provide data specifying the trained model for processing new network inputs. That is, the system 100 may output trained values of the model parameters for later use in processing input using the trained model, such as by output to a user device or by storage in memory accessible to the system 100.

Alternatively or in addition to outputting the trained model data, the system 100 can instantiate an instance of the machine learning model with the trained values of the model parameters, e.g., receive input to be processed through an Application Programming Interface (API) provided by the system, process the received input using the trained model to generate model output, and then provide the generated model output, classification output, or both in response to the received input.

Fig. 2 is a flow diagram of an example process 200 for training a student machine learning model. For convenience, process 200 is described as being performed by a system of one or more computers located at one or more locations. For example, a machine learning model training system, such as the machine learning model training system 100 of FIG. 1, suitably programmed, may perform the process 200.

The system initializes a weight value that defines a weighting between the teacher output and the true output (step 202). In other words, the system sets the weight value to a fixed initial value before starting training.

When calculating the weighted combination of teacher outputs, the weight values determine the weights assigned to the teacher output for a given training input and the true value output for the given training input.

In some cases, the weight value λ may be the weight assigned to the true output in the weighted combination, and then the weight assigned to the teacher output is equal to 1- λ. In these cases, the system initializes the weight values to values equal to 0 or within a range of threshold 0, i.e., so that the teacher output is initially strongly supported in the weighted combination with respect to the true value output.

In other cases, the weight value λ may be the weight assigned to the teacher output in a weighted combination, and then the weight assigned to the true value output is equal to 1- λ. In these cases, the system initializes the weight value to a value equal to 1 or within a range of threshold 1, i.e., so that the teacher output is initially strongly supported in the weighted combination with respect to the true value output.

The system trains the student machine learning model on the training data until the criteria for updating the weight values are met (step 204).

In particular, the system repeatedly performs training iterations on small batches of training iterations to optimize an objective function that measures, for any given training input, the error between (i) a weighted combination of the teacher output and the true value output for the given training input and (ii) the student output generated by the student machine learning model for the training input.

For example, when λ is a weight assigned to the true value output, the objective function/for task T may be expressed as:

wherein the content of the first and second substances,

is a training input for task T

The true value of (a) is output,

is generated by a teacher model trained for the task T according to the parameters theta of the teacher model_TFor training input

Generated teacher output, and

is directed to the training input by the student model according to the current value of the student parameter theta

The generated student output.

The objective function measures the error in a manner that is appropriate for a given task T.

For example, when task T is a classification task and student output, teacher output, and truth output are probability distributions, the objective function may be to measure the cross-entropy loss of cross-entropy between the weighted combination and the student output.

As another example, when task T is a regression task and student output, teacher output, and true value output are all an ordered set of one or more regression values, the objective function may be a distance penalty measuring the distance between the weighted combination and the student output, e.g., an L2 distance penalty.

In each training iteration, the system uses a small batch of training data to compute an update to the current values of the student model parameters at the time of the cutoff iteration, and then applies the update to the current values of the student parameters. Updating the current value is described below with reference to fig. 3.

The system may determine that the criteria for updating the weight values are met at any suitable point during training. For example, the system may determine that the criteria are met every N training iterations, i.e., after every N updates applied to the student model. As another example, the system may determine that the criteria are met each time a certain amount of time elapses during training. As yet another example, the system may maintain a set of performance benchmarks and may determine that a criterion is met whenever, for example, performance of a student model as measured on a validation dataset reaches one of the performance benchmarks.

Once the criteria are met during training, the system updates the weight values to gradually support the true value output in the weighting (step 206).

For example, when a true value output is assigned a weight equal to a weight value, the system may linearly increase the weight at some point during training, i.e., each time a criterion is met, so that the weight value moves close to 1 from a zero or near-zero starting point.

That is, the system updates the weight values according to a linear schedule to gradually support true value output, e.g., linearly increasing the weight after every N updates are applied to the student parameter (when the true value output is assigned a weight equal to the weight value) or linearly decreasing the weight after every N updates have been applied (when the teacher output is assigned a weight equal to the weight value).

As another example, the system may update the weight values according to an exponential schedule to gradually support a true value output, for example, increasing the weight exponentially after every N updates have been applied to the student parameter (when the true value output is assigned a weight equal to the weight value), or decreasing the weight exponentially after every N updates have been applied (when the teacher output is assigned a weight equal to the weight value).

The process 200 then returns to step 204, i.e., the system continues to train the model until the criteria for updating the weight values are again met. The system may continue to repeat steps 204 and 206 until termination criteria for training are met, e.g., a specified number of iterations of step 204 have been performed, a specified amount of time has elapsed or the student model parameters have converged.

Thus, by repeatedly performing steps 204 and 206 during training, the system repeatedly updates the weight values to gradually support the true value output in the weighting. That is, as training progresses, the system will continue to adjust the weight values so that the true outputs are given more and more weight in the weighted combination relative to the teacher output.

Fig. 3 is a flow diagram of an example process 300 for determining updates to current values of parameters of a student machine learning model. For convenience, process 300 is described as being performed by a system of one or more computers located at one or more locations. For example, a machine learning model training system, such as the machine learning model training system 100 of fig. 1, suitably programmed, may perform the process 300 by adjusting weight values as described in this specification.

The system may perform process 300 on each of a small batch of training inputs to determine a respective update for each training input. The system may then combine the updates, for example by summing or averaging the updates, and apply the combined updates to the current values of the student parameters, for example by adding or subtracting the combined updates to or from the current values.

When the student model is a single-task model, the training inputs in the small batch will all be training inputs of a single task.

When the student model is a multitasking model, the small batch may include training inputs for different ones of the multitasking. For example, the system may sample each training input in the mini-batch from an overall training data set that includes training data for all tasks, or may sample a specified number of training inputs from each of the multi-tasks to generate the mini-batch.

The system obtains a training input for the first machine learning task and a true value output for the training input (step 302).

The system processes the training inputs using the trained first teacher machine learning model to generate teacher outputs for the first machine learning task (step 304). As described above, the trained first teacher machine learning model has been trained to perform the first machine learning task. In some cases, prior to starting training, the system preprocesses all training data for the first task using the trained first teacher machine learning model to generate teacher outputs for training inputs in the training data. In other cases, the system uses the trained first machine learning model to process training input online, i.e., when teacher output is needed for training during training.

The system determines a weighted combination of the teacher output for the first machine learning task and the true output for the first training input based on the weight values (step 306). In other words, the system calculates a weighted combination of the teacher output and the true output by weighting the true output and the teacher output according to the current weights at the time of the current iteration of process 300.

The system processes the training inputs using the student machine learning model and according to the student parameters to generate student outputs for the first machine learning task (step 308).

The system then determines a gradient of the student parameter relative to an objective function by back-propagation, the objective function measuring the error between the weighted combination and the student output (step 310), and determines an update to the student parameter based on the gradient (step 312). For example, the system may determine the updates according to update rules (e.g., random gradient descent, rmsProp, Adam, etc.) of an optimizer used to train the machine learning model, i.e., by applying the update rules of the optimizer to the gradients.

This specification uses the term "configured" in connection with system and computer program components. A system for one or more computers to be configured to perform particular operations or actions means that the system has installed thereon software, firmware, hardware or a combination of software, firmware, hardware that in operation causes the system to perform the operations or actions. By one or more computer programs to be configured to perform particular operations or actions is meant that the one or more programs include instructions which, when executed by a data processing apparatus, cause the apparatus to perform the operations or actions.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly embodied computer software or firmware, in computer hardware comprising the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory storage medium for execution by, or to control the operation of, data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access storage device, or a combination of one or more of them. Alternatively or in addition, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by the data processing apparatus.

The term "data processing apparatus" refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. An apparatus may also be, or further comprise, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). The apparatus can optionally include, in addition to hardware, code that creates an execution environment for the computer program, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

A computer program, which may also be referred to or described as a program, software application, app, module, software module, script, or code, may be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, such as one or more scripts stored in a markup language document; in a single file dedicated to the program or in multiple coordinated files, such as files storing one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.

In this specification, the term "database" is used broadly to refer to any set of data: the data need not be structured in any particular way, or at all, and it may be stored on a storage device in one or more locations. Thus, for example, an index database may include multiple data sets, each of which may be organized and accessed differently.

Similarly, the term "engine" is used broadly in this specification to refer to a software-based system, subsystem, or process that is programmed to perform one or more particular functions. Typically, the engine will be implemented as one or more software modules or components installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines may be installed and run on the same computer or multiple computers.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and in combination with, special purpose logic circuitry, e.g., an FPGA or an ASIC.

A computer suitable for executing a computer program may be based on a general purpose microprocessor, or a special purpose microprocessor, or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for executing or carrying out instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Further, the computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game controller, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a Universal Serial Bus (USB) flash drive, etc.

Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and storage devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, such as internal hard disks or removable disks; magneto-optical disks; and CD ROM and DVD-ROM disks.

To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. Further, the computer may interact with the user by sending documents to and receiving documents from the device used by the user; for example, by sending a web page to a web browser on the user's device in response to receiving a request from the web browser. In addition, the computer may interact with the user by sending a text message or other form of message to a personal device, such as a smartphone that is running a messaging application, and then receiving a response message from the user.

The data processing apparatus for implementing a machine learning model may also include dedicated hardware accelerator units, e.g., for processing common and computationally intensive portions of machine learning training or production, i.e., reasoning, workload.

The machine learning model may be implemented and deployed using a machine learning framework. The machine learning framework is, for example, a TensorFlow framework, a Microsoft Cognitive Toolkit framework, an Apache Singa framework, or an Apache MXNet framework.

Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server; or include middleware components, such as application servers; or include a front-end component, such as a client computer having a graphical user interface, a web browser, or an app with which a user can interact with an implementation of the subject matter described in this specification; or any combination of one or more such back-end, intermediate, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a Local Area Network (LAN) and a Wide Area Network (WAN), such as the Internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, the server transmits data, e.g., HTML pages, to the user device, e.g., for the purpose of displaying data to a user interacting with the device as a client and receiving user input from the user. Data generated at the user device, e.g., results of user interactions, may be received at the server from the device.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings and described in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Specific embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Claims

1. A method of training a student machine learning model having a plurality of student parameters to perform at least a first machine learning task, wherein the student machine learning model is configured to receive model inputs and process the model inputs in accordance with the student parameters to generate an output comprising student outputs for the first machine learning task, the method comprising:

initializing a weight value defining a weighting between the teacher output and the true value output;

training the student machine learning model on training data, the training data comprising a plurality of first training inputs and a respective true value output for each first training input of the first machine learning task, the training comprising, for each first training input:

processing the first training input using a trained first teacher machine learning model to generate a teacher output for the first machine learning task, wherein the trained first teacher machine learning model has been trained to perform the first machine learning task;

determining a weighted combination of the teacher output for the first machine learning task and the true value output for the first training input according to the weight values;

processing the first training input using the student machine learning model and in accordance with the student parameters to generate student output for the first machine learning task;

determining a gradient of a student parameter with respect to an objective function that measures an error between the weighted combination and the student output; and

determining an update to the student parameter based on the gradient; and

during the training, the weight values are repeatedly updated to gradually support the true value output in the weighting.

2. The method of claim 1, wherein repeatedly updating the weight values comprises:

repeatedly linearly increasing the weight value during the training.

3. The method of claim 1, wherein the student output, the teacher output, and the truth output are probability distributions, and wherein the objective function is a cross-entropy loss that measures cross-entropy between the weighted combination and the student output.

4. The method of claim 1, wherein the first machine learning task is a regression task, wherein the student output, the teacher output, and the truth output are each an ordered set of one or more regression values, and wherein the objective function is a distance loss measuring a distance between the weighted combination and the student output.

5. The method of claim 1, wherein the student machine learning model is a single-task model trained only to perform the first machine learning task, and wherein the model output includes only student output for the first machine learning task.

6. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,

wherein the student machine learning model is a multi-task model trained to perform a plurality of machine learning tasks including the first machine learning task,

wherein the model output comprises a respective student output for each of the plurality of machine learning tasks,

wherein the training data comprises a respective plurality of training inputs for each of the plurality of machine learning tasks and a respective true value output for the first machine learning task for each of the plurality of training inputs for that machine learning task, and

wherein the training comprises training the student machine learning model on the training data to perform all of the plurality of machine learning tasks.

7. The method of claim 6, wherein the trained first teacher machine learning model is a multi-tasking model that has been trained on the plurality of machine learning tasks, and

wherein, for each of the plurality of machine learning tasks and for each of the plurality of training inputs for that machine learning task, the training further comprises:

processing the training input using the trained first teacher machine learning model to generate a teacher output for the machine learning task;

determining a weighted combination of the teacher output for the machine learning task and the true value output for the training input according to the weight values;

processing the training input using the student machine learning model and in accordance with the student parameters to generate student output for the machine learning task;

determining an update to the student parameter based on the gradient.

8. The method of claim 6, wherein the trained first teacher machine learning model is a single task model, and

wherein, for each of the plurality of tasks and for each training input for that task, the training further comprises:

processing the training input using a different trained teacher machine learning model specific to the machine learning task to generate a teacher output for the machine learning task;

determining an update to the student parameter based on the gradient.

9. The method of claim 6, wherein the student machine learning model is a neural network comprising an encoder neural network shared among all of the plurality of machine learning tasks and a respective output neural network for each of the plurality of machine learning tasks.

10. The method of claim 6, wherein the model input of the student machine learning model is a text sequence of a natural language, and wherein the plurality of machine learning tasks are different natural language processing tasks executable on the text sequence.

11. The method according to any one of claims 1-10, further comprising:

pre-training the student machine learning model on an unsupervised task prior to the training.

12. The method of claim 11, wherein the first machine learning task is a natural language processing task and the unsupervised task is an unsupervised language modeling task.

13. The method according to any one of claims 1-10, further comprising:

after training the student machine learning model to perform at least the first machine learning task:

receiving a new network input for the first machine learning task; and

processing the new network inputs using the trained student machine learning model to generate new network outputs for the new network inputs of the first machine learning task.

14. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations of the respective methods of any of claims 1-13.

15. One or more computer storage media storing instructions that, when executed by one or more computers, cause the one or more computers to perform operations of the respective methods of any of claims 1-13.