WO2023059439A1 - Apprentissage progressif sensible au matériel de modèles d'apprentissage machine - Google Patents

Apprentissage progressif sensible au matériel de modèles d'apprentissage machine Download PDF

Info

Publication number
WO2023059439A1
WO2023059439A1 PCT/US2022/044201 US2022044201W WO2023059439A1 WO 2023059439 A1 WO2023059439 A1 WO 2023059439A1 US 2022044201 W US2022044201 W US 2022044201W WO 2023059439 A1 WO2023059439 A1 WO 2023059439A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
machine learning
model
learning model
hardware
Prior art date
Application number
PCT/US2022/044201
Other languages
English (en)
Inventor
Sheng Li
Mingxing TAN
Norman Paul Jouppi
Quoc V. LE
Liqun Cheng
Ruoming Pang
Parthasarathy Ranganathan
Original Assignee
Google Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/899,728 external-priority patent/US20230108177A1/en
Application filed by Google Llc filed Critical Google Llc
Priority to KR1020237039206A priority Critical patent/KR20230170752A/ko
Priority to CN202280036704.7A priority patent/CN117999560A/zh
Priority to EP22787100.1A priority patent/EP4323928A1/fr
Publication of WO2023059439A1 publication Critical patent/WO2023059439A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • Neural networks are machine learning models that include one or more layers of nonlinear operations to predict an output for a received input.
  • some neural networks include one or more hidden layers.
  • the output of each hidden layer can be input to another hidden layer or the output layer of the neural network.
  • Each layer of the neural network can generate a respective output from a received input according to values for one or more model parameters for the layer.
  • the model parameters can be weights and/or bias values that are determined through a training process to cause the neural network to generate accurate output when evaluated using a performance or loss function.
  • Increasing the speed of the training process is critical to improving machine learning models. There exist a number of platform/hardware optimizations that can provide trade-offs between training speed and quality. However, because the quality of machine learning models is so important, hardware techniques are not applied to speed up the training process unless there is no loss in quality, leading to many performance optimization opportunities becoming unavailable.
  • Progressive learning or training is a technique for training machine learning models by adjusting the model or a training process for training the model, while training the model.
  • a progressive training system can generate and apply different values of both model-level and hardware-level performance settings at different stages of a training process to maintain model quality according to predetermined minimum thresholds while improving the speed at which the progressive training system trains the model.
  • Model-level performance settings correspond to characteristics of the machine learning model being trained or parameters of the training process applied.
  • the training system can adjust to different values of model-level performance settings during training, which do not depend on the computing resources used to train the model.
  • Hardware-level performance settings correspond to hardware features of computing resources used to train the machine learning model.
  • Hardware-level performance settings can take on different values to enable, disable, or modify different hardware features during training applied by the training system.
  • the training system leverages existing hardware features to adjust both hardware- and modellevel performance settings during training of a machine learning model at different stages of the training process.
  • the training system can identify and apply complementary values of hardware- and model-level performance settings to generate training schedules that improve model training speed at earlier stages of training, while maintaining or improving model quality at later stages of training.
  • aspects of the disclosure provide for improving training speed by using available computing resources and their respective available hardware features, such as hardware parallelism, operand numerical precision, and varying levels of intra- and inter-device communication, to improve the speed at which a model is trained versus progressive training alone.
  • the training system can be scaled as needed to leverage hardware features for computing resources of a computing platform of connected devices, to further improve the speed at which a training process is performed.
  • the training system can generate and store training schedules to be queried later for reuse in training other machine learning models or a previously trained model.
  • the training system can use portions of previously generated training schedules for retraining models on new training data, for example training schedules focusing on model quality improvements before increasing training speed.
  • aspects of the disclosure also provide for searching for neural architectures that can be modified during training according to a training schedule, for example with less computational overhead over modifying other candidate architectures, and/or to take more advantage of hardware-aware progressive training to realize increased training speeds over other architectures.
  • An aspect of the disclosure is directed to a system, including one or more processors configured to receive a request to train a machine learning model; receive, by the one or more processors, a training schedule specifying a plurality of values for one or more hardware-level performance settings and one or more model-level performance settings; train the machine learning model in accordance with a training process, one or more hardware -level performance settings, and one or more model-level performance settings set to different values of the plurality of values of the training schedule at different points in time during training; and in response to receipt of the request, send the trained machine learning model to one or more computing devices.
  • An aspect of the disclosure is directed to a method, including: receiving, by one or more processors, a request to train a machine learning model, the one or more processors configured to train the machine learning model in accordance with one or more hardware-level performance settings and one or more model-level performance settings; receiving, by the one or more processors, a training schedule specifying a plurality of values for the one or more hardware-level performance settings and the one or more model-level performance settings; training, by the one or more processors, the machine learning model in accordance with a training process and the one or more hardware -level performance settings and one or more model-level performance settings set to different values of the plurality of values of the training schedule at different points in time during training; and in response to receiving the request, sending, by the one or more processors, the trained machine learning model to one or more computing devices.
  • An aspect of the disclosure is directed to one or more non-transitory computer -readable storage media encoded with instructions that when executed by one or more processors configured to train a machine learning model in accordance with one or more hardware-level performance settings and one or more model-level performance settings, cause the one or more processors to perform operations including: receiving a request to train a first machine learning model; receiving a training schedule specifying a plurality of values for the one or more hardware-level performance settings and the one or more modellevel performance settings; training the first machine learning model in accordance with a training process and the one or more hardware -level performance settings and one or more model-level performance settings set to different values of the plurality of values of the training schedule at different points in time during training; and in response to receiving the request, sending the trained first machine learning model to one or more computing devices.
  • aspects of the disclosure can include one or more of the following features.
  • an aspect of the disclosure includes all of the following features, in combination.
  • the one or more model -level performance settings can include one or more of: an input data size for input data to the machine learning model, one or more model hyperparameters specifying the size or shape of the machine learning model, and one or more training process hyperparameters modifying the training process implemented by the one or more processors for training the machine learning model.
  • the one or more hardware-level performance settings can include settings for adjusting intra- or inter-data communication between the one or more processors.
  • the one or more processors can include a plurality of processors logically or physically grouped into a plurality of groups, and the one or more hardware-level performance settings can include settings for the rate of inter-data communication between processors in different groups.
  • the one or more hardware -level performance settings can include settings for adjusting numerical precision of operations performed by the one or more processors while training the machine learning model in accordance with the training process.
  • the one or more processors can be further configured to: set the one or more hardware-level and model-level performance settings to a first values of the plurality of values of the training schedule; and at a first point in time after initiation of the training of the machine learning model, adjust the one or more hardware-level and one or more model-level performance settings to second values of the plurality of values different from the first values.
  • the one or more processors can be further configured to generate a training schedule using a training schedule machine learning model, the training schedule machine learning model: trained to generate training schedules from one or more input parameters at least partially describing one or more of the machine learning model, the machine learning task, and computing resources available for training the machine learning model, and trained using one or more training examples of training schedules, each example training schedule labeled with respective data at least partially describing one or more respective input parameters used to generate the example training schedule, the training speed, and the model quality of a respective machine learning model trained in accordance with the training process and the example training schedule.
  • the training schedule machine learning model trained to generate training schedules from one or more input parameters at least partially describing one or more of the machine learning model, the machine learning task, and computing resources available for training the machine learning model, and trained using one or more training examples of training schedules, each example training schedule labeled with respective data at least partially describing one or more respective input parameters used to generate the example training schedule, the training speed, and the model quality of a respective machine learning model trained in
  • the machine learning model can be a neural network having a neural architecture selected from a plurality of candidate neural architectures, the selection of the neural architecture based at least partially on comparison of estimated respective training speeds and respective model qualities of neural networks: trained in accordance with the training process and a respective training schedule, and having a respective candidate neural architecture of the plurality of candidate neural architectures.
  • the one or more processors can be further configured to: send a query to one or more memory devices storing a plurality of candidate training schedules, the query comprising data at least partially describing one or more of the machine learning model, the machine learning task, and computing resources available for training the machine learning model; and receive the training schedule from the plurality of candidate training schedules in response to the query.
  • An aspect of the disclosure is directed to a method, including performing, by one or more processors, a neural architecture search over a plurality of candidate neural architectures to identify a target neural architecture, including: estimating at least the training speed and model quality of a first neural network having a first candidate neural architecture of the plurality of candidate neural architectures and trained in accordance with a training process and one or more hardware -level performance settings and one or more model-level performance settings set to different values of a first plurality of values during training, and selecting the first candidate neural architecture as the target neural architecture based at least on a comparison of the estimated training speed and estimated model quality of the first neural network to respective estimated training speeds and respective estimated model qualities of one or more second neural networks: each having a respective second candidate neural architecture, and trained in accordance with the training process and the one or more hardware -level performance settings and the one or more model-level performance settings set to different values of a respective second plurality of values during training.
  • the method can further include training, by the one or more processors, the first neural network in accordance with a third plurality of values of a training schedule; and sending, by the one or more processors, the trained first neural network to one or more computing devices.
  • FIG. 1 is a block diagram of an example training system, according to aspects of the disclosure.
  • FIG. 2 is a flowchart of an example process for hardware-aware progressive training of a machine learning model according to aspects of the disclosure.
  • FIG. 3A is a flowchart of an example process for training a machine learning model to generate training schedules for hardware- aware progressive training according to aspects of the disclosure.
  • FIG. 3B is a flowchart of an example process for querying and applying a pre-generated training schedule from one or more memory devices storing multiple training schedules according to aspects of the disclosure.
  • FIG. 4 is a flowchart of an example process for searching for neural architectures, according to aspects of the disclosure.
  • FIG. 5 is a block diagram of an example computing environment implementing the example training system according to aspects of the disclosure.
  • Hardware-aware progressive training refers to the application of a variety of different values to both model-level and hardware-level performance settings during the training of a machine learning model, which are adjusted to different values over the course of training.
  • a training system can generate and apply a training schedule specifying multiple values of model-level and hardware-level performance settings applied at different points during training.
  • a training system configured for hardware-aware progressive training as described herein can improve the speed at which the training system trains the model during earlier points of the training process, as well as improve the model quality of the model being trained during later points of the training process, over other approaches in which hardware-aware progressive training is not applied.
  • Hardware-level performance settings can include settings for adjusting the performance of computing resources used to train the machine learning model. Values for hardware-level performance settings can be adjusted for enabling, disabling, or modifying certain hardware features available on computing resources.
  • Computing resources can be any of a variety of combinations of computing devices and memory devices, which for example can be part of a computing platform.
  • the computing platform can logically organize how devices communicate among one another, the organization of which can also be modified through different values of corresponding hardware-level performance settings.
  • These hardware features can be selectively applied by the training system to adjust the performance of the computing resources in executing operations as part of a training process. For example, hardware features applied in accordance with different values of corresponding hardware-level performance settings can cause the computing resources to execute the operations faster, measured in processing cycles, clock time, etc., at the cost of accuracy in performing those operations. Other values for hardware-level performance settings cause the computing resources to execute operations such as different numerical calculations accurately, at the cost of additional processing cycles, processing/memory utilization, and/or time, etc. As a result, the model trained will have improved model quality, for example measured in model accuracy or recall rate. [0033] Model-level performance settings applied at different values by the training system modify the machine learning model or the training process itself.
  • Model-level performance settings do not affect the hardware or hardware features used by the training system during training, but depending on values taken for these settings, can affect the quality of the resulting trained model and the speed at which the model is trained.
  • Hardware aware progressive training provides for more effective use of available configurations of both model and hardware level features available on a platform training a model, to reach higher training speeds and sustained or improved model quality at different stages of training that may otherwise not be reached through progressive training alone.
  • the training system can train a machine learning model over multiple stages.
  • a training stage can be defined as a number of training steps, with each training step representing a full forward and backward pass to update model parameter values based on calculated error.
  • the number of training steps in a training stage can vary, for example from thousands to millions.
  • the number of training steps can vary based on, for example, the total number of training steps for all of the stages of training and/or the size of the training dataset.
  • stages can be defined as periods of time shorter than the total training time for training the model, a number of epochs or number of times an entire training set is processed by the model, and/or by certain model performance milestones achieved, such as a threshold recall rate or any threshold based on a metric for measuring model accuracy.
  • the training system can apply values for model-level performance settings corresponding to smaller network sizes, smaller input sizes, less regularization and/or less normalization, etc., which can result in faster training at the cost of model quality.
  • the training system can apply modellevel performance settings with different values corresponding to larger network sizes, larger input sizes, more regularization and/or more normalization, which can result in slower training due to performance overhead, but higher model quality.
  • Training speed can be measured, for example, in the number of processing cycles required to train a machine learning model through an entire epoch of training data, by how long it takes to process an individual training example or mini-batch of training examples, and/or by the number of processing cycles required to complete one or more stages of training.
  • Model quality can be measured, for example, according to how well a machine learning model performs the task it is being trained to perform.
  • Example metrics for measuring model quality can include recall rate, a loss between a model prediction and a corresponding ground-truth label, model accuracy, and/or model precision in performing a machine learning task.
  • the training system applies different values for both hardware- and model-level performance settings, and adjusts those values at different points during training to achieve different tradeoffs between training speed and model quality.
  • Example points at which the training system applies different values include the beginning of different stages of training defined, for example, according to time, number of training iterations, or meeting minimum milestones for model quality, etc. Other examples include time-based intervals, such as minute-by-minute or hour-by-hour intervals passing during training.
  • the training system can initially apply values to the performance settings to adjust training of the model to favor training speed over model quality to learn high-level patterns and relationships between training examples and their labels at higher training speeds.
  • the training system gradually adjusts the values of the performance settings to prefer model quality improvements with speed overhead, according to a rate of change that can be specified in the training schedule.
  • the training system applies values of the hardware- and model-level performance settings to emphasize model quality with little to no priority given to reducing performance overhead, resulting in reduced training speed.
  • the training system can generate training schedules with complementary values for various hardware-level and model-level performance settings.
  • Complementary values for model-level performance settings allow certain hardware features to be applied more efficiently, for example resulting in fewer processing cycles to execute operations as part of implementing a training process, or allowing for optimization processes to improve model quality.
  • values of model-level performance settings for enabling second order optimization methods during training complement values for hardware-level performance settings corresponding to performing operations with lower numerical precision, for example using less than 64-bit floating-point or integer precision.
  • the training system can identify complementary values of performance settings by the training system as part of generating training schedules.
  • the training system can implement a training schedule machine learning model trained to generate training schedules from one or more input parameters at least partially describing one or more of the machine learning model to be trained on a set of computing resources, the machine learning task, and the set of computing resources available for training the model.
  • the training system can search a space of candidate training schedules according to different optimization parameters or search criteria, as described herein.
  • Examples of complementary values include values for lower resolution, weaker regularization, and smaller models, paired with hardware-level performance settings for local node communication and gradient accumulation and lower precision computation. At later stages of training, higher resolution, stronger regularization, and larger models may be paired with hardware-level performance values for global communication and gradient accumulation and higher precision computation.
  • training schedules As better performing training schedules are identified, for example by observing faster training speeds and/or higher model qualities at different points during training, these training schedules can be provided as additional examples for retraining the training schedule machine learning model or updating search criteria for searching for training schedules given a set of input parameters.
  • higher performing training schedules will include complementary values of hardware- and model-level performance settings over lower performing training schedules.
  • Machine learning models can be trained faster, for example in less clock time and/or using fewer processing cycles, versus other models not trained using hardware-aware progressive training.
  • model quality can be approved by gradually adjusting performance settings to favor model quality at the cost of performance overhead.
  • Improved model quality of a trained machine learning model can improve the function of computing devices deploying the model at inference, for example because responses to queries or requests to process data on the model can be generated more accurately.
  • Training can be performed more efficiently, for example using more of available features to accelerate operations as part of implementing a training process, versus not using a training schedule as described herein.
  • the training system is configured to generate training schedules with complementary values to reduce or avoid conflicting values of hardware- and model-level performance settings which may inhibit training.
  • Training schedules applied and generated by the training system are tailored according to available hardware features for computing resources designated for training a model using a training process and a given training schedule.
  • a computing platform may include a variety of different computing devices available for training a machine learning model, with different devices varying in terms of hardware features available and/or data processing capability.
  • the training system can make more efficient use of computing resources allocated for training a particular machine learning model, because the training system can apply a training schedule with hardware-level performance settings values based on the particular hardware features and processing capability available by the allocated computing resources.
  • the training system can apply the same training schedule to the same set of computing resources at different scales, so as to not add additional processing overhead to platform operations for scaling computing resources up or down during or in-between training sessions.
  • FIG. 1 is a block diagram of an example training system 100, according to aspects of the disclosure.
  • the training system 100 can be implemented on one or more computing devices in one or more physical locations.
  • the training system 100 is shown in FIG. 1 as part of a computing platform 101.
  • the computing platform 101 can be a collection of computing devices communicating with one or more other computing devices over a network, for example computing device 105.
  • the training system 100 includes a training engine 110, and can also include a training schedule engine 115, and a training schedule library 120.
  • the training system 100 can also include a neural architecture search engine 125.
  • the training system 100 is configured to receive requests for training a machine learning model, for example from the computing device 105.
  • the computing device 105 can send a request, for example over some interface, such as an API or web interface on a browser or mobile application presented on a display of the computing device 105, to the training system 100.
  • the computing device 105 can be a user computing device operated by a user, and/or a device configured to automatically communicate with the training system 100.
  • the computing device 105 can be configured to receive and deploy a trained machine learning model.
  • the computing device 105 can be further configured to receive requests from other computing devices (not shown) for processing input by the deployed model to generate respective output data.
  • the other computing devices may be connected to the computing device 105, separately or as a part of a network connecting the platform 101 with the computing device 105.
  • the request from the computing device 105 can specify input parameters at least partially describing the machine learning model, the machine learning task, and/or the computing resources available for training the model.
  • Input parameters for describing the machine learning model can include a model type, such as a neural network, a support vector machine, a regression model, etc.
  • Input parameters can also include specific characteristics of the desired machine learning model, such as a neural network having a particular width or depth.
  • Input parameters can also specify the type of machine learning task the machine learning model will be trained to perform, such as a regression or a classification task.
  • Example machine learning tasks are provided herein, and in general a machine learning task can be defined for approximating a function between a set of input and corresponding output, which is learned by the machine learning model trained to perform the task.
  • the input parameters can also further specify a sub-type of a machine learning task for the machine learning model to be trained to perform, such as binary classification, multi-class classification, linear regression, logistic regression, etc.
  • the training system 100 can be configured to automatically select a type of machine learning model if a task is specified in the input parameters, but not a model type.
  • the training system 100 may be part of an automatic machine learning (AutoML) system (not shown in FIG. 1).
  • AutoME automatic machine learning
  • the AutoME system can be configured to automatically select a machine learning model to implement based on input parameters specifying a task to be performed, optionally among other input parameters. Even if the input parameters specify a model type, in some examples the AutoML system implementing the training system 100 can be configured to suggest one or more model types based on the other received parameters. As described in more detail with respect to FIG.
  • a neural architecture refers to a set of values describing the shape or topology of a neural network.
  • Example values that may be part of a neural architecture include, for example, the number of layers of the architecture, the width of each layer, the number of nodes or neurons at each layer, the types of operations performed at each layer given a set of input, and the types of activation functions applied for one or more of the network layers.
  • Each neural network is said to have a respective neural architecture.
  • Input parameters can also specify the computing resources on which the training system 100 is to train the machine learning model.
  • Computing resources 130 of the computing platform 101 can include a variety of different computing devices, including processors and memory devices of a variety of different types and configurations, as described herein with reference to FIG. 5.
  • the computing resources 130 can include a number of computing devices with various hardware features for improving data processing or storage on the computing devices. These hardware features can be enabled, disabled, or modified, according to different values of hardware-level performance settings adjusted by the training system 100.
  • the input parameters can specify how much, what kind, and/or which specific computing resources should be used by the training system 100 in training the machine learning model.
  • the computing device 105 may be associated with a user who has been allocated a portion of the computing resources 105.
  • the platform 101 may provide more or fewer computing resources, for example measured in a length of time of availability, a number of processing cycles, or more or fewer devices of different processing speeds or processing capabilities. Processing capability can be measured, for example, in clock speed, data bandwidth, cache memory size, etc.
  • a request may specify the use of graphics processing units (GPUs) for accelerating the training of a machine learning model, versus the use of other, less-specialized devices, such as central processing units (CPUs).
  • GPUs graphics processing units
  • CPUs central processing units
  • the request can also specify training data or the location of training data to be used for training the machine learning model.
  • the training data can be stored on one or more computing devices of the platform 101, which may be the same or different as the devices implementing the training system 100.
  • the training data can include, for example, one or more training examples of input the model is being trained to process to generate a respective output. Some or all of the training examples may include labels of ground- truth output corresponding to the labeled examples.
  • the training engine 110 receives the request from the computing device 105, and receives a training schedule specifying values for hardware-level and model-level performance settings for training a machine learning model according to the request. As described in more detail with reference to FIGs. 3A- B, the training engine 110 can receive the training schedule, for example from the training schedule engine 115 configured to generate a training schedule according to aspects of the disclosure. In other examples, the training engine 110 receives a training schedule by querying the training schedule library 120 storing a collection of pre-generated training schedules.
  • the training engine 110 implements a training process for training the machine learning model over a period of training time.
  • a training process can include any set of operations for training a machine learning model, which can be repeated one or more times over the period of training time.
  • the training process can vary, for example depending on the nature of the type of model to be trained and/or the machine learning task the model is being trained to perform.
  • Example processes can be based on supervised, unsupervised, or semi-supervised learning approaches.
  • the training engine 110 can be configured to train the machine learning model as a neural network, using backpropagation with gradient descent plus updating one or more weights or model parameter values for the machine learning model in accordance with the computed gradients and optionally one or more other parameters.
  • some model-level performance settings set to different values can cause the training engine 110 to modify the training process for training the model.
  • the training engine 110 can also be configured, as part of training, to perform various optimization processes, for example including adaptive moment estimation (Adam) optimization, stochastic or mini-batch gradient descent, gradient descent with momentum, as well as processes for reducing overfitting in a trained model, for example using dropout.
  • Adam adaptive moment estimation
  • stochastic or mini-batch gradient descent stochastic or mini-batch gradient descent
  • gradient descent with momentum processes for reducing overfitting in a trained model, for example using dropout.
  • training processes for example based on different model architectures such as models based on clustering or support vector machines, can also be applied by the training engine 110.
  • other types of training processes for example processes based on unsupervised or semi-supervised approaches, can also be executed by the training engine 110 to train a machine learning model according to aspects of the disclosure.
  • the period of training time can be defined according to one or more termination criteria, which can be provided, for example, as additional input parameters as part of a received request, or predetermined.
  • the training engine 110 stops training when termination criteria are met.
  • the criteria can be, for example, a maximum number of iterations of a training process implemented by the training engine 110, a maximum amount of time passing since the beginning of training, meeting minimum model quality performance thresholds by the trained model, and/or not meeting minimum predetermined improvements to model quality after a certain number of iterations or time has passed.
  • the training system 100 can train a machine learning model over multiple stages.
  • a training stage can correspond to a number of training steps, with each training step representing a full forward and backward pass to update the model parameters values based on calculated error.
  • the number of training steps in a training stage can vary, for example from thousands to millions.
  • the number of training steps can vary based on, for example, the total number of training steps for all of the stages of training and/or the size of the training dataset.
  • stages can be defined as periods of time shorter than the total training time for training the model, a number of epochs or number of times an entire training set is processed by the model, and/or by certain model performance milestones achieved, such as a threshold recall rate or any threshold based on a metric for measuring model accuracy.
  • the training engine 110 can apply different values for hardware- and model-level performance settings for adjusting the training process during that stage.
  • Hardware-level and model-level performance settings can take on a range of values with varying trade-offs between training speed and model quality of the trained machine learning model.
  • the training engine 110 can be configured to perform a combination of hardware- and model-level training optimizations together, and to adjust values for both hardware- and model-level performance parameters to achieve different balances between training speed and model quality of the resulting trained model.
  • the training schedule can specify a rate at which values are adjusted for various hardware- and model-level performance settings.
  • the training schedule can specify a rate at which values for a particular performance setting is adjusted to transition to values favoring model quality over training speed, or vice versa.
  • the training schedule can specify hardware- and model-level performance settings favoring higher training speed at the cost of model quality.
  • the training schedule can include a number of intermediate values for both hardware- and model-level performance settings to transition the training process performed by the system to favor model quality over training speed.
  • the training schedule specifies points at which intermediate values should be applied to the performance settings, and the training system is configured to apply values for those settings at the specified points. These points can be the beginning of subsequent stages of training, and/or intervals according to other conditions, such as time.
  • the training schedule may specify different values for performance settings on a minute-by-minute interval.
  • the training schedule can specify values or schemes for hardware- and model-level performance settings that favor higher model quality at the cost of lower training speed.
  • the range of values for the various hardware-level and model-level performance settings varies at least in accordance with the types of performance settings available during training.
  • one model-level performance setting the learning rate for training a machine learning model. Learning rate adjustments can be initially quite small, for example 0.1-0.01. After a certain number of stages or training steps, the learning rate can be stepped down by some amount, for example by 10 times its current value.
  • Another example model-level performance setting is regularization.
  • regularization for performance settings such as regularization, in which the performance setting involves different types or categories of optimization as opposed to adjusting numerical values
  • a value for a performance setting can correspond to a type of scheme covered by the performance setting.
  • model regularization such as data augmentation
  • the method for augmentation can change from simple distortion to more advanced blurring and distortion, depending on different model-level performance setting values.
  • a hardware-level performance setting can be a communication radius for communicating data, such as gradients, between chips, nodes, or other devices training a machine learning model. Initially, the communication radius may be small, for example two by two, for communicating among local devices adjacent to one another. The communication radius can be adjusted to increase, for example sixteen by sixteen or larger, to communicate with hundreds or thousands of chips across different hardware interconnects, within a datacenter, and/or across datacenters.
  • the training engine 110 is configured to cause the computing resources 130 to perform operations for training the machine learning model in accordance with current values of hardware- and model-level performance settings.
  • the training engine 110 can generate a program or sequence of instructions, which when executed by the computing resources 130, causes the computing resources 130 to execute operations in accordance with values for performance settings specified in the program or sequence of instructions.
  • the training engine 110 is configured to enable, disable, or modify the execution of hardware features through one or more control signals to the devices of the computing resources.
  • the training engine 110 may cause different hardware features to be enabled through an operating system or other software or firmware in control of the computing resources 130.
  • the training engine 110 may send a direct signal through a bus or communication channel a device is configured to receive control signals from for enabling or disabling hardware features.
  • Some examples of hardware features that can be adjusted by different values of hardware-level performance settings include: enabling/disabling inter- or intra-communication of data among and between computing devices; levels of numerical precision the computing devices apply to perform respective operations as part of the training process; and/or enabling/disabling hardware parallelism on the computing devices.
  • inter- or intra-communication of data can be further adjusted, such as by rate, volume, or type of data transmitted between devices.
  • Hardware-level performance settings can include settings for adjusting software- or virtually- defined clusters of computing devices, with logical pathways between those computing devices.
  • Example operations performed by the computing resources 130 during training can include calculating a dot product between a vector of input values and a matrix or tensor of weights of a neural network layer, matrix multiplication, calculating an activation function, performing convolutional operations, pooling multiple values of a feature map, etc.
  • Model-level performance settings can include model hyperparameters, such as the size of the machine learning model or a topology or shape of a neural network, including the size of the input the model receives.
  • Model-level performance settings can also include training process hyperparameters for modifying the training process used by the training engine in training the machine learning model, such as a learning rate or batch size.
  • Training process hyperparameters can also include parameters whose values control the application of various optimization processes that can be performed as part of the training process to further improve the model, such as second-order optimization methods or processes for how much functions part of the model are regularized, or how much data is normalized.
  • Examples of training process hyperparameters can also include a learning rate or a mini -batch size, for example when the training process is mini-batch gradient descent.
  • the training engine 110 can send signals interpretable by the computing resources 130 for adjusting model-level performance settings in accordance with a training schedule throughout a training period.
  • the training engine 110 may generate a program or sequence of instructions specifying adjustments to the model and/or the training process during training, and at which points or stages the adjustments should be made in accordance with model-level performance setting values of a training schedule.
  • the training engine 110 can generate the training schedule by searching for arrangements of values for hardware- and model-level performance settings for hardware-level or model-level features available on a platform implementing the system. As part of the generation, the training engine 110 can identify model-level and hardware-level performance settings that are complementary in achieving higher training speed or model quality, depending on the point in training at which the settings are applied.
  • different values of hardware-level performance settings for local-only communications of neighboring computing devices in a cluster may be paired with different values of model-level performance settings in which the training engine 110 applies batch normalization or crossreplica gradient summation, to speed up training at the cost of model quality during earlier stages of training.
  • Devices of the computing resources 130 can be logically and/or physically organized as a cluster or group of computing resources, with interconnections between at least some of the devices within a cluster to facilitate inter-device communication.
  • Hardware-level performance settings that the training engine 110 can adjust during training can include settings for adjusting communication overhead between devices in a cluster.
  • values for hardware-level performance settings for higher numerical precision during training can be paired with values for model-level performance settings which cause the training engine 110 to apply any of a variety of second order optimization methods for better model quality, at the cost of training speed.
  • hardware-level performance settings for enabling parallel computation on certain types of accelerators can be paired with certain model-level performance settings for selecting the activation function used in training certain neural networks.
  • ReLU may be selected as an activation function when parallel computation is selected for faster training at reduced model quality, but swish may be selected as an activation function later during training for increased model quality at the cost of reduced training speed due to reduced hardware execution parallelism.
  • a system such as the training system 100 described herein can allow for combining hardware settings with progressive training. For example, combining hardware and model level progressive training naively can cause a catastrophic quality loss that makes the model quality too low to be useful. As another example, applying lower regularization at the model level and low precision at the hardware level at the beginning of training can cause the initial quality loss to be too low to be recovered even if regularization and numeric precision is increased significantly later in the training.
  • a model may be retrained according to training schedules or portions of training schedules previously used by the training engine in training the model.
  • Retraining can include performing a number of iterations of a training process, using new training data.
  • Example retraining can include backpropagation with gradient descent plus updating model weights for a neural network previously set from earlier training.
  • the training engine 110 can apply values of hardware- and model-level performance settings of a previously-used training schedule for a later stage or point in training. In this way, values for performance settings corresponding to the current performance of the model (having already been trained) can be used by the training engine 110 to favor model quality improvement over training speed.
  • One example case in which a portion of a training schedule may be used as part of retraining is in retraining production machine learning models, such as models for an online search engine.
  • the models may occasionally be retrained in view of new training data and/or model-level optimizations that may have been developed after the deployment of the production machine learning model.
  • the training system can re-use a training schedule previously used to initially train a production machine learning model, but start retraining according to a point or stage at which model quality is emphasized over training speed.
  • the training schedule library 120 is a collection of pre-generated training schedules stored on one or more memory devices, for example as part of a queryable database.
  • the training schedule library 120 can be populated by training schedules generated by the training system, as described in more detail with reference to FIG. 2.
  • the training schedule engine 115 adds a generated training schedule to the library 120, tagging it with metadata at least partially describing the input parameters received as part of a request for training a model using the generated training schedule.
  • the training schedule engine 115 can populate the training schedule library 120 with one or more training schedules for commonly received machine learning models requested to be trained by the system 100.
  • the training engine 110 can query the training schedule library 120 to identify a stored training schedule previously generated for a machine learning model that is the same or similar to a model currently requested by the engine 110 to be trained.
  • the training system 100 can also include the neural architecture search (NAS) engine 125.
  • NAS neural architecture search
  • the NAS engine 125 is configured to search for neural architectures for neural networks that benefit from training according to a training schedule as described herein.
  • the training system 100 can receive input parameters for training a machine learning model specifying a machine learning task to perform, without specifying a particular model type.
  • the training system 100 can receive a request for generating a neural network based on a neural network architecture identified by the NAS engine 125.
  • FIG. 2 is a flowchart of an example process 200 for hardware-aware progressive training of a machine learning model.
  • a training system such as the training system 100 of FIG. 1, can be configured to perform the process 200.
  • a training system receives a request to train a machine learning model, according to block 210.
  • the request can include various types of data or metadata, including one or more input parameters.
  • the input parameters can include the input parameters described herein with reference to FIG. 1, at least partially describing one or more of the machine learning model, the machine learning task, and computing resources available for training the machine learning model.
  • the training system receives a training schedule specifying a plurality of values for one or more hardware-level performance settings and one or more model-level performance settings, according to block 220.
  • the training system can generate the training schedule, as described herein with reference to FIGs. 1 and 3A.
  • the training system can query one or more memory devices storing multiple pre-generated training schedules, as described herein with reference to FIGs. 1 and 3B.
  • the training system trains the machine learning model in accordance with a training process, one or more hardware-level performance settings, and one or more model-level performance settings set to different values of the plurality of values of the training schedule at different points in time during training, according to block 230.
  • the training system is configured to apply different values of both hardware- and model-level performance settings at various points during training.
  • the training schedule can specify those points, for example as stages or other defined intervals, and the training schedule can further specify the rate at which values are changed from one end of a range, to another.
  • the training system sends the trained machine learning model to one or more computing devices, according to block 240.
  • the one or more computing devices can be devices that originally requested that the machine learning model to be trained, as an example.
  • the one or more computing devices can be predetermined for receiving the trained machine learning model, for example as part of model deployment on a device on the edge of a network or another device of the computing platform.
  • FIG. 3A is a flowchart of an example process 300A for training a machine learning model to generate training schedules for hardware-aware progressive training.
  • the machine learning model trained is referred to as a training schedule machine learning model.
  • the training system receives one or more training examples of training schedules, according to block 310.
  • Each example training schedule can be labeled with respective data at least partially describing one or more respective input parameters used to generate the example training schedule, a respective training speed, and respective model quality of a respective model trained using the example training schedule.
  • the training data can be generated by hand, automatically, or a combination of both approaches.
  • the training system can store metadata for a training schedule generated according to received input parameters, and after training the model, record its training speed and model quality. Because the training speed and model quality varies throughout training, the training system can store individual values representing the speed and quality, respectively, at different intervals in which values from the training schedule are applied to the performance settings. In addition or alternatively, the training system can compute a function of the individual training speed and model quality values, for example as an average or sum.
  • the training system trains a machine learning model, i.e., the training schedule machine learning model, to generate training schedules from one or more input parameters, according to block 320.
  • the input parameters are the input parameters that can be received as part of a request for training a model, as described herein with reference to FIGs. 1-2.
  • the training system can train the training schedule machine learning model in a variety of different ways, for example using some form of backpropagation with gradient descent plus model weight updates.
  • the loss or performance function for training the training schedule machine learning model can be a function of how close the training speed or model quality at various points in the training period are to ground-truth training speeds or model qualities at those same points during training.
  • the training system can be configured to search for training schedules, according to an optimization approach over a set of candidate training schedules.
  • the search can be defined to identify a training schedule with the highest model quality and training speed through the course of training, subject to various restrictions which can be set in accordance with input parameters.
  • the restrictions can be over a certain subset of hardware-level and performance-level performance settings that are available for a given training process and set of computing resources to be used in training the model using an identified training schedule.
  • FIG. 3B is a flowchart of an example process for querying and applying a pre-generated training schedule from one or more memory devices storing multiple training schedules, according to aspects of the disclosure.
  • the training system sends a query to one or more memory devices storing a plurality of candidate training schedules, the query including data at least partially describing one or more of a machine learning model, the machine learning task, and computing resources available for training the machine learning model, according to block 330.
  • the training system can include a training engine configured to receive input parameters as part of a request to train a model, and query a training schedule library of memory devices for a previously-generated training schedule tagged with at least some of those input parameters.
  • the training system receives a training schedule from the plurality of candidate training schedules, in response to the query, according to block 340.
  • the received training schedule can be the training schedule that has the same or most similar metadata as the input parameters as in the query.
  • Input parameters can be compared to predetermined similarity measures corresponding to one or more input parameters.
  • FIG. 4 is a flowchart of an example process for searching for neural architectures, according to aspects of the disclosure.
  • aspects of the disclosure also provide for a training system configured to search a set of candidate neural network architectures for a target architecture in which hardware-aware progressive training can be applied.
  • the training system can identify a target architecture in which all or most of hardware features for a specified set of computing resources can be applied during training at different values for training speed-model quality trade-offs.
  • the training system as part of adjusting performance settings during training, may incur performance overhead through operations executed to cause the computing resources to train the model according to adjusted values.
  • the training system can identify target architectures in which model-level performance settings can be adjusted with minimal performance overhead over other candidate architectures.
  • the training system searches for neural architectures that can benefit from continuous adjustment of hardware- and model-level performance settings during training.
  • a neural architecture which can be expanded in model size, for example measured by a number of neural network layers and/or a number of nodes in each layer, or input size with and trained on corresponding computing resources that can be scaled to accommodate the increased model or input size would benefit more during training, for example measured in higher training speeds and model quality using a training schedule of varying performance setting values, as described herein.
  • the training system estimates at least the training speed and model quality of a first neural network having a first candidate neural architecture of a plurality of candidate neural architectures and trained using hardware-aware progressive learning.
  • the estimation can be part of measuring the performance of candidate neural architectures within a search space of neural architectures.
  • the search space can include a variety of different candidate architectures, which can be filtered or adjusted based on different provided input parameters. For example, if the training system receives input parameters specifying the model type to be a convolutional neural network, then the training system can search a search space of neural architectures including at least one convolutional layer.
  • the training system selects the first candidate neural architecture based at least on a comparison of the estimated training speed and estimated model quality of the first neural network to respective estimated training speeds and respective estimated model qualities of one or more second neural networks.
  • Each second neural network has a respective candidate neural architecture, according to block 420.
  • the second neural networks can be trained according to hardware-aware progressive learning, as described herein, to identify respective training speeds and model qualities.
  • the training system can estimate the training speeds and model qualities.
  • the selection by the training system can be part of multiple iterations of selecting a candidate neural architecture, and comparing that neural architecture to a current best-known architecture.
  • the searching can be augmented at least by using training speed and model quality from hardware-aware progressive training as indicators of the performance of different candidate models. Any of a variety of neural architecture search processes can be applied, such as a random search over a number of iterations or until finding a candidate neural architecture meeting a threshold performance value, based at least on its training speed and model quality.
  • the training system can proceed to train a neural network having the target neural architecture, for example as described herein with reference to FIGs. 1-2.
  • aspects of the disclosure can provide for at least the following technical advantages.
  • Generating a neural network having a neural architecture selected from NAS as described herein allows for improved utilization of hardware-aware progressive training as described herein.
  • Neural architectures can be tailored to the computing resource environment in which they are trained, allowing for increased access to hardware features for accelerating operations of an implemented training process, as opposed to neural architectures not identified as described herein, which may be incompatible with those hardware features.
  • FIG. 5 is a block diagram of an example environment 500 for implementing the training system 100.
  • the system 100 can be implemented on one or more devices having one or more processors in one or more locations, such as the computing platform 101 having one or more server computing devices 515 and one or more memory devices 530.
  • User computing device 512 and the server computing device(s) 515 can be communicatively coupled to the memory devices 530 over a network 560.
  • the memory device(s) 530 can be a combination of volatile and non-volatile memory, and can be at the same or different physical locations than the computing devices 512, 515.
  • the memory device(s) 530 can include any type of non-transitory computer readable medium capable of storing information, such as a hard-drive, solid state drive, tape drive, optical storage, memory card, ROM, RAM, DVD, CD-ROM, write-capable, and read-only memories.
  • the server computing device(s) 515 can include one or more processors 513 and memory 514.
  • the memory 514 can store information accessible by the processor(s) 513, including instructions 521 that can be executed by the processor(s) 513.
  • the memory 514 can also include data 523 that can be retrieved, manipulated, or stored by the processor(s) 513.
  • the memory 514 can be a type of non-transitory computer readable medium capable of storing information accessible by the processor(s) 513, such as volatile or nonvolatile memory.
  • the processor(s) 513 can include one or more central processing units (CPUs), graphic processing units (GPUs), field-programmable gate arrays (FPGAs), and/or application-specific integrated circuits (ASICs), such as tensor processing units (TPUs).
  • CPUs central processing units
  • GPUs graphic processing units
  • FPGAs field-programmable gate arrays
  • ASICs application-specific integrated circuits
  • TPUs tensor processing units
  • Available computing resources for the platform 101 can include one or more of the processors 513, and/or the memory 514 or memory devices 530. As described herein, computing resources for the platform 101 can be configured to implement one or more hardware features during data processing that can be enabled or modified in accordance with one or more hardware-level performance settings.
  • the training system 100 is configured to train a machine learning model according to aspects of the disclosure, on computing resources of the platform 101.
  • the instructions 521 can include one or more instructions that when executed by the processor(s) 513, cause the processor(s) 513 to perform actions defined by the instructions.
  • the instructions 521 can be stored in object code format for direct processing by the processor(s) 513, or in other formats, including interpretable scripts or collections of independent source code modules that are interpreted on demand or compiled in advance.
  • the instructions 521 can include instructions for implementing the training system 100 consistent with aspects of this disclosure.
  • the training system 100 can be executed using the processor(s) 513, and/or using other processors remotely located from the server computing device(s) 515.
  • the data 523 can be retrieved, stored, or modified by the processor(s) 513 in accordance with the instructions 521.
  • the data 523 can be stored in computer registers, in a relational or non-relational database as a table having a plurality of different fields and records, or as JSON, YAML, proto, or XML documents.
  • the data 523 can also be formatted in a computer-readable format such as, but not limited to, binary values, ASCII or Unicode.
  • the data 523 can include information sufficient to identify relevant information, such as numbers, descriptive text, proprietary codes, pointers, references to data stored in other memories, including other network locations, or information that is used by a function to calculate relevant data.
  • the user computing device 512 can also be configured similar to the server computing device(s) 515, with one or more processors 516, memory 517, instructions 518, and data 519.
  • the user computing device 512 can also include a user output 526, and a user input 524.
  • the user input 524 can include any appropriate mechanism or technique for receiving input from a user, such as keyboard, mouse, mechanical actuators, soft actuators, touchscreens, microphones, and sensors.
  • the server computing device(s) 515 can be configured to transmit data to the user computing device 512, and the user computing device 512 can be configured to display at least a portion of the received data on a display implemented as part of the user output 526.
  • the user output 526 can also be used for displaying an interface between the user computing device 512 and the server computing device(s) 515.
  • the user output 526 can alternatively or additionally include one or more speakers, transducers or other audio outputs, a haptic interface or other tactile feedback that provides non-visual and non-audible information to the user of the computing device 512.
  • FIG. 5 illustrates the processors 513, 516 and the memories 514, 517 as being within the computing devices 515, 512
  • components described in this specification, including the processors 513, 516 and the memories 514, 517 can include multiple processors and memories that can operate in different physical locations and not within the same computing device.
  • some of the instructions 521, 518 and the data 523, 519 can be stored on a removable SD card and others within a read-only computer chip. Some or all of the instructions and data can be stored in a location physically remote from, yet still accessible by, the processors 513, 516.
  • the processors 513, 516 can include a collection of processors that can perform concurrent and/or sequential operation.
  • the computing devices 515, 512 can each include one or more internal clocks providing timing information, which can be used for time measurement for operations and programs run by the computing devices 515, 512.
  • the server computing device(s) 515 can be configured to receive requests to process data from the user computing device 512.
  • the platform 101 can be configured to provide a variety of services to users, through various user interfaces and/or APIs exposing the platform services.
  • One or more services can be a machine learning framework or a set of tools for generating neural networks or other machine learning models according to a specified task and training data.
  • the user computing device 512 may receive and transmit data specifying target computing resources to be allocated for training and deploying a neural network to perform a particular machine learning task.
  • the server computing device(s) 515 can be configured to receive a request specifying, for example, a set of training data; the type of model to train, such as a deep neural network, a recurrent neural network, and a convolutional neural network; and the type of machine learning task the model will be trained to perform.
  • the request can optionally specify more or fewer parameters, as described herein.
  • the devices 512, 515 can be capable of direct and indirect communication over the network 560.
  • the devices 515, 512 can set up listening sockets that may accept an initiating connection for sending and receiving information.
  • the network 560 itself can include various configurations and protocols including the Internet, World Wide Web, intranets, virtual private networks, wide area networks, local networks, and private networks using communication protocols proprietary to one or more companies.
  • the network 560 can support a variety of short- and long-range connections. The short- and long-range connections may be made over different bandwidths, such as 2.402 GHz to 2.480 GHz, 2.4 GHz and 5 GHz; or with a variety of communication standards, such as standards for wireless broadband communication.
  • the network 560 in addition or alternatively, can also support wired connections between the devices 512, 515, including over various types of Ethernet connection.
  • aspects of the disclosure can be implemented according to a variety of different configurations and quantities of computing devices, including in paradigms for sequential or parallel processing, or over a distributed network of multiple devices. In some implementations, aspects of the disclosure can be performed on a single device, and any combination thereof.
  • aspects of the disclosure provide for hardware- aware progressive training of a machine learning model to perform a respective machine learning task. Examples of machine learning tasks follow.
  • the input to the machine learning model to be trained can be in the form of images or videos.
  • a machine learning model can be trained to extract, identify, and generate features as part of processing a given input, for example as part of a computer vision task.
  • a machine learning model trained to perform this type of machine learning task can be trained to generate an output classification from a set of different potential classifications.
  • the machine learning model can be trained to output a score corresponding to an estimated probability that an identified subject in the image or video belongs to a certain class.
  • the input to the machine learning model can be data files corresponding to a particular format, such as HTML or XML files, word processing documents, or formatted metadata obtained from other types of data, such as metadata for image files.
  • a machine learning task in this context can be to classify, score, or otherwise predict some characteristic about the received input.
  • a machine learning model can be trained to predict the probability that received input includes text relating to a particular subject.
  • the machine learning model can be trained to generate text predictions, for example as part of a tool for auto-completion of text in a document as the document is being composed.
  • a machine learning model can also be trained for predicting a translation of text in an input document to a target language, for example as a message is being composed.
  • Other types of input documents can be data relating to characteristics of a network of interconnected devices. These input documents can include activity logs, as well as records concerning access privileges for different computing devices to access different sources of potentially sensitive data.
  • a machine learning model can be trained for processing these and other types of documents for predicting on-going and future security breaches to the network. For example, the machine learning model can be trained to predict intrusion into the network by a malicious actor.
  • the input to a machine learning model can be audio input, including streamed audio, pre-recorded audio, and audio as part of a video or other source or media.
  • a machine learning task in the audio context can include speech recognition, including isolating speech from other identified sources of audio and/or enhancing characteristics of identified speech to be easier to hear.
  • a machine learning model can be trained to predict an accurate translation of input speech to a target language, for example in real-time as part of a translation tool.
  • a machine learning model can also be trained to process features corresponding to given input.
  • Features are values, such as numerical values or categorical values, which relate to some characteristic of the input. For example, in the context of an image, a feature of the image can relate to the RGB value for each pixel in the image.
  • a machine learning task in the image/video context can be to classify contents of an image or video, for example for the presence of different people, places, or things.
  • Machine learning models can be trained to extract and select relevant features for processing to generate an output for a given input, and can also be trained to generate new features based on learned relationships between various characteristics of input data.
  • aspects of this disclosure can be implemented in digital circuits, computer-readable storage media, as one or more computer programs, or a combination of one or more of the foregoing.
  • the computer- readable storage media can be non-transitory, for example, as one or more instructions executable by one or more computing devices and stored on one or more tangible memory devices.
  • the phrase “configured to” is used in different contexts related to computer systems, hardware, or part of a computer program, engine, or module.
  • a system is said to be configured to perform one or more operations, this means that the system has appropriate software, firmware, and/or hardware installed on the system that, when in operation, causes the system to perform the one or more operations.
  • some hardware is said to be configured to perform one or more operations, this means that the hardware includes one or more circuits that, when in operation, receive input and generate output according to the input and corresponding to the one or more operations.
  • a computer program, engine, or module When a computer program, engine, or module is said to be configured to perform one or more operations, this means that the computer program, engine, or module includes one or more program instructions, that when executed by one or more computing devices, such as one or more processors, causes the one or more computing devices to perform the one or more operations.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Stored Programmes (AREA)

Abstract

Des aspects de l'invention concernent l'entraînement progressif sensible au matériel de modèles d'apprentissage machine. Un système d'entraînement entraîne un modèle en fonction d'un processus d'apprentissage et de différentes valeurs spécifiées dans un programme d'apprentissage pour des réglages de performance à la fois niveau matériel et niveau modèle. Des réglages de performance niveau matériel peuvent provoquer des caractéristiques matérielles de ressources informatiques utilisées pour entraîner le modèle de façon à ce que celui-ci soit activé, désactivé, ou modifié au niveau de divers points pendant l'entraînement. Des réglages de performance niveau modèle peuvent prendre diverses valeurs pour ajuster les caractéristiques du modèle d'apprentissage machine qui sont entraînés ou du processus d'entraînement, pendant différents stades d'entraînement. Le système d'entraînement peut identifier et appliquer des valeurs complémentaires de paramètres de performance niveau matériel et niveau modèle pour générer des programmes d'entraînement qui améliorent la vitesse d'entraînement de modèle à des stades précoces d'entraînement, tout en améliorant la qualité du modèle à des stades ultérieurs d'entraînement.
PCT/US2022/044201 2021-10-06 2022-09-21 Apprentissage progressif sensible au matériel de modèles d'apprentissage machine WO2023059439A1 (fr)

Priority Applications (3)

Application Number Priority Date Filing Date Title
KR1020237039206A KR20230170752A (ko) 2021-10-06 2022-09-21 기계 학습 모델들의 하드웨어 인식 점진적 훈련
CN202280036704.7A CN117999560A (zh) 2021-10-06 2022-09-21 机器学习模型的硬件感知渐进训练
EP22787100.1A EP4323928A1 (fr) 2021-10-06 2022-09-21 Apprentissage progressif sensible au matériel de modèles d'apprentissage machine

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202163252743P 2021-10-06 2021-10-06
US63/252,743 2021-10-06
US17/899,728 2022-08-31
US17/899,728 US20230108177A1 (en) 2021-10-06 2022-08-31 Hardware-Aware Progressive Training Of Machine Learning Models

Publications (1)

Publication Number Publication Date
WO2023059439A1 true WO2023059439A1 (fr) 2023-04-13

Family

ID=83689824

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/044201 WO2023059439A1 (fr) 2021-10-06 2022-09-21 Apprentissage progressif sensible au matériel de modèles d'apprentissage machine

Country Status (1)

Country Link
WO (1) WO2023059439A1 (fr)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210097443A1 (en) * 2019-09-27 2021-04-01 Deepmind Technologies Limited Population-based training of machine learning models

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210097443A1 (en) * 2019-09-27 2021-04-01 Deepmind Technologies Limited Population-based training of machine learning models

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LUO MAI ET AL: "KungFu: Making Training in Distributed Machine Learning Adaptive", 6 November 2020 (2020-11-06), pages 948 - 965, XP061054690, Retrieved from the Internet <URL:http://www.usenix.org/sites/default/files/osdi20-full_proceedings_interior.pdf> [retrieved on 20201106] *
YONGGAN FU ET AL: "CPT: Efficient Deep Neural Network Training via Cyclic Precision", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 7 May 2021 (2021-05-07), XP081953355 *

Similar Documents

Publication Publication Date Title
US20200265301A1 (en) Incremental training of machine learning tools
US11741361B2 (en) Machine learning-based network model building method and apparatus
US11620568B2 (en) Using hyperparameter predictors to improve accuracy of automatic machine learning model selection
CN106897268B (zh) 文本语义理解方法、装置和系统
US20220108157A1 (en) Hardware architecture for introducing activation sparsity in neural network
WO2020081229A1 (fr) Sélection de sous-ensemble de caractéristiques automatique à l&#39;aide d&#39;un classement de caractéristiques et d&#39;une recherche automatique évolutive
US10635947B2 (en) Distributable classification system
EP3685316A1 (fr) Réseaux neuronaux à capsule
US20200410365A1 (en) Unsupervised neural network training using learned optimizers
US11954202B2 (en) Deep learning based detection of malicious shell scripts
CN112149809A (zh) 模型超参数的确定方法及设备、计算设备和介质
US20230108177A1 (en) Hardware-Aware Progressive Training Of Machine Learning Models
US20220138425A1 (en) Acronym definition network
JP2023552048A (ja) ハードウェアアクセラレータのためのニューラルアーキテクチャスケーリング
JP7353435B2 (ja) 教師なし対照学習によってデータを分類する方法、コンピュータ装置、およびコンピュータプログラム
EP3971782A2 (fr) Sélection de réseau de neurones artificiels
WO2023059439A1 (fr) Apprentissage progressif sensible au matériel de modèles d&#39;apprentissage machine
US20220405623A1 (en) Explainable artificial intelligence in computing environment
JP2024521136A (ja) 機械学習モデルのハードウェアを意識したプログレッシブトレーニング
KR102641629B1 (ko) 설명 가능한 인공지능 기반의 트랜스포머를 활용한 데이터 처리 방법 및 시스템
US20240037373A1 (en) OneShot Neural Architecture and Hardware Architecture Search
US20230297580A1 (en) Hybrid and Hierarchical Multi-Trial and OneShot Neural Architecture Search on Datacenter Machine Learning Accelerators
US11996116B2 (en) Methods and systems for implementing on-device non-semantic representation fine-tuning for speech classification
US20220059117A1 (en) Methods and Systems for Implementing On-Device Non-Semantic Representation Fine-Tuning for Speech Classification
US20230230358A1 (en) System and methods for active domain adaptation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22787100

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20237039206

Country of ref document: KR

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 1020237039206

Country of ref document: KR

Ref document number: 2022787100

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022787100

Country of ref document: EP

Effective date: 20231114

WWE Wipo information: entry into national phase

Ref document number: 2023572179

Country of ref document: JP