CN116432743B

CN116432743B - Method for improving throughput of reinforcement learning system

Info

Publication number: CN116432743B
Application number: CN202310419113.4A
Authority: CN
Inventors: 赵来平; 辛宇嵩; 赵志新; 代心安; 胡一涛; 李克秋
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2023-04-19
Filing date: 2023-04-19
Publication date: 2023-10-10
Anticipated expiration: 2043-04-19
Also published as: CN116432743A

Abstract

The invention discloses a method for improving throughput of a reinforcement learning system, which comprises the following steps: starting a RL training task, deducing an optimal global configuration by a coordinator according to the task configuration and hardware information, and then starting a pipeline sampler, a quantizer and a trainer; the sampler performs parallel pipeline sampling based on groups and collects a certain number of tracks; the track is collected by the message agency and distributed to a plurality of trainers, and is responsible for message serialization and transmission; training and evaluating the model by using the received track by the trainer and the predictor, and transmitting the updated model weight to the quantizer for weight quantization; the quantizer sends the quantized model weights to the agents in each sampler via the message agents for the next round of sampling and training. The invention can comprehensively improve the throughput of the reinforcement learning system, can timely find and identify the bottlenecks of sampling, training and communication stages, and adopts the collaborative optimization technology to improve the throughput, wherein the throughput is up to 90.6%.

Description

Method for improving throughput of reinforcement learning system

Technical Field

The invention belongs to the field of AI systems, and particularly relates to a method for improving throughput of a reinforcement learning system.

Background

Reinforcement Learning (RL) algorithms have gained tremendous leap in a variety of areas including go, dialogue systems, autopilot and robotic operations. However, such benefits are costly. Unlike typical deep learning models where training data is prepared in advance, reinforcement Learning (RL) models are trained using trajectory data generated during real-time interactions with the simulation environment, which may take days or weeks to have enough samples to converge the trained model, even though it is performed on hundreds of CPUs and gpus, e.g., the virtual character AlphaStar in an interplanetary dispute 2 game depends on the A3C model, requiring training for 44 days using 3072 TPU cores and 50400 CPU cores to reach beyond human capability. Also, in automatic driving, if a high quality driving strategy is trained using the PPO algorithm, 2 tesla K40 gpu is required to train for about 10 days to output an optimal steering command in a driving scene.

Therefore, it is important to increase the throughput (i.e., the number of samples processed per unit time) of the reinforcement learning system and the convergence speed of the RL model. Existing research efforts have focused on improving the throughput of reinforcement learning systems from two aspects: (1) innovation of algorithm. For example, using asynchronous communications instead of synchronous communications, asynchronous communications may improve throughput by reducing latency between training nodes and sampling nodes; (2) resource efficiency optimization. That is, more sampling, training or data transmission tasks are deployed on limited resources without degrading performance. Resources may include networks, cpus, gpus, and the like.

While existing research efforts have significantly improved throughput of RL systems, their application scenarios are very limited. For example, a low-precision RL can only improve throughput when network resources become bottlenecks; sample factor and GA3C are validated when the sample and training ends become bottlenecks, respectively. Experience in RL training has shown that bottlenecks can occur in several places under different circumstances. For example, the throughput bottleneck for an IMPALA task when interacting with the CartPole environment occurs on the training side, and on the sampling side when interacting with the Atari environment. Thus, high throughput is difficult to achieve with local optimization techniques alone.

Disclosure of Invention

In view of the above problems with the prior art, it is an object of the present invention to provide.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a method of improving throughput of a reinforcement learning system, comprising the steps of:

s1, starting a RL training task, firstly deducing an optimal global configuration by a coordinator according to task configuration and hardware information, and then starting a pipeline sampler, a quantizer and a trainer;

s2, sampling the parallel pipeline based on the group by a sampler and collecting a certain number of tracks;

s3, collecting and distributing the tracks with a certain number by the message agency into a plurality of trainers to be responsible for message serialization and transmission;

s4, training and evaluating models by using the received tracks by a plurality of trainers and predictors, and transmitting updated model weights to a quantizer for weight quantization;

s5, the quantizer transmits quantized model weights to agents in each sampler through the message agents so as to perform the next round of sampling and training.

Preferably, in the step S2, the specific steps of the sampler for performing parallel pipeline sampling based on groups are: the sampler divides the reasoning and environment into m groups, in each of which there are one reasoning and n environments, performs batch processing of each reasoning, and the m groups are deployed on several CPU cores and run in a single pipeline manner.

Preferably, in the step S3, the specific process of the single pipeline operation is: set E _A ＝{A ₀ ，A ₁ ，A ₂ }，E _B ＝{B ₀ ，B ₁ ，B ₂} and E_C ＝{C ₀ ，C ₁ ，C ₂ Three groups, E is executed first according to the decision of the previous reasoning _A An environmental process in (a); after completion, give the data to E _A In the reasoning process processing, at this time E _A The CPU occupied by the environment process will be released and start executing E _B The environmental process in (a) is circulated until E _C The environmental process in (1) is completed to be executed, and meanwhile E _A Is completed by reasoning, E _A All the environment processes in the system are started to execute again by S _pip The throughput representing pipeline sampling is expressed as follows:

wherein ,representing the sampling times of n parallel environments.

Preferably, in the step S3, the specific process of using the received trajectory training model by the several trainers is: distributing the received track data to a plurality of gpus by adopting an all-reduce architecture; by t _trn Representing the total time to run a training step on a single GPU under the all-reduce architecture, t is given by the training batch size b _trn Expressed as:

wherein ,t_f Representing the forward propagation time of processing one mini-batch; t is t _b Representing backward propagation time; d represents the size of the training model; g represents the number of gpus and w represents the bandwidth between gpus; according to all-reduce communication training architecture, communication in the training process consists of parameter aggregation and parameter distribution, and the total communication time in model training is 2d (g-1)/gw, so that the training throughput S of the training end _l The method comprises the following steps:

and further deducing the variable b:

b＝t _b +2d(g-1)/(gw))/(g/s-t _f )

in the training process, the GPU nodes are subjected to weight synchronization periodically, and after all training tasks are completed, the final weights are sent to a sampling end for next iterative sampling.

Preferably, in the process that the certain number of tracks are transmitted from the sampler to the trainer, a quantizer is adopted to perform model quantization, and the specific process is as follows: the parameter accuracy of the machine learning model is reduced from high accuracy to low accuracy without significantly affecting the model accuracy to facilitate faster computation, lower memory usage and lower network bandwidth requirements, and the accuracy of quantizer usage is denoted by q, as follows:

q∈{fp32，fp16，int8}

by S _n Representing the throughput of the network:

wherein ,d_q Representing the quantized model size, e representing the network bandwidth between the trainer and the sampler; model quantization is realized by adding an interval module responsible for compressing a model, and a parallel quantization mode is used for accelerating the compression process; when the training end generates the model, the model will be put into the input queue and further quantized by the idle quantization process. In the compression process, if another model is received, the new model does not need to wait for the completion of the previous process, but directly activates another available idle process; finally, the sampler reads the quantization model from the output queue.

The invention has the following beneficial effects:

the invention builds an HRL system and provides a method for improving throughput of the reinforcement learning system based on the HRL system, which can comprehensively improve the throughput of the reinforcement learning system, can timely discover and identify bottlenecks in sampling, training and communication stages, and adopts an effective collaborative optimization technology to improve the throughput. Compared with the existing work, the constructed HRL system has three advantages: firstly, the method is comprehensive, and can solve various bottleneck problems; secondly, the efficiency is high, and the training time can be shortened by 90%; thirdly, the system is expandable and can be integrated with various reinforcement learning frameworks. The HRL system optimizes the training process mainly by the following three aspects: the training end, the sampling end and the network end solve the throughput limit of the traditional RL system. HRL implements a comprehensive optimization system that can increase the throughput of the overall system by up to 90.6% (see fig. 3). Improvements come from efficient sampling methods and solving bottlenecks in network communications and trainers. Experimental results show that the HRL system can significantly improve the performance of the DQN, PPO and IMPALA (as shown in figure 3) algorithms. The system has been integrated with criminal days and has been shown to be effective in improving convergence speed of RL models. Experimental results show that the throughput of the HRL system is improved by 18.6% -90.6%.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an HRL system architecture;

FIG. 2 is a schematic diagram of a sampling process based on a packet pipeline;

FIG. 3 is a graph of results of improving reinforcement learning system throughput based on the HRL system.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details.

Example 1

Referring to fig. 1-3, in order to solve the problem of low throughput of the existing reinforcement learning system, the first aspect of the embodiment of the present invention constructs an HRL system, which includes a sampler, a trainer, a quantizer, and a coordinator.

The second aspect of the embodiment of the invention provides a method for improving throughput of a reinforcement learning system based on the HRL system, which comprises the following steps:

s1, starting an HRL training task, firstly deducing an optimal global configuration according to task configuration and hardware information by a coordinator, and then starting a pipeline sampler, a quantizer and a trainer (see (1) in the figure);

s2, the sampler in the pipeline sampler starts to sample in parallel and collect a certain number of tracks (2 in the figure);

s3, collecting and distributing the certain number of tracks to a plurality of trainers by a message agency to be responsible for message serialization and transmission (3 in the figure);

s4, training and evaluating the model by using the received track by the trainer and the predictor, and transmitting the updated model weight to the quantizer for weight quantization (4 in the figure);

s5, the quantizer transmits quantized model weights to agents in each sampler through the message agents to perform the next round of sampling and training (5 in the figure).

Wherein during the pre-training phase, the coordinator will continue to monitor system throughput and resource utilization, thereby fine-grained adjustment of the pipeline sampler and trainer configuration, until one of the components (referred to as either the sampling end, training end, or communication end) in the sampler, trainer, or communicator of the system reaches its bottleneck.

Sampling is an important task in the resource manager. It relies on interactions between reasoning and the environment. In particular, reasoning makes decisions, while the environment actually performs the decisions. To improve sampling throughput, it is common to deploy a large number of samplers. However, existing RL systems employ a "one-to-one mapping" approach between reasoning and environment, i.e., each agent activates and interacts with a separate reasoning end. Due to the mutual waiting between each pair of reasoning and environment, the resource efficiency is actually very low in the resource manager. In view of this, in order to improve the sampling throughput, as shown in fig. 2, for optimizing the sampling end in the sampling step, a parallel pipeline sampling method based on groups is proposed, which specifically includes the steps of:

the sampler divides the inferences and environments into m groups, in each of which there is one inference and n environments for batch processing of each inference, the m groups being disposed on several CPU cores and operating in a single pipeline manner to improve CPU efficiency. The single streamThe specific process of the waterline mode operation is as follows: set E _A ＝{A ₀ ，A ₁ ，A ₂ }，E _B ＝{B ₀ ，B ₁ ，B ₂} and E_C ＝{C ₀ ，C ₁ ，C ₂ The execution of the single pipeline is then described as follows: first, according to the decision of the previous reasoning, E is executed _A An environmental process in (a); after completion, give the data to E _A In the reasoning process processing, at this time E _A The CPU occupied by the environment process will be released and start executing E _B The process is cycled through until E _C The execution of the environmental process is completed. At the same time E _A Is completed by reasoning, E _A All the environment processes in the system are started to execute again by S _pip The throughput representing pipeline sampling is expressed as follows:

wherein ,representing the sampling times of n parallel environments. Since reasoning requires waiting for execution of the last context in its corresponding group to complete, a larger n tends to result in a longer time required for the n contexts to complete. Therefore, the value of n should not be too great here. If there are a large number of CPU cores in the cluster, the invention can run several pipeline groups at the same time to improve the sampling throughput.

Since both reasoning and context in packet pipeline sampling consume the CPU core, they should maximize the utilization of limited CPU resources to maximize overall sampling throughput. Here u _avg Representing the average throughput on the CPU core, then the objective of the present invention is:

wherein ,represents the inference time (i.e., batch size n) for a number of parallel contexts n, c _cpu Represents the number of CPU cores allocated to pipeline, < ->Representing the number of cores assigned to an inference process in a set of pipelines; due to-> and />Can be obtained by pre-data collection so that the variable n can be easily obtained by derivative calculation.

Given C CPU cores in a sampler cluster, the sampler samples the upper bound S of throughput _e The deduction can be made as follows:

where r represents the throughput achievable by the remaining CPUs not assigned to the pipeline scheme; thereby establishingAnother expression of (2):

in order to improve the training efficiency of the model, the invention aims at optimizing the training end in the training step, and provides a method for training the model by adopting a plurality of training machines. The training task in RL has the same features as supervised learning. The RL algorithm defines the way in which the RL target is updated, during which the RL model performs forward and backward propagation tasks. With the increase of the sample at the sampling end, the requirement on the processing capacity of the training end is also increased. To address this problem, the present invention employs several trainers to accelerate the training process using distributed training.

The work load flow in the training task begins with the arrival of trace data collected from the sampling end proxy. The data is partitioned to different trainers for further processing. Once ready, forward and backward propagation of training will be performed.

The plurality of trainers train the model by using the received track, and the specific process is as follows: the training tasks are distributed into several gpus using an all-reduce architecture. While this results in a large number of synchronization operations between GPU nodes, it also enables the training side to handle large data sets.

By t _trn Representing the total time to run a training step on a single GPU under the all-reduce architecture. Assuming that the training batch size is b, then t _trn Can be expressed as:

wherein ,t_f Representing the forward propagation time of processing one mini-batch; t is t _b Representing backward propagation time; d represents the size of the training model; g represents the number of gpus and w represents the bandwidth between gpus; according to all-reduce communication training architecture, communication in the training process mainly consists of parameter aggregation and parameter distribution, and the total communication time during model training is 2d (g-1)/gw, so that the training throughput S of the training end _l The method comprises the following steps:

during the training process, the GPU nodes periodically perform weight synchronization. After all training tasks are completed, the final weight is sent to the sampling end for the next iterative sampling.

For data transmission between the training end and the sampling end, a communication end is needed, and the communication end is responsible for transmitting data samples to the training device or synchronizing updated model parameters to the sampling device. While the trained model is complex, the network may become a bottleneck. In order to improve the transmission efficiency of the network model, the invention adopts buffering and model quantization to relieve network bottlenecks (namely, reduce the communication time of data on the network).

1) Buffer zone: the samples generated by the samplers need to be transmitted in time or else the pipeline will be blocked and the sampling efficiency will be reduced. To solve this problem, the present invention provides two buffer queues at the communication layer, which are used to store updated model parameters and data samples, respectively. As shown in fig. 1, after generating a sample, the pipeline sampler directly sends the sample to the agent end; then, the agent stores the data in its buffer queue; meanwhile, the pipeline sampler also extracts model parameters from the buffer area of another agent and executes the next sampling task.

2) Quantification: model quantization refers to the quantizer reducing the parameter accuracy of the machine-learned model from high accuracy (e.g., 32-bit floating point number, fp 32) to low accuracy (e.g., 8-bit integer, int 8) without significantly affecting the model accuracy, which may facilitate faster computation, lower memory usage and lower network bandwidth requirements, the accuracy of the quantizer usage being represented by q, as follows:

q∈{fp32，fp16，int8}

by S _n Representing the throughput of the network:

wherein ,d_q Representing the model size after q (quantization), e represents the network bandwidth between the trainer and the sampler.

To achieve quantization, a simpler approach is to add an interval module that is responsible for compressing the model. However, if a new module (interval module) is directly added, additional iteration overhead is introduced, and the performance of the whole system is reduced. In some cases, the time spent on model quantization is unacceptable in asynchronous RL algorithms. For example, it takes over 200 milliseconds to quantify IMPALA from 32-bit floating point to an 8-bit integer. Based on this, the present invention accelerates this compression process using parallel quantization. The quantizer is composed of a series of parallel quantization processes and two queues, and is used for storing a model to be quantized and a quantized model respectively. When the training end generates the model, the model will be put into the input queue and further quantized by the idle quantization process. In the compression process, if another model is received, the new model does not need to wait for the completion of the previous process, but directly activates another available idle process; finally, the sampler reads the quantization model from the output queue.

High throughput coordinator scheme: since the overall throughput of the RL system (i.e. s) is limited by bottlenecks that may occur in the trainer, sampler or network, it is possible to obtain:

maxs＝maxmin{S _l ,S _e ,S _n }

given S _l 、S _e 、S _n The coordinator needs to deduce<n,q,b>To achieve optimal throughput. Wherein n represents the number of environments in the grouping pipeline and determines the sampling speed of the sampling end; q represents the accuracy of model quantization, and determines the size and speed of data transmitted by the communication terminal; b represents the batch processing size of the training end for training, and determines the training speed of the training end.

Considering that the data transmitted in different directions are different, the invention respectively discusses a coordination method, and the method is specifically as follows:

1) "sampler- > network- > trainer". Since the overall throughput is limited by bottlenecks (i.e., short-board effects), there is no need to configure other components to maximum throughput. Thus, it is necessary to coordinate the configuration between the sampling side, the communication side, and the training side so that their throughput is kept consistent. The invention provides grouping flow line sampling and deducing the establishment of a formula in consideration of the total throughputN of (c).

The invention also provides the production ofTraining with a multi-trainer, deriving the variable b: b=t _b ＝2d(g-1)/(gw))/(g/s-t _f ) The method comprises the steps of carrying out a first treatment on the surface of the Further quantization model optimizes the transmission network rate: from the sampling end to the training end, the network only needs to transmit the sampled track data. Therefore, it is not necessary to configure a quantizer. Assume that the track data has a size d _tra Then the network transmission may be at d _tra And/e, where e represents the network bandwidth.

2) "trainer→network→sampler": from the trainer to the sampler, the network only needs to transmit model parameters to update the inference model. Since the model size is constant, we do not optimize for the trainer and sampler separately. However, in order to improve network transmission efficiency, the present invention configures the quantizer to be low-precision, as long as the low-precision does not affect the accuracy of the model.

As can be seen from the results of FIG. 3, the invention constructs a comprehensive optimization system, namely an HRL system, which can improve the throughput of the whole system by up to 90.6% and can obviously improve the performance of the DQN, PPO and IMPALA algorithms. In addition, the system is integrated with criminal days, and proved to be capable of effectively improving the convergence rate of the RL model, and experimental results show that the throughput of the HRL system is improved by 18.6% -90.6%.

The present invention is not limited to the above-described specific embodiments, and various modifications may be made by those skilled in the art without inventive effort from the above-described concepts, and are within the scope of the present invention.

Claims

1. A method of improving throughput of a reinforcement learning system, comprising the steps of:

s3, collecting and transmitting the tracks with a certain number by the message agency to a plurality of trainers, and taking charge of message serialization and transmission;

s5, the quantizer transmits quantized model weights to agents in each sampler through the message agents so as to perform the next round of sampling and training;

in the step S4, the specific process of using the received trajectory training model by the plurality of trainers is as follows: distributing the received track data to a plurality of gpus by adopting an all-reduce architecture; by usingRepresenting the total time to run a training step on a single GPU under all-reduce architecture, assuming a training batch size of b +.>Expressed as:

wherein ,representing the forward propagation time of processing one mini-batch; />Representing backward propagation time; d represents the size of the training model; g represents the number of gpus and w represents the bandwidth between gpus; according to all-reduce communication training architecture, communication in the training process consists of parameter aggregation and parameter distribution, and the total communication time in model training is 2d (g-1)/gw, so that training throughput of training end is->The method comprises the following steps:

and further deducing the variable b:

in the training process, the GPU nodes periodically perform weight synchronization, and after all training tasks are completed, the final weights are sent to a sampling end for next iterative sampling;

in the step S4, a quantizer is used for model quantization in the transmission process, and the specific process is as follows: under the condition that the model precision is not affected significantly, the parameter precision of the machine learning model is reduced from high precision to low precision, q is used for representing the precision used by the quantizer, and the method is concretely represented as follows:

by S _n Representing the throughput of the network:

wherein ,representing the quantized model size, e representing the network bandwidth between the trainer and the sampler; model quantization is realized by adding an interval module responsible for compressing a model, and a parallel quantization mode is used for accelerating the compression process; when the training end generates a model, the model is put into an input queue and further quantized by an idle quantization process, and in the compression process, if another model is received, the new model does not need to wait for the completion of the previous process, but directly activates another available idle process; finally, the sampler reads the quantization model from the output queue.

2. The method for improving throughput of reinforcement learning system according to claim 1, wherein in step S2, the specific steps of the sampler performing parallel pipeline sampling based on groups are: the sampler divides the reasoning and environment into m groups, in each group there is one reasoning and n parallel environments, and batch processing of each reasoning is performed, and the m groups are deployed on several CPU cores and run in a single pipeline manner.

3. The method for improving throughput of reinforcement learning system according to claim 2, wherein in step S3, the specific process of single pipeline operation is: is provided with， />Andfor three groups, first, according to the decision of the previous reasoning, E is executed _A An environmental process in (a); after completion, give the data to E _A In the reasoning process processing, at this time E _A The CPU occupied by the environment process will be released and start executing E _B The environmental process in (a) is circulated until E _C The environmental process in (1) is completed to be executed, and meanwhile E _A Is completed by reasoning, E _A All the environment processes in the system are started to execute again by S _pip The throughput representing pipeline sampling is expressed as follows:

wherein ,representing the sampling times of n parallel environments.