CN116432743B - Method for improving throughput of reinforcement learning system - Google Patents

Method for improving throughput of reinforcement learning system Download PDF

Info

Publication number
CN116432743B
CN116432743B CN202310419113.4A CN202310419113A CN116432743B CN 116432743 B CN116432743 B CN 116432743B CN 202310419113 A CN202310419113 A CN 202310419113A CN 116432743 B CN116432743 B CN 116432743B
Authority
CN
China
Prior art keywords
training
model
throughput
sampling
sampler
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310419113.4A
Other languages
Chinese (zh)
Other versions
CN116432743A (en
Inventor
赵来平
辛宇嵩
赵志新
代心安
胡一涛
李克秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202310419113.4A priority Critical patent/CN116432743B/en
Publication of CN116432743A publication Critical patent/CN116432743A/en
Application granted granted Critical
Publication of CN116432743B publication Critical patent/CN116432743B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method for improving throughput of a reinforcement learning system, which comprises the following steps: starting a RL training task, deducing an optimal global configuration by a coordinator according to the task configuration and hardware information, and then starting a pipeline sampler, a quantizer and a trainer; the sampler performs parallel pipeline sampling based on groups and collects a certain number of tracks; the track is collected by the message agency and distributed to a plurality of trainers, and is responsible for message serialization and transmission; training and evaluating the model by using the received track by the trainer and the predictor, and transmitting the updated model weight to the quantizer for weight quantization; the quantizer sends the quantized model weights to the agents in each sampler via the message agents for the next round of sampling and training. The invention can comprehensively improve the throughput of the reinforcement learning system, can timely find and identify the bottlenecks of sampling, training and communication stages, and adopts the collaborative optimization technology to improve the throughput, wherein the throughput is up to 90.6%.

Description

Method for improving throughput of reinforcement learning system
Technical Field
The invention belongs to the field of AI systems, and particularly relates to a method for improving throughput of a reinforcement learning system.
Background
Reinforcement Learning (RL) algorithms have gained tremendous leap in a variety of areas including go, dialogue systems, autopilot and robotic operations. However, such benefits are costly. Unlike typical deep learning models where training data is prepared in advance, reinforcement Learning (RL) models are trained using trajectory data generated during real-time interactions with the simulation environment, which may take days or weeks to have enough samples to converge the trained model, even though it is performed on hundreds of CPUs and gpus, e.g., the virtual character AlphaStar in an interplanetary dispute 2 game depends on the A3C model, requiring training for 44 days using 3072 TPU cores and 50400 CPU cores to reach beyond human capability. Also, in automatic driving, if a high quality driving strategy is trained using the PPO algorithm, 2 tesla K40 gpu is required to train for about 10 days to output an optimal steering command in a driving scene.
Therefore, it is important to increase the throughput (i.e., the number of samples processed per unit time) of the reinforcement learning system and the convergence speed of the RL model. Existing research efforts have focused on improving the throughput of reinforcement learning systems from two aspects: (1) innovation of algorithm. For example, using asynchronous communications instead of synchronous communications, asynchronous communications may improve throughput by reducing latency between training nodes and sampling nodes; (2) resource efficiency optimization. That is, more sampling, training or data transmission tasks are deployed on limited resources without degrading performance. Resources may include networks, cpus, gpus, and the like.
While existing research efforts have significantly improved throughput of RL systems, their application scenarios are very limited. For example, a low-precision RL can only improve throughput when network resources become bottlenecks; sample factor and GA3C are validated when the sample and training ends become bottlenecks, respectively. Experience in RL training has shown that bottlenecks can occur in several places under different circumstances. For example, the throughput bottleneck for an IMPALA task when interacting with the CartPole environment occurs on the training side, and on the sampling side when interacting with the Atari environment. Thus, high throughput is difficult to achieve with local optimization techniques alone.
Disclosure of Invention
In view of the above problems with the prior art, it is an object of the present invention to provide.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a method of improving throughput of a reinforcement learning system, comprising the steps of:
s1, starting a RL training task, firstly deducing an optimal global configuration by a coordinator according to task configuration and hardware information, and then starting a pipeline sampler, a quantizer and a trainer;
s2, sampling the parallel pipeline based on the group by a sampler and collecting a certain number of tracks;
s3, collecting and distributing the tracks with a certain number by the message agency into a plurality of trainers to be responsible for message serialization and transmission;
s4, training and evaluating models by using the received tracks by a plurality of trainers and predictors, and transmitting updated model weights to a quantizer for weight quantization;
s5, the quantizer transmits quantized model weights to agents in each sampler through the message agents so as to perform the next round of sampling and training.
Preferably, in the step S2, the specific steps of the sampler for performing parallel pipeline sampling based on groups are: the sampler divides the reasoning and environment into m groups, in each of which there are one reasoning and n environments, performs batch processing of each reasoning, and the m groups are deployed on several CPU cores and run in a single pipeline manner.
Preferably, in the step S3, the specific process of the single pipeline operation is: set E A ={A 0 ,A 1 ,A 2 },E B ={B 0 ,B 1 ,B 2} and EC ={C 0 ,C 1 ,C 2 Three groups, E is executed first according to the decision of the previous reasoning A An environmental process in (a); after completion, give the data to E A In the reasoning process processing, at this time E A The CPU occupied by the environment process will be released and start executing E B The environmental process in (a) is circulated until E C The environmental process in (1) is completed to be executed, and meanwhile E A Is completed by reasoning, E A All the environment processes in the system are started to execute again by S pip The throughput representing pipeline sampling is expressed as follows:
wherein ,representing the sampling times of n parallel environments.
Preferably, in the step S3, the specific process of using the received trajectory training model by the several trainers is: distributing the received track data to a plurality of gpus by adopting an all-reduce architecture; by t trn Representing the total time to run a training step on a single GPU under the all-reduce architecture, t is given by the training batch size b trn Expressed as:
wherein ,tf Representing the forward propagation time of processing one mini-batch; t is t b Representing backward propagation time; d represents the size of the training model; g represents the number of gpus and w represents the bandwidth between gpus; according to all-reduce communication training architecture, communication in the training process consists of parameter aggregation and parameter distribution, and the total communication time in model training is 2d (g-1)/gw, so that the training throughput S of the training end l The method comprises the following steps:
and further deducing the variable b:
b=t b +2d(g-1)/(gw))/(g/s-t f )
in the training process, the GPU nodes are subjected to weight synchronization periodically, and after all training tasks are completed, the final weights are sent to a sampling end for next iterative sampling.
Preferably, in the process that the certain number of tracks are transmitted from the sampler to the trainer, a quantizer is adopted to perform model quantization, and the specific process is as follows: the parameter accuracy of the machine learning model is reduced from high accuracy to low accuracy without significantly affecting the model accuracy to facilitate faster computation, lower memory usage and lower network bandwidth requirements, and the accuracy of quantizer usage is denoted by q, as follows:
q∈{fp32,fp16,int8}
by S n Representing the throughput of the network:
wherein ,dq Representing the quantized model size, e representing the network bandwidth between the trainer and the sampler; model quantization is realized by adding an interval module responsible for compressing a model, and a parallel quantization mode is used for accelerating the compression process; when the training end generates the model, the model will be put into the input queue and further quantized by the idle quantization process. In the compression process, if another model is received, the new model does not need to wait for the completion of the previous process, but directly activates another available idle process; finally, the sampler reads the quantization model from the output queue.
The invention has the following beneficial effects:
the invention builds an HRL system and provides a method for improving throughput of the reinforcement learning system based on the HRL system, which can comprehensively improve the throughput of the reinforcement learning system, can timely discover and identify bottlenecks in sampling, training and communication stages, and adopts an effective collaborative optimization technology to improve the throughput. Compared with the existing work, the constructed HRL system has three advantages: firstly, the method is comprehensive, and can solve various bottleneck problems; secondly, the efficiency is high, and the training time can be shortened by 90%; thirdly, the system is expandable and can be integrated with various reinforcement learning frameworks. The HRL system optimizes the training process mainly by the following three aspects: the training end, the sampling end and the network end solve the throughput limit of the traditional RL system. HRL implements a comprehensive optimization system that can increase the throughput of the overall system by up to 90.6% (see fig. 3). Improvements come from efficient sampling methods and solving bottlenecks in network communications and trainers. Experimental results show that the HRL system can significantly improve the performance of the DQN, PPO and IMPALA (as shown in figure 3) algorithms. The system has been integrated with criminal days and has been shown to be effective in improving convergence speed of RL models. Experimental results show that the throughput of the HRL system is improved by 18.6% -90.6%.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of an HRL system architecture;
FIG. 2 is a schematic diagram of a sampling process based on a packet pipeline;
FIG. 3 is a graph of results of improving reinforcement learning system throughput based on the HRL system.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details.
Example 1
Referring to fig. 1-3, in order to solve the problem of low throughput of the existing reinforcement learning system, the first aspect of the embodiment of the present invention constructs an HRL system, which includes a sampler, a trainer, a quantizer, and a coordinator.
The second aspect of the embodiment of the invention provides a method for improving throughput of a reinforcement learning system based on the HRL system, which comprises the following steps:
s1, starting an HRL training task, firstly deducing an optimal global configuration according to task configuration and hardware information by a coordinator, and then starting a pipeline sampler, a quantizer and a trainer (see (1) in the figure);
s2, the sampler in the pipeline sampler starts to sample in parallel and collect a certain number of tracks (2 in the figure);
s3, collecting and distributing the certain number of tracks to a plurality of trainers by a message agency to be responsible for message serialization and transmission (3 in the figure);
s4, training and evaluating the model by using the received track by the trainer and the predictor, and transmitting the updated model weight to the quantizer for weight quantization (4 in the figure);
s5, the quantizer transmits quantized model weights to agents in each sampler through the message agents to perform the next round of sampling and training (5 in the figure).
Wherein during the pre-training phase, the coordinator will continue to monitor system throughput and resource utilization, thereby fine-grained adjustment of the pipeline sampler and trainer configuration, until one of the components (referred to as either the sampling end, training end, or communication end) in the sampler, trainer, or communicator of the system reaches its bottleneck.
Sampling is an important task in the resource manager. It relies on interactions between reasoning and the environment. In particular, reasoning makes decisions, while the environment actually performs the decisions. To improve sampling throughput, it is common to deploy a large number of samplers. However, existing RL systems employ a "one-to-one mapping" approach between reasoning and environment, i.e., each agent activates and interacts with a separate reasoning end. Due to the mutual waiting between each pair of reasoning and environment, the resource efficiency is actually very low in the resource manager. In view of this, in order to improve the sampling throughput, as shown in fig. 2, for optimizing the sampling end in the sampling step, a parallel pipeline sampling method based on groups is proposed, which specifically includes the steps of:
the sampler divides the inferences and environments into m groups, in each of which there is one inference and n environments for batch processing of each inference, the m groups being disposed on several CPU cores and operating in a single pipeline manner to improve CPU efficiency. The single streamThe specific process of the waterline mode operation is as follows: set E A ={A 0 ,A 1 ,A 2 },E B ={B 0 ,B 1 ,B 2} and EC ={C 0 ,C 1 ,C 2 The execution of the single pipeline is then described as follows: first, according to the decision of the previous reasoning, E is executed A An environmental process in (a); after completion, give the data to E A In the reasoning process processing, at this time E A The CPU occupied by the environment process will be released and start executing E B The process is cycled through until E C The execution of the environmental process is completed. At the same time E A Is completed by reasoning, E A All the environment processes in the system are started to execute again by S pip The throughput representing pipeline sampling is expressed as follows:
wherein ,representing the sampling times of n parallel environments. Since reasoning requires waiting for execution of the last context in its corresponding group to complete, a larger n tends to result in a longer time required for the n contexts to complete. Therefore, the value of n should not be too great here. If there are a large number of CPU cores in the cluster, the invention can run several pipeline groups at the same time to improve the sampling throughput.
Since both reasoning and context in packet pipeline sampling consume the CPU core, they should maximize the utilization of limited CPU resources to maximize overall sampling throughput. Here u avg Representing the average throughput on the CPU core, then the objective of the present invention is:
wherein ,represents the inference time (i.e., batch size n) for a number of parallel contexts n, c cpu Represents the number of CPU cores allocated to pipeline, < ->Representing the number of cores assigned to an inference process in a set of pipelines; due to-> and />Can be obtained by pre-data collection so that the variable n can be easily obtained by derivative calculation.
Given C CPU cores in a sampler cluster, the sampler samples the upper bound S of throughput e The deduction can be made as follows:
where r represents the throughput achievable by the remaining CPUs not assigned to the pipeline scheme; thereby establishingAnother expression of (2):
in order to improve the training efficiency of the model, the invention aims at optimizing the training end in the training step, and provides a method for training the model by adopting a plurality of training machines. The training task in RL has the same features as supervised learning. The RL algorithm defines the way in which the RL target is updated, during which the RL model performs forward and backward propagation tasks. With the increase of the sample at the sampling end, the requirement on the processing capacity of the training end is also increased. To address this problem, the present invention employs several trainers to accelerate the training process using distributed training.
The work load flow in the training task begins with the arrival of trace data collected from the sampling end proxy. The data is partitioned to different trainers for further processing. Once ready, forward and backward propagation of training will be performed.
The plurality of trainers train the model by using the received track, and the specific process is as follows: the training tasks are distributed into several gpus using an all-reduce architecture. While this results in a large number of synchronization operations between GPU nodes, it also enables the training side to handle large data sets.
By t trn Representing the total time to run a training step on a single GPU under the all-reduce architecture. Assuming that the training batch size is b, then t trn Can be expressed as:
wherein ,tf Representing the forward propagation time of processing one mini-batch; t is t b Representing backward propagation time; d represents the size of the training model; g represents the number of gpus and w represents the bandwidth between gpus; according to all-reduce communication training architecture, communication in the training process mainly consists of parameter aggregation and parameter distribution, and the total communication time during model training is 2d (g-1)/gw, so that the training throughput S of the training end l The method comprises the following steps:
during the training process, the GPU nodes periodically perform weight synchronization. After all training tasks are completed, the final weight is sent to the sampling end for the next iterative sampling.
For data transmission between the training end and the sampling end, a communication end is needed, and the communication end is responsible for transmitting data samples to the training device or synchronizing updated model parameters to the sampling device. While the trained model is complex, the network may become a bottleneck. In order to improve the transmission efficiency of the network model, the invention adopts buffering and model quantization to relieve network bottlenecks (namely, reduce the communication time of data on the network).
1) Buffer zone: the samples generated by the samplers need to be transmitted in time or else the pipeline will be blocked and the sampling efficiency will be reduced. To solve this problem, the present invention provides two buffer queues at the communication layer, which are used to store updated model parameters and data samples, respectively. As shown in fig. 1, after generating a sample, the pipeline sampler directly sends the sample to the agent end; then, the agent stores the data in its buffer queue; meanwhile, the pipeline sampler also extracts model parameters from the buffer area of another agent and executes the next sampling task.
2) Quantification: model quantization refers to the quantizer reducing the parameter accuracy of the machine-learned model from high accuracy (e.g., 32-bit floating point number, fp 32) to low accuracy (e.g., 8-bit integer, int 8) without significantly affecting the model accuracy, which may facilitate faster computation, lower memory usage and lower network bandwidth requirements, the accuracy of the quantizer usage being represented by q, as follows:
q∈{fp32,fp16,int8}
by S n Representing the throughput of the network:
wherein ,dq Representing the model size after q (quantization), e represents the network bandwidth between the trainer and the sampler.
To achieve quantization, a simpler approach is to add an interval module that is responsible for compressing the model. However, if a new module (interval module) is directly added, additional iteration overhead is introduced, and the performance of the whole system is reduced. In some cases, the time spent on model quantization is unacceptable in asynchronous RL algorithms. For example, it takes over 200 milliseconds to quantify IMPALA from 32-bit floating point to an 8-bit integer. Based on this, the present invention accelerates this compression process using parallel quantization. The quantizer is composed of a series of parallel quantization processes and two queues, and is used for storing a model to be quantized and a quantized model respectively. When the training end generates the model, the model will be put into the input queue and further quantized by the idle quantization process. In the compression process, if another model is received, the new model does not need to wait for the completion of the previous process, but directly activates another available idle process; finally, the sampler reads the quantization model from the output queue.
High throughput coordinator scheme: since the overall throughput of the RL system (i.e. s) is limited by bottlenecks that may occur in the trainer, sampler or network, it is possible to obtain:
maxs=maxmin{S l ,S e ,S n }
given S l 、S e 、S n The coordinator needs to deduce<n,q,b>To achieve optimal throughput. Wherein n represents the number of environments in the grouping pipeline and determines the sampling speed of the sampling end; q represents the accuracy of model quantization, and determines the size and speed of data transmitted by the communication terminal; b represents the batch processing size of the training end for training, and determines the training speed of the training end.
Considering that the data transmitted in different directions are different, the invention respectively discusses a coordination method, and the method is specifically as follows:
1) "sampler- > network- > trainer". Since the overall throughput is limited by bottlenecks (i.e., short-board effects), there is no need to configure other components to maximum throughput. Thus, it is necessary to coordinate the configuration between the sampling side, the communication side, and the training side so that their throughput is kept consistent. The invention provides grouping flow line sampling and deducing the establishment of a formula in consideration of the total throughputN of (c).
The invention also provides the production ofTraining with a multi-trainer, deriving the variable b: b=t b =2d(g-1)/(gw))/(g/s-t f ) The method comprises the steps of carrying out a first treatment on the surface of the Further quantization model optimizes the transmission network rate: from the sampling end to the training end, the network only needs to transmit the sampled track data. Therefore, it is not necessary to configure a quantizer. Assume that the track data has a size d tra Then the network transmission may be at d tra And/e, where e represents the network bandwidth.
2) "trainer→network→sampler": from the trainer to the sampler, the network only needs to transmit model parameters to update the inference model. Since the model size is constant, we do not optimize for the trainer and sampler separately. However, in order to improve network transmission efficiency, the present invention configures the quantizer to be low-precision, as long as the low-precision does not affect the accuracy of the model.
As can be seen from the results of FIG. 3, the invention constructs a comprehensive optimization system, namely an HRL system, which can improve the throughput of the whole system by up to 90.6% and can obviously improve the performance of the DQN, PPO and IMPALA algorithms. In addition, the system is integrated with criminal days, and proved to be capable of effectively improving the convergence rate of the RL model, and experimental results show that the throughput of the HRL system is improved by 18.6% -90.6%.
The present invention is not limited to the above-described specific embodiments, and various modifications may be made by those skilled in the art without inventive effort from the above-described concepts, and are within the scope of the present invention.

Claims (3)

1. A method of improving throughput of a reinforcement learning system, comprising the steps of:
s1, starting a RL training task, firstly deducing an optimal global configuration by a coordinator according to task configuration and hardware information, and then starting a pipeline sampler, a quantizer and a trainer;
s2, sampling the parallel pipeline based on the group by a sampler and collecting a certain number of tracks;
s3, collecting and transmitting the tracks with a certain number by the message agency to a plurality of trainers, and taking charge of message serialization and transmission;
s4, training and evaluating models by using the received tracks by a plurality of trainers and predictors, and transmitting updated model weights to a quantizer for weight quantization;
s5, the quantizer transmits quantized model weights to agents in each sampler through the message agents so as to perform the next round of sampling and training;
in the step S4, the specific process of using the received trajectory training model by the plurality of trainers is as follows: distributing the received track data to a plurality of gpus by adopting an all-reduce architecture; by usingRepresenting the total time to run a training step on a single GPU under all-reduce architecture, assuming a training batch size of b +.>Expressed as:
wherein ,representing the forward propagation time of processing one mini-batch; />Representing backward propagation time; d represents the size of the training model; g represents the number of gpus and w represents the bandwidth between gpus; according to all-reduce communication training architecture, communication in the training process consists of parameter aggregation and parameter distribution, and the total communication time in model training is 2d (g-1)/gw, so that training throughput of training end is->The method comprises the following steps:
and further deducing the variable b:
in the training process, the GPU nodes periodically perform weight synchronization, and after all training tasks are completed, the final weights are sent to a sampling end for next iterative sampling;
in the step S4, a quantizer is used for model quantization in the transmission process, and the specific process is as follows: under the condition that the model precision is not affected significantly, the parameter precision of the machine learning model is reduced from high precision to low precision, q is used for representing the precision used by the quantizer, and the method is concretely represented as follows:
by S n Representing the throughput of the network:
wherein ,representing the quantized model size, e representing the network bandwidth between the trainer and the sampler; model quantization is realized by adding an interval module responsible for compressing a model, and a parallel quantization mode is used for accelerating the compression process; when the training end generates a model, the model is put into an input queue and further quantized by an idle quantization process, and in the compression process, if another model is received, the new model does not need to wait for the completion of the previous process, but directly activates another available idle process; finally, the sampler reads the quantization model from the output queue.
2. The method for improving throughput of reinforcement learning system according to claim 1, wherein in step S2, the specific steps of the sampler performing parallel pipeline sampling based on groups are: the sampler divides the reasoning and environment into m groups, in each group there is one reasoning and n parallel environments, and batch processing of each reasoning is performed, and the m groups are deployed on several CPU cores and run in a single pipeline manner.
3. The method for improving throughput of reinforcement learning system according to claim 2, wherein in step S3, the specific process of single pipeline operation is: is provided with, />Andfor three groups, first, according to the decision of the previous reasoning, E is executed A An environmental process in (a); after completion, give the data to E A In the reasoning process processing, at this time E A The CPU occupied by the environment process will be released and start executing E B The environmental process in (a) is circulated until E C The environmental process in (1) is completed to be executed, and meanwhile E A Is completed by reasoning, E A All the environment processes in the system are started to execute again by S pip The throughput representing pipeline sampling is expressed as follows:
wherein ,representing the sampling times of n parallel environments.
CN202310419113.4A 2023-04-19 2023-04-19 Method for improving throughput of reinforcement learning system Active CN116432743B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310419113.4A CN116432743B (en) 2023-04-19 2023-04-19 Method for improving throughput of reinforcement learning system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310419113.4A CN116432743B (en) 2023-04-19 2023-04-19 Method for improving throughput of reinforcement learning system

Publications (2)

Publication Number Publication Date
CN116432743A CN116432743A (en) 2023-07-14
CN116432743B true CN116432743B (en) 2023-10-10

Family

ID=87092385

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310419113.4A Active CN116432743B (en) 2023-04-19 2023-04-19 Method for improving throughput of reinforcement learning system

Country Status (1)

Country Link
CN (1) CN116432743B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112437020A (en) * 2020-10-30 2021-03-02 天津大学 Data center network load balancing method based on deep reinforcement learning
CN113312178A (en) * 2021-05-24 2021-08-27 河海大学 Assembly line parallel training task allocation method based on deep reinforcement learning
WO2022042840A1 (en) * 2020-08-27 2022-03-03 Siemens Aktiengesellschaft Method for a state engineering for a reinforcement learning (rl) system, computer program product and rl system
CN114429210A (en) * 2022-01-27 2022-05-03 西安交通大学 Cloud-protogenesis-based reinforcement learning pipeline method, system, equipment and storage medium
KR20220071058A (en) * 2020-11-23 2022-05-31 서울대학교산학협력단 Network throughput estimating apparatus and method
CN115904666A (en) * 2022-12-16 2023-04-04 上海交通大学 Deep learning training task scheduling system facing GPU cluster

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022042840A1 (en) * 2020-08-27 2022-03-03 Siemens Aktiengesellschaft Method for a state engineering for a reinforcement learning (rl) system, computer program product and rl system
CN112437020A (en) * 2020-10-30 2021-03-02 天津大学 Data center network load balancing method based on deep reinforcement learning
KR20220071058A (en) * 2020-11-23 2022-05-31 서울대학교산학협력단 Network throughput estimating apparatus and method
CN113312178A (en) * 2021-05-24 2021-08-27 河海大学 Assembly line parallel training task allocation method based on deep reinforcement learning
CN114429210A (en) * 2022-01-27 2022-05-03 西安交通大学 Cloud-protogenesis-based reinforcement learning pipeline method, system, equipment and storage medium
CN115904666A (en) * 2022-12-16 2023-04-04 上海交通大学 Deep learning training task scheduling system facing GPU cluster

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于强化学习协同训练的命名实体识别方法;程钟慧;陈珂;陈刚;徐世泽;傅丁莉;;软件工程(第01期);全文 *
无线网络中基于DSDV的最大化吞吐量的协作路由算法;赵方圆;韩昌彩;李媛;;信号处理(第04期);全文 *

Also Published As

Publication number Publication date
CN116432743A (en) 2023-07-14

Similar Documents

Publication Publication Date Title
CN113950066B (en) Single server part calculation unloading method, system and equipment under mobile edge environment
CN109753751B (en) MEC random task migration method based on machine learning
Zhang et al. A multi-agent reinforcement learning approach for efficient client selection in federated learning
CN112367353A (en) Mobile edge computing unloading method based on multi-agent reinforcement learning
CN108090560A (en) The design method of LSTM recurrent neural network hardware accelerators based on FPGA
CN114340016B (en) Power grid edge calculation unloading distribution method and system
CN111367657A (en) Computing resource collaborative cooperation method based on deep reinforcement learning
CN108111335A (en) A kind of method and system dispatched and link virtual network function
CN112732436B (en) Deep reinforcement learning acceleration method of multi-core processor-single graphics processor
CN113472597B (en) Distributed convolutional neural network fine-grained parameter transmission scheduling method and device
CN111740925B (en) Deep reinforcement learning-based flow scheduling method
CN114938381B (en) D2D-MEC unloading method based on deep reinforcement learning
CN116156563A (en) Heterogeneous task and resource end edge collaborative scheduling method based on digital twin
Fang et al. Smart collaborative optimizations strategy for mobile edge computing based on deep reinforcement learning
CN116432743B (en) Method for improving throughput of reinforcement learning system
CN117579701A (en) Mobile edge network computing and unloading method and system
CN117436485A (en) Multi-exit point end-edge-cloud cooperative system and method based on trade-off time delay and precision
CN116489708B (en) Meta universe oriented cloud edge end collaborative mobile edge computing task unloading method
CN112199154A (en) Distributed collaborative sampling central optimization-based reinforcement learning training system and method
CN116909738A (en) Neural network call simplification method based on distributed resource pool
Liu et al. Dependency-aware task offloading for vehicular edge computing with end-edge-cloud collaborative computing
CN115114030B (en) On-line multi-workflow scheduling method based on reinforcement learning
CN115695424A (en) Dependent task online unloading method based on cooperative edge computing
Furukawa et al. Accelerating Distributed Deep Reinforcement Learning by In-Network Experience Sampling
CN116938323B (en) Satellite transponder resource allocation method based on reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant