CN112732436A

CN112732436A - Deep reinforcement learning acceleration method of multi-core processor-single graphics processor

Info

Publication number: CN112732436A
Application number: CN202011476497.6A
Authority: CN
Inventors: 阮爱武; 朱重阳
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2021-04-30
Anticipated expiration: 2040-12-15
Also published as: CN112732436B

Abstract

The invention provides a deep reinforcement learning acceleration method of a multi-core processor-single graphic processor, which can establish a deep reinforcement learning framework especially based on a PPO algorithm on a CPU + GPU platform and simultaneously invent a pipeline method for accelerating collection sampling and reasoning. A multi-environment process simulation process is realized on a multi-core CPU, and a plurality of environment processes are arranged on each core. And the CPU plays a role of controlling data. And (4) realizing a neural network model reasoning process on the GPU. The CPU and the GPU respectively store memory spaces of an action network and an evaluation network parameter, information of each time is stored in an experience information pool memory of the CPU in the process of interaction between the environment and the intelligent agent, and the CPU extracts batch scale data from the experience information pool memory of the CPU for training at regular intervals according to truncation parameters and screening conditions of an algorithm. Through load balance, when the simulation time of the CPU single-core stacking environment process is the same as the neural network model reasoning time, the invention can realize the pipeline structure for parallel execution of environment sampling and action reasoning: by means of parallel operation of the half CPU simulator and the half GPU inference and preparation of the next half CPU simulator and half GPU inference process in data transmission, the reinforcement learning speed can be further accelerated, and the overall training speed is nearly 2 times that of the traditional method.

Description

Deep reinforcement learning acceleration method of multi-core processor-single graphics processor

Technical Field

The invention belongs to the field of computers, and particularly relates to a deep reinforcement learning acceleration method based on a multi-core Central Processing Unit (CPU) -single Graphics Processing Unit (GPU) platform.

Background

Deep Reinforcement Learning (DRL) is a product of combining deep learning and reinforcement learning, and integrates strong comprehension ability of deep learning on perception problems such as vision and decision-making ability of reinforcement learning, thereby realizing end-to-end learning. The occurrence of deep reinforcement learning enables the reinforcement learning technology to really go to practical use, solves the complex problem in the real scene, and is widely applied to various engineering fields, such as industrial manufacturing, robot positioning and recognition, game and the like.

Documents published in 2013 (Mnih V, Kavukcugcuoglu K, Silver D, et al. Playing Atari with deep discovery research// Proceedings of works phones at the 26th Neural Information Processing Systems 2013.Lake Tahoe, USA,2013:201-220) have appeared so far, and a large number of algorithms have appeared in the field of deep enhanced learning, including DQN algorithm optimized based on a Q value target, and the like, A3C optimized based on a strategy gradient target, TRPO algorithm, and the like. In 2017, the OpenAI publication (Schulman, John & Wolski, Filip & Dhariwal, Prafulla & Radford, Alec & Klimov, Oleg. (2017). Proximal Policy Optimization Algorithms) (PPO algorithm is optimized based on strategy pi near-end strategy), constraints are simply defined on a proxy objective function, the implementation and parameter adjustment processes are simplified, the performance is superior to most strategy gradient Algorithms, and the method is one of the optimal Algorithms preferred in many DRL researches. As these algorithms are proposed in succession, the required complexity of the algorithms gradually increases, requiring efficient computational support and framework support.

In the PPO algorithm, the operation Network (Actor Network) and the evaluation Network (Critic Network) are mainly divided, the current state s of an environment body is input, and after a Critic Network predicted value function V, a merit function A is obtained according to n steps of discount return values of partial Markov chains; and predicting by the Actor network to obtain a strategy pi, further selecting the action a in the current state s to return to the environment, and obtaining the next state s-, and repeating the process. After the T step length, the dominant function A generated by the prior strategy pi is processed according to the cutting width or divergence and used as a target function to train the network, so that the convergence stability is achieved.

Many of the deep reinforcement learning training in the prior art is built under a single environment configuration of a CPU-GPU framework, wherein the CPU is responsible for environment building and data control, and the GPU is used for high-throughput parallel computing. Because of single-environment configuration, the state can be predicted only once each time, and the operation overhead of scheduling the GPU is sometimes longer than the time of parallel computation, so that the acceleration effect of the DRL is not obvious, and even the speed reduction result is achieved. How to improve training efficiency becomes the focus of many DRL studies.

Deep research is necessary to break through high-efficiency deep reinforcement learning, an effective hardware framework based on a deep reinforcement learning algorithm is provided, and a data flow process under the framework is analyzed and optimized, so that the deep reinforcement learning speed is increased. The document (liang xing, von 26104; he, mayang, cheng guan, huang cai, wang qi, zhou yuzhen, liu fai. multi-Agent deep reinforcement learning overview [ J/OL ]. automated science report: 2019.05) proposes a communication method and an influence parameter of a DRL multi-environment body (Agent), and the document does not relate to how to realize the communication method and the influence parameter on hardware, and the acceleration of the same algorithm in different hardware environments is different.

Disclosure of Invention

The invention aims to provide a deep reinforcement learning multi-environment training method under a specific hardware frame aiming at the existing problems, can establish a deep reinforcement learning frame especially based on a PPO algorithm on a CPU + GPU platform, and simultaneously invents a pipeline method for accelerating collection sampling and reasoning.

The invention provides a training method of a deep reinforcement learning algorithm under a CPU + GPU platform, which comprises the following steps:

a multi-environment process simulation process is realized on a multi-core CPU, and a plurality of environment processes are arranged on each core. And the CPU plays a role of controlling data. And (4) realizing a neural network model reasoning process on the GPU. Under the condition of load balance, namely when the simulation time of the CPU single-core stacking environment process is the same as the neural network model reasoning time, the next half CPU simulator and half GPU reasoning process is prepared during data transmission through the operation of parallel half CPU simulators and half GPUs, the reinforcement learning speed can be further accelerated, and the overall training speed is nearly 2 times that of the traditional method.

The CPU and the GPU respectively store memory spaces of an action network and an evaluation network parameter, and information of each time is stored in an experience information pool memory of the CPU in the process of interaction between the environment and the intelligent agent.

And at regular intervals, the CPU extracts batch scale data from the memory of the experience information pool of the CPU according to the truncation parameters and the screening conditions of the algorithm. And calculating a target TD (temporal-difference) value through a Bellman equation, training the network by using the target TD value as a label, namely performing a back propagation process, and clearing a certain amount of experience pools or completely clearing according to algorithm setting. And then updating the memory space of the CPU network and the memory space of the GPU network.

The neural network running on the GPU may be a neural network in various forms, such as a Q-value-based DQN network, and the like, or a policy-based network, such as A3C, PPO, and the like, which may be described based on a high-level language such as C + + or Python, and the like. The network description method is not within the scope of the present invention.

The invention has the characteristics that:

1. the invention is based on a multi-core CPU-GPU framework, which is a framework with wide application and has universality.

2. Through load balance, when the simulation time of the CPU single-core stacking environment process is the same as the neural network model reasoning time, the invention can realize the pipeline structure for parallel execution of environment sampling and action reasoning: by means of parallel operation of the half CPU simulator and the half GPU inference and preparation of the next half CPU simulator and half GPU inference process in data transmission, the reinforcement learning speed can be further accelerated, and the overall training speed is nearly 2 times that of the traditional method.

3. The invention has universality. The framework of the invention can accelerate most DRL algorithms, because most DRL algorithms are similar to PPO algorithm principles especially used by the invention, all have reasoning-training processes, and can achieve load balance. The main differences of different algorithms are the neural network construction, the calculation of an objective function and the maintenance and utilization of an experience pool, and the change of the factors does not influence the acceleration capability brought by the invention.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention;

FIG. 2 is a schematic diagram of a neural network operating in the GPU;

FIG. 3 is a schematic diagram of the overall physical structure of a CPU-GPU framework;

FIG. 4 is a schematic diagram of reasoning process data flow under CPU-GPU framework

FIG. 5 is a schematic diagram of a training process data flow under a CPU-GPU framework;

FIG. 6 is a schematic diagram of a pipeline capable of further acceleration under load balancing

Detailed Description

The technical scheme in the embodiment of the invention is clearly and completely described in the following with reference to the accompanying drawings:

FIG. 1 shows an implementation process of the deep reinforcement learning acceleration method based on a multi-core CPU-GPU platform, which includes the following steps:

1. allocating memory spaces for a CPU and a GPU, wherein three memory spaces are arranged on the CPU, and one memory space is used for storing experience information pools and is used for network training; the other two memory spaces respectively store the action network parameter theta and the evaluation network parameter omega. Two memory spaces are distributed on the GPU and are used for storing local action network parameters theta respectively^-And locally evaluating the network parameter omega^-. Except that the CPU and the GPU can respectively control the internal memories, the memories of the CPU and the GPU can be communicated through a PCIE bus, and the operation comprises reading and writing operations. The random number generated by the CPU is used for initializing theta and omega in the memory, and then the initialized parameters are given to network parameters on the GPU through a PCIE (peripheral component interconnect express) bus.

2. For an M-core CPU, each core generates N reinforcement learning interactive environments env, for a total of M × N environment simulators. Meanwhile, the CPU serves as control equipment, dense small-batch floating point computing equipment and the GPU serves as DRL inference main computing equipment. And the GPU runs an Actor network and a Critic network in the PPO algorithm. The GPU communicates with the CPU by using a PCIE high-speed communication bus and is used for data transmission and updating network parameters.

3. And judging the data quantity of the experience pool once, if the data quantity is larger than a threshold value, such as batch _ max, enabling the CPU-GPU to enter a training mode, and otherwise, performing an inference process. If the step 3 is carried out for the first time, the experience pool obviously does not have any data, and then the reasoning process is naturally carried out, namely the step 4.

4. The inference step is performed, and the specific data flow is shown in fig. 4. And performing state sampling on M × N environments in parallel running by the CPU, maintaining a queue set for the states S of the environment sampling, temporarily storing each state S, obtaining a state S set with the size of M × N along with the completion of extraction of the last environment state, and sending the state set S to the GPU through the PCIE bus for inference selection of action.

5. For the PPO algorithm implemented by the present invention, two networks are included in the GPU, which are the local action network and the local evaluation network, respectively, as shown in fig. 3, after the local action network and the local evaluation network obtain the current state sent from the CPU, M × N actions a and value functions V are obtained, and a corresponding selection action set a and value function set V with a size of M × N are formed, and the sets a and V are sent back to the CPU via PCIE.

6. And correspondingly returning the action space A to the M-N CPU environment simulators, and obtaining a return value R and a next state S-by the environment simulators.

7. And storing the tuple set < S, A, R > into an experience pool, namely storing < S, a, R > corresponding to M × N single environments.

8. And if the training times are larger than a threshold value t _ max, ending the training, otherwise, if the data quantity is smaller than a threshold value batch _ max, entering a step 4, continuing sampling, and otherwise, if the data quantity is larger than the batch _ max, entering a step 9.

9. And performing a training step, extracting data of batch quantity from the empirical information pool as shown in fig. 1 and 5, and calculating a corresponding loss function in the CPU according to a Bellman equation.For the PPO online policy algorithm particularly used in the present invention, in the calculation of the batch, the merit function a(s) ═ r(s) + ρ V (s +1) -V(s) corresponding to each state, where r(s) is the input state s, the return value returned in the action a, V(s) is the output value function of the evaluation network in the input state s, and a(s) is the loss function of the evaluation network. Then according to the setting in the CPU, after sampling IS (importance sample) and cutting Clip or KL divergence method of the importance of PPO algorithm, obtaining the action network loss function expectation

10. The parameter training process of the action network and the evaluation network is carried out according to the obtained loss function, the process is intensive floating point arithmetic, and the training process is quicker to execute on a CPU (Central processing Unit) because the GPU is suitable for large-scale parallel data processing and the CPU is suitable for small-batch floating point arithmetic.

11. After the network parameter training is finished, new parameters are stored in an action network parameter theta and an evaluation network parameter omega on a CPU, and two operations are carried out simultaneously: the first step is to send the update parameters to the GPU for copy update, and the second step is to clear or clear part of experience pools according to the setting. At this time, the training times t are increased by 1, and the process proceeds to step 12 for judgment.

12. And if the maximum training threshold value t _ max is reached, completely finishing the training, otherwise, returning to the step 3.

FIG. 6 shows the optimization implementation of the DRL algorithm further accelerated in the CPU-GPU framework. The upper half of the graph shows the forward reasoning process of the general DRL algorithm in the framework, and it can be seen that if the initialization time is assumed to be 3t, and the time for the CPU environment simulator to collect M × N states and the time for the GPU to collect M × N actions are both 14t, two rounds of state-action loop processes are performed, which takes 59t time.

The existing M x N environmental simulators are split into two parts with the same size, namely 2M x N/2 environmental simulator combinations. The first part of environment simulator collects the state first, sends the collected state set S1 to the GPU, the local action network in the GPU carries out action collection A1, and the second part of environment simulator starts the collection state at the same time of action collection. The corresponding action set A1 of the first partial state is sent back to the first partial environment simulator of the CPU after the GPU collection is completed, the first partial environment simulator receives the action and collects the next round of state S1, meanwhile, the second partial environment simulator sends the collected state set S2 to the GPU, and the GPU performs action collection A2 in parallel. Thereby circulating.

When load balancing is performed, that is, the total running time of the CPU environment simulator stacked is the same as the GPU inference time, we can adopt the form of a pipeline structure under the CPU-GPU framework, such as the lower time flow diagram in fig. 6, which requires only 31 t. When the method is normally operated, the speed which is twice as fast as the operation speed of the traditional DRL algorithm can be achieved when the initialization condition is removed.

The PPO algorithm which is most widely used and has excellent effects at present is specifically described in the present invention, but the above data stream process and scheme implementation process can be implemented for any DRL algorithm in the CPU-GPU framework of the present invention, and the above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims

1. A deep reinforcement learning acceleration method of a multi-core processor and a single graphic processor is a deep reinforcement learning acceleration training method, which is characterized by comprising the following steps:

1) the method is operated under a multi-core CPU-GPU platform, and by means of multiple processes of CPUs, for M-core CPUs, each core generates N reinforcement learning interactive environments env to generate M-N environment simulators.

2) Assigning parameters omega, omega at fixed locations in memory^-，θ^-Theta and the experience pool, the CPU completes initialization, and the GPU can access through a data transmission bus PCIE;

3) and in the inference stage, the advantage of massive parallel computation of the GPU is utilized, and the inference selection of a is performed on the GPU. When the simulation time of the CPU single-core stacking environment process is the same as the neural network model reasoning time, the pipeline structure for parallel execution of environment sampling and action reasoning can be realized: by means of parallel operation of the half CPU simulator and the half GPU inference and preparation of the next half CPU simulator and half GPU inference process in data transmission, the reinforcement learning speed can be further accelerated, and the overall training speed is nearly 2 times that of the traditional method.

4) In the training process, the characteristic that the CPU has strong small-batch floating point operation capability is utilized, the data of batch amount is extracted from the data information pool, the superior function and the back propagation process are realized on the CPU, the network parameters are further adjusted, and the local network parameters of the GPU are updated through the PCIE bus. And emptying or cleaning a part of the information pool according to the setting.

5) And updating the parameters theta and omega according to the training times, and finishing the training after the system meets the requirements.

2. The method of claim 1, wherein a plurality of environment simulators are executed on the CPU and communicate with the GPU using PCIE.

3. The method of claim 1, wherein load balancing is achieved by stacking environment simulators within the CPU-GPU framework to achieve twice the performance acceleration of deep reinforcement learning.

4. The method of claim 1, wherein parameters are allocated to fixed positions of a CPU memory and a GPU memory, initialization and reading and writing of experience pool data are performed in the CPU, and network parameter reading and writing are performed between the CPU and the GPU through PCIE.