CN112732436A - Deep reinforcement learning acceleration method of multi-core processor-single graphics processor - Google Patents

Deep reinforcement learning acceleration method of multi-core processor-single graphics processor Download PDF

Info

Publication number
CN112732436A
CN112732436A CN202011476497.6A CN202011476497A CN112732436A CN 112732436 A CN112732436 A CN 112732436A CN 202011476497 A CN202011476497 A CN 202011476497A CN 112732436 A CN112732436 A CN 112732436A
Authority
CN
China
Prior art keywords
cpu
gpu
reinforcement learning
environment
core
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011476497.6A
Other languages
Chinese (zh)
Other versions
CN112732436B (en
Inventor
阮爱武
朱重阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202011476497.6A priority Critical patent/CN112732436B/en
Publication of CN112732436A publication Critical patent/CN112732436A/en
Application granted granted Critical
Publication of CN112732436B publication Critical patent/CN112732436B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a deep reinforcement learning acceleration method of a multi-core processor-single graphic processor, which can establish a deep reinforcement learning framework especially based on a PPO algorithm on a CPU + GPU platform and simultaneously invent a pipeline method for accelerating collection sampling and reasoning. A multi-environment process simulation process is realized on a multi-core CPU, and a plurality of environment processes are arranged on each core. And the CPU plays a role of controlling data. And (4) realizing a neural network model reasoning process on the GPU. The CPU and the GPU respectively store memory spaces of an action network and an evaluation network parameter, information of each time is stored in an experience information pool memory of the CPU in the process of interaction between the environment and the intelligent agent, and the CPU extracts batch scale data from the experience information pool memory of the CPU for training at regular intervals according to truncation parameters and screening conditions of an algorithm. Through load balance, when the simulation time of the CPU single-core stacking environment process is the same as the neural network model reasoning time, the invention can realize the pipeline structure for parallel execution of environment sampling and action reasoning: by means of parallel operation of the half CPU simulator and the half GPU inference and preparation of the next half CPU simulator and half GPU inference process in data transmission, the reinforcement learning speed can be further accelerated, and the overall training speed is nearly 2 times that of the traditional method.

Description

Deep reinforcement learning acceleration method of multi-core processor-single graphics processor
Technical Field
The invention belongs to the field of computers, and particularly relates to a deep reinforcement learning acceleration method based on a multi-core Central Processing Unit (CPU) -single Graphics Processing Unit (GPU) platform.
Background
Deep Reinforcement Learning (DRL) is a product of combining deep learning and reinforcement learning, and integrates strong comprehension ability of deep learning on perception problems such as vision and decision-making ability of reinforcement learning, thereby realizing end-to-end learning. The occurrence of deep reinforcement learning enables the reinforcement learning technology to really go to practical use, solves the complex problem in the real scene, and is widely applied to various engineering fields, such as industrial manufacturing, robot positioning and recognition, game and the like.
Documents published in 2013 (Mnih V, Kavukcugcuoglu K, Silver D, et al. Playing Atari with deep discovery research// Proceedings of works phones at the 26th Neural Information Processing Systems 2013.Lake Tahoe, USA,2013:201-220) have appeared so far, and a large number of algorithms have appeared in the field of deep enhanced learning, including DQN algorithm optimized based on a Q value target, and the like, A3C optimized based on a strategy gradient target, TRPO algorithm, and the like. In 2017, the OpenAI publication (Schulman, John & Wolski, Filip & Dhariwal, Prafulla & Radford, Alec & Klimov, Oleg. (2017). Proximal Policy Optimization Algorithms) (PPO algorithm is optimized based on strategy pi near-end strategy), constraints are simply defined on a proxy objective function, the implementation and parameter adjustment processes are simplified, the performance is superior to most strategy gradient Algorithms, and the method is one of the optimal Algorithms preferred in many DRL researches. As these algorithms are proposed in succession, the required complexity of the algorithms gradually increases, requiring efficient computational support and framework support.
In the PPO algorithm, the operation Network (Actor Network) and the evaluation Network (Critic Network) are mainly divided, the current state s of an environment body is input, and after a Critic Network predicted value function V, a merit function A is obtained according to n steps of discount return values of partial Markov chains; and predicting by the Actor network to obtain a strategy pi, further selecting the action a in the current state s to return to the environment, and obtaining the next state s-, and repeating the process. After the T step length, the dominant function A generated by the prior strategy pi is processed according to the cutting width or divergence and used as a target function to train the network, so that the convergence stability is achieved.
Many of the deep reinforcement learning training in the prior art is built under a single environment configuration of a CPU-GPU framework, wherein the CPU is responsible for environment building and data control, and the GPU is used for high-throughput parallel computing. Because of single-environment configuration, the state can be predicted only once each time, and the operation overhead of scheduling the GPU is sometimes longer than the time of parallel computation, so that the acceleration effect of the DRL is not obvious, and even the speed reduction result is achieved. How to improve training efficiency becomes the focus of many DRL studies.
Deep research is necessary to break through high-efficiency deep reinforcement learning, an effective hardware framework based on a deep reinforcement learning algorithm is provided, and a data flow process under the framework is analyzed and optimized, so that the deep reinforcement learning speed is increased. The document (liang xing, von 26104; he, mayang, cheng guan, huang cai, wang qi, zhou yuzhen, liu fai. multi-Agent deep reinforcement learning overview [ J/OL ]. automated science report: 2019.05) proposes a communication method and an influence parameter of a DRL multi-environment body (Agent), and the document does not relate to how to realize the communication method and the influence parameter on hardware, and the acceleration of the same algorithm in different hardware environments is different.
Disclosure of Invention
The invention aims to provide a deep reinforcement learning multi-environment training method under a specific hardware frame aiming at the existing problems, can establish a deep reinforcement learning frame especially based on a PPO algorithm on a CPU + GPU platform, and simultaneously invents a pipeline method for accelerating collection sampling and reasoning.
The invention provides a training method of a deep reinforcement learning algorithm under a CPU + GPU platform, which comprises the following steps:
a multi-environment process simulation process is realized on a multi-core CPU, and a plurality of environment processes are arranged on each core. And the CPU plays a role of controlling data. And (4) realizing a neural network model reasoning process on the GPU. Under the condition of load balance, namely when the simulation time of the CPU single-core stacking environment process is the same as the neural network model reasoning time, the next half CPU simulator and half GPU reasoning process is prepared during data transmission through the operation of parallel half CPU simulators and half GPUs, the reinforcement learning speed can be further accelerated, and the overall training speed is nearly 2 times that of the traditional method.
The CPU and the GPU respectively store memory spaces of an action network and an evaluation network parameter, and information of each time is stored in an experience information pool memory of the CPU in the process of interaction between the environment and the intelligent agent.
And at regular intervals, the CPU extracts batch scale data from the memory of the experience information pool of the CPU according to the truncation parameters and the screening conditions of the algorithm. And calculating a target TD (temporal-difference) value through a Bellman equation, training the network by using the target TD value as a label, namely performing a back propagation process, and clearing a certain amount of experience pools or completely clearing according to algorithm setting. And then updating the memory space of the CPU network and the memory space of the GPU network.
The neural network running on the GPU may be a neural network in various forms, such as a Q-value-based DQN network, and the like, or a policy-based network, such as A3C, PPO, and the like, which may be described based on a high-level language such as C + + or Python, and the like. The network description method is not within the scope of the present invention.
The invention has the characteristics that:
1. the invention is based on a multi-core CPU-GPU framework, which is a framework with wide application and has universality.
2. Through load balance, when the simulation time of the CPU single-core stacking environment process is the same as the neural network model reasoning time, the invention can realize the pipeline structure for parallel execution of environment sampling and action reasoning: by means of parallel operation of the half CPU simulator and the half GPU inference and preparation of the next half CPU simulator and half GPU inference process in data transmission, the reinforcement learning speed can be further accelerated, and the overall training speed is nearly 2 times that of the traditional method.
3. The invention has universality. The framework of the invention can accelerate most DRL algorithms, because most DRL algorithms are similar to PPO algorithm principles especially used by the invention, all have reasoning-training processes, and can achieve load balance. The main differences of different algorithms are the neural network construction, the calculation of an objective function and the maintenance and utilization of an experience pool, and the change of the factors does not influence the acceleration capability brought by the invention.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention;
FIG. 2 is a schematic diagram of a neural network operating in the GPU;
FIG. 3 is a schematic diagram of the overall physical structure of a CPU-GPU framework;
FIG. 4 is a schematic diagram of reasoning process data flow under CPU-GPU framework
FIG. 5 is a schematic diagram of a training process data flow under a CPU-GPU framework;
FIG. 6 is a schematic diagram of a pipeline capable of further acceleration under load balancing
Detailed Description
The technical scheme in the embodiment of the invention is clearly and completely described in the following with reference to the accompanying drawings:
FIG. 1 shows an implementation process of the deep reinforcement learning acceleration method based on a multi-core CPU-GPU platform, which includes the following steps:
1. allocating memory spaces for a CPU and a GPU, wherein three memory spaces are arranged on the CPU, and one memory space is used for storing experience information pools and is used for network training; the other two memory spaces respectively store the action network parameter theta and the evaluation network parameter omega. Two memory spaces are distributed on the GPU and are used for storing local action network parameters theta respectively-And locally evaluating the network parameter omega-. Except that the CPU and the GPU can respectively control the internal memories, the memories of the CPU and the GPU can be communicated through a PCIE bus, and the operation comprises reading and writing operations. The random number generated by the CPU is used for initializing theta and omega in the memory, and then the initialized parameters are given to network parameters on the GPU through a PCIE (peripheral component interconnect express) bus.
2. For an M-core CPU, each core generates N reinforcement learning interactive environments env, for a total of M × N environment simulators. Meanwhile, the CPU serves as control equipment, dense small-batch floating point computing equipment and the GPU serves as DRL inference main computing equipment. And the GPU runs an Actor network and a Critic network in the PPO algorithm. The GPU communicates with the CPU by using a PCIE high-speed communication bus and is used for data transmission and updating network parameters.
3. And judging the data quantity of the experience pool once, if the data quantity is larger than a threshold value, such as batch _ max, enabling the CPU-GPU to enter a training mode, and otherwise, performing an inference process. If the step 3 is carried out for the first time, the experience pool obviously does not have any data, and then the reasoning process is naturally carried out, namely the step 4.
4. The inference step is performed, and the specific data flow is shown in fig. 4. And performing state sampling on M × N environments in parallel running by the CPU, maintaining a queue set for the states S of the environment sampling, temporarily storing each state S, obtaining a state S set with the size of M × N along with the completion of extraction of the last environment state, and sending the state set S to the GPU through the PCIE bus for inference selection of action.
5. For the PPO algorithm implemented by the present invention, two networks are included in the GPU, which are the local action network and the local evaluation network, respectively, as shown in fig. 3, after the local action network and the local evaluation network obtain the current state sent from the CPU, M × N actions a and value functions V are obtained, and a corresponding selection action set a and value function set V with a size of M × N are formed, and the sets a and V are sent back to the CPU via PCIE.
6. And correspondingly returning the action space A to the M-N CPU environment simulators, and obtaining a return value R and a next state S-by the environment simulators.
7. And storing the tuple set < S, A, R > into an experience pool, namely storing < S, a, R > corresponding to M × N single environments.
8. And if the training times are larger than a threshold value t _ max, ending the training, otherwise, if the data quantity is smaller than a threshold value batch _ max, entering a step 4, continuing sampling, and otherwise, if the data quantity is larger than the batch _ max, entering a step 9.
9. And performing a training step, extracting data of batch quantity from the empirical information pool as shown in fig. 1 and 5, and calculating a corresponding loss function in the CPU according to a Bellman equation.For the PPO online policy algorithm particularly used in the present invention, in the calculation of the batch, the merit function a(s) ═ r(s) + ρ V (s +1) -V(s) corresponding to each state, where r(s) is the input state s, the return value returned in the action a, V(s) is the output value function of the evaluation network in the input state s, and a(s) is the loss function of the evaluation network. Then according to the setting in the CPU, after sampling IS (importance sample) and cutting Clip or KL divergence method of the importance of PPO algorithm, obtaining the action network loss function expectation
Figure BDA0002835753810000041
10. The parameter training process of the action network and the evaluation network is carried out according to the obtained loss function, the process is intensive floating point arithmetic, and the training process is quicker to execute on a CPU (Central processing Unit) because the GPU is suitable for large-scale parallel data processing and the CPU is suitable for small-batch floating point arithmetic.
11. After the network parameter training is finished, new parameters are stored in an action network parameter theta and an evaluation network parameter omega on a CPU, and two operations are carried out simultaneously: the first step is to send the update parameters to the GPU for copy update, and the second step is to clear or clear part of experience pools according to the setting. At this time, the training times t are increased by 1, and the process proceeds to step 12 for judgment.
12. And if the maximum training threshold value t _ max is reached, completely finishing the training, otherwise, returning to the step 3.
FIG. 6 shows the optimization implementation of the DRL algorithm further accelerated in the CPU-GPU framework. The upper half of the graph shows the forward reasoning process of the general DRL algorithm in the framework, and it can be seen that if the initialization time is assumed to be 3t, and the time for the CPU environment simulator to collect M × N states and the time for the GPU to collect M × N actions are both 14t, two rounds of state-action loop processes are performed, which takes 59t time.
The existing M x N environmental simulators are split into two parts with the same size, namely 2M x N/2 environmental simulator combinations. The first part of environment simulator collects the state first, sends the collected state set S1 to the GPU, the local action network in the GPU carries out action collection A1, and the second part of environment simulator starts the collection state at the same time of action collection. The corresponding action set A1 of the first partial state is sent back to the first partial environment simulator of the CPU after the GPU collection is completed, the first partial environment simulator receives the action and collects the next round of state S1, meanwhile, the second partial environment simulator sends the collected state set S2 to the GPU, and the GPU performs action collection A2 in parallel. Thereby circulating.
When load balancing is performed, that is, the total running time of the CPU environment simulator stacked is the same as the GPU inference time, we can adopt the form of a pipeline structure under the CPU-GPU framework, such as the lower time flow diagram in fig. 6, which requires only 31 t. When the method is normally operated, the speed which is twice as fast as the operation speed of the traditional DRL algorithm can be achieved when the initialization condition is removed.
The PPO algorithm which is most widely used and has excellent effects at present is specifically described in the present invention, but the above data stream process and scheme implementation process can be implemented for any DRL algorithm in the CPU-GPU framework of the present invention, and the above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims (4)

1. A deep reinforcement learning acceleration method of a multi-core processor and a single graphic processor is a deep reinforcement learning acceleration training method, which is characterized by comprising the following steps:
1) the method is operated under a multi-core CPU-GPU platform, and by means of multiple processes of CPUs, for M-core CPUs, each core generates N reinforcement learning interactive environments env to generate M-N environment simulators.
2) Assigning parameters omega, omega at fixed locations in memory-,θ-Theta and the experience pool, the CPU completes initialization, and the GPU can access through a data transmission bus PCIE;
3) and in the inference stage, the advantage of massive parallel computation of the GPU is utilized, and the inference selection of a is performed on the GPU. When the simulation time of the CPU single-core stacking environment process is the same as the neural network model reasoning time, the pipeline structure for parallel execution of environment sampling and action reasoning can be realized: by means of parallel operation of the half CPU simulator and the half GPU inference and preparation of the next half CPU simulator and half GPU inference process in data transmission, the reinforcement learning speed can be further accelerated, and the overall training speed is nearly 2 times that of the traditional method.
4) In the training process, the characteristic that the CPU has strong small-batch floating point operation capability is utilized, the data of batch amount is extracted from the data information pool, the superior function and the back propagation process are realized on the CPU, the network parameters are further adjusted, and the local network parameters of the GPU are updated through the PCIE bus. And emptying or cleaning a part of the information pool according to the setting.
5) And updating the parameters theta and omega according to the training times, and finishing the training after the system meets the requirements.
2. The method of claim 1, wherein a plurality of environment simulators are executed on the CPU and communicate with the GPU using PCIE.
3. The method of claim 1, wherein load balancing is achieved by stacking environment simulators within the CPU-GPU framework to achieve twice the performance acceleration of deep reinforcement learning.
4. The method of claim 1, wherein parameters are allocated to fixed positions of a CPU memory and a GPU memory, initialization and reading and writing of experience pool data are performed in the CPU, and network parameter reading and writing are performed between the CPU and the GPU through PCIE.
CN202011476497.6A 2020-12-15 2020-12-15 Deep reinforcement learning acceleration method of multi-core processor-single graphics processor Active CN112732436B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011476497.6A CN112732436B (en) 2020-12-15 2020-12-15 Deep reinforcement learning acceleration method of multi-core processor-single graphics processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011476497.6A CN112732436B (en) 2020-12-15 2020-12-15 Deep reinforcement learning acceleration method of multi-core processor-single graphics processor

Publications (2)

Publication Number Publication Date
CN112732436A true CN112732436A (en) 2021-04-30
CN112732436B CN112732436B (en) 2022-04-22

Family

ID=75602111

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011476497.6A Active CN112732436B (en) 2020-12-15 2020-12-15 Deep reinforcement learning acceleration method of multi-core processor-single graphics processor

Country Status (1)

Country Link
CN (1) CN112732436B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113448425A (en) * 2021-07-19 2021-09-28 哈尔滨工业大学 Dynamic parallel application program energy consumption runtime optimization method and system based on reinforcement learning
CN114707646A (en) * 2022-01-26 2022-07-05 电子科技大学 Distributed artificial intelligence practice platform based on remote reasoning
CN114862655A (en) * 2022-05-18 2022-08-05 北京百度网讯科技有限公司 Operation control method and device for model training and electronic equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104145281A (en) * 2012-02-03 2014-11-12 安秉益 Neural network computing apparatus and system, and method therefor
CN108537331A (en) * 2018-04-04 2018-09-14 清华大学 A kind of restructural convolutional neural networks accelerating circuit based on asynchronous logic
CN109559801A (en) * 2017-09-26 2019-04-02 西门子保健有限责任公司 The intelligent editing of processing result image
CN109783412A (en) * 2019-01-18 2019-05-21 电子科技大学 A kind of method that deeply study accelerates training
US20190392020A1 (en) * 2018-06-26 2019-12-26 Hcl Technologies Limited Reconfigurable convolution accelerator
US20200143206A1 (en) * 2018-11-05 2020-05-07 Royal Bank Of Canada System and method for deep reinforcement learning
CN111191728A (en) * 2019-12-31 2020-05-22 中国电子科技集团公司信息科学研究院 Deep reinforcement learning distributed training method and system based on asynchronization or synchronization
CN112070210A (en) * 2020-08-20 2020-12-11 成都恒创新星科技有限公司 Multi-parallel strategy convolution network accelerator based on FPGA
US20210278825A1 (en) * 2018-08-23 2021-09-09 Siemens Aktiengesellschaft Real-Time Production Scheduling with Deep Reinforcement Learning and Monte Carlo Tree Research

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104145281A (en) * 2012-02-03 2014-11-12 安秉益 Neural network computing apparatus and system, and method therefor
CN109559801A (en) * 2017-09-26 2019-04-02 西门子保健有限责任公司 The intelligent editing of processing result image
CN108537331A (en) * 2018-04-04 2018-09-14 清华大学 A kind of restructural convolutional neural networks accelerating circuit based on asynchronous logic
US20190392020A1 (en) * 2018-06-26 2019-12-26 Hcl Technologies Limited Reconfigurable convolution accelerator
US20210278825A1 (en) * 2018-08-23 2021-09-09 Siemens Aktiengesellschaft Real-Time Production Scheduling with Deep Reinforcement Learning and Monte Carlo Tree Research
US20200143206A1 (en) * 2018-11-05 2020-05-07 Royal Bank Of Canada System and method for deep reinforcement learning
CN109783412A (en) * 2019-01-18 2019-05-21 电子科技大学 A kind of method that deeply study accelerates training
CN111191728A (en) * 2019-12-31 2020-05-22 中国电子科技集团公司信息科学研究院 Deep reinforcement learning distributed training method and system based on asynchronization or synchronization
CN112070210A (en) * 2020-08-20 2020-12-11 成都恒创新星科技有限公司 Multi-parallel strategy convolution network accelerator based on FPGA

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
F. MORIN 等: "A high throughput architecture for channel equalization based on a neural network using a wave pipeline method", 《ENGINEERING SOLUTIONS FOR THE NEXT MILLENNIUM. 1999 IEEE CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING (CAT. NO.99TH8411)》 *
胡延步 等: "基于多现场可编程门阵列异构平台的流水线技术优化方法", 《集成技术》 *
蹇强 等: "一种可配置的CNN协加速器的FPGA实现方法", 《电子学报》 *
闵秋应 等: "改进型BP神经网络自适应均衡器设计", 《江西师范大学学报(自然科学版)》 *
陈朋 等: "基于改进动态配置的FPGA卷积神经网络加速器的优化方法", 《高技术通讯》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113448425A (en) * 2021-07-19 2021-09-28 哈尔滨工业大学 Dynamic parallel application program energy consumption runtime optimization method and system based on reinforcement learning
CN113448425B (en) * 2021-07-19 2022-09-09 哈尔滨工业大学 Dynamic parallel application program energy consumption runtime optimization method and system based on reinforcement learning
CN114707646A (en) * 2022-01-26 2022-07-05 电子科技大学 Distributed artificial intelligence practice platform based on remote reasoning
CN114862655A (en) * 2022-05-18 2022-08-05 北京百度网讯科技有限公司 Operation control method and device for model training and electronic equipment
CN114862655B (en) * 2022-05-18 2023-03-10 北京百度网讯科技有限公司 Operation control method and device for model training and electronic equipment

Also Published As

Publication number Publication date
CN112732436B (en) 2022-04-22

Similar Documents

Publication Publication Date Title
CN112732436B (en) Deep reinforcement learning acceleration method of multi-core processor-single graphics processor
Pholdee et al. Hybridisation of real-code population-based incremental learning and differential evolution for multiobjective design of trusses
CN109325591A (en) Neural network processor towards Winograd convolution
CN111191728B (en) Deep reinforcement learning distributed training method and system based on asynchronization or synchronization
CN109840154A (en) A kind of computation migration method that task based access control relies under mobile cloud environment
CN110135584A (en) Extensive Symbolic Regression method and system based on self-adaptive parallel genetic algorithm
CN112214301B (en) Smart city-oriented dynamic calculation migration method and device based on user preference
Zhang et al. An effective use of hybrid metaheuristics algorithm for job shop scheduling problem
CN109657794B (en) Instruction queue-based distributed deep neural network performance modeling method
Ye et al. A new approach for resource scheduling with deep reinforcement learning
CN114647515A (en) GPU cluster-oriented dynamic resource scheduling method
CN118095103B (en) Water plant digital twin application enhancement method and device, storage medium and electronic equipment
CN111831354A (en) Data precision configuration method, device, chip array, equipment and medium
CN117436485A (en) Multi-exit point end-edge-cloud cooperative system and method based on trade-off time delay and precision
CN113534660A (en) Multi-agent system cooperative control method and system based on reinforcement learning algorithm
CN116500896B (en) Intelligent real-time scheduling model and method for intelligent network-connected automobile domain controller multi-virtual CPU tasks
CN113110101A (en) Production line mobile robot gathering type recovery and warehousing simulation method and system
CN116720703A (en) AGV multi-target task scheduling method and system based on deep reinforcement learning
Niu et al. Cloud resource scheduling method based on estimation of distirbution shuffled frog leaping algorithm
Tan et al. A fast and stable forecasting model to forecast power load
CN114723058A (en) Neural network end cloud collaborative reasoning method and device for high-sampling-rate video stream analysis
Zhou et al. Decentralized adaptive optimal control for massive multi-agent systems using mean field game with self-organizing neural networks
CN111950691A (en) Reinforced learning strategy learning method based on potential action representation space
CN111612124A (en) Network structure adaptive optimization method for task-oriented intelligent scheduling
Teng et al. A New Frog Leaping Algorithm Based on Simulated Annealing and Immunization Algorithm for Low-power Mapping in Network-on-chip.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant