CN108108822B - Different strategy deep reinforcement learning method for parallel training - Google Patents

Different strategy deep reinforcement learning method for parallel training Download PDF

Info

Publication number
CN108108822B
CN108108822B CN201810040895.XA CN201810040895A CN108108822B CN 108108822 B CN108108822 B CN 108108822B CN 201810040895 A CN201810040895 A CN 201810040895A CN 108108822 B CN108108822 B CN 108108822B
Authority
CN
China
Prior art keywords
experience information
experience
storage unit
environment
interaction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810040895.XA
Other languages
Chinese (zh)
Other versions
CN108108822A (en
Inventor
陈志波
张直政
陈嘉乐
石隽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN201810040895.XA priority Critical patent/CN108108822B/en
Publication of CN108108822A publication Critical patent/CN108108822A/en
Application granted granted Critical
Publication of CN108108822B publication Critical patent/CN108108822B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a parallel training different strategy deep reinforcement learning method, which comprises the following steps: creating an environment process pool, and selecting more than two environment processes to simultaneously start interaction by taking an intelligent agent needing deep reinforcement learning as a main process; in the interaction process, the environment process stores experience information generated in each step of interaction in an experience information cache unit corresponding to the current environment process, and each time when n pieces of single-step interaction information are added to the experience information cache unit in the environment process, the main process samples from the experience storage unit based on an experience sample sampling algorithm and updates corresponding parameters of the intelligent main body; and when one round is finished, the environment process screens the experience information generated by the interaction of the round according to an experience sample screening and storing algorithm and selects and stores the experience information in a corresponding experience storage unit or directly deletes the experience information. The method can improve the sample efficiency of the reinforcement learning algorithm, shorten the training time and further improve the learning efficiency and stability of the reinforcement learning algorithm.

Description

Different strategy deep reinforcement learning method for parallel training
Technical Field
The invention relates to the technical field of artificial intelligence and machine learning, in particular to a parallel training different strategy deep reinforcement learning method.
Background
Reinforcement learning is an important machine learning method and has a great number of applications in the fields of intelligent control robots, man-machine gaming, clinical medicine, analysis and prediction and the like. The reinforcement learning is independent of the supervised learning and the unsupervised learning in the traditional machine learning, and experiences are obtained from the interaction between the intelligent agent and the environment, so that the strategy learning of the intelligent agent from the environment to the behavior mapping is completed. In reinforcement learning, the intelligent main body receives state information from an environment and generates an action to act on the environment based on a learning strategy, the state of the environment changes after receiving the action, a return value (reward or punishment) is generated at the same time, the changed current state and the reward and punishment signal are sent to the intelligent main body, and the intelligent main body updates the strategy according to the received information and selects the next decision result (namely the action) according to the strategy. The learning goal of the reinforcement learning system is to dynamically adjust the parameters of the intelligent agent in the interaction process with the environment so as to update the strategy to be learned, so that the positive signal fed back by the environment is the maximum.
The deep reinforcement learning is a novel reinforcement learning mode for constructing an intelligent main body in the reinforcement learning by using a deep learning model on the basis of the reinforcement learning. In deep reinforcement learning, the intelligent agent needs more training samples, and thus requires a large amount of interaction between the intelligent agent and the environment to generate training data. However, the design of the environment and the feedback signal (the return value) in the actual scientific research and the industrial production is very complicated, the calculation of the state value and the return value after the environment changes according to the action signal sent by the intelligent agent has large calculation amount and long time consumption, and extremely high requirements are provided for the training mode and the sample efficiency of the intelligent agent learning.
Therefore, it is necessary to make intensive research to improve the sample efficiency of the intelligent agent in the deep reinforcement learning and shorten the time for training the intelligent agent, thereby improving the learning ability of the intelligent agent in the reinforcement learning and making the intelligent agent to exert the practical value in the application scene more quickly and better.
Disclosure of Invention
The invention aims to provide a parallel-training different-strategy deep reinforcement learning method, which can improve the sample efficiency of a reinforcement learning algorithm, shorten the training time and further improve the learning efficiency and stability.
The purpose of the invention is realized by the following technical scheme:
a parallel-training different-strategy deep reinforcement learning method comprises the following steps:
creating an environment process pool, wherein the environment process pool comprises a plurality of environment processes of the same type;
an intelligent agent needing deep reinforcement learning serves as a main process to send an interaction request to an environment process pool, and more than two environment processes are selected to simultaneously start interaction according to information returned by the environment processes;
in the interaction process of the main process and the current environment process, the environment process stores experience information generated in each step of interaction in an experience information cache unit corresponding to the current environment process, and when n pieces of single-step interaction information are added to the experience information cache unit in the environment process, the main process samples from the experience storage unit based on an experience sample sampling algorithm and updates corresponding parameters of an intelligent main body;
and when one round in the interaction between the main process and the current environment process is finished, the environment process screens the experience information generated by the interaction of the round based on an experience sample screening and storing algorithm and stores the experience information in a corresponding experience storage unit or directly deletes the experience information.
According to the technical scheme provided by the invention, the experience sample screening and storing algorithm and the experience sample sampling algorithm are optimization algorithms which are adopted on the basis of an asynchronous parallel training frame, and can be selected and adjusted according to specific application requirements and application scenes, so that the sample efficiency of the different strategy reinforcement learning algorithm is improved, the training time is shortened, and the learning efficiency and stability are further improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a flowchart of a conventional inter-policy deep reinforcement learning method according to an embodiment of the present invention;
FIG. 2 is a flowchart of a parallel-trained inter-strategy deep reinforcement learning method according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a neural network according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a different-strategy deep reinforcement learning method for parallel training, which aims at a different-strategy deep reinforcement learning algorithm, realizes simultaneous interaction of an intelligent main body and a plurality of environments by utilizing multiple processes, and asynchronously trains the intelligent main body according to experiences generated by interaction with the plurality of environments, and mainly comprises the following steps:
1. and creating an environment process pool, wherein the environment process pool comprises a plurality of environment processes of the same type.
In the embodiment of the invention, the same or different parameter settings can be adopted among the environment processes.
2. And the intelligent agent needing deep reinforcement learning is used as a main process to send an interaction request to the environment process pool, and more than two environment processes are selected to simultaneously start interaction according to information returned by each environment process.
In the step, the intelligent agent serves as a main process to send an interaction request to an environment process pool, if an environment process which is not in an interaction state exists in the process pool, an identifier of the corresponding environment process is returned to the main process, and the main process sets the environment identifier of the corresponding environment process from an idle state to an occupied state and interacts with the corresponding environment process; if the process pool does not have the environment process which is not in the interactive state, the environment identifier which is in the occupied state is returned to the main process as a waiting signal, and the main process selects to wait or continuously sends the interactive request to the environment process pool according to the obtained waiting signal.
3. In the interaction process of the main process and the current environment process, the environment process stores experience information generated in each step of interaction in an experience information cache unit corresponding to the current environment process, and when n pieces of single-step interaction information are added to the experience information cache unit in the environment process, the main process samples from the experience storage unit based on an experience sample sampling algorithm and updates corresponding parameters of the intelligent main body.
In the embodiment of the present invention, the experience information mainly includes: status values, action values, reward values, interaction termination indicia, and cumulative reward values (optional).
4. And when one round in the interaction between the main process and the current environment process is finished, the environment process screens the experience information generated by the interaction of the round based on an experience sample screening and storing algorithm and stores the experience information in a corresponding experience storage unit or directly deletes the experience information.
In the embodiment of the invention, when one round in the interaction between the main process and the current environment process is finished, the main process releases the current environment process and resets the relevant environment process to an idle state; the reset environment process resets the environment and waits for the initiation of the next interaction.
In the embodiment of the invention, the experience storage unit mainly comprises a common experience information storage unit and a high-return experience information storage unit. The common experience information storage unit and the high-return experience information storage unit are both fixed in length, and store experience information in a FIFO (first-in first-out) storage mode; the length refers to the number of empirical information corresponding to the maximum storable single step interaction. The length of the common experience information storage unit is recorded as LOThe number of experience information corresponding to the single-step interaction stored in the common experience information storage unit is recorded as NO(ii) a The length of the high-return experience information storage unit is recorded as LHThe number of experience information corresponding to the single-step interaction stored in the high-reward experience information storage unit is recorded as NH
In the embodiment of the present invention, the experience sample screening and storing algorithm is to perform screening and storing operations on the experience information in the experience information cache unit in an environment process after a complete round (epicode) of interaction between a main process corresponding to an intelligent agent and the environment process is completed, and the main process is as follows:
storing experience information generated by the interaction of the current round of the cache in the experience information cache unit into a common experience information storage unit, and updating NO(ii) a Will be updated NOAnd a first threshold value NlimitA comparison is made.
In the embodiment of the invention, the interaction is carried out in a step unit, and one round (episode) comprises the multi-step interaction. The experience information caching unit and the experience information storage unit have different functions, when a round is not completed, the experience information corresponding to the currently completed step is stored in the experience information caching unit, and after the round is completed, the experience information of all the steps in the round is stored in the experience information storage unit.
If N is presentOIs less than a first threshold value NlimitIf so, the storage operation is ended;
if N is presentOGreater than a first threshold value NlimitThen respectively calculate the latest stored NnewP-th of round accumulated return value corresponding to experience information of each round1Value R of percentilehighAnd p is2Value R of percentilelowWherein p is1>p2(ii) a Recording the round accumulated return value of the experience information currently stored in the common experience information storage unit as R, wherein R is more than RhighCopying the experience information corresponding to the round to a high-return experience information storage unit for storage; when R islow≤R≤RhighThen, the experience information corresponding to the turn is changed into (R-R) with probability phigh)/(Rhigh-Rlow) Copying the experience information to a high-return experience information storage unit for storage; when R is less than or equal to RlowWhen the operation is finished, the storage operation is finished.
In the embodiment of the invention, each step of interaction (step) in a turn (episode) generates a return value, wherein the turn accumulated return value refers to the accumulation of the return values of all steps (step) in a certain turn, namely the accumulation is operated by taking the step (step) as a unit.
And the empirical sample sampling algorithm is used for sampling from the storage unit in the main process corresponding to the intelligent agent and updating the parameters of the intelligent agent. The main process is as follows:
when the main process prepares to update the parameters of the intelligent agent, the number N of the experience information already stored in the common experience information storage unit is detectedOAnd compares it with a second threshold value Nl(typically set to an integer multiple of Batch Size): if N is presentO<NlIf yes, the sampling and parameter updating are abandoned; if N is presentO≥NlDetecting the number of experience information already stored in the high-return experience information storage unit, and if the number of experience samples already stored is lower than a third threshold value Nl1Sampling from a common experience sample storage unit, and if the sampling is not lower than a fourth threshold value Nl2Then, one of the following two ways is selected to complete the sampling:
sampling from the high-return experience information storage unit with the probability P and performing parameter updating by using the obtained samples, and sampling from the common experience information storage unit with the probability (1-P) and performing parameter updating by using the obtained samples;
and sampling P samples in the sample set sampled every time from the high-return experience information storage unit, and sampling (1-P) samples from the common experience information storage unit.
For convenience of explanation, the following description will be given with reference to specific examples.
The traditional different-strategy deep reinforcement learning algorithm shown in fig. 1 adopts a training mode of serial synchronization of a single intelligent agent and a single environment, and is different from the training mode, the scheme provided by the embodiment of the invention is shown in fig. 2, and the training mode is used for performing parallel interaction on the single intelligent agent and a plurality of environments of the same type at the same time and correspondingly optimizing the storage and sampling modes of experience information; mainly as follows:
the intelligent agent in the embodiment of the invention is based on a classic strategy gradient algorithm Deep Deterministic policy gradient, and mainly comprises a strategy network and a value network shown in figure 3. Setting general experience informationLength L of memory cellO=106(ii) a Length L of high return experience information storage unitH=104(ii) a The Baych Size adopted when updating the neural network in the intelligent agent is 128; threshold value N of number of stored experience information required for sampling from common experience information storage unitlimit64 × 128; threshold N of the number of stored experience information required for sampling from the high-reward experience information storage unitl=32*128。
First, a process pool containing a plurality of environment processes of the same type is created according to the method in the foregoing scheme, wherein environment processes in different processes may adopt the same or different parameter settings (for example, different difficulty levels may be set in a game environment). And then, the main process corresponding to the intelligent agent interacts with different environment processes at the same time, and the intelligent agent detects whether the number of the experience information stored in each storage unit meets the requirement or not according to the experience sample sampling algorithm described in the scheme and determines the updating mode of the neural network in the intelligent agent every time one-step interaction is completed. More specifically, in each back propagation of the neural network, sampling is performed from the high-reward experience information storage unit with a probability of 0.1, and sampling is performed from the ordinary experience information storage unit with a probability of 0.9. Meanwhile, the experience information obtained by the single-step interaction of the environment process is stored in the experience information cache unit corresponding to the process. After a complete round of interaction is completed, the environmental process stores the experience information generated by the round of interaction according to the experience information screening and storing algorithm in the technical scheme. Wherein 100 rounds of experience information (i.e. N) most recently stored in the common experience information storage unit are takennew100) round accumulated reward value of 90 th (i.e., p)190) percentile as Rhigh10 th (i.e. p)210) percentile as Rlow(ii) a And comparing the accumulated return value R with the round accumulated return value R of the experience information currently stored in the common experience information storage unit, thereby selecting a corresponding storage strategy.
It should be noted that the specific values of the parameters mentioned in the above formula examples are only examples and are not limiting; in practical application, a user can set specific values of the parameters according to actual needs or experience.
The invention has certain detectability, and the specific detection scheme is as follows:
firstly, detecting the process number of the related programs. If a main process and a plurality of environment processes interacting with the main process exist in the related program, the technical scheme related to the invention patent is probably used.
And secondly, detecting the read-write condition of the storage unit and the process of the related program. If a plurality of storage units with higher reading and writing frequency and two storage units with reading and writing operation frequency obviously higher than writing operation frequency exist in the related program, the technical scheme related to the invention patent is probably used.
Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (6)

1. A parallel-training different-strategy deep reinforcement learning method is characterized by comprising the following steps:
creating an environment process pool, wherein the environment process pool comprises a plurality of environment processes of the same type;
an intelligent agent needing deep reinforcement learning serves as a main process to send an interaction request to an environment process pool, and more than two environment processes are selected to simultaneously start interaction according to information returned by the environment processes;
in the interaction process of the main process and the current environment process, the environment process stores experience information generated in each step of interaction in an experience information cache unit corresponding to the current environment process, and when n pieces of single-step interaction information are added to the experience information cache unit in the environment process, the main process samples from the experience storage unit based on an experience sample sampling algorithm and updates corresponding parameters of an intelligent main body;
when one round in the interaction between the main process and the current environment process is finished, the environment process screens experience information generated by the interaction of the round based on an experience sample screening and storing algorithm and stores the experience information in a corresponding experience storage unit or directly deletes the experience information;
the method for filtering and storing the experience information generated by the interaction of the round based on the experience sample filtering and storing algorithm by the environment process to the corresponding experience storage unit comprises the following steps:
the experience storage unit comprises a common experience information storage unit and a high return experience information storage unit; the length of the common experience information storage unit is recorded as LOThe number of experience information corresponding to the single-step interaction stored in the common experience information storage unit is recorded as NO(ii) a The length of the high-return experience information storage unit is recorded as LHThe number of experience information corresponding to the single-step interaction stored in the high-reward experience information storage unit is recorded as NH
Storing the information in the current experience information cache unit into a common experience information storage unit, and updating NO(ii) a Will be updated NOAnd a first threshold value NlimitComparing;
if N is presentOIs less than a first threshold value NlimitIf so, the storage operation is ended;
if N is presentOGreater than a first threshold value NlimitThen are respectivelyCalculating the latest stored NnewP-th of round accumulated return value corresponding to experience information of each round1Value R of percentilehighAnd p is2Value R of percentilelowWherein p is1>p2(ii) a Recording the round accumulated return value of the experience information currently stored in the common experience information storage unit as R, when R is more than RhighThen, copying the experience information corresponding to the round to a high-return experience information storage unit for storage; when R islow≤R≤RhighThen, the experience information corresponding to the turn is changed into (R-R) with probability phigh)/(Rhigh-Rlow) Copying the experience information to a high-return experience information storage unit for storage; when R is less than or equal to RlowWhen the operation is finished, the storage operation is finished.
2. The method for parallel training of the heterogeneous strategy deep reinforcement learning according to claim 1, wherein an intelligent agent is used as a main process to send an interaction request to an environment process pool, if an environment process which is not in an interaction state exists in the environment process pool, an identifier of the corresponding environment process is returned to the main process, and the main process sets the environment identifier of the corresponding environment process from an idle state to an occupied state and interacts with the corresponding environment process;
and if the process pool does not have the environment process which is not in the interactive state, returning the environment identifier in the occupied state to the main process as a waiting signal, and selecting to wait or continuously send an interactive request to the environment process pool by the main process according to the obtained waiting signal.
3. The method according to claim 1, wherein the empirical information comprises: state values, action values, reward values, interaction termination indicia, and cumulative reward values.
4. The parallel-training different-strategy deep reinforcement learning method according to claim 1, wherein the main process performs sampling from an experience storage unit based on an experience sample sampling algorithm and performs corresponding parameter updating on the intelligent agent, and comprises:
when the main process prepares to update the parameters of the intelligent agent, the number N of the experience information already stored in the common experience information storage unit is detectedOAnd compares it with a second threshold value NlAnd (3) comparison: if N is presentO<NlIf yes, the sampling and parameter updating are abandoned; if N is presentO≥NlDetecting the number of experience information already stored in the high-return experience information storage unit, and if the number of experience samples already stored in the high-return experience information storage unit is lower than a third threshold value Nl1Sampling from a common experience sample storage unit, and if the sampling is not lower than a fourth threshold value Nl2Then, one of the following two ways is selected to complete the sampling:
sampling from the high-return experience information storage unit with the probability P and performing parameter updating by using the obtained samples, and sampling from the common experience information storage unit with the probability (1-P) and performing parameter updating by using the obtained samples;
and sampling P samples in the sample set sampled every time from the high-return experience information storage unit, and sampling (1-P) samples from the common experience information storage unit.
5. The parallel-training different-strategy deep reinforcement learning method according to claim 1, characterized in that the common experience information storage unit and the high-return experience information storage unit are both of fixed length, and store experience information in a FIFO storage mode; the length refers to the number of empirical information corresponding to the maximum storable single step interaction.
6. The method of claim 1, wherein when one round of interaction between the host process and the current environment process is over, the host process releases the current environment process and resets the relevant environment process to an idle state; the reset environment process resets the environment and waits for the initiation of the next interaction.
CN201810040895.XA 2018-01-16 2018-01-16 Different strategy deep reinforcement learning method for parallel training Active CN108108822B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810040895.XA CN108108822B (en) 2018-01-16 2018-01-16 Different strategy deep reinforcement learning method for parallel training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810040895.XA CN108108822B (en) 2018-01-16 2018-01-16 Different strategy deep reinforcement learning method for parallel training

Publications (2)

Publication Number Publication Date
CN108108822A CN108108822A (en) 2018-06-01
CN108108822B true CN108108822B (en) 2020-06-26

Family

ID=62220060

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810040895.XA Active CN108108822B (en) 2018-01-16 2018-01-16 Different strategy deep reinforcement learning method for parallel training

Country Status (1)

Country Link
CN (1) CN108108822B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7016295B2 (en) * 2018-06-28 2022-02-04 三菱重工業株式会社 Decision-making devices, unmanned systems, decision-making methods, and programs
CN110888401B (en) * 2018-09-11 2022-09-06 京东科技控股股份有限公司 Combustion control optimization method and device for thermal generator set and readable storage medium
CN109523029B (en) * 2018-09-28 2020-11-03 清华大学深圳研究生院 Self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method
CN110428057A (en) * 2019-05-06 2019-11-08 南京大学 A kind of intelligent game playing system based on multiple agent deeply learning algorithm
CN110531617B (en) * 2019-07-30 2021-01-08 北京邮电大学 Multi-unmanned aerial vehicle 3D hovering position joint optimization method and device and unmanned aerial vehicle base station
WO2020098823A2 (en) * 2019-12-12 2020-05-22 Alipay (Hangzhou) Information Technology Co., Ltd. Determining action selection policies of an execution device
WO2020098821A2 (en) 2019-12-12 2020-05-22 Alipay (Hangzhou) Information Technology Co., Ltd. Determining action selection policies of an execution device
SG11202010172WA (en) 2019-12-12 2020-11-27 Alipay Hangzhou Inf Tech Co Ltd Determining action selection policies of execution device
CN112926735B (en) * 2021-01-29 2024-08-02 北京字节跳动网络技术有限公司 Method, device, framework, medium and equipment for updating deep reinforcement learning model
CN114117752A (en) * 2021-11-10 2022-03-01 杭州海康威视数字技术股份有限公司 Method and system for training reinforcement learning model of intelligent agent

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105637540A (en) * 2013-10-08 2016-06-01 谷歌公司 Methods and apparatus for reinforcement learning
CN106779072A (en) * 2016-12-23 2017-05-31 深圳市唯特视科技有限公司 A kind of enhancing based on bootstrapping DQN learns deep search method
CN107209872A (en) * 2015-02-06 2017-09-26 谷歌公司 The distributed training of reinforcement learning system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018005739A (en) * 2016-07-06 2018-01-11 株式会社デンソー Method for learning reinforcement of neural network and reinforcement learning device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105637540A (en) * 2013-10-08 2016-06-01 谷歌公司 Methods and apparatus for reinforcement learning
CN107209872A (en) * 2015-02-06 2017-09-26 谷歌公司 The distributed training of reinforcement learning system
CN106779072A (en) * 2016-12-23 2017-05-31 深圳市唯特视科技有限公司 A kind of enhancing based on bootstrapping DQN learns deep search method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Multi-agent Reinforcement Learning Based on Bidding;MENG Wei等;《The 1st International Conference on Information Science and Engineering》;20091228;第4949-4952页 *
并行强化学习算法及其应用研究;孟伟 等;《计算机工程与应用》;20091201;第45卷(第34期);第25-28,52页 *

Also Published As

Publication number Publication date
CN108108822A (en) 2018-06-01

Similar Documents

Publication Publication Date Title
CN108108822B (en) Different strategy deep reinforcement learning method for parallel training
CN110168578B (en) Multi-tasking neural network with task-specific paths
CN105184367B (en) The model parameter training method and system of deep neural network
CN110647294B (en) Storage block recovery method and device, storage medium and electronic equipment
CN106802772B (en) Data recovery method and device and solid state disk
CN109284233B (en) Garbage recovery method of storage system and related device
CN112819159A (en) Deep reinforcement learning training method and computer readable storage medium
CN111125519A (en) User behavior prediction method and device, electronic equipment and storage medium
CN113850364A (en) Non-transitory computer-readable recording medium, learning method, and information processing apparatus
CN112416255A (en) User writing speed control method, device, equipment and medium
CN116112563A (en) Dual-strategy self-adaptive cache replacement method based on popularity prediction
CN113268457B (en) Self-adaptive learning index method and system supporting efficient writing
CN117235088A (en) Cache updating method, device, equipment, medium and platform of storage system
CN117453123A (en) Data classification storage method and equipment based on reinforcement learning
EP3926547A1 (en) Program, learning method, and information processing apparatus
CN113268143A (en) Multimodal man-machine interaction method based on reinforcement learning
CN110533192B (en) Reinforced learning method and device, computer readable medium and electronic equipment
CN112801130A (en) Image clustering quality evaluation method, system, medium, and apparatus
CN110765360A (en) Text topic processing method and device, electronic equipment and computer storage medium
CN116893854A (en) Method, device, equipment and storage medium for detecting conflict of instruction resources
JP5931276B2 (en) Programmable display, its program
CN113741402A (en) Equipment control method and device, computer equipment and storage medium
CN110297977B (en) Personalized recommendation single-target evolution method for crowd funding platform
CN113439253B (en) Application cleaning method and device, storage medium and electronic equipment
CN117806837B (en) Method, device, storage medium and system for managing hard disk tasks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant