WO2020029095A1 - Reinforcement learning network training method, apparatus and device, and storage medium - Google Patents

Reinforcement learning network training method, apparatus and device, and storage medium Download PDF

Info

Publication number
WO2020029095A1
WO2020029095A1 PCT/CN2018/099256 CN2018099256W WO2020029095A1 WO 2020029095 A1 WO2020029095 A1 WO 2020029095A1 CN 2018099256 W CN2018099256 W CN 2018099256W WO 2020029095 A1 WO2020029095 A1 WO 2020029095A1
Authority
WO
WIPO (PCT)
Prior art keywords
value
action
reinforcement learning
preset
learning network
Prior art date
Application number
PCT/CN2018/099256
Other languages
French (fr)
Chinese (zh)
Inventor
王峥
梁明兰
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Priority to PCT/CN2018/099256 priority Critical patent/WO2020029095A1/en
Publication of WO2020029095A1 publication Critical patent/WO2020029095A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the invention belongs to the field of machine learning, and particularly relates to a training method, a device, a training device, and a storage medium for a reinforcement learning network.
  • Reinforcement learning also known as re-incentive learning and evaluation learning, is an important method of machine learning. It is an agent's learning from environment to behavior mapping to maximize the value of the reward signal (reinforcement signal) function. Reinforcement learning is different from supervised learning in connectionist learning, which is mainly manifested in teacher signals. Reinforcement signals provided by the environment in reinforcement learning are an evaluation of the quality of the action (usually a scalar signal), rather than telling How the reinforcement learning system RLS (reinforcement learning system) generates correct actions. Because the external environment provides little information, RLS must learn from its own experience. In this way, RLS gains knowledge in an action-evaluation environment and improves action plans to suit the environment. There are many applications in areas such as intelligent control of robots and analysis and prediction.
  • reinforcement learning has been widely used in robot control, computer vision, natural language processing, game theory, and autonomous driving.
  • the process of training reinforcement learning networks is usually implemented on CPU and GPU devices, and the amount of calculation is quite large.
  • problems such as occupying a large number of resources, slow operation speed, and low efficiency, and the calculation is limited due to the limitation of memory access bandwidth. The ability cannot be further improved.
  • the purpose of the present invention is to provide a training method, device, training device and storage medium for reinforcement learning network, which aims to solve the problem that the existing technology cannot provide an effective training method for reinforcement learning network, which leads to a large amount of training calculation and efficiency. Low problem.
  • the present invention provides a method for training a reinforcement learning network, which includes the following steps:
  • the present invention provides a training device for a reinforcement learning network, the device includes:
  • a parameter setting unit configured to set network parameters of the reinforcement learning network when receiving a request for training the reinforcement learning network to perform weight configuration on the reinforcement learning network
  • a matching obtaining unit configured to obtain the current state of the reinforcement learning network, match the current state in a pre-built state reward library, and obtain a reward value and a contribution value of the current state;
  • the traversal obtaining unit is configured to traverse the action combination of the action library in a pre-built action library, obtain the contribution value of the action combination, and obtain the value according to the current state contribution value and the action combination contribution value The maximum Q value of the current state of the reinforcement learning network;
  • An execution obtaining unit configured to obtain and execute the current action of the reinforcement learning network according to the maximum Q value of the current state, so that the reinforcement learning network enters the next state, and obtain the maximum Q value of the next state, And obtaining the target Q value of the current state through the maximum Q value of the next state, the reward value of the current state, and a preset target value formula;
  • a generation adjustment unit is configured to generate a loss function of the reinforcement learning network according to a target Q value of the reinforcement learning network, and adjust a network parameter of the reinforcement learning network through a preset adjustment algorithm to continue to perform the reinforcement learning network Train until the loss function converges.
  • the present invention also provides a reinforcement learning network training device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing the computer
  • the program implements the steps of the training method of the reinforcement learning network as described above.
  • the present invention also provides a computer-readable storage medium that stores a computer program that, when executed by a processor, implements the steps of the training method for a reinforcement learning network as described above.
  • the present invention sets network parameters of the reinforcement learning network to perform weight configuration, obtain the current state of the reinforcement learning network, and reward and contribution values of the current state, by traversing the actions of the action library Combination to obtain the maximum Q value of the combination of actions in the current state, obtain the current action and execute it based on the maximum Q value of the current state, and obtain the target Q value of the current state by obtaining the maximum Q value of the next state to generate a reinforcement learning network Loss function, adjusting network parameters through preset adjustment algorithms to continue training the reinforcement learning network until the loss function converges, thereby reducing the amount of computation for training the reinforcement learning network, thereby speeding up the training speed of the reinforcement learning network and improving training effectiveness.
  • FIG. 1 is an implementation flowchart of a method for training a reinforcement learning network according to Embodiment 1 of the present invention
  • FIG. 2 is a schematic diagram of a preferred storage structure of a state reward library provided by Embodiment 1 of the present invention
  • FIG. 3 is a schematic diagram of a preferred storage structure of an action library provided by Embodiment 1 of the present invention.
  • FIG. 4 is a schematic structural diagram of a training device for a reinforcement learning network according to a second embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a training device for a reinforcement learning network according to a third embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of a reinforcement learning network training device according to a fourth embodiment of the present invention.
  • FIG. 7 is a schematic diagram of a preferred structure of a reinforcement learning network training device provided in Embodiment 4 of the present invention.
  • FIG. 1 shows an implementation process of a training method for a reinforcement learning network provided in Embodiment 1 of the present invention. For ease of description, only parts related to the embodiment of the present invention are shown, and the details are as follows:
  • step S101 when a request for training a reinforcement learning network is received, network parameters of the reinforcement learning network are set to perform weight configuration on the reinforcement learning network.
  • the embodiments of the present invention are applicable to reinforcement learning network training equipment, for example, training equipment such as MATLAB (Matrix Laboratory).
  • the network parameters of the learning network are set to configure the weighting of the learning network. Specifically, the network parameters are written first.
  • the input network parameters start the calculation mode of the corresponding neurons of the reinforcement learning network. In this way, the parameters of each neuron of each layer of the network are configured, so that data is processed in parallel and data processing efficiency is improved.
  • step S102 the current state of the reinforcement learning network is acquired, and the current state is matched in a state reward library constructed in advance to obtain the reward value and contribution value of the current state.
  • the state reward library is a pre-built set that stores state nodes and corresponding reward values. After receiving a training request, the current state of the reinforcement learning network is obtained, and feature data of the current state is extracted. The characteristic data of the current state is calculated to obtain the contribution value of the current state, and then the current state is matched in the state reward database to obtain the current state reward value.
  • the figure shows a preferred storage structure of the state reward library.
  • the state reward library is divided into n reward groups, each corresponding to n special state reward values.
  • the reward value is stored at the beginning of the data.
  • the number of groups is n.
  • the end of the database stores the general state reward value, that is, the (n + 1) th reward value.
  • Each reward group includes different state nodes, that is, different state values. Different state nodes correspond to different ranges. Status value.
  • the current state is matched in a pre-built state reward library
  • the current state is matched with all state nodes corresponding to a preset number of reward groups in the state reward library, and when the current state is located in the preset number of rewards
  • the reward value of the preset status reward group is set to the reward value of the current status, otherwise the reward value of the current status is set to the preset general status reward value, so as to quickly obtain the instant reward of the current status .
  • the current state can only be located in one state node, or the current state is located outside all state nodes, when matching state nodes, the method of matching state nodes one by one can be used for matching.
  • step S103 the action combination of the action library is traversed in a pre-built action library to obtain the contribution value of the action combination, and the maximum Q of the current state of the reinforcement learning network is obtained according to the contribution value of the current state and the contribution value of the action combination. value.
  • the action library is a pre-built set that stores all actions that can be output by the learning network.
  • the Q value is a representation of the state mapping to action values in the reinforcement learning network.
  • the contribution value of each action combination (real-time action).
  • the Q value of each action combination is calculated from the current state contribution value and the action combination contribution value, so that the Q value of each action combination can be calculated. Get the maximum Q value of the current state of the reinforcement learning network.
  • the figure shows a preferred storage structure of the action library.
  • the action library is divided into an action memory module and a real-time action memory module.
  • the action memory module is used to store information of all actions, specifically the action dimension.
  • the real-time action memory module is used to store the action information to be output, specifically the action value of each action in the n-dimensional action.
  • actions include left turn (first dimension), right turn (second dimension), brake (third dimension), etc.
  • the corresponding action values are (1, a), (2, b) , (3, c) where 1, 2, and 3 respectively represent the dimensions of the action (for example, the first, second, and third dimensions), and a, b, and c are the measures corresponding to the first, second, and three-dimensional actions, respectively. value.
  • the starting value of the preset number of dimensional actions on the preset action list in the action library is sequentially set to the preset real-time action table in the action library.
  • the preset number of real-time action values to obtain the step value of the preset first-dimensional action on the preset action list, and accumulate the step value of the preset first-dimensional action to the corresponding one of the preset first-dimensional actions one by one. Real-time action value.
  • the preset first-dimensional motion and the preset second-dimensional motion are both one-dimensional motions among the preset number of dimensional motions.
  • step S104 the current action of the reinforcement learning network is obtained and executed according to the maximum Q value of the current state to obtain the next state of the reinforcement learning network, the maximum Q value of the next state is obtained, and the maximum Q value of the next state is passed , The reward value of the current state and a preset target value formula to obtain the target Q value of the current state.
  • the action that the reinforcement learning network needs to perform when the current action is the current state, the action that the reinforcement learning network needs to perform.
  • the current action of the reinforcement learning network is obtained and executed according to the maximum Q value of the current state, and enters the next state.
  • the method of steps S102 and S103 is repeated to obtain the maximum Q value of the next state.
  • the target Q value of the current state is obtained through a preset target value formula.
  • the current state, the current action, the reward value of the current state, and the next state are stored as training samples, thereby speeding up the subsequent convergence process.
  • the reinforcement learning network training device includes two processors, one of which is an AI chip, and the architecture of the AI chip is between ASIC (Application Specific Integrated Circuit) and FPGA (Field-Programmable Gate Array).
  • Programming logic gate array which is used to handle part of the process of making decisions based on the current state and responding to the current action during the training of the reinforcement learning network, thereby improving the training speed of the reinforcement learning network by increasing the memory access bandwidth.
  • step S105 a loss function of the reinforcement learning network is generated according to the target Q value of the current state, and the network parameters are adjusted by a preset adjustment algorithm to continue training the learning network until the loss function converges.
  • a loss function of the reinforcement learning network is generated.
  • the preset adjustment algorithm is a stochastic gradient descent (SGD) algorithm.
  • network parameters of the reinforcement learning network are set to perform weight configuration to obtain the current state of the reinforcement learning network, and the reward value and contribution value of the current state.
  • Traverse the action combination of the action library to obtain the maximum Q value of the action combination in the current state, obtain the current action and execute it according to the maximum Q value of the current state, and obtain the target Q value of the current state by obtaining the maximum Q value of the next state.
  • Generate the loss function of the reinforcement learning network adjust the network parameters through preset adjustment algorithms, and continue to train the reinforcement learning network until the loss function converges, thereby reducing the calculation amount of training the reinforcement learning network, and then speeding up the training of the reinforcement learning network Speed and improve training efficiency.
  • FIG. 4 shows the structure of a training device for a reinforcement learning network provided in Embodiment 2 of the present invention. For ease of description, only parts related to the embodiment of the present invention are shown, including:
  • a parameter setting unit 41 is configured to set network parameters of the reinforcement learning network when receiving a request for training the reinforcement learning network to perform weight configuration on the reinforcement learning network;
  • the matching obtaining unit 42 is configured to obtain the current state of the reinforcement learning network, match the current state in a pre-built state reward library, and obtain the reward value and contribution value of the current state;
  • the traversal acquisition unit 43 is configured to traverse the action combinations of the action library in a pre-built action library, obtain the contribution value of the action combination, and obtain the current state of the reinforcement learning network according to the contribution value of the current state and the contribution value of the action combination.
  • the execution obtaining unit 44 is configured to obtain and execute the current action of the reinforcement learning network according to the maximum Q value of the current state, to obtain the next state of the reinforcement learning network, obtain the maximum Q value of the next state, and pass the maximum value of the next state.
  • the generating and adjusting unit 45 is configured to generate a loss function of the reinforcement learning network according to the target Q value of the current state, and adjust network parameters through a preset adjustment algorithm to continue training the learning network until the loss function converges.
  • network parameters of the reinforcement learning network are set to perform weight configuration to obtain the current state of the reinforcement learning network, and the reward value and contribution value of the current state.
  • Traverse the action combination of the action library to obtain the maximum Q value of the action combination in the current state, obtain the current action and execute it according to the maximum Q value of the current state, and obtain the target Q value of the current state by obtaining the maximum Q value of the next state.
  • Generate the loss function of the reinforcement learning network adjust the network parameters through preset adjustment algorithms, and continue to train the reinforcement learning network until the loss function converges, thereby reducing the calculation amount of training the reinforcement learning network, and then speeding up the training of the reinforcement learning network Speed and improve training efficiency.
  • each unit of the training device of the reinforcement learning network may be implemented by corresponding hardware or software units.
  • Each unit may be an independent software and hardware unit, or may be integrated into one software and hardware unit. Limit the invention. For specific implementation of each unit, reference may be made to the description in Embodiment 1, and details are not described herein again.
  • FIG. 5 shows the structure of a training device for a reinforcement learning network provided in Embodiment 3 of the present invention. For ease of description, only parts related to the embodiment of the present invention are shown, including:
  • a parameter setting unit 51 is configured to set network parameters of the reinforcement learning network when receiving a request for training the reinforcement learning network to perform weight configuration on the reinforcement learning network;
  • the matching acquisition unit 52 is configured to acquire the current state of the reinforcement learning network, match the current state in a pre-built state reward library, and obtain the reward value and contribution value of the current state;
  • the traversal acquisition unit 53 is configured to traverse the action combinations of the action library in a pre-built action library, obtain the contribution value of the action combination, and obtain the current state of the reinforcement learning network according to the contribution value of the current state and the contribution value of the action combination.
  • the execution obtaining unit 54 is configured to obtain and execute the current action of the reinforcement learning network according to the maximum Q value of the current state, to obtain the next state of the reinforcement learning network, obtain the maximum Q value of the next state, and pass the maximum value of the next state.
  • the experience storage unit 55 is configured to store the current state, the current action, the reward value of the current state, and the next state as training samples;
  • the generating and adjusting unit 56 is configured to generate a loss function of the reinforcement learning network according to the target Q value of the current state, and adjust the network parameters through a preset adjustment algorithm to continue training the learning network until the loss function converges.
  • the matching obtaining unit 52 includes:
  • a matching subunit 521 configured to match the current state with all state nodes corresponding to a preset number of reward groups in the state reward library
  • the state value setting unit 522 is configured to set the reward value of the preset state reward group to the current state reward value when the current state is located in the preset state nodes in the preset number of reward groups, otherwise set the reward value of the current state Set to a preset general status reward value.
  • the traversal obtaining unit 53 includes:
  • a start value setting unit 531 configured to sequentially set a start value of a preset number of dimensional actions on a preset action list in the action library to a preset number of real-time action values on a preset real-time action table in the action library;
  • the first accumulation unit 532 is configured to obtain a step value of a preset first-dimensional action on the preset action list, and sequentially accumulate the step value of the preset first-dimensional action to the real-time corresponding to the preset first-dimensional action. Action value;
  • a second accumulation unit 533 configured to obtain the step value of the preset second-dimensional action on the preset action list when the corresponding real-time action value is sequentially accumulated outside the range corresponding to the preset first-dimensional action, and The step value of the preset second-dimensional action is successively accumulated to the real-time action value corresponding to the preset second-dimensional action.
  • network parameters of the reinforcement learning network are set to perform weight configuration to obtain the current state of the reinforcement learning network, and the reward value and contribution value of the current state.
  • Traverse the action combination of the action library to obtain the maximum Q value of the action combination in the current state, obtain the current action and execute it according to the maximum Q value of the current state, and obtain the target Q value of the current state by obtaining the maximum Q value of the next state.
  • Generate the loss function of the reinforcement learning network adjust the network parameters through preset adjustment algorithms, and continue to train the reinforcement learning network until the loss function converges, thereby reducing the calculation amount of training the reinforcement learning network, and then speeding up the training of the reinforcement learning network Speed and improve training efficiency.
  • each unit of the training device of the reinforcement learning network may be implemented by corresponding hardware or software units.
  • Each unit may be an independent software and hardware unit, or may be integrated into one software and hardware unit. Limit the invention. For specific implementation of each unit, reference may be made to the description in Embodiment 1, and details are not described herein again.
  • Embodiment 4 is a diagrammatic representation of Embodiment 4:
  • FIG. 6 shows the structure of a reinforcement learning network training device provided in Embodiment 4 of the present invention. For convenience of explanation, only parts related to the embodiment of the present invention are shown, including:
  • the reinforcement learning network training device 6 includes a processor 61, a memory 62, and a computer program 63 stored in the memory 62 and executable on the processor 61.
  • the processor 51 executes the computer program 63
  • the steps in the embodiment of the method for training a reinforcement learning network are implemented, for example, steps S101 to S105 shown in FIG.
  • the processor 61 executes the computer program 63
  • the functions of the units in the above embodiments of the training device of the reinforcement learning network are realized, for example, the functions of the units 41 to 45 shown in FIG. 4 and the units 51 to 56 shown in FIG. 5.
  • the reinforcement learning network training device 7 includes a first processor 711, a second processor 712, a first memory 721, a second memory 722, and a computer program stored in the first memory 721 and the second memory 722. 73.
  • the computer calculation program 73 may run on the first processor 711 and the second processor 712.
  • the first processor 711 is an ASIC (Application Specific Integrated Circuit) chip, thereby improving the efficiency of the learning network and reducing power consumption.
  • the first processor 711 implements the steps in the embodiment of the method for training a reinforcement learning network when the computer program 73 is executed, for example, steps S101 to S103 shown in FIG.
  • the second processor 712 implements the reinforcement learning network when the computer program 73 is executed.
  • the steps in the training method embodiment are, for example, steps S104 to S105 shown in FIG. 1.
  • the functions of the units in the embodiments of the training device for each reinforcement learning network described above are implemented, for example, the functions of units 41 to 43 shown in FIG. 4 and the functions of units 51 to 53 shown in FIG. 5,
  • the second processor 712 executes the computer program 73
  • the functions of the units in the above embodiments of the training device of the reinforcement learning network are realized, for example, the functions of the units 44 to 45 shown in FIG. 4 and the units 54 to 56 shown in FIG. 5.
  • the network parameters of the reinforcement learning network are set to perform weight configuration, and obtain the current state of the reinforcement learning network, and the current state.
  • To obtain the maximum Q value of the action combination in the current state by traversing the action combination of the action library, to obtain and execute the current action based on the maximum Q value of the current state, and to obtain the maximum Q value of the next state, Obtain the target Q value of the current state, generate the loss function of the reinforcement learning network, and adjust the network parameters through preset adjustment algorithms to continue training the reinforcement learning network until the loss function converges, thereby reducing the amount of calculation for training the reinforcement learning network. This speeds up the training of reinforcement learning networks and improves training efficiency.
  • Embodiment 5 is a diagrammatic representation of Embodiment 5:
  • a computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps in the embodiment of the training method of the reinforcement learning network are implemented. For example, steps S101 to S105 shown in FIG. 1.
  • the functions of the units in the embodiments of the training device for each reinforcement learning network described above are implemented, such as the functions of units 41 to 45 shown in FIG. 4 and the units 51 to 56 shown in FIG. 5.
  • the network parameters of the reinforcement learning network are set to perform weight configuration, obtain the current state of the reinforcement learning network, and the current Reward value and contribution value of the state, by traversing the action combination of the action library, obtain the maximum Q value of the action combination in the current state, obtain the current action and execute it according to the maximum Q value of the current state, and obtain the maximum Q value of the next state
  • To obtain the target Q value of the current state generate a loss function for the reinforcement learning network, and adjust the network parameters through preset adjustment algorithms to continue training the reinforcement learning network until the loss function converges, thereby reducing the amount of calculation for training the reinforcement learning network Therefore, the training speed of the reinforcement learning network is accelerated, and the training efficiency is improved.
  • the computer-readable storage medium of the embodiment of the present invention may include any entity or device capable of carrying computer program code, and a storage medium, for example, a memory such as a ROM / RAM, a magnetic disk, an optical disk, a flash memory, or the like.
  • a storage medium for example, a memory such as a ROM / RAM, a magnetic disk, an optical disk, a flash memory, or the like.

Abstract

The present invention is applicable to the technical field of machine learning, and provides a reinforcement learning network training method, apparatus and device, and a storage medium. Said method comprises: upon receipt of a request for training of a reinforcement learning network, setting network parameters of the reinforcement learning network, so as to perform weight configuration; acquiring the current state of the reinforcement learning network, and the reward value and the contribution value of the current state; acquiring the maximum Q value of the action combination in the current state by traversing action combinations in an action library; acquiring the current action according to the maximum Q value of the current state and executing same, and acquiring a target Q value of the current state by obtaining the maximum Q value of a next state; and generating a loss function of the reinforcement learning network, and adjusting the network parameters by means of a preset adjustment algorithm, so as to continue to train the reinforcement learning network until the loss function converges. The present invention reduces the calculation amount for reinforcement learning network training, thereby increasing the training speed of a reinforcement learning network and improving the training efficiency.

Description

强化学习网络的训练方法、装置、训练设备及存储介质Training method, device, training equipment and storage medium of reinforcement learning network 技术领域Technical field
本发明属于机器学习领域,尤其涉及一种强化学习网络的训练方法、装置、训练设备及存储介质。The invention belongs to the field of machine learning, and particularly relates to a training method, a device, a training device, and a storage medium for a reinforcement learning network.
背景技术Background technique
强化学习(reinforcement learning),又称再励学习、评价学习,是一种重要的机器学习方法,是智能体(Agent)从环境到行为映射的学习,以使奖励信号(强化信号)函数值最大,强化学习不同于连接主义学习中的监督学习,主要表现在教师信号上,强化学习中由环境提供的强化信号是对产生动作的好坏作一种评价(通常为标量信号),而不是告诉强化学习系统RLS(reinforcement learning system)如何去产生正确的动作。由于外部环境提供的信息很少,RLS必须靠自身的经历进行学习。通过这种方式,RLS在行动‐评价的环境中获得知识,改进行动方案以适应环境在智能控制机器人及分析预测等领域有许多应用。Reinforcement learning, also known as re-incentive learning and evaluation learning, is an important method of machine learning. It is an agent's learning from environment to behavior mapping to maximize the value of the reward signal (reinforcement signal) function. Reinforcement learning is different from supervised learning in connectionist learning, which is mainly manifested in teacher signals. Reinforcement signals provided by the environment in reinforcement learning are an evaluation of the quality of the action (usually a scalar signal), rather than telling How the reinforcement learning system RLS (reinforcement learning system) generates correct actions. Because the external environment provides little information, RLS must learn from its own experience. In this way, RLS gains knowledge in an action-evaluation environment and improves action plans to suit the environment. There are many applications in areas such as intelligent control of robots and analysis and prediction.
近年来,强化学习广泛应用于机器人控制领域、计算机视觉领域、自然语言处理、博弈论领域、自动驾驶。训练强化学习网络过程通常在CPU与GPU设备上实现,其计算量相当大,在实际应用过程中,存在着占用资源多、运算速度慢、效率低等问题,并且因内存访问带宽的限制导致计算能力无法进一步提升。In recent years, reinforcement learning has been widely used in robot control, computer vision, natural language processing, game theory, and autonomous driving. The process of training reinforcement learning networks is usually implemented on CPU and GPU devices, and the amount of calculation is quite large. In actual application, there are problems such as occupying a large number of resources, slow operation speed, and low efficiency, and the calculation is limited due to the limitation of memory access bandwidth. The ability cannot be further improved.
发明内容Summary of the invention
本发明的目的在于提供一种强化学习网络的训练方法、装置、训练设备以及存储介质,旨在解决由于现有技术无法提供一种有效的强化学习网络的训练 方法,导致训练计算量大、效率低的问题。The purpose of the present invention is to provide a training method, device, training device and storage medium for reinforcement learning network, which aims to solve the problem that the existing technology cannot provide an effective training method for reinforcement learning network, which leads to a large amount of training calculation and efficiency. Low problem.
一方面,本发明提供了一种强化学习网络的训练方法,所述方法包括下述步骤:In one aspect, the present invention provides a method for training a reinforcement learning network, which includes the following steps:
当接收到训练强化学习网络的请求时,设置所述强化学习网络的网络参数,以对所述强化学习网络进行权重配置;When receiving a request to train a reinforcement learning network, setting network parameters of the reinforcement learning network to perform weight configuration on the reinforcement learning network;
获取所述强化学习网络的当前状态,在预先构建的状态奖励库中对所述当前状态进行匹配,获取所述当前状态的奖励值和贡献值;Acquiring the current state of the reinforcement learning network, matching the current state in a pre-built state reward library, and obtaining a reward value and a contribution value of the current state;
在预先构建的动作库中遍历所述动作库的动作组合,获取所述动作组合的贡献值,并根据所述当前状态的贡献值和所述动作组合的贡献值,获取所述强化学习网络的当前状态的最大Q值;Traverse the action combinations of the action library in a pre-built action library to obtain the contribution value of the action combination, and obtain the reinforcement learning network's value based on the current state contribution value and the action combination's contribution value. The maximum Q value of the current state;
根据所述当前状态的最大Q值获取所述强化学习网络的当前动作并执行,以使所述强化学习网络进入下一状态,获取所述下一状态的最大Q值,并通过所述下一状态的最大Q值、所述当前状态的奖励值和预设目标值公式,获取所述当前状态的目标Q值;Acquire and execute the current action of the reinforcement learning network according to the maximum Q value of the current state, so that the reinforcement learning network enters the next state, obtain the maximum Q value of the next state, and pass the next state Obtaining the maximum Q value of the state, the reward value of the current state, and a preset target value formula to obtain the target Q value of the current state;
根据所述当前状态的目标Q值生成所述强化学习网络的损失函数,通过预设调整算法调整所述强化学习网络的网络参数,以继续对所述强化学习网络进行训练,直到所述损失函数收敛。Generate a loss function of the reinforcement learning network according to the target Q value of the current state, and adjust network parameters of the reinforcement learning network through a preset adjustment algorithm to continue training the reinforcement learning network until the loss function convergence.
另一方面,本发明提供了一种强化学习网络的训练装置,所述装置包括:In another aspect, the present invention provides a training device for a reinforcement learning network, the device includes:
参数设置单元,用于当接收到训练强化学习网络的请求时,设置所述强化学习网络的网络参数,以对所述强化学习网络进行权重配置;A parameter setting unit, configured to set network parameters of the reinforcement learning network when receiving a request for training the reinforcement learning network to perform weight configuration on the reinforcement learning network;
匹配获取单元,用于获取所述强化学习网络的当前状态,在预先构建的状态奖励库中对所述当前状态进行匹配,获取所述当前状态的奖励值和贡献值;A matching obtaining unit, configured to obtain the current state of the reinforcement learning network, match the current state in a pre-built state reward library, and obtain a reward value and a contribution value of the current state;
遍历获取单元,用于在预先构建的动作库中遍历所述动作库的动作组合,获取所述动作组合的贡献值,并根据所述当前状态的贡献值和所述动作组合的贡献值,获取所述强化学习网络的当前状态的最大Q值;The traversal obtaining unit is configured to traverse the action combination of the action library in a pre-built action library, obtain the contribution value of the action combination, and obtain the value according to the current state contribution value and the action combination contribution value The maximum Q value of the current state of the reinforcement learning network;
执行获取单元,用于根据所述当前状态的最大Q值获取所述强化学习网络 的当前动作并执行,以使所述强化学习网络进入下一状态,获取所述下一状态的最大Q值,并通过所述下一状态的最大Q值、所述当前状态的奖励值和预设目标值公式,获取所述当前状态的目标Q值;以及An execution obtaining unit, configured to obtain and execute the current action of the reinforcement learning network according to the maximum Q value of the current state, so that the reinforcement learning network enters the next state, and obtain the maximum Q value of the next state, And obtaining the target Q value of the current state through the maximum Q value of the next state, the reward value of the current state, and a preset target value formula; and
生成调整单元,用于根据所述强化学习网络的目标Q值生成所述强化学习网络的损失函数,通过预设调整算法调整所述强化学习网络的网络参数,以继续对所述强化学习网络进行训练,直到所述损失函数收敛。A generation adjustment unit is configured to generate a loss function of the reinforcement learning network according to a target Q value of the reinforcement learning network, and adjust a network parameter of the reinforcement learning network through a preset adjustment algorithm to continue to perform the reinforcement learning network Train until the loss function converges.
另一方面,本发明还提供了一种强化学习网络训练设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上述强化学习网络的训练方法的步骤。In another aspect, the present invention also provides a reinforcement learning network training device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing the computer The program implements the steps of the training method of the reinforcement learning network as described above.
另一方面,本发明还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如上述强化学习网络的训练方法的步骤。In another aspect, the present invention also provides a computer-readable storage medium that stores a computer program that, when executed by a processor, implements the steps of the training method for a reinforcement learning network as described above.
本发明当接收到训练强化学习网络的请求时,设置强化学习网络的网络参数,以进行权重配置,获取强化学习网络的当前状态,以及当前状态的奖励值和贡献值,通过遍历动作库的动作组合,获取当前状态下的动作组合的最大Q值,根据当前状态的最大Q值获取当前动作并执行,通过得到下一状态的最大Q值,获取当前状态的目标Q值,生成强化学习网络的损失函数,通过预设调整算法调整网络参数,以继续对强化学习网络进行训练,直到损失函数收敛,从而降低了训练强化学习网络的计算量,进而加快了强化学习网络的训练速度、提高了训练效率。When a request for training a reinforcement learning network is received, the present invention sets network parameters of the reinforcement learning network to perform weight configuration, obtain the current state of the reinforcement learning network, and reward and contribution values of the current state, by traversing the actions of the action library Combination to obtain the maximum Q value of the combination of actions in the current state, obtain the current action and execute it based on the maximum Q value of the current state, and obtain the target Q value of the current state by obtaining the maximum Q value of the next state to generate a reinforcement learning network Loss function, adjusting network parameters through preset adjustment algorithms to continue training the reinforcement learning network until the loss function converges, thereby reducing the amount of computation for training the reinforcement learning network, thereby speeding up the training speed of the reinforcement learning network and improving training effectiveness.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本发明实施例一提供的强化学习网络的训练方法的实现流程图;FIG. 1 is an implementation flowchart of a method for training a reinforcement learning network according to Embodiment 1 of the present invention; FIG.
图2是本发明实施例一提供的状态奖励库的优选存储结构示意图;FIG. 2 is a schematic diagram of a preferred storage structure of a state reward library provided by Embodiment 1 of the present invention; FIG.
图3是本发明实施例一提供的动作库的优选存储结构示意图;3 is a schematic diagram of a preferred storage structure of an action library provided by Embodiment 1 of the present invention;
图4是本发明实施例二提供的强化学习网络的训练装置的结构示意图;4 is a schematic structural diagram of a training device for a reinforcement learning network according to a second embodiment of the present invention;
图5是本发明实施例三提供的强化学习网络的训练装置的结构示意图;5 is a schematic structural diagram of a training device for a reinforcement learning network according to a third embodiment of the present invention;
图6是本发明实施例四提供的一种强化学习网络训练设备的结构示意图;以及FIG. 6 is a schematic structural diagram of a reinforcement learning network training device according to a fourth embodiment of the present invention; and
图7是本发明实施例四提供的一种强化学习网络训练设备的优选结构示意图。FIG. 7 is a schematic diagram of a preferred structure of a reinforcement learning network training device provided in Embodiment 4 of the present invention.
具体实施方式detailed description
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions, and advantages of the present invention clearer, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not intended to limit the present invention.
以下结合具体实施例对本发明的具体实现进行详细描述:The following describes the specific implementation of the present invention in detail with reference to specific embodiments:
实施例一:Embodiment one:
图1示出了本发明实施例一提供的强化学习网络的训练方法的实现流程,为了便于说明,仅示出了与本发明实施例相关的部分,详述如下:FIG. 1 shows an implementation process of a training method for a reinforcement learning network provided in Embodiment 1 of the present invention. For ease of description, only parts related to the embodiment of the present invention are shown, and the details are as follows:
在步骤S101中,当接收到训练强化学习网络的请求时,设置强化学习网络的网络参数,以对强化学习网络进行权重配置。In step S101, when a request for training a reinforcement learning network is received, network parameters of the reinforcement learning network are set to perform weight configuration on the reinforcement learning network.
本发明实施例适用于强化学习网络训练设备,例如,MATLAB(Matrix Laboratory,矩阵实验室)等训练设备。在本发明实施例中,当接收到训练强化学习网络的请求时,设置学习网络的网络参数,以对学习网络进行权重配置,具体地,先写入网络参数,当进行网络运算时,根据写入的网络参数启动强化学习网络相应神经元的计算模式,通过这种方式来配置每层网络的每个神经元的参数,从而实现数据并行处理,进而提高了数据处理效率。The embodiments of the present invention are applicable to reinforcement learning network training equipment, for example, training equipment such as MATLAB (Matrix Laboratory). In the embodiment of the present invention, when a request for training a reinforcement learning network is received, the network parameters of the learning network are set to configure the weighting of the learning network. Specifically, the network parameters are written first. The input network parameters start the calculation mode of the corresponding neurons of the reinforcement learning network. In this way, the parameters of each neuron of each layer of the network are configured, so that data is processed in parallel and data processing efficiency is improved.
在步骤S102中,获取强化学习网络的当前状态,在预先构建的状态奖励库中对当前状态进行匹配,获取当前状态的奖励值和贡献值。In step S102, the current state of the reinforcement learning network is acquired, and the current state is matched in a state reward library constructed in advance to obtain the reward value and contribution value of the current state.
在本发明实施例中,状态奖励库为预先构建的存储了状态节点和对应奖励值的集合,在接收到训练请求之后,获取强化学习网络的当前状态,并提取当 前状态的特征数据,通过该当前状态的特征数据计算得到当前状态的贡献值,然后,在状态奖励库中对当前状态进行匹配,得到当前状态的奖励值。In the embodiment of the present invention, the state reward library is a pre-built set that stores state nodes and corresponding reward values. After receiving a training request, the current state of the reinforcement learning network is obtained, and feature data of the current state is extracted. The characteristic data of the current state is calculated to obtain the contribution value of the current state, and then the current state is matched in the state reward database to obtain the current state reward value.
作为示例地,如图2所示,图中示出了状态奖励库的优选存储结构,状态奖励库分为n个奖励组,分别对应n个特殊状态的奖励值,数据的开头存储了奖励值组数n,数据库的结尾存储了一般状态的奖励值,即第(n+1)个奖励值,每一个奖励组都包括不同的状态节点,即不同状态值,不同的状态节点对应着不同范围的状态值。As an example, as shown in FIG. 2, the figure shows a preferred storage structure of the state reward library. The state reward library is divided into n reward groups, each corresponding to n special state reward values. The reward value is stored at the beginning of the data. The number of groups is n. The end of the database stores the general state reward value, that is, the (n + 1) th reward value. Each reward group includes different state nodes, that is, different state values. Different state nodes correspond to different ranges. Status value.
优选地,在预先构建的状态奖励库中对当前状态进行匹配时,将当前状态与状态奖励库中的预设数量个奖励组对应的所有状态节点进行匹配,当当前状态位于预设数量个奖励组中预设状态节点中时,将预设状态奖励组的奖励值设置为当前状态的奖励值,否则将当前状态的奖励值设置为预设一般状态奖励值,从而快速获取当前状态的即时奖励。具体地,由于当前状态只能位于一个状态节点中,或者,当前状态位于所有状态节点外,因此,在匹配状态节点时,可采用逐一匹配状态节点的方法进行匹配,当当前状态位于预设状态节点中时,停止匹配其他的状态节点,并将预设状态节点对应的奖励值设置为当前状态的奖励值,当逐一匹配所有状态节点后,都没有匹配成功,则将一般状态奖励值设置为当前状态的奖励值。Preferably, when the current state is matched in a pre-built state reward library, the current state is matched with all state nodes corresponding to a preset number of reward groups in the state reward library, and when the current state is located in the preset number of rewards When the preset status node in the group is set, the reward value of the preset status reward group is set to the reward value of the current status, otherwise the reward value of the current status is set to the preset general status reward value, so as to quickly obtain the instant reward of the current status . Specifically, since the current state can only be located in one state node, or the current state is located outside all state nodes, when matching state nodes, the method of matching state nodes one by one can be used for matching. When the current state is in a preset state When the node is in the node, stop matching other state nodes, and set the reward value corresponding to the preset state node as the reward value of the current state. When all the state nodes are matched one by one, no matching is successful, then the general state reward value is set to Reward value for the current state.
在步骤S103中,在预先构建的动作库中遍历动作库的动作组合,获取动作组合的贡献值,并根据当前状态的贡献值和动作组合的贡献值,获取强化学习网络的当前状态的最大Q值。In step S103, the action combination of the action library is traversed in a pre-built action library to obtain the contribution value of the action combination, and the maximum Q of the current state of the reinforcement learning network is obtained according to the contribution value of the current state and the contribution value of the action combination. value.
在本发明实施例中,动作库为预先构建的存储了学习网络可输出的所有动作的集合,Q值为强化学习网络中状态映射到动作值的表征,遍历动作库的所有动作组合,获取每个动作组合(实时动作)的贡献值,在遍历动作库的动作组合时,每得到一个动作组合,将通过当前状态的贡献值和动作组合的贡献值计算每一个动作组合的Q值,从而可获得强化学习网络的当前状态的最大Q值。In the embodiment of the present invention, the action library is a pre-built set that stores all actions that can be output by the learning network. The Q value is a representation of the state mapping to action values in the reinforcement learning network. The contribution value of each action combination (real-time action). When traversing the action combination of the action library, each time an action combination is obtained, the Q value of each action combination is calculated from the current state contribution value and the action combination contribution value, so that the Q value of each action combination can be calculated. Get the maximum Q value of the current state of the reinforcement learning network.
作为示例地,如图3所示,图中示出了动作库的优选存储结构,动作库分 为动作内存模块和实时动作内存模块,动作内存模块用于存储所有动作的信息,具体有动作维数n、每个动作维数的步长值、最大值和起始值,实时动作内存模块用于存储即将输出的动作信息,具体为n维动作中每个动作的动作值,作为示例地,在自动驾驶的强化学习网络中,动作有左转(第一维)、右转(第二维)、刹车(第三维)等,对应的动作值为(1,a)、(2,b)、(3,c)其中,1、2、3分别代表动作的维度(例如,第一维、第二维和第三维),a、b、c分别为第一、二、三维动作对应的度量值。As an example, as shown in FIG. 3, the figure shows a preferred storage structure of the action library. The action library is divided into an action memory module and a real-time action memory module. The action memory module is used to store information of all actions, specifically the action dimension. The number n, the step value, the maximum value, and the starting value of each action dimension. The real-time action memory module is used to store the action information to be output, specifically the action value of each action in the n-dimensional action. As an example, In the self-driving reinforcement learning network, actions include left turn (first dimension), right turn (second dimension), brake (third dimension), etc., and the corresponding action values are (1, a), (2, b) , (3, c) where 1, 2, and 3 respectively represent the dimensions of the action (for example, the first, second, and third dimensions), and a, b, and c are the measures corresponding to the first, second, and three-dimensional actions, respectively. value.
优选地,在预先构建的动作库中遍历动作库的动作组合时,将动作库中预设动作列表上的预设数量维动作的起始值,依次设置为动作库中预设实时动作表上的预设数量个实时动作值,获取预设动作列表上的预设第一维动作的步长值,并将预设第一维动作的步长值逐次累加到预设第一维动作对应的实时动作值,当对应的实时动作值逐次累加到预设第一维动作对应的范围之外时,获取预设动作列表上的预设第二维动作的步长值,并将预设第二维动作的步长值逐次累加到预设第二维动作对应的实时动作值,从而快速、准确地计算出每个实时动作对该学习网络的贡献值。其中,预设第一维动作和预设第二维动作都为预设数量维动作中的一维动作。Preferably, when the action combination of the action library is traversed in the pre-built action library, the starting value of the preset number of dimensional actions on the preset action list in the action library is sequentially set to the preset real-time action table in the action library. The preset number of real-time action values, to obtain the step value of the preset first-dimensional action on the preset action list, and accumulate the step value of the preset first-dimensional action to the corresponding one of the preset first-dimensional actions one by one. Real-time action value. When the corresponding real-time action value is successively accumulated outside the range corresponding to the preset first-dimensional action, obtain the step value of the preset second-dimensional action on the preset action list, and The step value of the dimensional action is accumulated one by one to the real-time action value corresponding to the preset second-dimensional action, so as to quickly and accurately calculate the contribution value of each real-time action to the learning network. The preset first-dimensional motion and the preset second-dimensional motion are both one-dimensional motions among the preset number of dimensional motions.
在步骤S104中,根据当前状态的最大Q值获取强化学习网络的当前动作并执行,以得到强化学习网络的下一状态,获取下一状态的最大Q值,并通过下一状态的最大Q值、当前状态的奖励值和预设目标值公式,获取当前状态的目标Q值。In step S104, the current action of the reinforcement learning network is obtained and executed according to the maximum Q value of the current state to obtain the next state of the reinforcement learning network, the maximum Q value of the next state is obtained, and the maximum Q value of the next state is passed , The reward value of the current state and a preset target value formula to obtain the target Q value of the current state.
在本发明实施例中,当前动作为当前状态时,强化学习网络需要执行的动作,预设目标值公式具体为Target_Q(s,a;θ)=r(s)+γmaxQ(s',a';θ),其中,Target_Q(s,a;θ)为当前状态的目标Q值,s为当前状态,a为当前动作,r(s)为当前状态的奖励值,γ为折扣因子,θ为网络参数,maxQ(s',a';θ)为下一状态的最大Q值。具体地,按照贪婪策略,根据当前状态的最大Q值获取强化学习网络的当前动作并执行,进入下一状态,此时,重复步骤S102和步骤S103的方 法,得到下一状态的最大Q值,再通过预设目标值公式得到当前状态的目标Q值。In the embodiment of the present invention, when the current action is the current state, the action that the reinforcement learning network needs to perform. The preset target value formula is specifically Target_Q (s, a; θ) = r (s) + γmaxQ (s ', a' Θ), where Target_Q (s, a; θ) is the target Q value of the current state, s is the current state, a is the current action, r (s) is the reward value of the current state, γ is the discount factor, and θ is Network parameters, maxQ (s ', a'; θ) is the maximum Q value for the next state. Specifically, according to the greedy strategy, the current action of the reinforcement learning network is obtained and executed according to the maximum Q value of the current state, and enters the next state. At this time, the method of steps S102 and S103 is repeated to obtain the maximum Q value of the next state. Then, the target Q value of the current state is obtained through a preset target value formula.
优选地,在获取当前状态的目标Q值之后,将当前状态、当前动作、当前状态的奖励值和下一状态作为训练样本进行存储,从而加快了后续的收敛过程。Preferably, after obtaining the target Q value of the current state, the current state, the current action, the reward value of the current state, and the next state are stored as training samples, thereby speeding up the subsequent convergence process.
优选地,强化学习网络训练设备包含2个处理器,其中一个芯片为AI芯片,该AI芯片的架构介于ASIC(Application Specific Integrated Circuit,专用集成电路)和FPGA(Field-Programmable Gate Array,现场可编程逻辑门阵列)之间,用于处理强化学习网络训练过程中根据当前状态决策、响应当前动作的部分过程,从而通过提高内存的访问带宽提高强化学习网络的训练速度。Preferably, the reinforcement learning network training device includes two processors, one of which is an AI chip, and the architecture of the AI chip is between ASIC (Application Specific Integrated Circuit) and FPGA (Field-Programmable Gate Array). Programming logic gate array), which is used to handle part of the process of making decisions based on the current state and responding to the current action during the training of the reinforcement learning network, thereby improving the training speed of the reinforcement learning network by increasing the memory access bandwidth.
在步骤S105中,根据当前状态的目标Q值生成强化学习网络的损失函数,通过预设调整算法调整网络参数,以继续对学习网络进行训练,直到损失函数收敛。In step S105, a loss function of the reinforcement learning network is generated according to the target Q value of the current state, and the network parameters are adjusted by a preset adjustment algorithm to continue training the learning network until the loss function converges.
在本发明实施例中,得到当前状态的目标Q值后,生成强化学习网络的损失函数,具体的,该损失函数为L(θ)=E[(Target_Q(s,a;θ)-Q(s,a;θ)) 2],其中,Target_Q(s,a;θ)为当前状态的目标Q值,E为均方差,Q(s,a;θ)为实时Q值,s为当前状态,a为当前动作,θ为网络参数,然后通过预设调整算法对神经网络参数进行调整,以继续对学习网络进行训练,直到损失函数收敛,从而最终完成强化学习网络的训练。具体地,预设调整算法为SGD(stochastic gradient descent,随机梯度下降)算法。 In the embodiment of the present invention, after obtaining the target Q value of the current state, a loss function of the reinforcement learning network is generated. Specifically, the loss function is L (θ) = E [(Target_Q (s, a; θ) -Q ( s, a; θ)) 2 ], where Target_Q (s, a; θ) is the target Q value of the current state, E is the mean square error, Q (s, a; θ) is the real-time Q value, and s is the current state , A is the current action, θ is the network parameter, and then the neural network parameters are adjusted by a preset adjustment algorithm to continue training the learning network until the loss function converges, thereby finally completing the training of the reinforcement learning network. Specifically, the preset adjustment algorithm is a stochastic gradient descent (SGD) algorithm.
在本发明实施例中,当接收到训练强化学习网络的请求时,设置强化学习网络的网络参数,以进行权重配置,获取强化学习网络的当前状态,以及当前状态的奖励值和贡献值,通过遍历动作库的动作组合,获取当前状态下的动作组合的最大Q值,根据当前状态的最大Q值获取当前动作并执行,通过得到下一状态的最大Q值,获取当前状态的目标Q值,生成强化学习网络的损失函数,通过预设调整算法调整网络参数,以继续对强化学习网络进行训练,直到损失函数收敛,从而降低了训练强化学习网络的计算量,进而加快了强化学习网络 的训练速度、提高了训练效率。In the embodiment of the present invention, when a request for training a reinforcement learning network is received, network parameters of the reinforcement learning network are set to perform weight configuration to obtain the current state of the reinforcement learning network, and the reward value and contribution value of the current state. Traverse the action combination of the action library to obtain the maximum Q value of the action combination in the current state, obtain the current action and execute it according to the maximum Q value of the current state, and obtain the target Q value of the current state by obtaining the maximum Q value of the next state. Generate the loss function of the reinforcement learning network, adjust the network parameters through preset adjustment algorithms, and continue to train the reinforcement learning network until the loss function converges, thereby reducing the calculation amount of training the reinforcement learning network, and then speeding up the training of the reinforcement learning network Speed and improve training efficiency.
实施例二:Embodiment two:
图4示出了本发明实施例二提供的强化学习网络的训练装置的结构,为了便于说明,仅示出了与本发明实施例相关的部分,其中包括:FIG. 4 shows the structure of a training device for a reinforcement learning network provided in Embodiment 2 of the present invention. For ease of description, only parts related to the embodiment of the present invention are shown, including:
参数设置单元41,用于当接收到训练强化学习网络的请求时,设置强化学习网络的网络参数,以对强化学习网络进行权重配置;A parameter setting unit 41 is configured to set network parameters of the reinforcement learning network when receiving a request for training the reinforcement learning network to perform weight configuration on the reinforcement learning network;
匹配获取单元42,用于获取强化学习网络的当前状态,在预先构建的状态奖励库中对当前状态进行匹配,获取当前状态的奖励值和贡献值;The matching obtaining unit 42 is configured to obtain the current state of the reinforcement learning network, match the current state in a pre-built state reward library, and obtain the reward value and contribution value of the current state;
遍历获取单元43,用于在预先构建的动作库中遍历动作库的动作组合,获取动作组合的贡献值,并根据当前状态的贡献值和动作组合的贡献值,获取强化学习网络的当前状态的最大Q值;The traversal acquisition unit 43 is configured to traverse the action combinations of the action library in a pre-built action library, obtain the contribution value of the action combination, and obtain the current state of the reinforcement learning network according to the contribution value of the current state and the contribution value of the action combination. Maximum Q value
执行获取单元44,用于根据当前状态的最大Q值获取强化学习网络的当前动作并执行,以得到强化学习网络的下一状态,获取下一状态的最大Q值,并通过下一状态的最大Q值、当前状态的奖励值和预设目标值公式,获取当前状态的目标Q值;以及The execution obtaining unit 44 is configured to obtain and execute the current action of the reinforcement learning network according to the maximum Q value of the current state, to obtain the next state of the reinforcement learning network, obtain the maximum Q value of the next state, and pass the maximum value of the next state. Q value, reward value of current state and preset target value formula to obtain target Q value of current state; and
生成调整单元45,用于根据当前状态的目标Q值生成强化学习网络的损失函数,通过预设调整算法调整网络参数,以继续对学习网络进行训练,直到损失函数收敛。The generating and adjusting unit 45 is configured to generate a loss function of the reinforcement learning network according to the target Q value of the current state, and adjust network parameters through a preset adjustment algorithm to continue training the learning network until the loss function converges.
在本发明实施例中,当接收到训练强化学习网络的请求时,设置强化学习网络的网络参数,以进行权重配置,获取强化学习网络的当前状态,以及当前状态的奖励值和贡献值,通过遍历动作库的动作组合,获取当前状态下的动作组合的最大Q值,根据当前状态的最大Q值获取当前动作并执行,通过得到下一状态的最大Q值,获取当前状态的目标Q值,生成强化学习网络的损失函数,通过预设调整算法调整网络参数,以继续对强化学习网络进行训练,直到损失函数收敛,从而降低了训练强化学习网络的计算量,进而加快了强化学习网络的训练速度、提高了训练效率。In the embodiment of the present invention, when a request for training a reinforcement learning network is received, network parameters of the reinforcement learning network are set to perform weight configuration to obtain the current state of the reinforcement learning network, and the reward value and contribution value of the current state. Traverse the action combination of the action library to obtain the maximum Q value of the action combination in the current state, obtain the current action and execute it according to the maximum Q value of the current state, and obtain the target Q value of the current state by obtaining the maximum Q value of the next state. Generate the loss function of the reinforcement learning network, adjust the network parameters through preset adjustment algorithms, and continue to train the reinforcement learning network until the loss function converges, thereby reducing the calculation amount of training the reinforcement learning network, and then speeding up the training of the reinforcement learning network Speed and improve training efficiency.
在本发明实施例中,强化学习网络的训练装置的各单元可由相应的硬件或软件单元实现,各单元可以为独立的软、硬件单元,也可以集成为一个软、硬件单元,在此不用以限制本发明。各单元的具体实施方式可参考实施例一的描述,在此不再赘述。In the embodiment of the present invention, each unit of the training device of the reinforcement learning network may be implemented by corresponding hardware or software units. Each unit may be an independent software and hardware unit, or may be integrated into one software and hardware unit. Limit the invention. For specific implementation of each unit, reference may be made to the description in Embodiment 1, and details are not described herein again.
实施例三:Embodiment three:
图5示出了本发明实施例三提供的强化学习网络的训练装置的结构,为了便于说明,仅示出了与本发明实施例相关的部分,其中包括:FIG. 5 shows the structure of a training device for a reinforcement learning network provided in Embodiment 3 of the present invention. For ease of description, only parts related to the embodiment of the present invention are shown, including:
参数设置单元51,用于当接收到训练强化学习网络的请求时,设置强化学习网络的网络参数,以对强化学习网络进行权重配置;A parameter setting unit 51 is configured to set network parameters of the reinforcement learning network when receiving a request for training the reinforcement learning network to perform weight configuration on the reinforcement learning network;
匹配获取单元52,用于获取强化学习网络的当前状态,在预先构建的状态奖励库中对当前状态进行匹配,获取当前状态的奖励值和贡献值;The matching acquisition unit 52 is configured to acquire the current state of the reinforcement learning network, match the current state in a pre-built state reward library, and obtain the reward value and contribution value of the current state;
遍历获取单元53,用于在预先构建的动作库中遍历动作库的动作组合,获取动作组合的贡献值,并根据当前状态的贡献值和动作组合的贡献值,获取强化学习网络的当前状态的最大Q值;The traversal acquisition unit 53 is configured to traverse the action combinations of the action library in a pre-built action library, obtain the contribution value of the action combination, and obtain the current state of the reinforcement learning network according to the contribution value of the current state and the contribution value of the action combination. Maximum Q value
执行获取单元54,用于根据当前状态的最大Q值获取强化学习网络的当前动作并执行,以得到强化学习网络的下一状态,获取下一状态的最大Q值,并通过下一状态的最大Q值、当前状态的奖励值和预设目标值公式,获取当前状态的目标Q值;The execution obtaining unit 54 is configured to obtain and execute the current action of the reinforcement learning network according to the maximum Q value of the current state, to obtain the next state of the reinforcement learning network, obtain the maximum Q value of the next state, and pass the maximum value of the next state. Q value, reward value of current state and preset target value formula to obtain target Q value of current state;
经验存储单元55,用于将当前状态、当前动作、当前状态的奖励值和下一状态作为训练样本进行存储;以及The experience storage unit 55 is configured to store the current state, the current action, the reward value of the current state, and the next state as training samples; and
生成调整单元56,用于根据当前状态的目标Q值生成强化学习网络的损失函数,通过预设调整算法调整网络参数,以继续对学习网络进行训练,直到损失函数收敛。The generating and adjusting unit 56 is configured to generate a loss function of the reinforcement learning network according to the target Q value of the current state, and adjust the network parameters through a preset adjustment algorithm to continue training the learning network until the loss function converges.
其中,匹配获取单元52包括:The matching obtaining unit 52 includes:
匹配子单元521,用于将当前状态与状态奖励库中的预设数量个奖励组对应的所有状态节点进行匹配;以及A matching subunit 521, configured to match the current state with all state nodes corresponding to a preset number of reward groups in the state reward library; and
状态值设置单元522,用于当当前状态位于预设数量个奖励组中预设状态节点中时,将预设状态奖励组的奖励值设置为当前状态的奖励值,否则将当前状态的奖励值设置为预设一般状态奖励值。The state value setting unit 522 is configured to set the reward value of the preset state reward group to the current state reward value when the current state is located in the preset state nodes in the preset number of reward groups, otherwise set the reward value of the current state Set to a preset general status reward value.
遍历获取单元53包括:The traversal obtaining unit 53 includes:
起始值设置单元531,用于将动作库中预设动作列表上的预设数量维动作的起始值,依次设置为动作库中预设实时动作表上的预设数量个实时动作值;A start value setting unit 531, configured to sequentially set a start value of a preset number of dimensional actions on a preset action list in the action library to a preset number of real-time action values on a preset real-time action table in the action library;
第一累加单元532,用于获取预设动作列表上的预设第一维动作的步长值,并将预设第一维动作的步长值逐次累加到预设第一维动作对应的实时动作值;以及The first accumulation unit 532 is configured to obtain a step value of a preset first-dimensional action on the preset action list, and sequentially accumulate the step value of the preset first-dimensional action to the real-time corresponding to the preset first-dimensional action. Action value; and
第二累加单元533,用于当对应的实时动作值逐次累加到预设第一维动作对应的范围之外时,获取预设动作列表上的预设第二维动作的步长值,并将预设第二维动作的步长值逐次累加到预设第二维动作对应的实时动作值。A second accumulation unit 533, configured to obtain the step value of the preset second-dimensional action on the preset action list when the corresponding real-time action value is sequentially accumulated outside the range corresponding to the preset first-dimensional action, and The step value of the preset second-dimensional action is successively accumulated to the real-time action value corresponding to the preset second-dimensional action.
在本发明实施例中,当接收到训练强化学习网络的请求时,设置强化学习网络的网络参数,以进行权重配置,获取强化学习网络的当前状态,以及当前状态的奖励值和贡献值,通过遍历动作库的动作组合,获取当前状态下的动作组合的最大Q值,根据当前状态的最大Q值获取当前动作并执行,通过得到下一状态的最大Q值,获取当前状态的目标Q值,生成强化学习网络的损失函数,通过预设调整算法调整网络参数,以继续对强化学习网络进行训练,直到损失函数收敛,从而降低了训练强化学习网络的计算量,进而加快了强化学习网络的训练速度、提高了训练效率。In the embodiment of the present invention, when a request for training a reinforcement learning network is received, network parameters of the reinforcement learning network are set to perform weight configuration to obtain the current state of the reinforcement learning network, and the reward value and contribution value of the current state. Traverse the action combination of the action library to obtain the maximum Q value of the action combination in the current state, obtain the current action and execute it according to the maximum Q value of the current state, and obtain the target Q value of the current state by obtaining the maximum Q value of the next state. Generate the loss function of the reinforcement learning network, adjust the network parameters through preset adjustment algorithms, and continue to train the reinforcement learning network until the loss function converges, thereby reducing the calculation amount of training the reinforcement learning network, and then speeding up the training of the reinforcement learning network Speed and improve training efficiency.
在本发明实施例中,强化学习网络的训练装置的各单元可由相应的硬件或软件单元实现,各单元可以为独立的软、硬件单元,也可以集成为一个软、硬件单元,在此不用以限制本发明。各单元的具体实施方式可参考实施例一的描述,在此不再赘述。In the embodiment of the present invention, each unit of the training device of the reinforcement learning network may be implemented by corresponding hardware or software units. Each unit may be an independent software and hardware unit, or may be integrated into one software and hardware unit. Limit the invention. For specific implementation of each unit, reference may be made to the description in Embodiment 1, and details are not described herein again.
实施例四:Embodiment 4:
图6示出了本发明实施例四提供的强化学习网络训练设备的结构,为了便 于说明,仅示出了与本发明实施例相关的部分,其中包括:FIG. 6 shows the structure of a reinforcement learning network training device provided in Embodiment 4 of the present invention. For convenience of explanation, only parts related to the embodiment of the present invention are shown, including:
本发明实施例的强化学习网络训练设备6包括处理器61、存储器62以及存储在存储器62中并可在处理器61上运行的计算机程序63。该处理器51执行计算机程序63时实现上述强化学习网络的训练方法实施例中的步骤,例如图1所示的步骤S101至S105。或者,处理器61执行计算机程序63时实现上述各个强化学习网络的训练装置实施例中各单元的功能,例如图4所示单元41至45以及图5所示单元51至56的功能。The reinforcement learning network training device 6 according to the embodiment of the present invention includes a processor 61, a memory 62, and a computer program 63 stored in the memory 62 and executable on the processor 61. When the processor 51 executes the computer program 63, the steps in the embodiment of the method for training a reinforcement learning network are implemented, for example, steps S101 to S105 shown in FIG. Alternatively, when the processor 61 executes the computer program 63, the functions of the units in the above embodiments of the training device of the reinforcement learning network are realized, for example, the functions of the units 41 to 45 shown in FIG. 4 and the units 51 to 56 shown in FIG. 5.
如图7所示,强化学习网络训练设备的优选结构示意图。优选地,强化学习网络训练设备7包括第一处理器711、第二处理器712,、第一存储器721、第二存储器722、以及存储在存储器第一存储器721和第二存储器722中的计算机程序73,计算机计算程序73可在第一处理器711和第二处理器712上运行。具体地,第一处理器711为ASIC(专用集成电路)芯片,从而提高了该学习网络的效率,并降低功率消耗。第一处理器711执行计算机程序73时实现上述强化学习网络的训练方法实施例中的步骤,例如图1所示的步骤S101至S103,第二处理器712执行计算机程序73时实现上述强化学习网络的训练方法实施例中的步骤,例如图1所示的步骤S104至S105。或者,第一处理器711执行计算机程序73时实现上述各个强化学习网络的训练装置实施例中各单元的功能,例如图4所示单元41至43以及图5所示单元51至53的功能,第二处理器712执行计算机程序73时实现上述各个强化学习网络的训练装置实施例中各单元的功能,例如图4所示单元44至45以及图5所示单元54至56的功能。As shown in FIG. 7, a schematic diagram of a preferred structure of a reinforcement learning network training device. Preferably, the reinforcement learning network training device 7 includes a first processor 711, a second processor 712, a first memory 721, a second memory 722, and a computer program stored in the first memory 721 and the second memory 722. 73. The computer calculation program 73 may run on the first processor 711 and the second processor 712. Specifically, the first processor 711 is an ASIC (Application Specific Integrated Circuit) chip, thereby improving the efficiency of the learning network and reducing power consumption. The first processor 711 implements the steps in the embodiment of the method for training a reinforcement learning network when the computer program 73 is executed, for example, steps S101 to S103 shown in FIG. 1, and the second processor 712 implements the reinforcement learning network when the computer program 73 is executed. The steps in the training method embodiment are, for example, steps S104 to S105 shown in FIG. 1. Alternatively, when the first processor 711 executes the computer program 73, the functions of the units in the embodiments of the training device for each reinforcement learning network described above are implemented, for example, the functions of units 41 to 43 shown in FIG. 4 and the functions of units 51 to 53 shown in FIG. 5, When the second processor 712 executes the computer program 73, the functions of the units in the above embodiments of the training device of the reinforcement learning network are realized, for example, the functions of the units 44 to 45 shown in FIG. 4 and the units 54 to 56 shown in FIG. 5.
在本发明实施例中,该处理器执行计算机程序时,当接收到训练强化学习网络的请求时,设置强化学习网络的网络参数,以进行权重配置,获取强化学习网络的当前状态,以及当前状态的奖励值和贡献值,通过遍历动作库的动作组合,获取当前状态下的动作组合的最大Q值,根据当前状态的最大Q值获取当前动作并执行,通过得到下一状态的最大Q值,获取当前状态的目标Q值,生成强化学习网络的损失函数,通过预设调整算法调整网络参数,以继续对强 化学习网络进行训练,直到损失函数收敛,从而降低了训练强化学习网络的计算量,进而加快了强化学习网络的训练速度、提高了训练效率。In the embodiment of the present invention, when the processor executes a computer program, when a request for training a reinforcement learning network is received, the network parameters of the reinforcement learning network are set to perform weight configuration, and obtain the current state of the reinforcement learning network, and the current state. To obtain the maximum Q value of the action combination in the current state by traversing the action combination of the action library, to obtain and execute the current action based on the maximum Q value of the current state, and to obtain the maximum Q value of the next state, Obtain the target Q value of the current state, generate the loss function of the reinforcement learning network, and adjust the network parameters through preset adjustment algorithms to continue training the reinforcement learning network until the loss function converges, thereby reducing the amount of calculation for training the reinforcement learning network. This speeds up the training of reinforcement learning networks and improves training efficiency.
该处理器执行计算机程序时实现上述强化学习网络的训练方法实施例中的步骤可参考实施例一的描述,在此不再赘述。For steps in the embodiment of the training method for implementing the reinforcement learning network when the processor executes a computer program, reference may be made to the description in Embodiment 1, and details are not described herein again.
实施例五:Embodiment 5:
在本发明实施例中,提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行时实现上述强化学习网络的训练方法实施例中的步骤,例如,图1所示的步骤S101至S105。或者,该计算机程序被处理器执行时实现上述各个强化学习网络的训练装置实施例中各单元的功能,例如图4所示单元41至45以及图5所示单元51至56的功能。In the embodiment of the present invention, a computer-readable storage medium is provided. The computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps in the embodiment of the training method of the reinforcement learning network are implemented. For example, steps S101 to S105 shown in FIG. 1. Alternatively, when the computer program is executed by a processor, the functions of the units in the embodiments of the training device for each reinforcement learning network described above are implemented, such as the functions of units 41 to 45 shown in FIG. 4 and the units 51 to 56 shown in FIG. 5.
在本发明实施例中,在计算机程序被处理器执行后,当接收到训练强化学习网络的请求时,设置强化学习网络的网络参数,以进行权重配置,获取强化学习网络的当前状态,以及当前状态的奖励值和贡献值,通过遍历动作库的动作组合,获取当前状态下的动作组合的最大Q值,根据当前状态的最大Q值获取当前动作并执行,通过得到下一状态的最大Q值,获取当前状态的目标Q值,生成强化学习网络的损失函数,通过预设调整算法调整网络参数,以继续对强化学习网络进行训练,直到损失函数收敛,从而降低了训练强化学习网络的计算量,进而加快了强化学习网络的训练速度、提高了训练效率。In the embodiment of the present invention, after the computer program is executed by the processor, when a request for training the reinforcement learning network is received, the network parameters of the reinforcement learning network are set to perform weight configuration, obtain the current state of the reinforcement learning network, and the current Reward value and contribution value of the state, by traversing the action combination of the action library, obtain the maximum Q value of the action combination in the current state, obtain the current action and execute it according to the maximum Q value of the current state, and obtain the maximum Q value of the next state To obtain the target Q value of the current state, generate a loss function for the reinforcement learning network, and adjust the network parameters through preset adjustment algorithms to continue training the reinforcement learning network until the loss function converges, thereby reducing the amount of calculation for training the reinforcement learning network Therefore, the training speed of the reinforcement learning network is accelerated, and the training efficiency is improved.
本发明实施例的计算机可读存储介质可以包括能够携带计算机程序代码的任何实体或装置、存储介质,例如,ROM/RAM、磁盘、光盘、闪存等存储器。The computer-readable storage medium of the embodiment of the present invention may include any entity or device capable of carrying computer program code, and a storage medium, for example, a memory such as a ROM / RAM, a magnetic disk, an optical disk, a flash memory, or the like.
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。The above description is only the preferred embodiments of the present invention, and is not intended to limit the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention shall be included in the protection of the present invention. Within range.

Claims (10)

  1. 一种强化学习网络的训练方法,其特征在于,所述方法包括下述步骤:A training method for a reinforcement learning network, characterized in that the method includes the following steps:
    当接收到训练强化学习网络的请求时,设置所述强化学习网络的网络参数,以对所述强化学习网络进行权重配置;When receiving a request to train a reinforcement learning network, setting network parameters of the reinforcement learning network to perform weight configuration on the reinforcement learning network;
    获取所述强化学习网络的当前状态,在预先构建的状态奖励库中对所述当前状态进行匹配,获取所述当前状态的奖励值和贡献值;Acquiring the current state of the reinforcement learning network, matching the current state in a pre-built state reward library, and obtaining a reward value and a contribution value of the current state;
    在预先构建的动作库中遍历所述动作库的动作组合,获取所述动作组合的贡献值,并根据所述当前状态的贡献值和所述动作组合的贡献值,获取所述强化学习网络的当前状态的最大Q值;Traverse the action combinations of the action library in a pre-built action library to obtain the contribution value of the action combination, and obtain the reinforcement learning network's value based on the current state contribution value and the action combination's contribution value. The maximum Q value of the current state;
    根据所述当前状态的最大Q值获取所述强化学习网络的当前动作并执行,以使所述强化学习网络进入下一状态,获取所述下一状态的最大Q值,并通过所述下一状态的最大Q值、所述当前状态的奖励值和预设目标值公式,获取所述当前状态的目标Q值;Acquire and execute the current action of the reinforcement learning network according to the maximum Q value of the current state, so that the reinforcement learning network enters the next state, obtain the maximum Q value of the next state, and pass the next state Obtaining the maximum Q value of the state, the reward value of the current state, and a preset target value formula to obtain the target Q value of the current state;
    根据所述当前状态的目标Q值生成所述强化学习网络的损失函数,通过预设调整算法调整所述网络参数,以继续对所述强化学习网络进行训练,直到所述损失函数收敛。A loss function of the reinforcement learning network is generated according to the target Q value of the current state, and the network parameters are adjusted by a preset adjustment algorithm to continue training the reinforcement learning network until the loss function converges.
  2. 如权利要求1所述的方法,其特征在于,在预先构建的状态奖励库中对强化学习网络的当前状态进行匹配的步骤,包括:The method of claim 1, wherein the step of matching the current state of the reinforcement learning network in a pre-built state reward library comprises:
    将所述当前状态与所述状态奖励库中的预设数量个奖励组对应的所有状态节点进行匹配;Matching the current state with all state nodes corresponding to a preset number of reward groups in the state reward library;
    当所述当前状态位于所述预设数量个奖励组中预设状态节点中时,将所述预设状态奖励组的奖励值设置为所述当前状态的奖励值,否则将所述当前状态的奖励值设置为预设一般状态奖励值。When the current state is located in a preset state node in the preset number of reward groups, setting the reward value of the preset state reward group to the current state reward value, otherwise setting the current state The reward value is set to a preset general state reward value.
  3. 如权利要求1所述的方法,其特征在于,在预先构建的动作库中遍历所述动作库的动作组合的步骤,包括:The method of claim 1, wherein the step of traversing the combination of actions of the action library in a pre-built action library comprises:
    将所述动作库中预设动作列表上的预设数量维动作的起始值,依次设置为 所述动作库中预设实时动作表上的预设数量个实时动作值;Setting the start value of a preset number of dimensional actions on a preset action list in the action library to a preset number of real-time action values on a preset real-time action list in the action library;
    获取所述预设动作列表上的预设第一维动作的步长值,并将所述预设第一维动作的步长值逐次累加到所述预设第一维动作对应的所述实时动作值;Acquiring a step value of a preset first-dimensional action on the preset action list, and sequentially accumulating the step value of the preset first-dimensional action to the real-time corresponding to the preset first-dimensional action Action value
    当所述对应的所述实时动作值逐次累加到所述预设第一维动作对应的范围之外时,获取所述预设动作列表上的预设第二维动作的步长值,并将所述预设第二维动作的步长值逐次累加到所述预设第二维动作对应的所述实时动作值。When the corresponding real-time action value is sequentially accumulated outside the range corresponding to the preset first-dimensional action, obtaining a step value of a preset second-dimensional action on the preset action list, and The step value of the preset second-dimensional motion is successively accumulated to the real-time motion value corresponding to the preset second-dimensional motion.
  4. 如权利要求1所述的方法,其特征在于,获取所述当前状态的目标Q值的步骤之后,所述方法还包括:The method according to claim 1, wherein after the step of obtaining the target Q value of the current state, the method further comprises:
    将所述当前状态、所述当前动作、所述当前状态的奖励值和所述下一状态作为训练样本进行存储。The current state, the current action, the reward value of the current state, and the next state are stored as training samples.
  5. 一种强化学习网络的训练装置,其特征在于,所述装置包括:A training device for a reinforcement learning network, characterized in that the device includes:
    参数设置单元,用于当接收到训练强化学习网络的请求时,设置所述强化学习网络的网络参数,以对所述强化学习网络进行权重配置;A parameter setting unit, configured to set network parameters of the reinforcement learning network when receiving a request for training the reinforcement learning network to perform weight configuration on the reinforcement learning network;
    匹配获取单元,用于获取所述强化学习网络的当前状态,在预先构建的状态奖励库中对所述当前状态进行匹配,获取所述当前状态的奖励值和贡献值;A matching obtaining unit, configured to obtain the current state of the reinforcement learning network, match the current state in a pre-built state reward library, and obtain a reward value and a contribution value of the current state;
    遍历获取单元,用于在预先构建的动作库中遍历所述动作库的动作组合,获取所述动作组合的贡献值,并根据所述当前状态的贡献值和所述动作组合的贡献值,获取所述强化学习网络的当前状态的最大Q值;The traversal obtaining unit is configured to traverse the action combination of the action library in a pre-built action library, obtain the contribution value of the action combination, and obtain the value according to the contribution value of the current state and the contribution value of the action combination. The maximum Q value of the current state of the reinforcement learning network;
    执行获取单元,用于根据所述当前状态的最大Q值获取所述强化学习网络的当前动作并执行,以使所述强化学习网络进入下一状态,获取所述下一状态的最大Q值,并通过所述下一状态的最大Q值、所述当前状态的奖励值和预设目标值公式,获取所述当前状态的目标Q值;以及An execution obtaining unit, configured to obtain and execute the current action of the reinforcement learning network according to the maximum Q value of the current state, so that the reinforcement learning network enters the next state, and obtain the maximum Q value of the next state, And obtaining the target Q value of the current state through the maximum Q value of the next state, the reward value of the current state, and a preset target value formula; and
    生成调整单元,用于根据所述强化学习网络的目标Q值生成所述强化学习网络的损失函数,通过预设调整算法调整所述强化学习网络的网络参数,以继续对所述强化学习网络进行训练,直到所述损失函数收敛。A generation adjustment unit is configured to generate a loss function of the reinforcement learning network according to a target Q value of the reinforcement learning network, and adjust a network parameter of the reinforcement learning network through a preset adjustment algorithm to continue to perform the reinforcement learning network Train until the loss function converges.
  6. 如权利要求5所述的装置,其特征在于,所述匹配获取单元包括:The apparatus according to claim 5, wherein the matching acquisition unit comprises:
    匹配子单元,用于将所述当前状态与所述状态奖励库中的预设数量个奖励组对应的所有状态节点进行匹配;以及A matching subunit, configured to match the current state with all state nodes corresponding to a preset number of reward groups in the state reward library; and
    状态值设置单元,用于当所述当前状态位于所述预设数量个奖励组中预设状态节点中时,将所述预设状态奖励组的奖励值设置为所述当前状态的奖励值,否则将所述当前状态的奖励值设置为预设一般状态奖励值。A state value setting unit, configured to set a reward value of the preset state reward group as a reward value of the current state when the current state is located in a preset state node in the preset number of reward groups, Otherwise, the reward value of the current state is set as a preset general state reward value.
  7. 如权利要求5所述的装置,其特征在于,所述遍历获取单元包括:The apparatus according to claim 5, wherein the traversal obtaining unit comprises:
    起始值设置单元,用于将所述动作库中预设动作列表上的预设数量维动作的起始值,依次设置为所述动作库中预设实时动作表上的预设数量个实时动作值;A starting value setting unit, configured to sequentially set a starting value of a preset number of dimensional actions on a preset action list in the action library to a preset number of real-time actions on a preset real-time action table in the action library. Action value
    第一累加单元,用于获取所述预设动作列表上的预设第一维动作的步长值,并将所述预设第一维动作的步长值逐次累加到所述预设第一维动作对应的所述实时动作值;以及A first accumulation unit, configured to obtain a step value of a preset first-dimensional action on the preset action list, and accumulate the step value of the preset first-dimensional action to the preset first one by one; The real-time action value corresponding to the dimensional action; and
    第二累加单元,用于当所述对应的所述实时动作值逐次累加到所述预设第一维动作对应的范围之外时,获取所述预设动作列表上的预设第二维动作的步长值,并将所述预设第二维动作的步长值逐次累加到所述预设第二维动作对应的所述实时动作值。A second accumulation unit, configured to, when the corresponding real-time action value is successively accumulated outside a range corresponding to the preset first-dimensional action, obtain a preset second-dimensional action on the preset action list The step size value of the preset second dimension action and accumulate the step size value of the preset second dimension action successively to the real time action value corresponding to the preset second dimension action.
  8. 如权利要求5所述的装置,其特征在于,所述装置还包括:The apparatus according to claim 5, further comprising:
    经验存储单元,用于将所述当前状态、所述当前动作、所述当前状态的奖励值和所述下一状态作为训练样本进行存储。The experience storage unit is configured to store the current state, the current action, the reward value of the current state, and the next state as training samples.
  9. 一种强化学习网络训练设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至4项所述方法的步骤。An reinforcement learning network training device includes a memory, a processor, and a computer program stored in the memory and operable on the processor, characterized in that the processor implements the rights as The steps of the method described in items 1 to 4 are required.
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至4项所述方法的步骤。A computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the steps of the method according to claims 1 to 4 are implemented.
PCT/CN2018/099256 2018-08-07 2018-08-07 Reinforcement learning network training method, apparatus and device, and storage medium WO2020029095A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/099256 WO2020029095A1 (en) 2018-08-07 2018-08-07 Reinforcement learning network training method, apparatus and device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/099256 WO2020029095A1 (en) 2018-08-07 2018-08-07 Reinforcement learning network training method, apparatus and device, and storage medium

Publications (1)

Publication Number Publication Date
WO2020029095A1 true WO2020029095A1 (en) 2020-02-13

Family

ID=69415170

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/099256 WO2020029095A1 (en) 2018-08-07 2018-08-07 Reinforcement learning network training method, apparatus and device, and storage medium

Country Status (1)

Country Link
WO (1) WO2020029095A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105637540A (en) * 2013-10-08 2016-06-01 谷歌公司 Methods and apparatus for reinforcement learning
CN105955930A (en) * 2016-05-06 2016-09-21 天津科技大学 Guidance-type policy search reinforcement learning algorithm
WO2018053187A1 (en) * 2016-09-15 2018-03-22 Google Inc. Deep reinforcement learning for robotic manipulation
CN108288094A (en) * 2018-01-31 2018-07-17 清华大学 Deeply learning method and device based on ambient condition prediction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105637540A (en) * 2013-10-08 2016-06-01 谷歌公司 Methods and apparatus for reinforcement learning
CN105955930A (en) * 2016-05-06 2016-09-21 天津科技大学 Guidance-type policy search reinforcement learning algorithm
WO2018053187A1 (en) * 2016-09-15 2018-03-22 Google Inc. Deep reinforcement learning for robotic manipulation
CN108288094A (en) * 2018-01-31 2018-07-17 清华大学 Deeply learning method and device based on ambient condition prediction

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LIN, HONG ET AL.: "Local Semantic Concept-based Human Action Recognition", INFORMATION TECHNOLOGY, no. 12, 31 December 2015 (2015-12-31), XP055683164 *

Similar Documents

Publication Publication Date Title
CN109242099B (en) Training method and device of reinforcement learning network, training equipment and storage medium
Bechtel et al. Deeppicar: A low-cost deep neural network-based autonomous car
KR102589303B1 (en) Method and apparatus for generating fixed point type neural network
US9754221B1 (en) Processor for implementing reinforcement learning operations
US10635944B2 (en) Self-supervised robotic object interaction
KR20220112813A (en) Neural network model update method, and image processing method and device
CN112596515B (en) Multi-logistics robot movement control method and device
US20190332940A1 (en) Method and system for performing machine learning
US20190244091A1 (en) Acceleration of neural networks using depth-first processing
CN108304925B (en) Pooling computing device and method
US11763153B2 (en) Method and apparatus with neural network operation
US11119507B2 (en) Hardware accelerator for online estimation
Addanki et al. Placeto: Efficient progressive device placement optimization
US20220230067A1 (en) Learning device, learning method, and learning program
CN111830822A (en) System for configuring interaction with environment
JP7329352B2 (en) Method and apparatus for processing parameters in neural networks for classification
WO2020029095A1 (en) Reinforcement learning network training method, apparatus and device, and storage medium
US20220121920A1 (en) Multi-agent coordination method and apparatus
CN109542513B (en) Convolutional neural network instruction data storage system and method
CN114371729B (en) Unmanned aerial vehicle air combat maneuver decision method based on distance-first experience playback
EP3933703A1 (en) Dynamic loading neural network inference at dram/on-bus sram/serial flash for power optimization
JP2022189799A (en) Demonstration-conditioned reinforcement learning for few-shot imitation
CN114529010A (en) Robot autonomous learning method, device, equipment and storage medium
KR20220045828A (en) Task execution method and electronic device using the same
KR20200090061A (en) Method and apparatus for artificial neural network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18929060

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.07.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18929060

Country of ref document: EP

Kind code of ref document: A1