WO2020029095A1

WO2020029095A1 - Reinforcement learning network training method, apparatus and device, and storage medium

Info

Publication number: WO2020029095A1
Application number: PCT/CN2018/099256
Authority: WO
Inventors: 王峥; 梁明兰
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2018-08-07
Filing date: 2018-08-07
Publication date: 2020-02-13

Abstract

The present invention is applicable to the technical field of machine learning, and provides a reinforcement learning network training method, apparatus and device, and a storage medium. Said method comprises: upon receipt of a request for training of a reinforcement learning network, setting network parameters of the reinforcement learning network, so as to perform weight configuration; acquiring the current state of the reinforcement learning network, and the reward value and the contribution value of the current state; acquiring the maximum Q value of the action combination in the current state by traversing action combinations in an action library; acquiring the current action according to the maximum Q value of the current state and executing same, and acquiring a target Q value of the current state by obtaining the maximum Q value of a next state; and generating a loss function of the reinforcement learning network, and adjusting the network parameters by means of a preset adjustment algorithm, so as to continue to train the reinforcement learning network until the loss function converges. The present invention reduces the calculation amount for reinforcement learning network training, thereby increasing the training speed of a reinforcement learning network and improving the training efficiency.

Description

Training method, device, training equipment and storage medium of reinforcement learning network

Technical field

The invention belongs to the field of machine learning, and particularly relates to a training method, a device, a training device, and a storage medium for a reinforcement learning network.

Background technique

Reinforcement learning, also known as re-incentive learning and evaluation learning, is an important method of machine learning. It is an agent's learning from environment to behavior mapping to maximize the value of the reward signal (reinforcement signal) function. Reinforcement learning is different from supervised learning in connectionist learning, which is mainly manifested in teacher signals. Reinforcement signals provided by the environment in reinforcement learning are an evaluation of the quality of the action (usually a scalar signal), rather than telling How the reinforcement learning system RLS (reinforcement learning system) generates correct actions. Because the external environment provides little information, RLS must learn from its own experience. In this way, RLS gains knowledge in an action-evaluation environment and improves action plans to suit the environment. There are many applications in areas such as intelligent control of robots and analysis and prediction.

In recent years, reinforcement learning has been widely used in robot control, computer vision, natural language processing, game theory, and autonomous driving. The process of training reinforcement learning networks is usually implemented on CPU and GPU devices, and the amount of calculation is quite large. In actual application, there are problems such as occupying a large number of resources, slow operation speed, and low efficiency, and the calculation is limited due to the limitation of memory access bandwidth. The ability cannot be further improved.

Summary of the invention

The purpose of the present invention is to provide a training method, device, training device and storage medium for reinforcement learning network, which aims to solve the problem that the existing technology cannot provide an effective training method for reinforcement learning network, which leads to a large amount of training calculation and efficiency. Low problem.

In one aspect, the present invention provides a method for training a reinforcement learning network, which includes the following steps:

When receiving a request to train a reinforcement learning network, setting network parameters of the reinforcement learning network to perform weight configuration on the reinforcement learning network;

Acquiring the current state of the reinforcement learning network, matching the current state in a pre-built state reward library, and obtaining a reward value and a contribution value of the current state;

Traverse the action combinations of the action library in a pre-built action library to obtain the contribution value of the action combination, and obtain the reinforcement learning network's value based on the current state contribution value and the action combination's contribution value. The maximum Q value of the current state;

Acquire and execute the current action of the reinforcement learning network according to the maximum Q value of the current state, so that the reinforcement learning network enters the next state, obtain the maximum Q value of the next state, and pass the next state Obtaining the maximum Q value of the state, the reward value of the current state, and a preset target value formula to obtain the target Q value of the current state;

Generate a loss function of the reinforcement learning network according to the target Q value of the current state, and adjust network parameters of the reinforcement learning network through a preset adjustment algorithm to continue training the reinforcement learning network until the loss function convergence.

In another aspect, the present invention provides a training device for a reinforcement learning network, the device includes:

A parameter setting unit, configured to set network parameters of the reinforcement learning network when receiving a request for training the reinforcement learning network to perform weight configuration on the reinforcement learning network;

A matching obtaining unit, configured to obtain the current state of the reinforcement learning network, match the current state in a pre-built state reward library, and obtain a reward value and a contribution value of the current state;

The traversal obtaining unit is configured to traverse the action combination of the action library in a pre-built action library, obtain the contribution value of the action combination, and obtain the value according to the current state contribution value and the action combination contribution value The maximum Q value of the current state of the reinforcement learning network;

An execution obtaining unit, configured to obtain and execute the current action of the reinforcement learning network according to the maximum Q value of the current state, so that the reinforcement learning network enters the next state, and obtain the maximum Q value of the next state, And obtaining the target Q value of the current state through the maximum Q value of the next state, the reward value of the current state, and a preset target value formula; and

A generation adjustment unit is configured to generate a loss function of the reinforcement learning network according to a target Q value of the reinforcement learning network, and adjust a network parameter of the reinforcement learning network through a preset adjustment algorithm to continue to perform the reinforcement learning network Train until the loss function converges.

In another aspect, the present invention also provides a reinforcement learning network training device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing the computer The program implements the steps of the training method of the reinforcement learning network as described above.

In another aspect, the present invention also provides a computer-readable storage medium that stores a computer program that, when executed by a processor, implements the steps of the training method for a reinforcement learning network as described above.

When a request for training a reinforcement learning network is received, the present invention sets network parameters of the reinforcement learning network to perform weight configuration, obtain the current state of the reinforcement learning network, and reward and contribution values of the current state, by traversing the actions of the action library Combination to obtain the maximum Q value of the combination of actions in the current state, obtain the current action and execute it based on the maximum Q value of the current state, and obtain the target Q value of the current state by obtaining the maximum Q value of the next state to generate a reinforcement learning network Loss function, adjusting network parameters through preset adjustment algorithms to continue training the reinforcement learning network until the loss function converges, thereby reducing the amount of computation for training the reinforcement learning network, thereby speeding up the training speed of the reinforcement learning network and improving training effectiveness.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is an implementation flowchart of a method for training a reinforcement learning network according to Embodiment 1 of the present invention; FIG.

FIG. 2 is a schematic diagram of a preferred storage structure of a state reward library provided by Embodiment 1 of the present invention; FIG.

3 is a schematic diagram of a preferred storage structure of an action library provided by Embodiment 1 of the present invention;

4 is a schematic structural diagram of a training device for a reinforcement learning network according to a second embodiment of the present invention;

5 is a schematic structural diagram of a training device for a reinforcement learning network according to a third embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a reinforcement learning network training device according to a fourth embodiment of the present invention; and

FIG. 7 is a schematic diagram of a preferred structure of a reinforcement learning network training device provided in Embodiment 4 of the present invention.

detailed description

In order to make the objectives, technical solutions, and advantages of the present invention clearer, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not intended to limit the present invention.

The following describes the specific implementation of the present invention in detail with reference to specific embodiments:

Embodiment one:

FIG. 1 shows an implementation process of a training method for a reinforcement learning network provided in Embodiment 1 of the present invention. For ease of description, only parts related to the embodiment of the present invention are shown, and the details are as follows:

In step S101, when a request for training a reinforcement learning network is received, network parameters of the reinforcement learning network are set to perform weight configuration on the reinforcement learning network.

The embodiments of the present invention are applicable to reinforcement learning network training equipment, for example, training equipment such as MATLAB (Matrix Laboratory). In the embodiment of the present invention, when a request for training a reinforcement learning network is received, the network parameters of the learning network are set to configure the weighting of the learning network. Specifically, the network parameters are written first. The input network parameters start the calculation mode of the corresponding neurons of the reinforcement learning network. In this way, the parameters of each neuron of each layer of the network are configured, so that data is processed in parallel and data processing efficiency is improved.

In step S102, the current state of the reinforcement learning network is acquired, and the current state is matched in a state reward library constructed in advance to obtain the reward value and contribution value of the current state.

In the embodiment of the present invention, the state reward library is a pre-built set that stores state nodes and corresponding reward values. After receiving a training request, the current state of the reinforcement learning network is obtained, and feature data of the current state is extracted. The characteristic data of the current state is calculated to obtain the contribution value of the current state, and then the current state is matched in the state reward database to obtain the current state reward value.

As an example, as shown in FIG. 2, the figure shows a preferred storage structure of the state reward library. The state reward library is divided into n reward groups, each corresponding to n special state reward values. The reward value is stored at the beginning of the data. The number of groups is n. The end of the database stores the general state reward value, that is, the (n + 1) th reward value. Each reward group includes different state nodes, that is, different state values. Different state nodes correspond to different ranges. Status value.

Preferably, when the current state is matched in a pre-built state reward library, the current state is matched with all state nodes corresponding to a preset number of reward groups in the state reward library, and when the current state is located in the preset number of rewards When the preset status node in the group is set, the reward value of the preset status reward group is set to the reward value of the current status, otherwise the reward value of the current status is set to the preset general status reward value, so as to quickly obtain the instant reward of the current status . Specifically, since the current state can only be located in one state node, or the current state is located outside all state nodes, when matching state nodes, the method of matching state nodes one by one can be used for matching. When the current state is in a preset state When the node is in the node, stop matching other state nodes, and set the reward value corresponding to the preset state node as the reward value of the current state. When all the state nodes are matched one by one, no matching is successful, then the general state reward value is set to Reward value for the current state.

In step S103, the action combination of the action library is traversed in a pre-built action library to obtain the contribution value of the action combination, and the maximum Q of the current state of the reinforcement learning network is obtained according to the contribution value of the current state and the contribution value of the action combination. value.

In the embodiment of the present invention, the action library is a pre-built set that stores all actions that can be output by the learning network. The Q value is a representation of the state mapping to action values in the reinforcement learning network. The contribution value of each action combination (real-time action). When traversing the action combination of the action library, each time an action combination is obtained, the Q value of each action combination is calculated from the current state contribution value and the action combination contribution value, so that the Q value of each action combination can be calculated. Get the maximum Q value of the current state of the reinforcement learning network.

As an example, as shown in FIG. 3, the figure shows a preferred storage structure of the action library. The action library is divided into an action memory module and a real-time action memory module. The action memory module is used to store information of all actions, specifically the action dimension. The number n, the step value, the maximum value, and the starting value of each action dimension. The real-time action memory module is used to store the action information to be output, specifically the action value of each action in the n-dimensional action. As an example, In the self-driving reinforcement learning network, actions include left turn (first dimension), right turn (second dimension), brake (third dimension), etc., and the corresponding action values are (1, a), (2, b) , (3, c) where 1, 2, and 3 respectively represent the dimensions of the action (for example, the first, second, and third dimensions), and a, b, and c are the measures corresponding to the first, second, and three-dimensional actions, respectively. value.

Preferably, when the action combination of the action library is traversed in the pre-built action library, the starting value of the preset number of dimensional actions on the preset action list in the action library is sequentially set to the preset real-time action table in the action library. The preset number of real-time action values, to obtain the step value of the preset first-dimensional action on the preset action list, and accumulate the step value of the preset first-dimensional action to the corresponding one of the preset first-dimensional actions one by one. Real-time action value. When the corresponding real-time action value is successively accumulated outside the range corresponding to the preset first-dimensional action, obtain the step value of the preset second-dimensional action on the preset action list, and The step value of the dimensional action is accumulated one by one to the real-time action value corresponding to the preset second-dimensional action, so as to quickly and accurately calculate the contribution value of each real-time action to the learning network. The preset first-dimensional motion and the preset second-dimensional motion are both one-dimensional motions among the preset number of dimensional motions.

In step S104, the current action of the reinforcement learning network is obtained and executed according to the maximum Q value of the current state to obtain the next state of the reinforcement learning network, the maximum Q value of the next state is obtained, and the maximum Q value of the next state is passed , The reward value of the current state and a preset target value formula to obtain the target Q value of the current state.

In the embodiment of the present invention, when the current action is the current state, the action that the reinforcement learning network needs to perform. The preset target value formula is specifically Target_Q (s, a; θ) = r (s) + γmaxQ (s ', a' Θ), where Target_Q (s, a; θ) is the target Q value of the current state, s is the current state, a is the current action, r (s) is the reward value of the current state, γ is the discount factor, and θ is Network parameters, maxQ (s ', a'; θ) is the maximum Q value for the next state. Specifically, according to the greedy strategy, the current action of the reinforcement learning network is obtained and executed according to the maximum Q value of the current state, and enters the next state. At this time, the method of steps S102 and S103 is repeated to obtain the maximum Q value of the next state. Then, the target Q value of the current state is obtained through a preset target value formula.

Preferably, after obtaining the target Q value of the current state, the current state, the current action, the reward value of the current state, and the next state are stored as training samples, thereby speeding up the subsequent convergence process.

Preferably, the reinforcement learning network training device includes two processors, one of which is an AI chip, and the architecture of the AI chip is between ASIC (Application Specific Integrated Circuit) and FPGA (Field-Programmable Gate Array). Programming logic gate array), which is used to handle part of the process of making decisions based on the current state and responding to the current action during the training of the reinforcement learning network, thereby improving the training speed of the reinforcement learning network by increasing the memory access bandwidth.

In step S105, a loss function of the reinforcement learning network is generated according to the target Q value of the current state, and the network parameters are adjusted by a preset adjustment algorithm to continue training the learning network until the loss function converges.

In the embodiment of the present invention, after obtaining the target Q value of the current state, a loss function of the reinforcement learning network is generated. Specifically, the loss function is L (θ) = E [(Target_Q (s, a; θ) -Q ( s, a; θ)) ² ], where Target_Q (s, a; θ) is the target Q value of the current state, E is the mean square error, Q (s, a; θ) is the real-time Q value, and s is the current state , A is the current action, θ is the network parameter, and then the neural network parameters are adjusted by a preset adjustment algorithm to continue training the learning network until the loss function converges, thereby finally completing the training of the reinforcement learning network. Specifically, the preset adjustment algorithm is a stochastic gradient descent (SGD) algorithm.

In the embodiment of the present invention, when a request for training a reinforcement learning network is received, network parameters of the reinforcement learning network are set to perform weight configuration to obtain the current state of the reinforcement learning network, and the reward value and contribution value of the current state. Traverse the action combination of the action library to obtain the maximum Q value of the action combination in the current state, obtain the current action and execute it according to the maximum Q value of the current state, and obtain the target Q value of the current state by obtaining the maximum Q value of the next state. Generate the loss function of the reinforcement learning network, adjust the network parameters through preset adjustment algorithms, and continue to train the reinforcement learning network until the loss function converges, thereby reducing the calculation amount of training the reinforcement learning network, and then speeding up the training of the reinforcement learning network Speed and improve training efficiency.

Embodiment two:

FIG. 4 shows the structure of a training device for a reinforcement learning network provided in Embodiment 2 of the present invention. For ease of description, only parts related to the embodiment of the present invention are shown, including:

A parameter setting unit 41 is configured to set network parameters of the reinforcement learning network when receiving a request for training the reinforcement learning network to perform weight configuration on the reinforcement learning network;

The matching obtaining unit 42 is configured to obtain the current state of the reinforcement learning network, match the current state in a pre-built state reward library, and obtain the reward value and contribution value of the current state;

The traversal acquisition unit 43 is configured to traverse the action combinations of the action library in a pre-built action library, obtain the contribution value of the action combination, and obtain the current state of the reinforcement learning network according to the contribution value of the current state and the contribution value of the action combination. Maximum Q value

The execution obtaining unit 44 is configured to obtain and execute the current action of the reinforcement learning network according to the maximum Q value of the current state, to obtain the next state of the reinforcement learning network, obtain the maximum Q value of the next state, and pass the maximum value of the next state. Q value, reward value of current state and preset target value formula to obtain target Q value of current state; and

The generating and adjusting unit 45 is configured to generate a loss function of the reinforcement learning network according to the target Q value of the current state, and adjust network parameters through a preset adjustment algorithm to continue training the learning network until the loss function converges.

In the embodiment of the present invention, each unit of the training device of the reinforcement learning network may be implemented by corresponding hardware or software units. Each unit may be an independent software and hardware unit, or may be integrated into one software and hardware unit. Limit the invention. For specific implementation of each unit, reference may be made to the description in Embodiment 1, and details are not described herein again.

Embodiment three:

FIG. 5 shows the structure of a training device for a reinforcement learning network provided in Embodiment 3 of the present invention. For ease of description, only parts related to the embodiment of the present invention are shown, including:

A parameter setting unit 51 is configured to set network parameters of the reinforcement learning network when receiving a request for training the reinforcement learning network to perform weight configuration on the reinforcement learning network;

The matching acquisition unit 52 is configured to acquire the current state of the reinforcement learning network, match the current state in a pre-built state reward library, and obtain the reward value and contribution value of the current state;

The traversal acquisition unit 53 is configured to traverse the action combinations of the action library in a pre-built action library, obtain the contribution value of the action combination, and obtain the current state of the reinforcement learning network according to the contribution value of the current state and the contribution value of the action combination. Maximum Q value

The execution obtaining unit 54 is configured to obtain and execute the current action of the reinforcement learning network according to the maximum Q value of the current state, to obtain the next state of the reinforcement learning network, obtain the maximum Q value of the next state, and pass the maximum value of the next state. Q value, reward value of current state and preset target value formula to obtain target Q value of current state;

The experience storage unit 55 is configured to store the current state, the current action, the reward value of the current state, and the next state as training samples; and

The generating and adjusting unit 56 is configured to generate a loss function of the reinforcement learning network according to the target Q value of the current state, and adjust the network parameters through a preset adjustment algorithm to continue training the learning network until the loss function converges.

The matching obtaining unit 52 includes:

A matching subunit 521, configured to match the current state with all state nodes corresponding to a preset number of reward groups in the state reward library; and

The state value setting unit 522 is configured to set the reward value of the preset state reward group to the current state reward value when the current state is located in the preset state nodes in the preset number of reward groups, otherwise set the reward value of the current state Set to a preset general status reward value.

The traversal obtaining unit 53 includes:

A start value setting unit 531, configured to sequentially set a start value of a preset number of dimensional actions on a preset action list in the action library to a preset number of real-time action values on a preset real-time action table in the action library;

The first accumulation unit 532 is configured to obtain a step value of a preset first-dimensional action on the preset action list, and sequentially accumulate the step value of the preset first-dimensional action to the real-time corresponding to the preset first-dimensional action. Action value; and

A second accumulation unit 533, configured to obtain the step value of the preset second-dimensional action on the preset action list when the corresponding real-time action value is sequentially accumulated outside the range corresponding to the preset first-dimensional action, and The step value of the preset second-dimensional action is successively accumulated to the real-time action value corresponding to the preset second-dimensional action.

Embodiment 4:

FIG. 6 shows the structure of a reinforcement learning network training device provided in Embodiment 4 of the present invention. For convenience of explanation, only parts related to the embodiment of the present invention are shown, including:

The reinforcement learning network training device 6 according to the embodiment of the present invention includes a processor 61, a memory 62, and a computer program 63 stored in the memory 62 and executable on the processor 61. When the processor 51 executes the computer program 63, the steps in the embodiment of the method for training a reinforcement learning network are implemented, for example, steps S101 to S105 shown in FIG. Alternatively, when the processor 61 executes the computer program 63, the functions of the units in the above embodiments of the training device of the reinforcement learning network are realized, for example, the functions of the units 41 to 45 shown in FIG. 4 and the units 51 to 56 shown in FIG. 5.

As shown in FIG. 7, a schematic diagram of a preferred structure of a reinforcement learning network training device. Preferably, the reinforcement learning network training device 7 includes a first processor 711, a second processor 712, a first memory 721, a second memory 722, and a computer program stored in the first memory 721 and the second memory 722. 73. The computer calculation program 73 may run on the first processor 711 and the second processor 712. Specifically, the first processor 711 is an ASIC (Application Specific Integrated Circuit) chip, thereby improving the efficiency of the learning network and reducing power consumption. The first processor 711 implements the steps in the embodiment of the method for training a reinforcement learning network when the computer program 73 is executed, for example, steps S101 to S103 shown in FIG. 1, and the second processor 712 implements the reinforcement learning network when the computer program 73 is executed. The steps in the training method embodiment are, for example, steps S104 to S105 shown in FIG. 1. Alternatively, when the first processor 711 executes the computer program 73, the functions of the units in the embodiments of the training device for each reinforcement learning network described above are implemented, for example, the functions of units 41 to 43 shown in FIG. 4 and the functions of units 51 to 53 shown in FIG. 5, When the second processor 712 executes the computer program 73, the functions of the units in the above embodiments of the training device of the reinforcement learning network are realized, for example, the functions of the units 44 to 45 shown in FIG. 4 and the units 54 to 56 shown in FIG. 5.

In the embodiment of the present invention, when the processor executes a computer program, when a request for training a reinforcement learning network is received, the network parameters of the reinforcement learning network are set to perform weight configuration, and obtain the current state of the reinforcement learning network, and the current state. To obtain the maximum Q value of the action combination in the current state by traversing the action combination of the action library, to obtain and execute the current action based on the maximum Q value of the current state, and to obtain the maximum Q value of the next state, Obtain the target Q value of the current state, generate the loss function of the reinforcement learning network, and adjust the network parameters through preset adjustment algorithms to continue training the reinforcement learning network until the loss function converges, thereby reducing the amount of calculation for training the reinforcement learning network. This speeds up the training of reinforcement learning networks and improves training efficiency.

For steps in the embodiment of the training method for implementing the reinforcement learning network when the processor executes a computer program, reference may be made to the description in Embodiment 1, and details are not described herein again.

Embodiment 5:

In the embodiment of the present invention, a computer-readable storage medium is provided. The computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps in the embodiment of the training method of the reinforcement learning network are implemented. For example, steps S101 to S105 shown in FIG. 1. Alternatively, when the computer program is executed by a processor, the functions of the units in the embodiments of the training device for each reinforcement learning network described above are implemented, such as the functions of units 41 to 45 shown in FIG. 4 and the units 51 to 56 shown in FIG. 5.

In the embodiment of the present invention, after the computer program is executed by the processor, when a request for training the reinforcement learning network is received, the network parameters of the reinforcement learning network are set to perform weight configuration, obtain the current state of the reinforcement learning network, and the current Reward value and contribution value of the state, by traversing the action combination of the action library, obtain the maximum Q value of the action combination in the current state, obtain the current action and execute it according to the maximum Q value of the current state, and obtain the maximum Q value of the next state To obtain the target Q value of the current state, generate a loss function for the reinforcement learning network, and adjust the network parameters through preset adjustment algorithms to continue training the reinforcement learning network until the loss function converges, thereby reducing the amount of calculation for training the reinforcement learning network Therefore, the training speed of the reinforcement learning network is accelerated, and the training efficiency is improved.

The computer-readable storage medium of the embodiment of the present invention may include any entity or device capable of carrying computer program code, and a storage medium, for example, a memory such as a ROM / RAM, a magnetic disk, an optical disk, a flash memory, or the like.

The above description is only the preferred embodiments of the present invention, and is not intended to limit the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention shall be included in the protection of the present invention. Within range.

Claims

A training method for a reinforcement learning network, characterized in that the method includes the following steps:

When receiving a request to train a reinforcement learning network, setting network parameters of the reinforcement learning network to perform weight configuration on the reinforcement learning network;

Acquiring the current state of the reinforcement learning network, matching the current state in a pre-built state reward library, and obtaining a reward value and a contribution value of the current state;

Traverse the action combinations of the action library in a pre-built action library to obtain the contribution value of the action combination, and obtain the reinforcement learning network's value based on the current state contribution value and the action combination's contribution value. The maximum Q value of the current state;

Acquire and execute the current action of the reinforcement learning network according to the maximum Q value of the current state, so that the reinforcement learning network enters the next state, obtain the maximum Q value of the next state, and pass the next state Obtaining the maximum Q value of the state, the reward value of the current state, and a preset target value formula to obtain the target Q value of the current state;

A loss function of the reinforcement learning network is generated according to the target Q value of the current state, and the network parameters are adjusted by a preset adjustment algorithm to continue training the reinforcement learning network until the loss function converges.
The method of claim 1, wherein the step of matching the current state of the reinforcement learning network in a pre-built state reward library comprises:

Matching the current state with all state nodes corresponding to a preset number of reward groups in the state reward library;

When the current state is located in a preset state node in the preset number of reward groups, setting the reward value of the preset state reward group to the current state reward value, otherwise setting the current state The reward value is set to a preset general state reward value.
The method of claim 1, wherein the step of traversing the combination of actions of the action library in a pre-built action library comprises:

Setting the start value of a preset number of dimensional actions on a preset action list in the action library to a preset number of real-time action values on a preset real-time action list in the action library;

Acquiring a step value of a preset first-dimensional action on the preset action list, and sequentially accumulating the step value of the preset first-dimensional action to the real-time corresponding to the preset first-dimensional action Action value

When the corresponding real-time action value is sequentially accumulated outside the range corresponding to the preset first-dimensional action, obtaining a step value of a preset second-dimensional action on the preset action list, and The step value of the preset second-dimensional motion is successively accumulated to the real-time motion value corresponding to the preset second-dimensional motion.
The method according to claim 1, wherein after the step of obtaining the target Q value of the current state, the method further comprises:

The current state, the current action, the reward value of the current state, and the next state are stored as training samples.
A training device for a reinforcement learning network, characterized in that the device includes:

A parameter setting unit, configured to set network parameters of the reinforcement learning network when receiving a request for training the reinforcement learning network to perform weight configuration on the reinforcement learning network;

A matching obtaining unit, configured to obtain the current state of the reinforcement learning network, match the current state in a pre-built state reward library, and obtain a reward value and a contribution value of the current state;

The traversal obtaining unit is configured to traverse the action combination of the action library in a pre-built action library, obtain the contribution value of the action combination, and obtain the value according to the contribution value of the current state and the contribution value of the action combination. The maximum Q value of the current state of the reinforcement learning network;

An execution obtaining unit, configured to obtain and execute the current action of the reinforcement learning network according to the maximum Q value of the current state, so that the reinforcement learning network enters the next state, and obtain the maximum Q value of the next state, And obtaining the target Q value of the current state through the maximum Q value of the next state, the reward value of the current state, and a preset target value formula; and

A generation adjustment unit is configured to generate a loss function of the reinforcement learning network according to a target Q value of the reinforcement learning network, and adjust a network parameter of the reinforcement learning network through a preset adjustment algorithm to continue to perform the reinforcement learning network Train until the loss function converges.
The apparatus according to claim 5, wherein the matching acquisition unit comprises:

A matching subunit, configured to match the current state with all state nodes corresponding to a preset number of reward groups in the state reward library; and

A state value setting unit, configured to set a reward value of the preset state reward group as a reward value of the current state when the current state is located in a preset state node in the preset number of reward groups, Otherwise, the reward value of the current state is set as a preset general state reward value.
The apparatus according to claim 5, wherein the traversal obtaining unit comprises:

A starting value setting unit, configured to sequentially set a starting value of a preset number of dimensional actions on a preset action list in the action library to a preset number of real-time actions on a preset real-time action table in the action library. Action value

A first accumulation unit, configured to obtain a step value of a preset first-dimensional action on the preset action list, and accumulate the step value of the preset first-dimensional action to the preset first one by one; The real-time action value corresponding to the dimensional action; and

A second accumulation unit, configured to, when the corresponding real-time action value is successively accumulated outside a range corresponding to the preset first-dimensional action, obtain a preset second-dimensional action on the preset action list The step size value of the preset second dimension action and accumulate the step size value of the preset second dimension action successively to the real time action value corresponding to the preset second dimension action.
The apparatus according to claim 5, further comprising:

The experience storage unit is configured to store the current state, the current action, the reward value of the current state, and the next state as training samples.
An reinforcement learning network training device includes a memory, a processor, and a computer program stored in the memory and operable on the processor, characterized in that the processor implements the rights as The steps of the method described in items 1 to 4 are required.
A computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the steps of the method according to claims 1 to 4 are implemented.