CN113112016A

CN113112016A - Action output method for reinforcement learning process, network training method and device

Info

Publication number: CN113112016A
Application number: CN202110376318.XA
Authority: CN
Inventors: 余昊男; 徐伟; 张海超
Original assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Current assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date: 2021-04-07
Filing date: 2021-04-07
Publication date: 2021-07-13

Abstract

Disclosed are an action output method, a network training method and a device for reinforcement learning process. The action output method for the reinforcement learning process comprises the following steps: determining a first state of an environment where an agent is located at a current time point; determining a first candidate action for the agent at the current point in time based on the first state and a first historical action output to the environment by the agent at the previous point in time; selecting a target action from the first candidate action and the first historical action; and controlling the intelligent agent to output the target action at the current time point. In the embodiment of the disclosure, the same action can be repeated for multiple times by the binary switching decision, so that the effect that the action spans multiple time points is achieved, the task timeline is shortened, the reward distribution problem is simplified, and the falling-to-ground use of an energized intelligent agent for deep reinforcement learning in an actual application scene is guaranteed.

Description

Action output method for reinforcement learning process, network training method and device

Technical Field

The present disclosure relates to the field of reinforcement learning technologies, and in particular, to an action output method, a network training method, and an apparatus for reinforcement learning.

Background

At present, reinforcement learning is more and more common in use, and when a reinforcement learning method is used, a new control task is faced, a large amount of trial and error may be needed by intelligent bodies such as robots, a large amount of time is spent, and in the process, the tasks can be learned through rewarding signals by risking hardware damage. It should be noted that the reinforcement learning method needs to solve the reward allocation problem (credit allocation problem), and in general, the longer the timeline of the task, the harder the reward allocation problem is to be solved, which may affect the falling-to-ground use of the intelligent agent enabled by the deep reinforcement learning in the actual application scenario.

Disclosure of Invention

The present disclosure is proposed to solve the above technical problems. The embodiment of the disclosure provides an action output method, a network training method and a device for a reinforcement learning process.

According to an aspect of an embodiment of the present disclosure, there is provided an action output method for a reinforcement learning process, including:

determining a first state of an environment where an agent is located at a current time point;

determining a first candidate action for the agent at a current point in time based on the first state and a first historical action output by the agent to the environment at a previous point in time;

selecting a target action from the first candidate action and the first historical action;

and controlling the intelligent agent to output the target action at the current time point.

According to another aspect of the embodiments of the present disclosure, there is provided a network training method, including:

acquiring historical data, wherein a second state of an environment where an agent is located at a first time point and a second historical action output to the environment by the agent at a second time point are recorded in the acquired historical data, and the first time point is a next time point of the second time point;

determining, via a second network, a second candidate action for the agent at the first point in time based on the second state and the second historical action;

determining a probability of being selected for the second candidate action based on the second state, the second historical action, and the second candidate action;

determining a parameter gradient for the second network based on the second state, the second historical action, the second candidate action, and the probability of being selected;

training the second network based on the parameter gradient.

According to still another aspect of an embodiment of the present disclosure, there is provided an action output apparatus for reinforcement learning process, including:

the intelligent agent monitoring system comprises a first determining module, a second determining module and a monitoring module, wherein the first determining module is used for determining a first state of an environment where an intelligent agent is located at the current time point;

a second determination module for determining a first candidate action for the agent at a current point in time based on the first state determined by the first determination module and a first historical action output by the agent to the environment at a previous point in time;

a selection module for selecting a target action from the first candidate action and the first historical action determined by the second determination module;

and the output module is used for controlling the intelligent agent to output the target action selected by the selection module at the current time point.

According to another aspect of the embodiments of the present disclosure, there is provided a network training apparatus including:

the second acquisition module is used for acquiring historical data, wherein a second state of an environment where the agent is located at a first time point and a second historical action output to the environment by the agent at the second time point are recorded in the acquired historical data, and the first time point is a next time point of the second time point;

a third determination module, configured to determine, via a second network, a second candidate action for the agent at the first time point based on the second state and the second historical action recorded in the historical data acquired by the acquisition module;

a fourth determining module, configured to determine, based on the second state and the second historical motion recorded in the historical data acquired by the acquiring module, and the second candidate motion determined by the third determining module, a selection probability of the second candidate motion;

a fifth determining module, configured to determine a parameter gradient of the second network based on the second state and the second historical motion recorded in the historical data acquired by the acquiring module, the second candidate motion determined by the third determining module, and the selected probability determined by the fourth determining module;

a second training module, configured to train the second network based on the parameter gradient determined by the fifth determining module.

According to still another aspect of an embodiment of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the above action output method for reinforcement learning process or for executing the above network training method.

According to still another aspect of an embodiment of the present disclosure, there is provided an electronic device including:

a processor;

a memory for storing the processor-executable instructions;

the processor is used for reading the executable instructions from the memory and executing the instructions to realize the action output method for the reinforcement learning process or execute the network training method.

Based on the action output method, the network training method, the device, the computer-readable storage medium and the electronic device for the reinforcement learning process provided by the above embodiments of the present disclosure, a first candidate action for an agent at a current time point may be determined based on a first state of an environment where the agent is located at the current time point and an action output to the environment by the agent at a previous time point, and a target action may be selected from the first candidate action and the first historical action, and then the agent may be controlled to output the target action at the current time point, which is equivalent to selecting a suitable action from the first candidate action and the first historical action to output to the environment through a binary switching decision, theoretically, the same action may be repeated multiple times by the binary switching decision, so as to achieve an effect that the action spans multiple time points, thereby shortening a task timeline, and then the reward distribution problem is simplified, which is beneficial to ensuring that an energized intelligent agent for deep reinforcement learning falls on the ground to use in an actual application scene, and is beneficial to improving the problem of low data utilization efficiency of the deep reinforcement learning.

The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments of the present disclosure with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. In the drawings, like reference numbers generally represent like parts or steps.

Fig. 1 is a flowchart illustrating an action output method for a reinforcement learning process according to an exemplary embodiment of the present disclosure.

Fig. 2 is a schematic diagram of an action decision process in an embodiment of the present disclosure.

Fig. 3 is a flowchart illustrating an action output method for a reinforcement learning process according to another exemplary embodiment of the present disclosure.

Fig. 4 is a flowchart illustrating an action output method for a reinforcement learning process according to still another exemplary embodiment of the present disclosure.

Fig. 5 is a flowchart illustrating a network training method according to an exemplary embodiment of the present disclosure.

Fig. 6 is a flowchart illustrating a network training method according to another exemplary embodiment of the present disclosure.

Fig. 7 is a schematic structural diagram of an action output device for a reinforcement learning process according to an exemplary embodiment of the present disclosure.

Fig. 8 is a schematic structural diagram of an action output device for a reinforcement learning process according to another exemplary embodiment of the present disclosure.

Fig. 9 is a schematic structural diagram of a network training apparatus according to an exemplary embodiment of the present disclosure.

Fig. 10 is a schematic structural diagram of a network training apparatus according to another exemplary embodiment of the present disclosure.

Fig. 11 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.

Detailed Description

Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

It will be understood by those within the art that the terms "first", "second", etc. in the embodiments of the present disclosure are used only for distinguishing between different steps, devices or modules, etc., and do not denote any particular technical or essential order.

It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more and "at least one" may refer to one, two or more.

It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.

The term "and/or" in this disclosure is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

The disclosed embodiments may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices, such as terminal devices, computer systems, servers, and the like, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above systems, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Summary of the application

In some cases, there is a need for: the intelligent agent can learn to complete different control tasks from different empirical data efficiently and autonomously through the same set of algorithm, and human intervention and reprogramming are avoided as much as possible in the process; wherein, the intelligent agent includes but is not limited to unmanned vehicle, robot, etc.; control tasks include, but are not limited to, autopilot control, robot motion, robotic arm control, and the like.

In response to the above needs, one branch of research that is currently promising is: through Deep Reinforcement Learning (Deep RL), the agent continuously tries on and mistakes and adjusts the behavior of the agent from a task rewarding Signal (rewarded Signal), thereby finally achieving the goal of completing the task. Common algorithms widely applied under the research branch are as follows: depth Deterministic Policy Gradient (DDPG), flexible decision evaluation algorithm (SAC), approximate Policy Optimization (PPO). The method of the research branch is characterized in that: the data-driven learning is emphasized, professional knowledge in the control field is basically not required, mobility is good on different control tasks, however, due to data hunger (data hungry) caused by data driving, when a new control task is faced, an intelligent agent may need a large amount of trial and error, a large amount of time is spent, and in the process, a risk of hardware damage is caused, and a task can be learned through a reward signal.

It should be noted that the reinforcement learning method often has a problem of low data utilization efficiency (sample effectiveness), and taking the currently popular method SAC as an example, the amount of data required for a simple object capturing operation based on its learning is about several orders of magnitude as much as that required by a human, and one important reason of the problem of low data utilization efficiency is: the reinforcement learning method needs to solve the problem of reward distribution, a current action may have a critical role to take reward/penalty at a time point far in the future on a time line, the reward/penalty signal needs to be transmitted from back to front step by step for examination, and in the process, an agent needs to correctly attribute the signal to a specific early action. Generally, the longer the timeline of the task, the more difficult the reward assignment problem to solve, which can make deep reinforcement learning-enabled agents less prevalent in practical application scenarios.

In order to solve the problem of low data utilization efficiency, one sub-field of deep reinforcement learning is a hierarchical control (hierarchical control) method, and the other sub-field is an action duration prediction (action duration prediction) method. The hierarchical control method abstracts the action of the intelligent agent by establishing a multi-level control time line, each level time line is greatly shortened, and the reward distribution problem is simplified. The action span prediction method simultaneously outputs the executed times of the action before outputting the action every time, belongs to an open-loop control (open-loop control) method, has low flexibility, and cannot immediately stop repeated execution of the action when meeting an emergency in an environment if a long action execution span is predicted.

Exemplary System

In embodiments of the present disclosure, reinforcement learning may be considered as a Decision search problem in a Markov Decision Process (MDP). It will be appreciated that the MDP contains a set of interactive objects, namely the agent and the environment in which the agent is located; the intelligent agent is an agent for machine learning in the MDP, can sense the state of the environment and make a decision, acts on the environment and adjusts the decision through the feedback of the environment; the environment is a collection of everything outside the agent in the MDP, the state of which changes under the influence of the agent's actions, and the change can be fully or partially perceived by the agent, and the environment may be fed back to the agent with a corresponding reward after each decision.

In specific implementation, the electronic devices such as the agent, the terminal device, and the server may execute the action output method for the reinforcement learning process provided in the following embodiments of the present disclosure to shorten a task timeline, further simplify the reward distribution problem, and ensure that the agent enabled by deep reinforcement learning falls into the ground in an actual application scenario. In addition, the electronic devices such as the agent, the terminal device, and the server may also execute the network training method provided in the following embodiments of the disclosure to train to obtain an accurate and reliable second network, so as to apply the second network to the above-mentioned action output method.

Exemplary method

Fig. 1 is a flowchart illustrating an action output method for a reinforcement learning process according to an exemplary embodiment of the present disclosure. The method shown in fig. 1 includes step 101, step 102, step 103, and step 104, and each step is described below.

Step 101, determining a first state of an environment where an agent is located at a current time point.

It should be noted that the time points involved in the embodiments of the present disclosure may also be referred to as time steps, and then, the current time point in step 101 may also be referred to as a current time step, and the previous time point in step 102 below may also be referred to as a previous time step. In addition, the intelligent agent involved in the embodiments of the present disclosure may be any device capable of performing deep reinforcement learning, including but not limited to an unmanned vehicle, a robot, and the like, and the robot may specifically include an intelligent robot such as a sweeping robot.

In step 101, the agent may sense its environment to determine a state of its environment at the current time point, which may be referred to as a first state. Optionally, in the case that the agent is an unmanned vehicle, the first state may be characterized by using camera image data of the unmanned vehicle; when the agent is a sweeping robot, the first state can be characterized by laser radar data of the sweeping robot.

Step 102, determining a first candidate action for the agent at the current time point based on the first state and the first historical action output to the environment by the agent at the last time point.

In the deep reinforcement learning process, actions that the agent has output to the environment at various time points may be recorded, for example, in a data return cache; the data return access cache can be a data cache region arranged in the intelligent agent. Optionally, in the case where the agent is an unmanned vehicle, the action that the agent has output to the environment may include a motor torque change action of the unmanned vehicle; in the case where the agent is a sweeping robot, the action that the agent has output to the environment may include a speed change action of the sweeping robot, a steering action of the sweeping robot, and the like.

In step 102, according to the information recorded in the data revisit cache, the action of the agent output to the environment at the last time point can be determined very conveniently and reliably, and the action can be used as a first historical action; thereafter, a first candidate action for the agent at the current point in time may be determined based on the first state and the first historical action. Optionally, the specific manner of determining the first candidate action may refer to the state at the current time point and the action output to the environment at the previous time point in the current SAC, and determine the manner of the action to be output to the environment at the current time point, which is not described herein again.

Step 103, selecting a target action from the first candidate action and the first historical action.

In step 103, the first candidate action and the first historical action may be compared to each other, so as to select an action more suitable for the current time point from the first candidate action and the first historical action according to the comparison result, and use the selected action as the target action.

And 104, controlling the intelligent agent to output the target action at the current time point.

In step 104, the agent may control itself to output the target action to the environment at the current time point, at which time the state of the environment may change due to the influence of the target action, the change may be sensed by the agent, and the agent may perform step 101 again.

It should be noted that steps 101 to 103 in the embodiment of the present disclosure can be regarded as an action decision process, and the action decision process includes two stages of decision, i.e., a first stage of decision and a second stage of decision. Specifically, as shown in fig. 2, a first state of an environment where the agent is located at a current time point may be represented by S, and a first historical action output to the environment by the agent at a previous time point may be represented by a^-It is shown that the decision of the first stage can be based on S and a^-Determining a first candidate action for the agent at the current time point, wherein the first candidate action can be used

Represents; when the decision of the second stage is carried out, a as the first historical action can be compared^-And as a first candidate action

Is relatively good or bad, from a^-And

one action is selected as the target action, which may be denoted as a.

Due to determination as a first candidate action

The method of (1) can refer to a corresponding method in the SAC, and it can be considered that on the basis of the SAC, the embodiment of the disclosure proposes and uses a flexible decision algorithm (Temporally abstract) of time sequence abstractiont Soft Actor-Critc, TASAC), which adds a decision of the second stage after the SAC decision (which corresponds to the decision of the first stage in the above) and can be called binary switch policy (binary switch policy), and the binary switch decision can determine which action to use as the action (e.g. the target action) output to the environment at the current time step by comparing the relative quality of the action (e.g. the first candidate action) generated by SAC and the action (e.g. the first historical action) output to the environment at the previous time step, that is, determine whether to stop repeating the previous action (e.g. the first historical action).

Based on the action output method for the reinforcement learning process provided by the above embodiment of the present disclosure, a first candidate action of a current time point for an agent may be determined based on a first state of an environment where the agent is located at the current time point and an action output to the environment by the agent at the previous time point, and a target action may be selected from the first candidate action and the first historical action, and then the agent may be controlled to output the target action at the current time point, which is equivalent to selecting a suitable action from the first candidate action and the first historical action to output to the environment through a binary switching decision, theoretically, the same action may be repeated multiple times by the binary switching decision, so as to achieve an effect that the action spans multiple time points, thereby shortening a task timeline, further simplifying a reward distribution problem, and thus helping to ensure a floor-to-ground use of the agent energized by deep reinforcement learning in an actual application scenario, and the problem of low data utilization efficiency in deep reinforcement learning is solved. In addition, the binary switching decision in the embodiments of the present disclosure may be applied to various scenarios, and is not limited to a specific scenario, and the binary switching decision belongs to a closed-loop control, which may decide whether to stop repeating the previous action (e.g., the first historical action) according to a specific state (e.g., the first state) of the environment, so that the flexibility is high, and the emergency in the environment can be handled.

On the basis of the embodiment shown in fig. 1, as shown in fig. 3, step 103 includes:

and step 1031, based on the first state and the first candidate action, obtaining a future reward influence predicted value corresponding to the first candidate action via the first network.

Here, the first network may be a network obtained by training and capable of calculating and outputting a future reward influence prediction value according to a given state and action. Optionally, in the case that the first network is trained, the first network may also be updated and trained periodically or aperiodically.

In step 1031, the first network performs an operation only by providing the first state and the first candidate action to the first network, so as to obtain and output a future reward influence prediction value, which can be used as the future reward influence prediction value corresponding to the first candidate action. It should be noted that the future reward impact prediction value corresponding to the first candidate action may refer to: in the first state, the sum of all rewards that the agent is expected to be able to take from the environment in the future when outputting the first candidate action to the environment at the current point in time is selected.

Step 1032 is that a future reward impact prediction value corresponding to the first historical action is obtained via the first network based on the first state and the first historical action.

Similar to step 1031, in step 1032, the first network may perform the operation only by providing the first status and the first historical action to the first network, so as to obtain and output a future reward impact prediction value, which may be used as the future reward impact prediction value corresponding to the first historical action. It should be noted that, the future reward impact predicted value corresponding to the first historical action may refer to: in a first state, the sum of all rewards that the agent is expected to be able to take from the environment in the future when outputting the first historical action to the environment at the current point in time is selected.

Step 1033, selecting the action with larger predicted value of the corresponding future reward influence from the first candidate action and the first historical action as the target action.

In step 1033, the future reward impact predictors for each of the first candidate action and the first historical action may be compared in magnitude. When the comparison result is that the future reward influence predicted value corresponding to the first candidate action is greater than the future reward influence predicted value corresponding to the first historical action, the first candidate action may be considered to be better than the first historical action, and the first candidate action is more adaptive to the current time point, so the first candidate action may be selected as the target action; otherwise, the first historical action may be selected as the target action.

Therefore, in the embodiment of the disclosure, through the first network, the future reward influence predicted values corresponding to the first candidate action and the first historical action can be determined very conveniently and reliably, so that the first candidate action and the first historical action are compared relatively good or bad, and therefore, a suitable action is selected from the first candidate action and the first historical action conveniently and reliably as the target action and is output to the environment.

On the basis of the embodiment shown in fig. 3, as shown in fig. 4, the method further includes:

and step 111, acquiring multiple sets of historical data, wherein the environment state, the action of the agent output to the environment, the next state of the environment and the reward generated by the state transition of the environment are recorded in each set of acquired historical data, the multiple sets of historical data are adjacent to each other on a time line, and the actions recorded in the multiple sets of historical data are the same.

It should be noted that, in the deep reinforcement learning process, the state of the environment (which may be represented as s) may be determined for any past time point_t) The action that the agent outputs to the environment (which may be denoted as a)_t) The next state of the environment (which may be denoted as s)_t+1) The reward generated by the state transition of the environment to the next state (which may be denoted as r(s)_t，a_t，s_t+1) And recording the determined state, action, next state, and reward via a set of historical data, such that each set of historical data can correspond to a historical time point, respectively, and each set of historical data can be represented as (state, action, next state, reward generated by this state transition) or(s)_t，a_t，s_t+1，r(s_t，a_t，s_t+1)). Optionally, all historical data may be stored in the data return cache.

In step 111, the data return buffer may be sampled to obtain multiple sets of historical data from the data return buffer, where the multiple sets of historical data may be two-by-two adjacent and recorded actions (i.e., s) on the timeline_t) The same is true. It should be noted that, the fact that some two sets of history data are adjacent on the time line means that: the historical time point corresponding to one group of historical data is a later time point of the historical time point corresponding to the other group of historical data.

Alternatively, when a plurality of sets of history data are acquired from the data return cache, other conditions may be referred to in addition to the condition that the history data are adjacent to each other on the time line and the recorded actions are the same (for convenience of explanation, this is hereinafter referred to as a first condition). For example, if there are 10 sets of history data in the data return cache, which are two-by-two adjacent and have the same recorded action (assuming that the action is action C), and the 10 sets of history data are the 1 st set of history data, the 2 nd set of history data, … …, and the 10 th set of history data in sequence from the early to the late of the corresponding history time point, the action in the 1 st set of history data may be taken as the history action, the state in the 2 nd set of history data may be taken as the state of the environment where the agent is located, and the action to be output to the environment is determined by the two-stage decision in the foregoing (equivalent to the process of determining the target action in the foregoing); determining the action to be output to the environment by taking the action in the 2 nd group of historical data as a historical action and taking the state in the 3 rd group of historical data as the state of the environment where the intelligent agent is located and adopting the decision of the two stages; in a similar manner, a total of 9 actions to be output to the environment may be determined. Assuming that the determined 9 actions to be output to the environment are all actions C, the 10 sets of historical data may be obtained from the data return access cache for subsequent network training in step 112; assuming that only the first Y (Y is an integer not greater than 8) of the determined 9 actions to be output to the environment are action C, the first Y +1 set of history data in the 10 sets of history data may be obtained from the data return cache for network training in the subsequent step 112. It should be noted that, in this example, other conditions are used to ensure that the action to be output, which is determined in the same state based on the two-stage decision, is the same as the action described in the historical data, which is beneficial to ensure the accurate reliability of the first network trained in step 112.

And 112, training the first network by using multiple sets of historical data as training data and using a multi-step self-help method.

Optionally, when training the first network using the multi-step bootstrap method, the first network may be trained using an objective function as follows:

theta denotes an algorithm parameter of an algorithm used when the first network calculates the future reward influence prediction value, D denotes a set composed of a plurality of sets of historical data, s_t、a_t、s_t+1Sequentially representing states, actions, next states in the same historical data, Q representing the algorithm used by the first network in calculating future reward impact predictors,

representing a historical version of theta, gamma representing a discount constant (which represents a deferred depreciation of future rewards over time), T representing an earliest point in time for the plurality of sets of historical data, T representing a latest point in time for the plurality of sets of historical data, r representing a reward, V being expected from a Q pair action in the same state, pi^taRepresenting a two-stage decision distribution (corresponding to the two-stage decision above), whereOne stage for the determination of candidate actions (corresponding to the decision of the first stage above), a second stage for the selection of target actions (corresponding to the decision of the second stage above), S_T+1The state of the agent aT the next time point corresponding to the latest time point corresponding to the plurality of sets of history data is represented, and aT represents an action output to the environment aT the latest time point corresponding to the plurality of sets of history data.

Assuming that the multiple sets of history data are specifically 10 sets of history data, 10 history time points corresponding to the 10 sets of history data are history time point 1, history time points 2, … …, and history time point 10 in sequence from morning to evening, the earliest time point T corresponding to the multiple sets of history data is specifically history time point 1, and the latest time point T corresponding to the multiple sets of history data is specifically history time point 10.

In specific implementation, under the condition that a plurality of groups of historical data are obtained from the data return access cache, the states, the actions, the next states and the rewards recorded in each group of historical data in the plurality of groups of historical data can be substituted into the objective function, the theta of the algorithm parameters of the algorithm used when the future reward influence predicted value is calculated by the first network can be calculated by minimizing the value of the objective function, the first network can be considered to be successfully trained under the condition that the theta is calculated, and at the moment, the first network can calculate and output the future reward influence predicted value as long as the states and the actions are given.

In the embodiment of the disclosure, since the TASAC adds a decision in the second stage (i.e., a binary switching decision) after the SAC decision, a situation that a plurality of consecutive time steps repeatedly output the same action may occur, then, based on the above first condition and second condition, a plurality of sets of historical data may be obtained, and the first network may be trained by using a multi-step self-service method with the plurality of sets of historical data as training data, which is equivalent to using a self-service method between the first time step and the last time step in the plurality of time steps corresponding to the same action, and is not limited to using a self-service method only between two adjacent time steps in the SAC, so that the self-service method needs to be used for a plurality of times, thereby effectively improving the training speed of the first network (i.e., improving the learning efficiency of the Q value).

It should be noted that, due to the existence of the two-stage decision, the TASAC in the embodiment of the present disclosure can be regarded as a hierarchical control method. Due to the existence of timing abstraction, the TASAC in embodiments of the present disclosure may have two important features: first, continuous search (persistence) in a certain motion direction; two, multi-step TD learning. The two characteristics can greatly improve the data utilization efficiency, so that compared with SAC, the TASAC can achieve an ideal learning effect by using less data on a plurality of control tasks.

Fig. 5 is a flowchart illustrating a network training method according to an exemplary embodiment of the present disclosure. The method shown in fig. 5 includes step 501, step 502, step 503, step 504, and step 505, and each step is explained below.

Step 501, historical data is obtained, wherein a second state of the environment where the agent is located at a first time point and a second historical action output to the environment by the agent at a second time point are recorded in the obtained historical data, and the first time point is a next time point of the second time point.

In the deep reinforcement learning process, for any one past time point, the state of the environment where the agent is located and the action output to the environment by the agent at the next time point of the time point can be determined, and the determined state and action can be recorded by a set of historical data. Optionally, all historical data may be stored in the data return cache; the data return access cache can be a data cache region arranged in the intelligent agent.

In step 501, the data return access cache may be sampled, for example, the data return access cache may be sampled according to an equal probability, so as to obtain the history data from the data return access cache, the state described in the obtained history data may be used as the second state, and the action described in the obtained history data may be used as the second history action.

Alternatively, the number of the historical data acquired from the data return cache may be multiple sets, and only the relevant processing for one set of the historical data will be described as an example hereinafter.

Step 502, determining a second candidate action for the agent at the first point in time via the second network based on the second state and the second historical action.

Here, the second network may be a network obtained by training and capable of determining candidate actions according to given states and actions. Alternatively, steps 501 to 505 may be performed periodically, and step 502 may be the second network trained by step 505 in the previous cycle when the second network currently being utilized in the cycle may be the second network.

In step 502, the second network may perform its own calculation to determine a candidate action, which may be used as a second candidate action for the agent at the first time point, by providing the second state and the second historical action to the second network.

Step 503, determining a probability of being selected for the second candidate action based on the second state, the second historical action, and the second candidate action.

Here, the probability of being selected of the second candidate action may refer to: in the second state, when the above tacac is employed, the probability of the action to be output to the environment at the first time point is selected as the second candidate action from the second history action and the second candidate action. Alternatively, the probability of being selected of the second candidate action may be expressed as

Step 504, determining a parameter gradient of the second network based on the second state, the second historical action, the second candidate action, and the selected probability.

Here, a formula for calculating the parameter gradient may be predefined, and in step 504, a parameter gradient may be calculated easily and reliably by simply substituting the second state, the second historical motion, the second candidate motion, and the selected probability into the formula, and the parameter gradient may be used as the parameter gradient of the second network. Alternatively, the parameter gradient of the second network may be expressed as Δ φ.

And 505, training the second network based on the parameter gradient.

Here, various parameters in the second network may be adjusted based on the parameter gradient to achieve updated training of the second network. It should be noted that a process of network training based on parameter gradients also exists in the SAC, and the specific implementation process of step 505 only needs to refer to the related training process in the SAC, which is not described herein again.

Based on the network training method provided by the above embodiment of the disclosure, after obtaining the history data describing the second state of the environment where the agent is located at the first time point (which is the next time point of the second time point) and the second historical action output to the environment by the agent at the second time point, based on the second state and the second historical action, the second candidate action for the agent at the first time point is determined via the second network, then the probability of being selected for the second candidate action may be determined based on the second state, the second historical action, the second candidate action and the probability of being selected, then the parameter gradient of the second network may be determined based on the second state, the second historical action, the second candidate action and the probability of being selected, and the second network is trained based on the parameter gradient to implement the update training of the second network, and the updated trained second network may be used for the action output party for the reinforcement learning process provided by the above embodiment of the disclosure The method determines the first candidate action to ensure the accuracy and reliability of the determined first candidate action, thereby ensuring the reasonable reliability of the finally selected and output target action and further being beneficial to ensuring the deep reinforcement learning effect.

On the basis of the embodiment shown in fig. 5, as shown in fig. 6, step 503 includes:

step 5031, based on the second state and the second candidate action, obtaining a future reward impact prediction value corresponding to the second candidate action via the first network.

Here, the first network may be a network obtained through training and capable of calculating and outputting a future reward influence prediction value according to a given state and action, and the training mode of the first network may refer to the description of the corresponding part in the action output method for the reinforcement learning process provided by the above embodiment of the present disclosure, and is not repeated herein.

In step 5031, the second network can perform operation to obtain and output a future reward impact prediction value by providing the second state and the second candidate action to the second network, and the future reward impact prediction value can be used as a future reward impact prediction value corresponding to the second candidate action. It should be noted that the future reward impact prediction value corresponding to the second candidate action may refer to: in a second state, the sum of all rewards that the agent is expected to be able to take from the environment in the future when outputting the second candidate action to the environment at the first point in time is selected.

Step 5032, based on the second state and the second historical action, obtaining a future reward impact prediction value corresponding to the second historical action via the first network.

Similar to step 5031, in step 5032, the second network can perform operation by itself only by providing the second status and the second historical action to the second network to obtain and output a future reward influence prediction value, which can be used as the future reward influence prediction value corresponding to the second historical action. It should be noted that, the future reward impact predicted value corresponding to the second historical action may refer to: in a second state, selecting the sum of all rewards that the agent is expected to be able to take from the environment in the future when outputting the second historical action to the environment at the first point in time.

Step 5033, determining the selected probability of the second candidate action based on the future reward impact predicted values corresponding to the second candidate action and the second historical action.

In one embodiment, step 5033 comprises:

calculating a first ratio of the future reward influence predicted value corresponding to the second candidate action to a specified constant;

calculating a second ratio of the future reward influence predicted value corresponding to the second historical action to the designated constant;

determining a first processing result based on a preset constant and a first ratio;

determining a second processing result based on the preset constant and the second ratio;

determining a normalized score for the first processing result based on the first processing result and the second processing result;

based on the normalized scores, a probability of being selected for the second candidate action is determined.

Here, the future reward influence prediction value corresponding to the second candidate action may be represented as Q (1), the future reward influence prediction value corresponding to the second historical action may be represented as Q (0), the designated constant may be represented as α, and the preset constant may be represented as e.

In this embodiment, Q (1) may be divided by α to obtain z1 as the first ratio and Q (0) may be divided by α to obtain z2 as the second ratio.

Next, a first processing result may be determined based on e and z1, and a second processing result may be determined based on e and z 2. Alternatively, an exponential function arithmetic process may be performed with e as the base and z1 as the exponent to take the arithmetic processing result as the first processing result, and thus the first processing result may be expressed as exp (z 1); similarly, an exponential function arithmetic process may be performed with e as the base and z2 as the exponent to take the arithmetic processing result as the second processing result, and thus the second processing result may be expressed as exp (z 2).

Thereafter, a normalized score of exp (z1) as the first processing result may be determined based on exp (z1) as the first processing result and exp (z2) as the second processing result. Alternatively, the normalized fraction of exp (z1) may be expressed as the following equation:

similarly, the normalized fraction of exp (z2) can be expressed as the following equation:

thereafter, a probability of being selected for the second candidate action may be determined based on the normalized score of exp (z1), e.g., the normalized score of exp (z1) may be directly taken as the probability of being selected for the second candidate action.

In this embodiment, the selection probability of the second candidate action can be calculated very conveniently and reliably by combining the future reward influence predicted value corresponding to the second candidate action, the future reward influence predicted value corresponding to the second historical action, the specified constant, the preset constant, and simple division operation, exponential function operation and normalization operation.

It can be seen that, in the embodiment of the disclosure, through the first network, the future reward influence predicted values corresponding to the second candidate action and the second historical action can be very conveniently and reliably determined, and based on the determined future reward influence predicted values, the selection probability of the second candidate action can be reasonably determined, so that the selection probability of the second candidate action is used for training of the second network.

In one alternative example, the formula used in determining the parameter gradient of the second network may be:

a phi denotes the gradient of the parameter,

representing a selected probability, Q representing an algorithm used by the first network in calculating the future reward impact prediction value, theta representing an algorithm parameter of the algorithm used by the first network in calculating the future reward impact prediction value, s representing a second state,

representing a second candidate action, alpha representing a given constant, pi representing an algorithm used by the second network to determine the candidate action, phi representing an algorithm parameter of the algorithm used by the second network to determine the candidate action, a^-Representing a second historical action, f representing a second networkThe algorithm used by the designated subnetwork.

In the embodiment of the disclosure, if Q, θ, pi, and phi are known, Δ phi serving as the parameter gradient of the second network can be calculated conveniently and reliably only by substituting the second state, the second candidate action, and the second historical action into the above formula. The parameter gradient calculated in the embodiments of the present disclosure is actually in comparison with the parameter gradient in the current SAC

The front is multiplied by a coefficient

The following is a description of the origin of the formula used above in determining the parameter gradient of the second network.

Assume that the decision of the second stage is expressed above as

Wherein b-0 represents the option a^-As the target action, b-1 indicates selection

As a target action, to simplify the expression, we remember

The decision distribution of the two phases above can then be expressed as:

suppose that

Then there are:

the following derivation procedure may follow:

wherein the content of the first and second substances,

to learn to obtain π_φAnd β, an objective function can be set as follows:

then for any

Can derive an optimal β distribution, which can be expressed as:

the derivation process may specifically be:

assuming that we have N values, any of which can be expressed as x (i), i ═ 0, … …, N-1, we want to find a discrete distribution of P by the following objective function:

wherein α may be greater than 0.

By using the lagrange multiplier λ, the objective function can be transformed into an unconstrained optimization problem, namely:

by taking the derivative of each p (i) in the above equation and setting it to 0, the following equation can be obtained:

wherein λ needs to ensure that the following equation is satisfied:

in order to give an arbitrary (s, a)^-)～D，

b-beta all derive the optimal beta distribution, we can order

And order

Thus, the optimal β distribution can look like P above^*The same is found.

It should be noted that, intuitively, the form of the optimal β distribution tells us to select the action at the previous time or the current candidate action, mainly by comparing the Q values corresponding to the actions, and the probability of selecting the action is proportional to exp () of the Q value, and therefore, in the specific implementation of the step 5033, the selection probability of the second candidate action may be calculated by combining division, exponential function, and normalization. In addition, pi is obtained for learning by substituting the form of this optimal β distribution_φAnd beta, an objective function can be simplified to obtain a function only related to pi as shown below_φThe objective function of (2):

by deriving phi with this objective function, the above formula used in determining the parameter gradient of the second network is obtained.

It should be noted that, when the multi-step self-help method is used to train the first network to increase the training speed of the first network, it is also beneficial to increase the training speed of the second network (i.e. to promote pi_φFaster learning) such that the TASAC in embodiments of the present disclosure may be more data efficient than current SACs.

Any of the action output methods for reinforcement learning processes provided by embodiments of the present disclosure may be performed by any suitable device having data processing capabilities, including but not limited to: terminal equipment, servers, agents, etc. Alternatively, any of the action output methods for the reinforcement learning process provided by the embodiments of the present disclosure may be executed by a processor, for example, the processor may execute any of the action output methods for the reinforcement learning process mentioned by the embodiments of the present disclosure by calling a corresponding instruction stored in a memory. And will not be described in detail below.

Any of the network training methods provided by the embodiments of the present disclosure may be performed by any suitable device having data processing capabilities, including but not limited to: terminal equipment, servers, agents, etc. Alternatively, any of the network training methods provided by the embodiments of the present disclosure may be executed by a processor, such as the processor executing any of the network training methods mentioned by the embodiments of the present disclosure by calling corresponding instructions stored in a memory. And will not be described in detail below.

Exemplary devices

Fig. 7 is a schematic structural diagram of an action output device for a reinforcement learning process according to an exemplary embodiment of the present disclosure. The apparatus shown in fig. 7 comprises a first determining module 701, a second determining module 702, a selecting module 703 and an output module 704.

A first determining module 701, configured to determine a first state of an environment where an agent is located at a current time point;

a second determining module 702, configured to determine a first candidate action for the agent at the current time point based on the first state determined by the first determining module 701 and the first historical action output by the agent to the environment at the previous time point;

a selecting module 703, configured to select a target action from the first candidate action and the first historical action determined by the second determining module 702;

and an output module 704, configured to control the agent to output the target action selected by the selection module 703 at the current time point.

In an alternative example, as shown in fig. 8, the selection module 703 includes:

a first obtaining sub-module 7031, configured to obtain, via the first network, a future reward impact prediction value corresponding to the first candidate action determined by the second determining module 702 based on the first state determined by the first determining module 701 and the first candidate action determined by the second determining module 702;

a second obtaining sub-module 7032, configured to obtain, based on the first state and the first historical action determined by the first determining module 701, a future reward influence predicted value corresponding to the first historical action via the first network;

the selecting submodule 7032 is configured to select, from the first candidate action and the first historical action determined by the second determining module 702, an action with a larger future reward influence predicted value in the future reward influence predicted values acquired by the first acquiring submodule 7031 and the second acquiring submodule 7032 as the target action.

In an alternative example, as shown in fig. 8, the apparatus further comprises:

the first obtaining module 711 is configured to obtain multiple sets of historical data, where the obtained historical data of each set are recorded with an environment state, an action that the agent outputs to the environment, a next state of the environment, and a reward generated by the state transition of the environment, and the multiple sets of historical data are adjacent to each other on a time line, and actions recorded in the multiple sets of historical data are the same;

the first training module 712 is configured to train the first network by using the multiple sets of historical data acquired by the first acquiring module 711 as training data and using a multi-step self-help method.

Alternatively, the first obtaining module 711 may obtain multiple sets of historical data from the data return cache based on the above first condition; alternatively, the first obtaining module 711 may obtain multiple sets of historical data from the data return cache based on the first condition and the second condition.

In one alternative example, the first network is trained using an objective function as follows:

showing the historical version of theta, gamma showing a discount constant, T showing the earliest time point corresponding to a plurality of groups of historical data, T showing the latest time point corresponding to a plurality of groups of historical data, r showing a reward, V being expected by Q in the same state to be obtained by action, and pi^taRepresenting a two-stage decision distribution, wherein a first stage is used for candidate action determination and a second stage is used for target action selection, S_T+1The state of the agent aT the next time point corresponding to the latest time point corresponding to the plurality of sets of history data is represented, and aT represents an action output to the environment aT the latest time point corresponding to the plurality of sets of history data.

Fig. 9 is a schematic structural diagram of a network training apparatus according to an exemplary embodiment of the present disclosure. The apparatus shown in fig. 9 comprises a second obtaining module 901, a third determining module 902, a fourth determining module 903, a fifth determining module 904 and a second training module 905.

A second obtaining module 901, configured to obtain historical data, where a second state of an environment where the agent is located at a first time point and a second historical action output by the agent to the environment at the second time point are recorded in the obtained historical data, and the first time point is a next time point of the second time point;

a third determining module 902, configured to determine, via the second network, a second candidate action for the agent at the first time point based on the second state and the second historical action recorded in the historical data acquired by the second acquiring module 901;

a fourth determining module 903, configured to determine a selection probability of a second candidate action based on the second state and the second historical action recorded in the historical data acquired by the second acquiring module 901 and the second candidate action determined by the third determining module 902;

a fifth determining module 904, configured to determine a parameter gradient of the second network based on the second state and the second historical motion recorded in the historical data acquired by the second acquiring module 901, the second candidate motion determined by the third determining module 902, and the probability of being selected determined by the fourth determining module 903;

a second training module 905, configured to train the second network based on the parameter gradient determined by the fifth determining module 904.

In an alternative example, as shown in fig. 10, the fourth determining module 903 includes:

a third obtaining sub-module 9031, configured to obtain, via the first network, a future reward influence predicted value corresponding to the second candidate action determined by the third determining module 902, based on the second state recorded in the history data obtained by the second obtaining module 901 and the second candidate action determined by the third determining module 902;

a fourth obtaining sub-module 9032, configured to obtain, based on the second state and the second historical action recorded in the historical data obtained by the second obtaining module 901, a future reward influence prediction value corresponding to the second historical action recorded in the historical data obtained by the second obtaining module 901 via the first network;

the determining submodule 9033 is configured to determine, based on the future reward influence predicted value corresponding to the second candidate action acquired by the third acquiring submodule 9031 and the future reward influence predicted value corresponding to the second historical action acquired by the fourth acquiring submodule 9032, the selected probability of the second candidate action determined by the third determining module 902.

In an alternative example, as shown in fig. 10, determining sub-module 9033 includes:

the first calculating unit 90331 is configured to calculate a first ratio of the future reward influence predicted value corresponding to the second candidate action acquired by the third acquiring sub-module 9031 to the specific constant;

the second calculating unit 90332 is configured to calculate a second ratio of the future reward influence predicted value corresponding to the second historical action acquired by the fourth acquiring sub-module 9032 to the specific constant;

a first determining unit 90333 configured to determine a first processing result based on a preset constant and the first ratio calculated by the first calculating unit 90331;

a second determining unit 90334, configured to determine a second processing result based on a preset constant and a second ratio calculated by the second calculating unit 90332;

a third determining unit 90335 for determining a normalized score of the first processing result based on the first processing result determined by the first determining unit 90333 and the second processing result determined by the second determining unit 90334;

a fourth determining unit 90336, configured to determine the probability of being selected of the second candidate action based on the normalized score determined by the third determining unit 90335.

In an alternative example, the formula used in determining the parameter gradient is:

a phi denotes the gradient of the parameter,

representing a second candidate action, alpha representing a given constant, pi representing an algorithm used by the second network to determine the candidate action, phi representing an algorithm parameter of the algorithm used by the second network to determine the candidate action, a^-Representing a second historical action, and f representing an algorithm used by a designated sub-network of the second network.

Exemplary electronic device

Next, an electronic apparatus according to an embodiment of the present disclosure is described with reference to fig. 11. The electronic device may be either or both of the first device and the second device, or a stand-alone device separate from them, which stand-alone device may communicate with the first device and the second device to receive the acquired input signals therefrom.

FIG. 11 illustrates a block diagram of an electronic device in accordance with an embodiment of the disclosure.

As shown in fig. 11, an electronic device 1100 includes one or more processors 1101 and memory 1101.

The processor 1101 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 1100 to perform desired functions.

The memory 1101 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by the processor 1101 to implement the action output method or the network training method for the reinforcement learning process of the various embodiments of the present disclosure described above. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 1100 may further include: an input device 1103 and an output device 1104, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

For example, when the electronic device is a first device or a second device, the input device 1103 can be a microphone or a microphone array. When the electronic device is a stand-alone device, the input means 13 may be a communication network connector for receiving the acquired input signals from the first device and the second device.

The input device 1103 may also include, for example, a keyboard, a mouse, and the like.

The output device 1104 can output various information including the determined distance information, direction information, and the like to the outside. The output devices 1104 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.

Of course, for simplicity, only some of the components of the electronic device 1100 relevant to the present disclosure are shown in fig. 11, omitting components such as buses, input/output interfaces, and the like. In addition, electronic device 1100 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the action output method or network training method for reinforcement learning process according to various embodiments of the present disclosure described in the "exemplary methods" section above in this specification.

The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform steps in a method of action output or network training for reinforcement learning processes according to various embodiments of the present disclosure described in the "exemplary methods" section above in this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A motion output method for a reinforcement learning process, comprising:

2. The method of claim 1, wherein said selecting a target action from said first candidate action and said first historical action comprises:

obtaining, via a first network, a future reward impact prediction value corresponding to the first candidate action based on the first state and the first candidate action;

obtaining, via the first network, a future reward impact prediction value corresponding to the first historical action based on the first state and the first historical action;

and selecting an action with a larger corresponding future reward influence predicted value from the first candidate action and the first historical action as a target action.

3. The method of claim 2, wherein the method further comprises:

acquiring multiple groups of historical data, wherein the state of the environment, the action output by the agent to the environment, the next state of the environment and the reward generated by the state transition of the environment are recorded in each group of acquired historical data, the multiple groups of historical data are adjacent to each other on a time line, and the actions recorded in the multiple groups of historical data are the same;

and training the first network by using the multiple groups of historical data as training data and utilizing a multi-step self-help method.

4. A network training method, comprising:

training the second network based on the parameter gradient.

5. The method of claim 4, wherein the determining the probability of being selected of the second candidate action based on the second state, the second historical action, and the second candidate action comprises:

obtaining, via a first network, a future reward impact prediction value corresponding to the second candidate action based on the second state and the second candidate action;

obtaining, via the first network, a future reward impact prediction value corresponding to the second historical action based on the second state and the second historical action;

determining a probability of being selected for the second candidate action based on future reward impact predictors for the second candidate action and the second historical action, respectively.

6. The method of claim 5, wherein the determining the selected probability of the second candidate action based on future reward impact predictors for the second candidate action and the second historical action, respectively, comprises:

calculating a first ratio of a future reward impact predicted value corresponding to the second candidate action to a specified constant;

calculating a second ratio of the future reward influence predicted value corresponding to the second historical action to a specified constant;

determining a first processing result based on a preset constant and the first ratio;

determining a probability of being selected of the second candidate action based on the normalized score.

7. A motion output device for a reinforcement learning process, comprising:

8. A network training apparatus comprising:

9. A computer-readable storage medium storing a computer program for executing the action output method for reinforcement learning process of any one of claims 1 to 3 or the network training method of any one of claims 4 to 6.

10. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the action output method for reinforcement learning process of any one of claims 1 to 3, or to execute the network training method of any one of claims 4 to 6.