CN114397817A

CN114397817A - Network training method, robot control method, network training device, robot control device, equipment and storage medium

Info

Publication number: CN114397817A
Application number: CN202111659123.2A
Authority: CN
Inventors: 李楚鸣; 刘宇; 王晓刚
Original assignee: Shanghai Sensetime Technology Development Co Ltd
Current assignee: Shanghai Sensetime Technology Development Co Ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-26
Also published as: WO2023123838A1

Abstract

The present disclosure provides a network training method, a robot control method, a device, an apparatus and a storage medium, wherein the training method comprises: acquiring environmental state information in a target application scene; obtaining action sequence information according to the environment state information and a pre-trained reinforcement learning network, and determining a return reward total value corresponding to the action sequence information, wherein the action sequence information is used for indicating at least two continuous execution actions within a future preset time length; and adjusting the network parameter values of the reinforcement learning network based on the total rewarding value to obtain the trained reinforcement learning network. In the present disclosure, the reinforcement learning network adjusted by the total reward value can be better adapted to the generation of the action sequence, so that the generated action sequence is better and better along with the adjustment of the reinforcement learning network. The trained reinforcement learning network can also have better control performance when applied to complex scenes such as continuous control.

Description

Network training method, robot control method, network training device, robot control device, equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for network training and robot control, a device and a storage medium.

Background

At present, the deep reinforcement learning integrates the strong comprehension ability of the deep learning on perception problems such as vision and the like and the decision-making ability of the reinforcement learning, so that some complex problems in a real scene are solved.

Taking robot control as an example, the robot can be helped to adaptively adjust the output action according to the surrounding environment information by using the decision-making capability of reinforcement learning, and the robot can be controlled to execute a corresponding action every time environment interaction is performed, and the executed action is more accurate and more in line with the scene requirements along with the increase of the environment interaction times.

However, in some complicated control scenarios, such as controlling a robot to perform a set of continuous actions, the reinforcement learning method described above has poor control performance.

Disclosure of Invention

The embodiment of the disclosure at least provides a network training method, a robot control method, a network training device, a robot control device, equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides a network training method, including:

acquiring environmental state information in a target application scene;

obtaining action sequence information according to the environment state information and a pre-trained reinforcement learning network, and determining a return reward total value corresponding to the action sequence information, wherein the action sequence information is used for indicating at least two continuous execution actions within a future preset time length;

adjusting the network parameter value of the reinforcement learning network based on the total reward value to obtain a trained reinforcement learning network; the trained reinforcement learning network is used for acquiring a target action sequence for continuously controlling a target object, and the total return reward value corresponding to the target action sequence is greater than a preset threshold value.

By adopting the network training method, the network parameter value of the reinforcement learning network can be adjusted based on the total reward value. The generated total reward value comprehensively evaluates the action relationship between at least two continuous execution actions, so that the reinforcement learning network adjusted by the total reward value can be better adapted to the generation of the action sequence, and the generated action sequence is better and better along with the adjustment of the reinforcement learning network. The trained reinforcement learning network can also have better control performance when applied to complex scenes such as continuous control.

In an optional embodiment, the determining a total reward value corresponding to the action sequence information includes:

the action sequence information is acted on the target application scene to obtain environment state sequence information and a return reward value set corresponding to the action sequence information; the reward value set comprises reward values corresponding to each execution action respectively under the condition that the at least two continuous execution actions are executed in sequence;

and obtaining the total reward value based on the environmental state sequence information and the reward value set.

In this embodiment, at each execution of an action, the corresponding environmental state may be changed, and the changed environmental state may in turn be used for the generation of the next action to be executed, looping in turn. Under the interaction of the environment and the actions, the action sequence information and the corresponding environment state sequence information which are formed by at least two actions and the environment states corresponding to the actions can enable the generated total value of the reward value to comprehensively consider the interaction relation between each action and each state, so that the accuracy of subsequent network adjustment is high.

In an optional embodiment, the deriving the reward total value based on the environmental status sequence information and the set of reward values includes:

for each execution action included in the at least two continuous execution actions, respectively determining an environment state corresponding to the execution action and a reward value generated when the execution action is executed from the environment state sequence information and the reward value set; and determining a reward and a value for the execution action based on the reward value and the environmental impact value in the environmental state;

the reward total value is determined based on the reward prize and value determined separately for each performed action.

In this embodiment, for each executed action, a corresponding reward and value may be determined based on the corresponding reward value and the environmental impact value, and the reward and value may be somewhat evaluated for the performance of the action. In addition, considering that the quality of the current execution action further affects the next environment state and the next execution action corresponding to the next environment state, that is, the multiple continuous execution actions included in the whole action sequence are affected with each other, under the mutual influence, the more accurate next action sequence can be determined according to the determined total reward value corresponding to each reward and value, so as to realize better and better control performance.

In an alternative embodiment, the determining the reward total value based on the reward and value determined separately for each executed action comprises:

acquiring action weight values respectively given to each execution action;

and determining the total reward value based on the reward and value respectively determined by each executed action and the action weight value respectively endowed by each action.

In this embodiment, the reward and value is determined using the action weight value. Therefore, when the total rewarding value is determined, the influence degrees of different executed actions are considered, and the determined total rewarding value is more in line with the requirements of an actual scene.

In an optional implementation manner, the action sequence information is used to indicate N consecutive execution actions within a preset time duration in the future, where N is an integer greater than or equal to 2, and obtaining the action sequence information according to the environmental state information and a pre-trained reinforcement learning network includes:

determining an nth execution action according to the pre-trained reinforcement learning network and the environment state of the target application scene at the nth-1 moment, wherein N is an integer and is more than 0 and less than or equal to N;

when n is 1, determining the environment state of the target application scene at the (n-1) th moment according to the environment state information; and when N is more than 1 and less than or equal to N, determining the environment state of the target application scene at the N-1 th moment according to the environment state of the target application scene at the N-2 th moment and the N-1 st execution action.

In this embodiment, when determining the motion sequence information, the interaction between each environment state and each execution motion is taken into consideration, so that each determined execution motion can be better adapted to the environment.

In an optional embodiment, the adjusting the network parameter value of the reinforcement learning network based on the total reward value to obtain the trained reinforcement learning network includes:

circularly executing the following steps until the total reward value corresponding to the target action sequence output by the trained reinforcement learning network is greater than a preset threshold value;

adjusting the network parameter value of the reinforcement learning network based on the total reward value to obtain an adjusted reinforcement learning network; the action sequence information is acted on the target application scene to obtain environment state sequence information corresponding to the action sequence information;

and inputting the last environmental state information included in the environmental state sequence information into the adjusted reinforcement learning network to obtain action sequence information which is output by the reinforcement learning network and used for executing a plurality of continuous execution actions within a preset time length in the future and a return reward total value generated under the condition of executing the action sequence information.

In the embodiment, through multiple rounds of network training, the action sequence information output by the reinforcement learning network can be closer to the requirement of an actual scene, and thus the output action sequence provides a basis for ensuring accurate continuous control.

In a second aspect, an embodiment of the present disclosure provides a robot control method, including:

acquiring current environment state information of a target robot;

and inputting the current environment state information into the reinforcement learning network trained by using any one of the network training methods in the first aspect to obtain a target action sequence for continuously controlling the target robot.

By adopting the robot control method, the trained reinforcement learning network can be used for quickly and accurately controlling the robot continuously, and the control performance is better.

In an optional embodiment, the method further comprises:

and under the condition of receiving an execution success instruction which is sent by the target robot and aims at the current execution action included in the target action sequence, issuing an action instruction for executing the next execution action of the current execution action to the target robot.

In this embodiment, in order to reduce the adverse effect of the command confusion on the target robot as much as possible, when the previous execution operation is successfully executed, the next execution operation command may be issued to the robot, so as to ensure the stability of the robot control.

In a third aspect, an embodiment of the present disclosure further provides a network training apparatus, including:

the acquisition module is used for acquiring environmental state information in a target application scene;

the training module is used for obtaining action sequence information according to the environment state information and a pre-trained reinforcement learning network and determining a return reward total value corresponding to the action sequence information, wherein the action sequence information is used for indicating at least two continuous execution actions within a preset time length in the future;

the adjusting module is used for adjusting the network parameter values of the reinforcement learning network based on the return reward total value to obtain a trained reinforcement learning network; the trained reinforcement learning network is used for acquiring a target action sequence for continuously controlling a target object, and the total return reward value corresponding to the target action sequence is greater than a preset threshold value.

In an optional implementation manner, the training module is configured to determine a total reward value corresponding to the action sequence information according to the following steps:

In an optional implementation manner, the training module is configured to obtain the reward total value based on the environmental status sequence information and the set of reward values according to the following steps:

In an alternative embodiment, the training module is configured to determine the reward total value based on the reward and value determined separately for each executed action according to the following steps:

acquiring action weight values respectively given to each execution action;

and determining the total reward value based on the reward and value determined by each execution action and the action weight value given by each execution action.

In an optional implementation manner, the action sequence information is used to indicate N consecutive execution actions within a preset time duration in the future, where N is an integer greater than or equal to 2, and the training module is configured to obtain the action sequence information according to the environmental state information and a pre-trained reinforcement learning network, according to the following steps:

In an optional implementation manner, the adjusting module is configured to adjust a network parameter value of the reinforcement learning network based on the total reward value according to the following steps to obtain a trained reinforcement learning network:

In a fourth aspect, an embodiment of the present disclosure further provides a robot control device, including:

the acquisition module is used for acquiring the current environment state information of the target robot;

and the control module is used for inputting the current environment state information into the reinforcement learning network trained by using any one of the network training methods in the first aspect to obtain a target action sequence for continuously controlling the target robot.

In an alternative embodiment, the apparatus further comprises:

and the sending module is used for sending an action instruction for executing the next execution action of the current execution action to the target robot under the condition of receiving the execution success instruction which is sent by the target robot and aims at the current execution action included in the target action sequence.

In a fifth aspect, an embodiment of the present disclosure further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the first aspect, or any one of the possible implementations of the first aspect, or the second aspect, or any one of the possible implementations of the second aspect.

In a sixth aspect, this disclosed embodiment also provides a computer readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps in the first aspect, or any one of the possible embodiments of the first aspect, or the steps in the second aspect, or any one of the possible embodiments of the second aspect.

For the description of the effects of the network training apparatus, the robot control apparatus, the electronic device, and the computer-readable storage medium, reference is made to the description of the network training method and the robot control method, which is not repeated herein.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

Fig. 1 shows a flowchart of a network training method provided by an embodiment of the present disclosure;

FIG. 2 is a flow chart illustrating a specific method for determining a total reward value in the network training method provided by the embodiment of the disclosure;

FIG. 3 is a flow chart illustrating a method for controlling a robot according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a network training apparatus provided in an embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating an architecture of a robot controller provided in an embodiment of the present disclosure;

fig. 6 shows a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

Research shows that, at present, the deep reinforcement learning integrates the strong comprehension ability of the deep learning on perception problems such as vision and the decision-making ability of the reinforcement learning, so that some complex problems in a real scene are solved. Taking robot control as an example, the robot can be helped to adaptively adjust the output action according to the surrounding environment information by using the decision-making capability of reinforcement learning, and the robot can be controlled to execute a corresponding action every time environment interaction is performed, and the executed action is more accurate and more in line with the scene requirements along with the increase of the environment interaction times.

Based on the research, the present disclosure provides a network training method, a robot control method, a device, an apparatus, and a storage medium, which have good network training performance, and thus can be well applied to complex scenes such as continuous control of a robot, and have good control performance.

The above-mentioned drawbacks are the results of the inventor after practical and careful study, and therefore, the discovery process of the above-mentioned problems and the solutions proposed by the present disclosure to the above-mentioned problems should be the contribution of the inventor in the process of the present disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

To facilitate understanding of the present embodiment, first, a network training method disclosed in the embodiments of the present disclosure is described in detail, where an execution subject of the network training method provided in the embodiments of the present disclosure is generally an electronic device with certain computing capability, and the electronic device includes, for example: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, or various robots such as an industrial robot, a mobile service robot, or a server or other processing device. In some possible implementations, the network training method may be implemented by a processor invoking computer readable instructions stored in a memory.

The network training method provided by the embodiment of the present disclosure is described below by taking an execution subject as an example of a robot.

Referring to fig. 1, a flowchart of a network training method provided in the embodiment of the present disclosure is shown, where the method includes steps S101 to S103, where:

s101, acquiring environmental state information in a target application scene;

s102, obtaining action sequence information according to the environment state information and a pre-trained reinforcement learning network, and determining a return reward total value corresponding to the action sequence information, wherein the action sequence information is used for indicating at least two continuous execution actions within a future preset time length;

s103, adjusting network parameter values of the reinforcement learning network based on the reward total value to obtain the well-trained reinforcement learning network; the trained reinforcement learning network is used for acquiring a target action sequence for continuously controlling a target object, and the total return reward value corresponding to the target action sequence is greater than a preset threshold value.

In order to facilitate understanding of the network training method provided by the embodiments of the present disclosure, first, a brief description is made on an application scenario of the method. The network training method can be mainly applied to the field of robot control, and is particularly suitable for control under complex scenes such as robot continuous control. Here, the trained reinforcement learning network can be used to continuously control the robot, for example, the robot can be controlled to perform a series of execution actions including moving to the position a, grabbing the target object, and moving the object to the position B, and the control performance is better.

It should be noted that the embodiment of the present disclosure may be applied not only to the field of robot control, but also to other decision intelligence technical fields that require continuous control, and is not limited specifically herein. In view of the wide application of robotics, a robot control application scenario is illustrated next.

The target application scenario in the embodiment of the present disclosure may be various application scenarios related to continuous control. Taking robot control as an example, application scenarios corresponding to different robots are also different, for example, the application scenarios may be scenarios related to a carrying robot performing goods carrying, or related scenarios such as placating a robot performing hugging, flicking, and the like, and no specific limitation is made here.

In order to better train the reinforcement learning network, the environment state under the target application scene can be acquired in advance, the action which can be controlled and executed by the reinforcement learning network is changed along with the update of the environment state, the change can act on the whole environment to change the environment state, the circulation is carried out in sequence, and the environment is changed along with the training of the reinforcement learning network.

Here, in order to realize the relevant continuous control, it may be determined that the reinforcement learning network outputs the relevant action sequence information for a plurality of continuously executed actions within a preset time period in the future based on the acquired environment state information. With the change of the environment state, the multiple continuous actions are mutually influenced, so that the determined return reward total value can evaluate the quality of an action sequence comprising the multiple continuous actions, the higher the return reward total value is, the more the output action sequence is in accordance with the environment requirement to a certain extent, and the guiding effect of the action sequence can be enhanced subsequently, otherwise, the lower the return reward total value is, the less the output action sequence is in accordance with the environment requirement to a certain extent, and the guiding effect of the action sequence can be weakened subsequently.

Here, the reinforcement learning network may be adjusted once or more based on the total reward value, and after the adjustment for multiple times, the output target action sequence may obtain a larger total reward value, so that a better continuous control performance may be achieved when the trained reinforcement learning network group is reapplied to a specific robot control and other related control fields.

It should be noted that, in the process of adjusting the network parameter value of the reinforcement learning network based on the reward total value, the embodiment of the present disclosure aims at optimizing each action in the whole action sequence, and each action is influenced with each other, so that the determined action sequence is more in line with the actual environmental requirement, and the applicability is stronger.

The total reward value herein comprehensively considers the reward value corresponding to each executed action and the environment status related to the executed action, and can be further described specifically by the following steps:

the method comprises the steps that firstly, action sequence information acts on a target application scene, and environment state sequence information and a reward value set corresponding to the action sequence information are obtained; the reward value set comprises reward values corresponding to each execution action under the condition that at least two continuous execution actions are executed in sequence;

and step two, obtaining a total return reward value based on the environment state sequence information and the return reward value set.

The action sequence information in the embodiment of the present disclosure may be determined by performing continuous action prediction on initial environmental state information according to a pre-trained reinforcement learning network, and is used to indicate N continuous execution actions within a preset time duration in the future, where N is an integer greater than or equal to 2.

In practical applications, the execution action at the current moment may be determined by the pre-trained reinforcement learning network and the environmental state of the target application scene at the previous moment, that is, the nth execution action may be determined according to the environment state of the reinforcement learning network and the target application scene at the nth-1 moment, where N is an integer and N is greater than 0 and less than or equal to N.

In practical application, the current environment state can be input into the reinforcement learning network, and the reinforcement learning network outputs an execution action; then, under the condition of executing the current execution action, obtaining the next environment state; and inputting the environment state into the reinforcement learning network, outputting the next execution action by the reinforcement learning network, and circulating the processes to obtain a plurality of continuous execution actions, wherein the continuous execution actions form action sequence information.

Here, the execution action may change the environment state, and the changed environment state may also act in reverse on the generation of the next execution action, that is, a plurality of corresponding environment states may be obtained based on the execution of the action, that is, one environment state sequence information may be configured.

Different return reward values can be determined according to different execution actions, and the return reward values corresponding to the execution actions can be combined to obtain a return reward value set.

In practical applications, the embodiment of the present disclosure may implement the optimization of the action sequence by using Soft Actor-critical (sac) reinforcement learning algorithm. The SAC herein mainly includes an environment model, and an action network for action prediction. In practical application, the environmental model can be periodically trained by using a real sample obtained by interaction between the Actor and the environment. The sample is defined as a quadruple (s _ t, a _ t, r _ t +1, s _ t +1), which corresponds to the current environmental status information, the execution action determined under the current environmental status information, the reward value brought by the execution action and the next environmental status, respectively, and the environmental model learns the mapping from (s _ t, a _ t) to (r _ t +1, s _ t + 1).

In addition to this, the hypothetical samples can be predicted periodically using the environmental model. That is, in the process of determining the execution action of each output, the state s _ t in the sampled real sample is input into the current policy, and the action a _ t can be obtained by sampling according to the current policy. When (s _ t, a _ t) is input to the environment model to obtain (r _ t +1, s _ t +1), a quadruple virtual sample (s _ t, a _ t, r _ t +1, s _ t +1) is formed.

In the disclosed embodiment, as the environment model is trained, the action network may be trained accordingly. In the process of training the action network, an environment model is used for predicting environment state sequences s _ t +1, … …, s _ t + h under the action sequence a _ t, … … and a _ t + h-1 of the current strategy sample at h moments in the future, and a reward set r _ t +1, … … and r _ t + h. Then, an optimization objective (corresponding to the reward total value) can be obtained based on a _ t, … …, a _ t + h-1, s _ t + h, and r _ t +1, … …, r _ t + h, so as to optimize each of the actions indicated by the action sequences a _ t, … …, a _ t + h-1.

Here, after obtaining the environmental status sequence information and the set of reward prize values, the reward prize total value can be obtained according to the steps shown in fig. 2:

s201, aiming at each execution action included in at least two continuous execution actions, respectively determining an environment state corresponding to the execution action and a reward value generated under the condition of executing the execution action from environment state sequence information and the reward value set; determining a reward and a value for the execution action based on the reward value and the environmental impact value in the environmental state;

s202, determining a total value of the reward based on the reward and the value respectively determined by each execution action.

Here, for each execution action, the corresponding reward and value may be determined based on the reward value corresponding to the execution action and the environmental impact value in the environmental state, so as to better evaluate the quality of the execution action.

Wherein, the reward and value can be calculated by the following calculation formula:

R_t+1＝r_t+1+e_t

wherein e is_t-alpha log (p (a _ t | s _ t)) for representing the environmental impact value at time t, R_t+1Representing the reward sum value at time t +1, r_t+1Representing the reward value at time t + 1.

As can be seen from the above formula, the reward and value at the current time is obtained by adding the reward value at the current time and the environmental impact value at the previous time. Under the combined action of the reward value and the environmental impact value, the reward value and the value of the determined execution action are closer to the execution action corresponding to the evaluation, and the action sequence is further optimized.

In practical application, considering that the influence degrees of different executed actions on the whole action sequence are different, that is, in the whole action sequence, some actions are executed inaccurately and insufficiently to influence the accuracy of the whole action sequence, and some actions are executed inaccurately and insufficiently to influence the accuracy of the whole action sequence, therefore, in order to further realize the optimization of the action sequence, here, an action weight value can be respectively given to each executed action, and a total reward value is obtained based on each reward sum value and the action weight value, so that the obtained total reward value can consider the influence degree of each executed action on the reward, and is more in line with the requirements of a practical scene.

In the embodiment of the present disclosure, the reward total value corresponding to a plurality of continuously executed actions from the current time t to h times in the future may be specifically calculated according to the following formula:

R_{general assembly}＝w₀R_t+1+w₁R_t+2+w₂R_t+3+…+w_h-1R_t+h-1+V_t+h

In the above formula, R_{General assembly}Represents the total reward value, R, corresponding to a plurality of continuously executed actions from the current time t to the h future times_t+1，R_t+2，R_t+3，…，R_t+h-1Respectively representing the reward and value at times t +1, t +2, t +3, …, t + h-1, w₀，w₁，w₂，w_h-1Respectively representing the weight value, V, corresponding to each action_t+hRepresenting the value of the environmental state at time t + h.

In the formula, the value of the environmental state at the time t + h is added on the basis of weighted summation of the reward sum values, and the larger the value of the environmental state is, the more accurate the execution action predicted based on the state is.

After the total reward value is obtained, the network parameter value of the reinforcement learning network can be adjusted to obtain the well-trained reinforcement learning network. In the process of adjusting the reinforcement learning network, the following steps can be executed in a circulating manner until the total reward value corresponding to the target action sequence output by the trained reinforcement learning network is greater than the preset threshold value.

Step one, adjusting network parameter values of the reinforcement learning network based on the reward total value to obtain an adjusted reinforcement learning network; acting the action sequence information on a target application scene to obtain environment state sequence information corresponding to the action sequence information;

and step two, inputting the last environmental state information included by the environmental state sequence information into the adjusted reinforcement learning network to obtain action sequence information which is output by the reinforcement learning network and used for executing a plurality of continuous execution actions within a preset time length in the future and a return reward total value generated under the condition of executing the action sequence information.

Here, when the reinforcement learning network is adjusted, the network parameter value of the reinforcement learning network may be adjusted based on the total reward value to output an action sequence having a larger corresponding total reward value. The current environment state information is input into the reinforcement learning network, and the reinforcement learning network can optimize each executed action in the output action sequence, namely, under the condition that the total value of the currently output reward is small, the performance of the currently executed action sequence is not good, in the next adjustment process, the generation probability of the action sequence can be reduced, on the contrary, under the condition that the total value of the currently output reward is large, the performance of the currently executed action sequence is good, in the next adjustment process, the generation probability of the action sequence can be enhanced, and the target action sequence with good performance can be obtained through sequential training.

In practical applications, in addition to cyclically adjusting the next round of network based on the last environmental state information included in the environmental state sequence information, the environmental state information may be directly selected from the real samples/hypothetical samples obtained from the environmental model to adjust the next round of network, which is not limited specifically herein.

In one adjustment process, each execution action in the action sequence corresponds to a reward value, meanwhile, as the execution action progresses, the environment state information is updated, the updated environment state information is input into the reinforcement learning network, the reinforcement learning network outputs a new execution action, and therefore action sequence information comprising a plurality of continuous actions can be obtained.

Here, a robot walking scene will be exemplified. In the walking process of the robot, the environmental state at the current moment is s₁(corresponding to the first environmental state), and (c) adding s to the first environmental state₁Inputting the network, the robot performs the action a of lifting the left foot₁(corresponding to the first execution action) to obtain the reward value r corresponding to the action₂(reward corresponding to first executed action) and environmental status s₂(corresponding to the second environmental state), comparing s₂Inputting the network, the robot executes the action a of putting down the left foot₂(corresponding to the second execution action) to obtain the reward value r corresponding to the action₃(reward for second action) and environmental status s₃(corresponding to a third environmental state), and (c) comparing s₃Inputting the network, the robot executes the action a of lifting the right foot₃(corresponding to the third execution action) to obtain the reward value r corresponding to the action₄(reward corresponding to third performed action) and environmental status s₄(corresponding to the fourth environmental state), will s₄Inputting the network, the robot executes the action a of putting down the right foot₄(corresponding to the fourth execution action) to obtain the reward value r corresponding to the action₅(reward corresponding to fourth performed action) and environmental status s₅(corresponding to the fifth environmental state).

A here₁，a₂，a₃，a₄Can form an action sequence information, s₁，s₂，s₃，s₄，s₅Can form an environmental state sequence information, s₁，s₂，s₃，s₄，s₅Corresponding environmental impact value of e₁，e₂，e₃，e₄，e₅. S here₁Will influence a₁，a₁Will influence s₂，s₂Will influence a₂，a₂Will influence s₃，s₃Will influence a₃，a₃Will influence s₄，s₄Will influence a₄，a₄Will influence s₅。

It can be known that the execution actions are mutually influenced, and the environment state information is continuously updated through the mutual influence of the execution actions, so that the execution actions can be better adapted to the environment. And substituting the obtained return reward values into a calculation formula of the return reward total value to calculate the return reward total value, wherein if the calculated return reward total value is greater than a preset threshold value, the action sequence information can better adapt to the current environment, and the reinforcement learning network can be adjusted by enhancing the execution probability of each continuously executed action in the action sequence information.

Based on the network training method, an embodiment of the present disclosure further provides a robot control method, which is shown in fig. 3 and specifically includes the following steps:

s301, acquiring current environment state information of a target robot;

s302, inputting the current environment state information into the reinforcement learning network trained by the network training method to obtain a target action sequence for continuously controlling the target robot.

The robot control method is performed by using the reinforcement learning network trained by the network training method, that is, the current environment state information of the target robot is input into the trained reinforcement learning network, so that a whole target action sequence can be obtained.

The training process of the reinforcement learning network can refer to the above contents, and is not described herein.

In practical applications, in order to avoid adverse effects possibly brought to a target robot by issuing a continuous action instruction, the robot control method provided in the embodiment of the present disclosure may issue an action instruction for executing a next execution action of a current execution action to the target robot whenever an execution success instruction of the current execution action is received, that is, issue a next execution action every time an action is executed, so that accurate execution of each action may be controlled while ensuring that the target robot can execute a series of continuous actions according to a predicted target action sequence.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same inventive concept, a network training device corresponding to the network training method is also provided in the embodiments of the present disclosure, and as the principle of solving the problem of the device in the embodiments of the present disclosure is similar to the network training method in the embodiments of the present disclosure, the implementation of the device may refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 4, a schematic diagram of an architecture of a network training apparatus provided in an embodiment of the present disclosure is shown, where the apparatus includes: an acquisition module 401, a training module 402, and an adjustment module 403; wherein the content of the first and second substances,

an obtaining module 401, configured to obtain environment state information in a target application scene;

a training module 402, configured to obtain action sequence information according to the environment state information and a pre-trained reinforcement learning network, and determine a total reward value corresponding to the action sequence information, where the action sequence information is used to indicate at least two consecutive execution actions within a preset time duration in the future;

an adjusting module 403, configured to adjust a network parameter value of the reinforcement learning network based on the total reward value to obtain a trained reinforcement learning network; the trained reinforcement learning network is used for acquiring a target action sequence for continuously controlling a target object, and the total return reward value corresponding to the target action sequence is greater than a preset threshold value.

The disclosed embodiments may adjust the network parameter values of the reinforcement learning network based on the total reward value. The generated total reward value comprehensively evaluates the action relationship between at least two continuous execution actions, so that the reinforcement learning network adjusted by the total reward value can be better adapted to the generation of the action sequence, and the generated action sequence is better and better along with the adjustment of the reinforcement learning network. The trained reinforcement learning network can also have better control performance when applied to complex scenes such as continuous control.

In an alternative embodiment, the training module 402 is configured to determine the total reward value corresponding to the action sequence information according to the following steps:

acting the action sequence information on a target application scene to obtain environment state sequence information and a reward value set corresponding to the action sequence information; the reward value set comprises reward values corresponding to each execution action under the condition that at least two continuous execution actions are executed in sequence;

and obtaining a total return reward value based on the environment state sequence information and the return reward value set.

In an alternative embodiment, the training module 402 is configured to obtain a total reward value based on the environmental status sequence information and the set of reward values according to the following steps:

for each execution action included in at least two continuous execution actions, respectively determining an environment state corresponding to the execution action and a return reward value generated under the condition of executing the execution action from the environment state sequence information and the return reward value set; determining a reward and a value for the execution action based on the reward value and the environmental impact value in the environmental state;

a total value of the reward is determined based on the reward and value determined separately for each performed action.

In an alternative embodiment, the training module 402 is configured to determine the reward total value based on the reward and value determined separately for each executed action according to the following steps:

acquiring action weight values respectively given to each execution action;

and determining a total reward value based on the reward and value respectively determined by each executed action and the action weight value respectively given to each executed action.

In an alternative embodiment, the action sequence information is used to indicate N consecutive execution actions within a preset time duration in the future, where N is an integer greater than or equal to 2, and the training module 402 is configured to obtain the action sequence information according to the environmental state information and the pre-trained reinforcement learning network according to the following steps:

In an alternative embodiment, the adjusting module 403 is configured to adjust the network parameter values of the reinforcement learning network based on the total reward value according to the following steps to obtain a trained reinforcement learning network:

adjusting network parameter values of the reinforcement learning network based on the total rewarding value to obtain an adjusted reinforcement learning network; acting the action sequence information on a target application scene to obtain environment state sequence information corresponding to the action sequence information;

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

Based on the same inventive concept, a robot control device corresponding to the robot control method is also provided in the embodiments of the present disclosure, and since the principle of solving the problem of the device in the embodiments of the present disclosure is similar to the robot control method in the embodiments of the present disclosure, the implementation of the device may refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 5, there is shown an architecture diagram of a robot control device according to an embodiment of the present disclosure, the device includes: an acquisition module 501 and a control module 502; wherein the content of the first and second substances,

an obtaining module 501, configured to obtain current environment state information of a target robot;

and the control module 502 is configured to input the current environment state information into the reinforcement learning network trained by the network training method, so as to obtain a target action sequence for continuously controlling the target robot.

The execution actions in the target action sequence obtained in the embodiment of the disclosure are continuous, so that the robot can be better controlled.

In an alternative embodiment, the apparatus further comprises:

and a sending module 503, configured to issue an action instruction for executing a next execution action of the current execution action to the target robot when receiving the execution success instruction sent by the target robot for the current execution action included in the target action sequence.

Based on the same technical concept, the embodiment of the disclosure also provides an electronic device. Referring to fig. 6, a schematic structural diagram of an electronic device 600 provided in the embodiment of the present disclosure includes a processor 61, a memory 62, and a bus 63. The memory 62 is used for storing execution instructions, and includes a memory 621 and an external memory 622; the memory 621 is also referred to as an internal memory, and is used for temporarily storing the operation data in the processor 61 and the data exchanged with the external memory 622 such as a hard disk, the processor 61 exchanges data with the external memory 622 through the memory 621, and when the electronic device 600 operates, the processor 61 communicates with the memory 62 through the bus 63, so that the processor 61 executes the following instructions:

acquiring environmental state information in a target application scene;

obtaining action sequence information according to the environment state information and a pre-trained reinforcement learning network, and determining a return reward total value corresponding to the action sequence information, wherein the action sequence information is used for indicating at least two continuous execution actions within a preset time length in the future;

adjusting network parameter values of the reinforcement learning network based on the total rewarding value to obtain a trained reinforcement learning network; the trained reinforcement learning network is used for acquiring a target action sequence for continuously controlling a target object, and the total return reward value corresponding to the target action sequence is greater than a preset threshold value;

alternatively, the following instructions are executed:

acquiring current environment state information of a target robot;

and inputting the current environment state information into the reinforcement learning network trained by the network training method to obtain a target action sequence for continuously controlling the target robot.

The embodiments of the present disclosure also provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the network training method or the steps of the robot control method in the foregoing method embodiments are executed. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, where instructions included in the program code may be used to execute the steps of the network training method in the foregoing method embodiments, or the steps of the robot control method, which may be referred to in the foregoing method embodiments specifically, and are not described herein again.

The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method of network training, comprising:

acquiring environmental state information in a target application scene;

2. The method of claim 1, wherein the determining a total reward value corresponding to the action sequence information comprises:

3. The method of claim 2, wherein deriving the reward total value based on the environmental status sequence information and the set of reward values comprises:

4. The method of claim 3, wherein determining the reward award total value based on the separately determined reward award and value for each performed action comprises:

acquiring action weight values respectively given to each execution action;

5. The method according to any one of claims 1 to 4, wherein the action sequence information is used to indicate N consecutive execution actions within a preset time period in the future, where N is an integer greater than or equal to 2, and the obtaining of the action sequence information according to the environmental status information and the pre-trained reinforcement learning network includes:

6. The method of any one of claims 1 to 5, wherein the adjusting the values of the network parameters of the reinforcement learning network based on the total reward value to obtain the trained reinforcement learning network comprises:

7. A robot control method, comprising:

acquiring current environment state information of a target robot;

inputting the current environment state information into a reinforcement learning network trained by the network training method of any one of claims 1 to 6 to obtain a target action sequence for continuously controlling the target robot.

8. The method of claim 7, further comprising:

9. A network training apparatus, comprising:

10. A robot control apparatus, comprising:

a control module, configured to input the current environment state information into the reinforcement learning network trained by the network training method according to any one of claims 1 to 6, so as to obtain a target action sequence for continuously controlling the target robot.

11. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the network training method of any one of claims 1 to 6 or the steps of the robot control method of claim 7 or 8.

12. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, performs the steps of the network training method according to any one of claims 1 to 6 or the steps of the robot control method according to claim 7 or 8.