CN115793450A

CN115793450A - Robot return function self-adaption method based on reinforcement learning

Info

Publication number: CN115793450A
Application number: CN202211459853.2A
Authority: CN
Inventors: 杨智友; 符明晟; 张帆; 屈鸿
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-11-16
Filing date: 2022-11-16
Publication date: 2023-03-14

Abstract

The invention discloses a robot return function self-adaption method based on reinforcement learning, which can learn a return value according to an interaction track of a robot and the environment, so as to guide a reinforcement learning algorithm to optimize a control strategy, avoid manual design intervention of a return model, and improve walking control efficiency of reinforcement learning under different scenes through a self-adaption return model.

Description

Robot return function self-adaption method based on reinforcement learning

Technical Field

The invention relates to the technical field of machine learning, in particular to a robot return function self-adaption method based on reinforcement learning.

Background

At present, the control of robot walking is mainly developed based on the traditional control technology, but the traditional control technology has the problems of single planning of a robot walking line, inflexible line planning, lack of coping strategies in complex scenes and the like. With the rapid development of deep learning technology and reinforcement learning, a great deal of walking control related characteristics can be learned from the interaction data of the robot and the external environment by utilizing the strong characteristic learning capability of a deep neural network, and obstacles can be avoided in the walking process of the robot by combining the reinforcement learning and the modeling of the walking problem of the robot, but the problem of returning in the interaction process needing manual design still exists.

Disclosure of Invention

In order to overcome at least the above-mentioned deficiencies in the prior art, it is an object of the present application to provide an adaptive method for a robot reward function based on reinforcement learning.

The embodiment of the application provides a robot reward function self-adaption method based on reinforcement learning, which comprises the following steps:

using an Actor strategy in an Actor-Critic network to control the robot to interact with an external environment to obtain track data, and storing the track data into an environment buffer pool;

updating a report model and the Actor-Critic network through data in the environment buffer pool; the reward model is constructed based on the information amount generated during state transition when the robot interacts with the external environment;

and controlling the robot to interact with an external environment according to the updated Actor-Critic network and acquiring new track data to update the reward model and the Actor-Critic network.

In the prior art, modeling of walking problems such as walking obstacle avoidance and the like of a robot through reinforcement learning highly depends on an evaluation scheme of a robot walking strategy, and the method is mainly embodied in calculating a return value when the robot executes the strategy to walk; the return value of the current robot executing strategy is usually designed manually, which requires a great deal of manpower and material resources to adjust the return value design scheme. In the embodiment of the application, the reward model operates independently of an Actor-Critic network for making a decision, the essence of the Actor-Critic network is a model for providing a policy, and the reward model aims to provide guidance for optimization of the policy. When the robot actually runs, certain information data, namely trajectory data, can be generated when the robot interacts with an external environment through an Actor strategy; the data in the environment buffer pool can be updated along with the updating of the track data, and the return model can be trained through the neural network to update the return model, so that the return model can learn the return value according to the interactive track of the robot and the environment, and the reinforcement learning algorithm is guided to optimize the control strategy.

In one possible implementation, the reward function includes an encoder and a decoder;

updating the reward model with data in the context buffer pool comprises:

inputting the current state and action in the environment buffer pool as first input data into the encoder, performing compression for ensuring information integrity on information in the first input data through a full connection layer and an activation layer of a neural network configured in the return model, and outputting a mean value and a variance of multi-dimensional Gaussian distribution as first output data through a last layer of network of the encoder;

sampling second input data from the first output data by a re-parameter method, inputting the second input data into the decoder, and outputting the mean value and the variance of the state at the next moment as second output data by the last full-connection layer of the decoder;

supervised learning training of the decoder and the encoder is performed using the next time instant state sampled from the environmental buffer pool and the second output data.

In one possible implementation, the compression of the information in the first input data by the fully-connected layer and the active layer of the neural network configured in the reward model to ensure information integrity is performed by the following formula:

wherein z is compressed information, KL is KL divergence, q (z) is prior probability of compressed information, s is a state value, and a is an action valueP (z | s, a) is the posterior probability when compressed to z by s and a, μ ₁ Being the mean, σ, of the encoder output ₁ Is the variance of the encoder output.

In one possible implementation, outputting the mean and variance of the state at the next time as the second output data by the last fully-connected layer of the decoder is performed by:

wherein s 'is the state of the next time s' _i For the next state of the time sampled from the buffer pool, μ ₂ Is the mean, sigma, of the output of the last layer of a fully-connected network in the decoder ₂ The variance of the output of the last layer of the fully connected network in the decoder.

In one possible implementation, the optimization function of the reward model is implemented based on the encoder and the decoder, and the optimization function adopts the following formula:

in the formula, mu ₁ Being the mean, σ, of the encoder output ₁ Variance of encoder output; mu.s ₂ Is the mean, σ, of the state at the next moment of the decoder output ₂ The variance of the state at the next time instant output by the decoder.

In a possible implementation manner, using an Actor policy in an Actor-Critic network to control a robot to interact with an external environment to acquire trajectory data, and storing the trajectory data into an environment buffer pool includes:

transmitting the current state faced by the robot in the real environment to the Actor-Critic network, and outputting the mean value and the variance of the action after calculation through the Actor-Critic network; the mean and variance are multidimensional Gaussian distribution;

sampling action values from the multi-dimensional Gaussian distribution, sending the action values to the robot, adjusting related parameters, and executing corresponding instructions by the robot to a new environment state;

simultaneously inputting the current state, the current action and the state at the next moment into the return model, and calculating to obtain a corresponding return value when the current state is transferred;

and storing the current state, the current action, the calculated return value and the state at the next moment into the environment buffer pool as the track data.

In a possible implementation manner, the calculation of the corresponding return value when the current state transition occurs is performed by using the following formula:

where s is the state value of the input encoder, a is the operation value of the input encoder, and μ ₁ Being the mean, σ, of the encoder output ₁ Is the variance of the encoder output, z is the value sampled from the encoder output and is the decoder input, z = μ ₁ +∈σ ₁ ，μ ₂ Is the mean, σ, of the state at the next moment of the decoder output ₂ E is the value sampled from mean 0 and variance 1 for the variance of the state at the next instant of the decoder output.

In a possible implementation manner, the criticic network of the Actor-criticic network includes one target Q function network and at least two current Q function networks;

updating the Actor-Critic network with data in the environment buffer pool includes:

when the action selected in the current state is evaluated, all current Q function networks are used for calculation, and the current Q function network with the minimum current Q function value is selected from the calculation results to update the Actor network in the Actor-Critic network;

forming an MSE loss function by the current Q function value in the calculation result and the target Q function network to update the current Q function network;

and when the target Q function network is updated, updating according to the updated parameters of the current Q function network and the parameters of the target Q function network by using a momentum principle.

In one possible implementation, the updated parameters of the current Q-function network and the target Q-function network are updated using the momentum principle according to the following formula:

in the formula, Q _θ For the updated target Q function network parameters,

to update the target Q function network parameters, Q _δ And epsilon represents a momentum parameter value at the moment of momentum updating, and epsilon is between 0 and 1 for the parameters of the current Q function network.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the robot return function self-adaption method based on reinforcement learning can learn the return value according to the interaction track of the robot and the environment, so that the reinforcement learning algorithm is guided to optimize the control strategy, manual design intervention of the return model is avoided, and walking control of reinforcement learning under different scenes can be improved more efficiently through the self-adaption return model.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

FIG. 1 is a schematic illustration of the steps of a method according to an embodiment of the present application;

FIG. 2 is a diagram illustrating a reporting model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a network structure of an encoder and a decoder according to an embodiment of the present application;

fig. 4 is a diagram of an Actor policy network structure according to an embodiment of the present application;

fig. 5 is a structural diagram of a critical network according to an embodiment of the present application.

Detailed Description

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the present application. Additionally, it should be understood that the schematic drawings are not necessarily drawn to scale. The flowcharts used in this application illustrate operations implemented according to some of the embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and steps without logical context may be performed in reverse order or simultaneously. In addition, one skilled in the art, under the guidance of the present disclosure, may add one or more other operations to, or remove one or more operations from, the flowchart.

In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as presented in the figures, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

Please refer to fig. 1, which is a flowchart illustrating an adaptive method for a robot reward function based on reinforcement learning according to an embodiment of the present invention, and further, the adaptive method for a robot reward function based on reinforcement learning may specifically include the following steps S1 to S3.

S1: using an Actor strategy in an Actor-Critic network to control the robot to interact with an external environment to obtain track data, and storing the track data into an environment buffer pool;

s2: updating a report model and the Actor-criticic network through data in the environment buffer pool; the reward model is constructed based on the information quantity generated during state transition when the robot interacts with the external environment;

s3: and controlling the robot to interact with the external environment according to the updated Actor-critical network and acquiring new track data to update the report model and the Actor-critical network.

In the prior art, modeling of walking problems such as walking obstacle avoidance and the like of a robot through reinforcement learning highly depends on an evaluation scheme of a robot walking strategy, and the method is mainly embodied in calculating a return value when the robot executes the strategy to walk; the return value of the current robot executing strategy is usually designed manually, which requires a great deal of manpower and material resources to adjust the return value design scheme. In the embodiment of the application, the reward model is operated independently of the Actor-Critic network for making a decision, the essence of the Actor-Critic network is a model for providing a strategy, and the reward model is used for optimizing the strategy. When the robot actually runs, certain information data, namely trajectory data, can be generated when the robot interacts with an external environment through an Actor strategy; the data in the environment buffer pool can be updated along with the updating of the track data, and the return model can be trained through the neural network to update the return model, so that the return model can learn the return value according to the interaction track of the robot and the environment, and the reinforcement learning algorithm is guided to optimize the control strategy.

updating the reward model with data in the context buffer pool comprises:

In the embodiment of the application, the optimization function of the reward model is a unique design, and the optimization function is derived from the information amount generated when the robot interacts with the external environment and the state is transferred. Specifically, the optimization function of the reward model is to learn the information amount existing in the environment state transition process, and then guide the optimization of the Actor-Critic model. The method is characterized in that the return information amount during state transition of neural network learning is used in a return model, the return model comprises an encoder module and a decoder module which are arranged in sequence, and the encoder and the decoder both comprise a full connection layer and an activation layer.

Specifically, an encoder in the reward model takes the current state and action pairs as input, performs maximum compression on information contained in the information through a full connection and activation layer of a neural network, ensures the integrity of the information while compressing, and outputs a mean value and a variance with multiple dimensions at the last layer of the network in the encoder. The decoder in the return model not only restores the state of the next moment from the information compressed by the encoder, specifically, the input of the decoder is sampled from the multidimensional Gaussian distribution output by the encoder by using a re-parameter method, the final output of the decoder is also the mean value and the variance of the multidimensional Gaussian distribution, namely the mean value and the variance of the state of the next moment output by the last full-connection layer in the decoder, and the multidimensional Gaussian distribution is adopted to increase the adaptability of the model to the complex environment.

where z is the compressed information, KL is the KL divergence, q (z) is the prior probability of the compressed information, s is the state value, a is the action value, p (z | s, a) is the posterior probability when compressed to z by s and a, μ ₁ Is the mean value, σ, of the encoder output ₁ Is the variance of the encoder output.

In the embodiment of the present application, the reward model requires encoding by the encoder and decoding by the decoder before calculating to obtain the reward value in the transfer process. Therefore, in order to compress the information of the state-action pairs, the encoder part of the reward model needs to estimate the information between the input and the compression by using a variation inference method, and therefore the calculation of the compressed information is performed by using the above expression.

In one possible implementation, outputting the mean and variance of the state at the next time instant as second output data by the last fully-connected layer of the decoder is performed using the following equation:

wherein s 'is the state of the next time, s' _i For the next state of time sampled from the buffer pool, mu ₂ Is the mean, sigma, of the output of the last layer of a fully-connected network in the decoder ₂ The variance of the output of the last layer of the fully-connected network in the decoder.

In the embodiment of the present application, the decoder part of the prediction model is to decode the information compressed by the encoder to the next time state, and therefore, for the decoder part, the next time state can be z-decoded from the compressed information by using the log-maximum likelihood estimation.

In the embodiment of the present application, the optimization function of the above formula can be formed by combining the encoder and the decoder based on the above optimization function.

sampling action values from multi-dimensional Gaussian distribution, sending the action values to the robot, adjusting relevant parameters, and executing corresponding instructions by the robot to a new environment state;

and storing the current state, the current action, the calculated return value and the state at the next moment as the track data into the environment buffer pool.

In the implementation of the embodiment of the application, the walking problem of the robot in the external environment is constructed into a Markov decision process, the robot observes that the state of the external environment is input into an Actor network in the walking process, the action value given by the Actor network is returned to the robot, the robot continues to walk after executing the action and then enters a new external state, and a return value is obtained in the transferring process; and guiding the optimization of the Critic network and the Actor network through a report value calculated by a report model, thereby obtaining an algorithm for controlling the robot walking based on deep reinforcement learning.

where s is the state value of the input encoder, a is the operation value of the input encoder, and μ ₁ Is the mean value, σ, of the encoder output ₁ Is the variance of the encoder output, z is the value sampled from the encoder output and is the decoder input, z = μ ₁ +∈σ ₁ ，μ ₂ Is the mean, σ, of the state at the next instant of time of the decoder output ₂ E is the value sampled from mean 0 and variance 1 for the variance of the state at the next instant of the decoder output.

In the embodiment of the application, the optimization function of the reward model is a unique design, and the optimization function is derived from the information amount generated when the robot interacts with the external environment and the state is transferred. Specifically, the optimization function of the reward model is the information amount existing in the process of learning the environmental state transition, and then the Actor-Critic model is optimized to be guided. The reporting information quantity when the neural network is used for learning the state transition in the reporting model comprises an encoder and a decoder which are sequentially arranged, wherein the encoder and the decoder both comprise a full connection layer and an activation layer. The encoder in the return model takes the current state and action pair as input, performs maximum compression on information contained in the feedback model through a full connection layer and an activation layer of a neural network, ensures the integrity of the information while compressing, and outputs a mean value and a variance with multiple dimensions at the last layer of the network in the encoder. The decoder in the return model not only restores the state of the next moment from the information compressed by the encoder, specifically, the input of the decoder is sampled from the multidimensional Gaussian distribution output by the encoder by using a re-parameter method, the final output of the decoder is also the mean value and the variance of the multidimensional Gaussian distribution, namely the mean value and the variance of the state of the next moment output by the last full-connection layer in the decoder, and the multidimensional Gaussian distribution is adopted to increase the adaptability of the model to the complex environment. Through the co-operation of the encoder and the decoder, the robot can calculate the reported value after transferring to the next moment.

updating the Actor-criticic network with data in the environment buffer pool comprises:

In the embodiment of the application, the Actor-Critic network includes an Actor policy network and 2 Critic networks, and each of the Actor-Critic networks includes a full connection layer and an activation layer, which are sequentially arranged. The Actor strategy network determines the mean value and the variance of the action in the state at the moment according to the output of the last full connection layer in the neural network structure, and specifically, the Actor strategy network obtains the mean value and the variance of multidimensional Gaussian distribution according to the output of the full connection layer to obtain a final action value through sampling; the information quantity existing in the real environment state transfer process is learned through a neural network model to serve as an actual return value, the essence of the return model is the capability of capturing the hidden information quantity in the real environment dynamic transfer, the capability is used as the return value of the corresponding environment state transfer process, and the return values can reflect various uncertainty problems existing in the state transfer process to the maximum extent, so that the learning of an Actor-Critic network is facilitated; when the Actor strategy network outputs the mean value and the variance according to the last full connection layer, carrying out sampling from multi-dimensional Gaussian distribution and then carrying out one-time nonlinear mapping by using a tanh function to ensure that the final action values are all in an effective range; the Critic network comprises a target Q function network and 2 current Q function networks, when the action selected in the current state is evaluated, the current Q function networks are used for calculation, and the networks with smaller Q function values are selected for the calculation results of the current Q function networks to update the Actor policy network; and respectively forming an MSE loss function by the two calculated current Q function values and the target Q function network, and updating the 2 current Q function networks.

in the formula, Q _θ For the updated target Q function network parameters,

for updating the network parameters of the target Q function, Q _δ And epsilon represents a momentum parameter value at the moment of momentum updating, and epsilon is between 0 and 1 for the parameters of the current Q function network.

In the embodiment of the present application, the target Q function network is updated by using a momentum principle, that is, the momentum principle is used according to the updated parameters of the current Q function network and the parameters of the target Q function network.

In one possible implementation manner, the loss function of the Actor network in the Actor-Critic network consists of two parts, wherein the first part is the evaluation value of the Critic network when a certain action is strategically selected in a certain state, namely a state action value function, the second part is the entropy for strategically selecting the action, and TD-error is used for updating the parameters of the Critic network as the loss function.

In one possible implementation, please refer to fig. 2, which illustrates a neural network model structure of a reward model, the neural network model sequentially sets 4 layers of full connections and 3 activation functions, each layer of full connection sets 256 hidden layers, and the activation functions use swish; as shown in fig. 3, a neural network model is respectively formed in the encoder and the decoder of the reward model, the input of the encoder in the reward model is a randomly given state and an action in the state, and the output is a mean value and a variance after information compression; the input of the decoder in the reward model is the value sampled from the mean variance of the encoder output, and the output is the state mean and variance at the next time. As shown in fig. 4, the Actor policy network of this embodiment includes 3 fully-connected layers and rule layers sequentially arranged, each layer of the fully-connected network includes 256 neurons, the input is in a current or given state, and the output is an action to be taken when facing the state. As shown in fig. 5, the 2 Critic network mechanisms of the present embodiment are completely the same, and include 3 fully-connected layers and rule layers, which are sequentially arranged, each fully-connected network layer includes 256 neurons, and the input is a given state and corresponding action, and the input is an evaluation value for this scenario.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The elements described as separate components may or may not be physically separate, as the elements are clearly recognizable to those skilled in the art that the elements and algorithm steps of each example described in connection with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or combinations of both, and the components and steps of each example have been described in general terms of function in the foregoing description for clearly illustrating the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partly contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a grid device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only examples of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. The robot reward function self-adaption method based on reinforcement learning is characterized by comprising the following steps:

updating a report model and the Actor-Critic network through data in the environment buffer pool; the reward model is constructed based on the information quantity generated during state transition when the robot interacts with the external environment;

and controlling the robot to interact with the external environment according to the updated Actor-critical network and acquiring new track data to update the report model and the Actor-critical network.

2. The adaptive method of reinforcement learning-based robot reward function of claim 1, wherein the reward function comprises an encoder and a decoder;

updating the reward model with data in the context buffer pool comprises:

3. The adaptive method for reinforcement learning-based robot reward function of claim 2, wherein the compression of information integrity of the information in the first input data by the fully connected layer and the active layer of the neural network configured in the reward model is performed by the following equation:

where z is the compressed information, KL is the KL divergence, q (z) is the prior probability of the compressed information, s is the state value, a is the action value, p (z | s, a) is the posterior probability when compressed to z by s and a, μ ₁ To weave intoMean, σ, of the output of the coder ₁ Is the variance of the encoder output.

4. The adaptive method for reinforcement-learning-based robot reward function of claim 3, wherein outputting the mean and variance of the state at the next moment as the second output data by the last fully-connected layer of the decoder is performed according to the following formula:

wherein s 'is the state of the next time, s' _i For the next state of the time sampled from the buffer pool, μ ₂ Is the mean, sigma, of the output of the last layer of a fully-connected network in the decoder ₂ The variance of the output of the last layer of the fully-connected network in the decoder.

5. The adaptive reinforcement learning-based robot reward function of claim 4, wherein an optimization function of the reward model is implemented based on the encoder and the decoder, and the optimization function employs the following equation:

in the formula, mu ₁ Is the mean value, σ, of the encoder output ₁ Variance of encoder output; mu.s ₂ Is the mean, σ, of the state at the next moment of the decoder output ₂ The variance of the state at the next time instant output by the decoder.

6. The adaptive method for the reward function of the reinforcement-learning-based robot as claimed in claim 2, wherein the controlling the robot to interact with the external environment to obtain trajectory data using an Actor policy in an Actor-Critic network and storing the trajectory data into an environment buffer pool comprises:

7. The adaptive method for the reward function of the robot based on reinforcement learning of claim 6, wherein the calculation of the corresponding reward value when the current state transition occurs is performed according to the following formula:

where s is the state value of the input encoder, a is the operation value of the input encoder, and μ ₁ Is the mean value, σ, of the encoder output ₁ Is the variance of the encoder output, z is the value sampled from the encoder output and is the decoder input, z = μ ₁ +∈σ ₁ ，μ ₂ Is the mean, σ, of the state at the next moment of the decoder output ₂ E is the value obtained from random sampling in mean 0 and variance 1 for the variance of the state at the next instant of the decoder output.

8. The adaptive method for the reward function of the reinforcement-learning-based robot according to claim 1, wherein the criticic network of the Actor-criticic network comprises a target Q-function network and at least two current Q-function networks;

when the selected action in the current state is evaluated, all current Q function networks are used for calculation, and the current Q function network with the minimum current Q function value is selected from the calculation results to update the Actor network in the Actor-Critic network;

and when the target Q function network is updated, updating by using a momentum principle according to the updated parameters of the current Q function network and the parameters of the target Q function network.

9. The adaptive method for robot reward function based on reinforcement learning of claim 8, wherein the updated parameters of the current Q-function network and the target Q-function network are updated using momentum principle according to the following formula:

in the formula, Q _θ For the updated target Q function network parameters,