CN113268854A

CN113268854A - Reinforced learning method and system for double evaluators and single actuator

Info

Publication number: CN113268854A
Application number: CN202110415953.4A
Authority: CN
Inventors: 任维雅; 周仕扬; 任小广; 王彦臻; 易晓东
Original assignee: National Defense Technology Innovation Institute PLA Academy of Military Science
Current assignee: National Defense Technology Innovation Institute PLA Academy of Military Science
Priority date: 2021-04-16
Filing date: 2021-04-16
Publication date: 2021-08-17

Abstract

The invention discloses a reinforcement learning method and a system of a double-evaluator single actuator, which comprises the following steps: s1 initializing parameters in the single executer of the double evaluators, and setting the proportionality coefficient of each evaluator in the loss function of the strategy network; s2 obtaining a state in the initialization environment according to the initialization noise function; s3, calculating actions according to the current state, the current strategy and the noise function, executing the action to observe reward and the next state, and storing the current state, the action, the reward and the next state as experience in a buffer area; s4, updating parameters in the double-evaluator single actuator according to the N samples collected from the buffer and the loss function; repeating the steps to train the reinforcement learning of the double-evaluator single actuator according to the set iteration condition; wherein the raters include a reward-based rater and an artificial potential field-based rater. The invention solves the problems of low sample utilization rate, low training convergence speed and the like in model-free reinforcement learning.

Description

Reinforced learning method and system for double evaluators and single actuator

Technical Field

The invention relates to the field of intelligent agent path planning, in particular to a reinforcement learning method and system of a double-evaluator single actuator.

Background

At present, most model-free reinforcement learning algorithms adopt generalized strategy iteration of iteration between strategy evaluation and strategy improvement, the strategy evaluation method is to estimate a behavior cost function, and the strategy improvement method is to update a strategy according to the behavior cost function. Based on the generalized strategy iteration and the strategy gradient theorem, Actor-Critic (AC) has become a widely used architecture.

Deterministic policy gradient algorithms (DPG) further consider Deterministic policy gradient algorithms for continuous actions based on AC framework, which reduces variance at policy evaluation compared to stochastic policies. Deep Deterministic Policy Gradient (DDPG) further combines Deep neural networks with DPG, and modeling capacity is improved. However, both the modeless AC algorithm and the DDPG algorithm generate samples by directly interacting with the environment, and have problems of low sampling efficiency, slow convergence speed, and the like.

The model-based planning method accelerates learning or obtains a better value estimation of the action state by performing simulation deduction using the learned model. Although the calculation efficiency is higher and the convergence speed is faster, the accuracy of the planning is closely related to the accuracy of the dynamic model of the environment. In a real situation, the environment is greatly influenced by various random factors such as air temperature, frictional resistance, communication delay, material characteristics and the like. Dynamic models of the environment required for planning are often not available in reality. In addition, the dependence of the planning method on the environment model is too strong, the generalization capability to a new environment is weak, and the environment needs to be re-planned once changed. However, the environment often changes over time, and it is not practical to obtain an accurate model that fully simulates a real-world environment.

Therefore, how to combine reinforcement learning and planning methods to solve each other is urgently needed to be solved.

Disclosure of Invention

In order to solve the above-mentioned deficiencies in the prior art, the present invention provides a reinforcement learning method for a dual evaluator and a single actuator, comprising:

s1 initializing parameters in the single executer of the double evaluators, and setting the proportionality coefficient of each evaluator in the loss function of the strategy network;

s2 obtaining a state in the initialization environment according to the initialization noise function;

s3, calculating actions according to the current state, the current strategy and the noise function, executing the action to observe reward and the next state, and storing the current state, the action, the reward and the next state as experience in a buffer area;

s4, updating parameters in the double-evaluator single actuator according to the N samples collected from the buffer and the loss function;

s5, judging whether the training step number reaches the maximum step number of one screen, if so, updating the screen number and executing S6, otherwise, updating the step number and executing S3;

s6, judging whether the screen number reaches the set maximum screen number, if yes, finishing the training, otherwise, executing S2 by initializing the step number;

wherein the raters include a reward-based rater and an artificial potential field-based rater.

Preferably, the step S1 initializes the parameters in the single dual-evaluator executor, and sets the scaling factor of each evaluator in the loss function of the policy network, including:

s101, randomly initializing a value function network of an evaluator based on the reward and a strategy network of an executor;

s102, initializing the weight of a target network;

s103, initializing an experience playback buffer area;

s104, setting the proportionality coefficient of each evaluator in the loss function of the strategy network.

Preferably, the loss function of the policy network is as follows:

in the formula: j (mu)_θ) Is a loss function of the policy network; theta is a strategy network parameter of the actuator;

is a state space; rho^μ(s,γ₁) To discount gamma₁Distribution of states of; s is the current state; gamma ray₁Discount coefficients for the reward; mu.s_θ(s) is a policy function; r (s, mu)_θ(s)) to take the policy μ at state s_θA reward that can be obtained; beta is the proportionality coefficient of the double evaluators; rho^μ(s,γ₂) To discount gamma₂Distribution of states of; gamma ray₂A discount coefficient for the potential field value; q. q.s_PF(s,μ_θ(s)) to execute the policy μ in state s_θTime based on the state of the potential field-action function.

Preferably, the action is calculated according to the current state, the current strategy and the noise function as follows:

a_t＝μ(s|θ)+N_t

in the formula: a is_tIs the action at time t; μ (s | θ) is the result of the current state s under the current strategy; s is the current state; theta is a strategy network parameter of the actuator; n is a radical of_tTo obtain the noise at time t according to the noise function.

Preferably, the S4 updates the parameters in the dual-evaluator single executor according to the N samples collected from the buffer and the loss function, including:

s401, updating a value function network of the evaluator based on the reward according to the N samples collected from the buffer area and a value function network updating formula;

s402, calculating the value of the state-action value function of the evaluator based on the artificial potential field according to the N samples collected from the buffer area and the state-action value function of the preset artificial potential field;

s403, updating the strategy network of the actuator according to the N samples collected from the buffer area, the loss function and a strategy network parameter updating formula;

s404, updating the target network according to the strategy network parameter theta and the strategy network mu.

Preferably, the state-action value function of the artificial potential field is as follows:

in the formula: q_PF(s, a) is a state-action value function of the artificial potential field; s is the current state; a is an action; u(s) is a potential field value in a state s; gamma ray₂A discount coefficient for the potential field value; s'_aIs the state after executing action a under state s; u (s'_a) A potential field value which is a state after the action a is executed in the state s; e, averaging; k is an intermediate variable for calculation and represents the current step number; q. q.s_PF(s_k,a_k) Is in a state s_kExecution policy mu_θA state-action function based on the potential field;

wherein q is_PF(s_k,a_k) Calculated as follows:

in the formula: u(s) is; χ is the angle between the direction after action a and f(s).

Preferably, the policy network parameter update formula is as follows:

in the formula: theta_t+1Parameters of the policy network at time t + 1; theta_tParameters of the policy network at time t; alpha is alpha_θIs the learning rate of the policy network;

solving the gradient of theta; mu.s_θThe strategy network with the weight parameter theta; s_tIs time tThe state of (1); a is_tAn action performed for time t; beta is a scale factor of the double evaluators;

to find the gradient of action a;

performing action a for reward-based state action value function using time t state under weight parameter w_tA value of (d);

the gradient of the state action value function based on the potential field for time t.

Preferably, the value function network updates the formula as shown in the following formula:

in the formula: delta_tTD error at time t; r_tInstant reward for time t; gamma ray₁A discount factor for the reward;

the value of the state at the time t +1 under the condition that the value function and the strategy weight parameter are w 'and theta' respectively is a state action value function based on the reward; s_t+1The state at the time t + 1; mu.s_θ，(s_t+1) The action to be executed for the state of the strategy at the moment t +1 under the weight parameter theta';

for a reward-based state action value function, the state s is at a value function weight parameter of w_tExecute action a at once_tThe value of time; s_tIs the state at time t; a is_tIs the action at time t; w is a_t+1Network weight at time t + 1; w is a_tIs the network weight at time t; alpha is alpha_wLearning rate for a value function network;

calculating the gradient of the weight w;

performing action a for reward-based state action value function using time t state under weight parameter w_tThe value of (c).

Preferably, the target network is updated according to the following formula:

θ′←τθ+(1-τ)θ′

w′←τw+(1-τ)w′

in the formula: theta' is the target policy network weight; tau is the temperature coefficient when the target network is in soft update; theta is the policy network weight; w' is the network weight of the objective function; w is a value function network weight.

Based on the same inventive concept, the invention also provides a reinforcement learning system with double evaluators and single actuators, which is characterized in that the reinforcement learning method for realizing the double evaluators and the single actuators in any one of the technical schemes comprises the following steps:

the initialization module is used for initializing parameters in the single executer of the double evaluators and setting the proportional coefficient of each evaluator in the loss function of the strategy network;

an initial state module for obtaining a state in an initialization environment according to an initialization noise function;

the generating sample module is used for calculating actions according to the current state, the current strategy and the noise function, executing the action observation reward and the next state, and storing the current state, the actions, the reward and the next state as experience in a buffer area;

the parameter updating module is used for updating parameters in the double-evaluator single actuator according to the N samples collected from the buffer area and the loss function;

the step number judging module is used for judging whether the training step number reaches the maximum step number of one screen or not, if so, updating the screen number and executing the screen number judging module, and otherwise, updating the step number and executing the sample generating module;

the screen number judging module is used for judging whether the screen number reaches a set maximum screen number, if so, the training is finished, and if not, the initial state module is executed by initializing the step number;

Compared with the prior art, the invention has the beneficial effects that:

according to the technical scheme provided by the invention, S1 initializes the parameters in the single executors of the double evaluators, and sets the proportional coefficient of each evaluator in the loss function of the strategy network; s2 obtaining a state in the initialization environment according to the initialization noise function; s3, calculating actions according to the current state, the current strategy and the noise function, executing the action to observe reward and the next state, and storing the current state, the action, the reward and the next state as experience in a buffer area; s4, updating parameters in the double-evaluator single actuator according to the N samples collected from the buffer and the loss function; s5, judging whether the training step number reaches the maximum step number of one screen, if so, updating the screen number and executing S6, otherwise, updating the step number and executing S3; s6, judging whether the screen number reaches the set maximum screen number, if yes, finishing the training, otherwise, executing S2 by initializing the step number; wherein the raters include a reward-based rater and an artificial potential field-based rater. The method solves the problems of low sample utilization rate, low training convergence speed and the like in model-free reinforcement learning, combines the result of continuous interaction with the environment in a reinforcement learning mode with a planning method, and can accelerate the convergence speed of the algorithm and make up the deviation caused by the inaccuracy of the environment model so as to obtain the optimal solution of the problem under the combined action of the two methods.

Drawings

Fig. 1 is a flowchart of a reinforcement learning method with two evaluators and a single actuator according to the present embodiment;

FIG. 2 is a schematic view showing the structure of a single actuator of the double evaluator in the present embodiment;

FIG. 3 is a schematic diagram illustrating the included angle between the direction after the action a is performed and f(s);

fig. 4 is a schematic diagram of a next state after different actions are performed in the current state provided in this embodiment;

FIG. 5 is a schematic diagram of a 3v1 predator-predator game provided in the present embodiment;

FIG. 6 is a schematic diagram of a 1v1 predator-predator game provided in the present embodiment;

FIG. 7 is a graph showing the average value of 1v1 awards per 500 awards provided in the present embodiment;

FIG. 8 is a graph showing the average value of 3v1 awards per 500 screens provided in the present embodiment;

FIG. 9 is a diagram of the capture success rate of the last 200 steps of 1v1 provided in this embodiment;

FIG. 10 is a diagram illustrating the capturing success rate of the last 200 steps of 3v1 in this embodiment;

FIG. 11 is a graph showing the average value of 3v1 awards per 500 lines when a predator and a predator are trained together, as provided in this embodiment;

FIG. 12 is a graph showing the capture success rate of the last 200 steps of 3v1 when a predator and a predator are trained together as provided in this example.

Detailed Description

For a better understanding of the present invention, reference is made to the following description taken in conjunction with the accompanying drawings and examples.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

At present, because a model-free AC algorithm and a DDPG algorithm directly interact with the environment to generate samples, the problems of low sampling efficiency and low convergence speed exist, and the dependence of a model planning method on an environment model is too strong, the generalization capability to a new environment is weaker, and the improvement is needed, therefore, the model planning method and the environment interaction result are continuously combined in a reinforcement learning mode, the convergence speed of the algorithm can be accelerated under the combined action of the model planning method and the environment interaction result, and the deviation caused by inaccuracy of the environment model can be compensated, so that the optimal solution of the problem is obtained.

As shown in fig. 1, the reinforcement learning method with two evaluators and one actuator provided in this embodiment includes:

The technical scheme provided by the embodiment solves the problems of low sample utilization rate, low training convergence speed and the like in model-free reinforcement learning, combines the result of continuous interaction with the environment in a reinforcement learning mode with a planning method, and can accelerate the convergence speed of the algorithm and make up the deviation caused by inaccuracy of the environment model so as to obtain the optimal solution of the problem under the combined action of the two methods.

The existing Actor-Critic structure is an Actor and an evaluator; or the Asynchronous adaptive evaluator-criterion (A3C) used for the parallel scheme is one evaluator with a plurality of actors, and the embodiment proposes a mode of a single Actor with two evaluators. For each agent, multiple evaluators can be used simultaneously to jointly guide a single actor, so that two gradient updates, with and without models, can be combined by designing different evaluators.

In this embodiment, the specific implementation step of S1 includes:

s102, initializing the weight of a target network;

s103, initializing an experience playback buffer area;

In this embodiment, the specific implementation step of S4 includes:

Referring to fig. 2, in the present embodiment, a multi-agent reinforcement learning method is introduced in the framework of a dual evaluator (Actor-Critic-2) single actuator, and firstly, a PGDDPG method of depth certainty strategy gradient guided by an artificial potential field is proposed, which combines an evaluator Critic2 based on an artificial potential field and a traditional Critic1 based on a reward based on an Actor-Critic-2 framework.

In this embodiment, the construction process of the evaluator Critic2 based on the artificial potential field includes:

firstly, a design mode of an evaluator Critic2 based on an artificial potential field is introduced based on a traditional artificial potential field calculation method.

The traditional artificial potential field adopts:

U(s)＝U_att(S)+U_rep(S). (1)

wherein, the gravity:

where xi is the attraction factor, d (s, s)_goal) From the position where it is in state s to the target state s_goalThe distance of (c).

Repulsion force:

wherein eta is repulsive force factor, d₀The maximum range affected by the repulsive force.

The force experienced by the body in the state s is the negative gradient of the potential field in that state, i.e. it is

An evaluator critic2 based on an artificial potential field is designed based on a traditional artificial potential field method, namely a state-action value function Q based on the artificial potential field is designed_PF(s, a) represented by the following formula:

wherein χ is the angle between the direction after the action a and f(s), and s is the current state as shown in fig. 3.

Wherein s 'is as shown in FIG. 4'_aIs the state in which action a (post) is performed at s, 0<γ₂Below 1 is the discount coefficient. In this embodiment, an Actor-critical-2-based artificial potential field guided depth deterministic strategy gradient:

critic1 is based on the environmental reward r(s)_k,a_k) Can be calculated from the original state-action value function in the DDPG as shown in the following formula:

in the formula, gamma₁To award discount coefficients, gamma₁Is taken to be in the range of 0 gamma₁≤1。

DDPG is a model-free reinforcement learning method, learning is carried out based on reward of environment feedback, an environment model is not needed, but the sample utilization rate is low, and the convergence rate is low.

Critic2 is based on the artificial potential field and is calculated by the state-action value function of the artificial potential field designed by the embodiment in the following formula:

in the formula: gamma ray₂Coefficient of discount for potential field value, gamma₂The value range of (1) is more than 0 and less than gamma₂≤1。

The artificial potential field is a planning method based on a model, the calculation efficiency is high, but the precision of an environment model influences the effect of the algorithm.

In a specific mode, according to a framework of an Actor-Critic-2, the Critic1 and the Critic2 are combined to guide the updating of the Actor policy network together, and the advantages of a model-based planning method and a model-free reinforcement learning method can be combined, so that the algorithm is faster and more stable to converge, and the Adam optimizer is used for optimization in the embodiment.

According to the formulas (6) and (7) in combination with the definition of the random strategy gradient loss function in the reinforcement learning, the loss function of the strategy network of the dual-evaluator single-actuator method can be expressed as follows:

beta is a proportional coefficient for adjusting Critic1 and Critic2, the larger beta is, the ratio of Critic1 based on reward is, and the guiding effect of reinforcement learning on an actor is stronger; the smaller beta, the larger Critic2 proportion, and the stronger the guiding effect of the artificial potential field planning.

In one embodiment, the setting mode of the proportionality coefficient β includes:

1) when beta is a fixed value, the influence of the artificial potential field is gradually reduced in the process of approaching the target, and local exploration is performed by learning;

2) and beta is dynamically adjusted along with the training turns, so that the total action value function is close to the real-state action value function.

The policy network gradient that can be derived from equation (8) is:

the policy network parameter of the Actor can be represented as theta, the value function network parameter of Critic1 can be represented as w, and Critic2 is directly calculated by using the formulas (4) and (5). The strategy network and the value function network are updated iteratively.

The calculation method is as follows:

updating the strategy network parameters according to the following formula:

solving the gradient of theta; mu.s_θThe strategy network with the weight parameter theta; s_tIs the state at time t; alpha is alpha_tAn action performed for time t; beta is a scale factor of the double evaluators;

to find the gradient of action a;

Value function network updates are performed as follows:

the value of the state at the time t +1 under the condition that the value function and the strategy weight parameter are w 'and theta' respectively is a state action value function based on the reward; s_t+1The state at the time t + 1; mu.s_θ′(s_t+1) The action to be executed for the state of the strategy at the moment t +1 under the weight parameter theta';

calculating the gradient of the weight w;

In order to make the algorithm convergence faster and more stable, the present embodiment selects to use the same dual-network delay updating manner as the DDPG, and sets the target value function network and the target policy network to Q 'and μ' respectively to update the target network parameters by soft updating:

θ′←τθ+(1-τ)θ′ (13)

w′←τw+(1-τ)w′ (14)

in the formula: theta' is the target policy network weight; tau is the 'temperature coefficient' of the target network during soft update; theta is the policy network weight; w' is the network weight of the objective function; w is a value function network weight.

Based on the embodiments provided above, the present embodiment specifically explains the reinforcement learning method for a dual-evaluator single actuator, which includes:

step 1, randomly initializing a value function network Q (s, a | w) of critic1 and a strategy network mu (s | theta) of Actor, expressing the weights as w and theta, and setting a proportionality coefficient beta of a double evaluator;

step 2, initializing the weights w 'and theta' of the target networks Q 'and mu' as w '═ w and theta' ═ theta;

step 3, initializing an experience playback buffer area R;

step 4, initializing a noise function N () for exploring actions;

step 5, initializing the environment and obtaining an initial state s₁；

Step 6, according to the current state S_tCurrent strategy mu_tAnd exploration noise N_tAnd calculating the action:

a_t＝μ(s|θ)+N_t

step 7, performing action a_tObserve the reward r_tAnd the next state s_t+1；

Step 8, store experience (S)_t,a_t,r_t,s_t+1) To the buffer R;

step 9, randomly sampling N samples from the buffer R (S)_i,a_i,r_i,s_i+1)；

Step 10, updating a value function network Q (s, a | w) of critic1 through formulas (11) and (12) by using N samples obtained by sampling;

step 11, calculating the value of the artificial potential field value function of critic2 through formulas (4) and (5) by using N samples obtained by sampling;

step 12, updating a policy network mu (s | theta) of the Actor through a formula (10) by using the N samples obtained by sampling;

step 13, updating the target network through formulas (13) and (14);

step 14, executing step 15 when one screen is finished or the maximum number of steps of one screen is reached, otherwise, returning to execute step 6;

and 15, finishing the training when the maximum screen number is reached, otherwise executing the step 4.

The reinforcement learning method of the single actuator with the double evaluators provided by the embodiment solves the problems of low sample utilization rate and low training convergence speed in model-free reinforcement learning. By using the mode of double evaluators, not only the planning with models (such as artificial potential fields) and the reinforcement learning method without models can be combined, but also a plurality of reinforcement learning methods can be directly combined. The method combines the planning with the model and the reinforcement learning method without the model, improves the generalization capability of the algorithm to a new environment while accelerating the convergence of the algorithm, and has an important promoting effect on accelerating the application of the reinforcement learning method in practice.

It should be noted that, although the foregoing embodiments describe each step in a specific sequence, those skilled in the art will understand that, in order to achieve the effect of the present invention, different steps do not necessarily need to be executed in such a sequence, and they may be executed simultaneously (in parallel) or in other sequences, and these changes are all within the protection scope of the present invention.

Based on the scheme, the invention provides an application scenario of an embodiment related to the technical scheme, the effectiveness of the technical scheme is verified based on experiments of multiagent-particle-environments (MPE), coordinates of a predator are limited to [ -1,1], coordinates of a prey are limited to [ -0.8,0.8] by using a predator-prey model in the MPE, and the predator have the same speed. There are two situations as shown in fig. 5 and 6, the predator-predator game of 3v1 and 1v1, 3v1 being the situation shown in fig. 5, where there are 3 predators chasing one prey, and the predator-predator game of 1v1 being the situation shown in fig. 6, where only one predator chases one prey, where the triangles represent predators and the circles represent prey.

Considering first the case of a predator-predator game of 1v1, the environmental awards are sparse (if successful awards +10) and depend only on the terminal status of each screen; then, consider that N predators chase a prey in a randomly generated environment, each predator would receive a reward of 10 if all predators capture a prey at the same time. Any predator is not rewarded as long as it does not catch the prey. The above process leads to a learning difficulty problem requiring good tacit cooperation.

The goal of the experiment was to learn to capture prey independently without knowledge of opponent's strategy and action. For the control problem of continuous action, a deterministic strategy gradient algorithm DDPG is used as a base (comprising an evaluator Critic1), a custom artificial potential field evaluator (Critic2) is added, and the Actor is jointly updated by using the gradient combination of Critic1 and Critic 2. This method is referred to as PGDDPG in the present embodiment.

In this embodiment, two ways are adopted to verify the validity of the technical scheme, which specifically include:

first, using a pre-trained DDPG model as a predator strategy

For predator-predator games of 1v1 and 3v1, the capture success rates and predator reward curves for PGDDPG and DDPG, as shown in fig. 7, 8, 9 and 10, respectively, are plotted.

To demonstrate a more fluid learning process, the mean value of the prize values per 500 sets was calculated as shown in fig. 7 and 8.

It is apparent from fig. 7, 8, 9 and 10 that PGDDPG is superior to DDPG in convergence speed.

Second, the predators and the predators are trained together

The simultaneous training increases the difficulty of learning because it becomes a zero-sum game, with the environment being dynamically enhanced.

As shown in fig. 11 and 12, for PGDDPG, the success rate dropped from 0.2 to 0 at approximately 1000 times (prey's ability to escape exceeds predator's ability to catch); however, the capture capacity quickly overtakes the escape capacity and takes a leading position.

It can also be observed from fig. 11 and 12 that DDPG fails in the predator-predator zero-sum game of 3vs 1.

Wherein the abscissa epsilon in fig. 7-12 represents the number of training sets, the ordinate rewarded of fig. 7, 8 and 11 represents the reward, and the ordinate of the rate of success of fig. 9, 10 and 12 represents the amount of success of the qualification.

It will be understood by those skilled in the art that all or part of the flow of the method according to the above-described embodiment may be implemented by a computer program, which may be stored in a computer-readable storage medium and used to implement the steps of the above-described embodiments of the method when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying said computer program code, media, usb disk, removable hard disk, magnetic diskette, optical disk, computer memory, read-only memory, random access memory, electrical carrier wave signals, telecommunication signals, software distribution media, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

Furthermore, the invention also provides a storage device. In one embodiment of the storage device according to the present invention, the storage device may be configured to store a program for executing the double-evaluator single-actuator reinforcement learning method of the above-described method embodiment, and the program may be loaded and executed by a processor to implement the double-evaluator single-actuator reinforcement learning method described above. For convenience of explanation, only the parts related to the embodiments of the present invention are shown, and details of the specific techniques are not disclosed. The storage device may be a storage device apparatus formed by including various electronic devices, and optionally, a non-transitory computer-readable storage medium is stored in the embodiment of the present invention.

Furthermore, the invention also provides a control device. In an embodiment of the control device according to the present invention, the control device comprises a processor and a storage device, the storage device may be configured to store a program for executing the double-evaluator single-actuator reinforcement learning method of the above-mentioned method embodiment, and the processor may be configured to execute a program in the storage device, the program including but not limited to a program for executing the double-evaluator single-actuator reinforcement learning method of the above-mentioned method embodiment. For convenience of explanation, only the parts related to the embodiments of the present invention are shown, and details of the specific techniques are not disclosed. The control device may be a control device apparatus formed including various electronic apparatuses.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. A reinforcement learning method of a double-evaluator single-actuator is characterized by comprising the following steps:

2. The reinforcement learning method of claim 1, wherein the S1 initializes parameters in a single evaluator-executor and sets a scaling factor of each evaluator in the loss function of the policy network, including:

s102, initializing the weight of a target network;

s103, initializing an experience playback buffer area;

3. The reinforcement learning method of claim 1, wherein the loss function of the policy network is represented by the following equation:

is a state space; rho^μ(s，γ₁) To discount gamma₁Distribution of states of; s is the current state; gamma ray₁Discount coefficients for the reward; mu.s_θ(s) is a policy function; r (s, mu)_θ(s)) to take the policy μ at state s_θA reward that can be obtained; beta is the proportionality coefficient of the double evaluators; rho^μ(s，γ₂) To discount gamma₂Distribution of states of; gamma ray₂A discount coefficient for the potential field value; q. q.s_PF(s，μ_θ(s)) to execute the policy μ in state s_θTime based on the state of the potential field-action function.

4. A reinforcement learning method as claimed in claim 1, characterized in that the action is calculated from the current state, the current strategy and the noise function according to the following formula:

a_t＝μ(s|θ)+N_t

5. The reinforcement learning method of claim 1, wherein the S4 updates parameters in a dual-evaluator single executor according to the N samples collected from the buffer and the loss function, including:

6. The reinforcement learning method of claim 5, wherein the state-action value function of the artificial potential field is represented by the following formula:

in the formula: q_PF(s, a) is a state-action value function of the artificial potential field; s is the current state; a is an action; u(s) is a potential field value in a state s; gamma ray₂A discount coefficient for the potential field value; s'_aIs the state after executing action a under state s; u (s'_a) A potential field value which is a state after the action a is executed in the state s; e, averaging; k is the current step number; q. q.s_PF(s_k，a_k) Is in a state s_kExecution policy mu_θA state-action function based on the potential field;

wherein q is_PF(s_k，a_k) Calculated as follows:

7. The reinforcement learning method of claim 5, wherein the policy network parameter updates a formula as shown in the following equation:

solving the gradient of theta; mu.s_θThe strategy network with the weight parameter theta; s_tIs the state at time t; a is_tAn action performed for time t; beta is a scale factor of the double evaluators;

to find the gradient of action a;

8. The reinforcement learning method of claim 4, wherein the value function network updates a formula as shown in the following equation:

calculating the gradient of the weight w;

9. The reinforcement learning method of claim 4, wherein the target network is updated as follows:

θ′←τθ+(1-τ)θ′

w′←τw+(1-τ)w′

10. A double-evaluator single-actuator reinforcement learning system, which is used for implementing the double-evaluator single-actuator reinforcement learning method according to any one of claims 1 to 9, and comprises: