CN113268854A - Reinforced learning method and system for double evaluators and single actuator - Google Patents

Reinforced learning method and system for double evaluators and single actuator Download PDF

Info

Publication number
CN113268854A
CN113268854A CN202110415953.4A CN202110415953A CN113268854A CN 113268854 A CN113268854 A CN 113268854A CN 202110415953 A CN202110415953 A CN 202110415953A CN 113268854 A CN113268854 A CN 113268854A
Authority
CN
China
Prior art keywords
state
action
network
function
reward
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110415953.4A
Other languages
Chinese (zh)
Inventor
任维雅
周仕扬
任小广
王彦臻
易晓东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Defense Technology Innovation Institute PLA Academy of Military Science
Original Assignee
National Defense Technology Innovation Institute PLA Academy of Military Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Defense Technology Innovation Institute PLA Academy of Military Science filed Critical National Defense Technology Innovation Institute PLA Academy of Military Science
Priority to CN202110415953.4A priority Critical patent/CN113268854A/en
Publication of CN113268854A publication Critical patent/CN113268854A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a reinforcement learning method and a system of a double-evaluator single actuator, which comprises the following steps: s1 initializing parameters in the single executer of the double evaluators, and setting the proportionality coefficient of each evaluator in the loss function of the strategy network; s2 obtaining a state in the initialization environment according to the initialization noise function; s3, calculating actions according to the current state, the current strategy and the noise function, executing the action to observe reward and the next state, and storing the current state, the action, the reward and the next state as experience in a buffer area; s4, updating parameters in the double-evaluator single actuator according to the N samples collected from the buffer and the loss function; repeating the steps to train the reinforcement learning of the double-evaluator single actuator according to the set iteration condition; wherein the raters include a reward-based rater and an artificial potential field-based rater. The invention solves the problems of low sample utilization rate, low training convergence speed and the like in model-free reinforcement learning.

Description

Reinforced learning method and system for double evaluators and single actuator
Technical Field
The invention relates to the field of intelligent agent path planning, in particular to a reinforcement learning method and system of a double-evaluator single actuator.
Background
At present, most model-free reinforcement learning algorithms adopt generalized strategy iteration of iteration between strategy evaluation and strategy improvement, the strategy evaluation method is to estimate a behavior cost function, and the strategy improvement method is to update a strategy according to the behavior cost function. Based on the generalized strategy iteration and the strategy gradient theorem, Actor-Critic (AC) has become a widely used architecture.
Deterministic policy gradient algorithms (DPG) further consider Deterministic policy gradient algorithms for continuous actions based on AC framework, which reduces variance at policy evaluation compared to stochastic policies. Deep Deterministic Policy Gradient (DDPG) further combines Deep neural networks with DPG, and modeling capacity is improved. However, both the modeless AC algorithm and the DDPG algorithm generate samples by directly interacting with the environment, and have problems of low sampling efficiency, slow convergence speed, and the like.
The model-based planning method accelerates learning or obtains a better value estimation of the action state by performing simulation deduction using the learned model. Although the calculation efficiency is higher and the convergence speed is faster, the accuracy of the planning is closely related to the accuracy of the dynamic model of the environment. In a real situation, the environment is greatly influenced by various random factors such as air temperature, frictional resistance, communication delay, material characteristics and the like. Dynamic models of the environment required for planning are often not available in reality. In addition, the dependence of the planning method on the environment model is too strong, the generalization capability to a new environment is weak, and the environment needs to be re-planned once changed. However, the environment often changes over time, and it is not practical to obtain an accurate model that fully simulates a real-world environment.
Therefore, how to combine reinforcement learning and planning methods to solve each other is urgently needed to be solved.
Disclosure of Invention
In order to solve the above-mentioned deficiencies in the prior art, the present invention provides a reinforcement learning method for a dual evaluator and a single actuator, comprising:
s1 initializing parameters in the single executer of the double evaluators, and setting the proportionality coefficient of each evaluator in the loss function of the strategy network;
s2 obtaining a state in the initialization environment according to the initialization noise function;
s3, calculating actions according to the current state, the current strategy and the noise function, executing the action to observe reward and the next state, and storing the current state, the action, the reward and the next state as experience in a buffer area;
s4, updating parameters in the double-evaluator single actuator according to the N samples collected from the buffer and the loss function;
s5, judging whether the training step number reaches the maximum step number of one screen, if so, updating the screen number and executing S6, otherwise, updating the step number and executing S3;
s6, judging whether the screen number reaches the set maximum screen number, if yes, finishing the training, otherwise, executing S2 by initializing the step number;
wherein the raters include a reward-based rater and an artificial potential field-based rater.
Preferably, the step S1 initializes the parameters in the single dual-evaluator executor, and sets the scaling factor of each evaluator in the loss function of the policy network, including:
s101, randomly initializing a value function network of an evaluator based on the reward and a strategy network of an executor;
s102, initializing the weight of a target network;
s103, initializing an experience playback buffer area;
s104, setting the proportionality coefficient of each evaluator in the loss function of the strategy network.
Preferably, the loss function of the policy network is as follows:
Figure BDA0003024332670000021
in the formula: j (mu)θ) Is a loss function of the policy network; theta is a strategy network parameter of the actuator;
Figure BDA0003024332670000022
is a state space; rhoμ(s,γ1) To discount gamma1Distribution of states of; s is the current state; gamma ray1Discount coefficients for the reward; mu.sθ(s) is a policy function; r (s, mu)θ(s)) to take the policy μ at state sθA reward that can be obtained; beta is the proportionality coefficient of the double evaluators; rhoμ(s,γ2) To discount gamma2Distribution of states of; gamma ray2A discount coefficient for the potential field value; q. q.sPF(s,μθ(s)) to execute the policy μ in state sθTime based on the state of the potential field-action function.
Preferably, the action is calculated according to the current state, the current strategy and the noise function as follows:
at=μ(s|θ)+Nt
in the formula: a istIs the action at time t; μ (s | θ) is the result of the current state s under the current strategy; s is the current state; theta is a strategy network parameter of the actuator; n is a radical oftTo obtain the noise at time t according to the noise function.
Preferably, the S4 updates the parameters in the dual-evaluator single executor according to the N samples collected from the buffer and the loss function, including:
s401, updating a value function network of the evaluator based on the reward according to the N samples collected from the buffer area and a value function network updating formula;
s402, calculating the value of the state-action value function of the evaluator based on the artificial potential field according to the N samples collected from the buffer area and the state-action value function of the preset artificial potential field;
s403, updating the strategy network of the actuator according to the N samples collected from the buffer area, the loss function and a strategy network parameter updating formula;
s404, updating the target network according to the strategy network parameter theta and the strategy network mu.
Preferably, the state-action value function of the artificial potential field is as follows:
Figure BDA0003024332670000031
in the formula: qPF(s, a) is a state-action value function of the artificial potential field; s is the current state; a is an action; u(s) is a potential field value in a state s; gamma ray2A discount coefficient for the potential field value; s'aIs the state after executing action a under state s; u (s'a) A potential field value which is a state after the action a is executed in the state s; e, averaging; k is an intermediate variable for calculation and represents the current step number; q. q.sPF(sk,ak) Is in a state skExecution policy muθA state-action function based on the potential field;
wherein q isPF(sk,ak) Calculated as follows:
Figure BDA0003024332670000032
in the formula: u(s) is; χ is the angle between the direction after action a and f(s).
Preferably, the policy network parameter update formula is as follows:
Figure BDA0003024332670000033
in the formula: thetat+1Parameters of the policy network at time t + 1; thetatParameters of the policy network at time t; alpha is alphaθIs the learning rate of the policy network;
Figure BDA0003024332670000034
solving the gradient of theta; mu.sθThe strategy network with the weight parameter theta; stIs time tThe state of (1); a istAn action performed for time t; beta is a scale factor of the double evaluators;
Figure BDA0003024332670000035
to find the gradient of action a;
Figure BDA0003024332670000036
performing action a for reward-based state action value function using time t state under weight parameter wtA value of (d);
Figure BDA0003024332670000037
the gradient of the state action value function based on the potential field for time t.
Preferably, the value function network updates the formula as shown in the following formula:
Figure BDA0003024332670000038
Figure BDA0003024332670000041
in the formula: deltatTD error at time t; rtInstant reward for time t; gamma ray1A discount factor for the reward;
Figure BDA0003024332670000042
the value of the state at the time t +1 under the condition that the value function and the strategy weight parameter are w 'and theta' respectively is a state action value function based on the reward; st+1The state at the time t + 1; mu.sθ,(st+1) The action to be executed for the state of the strategy at the moment t +1 under the weight parameter theta';
Figure BDA0003024332670000043
for a reward-based state action value function, the state s is at a value function weight parameter of wtExecute action a at oncetThe value of time; stIs the state at time t; a istIs the action at time t; w is at+1Network weight at time t + 1; w is atIs the network weight at time t; alpha is alphawLearning rate for a value function network;
Figure BDA0003024332670000044
calculating the gradient of the weight w;
Figure BDA0003024332670000045
performing action a for reward-based state action value function using time t state under weight parameter wtThe value of (c).
Preferably, the target network is updated according to the following formula:
θ′←τθ+(1-τ)θ′
w′←τw+(1-τ)w′
in the formula: theta' is the target policy network weight; tau is the temperature coefficient when the target network is in soft update; theta is the policy network weight; w' is the network weight of the objective function; w is a value function network weight.
Based on the same inventive concept, the invention also provides a reinforcement learning system with double evaluators and single actuators, which is characterized in that the reinforcement learning method for realizing the double evaluators and the single actuators in any one of the technical schemes comprises the following steps:
the initialization module is used for initializing parameters in the single executer of the double evaluators and setting the proportional coefficient of each evaluator in the loss function of the strategy network;
an initial state module for obtaining a state in an initialization environment according to an initialization noise function;
the generating sample module is used for calculating actions according to the current state, the current strategy and the noise function, executing the action observation reward and the next state, and storing the current state, the actions, the reward and the next state as experience in a buffer area;
the parameter updating module is used for updating parameters in the double-evaluator single actuator according to the N samples collected from the buffer area and the loss function;
the step number judging module is used for judging whether the training step number reaches the maximum step number of one screen or not, if so, updating the screen number and executing the screen number judging module, and otherwise, updating the step number and executing the sample generating module;
the screen number judging module is used for judging whether the screen number reaches a set maximum screen number, if so, the training is finished, and if not, the initial state module is executed by initializing the step number;
wherein the raters include a reward-based rater and an artificial potential field-based rater.
Compared with the prior art, the invention has the beneficial effects that:
according to the technical scheme provided by the invention, S1 initializes the parameters in the single executors of the double evaluators, and sets the proportional coefficient of each evaluator in the loss function of the strategy network; s2 obtaining a state in the initialization environment according to the initialization noise function; s3, calculating actions according to the current state, the current strategy and the noise function, executing the action to observe reward and the next state, and storing the current state, the action, the reward and the next state as experience in a buffer area; s4, updating parameters in the double-evaluator single actuator according to the N samples collected from the buffer and the loss function; s5, judging whether the training step number reaches the maximum step number of one screen, if so, updating the screen number and executing S6, otherwise, updating the step number and executing S3; s6, judging whether the screen number reaches the set maximum screen number, if yes, finishing the training, otherwise, executing S2 by initializing the step number; wherein the raters include a reward-based rater and an artificial potential field-based rater. The method solves the problems of low sample utilization rate, low training convergence speed and the like in model-free reinforcement learning, combines the result of continuous interaction with the environment in a reinforcement learning mode with a planning method, and can accelerate the convergence speed of the algorithm and make up the deviation caused by the inaccuracy of the environment model so as to obtain the optimal solution of the problem under the combined action of the two methods.
Drawings
Fig. 1 is a flowchart of a reinforcement learning method with two evaluators and a single actuator according to the present embodiment;
FIG. 2 is a schematic view showing the structure of a single actuator of the double evaluator in the present embodiment;
FIG. 3 is a schematic diagram illustrating the included angle between the direction after the action a is performed and f(s);
fig. 4 is a schematic diagram of a next state after different actions are performed in the current state provided in this embodiment;
FIG. 5 is a schematic diagram of a 3v1 predator-predator game provided in the present embodiment;
FIG. 6 is a schematic diagram of a 1v1 predator-predator game provided in the present embodiment;
FIG. 7 is a graph showing the average value of 1v1 awards per 500 awards provided in the present embodiment;
FIG. 8 is a graph showing the average value of 3v1 awards per 500 screens provided in the present embodiment;
FIG. 9 is a diagram of the capture success rate of the last 200 steps of 1v1 provided in this embodiment;
FIG. 10 is a diagram illustrating the capturing success rate of the last 200 steps of 3v1 in this embodiment;
FIG. 11 is a graph showing the average value of 3v1 awards per 500 lines when a predator and a predator are trained together, as provided in this embodiment;
FIG. 12 is a graph showing the capture success rate of the last 200 steps of 3v1 when a predator and a predator are trained together as provided in this example.
Detailed Description
For a better understanding of the present invention, reference is made to the following description taken in conjunction with the accompanying drawings and examples.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
At present, because a model-free AC algorithm and a DDPG algorithm directly interact with the environment to generate samples, the problems of low sampling efficiency and low convergence speed exist, and the dependence of a model planning method on an environment model is too strong, the generalization capability to a new environment is weaker, and the improvement is needed, therefore, the model planning method and the environment interaction result are continuously combined in a reinforcement learning mode, the convergence speed of the algorithm can be accelerated under the combined action of the model planning method and the environment interaction result, and the deviation caused by inaccuracy of the environment model can be compensated, so that the optimal solution of the problem is obtained.
As shown in fig. 1, the reinforcement learning method with two evaluators and one actuator provided in this embodiment includes:
s1 initializing parameters in the single executer of the double evaluators, and setting the proportionality coefficient of each evaluator in the loss function of the strategy network;
s2 obtaining a state in the initialization environment according to the initialization noise function;
s3, calculating actions according to the current state, the current strategy and the noise function, executing the action to observe reward and the next state, and storing the current state, the action, the reward and the next state as experience in a buffer area;
s4, updating parameters in the double-evaluator single actuator according to the N samples collected from the buffer and the loss function;
s5, judging whether the training step number reaches the maximum step number of one screen, if so, updating the screen number and executing S6, otherwise, updating the step number and executing S3;
s6, judging whether the screen number reaches the set maximum screen number, if yes, finishing the training, otherwise, executing S2 by initializing the step number;
wherein the raters include a reward-based rater and an artificial potential field-based rater.
The technical scheme provided by the embodiment solves the problems of low sample utilization rate, low training convergence speed and the like in model-free reinforcement learning, combines the result of continuous interaction with the environment in a reinforcement learning mode with a planning method, and can accelerate the convergence speed of the algorithm and make up the deviation caused by inaccuracy of the environment model so as to obtain the optimal solution of the problem under the combined action of the two methods.
The existing Actor-Critic structure is an Actor and an evaluator; or the Asynchronous adaptive evaluator-criterion (A3C) used for the parallel scheme is one evaluator with a plurality of actors, and the embodiment proposes a mode of a single Actor with two evaluators. For each agent, multiple evaluators can be used simultaneously to jointly guide a single actor, so that two gradient updates, with and without models, can be combined by designing different evaluators.
In this embodiment, the specific implementation step of S1 includes:
s101, randomly initializing a value function network of an evaluator based on the reward and a strategy network of an executor;
s102, initializing the weight of a target network;
s103, initializing an experience playback buffer area;
s104, setting the proportionality coefficient of each evaluator in the loss function of the strategy network.
In this embodiment, the specific implementation step of S4 includes:
s401, updating a value function network of the evaluator based on the reward according to the N samples collected from the buffer area and a value function network updating formula;
s402, calculating the value of the state-action value function of the evaluator based on the artificial potential field according to the N samples collected from the buffer area and the state-action value function of the preset artificial potential field;
s403, updating the strategy network of the actuator according to the N samples collected from the buffer area, the loss function and a strategy network parameter updating formula;
s404, updating the target network according to the strategy network parameter theta and the strategy network mu.
Referring to fig. 2, in the present embodiment, a multi-agent reinforcement learning method is introduced in the framework of a dual evaluator (Actor-Critic-2) single actuator, and firstly, a PGDDPG method of depth certainty strategy gradient guided by an artificial potential field is proposed, which combines an evaluator Critic2 based on an artificial potential field and a traditional Critic1 based on a reward based on an Actor-Critic-2 framework.
In this embodiment, the construction process of the evaluator Critic2 based on the artificial potential field includes:
firstly, a design mode of an evaluator Critic2 based on an artificial potential field is introduced based on a traditional artificial potential field calculation method.
The traditional artificial potential field adopts:
U(s)=Uatt(S)+Urep(S). (1)
wherein, the gravity:
Figure BDA0003024332670000081
where xi is the attraction factor, d (s, s)goal) From the position where it is in state s to the target state sgoalThe distance of (c).
Repulsion force:
Figure BDA0003024332670000082
wherein eta is repulsive force factor, d0The maximum range affected by the repulsive force.
The force experienced by the body in the state s is the negative gradient of the potential field in that state, i.e. it is
Figure BDA0003024332670000083
An evaluator critic2 based on an artificial potential field is designed based on a traditional artificial potential field method, namely a state-action value function Q based on the artificial potential field is designedPF(s, a) represented by the following formula:
Figure BDA0003024332670000084
wherein χ is the angle between the direction after the action a and f(s), and s is the current state as shown in fig. 3.
Figure BDA0003024332670000085
Wherein s 'is as shown in FIG. 4'aIs the state in which action a (post) is performed at s, 0<γ2Below 1 is the discount coefficient. In this embodiment, an Actor-critical-2-based artificial potential field guided depth deterministic strategy gradient:
critic1 is based on the environmental reward r(s)k,ak) Can be calculated from the original state-action value function in the DDPG as shown in the following formula:
Figure BDA0003024332670000086
in the formula, gamma1To award discount coefficients, gamma1Is taken to be in the range of 0 gamma1≤1。
DDPG is a model-free reinforcement learning method, learning is carried out based on reward of environment feedback, an environment model is not needed, but the sample utilization rate is low, and the convergence rate is low.
Critic2 is based on the artificial potential field and is calculated by the state-action value function of the artificial potential field designed by the embodiment in the following formula:
Figure BDA0003024332670000091
in the formula: gamma ray2Coefficient of discount for potential field value, gamma2The value range of (1) is more than 0 and less than gamma2≤1。
The artificial potential field is a planning method based on a model, the calculation efficiency is high, but the precision of an environment model influences the effect of the algorithm.
In a specific mode, according to a framework of an Actor-Critic-2, the Critic1 and the Critic2 are combined to guide the updating of the Actor policy network together, and the advantages of a model-based planning method and a model-free reinforcement learning method can be combined, so that the algorithm is faster and more stable to converge, and the Adam optimizer is used for optimization in the embodiment.
According to the formulas (6) and (7) in combination with the definition of the random strategy gradient loss function in the reinforcement learning, the loss function of the strategy network of the dual-evaluator single-actuator method can be expressed as follows:
Figure BDA0003024332670000092
beta is a proportional coefficient for adjusting Critic1 and Critic2, the larger beta is, the ratio of Critic1 based on reward is, and the guiding effect of reinforcement learning on an actor is stronger; the smaller beta, the larger Critic2 proportion, and the stronger the guiding effect of the artificial potential field planning.
In one embodiment, the setting mode of the proportionality coefficient β includes:
1) when beta is a fixed value, the influence of the artificial potential field is gradually reduced in the process of approaching the target, and local exploration is performed by learning;
2) and beta is dynamically adjusted along with the training turns, so that the total action value function is close to the real-state action value function.
The policy network gradient that can be derived from equation (8) is:
Figure BDA0003024332670000093
the policy network parameter of the Actor can be represented as theta, the value function network parameter of Critic1 can be represented as w, and Critic2 is directly calculated by using the formulas (4) and (5). The strategy network and the value function network are updated iteratively.
The calculation method is as follows:
updating the strategy network parameters according to the following formula:
Figure BDA0003024332670000101
in the formula: thetat+1Parameters of the policy network at time t + 1; thetatParameters of the policy network at time t; alpha is alphaθIs the learning rate of the policy network;
Figure BDA0003024332670000102
solving the gradient of theta; mu.sθThe strategy network with the weight parameter theta; stIs the state at time t; alpha is alphatAn action performed for time t; beta is a scale factor of the double evaluators;
Figure BDA0003024332670000103
to find the gradient of action a;
Figure BDA0003024332670000104
performing action a for reward-based state action value function using time t state under weight parameter wtA value of (d);
Figure BDA0003024332670000105
the gradient of the state action value function based on the potential field for time t.
Value function network updates are performed as follows:
Figure BDA0003024332670000106
Figure BDA0003024332670000107
in the formula: deltatTD error at time t; rtInstant reward for time t; gamma ray1A discount factor for the reward;
Figure BDA0003024332670000108
the value of the state at the time t +1 under the condition that the value function and the strategy weight parameter are w 'and theta' respectively is a state action value function based on the reward; st+1The state at the time t + 1; mu.sθ′(st+1) The action to be executed for the state of the strategy at the moment t +1 under the weight parameter theta';
Figure BDA0003024332670000109
for a reward-based state action value function, the state s is at a value function weight parameter of wtExecute action a at oncetThe value of time; stIs the state at time t; a istIs the action at time t; w is at+1Network weight at time t + 1; w is atIs the network weight at time t; alpha is alphawLearning rate for a value function network;
Figure BDA00030243326700001010
calculating the gradient of the weight w;
Figure BDA00030243326700001011
performing action a for reward-based state action value function using time t state under weight parameter wtThe value of (c).
In order to make the algorithm convergence faster and more stable, the present embodiment selects to use the same dual-network delay updating manner as the DDPG, and sets the target value function network and the target policy network to Q 'and μ' respectively to update the target network parameters by soft updating:
θ′←τθ+(1-τ)θ′ (13)
w′←τw+(1-τ)w′ (14)
in the formula: theta' is the target policy network weight; tau is the 'temperature coefficient' of the target network during soft update; theta is the policy network weight; w' is the network weight of the objective function; w is a value function network weight.
Based on the embodiments provided above, the present embodiment specifically explains the reinforcement learning method for a dual-evaluator single actuator, which includes:
step 1, randomly initializing a value function network Q (s, a | w) of critic1 and a strategy network mu (s | theta) of Actor, expressing the weights as w and theta, and setting a proportionality coefficient beta of a double evaluator;
step 2, initializing the weights w 'and theta' of the target networks Q 'and mu' as w '═ w and theta' ═ theta;
step 3, initializing an experience playback buffer area R;
step 4, initializing a noise function N () for exploring actions;
step 5, initializing the environment and obtaining an initial state s1
Step 6, according to the current state StCurrent strategy mutAnd exploration noise NtAnd calculating the action:
at=μ(s|θ)+Nt
step 7, performing action atObserve the reward rtAnd the next state st+1
Step 8, store experience (S)t,at,rt,st+1) To the buffer R;
step 9, randomly sampling N samples from the buffer R (S)i,ai,ri,si+1);
Step 10, updating a value function network Q (s, a | w) of critic1 through formulas (11) and (12) by using N samples obtained by sampling;
step 11, calculating the value of the artificial potential field value function of critic2 through formulas (4) and (5) by using N samples obtained by sampling;
step 12, updating a policy network mu (s | theta) of the Actor through a formula (10) by using the N samples obtained by sampling;
step 13, updating the target network through formulas (13) and (14);
step 14, executing step 15 when one screen is finished or the maximum number of steps of one screen is reached, otherwise, returning to execute step 6;
and 15, finishing the training when the maximum screen number is reached, otherwise executing the step 4.
The reinforcement learning method of the single actuator with the double evaluators provided by the embodiment solves the problems of low sample utilization rate and low training convergence speed in model-free reinforcement learning. By using the mode of double evaluators, not only the planning with models (such as artificial potential fields) and the reinforcement learning method without models can be combined, but also a plurality of reinforcement learning methods can be directly combined. The method combines the planning with the model and the reinforcement learning method without the model, improves the generalization capability of the algorithm to a new environment while accelerating the convergence of the algorithm, and has an important promoting effect on accelerating the application of the reinforcement learning method in practice.
It should be noted that, although the foregoing embodiments describe each step in a specific sequence, those skilled in the art will understand that, in order to achieve the effect of the present invention, different steps do not necessarily need to be executed in such a sequence, and they may be executed simultaneously (in parallel) or in other sequences, and these changes are all within the protection scope of the present invention.
Based on the scheme, the invention provides an application scenario of an embodiment related to the technical scheme, the effectiveness of the technical scheme is verified based on experiments of multiagent-particle-environments (MPE), coordinates of a predator are limited to [ -1,1], coordinates of a prey are limited to [ -0.8,0.8] by using a predator-prey model in the MPE, and the predator have the same speed. There are two situations as shown in fig. 5 and 6, the predator-predator game of 3v1 and 1v1, 3v1 being the situation shown in fig. 5, where there are 3 predators chasing one prey, and the predator-predator game of 1v1 being the situation shown in fig. 6, where only one predator chases one prey, where the triangles represent predators and the circles represent prey.
Considering first the case of a predator-predator game of 1v1, the environmental awards are sparse (if successful awards +10) and depend only on the terminal status of each screen; then, consider that N predators chase a prey in a randomly generated environment, each predator would receive a reward of 10 if all predators capture a prey at the same time. Any predator is not rewarded as long as it does not catch the prey. The above process leads to a learning difficulty problem requiring good tacit cooperation.
The goal of the experiment was to learn to capture prey independently without knowledge of opponent's strategy and action. For the control problem of continuous action, a deterministic strategy gradient algorithm DDPG is used as a base (comprising an evaluator Critic1), a custom artificial potential field evaluator (Critic2) is added, and the Actor is jointly updated by using the gradient combination of Critic1 and Critic 2. This method is referred to as PGDDPG in the present embodiment.
In this embodiment, two ways are adopted to verify the validity of the technical scheme, which specifically include:
first, using a pre-trained DDPG model as a predator strategy
For predator-predator games of 1v1 and 3v1, the capture success rates and predator reward curves for PGDDPG and DDPG, as shown in fig. 7, 8, 9 and 10, respectively, are plotted.
To demonstrate a more fluid learning process, the mean value of the prize values per 500 sets was calculated as shown in fig. 7 and 8.
It is apparent from fig. 7, 8, 9 and 10 that PGDDPG is superior to DDPG in convergence speed.
Second, the predators and the predators are trained together
The simultaneous training increases the difficulty of learning because it becomes a zero-sum game, with the environment being dynamically enhanced.
As shown in fig. 11 and 12, for PGDDPG, the success rate dropped from 0.2 to 0 at approximately 1000 times (prey's ability to escape exceeds predator's ability to catch); however, the capture capacity quickly overtakes the escape capacity and takes a leading position.
It can also be observed from fig. 11 and 12 that DDPG fails in the predator-predator zero-sum game of 3vs 1.
Wherein the abscissa epsilon in fig. 7-12 represents the number of training sets, the ordinate rewarded of fig. 7, 8 and 11 represents the reward, and the ordinate of the rate of success of fig. 9, 10 and 12 represents the amount of success of the qualification.
It will be understood by those skilled in the art that all or part of the flow of the method according to the above-described embodiment may be implemented by a computer program, which may be stored in a computer-readable storage medium and used to implement the steps of the above-described embodiments of the method when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying said computer program code, media, usb disk, removable hard disk, magnetic diskette, optical disk, computer memory, read-only memory, random access memory, electrical carrier wave signals, telecommunication signals, software distribution media, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
Furthermore, the invention also provides a storage device. In one embodiment of the storage device according to the present invention, the storage device may be configured to store a program for executing the double-evaluator single-actuator reinforcement learning method of the above-described method embodiment, and the program may be loaded and executed by a processor to implement the double-evaluator single-actuator reinforcement learning method described above. For convenience of explanation, only the parts related to the embodiments of the present invention are shown, and details of the specific techniques are not disclosed. The storage device may be a storage device apparatus formed by including various electronic devices, and optionally, a non-transitory computer-readable storage medium is stored in the embodiment of the present invention.
Furthermore, the invention also provides a control device. In an embodiment of the control device according to the present invention, the control device comprises a processor and a storage device, the storage device may be configured to store a program for executing the double-evaluator single-actuator reinforcement learning method of the above-mentioned method embodiment, and the processor may be configured to execute a program in the storage device, the program including but not limited to a program for executing the double-evaluator single-actuator reinforcement learning method of the above-mentioned method embodiment. For convenience of explanation, only the parts related to the embodiments of the present invention are shown, and details of the specific techniques are not disclosed. The control device may be a control device apparatus formed including various electronic apparatuses.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims (10)

1. A reinforcement learning method of a double-evaluator single-actuator is characterized by comprising the following steps:
s1 initializing parameters in the single executer of the double evaluators, and setting the proportionality coefficient of each evaluator in the loss function of the strategy network;
s2 obtaining a state in the initialization environment according to the initialization noise function;
s3, calculating actions according to the current state, the current strategy and the noise function, executing the action to observe reward and the next state, and storing the current state, the action, the reward and the next state as experience in a buffer area;
s4, updating parameters in the double-evaluator single actuator according to the N samples collected from the buffer and the loss function;
s5, judging whether the training step number reaches the maximum step number of one screen, if so, updating the screen number and executing S6, otherwise, updating the step number and executing S3;
s6, judging whether the screen number reaches the set maximum screen number, if yes, finishing the training, otherwise, executing S2 by initializing the step number;
wherein the raters include a reward-based rater and an artificial potential field-based rater.
2. The reinforcement learning method of claim 1, wherein the S1 initializes parameters in a single evaluator-executor and sets a scaling factor of each evaluator in the loss function of the policy network, including:
s101, randomly initializing a value function network of an evaluator based on the reward and a strategy network of an executor;
s102, initializing the weight of a target network;
s103, initializing an experience playback buffer area;
s104, setting the proportionality coefficient of each evaluator in the loss function of the strategy network.
3. The reinforcement learning method of claim 1, wherein the loss function of the policy network is represented by the following equation:
Figure FDA0003024332660000011
in the formula: j (mu)θ) Is a loss function of the policy network; theta is a strategy network parameter of the actuator;
Figure FDA0003024332660000012
is a state space; rhoμ(s,γ1) To discount gamma1Distribution of states of; s is the current state; gamma ray1Discount coefficients for the reward; mu.sθ(s) is a policy function; r (s, mu)θ(s)) to take the policy μ at state sθA reward that can be obtained; beta is the proportionality coefficient of the double evaluators; rhoμ(s,γ2) To discount gamma2Distribution of states of; gamma ray2A discount coefficient for the potential field value; q. q.sPF(s,μθ(s)) to execute the policy μ in state sθTime based on the state of the potential field-action function.
4. A reinforcement learning method as claimed in claim 1, characterized in that the action is calculated from the current state, the current strategy and the noise function according to the following formula:
at=μ(s|θ)+Nt
in the formula: a istIs the action at time t; μ (s | θ) is the result of the current state s under the current strategy; s is the current state; theta is a strategy network parameter of the actuator; n is a radical oftTo obtain the noise at time t according to the noise function.
5. The reinforcement learning method of claim 1, wherein the S4 updates parameters in a dual-evaluator single executor according to the N samples collected from the buffer and the loss function, including:
s401, updating a value function network of the evaluator based on the reward according to the N samples collected from the buffer area and a value function network updating formula;
s402, calculating the value of the state-action value function of the evaluator based on the artificial potential field according to the N samples collected from the buffer area and the state-action value function of the preset artificial potential field;
s403, updating the strategy network of the actuator according to the N samples collected from the buffer area, the loss function and a strategy network parameter updating formula;
s404, updating the target network according to the strategy network parameter theta and the strategy network mu.
6. The reinforcement learning method of claim 5, wherein the state-action value function of the artificial potential field is represented by the following formula:
Figure FDA0003024332660000021
in the formula: qPF(s, a) is a state-action value function of the artificial potential field; s is the current state; a is an action; u(s) is a potential field value in a state s; gamma ray2A discount coefficient for the potential field value; s'aIs the state after executing action a under state s; u (s'a) A potential field value which is a state after the action a is executed in the state s; e, averaging; k is the current step number; q. q.sPF(sk,ak) Is in a state skExecution policy muθA state-action function based on the potential field;
wherein q isPF(sk,ak) Calculated as follows:
Figure FDA0003024332660000022
in the formula: u(s) is; χ is the angle between the direction after action a and f(s).
7. The reinforcement learning method of claim 5, wherein the policy network parameter updates a formula as shown in the following equation:
Figure FDA0003024332660000023
in the formula: thetat+1Parameters of the policy network at time t + 1; thetatParameters of the policy network at time t; alpha is alphaθIs the learning rate of the policy network;
Figure FDA0003024332660000024
solving the gradient of theta; mu.sθThe strategy network with the weight parameter theta; stIs the state at time t; a istAn action performed for time t; beta is a scale factor of the double evaluators;
Figure FDA0003024332660000031
to find the gradient of action a;
Figure FDA0003024332660000032
performing action a for reward-based state action value function using time t state under weight parameter wtA value of (d);
Figure FDA0003024332660000033
the gradient of the state action value function based on the potential field for time t.
8. The reinforcement learning method of claim 4, wherein the value function network updates a formula as shown in the following equation:
Figure FDA0003024332660000034
Figure FDA0003024332660000035
in the formula: deltatTD error at time t; rtInstant reward for time t; gamma ray1A discount factor for the reward;
Figure FDA0003024332660000036
the value of the state at the time t +1 under the condition that the value function and the strategy weight parameter are w 'and theta' respectively is a state action value function based on the reward; st+1The state at the time t + 1; mu.sθ′(st+1) The action to be executed for the state of the strategy at the moment t +1 under the weight parameter theta';
Figure FDA0003024332660000037
for a reward-based state action value function, the state s is at a value function weight parameter of wtExecute action a at oncetThe value of time; stIs the state at time t; a istIs the action at time t; w is at+1Network weight at time t + 1; w is atIs the network weight at time t; alpha is alphawLearning rate for a value function network;
Figure FDA0003024332660000038
calculating the gradient of the weight w;
Figure FDA0003024332660000039
performing action a for reward-based state action value function using time t state under weight parameter wtThe value of (c).
9. The reinforcement learning method of claim 4, wherein the target network is updated as follows:
θ′←τθ+(1-τ)θ′
w′←τw+(1-τ)w′
in the formula: theta' is the target policy network weight; tau is the temperature coefficient when the target network is in soft update; theta is the policy network weight; w' is the network weight of the objective function; w is a value function network weight.
10. A double-evaluator single-actuator reinforcement learning system, which is used for implementing the double-evaluator single-actuator reinforcement learning method according to any one of claims 1 to 9, and comprises:
the initialization module is used for initializing parameters in the single executer of the double evaluators and setting the proportional coefficient of each evaluator in the loss function of the strategy network;
an initial state module for obtaining a state in an initialization environment according to an initialization noise function;
the generating sample module is used for calculating actions according to the current state, the current strategy and the noise function, executing the action observation reward and the next state, and storing the current state, the actions, the reward and the next state as experience in a buffer area;
the parameter updating module is used for updating parameters in the double-evaluator single actuator according to the N samples collected from the buffer area and the loss function;
the step number judging module is used for judging whether the training step number reaches the maximum step number of one screen or not, if so, updating the screen number and executing the screen number judging module, and otherwise, updating the step number and executing the sample generating module;
the screen number judging module is used for judging whether the screen number reaches a set maximum screen number, if so, the training is finished, and if not, the initial state module is executed by initializing the step number;
wherein the raters include a reward-based rater and an artificial potential field-based rater.
CN202110415953.4A 2021-04-16 2021-04-16 Reinforced learning method and system for double evaluators and single actuator Pending CN113268854A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110415953.4A CN113268854A (en) 2021-04-16 2021-04-16 Reinforced learning method and system for double evaluators and single actuator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110415953.4A CN113268854A (en) 2021-04-16 2021-04-16 Reinforced learning method and system for double evaluators and single actuator

Publications (1)

Publication Number Publication Date
CN113268854A true CN113268854A (en) 2021-08-17

Family

ID=77228844

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110415953.4A Pending CN113268854A (en) 2021-04-16 2021-04-16 Reinforced learning method and system for double evaluators and single actuator

Country Status (1)

Country Link
CN (1) CN113268854A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114139472A (en) * 2021-11-04 2022-03-04 江阴市智行工控科技有限公司 Integrated circuit direct current analysis method and system based on reinforcement learning dual-model structure
CN115392144A (en) * 2022-10-31 2022-11-25 深圳飞骧科技股份有限公司 Method for automatic design of acoustic surface filter, related system and storage medium
CN115493597A (en) * 2022-11-15 2022-12-20 山东大学 AUV path planning control method based on SAC algorithm

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114139472A (en) * 2021-11-04 2022-03-04 江阴市智行工控科技有限公司 Integrated circuit direct current analysis method and system based on reinforcement learning dual-model structure
CN115392144A (en) * 2022-10-31 2022-11-25 深圳飞骧科技股份有限公司 Method for automatic design of acoustic surface filter, related system and storage medium
CN115392144B (en) * 2022-10-31 2023-02-03 深圳飞骧科技股份有限公司 Method for automatic design of acoustic surface filter, related system and storage medium
CN115493597A (en) * 2022-11-15 2022-12-20 山东大学 AUV path planning control method based on SAC algorithm

Similar Documents

Publication Publication Date Title
CN113268854A (en) Reinforced learning method and system for double evaluators and single actuator
Lee et al. Sample-efficient deep reinforcement learning via episodic backward update
JP6824382B2 (en) Training machine learning models for multiple machine learning tasks
CN107209872B (en) Systems, methods, and storage media for training a reinforcement learning system
CN111026272B (en) Training method and device for virtual object behavior strategy, electronic equipment and storage medium
CN109284812B (en) Video game simulation method based on improved DQN
CN111008449A (en) Acceleration method for deep reinforcement learning deduction decision training in battlefield simulation environment
CN105637540A (en) Methods and apparatus for reinforcement learning
Cui et al. Using social emotional optimization algorithm to direct orbits of chaotic systems
CN113449458A (en) Multi-agent depth certainty strategy gradient method based on course learning
CN112052947B (en) Hierarchical reinforcement learning method and device based on strategy options
CN108465244A (en) AI method for parameter configuration, device, equipment and storage medium for racing class AI models
CN113962390B (en) Method for constructing diversified search strategy model based on deep reinforcement learning network
WO2020259504A1 (en) Efficient exploration method for reinforcement learning
CN112488826A (en) Method and device for optimizing bank risk pricing based on deep reinforcement learning
CN113487039A (en) Intelligent body self-adaptive decision generation method and system based on deep reinforcement learning
CN112104563B (en) Congestion control method and device
Mousavi et al. Applying q (λ)-learning in deep reinforcement learning to play atari games
CN114290339B (en) Robot realistic migration method based on reinforcement learning and residual modeling
CN107798384B (en) Iris florida classification method and device based on evolvable pulse neural network
CN113419424A (en) Modeling reinforcement learning robot control method and system capable of reducing over-estimation
CN113919475B (en) Robot skill learning method and device, electronic equipment and storage medium
CN115542912A (en) Mobile robot path planning method based on improved Q-learning algorithm
Nichols et al. Application of Newton's Method to action selection in continuous state-and action-space reinforcement learning
Lee et al. Convergent reinforcement learning control with neural networks and continuous action search

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination