CN109511277B

CN109511277B - Cooperative method and system for multi-state continuous action space

Info

Publication number: CN109511277B
Application number: CN201880001580.2A
Authority: CN
Inventors: 侯韩旭; 郝建业; 张程伟
Original assignee: Dongguan University of Technology
Current assignee: Dongguan University of Technology
Priority date: 2018-08-01
Filing date: 2018-08-01
Publication date: 2023-06-13
Anticipated expiration: 2038-08-01
Also published as: CN109511277A; WO2020024172A1

Abstract

The invention provides a cooperation method and a cooperation system for a multi-state continuous action space, and belongs to the field of reinforcement learning. The method comprises the following steps: initializing an action set for states in any state set; initializing related parameters for states in any state set and actions in the action set; and constructing corresponding collaboration mechanisms at an action set correction layer and a strategy evaluation update layer respectively until the return of the intelligent agent i under the state s is converged. The invention also provides a system for realizing the cooperation type method of the multi-state continuous action space. The beneficial effects of the invention are as follows: the collaboration problem of multiple agents in the continuous action space can be well processed.

Description

Cooperative method and system for multi-state continuous action space

Technical Field

The invention relates to the field of reinforcement learning, in particular to a cooperative method and a cooperative system for a multi-state continuous action space.

Background

Many efforts in the reinforcement learning field are designed to learn the optimal solution of continuous motion space problems, but most efforts have focused on single agent learning. Some of the problems encountered in the field of multi-agent collaboration, such as non-stationarity and randomness, remain a significant challenge in terms of continuous motion space.

In reality, many research fields relate to multi-agent cooperation problems in continuous motion space, such as robot football [1] and multi-person online competitive game [2]. In such problems, the agent needs not only to solve the problem of infinite action set selection over a continuous action space, but also to effectively cooperate with other agents to seek optimum group rewards.

So far, there have been many studies for solving the collaboration problem in the case of multi-agent environments. The most common are algorithms based on Q-learning extensions such as Distributed-Q learning [3], hybrid-Q learning [4], lenient learning [5], lenient-FAQ [6], LMRL2[7], lenient-DQN [8] and rFMQ [9]. These algorithms can solve the problem of collaboration of multi-agent systems to some extent, but they can only be applied in discrete motion spaces.

On the other hand, some work has focused on studying control problems under continuous motion space, such as value function approximation (Value Approximation) algorithm [10-14] and strategy approximation (Policy Approximation) algorithm [15-18]. The value function approximation class algorithm estimates the value function corresponding to the state-action space according to the training sample, and the strategy approximation class algorithm defines the strategy as a probability density function of a certain distribution on a continuous space, and then directly learns the strategy. The learning performance of these two classes of algorithms depends on the nature of the estimated value function, whereas state-action corresponding value functions in common problems often have complex structures, such as nonlinearities. Another class of algorithms for solving the problem of continuous motion space is based on Monte Carlo (Monte-Carlo-based) sampling class methods [19,20], which use sampling to solve the problem of exploration in continuous motion space, can be conveniently combined with traditional discrete reinforcement learning algorithms. Both of the above algorithms are designed in a single agent environment and cannot be directly applied to multi-agent collaboration problems. Because in a multi-agent environment, an agent's estimation of its current policy for the return function does not reflect its current policy [21]. There are also some work to study the problem in a multi-agent environment in continuous motion space, but not multi-agent collaboration problems such as learning algorithms for fairness [22], and theoretical model analysis in steady state using algorithms in continuous boltzmann exploration strategies [23].

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a cooperation type method and a cooperation type system for a multi-state continuous action space.

The cooperation type method of the multi-state continuous action space comprises the following steps:

(1): for any state s.epsilon.S, initializing the sampled action set A _i (s) continuous motion space for collecting agent i

N actions of a random sample, wherein S is a state set;

(2): for arbitrary states S ε S and actions a ε A _i (s) initializing the expected return Q of agent i in state s with respect to action a _i (s, a) historical maximum return

And weighted average return E _i (s, a) the average expected return V of agent i in state s _i (s) initializing the probability pi of selecting action a in state s _i (s, a) an estimate F of the frequency at which the maximum value of return occurs _i (s, a) and search Rate l _i (s) is a set value;

(3): the following steps are repeated until the return of agent i in state s converges,

(31): initialization state s≡s ₀ ；

(32): repeating the following steps until the state s reaches the end state

(321): judging whether an action set needs to be updated, if not, executing a step (322), if so, resampling the action set, reserving the maximum return action, collecting a new action within a certain range of the action as a new action set, and then executing the step (322);

(322): for any a.epsilon.A _i (s) updating pi according to the selection principle that the action with the highest return is most probable _i (s, a) and Q _i (s,a)；

(323): update state s≡s'.

The invention is further improved, in the step (1), a sampling action set A of the beginning of each state is set _i (s) is a continuous motion space

N actions of medium distance sampling.

The invention is further improved, and the slicing bilinear difference algorithm is adopted to convert n discrete actions into continuous actions in a continuous action space.

The invention is further improved by resampling in step (31) by a collaborative sampling strategy, updating the action set, and employing a variable exploration rate l _i (s) controlling the collaborative sampling strategy to sample a range of new actions around a maximum rewards action.

The invention further improves, the processing method of the collaborative sampling strategy is as follows:

a1: updating the exploration rate l _i (s):

If the average expected return of the current action set

Greater than or equal to the cumulative average expected return V for each previous action set _i (s) the search rate l is reduced _i (s) is l _i (s)δ _d Otherwise increase l _i (s) is l _i (s)δ _l Wherein delta _l A positive real number greater than 1, delta _d A positive real number less than 1;

a2: updating the cumulative average expected return:

Wherein alpha is _s Is the learning rate;

a3: according to the exploration rate l _i (s) resampling the action set:

calculating the action with the maximum current return

Preserving |A in the current set with the largest expected return _i (s) |/3 actions and from radius l _i A of(s) _max In-neighborhood random selection 2|A of (2) _i (s) |/3 new actions, together forming a new set of actions;

a4: initializing policies pi for new actions per action _i (s, a) and corresponding expected return Q _i (s, a) is the initial set value.

The invention is further improved, in step (32), agent i learns and updates by using a multi-state recursive frequency maximum Q value learning algorithm.

The invention further improves, the processing method of the multi-state recursion frequency maximum Q value learning algorithm is as follows:

b1: b2, judging whether the current action set is updated or not, if not, directly executing the step B2, and if so, initializing F corresponding to all actions in the current state _i (s,a)、

And E is _i (s, a), and then performing step B2; />

B2: according to a strategy pi with a certain exploration rate _i (s, a) behavior a ε A in selection state s _i (s)；

B3: observing the return r and the next state s' from the environment, and updating the state action value Q corresponding to the current s and the current a _i (s,a)：Q _i (s,a)←(1-α)Q _i (s,a)+α(r+γmax _a′ Q _i (s′,a′))，

Wherein alpha is learning rate, gamma is discount factor,

The maximum state action value when the action a 'is performed when the next state s' is performed;

b4: estimating E according to a recursive maximum priority idea _i (s,a)；

B5: according to E _i (s, a) updating the policy pi using a policy hill-climbing algorithm _i (s, a), i.e. increasing the choice with maximum E _i The probability of an action of the (s, a) value, while selecting a probability of decreasing the other actions.

The invention also provides a system for realizing the cooperation type method of the multi-state continuous action space, which comprises the following steps:

initializing an action set module: action set A for initializing samples for arbitrary states S ε S _i (s) continuous motion space A for collecting agent i _i N actions of the random pattern in(s);

initializing a parameter module: for arbitrary states S ε S and actions a ε A _i (s) initializing the expected return Q of agent i in state s with respect to action a _i (s, a) historical maximum return

and a convergence module: for repeatedly executing the following units until the return of agent i in state s converges,

Action set correction unit: if not, executing a strategy evaluation and updating unit, if yes, resampling the action set, reserving the maximum return action, collecting a new action within a certain range of the action as a new action set, and then executing the strategy evaluation and updating unit;

policy evaluation and update unit: for arbitrary a.epsilon.A _i (s) updating pi according to the selection principle that the action with the highest return is most probable _i (s, a) and Q _i (s,a)；

A state updating unit: for updating the state s≡s'.

The invention is toThe action set correction unit carries out resampling through a collaborative sampling strategy, updates the action set and adopts variable exploration rate l _i (s) controlling the collaborative sampling strategy to sample a range of new actions around a maximum rewards action.

The invention is further improved, and the strategy evaluation and updating unit adopts a multi-state recursion frequency maximum Q value learning algorithm to learn and update.

Compared with the prior art, the invention has the beneficial effects that: the method solves the problem of collaboration of the Markov game in the continuous action space, solves the problem of the continuous action space by resampling available action sets from a sampling strategy, and evaluates the sampled action sets and gives out a corresponding collaboration strategy by a multi-state recursive frequency maximum Q value learning algorithm. By considering the corresponding collaboration mechanisms for the two parts respectively, the method and the device can well solve the collaboration problem of multiple intelligent agents in the continuous action space.

Drawings

FIG. 1 is a schematic diagram of the structure of the present invention;

FIG. 2 is a flow chart of the method of the present invention;

FIG. 3 is a schematic diagram of a continuous version CG;

FIG. 4 is an experimental comparative reference diagram of SMC, rFMQ, CALA and SCC-rFMQ in a progressive version of a CG game;

FIG. 5 is an experimental comparative reference diagram of SMC, rFMQ, CALA and SCC-rFMQ in a continuous version of PSCG game;

FIG. 6 is a schematic illustration of a multi-agent ship river crossing game;

FIG. 7 is a comparative reference graph of the experiments of SCC-rFMQ and SMC in a ship-to-river game.

Detailed Description

The invention will be described in further detail with reference to the drawings and examples.

Aiming at the problem of multi-agent cooperation in continuous action space, the invention provides an agent independent learning reinforcement learning method SCC-rFMQ (Sample Continuous Coordination with recursive Frequency Maximum Q-Value, continuous collaborative sampling-recursive frequency maximum Q Value). The SCC-rFMQ of the invention divides the coordination process in the continuous action space into two layers: an action set modification layer (Action Set Modification) and a policy evaluation update layer (Policy Evaluation and Updating). The first layer is inspired by the sampling thought in [20], and a resampling mechanism is specifically designed, so that the sampling process can better cope with the change of other agent strategies. The new resampling mechanism preserves the optimal behavior of each agent state while using a variable exploration rate to control the probability distribution of resampling and the convergence time of the resampling mechanism. The variable exploration rate is adjusted here using win or learning (WoLM) principles to cope with non-stationarity and randomness issues arising from multi-agent collaboration during the course of action set updates [9]. At the Policy evaluation and update layer, rFMQ Policy [9] is introduced into PHC (Policy Hill-Climbing) algorithm [24] so that it can handle multi-agent collaboration problems in a multi-state environment. Finally, the performance of the SCC-rFMQ learning algorithm was analyzed by comparison with other reinforcement learning methods.

Next, the technology and essential basic concepts to which the present invention is applied will be described:

1. continuous action collaborative Markov gaming

Markov gaming (Markov gamme) is the basis for multi-agent reinforcement learning research and is a combination of Repeated gambling (Repeated gamme) and a Markov decision process (Markov Decision Process). Typically a Markov game may be composed of the following five tuples<S,N,A _i ,T,R _i >The representation is:

s: a state set;

n: an agent collection;

A _i : an agent $i$ action space;

t is S x A x S → [0,1]: a state transfer function;

R _i s x A → R: a return function for agent i.

Where a=a ₁ ×…×A _N And has A for any i ε N _i ∈[0,1]。

In a markov game, all agents can observe all states s. The state transfer function and the return function depend onAnd (3) joint action of the intelligent agent. One Markov game is called a collaborative Markov game (or alliance game), and when the Markov game is referred to as a Markov game, each agent can be divided into a plurality of groups which compete with each other according to the environment in which the agent is located, and each group of agents jointly completes the same target. In particular, if the payback of all agents in a markov game is always equal, the game is referred to as a fully collaborative markov game. Action space A of intelligent body _i Either continuously or discretely. Most reinforcement learning work is currently focused on discrete motion spaces. However, in real-world environments, such as some high-precision control problems, slight changes in their motion may result in significant losses. To solve such problems, simple discretization of the continuous motion space is often not satisfactory, and people often need to directly find the optimal strategy in the whole continuous space.

In collaborative Markov gaming, a common goal of agent learning is to find a Pareto optimal solution (Pareto-optimal solution) through independent learning. Pareto optimal solution means that under this strategy, none of the agents can get a higher expected return without the return of the other agents decreasing. Formalized, strategy (pi _i ,π _-i ) Is a pareto optimum if and only if no strategy (pi 'is present' _i ,π′ _-i ) So that R is present for any one agent i _i (π′ _i ,π′ _-i )≥R _i (π _i ,π _-i ) And wherein there is one agent strictly greater than it is. Unlike the Nash equalization strategy, pareto is optimal from a static global perspective to see the problem, and is the optimal solution of the problem; while nash equalization is a temporary solution in the problem solving process from a dynamic local perspective. The two concepts are not intersected, for example in prisoner dilemma gaming, traitor strategy (D, D) is nash equalization and cooperative strategy (C, C) is pareto optimal. It should be noted, however, that pareto optima are not equivalent to population optimal solutions, population optimal solutions and individual agent rewards and maximum solutions.

In a collaborative markov game, the biggest challenge to reinforcement learning is how to let the agent learn the pareto optimal nash equilibrium through independent learning. As already described in the previous chapters, the results of learning by each agent targeting its own optimal interests will in most cases eventually converge to nash equilibrium. If a game contains multiple Nash equalizations, it would be a more reasonable choice to find the best Nash equalization for pareto. Two other factors that may lead to multiple agents failing to find the optimal collaboration policy are non-stationarity problems (non-stationarity problem) and randomness problems (Stochasticity problem) [9]. In single agent learning, markov and stationarity are the necessary conditions for the algorithm to learn the optimal solution, however, in a multi-agent collaborative environment, both conditions are no longer satisfied. Meanwhile, if the reporting function is random, noise in the environment and policy changes of other agents will affect the reporting, so that the agents cannot determine the source of the reporting change. In conventional single agent reinforcement learning algorithms, non-stationary and randomness problems may result in algorithms that do not learn good collaborative strategies.

The invention provides a collaborative algorithm for learning Nash equalization of a Parritor optimal strategy of collaborative Markov games in a continuous action space.

2. Policy mountain climbing algorithm (Policy Hill Climbing, PHC)

The policy hill climbing algorithm PHC (Policy Hill Climbing) is a simple extension of the gradient-rise policy in Q learning to mix the strategically learning. Compared with Q learning, PHC agent maintains state value estimate Q (s, a) while there is a mixing strategy pi (s, a) representing the probability of selecting action a in agent state s. And then interact with the environment and other agents to obtain a return r and a next state s'. The estimate Q (s, a) and the strategy pi (s, a) are then updated as follows,

Q(s,a)←(1-α)Q(s,a)+α(r+γmax _a′ Q(s′,a′))

where α and δ are learning rates, γ is a discount factor, and |A _s I is the number of actions that the agent can choose in state s. The update of the estimate Q (s, a) is consistent with the Q learning algorithm. As can be seen from the above equation, the mixing policy is updated by gradually increasing the probability of selecting the qmax action and decreasing the probability of selecting other actions. Note that when the learning rate is at a maximum of 1, PHC will be equivalent to Q learning algorithm. In a multi-agent learning environment, PHC agents can go out of a hybrid strategy that learns the best response to other agents.

3. SMC learning algorithm on continuous motion space

At present, continuous action space reinforcement learning problems in a real environment are partially studied. SMC learning [20] is one of the representative algorithms. The algorithm is an Actor-Critic method, which approximates the probability distribution of a continuous spatially selective action by a Sequential Monte Carlo (SMC) sampling method. We briefly describe the learning flow of the algorithm.

The SMC learning algorithm is a sample-based Actor-Critic approach. The main idea of the Actor-Critic architecture is that the estimation of the strategy and the updating of the strategy are respectively learned in independent processes, the part for processing the strategy estimation is called Critic, and the part for updating the strategy is called Actor. In the SMC learning algorithm, an Actor is a random strategy of a Monte Carlo sampling method. Specifically, for any state s, a discrete set of actions a(s) derived from random sampling in a continuous action space is associated. Each action sample a in the collection _i E A(s) corresponds to an importance weight omega _i . In the action sampling process, the Actor is used for sampling according to the weight omega _i An action is randomly selected from the set of actions a(s). Critic then estimates the action value function Q of the response state in terms of the return. Finally, the Actor updates the probability distribution of the strategy according to importance sampling principle (Importance Sampling principle) according to the value estimation function provided by Critic. Importance sampling principle means that the probability that a point of an action with a high expected return is sampled should be relatively higher as well.

The weights corresponding to all actions a e a(s) in state s are one estimate of the policy density function of the continuous action space in that state. When the weights corresponding to some actions in the action set a(s) are small or large, it means that the set a(s) contains some actions with particularly small returns, so that the Actor needs to resample some new actions to replace these bad actions. The rule of resampling here is also according to the importance sampling principle, i.e. the probability that nearby points of an action with a high expected return are sampled should be relatively higher. Because the weight of the sample is proportional to the expected return of the behavior, the probability density function around points with high expected returns should also be higher, and these actions need to be sampled and performed more frequently to increase the expected returns.

The SMC learning algorithm is a learning algorithm for learning a continuous motion space optimally designed for markov under a single agent environment, which cannot be directly applied in a multi-agent environment as described above. In the latter work we designed a reinforcement learning algorithm for multi-agent cooperation problem in continuous motion space based on the resampling mechanism of SMC.

4. rFMQ algorithm in collaborative gaming

There is currently a significant amount of research effort on the problem of interoperability in multi-agent environments. Some of these efforts have been applied to collaborative multi-agent systems by improving Q-learning algorithms, such as distributed Q-learning [3] and frequency maximum Q-value learning (FMQ) [25]. An improved algorithm of FMQ, a recursive FMQ algorithm (rFMQ [9 ]), is presented herein to help solve the multi-agent collaboration problem.

rFMQ is a single-state multi-agent reinforcement learning algorithm designed for matrix gaming. In rFMQ, for each action a, the algorithm calculates an updated value function Q (a) and records the maximum return Q once received by the agent under that action _max (a) (note that the Q value is an estimate of the immediate return r in a single state environment). The frequency F (a) is an estimate of the proportion of all rewards that the algorithm receives the maximum reward when selecting action a.

Specifically, F (a) is determined according to the learning rate alpha _f The updates that are recursive in the learning process,

where r is the return received when selecting action $a$ in the current state, care should be taken that it is determined by the joint actions of the agents in the multi-agent environment. The key idea of rFMQ is to trade off the expected return Q (a) and the highest return Q under this action according to the frequency F (a) _max (a) I.e. E (a) = (1-F (a)) Q (a) +F (a) Q _max (a) Then select the next action according to the weighted return E (a). The action selection in the original rFMQ is based on the E greedy principle: actions are selected from the action space according to the probability epsilon and are uniformly distributed, or actions with the highest E value are selected from the action space according to the probability 1 epsilon.

Briefly, by employing Q (a) and Q _max (a) The weighting value E (a) of (a) selects an action, rFMQ increases the probability of selecting an action that exhibits the greatest return. In collaborative gaming, the actions corresponding to other agents in the return are often a good choice. Therefore, the method can better cope with the pareto selection problem and the non-stationary problem in the multi-agent cooperation process. This possibility exists in a random environment where an action, while ever producing a great return, does not expect the return to be the best. The recursively reduced weight design in rFMQ can effectively prevent the algorithm from converging into the action. The algorithm is described in Matignon et al [9 ]]The experimental verification in the work of the game can solve the non-stationary problem and the randomness problem which are encountered in the collaborative learning of the partial random matrix game.

The biggest limitation of this algorithm is that it can only be applied in matrix gaming of pure strategies. Therefore, the invention introduces the recursion maximum priority idea in rFMQ into PHC algorithm to solve the collaborative learning problem of the mixing strategy under multiple states.

5. Slicing bilinear difference algorithm

The piecewise bilinear interpolation algorithm is also called as a bilinear interpolation algorithm, is a linear interpolation expansion of interpolation functions of two variables, and has the core idea of performing linear interpolation in two directions respectively. The piecewise bilinear interpolation is widely applied to the aspects of signal processing, digital image processing, video processing and the like as an interpolation algorithm in numerical analysis.

As shown in fig. 1, for the destination point c= (x, y) (indicated by the middle point in the figure), the four nearest known points a are adjacent thereto ₁₁ ＝(x ₁ ,y ₁ )，A ₁₂ ＝(x ₁ ,y ₂ )，A ₂₁ ＝(x ₂ ,y ₁ ) And A ₂₂ ＝(x ₂ ,y ₂ ) The values of (a) are f (A) ₁₁ )，f(A ₁₂ )，f(A ₂₁ ) And f (A) ₂₂ ) These four points form a rectangle parallel to the coordinate axes, and the value f (C) is calculated using a bilinear interpolation algorithm. Firstly, performing linear interpolation twice to obtain points B ₁ ＝(x,y ₁ ) And point B ₂ ＝(x,y ₂ ) Interpolation of (points on both sides of the vertical direction of point C in the figure) and then linear interpolation is performed again by using the two points, so that interpolation of point c= (x, y) can be obtained. In particular, the method comprises the steps of,

the above steps can be reduced to the form of matrix multiplication,

Continuous functions can be generated using piecewise bilinear interpolation, often used to scale images. The invention makes the piecewise bilinear interpolation algorithm continuous two classical discrete matrix games, and then uses the continuous games to check the performance of the SCC-rFMQ method.

The present invention is described in detail below:

inspired by a sampling mechanism of SMC learning and a cooperative method of an rFMQ algorithm, the invention provides a reinforcement learning method for independently learning each agent in the multi-agent cooperative problem on a continuous action space, which is called as an SCC-rFMQ (Sample Continuous Coordination with recursive Frequency Maximum Q-Value) method. It should be noted here that SCC-rFMQ is not a simple combination of SMC learning and rFMQ algorithms. Specifically, SCC-rFMQ divides collaborative questions over the continuous action space in Markov games into two layers of learning, an action set correction layer (Action Set Modification layer) and a policy evaluation and update layer (Policy Evaluation and Updating layer). Aiming at the pareto selection problem, the non-stationarity problem and the randomness problem which can be encountered in the collaborative learning, the invention introduces corresponding collaboration mechanisms in the two layers respectively: a variable exploration rate mechanism and a recursive maximum priority mechanism. At the action set modification layer, a new resampling strategy is proposed, a collaborative resampling strategy (Coordination Resample) that uses variable exploration rates to solve the problem of action selection in continuous action space in a multi-agent environment. At the strategy evaluation and update layer, the invention introduces the recursive maximum priority concept in rFMQ algorithm into hill climbing algorithm (PHC 24) for solving the strategy evaluation problem in Markov game.

As an embodiment of the present invention, the specific algorithm framework of this example SCC-rFMQ is shown in algorithm 1,

algorithm 1 n sample SCC-rFMQ agent i learning dynamics

As shown in FIG. 2, in the SCC-rFMQ of the present invention, the sampling motion is first initializedThe collection is performed (step (1)). The example is to sample the action set A of the beginning of each state _i (s) can be set as a continuous motion space A _i (s) n actions of medium distance sampling. Step (2): other parameters are initialized. The present example is initialized according to the value set in step (2). Wherein Q is _i (s,a)、

And E is _i (s, a) expected return, historical maximum return and weighted average return for action a for agent i in state s, respectively, V _i (s) is the average expected return of i under s, pi _i (s, a) is the probability of selecting a in state s, F _i (s, a) is an estimate of the frequency at which the maximum occurs, l _i And(s) is the exploration rate. Step (3) is the main learning process of the method of the invention. The learning of the invention on each round of non-terminated state comprises two key steps: an action set correction step (step 3.2.1) and a policy evaluation and update step (step 3.2.2). In the action set correction step, the algorithm firstly judges whether the action set needs to be updated, if so, the action is updated, and if not, the layer is directly skipped. The example uses a collaborative sampling strategy to make corrections to the action set (see in particular algorithm 2)). The determination conditions of this example may be environment-dependent. In the experimental part of this example, the present invention uses a simpler decision condition, i.e. the action set is updated once every fixed number of learnings (c=200), since in the PHC class of algorithms with a fixed policy update rate c=200 times is sufficient for the algorithm to learn a relatively accurate estimate. Followed by a policy evaluation and update layer. In this layer, for any action a ε A _i (s) evaluation of Q according to modified PHC algorithm, namely multi-state rFMQ (see algorithm 3 in particular)) _i (s, a) and updating policy pi _i (s, a). The other steps of the SCC-rFMQ are the same as those of the traditional multi-state multi-agent Markov game reinforcement learning algorithm, so that detailed description is omitted.

Algorithms

2 and 3 are described in detail below in this example.

1. Collaborative resampling strategy

The first key step of this example SCC-rFMQ: collaborative resampling strategy (Coordination Resample strategy). At the action set correction layer, two problems need to be considered: one how to find a better action set than the current policy; the other is how effectively to cooperate to ensure that the algorithm eventually learns the best actions. For the first problem, the collaborative resampling strategy keeps the action with the largest return in each resampling process, and then resamples n-1 new actions according to a certain strategy in a certain range around the action, so that the invention increases the probability of sampling to better actions under the condition of ensuring that an action set is not damaged. For the second problem, the present invention uses a variable exploration rate to control the algorithm to sample the range of new actions around the maximum action. The size of the exploration rate is controlled here according to the WoLM (Win or Learn More, winning or learning) principle. The principle is inspired by the Wolf [24] (Win or Loss Fast, win or slow learning rule) principle of bowing et al, intuitively, the Wolm principle selects a smaller exploration range when the average expected return of the current action set is larger than the average expected returns of other action sets in history, and otherwise increases the exploration range. The original purpose of this design is to increase the probability of sampling around better actions, as is the importance sampling principle of SMC learning. The difference is that the WoLM strategy provides an opportunity for greater exploration rates, which is critical in multi-agent environments, because of environmental instability and actions that may lead to the current maximum estimate becoming bad in the future. In addition, the Wolm strategy can also increase the probability of finding the global optimum by the algorithm. Algorithm 2 is a specific process of the collaborative resampling strategy.

Algorithm 2 collaborative resampling strategy

First, the exploration rate l is updated by using Wolm principle _i (s) (step 1): if the average expected return of the current action set

Greater than or equal to the accumulation of previous sets of actionsAverage expected return V _i (s) the search rate l is reduced _i (s) is l _i (s)δ _d (δ _d < 1), otherwise increase l _i (s) is l _i (s)δ _l (δ _l > 1). Here delta _l And delta _d Are two positive real numbers. In algorithm 1 l _i The initial value of(s) is set to 1/2, in order to have a larger search range in the early stage of the algorithm. Step 2 then updates the cumulative average expected return V _i (s), wherein alpha _s Is the learning rate. Then resampling the action set at the exploration rate (step 3): preserving |A in the current set with the largest expected return _i (s) |/3 actions and according to an even distribution U [ a ] _max -l _i (s),a _max +l _i (s)]From radius l _i A of(s) _max In-neighborhood random selection 2|A of (2) _i (s) |/3 new actions, and A _max Together forming a new set of actions. Note that this step can also be sampled according to other probability distributions, such as a mean and standard deviation of a, respectively _max And l _i Normal distribution of(s). The action with the largest return of the first 1/3 is reserved in order to increase the accuracy of the algorithm. Finally, initializing the strategy pi under the new action of each action _i (s, a) and corresponding expected return Q _i (s, a). It is noted that the Q value and pi value of the original point are not directly inherited, on one hand, consideration is given to that the return of the agent policy in the multi-agent environment changes due to the change of the adversary policy, and on the other hand, the sampling times of other actions in the finite exploration are increased to ensure accurate estimation of the action return. In summary, by using the policy mechanism, the collaborative algorithm can effectively process the problems of non-stationarity and randomness encountered in collaborative gaming in the sampling link, and can avoid premature convergence to local optimum (local optimum from an individual perspective, i.e., pareto selection problem from a multi-agent perspective).

2. Multi-state rFMQ strategy

The second key part of the SCC-rFMQ: multi-state rFMQ strategy. In combination with the recursive maximum priority idea of rFMQ, the invention extends PHC algorithm [24] to multi-agent collaborative gaming. The PHC algorithm is a reinforcement learning algorithm that learns hybrid strategies in a multi-state multi-agent environment, although it may not converge in a competitive environment, the algorithm is converged in a collaborative game. Meanwhile, the introduction of the recursion maximum priority idea can also ensure that the algorithm learns better strategies in the independent learning of the agent. In addition, due to the resampling mechanism of SCC-rFMQ, the algorithm is not required to strictly guarantee convergence in this step in principle, and only the action with the highest return is required to be selected with the highest probability. Algorithm 3 is a specific description of the multi-state rFMQ strategy.

Algorithm 3 Multi-State rFMQ strategy for agent i

In the method, firstly, whether the current action set is updated or not is judged, and if the current action set is updated, F corresponding to all actions in the current state is initialized _i (s,a)、

And E is _i (s, a), otherwise, directly skipping (step 1). Then according to a certain probability, using a mixing strategy pi _i (s, a) selecting action a, and executing (step 2). Then observing the return r and the next state s' from the environment, and updating the state action value Q corresponding to the current s and the current a according to the Q learning method _i (s, a) (step 3). These two steps are the same as the conventional Q learning method and will not be described in detail here. Step 4 and step 5 estimate E based on recursive maximum priority _i And updates the policy pi _i (s, a), where the parameter α _F ,α _π E (0, 1) is the learning rate. Unlike rFMQ, here long term maximum return is used

To express the maximum long-term return that action a has obtained in state s, here +.>

Record r+γmax under action a _a′ Q _i (s ', a') maximum value. This is a natural extension of Q learning in a multi-state environment. Other variables such as F _i (s,a)、/>

And E is _i The update of (s, a) is the same as the original rFMQ. Finally according to E _i (s, a) updating policy pi using PHC policy _i (s, a) (step 5), i.e. increasing selection of the E with the greatest value _i The probability of an action of the (s, a) value, while selecting a probability of decreasing the other actions. By introducing the single-state rFMQ recursion maximum value priority idea into PHC learning, the algorithm can effectively solve the problem of multi-agent cooperation in a complex environment.

Finally, experiments and simulations are performed on the effects of the invention, thereby illustrating the performance and technical effects thereof.

1. Experiment and simulation

The performance of the SCC-rFMQ algorithm of the present invention was examined in this example by comparison with other related algorithms. Note that most of the related work is based on two agent environments [7-9]. The invention respectively constructs two representative multi-agent games aiming at single-state and multi-state environments: for a single-state environment, the invention is based on two classical matrix game games, and can pertinently create two games in continuous action space meeting the requirements of non-stationarity and randomness; for multi-state environments, the present invention uses a modified version of the ship river game [20,26] to verify the performance of the algorithm.

1.1 Single-State collaborative gaming-mountain climbing Game

Consider first a single state game environment. The present invention converts the two classical matrix gaming games in tables 1 and 2 into a continuous action space game using the sliced bilinear difference technique. These two matrix games have received attention in collaborative learning research of discrete actions because of their simplicity but various properties that lead to collaborative failure, such as non-stationarity and randomness [27-29]. Where CG is commonly used to test algorithms for their ability to handle non-stationarity problems, PSCG may also be used to test the ability to solve randomness problems.

Table 1 mountain Climbing Game (CG, climbabing Game)

Table 2 semi-random hill climbing game (PSCG Partially Stochastic Climbing Game)

1.1.1 game description:

the Kapetanakis et al [25] proposed semi-random hill climbing game (PSCG, partially Stochastic CG, table 2) is a variation of the CG game unlike the CG game, where the combined actions < B, B > of the PSCG correspond to a return of equal probability of 14 or 0. Statistically speaking, the average return of the PSCG and CG games is equal because the agents continue to select < B, B > in both games and the average return of 7. While the two games are simple, but have some interesting characteristics. First, CG and PSCG have two Nash equilibrium points, namely < A, A > and < B, B >, where < A, A > is the best in pareto, and if each agent selects a random action, the highest combined actions < B, B > corresponds to a return of equal probability, statistically, because the agents continue to select < B, B > in both games are all 7. While the two games are very simple, but have some interesting characteristics, first, CG and PSCG have two Nash equilibrium points, namely < A, A > and < B, B > where < A > is the best in pareto.

To verify the ability of the SCC-rFMQ of the present invention to solve the problem of collaboration in a continuous action space, it is necessary here to continue both games. First, the actions in CG and PSCG are used with one continuous variable a _i ∈[0,1]Representation of whereina _i ＝0，a _i =0.5 and a _i =1 represents actions a, B and C, respectively. Meanwhile, define the return r of agent as [0,1 ]]Mapping R to R [0,1 ]]×[0,1]R. The mapping satisfies a _i E {0,0.5,1}, r (a) ₁ ,a ₂ ) Equal to the return corresponding to the original CG and PSCG. For the following

In the case of time, the present invention uses a sliced bilinear difference algorithm [30 ]]And (5) carrying out serialization. In the field of numerical analysis, piecewise bilinear interpolation is an extension of linear interpolation over a binary function domain space. FIG. 3 is a color chart of the return function of a CG game after serialization using a piecewise bilinear interpolation algorithm, valued in action space. In FIG. 3, coordinate a ₁ And a ₂ Representing continuous actions of

agents

1 and 2, respectively, and the magnitude of the return value corresponding to the combined action is represented by the shade of the color. As can be seen from the figure, as with the original CG game, the continuous version of the CG game has two equilibrium points: global optimum r (0, 0) =11 and local optimum r (0.5 ) =7, where r (0, 0) =11 is also pareto optimum. Note that the gradient points to less than 1/8 of the total motion space of the area occupied by all points of the pareto optimal point (0, 0), i.e. the area occupied by the triangle enclosed by the points (0, 0), (0.5, 0) and (0, 0.5), which would lead to a traditional gradient-based algorithm that would learn more easily the local optimum than the pareto optimum. Meanwhile, the point with the return more than 7 in the whole space is only a small part of the lower left corner and occupies approximately 1/1000 of the total area, which results in that the traditional learning algorithm based on discrete sampling needs particularly fine discrete sampling to learn higher return, the calculation time and the space cost of the algorithm are seriously increased, and the fact that the optimal self can be learned from a large-capacity discrete action space is also a challenge of the discrete learning algorithm. Furthermore, the gradient of points near the local optimum (0.5 ) is almost unchanged (slow color change), which also has a certain impact on the traditional gradient class learning algorithm, making them even unable to learn the local optimum. Finally, if only the average return is considered, the average return near the local optimum (0.5 ) is also higher than the global optimum This also affects the learning performance of the random sampling class based learning algorithm, which will easily converge to a local optimum if adversary strategies are not considered.

For the continuity of PSCG games, two deterministic continuous action games are constructed for r (0.5 ) =14 and 0 respectively by a slicing bilinear interpolation algorithm, and then return is randomly read from the two deterministic games according to equal probability for any joint action. And will not be described in detail herein.

1.1.2 experimental simulations and results

The example compares SCC-rFMQ with several classical algorithms, namely SMC [20], rFMQ [9] and CALA [16,22] on continuous CG and PSCG games, wherein SMC and CALA are two classical single agent learning algorithms on continuous motion space, and rFMQ is a commonly used multi-agent collaborative algorithm on discrete motion space.

The parameter settings for each learning algorithm are shown in table 3. For fairness, the same parameters in SMC and rFMQ as in SCC-rFMQ (e.g. alpha _Q And gamma) are set as well, and the best parameter configuration is selected for other parameters through multiple experiments. For all algorithms, parameter α _Q Taking alpha in successive versions of FCG and PSCG _Q =0.5, in the ship river game α _Q =0.9. The setting of parameters sigma and tau of the algorithm SMC is the same as in the original paper (Lazaric et al [20] ]). Parameter configuration of Algorithm CALA, a thesis of originality (de Jong et al [ 22)])。

TABLE 3 parameter settings

Fig. 4 shows the average experimental results of the methods for 50 runs in a continuous version of CG game, where the abscissa Interaction Round is the number of algorithm interactions and the ordinate Reward is the return. The return (ordinate) received by the agent at each round is used here as an evaluation index for performance. For the rFMQ algorithm, the example uniformly takes 5 and 10 actions in continuous space for experiments, respectively. These actions correspond to rewards as successive versions of CG and PSCG. For fairness, SMC and SCC-rFMQ are also initialized with 5 and 10 samples, and the initial set of actions for each set of experiments avoid the point of global optimality. As can be seen from FIG. 4, in all cases the SCC-rFMQ algorithm is significantly better than the other three algorithms, followed by SMC learning, with the CALA performance being the worst. It can also be observed from fig. 4 that the algorithms SCC-rFMQ, SMC and rFMQ perform better with more samples or actions.

Fig. 5 shows experimental results of each method on a continuous PSCG game, where the abscissa Interaction Round is the number of algorithm interactions and the ordinate average report is the average return. The experimental setup was the same as for the progressive CG game. In view of the randomness of the game, cumulative average return (ordinate) is used herein as an evaluation index for the experimental results of each algorithm. Also, the average results of 50 experiments were taken for each algorithm. As can be taken from fig. 5, the results of each experiment are similar to the continuous CG game, except that more convergence time is required. In addition, the experimental results on the continuous PSCG games were slightly better for SMC, rfmeq and CALA than for continuous CG. Their return results are closer to 7, and the effect of different numbers of sample samples or actions on experimental results is relatively small for SMC and rFMQ. In summary, the SCC-rFMQ algorithm is superior to several other algorithms in dealing with multi-agent cooperation problems of single state continuous motion space.

1.2 Multi-state cooperative game-ship river-crossing game

To further examine the performance of the present invention, the present invention contemplates a classical multi-state continuous motion space game, a ship-to-river game. This problem was originally addressed by Jouffe [26] and Lazaric [20], which the present invention redefines in a more general manner.

1.2.1 Game description

The goal of the ship-to-river game is to control the speed and direction of the ship to move from the dock on one side of the river to the dock on the opposite side of the river (as shown in fig. 6). The ship in the game is controlled by two engines, namely a forward engine and a steering engine, and the forward acceleration and the steering acceleration of the ship are respectively controlled. Here, the two engines are controlled by two independent agents respectively, and therefore, the two agents need to learn how to cooperate for a common purpose. The state of the ship can be composed of four groups<x,y,θ,v>Representation, where x is e [0,50]And y E [0,100 ]]Representing the position coordinates of the ship, theta epsilon [ -pi/3, pi/3)]Is the angle of the ship (direction of ship travel), and v e 2,5]Is the speed of the ship. To increase game difficulty, the flow rate of river water is defined as E (x) =f _c [x/50-(x/50) ² ]Wherein f _c Is a material obeying normal distribution N (4,0.3) ² ) Is a random variable of (a). Such an assumption is more practical, on the one hand, the water flow rate is high at the center of the river and low at both sides, and on the other hand, has a certain randomness. Meanwhile, a certain challenge is provided for the invention to learn the optimal. The actions of two agents (controllers) are defined as two continuous variables a e-1, 2 ]And ω.epsilon. -1,1]Where a is the forward acceleration of the ship and ω is the angular velocity of the ship. The center coordinates of the wharf at the two ends of the ship are (0, 50) and (50, 50), respectively. The state variables of the ship at each moment are updated as follows,

II in which _Δ For projection mapping, values on the domain are mapped into the interval Δ to prevent the variables from exceeding the specified range.

The reward function is divided into two domain definitions. Success domain Z _s Dock area corresponding to the bank of the river, while failure area Z _f All other locations. The reward function may be formally defined as,

where D (x, y) =20-2|y-50| is an equation whose return is decreasing from 0 to 20 with inputs representing the distance of the location from the center of the quay. Here, Z _s ＝{(x,y)|x＝0,y∈(40,60)}。

1.2.2 experimental simulations and results

In experiments, the invention uses two agents to control the variables a and ω, respectively, so the game can be described as a collaborative game over a multi-state continuous motion space of the two agents. The goal is to train the two agents to cooperate with each other to obtain as high a return as possible. The position state variables x and y are discretized according to 1, and the other state variables θ and v are equally divided into 10 according to the definition interval thereof. Total 500000 states were found for this experiment. The initial state is defined as <0,50,0,0>. The experimental configuration ensures that the ship does not change state every time the state is updated according to equation (1), resulting in an infinite loop of experimental immersion states. Next, the present invention compares SCC-rFMQ with SMC [20] at different numbers of samples. In fact, the present invention also makes other classical multi-agent algorithms related experiments, such as Wolf-PHC [24], distributed Q learning [3] and Hysteretic learners [4], and finds that these traditional discrete action space algorithms have poor convergence in the game due to the large scale state of the game environment. Other algorithms, such as rFMQ and CALA, are only suitable for use in single state games. These algorithms are therefore not to be contrasted in detail herein, and only the SMC algorithm is used to represent those related algorithms designed in a single-agent continuous action space gaming environment. The relevant parameter designs for each algorithm are shown in table 3.

FIG. 6 shows a comparison of experimental effects of the algorithms SCC-rFMQ and SMC at different numbers of sample samples, where the abscissa Episodes is the number of algorithm learning (one cycle of the game from initial state to end state represents learning once) and the ordinate Total report is the Total return. For both methods, 4 and 6 equidistantly selected actions were taken, respectively, and the total return obtained by each algorithm in each round of learning was compared. All results given in the experiments are the average results after 100 replicates. Note that 20 is the theoretical highest return value. Looking at fig. 6, it can be seen that in all sample numbers, the SCC-rFMQ is due to the SMC algorithm both in terms of convergence speed and the learned final return. In both 4 and 6 sampling environments, the SCC-rFMQ eventually successfully learns the highest return 20, followed by the 6 sampling action SMC (learns around 19). The worst effect is an SMC with 4 sampling actions converging to around 13. Furthermore, as the number of samples increases, the convergence speed of SCC-rFMQ is slower, whereas SMC does not have such a phenomenon. In summary, SCC-rFMQ is superior to SMC, which is a learning algorithm without cooperative mechanism, in cooperative game in continuous action space.

2 analysis of experimental results

The convergence time of the SCC-rFMQ algorithm depends on two factors: the number of samples and the number of learning during the two samples. Although a higher number of samples can increase the exploratory capacity of the algorithm, the probability of the algorithm learning a better return is also improved. At the same time, the capability of the algorithm to cope with environmental changes is also reduced, as more sampling times means a reduction in the number of observations per action, thus increasing the time for the algorithm to converge. So in fig. 4, 5 and 7, the learning speed of the SCC-rFMQ algorithm with more sampling actions is slower than in the case of less sampling.

The poor performance of the other comparative experiments in figures 4 and 5 is mainly due to the low gradient of the game around the local optimum <0.5,0.5> and the high average return around it. SMC and CALA are prone to trap into locally optimal traps because they are not designed for multi-agent collaboration problems and cannot cope with environmental changes. It has been analyzed that in continuous CG and PSCG games, the gradient points to 1/8 of the total area of the globally optimal point, while points with a return greater than 7 account for less than 1/1000 of the total area. For the rFMQ algorithm, it is difficult to make the sampling action involve global optima, so its effect is poor. Furthermore, comparing fig. 3 and 4, it was found that the payback in a continuous CG game is slightly better for all algorithms than for a SCC-rFMQ, because the randomness of the continuous PSCG results in the gradient of the game around the local optimum <0.5,0.5> being not so low, so that each algorithm can learn the local optimum with a higher probability, thus obtaining a higher payback than for a continuous CG game.

As with the experiments in FIGS. 4 and 5, SCC-rFMQ performed better than SMC in the ship-to-river game. SCC-rFMQ learns global optimum faster with both 4 and 6 sample numbers. The poor performance of the SMC in fig. 7 is mainly due to the randomness of the water flow in the game (subject to the parameter f _c Influence), resulting in an algorithm that cannot cope with the change of the test environment.

3. Summary

The chapter proposes an SCC-rFMQ method to solve the problem of collaboration of Markov games in a continuous action space. The SCC-rFMQ algorithm contains two key components: collaborative resampling strategies and multi-state rFMQ strategies. Collaborative slave sampling strategies solve the continuous action space problem by resampling available action sets, while multi-state rFMQ strategies evaluate the sampled action sets and give corresponding collaborative strategies. The SCC-rFMA algorithm can well process the collaboration problem of multiple agents in the continuous action space by considering corresponding collaboration mechanisms for the two parts respectively. Sufficient simulation experiments also show that SCC-rFMQ is superior to other reinforcement learning methods.

The above embodiments are preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, which includes but is not limited to the embodiments, and equivalent modifications according to the present invention are within the scope of the present invention.

References corresponding to the reference numerals referred to in the present invention are as follows:

[1]Riedmiller M,Gabel T,Hafner R,et al.Reinforcement Learning for Robot Soccer[J].Auton.Robots,2009,27(1):55–73.

[2]Meng J,Williams D,Shen C.Channels matter:Multimodal connectedness,types of co-players and social capital for Multiplayer Online Battle Arena gamers[J].Computers in Human Behavior,2015,52:190–199.

[3]Lauer M,Riedmiller M A.An Algorithm for Distributed Reinforcement Learning in Cooperative Multi-Agent Systems[C].In Proceedings of the Seventeenth International Conference on Machine Learning,San Francisco,CA,USA,2000:535–542.

[4]Matignon L,Laurent G J,Fort-Piat N L.Hysteretic q-learning:an algorithm for decentralized reinforcement learning in cooperative multi-agent teams[C].In IEEE/RSJ International Conference on Intelligent Robots and Systems IROS,2007:64–69.

[5]Panait L,Sullivan K,Luke S.Lenient learners in cooperative multiagent systems[C].In International Joint Conference on Autonomous Agents and Multiagent Systems,2006:801–803.

[6]Bloembergen D,Kaisers M,Tuyls K.Empirical and Theoretical Support for Lenient Learning[C].In The 10th International Conference on Autonomous Agents and Multiagent Systems Volume 3,2011:1105–1106.

[7]Wei E,Luke S.Lenient Learning in Independent-learner Stochastic Cooperative Games[J].J.Mach.Learn.Res.,2016,17(1):2914–2955.

[8]Palmer G,Tuyls K,Bloembergen D,et al.Lenient Multi-Agent Deep Reinforcement Learning[J].CoRR,2017,abs/1707.04402.

[9]Matignon L,Laurent G j,Le fort piat N.Review:Independent Reinforcement Learners in Cooperative Markov Games:A Survey Regarding Coordination Problems[J].Knowl.Eng.Rev.,2012,27(1):1–31.

[10]PazisJ,LagoudakisMG.Binary Action Search for Learning Continuous-action Control Policies[C].In Proceedings of the 26th Annual International Conference on Machine Learning,New York,NY,USA,2009:793–800.

[11]Pazis J,Lagoudakis M G.Reinforcement learning in multidimensional continuous action spaces[C].In IEEE Symposium on Adaptive Dynamic Programming&Reinforcement Learning,2011:97–104.

[12]Sutton R S,Maei H R,Precup D,et al.Fast Gradient-descent Methods for Temporal-difference Learning with Linear Function Approximation[C].In Proceedings of the 26th Annual International Conference on Machine Learning,2009:993–1000.

[13]Pazis J,Parr R.Generalized Value Functions for Large Action Sets[C].In International Conference on Machine Learning,ICML 2011,Bellevue,Washington,USA,2011:1185–1192.

[14]Lillicrap T P,Hunt J J,Pritzel A,et al.Continuous control with deep reinforcement learning[J].Computer Science,2015,8(6):A187.

[15]KONDA V R.Actor-critic algorithms[J].SIAM Journal on Control and Optimization,2003,42(4).

[16]Thathachar M A L,Sastry P S.Networks of Learning Automata:Techniques for Online Stochastic Optimization[J].Kluwer Academic Publishers,2004.

[17]Peters J,Schaal S.2008 Special Issue:Reinforcement Learning of Motor Skills with Policy Gradients[J].Neural Netw.,2008,21(4).

[18]van Hasselt H.Reinforcement Learning in Continuous State and Action Spaces[M].In Reinforcement Learning:State-of-the-Art.Berlin,Heidelberg:Springer Berlin Heidelberg,2012:207–251.

[19]Sallans B,Hinton G E.Reinforcement Learning with Factored States and Actions[J].J.Mach.Learn.Res.,2004,5:1063–1088.

[20]Lazaric A,Restelli M,Bonarini A.Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods[C].In Conference on Neural Information Processing Systems,Vancouver,British Columbia,Canada,2007:833–840

[21]Lowe R,Wu Y,Tamar A,et al.Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments[J].CoRR,2017,abs/1706.02275.

[22]de Jong S,Tuyls K,Verbeeck K.Artificial Agents Learning Human Fairness[C].In Proceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems.Volume 2,2008:863–870.

[23]Galstyan A.Continuous Strategy Replicator Dynamics for Multi-agent Q-learning[J].Autonomous Agents and Multi-Agent Systems,2013,26(1):37–53.

[24]BowlingM,Veloso.Multiagent learning using a variable learning rate[J].Artificial Intelligence,2002,136(2):215–250.

[25]Kapetanakis S,Kudenko D.Reinforcement learning of coordination in cooperative multi-agent systems[J].AAAI/IAAI,2002,2002:326–331.

[26]Jouffe L.Fuzzy Inference System Learning by Reinforcement Methods[J].Trans.Sys.Man Cyber Part C,1998,28(3):338–355.

[27]Claus C,Boutilier C.The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems[C].In Proceedings of the Fifteenth National/Tenth Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence,Menlo Park,CA,USA,1998:746–752.

[28]Lauer M,Riedmiller M.Reinforcement learning for stochastic cooperative multi-agent-systems[C].In Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems,2004.AAMAS 2004.,2004:1516–1517.[29]Carpenter M,Kudenko D.Baselines for Joint-Action Reinforcement Learning of Coordination in Cooperative Multi-agent Systems[M].In Adaptive Agents and Multi-Agent Systems II:Adaptation and Multi-Agent Learning.Berlin Heidelberg:Springer,2005:55–72.

[30]Saha Ray S.Numerical analysis with algorithms and programming[M].Boca Raton:CRC Press,Taylor&Francis Group,2016.

Claims

1. the cooperation method of the multi-state continuous action space is characterized by comprising the following steps:

(1): for any state s.epsilon.S, initializing the sampled action set A _i (s) continuous motion space A for collecting agent i _i N actions of the random pattern in (S), wherein S is a set of states;

(31): initialization state s≡s ₀ ；

(32): repeating the following steps until the state s reaches the end state

(323): update state s≡s',

in step (1), a sampling action set A at the beginning of each state is set _i (s) is a continuous motion space A _i (s) n actions of medium distance sampling,

and converting the n discrete actions into continuous actions in a continuous action space by adopting a slicing bilinear difference algorithm.

2. The collaborative method for a multi-state continuous action space of claim 1, wherein: in step (31), resampling is performed by a collaborative sampling strategy, the action set is updated, and a variable exploration rate l is adopted _i (s) controlling the collaborative sampling strategy to sample a range of new actions around a maximum rewards action.

3. The collaborative method for a multi-state continuous action space of claim 2, wherein: the processing method of the collaborative sampling strategy comprises the following steps:

a1: updating the exploration rate l _i (s):

If the average expected return of the current action set

Greater than or equal to the cumulative average expected return for each previous action setV _i (s) the search rate l is reduced _i (s) is l _i (s)δ _d Otherwise increase l _i (s) is l _i (s)δ _l Wherein delta _l A positive real number greater than 1, delta _d A positive real number less than 1;

a2: updating the cumulative average expected return:

wherein alpha is _s Is the learning rate;

a3: according to the exploration rate l _i (s) resampling the action set:

calculating the action with the maximum current return

4. The collaborative method for a multi-state continuous action space of claim 2, wherein: in step (32), agent i learns and updates using a multi-state recursive frequency maximum Q learning algorithm.

5. The collaborative method for multi-state continuous action space according to claim 4, wherein: the processing method of the multi-state recursion frequency maximum Q value learning algorithm comprises the following steps:

And E is _i (s, a), and then performing step B2;

B3: observing the return r and the next state s' from the environment, and updating the state action value Q corresponding to the current s and the current a _i (s,a)：

Wherein alpha is learning rate, gamma is discount factor,

b4: estimating E according to a recursive maximum priority idea _i (s,a)；

6. A system for implementing the collaborative method for a multi-state continuous action space of any of claims 1-5, comprising:

A state updating unit: for updating the state s≡s'.

7. The system according to claim 6, wherein: the action set correction unit resamples through a collaborative sampling strategy, updates the action set, and adopts a variable exploration rate l _i (s) controlling the collaborative sampling strategy to sample a range of new actions around a maximum rewards action.

8. The system according to claim 7, wherein: the strategy evaluation and updating unit adopts a multi-state recursion frequency maximum Q value learning algorithm to learn and update.