CN108921298A

CN108921298A - Intensified learning multiple agent is linked up and decision-making technique

Info

Publication number: CN108921298A
Application number: CN201810606662.1A
Authority: CN
Inventors: 查正军; 李厚强; 温忻; 李斌; 王子磊
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2018-06-12
Filing date: 2018-06-12
Publication date: 2018-11-30
Anticipated expiration: 2038-06-12
Also published as: CN108921298B

Abstract

The invention discloses a kind of communication of intensified learning multiple agent and decision-making techniques, including：Corresponding state feature is extracted by neural network according to the observation state information of each intelligent body；Using the state feature of all intelligent bodies as linking up information input to carrying out soft distribution in VLAD layer and cluster, the communication information after being clustered；Communication information after cluster is distributed to each intelligent body, is polymerize the state feature of itself with the communication information after the cluster received by each intelligent body, and movement decision is carried out by the full Connection Neural Network inside intelligent body.This method can the status information to each intelligent body carry out cluster and linked up with other intelligent bodies, and then improve intelligent body level of decision-making.

Description

Intensified learning multiple agent is linked up and decision-making technique

Technical field

The present invention relates to multiple agent deeply learning art field more particularly to a kind of intensified learning multiple agent ditches Logical and decision-making technique.

Background technique

Intensified learning (Reinforcement Learning) is a kind of can be achieved directly from environment sensing to movement mapping Algorithm.By inputting perception information (such as visual information, status information), mapping model output action is then established, in turn Realize the decision process of intelligent body (Agent) in circumstances not known.Deeply study combines deep neural network and reinforcing The advantage of study can effectively solve perception decision problem of the intelligent body (Agent) under the strange complex environment of higher-dimension.Tradition Supervised learning algorithm usually require the largely training data that manually marks, while the obtained model of training it is horizontal also by It is limited to the level of training data.Intensified learning, which passes through, constantly generates data with environmental interaction, and not according to the feedback of environment Disconnected iteration itself strategy.The data manually marked are depended on to solve supervised learning method to a certain extent, are also limited In the problem of human data level.Therefore, depth enhancing study is the forward position research direction of general artificial intelligence field, is had wide Wealthy application prospect.

The case where common deeply study is mainly applied to single intelligent body (Single-Agent), i.e., in environment Only one Agent constantly interacts with environment and then obtains sample, and training one depth-size strategy network-control one Agent.And the problem of being more multiple agent in actual environment, i.e. environment, have multiple intelligent bodies to carry out decision, multiple intelligence It influences each other between body, the common state for changing environment.There are also different relationships, (such as competitive relation is closed between multiple intelligent bodies Make relationship etc.).For single intelligent body, when carrying out decision in multiple agent environment, and meanwhile it should also be taken into account that it is teammate, right Hand state in which and their strategy.Many problems in natural world and human society can be regarded as multiple agent Gambling process (such as vehicular traffic travel, be related to the game etc. of more people), therefore the nitrification enhancement based on multiple agent It has broad application prospects, while being also the only way which must be passed that the mankind realize strong artificial intelligence.

However, existing nitrification enhancement is typically only capable to the neural network model of cooperation lightweight, under complex model Performance and bad.Therefore efficient, succinct, practical neural network model how is designed, the relationship between description intelligent body comprehensively While, guarantee being condensed into for the key of multiple agent intensified learning method for network structure.

Summary of the invention

The object of the present invention is to provide a kind of communication of intensified learning multiple agent and decision-making techniques, can be to each intelligent body Status information cluster and linked up with other intelligent bodies, and then improves the level of decision-making of intelligent body.

The purpose of the present invention is what is be achieved through the following technical solutions：

A kind of intensified learning multiple agent is linked up and decision-making technique, including：

Corresponding state feature is extracted by neural network according to the observation state information of each intelligent body；

Using the state feature of all intelligent bodies as information input is linked up to carrying out soft distribution in VLAD layer and cluster, obtain Communication information after cluster；

Communication information after cluster is distributed to each intelligent body, by the state feature of itself and is received by each intelligent body To cluster after communication information polymerize, and movement decision is carried out by the full Connection Neural Network inside intelligent body.

As seen from the above technical solution provided by the invention, it can be propagated based on gradient, cluster centre can learn The intensified learning multiple agent communication mechanism of VLAD, for the collaborative problem between intelligent body under multiple agent environment, Ke Yishi It is effectively linked up between existing intelligent body and status information is interactive, have very strong robust simultaneously for the dynamic change of intelligent body quantity Property, the final performance for improving neural network model.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill in field, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is network architecture schematic diagram provided in an embodiment of the present invention；

Fig. 2 is the flow chart of a kind of intensified learning multiple agent communication and decision-making technique provided in an embodiment of the present invention

The network architecture schematic diagram that Fig. 3 is VLAD layers provided in an embodiment of the present invention.

Specific embodiment

With reference to the attached drawing in the embodiment of the present invention, technical solution in the embodiment of the present invention carries out clear, complete Ground description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Based on this The embodiment of invention, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, belongs to protection scope of the present invention.

In order to make intelligent physical efficiency preferably cooperate, compete, while have algorithm to the dynamic change of intelligent body quantity Stronger robustness, the embodiment of the present invention provide a kind of intensified learning multiple agent and link up and decision-making technique, can be In the training optimization process of multiple agent intensified learning strategy network, communication mechanism is established between each intelligent body, to each intelligence Energy body state in which carries out cluster coding, and each intelligent body is compiled according to oneself state information and the state of other intelligent bodies later Code information can carry out decision；Entire communication mechanism is simple and effective, dynamic change robustness of the communication mechanism to intelligent body quantity By force, while end-to-end mapping from ambient condition to intelligent body strategy is realized.

In the embodiment of the present invention, the joint decision of multiple agent, network model knot are realized using the neural network of multilayer Structure is as shown in Figure 1, the realization process of correlation technique is as shown in Figure 2.

Referring to Fig. 1, it is assumed that now have N number of intelligent body in the environment, this N number of intelligent body in the environment it is observed that environment Different status information is respectively s₁,s₂,……s_N, neural network module inside each moment t, each intelligent body f¹……f^mCorresponding movement can be generated according to its state in which, if the movement that each intelligent body is taken is respectively a₁, a₂,……a_N, after all intelligent bodies have been carried out movement, each intelligent body is believed the reward that environmental feedback is returned is received Cease r_t.Wherein r_tIt is related with the movement of all intelligent bodies selection in environment, namely in embodiments of the present invention, all intelligent bodies It is all identical in the environment reward that each moment receives.

Referring to fig. 2, the realization process of correlation technique mainly includes：

Step 1 extracts corresponding state feature by neural network according to the observation state information of each intelligent body.

In the embodiment of the present invention, the observation state information of each intelligent body is subjected to manual coding, is realized from physical world To the mapping of mathematical space, coding result can be vector form or graphic form；If mapping code result is vector form, The feature that does well then is extracted by MLP network；If mapping code result is graphic form, shape is extracted by CNN network State feature.

Step 2, using the state feature of all intelligent bodies as linking up information input to carrying out soft distribution in VLAD layer and gather Class, the communication information after being clustered.

It in the embodiment of the present invention, can be transmitted using gradient, the VLAD (partial polymerization description vectors) that cluster centre can learn Layer, structure are as shown in Figure 3.

In the embodiment of the present invention, VLAD cluster is carried out to the state feature of each intelligent body by the way of soft distribution, point After dispensing respectively clusters weight a little by the multiplication of state characteristic weighing, cooperation softmax formula is provided, and is expressed as：

In above formula, w_k(X_i) indicate i-th of intelligent body state feature X_iDistribute to the weight of k-th of cluster centre, a_k、b_k For the corresponding soft distribution weight of k-th of cluster centre, a_kFor row vector, b_kFor scalar, x_iFor the state feature of i-th of intelligent body X_iRepresented column vector, traversal of the k ' expression to k all cluster centres, a_k′、b_k′Indicate that kth ' a cluster centre is corresponding Soft distribution weight, a_k′For row vector, b_k′For scalar.

In the embodiment of the present invention, the convolution kernel that 1*1 can be used realizes weight computations a in soft distribution_kx_i+b_k；It The weight w of soft distribution is further calculated out using the softmax layer in neural network afterwards_k(X_i)。

Thought later based on VLAD cluster, final cluster result is by feature space between vector sum cluster centre Distance characterization, the cluster result of k-th of cluster centre are as follows：

Wherein, V (j, k) is the communication information after the cluster result of k-th of cluster centre jth dimension, namely cluster；x_i(j) For the state feature X of i-th of intelligent body_iJth dimension in represented column vector, c_kIt (j) is the jth dimension of k-th of cluster centre point Coordinate, N are intelligent body quantity.

It, can be by VLAD core layer according to w in the embodiment of the present invention_k(X_i) and X_i, complete the distribution of specific cluster centre With the generation work of final VLAD vector, this layer is mainly made of the plus-minus module of vector.

Communication information after cluster is distributed to each intelligent body by step 3, by each intelligent body by the state feature of itself It is polymerize with the communication information after the cluster received, and is moved by the full Connection Neural Network module inside intelligent body It makes decision.

In the embodiment of the present invention, each intelligent body adopts the state feature of itself and the communication information after the cluster that receives It is polymerize with concatenated mode；Then, the optional n of intelligent body is generated by the full Connection Neural Network module inside intelligent body A movement a₁,a₂,……a_nCorresponding probability distribution p₁,p₂,……p_n；The full Connection Neural Network is one or more layers, is inputted Layer dimension is the sum of the dimension of the communication information after state feature and cluster, output layer dimension and optional movement a₁,a₂,……a_n It is corresponding, so its dimension is n；After the probability distribution of n movement of generation, it can be sampled according to probability and generate final move Make, the movement that can also choose maximum probability is the final movement of the intelligent body；Each intelligent body according to itself state not Together, in conjunction with information is linked up, there may be identical movements, it is also possible to generate different movements.

On the other hand, each intelligent body of the embodiment of the present invention will receive the reward letter that environmental feedback is returned after being carried out movement It ceases, the incentive message that sharing model parameters and environmental feedback are returned between each intelligent body, before being measured by the size of reward The quality for the movement taken, and then training smart body Model uses preferably strategy when next time with environmental interaction.And it adopts With the mode of course transfer learning, it is stepped up the complexity of environment and the quantity of intelligent body in the training process, thus plus The training speed of fast model.

Course transfer learning refers to the complexity that environment is stepped up in the training process of model, first relatively easy Environment in (such as in the environment of intelligent body negligible amounts) training pattern, later using trained parameter in more complicated ring It is trained in border, is finally slowly transitioned into desired complex environment.Meanwhile in the training process, the intelligent body of same type All-network model (including handle observation state information neural network, VLAD layer, final generation act decision full connection Neural network) parameter be all it is shared, the reward feedback signal that each intelligent body is obtained from environment is also identical, Ge Gezhi Energy body updates the same model parameter according to the state iteration of itself.The model parameter of different types of intelligent body is different, environment It is identical to reward feedback signal.Therefore the model in the embodiment of the present invention has very strong robustness to the variation of intelligent body quantity.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment can The mode of necessary general hardware platform can also be added to realize by software by software realization.Based on this understanding, The technical solution of above-described embodiment can be embodied in the form of software products, which can store non-easy at one In the property lost storage medium (can be CD-ROM, USB flash disk, mobile hard disk etc.), including some instructions are with so that a computer is set Standby (can be personal computer, server or the network equipment etc.) executes method described in each embodiment of the present invention.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Within the technical scope of the present disclosure, any changes or substitutions that can be easily thought of by anyone skilled in the art, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the protection model of claims Subject to enclosing.

Claims

1. a kind of intensified learning multiple agent is linked up and decision-making technique, which is characterized in that including：

Using the state feature of all intelligent bodies as information input is linked up to carrying out soft distribution in VLAD layer and cluster, clustered Communication information afterwards；

Communication information after cluster is distributed to each intelligent body, by each intelligent body by the state feature of itself with receive Communication information after cluster is polymerize, and carries out movement decision by the full Connection Neural Network inside intelligent body.

2. a kind of intensified learning multiple agent according to claim 1 is linked up and decision-making technique, which is characterized in that state is special The process of sign includes：

The observation state information of each intelligent body is subjected to manual coding, realizes the mapping from physical world to mathematical space, is compiled Code result is vector form or graphic form；

If mapping code result is vector form, the feature that does well is extracted by MLP network；

If mapping code result is graphic form, the feature that does well is extracted by CNN network.

3. a kind of intensified learning multiple agent according to claim 1 is linked up and decision-making technique, which is characterized in that VLAD layers It carries out soft distribution and the process of cluster includes：

VLAD cluster is carried out to the state feature of each intelligent body by the way of soft distribution, distribute to the weight of each cluster point by After state characteristic weighing is multiplied, cooperation softmax formula is provided, and is expressed as：

In above formula, w_k(X_i) indicate i-th of intelligent body state feature X_iDistribute to the weight of k-th of cluster centre, a_k、b_kIt is The corresponding soft distribution weight of k cluster centre, x_iFor the state feature X of i-th of intelligent body_iRepresented column vector, k ' expression pair The traversal of k all cluster centres, a_k′、b_k′Indicate the corresponding soft distribution weight of kth ' a cluster centre；

Final cluster result is characterized by the distance between vector sum cluster centre in feature space, k-th cluster centre it is poly- Class result is as follows：

Wherein, V (j, k) is the communication information after the cluster result of k-th of cluster centre jth dimension, namely cluster；x_iIt (j) is i-th The state feature X of a intelligent body_iJth dimension in represented column vector, c_k(j) coordinate is tieed up for the jth of k-th of cluster centre point, N is intelligent body quantity.

4. a kind of intensified learning multiple agent according to claim 1 is linked up and decision-making technique, which is characterized in that described to incite somebody to action The state feature of itself is polymerize with the communication information after the cluster received, and passes through the full connection nerve inside intelligent body Network carries out movement decision：

Each intelligent body will be carried out the state feature of itself with the communication information after the cluster that receives using concatenated mode Polymerization；

Then, the optional n movement a of intelligent body is generated by the full Connection Neural Network inside intelligent body₁, a₂... a_nIt is corresponding Probability distribution p₁, p₂... p_n；After generating the probability distribution that n acts, final move is generated according to probability sampling Make, or choosing the movement of maximum probability is the final movement of the intelligent body；

The full Connection Neural Network is one or more layers, and input layer dimension is the dimension of the communication information after state feature and cluster The sum of, output layer dimension and optional movement a₁, a₂... a_nIt is corresponding, dimension n.

5. a kind of intensified learning multiple agent according to claim 1 is linked up and decision-making technique, which is characterized in that Ge Gezhi Energy body executes the incentive message that will be returned by environmental feedback after a movement, sharing model parameters and ring between each intelligent body The incentive message that border is fed back, the quality for the movement taken before being measured by the size of reward, and then training smart body exists Next time with when environmental interaction using preferably strategy；Meanwhile by the way of course transfer learning, in the training process gradually Increase the complexity of environment and the quantity of intelligent body.