WO2020024097A1 - Deep reinforcement learning-based adaptive game algorithm - Google Patents

Deep reinforcement learning-based adaptive game algorithm Download PDF

Info

Publication number
WO2020024097A1
WO2020024097A1 PCT/CN2018/097747 CN2018097747W WO2020024097A1 WO 2020024097 A1 WO2020024097 A1 WO 2020024097A1 CN 2018097747 W CN2018097747 W CN 2018097747W WO 2020024097 A1 WO2020024097 A1 WO 2020024097A1
Authority
WO
WIPO (PCT)
Prior art keywords
cooperation
att
strategy
degree
strategies
Prior art date
Application number
PCT/CN2018/097747
Other languages
French (fr)
Chinese (zh)
Inventor
侯韩旭
郝建业
王维勋
Original Assignee
东莞理工学院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 东莞理工学院 filed Critical 东莞理工学院
Priority to CN201880001586.XA priority Critical patent/CN109496318A/en
Priority to PCT/CN2018/097747 priority patent/WO2020024097A1/en
Publication of WO2020024097A1 publication Critical patent/WO2020024097A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the invention relates to the field of data processing, in particular to an adaptive game algorithm based on deep reinforcement learning.
  • the present invention provides an adaptive game algorithm based on deep reinforcement learning to solve the problem of poor scalability of multi-agents in the prior art.
  • the present invention is realized through the following technical schemes: designing and manufacturing an adaptive game algorithm based on deep reinforcement learning, including the following steps: (A) acquiring strategies with different degrees of cooperation; (B) generating strategies with different degrees of cooperation; (C) Detecting opponents' cooperative strategies; (D) Developing different coping strategies.
  • step (A) in the step (A), strategies using different network structures and / or different target reward forms are used for training and obtaining different degrees of cooperation.
  • step (A) strategies of different degrees of cooperation are obtained by modifying key factors influencing the degree of competition and cooperation in the environment or by modifying the learning goals of the agent.
  • the policies with different degrees of cooperation obtained in step (A) are set as expert networks, and the policies with different degrees of cooperation in the expert network are given weights;
  • the degree of cooperation strategy affects the degree of strategy to generate a new degree of cooperation.
  • the specific process of the algorithm for generating a strategy for a new degree of cooperation is: in an expert network expert network, Means using the degree of cooperation att n to obtain the strategy of agent i in the training settings of different degrees of cooperation.
  • Each expert network predicts that in the current state, the strategy of cooperation degree can be used to play with other agents to obtain the state-action value state.
  • action value according to the existing cooperation strategy A strategy to obtain a new degree of cooperation for agent i by assigning different weights w 1 , w 2 , ... w n For the strategy of cooperation degree att new ,
  • Att new f value (att 1 , att 2 , ..., att n , w 1 , w 2 , ..., w n )
  • f value (att 1 , att 2 ,..., att n , w 1 , w 2 ,..., w n ) The degree of cooperation with the corresponding weights w 1 , w 2 , ..., w n is obtained, and f value is:
  • a neural network is used to determine the degree of cooperation.
  • the structure of the neural network adopts a multi-tasking mode, and the structure of the self-encoder is combined with the structure of the classifier.
  • the autoencoder and classifier share the underlying parameters of the neural network.
  • the opponent is judged by detecting the degree of the opponent's cooperation to generate a strategy that is more conducive to itself.
  • different network structures include a strategic network structure and a network structure of a cooperation degree detector.
  • the adaptive game algorithm based on deep reinforcement learning is applied in a multi-agent environment.
  • the beneficial effects of the present invention are: using the trained detectors and strategies with different degrees of cooperation, the existing Tit for tat and other ideas are implemented in sequential social dilemmas; the scalability of the agent is improved; more intuitive acquisition Better than your own competitive strategy.
  • FIG. 1 is a schematic diagram of generating different cooperation strategies according to the present invention
  • FIG. 2 is a schematic diagram of a network structure of a cooperation degree detector according to the present invention.
  • FIG. 3 is a schematic diagram of formulating different coping strategies according to the present invention.
  • An adaptive game algorithm based on deep reinforcement learning includes the following steps: (A) acquiring strategies with different degrees of cooperation; (B) generating strategies with different degrees of cooperation; (C) detecting opponents' cooperative strategies; (D) formulating different strategies Coping strategies.
  • step (A) training is performed by using different network structures and / or different target reward forms to obtain strategies with different degrees of cooperation.
  • step (A) strategies of different degrees of cooperation are obtained by modifying key factors that affect the degree of competition and cooperation in the environment or by modifying the learning goals of the agent.
  • step (B) the strategies with different degrees of cooperation obtained in step (A) are set as expert networks, and the strategies with different degrees of cooperation in the expert network are given weights; and according to the degree of influence of the strategies with different degrees of cooperation To generate strategies for new levels of collaboration.
  • the algorithm for generating a new cooperation strategy is as follows: in an expert network, Means using the degree of cooperation att n to obtain the strategy of agent i in the training settings of different degrees of cooperation. Each expert network predicts that in the current state, the strategy of cooperation degree can be used to play with other agents to obtain the state-action value state. action value, according to the existing cooperation strategy A strategy to obtain a new degree of cooperation for agent i by assigning different weights w 1 , w 2 , ... w n For the strategy of cooperation degree att new ,
  • Att new f value (att 1 , att 2 , ..., att n , w 1 , w 2 , ..., w n )
  • f value (att 1 , att 2 ,..., att n , w 1 , w 2 ,..., w n ) The degree of cooperation with the corresponding weights w 1 , w 2 , ..., w n is obtained, and f value is:
  • a neural network is used to judge the degree of cooperation.
  • the structure of the neural network adopts a multi-tasking mode.
  • the structure of the autoencoder is combined with the structure of the classifier.
  • the autoencoder and the classifier share The underlying parameters of the neural network.
  • the opponent is judged by detecting the degree of the opponent's cooperation to generate a strategy that is more beneficial to himself.
  • Different network structures include the strategic network structure and the network structure of the cooperation degree detector.
  • the strategy network structure is similar to the DQN structure and uses a five-layer structure.
  • the first hidden layer, the second hidden layer, and the third hidden layer are all convolution layers, the fourth layer is a fully connected layer, and the last layer uses the number of actions.
  • the number of nodes is the same;
  • the network structure of the cooperation degree detector is a three-layer structure, and all three layers are convolution layers.
  • the self-encoder part is connected to the third layer, and the cooperation degree detection part is connected to the third layer.
  • the adaptive game algorithm based on deep reinforcement learning is applied to multi-agent environment sequential social dilemmas.
  • the adaptive game algorithm (Deep Reinforcement, Learning Framework, Mutual Cooperation (DRLFMC)) based on deep reinforcement learning of the present invention first uses target weighted target reward to learn strategies with different degrees of cooperation, and the average reward of the agent (reward) ) To verify that the weighted target can adjust the degree of cooperation of the agent and train strategies with different degrees of cooperation. Then use policy generation to synthesize policies with different degrees of cooperation according to the strategies generated by weighted targets and rewards, and also use the average reward of the agent to prove that the synthesized policies have different degrees of synthesis. Then use different cooperation strategies to generate data sets, which are used to train the cooperation detection network (cd detection detection network) to detect the opponent's cooperation.
  • DRLFMC Deep Reinforcement, Learning Framework, Mutual Cooperation
  • Training strategy networks with different levels of cooperation using strategies of existing levels of cooperation to generate new levels of cooperation, strategies of opponents' cooperation, and drawing on existing game theory strategies to design deep game strategies.
  • attitude modified game By referring to Steven Damer ’s attitude update The idea of attitude modified game [Achieving, Cooperation, Minimally, Constrained, Environment, Learning, Cooperate, Normal, and Form Games] has designed a method that can be simply applied to the existing deep reinforcement learning to obtain different degrees of cooperation. At the same time, because the agent may have infinitely different cooperation and competition strategies, it is impossible to learn one by one. Therefore, he borrowed the idea of using deep-learning architecture (Mixture-of-Experts architecture) to design deep reinforcement learning. A method of constructing a strategy of a new degree of cooperation using an existing degree of strategy [Opponent Modeling Deep Reinforcement Learning].
  • One way to get strategies for different levels of cooperation is to modify the key factors that affect the level of competition and cooperation in the environment, such as the amount of resources, the way of interaction between agents, and so on.
  • Joel Z. Leibo investigates if resource changes affect the idea of cooperation and competition, they find that self-interest independent learning agents are more likely to cooperate with sufficient resources and more easily in a resource-poor environment Competition, but they only investigated the impact of resources and other factors on the final agent learning results, and did not go further to generate strategies with different degrees of cooperation.
  • Another way to obtain strategies with different degrees of cooperation is to modify the learning goals of the agents to obtain strategies with different degrees of cooperation.
  • Agents with the goal of maximizing their own interests will naturally learn more competitive strategies.
  • the agent whose overall interest is maximized as the learning goal will naturally learn a more cooperative strategy.
  • Steven Damer modified the agent's reward to be his own reward plus attitude in the original environment, multiplied by the opponent's reward in the original environment, and considered in the process of maximizing the reward To their own reward and opponent's reward, eventually reached a cooperation. [Achieving Cooperation in a Minimally Constrained Environment, Learning to Cooperate in Normal Form Games]. This idea can be used not only to make the agent finally cooperate, but also to obtain strategies with different degrees of cooperation.
  • the agent can weigh the importance of its own reward with the rewards of other agents when learning.
  • the larger the altitude the more the other agents' rewards are considered, and the more cooperative the agent ’s learning strategy is, the smaller it is.
  • the more cooperative attitude attitude will eventually learn.
  • this method can directly use the current single-agent deep reinforcement learning (single-agent DRL) algorithm to obtain strategies with different degrees of cooperation (here, it is assumed that Markov game and other symbols have been defined previously, such as the existence of N agents: i 1 , i 2 ..., a total of N agents, the actions of each agent are: a 1 , a 2 ..., and they take a joint action (a 1 , a 2 ...) to obtain reward: r 1 , r 2 ..., their strategies are ⁇ 1 , ⁇ 2 ....
  • the present invention can obtain strategies with different degrees of cooperation by controlling the degree of consideration of the rewards of other agents, so that agents consider other agents to obtain rewards.
  • r ′ 1 att 11 * r 1 + att 12 * r2 + ... + att 1n * r n
  • r ′ 1 att 11 * r 1 + att 12 * r2 + ... + att 1n * r n
  • r ′ 1 att 11 * r 1 + att 12 * r2 + ... + att 1n * r n
  • By adjusting the value of altitude we can obtain different cooperation strategies. If the attitude of other agents is larger, the agent will pay more attention to the benefits of other agents, and obtain more cooperative strategies. If the value is smaller, the agent will consider its own income more and obtain a more competitive strategy.
  • the agent uses a deep reinforcement learning algorithm to learn, although the parameters are increased linearly, but because the neural network has more parameters, it is difficult to apply in an environment with a larger number of agents.
  • a network can be used to learn the joint strategy ⁇ joint (a 1 , a 2 , ... an n
  • the joint reward of this network can be defined as follows. We can also directly use the deep reinforcement learning algorithms common in single-agent to obtain different cooperation strategies.
  • Att e represents r total 's reward for agent i e .
  • Equal att e have the same degree of cooperation.
  • agents can have different degrees of cooperation. For opponents with different degrees of cooperation, we may not want to simply choose a cooperative or competitive strategy to respond, but we hope to have a certain degree of cooperation strategy. as a response.
  • By training strategies with different degrees of cooperation we can use altitude reward to train strategies with different degrees of cooperation.
  • agents can use an infinite number of strategies with different degrees of cooperation. We cannot train corresponding strategies for each degree of cooperation. Therefore, we propose a strategy that can be compounded with different degrees of cooperation according to the strategies of different degrees of cooperation that have been trained. He He designed a network structure that can ensure the benefits of opponents with different strategies in a completely competitive environment.
  • the idea of designing a network structure is to use some existing strategies to predict which ones by judging the behavior of the opponent.
  • the strategy can better deal with the opponent, and through the combination of the existing strategies, the strategy against the current opponent can be obtained.
  • This idea can not only be used to adjust your strategy for different opponents in a perfect competitive environment, but also can be used in Sequential Social Dilemmas to generate strategies with different degrees of cooperation. If we use value-based deep reinforcement learning algorithms to train strategies with different degrees of cooperation to learn strategies with different degrees of cooperation, then we can treat each different strategy as an expert network. Means using the degree of cooperation att n to obtain the strategy of agent i in the training settings of different degrees of cooperation.
  • Att new f value (att 1 , att 2 , ..., att n , w 1 , w 2 , ..., w n )
  • f value (att 1 , att 2 ,..., att n , w 1 , w 2 ,..., w n ) is the basic strategy.
  • Weight w corresponding right 1, w 2, ..., w n get the level of cooperation policy, which means that we will consider the existing policy att i and the impact of the new strategy w i to calculate the degree of cooperation in the new strategy. If I assume that the strategies of different cooperation levels are linearly superimposed, and the degree of cooperation is also linearly superimposed, then we write f value as:
  • Agent's new degree of cooperation strategy can be naturally transferred to strategies with different degrees of cooperation learned using strategy-based deep reinforcement learning. Agent's new degree of cooperation strategy:
  • Att new f policy (att 1 , att 2 , ... att n , w 1 , w 2 , ... w n )
  • a common approach in game theory is to adjust your own strategy based on the strategies of other agents. For example, Tit for tat: The first round chooses to cooperate. Whether the next round chooses to cooperate depends on whether the other party has cooperated. If the other party has betrayed the last time, I also betray this round. If the other party cooperated the last time, this round continues Cooperation.
  • the strategy does not need to determine exactly what strategy the opponent uses, but it can judge whether the opponent is cooperation or competition through the final situation.
  • ⁇ k is the set of various cooperation degrees obtained by agent i k by training strategies of different degrees of cooperation to generate policies of different degrees of cooperation
  • f is a function of the degree of cooperation marked by agent i in the current data
  • An autoencoder contains two parts, an encoder and a decoder.
  • the encoder as: ⁇ and the decoder as: ⁇ , as follows:
  • the features extracted by the first few layers of the classifier are effective representations of x, and the overall classification effect is with an effective representation of x, and the classification effect is better.
  • the training speed of the classifier is accelerated.
  • the underlying network can be more stable and the vibration during classifier training can be reduced.
  • the network is trained by min ⁇ x L (f c (x), label).
  • ⁇ new arg max a Q new (s, a)
  • our algorithm is applied in 2 games: Fruit Gathering game, Apple-Pear game.
  • the Fruit Gathering game has two agents in the environment: red and blue.
  • the goal of the agent is to collect apples. Apples are represented by green pixels.
  • the agents collect apples, they will get r apple rewards, and the apples will disappear from the environment. N apple appears again after a period of time.
  • the agent can emit a beam of light from its own position toward its own front.
  • One agent is shot twice by the beam sent by another agent, and the agent will disappear from the environment. N tagged for a while and then reappears.
  • the agent can choose 8 actions in the environment: forward, backward, left, right, turn left, turn right, use light and stay in place.
  • the agent can tend to cooperate, collect fruits together and emit little light, and the agent can also choose to compete. While collecting fruits, send out more beams, try to hit the opponent, and move the opponent out of the environment, so that they can Get more fruits.
  • the Apple-Pear game there are two agents in the environment: red agent and blue agent, and two fruits: red apple and green pear.
  • the agent moves to collect fruits in the environment.
  • the blue agent prefers apple, so for agent1, the reward is r prefer when the apple is obtained alone, and the reward is r common when the pear is obtained separately.
  • the red agent is the opposite, satisfying r prefer > r common
  • the agent can perform 4 actions in the environment: forward, backward, left, and right. For each step, the agent spends c steps .
  • the s obtained by the agent in the environment is image information with a size of 84 * 84 * 3.
  • apple-pear game is simpler than Fruit Gathering game, competition and cooperation are easier to analyze in apple-pear.
  • the agent collects its own fruits for cooperation, and the agent collects all fruits for competition, so we can use Apple- Pear game to more clearly analyze the effectiveness of our algorithm.
  • the average reward of agent1 and average reward of agent2 can also be used to verify that the degree of cooperation of the synthesized strategy is in the middle of the basic strategy.
  • the agent's own rewards are increased as their cooperation level is reduced, and as the opponent's cooperation level is increased, they are reduced.
  • the reward when both choose to cooperate proves that Fruit, Gathering, and Apple-Pear games can be regarded as SPDs about the degree of cooperation.
  • the actor and critic share the underlying network: the first hidden layer uses 32 8x8 convolutional layers with a step size of 4 and is activated by RELU.
  • the second The hidden layer uses 64 4x4 convolution layers with a step size of 2 and is activated by RELU.
  • the third hidden layer uses 64 3x3 convolution layers with a step size of 1 and is activated by RELU.
  • the actor is a fully connected layer with 128 nodes, activated by RELU, and the last layer is activated by the same number of nodes as softmax.
  • a critic is similar in structure to an actor, but ends up with only one output.
  • the network structure of the cooperation degree detector is shown in Figure 2.
  • the structure of the common part is: the first layer is 10 3x3 convolution layers with a step size of 2.
  • the structure of activating the second layer using RELU is the same as the first layer
  • the structure of the third layer is ten 3x3 convolutional layers with a step size of 3 and is activated by RELU.
  • the structure of the self-encoder part connected to the third layer is: 10 3x3 deconvolutions, activated by sigmoid, the output shape is 21 * 21 * 10, and then the deconvolution is: 10 3x3 cores 2, activated by sigmoid , The output shape is 42 * 42 * 10, and finally the deconvolution kernel is 10 3x3 kernels, which are activated by sigmoid, and the output shape is 84 * 84 * 10.
  • the cd detection section is followed by the third layer of the network structure, which is the long- and short-term memory model (lstm) layer of two layers of 256 nodes, and finally an output node.
  • the exploration rate starts from 1, and gradually decreases to 0.1 after 20,000 steps.
  • the weight update uses a soft update ("soft" target updates) to avoid a large fluctuation of the strategy: ⁇ ′ ⁇ 0.05 ⁇ + 0.095 ⁇ ′, where ⁇ is a parameter of the target network, ⁇ ′ is a parameter of the strategy network used by the agent, the learning rate is 0.0001, and the storage (memory ) Has a size of 25000 and a batch size of 128.
  • the range of w 1 (that is, att 1 in the table) is set from 0.1 to 0.9.
  • a larger w 1 means that the strategies learned by agent 1 will be more competitive, and the strategies learned by agent 2 will be more cooperative.
  • the final results of different w 1 trainings are shown in the following table.
  • the average reward of agent 1 and the average reward of agent 2 represent the average reward that only considers the fruit collected by the agent.
  • the changes of agent1average reward and agent2average reward with training episode are shown in Figure 5.
  • agent2 gradually rises as w 1 decreases.
  • the agent can use an independent network structure to learn. Due to space limitations, we only show a part of the learning results of the degree of cooperation. As shown in the following table, the network structure of the joint action is the same. Rise and fall, rise as the degree of cooperation of the opponent rises
  • the strategy of agent1 is used as the basic competition strategy of agent1
  • the basic competition strategy determines the highest degree of competition for the agent, so we can limit the scope of agent cooperation and competition by choosing different strategies.
  • the strategy we selected satisfies: the degree of cooperation of the basic cooperation strategy is higher than that of the basic competition strategy.
  • Agent1 uses the Policy Generation algorithm.
  • ⁇ new w c * ⁇ c + (1-w c ) * ⁇ d
  • w c ranges from 0 to 1, Interval 0.1. From average reward, it can be known that as w c rises, the average reward of agent 1 gradually decreases. For a more cooperative agent 2, the fixed w c has a larger average reward. So this table shows that using different w c for Policy Generation can synthesize policies with different degrees of cooperation.
  • both agent1 and agent2 use Policy Generation to generate policies, and then use the generated policies to play with each other.
  • apple-pear game can be regarded as an SPD about the degree of cooperation between agents. If only one's own interests are considered, the agent will choose a more competitive strategy, but if both agents choose a competing strategy, then their respective rewards are less than the rewards when both agents choose a cooperative strategy, and the total reward is (total reward) It can be seen that in cooperation-cooperation, total-reward is the largest.
  • agent1 uses cd from 0 to 1 with a 0.1 interval strategy and agent2 uses cd from 0 to 1 with a 0.1 interval strategy to play.
  • the average value is shown in the following figure.
  • agent1 is initially a cooperation strategy
  • agent2 is initially a competition strategy
  • agent1 is initially a competition strategy
  • agent2 is initially a cooperation strategy
  • 3.2 agents All are cooperative strategies.
  • 4.2 agents are all competitive strategies. In self-play, both agents eventually converged to a cooperative-cooperative state, achieving a win-win situation.
  • the four initially set agents eventually converged to cooperation-cooperation. If both parties initially cooperate, the two parties will remain cooperative. In the other three cases, the agent will not adjust its own strategy and will first reach competition-competition. In the case, after an agent has tagged another agent out of the environment, the agent will start collecting the remaining fruits and wait for the appearance of the apple. Since the other agent is tagged out of the environment, the degree of cooperation detected by detection will be slightly higher than the competition Opponent, so the agent will look forward to cooperating with another agent. During this period, the tagged agent may reappear. At this time, the detection part of the agent is measured based on the behavior of the other agent before it appears, so when two agents Collect apples without actively tagging each other, and the two parties may converge to cooperation-cooperation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The present invention relates to the field of data processing, and discloses a deep reinforcement learning-based adaptive game algorithm, comprising the following steps: (A) acquiring policies for different degrees of cooperation; (B) generating policies for different degrees of cooperation; (C) detecting a cooperation policy of an opponent; and (D) making different coping policies. The technical effects of the present invention are as follows: trained detectors and policies for different degrees of cooperation are used to implement the existing concepts, such as Tit for tat, in sequential social dilemmas, improving the extensibility of the agent, and more intuitively acquiring competition policies superior to those already acquired.

Description

基于深度强化学习的自适应博弈算法Adaptive game algorithm based on deep reinforcement learning 【技术领域】[Technical Field]
本发明涉及数据处理领域,尤其涉及一种基于深度强化学习的自适应博弈算法。The invention relates to the field of data processing, in particular to an adaptive game algorithm based on deep reinforcement learning.
【背景技术】【Background technique】
强化学习被运用在各个领域,从游戏到机器人控制,传统的强化学习通过表格或者线性函数来表示值函数或者策略,很难扩展到复杂的问题下,结合深度学习的深度强化学习利用神经网络提取特征特点与函数近似的能力,已经出现了一些成功的运用[DQN][Alpha Zero][PPO]。囚徒困境(PD game)一直是矩阵博弈(Matrix game)的研究重点,PD game将合作与竞争看成是一个原子动作(atomic action),但是在真实世界中博弈是由一系列动作组成,将进行时序上扩展(temporally extended)的PD称为序列囚徒困境(SPD)。在PD game中,大部分的多智能体强化学习(MARL)算法集中传统的强化学习上,很难直接扩展到囚徒困境博弈(SPD game)中,[SSD]中观察了资源变化对于只考虑自己的收益的智能体(agent)的影响,但是并没有根据SPD的特性提出相应的学习算法。Reinforcement learning is used in various fields, from games to robot control. Traditional reinforcement learning uses table or linear functions to represent value functions or strategies. It is difficult to extend to complex problems. Deep neural learning combined with deep learning uses neural network extraction. The characteristics and the ability to approximate functions have been successfully used [DQN] [Alpha Zero] [PPO]. The prisoner's dilemma (PD game) has always been the focus of matrix game research. PD game regards cooperation and competition as an atomic action, but in the real world, the game is composed of a series of actions and will be carried out. Temporally extended PD is called sequential prisoner's dilemma (SPD). In PD game, most of the multi-agent reinforcement learning (MARL) algorithms focus on traditional reinforcement learning, which is difficult to extend directly to the prisoner's dilemma game (SPD game). [SSD] observes changes in resources and only considers itself. The benefit of the agent (agent), but did not propose a corresponding learning algorithm based on the characteristics of SPD.
【发明内容】[Summary of the Invention]
为了解决现有技术中的问题,本发明提供了一种基于深度强化学习的自适应博弈算法,解决现有技术中多智能体扩展性差的问题。In order to solve the problems in the prior art, the present invention provides an adaptive game algorithm based on deep reinforcement learning to solve the problem of poor scalability of multi-agents in the prior art.
本发明是通过以下技术方案实现的:设计、制造了一种基于深度强化学习的自适应博弈算法,包括如下步骤:(A)获取不同合作程度的策略;(B)生成不同合作程度的策略;(C)检测对手的合作策略;(D)制定不同的应对策略。The present invention is realized through the following technical schemes: designing and manufacturing an adaptive game algorithm based on deep reinforcement learning, including the following steps: (A) acquiring strategies with different degrees of cooperation; (B) generating strategies with different degrees of cooperation; (C) Detecting opponents' cooperative strategies; (D) Developing different coping strategies.
作为本发明的进一步改进:所述步骤(A)中,通过使用不同的网络结构和/或不同的目标奖赏形式进行训练并获取不同合作程度的策略。As a further improvement of the present invention: in the step (A), strategies using different network structures and / or different target reward forms are used for training and obtaining different degrees of cooperation.
作为本发明的进一步改进:所述步骤(A)中,通过修改环境中影响竞争与合作程度的关键因素或者通过对agent的学习目标进行修改来获得不同合作程度的策略。As a further improvement of the present invention: in the step (A), strategies of different degrees of cooperation are obtained by modifying key factors influencing the degree of competition and cooperation in the environment or by modifying the learning goals of the agent.
作为本发明的进一步改进:所述步骤(B)中,将步骤(A)中得到的不同合作程度的策略设为专家网络,并对专家网络中的不同合作程度的策略赋予权重;并根据不同合作程度的策略的影响程度来生成新合作程度的策略。As a further improvement of the present invention: in the step (B), the policies with different degrees of cooperation obtained in step (A) are set as expert networks, and the policies with different degrees of cooperation in the expert network are given weights; The degree of cooperation strategy affects the degree of strategy to generate a new degree of cooperation.
作为本发明的进一步改进:生成新合作程度的策略的算法具体过程为:一 个专家网络expert network中,
Figure PCTCN2018097747-appb-000001
表示采用合作程度att n在训练不同合作程度的策略中设置中获得agent i的策略,每个expert network预测在当前状态下,采用合作程度的策略与其它agent进行play所能获得状态-动作值state action value,根据已有的合作程度的策略
Figure PCTCN2018097747-appb-000002
通过赋予不同的权重w 1,w 2,…w n来获得agent i的新的合作程度的策略
Figure PCTCN2018097747-appb-000003
Figure PCTCN2018097747-appb-000004
为合作程度为att new的策略,
As a further improvement of the present invention, the specific process of the algorithm for generating a strategy for a new degree of cooperation is: in an expert network expert network,
Figure PCTCN2018097747-appb-000001
Means using the degree of cooperation att n to obtain the strategy of agent i in the training settings of different degrees of cooperation. Each expert network predicts that in the current state, the strategy of cooperation degree can be used to play with other agents to obtain the state-action value state. action value, according to the existing cooperation strategy
Figure PCTCN2018097747-appb-000002
A strategy to obtain a new degree of cooperation for agent i by assigning different weights w 1 , w 2 , ... w n
Figure PCTCN2018097747-appb-000003
Figure PCTCN2018097747-appb-000004
For the strategy of cooperation degree att new ,
att new=f value(att 1,att 2,…,att n,w 1,w 2,…,w n) att new = f value (att 1 , att 2 , ..., att n , w 1 , w 2 , ..., w n )
f value(att 1,att 2,…,att n,w 1,w 2,…,w n)为采用基本策略
Figure PCTCN2018097747-appb-000005
与相应权重w 1,w 2,…,w n获得策略的合作程度,f value为:
f value (att 1 , att 2 ,…, att n , w 1 , w 2 ,…, w n )
Figure PCTCN2018097747-appb-000005
The degree of cooperation with the corresponding weights w 1 , w 2 , ..., w n is obtained, and f value is:
f value(att 1,att 2,…,att n,w 1,w 2,…,w n) f value (att 1 , att 2 ,…, att n , w 1 , w 2 ,…, w n )
=att 1*w 1+att 2*w 2+…+att n*w n = Att 1 * w 1 + att 2 * w 2 + ... + att n * w n
agent i的新合作程度的策略:Agent's new degree of cooperation strategy:
Figure PCTCN2018097747-appb-000006
Figure PCTCN2018097747-appb-000006
定义att new用来衡量合成之后策略的合作程度att new=f policy(att 1,att 2,…att n,w 1,w 2,…w n)。 Define att new to measure the degree of cooperation of the strategy after synthesis att new = f policy (att 1 , att 2 , ... att n , w 1 , w 2 , ... w n ).
作为本发明的进一步改进:所述步骤(C)中,使用神经网络来进行合作程度的判别,神经网络的结构采用多任务模式,采用自编码器的结构与分类器的结构相结合,所述自编码器和分类器共用神经网络底层参数。As a further improvement of the present invention: in the step (C), a neural network is used to determine the degree of cooperation. The structure of the neural network adopts a multi-tasking mode, and the structure of the self-encoder is combined with the structure of the classifier. The autoencoder and classifier share the underlying parameters of the neural network.
作为本发明的进一步改进:所述步骤(D)中,通过检测对手的合作程度中对对手进行判断,生成更利于自身的策略。As a further improvement of the present invention, in the step (D), the opponent is judged by detecting the degree of the opponent's cooperation to generate a strategy that is more conducive to itself.
作为本发明的进一步改进:不同的网络结构包括策略网络结构和合作程度检测器的网络结构。As a further improvement of the present invention, different network structures include a strategic network structure and a network structure of a cooperation degree detector.
作为本发明的进一步改进:所述基于深度强化学习的自适应博弈算法应用于多智能体环境中。As a further improvement of the present invention, the adaptive game algorithm based on deep reinforcement learning is applied in a multi-agent environment.
本发明的有益效果是:利用训练出来的检测器和不同合作程度的策略,将已有的Tit for tat等思想实现运用在sequential social dilemmas中;提高了智能体agent的扩展性;更加直观的获取更优于自身的竞争策略。The beneficial effects of the present invention are: using the trained detectors and strategies with different degrees of cooperation, the existing Tit for tat and other ideas are implemented in sequential social dilemmas; the scalability of the agent is improved; more intuitive acquisition Better than your own competitive strategy.
【附图说明】[Brief Description of the Drawings]
图1为本发明生成不同合作策略的示意图;FIG. 1 is a schematic diagram of generating different cooperation strategies according to the present invention; FIG.
图2为本发明合作程度检测器的网络结构示意图;2 is a schematic diagram of a network structure of a cooperation degree detector according to the present invention;
图3为本发明制定不同的应对策略的示意图。FIG. 3 is a schematic diagram of formulating different coping strategies according to the present invention.
【具体实施方式】【detailed description】
下面结合附图说明及具体实施方式对本发明进一步说明。The present invention will be further described below with reference to the accompanying drawings and specific embodiments.
一种基于深度强化学习的自适应博弈算法,包括如下步骤:(A)获取不同合作程度的策略;(B)生成不同合作程度的策略;(C)检测对手的合作策略;(D)制定不同的应对策略。An adaptive game algorithm based on deep reinforcement learning includes the following steps: (A) acquiring strategies with different degrees of cooperation; (B) generating strategies with different degrees of cooperation; (C) detecting opponents' cooperative strategies; (D) formulating different strategies Coping strategies.
所述步骤(A)中,通过使用不同的网络结构和/或不同的目标奖赏形式进行训练并获取不同合作程度的策略。In the step (A), training is performed by using different network structures and / or different target reward forms to obtain strategies with different degrees of cooperation.
所述步骤(A)中,通过修改环境中影响竞争与合作程度的关键因素或者通过对agent的学习目标进行修改来获得不同合作程度的策略。In the step (A), strategies of different degrees of cooperation are obtained by modifying key factors that affect the degree of competition and cooperation in the environment or by modifying the learning goals of the agent.
所述步骤(B)中,将步骤(A)中得到的不同合作程度的策略设为专家网络,并对专家网络中的不同合作程度的策略赋予权重;并根据不同合作程度的策略的影响程度来生成新合作程度的策略。In the step (B), the strategies with different degrees of cooperation obtained in step (A) are set as expert networks, and the strategies with different degrees of cooperation in the expert network are given weights; and according to the degree of influence of the strategies with different degrees of cooperation To generate strategies for new levels of collaboration.
生成新合作程度的策略的算法具体过程为:一个专家网络expert network中,
Figure PCTCN2018097747-appb-000007
表示采用合作程度att n在训练不同合作程度的策略中设置中获得agent i的策略,每个expert network预测在当前状态下,采用合作程度的策略与其它agent进行play所能获得状态-动作值state action value,根据已有的合作程度的策略
Figure PCTCN2018097747-appb-000008
通过赋予不同的权重w 1,w 2,…w n来获得agent i的新的合作程度的策略
Figure PCTCN2018097747-appb-000009
Figure PCTCN2018097747-appb-000010
为合作程度为att new的策略,
The algorithm for generating a new cooperation strategy is as follows: in an expert network,
Figure PCTCN2018097747-appb-000007
Means using the degree of cooperation att n to obtain the strategy of agent i in the training settings of different degrees of cooperation. Each expert network predicts that in the current state, the strategy of cooperation degree can be used to play with other agents to obtain the state-action value state. action value, according to the existing cooperation strategy
Figure PCTCN2018097747-appb-000008
A strategy to obtain a new degree of cooperation for agent i by assigning different weights w 1 , w 2 , ... w n
Figure PCTCN2018097747-appb-000009
Figure PCTCN2018097747-appb-000010
For the strategy of cooperation degree att new ,
att new=f value(att 1,att 2,…,att n,w 1,w 2,…,w n) att new = f value (att 1 , att 2 , ..., att n , w 1 , w 2 , ..., w n )
f value(att 1,att 2,…,att n,w 1,w 2,…,w n)为采用基本策略
Figure PCTCN2018097747-appb-000011
与相应权重w 1,w 2,…,w n获得策略的合作程度,f value为:
f value (att 1 , att 2 ,…, att n , w 1 , w 2 ,…, w n )
Figure PCTCN2018097747-appb-000011
The degree of cooperation with the corresponding weights w 1 , w 2 , ..., w n is obtained, and f value is:
f value(att 1,att 2,…,att n,w 1,w 2,…,w n) f value (att 1 , att 2 ,…, att n , w 1 , w 2 ,…, w n )
=att 1*w 1+att 2*w 2+…+att n*w n = Att 1 * w 1 + att 2 * w 2 + ... + att n * w n
agent i的新合作程度的策略:Agent's new degree of cooperation strategy:
Figure PCTCN2018097747-appb-000012
Figure PCTCN2018097747-appb-000012
定义att new用来衡量合成之后策略的合作程度att new=f policy(att 1,att 2,…att n,w 1,w 2,…w n)。 Define att new to measure the degree of cooperation of the strategy after synthesis att new = f policy (att 1 , att 2 , ... att n , w 1 , w 2 , ... w n ).
所述步骤(C)中,使用神经网络来进行合作程度的判别,神经网络的结构采用多任务模式,采用自编码器的结构与分类器的结构相结合,所述自编码器和分类器共用神经网络底层参数。In the step (C), a neural network is used to judge the degree of cooperation. The structure of the neural network adopts a multi-tasking mode. The structure of the autoencoder is combined with the structure of the classifier. The autoencoder and the classifier share The underlying parameters of the neural network.
所述步骤(D)中,通过检测对手的合作程度中对对手进行判断,生成更利于自身的策略。In the step (D), the opponent is judged by detecting the degree of the opponent's cooperation to generate a strategy that is more beneficial to himself.
不同的网络结构包括策略网络结构和合作程度检测器的网络结构。Different network structures include the strategic network structure and the network structure of the cooperation degree detector.
所述策略网络结构为类似DQN结构,采用五层结构,第一隐藏层、第二隐藏层和第三隐藏层均为卷积层,第四层为全连接层,最后一层采用与动作数目相同的节点数;所述合作程度检测器的网络结构为三层结构,这三层均为卷积层,自编码器部分接第三层,合作程度检测部分后接第三层。The strategy network structure is similar to the DQN structure and uses a five-layer structure. The first hidden layer, the second hidden layer, and the third hidden layer are all convolution layers, the fourth layer is a fully connected layer, and the last layer uses the number of actions. The number of nodes is the same; the network structure of the cooperation degree detector is a three-layer structure, and all three layers are convolution layers. The self-encoder part is connected to the third layer, and the cooperation degree detection part is connected to the third layer.
所述基于深度强化学习的自适应博弈算法应用于多智能体环境sequential social dilemmas中。The adaptive game algorithm based on deep reinforcement learning is applied to multi-agent environment sequential social dilemmas.
本发明基于深度强化学习的自适应博弈算法(Deep Reinforcement Learning Framework towards Mutual Cooperation(DRLFMC)),首先使用目标加权奖赏(weighted target reward)来学习出不同合作程度的策略,通过agent的平均奖赏(reward)来验证weighted target reward能够调整agent的合作 程度,训练出不同合作程度的策略。然后利用策略生成(policy generation)根据weighted target reward生成的策略的来合成不同合作程度策略,同样利用agent的平均reward来证明合成之后的策略具有不同的合成程度。之后利用不同合作程度的策略生成数据集,用来训练合作程度检测网络(cd detection network)检测对手的合作程度。最后,利用policy generation与cd detection network根据对手的合作程度来调整自己的合作程度:当对手偏向合作时,我们也才用偏向合作的策略,达成双赢,在对手才用竞争的策略时,也采用竞争的策略,来确保自己的最低收益。The adaptive game algorithm (Deep Reinforcement, Learning Framework, Mutual Cooperation (DRLFMC)) based on deep reinforcement learning of the present invention first uses target weighted target reward to learn strategies with different degrees of cooperation, and the average reward of the agent (reward) ) To verify that the weighted target can adjust the degree of cooperation of the agent and train strategies with different degrees of cooperation. Then use policy generation to synthesize policies with different degrees of cooperation according to the strategies generated by weighted targets and rewards, and also use the average reward of the agent to prove that the synthesized policies have different degrees of synthesis. Then use different cooperation strategies to generate data sets, which are used to train the cooperation detection network (cd detection detection network) to detect the opponent's cooperation. Finally, use policy generation and cd detection network to adjust the degree of cooperation according to the degree of cooperation of the opponent: when the opponent is biased towards cooperation, we also use the strategy of biased cooperation to achieve a win-win situation. When the opponent uses the competitive strategy, we also adopt Competitive strategies to ensure your own minimum return.
训练不同合作程度的策略网络,利用已有合作程度的策略生成新的合作程度的策略,对手合作程度,借鉴已有的博弈论策略设计的深度博弈策略。Joel Z.Leibo指出序列囚徒困境(sequential social dilemmas)中agent的策略并不是简单地划分为合作,竞争,而是有些策略更偏向合作,但也存在竞争,有些策略更偏向竞争,但也存在合作[Multi-agent Reinforcement Learning inSequential Social Dilemmas],已有的深度强化学习算法最终只能学习到合作的策略或者竞争的策略,并不能够学习到特定合作程度的策略,通过借鉴了Steven Damer采用态度更新博弈(attitude modified game)的思想[Achieving Cooperation in a Minimally Constrained Environment,Learning to Cooperate in Normal Form Games]设计了一种能够简单地应用在已有的深度强化学习上获得不同合作程度的方法。同时因为agent可能具有无穷个不同合作,竞争程度的策略,不可能一一进行学习,所以借鉴He He利用混合专家结构(Mixture-of-Experts architecture)来设计深度强化学习的思想,提出了一种利用已有程度的策略构造新合作程度的策略的方法[Opponent Modeling in Deep Reinforcement Learning]。针对matrix game中常见的问题,例如社会困境(social dilemmas)等,已经存在一些简单但实际效果好的启发式算法,比如以牙还牙(Tit for tat),但是这些算法很难直接运用在Sequential Social Dilemmas与深度强化学习上,一方面agent策略并不是明确的合作与竞争,另外一方面为了实现Tit for tat必须知道对手是否合作,提出一种训练检测对手合作程度的检测器的方法,利用训练出来的检测器和不同合作程度的策略,能够将已有的Tit for tat等思想实现运用在sequential social dilemmas中。Training strategy networks with different levels of cooperation, using strategies of existing levels of cooperation to generate new levels of cooperation, strategies of opponents' cooperation, and drawing on existing game theory strategies to design deep game strategies. Joel Z. Leibo pointed out that the strategy of the agent in sequential prisoner dilemmas is not simply divided into cooperation and competition, but some strategies are more cooperative, but there is competition, and some strategies are more competitive, but there is also cooperation [Multi-agent Reinforcement Learning Learning Social Dilemmas], the existing deep reinforcement learning algorithms can only learn cooperative strategies or competitive strategies in the end, but ca n’t learn strategies with a specific degree of cooperation. By referring to Steven Damer ’s attitude update The idea of attitude modified game [Achieving, Cooperation, Minimally, Constrained, Environment, Learning, Cooperate, Normal, and Form Games] has designed a method that can be simply applied to the existing deep reinforcement learning to obtain different degrees of cooperation. At the same time, because the agent may have infinitely different cooperation and competition strategies, it is impossible to learn one by one. Therefore, he borrowed the idea of using deep-learning architecture (Mixture-of-Experts architecture) to design deep reinforcement learning. A method of constructing a strategy of a new degree of cooperation using an existing degree of strategy [Opponent Modeling Deep Reinforcement Learning]. For common problems in matrix games, such as social dilemmas, there are already some simple but effective heuristic algorithms, such as Tit for Tat, but these algorithms are difficult to apply directly to Sequential Social Dilemmas and In deep reinforcement learning, on the one hand, the agent strategy is not clear cooperation and competition. On the other hand, in order to achieve Tit for tat, it is necessary to know whether the opponent cooperates. A method for training a detector to detect the degree of opponent cooperation is proposed. Devices and strategies with different degrees of cooperation, the existing ideas of Tit, tat, etc. can be applied to sequential social dilemmas.
以下通过四个方面对本发明进行说明。The present invention is described below in four aspects.
一.训练不同合作程度的策略I. Training strategies for different levels of cooperation
一种得到不同合作程度策略的方法是通过修改环境中影响竞争与合作程度的关键因素,比如:资源的多少,agent之间的交互方式等等。Joel Z.Leibo调查资源变化如果影响合作与竞争的想法,他们发现对于自私独立的学习智能体(self-interest independent learning agents)在资源充足的情况下更容易合作,在资源稀缺的环境下更容易竞争,但是他们只是调查了资源等因素对于最终agent学习结果的影响,并没有更近一步用来生成不同合作程度的策略。所以为了在环境e中获得不同合作程度与不同竞争程度的策略,可以通过设置环境参数来获得比环境e资源更稀缺,更容易导致竞争的环境e scarcity,在环境 e scarcity中获得的策略在环境e中会表现得更多竞争更少合作,反之,可以设置比环境e资源更多的环境e sufficient,在e sufficient获得得策略在环境e中表现得更多合作与更少竞争[Multi-agent Reinforcement Learning in Sequential Social Dilemmas]。 One way to get strategies for different levels of cooperation is to modify the key factors that affect the level of competition and cooperation in the environment, such as the amount of resources, the way of interaction between agents, and so on. Joel Z. Leibo investigates if resource changes affect the idea of cooperation and competition, they find that self-interest independent learning agents are more likely to cooperate with sufficient resources and more easily in a resource-poor environment Competition, but they only investigated the impact of resources and other factors on the final agent learning results, and did not go further to generate strategies with different degrees of cooperation. Therefore, in order to obtain different levels of cooperation with the different levels of competition policy in the environment e can be obtained by setting environmental parameters is more scarce than the ambient e resources more easily lead to a competitive environment e scarcity, policy acquired in the environment e scarcity in the environment In e, there will be more competition and less cooperation. On the contrary, you can set more environmentally sufficient resources than environment e. The strategy obtained in e sufficient will perform more cooperation and less competition in environment e. [Multi-agent Reinforcement Learning in Sequential Social Dilemmas].
另外一种得到不同合作程度策略的方法是通过对agent的学习目标进行修改来获得不同合作程度的策略,以最大化自身利益为学习的目标的agent,最终自然会学习到更竞争的策略,以总体利益最大化为学习目标的agent,最终自然会学习到更合作的策略。Steven Damer为了在最小约束环境(Minimally Constrained Environment)中达到合作,将agent获得reward修改为在原来环境中自己reward加上态度(attitude)乘上原来环境对手的reward,在最大化reward的过程中考虑到自己的reward与对手的reward,最终达到了合作。[Achieving Cooperation in a Minimally Constrained Environment,Learning to Cooperate in Normal Form Games]。这个思想不仅可以用来使agent最终达成合作,同样可以用来获得不同合作程度的策略。通过控制attitude的大小让agent在学习时能够权衡自己reward与其他agent的reward之间的重要性,越大的attitude就越考虑其他agent的reward,agent学习的策略就会越合作,反之越小的attitude最终学习的策略就会越合作。同时这个做法能够直接利用现在单智能体深度强化学习(single-agent DRL)的算法来获得不同合作程度的策略(这里假设前面已经定义了马尔可夫博弈(Markov game)等符号,比如存在N个agent:i 1,i 2…,一共N个agent,每个agent的动作(action)分别为:a 1,a 2…,他们采取联合动作(a 1,a 2…)后分别获得reward:r 1,r 2…,他们的策略分别为π 12…。 Another way to obtain strategies with different degrees of cooperation is to modify the learning goals of the agents to obtain strategies with different degrees of cooperation. Agents with the goal of maximizing their own interests will naturally learn more competitive strategies. The agent whose overall interest is maximized as the learning goal will naturally learn a more cooperative strategy. In order to achieve cooperation in the Minimum Constrained Environment, Steven Damer modified the agent's reward to be his own reward plus attitude in the original environment, multiplied by the opponent's reward in the original environment, and considered in the process of maximizing the reward To their own reward and opponent's reward, eventually reached a cooperation. [Achieving Cooperation in a Minimally Constrained Environment, Learning to Cooperate in Normal Form Games]. This idea can be used not only to make the agent finally cooperate, but also to obtain strategies with different degrees of cooperation. By controlling the size of the altitude, the agent can weigh the importance of its own reward with the rewards of other agents when learning. The larger the altitude, the more the other agents' rewards are considered, and the more cooperative the agent ’s learning strategy is, the smaller it is. The more cooperative attitude attitude will eventually learn. At the same time, this method can directly use the current single-agent deep reinforcement learning (single-agent DRL) algorithm to obtain strategies with different degrees of cooperation (here, it is assumed that Markov game and other symbols have been defined previously, such as the existence of N agents: i 1 , i 2 …, a total of N agents, the actions of each agent are: a 1 , a 2 …, and they take a joint action (a 1 , a 2 …) to obtain reward: r 1 , r 2 …, their strategies are π 1 , π 2 ….
为了获得不同合作程度的策略,本发明可以通过控制考虑其他agent的reward的程度来获得不同合作程度的策略,让agent考虑其它agent获得reward。其中att ef代表对于agent i e考虑agent i f的reward r j的态度(attitude),r′ e代表考虑其他agent之后agent i e的reward,agent i e对所有agent的attitude可以表示为attitude e={att 1,att 2,…att n},n∈N可以直接使用单智能体(single-agent)中常见的深度强化学习算法来获得不同合作程度策略。 In order to obtain strategies with different degrees of cooperation, the present invention can obtain strategies with different degrees of cooperation by controlling the degree of consideration of the rewards of other agents, so that agents consider other agents to obtain rewards. Wherein after att ef representatives for agent i e considering agent i f the reward r j attitude (attitude), r 'e representatives to consider other agent agent i e a reward, agent i e for attitude all agent may be represented as attitude e = {att 1, att 2 , ... att n }, n∈N can directly use deep reinforcement learning algorithms commonly used in single-agents to obtain strategies with different degrees of cooperation.
r′ 1=att 11*r 1+att 12*r2+…+att 1n*r n r ′ 1 = att 11 * r 1 + att 12 * r2 + ... + att 1n * r n
r′ 1=att 11*r 1+att 12*r2+…+att 1n*r n r ′ 1 = att 11 * r 1 + att 12 * r2 + ... + att 1n * r n
...
r′ 1=att 11*r 1+att 12*r2+…+att 1n*r n r ′ 1 = att 11 * r 1 + att 12 * r2 + ... + att 1n * r n
agent完全不考虑其他agent的reward,即{att ef=0,if e≠f},那么就是只考虑自己(self-indenpent)的情况,agent会最大化的自己的累积收益,而不管其他agent的收益,而最终导致完全竞争的局面。对于{att ef=1,if all e,f∈N},那么就是合作(cooperation)的情况,agent合作去达到全局累积收益最大的目的。通过调整attitude的数值大小,我们可以获得不同合作程度策略,如果对其他agent的attitude的值大些,agent会更关注其他的agent的收益,而获得更合作的策略,如果对其他agent的attitude的值小些,agent会更考虑自己的收益,而获得更竞争的策略。 The agent does not consider the reward of other agents at all, that is, {att ef = 0, if e ≠ f}, then it only considers the case of itself (self-indenpent), the agent will maximize its own cumulative income, regardless of other agents' Gains, which ultimately leads to perfect competition. For {att ef = 1, if all e, f ∈ N}, then this is the case of cooperation, and the agent cooperates to achieve the purpose of maximizing the global cumulative gain. By adjusting the value of altitude, we can obtain different cooperation strategies. If the attitude of other agents is larger, the agent will pay more attention to the benefits of other agents, and obtain more cooperative strategies. If the value is smaller, the agent will consider its own income more and obtain a more competitive strategy.
如果agent采用深度强化学习的算法来学习,虽然参数是线形增加的,但是由于神经网络的参数较多,所以很难运用在agent数量更多的环境中。同时由于每个agent是独立(indenpent)的,学习过程中需要更多的data来学习。所以可以一个网络来学习联合策略π joint(a 1,a 2,…a n|s),通过共用底层网络来缓解参数增加的问题,同时明确对于每个a 1,a 2,…a n明确学习π joint(a 1,a 2,…a n|s)能够加有效地学习到不同合作程度的策略。对于这个网络的联合reward可以定义如下,我们也可以直接使用single-agent中常见的深度强化学习算法来获得不同合作程度策略。 If the agent uses a deep reinforcement learning algorithm to learn, although the parameters are increased linearly, but because the neural network has more parameters, it is difficult to apply in an environment with a larger number of agents. At the same time, because each agent is independent (indenpent), more data is needed to learn during the learning process. Therefore, a network can be used to learn the joint strategy π joint (a 1 , a 2 , ... an n | s), and the problem of parameter increase is alleviated by sharing the underlying network, and it is clear for each a 1 , a 2 , ... an n learning π joint (a 1, a 2 , ... a n | s) can be effectively applied to study different levels of cooperation strategy. The joint reward of this network can be defined as follows. We can also directly use the deep reinforcement learning algorithms common in single-agent to obtain different cooperation strategies.
r total=att 1*r 1+att 2*r 2+…+att n*r n这个网络通过最大化累积的r total来获得不同合作程度的策略,att e代表r total对于agent i e的reward的重视程度,通过att e的相对大小,越大的att e,agent i e的竞争程度越高,越小的att e合作程度越高。相等att e具有相同的合作程度。对于学习出来的联合策略π joint(a 1,a 2…a n|s),我们可以获得对于agent i的策略: r total = att 1 * r 1 + att 2 * r 2 + ... + att n * r n This network obtains strategies of different degrees of cooperation by maximizing the accumulated r total . Att e represents r total 's reward for agent i e . emphasis, by att e relative size, the greater the att e, agent i e, the higher the degree of competition, the higher the level of smaller att e cooperation. Equal att e have the same degree of cooperation. For the learned joint strategy π joint (a 1 , a 2 … a n | s), we can obtain the strategy for agent i:
Figure PCTCN2018097747-appb-000013
Figure PCTCN2018097747-appb-000013
二生成不同合作程度的策略Two strategies for generating different degrees of cooperation
我们知道在Sequential Social Dilemmas中agent可以有不同的合作程 度,针对不同合作程度的对手,我们可能并不想简单地选择合作的策略或者竞争的策略来回应,而是希望能够具有一定合作程度的策略来作为回应。通过训练不同合作程度的策略,我们可以采用attitude reward来训练出不同合作程度的策略,然而agent可以采用无穷多不同的合作程度的策略,我们不可能针对每种合作程度训练出相应的策略。所以我们提出一种能够根据已经训练出来的不同合作程度的策略复合出不同合作程度的策略。He He设计了一种在完全竞争环境下,能够针对不同策略的对手都能确保自己收益的网络结构,设计网络结构的思想是利用一些已有的策略,通过判断对手的行为来预测那个几个策略能够较好地应对对手,通过对已有策略的复合来获得应对当前对手的策略。这个想法不仅可以使用在完全竞争环境下针对不同对手来调整自己策略,同样可以运用在Sequential Social Dilemmas中来生成不同合作程度的策略。如果我们在训练不同合作程度的策略中采用基于value的深度强化学习算法来学习不同合作程度的策略,那么我们可以将每个不同的策略看成一个专家网络(expert network),
Figure PCTCN2018097747-appb-000014
表示采用合作程度att n在训练不同合作程度的策略中设置中获得agent i的策略,每个expert network预测在当前状态下,采用特定合作程度的策略与其他agent进行play所能获得状态-动作值(state action value),那么我们可以根据已有的合作程度的策略
Figure PCTCN2018097747-appb-000015
通过赋予不同的权重w 1,w 2,…w n来获得agent i的新的合作程度的策略
We know that in Sequential Social Dilemmas, agents can have different degrees of cooperation. For opponents with different degrees of cooperation, we may not want to simply choose a cooperative or competitive strategy to respond, but we hope to have a certain degree of cooperation strategy. as a response. By training strategies with different degrees of cooperation, we can use altitude reward to train strategies with different degrees of cooperation. However, agents can use an infinite number of strategies with different degrees of cooperation. We cannot train corresponding strategies for each degree of cooperation. Therefore, we propose a strategy that can be compounded with different degrees of cooperation according to the strategies of different degrees of cooperation that have been trained. He He designed a network structure that can ensure the benefits of opponents with different strategies in a completely competitive environment. The idea of designing a network structure is to use some existing strategies to predict which ones by judging the behavior of the opponent. The strategy can better deal with the opponent, and through the combination of the existing strategies, the strategy against the current opponent can be obtained. This idea can not only be used to adjust your strategy for different opponents in a perfect competitive environment, but also can be used in Sequential Social Dilemmas to generate strategies with different degrees of cooperation. If we use value-based deep reinforcement learning algorithms to train strategies with different degrees of cooperation to learn strategies with different degrees of cooperation, then we can treat each different strategy as an expert network.
Figure PCTCN2018097747-appb-000014
Means using the degree of cooperation att n to obtain the strategy of agent i in the training settings of different degrees of cooperation. Each expert network predicts that under the current state, the strategy with a specific degree of cooperation can be used to play with other agents to obtain the state-action value. (state action value), then we can use the strategy based on the degree of cooperation
Figure PCTCN2018097747-appb-000015
A strategy to obtain a new degree of cooperation for agent i by assigning different weights w 1 , w 2 , ... w n
Figure PCTCN2018097747-appb-000016
Figure PCTCN2018097747-appb-000016
对于π new我们可以将他看成合作程度为att new: For π new we can think of it as the degree of cooperation att new :
att new=f value(att 1,att 2,…,att n,w 1,w 2,…,w n) att new = f value (att 1 , att 2 , ..., att n , w 1 , w 2 , ..., w n )
的策略,f value(att 1,att 2,…,att n,w 1,w 2,…,w n)为采用基本策略
Figure PCTCN2018097747-appb-000017
与相应权重w 1,w 2,…,w n获得策略的合作程度,这意味着,我们会考虑已有策略的att i与对新策略的影响程度w i来计算新策略的合作程度。如果我假设不同合作程度的策略线性叠加之后,合作程度也是线性叠加的,那么我们将f value写为:
Strategy, f value (att 1 , att 2 ,…, att n , w 1 , w 2 ,…, w n ) is the basic strategy.
Figure PCTCN2018097747-appb-000017
Weight w corresponding right 1, w 2, ..., w n get the level of cooperation policy, which means that we will consider the existing policy att i and the impact of the new strategy w i to calculate the degree of cooperation in the new strategy. If I assume that the strategies of different cooperation levels are linearly superimposed, and the degree of cooperation is also linearly superimposed, then we write f value as:
f value(att 1,att 2,…,att n,w 1,w 2,…,w n) f value (att 1 , att 2 ,…, att n , w 1 , w 2 ,…, w n )
=att 1*w 1+att 2*w 2+…+att n*w n = Att 1 * w 1 + att 2 * w 2 + ... + att n * w n
同样,这个想法可以很自然地迁移到采用基于策略的深度强化学习学习出的不同合作程度的策略上。agent i的新合作程度的策略:Similarly, this idea can be naturally transferred to strategies with different degrees of cooperation learned using strategy-based deep reinforcement learning. Agent's new degree of cooperation strategy:
Figure PCTCN2018097747-appb-000018
Figure PCTCN2018097747-appb-000018
与采用基于值(value)的深度强化学习算法类似,我们同样可以定义att new用来衡量合成之后策略的合作程度。 Similar to the use of value-based deep reinforcement learning algorithms, we can also define att new to measure the degree of cooperation of the strategy after synthesis.
att new=f policy(att 1,att 2,…att n,w 1,w 2,…w n) att new = f policy (att 1 , att 2 , ... att n , w 1 , w 2 , ... w n )
检测对手的合作程度Detecting the degree of cooperation of opponents
在博弈论中一个常见的做法就是:根据其他agent的策略来调整自己的策略。比如Tit for tat:第一个回合选择合作,下一回合是否选合作要看上一回对方是否合作,若对方上一回背叛,此回合我亦背叛;若对方上一回合作,此回合继续合作。在这类博弈中,策略并不需要明确地判断对手具体采用什么策略,而是可以通过最终的局面来判断对手是否是合作还是竞争。在matrix game中,我们可以简单地通过结果来判断agent是否合作,但是在markov game中agent的竞争与合作程度是与agent在整个game进行的过程中的每个action相关的,我们很难提出一种规则来直接判断agent的合作程度是多少,相对于通过特定的规则判断,我们可以采用监督学习的方法,通过数据来直接学习出agent在一局中采取的一系列action的合作程度是多少。那么有一个问题,一个监督学习的问题需要相应的数据与对应的标签(label),通常情况下我们很难获得不同合作程度的数据来让判别器学习到agent行为具有多大的合作程度,但是如果我们使用训练不同合作程度的策略与生成不同合作程度的策略的方法,对于每个agent i,我们可以得到agent i具有不同合作程度的策略
Figure PCTCN2018097747-appb-000019
所以如果我们想要判断agent i的合作程度,那么我们可以生成数据集d:
A common approach in game theory is to adjust your own strategy based on the strategies of other agents. For example, Tit for tat: The first round chooses to cooperate. Whether the next round chooses to cooperate depends on whether the other party has cooperated. If the other party has betrayed the last time, I also betray this round. If the other party cooperated the last time, this round continues Cooperation. In this type of game, the strategy does not need to determine exactly what strategy the opponent uses, but it can judge whether the opponent is cooperation or competition through the final situation. In the matrix game, we can simply judge whether the agent cooperates by the result, but the degree of competition and cooperation of the agent in the markov game is related to each action of the agent in the entire game. It is difficult to propose a This kind of rule can directly determine the degree of cooperation of the agent. Compared with the specific rules, we can use the method of supervised learning to directly learn the degree of cooperation of a series of actions that the agent takes in a round. Then there is a problem. A supervised learning problem requires corresponding data and corresponding labels. Generally, it is difficult for us to obtain data of different cooperation levels to allow the discriminator to learn how cooperative the agent behavior is. But if We use the methods of training strategies with different degrees of cooperation and generating strategies with different degrees of cooperation. For each agent i, we can get strategies with different degrees of cooperation for agent i.
Figure PCTCN2018097747-appb-000019
So if we want to judge the degree of cooperation of agent i, then we can generate data set d:
Figure PCTCN2018097747-appb-000020
Figure PCTCN2018097747-appb-000020
Figure PCTCN2018097747-appb-000021
Figure PCTCN2018097747-appb-000021
Figure PCTCN2018097747-appb-000022
Figure PCTCN2018097747-appb-000022
Π k为agent i k通过训练不同合作程度的策略,生成不同合作程度的策略获得的各种合作程度的策略的集合,f为agent i在当前数据中标记的合作程度的函数,X表示agents采取自己的策略在环境中交互产生的轨迹<s 0,s 1,...,s n>的所有可能的集合,
Figure PCTCN2018097747-appb-000023
Figure PCTCN2018097747-appb-000024
的含义为:agent i i采取特定的策略π i,其他agent采用自己策略集中策略,在环境中进行博弈,搜集到对应的状态转移x=<s 1,s 2,...,s d>∈X,对应x的label标记为:
Figure PCTCN2018097747-appb-000025
Π k is the set of various cooperation degrees obtained by agent i k by training strategies of different degrees of cooperation to generate policies of different degrees of cooperation, f is a function of the degree of cooperation marked by agent i in the current data, and X indicates that agents take All possible sets of trajectories <s 0 , s 1 , ..., s n > produced by own strategy interactively in the environment,
Figure PCTCN2018097747-appb-000023
Figure PCTCN2018097747-appb-000024
The meaning is: agent i i adopts a specific strategy π i , other agents adopt their own strategy concentration strategy, play games in the environment, and collect the corresponding state transition x = <s 1 , s 2 , ..., s d > ∈ X, the label corresponding to x is:
Figure PCTCN2018097747-appb-000025
通过生成的数据集d,我们采用常见的监督学习算法,比如决策树,SVM,神经网络等等来进行监督学习。在这里我们使用神经网络来进行合作程度的判别,我们设计的神经网络的结构采用多任务(multitask)的思想,采用自编码器的结构与分类器的结构相结合。一个自编码器包含两个部分,编码器与解码器,我们用定义编码器为:φ和解码器为:ψ,如下:Through the generated data set d, we use common supervised learning algorithms, such as decision trees, SVM, neural networks, etc. for supervised learning. Here we use a neural network to determine the degree of cooperation. The structure of the neural network we designed adopts the idea of multitask, using the structure of the autoencoder and the structure of the classifier. An autoencoder contains two parts, an encoder and a decoder. We define the encoder as: φ and the decoder as: ψ, as follows:
φ=X→Fφ = X → F
ψ=F→Xψ = F → X
φ,ψ=arg min φ,ψ||X-(φ°ψ)|| 2 φ, ψ = arg min φ, ψ || X- (φ ° ψ) || 2
通过与自编码器共用神经网络底层参数,一方面使得分类器前几层所提取出的特征是对x的有效的表示,整体的分类效果是具有对x的有效表征上,分类效果更好,另外一方面通过自编码梯度对底层参数的影响,加速分类器的训练速度,同 时经过一定的训练时间后,能够使底层网络能够比较稳定,减小分类器训练时的震荡。假设我们的分类器能够判断x=<s 1,s 2,...,s d>为label 1,label 2,…label n的可能性,即: By sharing the underlying parameters of the neural network with the self-encoder, on the one hand, the features extracted by the first few layers of the classifier are effective representations of x, and the overall classification effect is with an effective representation of x, and the classification effect is better. On the other hand, through the influence of the self-encoding gradient on the underlying parameters, the training speed of the classifier is accelerated. At the same time, after a certain training time, the underlying network can be more stable and the vibration during classifier training can be reduced. Suppose our classifier can judge the possibility that x = <s 1 , s 2 , ..., s d > is label 1 , label 2 , ... label n , that is:
f c(x)=f c(<s 1,s 2,...,s d>)=<p 1,p 2,…p n> f c (x) = f c (<s 1 , s 2 , ..., s d >) = <p 1 , p 2 , ... p n >
其中<p 1,p 2,…p n>代表x为label 1,label 2,…label n的可能性,那么我可以可以将分类器的损失函数定义为: Where <p 1 , p 2 , ... p n > represents the possibility that x is label 1 , label 2 , ... label n , then I can define the loss function of the classifier as:
Figure PCTCN2018097747-appb-000026
Figure PCTCN2018097747-appb-000026
通过min∑ xL(f c(x),label)来训练网络。 The network is trained by min∑ x L (f c (x), label).
与不同对手playPlay with different opponents
使用训练不同合作程度的策略,生成不同合作程度的策略,检测对手的合作程度中方法,我们已经可以在Sequential Social Dilemmas中获得不同合作程度的策略与判断agent合作程度的检测器,在此基础上,我们可以地借鉴matrix game中思想,来运用到Sequential Social Dilemmas中。以一类在matrix game Prisoner's dilemma中常见的思想为例:如果对手合作的话,我们就与对手合作,达成双赢,如果对手背叛的话,我们要确保自己的收益。这种想法的一种做法就是经典的Tit for tat:第一个回合选择合作,下一回合是否选合作要看上一回对方是否合作,若对方上一回背叛,此回合我亦背叛;若对方上一回合作,此回合继续合作。我们可以借鉴Tit for tat的核心思想,在Sequential Social Dilemmas中,如果对手上几局采取了attidue att i的策略,那么我们在这一局中就采用attidue att i策略。即通过检测对手的合作程度中的分类器对对手进行判断,假设对手上几局的行为体现为:
Figure PCTCN2018097747-appb-000027
那么通过分类器f c(x),我们可以得到对对手上几局合作程度的估计
Figure PCTCN2018097747-appb-000028
p i对应的合作程度的策略为π i,那么根据我们的已有策略是居于值的深度强化学习还是基于策略的深度强化学习,可以采用与对手合作程度相似的策略:
Using strategies for training different levels of cooperation, generating strategies for different levels of cooperation, and methods for detecting opponents' levels of cooperation, we can already obtain strategies for different levels of cooperation and detectors for determining the degree of cooperation of agents in Sequential Social Dilemmas. On this basis We can borrow ideas from matrix game and apply them to Sequential Social Dilemmas. Take a class of ideas commonly used in matrix game Prisoner's dilemma as an example: if the opponent cooperates, we cooperate with the opponent to achieve a win-win situation, and if the opponent betrays, we must ensure our own profit. One way of thinking about this is the classic Tit for tat: the first round chooses cooperation. Whether to choose cooperation in the next round depends on whether the other party cooperated. If the other party betrays the last time, I will also betray this round. The last time the other side cooperated, this round continued to cooperate. We can learn from the core idea of Tit for tat. In Sequential Social Dilemmas, if the opponent has adopted the strategy att i in several games, then we use the strategy att i in this game. That is, the opponent is judged by the classifier in the detection of the degree of cooperation of the opponent. It is assumed that the behavior of the opponent in several games is:
Figure PCTCN2018097747-appb-000027
Then through the classifier f c (x), we can get an estimate of the degree of cooperation of the opponent in several games.
Figure PCTCN2018097747-appb-000028
The strategy for the degree of cooperation corresponding to p i is π i . According to whether our existing strategy is value-based deep reinforcement learning or strategy-based deep reinforcement learning, a strategy similar to the degree of cooperation of the opponent can be adopted:
π new=arg max a Q new(s,a) π new = arg max a Q new (s, a)
=arg max(p 1*Q π1(s,a)+p 2*Q π2(s,a)+…+p n = Arg max (p 1 * Q π1 (s, a) + p 2 * Q π2 (s, a) + ... + p n
*Q πn(s,a)) * Q πn (s, a))
其中Q πi表示采用π i时的Q值,或者 Where Q πi represents the Q value when π i is used, or
Figure PCTCN2018097747-appb-000029
Figure PCTCN2018097747-appb-000029
Figure PCTCN2018097747-appb-000030
表示采用π i时,在状态s下采用a的概率将Tit for tat中的合作与背叛扩展到不同程度的合作上,如图3,这种做法一方面确保了自己的收益,同时不会使得对手因为收益的损失而采用更竞争的策略。我们的框架不仅仅只能采用Tit for tat的思想,博弈论中通过判断对手信息来做react的策略的思想都可以较简单地迁移过来。
Figure PCTCN2018097747-appb-000030
It means that when π i is adopted, the probability of a in state s is used to extend the cooperation and betrayal in Tit for tat to different degrees of cooperation, as shown in Figure 3. This approach on the one hand ensures its own benefits while not making Opponents adopt more competitive strategies because of lost revenue. Our framework not only adopts the idea of Tit for tat, but the idea of react strategy by judging opponent information in game theory can be easily transferred.
在一具体实施例中,将我们的算法运用在2个game中:收集水果的博弈(Fruit Gathering game),苹果-梨博弈(Apple-Pear game)。Fruit Gathering game在环境中存在2个agent:红色,蓝色,agent的目标是去收集到apple,apple由绿色像素表示,当agent收集到apple会获得r apple的reward,同时apple会从环境中消失一段时间N apple后再出现,在收集到过程中agent可以从自己的位置朝自己的正面射出一道光线(beam),一个agent被另外agent发出的beam射中2次,这个agent会从环境中消失一段时间N tagged然后再出现。agent在环境中能够选择8个动作:向前,向后,向左,向右,向左转,向右转,使用光线和呆在原地。(在这个环境中,agent可以倾向合作,一起收集水果,很少地发出光线,agent也可以选择竞争,在收集水果的同时,多发出beam,争取射中对手,将对手移出环境,让自己能够获得更多的水果。)Apple-Pear game一实施例中,环境中存在2个agent:red的agent和blue的agent,两个水果:红色的apple和绿色的梨。agent在环境中移动搜集水果,蓝色的agent更偏好apple,所以对于agent1单独获得apple的时候reward为r prefer,单独获得pear的时候reward为r common,红色的agent相反,满足r prefer>r common。当两个agent同时向一个水果移动的时候,他们各自获得一半的水果,所以reward为一半的水果的收益。agent在环境中能够执行4个action:向前,向后,向左,向右,每走一个step,agent会花费c step。在两个环境中,agent在环境中 获得的s都是为大小为84*84*3的图片信息。虽然apple-pear game比Fruit Gathering game更简单,但是竞争与合作在apple-pear中更容易去分析,agent各自收集自己的水果为合作,agent什么水果都收集则为竞争,所以我们可以采用Apple-Pear game来更为清晰地分析我们算法的有效性。 In a specific embodiment, our algorithm is applied in 2 games: Fruit Gathering game, Apple-Pear game. The Fruit Gathering game has two agents in the environment: red and blue. The goal of the agent is to collect apples. Apples are represented by green pixels. When the agents collect apples, they will get r apple rewards, and the apples will disappear from the environment. N apple appears again after a period of time. During the collection process, the agent can emit a beam of light from its own position toward its own front. One agent is shot twice by the beam sent by another agent, and the agent will disappear from the environment. N tagged for a while and then reappears. The agent can choose 8 actions in the environment: forward, backward, left, right, turn left, turn right, use light and stay in place. (In this environment, the agent can tend to cooperate, collect fruits together and emit little light, and the agent can also choose to compete. While collecting fruits, send out more beams, try to hit the opponent, and move the opponent out of the environment, so that they can Get more fruits.) In one embodiment of the Apple-Pear game, there are two agents in the environment: red agent and blue agent, and two fruits: red apple and green pear. The agent moves to collect fruits in the environment. The blue agent prefers apple, so for agent1, the reward is r prefer when the apple is obtained alone, and the reward is r common when the pear is obtained separately. The red agent is the opposite, satisfying r prefer > r common When two agents move to a fruit at the same time, they each get half of the fruit, so reward is half of the fruit's revenue. The agent can perform 4 actions in the environment: forward, backward, left, and right. For each step, the agent spends c steps . In both environments, the s obtained by the agent in the environment is image information with a size of 84 * 84 * 3. Although apple-pear game is simpler than Fruit Gathering game, competition and cooperation are easier to analyze in apple-pear. The agent collects its own fruits for cooperation, and the agent collects all fruits for competition, so we can use Apple- Pear game to more clearly analyze the effectiveness of our algorithm.
为了调查我们算法的有效性,在两个环境中,首先我们分别使用不同的网络结构与不同的目标奖赏(target reward)形式进行学习。对于不同的target reward,均采用Actor-Critic的强化学习算法,通过最大化累积累积目标奖赏(discoutn target reward)来学习到不同合作程度的策略。学习出的策略的合作程度可以通过agent1的平均reward,agent2的平均reward的大小来进行分析。其次我们调查了Policy Generation的有效性,我们根据agent的平均reward与策略实际的表现,在采用不同target reward生成的策略集合中挑选出两个策略作为合作的基本策略与竞争的基本策略。以这个两个策略作为衡量竞争与合作的标杆。然后我们对合作的基本策略与竞争的基本策略进行线性叠加来获得不同合作程度的策略。对于不同合作程度合成的策略,同样可以通过agent1的平均reward,agent2的平均reward,来验证合成的之后的策略的合作程度是介于基本策略中间。在实验中,我们使用不同合作程度合成的策略相互间进行play,通过agent的reward来证明验证了Policy Generation的有效性。同时根据不同合作程度策略间相互的play的结果,agent自己的reward都是随着自己的合作程度降低而升高,随着对手的合作程度升高而降低,双方都选择竞争时的reward小于双方都选择合作时的reward,证明了Fruit Gathering game,Apple-Pear game可以视为关于合作程度SPD。第三,我们利用不同合作程度的策略相互进行play来收集数据,因为我们可以知道agent在play中的合作程度,所以我们可以给予data不同的label。在生成数据集之后,我们采用了结合自编码器的神经网络来学习agent的合作程度。最后,我们可以采用Policy Generation和attitude detection结合,先检测对手的合作程度,根据对手的合作程度来动态地调整自己的策略的合作程度,在实验中,我们测试了self-play,固定合作程度的对手,变化策略的对手。In order to investigate the effectiveness of our algorithm, in two environments, we first use different network structures and different target reward forms for learning. For different target rewards, Actor-Critic reinforcement learning algorithms are used to learn strategies with different degrees of cooperation by maximizing cumulative target rewards (discoutn target reward). The degree of cooperation of the learned strategy can be analyzed by the average reward of agent1 and the average reward of agent2. Secondly, we investigated the effectiveness of Policy Generation. We selected two strategies as the basic strategy for cooperation and the basic strategy for competition based on the average reward of the agent and the actual performance of the strategy. Use these two strategies as a benchmark for measuring competition and cooperation. Then we linearly superimpose the basic strategy of cooperation and the basic strategy of competition to obtain strategies of different degrees of cooperation. For strategies synthesized at different levels of cooperation, the average reward of agent1 and average reward of agent2 can also be used to verify that the degree of cooperation of the synthesized strategy is in the middle of the basic strategy. In the experiments, we used different strategies to play with each other, and proved the effectiveness of Policy Generation through the agent's reward. At the same time, according to the results of mutual play between strategies of different cooperation levels, the agent's own rewards are increased as their cooperation level is reduced, and as the opponent's cooperation level is increased, they are reduced. The reward when both choose to cooperate proves that Fruit, Gathering, and Apple-Pear games can be regarded as SPDs about the degree of cooperation. Third, we use different strategies of cooperation to play with each other to collect data. Because we can know the degree of cooperation of the agent in play, we can give data different labels. After generating the data set, we adopted a neural network combined with an autoencoder to learn the degree of cooperation of the agent. Finally, we can use a combination of Policy Generation and altitude detection to first detect the degree of cooperation of the opponent, and dynamically adjust the degree of cooperation of our strategy according to the degree of cooperation of the opponent. In the experiment, we tested self-play. Opponents, opponents of changing strategies.
网络结构Network structure
在两个环境为中,我们的策略网络结构类似DQN结构,actor与critic共用底层网络:第一个隐藏层采用32个8x8大小的卷积层,步长为4,采用RELU激活.第二个隐藏层采用64个4x4大小的卷积层,步长为2,采用RELU激活.第三个隐藏层采用64个3x3大小的卷积层,步长为1,采用RELU激活。行动者(actor)在共用结构的基础上,下一层为128个节点的全连接层,采用RELU激活,最后一层采用与动作数目相同的节点数,采用softmax激活。评论家(critic)的结构类似actor,但是最终只有一个输出。In both environments, our strategic network structure is similar to the DQN structure. The actor and critic share the underlying network: the first hidden layer uses 32 8x8 convolutional layers with a step size of 4 and is activated by RELU. The second The hidden layer uses 64 4x4 convolution layers with a step size of 2 and is activated by RELU. The third hidden layer uses 64 3x3 convolution layers with a step size of 1 and is activated by RELU. On the basis of the shared structure, the actor is a fully connected layer with 128 nodes, activated by RELU, and the last layer is activated by the same number of nodes as softmax. A critic is similar in structure to an actor, but ends up with only one output.
合作程度检测器的网络结构如图2所示,共用的部分的结构为:第一层为10个3x3的卷积层,步长为2,采用RELU激活第二层的结构与第一层一样,第三层的结构为10个3x3的卷积层,步长为3,采用RELU激活。自编码器部分接第三层的结构为:10个3x3反卷积,采用sigmoid激活,输出形状为为21*21*10,然后采用反卷积为:10个3x3的核2,采用sigmoid激活,输 出形状为42*42*10,最后接反卷积核为10个3x3的核,采用sigmoid激活,输出形状为84*84*10。cd detection部分后接第三层的网络结构为2层256节点的长短时记忆模型(lstm)层,最后为一个输出的节点。The network structure of the cooperation degree detector is shown in Figure 2. The structure of the common part is: the first layer is 10 3x3 convolution layers with a step size of 2. The structure of activating the second layer using RELU is the same as the first layer The structure of the third layer is ten 3x3 convolutional layers with a step size of 3 and is activated by RELU. The structure of the self-encoder part connected to the third layer is: 10 3x3 deconvolutions, activated by sigmoid, the output shape is 21 * 21 * 10, and then the deconvolution is: 10 3x3 cores 2, activated by sigmoid , The output shape is 42 * 42 * 10, and finally the deconvolution kernel is 10 3x3 kernels, which are activated by sigmoid, and the output shape is 84 * 84 * 10. The cd detection section is followed by the third layer of the network structure, which is the long- and short-term memory model (lstm) layer of two layers of 256 nodes, and finally an output node.
参数设置parameter settings
在apple-pear game中,我们限制每个轮(episode)的最大步数(step)为100,探索率从1开始,经过20000步后逐步衰减为0.1,权重更新采用软更新(“soft”target updates)来避免策略的大幅震荡:θ′←0.05θ+0.095θ′,其中θ为目标网络(target network)的参数,θ′为agent使用的策略网络的参数,学习率为0.0001,存储(memory)的大小为25000,批大小(batch size)为128。环境中r prefer=1,r comment=0.5,c step=0.01 In the apple-pear game, we limit the maximum number of steps (episode) to 100 for each round, the exploration rate starts from 1, and gradually decreases to 0.1 after 20,000 steps. The weight update uses a soft update ("soft" target updates) to avoid a large fluctuation of the strategy: θ ′ ← 0.05θ + 0.095θ ′, where θ is a parameter of the target network, θ ′ is a parameter of the strategy network used by the agent, the learning rate is 0.0001, and the storage (memory ) Has a size of 25000 and a batch size of 128. R prefer = 1, r comment = 0.5, c step = 0.01 in the environment
在Fruit Gathering game中,我们采用与apple-pear game一样的检测器(detection)来进行学习,actor critic部分采用独立(independen)t来分开进行学习。在Fruit Gathering game中,我们设置N apple=40,N apple=20,r apple=1。在训练时,我们限制每个episode的step为1000,agent的探索率从1开始,经过200000步后逐步衰减为0.1,memory的大小为250000,学习率为0.0001,同样采用“soft”target updates:θ′←0.001θ+0.999θ′,每4step更新一次网络权重,batch size为64。 In the Fruit Gathering game, we use the same detector as apple-pear game to learn, and the actor critic part uses independent t to learn separately. In the Fruit Gathering game, we set N apple = 40, N apple = 20, and r apple = 1. During training, we limit the steps of each epidede to 1000, the exploration rate of the agent starts from 1, and gradually decreases to 0.1 after 200,000 steps, the memory size is 250,000, and the learning rate is 0.0001. We also use "soft" target updates: θ ′ ← 0.001θ + 0.999θ ′, the network weight is updated every 4 steps, and the batch size is 64.
weighted Target Rewardweighted Target Reward
我们使用在训练不同合作程度的策略中提到的联合动作(joint action)的方法来训练我们的网络,在apple-pear game中我们可以设置weighted target reward设置为:We use the joint action method mentioned in training strategies of different degrees of cooperation to train our network. In the apple-pear game, we can set the weighted target and reward settings to:
r=w 1*r 1+(1-w 1)*r 2 r = w 1 * r 1 + (1-w 1 ) * r 2
因为我们采用的采用joint action的网络结构,所以如果agent1与agent2的attitude的值相同,则2个agent的程度一样,相当于reward同比例放大缩小相应倍数,最终学习出的策略大致相同,所以我们可以两个agent的权重的和归一化到1。Because we use a network structure that uses joint actions, if the altitude values of agent1 and agent2 are the same, the degree of the two agents is the same. The sum of the weights of the two agents can be normalized to 1.
w 1(即表片中的att 1)的的范围设置为从0.1到0.9,越大的w 1代表着agent1学习出来的策略会越竞争,agent2学习出来的策略会越合作。不设置w 1 为0或者1的主要原因是因为:当w 1=1时,则1-w 1=0,那么学习出的agent2的策略则完全没有意义,agent2的策略会体现为在环境中移动,避免收集水果,同时避免与agent1发生碰撞,但是我们希望每个agent的策略都是有意义的,只是合作的程度有所不同,所以我们去除掉了w 1=0,w 1=1,只选择了w 1的范围设置为从0.1到0.9来训练网络。 The range of w 1 (that is, att 1 in the table) is set from 0.1 to 0.9. A larger w 1 means that the strategies learned by agent 1 will be more competitive, and the strategies learned by agent 2 will be more cooperative. The main reason for not setting w 1 to 0 or 1 is: when w 1 = 1, then 1-w 1 = 0, then the learned strategy of agent 2 is completely meaningless, and the strategy of agent 2 will be reflected in moving in the environment To avoid collecting fruit and avoid collision with agent1, but we hope that each agent's strategy is meaningful, but the degree of cooperation is different, so we remove w 1 = 0, w 1 = 1, only The range of w 1 was selected to be trained from 0.1 to 0.9.
不同w 1的训练的最终结果如下表所示,agent1平均奖赏(average reward)与agent2average reward表示的是只考虑agent收集水果的average reward。agent1average reward与agent2average reward的随着训练episode的变化如图5所示。通过图表可知,对于agent1,当w 1越大时,agent1average reward就越大,说明agent1的策略越偏向竞争,如w 1=0.9时,agent1average reward为1.104,agent不仅收集了自己的水果,还多次尝试收集对手的水果,当w 1在0.2与0.1时,agent1average reward分别为0.4065与0.514,接近r comment=0.5,说明agent1在大部分时间都尽力收集自己喜欢的水果,很少收集到对手的水果。同理,agent2随着w 1的下降而逐步上升。 The final results of different w 1 trainings are shown in the following table. The average reward of agent 1 and the average reward of agent 2 represent the average reward that only considers the fruit collected by the agent. The changes of agent1average reward and agent2average reward with training episode are shown in Figure 5. By chart shows, for agent1, when the greater w 1, agent1average reward the greater, indicating agent1 strategy toward more competition, as w 1 = 0.9 when, agent1average reward is 1.104, agent not only collect their fruit, more than Attempts to collect opponent's fruits. When w 1 is 0.2 and 0.1, agent1average reward is 0.4065 and 0.514, which are close to r comment = 0.5, indicating that agent1 has tried his best to collect his favorite fruits, and rarely collects opponent's fruit. In the same way, agent2 gradually rises as w 1 decreases.
w 1 w 1 agent1 average rewardagent1 average agent2 average rewardagent2 average
0.10.1 0.40650.4065 1.075751.07575
0.20.2 0.5140.514 0.97750.9775
0.30.3 0.6130.613 0.881750.88175
0.40.4 0.7520.752 0.7340.734
0.50.5 0.7890.789 0.69450.6945
0.60.6 0.89450.8945 0.61250.6125
0.70.7 0.97350.9735 0.533250.53325
0.80.8 1.04351.0435 0.42750.4275
0.90.9 1.1041.104 0.3630.363
同样,agent可以采用independent的网络结构进行学习,由于篇幅所限,我们 只展示一部分合作程度的学习结果,如下表所示,与joint action的网络结构相同,agent的average reward随着自己的合作程度上升而下降,随着对手的合作程度上升而上升Similarly, the agent can use an independent network structure to learn. Due to space limitations, we only show a part of the learning results of the degree of cooperation. As shown in the following table, the network structure of the joint action is the same. Rise and fall, rise as the degree of cooperation of the opponent rises
Figure PCTCN2018097747-appb-000031
Figure PCTCN2018097747-appb-000031
Figure PCTCN2018097747-appb-000032
Figure PCTCN2018097747-appb-000032
在Fruit gathering game中我们采用independent的weighted target reward,主要原因是:对于两个agent的r apple相同,所以如果想要最大化联合r的话,reward权重高的agent会尽力去收集所有的apple,权重低的agent会尽量不去妨碍权重高的agent收集Apple,所以最终的学习结果可能是reward权重高的agent尽力收集所有的apple,而reward权重低的agent可能会是无意义的策略。但是采用independent可以缓解这个问题: In the Fruit gathering game, we use independent weighted target reward. The main reason is that the r apples of the two agents are the same, so if you want to maximize the joint r, the agent with the highest weight will try to collect all the apples. A low agent will try not to prevent the high-weight agent from collecting Apples, so the final learning result may be that the high-reward agent will try his best to collect all apples, and the low-reward agent may be a meaningless strategy. But using independent can alleviate this problem:
r′ 1=r 1+att 12*r 2 r ′ 1 = r 1 + att 12 * r 2
r′ 2=r 2+att 21*r 1 r ′ 2 = r 2 + att 21 * r 1
通过限制att 12,att 21在[0,1]的范围内,agent的策略会在自己收集apple的基础上,不同程度地考虑考虑对手的reward。 By limiting att 12 and att 21 to the range of [0,1], the agent's strategy will consider the opponent's reward to varying degrees based on the apples it collects.
如下表,限于篇幅,我们只展示了一部分的结果,我们可以看出在对方att不改变的时候,agent越考虑对方的reward,则beam的使用次数就越少,reward相对越低。所以说明了在Fruit Gathering game中weighted target reward同样可以训练出不同合作程度的策略。As shown in the following table, due to space limitations, we only show a part of the results. We can see that when the other party's att does not change, the more the agent considers the other's reward, the fewer times the beam is used and the lower the reward is. So it shows that weighted targets and rewards can also be used to train strategies with different degrees of cooperation in Fruit, Gathering, and game.
Policy GenerationPolicy Generation
为了在apple-pear game中生成不同合作程度的策略,我们选择w 1=0.5学习出的agent1的策略,agent2的策略分别作为agent1的基本合作策略与agent2的基本合作策略,选择w 1=0.75学习出的agent1的策略作为agent1的基本竞争策略,选择w 1=0.25学习出的agent2的策略作为agent2的基本竞争策略。事实上,挑选agent的基本竞争策略与基本合作策略并没有特别的限制,只要基本合作策略的合作程度比基本竞争策略高即可,挑选出来的基本合作策略决定了agent最高的合作程度,挑选出的基本竞争策略决定了agent最高的竞争程度,所以我们可以通过挑选不同的策略来限制agent的合作与竞争的范围。我们挑选的策略满足了:基本合作策略的合作程度比基本竞争策略高。这条基本限制,同时agent1竞争策略选择的w 1=0.75介于0.7与0.8间,通过表1(上面的那个0.1到1的表)可知,agent1的策略即会收集自己的水果,同时收集对方水果。同理,我们可以分析出agent2的竞争策略也具有相应的竞争性。 In order to generate strategies with different degrees of cooperation in the apple-pear game, we choose the strategy of agent1 learned by w 1 = 0.5, and the strategy of agent2 is used as the basic cooperation strategy of agent1 and the basic cooperation strategy of agent2, and choose w 1 = 0.75 to learn The strategy of agent1 is used as the basic competition strategy of agent1, and the strategy of agent2 learned by w 1 = 0.25 is selected as the basic competition strategy of agent2. In fact, there are no special restrictions on the basic competition strategy and basic cooperation strategy for selecting agents, as long as the cooperation degree of the basic cooperation strategy is higher than the basic competition strategy. The basic competition strategy determines the highest degree of competition for the agent, so we can limit the scope of agent cooperation and competition by choosing different strategies. The strategy we selected satisfies: the degree of cooperation of the basic cooperation strategy is higher than that of the basic competition strategy. This basic restrictions, while agent1 competitive strategy selection w 1 = 0.75 is between 0.7 and 0.8, you can see through 1 (0.1 to 1 of the table above) table, agent1 strategy that is collect their fruit, while collecting the other side fruit. In the same way, we can analyze that the competitive strategy of agent2 is also competitive.
挑选出基本的竞争策略与合作策略之后,我们可以采用生成不同合作程度的策略中提出的算法来进行算法的合成。为了证明合成策略的有效性,我们可以将合成后的策略与固定的对手进行play,通过最终的average reward来判断最终合成策略的效果。如下表所示,为20万episode的平均结果After selecting the basic competition strategy and cooperation strategy, we can use the algorithms proposed in generating strategies with different degrees of cooperation to synthesize algorithms. In order to prove the effectiveness of the synthesis strategy, we can play the synthesized strategy with a fixed opponent and judge the effect of the final synthesis strategy through the final average and reward. As shown in the table below, it is the average result of 200,000 episode
Figure PCTCN2018097747-appb-000033
Figure PCTCN2018097747-appb-000033
表1Table 1
我们固定agent2只使用基准的合作策略与基准的竞争策略,agent1采用Policy Generation的算法,π new=w cc+(1-w c)*π d,w c的范围从0到1,间隔0.1。从average reward可知,随着w c到上升,agent1 average reward逐渐下降,固定的w c面对更合作的agent2而言,agent1 average reward更大。所以这个表格说明了,采用不同的w c进行Policy Generation,可以合成出不同合作程度的策略。 We fixed agent2 only using the benchmark cooperation strategy and benchmark competition strategy. Agent1 uses the Policy Generation algorithm. Π new = w c * π c + (1-w c ) * π d , w c ranges from 0 to 1, Interval 0.1. From average reward, it can be known that as w c rises, the average reward of agent 1 gradually decreases. For a more cooperative agent 2, the fixed w c has a larger average reward. So this table shows that using different w c for Policy Generation can synthesize policies with different degrees of cooperation.
更进一步地,agent1与agent2都采用Policy Generation进行策略的生成,然后使用生成的策略相互之间进行play,可以很清晰地看出apple-pear game可以视为一个关于agent合作程度的spd。如果只考虑自身的利益,则agent会选择更竞争的策略,但是如果两个agent都选择竞争的策略,那么他们的各自的reward都小于两个agent都选择合作策略时的reward,同时从总奖赏(total reward)可以看出,在合作-合作时,total-reward是最大的。Furthermore, both agent1 and agent2 use Policy Generation to generate policies, and then use the generated policies to play with each other. It can be clearly seen that apple-pear game can be regarded as an SPD about the degree of cooperation between agents. If only one's own interests are considered, the agent will choose a more competitive strategy, but if both agents choose a competing strategy, then their respective rewards are less than the rewards when both agents choose a cooperative strategy, and the total reward is (total reward) It can be seen that in cooperation-cooperation, total-reward is the largest.
类似于apple-pear game,在Fruit Gathering game中,我们挑选了att 12=0,att 21=0学习出的策略作为agent的基本竞争策略,att 12=0.5,att 21=0.5学习出的策略作为agent的合作策略。不挑选att 12=1,att 21=1学习出来的策略作为agent合作策略的原因是:在这种设置下,由于agent的reward一样,所以最终可能导致其中一个agent学习到收集apple的行为后,虽然另外一个agent没有学习到收集apple的行为,但是同样可以获得相应的reward,所以无法学习到有意义的行为。 Similar to the apple-pear game, in the Fruit Gathering game, we selected the strategy learned by att 12 = 0, att 21 = 0 as the basic competition strategy of the agent, and the strategy learned by att 12 = 0.5, and att 21 = 0.5 as Cooperation strategies of agents. The reason why att 12 = 1 and att 21 = 1 are not selected as the agent cooperation strategy is that in this setting, because the agent's reward is the same, it may eventually lead to one of the agents learning the behavior of collecting apples. Although another agent did not learn the behavior of collecting apples, it could also obtain the corresponding reward, so it could not learn meaningful behavior.
在挑选策略之后,我们同样采用Policy Generation进行策略的生成,然后使用生成的策略相互之间进行play,Fruit Gathering game同样可以视为一个关于合作程度的spd。After selecting a strategy, we also use Policy Generation to generate the strategy, and then use the generated strategy to play with each other. Fruit Gathering game can also be regarded as an SPD on the degree of cooperation.
Cooperation Degree DetectionCooperationDegree
如检测对手的合作程度中所说,为了检测agent2的合作程度,我们将agent2的基本合作策略,基本竞争策略分别视为合作程度为1的策略与合作程度为0的策略,将agent2的策略固定为基本合作策略,与采用不同的合作程度的策略的agent1进行play收集数据<S 1,S 2,...,S d>,可以将数据的label标记为1,同样可以使用基本竞争策略与不同合作程度的策略进行play来收集label标记为0的数据。 As mentioned in Detecting the degree of cooperation of the opponent, in order to detect the degree of cooperation of agent2, we consider the basic cooperation strategy and basic competition strategy of agent2 as a strategy with a degree of cooperation of 1 and a strategy with a degree of cooperation of 0, respectively, and fix the strategy of agent2 For the basic cooperation strategy, collect data <S 1 , S 2 , ..., S d > play with agent1 with a different degree of cooperation strategy. You can mark the data label as 1, and you can also use the basic competition strategy and Different cooperation strategies play to collect data with label 0.
在apple-pear game中我们收集长度分别为3,4,5,6,7,8,9的10000点条 数据来训练我们的cd detection,训练效果如下图所示,可以看出我们的cd detection在数据集上能够很好地识别出agent2的合作程度。为了测试cd detection对于不同合作程度的agent2的cd检测的效果,agent1采用cd从0到1,间隔为0.1的策略分别与agent2采用cd采用从0到1,间隔为0.1的策略进行play,每次play进行1w episode,使用cd detection检测这1w episode中agent2的cd,均值如下图所示,可以很明显地看出,虽然我们只使用了label1与label 0来训练cd detection,但是对于agent2cd采用0到1间的策略,cd detection能够分辨出不同合作程度相对大小,但是数值上并不一定准确。为了获得准确的agent2cd,我们发现cd的变化几乎为线性,所以我们可以采用一条直线进行拟合,通过拟合的直线来计算agent2cd的近似值。In the apple-pear game, we collected 10,000 points of data with lengths of 3, 4, 5, 6, 7, 8, and 9 to train our cd detection. The training effect is shown in the following figure, and we can see our cd detection The degree of cooperation of agent2 can be well identified on the dataset. In order to test the effect of cd detection on cd detection of agent2 with different cooperation levels, agent1 uses cd from 0 to 1 with a 0.1 interval strategy and agent2 uses cd from 0 to 1 with a 0.1 interval strategy to play. Each time Play performs 1w episode, and uses cd detection to detect the cd of agent2 in the 1w episode. The average value is shown in the following figure. It is obvious that although we only use label1 and label0 to train cd detection, we use 0 to agent2cd. 1 strategy, cd detection can distinguish the relative size of different degrees of cooperation, but the value is not necessarily accurate. In order to obtain accurate agent2cd, we found that the change of cd is almost linear, so we can use a straight line to fit and calculate the approximate value of agent2cd by the fitted straight line.
在Fruit Gathering game中,我们同样可以利用已经生成的策略来收集不同合作程度的数据集。标记数据的方法与apple-pear game中相同,在fruit gathering game中,我们分别收集合作程度为1与合作程度为0,长度为40,50,60,70的数据各2000条,并采用相同的结果来训练detection,。我们同样采用了不同合作程度的agent相互进行play来检测detection的效果,我们可以看出检测出的值与态度大体上呈线性关系,但是一些合作程度上,比如对手完全合作时,检测的值与自身的策略相关性比较大,所以我们对于自身不同合作程度的检测值分别拟合出相应的直线,在检测对手合作程度时,根据自身的合作程度挑选出最接近的直线来对手的合作程度。In Fruit Gathering game, we can also use the generated strategies to collect data sets with different degrees of cooperation. The method of labeling data is the same as in apple-pear game. In fruit game, we collect 2000 pieces of data with a cooperation degree of 1 and a cooperation degree of 0, with lengths of 40, 50, 60, and 70, and use the same Results to train detection. We also use agents with different degrees of cooperation to play each other to detect the effect of detection. We can see that the detected value and attitude are generally linear, but to some degree of cooperation, such as when the opponent fully cooperates, the detected value and The correlation of our own strategy is relatively large, so we fit the corresponding straight lines for the detection values of our own different degrees of cooperation. When detecting the degree of opponent cooperation, we choose the closest straight line to the degree of cooperation of the opponent.
Performance with different opponentPerformance with different opponent
在训练完cd detection之后,为了检测提出的算法的有效性,我们调查了算法在self-play与面对不断变化的对手的情况。After training cd detection, in order to test the effectiveness of the proposed algorithm, we investigated the algorithm's self-play and the face of changing opponents.
首先在apple-pear game的self-play设置中,我们设置了4种情况:1.agent1初始为合作策略,agent2初始为竞争策略2.agent1初始为竞争策略,agent2初始为合作策略,3.2个agent都为合作策略4.2个agent都为竞争策略。在self-play中,两个agent最终都收敛到了合作-合作的状态,达成共赢。由于环境的初始化的随机性,所以即使对手采用的是竞争的策略,但是在这个初始化设置中,最终agent与对手达到的状态和采取合作策略的效果是一致的,所以agent会认为对手具有一定的合作成程度,所以在下一次会调整自己的合作程度,最终双方会达成合作-合作的效果。First of all, in the self-play setting of apple-pear game, we set 4 cases: 1. agent1 is initially a cooperation strategy, agent2 is initially a competition strategy 2. agent1 is initially a competition strategy, agent2 is initially a cooperation strategy, 3.2 agents All are cooperative strategies. 4.2 agents are all competitive strategies. In self-play, both agents eventually converged to a cooperative-cooperative state, achieving a win-win situation. Due to the randomness of environment initialization, even if the opponent adopts a competitive strategy, in this initialization setting, the state reached by the agent and the opponent and the effect of the cooperation strategy are consistent, so the agent will consider the opponent to have a certain The degree of cooperation is reached, so the next time we will adjust the degree of cooperation, and eventually the two parties will reach the effect of cooperation-cooperation.
在面对不断变化自己策略的对手时,我们设置对手的策略为每隔Tepisode变化一次策略,在竞争策略与合作策略中不停变化。在detection的部分,我们可以发现detection可以比较好地检测对手的策略,但是在对手快速调整自己的策略时,detection会存在一部分的误差,如在T=50,110,150,190时,detection检测出的形状并不如T=110,150,190那么整齐,产生这样的效果的原因主要是由于detection是采用前面几个episode作为数据进行检测的,所以在策略变化的初期检测效果并不是特别明显,所以在detection意识到对手变化策略,开始调整自己策略时,对手的策略可能又发生了改变。这种情况在T较大时能够较好地解决,detection的效果相比快速变换的效果好上很多。我们的算法自身的reward介于采取基本合作策略与基本竞争策略中,说明detection检测后,调整为不同合作程度的策略能够面对竞争策略的时候保证自己的收益,在total  reward中,我们的图高于竞争的策略,说明我们的策略在面对合作的对手的时候,能够调整为合作的策略与对手共赢。In the face of opponents who constantly change their own strategy, we set the opponent's strategy to change the strategy every Tepisode, constantly changing in the competition strategy and cooperation strategy. In the detection section, we can find that detection can better detect the opponent's strategy, but when the opponent quickly adjusts his own strategy, there will be some errors in the detection, such as detection detection at T = 50, 110, 150, 190 The shape is not as neat as T = 110, 150, 190. The reason for this effect is mainly because detection uses the previous episodes as data to detect, so the detection effect is not particularly obvious at the initial stage of the strategy change. Therefore, when detection is aware of the changes in the opponent ’s strategy and begins to adjust its own strategy, the opponent ’s strategy may have changed again. This situation can be better solved when T is large, and the effect of detection is much better than the effect of fast transformation. The reward of our algorithm is between adopting a basic cooperation strategy and a basic competition strategy, indicating that after detection detection, a strategy adjusted to a different degree of cooperation can guarantee its own revenue when facing a competitive strategy. In total reward, our graph Strategies that are higher than competition indicate that our strategy can be adjusted to a win-win situation with our opponents when we face them.
self-play与变化对手playself-play vs. change opponent play
在fruit gathering game中,我们同样测试了self-play,不停变化策略对手的效果。一实施例中,4种初始设置的agent最终都收敛到了合作-合作,双方都初始为合作的话,双方会保持为合作,另外三种情况,agent会不调整自身策略,先是都达到竞争-竞争的情况,在一个agent将另外一个agent tagged出环境后,agent会开始收集剩余的水果,并等待apple的出现,由于另外的agent被tagged出环境,所以detection检测出的合作程度会略高于竞争的对手,所以agent会期待与另外的agent达成合作,在此期间,被tagged的agent可能会重新出现,此时agent的detection的部分是根据出现前对方agent行为来衡量的,所以当两个agent都去收集apple,而没有主动tagged对方,则双方可能收敛到合作-合作。In the fruit game, we also tested the effect of self-play, constantly changing strategy opponents. In one embodiment, the four initially set agents eventually converged to cooperation-cooperation. If both parties initially cooperate, the two parties will remain cooperative. In the other three cases, the agent will not adjust its own strategy and will first reach competition-competition. In the case, after an agent has tagged another agent out of the environment, the agent will start collecting the remaining fruits and wait for the appearance of the apple. Since the other agent is tagged out of the environment, the degree of cooperation detected by detection will be slightly higher than the competition Opponent, so the agent will look forward to cooperating with another agent. During this period, the tagged agent may reappear. At this time, the detection part of the agent is measured based on the behavior of the other agent before it appears, so when two agents Collect apples without actively tagging each other, and the two parties may converge to cooperation-cooperation.
面对不停变换自己策略的对手,在对手变换间隔小的时候,效果不是非常出色,当对手变换间隔更大些的时候,agent能够更好地估计出对手的合作程度,并以此来调整自己的策略,同样从自己的reward可以看出,我们的算法好于合作的策略,所以在对手采用竞争的策略时,我们也会采用竞争策略来确保自身的收益,在total reward中,我们的策略的total reward好于竞争策略的total reward,说明我们的策略在对手合作的时候会采用合作的策略,与对手合作达成共赢。In the face of opponents who constantly change their strategy, the effect is not very good when the opponent's transformation interval is small. When the opponent's transformation interval is larger, the agent can better estimate the degree of cooperation of the opponent and use this to adjust Our own strategy can also be seen from our own reward. Our algorithm is better than the cooperative strategy, so when the opponent adopts the competitive strategy, we will also use the competitive strategy to ensure our own returns. In total reward, our The total reward of the strategy is better than the total reward of the competitive strategy, which indicates that our strategy will adopt a cooperative strategy when the opponent cooperates to achieve a win-win situation with the opponent.
以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明,不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本发明的保护范围。The above is a further detailed description of the present invention in combination with specific preferred embodiments, and it cannot be considered that the specific implementation of the present invention is limited to these descriptions. For those of ordinary skill in the technical field to which the present invention pertains, without deviating from the concept of the present invention, several simple deductions or replacements can be made, which should all be regarded as belonging to the protection scope of the present invention.

Claims (9)

  1. 一种基于深度强化学习的自适应博弈算法,其特征在于:包括如下步骤:(A)获取不同合作程度的策略;(B)生成不同合作程度的策略;(C)检测对手的合作策略;(D)制定不同的应对策略。An adaptive game algorithm based on deep reinforcement learning, which is characterized by the following steps: (A) acquiring strategies with different degrees of cooperation; (B) generating strategies with different degrees of cooperation; (C) detecting opponents' cooperative strategies; ( D) Develop different coping strategies.
  2. 根据权利要求1所述的基于深度强化学习的自适应博弈算法,其特征在于:所述步骤(A)中,通过使用不同的网络结构和/或不同的目标奖赏形式进行训练并获取不同合作程度的策略。The adaptive game algorithm based on deep reinforcement learning according to claim 1, characterized in that in the step (A), training is performed by using different network structures and / or different target reward forms to obtain different degrees of cooperation Strategy.
  3. 根据权利要求1所述的基于深度强化学习的自适应博弈算法,其特征在于:所述步骤(A)中,通过修改环境中影响竞争与合作程度的关键因素或者通过对agent的学习目标进行修改来获得不同合作程度的策略。The adaptive game algorithm based on deep reinforcement learning according to claim 1, characterized in that in the step (A), the key factors affecting the degree of competition and cooperation in the environment are modified or the learning objectives of the agent are modified To obtain strategies for different levels of cooperation.
  4. 根据权利要求1所述的基于深度强化学习的自适应博弈算法,其特征在于:所述步骤(B)中,将步骤(A)中得到的不同合作程度的策略设为专家网络,并对专家网络中的不同合作程度的策略赋予权重;并根据不同合作程度的策略的影响程度来生成新合作程度的策略。The adaptive game algorithm based on deep reinforcement learning according to claim 1, characterized in that in the step (B), the strategies with different degrees of cooperation obtained in step (A) are set as expert networks, and Strategies with different degrees of cooperation in the network are given weights; and strategies with new degrees of cooperation are generated based on the degree of influence of strategies with different degrees of cooperation.
  5. 根据权利要求4所述的基于深度强化学习的自适应博弈算法,其特征在于:生成新合作程度的策略的算法具体过程为:一个专家网络expert network中,
    Figure PCTCN2018097747-appb-100001
    表示采用合作程度att n在训练不同合作程度的策略中设置中获得agent i的策略,每个expert network预测在当前状态下,采用合作程度的策略与其它agent进行play所能获得状态-动作值state action value,根据已有的合作程度的策略
    Figure PCTCN2018097747-appb-100002
    通过赋予不同的权重w 1,w 2,…w n来获得agent i的新的合作程度的策略
    The adaptive game algorithm based on deep reinforcement learning according to claim 4, characterized in that the algorithm specific process of generating a strategy of a new degree of cooperation is: in an expert network expert network,
    Figure PCTCN2018097747-appb-100001
    Means using the degree of cooperation att n to obtain the strategy of agent i in the training settings of different degrees of cooperation. Each expert network predicts that in the current state, the strategy of cooperation degree can be used to play with other agents to obtain the state-action value state. action value, according to the existing cooperation strategy
    Figure PCTCN2018097747-appb-100002
    A strategy to obtain a new degree of cooperation for agent i by assigning different weights w 1 , w 2 , ... w n
    Figure PCTCN2018097747-appb-100003
    Figure PCTCN2018097747-appb-100003
    Figure PCTCN2018097747-appb-100004
    为合作程度为att new的策略,
    Figure PCTCN2018097747-appb-100004
    For the strategy of cooperation degree att new ,
    att new=f value(att 1,att 2,…,att n,w 1,w 2,…,w n) att new = f value (att 1 , att 2 , ..., att n , w 1 , w 2 , ..., w n )
    f value(att 1,att 2,…,att n,w 1,w 2,…,w n)为采用基本策略
    Figure PCTCN2018097747-appb-100005
    与相应权重w 1,w 2,…,w n获得策略的合作程度,f value为:
    f value (att 1 , att 2 ,…, att n , w 1 , w 2 ,…, w n )
    Figure PCTCN2018097747-appb-100005
    The degree of cooperation with the corresponding weights w 1 , w 2 , ..., w n is obtained, and f value is:
    f value(att 1,att 2,…,att n,w 1,w 2,…,w n) f value (att 1 , att 2 ,…, att n , w 1 , w 2 ,…, w n )
    =att 1*w 1+att 2*w 2+…+att n*w n = Att 1 * w 1 + att 2 * w 2 + ... + att n * w n
    agent i的新合作程度的策略:Agent's new degree of cooperation strategy:
    Figure PCTCN2018097747-appb-100006
    Figure PCTCN2018097747-appb-100006
    定义att new用来衡量合成之后策略的合作程度att new=f policy(att 1,att 2,…att n,w 1,w 2,…w n)。 Define att new to measure the degree of cooperation of the strategy after synthesis att new = f policy (att 1 , att 2 , ... att n , w 1 , w 2 , ... w n ).
  6. 根据权利要求1所述的基于深度强化学习的自适应博弈算法,其特征在于:所述步骤(C)中,使用神经网络来进行合作程度的判别,神经网络的结构采用多任务模式,采用自编码器的结构与分类器的结构相结合,所述自编码器和分类器共用神经网络底层参数。The adaptive game algorithm based on deep reinforcement learning according to claim 1, characterized in that: in the step (C), a neural network is used to judge the degree of cooperation, and the structure of the neural network adopts a multi-task mode and adopts an automatic The structure of the encoder is combined with the structure of the classifier, and the self-encoder and the classifier share the underlying parameters of the neural network.
  7. 根据权利要求1所述的基于深度强化学习的自适应博弈算法,其特征在于:所述步骤(D)中,通过检测对手的合作程度中对对手进行判断,生成更利于自身的策略。The adaptive game algorithm based on deep reinforcement learning according to claim 1, characterized in that in the step (D), the opponent is judged by detecting the degree of cooperation of the opponent to generate a strategy more favorable to itself.
  8. 根据权利要求2所述的基于深度强化学习的自适应博弈算法,其特征在于:不同的网络结构包括策略网络结构和合作程度检测器的网络结构。The adaptive game algorithm based on deep reinforcement learning according to claim 2, characterized in that different network structures include a strategic network structure and a network structure of a cooperation degree detector.
  9. 根据权利要求1所述的基于深度强化学习的自适应博弈算法,其特征在于:所述基于深度强化学习的自适应博弈算法应用于多智能体环境中。The adaptive game algorithm based on deep reinforcement learning according to claim 1, wherein the adaptive game algorithm based on deep reinforcement learning is applied in a multi-agent environment.
PCT/CN2018/097747 2018-07-30 2018-07-30 Deep reinforcement learning-based adaptive game algorithm WO2020024097A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201880001586.XA CN109496318A (en) 2018-07-30 2018-07-30 Adaptive game playing algorithm based on deeply study
PCT/CN2018/097747 WO2020024097A1 (en) 2018-07-30 2018-07-30 Deep reinforcement learning-based adaptive game algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/097747 WO2020024097A1 (en) 2018-07-30 2018-07-30 Deep reinforcement learning-based adaptive game algorithm

Publications (1)

Publication Number Publication Date
WO2020024097A1 true WO2020024097A1 (en) 2020-02-06

Family

ID=65713819

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/097747 WO2020024097A1 (en) 2018-07-30 2018-07-30 Deep reinforcement learning-based adaptive game algorithm

Country Status (2)

Country Link
CN (1) CN109496318A (en)
WO (1) WO2020024097A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361594A (en) * 2021-06-03 2021-09-07 安徽理工大学 Countermeasure sample generation method based on generation model
CN113791634A (en) * 2021-08-22 2021-12-14 西北工业大学 Multi-aircraft air combat decision method based on multi-agent reinforcement learning
CN114154397A (en) * 2021-11-09 2022-03-08 大连理工大学 Implicit adversary modeling method based on deep reinforcement learning

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110081893B (en) * 2019-04-01 2020-09-25 东莞理工学院 Navigation path planning method based on strategy reuse and reinforcement learning
CN110428057A (en) * 2019-05-06 2019-11-08 南京大学 A kind of intelligent game playing system based on multiple agent deeply learning algorithm
CN110399920B (en) * 2019-07-25 2021-07-27 哈尔滨工业大学(深圳) Non-complete information game method, device and system based on deep reinforcement learning and storage medium
CN111030764B (en) * 2019-10-31 2021-02-02 武汉大学 Crowdsourcing user information age management method based on random game online learning
CN110909465B (en) * 2019-11-20 2021-08-31 北京航空航天大学 Cooperative game cluster visual maintenance method based on intelligent learning
CN111026548B (en) * 2019-11-28 2023-05-09 国网甘肃省电力公司电力科学研究院 Power communication equipment test resource scheduling method for reverse deep reinforcement learning
CN111294242A (en) * 2020-02-16 2020-06-16 湖南大学 Multi-hop learning method for improving cooperation level of multi-agent system
CN111488992A (en) * 2020-03-03 2020-08-04 中国电子科技集团公司第五十二研究所 Simulator adversary reinforcing device based on artificial intelligence
CN111766782B (en) * 2020-06-28 2021-07-13 浙江大学 Strategy selection method based on Actor-Critic framework in deep reinforcement learning
CN114154614B (en) * 2020-09-08 2024-06-11 深圳市优智创芯科技有限公司 Multi-agent game method based on impulse neural network
CN112364500B (en) * 2020-11-09 2021-07-20 中国科学院自动化研究所 Multi-concurrency real-time countermeasure system oriented to reinforcement learning training and evaluation
CN112260733B (en) * 2020-11-10 2022-02-01 东南大学 Multi-agent deep reinforcement learning-based MU-MISO hybrid precoding design method
CN112838946B (en) * 2020-12-17 2023-04-28 国网江苏省电力有限公司信息通信分公司 Method for constructing intelligent sensing and early warning model based on communication network faults
CN113599832B (en) * 2021-07-20 2023-05-16 北京大学 Opponent modeling method, device, equipment and storage medium based on environment model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103500361A (en) * 2013-08-27 2014-01-08 浙江工业大学 Micro-grid load control method based on game theory
CN106371318A (en) * 2016-10-31 2017-02-01 南京农业大学 Facility environment multi-objective optimization control method based on cooperative game
CN107479380A (en) * 2017-08-25 2017-12-15 东北大学 Multi-Agent coordination control method based on evolutionary game theory

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103500361A (en) * 2013-08-27 2014-01-08 浙江工业大学 Micro-grid load control method based on game theory
CN106371318A (en) * 2016-10-31 2017-02-01 南京农业大学 Facility environment multi-objective optimization control method based on cooperative game
CN107479380A (en) * 2017-08-25 2017-12-15 东北大学 Multi-Agent coordination control method based on evolutionary game theory

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361594A (en) * 2021-06-03 2021-09-07 安徽理工大学 Countermeasure sample generation method based on generation model
CN113361594B (en) * 2021-06-03 2023-10-20 安徽理工大学 Countermeasure sample generation method based on generation model
CN113791634A (en) * 2021-08-22 2021-12-14 西北工业大学 Multi-aircraft air combat decision method based on multi-agent reinforcement learning
CN113791634B (en) * 2021-08-22 2024-02-02 西北工业大学 Multi-agent reinforcement learning-based multi-machine air combat decision method
CN114154397A (en) * 2021-11-09 2022-03-08 大连理工大学 Implicit adversary modeling method based on deep reinforcement learning
CN114154397B (en) * 2021-11-09 2024-05-10 大连理工大学 Implicit opponent modeling method based on deep reinforcement learning

Also Published As

Publication number Publication date
CN109496318A (en) 2019-03-19

Similar Documents

Publication Publication Date Title
WO2020024097A1 (en) Deep reinforcement learning-based adaptive game algorithm
Khadka et al. Evolution-guided policy gradient in reinforcement learning
US20230116117A1 (en) Federated learning method and apparatus, and chip
CN107392312B (en) Dynamic adjustment method based on DCGAN performance
CN110166428B (en) Intelligent defense decision-making method and device based on reinforcement learning and attack and defense game
Dutt-Mazumder et al. Neural network modelling and dynamical system theory: are they relevant to study the governing dynamics of association football players?
CN107179077B (en) Self-adaptive visual navigation method based on ELM-LRF
Braik et al. Memory based hybrid crow search algorithm for solving numerical and constrained global optimization problems
CN113505855B (en) Training method for challenge model
Li et al. Generative attention networks for multi-agent behavioral modeling
Song et al. Diversity-driven extensible hierarchical reinforcement learning
Dao et al. Deep reinforcement learning monitor for snapshot recording
Klissarov et al. Deep laplacian-based options for temporally-extended exploration
Scott et al. How does AI play football? An analysis of RL and real-world football strategies
Xing et al. Learning with Generated Teammates to Achieve Type-Free Ad-Hoc Teamwork.
WO2019240047A1 (en) Behavior learning device, behavior learning method, behavior learning system, program, and recording medium
Xu et al. Deep reinforcement learning with part-aware exploration bonus in video games
Wang et al. Camp: Causal multi-policy planning for interactive navigation in multi-room scenes
Bezek et al. Multi-agent strategic modeling in a robotic soccer domain
CN116245009A (en) Man-machine strategy generation method
CN109636609A (en) Stock recommended method and system based on two-way length memory models in short-term
Sohn et al. Shortest-path constrained reinforcement learning for sparse reward tasks
Rafati et al. Efficient exploration through intrinsic motivation learning for unsupervised subgoal discovery in model-free hierarchical reinforcement learning
Jin et al. Deep deformable Q-Network: an extension of deep Q-Network
Handa Neuroevolution with manifold learning for playing Mario

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18928785

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18928785

Country of ref document: EP

Kind code of ref document: A1