WO2020024097A1

WO2020024097A1 - Deep reinforcement learning-based adaptive game algorithm

Info

Publication number: WO2020024097A1
Application number: PCT/CN2018/097747
Authority: WO
Inventors: 侯韩旭; 郝建业; 王维勋
Original assignee: 东莞理工学院
Priority date: 2018-07-30
Filing date: 2018-07-30
Publication date: 2020-02-06
Also published as: CN109496318A

Abstract

The present invention relates to the field of data processing, and discloses a deep reinforcement learning-based adaptive game algorithm, comprising the following steps: (A) acquiring policies for different degrees of cooperation; (B) generating policies for different degrees of cooperation; (C) detecting a cooperation policy of an opponent; and (D) making different coping policies. The technical effects of the present invention are as follows: trained detectors and policies for different degrees of cooperation are used to implement the existing concepts, such as Tit for tat, in sequential social dilemmas, improving the extensibility of the agent, and more intuitively acquiring competition policies superior to those already acquired.

Description

Adaptive game algorithm based on deep reinforcement learning

[Technical Field]

The invention relates to the field of data processing, in particular to an adaptive game algorithm based on deep reinforcement learning.

【Background technique】

Reinforcement learning is used in various fields, from games to robot control. Traditional reinforcement learning uses table or linear functions to represent value functions or strategies. It is difficult to extend to complex problems. Deep neural learning combined with deep learning uses neural network extraction. The characteristics and the ability to approximate functions have been successfully used [DQN] [Alpha Zero] [PPO]. The prisoner's dilemma (PD game) has always been the focus of matrix game research. PD game regards cooperation and competition as an atomic action, but in the real world, the game is composed of a series of actions and will be carried out. Temporally extended PD is called sequential prisoner's dilemma (SPD). In PD game, most of the multi-agent reinforcement learning (MARL) algorithms focus on traditional reinforcement learning, which is difficult to extend directly to the prisoner's dilemma game (SPD game). [SSD] observes changes in resources and only considers itself. The benefit of the agent (agent), but did not propose a corresponding learning algorithm based on the characteristics of SPD.

[Summary of the Invention]

In order to solve the problems in the prior art, the present invention provides an adaptive game algorithm based on deep reinforcement learning to solve the problem of poor scalability of multi-agents in the prior art.

The present invention is realized through the following technical schemes: designing and manufacturing an adaptive game algorithm based on deep reinforcement learning, including the following steps: (A) acquiring strategies with different degrees of cooperation; (B) generating strategies with different degrees of cooperation; (C) Detecting opponents' cooperative strategies; (D) Developing different coping strategies.

As a further improvement of the present invention: in the step (A), strategies using different network structures and / or different target reward forms are used for training and obtaining different degrees of cooperation.

As a further improvement of the present invention: in the step (A), strategies of different degrees of cooperation are obtained by modifying key factors influencing the degree of competition and cooperation in the environment or by modifying the learning goals of the agent.

As a further improvement of the present invention: in the step (B), the policies with different degrees of cooperation obtained in step (A) are set as expert networks, and the policies with different degrees of cooperation in the expert network are given weights; The degree of cooperation strategy affects the degree of strategy to generate a new degree of cooperation.

As a further improvement of the present invention, the specific process of the algorithm for generating a strategy for a new degree of cooperation is: in an expert network expert network,

Means using the degree of cooperation att ⁿ to obtain the strategy of agent i in the training settings of different degrees of cooperation. Each expert network predicts that in the current state, the strategy of cooperation degree can be used to play with other agents to obtain the state-action value state. action value, according to the existing cooperation strategy

A strategy to obtain a new degree of cooperation for agent i by assigning different weights w ₁ , w ₂ , ... w _n

For the strategy of cooperation degree att _new ,

att _new = f _value (att ¹ , att ² , ..., att ⁿ , w ₁ , w ₂ , ..., w _n )

f _value (att ¹ , att ² ,…, att ⁿ , w ₁ , w ₂ ,…, w _n )

The degree of cooperation with the corresponding weights w ₁ , w ₂ , ..., w _{n is} obtained, and f _value is:

f _value (att ¹ , att ² ,…, att ⁿ , w ₁ , w ₂ ,…, w _n )

= Att ¹ * w ₁ + att ² * w ₂ + ... + att ⁿ * w _n

Agent's new degree of cooperation strategy:

Define att _new to measure the degree of cooperation of the strategy after synthesis att _new = f _policy (att ¹ , att ² , ... att ⁿ , w ₁ , w ₂ , ... w _n ).

As a further improvement of the present invention: in the step (C), a neural network is used to determine the degree of cooperation. The structure of the neural network adopts a multi-tasking mode, and the structure of the self-encoder is combined with the structure of the classifier. The autoencoder and classifier share the underlying parameters of the neural network.

As a further improvement of the present invention, in the step (D), the opponent is judged by detecting the degree of the opponent's cooperation to generate a strategy that is more conducive to itself.

As a further improvement of the present invention, different network structures include a strategic network structure and a network structure of a cooperation degree detector.

As a further improvement of the present invention, the adaptive game algorithm based on deep reinforcement learning is applied in a multi-agent environment.

The beneficial effects of the present invention are: using the trained detectors and strategies with different degrees of cooperation, the existing Tit for tat and other ideas are implemented in sequential social dilemmas; the scalability of the agent is improved; more intuitive acquisition Better than your own competitive strategy.

[Brief Description of the Drawings]

FIG. 1 is a schematic diagram of generating different cooperation strategies according to the present invention; FIG.

2 is a schematic diagram of a network structure of a cooperation degree detector according to the present invention;

FIG. 3 is a schematic diagram of formulating different coping strategies according to the present invention.

【detailed description】

The present invention will be further described below with reference to the accompanying drawings and specific embodiments.

An adaptive game algorithm based on deep reinforcement learning includes the following steps: (A) acquiring strategies with different degrees of cooperation; (B) generating strategies with different degrees of cooperation; (C) detecting opponents' cooperative strategies; (D) formulating different strategies Coping strategies.

In the step (A), training is performed by using different network structures and / or different target reward forms to obtain strategies with different degrees of cooperation.

In the step (A), strategies of different degrees of cooperation are obtained by modifying key factors that affect the degree of competition and cooperation in the environment or by modifying the learning goals of the agent.

In the step (B), the strategies with different degrees of cooperation obtained in step (A) are set as expert networks, and the strategies with different degrees of cooperation in the expert network are given weights; and according to the degree of influence of the strategies with different degrees of cooperation To generate strategies for new levels of collaboration.

The algorithm for generating a new cooperation strategy is as follows: in an expert network,

For the strategy of cooperation degree att _new ,

f _value (att ¹ , att ² ,…, att ⁿ , w ₁ , w ₂ ,…, w _n )

= Att ¹ * w ₁ + att ² * w ₂ + ... + att ⁿ * w _n

Agent's new degree of cooperation strategy:

In the step (C), a neural network is used to judge the degree of cooperation. The structure of the neural network adopts a multi-tasking mode. The structure of the autoencoder is combined with the structure of the classifier. The autoencoder and the classifier share The underlying parameters of the neural network.

In the step (D), the opponent is judged by detecting the degree of the opponent's cooperation to generate a strategy that is more beneficial to himself.

Different network structures include the strategic network structure and the network structure of the cooperation degree detector.

The strategy network structure is similar to the DQN structure and uses a five-layer structure. The first hidden layer, the second hidden layer, and the third hidden layer are all convolution layers, the fourth layer is a fully connected layer, and the last layer uses the number of actions. The number of nodes is the same; the network structure of the cooperation degree detector is a three-layer structure, and all three layers are convolution layers. The self-encoder part is connected to the third layer, and the cooperation degree detection part is connected to the third layer.

The adaptive game algorithm based on deep reinforcement learning is applied to multi-agent environment sequential social dilemmas.

The adaptive game algorithm (Deep Reinforcement, Learning Framework, Mutual Cooperation (DRLFMC)) based on deep reinforcement learning of the present invention first uses target weighted target reward to learn strategies with different degrees of cooperation, and the average reward of the agent (reward) ) To verify that the weighted target can adjust the degree of cooperation of the agent and train strategies with different degrees of cooperation. Then use policy generation to synthesize policies with different degrees of cooperation according to the strategies generated by weighted targets and rewards, and also use the average reward of the agent to prove that the synthesized policies have different degrees of synthesis. Then use different cooperation strategies to generate data sets, which are used to train the cooperation detection network (cd detection detection network) to detect the opponent's cooperation. Finally, use policy generation and cd detection network to adjust the degree of cooperation according to the degree of cooperation of the opponent: when the opponent is biased towards cooperation, we also use the strategy of biased cooperation to achieve a win-win situation. When the opponent uses the competitive strategy, we also adopt Competitive strategies to ensure your own minimum return.

Training strategy networks with different levels of cooperation, using strategies of existing levels of cooperation to generate new levels of cooperation, strategies of opponents' cooperation, and drawing on existing game theory strategies to design deep game strategies. Joel Z. Leibo pointed out that the strategy of the agent in sequential prisoner dilemmas is not simply divided into cooperation and competition, but some strategies are more cooperative, but there is competition, and some strategies are more competitive, but there is also cooperation [Multi-agent Reinforcement Learning Learning Social Dilemmas], the existing deep reinforcement learning algorithms can only learn cooperative strategies or competitive strategies in the end, but ca n’t learn strategies with a specific degree of cooperation. By referring to Steven Damer ’s attitude update The idea of attitude modified game [Achieving, Cooperation, Minimally, Constrained, Environment, Learning, Cooperate, Normal, and Form Games] has designed a method that can be simply applied to the existing deep reinforcement learning to obtain different degrees of cooperation. At the same time, because the agent may have infinitely different cooperation and competition strategies, it is impossible to learn one by one. Therefore, he borrowed the idea of using deep-learning architecture (Mixture-of-Experts architecture) to design deep reinforcement learning. A method of constructing a strategy of a new degree of cooperation using an existing degree of strategy [Opponent Modeling Deep Reinforcement Learning]. For common problems in matrix games, such as social dilemmas, there are already some simple but effective heuristic algorithms, such as Tit for Tat, but these algorithms are difficult to apply directly to Sequential Social Dilemmas and In deep reinforcement learning, on the one hand, the agent strategy is not clear cooperation and competition. On the other hand, in order to achieve Tit for tat, it is necessary to know whether the opponent cooperates. A method for training a detector to detect the degree of opponent cooperation is proposed. Devices and strategies with different degrees of cooperation, the existing ideas of Tit, tat, etc. can be applied to sequential social dilemmas.

The present invention is described below in four aspects.

I. Training strategies for different levels of cooperation

One way to get strategies for different levels of cooperation is to modify the key factors that affect the level of competition and cooperation in the environment, such as the amount of resources, the way of interaction between agents, and so on. Joel Z. Leibo investigates if resource changes affect the idea of cooperation and competition, they find that self-interest independent learning agents are more likely to cooperate with sufficient resources and more easily in a resource-poor environment Competition, but they only investigated the impact of resources and other factors on the final agent learning results, and did not go further to generate strategies with different degrees of cooperation. Therefore, in order to obtain different levels of cooperation with the different levels of competition policy in the environment e can be obtained by setting environmental parameters is more scarce than the ambient e resources more easily lead to a competitive environment e _scarcity, policy acquired in the environment e _scarcity in the environment In e, there will be more competition and less cooperation. On the contrary, you can set more environmentally _sufficient resources than environment e. The strategy obtained in e _sufficient will perform more cooperation and less competition in environment e. [Multi-agent Reinforcement Learning in Sequential Social Dilemmas].

Another way to obtain strategies with different degrees of cooperation is to modify the learning goals of the agents to obtain strategies with different degrees of cooperation. Agents with the goal of maximizing their own interests will naturally learn more competitive strategies. The agent whose overall interest is maximized as the learning goal will naturally learn a more cooperative strategy. In order to achieve cooperation in the Minimum Constrained Environment, Steven Damer modified the agent's reward to be his own reward plus attitude in the original environment, multiplied by the opponent's reward in the original environment, and considered in the process of maximizing the reward To their own reward and opponent's reward, eventually reached a cooperation. [Achieving Cooperation in a Minimally Constrained Environment, Learning to Cooperate in Normal Form Games]. This idea can be used not only to make the agent finally cooperate, but also to obtain strategies with different degrees of cooperation. By controlling the size of the altitude, the agent can weigh the importance of its own reward with the rewards of other agents when learning. The larger the altitude, the more the other agents' rewards are considered, and the more cooperative the agent ’s learning strategy is, the smaller it is. The more cooperative attitude attitude will eventually learn. At the same time, this method can directly use the current single-agent deep reinforcement learning (single-agent DRL) algorithm to obtain strategies with different degrees of cooperation (here, it is assumed that Markov game and other symbols have been defined previously, such as the existence of N agents: i ₁ , i ₂ …, a total of N agents, the actions of each agent are: a ₁ , a ₂ …, and they take a joint action (a ₁ , a ₂ …) to obtain reward: r ₁ , r ₂ …, their strategies are π ₁ , π ₂ ….

In order to obtain strategies with different degrees of cooperation, the present invention can obtain strategies with different degrees of cooperation by controlling the degree of consideration of the rewards of other agents, so that agents consider other agents to obtain rewards. Wherein after att _ef representatives for agent i _e considering agent i _f the reward r _j attitude (attitude), r _'e representatives to consider other agent agent i _e a reward, agent i _e for attitude all agent may be represented as attitude _e = {att _1, att ₂ , ... att _n }, n∈N can directly use deep reinforcement learning algorithms commonly used in single-agents to obtain strategies with different degrees of cooperation.

r ′ ₁ = att ₁₁ * r ₁ + att ₁₂ * r2 + ... + att _1n * r _n

...

r ′ ₁ = att ₁₁ * r ₁ + att ₁₂ * r2 + ... + att _1n * r _n

The agent does not consider the reward of other agents at all, that is, {att _ef = 0, if e ≠ f}, then it only considers the case of itself (self-indenpent), the agent will maximize its own cumulative income, regardless of other agents' Gains, which ultimately leads to perfect competition. For {att _ef = 1, if all e, f ∈ N}, then this is the case of cooperation, and the agent cooperates to achieve the purpose of maximizing the global cumulative gain. By adjusting the value of altitude, we can obtain different cooperation strategies. If the attitude of other agents is larger, the agent will pay more attention to the benefits of other agents, and obtain more cooperative strategies. If the value is smaller, the agent will consider its own income more and obtain a more competitive strategy.

If the agent uses a deep reinforcement learning algorithm to learn, although the parameters are increased linearly, but because the neural network has more parameters, it is difficult to apply in an environment with a larger number of agents. At the same time, because each agent is independent (indenpent), more data is needed to learn during the learning process. Therefore, a network can be used to learn the joint strategy π ^joint (a ₁ , a ₂ , ... an _n | s), and the problem of parameter increase is alleviated by sharing the underlying network, and it is clear for each a ₁ , a ₂ , ... an _n learning ^{_{_{π joint (a 1, a 2}}} , ... a n | s) can be effectively applied to study different levels of cooperation strategy. The joint reward of this network can be defined as follows. We can also directly use the deep reinforcement learning algorithms common in single-agent to obtain different cooperation strategies.

r _total = att ₁ * r ₁ + att ₂ * r ₂ + ... + att _n * r _n This network obtains strategies of different degrees of cooperation by maximizing the accumulated r _total . Att _e represents r _total 's reward for agent i _e . emphasis, by att _e relative size, the greater the att _e, agent i _e, the higher the degree of competition, the higher the level of smaller att _e cooperation. Equal att _e have the same degree of cooperation. For the learned joint strategy π ^joint (a ₁ , a ₂ … a _n | s), we can obtain the strategy for agent i:

Two strategies for generating different degrees of cooperation

We know that in Sequential Social Dilemmas, agents can have different degrees of cooperation. For opponents with different degrees of cooperation, we may not want to simply choose a cooperative or competitive strategy to respond, but we hope to have a certain degree of cooperation strategy. as a response. By training strategies with different degrees of cooperation, we can use altitude reward to train strategies with different degrees of cooperation. However, agents can use an infinite number of strategies with different degrees of cooperation. We cannot train corresponding strategies for each degree of cooperation. Therefore, we propose a strategy that can be compounded with different degrees of cooperation according to the strategies of different degrees of cooperation that have been trained. He He designed a network structure that can ensure the benefits of opponents with different strategies in a completely competitive environment. The idea of designing a network structure is to use some existing strategies to predict which ones by judging the behavior of the opponent. The strategy can better deal with the opponent, and through the combination of the existing strategies, the strategy against the current opponent can be obtained. This idea can not only be used to adjust your strategy for different opponents in a perfect competitive environment, but also can be used in Sequential Social Dilemmas to generate strategies with different degrees of cooperation. If we use value-based deep reinforcement learning algorithms to train strategies with different degrees of cooperation to learn strategies with different degrees of cooperation, then we can treat each different strategy as an expert network.

Means using the degree of cooperation att ⁿ to obtain the strategy of agent i in the training settings of different degrees of cooperation. Each expert network predicts that under the current state, the strategy with a specific degree of cooperation can be used to play with other agents to obtain the state-action value. (state action value), then we can use the strategy based on the degree of cooperation

For π _new we can think of it as the degree of cooperation att _new :

Strategy, f _value (att ¹ , att ² ,…, att ⁿ , w ₁ , w ₂ ,…, w _n ) is the basic strategy.

Weight w corresponding right _{_{1, w 2, ..., w}} n get the level of cooperation policy, which means that we will consider the existing policy att ⁱ and the impact of the new strategy w ⁱ to calculate the degree of cooperation in the new strategy. If I assume that the strategies of different cooperation levels are linearly superimposed, and the degree of cooperation is also linearly superimposed, then we write f _value as:

f _value (att ¹ , att ² ,…, att ⁿ , w ₁ , w ₂ ,…, w _n )

= Att ¹ * w ₁ + att ² * w ₂ + ... + att ⁿ * w _n

Similarly, this idea can be naturally transferred to strategies with different degrees of cooperation learned using strategy-based deep reinforcement learning. Agent's new degree of cooperation strategy:

Similar to the use of value-based deep reinforcement learning algorithms, we can also define att _new to measure the degree of cooperation of the strategy after synthesis.

att _new = f _policy (att ¹ , att ² , ... att ⁿ , w ₁ , w ₂ , ... w _n )

Detecting the degree of cooperation of opponents

A common approach in game theory is to adjust your own strategy based on the strategies of other agents. For example, Tit for tat: The first round chooses to cooperate. Whether the next round chooses to cooperate depends on whether the other party has cooperated. If the other party has betrayed the last time, I also betray this round. If the other party cooperated the last time, this round continues Cooperation. In this type of game, the strategy does not need to determine exactly what strategy the opponent uses, but it can judge whether the opponent is cooperation or competition through the final situation. In the matrix game, we can simply judge whether the agent cooperates by the result, but the degree of competition and cooperation of the agent in the markov game is related to each action of the agent in the entire game. It is difficult to propose a This kind of rule can directly determine the degree of cooperation of the agent. Compared with the specific rules, we can use the method of supervised learning to directly learn the degree of cooperation of a series of actions that the agent takes in a round. Then there is a problem. A supervised learning problem requires corresponding data and corresponding labels. Generally, it is difficult for us to obtain data of different cooperation levels to allow the discriminator to learn how cooperative the agent behavior is. But if We use the methods of training strategies with different degrees of cooperation and generating strategies with different degrees of cooperation. For each agent i, we can get strategies with different degrees of cooperation for agent i.

So if we want to judge the degree of cooperation of agent i, then we can generate data set d:

Π _k is the set of various cooperation degrees obtained by agent i _k by training strategies of different degrees of cooperation to generate policies of different degrees of cooperation, f is a function of the degree of cooperation marked by agent i in the current data, and X indicates that agents take All possible sets of trajectories <s ₀ , s ₁ , ..., s _n > produced by own strategy interactively in the environment,

The meaning is: agent i _i adopts a specific strategy π _i , other agents adopt their own strategy concentration strategy, play games in the environment, and collect the corresponding state transition x = <s ₁ , s ₂ , ..., s _d > ∈ X, the label corresponding to x is:

Through the generated data set d, we use common supervised learning algorithms, such as decision trees, SVM, neural networks, etc. for supervised learning. Here we use a neural network to determine the degree of cooperation. The structure of the neural network we designed adopts the idea of multitask, using the structure of the autoencoder and the structure of the classifier. An autoencoder contains two parts, an encoder and a decoder. We define the encoder as: φ and the decoder as: ψ, as follows:

φ = X → F

ψ = F → X

φ, ψ = arg min _{φ, ψ} || X- (φ ° ψ) || ²

By sharing the underlying parameters of the neural network with the self-encoder, on the one hand, the features extracted by the first few layers of the classifier are effective representations of x, and the overall classification effect is with an effective representation of x, and the classification effect is better. On the other hand, through the influence of the self-encoding gradient on the underlying parameters, the training speed of the classifier is accelerated. At the same time, after a certain training time, the underlying network can be more stable and the vibration during classifier training can be reduced. Suppose our classifier can judge the possibility that x = <s ₁ , s ₂ , ..., s _d > is label ₁ , label ₂ , ... label _n , that is:

f _c (x) = f _c (<s ₁ , s ₂ , ..., s _d >) = <p ₁ , p ₂ , ... p _n >

Where <p ₁ , p ₂ , ... p _n > represents the possibility that x is label ₁ , label ₂ , ... label _n , then I can define the loss function of the classifier as:

The network is trained by min∑ _x L (f _c (x), label).

Play with different opponents

Using strategies for training different levels of cooperation, generating strategies for different levels of cooperation, and methods for detecting opponents' levels of cooperation, we can already obtain strategies for different levels of cooperation and detectors for determining the degree of cooperation of agents in Sequential Social Dilemmas. On this basis We can borrow ideas from matrix game and apply them to Sequential Social Dilemmas. Take a class of ideas commonly used in matrix game Prisoner's dilemma as an example: if the opponent cooperates, we cooperate with the opponent to achieve a win-win situation, and if the opponent betrays, we must ensure our own profit. One way of thinking about this is the classic Tit for tat: the first round chooses cooperation. Whether to choose cooperation in the next round depends on whether the other party cooperated. If the other party betrays the last time, I will also betray this round. The last time the other side cooperated, this round continued to cooperate. We can learn from the core idea of Tit for tat. In Sequential Social Dilemmas, if the opponent has adopted the strategy att _i in several games, then we use the strategy att _i in this game. That is, the opponent is judged by the classifier in the detection of the degree of cooperation of the opponent. It is assumed that the behavior of the opponent in several games is:

Then through the classifier f _c (x), we can get an estimate of the degree of cooperation of the opponent in several games.

The strategy for the degree of cooperation corresponding to p _i is π _i . According to whether our existing strategy is value-based deep reinforcement learning or strategy-based deep reinforcement learning, a strategy similar to the degree of cooperation of the opponent can be adopted:

π _new = arg max _a Q _new (s, a)

= Arg max (p ₁ * Q _π1 (s, a) + p ₂ * Q _π2 (s, a) + ... + p _n

* Q _πn (s, a))

Where Q _πi represents the Q value when π _i is used, or

It means that when π _i is adopted, the probability of a in state s is used to extend the cooperation and betrayal in Tit for tat to different degrees of cooperation, as shown in Figure 3. This approach on the one hand ensures its own benefits while not making Opponents adopt more competitive strategies because of lost revenue. Our framework not only adopts the idea of Tit for tat, but the idea of react strategy by judging opponent information in game theory can be easily transferred.

In a specific embodiment, our algorithm is applied in 2 games: Fruit Gathering game, Apple-Pear game. The Fruit Gathering game has two agents in the environment: red and blue. The goal of the agent is to collect apples. Apples are represented by green pixels. When the agents collect apples, they will get r _apple rewards, and the apples will disappear from the environment. N _apple appears again after a period of time. During the collection process, the agent can emit a beam of light from its own position toward its own front. One agent is shot twice by the beam sent by another agent, and the agent will disappear from the environment. N _tagged for a while and then reappears. The agent can choose 8 actions in the environment: forward, backward, left, right, turn left, turn right, use light and stay in place. (In this environment, the agent can tend to cooperate, collect fruits together and emit little light, and the agent can also choose to compete. While collecting fruits, send out more beams, try to hit the opponent, and move the opponent out of the environment, so that they can Get more fruits.) In one embodiment of the Apple-Pear game, there are two agents in the environment: red agent and blue agent, and two fruits: red apple and green pear. The agent moves to collect fruits in the environment. The blue agent prefers apple, so for agent1, the reward is r _prefer when the apple is obtained alone, and the reward is r _common when the pear is obtained separately. The red agent is the opposite, satisfying r _prefer > r _common When two agents move to a fruit at the same time, they each get half of the fruit, so reward is half of the fruit's revenue. The agent can perform 4 actions in the environment: forward, backward, left, and right. For each step, the agent spends c _steps . In both environments, the s obtained by the agent in the environment is image information with a size of 84 * 84 * 3. Although apple-pear game is simpler than Fruit Gathering game, competition and cooperation are easier to analyze in apple-pear. The agent collects its own fruits for cooperation, and the agent collects all fruits for competition, so we can use Apple- Pear game to more clearly analyze the effectiveness of our algorithm.

In order to investigate the effectiveness of our algorithm, in two environments, we first use different network structures and different target reward forms for learning. For different target rewards, Actor-Critic reinforcement learning algorithms are used to learn strategies with different degrees of cooperation by maximizing cumulative target rewards (discoutn target reward). The degree of cooperation of the learned strategy can be analyzed by the average reward of agent1 and the average reward of agent2. Secondly, we investigated the effectiveness of Policy Generation. We selected two strategies as the basic strategy for cooperation and the basic strategy for competition based on the average reward of the agent and the actual performance of the strategy. Use these two strategies as a benchmark for measuring competition and cooperation. Then we linearly superimpose the basic strategy of cooperation and the basic strategy of competition to obtain strategies of different degrees of cooperation. For strategies synthesized at different levels of cooperation, the average reward of agent1 and average reward of agent2 can also be used to verify that the degree of cooperation of the synthesized strategy is in the middle of the basic strategy. In the experiments, we used different strategies to play with each other, and proved the effectiveness of Policy Generation through the agent's reward. At the same time, according to the results of mutual play between strategies of different cooperation levels, the agent's own rewards are increased as their cooperation level is reduced, and as the opponent's cooperation level is increased, they are reduced. The reward when both choose to cooperate proves that Fruit, Gathering, and Apple-Pear games can be regarded as SPDs about the degree of cooperation. Third, we use different strategies of cooperation to play with each other to collect data. Because we can know the degree of cooperation of the agent in play, we can give data different labels. After generating the data set, we adopted a neural network combined with an autoencoder to learn the degree of cooperation of the agent. Finally, we can use a combination of Policy Generation and altitude detection to first detect the degree of cooperation of the opponent, and dynamically adjust the degree of cooperation of our strategy according to the degree of cooperation of the opponent. In the experiment, we tested self-play. Opponents, opponents of changing strategies.

Network structure

In both environments, our strategic network structure is similar to the DQN structure. The actor and critic share the underlying network: the first hidden layer uses 32 8x8 convolutional layers with a step size of 4 and is activated by RELU. The second The hidden layer uses 64 4x4 convolution layers with a step size of 2 and is activated by RELU. The third hidden layer uses 64 3x3 convolution layers with a step size of 1 and is activated by RELU. On the basis of the shared structure, the actor is a fully connected layer with 128 nodes, activated by RELU, and the last layer is activated by the same number of nodes as softmax. A critic is similar in structure to an actor, but ends up with only one output.

The network structure of the cooperation degree detector is shown in Figure 2. The structure of the common part is: the first layer is 10 3x3 convolution layers with a step size of 2. The structure of activating the second layer using RELU is the same as the first layer The structure of the third layer is ten 3x3 convolutional layers with a step size of 3 and is activated by RELU. The structure of the self-encoder part connected to the third layer is: 10 3x3 deconvolutions, activated by sigmoid, the output shape is 21 * 21 * 10, and then the deconvolution is: 10 3x3 cores 2, activated by sigmoid , The output shape is 42 * 42 * 10, and finally the deconvolution kernel is 10 3x3 kernels, which are activated by sigmoid, and the output shape is 84 * 84 * 10. The cd detection section is followed by the third layer of the network structure, which is the long- and short-term memory model (lstm) layer of two layers of 256 nodes, and finally an output node.

parameter settings

In the apple-pear game, we limit the maximum number of steps (episode) to 100 for each round, the exploration rate starts from 1, and gradually decreases to 0.1 after 20,000 steps. The weight update uses a soft update ("soft" target updates) to avoid a large fluctuation of the strategy: θ ′ ← 0.05θ + 0.095θ ′, where θ is a parameter of the target network, θ ′ is a parameter of the strategy network used by the agent, the learning rate is 0.0001, and the storage (memory ) Has a size of 25000 and a batch size of 128. R _prefer = 1, r _comment = 0.5, c _step = 0.01 in the environment

In the Fruit Gathering game, we use the same detector as apple-pear game to learn, and the actor critic part uses independent t to learn separately. In the Fruit Gathering game, we set N _apple = 40, N _apple = 20, and r _apple = 1. During training, we limit the steps of each epidede to 1000, the exploration rate of the agent starts from 1, and gradually decreases to 0.1 after 200,000 steps, the memory size is 250,000, and the learning rate is 0.0001. We also use "soft" target updates: θ ′ ← 0.001θ + 0.999θ ′, the network weight is updated every 4 steps, and the batch size is 64.

weighted Target Reward

We use the joint action method mentioned in training strategies of different degrees of cooperation to train our network. In the apple-pear game, we can set the weighted target and reward settings to:

r = w ₁ * r ₁ + (1-w ₁ ) * r ₂

Because we use a network structure that uses joint actions, if the altitude values of agent1 and agent2 are the same, the degree of the two agents is the same. The sum of the weights of the two agents can be normalized to 1.

The range of w ₁ (that is, att _{1 in the} table) is set from 0.1 to 0.9. A larger w ₁ means that the strategies learned by agent ₁ will be more competitive, and the strategies learned by agent 2 will be more cooperative. The main reason for not setting w ₁ to 0 or 1 is: when w ₁ = 1, then 1-w ₁ = 0, then the learned strategy of agent 2 is completely meaningless, and the strategy of agent 2 will be reflected in moving in the environment To avoid collecting fruit and avoid collision with agent1, but we hope that each agent's strategy is meaningful, but the degree of cooperation is different, so we remove w ₁ = 0, w ₁ = 1, only The range of w ₁ was selected to be trained from 0.1 to 0.9.

The final results of different w ₁ trainings are shown in the following table. The average reward of agent 1 and the average reward of agent 2 represent the average reward that only considers the fruit collected by the agent. The changes of agent1average reward and agent2average reward with training episode are shown in Figure 5. By chart shows, for agent1, when the greater w _1, agent1average reward the greater, indicating agent1 strategy toward more competition, as w ₁ = 0.9 when, agent1average reward is 1.104, agent not only collect their fruit, more than Attempts to collect opponent's fruits. When w ₁ is 0.2 and 0.1, agent1average reward is 0.4065 and 0.514, which are close to r _comment = 0.5, indicating that agent1 has tried his best to collect his favorite fruits, and rarely collects opponent's fruit. In the same way, agent2 gradually rises as w ₁ decreases.

w ₁ w ₁	agent1 average rewardagent1 average	agent2 average rewardagent2 average
0.10.1	0.40650.4065	1.075751.07575
0.20.2	0.5140.514	0.97750.9775
0.30.3	0.6130.613	0.881750.88175
0.40.4	0.7520.752	0.7340.734
0.50.5	0.7890.789	0.69450.6945
0.60.6	0.89450.8945	0.61250.6125
0.70.7	0.97350.9735	0.533250.53325
0.80.8	1.04351.0435	0.42750.4275
0.90.9	1.1041.104	0.3630.363

Similarly, the agent can use an independent network structure to learn. Due to space limitations, we only show a part of the learning results of the degree of cooperation. As shown in the following table, the network structure of the joint action is the same. Rise and fall, rise as the degree of cooperation of the opponent rises

In the Fruit gathering game, we use independent weighted target reward. The main reason is that the r _{apples of the} two agents are the same, so if you want to maximize the joint r, the agent with the highest weight will try to collect all the apples. A low agent will try not to prevent the high-weight agent from collecting Apples, so the final learning result may be that the high-reward agent will try his best to collect all apples, and the low-reward agent may be a meaningless strategy. But using independent can alleviate this problem:

r ′ ₁ = r ₁ + att ₁₂ * r ₂

r ′ ₂ = r ₂ + att ₂₁ * r ₁

By limiting att ₁₂ and att ₂₁ to the range of [0,1], the agent's strategy will consider the opponent's reward to varying degrees based on the apples it collects.

As shown in the following table, due to space limitations, we only show a part of the results. We can see that when the other party's att does not change, the more the agent considers the other's reward, the fewer times the beam is used and the lower the reward is. So it shows that weighted targets and rewards can also be used to train strategies with different degrees of cooperation in Fruit, Gathering, and game.

Policy Generation

In order to generate strategies with different degrees of cooperation in the apple-pear game, we choose the strategy of agent1 learned by w ₁ = 0.5, and the strategy of agent2 is used as the basic cooperation strategy of agent1 and the basic cooperation strategy of agent2, and choose w ₁ = 0.75 to learn The strategy of agent1 is used as the basic competition strategy of agent1, and the strategy of agent2 learned by w ₁ = 0.25 is selected as the basic competition strategy of agent2. In fact, there are no special restrictions on the basic competition strategy and basic cooperation strategy for selecting agents, as long as the cooperation degree of the basic cooperation strategy is higher than the basic competition strategy. The basic competition strategy determines the highest degree of competition for the agent, so we can limit the scope of agent cooperation and competition by choosing different strategies. The strategy we selected satisfies: the degree of cooperation of the basic cooperation strategy is higher than that of the basic competition strategy. This basic restrictions, while agent1 competitive strategy selection w ₁ = 0.75 is between 0.7 and 0.8, you can see through 1 (0.1 to 1 of the table above) table, agent1 strategy that is collect their fruit, while collecting the other side fruit. In the same way, we can analyze that the competitive strategy of agent2 is also competitive.

After selecting the basic competition strategy and cooperation strategy, we can use the algorithms proposed in generating strategies with different degrees of cooperation to synthesize algorithms. In order to prove the effectiveness of the synthesis strategy, we can play the synthesized strategy with a fixed opponent and judge the effect of the final synthesis strategy through the final average and reward. As shown in the table below, it is the average result of 200,000 episode

Table 1

We fixed agent2 only using the benchmark cooperation strategy and benchmark competition strategy. Agent1 uses the Policy Generation algorithm. Π ^new = w _c * π _c + (1-w _c ) * π _d , w _c ranges from 0 to 1, Interval 0.1. From average reward, it can be known that as w _c rises, the average reward of agent 1 gradually decreases. For a more cooperative agent 2, the fixed w _c has a larger average reward. So this table shows that using different w _c for Policy Generation can synthesize policies with different degrees of cooperation.

Furthermore, both agent1 and agent2 use Policy Generation to generate policies, and then use the generated policies to play with each other. It can be clearly seen that apple-pear game can be regarded as an SPD about the degree of cooperation between agents. If only one's own interests are considered, the agent will choose a more competitive strategy, but if both agents choose a competing strategy, then their respective rewards are less than the rewards when both agents choose a cooperative strategy, and the total reward is (total reward) It can be seen that in cooperation-cooperation, total-reward is the largest.

Similar to the apple-pear game, in the Fruit Gathering game, we selected the strategy learned by att ₁₂ = 0, att ₂₁ = 0 as the basic competition strategy of the agent, and the strategy learned by att ₁₂ = 0.5, and att ₂₁ = 0.5 as Cooperation strategies of agents. The reason why att ₁₂ = 1 and att ₂₁ = 1 are not selected as the agent cooperation strategy is that in this setting, because the agent's reward is the same, it may eventually lead to one of the agents learning the behavior of collecting apples. Although another agent did not learn the behavior of collecting apples, it could also obtain the corresponding reward, so it could not learn meaningful behavior.

After selecting a strategy, we also use Policy Generation to generate the strategy, and then use the generated strategy to play with each other. Fruit Gathering game can also be regarded as an SPD on the degree of cooperation.

CooperationDegree

As mentioned in Detecting the degree of cooperation of the opponent, in order to detect the degree of cooperation of agent2, we consider the basic cooperation strategy and basic competition strategy of agent2 as a strategy with a degree of cooperation of 1 and a strategy with a degree of cooperation of 0, respectively, and fix the strategy of agent2 For the basic cooperation strategy, collect data <S ₁ , S ₂ , ..., S _d > play with agent1 with a different degree of cooperation strategy. You can mark the data label as 1, and you can also use the basic competition strategy and Different cooperation strategies play to collect data with label 0.

In the apple-pear game, we collected 10,000 points of data with lengths of 3, 4, 5, 6, 7, 8, and 9 to train our cd detection. The training effect is shown in the following figure, and we can see our cd detection The degree of cooperation of agent2 can be well identified on the dataset. In order to test the effect of cd detection on cd detection of agent2 with different cooperation levels, agent1 uses cd from 0 to 1 with a 0.1 interval strategy and agent2 uses cd from 0 to 1 with a 0.1 interval strategy to play. Each time Play performs 1w episode, and uses cd detection to detect the cd of agent2 in the 1w episode. The average value is shown in the following figure. It is obvious that although we only use label1 and label0 to train cd detection, we use 0 to agent2cd. 1 strategy, cd detection can distinguish the relative size of different degrees of cooperation, but the value is not necessarily accurate. In order to obtain accurate agent2cd, we found that the change of cd is almost linear, so we can use a straight line to fit and calculate the approximate value of agent2cd by the fitted straight line.

In Fruit Gathering game, we can also use the generated strategies to collect data sets with different degrees of cooperation. The method of labeling data is the same as in apple-pear game. In fruit game, we collect 2000 pieces of data with a cooperation degree of 1 and a cooperation degree of 0, with lengths of 40, 50, 60, and 70, and use the same Results to train detection. We also use agents with different degrees of cooperation to play each other to detect the effect of detection. We can see that the detected value and attitude are generally linear, but to some degree of cooperation, such as when the opponent fully cooperates, the detected value and The correlation of our own strategy is relatively large, so we fit the corresponding straight lines for the detection values of our own different degrees of cooperation. When detecting the degree of opponent cooperation, we choose the closest straight line to the degree of cooperation of the opponent.

Performance with different opponent

After training cd detection, in order to test the effectiveness of the proposed algorithm, we investigated the algorithm's self-play and the face of changing opponents.

First of all, in the self-play setting of apple-pear game, we set 4 cases: 1. agent1 is initially a cooperation strategy, agent2 is initially a competition strategy 2. agent1 is initially a competition strategy, agent2 is initially a cooperation strategy, 3.2 agents All are cooperative strategies. 4.2 agents are all competitive strategies. In self-play, both agents eventually converged to a cooperative-cooperative state, achieving a win-win situation. Due to the randomness of environment initialization, even if the opponent adopts a competitive strategy, in this initialization setting, the state reached by the agent and the opponent and the effect of the cooperation strategy are consistent, so the agent will consider the opponent to have a certain The degree of cooperation is reached, so the next time we will adjust the degree of cooperation, and eventually the two parties will reach the effect of cooperation-cooperation.

In the face of opponents who constantly change their own strategy, we set the opponent's strategy to change the strategy every Tepisode, constantly changing in the competition strategy and cooperation strategy. In the detection section, we can find that detection can better detect the opponent's strategy, but when the opponent quickly adjusts his own strategy, there will be some errors in the detection, such as detection detection at T = 50, 110, 150, 190 The shape is not as neat as T = 110, 150, 190. The reason for this effect is mainly because detection uses the previous episodes as data to detect, so the detection effect is not particularly obvious at the initial stage of the strategy change. Therefore, when detection is aware of the changes in the opponent ’s strategy and begins to adjust its own strategy, the opponent ’s strategy may have changed again. This situation can be better solved when T is large, and the effect of detection is much better than the effect of fast transformation. The reward of our algorithm is between adopting a basic cooperation strategy and a basic competition strategy, indicating that after detection detection, a strategy adjusted to a different degree of cooperation can guarantee its own revenue when facing a competitive strategy. In total reward, our graph Strategies that are higher than competition indicate that our strategy can be adjusted to a win-win situation with our opponents when we face them.

self-play vs. change opponent play

In the fruit game, we also tested the effect of self-play, constantly changing strategy opponents. In one embodiment, the four initially set agents eventually converged to cooperation-cooperation. If both parties initially cooperate, the two parties will remain cooperative. In the other three cases, the agent will not adjust its own strategy and will first reach competition-competition. In the case, after an agent has tagged another agent out of the environment, the agent will start collecting the remaining fruits and wait for the appearance of the apple. Since the other agent is tagged out of the environment, the degree of cooperation detected by detection will be slightly higher than the competition Opponent, so the agent will look forward to cooperating with another agent. During this period, the tagged agent may reappear. At this time, the detection part of the agent is measured based on the behavior of the other agent before it appears, so when two agents Collect apples without actively tagging each other, and the two parties may converge to cooperation-cooperation.

In the face of opponents who constantly change their strategy, the effect is not very good when the opponent's transformation interval is small. When the opponent's transformation interval is larger, the agent can better estimate the degree of cooperation of the opponent and use this to adjust Our own strategy can also be seen from our own reward. Our algorithm is better than the cooperative strategy, so when the opponent adopts the competitive strategy, we will also use the competitive strategy to ensure our own returns. In total reward, our The total reward of the strategy is better than the total reward of the competitive strategy, which indicates that our strategy will adopt a cooperative strategy when the opponent cooperates to achieve a win-win situation with the opponent.

The above is a further detailed description of the present invention in combination with specific preferred embodiments, and it cannot be considered that the specific implementation of the present invention is limited to these descriptions. For those of ordinary skill in the technical field to which the present invention pertains, without deviating from the concept of the present invention, several simple deductions or replacements can be made, which should all be regarded as belonging to the protection scope of the present invention.

Claims

An adaptive game algorithm based on deep reinforcement learning, which is characterized by the following steps: (A) acquiring strategies with different degrees of cooperation; (B) generating strategies with different degrees of cooperation; (C) detecting opponents' cooperative strategies; ( D) Develop different coping strategies.
The adaptive game algorithm based on deep reinforcement learning according to claim 1, characterized in that in the step (A), training is performed by using different network structures and / or different target reward forms to obtain different degrees of cooperation Strategy.
The adaptive game algorithm based on deep reinforcement learning according to claim 1, characterized in that in the step (A), the key factors affecting the degree of competition and cooperation in the environment are modified or the learning objectives of the agent are modified To obtain strategies for different levels of cooperation.
The adaptive game algorithm based on deep reinforcement learning according to claim 1, characterized in that in the step (B), the strategies with different degrees of cooperation obtained in step (A) are set as expert networks, and Strategies with different degrees of cooperation in the network are given weights; and strategies with new degrees of cooperation are generated based on the degree of influence of strategies with different degrees of cooperation.
The adaptive game algorithm based on deep reinforcement learning according to claim 4, characterized in that the algorithm specific process of generating a strategy of a new degree of cooperation is: in an expert network expert network,
Means using the degree of cooperation att n to obtain the strategy of agent i in the training settings of different degrees of cooperation. Each expert network predicts that in the current state, the strategy of cooperation degree can be used to play with other agents to obtain the state-action value state. action value, according to the existing cooperation strategy
A strategy to obtain a new degree of cooperation for agent i by assigning different weights w 1 , w 2 , ... w n

For the strategy of cooperation degree att new ,

att new = f value (att 1 , att 2 , ..., att n , w 1 , w 2 , ..., w n )

f value (att 1 , att 2 ,…, att n , w 1 , w 2 ,…, w n )
The degree of cooperation with the corresponding weights w 1 , w 2 , ..., w n is obtained, and f value is:

f value (att 1 , att 2 ,…, att n , w 1 , w 2 ,…, w n )

= Att 1 * w 1 + att 2 * w 2 + ... + att n * w n

Agent's new degree of cooperation strategy:

Define att new to measure the degree of cooperation of the strategy after synthesis att new = f policy (att 1 , att 2 , ... att n , w 1 , w 2 , ... w n ).
The adaptive game algorithm based on deep reinforcement learning according to claim 1, characterized in that: in the step (C), a neural network is used to judge the degree of cooperation, and the structure of the neural network adopts a multi-task mode and adopts an automatic The structure of the encoder is combined with the structure of the classifier, and the self-encoder and the classifier share the underlying parameters of the neural network.
The adaptive game algorithm based on deep reinforcement learning according to claim 1, characterized in that in the step (D), the opponent is judged by detecting the degree of cooperation of the opponent to generate a strategy more favorable to itself.
The adaptive game algorithm based on deep reinforcement learning according to claim 2, characterized in that different network structures include a strategic network structure and a network structure of a cooperation degree detector.
The adaptive game algorithm based on deep reinforcement learning according to claim 1, wherein the adaptive game algorithm based on deep reinforcement learning is applied in a multi-agent environment.