CN114154397A

CN114154397A - Implicit adversary modeling method based on deep reinforcement learning

Info

Publication number: CN114154397A
Application number: CN202111316717.3A
Authority: CN
Inventors: 刘婵娟; 赵天昊; 刘睿康
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2021-11-09
Filing date: 2021-11-09
Publication date: 2022-03-08
Anticipated expiration: 2041-11-09

Abstract

The invention discloses an implicit opponent modeling method based on deep reinforcement learning, and belongs to the field of opponent modeling of multi-agent reinforcement learning directions. The invention provides an improved implicit opponent modeling method by utilizing a deep reinforcement learning technology around the opponent modeling problem in a dynamic game environment. The implicit modeling method does not depend on specific field knowledge, can adapt to the dynamic change of the opponent strategy, solves the problem of overestimation and has higher convergence speed.

Description

Implicit adversary modeling method based on deep reinforcement learning

Technical Field

The invention belongs to the field of adversary modeling of multi-agent reinforcement learning direction, and particularly relates to an implicit adversary modeling method based on deep reinforcement learning.

Background

Intelligent decisions are intended to allow the agent to make reasonable decisions in the gaming environment to maximize its own revenue, and if the actions, preferences, etc. of the opponent are modeled in this process, the opponent's behavior can be better predicted, thereby optimizing the decisions. For example, in a chess game, if one party can predict the next step of falling of an opponent, targeted strategy layout can be performed in advance; in automatic driving, if the automobile can predict the moving direction of other vehicles or pedestrians in advance, avoidance can be performed in advance. Therefore, modeling other agents in a gaming environment is critical to decision optimization, and adversary modeling has become one of the important research directions in the field of artificial intelligence.

Existing opponent modeling techniques mostly assume that an opponent adopts a fixed strategy, and in most real game environments, the opponent often dynamically changes the strategy to maximize the profit, and at the moment, the profit obtained by the main intelligent agent is greatly influenced by the change of the strategy of the opponent. In this case, dynamic adversary features need to be modeled to accommodate changes in adversary policies. There are also currently some adversary modeling techniques that can achieve dynamic modeling of adversary features under different constraints. For example, the AWESOME algorithm proposed by Vincent conditioner et al, at the university of tomilon in the card, can ensure that the host agent makes the optimal decision in the event that the adversary's policy eventually becomes stable. The DriftER algorithm proposed by Pablo Hernandez-Leal of CMI and Yusen Zhan et al of university at Washington, assumes that the opponent's policy switches among a variety of fixed policies, and the main agent monitors the opponent's timing of changing the policy through prediction errors, and readjusts its own policy to adapt to the change of the opponent's policy accordingly.

The above-mentioned adversary modeling techniques all belong to display modeling. In display modeling, since the modeling process of an adversary and the environment-based planning process are separated, the modeling process usually requires the use of a large amount of domain-specific knowledge, which makes it difficult for display modeling to be applied in places lacking domain knowledge and difficult for display modeling to migrate from one domain to another. The implicit modeling can combine modeling and planning processes, specific field knowledge is not needed, and the adversary modeling can be carried out only according to historical interaction information of the opponent, so that the construction of a universal opponent modeling framework is possible.

Thanks to the rapid development of deep reinforcement learning in recent years, a series of new deep reinforcement learning methods are proposed, which provides a new idea for the adversary modeling technology. A relatively representative work of this is the DRON algorithm proposed by He et al. The DRON algorithm is used for implicit opponent modeling in a dynamic gaming environment. On the basis of the DQN algorithm, the historical interactive information of an opponent is read, the opponent characteristics and the environmental characteristics are coded in a neural network together in an implicit mode, the opponent modeling can be performed implicitly without any domain knowledge, the model has good effects in football and question-answering games, but is limited by the characteristics of the DQN algorithm, and the problems of easiness in overestimation, low convergence speed and the like are still achieved.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an improved implicit opponent modeling method by utilizing a deep reinforcement learning technology around the opponent modeling problem in a dynamic game environment. The implicit modeling method does not depend on specific field knowledge, can adapt to the dynamic change of the opponent strategy, solves the problem of overestimation and has higher convergence speed.

The technical scheme of the invention is as follows:

an implicit Opponent modeling method based on Deep Reinforcement learning comprises two neural Network models DRON-DualFc2(Deep knowledge Oppont Network-Dual and Full Connected 2Networks) and DRON-DualMOE (Deep knowledge Oppont Network-Dual and lean of experiments Networks) for implicit Opponent modeling and a DecoleDRON learning algorithm for relieving the overestimation problem in the algorithm.

DRON-DualFc2 and DRON-DualMOE are two neural network models that can be used to model an adversary. They enable the host intelligence to better understand the behavior of an opponent by performing implicit opponent modeling through input opponent features that are primarily based on an assessment of the ability of the opponent and observations of recent actions taken by the opponent, such as in a question-and-answer game, opponent features can be characterized by the number of questions the opponent has answered, average accuracy, etc., while in a soccer game, opponent features can be characterized by the frequency with which the opponent breaks their balls, recent actions taken, etc. The DRON-DualFc2 and the DRON-DualMOE are both composed of a strategy learning network and an adversary model learning network, wherein the strategy learning network is used for predicting the Q value, and the adversary model learning network is used for implicit adversary modeling. The difference between the two network models DRON-DualFc2 and DRON-DualMOE mainly lies in the difference of the ways of fusing a strategy learning network and an adversary model learning network. Specifically, the method comprises the following steps:

in DRON-DualFc2, the input of the strategy learning network is environment information s, the input of the adversary model learning network is adversary characteristics, and the two-path input obtains two-path hidden layer output h after passing through respective hidden layers^sAnd h^oDRON-DualFc2 by ligation h^sAnd h^oFusing the environmental information and the opponent characteristics, and then outputting a state estimation value V after passing through a subsequent hidden layer^π(s, o) and an estimate of motion advantage A^π(s, o, a), and finally adding the state estimation value and the normalized action advantage estimation value to obtain an action estimation value Q.

In the formula, V^π(s,o)、Q^π(s, o, a) and A^π(s, o, a) respectively indicate a state estimation value, a Q value of action a, and an action superiority estimation value when the environment information is s and the opponent feature is s.

Indicating the number of all possible actions. Sigma_a′A^π(s, o, a') represents the sum of the motion advantage estimates for all possible motions.

By decomposing the Q value, the state estimation value can be directly updated each time the Q value is updated, so that all the Q values in the state are updated, and the DRON-DualFc 2network has higher convergence speed.

In DRON-DualMOE, the strategy learning network can be regarded as an expert network, and the environment information s is input; the adversary model learning network can be regarded as a weight network, and input is adversary characteristics and environment information s. The expert network comprises k expert subnetworks, each of which outputs an independent state estimate V^π(s, o) and normalized motion advantage estimationValue A^π(s, o, a), the weight network outputs a corresponding k-dimensional weight vector w, which can be regarded as confidence that the adversary takes different strategies. Output V of w and k expert subnetworks^π(s, o) and A^π(s, o, a) are weighted and summed respectively to obtain a final state estimate and an action advantage estimate, and a final Q value is obtained by the same normalization operation as DRON-DualFc 2. Similar to DRON-DualFc2, DRON-DualMOE also accelerates the convergence speed of the network by decomposing the Q value into a state estimate and an action advantage estimate.

In the formula, w_iRepresenting the ith component, V, of a k-dimensional weight vector^π(s,o_i) And

respectively representing the state estimation and the action advantage estimation output by the ith expert subnetwork.

Meanwhile, in order to relieve the common overestimation problem in Q-learning, the invention also uses a DecoupleDRON learning algorithm. The algorithm improves the training process of the DRON, uses the current value estimation network to select the action, and uses the target value estimation network to estimate the action, thereby decoupling the action selection and the action estimation and further relieving the over-estimation problem. In the decoruperdon learning algorithm, the target value y is calculated as follows:

wherein r is_tThe gain obtained for the host agent at time t, γ is the decay Rate, s_t+1Environmental information at time t +1, o_t+1Is the feature of the opponent at time t +1, θ_tEstimating a network parameter, θ ', for a current value at time t'_tEstimating network parameters for the target value at time t (the current value estimation network and the target value estimation network have the same structure, that isDRON-DualFc2 or DRON-DualMOE).

The method comprises the following specific steps:

step S1: initializing an experience pool with the capacity of N to store interactive experiences generated in the algorithm training process. The interactive experience mainly comprises (s, o, a, r, s ', o'), wherein s represents the environmental information of the current time step; o represents the opponent's feature at the current time step; a represents the action of the master agent; r represents the benefit obtained after the main agent executes the action a; s 'represents environment information at the next time step, and o' represents opponent features at the next time step.

Step S2: selecting DRON-DualFc2 or DRON-DualMOE as the network structures of the current value estimation network and the target value estimation network, and then randomly initializing the current value estimation network and the target value estimation network to be the same parameters. And repeating the steps S3-S9M times, wherein M is the number of training rounds, the value of M depends on a specific application scene, and the M is a hyper-parameter.

Step S3: a gaming environment is initialized, including environmental information and opponent characteristics. The current time step t is initialized to 1.

Step S4: the main intelligent agent obtains the environmental information s of the current time step_tAnd opponent feature o_t. The master agent performs an action a randomly with a probability epsilon_tOtherwise, perform the action

Step S5: master agent performing action a_tLater, the instant profit r is obtained from the game environment_tEnvironmental information s of the next time step_t+1And opponent feature o_t+1And experiences(s) generated by interaction of the agent with the environment_t,o_t,a_t,r_t,s_t+1,o_t+1) And storing in an experience pool.

Step S6: randomly sampling a batch of experience from a pool of experiences. For each experience(s) in the batch_j,o_j,a_j,r_j,s_j+1,o_j+1) Where j represents the pen experience pairCorresponding time step, target value y_jThe calculation is carried out according to the following formula:

step S7: defining a loss function L according to the formula (5), and estimating a parameter theta of the network for the current value_tA gradient descent is performed.

Step S8: estimating parameters theta 'of the target value estimation network every C time steps'_tUpdating a parameter θ of a network to a current value estimate_t。

Step S9: if s is_t+1If the state is a non-terminated state, updating the time step t to t +1, and repeatedly executing the steps S4 to S8; otherwise, ending the training of the round.

The invention has the beneficial effects that:

(1) the invention belongs to an implicit adversary modeling method, and constructs a general adversary modeling framework without depending on specific field knowledge.

(2) The method can be applied to a dynamic game environment, and can better model the opponent strategy which changes dynamically.

(3) The invention can better deal with the overestimation problem in the training process and has higher convergence rate.

Drawings

FIG. 1 is a flow chart of the training process of the method of the present invention.

Fig. 2 is a flow chart of parameter updating of the neural network.

FIG. 3 is a data flow diagram of the training process of the method of the present invention.

FIG. 4 is a network structure diagram of DRON-DualFc 2.

Fig. 5 is a network structure diagram of DRON-dual moe.

Detailed Description

The following further describes embodiments of the present invention with reference to the drawings.

The training flow chart of the invention is shown in fig. 1, and the steps are described as follows:

the first step is as follows: and initializing an experience pool for storing interactive experiences generated by the main intelligent body in the algorithm training process.

The second step is that: DRON-DualFc2 is selected as the network structure of the current value estimation network and the target value estimation network, and then the current network and the target network are initialized randomly to the same parameters. Repeating the third step to the tenth step M times.

The third step: a gaming environment is initialized, including environmental information and opponent characteristics. The current time step is initialized to 1.

The fourth step: the master agent obtains environmental information and opponent features of the current time step.

The fifth step: the master agent selects the action to be performed according to the epsilon-greedy algorithm. That is, an action is randomly selected according to the probability epsilon, otherwise, the acquired environmental information and the opponent characteristics are transmitted to the current value estimation network, and the action to be executed is obtained.

And a sixth step: and the main intelligent agent executes the action obtained in the last step.

The seventh step: the agent obtains the instant profit r from the game environment_tEnvironmental information s of the next time step_t+1And opponent feature o_t+1。

Eighth step: and storing the experience generated by the interaction of the main intelligent agent and the environment into an experience pool.

The ninth step: and updating parameters of the current value estimation network and the target value estimation network.

The tenth step: if the next moment is not in a termination state, updating the time step and repeatedly executing the processes from the fourth step to the ninth step; if the next moment is in a termination state, the training of the round is finished.

The parameter updating process of the neural network is shown in fig. 2, and the specific steps are described as follows:

step 1: and randomly sampling a batch of interactive experiences from the experience pool, wherein the batch of interactive experiences mainly comprise the environmental state of the current time step, the opponent characteristics, the action taken by the main agent, the instant reward obtained by the main agent and the sequence of the environmental state and the opponent characteristics of the next time step, which are generated when the main agent interacts in the environment.

Step 2: the loss of the network is calculated according to the formula (4) and the formula (5).

And 3, step 3: and calculating the gradient of the loss function relative to each parameter in the current value estimation network after the loss function is propagated reversely.

And 4, step 4: and updating the parameters of the current value estimation network according to the gradient descent optimization method.

And 5, step 5: and updating the parameters of the target value estimation network into the parameters of the current value estimation network every C time steps.

The data flow of the training process of the method of the invention is shown in fig. 3, and for the main agent, the current value estimation network takes the environmental information and the opponent characteristics at the time t-1 as input, and outputs the action and the Q value which should be executed by the main agent at the time t.

The network structure of DRON-DualFc2 is shown in FIG. 4, where all layers in the network are fully connected layers and the activation function is a ReLU function. The network input is the current environment information s and the adversary characteristics, and the two paths of input obtain two paths of hidden layer outputs h after passing through respective hidden layers^sAnd h^oThen h is mixed^sAnd h^oAnd sending the data to a subsequent hidden layer after connection to obtain an output h. And h, respectively obtaining a state estimated value and an action advantage estimated value through two independent hidden layers, and finally adding the state estimated value and the normalized action advantage estimated value to obtain a Q value.

Fig. 5 shows a network structure diagram of the DRON-dual moe, all layers in the network are fully connected layers, and except that the activation function when outputting the weight vector w is a Softmax function, the activation functions in the other layers are ReLU functions. The DRON-DualMOE network consists of two parts, namely a weight network and an expert network. The weight network inputs current adversary features and environment information s, and outputs a k-dimensional weight vector w through a Softmax function after passing through two full-connected layers. The expert network inputs the current environmental information s, which will go through two full pathsOutput obtained after connecting layers

And as input of k expert subnetworks, each expert subnetwork outputs an independent state estimation value and a normalized action advantage estimation value, and the Q value is obtained through weighting summation of weight vectors and normalization processing.

The above description is only for the purpose of illustrating the embodiments of the present invention and the appended claims are not to be construed as limiting the invention, but rather as encompassing all the modifications, equivalents, and improvements made within the spirit and scope of the present invention.

Claims

1. An implicit opponent modeling method based on deep reinforcement learning is characterized by comprising the following steps:

step S1: initializing an experience pool with the capacity of N to store interactive experience generated in the algorithm training process; the interactive experience comprises (s, o, a, r, s ', o'), wherein s represents the environmental information of the current time step; o represents the opponent's feature at the current time step; a represents the action of the master agent; r represents the benefit obtained after the main agent executes the action a; s 'represents environment information of the next time step, o' represents opponent characteristics of the next time step;

step S2: selecting DRON-DualFc2 or DRON-DualMOE as the network structures of the current value estimation network and the target value estimation network, and then randomly initializing the current value estimation network and the target value estimation network to be the same parameters; repeating the steps S3-S9M times, wherein M is the number of training rounds;

step S3: initializing a gaming environment, including environmental information and opponent characteristics; initializing the current time step t as 1;

step S4: the main intelligent agent obtains the environmental information s of the current time step_tAnd opponent feature o_t(ii) a The master agent performs an action a randomly with a probability epsilon_tOtherwise, perform the action

Step S5: master agent performing action a_tLater, the instant profit r is obtained from the game environment_tEnvironmental information s of the next time step_t+1And opponent feature o_t+1And experiences(s) generated by interaction of the agent with the environment_t，o_t，a_t，r_t，s_t+1，o_t+1) Storing the data into an experience pool;

step S6: randomly sampling experience of a batch from an experience pool; for each experience(s) in the batch_j，o_j，a_j，r_j，s_j+1，o_j+1) Target value y_jThe calculation is carried out according to the following formula, wherein j represents the time step corresponding to the experience:

step S7: defining a loss function L according to the formula (2), and estimating a parameter theta of the network for the current value_tGradient descending is carried out;

step S8: estimating parameters theta 'of the target value estimation network every C time steps'_tUpdating a parameter θ of a network to a current value estimate_t；

2. The implicit adversary modeling method based on deep reinforcement learning of claim 1, wherein the DRON-DualFc2 is composed of a strategy learning network and an adversary model learning network; the input of the strategy learning network is environment information s, and the input of the adversary model learning network isFor the hand feature o, the two inputs are processed by respective hidden layers to obtain two hidden layer outputs h^sAnd h^oDRON-DualFc2 by ligation h^sAnd h^oFusing the environmental information and the opponent characteristics, and then outputting a state estimation value V after passing through a subsequent hidden layer^π(s, o) and an estimate of motion advantage A^π(s, o, a), finally adding the state estimation value and the normalized action dominance estimation value to obtain an action estimation value Q:

in the formula, V^π(s，o)、Q^π(s, o, a) and A^π(s, o, a) respectively represent a state estimation value, a Q value of action a and an action advantage estimation value when the environment information is s and the opponent feature is o;

representing the number of all possible actions; sigma_a′A^π(s, o, a') represents the sum of the motion advantage estimates for all possible motions.

3. The implicit opponent modeling method based on deep reinforcement learning according to claim 1 or 2, wherein the DRON-DualmOE is composed of a strategy learning network and an opponent model learning network; wherein, the strategy learning network is regarded as an expert network and is input as environment information s; the adversary model learning network is regarded as a weight network and is input as an adversary feature o and environmental information s; the expert network comprises k expert subnetworks, each of which outputs an independent state estimate V^π(s, o) and normalized action dominance estimate A^π(s, o, a), the weight network outputs a corresponding k-dimensional weight vector w; output V of w and k expert subnetworks^π(s, o) and A^π(s, o, a) performing weighted summation to obtain a final state estimation value and an action advantage estimation value, and performing normalization to obtain a final Q value:

in the formula, w_iAn i-th component representing a k-dimensional weight vector; v^π(s，o_i) And