CN116128060A - Chess game method based on opponent modeling and Monte Carlo reinforcement learning - Google Patents

Chess game method based on opponent modeling and Monte Carlo reinforcement learning Download PDF

Info

Publication number
CN116128060A
CN116128060A CN202310135726.5A CN202310135726A CN116128060A CN 116128060 A CN116128060 A CN 116128060A CN 202310135726 A CN202310135726 A CN 202310135726A CN 116128060 A CN116128060 A CN 116128060A
Authority
CN
China
Prior art keywords
network
chess
opponent
reinforcement learning
strategy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310135726.5A
Other languages
Chinese (zh)
Inventor
王丹
黄刚
华炜
白雅璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Zhejiang Lab
Original Assignee
Xidian University
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University, Zhejiang Lab filed Critical Xidian University
Priority to CN202310135726.5A priority Critical patent/CN116128060A/en
Publication of CN116128060A publication Critical patent/CN116128060A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/042Backward inferencing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a chess game method based on opponent modeling and Monte Carlo reinforcement learning, which aims at the problems of low decision efficiency and poor generalization capability of a chess game algorithm. Designing input and output of a chess algorithm, constructing a reinforcement learning network and randomly initializing parameters of the network; constructing an opponent modeling network to predict and infer an opponent strategy, and inputting the opponent strategy into a reinforcement learning network to form a reinforcement learning network based on opponent modeling; constructing a Monte Carlo search tree, and guiding a search process of Monte Carlo tree nodes by using a reinforcement learning network based on opponent modeling; performing self-playing by utilizing the Monte Carlo search tree to generate game data and storing the game data as a training data set; training the reinforcement learning network by utilizing the data set until the network converges, and obtaining a final falling decision. The technology can solve the problems of low decision efficiency, poor generalization capability and the like in the existing chess game algorithm, and provides the general chess game algorithm with high efficiency, stability and strong generalization.

Description

Chess game method based on opponent modeling and Monte Carlo reinforcement learning
Technical Field
The invention relates to the technical field of machine learning, in particular to a chess game method based on opponent modeling and Monte Carlo reinforcement learning.
Background
In recent years, with continuous optimization and innovation of machine learning algorithms and improvement of computing power brought by computing resources and distributed computing systems, artificial intelligence technology is rapidly developing. The deep reinforcement learning is used as a key technology in the field of decision intelligence, and integrates decision capability of reinforcement learning and perception and expression capability of deep learning. By virtue of the advantages and characteristics of realizing the interaction, perception and learning of the intelligent agent, the intelligent robot has bright eye achievement in a plurality of fields of robots, autopilots, smart cities, smart medical treatment and the like, and the potential of the intelligent robot in solving complex problems and being applied in the real world is proved. As artificial intelligence algorithms represented by AlphaGo defeat human players many times, chess game algorithms based on deep reinforcement learning are becoming a popular research direction. The traditional chess game method relies on pure tree search and domain knowledge from human expert, and has the problems of low algorithm success rate, slow training speed and the like. Due to the perception and decision making capability of deep reinforcement learning, the existing chess game algorithm based on artificial intelligence can eliminate the dependence on expert knowledge to a certain extent, and the training efficiency is improved by fully utilizing the computing power resources of the computer. Therefore, the invention aims at designing a universal chess game algorithm by taking a deep reinforcement learning technology as a theoretical basis and adopting a method of combining a deep neural network and Monte Carlo tree search in reinforcement learning. However, how to further improve the stability and generalization of the game algorithm, and fully mine the feature information of each type of player to improve the decision-making efficiency is a problem worthy of research.
Therefore, the invention introduces a reinforcement learning algorithm based on opponent modeling. Opponent modeling is a key technology in the field of multi-agent game countermeasure, is a typical agent cognitive behavior modeling method, and becomes one of important research directions in the field of reinforcement learning decision optimization. The goal of intelligent decision making is to let the agent make decisions in the complex gaming environment to maximize its own benefits. If the actions, preferences, etc. of the adversary can be modeled by mining the adversary feature information, the adversary policy can be better predicted. For example, in a chess game, if one party can predict the next drop of hands, a targeted strategy layout can be performed in advance. Therefore, the reinforcement learning method based on opponent modeling is introduced into the chess game scene, so that the improvement of the chess force of the intelligent body is greatly facilitated, and the reinforcement learning method is taken as the optimization direction of a chess game decision algorithm to consider rationality and necessity. Specifically, historical office information is stored in the training process, future information (behaviors, strategies, targets, types and the like) of the opponent is predicted according to the historical information (states, observations, behaviors and the like) of the opponent, and the prediction result is used as prior knowledge of reinforcement learning to assist decision making. Thanks to the rapid development of deep learning in recent years, many methods for realizing prediction by using neural networks provide new ideas for opponent modeling methods. For example, bayesian reasoning is used as a time sequence prediction method taking a dynamic model as a research object, not only can the model information and the data information be utilized to realize prediction, but also prior information such as experience and judgment of a decision maker is added. The long-short-time memory network (Long Short Term Memory, LSTM) model is a specific form of cyclic neural network, and is very suitable for processing long-distance time sequence information processing and prediction. Specifically, there may be a lag of unknown duration between important events in a long-time sequence, and the LSTM network can well cope with the problems of gradient explosion and gradient disappearance, and has great advantages compared with other sequence prediction methods such as the traditional cyclic neural network, the hidden markov model, and the like.
In summary, the existing chess game algorithm based on reinforcement learning has the following defects: 1. traditional chess game methods based on pure tree search rely on domain knowledge from human expert, have low algorithm efficiency, are limited to the level of human strategy, and have poor effects. 2. Chess game play is usually a complex decision process with a high-dimensional action space and a state space, and blind application of game search algorithms usually leads to the problems of sparse algorithm rewards, high computational complexity and the like; 3. for different chess rules and different opponent strategy preference characteristics, the strategy experience from the self-intelligent agent is insufficient to summarize the game mode of the unknown intelligent agent, so that the problems of low algorithm decision-making efficiency, poor generalization capability and the like are caused. Therefore, the technical solution to the above problems is needed, and the efficient, stable and generalized general chess game algorithm is further explored.
Disclosure of Invention
Aiming at the problems of low decision efficiency, high computational complexity and poor generalization capability when learning by relying on human expert knowledge in a chess game algorithm in the prior art, the invention aims to provide a chess game method based on opponent modeling and Monte Carlo reinforcement learning, which has strong decision generalization, high accuracy and rapid adaptation to opponent game styles.
The technical solution of the invention is to provide a chess game method based on opponent modeling and Monte Carlo reinforcement learning, which comprises the following steps: comprises the following steps of the method,
step 1, designing input and output of a chess algorithm, constructing a reinforcement learning network and randomly initializing parameters of the network;
step 2, constructing an opponent modeling network prediction reasoning opponent strategy and inputting a reinforcement learning strategy network to form a reinforcement learning network based on opponent modeling;
step 3, constructing a Monte Carlo search tree, and guiding a search process of Monte Carlo tree nodes by using the reinforcement learning network based on opponent modeling in the step 2;
step 4, performing self-playing by utilizing the Monte Carlo search tree in the step 2 to generate game data which is stored as a training data set and used for training the reinforcement learning network;
and 5, repeating the steps 3 and 4 until the reinforcement learning network converges to obtain a final falling decision.
Preferably, the step 1 comprises the following sub-steps:
initializing a chessboard according to rules of different chess types, setting the sizes of an input layer and an output layer according to the size n of the chessboard of different chess types, designing four 0-1 characteristic planes to represent the chessboard state, wherein the four characteristic planes respectively represent the drop position of a player on my side, the drop position of a player on opponent side, the last drop position of the player on opponent side and the precedence of the player on my side;
step 1.2, strengthening an input layer of a learning network: the state matrix with the state discretization of the chess face expressed as 4 x n is input into a three-layer full convolution network, and convolution kernels of 32, 64 and 128 3*3 and a ReLu activation function are respectively used;
step 1.3, constructing an action strategy network, namely a strategy network: inputting the current chess face state into a strategy network, and predicting the next chess playing strategy according to the current chess face through a layer of convolution network and a full connection layer;
step 1.4, constructing a state value network, namely a value network: inputting the chess face state into a value network, and evaluating the value of the input chess face state through a layer of convolution network and a full connection layer;
step 1.5, strengthening the output layer of the learning network: based on the steps 1.3 and 1.4, the output layer is divided into two parts:
step 1.5.1, performing dimension reduction by using 4 1*1 convolution kernels at the strategy network output end, then sending to a full-connection layer, outputting the falling probability of each position selected in the current chess face state by using a nonlinear function softmax (), and outputting a vector p with the size of 1 Xn 2 And p is 1, the component ranges are [0,1];
Step 1.5.2, the value network output end uses 2 convolution kernels of 1*1 to carry out dimension reduction, then sends the value evaluation to a full-connection layer to output the current chess face state by using a nonlinear function tanh (), the output is represented by a scalar v, the range is [ -1,1], the larger value of v represents that the current chess face winning rate is higher, and the smaller value of v represents that the current chess face winning rate is lower;
step 1.6, randomly initializing the neural network parameters after constructing the reinforcement learning network according to the steps 1.1 to 1.5.
Preferably, the step 2 comprises the following sub-steps:
step 2.1, backtracking the past 10-50 times of history chess face states, expanding and splicing the discretized state feature matrix into a 4 x n dimension vector, and splicing the 10 times of history chess face states according to rows to form a history walking sub-matrix as input;
step 2.2, constructing a Bayesian LSTM opponent modeling network, and combining Bayesian reasoning and LSTM network prediction opponent strategy:
2.2.1, an encoder in an opponent modeling network consists of 2 stacked layers and a unidirectional LSTM network of 128 hidden units, and an embedded vector with fixed dimension is output;
2.2.2, a decoder in the opponent modeling network is composed of 2 stacked layers and a unidirectional LSTM network of 32 hidden units, and state prediction of a future opponent is generated;
step 2.3, training a Bayesian LSTM opponent modeling network, and obtaining an opponent strategy model through training and reasoning of two LSTM layers according to the historical opponent information:
step 2.3.1, using an ADAM optimizer and small batch gradient descent in the training process, such as batch_size=128, using random Dropout to simulate the network-output prediction result into random samples from the posterior distribution of the target variable, the prediction result will be used to estimate the posterior distribution of the target variable and create a confidence interval for each prediction accordingly;
step 2.3.2, mixing the opponent strategies, namely, the prediction results, by using a Bayesian mixing method, wherein the prediction results are used for estimating and reasoning the opponent strategies;
and 2.4, connecting the output end of the Bayesian LSTM opponent modeling network with the input end of the strategy network in the step one, and implicitly encoding the opponent characteristics, namely the opponent modeling result and the current actual state, in the neural network, namely splicing the opponent strategy and the current chess face state vector and inputting the opponent strategy and the current chess face state vector into the reinforcement learning network to realize the reinforcement learning network based on opponent modeling.
Preferably, the step 3 comprises the following sub-steps:
step 3.1, calculating the confidence upper limit Q (s, a) +U (s, a) of all possible falling actions under the current chess face state, and selecting actions corresponding to the maximum value to traverse the tree, wherein the confidence upper limit Q (s, a) +U (s, a) is shown in the following formula
a=arg max a (Q(s,a)+U(s,a))
Where a denotes the falling motion, s denotes the face state, Q (s, a) denotes the average value evaluation of all possible states under the motion, U (s, a) depends on the prior probability P (s, a) of the motion and the access count N (s, a) of the edge, each access is incremented once,
Figure SMS_1
wherein c puct A constant representing the degree of exploration; b represents a variable traversing all possible actions currently;
step 3.2, after the selected action a reaches the next state s, continuing to evaluate the policy value of the current chess face by using the reinforcement learning network, wherein the policy value is expressed as s (P (s,) Q (s,) =f θ (s) wherein f θ (s) represents the reinforcement learning network constructed in step 1;
step 3.3, repeating step 2.1 and step 2.2, when the winning or losing result is generated, indicating that the Monte Carlo tree search is completed, and returning to the search strategy under the state s
Figure SMS_2
Where τ represents a search constant set in advance;
and 3.4, reversely updating the search result to train the reinforcement learning network, wherein the training aim is to minimize the error between the evaluation value v and the actual value z and maximize the similarity between the evaluation strategy p and the actual search strategy pi, and the similarity is shown in the following formula:
loss=(z-v) 2T log p+c||θ|| 2
wherein z and pi represent tag data in the dataset, z represents the chess face state value and pi represents the actual search strategy; p and v represent evaluation values of the reinforcement learning network, v represents evaluation values of the chess face states, p represents action strategies output by the strategy network, and c represents L2 regularization parameters.
Preferably, the step 4 comprises the following sub-steps:
step 4.1, collecting training data sets, and giving strategy value network parameters f θ And an initial face state s:
step 4.1.1, running a Monte Carlo search tree, simulating all possible walking methods under the limited search steps in the current state, and continuously tracing downwards along the node with the maximum value from the current chessboard state every time the simulation starts;
step 4.1.2, when a strange chess face state, namely a leaf node is encountered, converting the chess face state into an input feature vector of a reinforcement learning network, evaluating the value by using a value network, and inputting the chess face state into a strategy network to obtain action probability distribution pi=aθ(s);
step 4.1.3, repeating the steps 4.1.1, 4.1.2 until the playing is finished, and giving a value label z to the chessboard value through the final result to obtain a situation evaluation value, wherein the winning is +1, the negative is-1, and the chess is 0, and the playing data of each time step t can be used as a tuple [ s ] tt ,z t ]A representation;
step 4.1.4, initializing a data set cache library
Figure SMS_3
The capacity of (2) is 5000, and playing data tuples are stored in the memory, so that a training data set is finally obtained;
step 4.2, maintaining a training data set:
step 4.2.1, expanding historical state data sets of the two players by reasonably using a mirror image and turning method according to different chess rules;
step 4.2.2, randomly sampling m samples to train the reinforcement learning network, and deleting the corresponding data set cache library
Figure SMS_4
The samples at the tail of the queue ensure that the samples in the cache library are the latest data; discarding the data tuple which is put first if the capacity of the buffer area is full;
step 4.3, training the reinforcement learning network from the training data set:
step 4.3.1, caching the library from the dataset
Figure SMS_5
Randomly sampling m samples to obtain training data [ s ] of one batch tt ,z t ]The input of the reinforcement learning network is chess face state s, the output is strategy p and value v, and the reinforcement learning network is trained by taking (pi, z) as tag data of supervised learning so that (p, v) is approximately equal to (pi, z);
step 4.3.2, based on step 4.3.1, the loss function of the policy network is loss= -pi T log p+c||θ|| 2 The loss function of the value network is loss= (z-v) 2 +c||θ|| 2 Where the L2 regularization parameter is c=0.0001, the initial value of the learning rate α is 0.01 and the learning rate is exponentially decayed.
Preferably, in the step 5, the steps 3 and 4 are repeated until the reinforcement learning network converges to obtain a final lower sub-decision, including the following sub-steps:
step 5.1, initializing algorithm parameters, including parameters such as input and output sizes designed according to different chess types and rules, chess face state matrix size, algorithm training total round number, initial learning rate, batch size of training samples and buffer area size, simulation step number of Monte Carlo number search of the game opponents and the hidden layer size of the Bayesian LSTM network;
and 5.2, evaluating the experimental result, training the intelligent chess player, wherein the evaluation index is the winning rate of playing chess with a pure Monte Carlo reinforcement learning algorithm, playing chess in each round of training for 100-1000 rounds, and finally counting the winning rate.
Compared with the prior art, the chess game method based on opponent modeling and Monte Carlo reinforcement learning has the following advantages:
firstly, the invention uses Monte Carlo reinforcement learning to self-generate data set, can learn from 0 without the need of the domain knowledge of human expert, and has the capability of processing the decision space with high dimension. The invention introduces a reinforcement learning network in the traversal process, namely Monte Carlo reinforcement learning, on one hand, the reinforcement learning network is utilized to realize strategy value evaluation so as to enhance a search algorithm, and on the other hand, the Monte Carlo search tree simulates actual playing rounds to obtain a training data set as a strategy booster for continuously exploring an agent. The reinforcement learning network comprises a strategy network and a value network, the mapping probability of the current chess face state to the optimal action and the value estimation of the current state are respectively output, the Monte Carlo search tree can summarize the value strategy estimation to generate the next action, and the obtained rewards are used for carrying out reverse training on the network. In this way, the algorithm can start from 0, thus freeing up the dependency of the domain knowledge of the human expert. Meanwhile, the Monte Carlo reinforcement learning can reduce the calculation complexity and improve the algorithm decision efficiency;
and secondly, the method predicts the opponent strategy by adopting an opponent modeling mode, so that the decision model can be quickly adapted to the game styles of different opponents. The invention combines the Bayesian inference framework and the LSTM network, predicts the future opponent strategy through the opponent state in the history situation, and then merges the opponent modeling result with the current actual state and inputs the opponent modeling result into the reinforcement learning network.
Finally, reinforcement learning based on opponent modeling can realize opponent modeling without any domain knowledge, and the predicted opponent strategy is used as priori knowledge to greatly accelerate the efficiency of reinforcement learning exploration, so that a decision model can quickly learn game styles of different opponents, and the stability and generalization of an algorithm are improved.
Drawings
FIG. 1 is a schematic flow diagram of an implementation of the present invention;
FIG. 2 is a schematic diagram of a Monte Carlo reinforcement learning algorithm based on opponent modeling in the present invention;
FIG. 3 is a graph showing the results of the simulation experiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
The prior chess game algorithm based on reinforcement learning has the following three problems:
1. traditional chess game methods based on pure tree search rely on domain knowledge from human expert, have low algorithm efficiency, are limited to the level of human strategy, and have poor effects.
2. Chess games are usually complex decision-making processes with high-dimensional action space and state space, and blind application of game search algorithms usually leads to problems of sparse algorithm rewards, high computational complexity and the like.
3. For different chess rules and different opponent strategy preference characteristics, the strategy experience from the self-intelligent agent is insufficient to summarize the game mode of the unknown intelligent agent, so that the problems of low algorithm decision-making efficiency, poor generalization capability and the like are caused.
The invention is further described with reference to the drawings and detailed description which follow: as shown in fig. 1 and 2, the main technical idea of the invention is as follows: firstly, discretizing the chess face state into a 0-1 state matrix to form an input layer of a neural network by combining the situation state and the falling action; secondly, constructing a reinforcement learning network which respectively comprises a strategy network and a value network, and respectively outputting the mapping probability of the current chess face state to the optimal action and the value estimation of the current state. Then, the Monte Carlo search tree is used for simulating actual playing rounds, summarizing value strategy estimation given by the reinforcement learning network to generate next action, and obtaining a training data set through the obtained bonus Monte Carlo search tree; at the same time, a Bayesian LSTM network is used for processing the history state matrix and predicting the opponent strategy, wherein the Bayesian inference is combined with the LSTM network to predict the future opponent strategy through the opponent state in the history situation, and finally the opponent characteristic and the environment characteristic are implicitly coded together in the neural network to train the reinforcement learning network.
As shown in fig. 1, the steps for embodying the present invention are as follows.
Step one, designing input and output of a chess algorithm, constructing a reinforcement learning network and randomly initializing parameters of the network;
initializing a chessboard according to rules of different chess types, setting the sizes of an input layer and an output layer according to the size n of the chessboard of different chess types, designing four 0-1 characteristic planes to represent the chessboard state, wherein the four characteristic planes respectively represent the drop position of a player on my side, the drop position of a player on opponent side, the last drop position of the player on opponent side and the precedence of the player on my side;
step 1.2, strengthening an input layer of a learning network: the state matrix with the state discretization of the chess face expressed as 4 x n is input into a three-layer full convolution network, and convolution kernels of 32, 64 and 128 3*3 and a ReLu activation function are respectively used;
step 1.3, constructing an action strategy network (strategy network): inputting the current chess face state into a strategy network, and predicting the next chess playing strategy according to the current chess face through a layer of convolution network and a full connection layer;
step 1.4, constructing a state value network (value network): inputting the chess face state into a value network, and evaluating the value of the input chess face state through a layer of convolution network and a full connection layer;
step 1.5, strengthening the output layer of the learning network: based on the steps 1.3 and 1.4, the output layer is divided into two parts:
step 1.5.1, performing dimension reduction by using 4 1*1 convolution kernels at the strategy network output end, then sending to a full-connection layer, outputting the falling probability of each position selected in the current chess face state by using a nonlinear function softmax (), and outputting a vector p with the size of 1 Xn 2 And p is 1, the component ranges are [0,1];
Step 1.5.2, the value network output end uses the convolution kernel of 2 1*1 to carry on the dimension reduction, then send to a full-link layer and use the nonlinear function tanh () to output the value assessment of the current chess face state, the output is expressed by a scalar v, the scope is [ -1,1], the bigger value of v indicates that the current chess face winning rate is higher my, the smaller value of v indicates that the current chess face winning rate is lower my;
step 1.6, randomly initializing the neural network parameters after constructing the reinforcement learning network according to the steps 1.1 to 1.5.
Step two, constructing an opponent modeling network prediction reasoning opponent strategy and inputting a reinforcement learning strategy network to form a reinforcement learning network based on opponent modeling;
step 2.1, backtracking the previous 10 rounds of history chess face states, expanding and splicing the discretized state feature matrix into a 4 x n dimension vector, and splicing the 10 rounds of history chess face states according to rows to form a history walking sub-matrix as input;
step 2.2, constructing a Bayesian LSTM opponent modeling network, and combining Bayesian reasoning and LSTM network prediction opponent strategy:
2.2.1, an encoder in an opponent modeling network consists of 2 stacked layers and a unidirectional LSTM network of 128 hidden units, and an embedded vector with fixed dimension is output;
step 2.2.2, a decoder in the opponent modeling network is composed of a unidirectional LSTM network of 2 stacked layers and 32 hidden units, and is used for generating state prediction of a future opponent;
specifically, LSTM is used to process historical chess face status, thereby solving the problems of gradient extinction and gradient explosion during long sequence training. Each LSTM layer in the designed bayesian LSTM network is followed by a random Dropout interpreting the model output as random samples from the posterior distribution of the target variable. This means that parameters of the posterior distribution can be approximated by predicting multiple times to create a confidence interval for each prediction. And then, the predicted opponent strategy model is mixed according to a Bayesian mixing method, so that the estimation and the reasoning of the opponent strategy are realized. As such, the opponent features are implicitly encoded together with the environmental features in the neural network to train the reinforcement learning network using the Bayesian LSTM network to learn historical interaction information of the opponent.
Step 2.3, training a Bayesian LSTM opponent modeling network, and obtaining an opponent strategy model through training and reasoning of two LSTM layers according to the historical opponent information:
step 2.3.1, using an ADAM optimizer and a small batch gradient descent (batch_size=128) in the training process, simulating a prediction result output by the network into a random sample from a posterior distribution of the target variable by adopting random Dropout, estimating the posterior distribution of the target variable after multiple predictions and creating a confidence interval for each prediction according to the posterior distribution;
step 2.3.2, the opponent strategy (prediction result) is mixed by using a Bayesian mixing method, so that estimation and reasoning of the opponent strategy are realized, and the efficiency of reinforcement learning exploration is greatly accelerated by using the estimation and reasoning as priori knowledge;
and 2.4, connecting the output end of the Bayesian LSTM opponent modeling network with the input end of the strategy network in the step one, implicitly encoding the opponent characteristics (opponent modeling result) and the environment characteristics (current actual state) in the neural network, namely splicing the opponent strategy and the current chess face state vector and inputting the opponent strategy and the current chess face state vector into the reinforcement learning network, and realizing the reinforcement learning network based on opponent modeling.
Step three, constructing a Monte Carlo search tree, and guiding a search process of Monte Carlo tree nodes by using the reinforcement learning network based on opponent modeling in the step two;
step 3.1, calculating the confidence upper limit Q (s, a) +U (s, a) of all possible falling actions under the current chess face state, and selecting actions corresponding to the maximum value to traverse the tree, wherein the confidence upper limit Q (s, a) +U (s, a) is shown in the following formula
a=arg max a (Q(s,a)+U(s,a))
Where a represents a falling motion, s represents a face state, Q (s, a) represents an average value estimate for all possible states under the motion, U (s, a) depends on the prior probability P (s, a) of the motion and the access count N (s, a) of the edge (each access is incremented once)
Figure SMS_6
Wherein c puct Representing decision explorationA constant of degree; b represents a variable traversing all possible actions currently;
step 3.2, after the selected action a reaches the next state s, continuing to evaluate the strategy value of the current chess face by using the reinforcement learning network, wherein the strategy value is expressed as s (P (s,), Q (s,) =f θ (s) wherein f θ (s) representing the reinforcement learning network constructed in step one;
step 3.3, repeating step 2.1 and step 2.2, when the winning or losing result is generated, indicating that the Monte Carlo tree search is completed, and returning to the search strategy under the state s
Figure SMS_7
Where τ represents a search constant set in advance;
and 3.4, reversely updating the search result to train the reinforcement learning network, wherein the training aim is to minimize the error between the evaluation value v and the actual value z and maximize the similarity between the evaluation strategy p and the actual search strategy pi, and the similarity is shown in the following formula:
loss=(z-v) 2T log p+c||θ|| 2
wherein z and pi represent tag data in the dataset, z represents the chess face state value and pi represents the actual search strategy; p and v represent evaluation values of the reinforcement learning network, v represents evaluation values of the chess face states, p represents action strategies output by the strategy network, and c represents L2 regularization parameters.
In particular, the decision making process of a chess game can be seen as a tree comprising a plurality of possible trees, each step of the walk being obtained by means of a Monte Carlo tree search. The conventional monte carlo tree search MCTS is a search strategy that approximates nash equilibrium, however, in practice the limited computational power is not sufficient to thoroughly search this vast tree.
Fourthly, performing self-playing by utilizing the Monte Carlo search tree in the second step to generate game data which is stored as a training data set and used for training the reinforcement learning network;
step 4.1, collecting training data sets, and giving strategy value network parameters f θ And an initial face state s:
and 4.1.1, running a Monte Carlo search tree, simulating all possible walking methods under the limited search steps in the current state, and continuously tracing downwards along the node with the maximum value from the current chessboard state every time the simulation starts.
Step 4.1.2, when a strange chess face state (namely leaf nodes) is encountered, converting the chess face state into an input feature vector of a reinforcement learning network, evaluating the value by using a value network, and inputting the chess face state into a strategy network to obtain action probability distribution pi=aθ(s);
and step 4.1.3, repeating the steps 4.1.1 and 4.1.2 until the playing ends to win or lose. The final result is used to assign value label z to the chessboard value to obtain situation evaluation value, the winning is +1, the negative is-1, and the chess is 0, so that the playing data of each time step t can be used as tuple [ s ] tt ,z t ]A representation;
step 4.1.4, initializing a data set cache library
Figure SMS_8
The capacity of (2) is 5000, and playing data tuples are stored in the memory, so that a training data set is finally obtained;
step 4.2, maintaining a training data set:
step 4.2.1, reasonably using methods such as mirror images, overturning and the like to expand the data sets according to different chess rules, wherein the chessboard of the chess game is generally symmetrical, and the historical state data sets of the two parties can be exchanged in a mirror image mode to improve the efficiency;
step 4.2.2, randomly sampling m samples to train the reinforcement learning network, and deleting the corresponding data set cache library
Figure SMS_9
The samples at the tail of the queue ensure that the samples in the cache library are the latest data; discarding the data tuple which is put first if the capacity of the buffer area is full;
step 4.3, training the reinforcement learning network from the training data set:
step 4.3.1, caching the library from the dataset
Figure SMS_10
Randomly sampling m samples to obtain training data [ s ] of one batch tt ,z t ]The input of the reinforcement learning network is chess face state s, the output is strategy p and value v, and the reinforcement learning network is trained by taking (pi, z) as tag data of supervised learning so that (p, v) is approximately equal to (pi, z);
step 4.3.2, based on step 4.3.1, the loss function of the policy network is loss= -pi T log p+c||θ|| 2 The loss function of the value network is loss= (z-v) 2 +c||θ|| 2 Where the L2 regularization parameter is c=0.0001, the initial value of the learning rate α is 0.01 and the learning rate is exponentially decayed.
And fifthly, repeating the third and fourth steps until the reinforcement learning network converges, and obtaining a final falling decision.
Step 5.1, initializing algorithm parameters including parameters such as input and output sizes designed according to different chess types and rules, chess face state matrix size, algorithm training total round number, initial learning rate, batch size of training samples and buffer area size, simulation step number of Monte Carlo number search of the my and game opponents, hidden layer size of the Bayesian LSTM network and the like, wherein specifically set values are shown in table 1.
And 5.2, evaluating an experimental result, namely training an intelligent chess player by using a game algorithm based on opponent modeling and Monte Carlo reinforcement learning in a simulation experiment by taking a gobang game as an example, evaluating the chess player with the evaluation index of the winning rate played with a pure Monte Carlo reinforcement learning algorithm, playing 100 rounds in each round of training, and finally counting the winning rate.
The invention is further described in connection with simulation experiments as follows:
1. simulation experiment conditions: hardware environment of the simulation experiment of the invention: intel Core i7-10700KCPU@3.60GHz, 64GB memory, RTX 3090GPU, software environment: ubuntu20.04 operating System, python3.7, pytorch1.8.0.
2. Simulation content and result analysis: the simulation experiment of the invention takes a gobang game as an example, uses a game algorithm based on opponent modeling and Monte Carlo reinforcement learning to train an agent chess player, and evaluates the index to be the winning rate of playing with a pure Monte Carlo reinforcement learning algorithm. The main experimental parameters of the simulation experiment of the invention are shown in table 1.
Table 1 simulation experiment parameters
Parameter name Numerical value
Chessboard size 8*8
Chess face state matrix size 4*8*8
Total number of training wheels 500
Round of playing chess 100
Monte Carlo simulation step number (My) 200
Monte Carlo simulation step number (opponent) 500
Policy network output size (4*64,64)
Value network output size (64,1)
Bayesian LSTM network hidden layer size 128,64
Dropout probability factor 0.5
Batch size of training samples 128
Buffer size 5000
Initial learning rate 0.002
Simulation results of gobang game using the game algorithm based on opponent modeling and Monte Carlo reinforcement learning provided by the invention are shown in figure 3. As can be seen from fig. 3, the winning rate is gradually stabilized after the training times reach 150 times, and the average is up to 75%. Simulation results prove that the algorithm provided by the invention can adapt to an opponent at a higher speed, and the self-winning rate is improved in fewer exploration times. In addition, the invention can finally complete chess game tasks with higher winning rate, and has generalization, stability and high efficiency.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims (6)

1. A chess game method based on opponent modeling and Monte Carlo reinforcement learning is characterized in that: comprises the following steps of the method,
step 1, designing input and output of a chess algorithm, constructing a reinforcement learning network and randomly initializing parameters of the network;
step 2, constructing an opponent modeling network prediction reasoning opponent strategy and inputting a reinforcement learning strategy network to form a reinforcement learning network based on opponent modeling;
step 3, constructing a Monte Carlo search tree, and guiding a search process of Monte Carlo tree nodes by using the reinforcement learning network based on opponent modeling in the step 2;
step 4, performing self-playing by utilizing the Monte Carlo search tree in the step 2 to generate game data which is stored as a training data set and used for training the reinforcement learning network;
and 5, repeating the steps 3 and 4 until the reinforcement learning network converges to obtain a final falling decision.
2. The chess gaming method based on opponent modeling and monte carlo reinforcement learning of claim 1, wherein the method comprises the steps of: the step 1 comprises the following sub-steps:
initializing a chessboard according to rules of different chess types, setting the sizes of an input layer and an output layer according to the size n of the chessboard of different chess types, designing four 0-1 characteristic planes to represent the chessboard state, wherein the four characteristic planes respectively represent the drop position of a player on my side, the drop position of a player on opponent side, the last drop position of the player on opponent side and the precedence of the player on my side;
step 1.2, strengthening an input layer of a learning network: the state matrix with the state discretization of the chess face expressed as 4 x n is input into a three-layer full convolution network, and convolution kernels of 32, 64 and 128 3*3 and a ReLu activation function are respectively used;
step 1.3, constructing an action strategy network, namely a strategy network: inputting the current chess face state into a strategy network, and predicting the next chess playing strategy according to the current chess face through a layer of convolution network and a full connection layer;
step 1.4, constructing a state value network, namely a value network: inputting the chess face state into a value network, and evaluating the value of the input chess face state through a layer of convolution network and a full connection layer;
step 1.5, strengthening the output layer of the learning network: based on the steps 1.3 and 1.4, the output layer is divided into two parts:
step 1.5.1, performing dimension reduction by using 4 1*1 convolution kernels at the strategy network output end, then sending to a full-connection layer, outputting the falling probability of each position selected in the current chess face state by using a nonlinear function softmax (), and outputting a vector p with the size of 1 Xn 2 And p is 1, the component ranges are [0,1];
Step 1.5.2, the value network output end uses 2 convolution kernels of 1*1 to carry out dimension reduction, then sends the value evaluation to a full-connection layer to output the current chess face state by using a nonlinear function tanh (), the output is represented by a scalar v, the range is [ -1,1], the larger value of v represents that the current chess face winning rate is higher, and the smaller value of v represents that the current chess face winning rate is lower;
step 1.6, randomly initializing the neural network parameters after constructing the reinforcement learning network according to the steps 1.1 to 1.5.
3. The chess gaming method based on opponent modeling and monte carlo reinforcement learning of claim 1, wherein the method comprises the steps of: the step 2 comprises the following sub-steps:
step 2.1, backtracking the past 10-50 times of history chess face states, expanding and splicing the discretized state feature matrix into a 4 x n dimension vector, and splicing the 10 times of history chess face states according to rows to form a history walking sub-matrix as input;
step 2.2, constructing a Bayesian LSTM opponent modeling network, and combining Bayesian reasoning and LSTM network prediction opponent strategy:
2.2.1, an encoder in an opponent modeling network consists of 2 stacked layers and a unidirectional LSTM network of 128 hidden units, and an embedded vector with fixed dimension is output;
2.2.2, a decoder in the opponent modeling network is composed of 2 stacked layers and a unidirectional LSTM network of 32 hidden units, and state prediction of a future opponent is generated;
step 2.3, training a Bayesian LSTM opponent modeling network, and obtaining an opponent strategy model through training and reasoning of two LSTM layers according to the historical opponent information:
step 2.3.1, using an ADAM optimizer and small batch gradient descent in the training process, such as batch_size=128, using random Dropout to simulate the network-output prediction result into random samples from the posterior distribution of the target variable, the prediction result will be used to estimate the posterior distribution of the target variable and create a confidence interval for each prediction accordingly;
step 2.3.2, mixing the opponent strategies, namely, the prediction results, by using a Bayesian mixing method, wherein the prediction results are used for estimating and reasoning the opponent strategies;
and 2.4, connecting the output end of the Bayesian LSTM opponent modeling network with the input end of the strategy network in the step one, and implicitly encoding the opponent characteristics, namely the opponent modeling result and the current actual state, in the neural network, namely splicing the opponent strategy and the current chess face state vector and inputting the opponent strategy and the current chess face state vector into the reinforcement learning network to realize the reinforcement learning network based on opponent modeling.
4. The chess gaming method based on opponent modeling and monte carlo reinforcement learning of claim 1, wherein the method comprises the steps of: said step 3 comprises the following sub-steps:
step 3.1, calculating the confidence upper limit Q (s, a) +U (s, a) of all possible falling actions under the current chess face state, and selecting actions corresponding to the maximum value to traverse the tree, wherein the confidence upper limit Q (s, a) +U (s, a) is shown in the following formula
a=arg max a (Q(s,a)+U(s,a))
Where a denotes the falling motion, s denotes the face state, Q (s, a) denotes the average value evaluation of all possible states under the motion, U (s, a) depends on the prior probability P (s, a) of the motion and the access count N (s, a) of the edge, each access is incremented once,
Figure QLYQS_1
wherein c puct A constant representing the degree of exploration; b represents traversing the currentAll possible action variables;
step 3.2, after the selected action a reaches the next state s, continuing to evaluate the policy value of the current chess face by using the reinforcement learning network, wherein the policy value is expressed as s (P (s,) Q (s,) =f θ (s) wherein f θ (s) represents the reinforcement learning network constructed in step 1;
step 3.3, repeating step 2.1 and step 2.2, when the winning or losing result is generated, indicating that the Monte Carlo tree search is completed, and returning to the search strategy under the state s
Figure QLYQS_2
Where τ represents a search constant set in advance;
and 3.4, reversely updating the search result to train the reinforcement learning network, wherein the training aim is to minimize the error between the evaluation value v and the actual value z and maximize the similarity between the evaluation strategy p and the actual search strategy pi, and the similarity is shown in the following formula:
loss=(z-v) 2T log p+c||θ|| 2
wherein z and pi represent tag data in the dataset, z represents the chess face state value and pi represents the actual search strategy; p and v represent evaluation values of the reinforcement learning network, v represents evaluation values of the chess face states, p represents action strategies output by the strategy network, and c represents L2 regularization parameters.
5. The chess gaming method based on opponent modeling and monte carlo reinforcement learning of claim 1, wherein the method comprises the steps of: the step 4 comprises the following sub-steps:
step 4.1, collecting training data sets, and giving strategy value network parameters f θ And an initial face state s:
step 4.1.1, running a Monte Carlo search tree, simulating all possible walking methods under the limited search steps in the current state, and continuously tracing downwards along the node with the maximum value from the current chessboard state every time the simulation starts;
step 4.1.2, when a strange chess face state, namely a leaf node is encountered, converting the chess face state into an input feature vector of a reinforcement learning network, evaluating the value by using a value network, and inputting the chess face state into a strategy network to obtain action probability distribution pi=aθ(s);
step 4.1.3, repeating the steps 4.1.1, 4.1.2 until the playing is finished, and giving a value label z to the chessboard value through the final result to obtain a situation evaluation value, wherein the winning is +1, the negative is-1, and the chess is 0, and the playing data of each time step t can be used as a tuple [ s ] t ,π t ,z t ]A representation;
step 4.1.4, initializing a data set cache library
Figure QLYQS_3
The capacity of (2) is 5000, and playing data tuples are stored in the memory, so that a training data set is finally obtained;
step 4.2, maintaining a training data set:
step 4.2.1, expanding historical state data sets of the two players by reasonably using a mirror image and turning method according to different chess rules;
step 4.2.2, randomly sampling m samples to train the reinforcement learning network, and deleting the corresponding data set cache library
Figure QLYQS_4
The samples at the tail of the queue ensure that the samples in the cache library are the latest data; discarding the data tuple which is put first if the capacity of the buffer area is full;
step 4.3, training the reinforcement learning network from the training data set:
step 4.3.1, caching the library from the dataset
Figure QLYQS_5
Randomly sampling m samples to obtain training data [ s ] of one branch t ,π t ,z t ]The input of the reinforcement learning network is chess face state s, the output is strategy p and value v, and the reinforcement learning network is trained by taking (pi, z) as tag data of supervised learning so that (p, v) is approximately equal to (pi, z);
step 4.3.2, policy based on step 4.3.1The loss function of the network is loss = -pi T logp+c||θ|| 2 The loss function of the value network is loss= (z-v) 2 +c||θ|| 2 Where the L2 regularization parameter is c=0.0001, the initial value of the learning rate α is 0.01 and the learning rate is exponentially decayed.
6. The chess gaming method based on opponent modeling and monte carlo reinforcement learning of claim 1, wherein the method comprises the steps of: in the step 5, the steps 3 and 4 are repeated until the reinforcement learning network converges to obtain a final lower sub-decision, which comprises the following sub-steps:
step 5.1, initializing algorithm parameters, including parameters such as input and output sizes designed according to different chess types and rules, chess face state matrix size, algorithm training total round number, initial learning rate, batch size of training samples and buffer area size, simulation step number of Monte Carlo number search of the game opponents and the hidden layer size of the Bayesian LSTM network;
and 5.2, evaluating the experimental result, training the intelligent chess player, wherein the evaluation index is the winning rate of playing chess with a pure Monte Carlo reinforcement learning algorithm, playing chess in each round of training for 100-1000 rounds, and finally counting the winning rate.
CN202310135726.5A 2023-02-18 2023-02-18 Chess game method based on opponent modeling and Monte Carlo reinforcement learning Pending CN116128060A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310135726.5A CN116128060A (en) 2023-02-18 2023-02-18 Chess game method based on opponent modeling and Monte Carlo reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310135726.5A CN116128060A (en) 2023-02-18 2023-02-18 Chess game method based on opponent modeling and Monte Carlo reinforcement learning

Publications (1)

Publication Number Publication Date
CN116128060A true CN116128060A (en) 2023-05-16

Family

ID=86309914

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310135726.5A Pending CN116128060A (en) 2023-02-18 2023-02-18 Chess game method based on opponent modeling and Monte Carlo reinforcement learning

Country Status (1)

Country Link
CN (1) CN116128060A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116881656A (en) * 2023-07-06 2023-10-13 南华大学 Reinforced learning military chess AI system based on deep Monte Carlo
CN117033250A (en) * 2023-10-08 2023-11-10 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for testing office application

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116881656A (en) * 2023-07-06 2023-10-13 南华大学 Reinforced learning military chess AI system based on deep Monte Carlo
CN116881656B (en) * 2023-07-06 2024-03-22 南华大学 Reinforced learning military chess AI system based on deep Monte Carlo
CN117033250A (en) * 2023-10-08 2023-11-10 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for testing office application
CN117033250B (en) * 2023-10-08 2024-01-23 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for testing office application

Similar Documents

Publication Publication Date Title
CN111282267B (en) Information processing method, information processing apparatus, information processing medium, and electronic device
CN109063823B (en) Batch A3C reinforcement learning method for exploring 3D maze by intelligent agent
CN112329348A (en) Intelligent decision-making method for military countermeasure game under incomplete information condition
CN116128060A (en) Chess game method based on opponent modeling and Monte Carlo reinforcement learning
CN113688977B (en) Human-computer symbiotic reinforcement learning method and device oriented to countermeasure task, computing equipment and storage medium
Chen et al. An improved bat algorithm hybridized with extremal optimization and Boltzmann selection
CN109740741B (en) Reinforced learning method combined with knowledge transfer and learning method applied to autonomous skills of unmanned vehicles
Knegt et al. Opponent modelling in the game of Tron using reinforcement learning
Hammoudeh A concise introduction to reinforcement learning
Schmidhuber Philosophers & futurists, catch up! Response to The Singularity
CN115238891A (en) Decision model training method, and target object strategy control method and device
WO2022247791A1 (en) Chess self-learning method and apparatus based on machine learning
Cahill Catastrophic forgetting in reinforcement-learning environments
CN113509726A (en) Interactive model training method and device, computer equipment and storage medium
CN116090549A (en) Knowledge-driven multi-agent reinforcement learning decision-making method, system and storage medium
Espinosa Leal et al. Reinforcement learning for extended reality: designing self-play scenarios
CN115906673A (en) Integrated modeling method and system for combat entity behavior model
Guo Deep learning and reward design for reinforcement learning
Olesen et al. Evolutionary planning in latent space
Nakashima et al. Designing high-level decision making systems based on fuzzy if–then rules for a point-to-point car racing game
Sun Performance of reinforcement learning on traditional video games
Hui et al. Balancing excitation and inhibition of spike neuron using deep q network (dqn)
Lamontagne et al. Acquisition of cases in sequential games using conditional entropy
Grim Philosophy for computers: some explorations in philosophical modeling
Yao et al. Cheat-FlipIt: An Approach to Modeling and Perception of a Deceptive Opponent

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination