CN116881656A

CN116881656A - Reinforced learning military chess AI system based on deep Monte Carlo

Info

Publication number: CN116881656A
Application number: CN202310825710.7A
Authority: CN
Inventors: 林文斌; 吕航; 王玮; 杨雪晴
Original assignee: University of South China
Current assignee: University of South China
Priority date: 2023-07-06
Filing date: 2023-07-06
Publication date: 2023-10-13
Anticipated expiration: 2043-07-06
Also published as: CN116881656B

Abstract

The invention discloses an enhanced learning military chess AI system based on deep Monte Carlo, belonging to the technical field of artificial intelligence AI game; the invention provides a reinforcement learning army chess AI system based on a deep Monte Carlo, and further provides a training method and an actual fight execution method matched with the system, and the advanced Monte Carlo method and a chess upper and lower limit evaluation algorithm are utilized, so that the expression of the army chess AI is greatly improved, the application prospect is good, and the blank of the army chess AI in the artificial intelligence field is filled. The invention solves the problems that the AI of the incomplete information game of the military chess is difficult to design and train, and the like, and improves the performance of the military chess when the AI fights human beings. Has great significance for training and researching the army chess lovers, and provides the man-machine fight and army chess AI self-play training function.

Description

Reinforced learning military chess AI system based on deep Monte Carlo

Technical Field

The invention relates to the technical field of artificial intelligence AI games, in particular to an AI system for reinforcement learning army chess based on deep Monte Carlo.

Background

In the field of artificial intelligence, combining deep learning with reinforcement learning has produced outstanding game-like artificial intelligence such as AlphaGo. A typical chess game, such as go or chess, is a full information game in that both players can observe the entire board. However, the army chess is a non-complete information game, and in the process of fight, one player cannot know the chess piece information of the other player, so that the design of the AI of the army chess becomes very challenging.

The problem can not be well solved at home and abroad at present, therefore, a reinforced learning army chess AI system based on the depth Monte Carlo and a training method thereof are designed, the expression of the army chess AI is greatly improved by using the depth Monte Carlo method and the chess upper and lower limit evaluation algorithm, the practical significance and the application prospect are good, and the blank of the army chess AI in the artificial intelligence field is filled.

Disclosure of Invention

The invention aims to provide a reinforced learning army chess AI system based on deep Monte Carlo, which improves the performance of army chess AI when fighting human beings, and provides an artificial intelligent opponent which is convenient and interesting for army chess lovers.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a deep monte carlo based reinforcement learning military chess AI system, the system comprising: the system comprises a military chess fight module, a military chess law generation module, a military chess characteristic acquisition module, a military chess law decision module and a decision evaluation module;

the army chess fight module is used for displaying the situation of both sides of the chess, executing and interacting decisions of both sides, and judging the fight result between the chesses;

the military chess method generating module is used for searching the current chess situation, giving out all feasible methods of the current player and sending the current player into the military chess characteristic acquisition module;

the army chess characteristic acquisition module acquires current own chess piece information, enemy chess piece information and latest two-step recruitment methods from the army chess fight module, acquires all feasible recruitment methods from the army chess recruitment method generation module, converts the data into a proper coding format as a state value and inputs the state value into the army chess recruitment method decision module;

the army chess method decision-making module is divided into a training stage and an actual fight stage;

wherein the training phase comprises: generating evaluation values of all feasible recruitment methods of the current player by adopting a deep Monte Carlo network decision technique according to the input state values, namely evaluating all the feasible recruitment methods of the player under the current situation, then selecting the recruitment method with the maximum evaluation value, continuously choosing until the game is ended to win or lose, training a decision network according to the final feedback information of a decision evaluation module, and optimizing parameters of the decision network;

the actual combat stage comprises the following steps: the fully trained decision model does not update network parameters any more, and an optimal recruitment method is selected by adopting a deep Monte Carlo decision technology according to the input state value;

the decision evaluation module evaluates the scores of all decisions in the whole game according to the final game result and generates feedback information to the military chess law-solicitation decision module.

Preferably, the specific implementation flow of the system comprises the following steps:

s1, a military chess fight module stores current chess game information in real time and executes and judges decisions of fight parties;

s2, analyzing and matching all possible chess playing methods of the current player by the military chess playing method generating module according to the current chess game information; the structure of the recruitment method comprises: the coordinates of the chessmen of the already side, the type of the chessmen of the already side, the coordinates of the target and the type of the coordinates of the target;

s3, the military chess characteristic acquisition module extracts all visible information under the current player view angle according to the current chess game information, wherein the visible information comprises own chess piece information, enemy chess piece information, the number of rounds of the non-occurrence of the chess, all the recruitment methods of the current player and the latest twenty rounds of both the two-party recruitment methods, and the information is used as the input of the military chess recruitment method decision module;

s4, the military chess method decision module evaluates each method under the current chess game through the deep Monte Carlo network according to the input state value, selecting a playing method, executing the playing method by the army chess fight module, switching to the view angle of another player, and returning to the step S2 until the army chess fight module judges that the game is ended;

s5, the military chess method assessment module optimizes the deep Monte Carlo network in the military chess method decision system by calculating MSE (mean square error) according to the collected fight data, the final result and the state value in the period.

Preferably, the piece information in S3 specifically includes: all chessmen current coordinates, chessman types, survival states, whether the chessman is positioned on a railway or not, and whether the chessman is positioned on camping or not; the chessmen are divided into survival chessmen and matrix death chessmen according to the survival state; for the gusted pawns, the current coordinates, the survival status, the railway-located and camping features are all 0, which is different from the survival pawns. For the type of the enemy chessman, the method is expressed by adopting a possibility method because of unknown chessman type. Specifically, the initial possibility is set according to the coordinates of the army chess in the initial stage, and the initial possibility is set according to the rules of opening the army chess respectively. And in the game process, updating the possibility according to the fight result and the movement behavior.

Preferably, the deep Monte Carlo network in S4 is an estimation network of the value network for actions and states; the evaluation network is used for generating a value function of a bidding decision, the network structure comprises an LSTM network for receiving the characteristics generated by the latest twenty steps of bidding, the output of the LSTM network is connected with the action state value and the chess game state value and is transmitted into the full-connection layer together to generate a plurality of evaluation values equal to the types of the bidding, and the information transmitted into the full-connection layer specifically comprises own chess piece information, enemy chess piece information, characteristic values generated by the latest twenty steps of bidding through the LSTM network and currently feasible bidding.

Preferably, the training method of the training stage specifically includes the following contents:

a1, establishing an experience pool B for two players ₁ 、B ₂ The experience pool is a storage, input characteristics F of each round in two players' combat are respectively stored, the upper limit of the experience pool is set to S by people, and when the experience pool storage reaches S, training is carried out to empty the experience pool and storage is restarted;

a2, establishing decision networks Q1, Q2 for two players, wherein the decision networks can read the input characteristic F and give an estimated value, and select a law-inviting action a in a round t _t The method of (1) is as follows:

a _t ＝argmax _a Q(s _t ,a)p＝(1-ε)

a _t ＝random(s _t ,a)p＝ε

the prize value for each round is determined by the end result:

r _t ←r _t +γr _t+1

wherein s is _t Refers to the state of the chessboard under the current view angle of the player in the t-th round; r is (r) _t Referring to the t-th round, the current player operates the obtained rewarding value; argmax _a Refers to the current round chessboard state s _t The action of selecting the maximum estimated value from the estimated values calculated by the decision network Q with the action set a is used as the current round action a _t The method comprises the steps of carrying out a first treatment on the surface of the random refers to randomly selecting an action from the current round action set a as the current round action a _t The method comprises the steps of carrying out a first treatment on the surface of the Epsilon refers to the exploration probability of the current round; p refers to the corresponding probability of the action method obtained through exploring probability calculation; gamma is a decay factor, meaning that the calculation of the current round of rewards is jointly determined by the current round of rewards and a decayed next round of rewards;

a3, learning an experience pool with the total times of T times, inputting the round characteristics accumulated by a game into the experience pool B after each game is finished, and learning once after the experience pool is full until the experience pool is learned for T times finally;

a4, in each round of learning, firstly obtaining the characteristic F in the experience pool B _t Into the network Q to obtain an estimated value G _t R is recorded with _t The loss function is calculated by using the mean square error, and the problems of vanishing of the network Q learning rate, slow convergence and abnormal parameter updating are solved by using the Adam algorithm.

Preferably, the actual combat stage refers to man-machine combat, wherein 'man' of man-machine combat refers to human players of the military chess, and 'machine' refers to AI of the military chess; the specific execution flow comprises the following steps:

b1, determining a first hand and a second hand at a starting interface by a human player of the army chess, and arranging own army chess layout;

b2, playing chess by a first player, extracting all visible information under the current player view angle by a military chess characteristic acquisition module according to the current chess game information, wherein the visible information specifically comprises own chess information, enemy chess information, the number of non-generated rounds of playing, all the betting methods of the current player and the twenty latest rounds of two-party betting methods, and detecting whether the current player betting method violates a game rule or not by a military chess betting method generation module, and if so, the operation cannot be performed;

b3, the military chess fight module executes a first player fight method, judges a draught result according to rules, and updates information of chess bureaus of the two parties;

b4, backhand player playing chess, wherein the military chess characteristic acquisition module extracts all visible information under the current player view angle according to the current chess game information, and specifically comprises own chess information, enemy chess information, the number of non-occurrence rounds of the playing, all the betting methods of the current player and the twenty latest rounds of two-party betting methods, and the military chess betting method generation module detects whether the current player betting method violates a game rule or not, and if so, the military chess betting method cannot be operated;

b5, executing a back hand player recruitment method by the military chess fight module, judging a draft result according to rules, and updating chess office information of both parties;

and B6, if the game is not finished, returning to the execution step S2, otherwise, declaring the winner.

Preferably, the military chess AI also has a function of determining hands before hands, and if the military chess AI is a first-hand player, the first-hand player in the B2 is evaluated and determined by a depth monte carlo network of the military chess AI; if the military chess AI is a backhand player, in the step B4, the forehand player's playing method is evaluated and determined by the depth Monte Carlo network of the military chess AI.

Compared with the prior art, the invention provides a reinforcement learning military chess AI system based on deep Monte Carlo, which has the following beneficial effects:

the invention provides an AI system for reinforcement learning army chess based on deep Monte Carlo, and further provides a training method matched with the system and an actual fight execution method, so that the problems that the AI of an incomplete information game of army chess is difficult to design and train and the like are solved, and the performance of the army chess AI in fight against human beings is improved. Has great significance for training and researching the army chess lovers, and provides the man-machine fight and army chess AI self-play training function. Through the invention, the army chess lovers can easily, conveniently and efficiently grind own technology, relax tension mood and ease life pressure.

Drawings

FIG. 1 is a practical combat flow chart in the embodiment 1 of the present invention;

FIG. 2 is a schematic diagram of interaction behavior of each module of the military chess AI system in embodiment 1 of the invention;

fig. 3 is a schematic diagram of a deep monte carlo based recruitment assessment network in embodiment 1 of the present invention.

The reference numerals in the figures illustrate:

101. a military chess fight module; 102. a military chess method generating module; 103. the army chess characteristic acquisition module; 104. a military chess law decision module; 105. and a decision evaluation module.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.

Example 1:

referring to fig. 1-2, the present invention provides a deep Monte Carlo based reinforcement learning military chess AI system, and the normal game process is shown in fig. 1.

S1, initializing a chess bureau, wherein a human player of the military chess decides a first hand and a second hand in a starting interface, and arranges own military chess layout;

s2, performing real-time fight interaction, sequentially playing chess by players, planning a fight method by using a deep Monte Carlo algorithm, judging fight results by a military chess fight module 101, and updating a real-time situation;

and S3, feature processing is inserted in the fight, and after the fight module 101 updates the situation, new features including own chess piece information, enemy chess piece information, the number of rounds of the non-occurrence, all the recruitment methods of the current player and the latest twenty rounds of both the recruitment methods are transmitted to the player who is going to play chess. Wherein the known information comprises all information of the chess pieces of the chess player, the coordinates of the visible enemy chess pieces and the necessary information which can be obtained according to the rules of the military chess. The unknown information includes guesses for the enemy's surviving pieces and dead pieces. After processing these features, new features are generated.

The interactive behavior of each module of the military chess AI system provided by the invention is shown in fig. 2, a military chess fight module 101 stores current chess game information in real time, performs and judges decisions of fight parties, transmits the chess game information to a military chess fight method generation module 102 and a military chess feature acquisition module 103, and sends training data to a decision evaluation module 105 after meeting requirements;

the military chess method generating module 102 receives the current chess information and generates all feasible chess methods to be sent to the military chess characteristic collecting module 103 for further processing.

And the military chess feature acquisition module 103 receives the law and chess information, refines the information into features and sends the features to the military chess law decision module 104.

The military chess method decision module 104 accepts the features and adopts the deep Monte Carlo network decision technology to generate evaluation values of all feasible methods of the current player, namely, the evaluation of all the feasible methods of the player in the current situation, and then selects the method with the largest evaluation value.

The decision evaluation module 105 receives the training data packaged by the military chess fight module 101 to evaluate when the game data is generated to a certain amount, and calculates the gradient of the loss value generated after the evaluation for parameter updating.

Specifically, the deep Monte Carlo algorithm is used for the evaluation of the recruitment method, and the LSTM network is used for receiving the last few steps of the recruitment method to generate the characteristics about the history period. The characteristic network structure is shown in fig. 3, the input layer is a full-connection layer, the input state value is the current coordinate of the chess piece, the type of the chess piece, the survival state, whether the chess piece is positioned on a railway, whether the chess piece is positioned on camping and the recruitment method, and the characteristics generated by the LSTM in the last step are also shown, the middle layer of the network is the full-connection layer, the final output layer is a softmax layer, and the output value is the evaluation value of the current environment and the recruitment method.

Adam algorithm used for parameter updating, and loss function is used for calculating evaluation value G _t And the final game result r _t Is a mean square error of (c).

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. A deep monte carlo based reinforcement learning army chess AI system, the system comprising: the system comprises a military chess fight module, a military chess law generation module, a military chess characteristic acquisition module, a military chess law decision module and a decision evaluation module;

2. The reinforcement learning army chess AI system based on deep Monte Carlo of claim 1, wherein the system specific execution flow comprises the following steps:

s5, the military chess method evaluation module optimizes the deep Monte Carlo network in the military chess method decision system by calculating the mean square error according to the collected fight data, the final result and the state value in the period.

3. The deep Monte Carlo based reinforcement learning military chess AI system of claim 2, wherein the chess piece information in S3 specifically comprises: all chessmen current coordinates, chessman types, survival states, whether the chessman is positioned on a railway or not, and whether the chessman is positioned on camping or not; the chessmen are divided into survival chessmen and matrix death chessmen according to the survival state; for the gusted pawns, the current coordinates, the survival status, the railway-located and camping features are all 0, which is different from the survival pawns.

4. The deep Monte Carlo based reinforcement learning military chess AI system of claim 2, wherein the deep Monte Carlo network in S4 is an evaluation network for the value network of actions and states; the evaluation network is used for generating a value function of a bidding decision, the network structure comprises an LSTM network for receiving the characteristics generated by the latest twenty steps of bidding, the output of the LSTM network is connected with the action state value and the chess game state value and is transmitted into the full-connection layer together to generate a plurality of evaluation values equal to the types of the bidding, and the information transmitted into the full-connection layer specifically comprises own chess piece information, enemy chess piece information, characteristic values generated by the latest twenty steps of bidding through the LSTM network and currently feasible bidding.

5. The reinforcement learning army chess AI system based on deep Monte Carlo of claim 1, wherein the training phase training method specifically comprises the following contents:

a _t ＝argmax _a Q(s _t ,a)p＝(1-ε)

a _t ＝random(s _t ,a)p＝ε

the prize value for each round is determined by the end result:

r _t ←r _t +γr _t+1

6. The reinforcement learning army chess AI system based on deep Monte Carlo of claim 1, wherein the actual combat stage is man-machine combat, wherein "man" of man-machine combat is an army chess human player and "machine" is an army chess AI; the specific execution flow comprises the following steps:

7. The reinforcement learning army chess AI system based on deep monte carlo of claim 6, wherein the army chess AI also has a function of determining a forehand and a backhand, and if the army chess AI is a forehand player, the forehand player's method of engagement in B2 is evaluated and determined by the deep monte carlo network of the army chess AI; if the military chess AI is a backhand player, in the step B4, the forehand player's playing method is evaluated and determined by the depth Monte Carlo network of the military chess AI.