AU2018101314A4

AU2018101314A4 - A MCST and deep neural network based FIR battle platform

Info

Publication number: AU2018101314A4
Application number: AU2018101314A
Authority: AU
Inventors: Kaidi Chen; Zeyuan Dai; Andi Liu; Lihang Liu; Lixian Zhao
Original assignee: Individual
Current assignee: Individual
Priority date: 2018-09-07
Filing date: 2018-09-07
Publication date: 2018-10-11
Anticipated expiration: 2026-09-07

Abstract

This invention is a dual self-mode that combines the neural network and the Monte Carlo tree search. This model competes with itself as a simulation of playing against an actual person as a self-promoting process. In the neural network, we adopt a brand new combination of calculations that performs the value network to analyze the chess states, and the policy network to select the position of next chess piece. The function of the Monte Carlo tree search in this model is to utilize the Monte Carlo to assess the value of every state in the tree. This model effectively dramatically reduced the time consumed because when the neural network is being combined with the Monte Carlo tree search, the neural network will automatically select the optimal method to perform the simulation which enables the model to avoid Monte Carlo tree accordingly so the model does not need to simulate the entire process of each game in order to achieve a resourceful result. After finishing all the steps in the loop, the model will restart the loop from the beginning to initial the whole process again. As a result, each very next step is different from the previous one since the model is constantly improving itself through the process of repeating the loop.

Description

TITLE

A MCST and deep neural network based FIR battle platform

FIELD OF THE INVENTION

This invention relates to a FIR battle platform, with entertainment function and has some educational function, It can improving the feeling of players when they compete with Al robots, and players can learn some tips and skills of FIR through competitions.

BACKGROUND OF THE INVENTION

With the rapid development of the Internet, the number of electronic products on the market is increasing. As a result, reinforcement Learning has become a popular research spot in the analysis and prediction fields. In this invention, we use Reinforcement Learning to play the FIR(Five In A Line). Originally, many competitive algorithms used exhaustion method to calculate the next step in man-machine combat. However, this method will be limited if the board is relatively huge or the number of the pieces in a line is larger since the computer will consume more time to calculate. Therefore, in this case, the exhaustion method is not a very efficient method. Yet if the Al learns how to play the game and knows about the tips to win, it will become harder to defeat and the calculation time will be shortened as well.

2018101314 07 Sep 2018

Reinforcement Learning is an important machine learning method. Reinforcement Learning uses a reward signal that enables intelligence system itself to do the choice, and the system does not know any information at first. The system will receive a reward when it has completed a certain task, so in the next iteration, the system may do some changes to achieve a more reward according compare to the last reward. Thus the system will become smarter, and eventually be able to perform well in this job even better than human, just like AlphaGo.

In this invention, we plan to use The Monte Carlo search tree and Convolutional Neural Network to build a Reinforcement Learning system. As the Monte Carlo search tree constructing a set of rules to figure out the math problem and with the tree, we can use a computer to do the FIR experiments, allows the process become faster and easier. Besides, we will use the Convolutional Neural Network to analyze every step of the movement and evaluate the step since Convolutional Neural Network can use numerous layers and gradient descent method to approach the accurate value. By using these processes, the computer is able to simulate the thinking process as the human brain.

SUMMARY OF THE INVENTION

Our invention is design and implement an artificial intelligence program to play the simple board game Gomoku based on Monte Carlo tree search

2018101314 07 Sep 2018 and convolutional neural network. And we used no human experts’ knowledge to guide the Al which means that there is no supervised systems in our program.

Self-play is the key in our entire design. The machine play against itself and learn form self-play. We used the Monte Carlo tree search to simulate the self-play. And once we have new moves which represents the states of the board changed, we input the board state into a CNN. And the outputs from the CNN can then influence the Monte Carlo tree search. The

MCTS can produce self-paly data which can train the network and the network produce information to guide the MCTS. These procedures compose the entire loop of our program.

Monte Carlo tree search is used to decide which position the machine should take in next move. It can provide very powerful search ability and is quiet easy to implement. In our program, each node of the graph represents a certain state of the board. First we select a node which is a leaf node or represents the end of a game. When the node is selected and it does not represent the end of a game, we try to expand the node. It represent that we make the decision to take next move. And when the expansion is done, we upgrade the node and all its parents nodes. This is

2018101314 07 Sep 2018 a brief description of the Monte Carlo tree search we used in our program.

Once we upgraded the graph, we put the board states into our CNN program. The CNN takes a certain board states as inputs and output the probabilities of every next move and the evaluation of these certain board states. The probabilities of every next move is used in the Monte Carlo tree search that decide which node to select.

Our CNN is composed of 3 convolutional layers, each contains 32, 64 and 128 filters which are 3 by 3. Use ReLu activation function after these three convolutional layers then divide the network into two output layers, policy output and value output. The policy output used four 1 by 1 filter to decrease the dimension. Then apply one full-connected layer, then use softmax function to output the probabilities of every next position of next move. The value output used two 1 by 1 filter to decrease the dimension, the apply one full-connected layer then use tanh function to output the evaluation of the state of the board.

The board size is 8 by 8 and our Al program did very well when it played against a pure MCTS Al program. And then we tried to expand the board size, say 10 by 10.

2018101314 07 Sep 2018

DESCRIPTION OF THE DRAWINGS

The attached drawings serve a function as explanation and description, which consist of:

Fig.l -5 Flow chart;

DESCRIPTION OF PREFERRED EMBODIMENT

As shown in figure 1, before the training process begins, the system firstly initialized parameters which include the size of the chess board, learning frequency, buffer size, batch size, training duration and checking-frequency. After the initialization, the system enters the cycle to perform numerous times of self-play in order to collect board states data. The board states will also be rotated and mirrored by the system to generate completely identical states in a different direction, this process can successfully avoid overlap calculation as well as enlarging the database.

The data will then be processed in the buffer to determine whether the data size matches the batch size inside the buffer. If a mismatch occurs, the system will continue to collect self-play data. If two sizes do match, the data in the buffer will get randomly selected and process through the train policy value net several times. After the process, the value net will return loss and entropy value to adjust the learning frequency accordingly.

2018101314 07 Sep 2018

After the entire process is completed, we will compare our data with the data obtained from an individual Monte Carlo tree search through competition to test the result.

As shown in figure 2, during the updating process of training, the system will use the previous data set to calculate the previous probability and winning rate, then perform a cycle training. After the update, the system will use the same data set to calculate the new probability and winning rate according to the policy net. Then the system applies both new and old result to calculate the KL divergence, which will be used to assess the size of the learning rate and adjust it accordingly.

Figure 3 represents the application of the Monte Carlo tree search. During each tree search, the Monte Carlo tree will collect states then establish a new root according to the states collected. After that, the Monte Carlo tree will repeatedly search the average Q and U plus the maximum child until the leaf is being detected, after the leaf is detected, the tree will determine whether the game should stop at that specific leaf. If not, the child will be expanded, and the system will recursively update the visit times and Q of a parent through the value calculated from policy net. If the leaf does represent the end of the game, then a certain value will be applied to update the parent. After numerous tree search, one single

2018101314 07 Sep 2018

Monte Carlo tree is able to complete one round of a game.

Figure 4 shows the structure of the Convolutional Neural Network that inputs 8^Λ8^Λ4 size board states. The CNN is composed of 3 convolutional layers, each contains a 3 by 3 convolutional core and has 32, 64 and 128 filters respectively to gather a feature. After the process, CNN will apply different convolution layers to the probability and winning rate, which allows two values to be trained separately. Next, the two convolutions will get flatten and inputted into the full connected layer to obtain a new probability and winning rate. Each time the CNN completed a series of process, it will calculate the loss value, and then update its corresponding values.

THE FUCTION OF THE PLATFORM

Figure 5 shows the interactive platform in our project intends for both entertainment and learning. Users are allowed to select difficulty accordingly to optimize their experience. Besides, the platform also includes the hint system, whereas when users get confused by the game, they can enable the hint system that automatically recommends moves, indicated by pieces of different sizes and colors to show the level of recommendation. The platform also provides undo and redo functions, all dedicated to improving users learning and entertaining experience.

2018101314 07 Sep 2018

COMPARISON WITH OTHER MODULE pure MCTS VS neural network+ MCTS

The rounds of the game:20

The basic information

The times of the Monte Carlo searching:2000 ghtir

9s,p

The basic information

The times of the Monte Carlo searching: 800

It had already trained almost 2000 times.

cum nine. /o.joohzh b, piay_uuuni. iv average cost time: mcts player : 2.115824 s, pure mcts player : 13.560057 s cost time: 277.254147 s, play_count: 33 average cost time: mcts player : 2.254340 s, pure mcts player : 14.932645 s cost time: 77.882141 s, play_count: 10 average cost time: mcts player : 2.219113 s, pure mcts player : 13.356912 s cost time: 60.968000 s, playcount: 9 average cost time: mcts player : 1.869358 s, pure mcts player : 12.905060 s cost time: 442.511613 s, play_count: 64 average cost time: mcts player : 2.049253 s, pure mcts player : 11.778762 s cost time: 65.496403 s, play count: 9 average cost time: mcts player : 2.312566 s, pure mcts player : 13.483017 s cost time: 81.285615 s, play_count: 10 average cost time: mcts player : 2.275838 s, pure mcts player : 13.980887 s cost time: 64.253853 s, play_count: 9 average cost time: mcts player : 2.118642 s, pure mcts player : 13.414910 s cost time: 113.065399 s, play count: 14 average cost time: mcts player : 2.529724 s, pure mcts player : 13.622332 s cost time: 65.616047 s, play count: 9 average cost time: mcts player : 2.083739 s, pure mcts player : 13.799088 s cost time: 269.548855 s, play_count: 34 average cost time: mcts player : 2.127301 s, pure mcts player : 13.728455 s cost time: 67.489317 s, play_count: 9 average cost time: mcts player : 2.317580 s, pure mcts player : 13.975354 s cost time: 85.224424 s, play_count: 10 average cost time: mcts player : 2.110235 s, pure mcts player : 14.934650 s cost time: 66.184902 s, play count: 9 average cost time: mcts player : 2.591482 s, pure mcts player : 13.306873 s cost time: 166.066459 s, play count: 20 average cost time: mcts player : 2.527148 s, pure mcts player : 14.079400 s cost time: 65.603459 s, play count: 9 average cost time: mcts player : 2.111826 s, pure mcts player : 13.760837 s cost time: 282.713064 s, play_count: 33 average cost time: mcts player : 2.258475 s, pure mcts player : 14.504319 s cost time: 279.071302 s, play_count: 32

2018101314 07 Sep 2018 average cost time: mcts player : 2.286256 s, pure mcts player : 15.155575 s cost time: 146.410816 s, play count: 18 average cost time: mcts player : 2.509988 s, pure mcts player : 13.757546 s

The sum of the whole rounds:20

Win:2 Lose: 17 Tie:l

Win: 17 Lose:2 Tie:l

According to the game results collected, we can conclude that the MCTS player is prior during each match. The neural network combined with the MCTS can result in a much more intelligent Al. Besides the winning ratio, the average computational time of the player we generated is remarkably faster than the independent MCTS generated player. The principle of this experiment is that the neural network collaborates with the MCTS have significantly better efficiency compare to an independent model MCTS which requires a complete simulation of the entire game stage. In contrast, our combined model enables the Al to compute precisely and also reduce the training time consumed.

Claims

CLAMIS

1. A method can be used for human-computer battle, in which is especially for children to play so that they can improve their intelligence from a very early age;

1) Human players can adjust the Al player’s difficulty which for different varieties of people; there are three kinds of difficulties for people to choose, they are easy, middle and hard;
2) If human players put the chess on a wrong position, human players also has another chance to undo his last step and select other positions to put on; however, human players don’t want to undo this chess, they can also click “redo” to restore the composition;
3) Sometimes, human players don’t know how to go to the next step, some gray dots with different sizes will appear at this time; human players can choose these positions so that they will have much bigger winning rate.