CN110119804A

CN110119804A - A kind of Ai Ensitan chess game playing algorithm based on intensified learning

Info

Publication number: CN110119804A
Application number: CN201910375250.6A
Authority: CN
Inventors: 袁仪驰; 吴蕾; 姚超超; 李学俊; 陆梦宣; 沈恒恒
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2019-05-07
Filing date: 2019-05-07
Publication date: 2019-08-13

Abstract

The invention discloses the self study game playing algorithms based on deeply study in a kind of Ai Ensitan chess, BP neural network is applied into the Valuation Method of chessboard and in the movement selection strategy of Monte Carlo tree search algorithm, by intensified learning method from playing chess learning rules, the feature of chessboard is learnt simultaneously gradually to adjust network parameter, so that BP neural network is gradually accurate for the policy calculation of value assessment and the movement of playing chess of chessboard, so that the performance of entire game playing algorithm be made gradually to be promoted.Value evaluation function and behavioral strategy function of the present invention using two BP neural networks as Ai Ensitan chess, using nitrification enhancement as the evolutionary mechanism of adjustment BP neural network parameter, it solves the defect that the level of current Ai Ensitan chess training set is limited by human levels, improves the upper limit of Ai Ensitan chess game level.

Description

A kind of Ai Ensitan chess game playing algorithm based on intensified learning

Technical field

The invention belongs to the research field of chess game game playing by machine, in particular to a kind of Ai Ensi based on intensified learning Smooth chess game playing algorithm.

Background technique

Artificial neural network (Artificial Neural Network, i.e. ANN), it is artificial since being the 1980s The research hotspot that smart field rises.Neural network is a kind of operational model, by handing between a large amount of node (or neuron) It is mutually coupled with composition.Each node indicates a kind of special processing function, referred to as activation primitive (activation function). Connection between every two node all represents a weighted value for passing through the connection signal, referred to as weight (weight), this It is equivalent to the learning experience of artificial neural network.The output of network then according to the connection type of network, weighted value and excitation function It is different and different.And network itself is approached certain algorithm of nature or function, it is also possible to patrol one kind Collect the expression of strategy.

Intensified learning (reinforcement learning) is a kind of important machine learning method, in intelligent controller There are many applications in the fields such as device people and analysis prediction.Intensified learning be intelligent body (Agent) in a manner of " trial and error ", " self-examination " into Row study instructs behavior by the award for interacting acquisition with environment, and target is that intelligent body is made to obtain maximum award.Strengthen Study is different from the supervised learning in connectionism study, is mainly manifested on learning signal.It is provided in intensified learning by environment Enhanced signal be the movement of generation quality as a kind of evaluation (usually invariant signal), rather than instruction intensified learning system How system RLS (reinforcement learning system) goes to generate correct movement.The letter provided due to external environment Seldom, RLS must be by being learnt with repeatedly interacting for environment for breath.In this way, environment of the RLS in action-critic Middle acquisition knowledge improves action scheme to adapt to environment.

The basic ideas for promoting the level of intelligence of game playing system are to improve estimation accuracy and promotion search efficiency.In searcher Face, game-tree search (Monte-Carlo Tree Search, abbreviation MCTS) algorithm can preferably solve in game with The strong problem of machine.MCTS algorithm is the statistical method using the search result bootstrap direction of search, selectively establishes one Tree, the direction growth that this tree can be bigger toward value.

Ai Ensitan chess Game Rule is as follows:

1. the criss-cross chessboard that chessboard is 5 × 5, grid is chess position, and the upper left corner is red zone of departure；The lower right corner goes out for blue party Area is sent out, as shown in Figure 1；

2. red blue party respectively there are 6 pieces of box-shaped chess pieces, number 1-6 is indicated respectively.Chess of both sides' chess piece in zone of departure when beginning Position can arbitrarily be put；

3. both sides shake the elbows in turn, chess piece corresponding with dice display number of then walking about.If corresponding chess piece Removed from chessboard, can walk about it is more than or less than this number and with this immediate chess piece of number；

4. red chess piece is walked about direction be to the right, downwards, to bottom right, a lattice of walking about every time；Blue party chess piece direction of walking about is To the left, upwards, to upper left, a lattice of walking about every time；

5. if the chess piece is removed and (is eaten up) from chessboard have chess piece on the target chess position that chess piece is walked about.Have When to eat up this square chess be also a kind of strategy because the chance and flexibility that other chess pieces are walked about can be increased；

6. taking the lead in reaching other side zone of departure angle point or the side all eaten up to square chess winning；

7. result of playing chess only has victory or defeat, without draw in chess.

8. every side's used time 3 minutes of every disk, loses on time；Every wheel both sides are poised for battle most 7 disks, and on the offensive in turn (Party A 1 Disk is on the offensive, and 2367 disk of Party B is on the offensive), it does not rest among two disks, first winning 4 disks is winner.

By the research to Ai Ensitan chess itself, chessboard is assessed with three key factors in terms of valuation: probability (probability), distance (distance), Threat (threat).Previous researcher constructs quiet according to these factors The valuation formula of state calculates the value of chessboard.Two problems of this method, first is that the precision problem of numerical value, when calculating, is adopted Coefficient and parameter values are all integer mostly, and the chessboard value calculated in this way will not be too accurate, so that various Difference between chessboard state can not embody；Second is that the limitation of calculation formula, single the problem of bringing of calculation formula, are Calculation formula will not be very perfect, and the quality of chessboard can not be objectively reacted so as to cause the numerical value calculated.

Summary of the invention

For the deficiencies in the prior art, technical problem to be solved by the invention is to provide one kind to be based on extensive chemical The Ai Ensitan chess game playing algorithm of habit, the level which solves current Ai Ensitan chess training set are limited by human levels Defect, improve the upper limit of Ai Ensitan chess game level.

The scheme that the present invention solves above-mentioned technical problem is:

A kind of Ai Ensitan chess game playing algorithm based on intensified learning, the algorithm the following steps are included:

(1) neural network initializes:

Neural network is created when system initialization；Random initializtion reads neural network parameter from file；Initialization Game theory；Initialization sample collector；Wherein, the neural network includes tactful network and value network；

The tactful network includes input layer, hidden layer, output layer；The input layer includes 42 neurons；It is described Hidden layer have three layers, every layer includes 20 neurons, and activation primitive isThe output layer includes 18 Neuron, activation primitive Softmax；

The value network includes input layer, hidden layer, output layer；The input layer includes 36 neurons；It is described Hidden layer have three layers, every layer includes 20 neurons, and activation primitive isThe output layer includes 1 Neuron, activation primitive are

(2) algorithm executes:

When it is true for learning mark, system carries out computer from simulation of playing chess, and the both sides that play chess share the neural network, Movement selection is carried out using APV-MCTS algorithm；After having simulated overall situation, the applicator can collect movement of playing chess Sample, and the sample of collection is sent to the training set of value network and tactful network, randomly selects a certain amount of data to mind It is trained through network, the simulation then to play a game of chess；

Wherein, the step of APV-MCTS algorithm each Rollout be divided into selection movement, extension, assignment, update it is anti- Feedback；

When executing APV-MCTS algorithm,

The formula for selecting movement is a=argmax (V (s')+U (s, a)).Wherein V (s') is to be held based on state s The value of the state s' reached after a is made in action；Wherein c is to explore coefficient, Setting value is c=5；(s is a) probability factor of tactful network query function output to P, and a is based on movement caused by current state s； N (s) is state s access times；N (s') is the access times that the state s' reached after a is acted based on state s execution；

The extension is to calculate chess represented by leaf node in the leaf node LeafNode for searching game theory Chessboard state after all play chess movement and these movement execution that plate-like state can be carried out, using these states as new section Point is extended to the child node of leaf node LeafNode；

The assignment is obtained after directly being extended by value network calculating after game theory carries out point spread Chessboard feature represented by node carrys out the value V (s) of calculate node, and s is current chessboard state；

The more new formula of V (s) is V (s)=(V (s) * N (s)+V when the update is fed back_leaf)/(N (s)+1), wherein Access times N (s)=N (s)+1 of institute's accessed node, V_leafThe leaf node of selection is extended by the last Rollout The value of Node；

After the tree search process of Monte Carlo, select execute play chess act formula for a=argmaxP (root, a), Wherein P (root, a)=N_a^(1/t)/∑N_b^ (1/t), wherein N_aRepresent the section of reached chessboard state s' after expression movement a is executed The access times of point；N_bIndicate that the chessboard state s before executing based on this movement is attainable by executing legal movement institute of playing chess The sum of the access times of node of all chessboard states；T is temperature, when training t → 0, to increase when neural network restrains Stability, t=1 when playing chess；To ensure that every kind of movement can be all attempted when training, and be added Di Li Cray noise, at this time P (root, a)=(1- ε) P_a+εη_a, wherein (0.03) η~Dir, ε=0.25；

APV-MCTS algorithms selection is executed when training every time to play chess after movement, applicator record play chess before checkerboard State and dice, and it is respectively converted into the input feature value of tactful network and value network, and record each section The access times of the child node of point are simultaneously converted into probability t and are stored in output mark in the training set of tactful network as sample Sign vector；

After each board game, chessboard is worth by game result and assigns label value z, winner 1, and loser is -1, deposits Enter the output label vector in the training set of the value network as corresponding sample；If the step number of playing chess of a board game is n, After sample collection, n sample is randomly choosed respectively in the training set of value network and tactful network, network is instructed Practice；

The loss function of the tactful network is loss=-t*logp+c | | W | |², the loss of the value network Function is loss=(z-v)²+c*||W||², L2 regularization parameter is c=0.0001；The initial value of learning rate α is 0.01, is used Learning rate successively decreases, and is reduced to original 1/10 every 1000 disks；

When the game of algorithm is horizontal or the training time reaches the given threshold of artificial settings, system stops playing chess certainly Journey simultaneously stops the collection of data and the training of neural network, stores neural network parameter file；

When needing to use the progress game of this algorithm, the computer player of this algorithm policy is selected to load stored nerve net Network parameter, and carry out movement using APV-MCTS algorithm and calculate to choose movement of playing chess.

Compared with prior art, the present invention has the advantages that:

Existing Ai Ensitan chess machine learning algorithm needs to train neural network model using external trainer data, and Training data is generation of being played chess by other high-performance algorithms or mankind chess player, and caused result is exactly machine learning algorithm Level can not surmount the level of other existing high-performance algorithms or the level of human player.Characteristic of the invention is exactly to use The mode of intensified learning, so that algorithm voluntarily generates training number during self is played chess, through the feedback to result of playing chess According to come make neural network carry out on-line study, algorithm level is gradually increased in the case where being detached from external trainer data, And it is not limited by existing algorithm level and human levels.

Play chess the time in addition, APV-MCTS algorithm of the present invention saves a large amount of simulation, has sufficiently used value The valuation experience that network is acquired, improves the execution efficiency of MCTS.Meanwhile the numerical value for exporting tactful network query function is alternatively Play chess movement reference value UCB calculation formula impact factor, thus Rollout times number it is less in the case where improve selection Accuracy of action is guaranteeing that algorithm is efficient while also ensuring the reliability of algorithm.

Detailed description of the invention

Fig. 1 is chessboard schematic diagram employed in the embodiment of the present invention.

Fig. 2 is corresponding chess piece coordinate (distance) value in both sides' chess piece position of playing chess in the present invention, wherein figure 2 (a) indicate that red, Fig. 2 (b) indicate blue party.

Input layer, hidden layer, the output layer schematic diagram that Fig. 3 includes by tactful network of the present invention and value network, Wherein, Fig. 3 (a) indicates that tactful network, Fig. 3 (b) indicate value network.

Fig. 4 is the flow chart of Rollout of APV-MCTS algorithm of the present invention.

Fig. 5 is a kind of Ai Ensitan chess game playing algorithm step schematic diagram based on intensified learning provided by the invention.

Specific embodiment

For a better understanding of the technical solution of the present invention, being carried out more below with reference to specific example and attached drawing to the present invention Detailed description.

1. chessboard feature

It is characterized in the variable needed for the characteristic for describing chessboard, is according to Ai Ensitan chess rule, from Ai Ensitan chess The some useful information extracted in chessboard.The present invention extracts the chess piece feature of each chess piece first, is linearly being combined into chessboard Feature vector.Chess piece feature includes walking sub- probability (Probability), chess piece coordinate (Coordinate) and Threat (Threat), chess piece feature is specifically described below with reference to the chessboard in Fig. 1.

1) sub- probability is walked

Both sides first determine the chess piece that can be walked according to the dice being thrown into before walking son, therefore the probability for the event that shakes the elbows Distribution is one of an important factor for affecting the situation.Chess piece p_iWalk sub- probability refer to throw can make the chess piece walk son dice Probability.Sub- probability is walked to calculate using following formula:

Wherein p_iIndicate chess piece, dice (p_i) indicate that p can be made_iWalk the set of the dice of son.

For the chessboard in Fig. 1: there are 1,2,5,6 chess pieces of 2,3,5, No. 6 chess pieces and blue party of red on chessboard.For No. 5 chess pieces of blue party, when dice is thrown to 3,4,5, No. 5 chess pieces could be moved, therefore dice (blue5)={ 3,4,5 }, | dice (blue5) |=3, probability (blue5)=0.5.For No. 3 chess pieces of blue party, because chess piece is on chessboard, Probability (blue3)=- 1.

2) chess piece coordinate

The chessboard of Ai Ensitan chess is 5 × 5 rectangular chessboard, when some chess piece of a side takes the lead in reaching other side zone of departure angle It wins when point.Step number of the chess piece away from other side zone of departure angle point can reflect to a certain extent the advantage of chess piece, while also represent chess piece Position on chessboard, therefore as the feature of chess piece.Chess piece p_iChess piece coordinate refer to p_iAway from other side zone of departure angle point Step number.Chess piece coordinate is calculated using following formula:

Wherein distance (p_i) indicate chess piece p_iTo the distance of other side zone of departure angle point, i.e. p_iIt is moved to other side zone of departure Minimum step number required for angle point.distance(p_i) value as shown in Fig. 2, Fig. 2 (a) to give all positions of red corresponding Distance value, Fig. 2 (b) gives the corresponding distance value in all positions of blue party.Such as the blue party in Fig. 1 No. 5 chess pieces, coordinate (blue5)=distance (blue5)/4=2/4=0.5.

3) Threat

If a side has chess piece on walking period of the day from 11 p.m. to 1 a.m target chess position, no matter the chess piece on target chess position is we or other side Chess piece " can all be eaten up ".After chess piece " being eaten up ", the chess piece quantity of one side reduction may allow the side eaten to fall into The crisis that chess piece is eaten up；On the other hand increase other chess pieces walks sub- probability, keeps chess piece more flexible.Therefore chess piece had been eaten up both There are advantages there is disadvantage again, and the specific gravity of superiority and inferiority is related with specific board layout, this is also design evaluation function Where difficult point.The present invention is using this characteristic as the feature of chess piece, chess piece p_iThreat refer to p_iA possibility that being eaten up by other side, It is calculated using following formula:

WhereinIt respectively indicates and is located at chess piece p_i3 target chess positions that can be walked on to square chess,Expression can make chess piece setIn any one chess piece walk about the set of dice.For 5, No. 6 chess pieces of No. 5 chess pieces of blue party in Fig. 1, red can eat up 5 work songs, when dice is thrown to 4,5,6,5, No. 6 chesses of red Son can walk about.Therefore dice (red5, red6)={ 4,5,6 }, threat (blue5)=0.5.

4) chessboard feature vector

The chess piece feature of our all chess pieces is pressed coordinate by 3 chess piece features for calculating all chess pieces first (p_i) descending sequence, and connect into 18 dimensional feature vectors；Secondly the chess piece feature of all chess pieces of other side is pressed coordinate(p_i) descending sequence, connect into 18 dimensional feature vectors；Finally by our feature vector and other side's feature to Amount is connected, and forms 36 dimension chessboard feature vectors.For the chessboard in Fig. 1, the Probability of 1~No. 6 chess piece of blue party, Coordinate and Threat be respectively { 1/6,2/4,2/6 }, { 3/6,4/4,0 }, { -1, -1, -1 }, { -1, -1, -1 }, 3/6, 2/4,3/6},{1/6,4/4,0}；Probability, Coordinate and the Threat of 1~No. 6 chess piece of red be respectively- 1,-1,-1},{2/6,4/4,1/6},{2/6,2/4,0},{-1,-1,-1},{2/6,3/4,3/6},{1/6,3/4,3/6}.It is false If we is blue party, then chessboard feature vector are as follows: 3/6,2/4,3/6,1/6,2/4,2/6,3/6,4/4,0,1/6,4/4,0 ,- 1,-1,-1,-1,-1,-1,2/6,2/4,0,{2/6,3/4,3/6,1/6,4/4,0,2/6,4/4,1/6,-1,-1,-1,-1,- 1,-1}。

2. neural network structure

Strategy network and value network disclosed by the invention are all made of BP network structure

1) tactful network

As shown in Fig. 3 (a), tactful network includes input layer, hidden layer, output layer.Input layer includes 42 neurons, hidden layer It haves three layers, every layer includes 20 neurons, and activation primitive isOutput layer includes 18 neurons, activates letter Number is softmax.The building form of input feature value are as follows: first arrange the chess piece for the side of playing chess according to 1~6 sequence of chess piece number Column, the feature of each chess piece are arranged according to sub- probability, chess piece coordinate, Threat is walked, and are formed 18 dimensional vectors of the side of playing chess, then will The feature of the non-side of playing chess does same treatment, then by dice character representation at 6 dimensional vectors, wherein (n-1)th (n > 0) a element represents dice Subnumber n, the corresponding vector element of dice number being sieved to are expressed as 1, remaining is 0, non-to play chess finally according to the side's of playing chess feature preceding In, the posterior sequence of dice feature combines Fang Tezheng, and the total dimension of vector is 42 dimensions.Export the building form of feature are as follows: The chess piece for the side of playing chess is arranged according to number 1~6, the feature of every chess piece forward, is marked for three to the right according to movement is played chess to the left Amount is arranged successively, and vector dimension is 18 dimensions, the feature of wherein fair play is done softmax processing, output is each dynamic Make the probability executed.

2) value network

As shown in Fig. 3 (b), value network includes input layer, hidden layer, output layer.Input layer includes 36 neurons, hidden layer It haves three layers, every layer is comprising 20 neuron activation functionsOutput layer includes 1 neuron, activates letter Number isThe composition of input vector is the dice that the input vector of tactful network is removed to last 6 dimension Feature, vector dimension are 36 dimensions.Output vector dimension is 1, indicates to be worth in the chessboard static state at the side of playing chess visual angle.

3.APV-MCTS (Asynchronous Policy andValue MCTS) algorithm

Traditional each Rollout of MCTS algorithm includes selection, extension, simulation, updates four steps.Wherein, it is simulating Need to expend long time and many hardware resources in the step, and APV-MCTS algorithm very good solution this ask Simulation process is replaced with assignment, as directly replaces obtaining by simulation with the static value of neural computing chessboard by topic Chessboard value, and the experience that learns before neural network can also being used to save, as shown in figure 4, APV-MCTS algorithm is primary The detail of Rollout process is as follows:

1) it selects

Game theory after initialization includes the child nodes of root node and root node, is namely based on represented by child nodes Chessboard state s represented by root node executes all possible attainable chessboard state s' of movement institute that plays chess.When thread is initial After change, a game route from root node to leaf node is selected to be explored.When thread is explored to a non-leaf nodes When Node, it is known that the chessboard value V of all child nodes of Node selects the public affairs of child nodes after obtaining random dice number Formula is that (V (s')+U (s, a)), wherein V (s') is to act the chessboard state reached after a based on chessboard state s execution to a=argmax The value of s',Wherein c is to explore coefficient, and setting value c=5, P (s, a) By thread dispatching strategy network be calculated based on chessboard state s execution movement a execution probability (tactful network it is defeated Enter, output vector composition introduced above, if execute movement be No. 3 chess pieces take one pace forward, then choose output vector In a element of the 7th (totally 18 dimension, subscript 0~17) as this act sub- probability), N (s) is the section of expression chessboard state s The access times of point, N (s') are the access for indicating to act the node of the chessboard state s' reached after a based on chessboard state s execution Number.This formula takes into account the potential value for calculating the chessboard value and the chessboard not being explored explored.

2) it extends

When thread reaches leaf node LeafNode by selection step by step, it is necessary to extend this leaf node Succeeding state, i.e., the chessboard state s indicated based on this node execute the chessboard state s' obtained after all fair plays.Specifically do Method is that all nodes for indicating chessboard succeeding state are added in the child nodes set of leaf node LeafNode.

3) assignment

After extension, thread selects the node of a succeeding state and calls value network to the chessboard given price represented by it Value V, and make access times N (s)=N (s)+1 of this node.

4) it updates

After assignment, thread feeds back obtained information upwards since leaf node.Thread loops select father node, use formula V (s)=(V (s) * N (s)+V_leaf)/(N (s)+1) more new node is worth V, and makes access times N (s)=N (s)+1.

Rollout process carries out Rollout process next time after having executed.When the operation of MCTS algorithm reaches predetermined Time after, then stop the execution of above-mentioned 4 step, and select an optimal movement of playing chess, i.e. root node a child node. Select movement formula for a=argmaxP (root, a).Wherein P (root, a)=N_a^(1/t)/∑N_b^ (1/t), wherein N_aGeneration Table expression acts the access times of the new chessboard state s' after a is executed, N_bRepresenting indicates the checkerboard before executing based on this movement State s by execute it is legal play chess movement attainable all chessboard states the sum of the access times of node, t is temperature, when T → 0 when training, t=1 when playing chess.Di Li Cray noise, to ensure that every kind of movement can be all attempted, P are added when training (root, a)=(1- ε) P_a+εη_a, wherein (0.03) η~Dir, ε=0.25.

4. training

1) training set data is collected

APV-MCTS algorithms selection is executed when training every time to play chess after movement, applicator record play chess before checkerboard State and dice, and be respectively converted into the input feature value of tactful network and value network (conversion regime is above-mentioned Illustrated in neural network structure), it is respectively fed to the training set of tactful the network training set and value network of training set, and record It is simultaneously stored in tactful network as probability t divided by the access times of root node by the access times of the child node of each root node Training set in output label vector as sample.After each board game, chessboard is worth by game result and assigns mark Label value z, winner 1, and loser is -1, is stored in the output label vector in the training set of value network as corresponding sample.If one The step number of playing chess of board game is n, after sample collection, is randomly choosed respectively in the training set of value network and tactful network N sample is trained network.

2) maintenance of training set

The size of the training sample queue of value network and tactful network is set to 100, self was played chess in a disk After journey is completed and newest training sample is sent into training sample queue, several summations of beginning of both sides in this board game are set as N randomly selects n sample respectively in training sample queue and is trained to value network and tactful network, and respectively by two N sample of the tail of the queue of a training sample queue is deleted, and guarantees that the sample in queue is newest data, and the size of queue It is always 100.

3) neural metwork training

The loss function of tactful network is loss=-t*logp+c | | W | |², the loss function of value network is loss= (z-v)²+c*||W||², L2 regularization parameter is c=0.0001, and the initial value of learning rate α is 0.01, successively decreased using learning rate, Original 1/10 is reduced to every 1000 disks.

5. the Ai Ensitan chess game playing algorithm based on intensified learning executes step

As shown in figure 5, the present invention provides a kind of Ai Ensitan chess game playing algorithm based on intensified learning, steps are as follows:

Step 1: creation neural network, random initializtion or from file read neural network parameter.Initialize game theory. Initialization sample collector.Setting study flag bit.If study mark is very, to enter step 2.If vacation, 6 are entered step.

Step 2: one disk new game of initialization initializes beginning chessboard, initializes both data, and the strategy that will play chess all is arranged For APV-MCTS algorithm, 3 are entered step.

Step 3: the side of playing chess is calculated using APV-MCTS algorithm and selects movement of playing chess, and is calculated after the completion of selecting to execute, replacement The side of playing chess enters step 4.

Step 4: applicator collects chessboard state and dice before the last movement of playing chess, is respectively converted into The input feature value of value network and tactful network, is stored in training set.Record the access time of the child node of each root node It counts and is converted to probability t and be stored in output label vector in the training set of tactful network as sample.If overall situation does not terminate, 3 are entered step, if overall situation terminates, enters step 5.

Step 5: obtaining the value V of each chessboard state of this disk chess by the victory or defeat of nearest overall situation, and be sent to instruction Practice to collect and is used as output label feature vector in corresponding training sample.The training sample of this board game is sent into training set, and Sample drawn is trained neural network.If the training time terminates, 6 are entered step.If the training time is not finished, enter step 2。

Step 6: one disk new game of initialization initializes beginning chessboard, both data is initialized, by computer chess strategy It is set as APV-MCTS algorithm, another party is set as any strategy, and both sides carry out playing chess.

Claims

1. a kind of Ai Ensitan chess game playing algorithm based on intensified learning, the game playing algorithm the following steps are included:

(1) neural network initializes:

Neural network is created when system initialization；Random initializtion reads neural network parameter from file；Initialize game Tree；Initialization sample collector；Wherein, the neural network includes tactful network and value network；

The tactful network includes input layer, hidden layer, output layer；The input layer includes 42 neurons；Described is hidden Layer haves three layers, and every layer includes 20 neurons, and activation primitive isThe output layer includes 18 nerves Member, activation primitive Softmax；

The value network includes input layer, hidden layer, output layer；The input layer includes 36 neurons；Described is hidden Layer haves three layers, and every layer includes 20 neurons, and activation primitive isThe output layer includes 1 nerve Member, activation primitive are

(2) algorithm executes:

When it is true for learning mark, system carries out computer from simulation of playing chess, and the both sides that play chess share the neural network, makes Movement selection is carried out with APV-MCTS algorithm；After having simulated overall situation, the applicator can collect the sample for movement of playing chess Originally, it and by the sample of collection is sent to the training set of value network and tactful network, randomly selects a certain amount of data to nerve net Network is trained, the simulation then to play a game of chess；

Wherein, the step of APV-MCTS algorithm each Rollout is divided into selection movement, extension, assignment, updates feedback；

When executing APV-MCTS algorithm,

The formula for selecting movement is a=argmax (V (s')+U (s, a)).Wherein V (s') is to execute to move based on state s Make the value of state s' reached after a；Wherein c is to explore coefficient, setting Value is c=5；(s is a) probability factor of tactful network query function output to P, and a is based on movement caused by current state s；N(s) For state s access times；N (s') is the access times that the state s' reached after a is acted based on state s execution；

The extension is to calculate checkerboard represented by leaf node in the leaf node LeafNode for searching game theory Chessboard state after all play chess movement and these movement execution that state can be carried out, expands these states as new node Exhibition is the child node of leaf node LeafNode；

The assignment is directly to calculate the node obtained after extension by value network after game theory carries out point spread Represented chessboard feature carrys out the value V (s) of calculate node, and s is current chessboard state；

The more new formula of V (s) is V (s)=(V (s) * N (s)+V when the update is fed back_leaf)/(N (s)+1), wherein it is visited Ask access times N (s)=N (s)+1 of node, V_leafExtend the leaf node Node's of selection by the last Rollout Value；

After the tree search process of Monte Carlo, select to execute play chess act formula for a=argmaxP (root, a), wherein P (root, a)=N_a^(1/t)/∑N_b^ (1/t), wherein N_aRepresent the node of reached chessboard state s' after expression movement a is executed Access times；N_bIndicate that the chessboard state s before executing based on this movement is attainable all by executing legal movement institute of playing chess The sum of the access times of node of chessboard state；T is temperature, when training t → 0, to increase stabilization when neural network convergence Property, t=1 when playing chess；For ensure train when every kind of movement can all be attempted, addition Di Li Cray noise, at this time P (root, a) =(1- ε) P_a+εη_a, wherein (0.03) η~Dir, ε=0.25；

APV-MCTS algorithms selection is executed when training every time to play chess after movement, applicator record play chess before chessboard state and Dice, and it is respectively converted into the input feature value of tactful network and value network, and record each root node The access times of child node and be converted into probability t be stored in the training set of tactful network as the output label of sample to Amount；

After each board game, chessboard is worth by game result and assigns label value z, winner 1, and loser is -1, is stored in institute Output label vector in the training set for the value network stated as corresponding sample；If the step number of playing chess of a board game is n, sample After acquisition, n sample is randomly choosed respectively in the training set of value network and tactful network, network is trained；

The loss function of the tactful network is loss=-t*logp+c | | W | |², the loss function of the value network is Loss=(z-v)²+c*||W||², L2 regularization parameter is c=0.0001；The initial value of learning rate α is 0.01, uses learning rate Successively decrease, is reduced to original 1/10 every 1000 disks；

When the game of algorithm is horizontal or the training time reaches the given threshold of artificial settings, system stops playing chess process simultaneously certainly Stop the collection of data and the training of neural network, stores neural network parameter file；

When needing to use the progress game of this algorithm, the computer player of this algorithm policy is selected to load stored neural network ginseng Number, and carry out movement using APV-MCTS algorithm and calculate to choose movement of playing chess.