CN110555517A - Improved chess game method based on Alphago Zero - Google Patents

Improved chess game method based on Alphago Zero Download PDF

Info

Publication number
CN110555517A
CN110555517A CN201910837810.5A CN201910837810A CN110555517A CN 110555517 A CN110555517 A CN 110555517A CN 201910837810 A CN201910837810 A CN 201910837810A CN 110555517 A CN110555517 A CN 110555517A
Authority
CN
China
Prior art keywords
model
network
training
self
play
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910837810.5A
Other languages
Chinese (zh)
Inventor
郑秋梅
王璐璐
商振浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Petroleum East China
Original Assignee
China University of Petroleum East China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Petroleum East China filed Critical China University of Petroleum East China
Priority to CN201910837810.5A priority Critical patent/CN110555517A/en
Publication of CN110555517A publication Critical patent/CN110555517A/en
Pending legal-status Critical Current

Links

Classifications

    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F3/00Board games; Raffle games
    • A63F3/00643Electric board games; Electric features of board games
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Prevention of errors by analysis, debugging or testing of software
    • G06F11/3668Testing of software
    • G06F11/3672Test management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • Computer Hardware Design (AREA)
  • Multimedia (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

本发明提出基于AlphaGo Zero改进的国际象棋博弈方法,扩展了AlphaGo Zero方法在人机博弈领域的应用范围,属于机器人科技娱乐领域。其包括以下步骤:采用能有效避免梯度弥散并能够获得最优位置收敛的CNN、ResNet以及全连接层在内的混合网络结构,并使用一个训练网络同时输出策略与估值;2)采用强化学习策略,通过使用自我博弈(Self‑Play)产生的数据进行训练,解决序贯结构的数据训练规模较大的问题,在博弈过程中进行模型优化;3)神经网络训练优化模型,定义损失函数并选择相应的优化器进行向减小损失方向的迭代更新;4)网络模型评估,使用训练一段时间后的新模型与训练前的模型进行对弈,根据胜负局数获取当前模型的性能以判定是否进行模型的迭代;5)采用第三方软件进行可视化交互博弈测试与评估。

The invention proposes an improved chess game method based on AlphaGo Zero, expands the application range of the AlphaGo Zero method in the field of man-machine games, and belongs to the field of robot science and technology entertainment. It includes the following steps: using a hybrid network structure including CNN, ResNet and fully connected layers that can effectively avoid gradient dispersion and obtain optimal position convergence, and use a training network to simultaneously output strategies and estimates; 2) using reinforcement learning Strategy, by using the data generated by self-play (Self-Play) for training, solve the problem of large-scale data training of sequential structure, and optimize the model during the game process; 3) neural network training optimization model, define the loss function and Select the corresponding optimizer to iteratively update in the direction of reducing the loss; 4) network model evaluation, use the new model after training for a period of time to play against the model before training, and obtain the performance of the current model according to the number of winning and losing games to determine whether Iterate the model; 5) Use third-party software for visual interactive game testing and evaluation.

Description

基于AlphaGo Zero改进的国际象棋博弈方法Improved Chess Game Method Based on AlphaGo Zero

技术领域technical field

本发明提出基于AlphaGo Zero改进的国际象棋博弈方法,拓展了AlphaGo Zero方法在人机博弈领域的应用范围,属于机器人科技娱乐技术领域。The invention proposes an improved chess game method based on AlphaGo Zero, expands the application scope of the AlphaGo Zero method in the field of man-machine games, and belongs to the technical field of robot technology and entertainment.

背景技术Background technique

人机博弈机制及其算法研究自计算机诞生以来人们一直没有停下对其探索的脚步。人机博弈是人工智能的一个重要组成分支,人们在研究人机博弈的过程中探索出包含机器学习在内的许多人工智能新方法、新思想,对社会生活和学术研究都产生了深远的影响。Human-computer game mechanism and its algorithm research Since the birth of the computer, people have not stopped exploring it. Human-computer game is an important branch of artificial intelligence. In the process of studying human-computer game, people have explored many new methods and new ideas of artificial intelligence including machine learning, which have had a profound impact on social life and academic research. .

本发明之所以选择国际象棋作为人机博弈的研究实例,除了国际象棋展示出的无穷游戏魅力之外,国际象棋较大的搜索空间也更难以用传统的方法解决,从而可以更好的展现出深度学习算法在解决博弈树过大的问题时的优秀性能。本发明的主要工作是沿用AlphaGo Lee与AlphaGo Zero的构建思想、网络结构和训练方式,尝试构造具有国际象棋对弈能力的神经网络模型并进行无人类经验的机器学习训练;除此之外,本发明还将介绍该国际象棋程序的核心神经网络结构以及模型的训练模式,最后通过ChessAi引擎的黑白自我对弈以及与以传统搜索算法为基础的Chess Titans国际象棋程序对弈进行博弈测试,充分发掘深度学习算法在人机博弈领域的普适性。The reason why the present invention chooses chess as a research example of man-machine game is that in addition to the infinite charm of the game displayed by chess, the larger search space of chess is also more difficult to solve by traditional methods, so that it can better show Excellent performance of deep learning algorithm in solving the problem of too large game tree. The main work of the present invention is to continue to use the construction ideas, network structure and training methods of AlphaGo Lee and AlphaGo Zero, try to construct a neural network model with chess playing ability and carry out machine learning training without human experience; in addition, the present invention It will also introduce the core neural network structure of the chess program and the training mode of the model. Finally, through the self-play of ChessAi engine black and white and the game test with Chess Titans chess program based on traditional search algorithms, fully explore the deep learning algorithm Universality in the field of man-machine games.

发明内容Contents of the invention

AlphaGo Zero采用了单个神经网络完成了AlphaGo Lee中的策略网络(Policynetwork)与估值网络(Value network)的功能,本发明也采用类似的网络结构。模型所构建的网络结构并不是单纯的卷积神经网络(CNN),而是混合的网络结构形式,本发明采取如下技术方案:一种基于AlphaGo Zero改进的国际象棋博弈方法包括如下步骤:AlphaGo Zero has adopted single neural network to have finished the function of strategy network (Policynetwork) and valuation network (Value network) in AlphaGo Lee, and the present invention also adopts similar network structure. The network structure built by the model is not a simple convolutional neural network (CNN), but a mixed network structure form. The present invention takes the following technical scheme: a chess game method improved based on AlphaGo Zero comprises the following steps:

1)采用能有效避免梯度弥散并能够获得最优位置收敛的CNN、ResNet以及全连接层在内的混合网络结构,并使用一个训练网络同时输出策略与估值;1) Adopt a hybrid network structure including CNN, ResNet and fully connected layers that can effectively avoid gradient dispersion and obtain optimal position convergence, and use a training network to simultaneously output strategies and estimates;

2)采用强化学习策略,通过使用自我博弈(Self-Play)产生的数据进行训练,解决序贯结构的数据训练规模较大的问题,在博弈过程中进行模型优化;2) Using the reinforcement learning strategy, by using the data generated by self-play (Self-Play) for training, to solve the problem of large-scale sequential structure data training, and to optimize the model during the game process;

3)神经网络训练优化模型,定义损失函数并选择相应的优化器进行向减小损失方向的迭代更新;3) The neural network trains the optimization model, defines the loss function and selects the corresponding optimizer to iteratively update in the direction of reducing the loss;

4)网络模型评估,使用训练一段时间后的新模型与训练前的模型进行对弈,根据胜负局数获取当前模型的性能以判定是否进行模型的迭代;4) Network model evaluation, use the new model after training for a period of time to play against the model before training, and obtain the performance of the current model according to the number of wins and losses to determine whether to iterate the model;

5)采用第三方软件进行可视化交互博弈测试与评估。5) Use third-party software for visual interactive game testing and evaluation.

所述步骤1)ResNet网络使用“短路”设计,使原本我们期望使用F(x)来拟合的映射关系转而使用H(x)=F(x)+x映射,以引入和保持更为丰富的参考信息,从而使网络学到更为丰富的内容。使用ResNet,还会产生平滑的正向传递与平滑的反向残差传递过程。如一个ResNet的第L层(第L层的输入为xl)到第L-1层的ResNet,可以推导出:The step 1) The ResNet network uses a "short-circuit" design, so that the mapping relationship that we originally expected to use F(x) to fit is changed to use H(x)=F(x)+x mapping to introduce and maintain more Rich reference information, so that the network can learn richer content. Using ResNet also produces a smooth forward pass and a smooth reverse residual pass. For example, from the L-th layer of ResNet (the input of the L-th layer is x l ) to the ResNet of the L-1 layer, it can be deduced that:

①xl+1=xl+F(xl)①x l +1=x l +F(x l )

继而有如下推导:Then there is the following derivation:

②xl+2=xl+1+F(xl+1)=xl+F(xl)+F(xl+1)②x l +2=x l +1+F(x l +1)=x l +F(x l )+F(x l +1)

故可以得到以下一般形式:Therefore, the following general form can be obtained:

后面的任何一层的xL向量的内容都会有一部来自前面某一层的xl的线性贡献。反向的残差传递也有类似的平滑过程,定义残差xlable表示理想情况下在给定样本和标签后在某一层xL所对应的值,由链式法则有: The content of the xL vector of any subsequent layer will have a linear contribution from the xl of a previous layer. The reverse residual transfer also has a similar smoothing process, defining the residual x lable represents ideally the value corresponding to a certain layer x L after a given sample and label, according to the chain rule:

由此可以发现,任意一层的输出xL所产生的残差可以传回到其前面的任意一层的xl上,这是个“直接”的传递过程,从而使得ResNet在层数变多的时候也不会出现明显的效率问题;与此同时,根据等式因子可知,是线性叠加的过程而非连乘关系,所以ResNet几乎不会出现梯度弥散的现象。本发明所训练的包含CNN、ResNet以及全连接层在内的混合网络结构,其中包含了19层的ResNet结构,并使用一个训练网络同时输出策略与估值,在简化网络结构的同时也降低了运算量,提高了训练效率,减少了训练时间。From this, it can be found that the residual generated by the output x L of any layer can be transmitted back to x l of any layer in front of it. This is a "direct" transfer process, so that ResNet has more layers. There will be no obvious efficiency problems; at the same time, according to the equation factor It can be seen that and It is a process of linear superposition rather than a multiplicative relationship, so ResNet has almost no gradient dispersion phenomenon. The hybrid network structure trained by the present invention includes CNN, ResNet and fully connected layers, which includes a 19-layer ResNet structure, and uses a training network to output strategies and estimates at the same time, while simplifying the network structure. The amount of calculation improves the training efficiency and reduces the training time.

所述步骤2)整个强化学习过程分为自我博弈(Self-Play),模型优化(Optimize,Opt)与网络评估(Evaluate,Eval)三个异步子过程。Step 2) The entire reinforcement learning process is divided into three asynchronous sub-processes: self-play (Self-Play), model optimization (Optimize, Opt) and network evaluation (Evaluate, Eval).

Self-Play由蒙特卡洛搜索(Monto Carlo Tree Search,MCTS)与构建的网络模型共同驱动,采用更加的高效的估值网络进行快速判定,同时规则监督在这个过程中持续运行,进行行棋指导与胜负终局判定,单局结束后进行格式化归档存储并初始化棋面以备下次进行自我博弈。此过程通过自我博弈产生大量对弈棋谱,后去大量训练数据,为网络优化提供训练数据资源;Self-Play is jointly driven by Monte Carlo Search (Monto Carlo Tree Search, MCTS) and the constructed network model, using a more efficient valuation network for quick judgment, while rule supervision continues to run during this process, providing chess guidance After the end of a single game, it will be formatted, archived and stored, and the chess board will be initialized for the next self-game. This process generates a large number of game records through self-play, and then removes a large amount of training data to provide training data resources for network optimization;

Opt是模型网络的训练优化过程,使用来自Self-Play的阶段的棋谱数据进行模型优化;Opt is the training and optimization process of the model network, using the chess record data from the Self-Play stage for model optimization;

Eval是新模型的评估过程,在优化执行一段时间后进行对弈模拟评估,根据胜负局数获取当前模型的性能以判断是否进行模型的迭代。Eval is the evaluation process of the new model. After a period of optimization and execution, the game simulation evaluation is carried out, and the performance of the current model is obtained according to the number of wins and losses to judge whether to iterate the model.

所述步骤3)神经网络训练优化模型,定义损失函数并选择相应的优化器进行向减小损失方向的迭代更新,在模型定义完成后开始训练前进行的优化器的指定与损失函数的定义。Said step 3) neural network training optimization model, defining a loss function and selecting a corresponding optimizer to iteratively update the direction of loss reduction, specifying the optimizer and defining the loss function before starting training after the model definition is completed.

本发明使用Adam优化器,学习率设置为0.001,β1设置为0.9,β2设置为0.999,ε设置为10-8,学习率衰减值“dency”设置为0.0。Adam优化器使用累计梯度代替单次梯度,使用累计梯度均方根自调节学习率,有助于促进模型向正确的方向收敛并加快模型的收敛速度。The present invention uses Adam optimizer, the learning rate is set to 0.001, β 1 is set to 0.9, β 2 is set to 0.999, ε is set to 10 -8 , and the learning rate decay value "dency" is set to 0.0. The Adam optimizer uses the cumulative gradient instead of a single gradient, and uses the root mean square of the cumulative gradient to self-adjust the learning rate, which helps to promote the model to converge in the correct direction and speed up the convergence speed of the model.

对于策略网络的策略概率输出定义为多类对数损失,在模型的输入数据上使用独热编码以减弱网络对多类对数损失等交叉熵类损失函数的认知干扰。对于估值网络输出的局面胜率估值定义损失函数为均方误差,对于单值输出的神经网络,使用均方误差作为损失函数是普遍而有效的。The policy probability output of the policy network is defined as multi-class logarithmic loss, and one-hot encoding is used on the input data of the model to reduce the cognitive interference of the network on cross-entropy loss functions such as multi-class logarithmic loss. For the evaluation of the situation winning rate output by the evaluation network, the loss function is defined as the mean square error. For the neural network with single value output, it is common and effective to use the mean square error as the loss function.

所述步骤4)网络模型评估,使用训练一段时间后的新模型与训练前的模型进行对弈,根据胜负局数获取当前模型的性能以判定是否进行模型的迭代;Described step 4) network model evaluation, use the new model after training for a period of time to play a game with the model before training, obtain the performance of current model according to the number of winning and losing rounds to determine whether to iterate the model;

在模型训练之前需要进行原模型的备份以备在本步骤使用,对于训练时间,通过设定epochs进行间接设定,在训练一个epochs即进行模型的评估,每次评估进行100局的模拟对弈,每局的对弈过程与Self-Play阶段类似,不同之处在于每个回合的双方的MCTS由不同的模型网络进行判定与更新。对弈结束后根据新模型胜率是否达到迭代胜率阈值(超参数,设定为0.55)判定新模型是否替代原模型成为最优模型。模型迭代完成后,将回到Self-Play阶段,使用新模型进行自我对弈与持续的优化,完成网络对弈能力的不断提升。Before the model training, the original model needs to be backed up for use in this step. For the training time, it is set indirectly by setting epochs. After training one epochs, the model is evaluated, and each evaluation is performed with 100 games of simulated games. The game process of each game is similar to the Self-Play stage, the difference is that the MCTS of both sides in each round is judged and updated by different model networks. After the game is over, judge whether the new model replaces the original model and becomes the optimal model according to whether the winning rate of the new model reaches the iterative winning rate threshold (hyperparameter, set to 0.55). After the model iteration is completed, it will return to the Self-Play stage, use the new model for self-play and continuous optimization, and complete the continuous improvement of the network game ability.

所述步骤5)采用第三方软件进行可视化交互博弈测试与评估。The step 5) uses third-party software to perform visual interactive game testing and evaluation.

本发明侧重于网络模型的构建与训练,对于前端界面的设计,使用了第三方ChessGUI Arena进行可视化交互博弈测试。通过模型网络构建的国际象棋引擎在GUI前端使用ChessAi标识,Arena GUI与ChessAi引擎使用UCI的FEN(Forsyth-Edwards Notation)字符串进行前后端连接与局面交互。The present invention focuses on the construction and training of the network model. For the design of the front-end interface, a third-party ChessGUI Arena is used for visual interactive game testing. The chess engine built through the model network uses the ChessAi logo on the front end of the GUI, and the Arena GUI and the ChessAi engine use UCI's FEN (Forsyth-Edwards Notation) string for front-end connection and situation interaction.

本发明使用了来自University of Duesseldorf的Elometer在线Elo rating评级系统进行ChessAi引擎的Elo评级,Elo评级系统是用于计算零和游戏的玩家技能相对水平的评估方式,Elometer通过解决76个puzzle给定Elo rating的估测值。The present invention uses the Elometer online Elo rating rating system from the University of Duesseldorf to carry out the Elo rating of the ChessAi engine. The Elo rating system is an evaluation method for calculating the relative level of player skills in zero-sum games. Elometer gives Elo by solving 76 puzzles Estimated value of rating.

本发明由于采取以上技术方案,其具有以下优点:The present invention has the following advantages due to the adoption of the above technical scheme:

本发明所训练的神经网络是包含CNN、ResNet以及全连接层在内的混合网络结构。能够避免过拟合并减少一定运算量而进行的降采样(downsampling)方式在发挥其性能的同时也导致了随着网络层数增多而出现的图像过“模糊”现象,进而引发训练过程中的梯度弥散,最终导致模型向错误的方向或在偏离最优的位置收敛这类问题;The neural network trained by the present invention is a mixed network structure including CNN, ResNet and fully connected layers. The downsampling method, which can avoid overfitting and reduce a certain amount of calculation, not only exerts its performance, but also leads to the image "blurring" phenomenon that occurs with the increase in the number of network layers, which in turn leads to blurring in the training process. Gradient dispersion, which eventually leads to problems where the model converges in the wrong direction or deviates from the optimal position;

本发明的混合神经网络包含了的19层ResNet结构,并使用一个训练网络同时输出策略与估值,在简化网络结构的同时也降低了运算量,提高了训练效率,减少了训练时间。The hybrid neural network of the present invention includes a 19-layer ResNet structure, and uses a training network to simultaneously output strategies and estimates, which not only simplifies the network structure, but also reduces the amount of computation, improves training efficiency, and reduces training time.

附图说明Description of drawings

图1本发明的模型网络结构;The model network structure of Fig. 1 the present invention;

图2模型网络强化学习模式;Figure 2 Model Network Reinforcement Learning Mode;

图3 MCTS模拟示意;Figure 3 Schematic diagram of MCTS simulation;

图4单局自我对弈流程Figure 4 Single game self-play process

图5模型网络训练优化Figure 5 Model network training optimization

图6 GUI与运行模式Figure 6 GUI and running mode

图7 Elometer测试反馈Figure 7 Elometer test feedback

图8对弈测试结果(10局)Figure 8 Game test results (10 rounds)

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

步骤一、采用能有效避免梯度弥散并能够获得最优位置收敛的CNN、ResNet以及全连接层在内的混合网络结构,并使用一个训练网络同时输出策略与估值;Step 1. Adopt a hybrid network structure including CNN, ResNet and fully connected layers that can effectively avoid gradient dispersion and obtain optimal position convergence, and use a training network to simultaneously output strategies and estimates;

ResNet网络使用“短路”设计,使原本我们期望使用F(x)来拟合的映射关系转而使用H(x)=F(x)+x映射,以引入和保持更为丰富的参考信息,从而使网络学到更为丰富的内容。使用ResNet,还会产生平滑的正向传递与平滑的反向残差传递过程。如一个ResNet的第L层(第L层的输入为xl)到第L-1层的ResNet,可以推导出The ResNet network uses a "short-circuit" design, so that the mapping relationship that we originally expected to use F(x) to fit is changed to use H(x)=F(x)+x mapping to introduce and maintain richer reference information. So that the network can learn richer content. Using ResNet also produces a smooth forward pass and a smooth reverse residual pass. For example, from the L-th layer of ResNet (the input of the L-th layer is x l ) to the ResNet of the L-1 layer, it can be deduced that

①xl+1=xl+F(xl)①x l +1=x l +F(x l )

继而有如下推导:Then there is the following derivation:

②xl+2=xl+1+F(xl+1)=xl+F(xl)+F(xl+1)②x l +2=x l +1+F(x l +1)=x l +F(x l )+F(x l +1)

故可以得到以下一般形式:Therefore, the following general form can be obtained:

后面的任何一层的xL向量的内容都会有一部来自前面某一层的xl的线性贡献。反向的残差传递也有类似的平滑过程,定义残差xlable表示理想情况下在给定样本和标签后在某一层xL所对应的值,由链式法则有: The content of the xL vector of any subsequent layer will have a linear contribution from the xl of a previous layer. The reverse residual transfer also has a similar smoothing process, defining the residual x lable represents ideally the value corresponding to a certain layer x L after a given sample and label, according to the chain rule:

由此可以发现,任意一层的输出xL所产生的残差可以传回到其前面的任意一层的xl上,这是个“直接”的传递过程,从而使得ResNet在层数变多的时候也不会出现明显的效率问题;与此同时,根据等式因子可知,是线性叠加的过程而非连乘关系,所以ResNet几乎不会出现梯度弥散的现象。本发明所训练的包含CNN、ResNet以及全连接层在内的混合网络结构,其中包含了19层ResNet结构,并使用一个训练网络同时输出策略与估值,在简化网络结构的同时也降低了运算量,提高了训练效率,减少了训练时间。From this, it can be found that the residual generated by the output x L of any layer can be transmitted back to x l of any layer in front of it. This is a "direct" transfer process, so that ResNet has more layers. There will be no obvious efficiency problems; at the same time, according to the equation factor It can be seen that and It is a process of linear superposition rather than a multiplicative relationship, so ResNet has almost no gradient dispersion phenomenon. The hybrid network structure trained by the present invention includes CNN, ResNet and fully connected layers, which includes a 19-layer ResNet structure, and uses a training network to output strategies and estimates at the same time, which not only simplifies the network structure, but also reduces the calculation The amount improves the training efficiency and reduces the training time.

步骤二、如图2所示,整个强化学习过程分为自我博弈(Self-Play),模型优化(Optimize,Opt)与网络评估(Evaluate,Eval)三个异步子过程。Step 2, as shown in Figure 2, the entire reinforcement learning process is divided into three asynchronous sub-processes: self-play (Self-Play), model optimization (Optimize, Opt) and network evaluation (Evaluate, Eval).

强化学习具体过程:The specific process of reinforcement learning:

⑴自我博弈(Self-Play)⑴Self-Play

Self-Play由蒙特卡洛搜索(Monto Carlo Tree Search,MCTS)与构建的网络模型共同驱动,Self-Play过程中的MCTS采用更加的高效的估值网络进行快速判定。如图3的MCTS模拟所示,MCTS与模型网络对对弈网络进行驱动,同时规则监督在这个过程中持续运行,进行行棋指导与胜负终局判定,单局结束后进行格式化归档存储并初始化棋面以备下次进行自我博弈,图4所示为单局自我对弈流程。此过程通过自我博弈产生大量对弈棋谱,后去大量训练数据,为网络优化提供训练数据资源;Self-Play is jointly driven by Monte Carlo Search (Monto Carlo Tree Search, MCTS) and the constructed network model. MCTS in the process of Self-Play uses a more efficient valuation network for quick judgment. As shown in the MCTS simulation in Figure 3, the MCTS and the model network drive the game network, while the rule supervision continues to run during this process, providing chess guidance and final judgment of victory and defeat, and formatting, archiving, storage and initialization after a single game The chess surface is prepared for the next self-game. Figure 4 shows the process of a single game of self-play. This process generates a large number of game records through self-play, and then removes a large amount of training data to provide training data resources for network optimization;

⑵模型优化(Optimize,Opt)⑵Model optimization (Optimize, Opt)

Opt是模型网络的训练优化过程,使用来自Self-Play的阶段的棋谱数据进行模型优化。Opt is the training optimization process of the model network, using the chess record data from the Self-Play stage for model optimization.

⑶网络评估(Evaluate,Eval)⑶ network evaluation (Evaluate, Eval)

Eval是新模型的评估过程,在优化执行一段时间后进行对弈模拟评估,根据评估结果判定继续优化还是进行模型的迭代更新。网络评估即使用训练一段后的新模型与训练前的模型进行对弈,根据胜负局数获取当前模型的性能以判断是否进行模型的迭代。Eval is the evaluation process of the new model. After the optimization is executed for a period of time, the simulation evaluation of the game is carried out. Based on the evaluation results, it is determined whether to continue the optimization or iteratively update the model. Network evaluation is to use the new model after training for a period of time to play against the model before training, and obtain the performance of the current model according to the number of wins and losses to judge whether to iterate the model.

步骤三、本发明采用神经网络训练优化模型,定义损失函数并选择相应的优化器进行向减小损失方向的迭代更新;在模型定义完成后开始训练前进行的优化器的指定与损失函数的定义。Step 3, the present invention adopts the neural network training optimization model, defines the loss function and selects the corresponding optimizer to carry out iterative update to the direction of reducing the loss; the designation of the optimizer and the definition of the loss function before starting the training after the model definition is completed .

具体包含如下步骤:Specifically include the following steps:

⑴本发明使用Adam优化器,学习率设置为0.001,β1设置为0.9,β2设置为0.999,ε设置为10-8,学习率衰减值“dency”设置为0.0。Adam优化器使用累计梯度代替单次梯度,使用累计梯度均方根自调节学习率,有助于促进模型向正确的方向收敛并加快模型的收敛速度。(1) The present invention uses Adam optimizer, the learning rate is set to 0.001, β 1 is set to 0.9, β 2 is set to 0.999, ε is set to 10 -8 , and the learning rate decay value "dency" is set to 0.0. The Adam optimizer uses the cumulative gradient instead of a single gradient, and uses the root mean square of the cumulative gradient to self-adjust the learning rate, which helps to promote the model to converge in the correct direction and speed up the convergence speed of the model.

⑵对于策略网络的策略概率输出定义为多类对数损失,在模型的输入数据上使用独热编码以减弱网络对多类对数损失等交叉熵类损失函数的认知干扰。对于估值网络输出的局面胜率估值定义损失函数为均方误差,对于单值输出的神经网络,使用均方误差作为损失函数是普遍而有效的。(2) The strategy probability output of the strategy network is defined as multi-class logarithmic loss, and one-hot encoding is used on the input data of the model to reduce the cognitive interference of the network on cross-entropy loss functions such as multi-class logarithmic loss. For the evaluation of the situation winning rate output by the evaluation network, the loss function is defined as the mean square error. For the neural network with single value output, it is common and effective to use the mean square error as the loss function.

步骤四、采用网络模型评估,使用训练一段时间后的新模型与训练前的模型进行对弈,根据胜负局数获取当前模型的性能以判定是否进行模型的迭代。图5为模型网络训练优化,在模型训练之前需要进行原模型的备份以备在本步骤使用,对于训练时间,通过设定epochs进行间接设定,在训练一个epochs即进行模型的评估,每次评估进行100局的模拟对弈,每局的对弈过程与Self-Play阶段类似,不同之处在于每个回合的双方的MCTS由不同的模型网络进行判定与更新。对弈结束后根据新模型胜率是否达到迭代胜率阈值(超参数,设定为0.55)判定新模型是否替代原模型成为最优模型。模型迭代完成后,将回到Self-Play阶段,使用新模型进行自我对弈与持续的优化,完成网络对弈能力的不断提升。Step 4: Use network model evaluation, use the new model after training for a period of time to play against the model before training, and obtain the performance of the current model according to the number of wins and losses to determine whether to iterate the model. Figure 5 shows the optimization of the model network training. Before the model training, the original model needs to be backed up for use in this step. For the training time, it is set indirectly by setting the epochs, and the model is evaluated after training one epochs, each time The evaluation carried out 100 rounds of simulated games. The game process of each game is similar to the Self-Play stage. The difference is that the MCTS of both sides in each round is judged and updated by different model networks. After the game is over, judge whether the new model replaces the original model and becomes the optimal model according to whether the winning rate of the new model reaches the iterative winning rate threshold (hyperparameter, set to 0.55). After the model iteration is completed, it will return to the Self-Play stage, use the new model for self-play and continuous optimization, and complete the continuous improvement of the network game ability.

步骤五、本发明构建侧重于模型网络的实现与训练,对于前端界面的设计使用了第三方Chess GUI Arena进行可视化交互博弈测试,使用了来自University ofDuesseldorf的Elometer在线Elo rating进行ChessAi引擎的Elo评级。Step 5. The present invention focuses on the realization and training of the model network. For the design of the front-end interface, a third-party Chess GUI Arena is used for visual interactive game testing, and the Elometer online Elo rating from the University of Duesseldorf is used to perform the Elo rating of the ChessAi engine.

本发明通过模型网络构建的国际象棋引擎在GUI前端使用ChessAi标识,ArenaGUI与ChessAi引擎使用UCI的FEN(Forsyth-Edwards Notation)字符串进行前后端连接与局面交互。图6是Arena GUI的前端呈现与整个工程的部署运行结构,使用GUP版本的Keras与TensorFlow完成分布式的数据计算,通过Master CPU进行计算资源的自动分配与线程处理。Elometer通过解决76个puzzle给定Elo rating的估测值,测试反馈如图7所示,最终的Elo rating为1545,95%的置信区间为[1392,1698]。The chess engine constructed by the model network in the present invention uses the ChessAi logo on the front end of the GUI, and ArenaGUI and the ChessAi engine use the FEN (Forsyth-Edwards Notation) string of UCI to perform front-end connection and situation interaction. Figure 6 shows the front-end presentation of the Arena GUI and the deployment and operation structure of the entire project. The GUP version of Keras and TensorFlow are used to complete distributed data calculations, and the Master CPU is used to automatically allocate computing resources and thread processing. Elometer estimates the Elo rating by solving 76 puzzles. The test feedback is shown in Figure 7. The final Elo rating is 1545, and the 95% confidence interval is [1392, 1698].

在实际的对弈测试阶段,本发明使用ChessAi引擎进行了十局的对弈测试,其中包含六局的ChessAi的自我对弈与四局的ChessAi与Chess Titans的对弈,对弈的最终结果如图8所示。In the actual game testing stage, the present invention uses the ChessAi engine to carry out ten games of games, including six games of ChessAi self-play and four games of ChessAi and Chess Titans. The final result of the game is shown in Figure 8.

Claims (7)

1. the improved chess game method based on Alphago Zero comprises the following steps:
1) Adopting a mixed network structure including CNN, ResNet and full connection layers which can effectively avoid gradient dispersion and can obtain optimal position convergence, and simultaneously outputting strategy and estimation value by using a training network;
2) the method adopts a reinforcement learning strategy, trains by using data generated by Self game (Self-Play), solves the problem of large scale of data training of a sequential structure, and optimizes a model in the game process;
3) training an optimization model by a neural network, defining a loss function and selecting a corresponding optimizer to carry out iterative update in a loss reduction direction;
4) Evaluating a network model, namely playing chess by using a new model after training for a period of time and a model before training, and acquiring the performance of the current model according to the number of wins and losses to judge whether to iterate the model;
5) and carrying out visual interactive game testing and evaluation by adopting third-party software.
2. an AlphaGo Zero-based improved chess gaming method according to claim 1, wherein: the step 1), the ResNet network uses a "short circuit" design, so that the mapping relationship that we would like to fit using f (x) is changed to using h (x) ═ f (x) + x mapping to introduce and maintain richer reference information, thereby enabling the network to use a "short circuit" designLearn richer content. Using ResNet, a smooth forward transfer and a smooth reverse residual transfer process also result. E.g. an Lth layer of ResNet (the input of the Lth layer is x)l) ResNet to layer L-1, one can deduce
①xl+1=xl+F(xl)
The following derivation follows:
②xl+2=xl+1+F(xl+1)=xl+F(xl)+F(xl+1)
The following general form can be obtained:
X of any layer behindLthe contents of the vector will have a portion of x from one layer abovelThe linear contribution of (a). The inverse residual transfer also has a similar smoothing process, defining the residualxlableMeaning that ideally after a given sample and label, at a certain level xLThe corresponding values are determined by the chain rule:
From this, it can be found that the output x of any one layerLthe generated residual can be transmitted back to x of any layer before itlIn the above, the transmission process is a 'direct' transmission process, so that the ResNet does not have obvious efficiency problem when the number of layers is increased; at the same time, according to the equation factorin a clear view of the above, it is known that,Andthe method is a linear superposition process and not a continuous multiplication relation, so that the ResNet hardly has the phenomenon of gradient diffusion. The mixed network structure which is trained by the invention and comprises CNN, ResNet and full connection layers comprises a ResNet structure with 19 layers, and a training network is used for simultaneously outputting strategies and estimated values, thereby simplifying the network structure, reducing the operation amount, improving the training efficiency and reducing the training time.
3. An AlphaGo Zero-based improved chess gaming method according to claim 1, wherein: in the step 2), the whole reinforcement learning process is divided into three asynchronous sub-processes of Self-game (Self-Play), model optimization (Optimize, Opt) and network evaluation (Evaluate, Eval), and the specific steps are as follows:
Step S21) the Self-Play is driven by Monte Carlo Search (MCTS) and the constructed network model, a large amount of chess playing spectrums are generated through Self-game in the process, and then a large amount of training data are obtained to provide training data resources for network optimization;
Step S22) Opt is a training optimization process of the model network, using the chess manual data from the stage of Self-Play to perform model optimization;
Step S23) Eval is the evaluation process of the new model, the simulation evaluation of playing is carried out after the optimization is carried out for a period of time, and whether the optimization is continued or the iterative update of the model is carried out is judged according to the evaluation result.
4. an improved chess gaming method based on AlphaGo Zero as recited in claim 3, wherein the MCTS in the Self-Play process of step S21): and a more efficient estimation network is adopted for quick judgment. The MCTS and the model network drive the playing network, meanwhile, the rule supervision continuously operates in the process, chess playing guidance and final outcome judgment are carried out, and after a single outcome is finished, formatting filing storage is carried out, and chess surfaces are initialized to be played for the next time.
5. a game method of chess according to claim 3 based on the improvement of AlphaGo Zero, characterized by the steps of S22) and S23): and (3) network evaluation, namely, playing games by using a new model after training for a period and a model before training, and acquiring the performance of the current model according to the number of wins and losses to judge whether to iterate the model.
6. the improved chess gaming method based on AlphaGo Zero of claim 1, wherein said step 3): the specification of the optimizer and the definition of the loss function are performed after the model definition is completed and before the training is started. Specifically comprises the following steps
step S31) the invention uses an Adam optimizer with a learning rate set to 0.001, β1Is set to 0.9, beta2set to 0.999, ∈ set to 10-8The learning rate attenuation value "dency" is set to 0.0. The Adam optimizer replaces single gradient with cumulative gradient, uses cumulative gradient root mean square self-adjusting learning rate, and helps to promote the convergence of the model to the right direction and accelerate the convergence speed of the model.
Step S32) the strategy probability output of the strategy network is defined as multi-class logarithmic loss, and the one-hot coding is used on the input data of the model to weaken the cognitive interference of the network on cross entropy class loss functions such as the multi-class logarithmic loss. For the situation win estimation of estimation network output, the loss function is defined as mean square error, and for the neural network of single value output, it is common and effective to use the mean square error as the loss function.
7. The improved chess gaming method based on AlphaGo Zero of claim 1, wherein said step 4): before model training, backup of original models is needed to be carried out for use in the step, the training time is set indirectly by setting epochs, the models are evaluated when one epochs is trained, 100 rounds of simulated Play are carried out for each evaluation, the playing process of each round is similar to that of a Self-Play stage, and the difference is that MCTS of two sides of each round is judged and updated by different model networks. And after the chess playing is finished, judging whether the new model replaces the original model to become the optimal model according to whether the new model rate reaches an iteration rate threshold value (the over-parameter is set to be 0.55). And after the model iteration is finished, returning to a Self-Play stage, and using the new model to carry out Self-Play and continuous optimization to finish the continuous improvement of the network Play capability.
CN201910837810.5A 2019-09-05 2019-09-05 Improved chess game method based on Alphago Zero Pending CN110555517A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910837810.5A CN110555517A (en) 2019-09-05 2019-09-05 Improved chess game method based on Alphago Zero

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910837810.5A CN110555517A (en) 2019-09-05 2019-09-05 Improved chess game method based on Alphago Zero

Publications (1)

Publication Number Publication Date
CN110555517A true CN110555517A (en) 2019-12-10

Family

ID=68739182

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910837810.5A Pending CN110555517A (en) 2019-09-05 2019-09-05 Improved chess game method based on Alphago Zero

Country Status (1)

Country Link
CN (1) CN110555517A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111330255A (en) * 2020-01-16 2020-06-26 北京理工大学 Amazon chess-calling generation method based on deep convolutional neural network
CN111667075A (en) * 2020-06-12 2020-09-15 杭州浮云网络科技有限公司 Service execution method, device and related equipment
CN112016704A (en) * 2020-10-30 2020-12-01 超参数科技(深圳)有限公司 AI model training method, model using method, computer device and storage medium
CN113011575A (en) * 2019-12-19 2021-06-22 华为技术有限公司 Neural network model update method, image processing method and device
CN113127704A (en) * 2021-03-11 2021-07-16 西安电子科技大学 Monte Carlo tree searching method, system and application
CN114425773A (en) * 2021-11-18 2022-05-03 南京师范大学 Game playing robot system based on deep learning and godson group and game playing method thereof
CN116808592A (en) * 2023-07-12 2023-09-29 浙江大学 A weighted loss-based ELO evaluation method and device
CN116881656A (en) * 2023-07-06 2023-10-13 南华大学 Reinforced learning military chess AI system based on deep Monte Carlo

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011575A (en) * 2019-12-19 2021-06-22 华为技术有限公司 Neural network model update method, image processing method and device
WO2021120719A1 (en) * 2019-12-19 2021-06-24 华为技术有限公司 Neural network model update method, and image processing method and device
CN111330255A (en) * 2020-01-16 2020-06-26 北京理工大学 Amazon chess-calling generation method based on deep convolutional neural network
CN111330255B (en) * 2020-01-16 2021-06-08 北京理工大学 A method for generating Amazon chess moves based on deep convolutional neural networks
CN111667075A (en) * 2020-06-12 2020-09-15 杭州浮云网络科技有限公司 Service execution method, device and related equipment
CN112016704A (en) * 2020-10-30 2020-12-01 超参数科技(深圳)有限公司 AI model training method, model using method, computer device and storage medium
CN113127704A (en) * 2021-03-11 2021-07-16 西安电子科技大学 Monte Carlo tree searching method, system and application
CN113127704B (en) * 2021-03-11 2022-11-01 西安电子科技大学 A Monte Carlo tree search method, system and application
CN114425773A (en) * 2021-11-18 2022-05-03 南京师范大学 Game playing robot system based on deep learning and godson group and game playing method thereof
CN114425773B (en) * 2021-11-18 2024-03-26 南京师范大学 Deep learning and Loongson group-based playing robot system and playing method thereof
CN116881656A (en) * 2023-07-06 2023-10-13 南华大学 Reinforced learning military chess AI system based on deep Monte Carlo
CN116881656B (en) * 2023-07-06 2024-03-22 南华大学 A reinforcement learning military chess AI system based on deep Monte Carlo
CN116808592A (en) * 2023-07-12 2023-09-29 浙江大学 A weighted loss-based ELO evaluation method and device

Similar Documents

Publication Publication Date Title
CN110555517A (en) Improved chess game method based on Alphago Zero
Justesen et al. Illuminating generalization in deep reinforcement learning through procedural level generation
CN108629422B (en) Intelligent learning method based on knowledge guidance-tactical perception
CN108921298B (en) Multi-agent communication and decision-making method for reinforcement learning
CN112044076B (en) Object control method and device and computer readable storage medium
CN109284812B (en) Video game simulation method based on improved DQN
CN109740741A (en) A reinforcement learning method combined with knowledge transfer and its application to autonomous skills learning method of unmanned vehicles
CN111729300A (en) Research method of fighting landlord strategy based on Monte Carlo tree search and convolutional neural network
CN114404975B (en) Training method, device, equipment, storage medium and program product of decision model
CN116128060A (en) A Chess Game Method Based on Opponent Modeling and Monte Carlo Reinforcement Learning
CN112016704A (en) AI model training method, model using method, computer device and storage medium
CN114048834A (en) Method and device for continuous reinforcement learning game with incomplete information based on retrospective and incremental expansion
CN111282272A (en) Information processing method, computer readable medium and electronic device
CN118014013A (en) Deep reinforcement learning quick search game method and system based on priori policy guidance
CN118761902B (en) Method and system for migrating arbitrary picture style based on deep reinforcement learning
CN116205298A (en) A method and system for modeling opponent's behavior strategy based on deep reinforcement learning
CN112274935A (en) AI model training method, use method, computer device and storage medium
CN111639756B (en) A multi-agent reinforcement learning method based on game reduction
CN116644666A (en) Path Planning and Guidance Method for Virtual Assembly Based on Strategy Gradient Optimization Algorithm
CN113689001B (en) A virtual self-playing method and device based on counterfactual regret minimization
CN114154397B (en) Implicit opponent modeling method based on deep reinforcement learning
CN115933717A (en) UAV intelligent air combat maneuver decision-making training system and method based on deep reinforcement learning
CN114404976A (en) Method and device for training decision model, computer equipment and storage medium
CN111001161B (en) Game strategy obtaining method based on second-order back propagation priority
Wang Artificial Intelligence Gamers Based on Deep Reinforcement Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20191210

WD01 Invention patent application deemed withdrawn after publication