CN110555517A - Improved chess game method based on Alphago Zero - Google Patents
Improved chess game method based on Alphago Zero Download PDFInfo
- Publication number
- CN110555517A CN110555517A CN201910837810.5A CN201910837810A CN110555517A CN 110555517 A CN110555517 A CN 110555517A CN 201910837810 A CN201910837810 A CN 201910837810A CN 110555517 A CN110555517 A CN 110555517A
- Authority
- CN
- China
- Prior art keywords
- model
- network
- training
- self
- play
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- A—HUMAN NECESSITIES
- A63—SPORTS; GAMES; AMUSEMENTS
- A63F—CARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
- A63F3/00—Board games; Raffle games
- A63F3/00643—Electric board games; Electric features of board games
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Prevention of errors by analysis, debugging or testing of software
- G06F11/3668—Testing of software
- G06F11/3672—Test management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Quality & Reliability (AREA)
- Computer Hardware Design (AREA)
- Multimedia (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
本发明提出基于AlphaGo Zero改进的国际象棋博弈方法,扩展了AlphaGo Zero方法在人机博弈领域的应用范围,属于机器人科技娱乐领域。其包括以下步骤:采用能有效避免梯度弥散并能够获得最优位置收敛的CNN、ResNet以及全连接层在内的混合网络结构,并使用一个训练网络同时输出策略与估值;2)采用强化学习策略,通过使用自我博弈(Self‑Play)产生的数据进行训练,解决序贯结构的数据训练规模较大的问题,在博弈过程中进行模型优化;3)神经网络训练优化模型,定义损失函数并选择相应的优化器进行向减小损失方向的迭代更新;4)网络模型评估,使用训练一段时间后的新模型与训练前的模型进行对弈,根据胜负局数获取当前模型的性能以判定是否进行模型的迭代;5)采用第三方软件进行可视化交互博弈测试与评估。
The invention proposes an improved chess game method based on AlphaGo Zero, expands the application range of the AlphaGo Zero method in the field of man-machine games, and belongs to the field of robot science and technology entertainment. It includes the following steps: using a hybrid network structure including CNN, ResNet and fully connected layers that can effectively avoid gradient dispersion and obtain optimal position convergence, and use a training network to simultaneously output strategies and estimates; 2) using reinforcement learning Strategy, by using the data generated by self-play (Self-Play) for training, solve the problem of large-scale data training of sequential structure, and optimize the model during the game process; 3) neural network training optimization model, define the loss function and Select the corresponding optimizer to iteratively update in the direction of reducing the loss; 4) network model evaluation, use the new model after training for a period of time to play against the model before training, and obtain the performance of the current model according to the number of winning and losing games to determine whether Iterate the model; 5) Use third-party software for visual interactive game testing and evaluation.
Description
技术领域technical field
本发明提出基于AlphaGo Zero改进的国际象棋博弈方法,拓展了AlphaGo Zero方法在人机博弈领域的应用范围,属于机器人科技娱乐技术领域。The invention proposes an improved chess game method based on AlphaGo Zero, expands the application scope of the AlphaGo Zero method in the field of man-machine games, and belongs to the technical field of robot technology and entertainment.
背景技术Background technique
人机博弈机制及其算法研究自计算机诞生以来人们一直没有停下对其探索的脚步。人机博弈是人工智能的一个重要组成分支,人们在研究人机博弈的过程中探索出包含机器学习在内的许多人工智能新方法、新思想,对社会生活和学术研究都产生了深远的影响。Human-computer game mechanism and its algorithm research Since the birth of the computer, people have not stopped exploring it. Human-computer game is an important branch of artificial intelligence. In the process of studying human-computer game, people have explored many new methods and new ideas of artificial intelligence including machine learning, which have had a profound impact on social life and academic research. .
本发明之所以选择国际象棋作为人机博弈的研究实例,除了国际象棋展示出的无穷游戏魅力之外,国际象棋较大的搜索空间也更难以用传统的方法解决,从而可以更好的展现出深度学习算法在解决博弈树过大的问题时的优秀性能。本发明的主要工作是沿用AlphaGo Lee与AlphaGo Zero的构建思想、网络结构和训练方式,尝试构造具有国际象棋对弈能力的神经网络模型并进行无人类经验的机器学习训练;除此之外,本发明还将介绍该国际象棋程序的核心神经网络结构以及模型的训练模式,最后通过ChessAi引擎的黑白自我对弈以及与以传统搜索算法为基础的Chess Titans国际象棋程序对弈进行博弈测试,充分发掘深度学习算法在人机博弈领域的普适性。The reason why the present invention chooses chess as a research example of man-machine game is that in addition to the infinite charm of the game displayed by chess, the larger search space of chess is also more difficult to solve by traditional methods, so that it can better show Excellent performance of deep learning algorithm in solving the problem of too large game tree. The main work of the present invention is to continue to use the construction ideas, network structure and training methods of AlphaGo Lee and AlphaGo Zero, try to construct a neural network model with chess playing ability and carry out machine learning training without human experience; in addition, the present invention It will also introduce the core neural network structure of the chess program and the training mode of the model. Finally, through the self-play of ChessAi engine black and white and the game test with Chess Titans chess program based on traditional search algorithms, fully explore the deep learning algorithm Universality in the field of man-machine games.
发明内容Contents of the invention
AlphaGo Zero采用了单个神经网络完成了AlphaGo Lee中的策略网络(Policynetwork)与估值网络(Value network)的功能,本发明也采用类似的网络结构。模型所构建的网络结构并不是单纯的卷积神经网络(CNN),而是混合的网络结构形式,本发明采取如下技术方案:一种基于AlphaGo Zero改进的国际象棋博弈方法包括如下步骤:AlphaGo Zero has adopted single neural network to have finished the function of strategy network (Policynetwork) and valuation network (Value network) in AlphaGo Lee, and the present invention also adopts similar network structure. The network structure built by the model is not a simple convolutional neural network (CNN), but a mixed network structure form. The present invention takes the following technical scheme: a chess game method improved based on AlphaGo Zero comprises the following steps:
1)采用能有效避免梯度弥散并能够获得最优位置收敛的CNN、ResNet以及全连接层在内的混合网络结构,并使用一个训练网络同时输出策略与估值;1) Adopt a hybrid network structure including CNN, ResNet and fully connected layers that can effectively avoid gradient dispersion and obtain optimal position convergence, and use a training network to simultaneously output strategies and estimates;
2)采用强化学习策略,通过使用自我博弈(Self-Play)产生的数据进行训练,解决序贯结构的数据训练规模较大的问题,在博弈过程中进行模型优化;2) Using the reinforcement learning strategy, by using the data generated by self-play (Self-Play) for training, to solve the problem of large-scale sequential structure data training, and to optimize the model during the game process;
3)神经网络训练优化模型,定义损失函数并选择相应的优化器进行向减小损失方向的迭代更新;3) The neural network trains the optimization model, defines the loss function and selects the corresponding optimizer to iteratively update in the direction of reducing the loss;
4)网络模型评估,使用训练一段时间后的新模型与训练前的模型进行对弈,根据胜负局数获取当前模型的性能以判定是否进行模型的迭代;4) Network model evaluation, use the new model after training for a period of time to play against the model before training, and obtain the performance of the current model according to the number of wins and losses to determine whether to iterate the model;
5)采用第三方软件进行可视化交互博弈测试与评估。5) Use third-party software for visual interactive game testing and evaluation.
所述步骤1)ResNet网络使用“短路”设计,使原本我们期望使用F(x)来拟合的映射关系转而使用H(x)=F(x)+x映射,以引入和保持更为丰富的参考信息,从而使网络学到更为丰富的内容。使用ResNet,还会产生平滑的正向传递与平滑的反向残差传递过程。如一个ResNet的第L层(第L层的输入为xl)到第L-1层的ResNet,可以推导出:The step 1) The ResNet network uses a "short-circuit" design, so that the mapping relationship that we originally expected to use F(x) to fit is changed to use H(x)=F(x)+x mapping to introduce and maintain more Rich reference information, so that the network can learn richer content. Using ResNet also produces a smooth forward pass and a smooth reverse residual pass. For example, from the L-th layer of ResNet (the input of the L-th layer is x l ) to the ResNet of the L-1 layer, it can be deduced that:
①xl+1=xl+F(xl)①x l +1=x l +F(x l )
继而有如下推导:Then there is the following derivation:
②xl+2=xl+1+F(xl+1)=xl+F(xl)+F(xl+1)②x l +2=x l +1+F(x l +1)=x l +F(x l )+F(x l +1)
故可以得到以下一般形式:Therefore, the following general form can be obtained:
③ ③
后面的任何一层的xL向量的内容都会有一部来自前面某一层的xl的线性贡献。反向的残差传递也有类似的平滑过程,定义残差xlable表示理想情况下在给定样本和标签后在某一层xL所对应的值,由链式法则有: The content of the xL vector of any subsequent layer will have a linear contribution from the xl of a previous layer. The reverse residual transfer also has a similar smoothing process, defining the residual x lable represents ideally the value corresponding to a certain layer x L after a given sample and label, according to the chain rule:
④ ④
由此可以发现,任意一层的输出xL所产生的残差可以传回到其前面的任意一层的xl上,这是个“直接”的传递过程,从而使得ResNet在层数变多的时候也不会出现明显的效率问题;与此同时,根据等式因子可知,与是线性叠加的过程而非连乘关系,所以ResNet几乎不会出现梯度弥散的现象。本发明所训练的包含CNN、ResNet以及全连接层在内的混合网络结构,其中包含了19层的ResNet结构,并使用一个训练网络同时输出策略与估值,在简化网络结构的同时也降低了运算量,提高了训练效率,减少了训练时间。From this, it can be found that the residual generated by the output x L of any layer can be transmitted back to x l of any layer in front of it. This is a "direct" transfer process, so that ResNet has more layers. There will be no obvious efficiency problems; at the same time, according to the equation factor It can be seen that and It is a process of linear superposition rather than a multiplicative relationship, so ResNet has almost no gradient dispersion phenomenon. The hybrid network structure trained by the present invention includes CNN, ResNet and fully connected layers, which includes a 19-layer ResNet structure, and uses a training network to output strategies and estimates at the same time, while simplifying the network structure. The amount of calculation improves the training efficiency and reduces the training time.
所述步骤2)整个强化学习过程分为自我博弈(Self-Play),模型优化(Optimize,Opt)与网络评估(Evaluate,Eval)三个异步子过程。Step 2) The entire reinforcement learning process is divided into three asynchronous sub-processes: self-play (Self-Play), model optimization (Optimize, Opt) and network evaluation (Evaluate, Eval).
Self-Play由蒙特卡洛搜索(Monto Carlo Tree Search,MCTS)与构建的网络模型共同驱动,采用更加的高效的估值网络进行快速判定,同时规则监督在这个过程中持续运行,进行行棋指导与胜负终局判定,单局结束后进行格式化归档存储并初始化棋面以备下次进行自我博弈。此过程通过自我博弈产生大量对弈棋谱,后去大量训练数据,为网络优化提供训练数据资源;Self-Play is jointly driven by Monte Carlo Search (Monto Carlo Tree Search, MCTS) and the constructed network model, using a more efficient valuation network for quick judgment, while rule supervision continues to run during this process, providing chess guidance After the end of a single game, it will be formatted, archived and stored, and the chess board will be initialized for the next self-game. This process generates a large number of game records through self-play, and then removes a large amount of training data to provide training data resources for network optimization;
Opt是模型网络的训练优化过程,使用来自Self-Play的阶段的棋谱数据进行模型优化;Opt is the training and optimization process of the model network, using the chess record data from the Self-Play stage for model optimization;
Eval是新模型的评估过程,在优化执行一段时间后进行对弈模拟评估,根据胜负局数获取当前模型的性能以判断是否进行模型的迭代。Eval is the evaluation process of the new model. After a period of optimization and execution, the game simulation evaluation is carried out, and the performance of the current model is obtained according to the number of wins and losses to judge whether to iterate the model.
所述步骤3)神经网络训练优化模型,定义损失函数并选择相应的优化器进行向减小损失方向的迭代更新,在模型定义完成后开始训练前进行的优化器的指定与损失函数的定义。Said step 3) neural network training optimization model, defining a loss function and selecting a corresponding optimizer to iteratively update the direction of loss reduction, specifying the optimizer and defining the loss function before starting training after the model definition is completed.
本发明使用Adam优化器,学习率设置为0.001,β1设置为0.9,β2设置为0.999,ε设置为10-8,学习率衰减值“dency”设置为0.0。Adam优化器使用累计梯度代替单次梯度,使用累计梯度均方根自调节学习率,有助于促进模型向正确的方向收敛并加快模型的收敛速度。The present invention uses Adam optimizer, the learning rate is set to 0.001, β 1 is set to 0.9, β 2 is set to 0.999, ε is set to 10 -8 , and the learning rate decay value "dency" is set to 0.0. The Adam optimizer uses the cumulative gradient instead of a single gradient, and uses the root mean square of the cumulative gradient to self-adjust the learning rate, which helps to promote the model to converge in the correct direction and speed up the convergence speed of the model.
对于策略网络的策略概率输出定义为多类对数损失,在模型的输入数据上使用独热编码以减弱网络对多类对数损失等交叉熵类损失函数的认知干扰。对于估值网络输出的局面胜率估值定义损失函数为均方误差,对于单值输出的神经网络,使用均方误差作为损失函数是普遍而有效的。The policy probability output of the policy network is defined as multi-class logarithmic loss, and one-hot encoding is used on the input data of the model to reduce the cognitive interference of the network on cross-entropy loss functions such as multi-class logarithmic loss. For the evaluation of the situation winning rate output by the evaluation network, the loss function is defined as the mean square error. For the neural network with single value output, it is common and effective to use the mean square error as the loss function.
所述步骤4)网络模型评估,使用训练一段时间后的新模型与训练前的模型进行对弈,根据胜负局数获取当前模型的性能以判定是否进行模型的迭代;Described step 4) network model evaluation, use the new model after training for a period of time to play a game with the model before training, obtain the performance of current model according to the number of winning and losing rounds to determine whether to iterate the model;
在模型训练之前需要进行原模型的备份以备在本步骤使用,对于训练时间,通过设定epochs进行间接设定,在训练一个epochs即进行模型的评估,每次评估进行100局的模拟对弈,每局的对弈过程与Self-Play阶段类似,不同之处在于每个回合的双方的MCTS由不同的模型网络进行判定与更新。对弈结束后根据新模型胜率是否达到迭代胜率阈值(超参数,设定为0.55)判定新模型是否替代原模型成为最优模型。模型迭代完成后,将回到Self-Play阶段,使用新模型进行自我对弈与持续的优化,完成网络对弈能力的不断提升。Before the model training, the original model needs to be backed up for use in this step. For the training time, it is set indirectly by setting epochs. After training one epochs, the model is evaluated, and each evaluation is performed with 100 games of simulated games. The game process of each game is similar to the Self-Play stage, the difference is that the MCTS of both sides in each round is judged and updated by different model networks. After the game is over, judge whether the new model replaces the original model and becomes the optimal model according to whether the winning rate of the new model reaches the iterative winning rate threshold (hyperparameter, set to 0.55). After the model iteration is completed, it will return to the Self-Play stage, use the new model for self-play and continuous optimization, and complete the continuous improvement of the network game ability.
所述步骤5)采用第三方软件进行可视化交互博弈测试与评估。The step 5) uses third-party software to perform visual interactive game testing and evaluation.
本发明侧重于网络模型的构建与训练,对于前端界面的设计,使用了第三方ChessGUI Arena进行可视化交互博弈测试。通过模型网络构建的国际象棋引擎在GUI前端使用ChessAi标识,Arena GUI与ChessAi引擎使用UCI的FEN(Forsyth-Edwards Notation)字符串进行前后端连接与局面交互。The present invention focuses on the construction and training of the network model. For the design of the front-end interface, a third-party ChessGUI Arena is used for visual interactive game testing. The chess engine built through the model network uses the ChessAi logo on the front end of the GUI, and the Arena GUI and the ChessAi engine use UCI's FEN (Forsyth-Edwards Notation) string for front-end connection and situation interaction.
本发明使用了来自University of Duesseldorf的Elometer在线Elo rating评级系统进行ChessAi引擎的Elo评级,Elo评级系统是用于计算零和游戏的玩家技能相对水平的评估方式,Elometer通过解决76个puzzle给定Elo rating的估测值。The present invention uses the Elometer online Elo rating rating system from the University of Duesseldorf to carry out the Elo rating of the ChessAi engine. The Elo rating system is an evaluation method for calculating the relative level of player skills in zero-sum games. Elometer gives Elo by solving 76 puzzles Estimated value of rating.
本发明由于采取以上技术方案,其具有以下优点:The present invention has the following advantages due to the adoption of the above technical scheme:
本发明所训练的神经网络是包含CNN、ResNet以及全连接层在内的混合网络结构。能够避免过拟合并减少一定运算量而进行的降采样(downsampling)方式在发挥其性能的同时也导致了随着网络层数增多而出现的图像过“模糊”现象,进而引发训练过程中的梯度弥散,最终导致模型向错误的方向或在偏离最优的位置收敛这类问题;The neural network trained by the present invention is a mixed network structure including CNN, ResNet and fully connected layers. The downsampling method, which can avoid overfitting and reduce a certain amount of calculation, not only exerts its performance, but also leads to the image "blurring" phenomenon that occurs with the increase in the number of network layers, which in turn leads to blurring in the training process. Gradient dispersion, which eventually leads to problems where the model converges in the wrong direction or deviates from the optimal position;
本发明的混合神经网络包含了的19层ResNet结构,并使用一个训练网络同时输出策略与估值,在简化网络结构的同时也降低了运算量,提高了训练效率,减少了训练时间。The hybrid neural network of the present invention includes a 19-layer ResNet structure, and uses a training network to simultaneously output strategies and estimates, which not only simplifies the network structure, but also reduces the amount of computation, improves training efficiency, and reduces training time.
附图说明Description of drawings
图1本发明的模型网络结构;The model network structure of Fig. 1 the present invention;
图2模型网络强化学习模式;Figure 2 Model Network Reinforcement Learning Mode;
图3 MCTS模拟示意;Figure 3 Schematic diagram of MCTS simulation;
图4单局自我对弈流程Figure 4 Single game self-play process
图5模型网络训练优化Figure 5 Model network training optimization
图6 GUI与运行模式Figure 6 GUI and running mode
图7 Elometer测试反馈Figure 7 Elometer test feedback
图8对弈测试结果(10局)Figure 8 Game test results (10 rounds)
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.
步骤一、采用能有效避免梯度弥散并能够获得最优位置收敛的CNN、ResNet以及全连接层在内的混合网络结构,并使用一个训练网络同时输出策略与估值;Step 1. Adopt a hybrid network structure including CNN, ResNet and fully connected layers that can effectively avoid gradient dispersion and obtain optimal position convergence, and use a training network to simultaneously output strategies and estimates;
ResNet网络使用“短路”设计,使原本我们期望使用F(x)来拟合的映射关系转而使用H(x)=F(x)+x映射,以引入和保持更为丰富的参考信息,从而使网络学到更为丰富的内容。使用ResNet,还会产生平滑的正向传递与平滑的反向残差传递过程。如一个ResNet的第L层(第L层的输入为xl)到第L-1层的ResNet,可以推导出The ResNet network uses a "short-circuit" design, so that the mapping relationship that we originally expected to use F(x) to fit is changed to use H(x)=F(x)+x mapping to introduce and maintain richer reference information. So that the network can learn richer content. Using ResNet also produces a smooth forward pass and a smooth reverse residual pass. For example, from the L-th layer of ResNet (the input of the L-th layer is x l ) to the ResNet of the L-1 layer, it can be deduced that
①xl+1=xl+F(xl)①x l +1=x l +F(x l )
继而有如下推导:Then there is the following derivation:
②xl+2=xl+1+F(xl+1)=xl+F(xl)+F(xl+1)②x l +2=x l +1+F(x l +1)=x l +F(x l )+F(x l +1)
故可以得到以下一般形式:Therefore, the following general form can be obtained:
③ ③
后面的任何一层的xL向量的内容都会有一部来自前面某一层的xl的线性贡献。反向的残差传递也有类似的平滑过程,定义残差xlable表示理想情况下在给定样本和标签后在某一层xL所对应的值,由链式法则有: The content of the xL vector of any subsequent layer will have a linear contribution from the xl of a previous layer. The reverse residual transfer also has a similar smoothing process, defining the residual x lable represents ideally the value corresponding to a certain layer x L after a given sample and label, according to the chain rule:
④ ④
由此可以发现,任意一层的输出xL所产生的残差可以传回到其前面的任意一层的xl上,这是个“直接”的传递过程,从而使得ResNet在层数变多的时候也不会出现明显的效率问题;与此同时,根据等式因子可知,与是线性叠加的过程而非连乘关系,所以ResNet几乎不会出现梯度弥散的现象。本发明所训练的包含CNN、ResNet以及全连接层在内的混合网络结构,其中包含了19层ResNet结构,并使用一个训练网络同时输出策略与估值,在简化网络结构的同时也降低了运算量,提高了训练效率,减少了训练时间。From this, it can be found that the residual generated by the output x L of any layer can be transmitted back to x l of any layer in front of it. This is a "direct" transfer process, so that ResNet has more layers. There will be no obvious efficiency problems; at the same time, according to the equation factor It can be seen that and It is a process of linear superposition rather than a multiplicative relationship, so ResNet has almost no gradient dispersion phenomenon. The hybrid network structure trained by the present invention includes CNN, ResNet and fully connected layers, which includes a 19-layer ResNet structure, and uses a training network to output strategies and estimates at the same time, which not only simplifies the network structure, but also reduces the calculation The amount improves the training efficiency and reduces the training time.
步骤二、如图2所示,整个强化学习过程分为自我博弈(Self-Play),模型优化(Optimize,Opt)与网络评估(Evaluate,Eval)三个异步子过程。Step 2, as shown in Figure 2, the entire reinforcement learning process is divided into three asynchronous sub-processes: self-play (Self-Play), model optimization (Optimize, Opt) and network evaluation (Evaluate, Eval).
强化学习具体过程:The specific process of reinforcement learning:
⑴自我博弈(Self-Play)⑴Self-Play
Self-Play由蒙特卡洛搜索(Monto Carlo Tree Search,MCTS)与构建的网络模型共同驱动,Self-Play过程中的MCTS采用更加的高效的估值网络进行快速判定。如图3的MCTS模拟所示,MCTS与模型网络对对弈网络进行驱动,同时规则监督在这个过程中持续运行,进行行棋指导与胜负终局判定,单局结束后进行格式化归档存储并初始化棋面以备下次进行自我博弈,图4所示为单局自我对弈流程。此过程通过自我博弈产生大量对弈棋谱,后去大量训练数据,为网络优化提供训练数据资源;Self-Play is jointly driven by Monte Carlo Search (Monto Carlo Tree Search, MCTS) and the constructed network model. MCTS in the process of Self-Play uses a more efficient valuation network for quick judgment. As shown in the MCTS simulation in Figure 3, the MCTS and the model network drive the game network, while the rule supervision continues to run during this process, providing chess guidance and final judgment of victory and defeat, and formatting, archiving, storage and initialization after a single game The chess surface is prepared for the next self-game. Figure 4 shows the process of a single game of self-play. This process generates a large number of game records through self-play, and then removes a large amount of training data to provide training data resources for network optimization;
⑵模型优化(Optimize,Opt)⑵Model optimization (Optimize, Opt)
Opt是模型网络的训练优化过程,使用来自Self-Play的阶段的棋谱数据进行模型优化。Opt is the training optimization process of the model network, using the chess record data from the Self-Play stage for model optimization.
⑶网络评估(Evaluate,Eval)⑶ network evaluation (Evaluate, Eval)
Eval是新模型的评估过程,在优化执行一段时间后进行对弈模拟评估,根据评估结果判定继续优化还是进行模型的迭代更新。网络评估即使用训练一段后的新模型与训练前的模型进行对弈,根据胜负局数获取当前模型的性能以判断是否进行模型的迭代。Eval is the evaluation process of the new model. After the optimization is executed for a period of time, the simulation evaluation of the game is carried out. Based on the evaluation results, it is determined whether to continue the optimization or iteratively update the model. Network evaluation is to use the new model after training for a period of time to play against the model before training, and obtain the performance of the current model according to the number of wins and losses to judge whether to iterate the model.
步骤三、本发明采用神经网络训练优化模型,定义损失函数并选择相应的优化器进行向减小损失方向的迭代更新;在模型定义完成后开始训练前进行的优化器的指定与损失函数的定义。Step 3, the present invention adopts the neural network training optimization model, defines the loss function and selects the corresponding optimizer to carry out iterative update to the direction of reducing the loss; the designation of the optimizer and the definition of the loss function before starting the training after the model definition is completed .
具体包含如下步骤:Specifically include the following steps:
⑴本发明使用Adam优化器,学习率设置为0.001,β1设置为0.9,β2设置为0.999,ε设置为10-8,学习率衰减值“dency”设置为0.0。Adam优化器使用累计梯度代替单次梯度,使用累计梯度均方根自调节学习率,有助于促进模型向正确的方向收敛并加快模型的收敛速度。(1) The present invention uses Adam optimizer, the learning rate is set to 0.001, β 1 is set to 0.9, β 2 is set to 0.999, ε is set to 10 -8 , and the learning rate decay value "dency" is set to 0.0. The Adam optimizer uses the cumulative gradient instead of a single gradient, and uses the root mean square of the cumulative gradient to self-adjust the learning rate, which helps to promote the model to converge in the correct direction and speed up the convergence speed of the model.
⑵对于策略网络的策略概率输出定义为多类对数损失,在模型的输入数据上使用独热编码以减弱网络对多类对数损失等交叉熵类损失函数的认知干扰。对于估值网络输出的局面胜率估值定义损失函数为均方误差,对于单值输出的神经网络,使用均方误差作为损失函数是普遍而有效的。(2) The strategy probability output of the strategy network is defined as multi-class logarithmic loss, and one-hot encoding is used on the input data of the model to reduce the cognitive interference of the network on cross-entropy loss functions such as multi-class logarithmic loss. For the evaluation of the situation winning rate output by the evaluation network, the loss function is defined as the mean square error. For the neural network with single value output, it is common and effective to use the mean square error as the loss function.
步骤四、采用网络模型评估,使用训练一段时间后的新模型与训练前的模型进行对弈,根据胜负局数获取当前模型的性能以判定是否进行模型的迭代。图5为模型网络训练优化,在模型训练之前需要进行原模型的备份以备在本步骤使用,对于训练时间,通过设定epochs进行间接设定,在训练一个epochs即进行模型的评估,每次评估进行100局的模拟对弈,每局的对弈过程与Self-Play阶段类似,不同之处在于每个回合的双方的MCTS由不同的模型网络进行判定与更新。对弈结束后根据新模型胜率是否达到迭代胜率阈值(超参数,设定为0.55)判定新模型是否替代原模型成为最优模型。模型迭代完成后,将回到Self-Play阶段,使用新模型进行自我对弈与持续的优化,完成网络对弈能力的不断提升。Step 4: Use network model evaluation, use the new model after training for a period of time to play against the model before training, and obtain the performance of the current model according to the number of wins and losses to determine whether to iterate the model. Figure 5 shows the optimization of the model network training. Before the model training, the original model needs to be backed up for use in this step. For the training time, it is set indirectly by setting the epochs, and the model is evaluated after training one epochs, each time The evaluation carried out 100 rounds of simulated games. The game process of each game is similar to the Self-Play stage. The difference is that the MCTS of both sides in each round is judged and updated by different model networks. After the game is over, judge whether the new model replaces the original model and becomes the optimal model according to whether the winning rate of the new model reaches the iterative winning rate threshold (hyperparameter, set to 0.55). After the model iteration is completed, it will return to the Self-Play stage, use the new model for self-play and continuous optimization, and complete the continuous improvement of the network game ability.
步骤五、本发明构建侧重于模型网络的实现与训练,对于前端界面的设计使用了第三方Chess GUI Arena进行可视化交互博弈测试,使用了来自University ofDuesseldorf的Elometer在线Elo rating进行ChessAi引擎的Elo评级。Step 5. The present invention focuses on the realization and training of the model network. For the design of the front-end interface, a third-party Chess GUI Arena is used for visual interactive game testing, and the Elometer online Elo rating from the University of Duesseldorf is used to perform the Elo rating of the ChessAi engine.
本发明通过模型网络构建的国际象棋引擎在GUI前端使用ChessAi标识,ArenaGUI与ChessAi引擎使用UCI的FEN(Forsyth-Edwards Notation)字符串进行前后端连接与局面交互。图6是Arena GUI的前端呈现与整个工程的部署运行结构,使用GUP版本的Keras与TensorFlow完成分布式的数据计算,通过Master CPU进行计算资源的自动分配与线程处理。Elometer通过解决76个puzzle给定Elo rating的估测值,测试反馈如图7所示,最终的Elo rating为1545,95%的置信区间为[1392,1698]。The chess engine constructed by the model network in the present invention uses the ChessAi logo on the front end of the GUI, and ArenaGUI and the ChessAi engine use the FEN (Forsyth-Edwards Notation) string of UCI to perform front-end connection and situation interaction. Figure 6 shows the front-end presentation of the Arena GUI and the deployment and operation structure of the entire project. The GUP version of Keras and TensorFlow are used to complete distributed data calculations, and the Master CPU is used to automatically allocate computing resources and thread processing. Elometer estimates the Elo rating by solving 76 puzzles. The test feedback is shown in Figure 7. The final Elo rating is 1545, and the 95% confidence interval is [1392, 1698].
在实际的对弈测试阶段,本发明使用ChessAi引擎进行了十局的对弈测试,其中包含六局的ChessAi的自我对弈与四局的ChessAi与Chess Titans的对弈,对弈的最终结果如图8所示。In the actual game testing stage, the present invention uses the ChessAi engine to carry out ten games of games, including six games of ChessAi self-play and four games of ChessAi and Chess Titans. The final result of the game is shown in Figure 8.
Claims (7)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910837810.5A CN110555517A (en) | 2019-09-05 | 2019-09-05 | Improved chess game method based on Alphago Zero |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910837810.5A CN110555517A (en) | 2019-09-05 | 2019-09-05 | Improved chess game method based on Alphago Zero |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110555517A true CN110555517A (en) | 2019-12-10 |
Family
ID=68739182
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910837810.5A Pending CN110555517A (en) | 2019-09-05 | 2019-09-05 | Improved chess game method based on Alphago Zero |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110555517A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111330255A (en) * | 2020-01-16 | 2020-06-26 | 北京理工大学 | Amazon chess-calling generation method based on deep convolutional neural network |
CN111667075A (en) * | 2020-06-12 | 2020-09-15 | 杭州浮云网络科技有限公司 | Service execution method, device and related equipment |
CN112016704A (en) * | 2020-10-30 | 2020-12-01 | 超参数科技(深圳)有限公司 | AI model training method, model using method, computer device and storage medium |
CN113011575A (en) * | 2019-12-19 | 2021-06-22 | 华为技术有限公司 | Neural network model update method, image processing method and device |
CN113127704A (en) * | 2021-03-11 | 2021-07-16 | 西安电子科技大学 | Monte Carlo tree searching method, system and application |
CN114425773A (en) * | 2021-11-18 | 2022-05-03 | 南京师范大学 | Game playing robot system based on deep learning and godson group and game playing method thereof |
CN116808592A (en) * | 2023-07-12 | 2023-09-29 | 浙江大学 | A weighted loss-based ELO evaluation method and device |
CN116881656A (en) * | 2023-07-06 | 2023-10-13 | 南华大学 | Reinforced learning military chess AI system based on deep Monte Carlo |
-
2019
- 2019-09-05 CN CN201910837810.5A patent/CN110555517A/en active Pending
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113011575A (en) * | 2019-12-19 | 2021-06-22 | 华为技术有限公司 | Neural network model update method, image processing method and device |
WO2021120719A1 (en) * | 2019-12-19 | 2021-06-24 | 华为技术有限公司 | Neural network model update method, and image processing method and device |
CN111330255A (en) * | 2020-01-16 | 2020-06-26 | 北京理工大学 | Amazon chess-calling generation method based on deep convolutional neural network |
CN111330255B (en) * | 2020-01-16 | 2021-06-08 | 北京理工大学 | A method for generating Amazon chess moves based on deep convolutional neural networks |
CN111667075A (en) * | 2020-06-12 | 2020-09-15 | 杭州浮云网络科技有限公司 | Service execution method, device and related equipment |
CN112016704A (en) * | 2020-10-30 | 2020-12-01 | 超参数科技(深圳)有限公司 | AI model training method, model using method, computer device and storage medium |
CN113127704A (en) * | 2021-03-11 | 2021-07-16 | 西安电子科技大学 | Monte Carlo tree searching method, system and application |
CN113127704B (en) * | 2021-03-11 | 2022-11-01 | 西安电子科技大学 | A Monte Carlo tree search method, system and application |
CN114425773A (en) * | 2021-11-18 | 2022-05-03 | 南京师范大学 | Game playing robot system based on deep learning and godson group and game playing method thereof |
CN114425773B (en) * | 2021-11-18 | 2024-03-26 | 南京师范大学 | Deep learning and Loongson group-based playing robot system and playing method thereof |
CN116881656A (en) * | 2023-07-06 | 2023-10-13 | 南华大学 | Reinforced learning military chess AI system based on deep Monte Carlo |
CN116881656B (en) * | 2023-07-06 | 2024-03-22 | 南华大学 | A reinforcement learning military chess AI system based on deep Monte Carlo |
CN116808592A (en) * | 2023-07-12 | 2023-09-29 | 浙江大学 | A weighted loss-based ELO evaluation method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110555517A (en) | Improved chess game method based on Alphago Zero | |
Justesen et al. | Illuminating generalization in deep reinforcement learning through procedural level generation | |
CN108629422B (en) | Intelligent learning method based on knowledge guidance-tactical perception | |
CN108921298B (en) | Multi-agent communication and decision-making method for reinforcement learning | |
CN112044076B (en) | Object control method and device and computer readable storage medium | |
CN109284812B (en) | Video game simulation method based on improved DQN | |
CN109740741A (en) | A reinforcement learning method combined with knowledge transfer and its application to autonomous skills learning method of unmanned vehicles | |
CN111729300A (en) | Research method of fighting landlord strategy based on Monte Carlo tree search and convolutional neural network | |
CN114404975B (en) | Training method, device, equipment, storage medium and program product of decision model | |
CN116128060A (en) | A Chess Game Method Based on Opponent Modeling and Monte Carlo Reinforcement Learning | |
CN112016704A (en) | AI model training method, model using method, computer device and storage medium | |
CN114048834A (en) | Method and device for continuous reinforcement learning game with incomplete information based on retrospective and incremental expansion | |
CN111282272A (en) | Information processing method, computer readable medium and electronic device | |
CN118014013A (en) | Deep reinforcement learning quick search game method and system based on priori policy guidance | |
CN118761902B (en) | Method and system for migrating arbitrary picture style based on deep reinforcement learning | |
CN116205298A (en) | A method and system for modeling opponent's behavior strategy based on deep reinforcement learning | |
CN112274935A (en) | AI model training method, use method, computer device and storage medium | |
CN111639756B (en) | A multi-agent reinforcement learning method based on game reduction | |
CN116644666A (en) | Path Planning and Guidance Method for Virtual Assembly Based on Strategy Gradient Optimization Algorithm | |
CN113689001B (en) | A virtual self-playing method and device based on counterfactual regret minimization | |
CN114154397B (en) | Implicit opponent modeling method based on deep reinforcement learning | |
CN115933717A (en) | UAV intelligent air combat maneuver decision-making training system and method based on deep reinforcement learning | |
CN114404976A (en) | Method and device for training decision model, computer equipment and storage medium | |
CN111001161B (en) | Game strategy obtaining method based on second-order back propagation priority | |
Wang | Artificial Intelligence Gamers Based on Deep Reinforcement Learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20191210 |
|
WD01 | Invention patent application deemed withdrawn after publication |