JP2018063602A

JP2018063602A - Program, system, and method for adjusting weighting of neural network using q-learning

Info

Publication number: JP2018063602A
Application number: JP2016202021A
Authority: JP
Inventors: 英爾関谷; Eiji Sekiya
Original assignee: DeNA Co Ltd
Current assignee: DeNA Co Ltd
Priority date: 2016-10-13
Filing date: 2016-10-13
Publication date: 2018-04-19
Anticipated expiration: 2036-10-13
Also published as: JP6330008B2

Abstract

PROBLEM TO BE SOLVED: To apply Q-learning for improving weighting of neural networks (NN) having game parameters as input values and expected rewards Q corresponding to respective actions of a game character as output values.SOLUTION: A method for adjusting parameters of a neural network using Q-learning executes a set of steps repeatedly for a plurality of times, the set of the steps including: a step for extracting a game parameter relating to a game character as a first input value; a step for extracting, on the basis of the first input value, a first expected reward (Q value) associated with respective actions of the game character from a neural network (NN) as an output value; a step for extracting the game parameter after selecting/executing one out of respective actions as a second input value; a step for calculating, on the basis of the second input value, a second expected reward (Q value) corresponding to respective actions of the game character from the neural network (NN); a step for updating, on the basis of the first expected reward (Q value) and the second expected reward (Q value), a parameter of the neural network (NN) consisting of one or a plurality of layers.SELECTED DRAWING: Figure 3

Description

本発明は、Ｑ学習（Ｑ−ｌｅａｒｎｉｎｇ）を用いたニューラルネットワークの重み付け調整のためのプログラム、システム、及び方法に関し、詳しくは、ゲーム環境のパラメータを入力値としゲームキャラクタの各行動による見込み報酬Ｑを出力値とするニューラルネットワーク（ＮＮ）の重み付けをＱ学習により改善するためのプログラム、システム、及び方法に関する。 The present invention relates to a program, a system, and a method for weighting adjustment of a neural network using Q-learning, and more specifically, an expected reward Q for each action of a game character using a game environment parameter as an input value. The present invention relates to a program, a system, and a method for improving the weighting of a neural network (NN) having Q as an output value by Q learning.

ニューラルネットワークは、脳のニューロン及びシナプスを模して考えられたモデルであり、学習および処理の２段階により構成される。学習段階では、多数の入力からその特徴を学習し、後の処理のためのニューラルネットワークを構築する。処理段階では、ニューラルネットワークを用いて新たな入力による出力値を抽出する処理を行う。近年では、学習段階の技術が大きく発展しており、例えばディープラーニングにより、高い表現力を持った多層ニューラルネットワークを構築できるようになりつつある。様々な分野で、この多層ニューラルネットワークの有効性が確かめられ、ディープラーニングの有効性が広く認知されるようになっている。 A neural network is a model that is modeled after the neurons and synapses of the brain, and is composed of two stages of learning and processing. In the learning stage, features are learned from a large number of inputs, and a neural network is constructed for later processing. In the processing stage, a process for extracting an output value based on a new input is performed using a neural network. In recent years, the technology at the learning stage has been greatly developed, and for example, deep learning has been able to construct a multilayer neural network having high expressive power. The effectiveness of this multilayer neural network has been confirmed in various fields, and the effectiveness of deep learning has been widely recognized.

一方、Ｑ学習（Ｑ−ｌｅａｒｎｉｎｇ）は、強化学習のの１つで、タスクに対する正解の行動を与えることなく、環境から与えられる報酬を元にして状況に応じた適切な行動を学習するというものである。近年、ニューラルネットワークの構造やパラメータを強化学習により改善・更新する試みがなされている。例えば、非特許文献１では、多層ニューラルネットワーク（ＣＮＮ）とＱ学習を組み合わせ、多層ニューラルネットワークのパラメータをＱ学習により改善する手法として、ＤＱＮ（ＤｅｅｐＱ−Ｎｅｔｗｏｒｋ）が提案されている。 On the other hand, Q learning (Q-learning) is one of reinforcement learning, which learns appropriate actions according to the situation based on rewards given from the environment without giving correct actions to tasks. It is. In recent years, attempts have been made to improve and update the structure and parameters of neural networks by reinforcement learning. For example, Non-Patent Document 1 proposes DQN (Deep Q-Network) as a technique for improving the parameters of a multilayer neural network by Q learning by combining multilayer neural network (CNN) and Q learning.

“Ｈｕｍａｎ−ｌｅｖｅｌＣｏｎｔｒｏｌＴｈｒｏｕｇｈＤｅｅｐＲｅｉｎｆｏｒｃｅｍｅｎｔＬｅａｒｎｉｎｇ”、ＶｏｌｏｄｙｍｙｒＭｎｉｈ他、２０１５年２月２６日“Human-level Control Through Deep Reinforcement Learning”, Volodymyr Minih et al., February 26, 2015

しかしながら、当該ＤＱＮの手法をゲームキャラクタの最適な行動の選択のためニューラルネットワークのパラメータの学習に活用しようとすると、ゲーム特有の改善が必要となる。より具体的には、例えば、進行状況によりユーザキャラクタの行動の選択が制限される（選択可能となるまでに時間を要する必殺技などの行動など）ようなゲームにおいて、当該ＤＱＮの手法をそのまま活用しても、このようなユーザキャラクタの制限された行動の学習が思うように進まないために学習に偏りが生じ、ニューラルネットワークのパラメータの改善・更新が困難とならざるを得ないという問題があった。 However, if the DQN method is used to learn the parameters of the neural network for selecting the optimum action of the game character, improvement specific to the game is required. More specifically, for example, the DQN method is used as it is in a game in which the selection of the action of the user character is restricted depending on the progress (such as an action such as a deadly technique that takes time to become selectable). However, there is a problem in that learning of the user character's restricted behavior does not proceed as expected, and learning is biased, making it difficult to improve and update the parameters of the neural network. It was.

本発明の実施形態は、ゲーム環境に特有な状況を考慮した、Ｑ学習（Ｑ−ｌｅａｒｎｉｎｇ）によるニューラルネットワークの重み付けの調整を行うことを目的の一つとする。本発明の実施形態の他の目的は、本明細書全体を参照することにより明らかとなる。 An object of the embodiment of the present invention is to adjust the weight of a neural network by Q-learning in consideration of a situation unique to a game environment. Other objects of the embodiments of the present invention will become apparent by referring to the entire specification.

本発明の一実施形態に係る方法は、１又は複数のコンピュータ上で実行されることに応じて、当該１又は複数のコンピュータに、１又は複数のゲームキャラクタに関するゲームパラメータを第１の入力値として抽出するステップ、該第１の入力値に基づき、前記ニューラルネットワーク（ＮＮ）からゲームキャラクタの各行動による第１の見込み報酬（Ｑ値）を出力値として抽出するステップと、該各行動の中からの１つの選択・実行後の前記ゲームパラメータを第２の入力値として抽出するステップと、該第２の入力値に基づき、前記ニューラルネットワーク（ＮＮ）からゲームキャラクタの各行動による第２の見込み報酬（Ｑ値）を算出するステップと、前記第１の見込み報酬（Ｑ値）と前記第２の見込み報酬（Ｑ値）に基づき、１又は複数層で構成されるニューラルネットワーク（ＮＮ）のパラメータを更新するステップとを複数回繰り返し実行させ、Ｑ学習を用いて１又は複数層で構成されるニューラルネットワーク（ＮＮ）のパラメータの調整を行うように構成される。 A method according to an embodiment of the present invention, when executed on one or more computers, causes the one or more computers to have a game parameter related to one or more game characters as a first input value. A step of extracting, based on the first input value, a step of extracting a first expected reward (Q value) by each action of the game character from the neural network (NN) as an output value; Extracting the game parameter after selection / execution as a second input value, and a second potential reward for each action of the game character from the neural network (NN) based on the second input value Based on the step of calculating (Q value) and the first expected reward (Q value) and the second expected reward (Q value), 1 or The step of updating the parameter of the neural network (NN) composed of several layers is repeatedly executed a plurality of times, and the parameter of the neural network (NN) composed of one or a plurality of layers is adjusted using Q learning. Configured.

本発明の一実施形態に係るシステムは、１又は複数のコンピュータ上で実行されることに応じて、当該１又は複数のコンピュータに、１又は複数のゲームキャラクタに関するゲームパラメータを第１の入力値として抽出するステップ、該第１の入力値に基づき、前記ニューラルネットワーク（ＮＮ）からゲームキャラクタの各行動による第１の見込み報酬（Ｑ値）を出力値として抽出するステップと、該各行動の中からの１つの選択・実行後の前記ゲームパラメータを第２の入力値として抽出するステップと、該第２の入力値に基づき、前記ニューラルネットワーク（ＮＮ）からゲームキャラクタの各行動による第２の見込み報酬（Ｑ値）を算出するステップと、前記第１の見込み報酬（Ｑ値）と前記第２の見込み報酬（Ｑ値）に基づき、１又は複数層で構成されるニューラルネットワーク（ＮＮ）のパラメータを更新するステップとを複数回繰り返し実行させ、Ｑ学習を用いたニューラルネットワーク（ＮＮ）のパラメータの調整を行うように構成される。 A system according to an embodiment of the present invention, when executed on one or more computers, causes the one or more computers to have a game parameter relating to one or more game characters as a first input value. A step of extracting, based on the first input value, a step of extracting a first expected reward (Q value) by each action of the game character from the neural network (NN) as an output value; Extracting the game parameter after selection / execution as a second input value, and a second potential reward for each action of the game character from the neural network (NN) based on the second input value Based on the step of calculating (Q value), the first expected reward (Q value) and the second expected reward (Q value), 1 Is configured to perform the steps of updating the parameters of the neural network (NN) consists of multiple layers were multiple iterations performed, the adjustment of the parameters of the neural network (NN) using the Q-learning.

本発明の一実施形態に係るプログラムは、１又は複数のコンピュータ上で実行されることに応じて、当該１又は複数のコンピュータに、１又は複数のゲームキャラクタのゲームパラメータを第１の入力値として抽出するステップ、該第１の入力値に基づき、前記ニューラルネットワーク（ＮＮ）からゲームキャラクタの各行動による第１の見込み報酬（Ｑ値）を出力値として抽出するステップと、該各行動の中からの１つの選択・実行後の前記ゲームパラメータを第２の入力値として抽出するステップと、該第２の入力値に基づき、前記ニューラルネットワーク（ＮＮ）からゲームキャラクタの各行動による第２の見込み報酬（Ｑ値）を算出するステップと、前記第１の見込み報酬（Ｑ値）と前記第２の見込み報酬（Ｑ値）に基づき、１又は複数層で構成されるニューラルネットワーク（ＮＮ）のパラメータを更新するステップとを複数回繰り返し実行させ、Ｑ学習を用いたニューラルネットワーク（ＮＮ）のパラメータの調整を行うように構成される。 When the program according to the embodiment of the present invention is executed on one or a plurality of computers, the game parameter of one or a plurality of game characters is used as a first input value for the one or a plurality of computers. A step of extracting, based on the first input value, a step of extracting a first expected reward (Q value) by each action of the game character from the neural network (NN) as an output value; Extracting the game parameter after selection / execution as a second input value, and a second potential reward for each action of the game character from the neural network (NN) based on the second input value Based on the step of calculating (Q value) and the first expected reward (Q value) and the second expected reward (Q value), 1 or And updating the parameters of the neural network (NN) consists of several layers was repeatedly performed a plurality of times, configured to adjust the parameters of the neural network (NN) using the Q-learning.

本発明の様々な実施形態によって、１つのバトルの間に１又は複数のゲームキャラクタの行動によりバトル状態が刻々と変化するゲーム、特に、進行状況によりユーザキャラクタの行動の選択が制限されるようなゲームにおいて、ユーザキャラクタの制限された行動の学習を効率的かつ確実に選択せしめ、ニューラルネットワークのパラメータの改善・更新を効果的に行うことが可能となる。 According to various embodiments of the present invention, a game in which a battle state changes every moment according to the action of one or a plurality of game characters during one battle, in particular, the selection of the action of the user character is limited by the progress. In the game, it is possible to efficiently and surely select the learning of the restricted behavior of the user character, and to effectively improve and update the parameters of the neural network.

本発明の一実施形態に係るシステム１の構成を概略的に示す構成図。The lineblock diagram showing roughly the composition of system 1 concerning one embodiment of the present invention. 一実施形態におけるシステム１が有する機能を概略的に示すブロック図。The block diagram which shows roughly the function which the system 1 in one Embodiment has. 一実施形態におけるニューラルネットワーク（ＮＮ）のパラメータを調整するフローの一例を示す図。The figure which shows an example of the flow which adjusts the parameter of the neural network (NN) in one Embodiment. 一実施形態におけるニューラルネットワーク（ＮＮ）の構成の一例を示す図。The figure which shows an example of a structure of the neural network (NN) in one Embodiment. 一実施形態に行動別履歴管理テーブルの一例を示す図。The figure which shows an example of the log | history management table classified by action in one Embodiment.

図１は、本発明の一実施形態に係るシステム１の構成を概略的に示す構成図である。一実施形態におけるシステム１は、図示するように、サーバ１０と、このサーバ１０とインターネット等の通信網２０を介して接続された複数の端末装置３０と、を備え、端末装置３０のユーザに対して電子商取引サービスを提供する。また、一実施形態におけるシステム１は、キャラクタを用いたゲームや電子書籍、動画コンテンツ、及び音楽コンテンツ等のゲーム以外の様々なデジタルコンテンツの提供サービス、並びに、テキストチャット（ミニメール）、サークル、アバター、日記、伝言板、及び挨拶等の様々なユーザ間のコミュニケーション機能を実現するコミュニケーションプラットフォーム（ＳＮＳプラットフォーム）サービス等の様々なインターネットサービスを、端末装置３０のユーザに対して提供し得る。 FIG. 1 is a configuration diagram schematically showing a configuration of a system 1 according to an embodiment of the present invention. As illustrated, the system 1 according to an embodiment includes a server 10 and a plurality of terminal devices 30 connected to the server 10 via a communication network 20 such as the Internet. To provide electronic commerce services. In addition, the system 1 according to the embodiment includes a service for providing various digital contents other than games such as games using characters, electronic books, moving image contents, and music contents, text chat (mini mail), circles, and avatars. Various Internet services such as a communication platform (SNS platform) service that realizes communication functions between various users such as a diary, a message board, and greetings can be provided to the user of the terminal device 30.

一実施形態におけるサーバ１０は、一般的なコンピュータとして構成されており、図示のとおり、ＣＰＵ（コンピュータプロセッサ）１１と、メインメモリ１２と、ユーザＩ／Ｆ１３と、通信Ｉ／Ｆ１４と、ストレージ（記憶装置）１５と、を含み、これらの各構成要素がバス１７を介して互いに電気的に接続されている。ＣＰＵ１１は、ストレージ１５からオペレーティングシステムやその他様々なプログラムをメインメモリ１２にロードし、このロードしたプログラムに含まれる命令を実行する。メインメモリ１２は、ＣＰＵ１１が実行するプログラムを格納するために用いられ、例えば、ＤＲＡＭ等によって構成される。なお、一実施形態におけるサーバ１０は、それぞれ上述したようなハードウェア構成を有する複数のコンピュータを用いて構成され得る。なお、上述したＣＰＵ（コンピュータプロセッサ）１１は一例であり、これに代えて、ＧＰＵ（グラフィックス・プロセッシング・ユニット）を用いるものであってもよいことは言うまでもない。ＣＰＵ及び／又はＧＰＵをどのように選択するかは、所望のコストないし効率などを勘案した上で適宜決定することが可能である。以下、ＣＰＵ１１を例として説明する。 The server 10 in one embodiment is configured as a general computer, and as illustrated, a CPU (computer processor) 11, a main memory 12, a user I / F 13, a communication I / F 14, and storage (storage). Device) 15, and these components are electrically connected to each other via a bus 17. The CPU 11 loads an operating system and various other programs from the storage 15 into the main memory 12 and executes instructions included in the loaded programs. The main memory 12 is used for storing a program executed by the CPU 11, and is configured by a DRAM or the like, for example. In addition, the server 10 in one embodiment may be configured using a plurality of computers each having a hardware configuration as described above. The CPU (computer processor) 11 described above is merely an example, and it goes without saying that a GPU (graphics processing unit) may be used instead. How to select the CPU and / or GPU can be appropriately determined in consideration of desired cost or efficiency. Hereinafter, the CPU 11 will be described as an example.

ユーザＩ／Ｆ１３は、例えば、オペレータの入力を受け付けるキーボードやマウス等の情報入力装置と、ＣＰＵ１１の演算結果を出力する液晶ディスプレイ等の情報出力装置とを含む。通信Ｉ／Ｆ１４は、ハードウェア、ファームウェア、又はＴＣＰ／ＩＰドライバやＰＰＰドライバ等の通信用ソフトウェア又はこれらの組み合わせとして実装され、通信網２０を介して端末装置３０と通信可能に構成される。 The user I / F 13 includes, for example, an information input device such as a keyboard and a mouse that accepts an operator input, and an information output device such as a liquid crystal display that outputs a calculation result of the CPU 11. The communication I / F 14 is implemented as hardware, firmware, communication software such as a TCP / IP driver or a PPP driver, or a combination thereof, and is configured to be able to communicate with the terminal device 30 via the communication network 20.

ストレージ１５は、例えば磁気ディスクドライブで構成され、各種サービスを提供するための制御用プログラム等の様々なプログラムが記憶される。また、ストレージ１５には、各種サービスを提供するための各種データも記憶され得る。ストレージ１５に記憶され得る各種データは、サーバ１０と通信可能に接続されるサーバ１０とは物理的に別体のデータベースサーバ等に格納されてもよい。 The storage 15 is composed of, for example, a magnetic disk drive, and stores various programs such as a control program for providing various services. The storage 15 can also store various data for providing various services. Various data that can be stored in the storage 15 may be stored in a database server or the like that is physically separate from the server 10 that is communicably connected to the server 10.

一実施形態において、サーバ１０は、階層構造の複数のウェブページから成るウェブサイトを管理するウェブサーバとしても機能し、こうしたウェブサイトを介して各種サービスを端末装置３０のユーザに対して提供し得る。ストレージ１５には、このウェブページに対応するＨＴＭＬデータも記憶され得る。ＨＴＭＬデータは、様々な画像データが関連付けられ、又、ＪａｖａＳｃｒｉｐｔ（登録商標）等のスクリプト言語等で記述された様々なプログラムが埋め込まれ得る。 In one embodiment, the server 10 also functions as a web server that manages a website composed of a plurality of hierarchical web pages, and can provide various services to the user of the terminal device 30 via the website. . The storage 15 can also store HTML data corresponding to this web page. HTML data is associated with various image data, and various programs described in a script language such as JavaScript (registered trademark) can be embedded.

また、一実施形態において、サーバ１０は、端末装置３０においてウェブブラウザ以外の実行環境上で実行されるアプリケーション（プログラム）を介して各種サービスを提供し得る。ストレージ１５には、こうしたアプリケーションも記憶され得る。このアプリケーションは、例えば、Ｏｂｊｅｃｔｉｖｅ−ＣやＪａｖａ（登録商標）等のプログラミング言語を用いて作成される。ストレージ１５に記憶されたアプリケーションは、配信要求に応じて端末装置３０に配信される。なお、端末装置３０は、こうしたアプリケーションを、サーバ１０以外の他のサーバ（アプリマーケットを提供するサーバ）等からダウンロードすることもできる。 In one embodiment, the server 10 can provide various services via an application (program) executed on the terminal device 30 in an execution environment other than the web browser. Such applications can also be stored in the storage 15. This application is created using a programming language such as Objective-C or Java (registered trademark). The application stored in the storage 15 is distributed to the terminal device 30 in response to the distribution request. Note that the terminal device 30 can also download such an application from a server other than the server 10 (a server that provides an application market) or the like.

このように、サーバ１０は、各種サービスを提供するためのウェブサイトを管理し、当該ウェブサイトを構成するウェブページ（ＨＴＭＬデータ）を端末装置３０からの要求に応答して配信することができる。また、上述したように、サーバ１０は、このようなウェブページ（ウェブブラウザ）を用いた各種サービスの提供とは代替的に、又は、これに加えて、端末装置３０において実行されるアプリケーションとの通信に基づいて各種サービスを提供することができる。いずれの態様で当該サービスを提供するにしても、サーバ１０は、各種サービスの提供に必要な各種データ（画面表示に必要なデータを含む）を端末装置３０との間で送受信することができる。また、サーバ１０は、各ユーザを識別する識別情報（例えば、ユーザＩＤ）毎に各種データを記憶し、ユーザ毎に各種サービスの提供状況を管理することができる。詳細な説明は省略するが、サーバ１０は、ユーザの認証処理や課金処理等を行う機能を有することもできる。 As described above, the server 10 can manage websites for providing various services, and distribute web pages (HTML data) constituting the websites in response to requests from the terminal device 30. In addition, as described above, the server 10 is an application that is executed in the terminal device 30 in place of or in addition to the provision of various services using such a web page (web browser). Various services can be provided based on communication. Regardless of which aspect of the service is provided, the server 10 can transmit and receive various data (including data necessary for screen display) necessary for providing various services to and from the terminal device 30. Moreover, the server 10 can store various data for each identification information (for example, user ID) for identifying each user, and can manage the provision status of various services for each user. Although detailed description is omitted, the server 10 may have a function of performing user authentication processing, billing processing, and the like.

一実施形態における端末装置３０は、サーバ１０が提供するウェブサイトのウェブページをウェブブラウザ上で表示すると共にアプリケーションを実行するための実行環境を実装した任意の情報処理装置であり、スマートフォン、タブレット端末、ウェアラブルデバイス、パーソナルコンピュータ、及びゲーム専用端末等が含まれ得るが、これらに限定されるものではない。 The terminal device 30 according to the embodiment is an arbitrary information processing device that displays a web page of a website provided by the server 10 on a web browser and implements an execution environment for executing an application. , A wearable device, a personal computer, a game-dedicated terminal, and the like, but are not limited thereto.

端末装置３０は、一般的なコンピュータとして構成され、図１に示すとおり、ＣＰＵ（コンピュータプロセッサ）３１と、メインメモリ３２と、ユーザＩ／Ｆ３３と、通信Ｉ／Ｆ３４と、ストレージ（記憶装置）３５と、を含み、これらの各構成要素がバス３７を介して互いに電気的に接続されている。 The terminal device 30 is configured as a general computer, and as shown in FIG. 1, a CPU (computer processor) 31, a main memory 32, a user I / F 33, a communication I / F 34, and a storage (storage device) 35. These components are electrically connected to each other via a bus 37.

ＣＰＵ３１は、ストレージ３５からオペレーティングシステムやその他様々なプログラムをメインメモリ３２にロードし、このロードしたプログラムに含まれる命令を実行する。メインメモリ３２は、ＣＰＵ３１が実行するプログラムを格納するために用いられ、例えば、ＤＲＡＭ等によって構成される。 The CPU 31 loads an operating system and various other programs from the storage 35 into the main memory 32 and executes instructions included in the loaded programs. The main memory 32 is used for storing a program executed by the CPU 31, and is configured by, for example, a DRAM or the like.

ユーザＩ／Ｆ３３は、例えば、ユーザの入力を受け付けるタッチパネル、キーボード、ボタン及びマウス等の情報入力装置と、ＣＰＵ３１の演算結果を出力する液晶ディスプレイ等の情報表示装置とを含む。通信Ｉ／Ｆ３４は、ハードウェア、ファームウェア、又は、ＴＣＰ／ＩＰドライバやＰＰＰドライバ等の通信用ソフトウェア又はこれらの組み合わせとして実装され、通信網２０を介してサーバ１０と通信可能に構成される。 The user I / F 33 includes, for example, an information input device such as a touch panel that accepts user input, a keyboard, a button, and a mouse, and an information display device such as a liquid crystal display that outputs a calculation result of the CPU 31. The communication I / F 34 is implemented as hardware, firmware, communication software such as a TCP / IP driver or a PPP driver, or a combination thereof, and is configured to be able to communicate with the server 10 via the communication network 20.

ストレージ３５は、例えば磁気ディスクドライブやフラッシュメモリ等により構成され、オペレーティングシステム等の様々なプログラムが記憶される。また、ストレージ３５は、サーバ１０から受信した様々なアプリケーションが記憶され得る。 The storage 35 is composed of, for example, a magnetic disk drive, a flash memory, or the like, and stores various programs such as an operating system. The storage 35 can store various applications received from the server 10.

端末装置３０は、例えば、ＨＴＭＬ形式のファイル（ＨＴＭＬデータ）を解釈して画面表示するためのウェブブラウザを備えており、このウェブブラウザの機能によりサーバ１０から取得したＨＴＭＬデータを解釈して、受信したＨＴＭＬデータに対応するウェブページを表示することができる。また、端末装置３０のウェブブラウザには、ＨＴＭＬデータに関連付けられた様々な形式のファイルを実行可能なプラグインソフトが組み込まれ得る。 The terminal device 30 includes, for example, a web browser for interpreting an HTML file (HTML data) and displaying the screen, and interprets and receives the HTML data acquired from the server 10 by the function of the web browser. A web page corresponding to the HTML data thus displayed can be displayed. The web browser of the terminal device 30 can incorporate plug-in software that can execute various types of files associated with HTML data.

端末装置３０のユーザがサーバ１０によって提供されるサービスを利用する際には、例えば、ＨＴＭＬデータやアプリケーションによって指示されたアニメーションや操作用アイコン等が端末装置３０に画面表示される。ユーザは、端末装置３０のタッチパネル等を用いて各種指示を入力することができる。ユーザから入力された指示は、端末装置３０のウェブブラウザやＮｇＣｏｒｅ（商標）等のアプリケーション実行環境の機能を介してサーバ１０に伝達される。 When the user of the terminal device 30 uses a service provided by the server 10, for example, HTML data, an animation instructed by an application, an operation icon, or the like is displayed on the terminal device 30. The user can input various instructions using the touch panel of the terminal device 30 or the like. The instruction input from the user is transmitted to the server 10 via the function of the application execution environment such as the web browser of the terminal device 30 and NgCore (trademark).

次に、このように構成された一実施形態におけるシステム１が有する機能について説明する。上述したように、一実施形態におけるシステム１は、ユーザに対して様々なインターネットサービスを提供し得るが、特に、ゲーム配信サービスを提供することが可能である。以降、一実施形態におけるシステム１の機能について、ゲーム配信サービスを提供する機能を例として説明する。 Next, the function which the system 1 in one embodiment comprised in this way has is demonstrated. As described above, the system 1 according to an embodiment can provide various Internet services to the user, and in particular, can provide a game distribution service. Hereinafter, the function of the system 1 according to an embodiment will be described by taking a function of providing a game distribution service as an example.

図２は、システム１（サーバ１０及び端末装置３０）が有する機能を概略的に示すブロック図である。まず、一実施形態におけるサーバ１０が有する機能について説明する。サーバ１０は、図示するように、様々な情報を記憶する情報記憶部４１と、一実施形態におけるゲームキャラクタの行動を決定するキャラクタ行動制御部４２と、を備える。これらの機能は、ＣＰＵ１１及びメインメモリ１２等のハードウェア、並びに、ストレージ１５に記憶されている各種プログラムやテーブル等が協働して動作することによって実現され、例えば、ロードしたプログラムに含まれる命令をＣＰＵ１１が実行することによって実現される。また、図２に例示したサーバ１０が有する機能の一部又は全部は、端末装置３０によって実現され、又は、サーバ１０と端末装置３０とが協働することによって実現され得る。 FIG. 2 is a block diagram schematically showing functions of the system 1 (the server 10 and the terminal device 30). First, the function which the server 10 in one Embodiment has is demonstrated. As illustrated, the server 10 includes an information storage unit 41 that stores various information, and a character action control unit 42 that determines the action of the game character in one embodiment. These functions are realized by the cooperative operation of hardware such as the CPU 11 and the main memory 12 and various programs and tables stored in the storage 15, for example, instructions included in the loaded program This is realized by the CPU 11 executing. 2 may be realized by the terminal device 30 or may be realized by the cooperation of the server 10 and the terminal device 30.

一実施形態における情報記憶部４１は、ストレージ１５等によって実現され、図２に示すように、ゲームにおける敵・味方キャラクタ（これらを総称してゲームキャラクタと呼ぶ）の各種ステータス、行動可能なキャラクタ、行動可能な技、直近行動したキャラクタのフラグ、直近行動したキャラクタの技などのゲームパラメータを管理するためのゲームパラメータ管理テーブル４１ａと、行動可能な各行動及びゲーム中に当該ゲームパラメータに基づき各行動が選択された場合の見込み報酬（Ｑ値）を管理するための行動評価管理テーブル４１ｂと、ゲーム中の各ゲームキャラクタにより選択された行動の履歴を管理するための行動履歴管理テーブル４１ｃと、を有する。 The information storage unit 41 in one embodiment is realized by the storage 15 or the like, and as shown in FIG. 2, various statuses of enemies and teammate characters (collectively referred to as game characters) in the game, actionable characters, A game parameter management table 41a for managing game parameters such as a skill that can be acted, a flag of the character that acted most recently, a skill of the character that acted most recently, and each action that can be acted and each action based on the game parameter during the game An action evaluation management table 41b for managing a prospective reward (Q value) when the game is selected, and an action history management table 41c for managing a history of actions selected by each game character in the game. Have.

次に、一実施形態におけるゲームキャラクタの行動を決定するキャラクタ行動制御部４２の機能について説明する。キャラクタ行動制御部４２は、情報記憶部４１のゲームパラメータ管理テーブル４１ａに格納されている敵・味方キャラクタの各種ステータス、行動可能なキャラクタ、行動可能な技、直近行動したキャラクタのフラグ、直近行動したキャラクタの技などのゲームパラメータに基づき各ゲームキャラクタの行動を選択・決定していく。より具体的には、キャラクタ行動制御部４２は、ゲームパラメータ管理テーブル４１ａのゲームパラメータを抽出し、これを入力値として１又は複数層のニューラルネットワークに入力し、出力として行動可能なキャラクタの行動毎の見込み報酬（Ｑ値）を出力値として抽出し、これらの中から出力値の最も高い行動を通常選択・決定することとなるが、学習段階においては、各行動による見込み報酬（Ｑ値）とは無関係に行動可能なキャラクタの行動をランダムに選択・決定するか、または、行動可能なキャラクタの行動毎の見込み報酬（Ｑ値）の最も高い行動を選択・決定することで、都度ニューラルネットワーク（ＮＮ）のパラメータを調整を行う。キャラクタ行動制御部４２は、行動の選択・決定のため、必要に応じて行動評価管理テーブル４１ｂを参照し、行動可能な各行動及びゲーム中に当該ゲームパラメータに基づき各行動が選択された場合の見込み報酬（Ｑ値）を得ることができる。キャラクタ行動制御部４２は、このようにして例えば、１つのバトルが終了するまでの間の味方キャラクタの各行動を選択・決定することで学習を進めていくこととなる。なお、この例では、１バトルが終了するまでの味方キャラクタの各行動として説明したが、敵キャラクタ、味方キャラクタ又はこれらの一部であることを排除する意図ではない。また、１バトルが終了するまでとしたが、その他見込み報酬（Ｑ値）が適切に算出できる区切りの時点までとしても構わない。 Next, the function of the character action control part 42 which determines the action of the game character in one Embodiment is demonstrated. The character action control unit 42 has various statuses of enemies and teammate characters stored in the game parameter management table 41a of the information storage unit 41, an actionable character, an actionable technique, a flag of a character that has recently acted, and an action of the most recent character. The action of each game character is selected and determined based on game parameters such as the character's skill. More specifically, the character action control unit 42 extracts game parameters from the game parameter management table 41a, inputs them as input values to one or more layers of neural networks, and outputs each action of the character that can act as an output. The expected reward (Q value) is extracted as the output value, and the action with the highest output value is normally selected and determined from these. In the learning stage, the expected reward (Q value) for each action is By selecting / determining the actions of characters that can act independently, or by selecting / determining the action with the highest expected reward (Q value) for each action of a character that can act, NN) parameters are adjusted. The character action control unit 42 refers to the action evaluation management table 41b as necessary for action selection / determination, and each action that can be acted and when each action is selected based on the game parameter during the game. Expected reward (Q value) can be obtained. In this way, the character action control unit 42 proceeds with learning by selecting and determining each action of the teammate character until one battle is completed, for example. In addition, in this example, although demonstrated as each action of the friendly character until 1 battle is complete | finished, it is not the intention which excludes being an enemy character, a friendly character, or these parts. Moreover, although it was made until 1 battle was complete | finished, it does not matter even if it is to the time of the division | segmentation which can calculate other expected reward (Q value) appropriately.

キャラクタ行動制御部４２は、上記いずれの行動の選択・決定方法に関わらず、行動可能な各行動及びゲーム中に当該ゲームパラメータに基づき各行動が選択された場合の見込み報酬（Ｑ値）を管理するための行動評価管理テーブル４１ｂを適宜参照の上、ある状態Ｓｔのときの行動ａｔ及びその見込み報酬Ｑ（Ｓｔ，ａｔ）と、行動ａを選択した後の状態Ｓｔ＋１のときの各行動ａ毎の見込み報酬Ｑ（Ｓｔ＋１，a）の中で最大の見込み報酬ｍａｘＱ（Ｓｔ＋１，ａ）とを用いて、以下の示す式により、適切な見込み報酬とすべきであった（より適切な）見込み報酬を算出する。なお、当該算出式は一例であり、これに限定されず適宜変更可能であることはいうまでもない。ここで、見込み報酬Ｑ（Ｑ値）とは、行動ａをしたときに得られる報酬と今後に得られるであろう報酬の総和であり、ｒは行動ａｔを選択・実行後の現実の報酬、αは学習率、γは割引率（報酬の減衰係数）をそれぞれ表す。
これにより、当初の見込み報酬Ｑ（Ｓｔ，ａｔ）を修正すべきと判断されると（すなわち当初の（状態Ｓｔのときの）見込み報酬Ｑ（Ｓｔ，ａｔ）と後（Ｓｔ＋１）から振り返って算出された見込み報酬（Ｓｔ，ａｔ）との間に乖離がある場合）、この見込み報酬（Ｓｔ，ａｔ）がより適切な値となるよう、後述するニューラルネットワークのパラメータを更新する。Ｑ学習の基本的手法については、これ以上の詳述はしないが、当業者はこれらの手法を適宜一実施形態に係るシステムに適用することができる。 Regardless of any action selection / determination method, the character action control unit 42 manages each action that can be performed and the expected reward (Q value) when each action is selected based on the game parameter during the game. For each action a in the state St + 1 after selecting the action at and its expected reward Q (St, at) in a certain state St, with appropriate reference to the action evaluation management table 41b for The expected reward that should have been an appropriate prospective reward according to the following formula using the maximum expected reward maxQ (St + 1, a) among the expected rewards Q (St + 1, a) of Is calculated. It is needless to say that the calculation formula is an example, and is not limited thereto and can be changed as appropriate. Here, the expected reward Q (Q value) is the sum of the reward obtained when the action a is performed and the reward that will be obtained in the future, and r is the actual reward after selecting and executing the action at, α represents a learning rate, and γ represents a discount rate (reward attenuation coefficient).
As a result, when it is determined that the initial expected reward Q (St, at) should be corrected (that is, looking back from the initial expected reward Q (St, at) and after (St + 1) in the state St) If there is a divergence between the estimated reward (St, at) and the expected reward (St, at), the parameters of the neural network described later are updated so that the expected reward (St, at) becomes a more appropriate value. The basic method of Q-learning is not described in detail any more, but those skilled in the art can appropriately apply these methods to the system according to the embodiment.

図３は、本発明の一実施形態をフローとして示したものである。まず、ゲームパラメータ（入力値）の抽出を行い（ステップ１１０）、これを初期のニューラルネットワークに対して（第１の入力値として）入力を行うことで、ゲームキャラクタの各行動毎の第１の見込み報酬（Ｑ値）を出力値として抽出し（ステップ１２０）、その中から所定のルールに従いゲームキャラクタの特定の行動を選択・決定し（ステップ１３０）、当該行動後のゲームパラメータを更新すると共に当該更新後のゲームパラメータを抽出し（ステップ１４０）、当該更新後のゲームパラメータを第２の入力値としてニューラルネットワークに入力を行い、ゲームキャラクタの各行動毎の第２の見込み報酬（Ｑ値）を出力値として抽出し（ステップ１５０）、当該第１及び第２の見込み報酬（Ｑ値）に基づき、より適切な見込み報酬を算出できるようニューラルネットワークのパラメータを更新する（ステップ１６０）。これらのステップを例えば１つゲームバトルが終了するまで、若しくは、複数世代繰り返していくことでゲームキャラクタの適切な行動の選択・決定に関わるニューラルネットワークのパラメータを進化させていく。以下、一実施形態に係る発明における、ゲームキャラクタのより最適な行動を決定するための、Ｑ学習によるニューラルネットワークの学習を行う方法をより具体的に示す。 FIG. 3 shows an embodiment of the present invention as a flow. First, a game parameter (input value) is extracted (step 110), and this is input to the initial neural network (as a first input value), thereby providing a first for each action of the game character. The expected reward (Q value) is extracted as an output value (step 120), a specific action of the game character is selected and determined from among them (step 130), and the game parameters after the action are updated. The updated game parameter is extracted (step 140), the updated game parameter is input to the neural network as a second input value, and a second potential reward (Q value) for each action of the game character. Is extracted as an output value (step 150), and based on the first and second expected rewards (Q value), a more appropriate expectation Updates parameters of the neural network to be able to calculate the based compensation (step 160). By repeating these steps, for example, until one game battle is completed or by repeating a plurality of generations, the parameters of the neural network related to selection / determination of appropriate action of the game character are evolved. Hereinafter, a method for learning a neural network by Q learning for determining a more optimal action of a game character in the invention according to an embodiment will be described more specifically.

まず、強化学習の対象となるニューラルネットワークは、１又は複数層のニューラルネットワーク（ＣＮＮも含む）構造を備える任意のネットワークであってよい。ここで、ニューラルネットワークのパラメータの更新とは、ニューラルネットワークの構造及び各ノード間の重み付けの内、ニューラルネットワークの各ノード間の重み付けを変更することを主として意図している。図４は、このニューラルネットワークの構造及び各ノード間の重み付けの一例を示す。図示のように、一層の隠れ層を有するニューラルネットワーク（ＮＮ）であり、５つの入力層（ノード）に対して、３つのノードを有する隠れ層を通って、３つの出力層（ノード）と接続されている。接続する各ノード同士は、図示のように固有の重み付け（Ｗ１１、Ｗ１２、Ｗ２１、Ｗ２２等であり、この例では合計１３の重み付け）でつながっている。ニューラルネットワークの構造は、通常は隠れ層が１層若しくは２層で構成され、隠れ層のノード数は、出力層のノード数と同じ数（３０〜７０程度）だけ設けられるが、特にこれに限られず、ＣＮＮ、ＬＳＴＭその他のニューラルネットワークであってもよい。 First, the neural network to be subjected to reinforcement learning may be an arbitrary network having a single-layer or multiple-layer neural network (including CNN) structure. Here, the update of the parameters of the neural network is mainly intended to change the weight between the nodes of the neural network among the structure of the neural network and the weight between the nodes. FIG. 4 shows an example of the structure of this neural network and the weighting between the nodes. As shown in the figure, a neural network (NN) having one hidden layer is connected to three output layers (nodes) through a hidden layer having three nodes for five input layers (nodes). Has been. As shown in the figure, the nodes to be connected are connected with specific weights (W11, W12, W21, W22, etc., in this example, a total of 13 weights). The structure of the neural network is usually composed of one or two hidden layers. The number of nodes in the hidden layer is the same as the number of nodes in the output layer (about 30 to 70). Alternatively, CNN, LSTM, or other neural networks may be used.

上述のように、このようなニューラルネットワークを用いて、各時点におけるゲームパラメータを入力値として入力することで、各行動（例えば、攻撃、魔法、特技など）を選択する場合の見込み報酬（Ｑ値）を出力として得ることができる。特に学習の初期段階でみられるが、この見込み報酬（Ｑ値）が正しく算出されていない場合は、このニューラルネットワークの誤差伝播等を適用することで、同じゲームパラメータを入力値とする場合に、適切な見込み報酬（Ｑ値）が算出できるようニューラルネットワークのパラメータ（ノード間の重み付け）を変更していく。これを複数世代に亘り継続することで、ニューラルネットワークが進化・改善していき、ゲームにおけるゲームキャラクタによるより適切な行動を選択・決定できることにつながる。 As described above, by using such a neural network, a game parameter at each time point is input as an input value, so that an expected reward (Q value) when selecting each action (for example, attack, magic, special skill, etc.) ) As an output. Especially in the initial stage of learning, when this expected reward (Q value) is not calculated correctly, by applying error propagation of this neural network, etc., when the same game parameter is used as an input value, Neural network parameters (weighting between nodes) are changed so that an appropriate expected reward (Q value) can be calculated. By continuing this over a plurality of generations, the neural network will evolve and improve, and it will lead to the selection and determination of more appropriate actions by game characters in the game.

しかしながら、ゲームによっては、選択が制限されている行動（常に選択可能ではない行動）や選択がされにくい行動などがあり、そのようなゲーム環境が複雑なゲームにおいては、Ｑ学習に基づく強化学習をそのまま取り入れたとしても、ニューラルネットワークの学習が思うように進まないという問題があることが分かっている。当該問題に対する対処方法を検討した結果、下記手法により学習方法の改善が有効であることが判明した。 However, depending on the game, there are actions that are restricted in selection (actions that are not always selectable) and actions that are difficult to select. In such a complicated game environment, reinforcement learning based on Q-learning is performed. Even if it is adopted as it is, it is known that there is a problem that learning of the neural network does not proceed as expected. As a result of examining the coping method for the problem, it was found that the improvement of the learning method is effective by the following method.

以下、一実施形態としてロールプレーイングゲームを例に説明するが、任意のゲームであってもよく、特定のゲームに限定することを意図するものではない。ロールプレーイングゲームは、複数の味方キャラクタと敵キャラクタとが対戦するコマンドバトルゲームである。味方キャラクタ及び敵キャラクタは行動ゲージを持っており、時間の経過と共に行動ゲージが上昇する。そして行動ゲージが満タンになったキャラクタは行動可能状態となる。プレイヤは行動可能状態となったキャラクタに対して、コマンドを選択することによりゲームが進行する。ゲームの試行においては、ある時間におけるゲームの状態をパラメータ化して入力値として個体に入力し、ニューラルネットワークからの出力値に応じてコマンドを選択することでゲームを進行させる。なお、前述したが、ゲームにおける敵・味方キャラクタ（これらを総称してゲームキャラクタと呼ぶ）の各種ステータス、行動可能なゲームキャラクタ、行動可能な技、直近行動したゲームキャラクタのフラグ、直近行動したゲームキャラクタの技などをゲームパラメータと定義する。 Hereinafter, although a role playing game will be described as an example as an embodiment, it may be an arbitrary game and is not intended to be limited to a specific game. The role playing game is a command battle game in which a plurality of teammate characters and enemy characters battle each other. Allied characters and enemy characters have action gauges, and the action gauges increase with the passage of time. Then, a character whose action gauge is full is in an actionable state. The game progresses when the player selects a command for the character in the actionable state. In the trial of the game, the state of the game at a certain time is parameterized and input to an individual as an input value, and a command is selected according to the output value from the neural network, and the game is advanced. As described above, various statuses of enemies and teammate characters (collectively referred to as game characters) in the game, actionable game characters, actionable techniques, the flag of the most recently acted game character, and the game that recently acted Character skills are defined as game parameters.

次に、この入力値の例をより具体的に説明する。ニューラルネットワークへの入力値として、例えば、ある時間（いずれかのゲームキャラクタが行動可能となった時間）における（１）ゲームキャラのステータス、（２）ゲームキャラクタのコマンド選択可能フラグと、（３）ゲームキャラクタが直近にコマンド選択を行ったかの正規化値とを入力するが、これらに限定することを意図するものではない。（１）ゲームキャラのステータスは、各ゲームキャラクタ（敵キャラクタも含む）の攻撃力、魔法攻撃力、防御力、魔法防御力、スピード、必殺技ゲージなどが考えられる。（２）ゲームキャラクタ全員のコマンド選択可能フラグは、全ゲームキャラクタ（味方キャラクタ及び敵キャラクタ）の各コマンドについての使用可能フラグ（０又は１）を意味する。 Next, an example of this input value will be described more specifically. As an input value to the neural network, for example, (1) the status of the game character at a certain time (the time when any game character can act), (2) a command selectable flag for the game character, and (3) A normalization value indicating whether or not the game character has most recently made a command selection is input, but it is not intended to be limited to these. (1) The game character status may be the attack power, magic attack power, defense power, magic defense power, speed, deadly skill gauge, etc. of each game character (including enemy characters). (2) The command selectable flag for all game characters means an available flag (0 or 1) for each command of all game characters (friend characters and enemy characters).

当該ゲームでは、味方キャラクタ１、味方キャラクタ２、味方キャラクタ３、味方キャラクタ４、敵キャラクタがおり、それぞれのキャラクタに５つのコマンド（攻撃、防御、スキル１、スキル２、必殺技）が設定されている。ある時間において、味方キャラクタ１のみが行動可能であり、且つ味方キャラクタ１は攻撃、防御、スキル１、スキル２が選択可能である場合、
味方キャラクタ１_攻撃：１
味方キャラクタ１_防御：１
味方キャラクタ１_スキル１：１
味方キャラクタ１_スキル２：１
味方キャラクタ１_必殺技：０
味方キャラクタ２_攻撃：０
味方キャラクタ２_防御：０
味方キャラクタ２_スキル１：０
味方キャラクタ２_スキル２：０
味方キャラクタ２_必殺技：０
・・・（略）・・・
敵キャラクタ_必殺技：０
といった入力となる。 In the game, there are teammate character 1, teammate character 2, teammate character 3, teammate character 4, and enemy character, and five commands (attack, defense, skill 1, skill 2, and special move) are set for each character. Yes. When only the ally character 1 can act at a certain time and the ally character 1 can select attack, defense, skill 1, and skill 2,
Allied character 1_attack: 1
Ally character 1_defense: 1
Ally character 1_skill 1: 1
Ally character 1_skill 2: 1
Allied Character 1_ Special Move: 0
Ally character 2_attack: 0
Ally character 2_defense: 0
Ally character 2_skill 1: 0
Ally character 2_skill 2: 0
Allied Character 2_ Special Moves: 0
... (omitted) ...
Enemy Character_ Special Moves: 0
Will be input.

次に、（３）のゲームキャラクタが直近にコマンド選択を行ったかの正規化値は、各コマンド毎に、当該各コマンドをどの程度直近に選択したかを示す値となる。例えば、あるコマンドを選択すると、その選択後、この正規化値は０と設定され、次のコマンド選択のタイミングで、０．２となり、その後０．２づつ最大値１まで増加する値に設定される。この正規化値が低いと同じコマンドが選択しづらくなる。 Next, the normalized value of whether or not the game character of (3) has most recently selected a command is a value indicating how recently each command has been selected for each command. For example, when a certain command is selected, after the selection, this normalized value is set to 0, and is set to a value that becomes 0.2 at the next command selection timing, and then increases to a maximum value of 1 by 0.2. The If this normalized value is low, it is difficult to select the same command.

次に、出力値の例をより具体的に説明する。ニューラルネットワークから味方キャラクタのコマンドそれぞれについて見込み報酬（Ｑ値）が出力値として抽出される。これによって、ある時間における行動可能なキャラクタのコマンドが決定される。
例えば、味方キャラクタ１における見込み報酬（Ｑ値）が、
味方キャラクタ１_攻撃：０．８
味方キャラクタ１_防御：０．５
味方キャラクタ１_スキル１：０．４
味方キャラクタ１_スキル２：０．１
味方キャラクタ１_必殺技：０
であった場合、通常のコマンド選択方法であれば一番見込み報酬（Ｑ値）が大きい「攻撃」が選択されることとなる。 Next, examples of output values will be described more specifically. An expected reward (Q value) is extracted as an output value for each of the teammate character commands from the neural network. Thereby, the command of the character that can act at a certain time is determined.
For example, the expected reward (Q value) in the ally character 1 is
Allied character 1_attack: 0.8
Ally character 1_defense: 0.5
Ally character 1_skill 1: 0.4
Ally character 1_skill 2: 0.1
Allied Character 1_ Special Move: 0
If this is the case, in the normal command selection method, the “attack” with the largest expected reward (Q value) is selected.

ゲームバトルにおいては、この入力と出力を繰り返すことで、ゲームキャラクタの各行動を決定していく。通常は複数のゲームキャラクタの複数の行動が組み合わされてゲームバトルが進行してゆき、バトルが終了するまで継続することとなる。バトルの終了は、通常、例えば、敵ゲームキャラクタのＨＰが０となったり、味方ゲームキャラクタ全員のＨＰが０となる場合を意図しているが、ゲームバトルの評価が可能なその他の場合であっても構わない。 In the game battle, each action of the game character is determined by repeating this input and output. Usually, a plurality of actions of a plurality of game characters are combined and a game battle proceeds, and continues until the battle ends. The end of the battle is usually intended, for example, when the HP of the enemy game character is 0, or the HP of all the friend game characters is 0, but in other cases where the game battle can be evaluated. It doesn't matter.

出力値の例をより詳細に説明する。味方キャラクタが各コマンド及び見込み報酬（Ｑ値）は下記のように抽出・算出される。
味方キャラクタ１_防御：０．８９９６１８３９
味方キャラクタ１_熟練の拳：１．１６００６９２３
味方キャラクタ１_シャウト：０．６０１２０３２
味方キャラクタ１_たたかう：０．５３５７９７１２
味方キャラクタ１_アーマーブレイク：０．５２２５６１３１
味方キャラクタ１_正拳突き：０．３１４９５５
味方キャラクタ１_風の刃：０．６０７６６３３９

味方キャラクタ２_防御：０．６３６１０７６８
味方キャラクタ２_聖なる伝説の剣：０．８３３４９６０９
味方キャラクタ２_魔神の守護：０．２９６０３６０１
味方キャラクタ２_シャウト：−０．２００９１６７７
味方キャラクタ２_イナズマ剣：０．４５６３０９８
味方キャラクタ２_火炎剣：０．６４００９３８
味方キャラクタ２_たたかう：０．４０８６３３９５

味方キャラクタ３_防御：０．９５４４２２４７
味方キャラクタ３_海の覇者：０．２９０１８４２６
味方キャラクタ３_兜割り：０．２０８８５４９１
味方キャラクタ３_シャウト：−０．２５５４１１３９
味方キャラクタ３_たたかう：０．５０９５８４６７
味方キャラクタ３_魔法破壊：０．３６３８２７７１
味方キャラクタ３_腕力破壊：０．７７３６８３０７

味方キャラクタ４_防御：０．８４５０５７４９
味方キャラクタ４_宿敵チェンジ：０．２１７１０７５３
味方キャラクタ４_フレンドコンビネーション：０．７７１４６９３５
味方キャラクタ４_シャウト：０．１８８６０７４５
味方キャラクタ４_たたかう：０．６８９５４０３９
味方キャラクタ４_破壊工作：０．８９６４８０８
味方キャラクタ４_剣光線：０．６７８１３５４

味方キャラクタ５_防御：０．３７０５７３０４
味方キャラクタ５_全体回復：−１．２９２６９９３４
味方キャラクタ５_女王の加護：−１．２８３０７４１４
味方キャラクタ５_シャウト：−１．３３８７４７０２
味方キャラクタ５_治癒：−０．２１４３９０７５
味方キャラクタ５_アーマープラス：−０．１５５７０３５４
味方キャラクタ５_たたかう：−２．４４１５３４５２
上記例では、５キャラクタがそれぞれ７コマンドを有しているため、合計３５個のコマンド及びその見込み報酬（Ｑ値）が抽出されることとなる。なお、この見込み報酬（Ｑ値）は、−１〜１の範囲で正規化しているが、Ｑ学習では将来の報酬が足されるためその範囲に収まらない場合もある。 An example of the output value will be described in more detail. Each command and prospective reward (Q value) of the teammate character are extracted and calculated as follows.
Ally character 1_defense: 0.89961839
Allied Character 1_Skilled Fist: 1.16006923
Ally character 1_shout: 0.6012032
Allied character 1_Fight: 0.53579712
Friend Character 1_ Armor Break: 0.52256131
Allied character 1_Fist butt: 0.314955
Allied character 1_wind blade: 0.60766339

Ally character 2_defense: 0.63610768
Ally character 2_ sacred legendary sword: 0.83349609
Ally character 2_protection of the genie: 0.29603601
Ally character 2_shout: -0.20091677
Friend Character 2_Inazuma Sword: 0.4563830
Allied Character 2_ Flame Sword: 0.640000938
Ally character 2_ fight: 0.408863395

Ally character 3_defense: 0.9544442247
Allied Character 3_Ocean Champion: 0.29018426
Allied character 3_ 兜 split: 0.20885491
Ally character 3_shout: -0.25541139
Ally character 3_ fight: 0.509958467
Ally character 3_ magical destruction: 0.36382771
Allied character 3_arm strength destruction: 0.77368307

Allied character 4_Defense: 0.84505749
Allied character 4_ nemesis change: 0.21710753
Friend character 4_friend combination: 0.771493635
Ally character 4_ shout: 0.188607745
Allied character 4_ Fight: 0.68954039
Allied character 4_Destroyed work: 0.8964808
Friend Character 4_Sword Ray: 0.6781354

Ally character 5_defense: 0.370557304
Ally character 5_whole recovery: -1.292669934
Allied character 5_Queen's blessing: -1.283307414
Ally character 5_shout: -1.3387702702
Ally character 5_ healing: -0.21439075
Ally character 5_ armor plus: -0.15570354
Ally character 5_ fight: -2.44153452
In the above example, since each of the 5 characters has 7 commands, a total of 35 commands and the expected reward (Q value) are extracted. Note that this expected reward (Q value) is normalized in the range of −1 to 1, but in Q learning, a future reward may be added, so it may not be within that range.

このようにして、各コマンド毎の見込み報酬（Ｑ値）が算出・抽出されると、これらの中から出力値の最も高い行動を通常選択・決定することとなるが、学習段階においては、必殺技など行動の選択・実行が制限されているため（必殺技ゲージが貯まるのに時間がかかるため、その間必殺技を選択できないため）、学習の範囲に偏りが生じてしまうという問題があることが明らかとなった。ゲーム、特にロールプレーイングのようなゲームではこの問題が顕著にみられた。そのため、学習段階においては、各行動による見込み報酬（Ｑ値）とは無関係に行動可能なキャラクタの行動をランダムに選択・決定するか、または、行動可能なキャラクタの行動毎の見込み報酬（Ｑ値）の最も高い行動を選択・決定することで、都度ニューラルネットワーク（ＮＮ）のパラメータの調整を行う。各行動による見込み報酬（Ｑ値）とは無関係に行動可能なキャラクタの行動をランダムに選択・決定するか、または、行動可能なキャラクタの行動毎の見込み報酬（Ｑ値）の最も高い行動を選択・決定するかのどちらを選択するかは、予めそれぞれの選択・決定方法の選定確率を設定値として決めておくことで、どちらの選択・決定方法をより優先して選ぶかを適宜設定可能である。これらの設定値が、それぞれ５０％、５０％であれば、等しい確率でいずれかの選択・決定方法が選ばれることとなり、それぞれ１００％、０％と設定すると、常に各行動による見込み報酬（Ｑ値）とは無関係に行動可能なキャラクタの行動をランダムに選択・決定する方法で、実行するコマンドが決定される。このような選択・決定の選定確率は学習の回数等に応じて所望の設定値に変更可能である。 When the expected reward (Q value) for each command is calculated and extracted in this way, the action with the highest output value is usually selected and determined from these, but in the learning stage, it is a deadly move. Because the selection and execution of actions such as tricks are restricted (because it takes time to accumulate the Special Moves Gauge, it is not possible to select the Special Moves during that time), there is a problem that the range of learning will be biased It became clear. This problem was particularly noticeable in games such as role playing. Therefore, in the learning stage, the behavior of the character that can act independently of the expected reward (Q value) by each behavior is randomly selected or determined, or the expected reward (Q value) for each behavior of the character that can act The parameters of the neural network (NN) are adjusted each time by selecting / determining the action with the highest). Randomly select / determine the action of the character that can act regardless of the expected reward (Q value) for each action, or select the action with the highest expected reward (Q value) for each action of the character that can act・ Which one to select should be determined in advance by deciding the selection probability of each selection / determination method as a set value. is there. If these set values are 50% and 50%, respectively, one of the selection / determination methods is selected with equal probability. If they are set to 100% and 0%, respectively, the expected remuneration (Q A command to be executed is determined by a method of randomly selecting and determining an action of a character that can act regardless of (value). Such selection / determination selection probability can be changed to a desired set value in accordance with the number of learnings and the like.

一実施形態において、上述の必殺技など行動の選択・実行が制限されている（必殺技ゲージが貯まるのに時間がかかるため、その間必殺技を選択できない）こと等に起因する学習の範囲の偏りを防止するため、各行動による見込み報酬（Ｑ値）とは無関係に行動可能なキャラクタの行動をランダムに選択・決定する方法の選定確率の設定値を１００％とし、行動可能なキャラクタの行動毎の見込み報酬（Ｑ値）の最も高い行動を選択・決定する方法の選定確率０％とすることができる。このようにすることで、各行動による見込み報酬（Ｑ値）を元に選択するコマンドを決定する方法に比して、選択されにくいコマンドがより多く選択されることとなる結果、効果的、効率的な学習を行うことが可能となる。このような選択されにくいコマンドの選択による学習が十分に進めば、それぞれの確率を変更させていくことができる。例えば、各回の学習毎に、各行動による見込み報酬（Ｑ値）とは無関係に行動可能なキャラクタの行動をランダムに選択・決定する方法の選定確率の設定値を１００％から、０．００１５％づつ減らしていってもよい。このようにすると学習が進むにつれ、ランダムに選択・決定する方法の選定確率の設定値が下がっていき、見込み報酬（Ｑ値）の最も高い行動を選択・決定する方法の選定確率が上がっていくこととなる。この設定値の変更幅に特に限定はないが、例えば、０．００１５％〜０．０１５％とすることができる。また、設定しの変更幅は随時変更可能である。 In one embodiment, the selection / execution of actions such as the above-mentioned special moves is restricted (it takes time to accumulate the special technique gauge, so the special techniques cannot be selected during that time), etc. In order to prevent such a situation, the setting value of the selection probability of a method for randomly selecting / determining the behavior of a character that can act independently of the expected reward (Q value) of each behavior is set to 100%. The selection probability of the method for selecting / determining the action with the highest expected reward (Q value) can be set to 0%. By doing in this way, compared with the method of determining the command to be selected based on the expected reward (Q value) by each action, more difficult commands to be selected are selected. Learning is possible. If learning by selecting such difficult-to-select commands is sufficiently advanced, the respective probabilities can be changed. For example, for each learning, a setting value of a selection probability of a method for randomly selecting and determining an action of a character that can act independently of the expected reward (Q value) by each action is set from 100% to 0.0015%. It may be reduced gradually. In this way, as learning progresses, the selection probability setting method for the method of selecting / determining at random decreases, and the selection probability of the method for selecting / determining the action with the highest expected reward (Q value) increases. It will be. Although there is no particular limitation on the change range of the set value, for example, it can be 0.0015% to 0.015%. The set change width can be changed at any time.

一実施形態において、選択されにくいコマンド（例えば必殺技など）が所定の回数以上選択された場合若しくは所定の条件を満たした場合に、各行動による見込み報酬（Ｑ値）とは無関係に行動可能なキャラクタの行動をランダムに選択・決定する方法の選定確率の設定値を変更するように構成してもよい。例えば、選択されにくいコマンド（例えば必殺技など）が２回以上選択されると、当該ランダムに選択・決定する方法の選定確率の設定値を変更するようにしてもよい。その他、所定の条件は適宜設定することができる。 In one embodiment, when a command that is difficult to select (for example, a special move) is selected a predetermined number of times or when a predetermined condition is satisfied, the user can act regardless of the expected reward (Q value) for each action. You may comprise so that the setting value of the selection probability of the method of selecting and determining a character's action at random may be changed. For example, when a command that is difficult to select (for example, a special move) is selected twice or more, the setting value of the selection probability of the method of selecting / determining at random may be changed. In addition, the predetermined conditions can be set as appropriate.

ここで、一実施形態において、あるアクションに対する報酬の算出において、味方キャラクタのアクション数に比べ敵キャラクタの行動回数が少ないため、敵キャラクタの攻撃のあった前後で選択・行動した味方キャラクタの１人にのみ敵キャラクターからの攻撃の被ダメージ分を反映させると当該キャラクタのコマンドによるアクションが大きくマイナスの方向に修正されてしまうということが分かっている。これを避けるため、味方のキャラクターの数で相手の攻撃を分散し、アクションがあったキャラがその一つ支払い自分の行動の実際の与ダメージを下方修正したものを報酬とする。例えば、敵キャラクタの攻撃が１で味方キャラ数が５である場合、被ダメージメモリをそれぞれ−０．２づつとする。味方の１人のアクション(報酬１)が発火すると、０．８を報酬として学習する。被ダメージメモリは、この場合「−０．２、−０．２、−０．２、−０．２、０」となり、味方のキャラクタのアクション（報酬０．５）が発火すると、０．３を報酬として学習をする。被ダメージメモリは、この場合「−０．２、−０．２、−０．２、０、０」となる。敵キャラクタの攻撃が０．５である場合、被ダメージメモリは「−０．３、−０．３、−０．３、−０．１、０．１」となる。 Here, in one embodiment, since the number of actions of the enemy character is less than the number of actions of the teammate character in calculating the reward for an action, one of the teammate characters selected and acted before and after the enemy character attacked It is known that the action by the command of the character is greatly corrected in the negative direction when only the damage received from the attack from the enemy character is reflected on the character. In order to avoid this, the opponent's attack is distributed by the number of allied characters, and the character that had the action pays one of them, and the actual damage caused by his action is corrected downward. For example, if the enemy character's attack is 1 and the number of friendly characters is 5, the memory to be damaged is set to -0.2 each. When one ally's action (reward 1) fires, it learns 0.8 as a reward. In this case, the memory to be damaged is “−0.2, −0.2, −0.2, −0.2, 0”, and when the action (reward 0.5) of the ally character is ignited, 0.3 Learning as a reward. In this case, the memory to be damaged is “−0.2, −0.2, −0.2, 0, 0”. When the attack of the enemy character is 0.5, the memory to be damaged is “−0.3, −0.3, −0.3, −0.1, 0.1”.

上述のように、学習段階においては、必殺技など行動の選択・実行が制限されている等の理由から、ゲーム、特にロールプレーイングのようなゲームにおける行動選択・決定の学習の範囲に偏りが生じてしまうことを回避するため、学習段階においては、各行動による見込み報酬（Ｑ値）とは無関係に行動可能なキャラクタの行動をランダムに選択・決定するか、または、行動可能なキャラクタの行動毎の見込み報酬（Ｑ値）の最も高い行動を選択・決定することで、都度ニューラルネットワーク（ＮＮ）のパラメータを調整を行うようにすることができるが、各行動による見込み報酬（Ｑ値）とは無関係に行動可能なキャラクタの行動をランダムに選択・決定する場合においても、選択・実行が制限されている行動が必ずしも十分に選択されないという状況がある。これは、上述したゲーム中の各ゲームキャラクタにより選択された行動の履歴を管理するための行動履歴管理テーブル４１ｃにおいて、選択・実行が制限されている行動に比してそれ以外の行動がより多く選択されることとなる結果、当該それ以外の行動の履歴が相対的に多くを占めることとなることに起因している。 As described above, in the learning stage, there is a bias in the learning range of action selection / determination in games, particularly games such as role-playing, because the selection and execution of actions such as special moves are restricted. In order to avoid the occurrence, the behavior of the character that can be acted at random is selected or determined at random in the learning stage, regardless of the expected reward (Q value) by each behavior, or the behavior of the character that can be acted By selecting and determining the action with the highest expected reward (Q value) for each, the neural network (NN) parameters can be adjusted each time, but the expected reward (Q value) for each action and Even when the actions of characters that can be acted independently are selected and determined at random, the actions that are restricted from being selected and executed are not always sufficiently selected. There is a situation that it is not. This is because, in the action history management table 41c for managing the history of actions selected by each game character in the game described above, there are more other actions than the actions whose selection / execution is restricted. As a result of being selected, this is because the history of other behaviors occupies a relatively large amount.

一実施形態において、これを更に改善するため、行動毎に異なるメモリ空間、若しくは、行動毎に履歴情報を格納すると共に、全ての行動（アクション）の中から同数だけ選び、各行動（アクション）毎の履歴の中から完全にランダムに選択するようにすることができる。一例として、図５に示すような行動履歴管理テーブルが考えられる。図５では、行動履歴管理テーブルが、各行動（行動１、行動２・・・行動ｎ）毎にその履歴がテーブル管理される状態となっている。さらに、この場合、各行動毎に２つの履歴を保持するようにし、同じ行動が２回より多く選択されると、古い履歴は削除されることで、履歴の数が２つに制限されるようになっている。このようにすることで、ランダムに選択する場合にあっても、履歴の数が必然的に多くなるような行動が多く選択されるといったことが確実に防止され、どの行動（選択されにくい行動、選択されやすい行動も含め）であっても広く選択されることで学習に偏りを効果的に防止することができることを見出した。なお、図５に示す例は一例であって、行動毎の行動履歴の数はこれに限定されず、適宜設定可能である。また、行動毎の行動履歴管理テーブルはそれぞれ別のメモリ空間に保存されるようにしてもよい。 In one embodiment, in order to further improve this, a memory space that is different for each action, or history information is stored for each action, and the same number is selected from all actions (actions), and each action (action) is selected. Can be selected completely randomly from the history of As an example, an action history management table as shown in FIG. 5 can be considered. In FIG. 5, the action history management table is in a state in which the history is managed for each action (behavior 1, action 2... Action n). Furthermore, in this case, two histories are kept for each action, and if the same action is selected more than twice, the old history is deleted, so that the number of histories is limited to two. It has become. In this way, even when selecting randomly, it is surely prevented that many actions that inevitably have a large number of histories are selected, and which actions (behaviors that are difficult to select, It was found that even if it is an action that is easy to select), it is possible to effectively prevent bias in learning by being widely selected. The example shown in FIG. 5 is an example, and the number of action histories for each action is not limited to this, and can be set as appropriate. Further, the behavior history management table for each behavior may be stored in different memory spaces.

このようにすることで、ゲームのバトルなどゲームパラメータが刻々と変化するようなゲームにおいても、ニューラルネットワークのパラメータをＱ学習の手法を用いて強化学習させ、ニューラルネットワークを更新していくことで、敵キャラクタや味方キャラクタのより適切な行動を決定することが可能となり、ゲームの進行を飽きることなくより自然に楽しめ、ゲーム全体への魅力を高めることが可能となる。 By doing in this way, even in a game where game parameters such as game battles change every moment, the neural network parameters are subjected to reinforcement learning using the Q learning method, and the neural network is updated. It becomes possible to determine a more appropriate action of the enemy character or ally character, to enjoy more naturally without getting tired of the progress of the game, and to enhance the appeal to the entire game.

以上、サーバ１０が有する機能について説明した。次に、一実施形態における端末装置３０が有する機能について説明する。端末装置３０は、図２に示すように、様々な情報を記憶する情報記憶部５１と、一実施形態における画像情報を端末側で表示させるための制御を実行する端末側制御部５２と、を有する。これらの機能は、ＣＰＵ３１及びメインメモリ３２等のハードウェア、並びに、ストレージ３５に記憶されている各種プログラムやテーブル等が協働して動作することによって実現され、例えば、ロードしたプログラムに含まれる命令をＣＰＵ３１が実行することによって実現される。また、図２に例示した端末装置３０が有する機能の一部又は全部は、サーバ１０と端末装置３０とが協働することによって実現され、又は、サーバ１０によって実現され得る。 In the above, the function which the server 10 has was demonstrated. Next, functions of the terminal device 30 in the embodiment will be described. As shown in FIG. 2, the terminal device 30 includes an information storage unit 51 that stores various information, and a terminal-side control unit 52 that executes control for displaying image information on the terminal side in one embodiment. Have. These functions are realized by the cooperation of hardware such as the CPU 31 and the main memory 32, and various programs and tables stored in the storage 35. For example, instructions included in the loaded program This is realized by the CPU 31 executing. In addition, part or all of the functions of the terminal device 30 illustrated in FIG. 2 can be realized by the cooperation of the server 10 and the terminal device 30, or can be realized by the server 10.

一実施形態における情報記憶部５１は、メインメモリ３２又はストレージ３５等によって実現される。一実施形態における端末側制御部５２は、ユーザキャラクタによる行動の選択や受信したゲーム画面情報の表示などの様々な端末側の処理の実行を制御する。例えば、端末側制御部５２は、ユーザがユーザキャラクタのある行動を選択すると、これをサーバ１０へ送信したり、味方キャラクタや敵キャラクタの行動の結果ゲームパラメータが変化すると、これらの行動の動作や変更後のゲームパラメータをサーバ１０から受信し表示させたりすることができる。 The information storage unit 51 in the embodiment is realized by the main memory 32 or the storage 35. The terminal-side control unit 52 in one embodiment controls the execution of various terminal-side processes such as selection of actions by the user character and display of received game screen information. For example, when the user selects an action with a user character, the terminal-side control unit 52 transmits the action to the server 10 or when the game parameter changes as a result of the action of an ally character or enemy character, The changed game parameters can be received from the server 10 and displayed.

本明細書で説明された処理及び手順は、実施形態中で明示的に説明されたもの以外にも、ソフトウェア、ハードウェアまたはこれらの任意の組み合わせによって実現される。より具体的には、本明細書で説明される処理及び手順は、集積回路、揮発性メモリ、不揮発性メモリ、磁気ディスク、光ストレージ等の媒体に、当該処理に相当するロジックを実装することによって実現される。また、本明細書で説明される処理及び手順は、それらの処理・手順をコンピュータプログラムとして実装し、各種のコンピュータに実行させることが可能である。 The processes and procedures described in this specification are implemented by software, hardware, or any combination thereof other than those explicitly described in the embodiments. More specifically, the processes and procedures described in this specification are performed by mounting logic corresponding to the processes on a medium such as an integrated circuit, a volatile memory, a nonvolatile memory, a magnetic disk, or an optical storage. Realized. Further, the processes and procedures described in this specification can be implemented as a computer program and executed by various computers.

本明細書中で説明される処理及び手順が単一の装置、ソフトウェア、コンポーネント、モジュールによって実行される旨が説明されたとしても、そのような処理または手順は複数の装置、複数のソフトウェア、複数のコンポーネント、及び／又は複数のモジュールによって実行され得る。また、本明細書中で説明されるデータ、テーブル、又はデータベースが単一のメモリに格納される旨説明されたとしても、そのようなデータ、テーブル、又はデータベースは、単一の装置に備えられた複数のメモリまたは複数の装置に分散して配置された複数のメモリに分散して格納され得る。さらに、本明細書において説明されるソフトウェアおよびハードウェアの要素は、それらをより少ない構成要素に統合して、またはより多い構成要素に分解することによって実現することも可能である。 Even if the processes and procedures described herein are described as being performed by a single device, software, component, or module, such processes or procedures may be performed by multiple devices, multiple software, multiple Component and / or multiple modules. In addition, even though the data, tables, or databases described herein are described as being stored in a single memory, such data, tables, or databases are provided on a single device. Alternatively, the data can be distributed and stored in a plurality of memories or a plurality of memories arranged in a plurality of devices. Further, the software and hardware elements described herein may be implemented by integrating them into fewer components or by decomposing them into more components.

本明細書において、発明の構成要素が単数もしくは複数のいずれか一方として説明された場合、又は、単数もしくは複数のいずれとも限定せずに説明された場合であっても、文脈上別に解すべき場合を除き、当該構成要素は単数又は複数のいずれであってもよい。 In the present specification, when the constituent elements of the invention are described as one or a plurality, or when they are described without being limited to one or a plurality of cases, they should be understood separately in context. The component may be either singular or plural.

１０サーバ
２０通信網
３０端末装置
４１情報記憶部
４２キャラクタ行動制御部
５１情報記憶部
５２端末側制御部 DESCRIPTION OF SYMBOLS 10 Server 20 Communication network 30 Terminal device 41 Information storage part 42 Character action control part 51 Information storage part 52 Terminal side control part

Claims

A method for adjusting parameters of a neural network (NN) composed of one or more layers using Q learning,
In response to being executed on one or more computers,
Extracting game parameters relating to one or more game characters as a first input value;
Extracting, based on the first input value, a first expected reward (Q value) due to each action of the game character from the neural network (NN) as an output value;
Extracting the game parameter after selection / execution from one of the actions as a second input value;
Calculating a second expected reward (Q value) for each action of the game character from the neural network (NN) based on the second input value;
Updating a parameter of a neural network (NN) composed of one or more layers based on the first potential reward (Q value) and the second potential reward (Q value);
Is a method of adjusting a parameter of a neural network (NN) using Q-learning, which is characterized in that it is repeatedly executed a plurality of times.

One selection from among the actions of the character selects an action having the highest expected reward (Q value) from the expected rewards (Q value) of the actions, or depends on the actions. The method according to claim 1, wherein the method is performed by determining at random regardless of the expected reward (Q value).

One selection from among the actions of the character includes a set value of a probability of selecting an action having the highest expected reward (Q value) from the expected rewards (Q value) of the actions and the actions. The method according to claim 2, wherein the method is performed based on a set value of a probability that is randomly determined regardless of the expected reward (Q value).

One selection from the actions of the character selects an action having the highest expected reward (Q value) from the expected rewards (Q value) of the actions as the number of selections increases. 4. The method according to claim 3, wherein the probability setting value is increased, and the probability setting value determined at random regardless of the expected reward (Q value) by each action is decreased.

One selection from the actions of the character selects an action having the highest expected reward (Q value) from the expected rewards (Q value) of the actions as the number of selections increases. The probability setting value is increased by about 0.0015%, and the probability setting value determined randomly regardless of the expected reward (Q value) by each action is decreased by about 0.0015%. The method of claim 4.

When one of the actions of the character is selected randomly regardless of the expected reward (Q value) for each action, a plurality of action histories storing action histories for each action 6. The method according to any one of claims 2 to 5, wherein the information is randomly determined from the actions regardless of the amount of history information in the information.

When one of the actions of the character is selected at random regardless of the expected reward (Q value) of each action, the action history in which the number of action histories that can be stored is limited to a predetermined number 6. The method according to any one of claims 2 to 5, wherein the management table is referred to and determined randomly from each action history.

Each of the plurality of action-specific history information is stored in different memory spaces, and one selection of each action of the character is stored in a memory space randomly determined from the plurality of memory spaces. The method according to any one of claims 2 to 7, wherein the action is history information according to action.

9. The method according to claim 2, wherein one of the actions of the character is selected at random from actions whose selection conditions are attached to the actions. The method according to claim 1.

The set value of the probability determined randomly regardless of the expected reward (Q value) by each action is reduced when the action with the selection condition is selected a predetermined number of times or more. 10. A method according to any one of claims 3 to 9.

A system for adjusting parameters of a neural network (NN) composed of one or more layers using Q learning,
In response to being executed on one or more computers,
Extracting game parameters of one or more game characters as a first input value;
Extracting, based on the first input value, a first expected reward (Q value) due to each action of the game character from the neural network (NN) as an output value;
Extracting the game parameter after selection / execution from one of the actions as a second input value;
Calculating a second expected reward (Q value) for each action of the game character from the neural network (NN) based on the second input value;
Updating a parameter of a neural network (NN) composed of one or more layers based on the first potential reward (Q value) and the second potential reward (Q value);
Is a system for adjusting parameters of a neural network (NN) using Q-learning, which is characterized in that it is repeatedly executed.

A program for adjusting parameters of a neural network (NN) composed of one or more layers using Q learning,
In response to being executed on one or more computers,
Extracting game parameters relating to one or more game characters as a first input value;
Extracting, based on the first input value, a first expected reward (Q value) due to each action of the game character from the neural network (NN) as an output value;
Extracting the game parameter after selection / execution from one of the actions as a second input value;
Calculating a second expected reward (Q value) for each action of the game character from the neural network (NN) based on the second input value;
Updating a parameter of a neural network (NN) composed of one or more layers based on the first potential reward (Q value) and the second potential reward (Q value);
Is a program for adjusting parameters of a neural network (NN) using Q-learning, which is characterized by being repeatedly executed.