WO2024116387A1 - Information processing device, information processing method, and information processing program - Google Patents
Information processing device, information processing method, and information processing program Download PDFInfo
- Publication number
- WO2024116387A1 WO2024116387A1 PCT/JP2022/044452 JP2022044452W WO2024116387A1 WO 2024116387 A1 WO2024116387 A1 WO 2024116387A1 JP 2022044452 W JP2022044452 W JP 2022044452W WO 2024116387 A1 WO2024116387 A1 WO 2024116387A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- policy
- user
- game state
- information processing
- similarity
- Prior art date
Links
- 230000010365 information processing Effects 0.000 title claims abstract description 55
- 238000003672 processing method Methods 0.000 title claims description 4
- 238000004364 calculation method Methods 0.000 claims description 35
- 238000000034 method Methods 0.000 description 14
- 238000010586 diagram Methods 0.000 description 13
- 239000013598 vector Substances 0.000 description 10
- 239000011159 matrix material Substances 0.000 description 7
- 230000006870 function Effects 0.000 description 4
- 230000002860 competitive effect Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 229910052709 silver Inorganic materials 0.000 description 1
- 239000004332 silver Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Classifications
-
- A—HUMAN NECESSITIES
- A63—SPORTS; GAMES; AMUSEMENTS
- A63F—CARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
- A63F13/00—Video games, i.e. games using an electronically generated display having two or more dimensions
- A63F13/60—Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor
- A63F13/67—Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor adaptively or by learning from player actions, e.g. skill level adjustment or by storing successful combat sequences for re-use
-
- A—HUMAN NECESSITIES
- A63—SPORTS; GAMES; AMUSEMENTS
- A63F—CARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
- A63F13/00—Video games, i.e. games using an electronically generated display having two or more dimensions
- A63F13/80—Special adaptations for executing a specific game genre or game mode
Definitions
- the present invention relates to an information processing device, an information processing method, and an information processing program for outputting strategies suited to the strength of a user in a competitive game.
- One possible method for creating an AI with the same level of strength as the user is to use machine learning based on the user's match data.
- the objective of the present invention is to create an AI with sufficient accuracy to match the user's strength without using a large amount of user match data.
- the present invention is characterized by comprising an input unit that accepts input of the game state at a predetermined time; an effective move calculation unit that calculates a set of actions that the user can take in the game state at the predetermined time according to the game rules; a reference strategy calculation unit that calculates a reference strategy that is a strategy that maximizes the similarity to the user's past match data based on the set of actions that the user can take in the game state according to the game rules and the user's past match data; a mixing unit that uses the game state as input and creates a strategy by mixing the user's strategy in the game state and the reference strategy output by an AI model trained to output the user's strategy in the game state in a ratio according to the magnitude of the similarity of the reference strategy; and an output processing unit that outputs the created strategy.
- FIG. 1 is a diagram for explaining the terminology used in this embodiment.
- FIG. 2 is a diagram for explaining an overview of the information processing device.
- FIG. 3 is a diagram illustrating an example of the configuration of an information processing device.
- FIG. 4 is a diagram showing an example of a matrix indicating the state of tic-tac-toe.
- FIG. 5 is a diagram showing an example of calculation of a set of effective moves by the effective move calculation unit in FIG.
- FIG. 6 is a diagram showing an example of the database of the match data of FIG.
- FIG. 7 is a diagram showing an example of an expanded database created by the expanded database creating unit in FIG.
- FIG. 8 is a diagram showing an example of identification of a reference move by the reference move calculation unit in FIG. 3 and calculation of the similarity of the reference move.
- FIG. 9 is a diagram showing an example of a reference policy created by the reference policy creating unit in FIG.
- FIG. 10 is a diagram showing an example of an AI policy created by the AI policy output unit in FIG.
- FIG. 11 is a diagram showing an example of mixing the reference policy and the AI policy by the mixing unit in FIG.
- FIG. 12 is a flowchart illustrating an example of a processing procedure of the information processing device.
- FIG. 13 is a diagram illustrating a computer that executes a program.
- a game state s is the state of the game at a given time (for example, the current time), and is also simply called a "state.”
- An action (move) a is an action that a user can execute in the game.
- a set of effective moves E is a set of actions a that a user can execute in a certain game state s according to the rules of the game.
- a policy p is information indicating, by a probability value, which action a a user will execute in a certain game state s.
- the information processing device uses a small amount of battle data of a user and an AI model to output a strategy similar to that of the user.
- the user's match data is defined as "a pair (s, a) of a game state (board) s and an action a in that game state s.”
- a database D is prepared, which is a collection of users' match data.
- the information processing device searches the database D for data on a situation close to that game state.
- the information processing device uses the user's actions shown in the searched data as a reference to output the probability (strategy) p that the user will choose each action in that game state.
- the information processing device adjusts the strategy for game state s0, for example, as follows:
- the information processing device first determines a set E of effective moves in the game state s0 (calculating effective moves). Then, the information processing device creates data (a0, (s, a)) for all combinations that combine action a0 in the set E of effective moves with each data (s, a) in database D. The created combinations are called the extended database D' (creating extended database, see formula (1)).
- the information processing device searches the extended database D' for (a0 * , (s, a) * ) that maximizes the similarity between the data (s0, a0) and the data (s, a).
- the information processing device searches for a move that is similar to a move in the state indicated by the user's past match data from among the effective moves in the game state s0.
- the move (a0 * ) at this time is called the reference move.
- the similarity of the reference move at this time is defined as m (see formula (2)).
- the information processing device creates a reference policy p_a0 * that sets the probability of selecting the reference move a0 * in the game state s0 to the highest.
- the information processing device also inputs the game state s0 to the AI model and obtains the user's policy (AI policy p) output from the AI model.
- AI policy p user's policy
- the information processing device obtains a policy p' by mixing the reference policy p_a0 * with the AI policy p according to the magnitude of m (see formula (3)).
- the information processing device can output a strategy p' that resembles a move that the user could make in game state s0, without using a large amount of match data.
- the device that plays the match game with the user then determines the next move in game state s0 based on the output strategy p'. This allows the user to play a match game that matches their own strength.
- the information processing device 10 includes, for example, an input/output unit 11, a storage unit 12, and a control unit 13.
- the input/output unit 11 is an interface that handles the input and output of various data.
- the memory unit 12 stores data, programs, etc. that are referenced when the control unit 13 executes various processes.
- the memory unit 12 is realized by a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory, or a storage device such as a hard disk or an optical disk.
- the memory unit 12 stores the user's current game state s0 etc. received by the input/output unit 11.
- the memory unit 12 also stores a database D of the user's match data, parameters of the AI model etc.
- the AI model is a model that is trained to take the game state as input and output the probability that the user will perform each action in that game state (the user's strategy).
- the control unit 13 is responsible for controlling the entire information processing device 10.
- the functions of the control unit 13 are realized, for example, by a CPU (Central Processing Unit) executing a program stored in the storage unit 12.
- a CPU Central Processing Unit
- the control unit 13 includes, for example, a state input unit 130, an effective move calculation unit 131, a reference strategy calculation unit 132, an AI strategy output unit 136, a mixing unit 137, and an output processing unit 138.
- control unit 13 The various components of the control unit 13 will be described in detail below with reference to the drawings. Note that in the following, the game to be processed by the information processing device 10 will be described as an example of tic-tac-toe, as shown in FIG. 4.
- Each square in this tic-tac-toe game is assigned a number as shown by reference numeral 401 in FIG. 4.
- the information processing device 10 represents the game state of tic-tac-toe, for example, as a combination of three matrices as shown by reference numeral 402.
- the first matrix is a matrix indicating squares in the tic-tac-toe game that contain a circle.
- the second matrix is a matrix indicating squares in the tic-tac-toe game that contain an x.
- the third matrix is a matrix indicating squares in the tic-tac-toe game that contain no squares.
- the state input unit 130 accepts input of the game state s0.
- the effective move calculation unit 131 calculates a set E of moves (effective moves) that the user can take in the game state s0.
- the effective move calculation unit 131 receives an input of the game state (state) s0 indicated by reference numeral 501 in FIG. 5, it calculates and outputs a set E of effective moves indicated by reference numeral 502.
- the reference move calculation unit 132 calculates a reference move, which is a move that maximizes the similarity to the user's past match data, based on the set E of effective moves calculated by the effective move calculation unit 131 and the user's past match data (match data database D).
- the reference strategy calculation unit 132 includes, for example, an extended database creation unit 133, a reference move calculation unit 134, and a reference strategy creation unit 135.
- the extended database creation unit 133 creates an extended database D' that combines the action a0 of the set E of effective moves with each data (s, a) of the database D of the match data.
- the extended database creation unit 133 creates the extended database D' shown by reference numeral 701 by calculating combinations of the set of valid moves E and the match data in database D, as shown in FIG. 7.
- the extended database creation unit 133 then outputs the created extended database D'.
- the reference move calculation unit 134 uses the extended database D' to identify an effective move (reference move) from the set E of effective moves in the game state s0 that has the highest similarity to the move indicated by the user's match data. The reference move calculation unit 134 then outputs the identified reference move and the similarity m of that reference move.
- the reference move calculation unit 134 when the reference move calculation unit 134 receives input of the game state s0 and the expanded database D', it calculates the similarity between the combination of (game state s0, valid move) and the match data (game state, move) in a brute force manner.
- the reference move calculation unit 134 vectorizes (game state s0, valid move) and match data (game state, move) using a neural network, and calculates the similarity between these vectors.
- the reference policy creation unit 135 creates a reference policy from the reference move identified by the reference move calculation unit 134. For example, the reference policy creation unit 135 creates a policy in which the probability of selecting the reference move a0 in the game state s0 is set higher than the probability of selecting other moves. The created policy is then designated as the reference policy p_a0.
- the AI policy output unit 136 outputs the user's policy in the game state s0 output from the AI model.
- the AI policy output unit 136 inputs the game state s0 to the AI model, and outputs the user's policy (AI policy p shown with reference symbol 1001) output from the AI model.
- the mixing unit 137 creates a policy by mixing the reference policy created by the reference policy creation unit 135 and the AI policy p output by the AI policy output unit 136.
- the mixing unit 137 mixes the reference policy p_a0 * and the AI policy p by weighting them using an exponential function using the similarity m (see equation (4)), and creates and outputs the policy p' shown by the symbol 1101.
- the mixing unit 137 can mix the ratio of the reference policy p_a0 * to the AI policy p to increase the more similar the reference policy p_a0* is to a policy shown in the user's past battle data. As a result, the mixing unit 137 can create a policy p' that resembles a past move of the user. After that, the output processing unit 138 outputs the policy p' created by the mixing unit 137 via the input/output unit 11.
- the state input unit 130 accepts input of a game state s0 (S1).
- the effective move calculation unit 131 calculates a set E of effective moves in the game state s0 input in S1 (S2).
- the extended database creation unit 133 creates an extended database D' by combining the action a0 of the set E of valid moves calculated in S2 with each data item (s, a) in the database D of the match data (S3).
- the reference move calculation unit 134 identifies a pair of the game state and move indicated by the user's match data that has the highest similarity from among the pairs of game state s0 and effective moves in the expanded database D' created in S3.
- the reference move calculation unit 134 then outputs the effective move (reference move) for the identified pair and the similarity m of the pair (S4: Calculation of reference move).
- the reference strategy creation unit 135 creates a reference strategy from the reference move calculated in S4 (S5).
- the AI policy output unit 136 inputs the game state s0 to the AI model and outputs the policy (AI policy) output by the AI model (S6). Then, the mixing unit 137 mixes the reference policy created in S5 and the AI policy output in S6 in a ratio according to the similarity m (S7: mixing of reference policy and AI policy). After that, the output processing unit 138 outputs the policy mixed in S7 (S8).
- the information processing device 10 searches for data of similar situations from the user's past match data, and selects the user's action (reference move) by referring to the user's action in that similar situation.
- the information processing device 10 then mixes the selected action (reference move) with the strategy (AI strategy) output by the AI model, and determines and outputs the probability (strategy) that the user will select each action. This allows the information processing device 10 to output a strategy with sufficient accuracy corresponding to the user's strength, without using a large amount of the user's match data.
- each component of each part shown in the figure is a functional concept, and does not necessarily have to be physically configured as shown in the figure.
- the specific form of distribution and integration of each device is not limited to that shown in the figure, and all or a part of it can be functionally or physically distributed and integrated in any unit depending on various loads, usage conditions, etc.
- each processing function performed by each device can be realized in whole or in any part by a CPU and a program executed by the CPU, or can be realized as hardware using wired logic.
- the information processing device 10 can be implemented by installing a program (information processing program) as package software or online software on a desired computer.
- the information processing device can function as the information processing device 10 by executing the above program on the information processing device.
- the information processing device referred to here includes mobile communication terminals such as smartphones, mobile phones, and PHS (Personal Handyphone System), and further terminals such as PDAs (Personal Digital Assistants).
- FIG. 13 is a diagram showing an example of a computer that executes an information processing program.
- the computer 1000 has, for example, a memory 1010 and a CPU 1020.
- the computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these components is connected by a bus 1080.
- the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012.
- the ROM 1011 stores a boot program such as a BIOS (Basic Input Output System).
- BIOS Basic Input Output System
- the hard disk drive interface 1030 is connected to a hard disk drive 1090.
- the disk drive interface 1040 is connected to a disk drive 1100.
- a removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100.
- the serial port interface 1050 is connected to a mouse 1110 and a keyboard 1120, for example.
- the video adapter 1060 is connected to a display 1130, for example.
- the hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, the programs that define each process executed by the information processing device 10 are implemented as program modules 1093 in which computer-executable code is written.
- the program modules 1093 are stored, for example, in the hard disk drive 1090.
- a program module 1093 for executing processes similar to the functional configuration of the information processing device 10 is stored in the hard disk drive 1090.
- the hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
- the data used in the processing of the above-described embodiment is stored as program data 1094, for example, in memory 1010 or hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 or program data 1094 stored in memory 1010 or hard disk drive 1090 into RAM 1012 as necessary and executes it.
- the program module 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and program data 1094 may be stored in another computer connected via a network (such as a LAN (Local Area Network), WAN (Wide Area Network)). The program module 1093 and program data 1094 may then be read by the CPU 1020 from the other computer via the network interface 1070.
- a network such as a LAN (Local Area Network), WAN (Wide Area Network)
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
An information processing device calculates a set E of moves (effective moves) that a user can make in accordance with the rules of a game in a game state s0. The information processing device then refers to a database D of past battle data of the user, and from the set E of effective moves, selects a move (reference move a0*) that has maximum similarity to a move made by the user in a game state that the user has experienced in the past. The information processing device then outputs a reference policy p_a0* indicating the reference move a0* and a similarity m of the reference policy p_a0*. The information processing device also acquires a policy (AI policy) p obtained by inputting the game state s0 into an AI model. The information processing device then creates and outputs a policy p' that resembles the moves of the user by mixing the reference policy p_a0* with the AI policy p in a proportion according to the extent of the similarity m.
Description
本発明は、対戦ゲームにおいてユーザの強さに合わせた方策を出力するための、情報処理装置、情報処理方法、および、情報処理プログラムに関する。
The present invention relates to an information processing device, an information processing method, and an information processing program for outputting strategies suited to the strength of a user in a competitive game.
近年、囲碁や将棋等のコンピュータ対戦において、AI技術が活発に利用されており、特にプロ棋士に勝利するような強いAIの研究が盛んに行われている。このような強いAIの行動(対戦傾向)を人間が観察することで、ゲーム中の最適行動の学習を行うといった練習方法が行われている。
In recent years, AI technology has been actively used in computer battles in games such as Go and Shogi, and research into strong AI that can beat professional players is particularly active. One practice method is for humans to observe the behavior (battle tendencies) of such strong AI and learn optimal actions during the game.
強いAIとの対戦による練習は、AIが強すぎるが故に一般的には行われず、実際の対戦練習においてはユーザのレベルに合わせたAIとの対戦が望ましいと考えられる。
Practicing against a strong AI is not commonly done because the AI is too strong, and in actual practice matches, it is considered desirable to play against an AI that matches the user's level.
ここで、ユーザと同程度の強さのAIを構成するには、そのユーザの強さを適切に把握し、AIに組み込む必要がある。ユーザと同程度の強さのAIを構成するには、ユーザの対戦データに基づき機械学習を行う方法が考えられる。
Here, to create an AI with the same level of strength as the user, it is necessary to properly understand the user's strength and incorporate it into the AI. One possible method for creating an AI with the same level of strength as the user is to use machine learning based on the user's match data.
しかし、ユーザの対戦データを用いた機械学習により、当該ユーザの強さに対応した充分な精度のAIを構成するためには、ユーザの対戦データが大量に必要となるという問題があった。そこで、本発明は、ユーザの対戦データを大量に用いずに、ユーザの強さに対応した充分な精度のAIを構成することを課題とする。
However, there is a problem in that a large amount of user match data is required to create an AI with sufficient accuracy to match the user's strength through machine learning using the user's match data. Therefore, the objective of the present invention is to create an AI with sufficient accuracy to match the user's strength without using a large amount of user match data.
前記した課題を解決するため、本発明は、所定の時点のゲーム状態の入力を受け付ける入力部と、前記所定の時点のゲーム状態において、ゲームのルール上、ユーザがとることができる行動の集合を算出する有効手算出部と、前記ゲーム状態において、ゲームのルール上、ユーザがとることができる行動の集合と、前記ユーザの過去の対戦データとに基づき、前記ユーザの過去の対戦データとの類似度が最大となる方策である参考方策を算出する参考方策算出部と、ゲーム状態を入力とし、前記ゲーム状態におけるユーザの方策を出力するよう学習されたAIモデルにより出力された、前記ゲーム状態におけるユーザの方策と、前記参考方策とを当該参考方策の類似度の大きさに応じた割合で混合した方策を作成する混合部と、前記作成された方策を出力する出力処理部と、を備えることを特徴とする。
In order to solve the above-mentioned problems, the present invention is characterized by comprising an input unit that accepts input of the game state at a predetermined time; an effective move calculation unit that calculates a set of actions that the user can take in the game state at the predetermined time according to the game rules; a reference strategy calculation unit that calculates a reference strategy that is a strategy that maximizes the similarity to the user's past match data based on the set of actions that the user can take in the game state according to the game rules and the user's past match data; a mixing unit that uses the game state as input and creates a strategy by mixing the user's strategy in the game state and the reference strategy output by an AI model trained to output the user's strategy in the game state in a ratio according to the magnitude of the similarity of the reference strategy; and an output processing unit that outputs the created strategy.
本発明によれば、ユーザの対戦データを大量に用いずにユーザの強さに対応した充分な精度のAIを構成することができる。
According to the present invention, it is possible to construct an AI with sufficient accuracy that corresponds to the user's strength without using a large amount of the user's battle data.
以下、図面を参照しながら、本発明を実施するための形態(実施形態)について説明する。本発明は、本実施形態に限定されない。
Below, a form (embodiment) for carrying out the present invention will be described with reference to the drawings. The present invention is not limited to this embodiment.
[用語の説明]
まず、図1を参照しながら、本実施形態で用いる用語の説明を行う。ゲーム状態sは、所定の時点(例えば、現時点)でのゲームの状態である、単に「状態」とも言う。行動(手)aは、ユーザがゲームで実行することができる行動である。有効手集合Eは、あるゲーム状態sでユーザがゲームのルール上、実行することができる行動aの集合である。方策pは、あるゲーム状態sでユーザがどの行動を実行するかを確率値で示した情報である。 [Terminology explanation]
First, the terms used in this embodiment will be explained with reference to FIG. 1. A game state s is the state of the game at a given time (for example, the current time), and is also simply called a "state." An action (move) a is an action that a user can execute in the game. A set of effective moves E is a set of actions a that a user can execute in a certain game state s according to the rules of the game. A policy p is information indicating, by a probability value, which action a a user will execute in a certain game state s.
まず、図1を参照しながら、本実施形態で用いる用語の説明を行う。ゲーム状態sは、所定の時点(例えば、現時点)でのゲームの状態である、単に「状態」とも言う。行動(手)aは、ユーザがゲームで実行することができる行動である。有効手集合Eは、あるゲーム状態sでユーザがゲームのルール上、実行することができる行動aの集合である。方策pは、あるゲーム状態sでユーザがどの行動を実行するかを確率値で示した情報である。 [Terminology explanation]
First, the terms used in this embodiment will be explained with reference to FIG. 1. A game state s is the state of the game at a given time (for example, the current time), and is also simply called a "state." An action (move) a is an action that a user can execute in the game. A set of effective moves E is a set of actions a that a user can execute in a certain game state s according to the rules of the game. A policy p is information indicating, by a probability value, which action a a user will execute in a certain game state s.
[概要]
次に、本実施形態の情報処理装置の概要を説明する。情報処理装置は、ユーザの少量の対戦データとAIモデルとを用いて、当該ユーザの手に似た方策を出力する。 [overview]
Next, an overview of the information processing device of this embodiment will be described. The information processing device uses a small amount of battle data of a user and an AI model to output a strategy similar to that of the user.
次に、本実施形態の情報処理装置の概要を説明する。情報処理装置は、ユーザの少量の対戦データとAIモデルとを用いて、当該ユーザの手に似た方策を出力する。 [overview]
Next, an overview of the information processing device of this embodiment will be described. The information processing device uses a small amount of battle data of a user and an AI model to output a strategy similar to that of the user.
本実施形態において、ユーザの対戦データを「ゲーム状態(盤面)sとそのゲーム状態sにおける行動aの組(s,a)」により定義する。そして、ユーザの対戦データの集合であるデータベースDを用意する。そして、情報処理装置は、あるゲーム状態における方策を出力する際に、データベースDから当該ゲーム状態に近い状況のデータを検索する。そして、情報処理装置は、検索されたデータに示されるユーザの行動を参考にして、当該ゲーム状態においてユーザが各行動を選ぶ確率(方策)pを出力する。
In this embodiment, the user's match data is defined as "a pair (s, a) of a game state (board) s and an action a in that game state s." A database D is prepared, which is a collection of users' match data. When outputting a strategy in a certain game state, the information processing device searches the database D for data on a situation close to that game state. The information processing device then uses the user's actions shown in the searched data as a reference to output the probability (strategy) p that the user will choose each action in that game state.
ここで、情報処理装置は、例えば、ゲーム状態s0の方策を以下のように調整する。
Here, the information processing device adjusts the strategy for game state s0, for example, as follows:
例えば、図2に示すように、情報処理装置は、まず、ゲーム状態s0における有効手の集合Eを求める(有効手算出)。そして、情報処理装置は、有効手の集合Eにおける行動a0と、データベースDの各データ(s,a)とを組み合わせたデータ(a0,(s,a))を、すべての組み合わせについて作成する。作成した組み合わせを拡張データベースD´とする(拡張データベース作成、式(1)参照)。
For example, as shown in FIG. 2, the information processing device first determines a set E of effective moves in the game state s0 (calculating effective moves). Then, the information processing device creates data (a0, (s, a)) for all combinations that combine action a0 in the set E of effective moves with each data (s, a) in database D. The created combinations are called the extended database D' (creating extended database, see formula (1)).
次に、情報処理装置は、拡張データベースD´から、データ(s0,a0)と、データ(s,a)との類似度が最大となるような(a0*,(s,a)*)を検索する。つまり、情報処理装置は、ゲーム状態s0における有効手の中から、ユーザの過去の対戦データが示す状態における手と似た手を検索する。このときの手(a0*)を参考手と呼ぶ。また、このときの参考手の類似度をmとする(式(2)参照)。
Next, the information processing device searches the extended database D' for (a0 * , (s, a) * ) that maximizes the similarity between the data (s0, a0) and the data (s, a). In other words, the information processing device searches for a move that is similar to a move in the state indicated by the user's past match data from among the effective moves in the game state s0. The move (a0 * ) at this time is called the reference move. The similarity of the reference move at this time is defined as m (see formula (2)).
次に、情報処理装置は、ゲーム状態s0において参考手a0*を選ぶ確率を最も高く設定した参考方策p_a0*を作成する。また、情報処理装置は、ゲーム状態s0をAIモデルに入力し、AIモデルから出力されたユーザの方策(AI方策p)を得る。そして、情報処理装置は、上記のmの大きさに応じて参考方策p_a0*をAI方策pと混合した方策p´求める(式(3)参照)。
Next, the information processing device creates a reference policy p_a0 * that sets the probability of selecting the reference move a0 * in the game state s0 to the highest. The information processing device also inputs the game state s0 to the AI model and obtains the user's policy (AI policy p) output from the AI model. Then, the information processing device obtains a policy p' by mixing the reference policy p_a0 * with the AI policy p according to the magnitude of m (see formula (3)).
このようにすることにより情報処理装置は、大量の対戦データを用いずとも、ゲーム状態s0においてユーザがとりうる手に似せた方策p´を出力することができる。そして、ユーザとの対戦ゲームを行う装置は、出力された方策p´に基づき、ゲーム状態s0の次の手を決定する。これにより、ユーザは自分の強さに応じた対戦ゲームを行うことができる。
By doing this, the information processing device can output a strategy p' that resembles a move that the user could make in game state s0, without using a large amount of match data. The device that plays the match game with the user then determines the next move in game state s0 based on the output strategy p'. This allows the user to play a match game that matches their own strength.
[構成例]
次に、図3を用いて、情報処理装置10の構成例を説明する。情報処理装置10は、例えば、入出力部11、記憶部12、および、制御部13を備える。 [Configuration example]
Next, a configuration example of theinformation processing device 10 will be described with reference to Fig. 3. The information processing device 10 includes, for example, an input/output unit 11, a storage unit 12, and a control unit 13.
次に、図3を用いて、情報処理装置10の構成例を説明する。情報処理装置10は、例えば、入出力部11、記憶部12、および、制御部13を備える。 [Configuration example]
Next, a configuration example of the
入出力部11は、各種データの入出力を司るインタフェースである。入出力部11は、例えば、所定の時点(例えば、現在)におけるゲーム状態s0の入力を受け付ける。例えば、対戦ゲームにおいてユーザが行動した直後のゲーム状態s0の入力を受け付ける。
The input/output unit 11 is an interface that handles the input and output of various data. The input/output unit 11, for example, accepts input of a game state s0 at a given point in time (e.g., the present). For example, it accepts input of a game state s0 immediately after a user takes an action in a competitive game.
記憶部12は、制御部13が各種処理を実行する際に参照されるデータ、プログラム等を記憶する。記憶部12は、RAM(Random Access Memory)、フラッシュメモリ(Flash Memory)等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現される。
The memory unit 12 stores data, programs, etc. that are referenced when the control unit 13 executes various processes. The memory unit 12 is realized by a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory, or a storage device such as a hard disk or an optical disk.
例えば、記憶部12は、入出力部11で受け付けたユーザの現在のゲーム状態s0等を記憶する。また、記憶部12は、ユーザの対戦データのデータベースD、AIモデルのパラメータ等を記憶する。AIモデルは、ゲーム状態を入力とし、当該ゲーム状態においてユーザが各行動を実行する確率(ユーザの方策)を出力するよう学習されたモデルである。
For example, the memory unit 12 stores the user's current game state s0 etc. received by the input/output unit 11. The memory unit 12 also stores a database D of the user's match data, parameters of the AI model etc. The AI model is a model that is trained to take the game state as input and output the probability that the user will perform each action in that game state (the user's strategy).
制御部13は、情報処理装置10全体の制御を司る。制御部13の機能は、例えば、CPU(Central Processing Unit)が、記憶部12に記憶されるプログラムを実行することにより実現される。
The control unit 13 is responsible for controlling the entire information processing device 10. The functions of the control unit 13 are realized, for example, by a CPU (Central Processing Unit) executing a program stored in the storage unit 12.
制御部13は、例えば、状態入力部130と、有効手算出部131と、参考方策算出部132と、AI方策出力部136と、混合部137と、出力処理部138とを備える。
The control unit 13 includes, for example, a state input unit 130, an effective move calculation unit 131, a reference strategy calculation unit 132, an AI strategy output unit 136, a mixing unit 137, and an output processing unit 138.
以下、図面を参照しながら、制御部13に装備される各部を詳細に説明する。なお、以下において、情報処理装置10が処理対象とするゲームは、図4に示すような三目並べである場合を例に説明する。
The various components of the control unit 13 will be described in detail below with reference to the drawings. Note that in the following, the game to be processed by the information processing device 10 will be described as an example of tic-tac-toe, as shown in FIG. 4.
この三目並べの各マス目には、図4の符号401に示すように番号が付されている。情報処理装置10は、符号401に示す番号に基づき、三目並べのゲーム状態を、例えば、符号402に示す3つの行列の組み合わせで表す。1つめの行列は、三目並べにおいて〇が入っているマス目を示す行列である。2つめの行列は、三目並べにおいて×が入っているマス目を示す行列である。3つめの行列は、三目並べにおいて何も入っていないマス目を示す行列である。
Each square in this tic-tac-toe game is assigned a number as shown by reference numeral 401 in FIG. 4. Based on the numbers shown by reference numeral 401, the information processing device 10 represents the game state of tic-tac-toe, for example, as a combination of three matrices as shown by reference numeral 402. The first matrix is a matrix indicating squares in the tic-tac-toe game that contain a circle. The second matrix is a matrix indicating squares in the tic-tac-toe game that contain an x. The third matrix is a matrix indicating squares in the tic-tac-toe game that contain no squares.
図3の説明に戻る。状態入力部130は、ゲーム状態s0の入力を受け付ける。また、有効手算出部131は、ゲーム状態s0においてユーザがとりうる手(有効手)の集合Eを算出する。
Returning to the explanation of FIG. 3, the state input unit 130 accepts input of the game state s0. In addition, the effective move calculation unit 131 calculates a set E of moves (effective moves) that the user can take in the game state s0.
例えば、有効手算出部131は、図5の符号501に示すゲーム状態(状態)s0の入力を受け付けると、符号502に示す有効手の集合Eを算出し、出力する。
For example, when the effective move calculation unit 131 receives an input of the game state (state) s0 indicated by reference numeral 501 in FIG. 5, it calculates and outputs a set E of effective moves indicated by reference numeral 502.
図3の説明に戻る。参考方策算出部132は、有効手算出部131により算出された有効手の集合Eと、ユーザの過去の対戦データ(対戦データのデータベースD)とに基づき、当該ユーザの過去の対戦データとの類似度が最大となる方策である参考方策を算出する。
Returning to the explanation of FIG. 3, the reference move calculation unit 132 calculates a reference move, which is a move that maximizes the similarity to the user's past match data, based on the set E of effective moves calculated by the effective move calculation unit 131 and the user's past match data (match data database D).
参考方策算出部132は、例えば、拡張データベース作成部133と、参考手算出部134と、参考方策作成部135とを備える。拡張データベース作成部133は、有効手の集合Eの行動a0と、対戦データのデータベースDの各データ(s,a)とを組み合わせた拡張データベースD´を作成する。
The reference strategy calculation unit 132 includes, for example, an extended database creation unit 133, a reference move calculation unit 134, and a reference strategy creation unit 135. The extended database creation unit 133 creates an extended database D' that combines the action a0 of the set E of effective moves with each data (s, a) of the database D of the match data.
例えば、有効手の集合Eが図5の符号502に示す手の集合であり、データベースDに格納される対戦データが図6に示す対戦データである場合を考える。この場合、拡張データベース作成部133は、図7に示すように、有効手の集合Eと、データベースDの対戦データとの組み合わせを計算することにより、符号701に示す拡張データベースD´を作成する。そして、拡張データベース作成部133は、作成した拡張データベースD´を出力する。
For example, consider the case where the set of valid moves E is the set of moves shown by reference numeral 502 in FIG. 5, and the match data stored in database D is the match data shown in FIG. 6. In this case, the extended database creation unit 133 creates the extended database D' shown by reference numeral 701 by calculating combinations of the set of valid moves E and the match data in database D, as shown in FIG. 7. The extended database creation unit 133 then outputs the created extended database D'.
図3の説明に戻る。参考手算出部134は、拡張データベースD´を用いて、ゲーム状態s0における有効手の集合Eから、ユーザの対戦データの示す手との類似度が最大となる有効手(参考手)を特定する。そして、参考手算出部134は、特定した参考手とその参考手の類似度mとを出力する。
Returning to the explanation of FIG. 3, the reference move calculation unit 134 uses the extended database D' to identify an effective move (reference move) from the set E of effective moves in the game state s0 that has the highest similarity to the move indicated by the user's match data. The reference move calculation unit 134 then outputs the identified reference move and the similarity m of that reference move.
例えば、参考手算出部134は、図8に示すように、ゲーム状態s0と拡張データベースD´の入力を受け付けると、(ゲーム状態s0,有効手)と、対戦データ(ゲーム状態,手)との組み合わせの類似度を総当たりで算出する。
For example, as shown in FIG. 8, when the reference move calculation unit 134 receives input of the game state s0 and the expanded database D', it calculates the similarity between the combination of (game state s0, valid move) and the match data (game state, move) in a brute force manner.
例えば、参考手算出部134は、(ゲーム状態s0,有効手)と、対戦データ(ゲーム状態,手)とをそれぞれニューラルネットワークでベクトル化し、そのベクトル同士の類似度を算出する。例えば、変分エンコーダによりゲーム状態と行動=(s,a)をベクトルに変換する機構を用意する。そして、参考手算出部134は、上記の機構を用いて、(s,a)をベクトルv1に変換し、(s0,a0)をベクトルv2に変換する。次に、参考手算出部134は、ベクトルv1とベクトルv2とのコサイン類似度を算出する。
For example, the reference move calculation unit 134 vectorizes (game state s0, valid move) and match data (game state, move) using a neural network, and calculates the similarity between these vectors. For example, a mechanism is provided that uses a variational encoder to convert the game state and action = (s, a) into vectors. Then, using the mechanism described above, the reference move calculation unit 134 converts (s, a) into vector v1 and (s0, a0) into vector v2. Next, the reference move calculation unit 134 calculates the cosine similarity between vector v1 and vector v2.
その後、参考手算出部134は、算出した類似度が最大になる有効手を特定する。例えば、図8に示す(ゲーム状態s0,有効手)と、対戦データの示す(ゲーム状態,手)との組み合わせのうち、類似度が最大となる組み合わせは、符号801に示す組み合わせである。よって、参考手算出部134は、符号801に示す組み合わせにおける有効手a0=3を参考手として特定する。また、a0=3の類似度m=0.94である。よって、参考手算出部134は、a0=3と類似度m=0.94を出力する。
Then, the reference move calculation unit 134 identifies the effective move that maximizes the calculated similarity. For example, among the combinations of (game state s0, effective move) shown in FIG. 8 and (game state, move) shown in the match data, the combination that maximizes the similarity is the combination shown by reference symbol 801. Therefore, the reference move calculation unit 134 identifies the effective move a0=3 in the combination shown by reference symbol 801 as the reference move. Also, the similarity m of a0=3 is 0.94. Therefore, the reference move calculation unit 134 outputs a0=3 and similarity m=0.94.
図3の説明に戻る。参考方策作成部135は、参考手算出部134により特定された参考手から参考方策を作成する。例えば、参考方策作成部135は、ゲーム状態s0において参考手a0を選ぶ確率を、他の手を選ぶ確率よりも高く設定した方策を作成する。そして、作成した方策を参考方策p_a0とする。
Returning to the explanation of FIG. 3, the reference policy creation unit 135 creates a reference policy from the reference move identified by the reference move calculation unit 134. For example, the reference policy creation unit 135 creates a policy in which the probability of selecting the reference move a0 in the game state s0 is set higher than the probability of selecting other moves. The created policy is then designated as the reference policy p_a0.
例えば、参考方策作成部135は、図9に示すように、参考手a0=3をone-hotベクトル化することにより符号901に示すベクトル(1~9のマス目のうち、3のマス目の確率に1.0を設定し、その他のマス目の確率に0.0を設定したベクトル)を作成する。
For example, as shown in FIG. 9, the reference policy creation unit 135 creates a vector indicated by the reference move a0=3 as a one-hot vector, thereby creating a vector indicated by the symbol 901 (a vector in which, of the squares 1 to 9, the probability of the square 3 is set to 1.0, and the probabilities of the other squares are set to 0.0).
図3の説明に戻る。AI方策出力部136は、AIモデルから出力された、ゲーム状態s0におけるユーザの方策を出力する。例えば、AI方策出力部136は、図10に示すように、AIモデルにゲーム状態s0を入力し、当該AIモデルから出力されたユーザの方策(符号1001に示すAI方策p)を出力する。
Returning to the explanation of FIG. 3, the AI policy output unit 136 outputs the user's policy in the game state s0 output from the AI model. For example, as shown in FIG. 10, the AI policy output unit 136 inputs the game state s0 to the AI model, and outputs the user's policy (AI policy p shown with reference symbol 1001) output from the AI model.
図3の説明に戻る。混合部137は、参考方策作成部135により作成された参考方策と、AI方策出力部136により出力されたAI方策pとを混合した方策を作成する。
Returning to the explanation of FIG. 3, the mixing unit 137 creates a policy by mixing the reference policy created by the reference policy creation unit 135 and the AI policy p output by the AI policy output unit 136.
例えば、混合部137は、図11に示すように、参考方策p_a0*とAI方策pとを類似度mを用いた指数関数により重み付けして混合し(式(4)参照)、符号1101に示す方策p´を作成し、出力する。
For example, as shown in FIG. 11, the mixing unit 137 mixes the reference policy p_a0 * and the AI policy p by weighting them using an exponential function using the similarity m (see equation (4)), and creates and outputs the policy p' shown by the symbol 1101.
これにより混合部137は、参考方策p_a0*とAI方策pとを混合する際、参考方策p_a0*がユーザの過去の対戦データに示される方策と類似しているほど、AI方策pに対する参考方策p_a0*の割合を大きくして混合することができる。これにより、混合部137は、過去のユーザの手に似た方策p´を作成することができる。その後、出力処理部138は、混合部137により作成された方策p´を入出力部11経由で出力する。
As a result, when mixing the reference policy p_a0 * and the AI policy p, the mixing unit 137 can mix the ratio of the reference policy p_a0 * to the AI policy p to increase the more similar the reference policy p_a0* is to a policy shown in the user's past battle data. As a result, the mixing unit 137 can create a policy p' that resembles a past move of the user. After that, the output processing unit 138 outputs the policy p' created by the mixing unit 137 via the input/output unit 11.
[処理手順の例]
次に、図12を用いて、情報処理装置10が実行する処理手順の例を説明する。まず、状態入力部130は、ゲーム状態s0の入力を受け付ける(S1)。その後、有効手算出部131は、S1で入力されたゲーム状態s0における有効手の集合Eを算出する(S2)。 [Example of processing procedure]
Next, an example of a processing procedure executed by theinformation processing device 10 will be described with reference to Fig. 12. First, the state input unit 130 accepts input of a game state s0 (S1). After that, the effective move calculation unit 131 calculates a set E of effective moves in the game state s0 input in S1 (S2).
次に、図12を用いて、情報処理装置10が実行する処理手順の例を説明する。まず、状態入力部130は、ゲーム状態s0の入力を受け付ける(S1)。その後、有効手算出部131は、S1で入力されたゲーム状態s0における有効手の集合Eを算出する(S2)。 [Example of processing procedure]
Next, an example of a processing procedure executed by the
次に、拡張データベース作成部133は、S2で算出された有効手の集合Eの行動a0と、対戦データのデータベースDの各データ(s,a)とを組み合わせることにより拡張データベースD´を作成する(S3)。
Next, the extended database creation unit 133 creates an extended database D' by combining the action a0 of the set E of valid moves calculated in S2 with each data item (s, a) in the database D of the match data (S3).
S3の後、参考手算出部134は、S3で作成された拡張データベースD´におけるゲーム状態s0と有効手のペアから、ユーザの対戦データの示すゲーム状態と手とのペアとの類似度が最大となるペアを特定する。そして、参考手算出部134は、特定したペアにおける有効手(参考手)と、そのペアの類似度mを出力する(S4:参考手の算出)。その後、参考方策作成部135は、S4で算出された参考手から参考方策を作成する(S5)。
After S3, the reference move calculation unit 134 identifies a pair of the game state and move indicated by the user's match data that has the highest similarity from among the pairs of game state s0 and effective moves in the expanded database D' created in S3. The reference move calculation unit 134 then outputs the effective move (reference move) for the identified pair and the similarity m of the pair (S4: Calculation of reference move). After that, the reference strategy creation unit 135 creates a reference strategy from the reference move calculated in S4 (S5).
また、AI方策出力部136は、AIモデルにゲーム状態s0を入力し、AIモデルにより出力された方策(AI方策)を出力する(S6)。そして、混合部137は、S5で作成された参考方策とS6で出力されたAI方策とを、類似度mに応じた割合で混合する(S7:参考方策とAI方策との混合)。その後、出力処理部138は、S7で混合された方策を出力する(S8)。
The AI policy output unit 136 inputs the game state s0 to the AI model and outputs the policy (AI policy) output by the AI model (S6). Then, the mixing unit 137 mixes the reference policy created in S5 and the AI policy output in S6 in a ratio according to the similarity m (S7: mixing of reference policy and AI policy). After that, the output processing unit 138 outputs the policy mixed in S7 (S8).
このように情報処理装置10は、ゲーム状態s0でのユーザの行動を選ぶ際に、ユーザの過去の対戦データから類似した状況のデータを検索し、その類似した状況におけるユーザの行動を参考にして、ユーザの行動(参考手)を選ぶ。そして、情報処理装置10は、選んだ行動(参考手)と、AIモデルにより出力された方策(AI方策)とを混合することにより、ユーザが各行動を選ぶ確率(方策)を決定し、出力する。これにより情報処理装置10は、ユーザの対戦データを大量に用いずにユーザの強さに対応した充分な精度の方策を出力することができる。
In this way, when selecting a user's action in game state s0, the information processing device 10 searches for data of similar situations from the user's past match data, and selects the user's action (reference move) by referring to the user's action in that similar situation. The information processing device 10 then mixes the selected action (reference move) with the strategy (AI strategy) output by the AI model, and determines and outputs the probability (strategy) that the user will select each action. This allows the information processing device 10 to output a strategy with sufficient accuracy corresponding to the user's strength, without using a large amount of the user's match data.
[システム構成等]
また、図示した各部の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。さらに、各装置にて行われる各処理機能は、その全部又は任意の一部が、CPU及び当該CPUにて実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
In addition, each component of each part shown in the figure is a functional concept, and does not necessarily have to be physically configured as shown in the figure. In other words, the specific form of distribution and integration of each device is not limited to that shown in the figure, and all or a part of it can be functionally or physically distributed and integrated in any unit depending on various loads, usage conditions, etc. Furthermore, each processing function performed by each device can be realized in whole or in any part by a CPU and a program executed by the CPU, or can be realized as hardware using wired logic.
また、図示した各部の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。さらに、各装置にて行われる各処理機能は、その全部又は任意の一部が、CPU及び当該CPUにて実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
In addition, each component of each part shown in the figure is a functional concept, and does not necessarily have to be physically configured as shown in the figure. In other words, the specific form of distribution and integration of each device is not limited to that shown in the figure, and all or a part of it can be functionally or physically distributed and integrated in any unit depending on various loads, usage conditions, etc. Furthermore, each processing function performed by each device can be realized in whole or in any part by a CPU and a program executed by the CPU, or can be realized as hardware using wired logic.
また、前記した実施形態において説明した処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部又は一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。
Furthermore, among the processes described in the above embodiments, all or part of the processes described as being performed automatically can be performed manually, or all or part of the processes described as being performed manually can be performed automatically using known methods. In addition, the information including the processing procedures, control procedures, specific names, various data and parameters shown in the above documents and drawings can be changed as desired unless otherwise specified.
[プログラム]
前記した情報処理装置10は、パッケージソフトウェアやオンラインソフトウェアとしてプログラム(情報処理プログラム)を所望のコンピュータにインストールさせることによって実装できる。例えば、上記のプログラムを情報処理装置に実行させることにより、情報処理装置を情報処理装置10として機能させることができる。ここで言う情報処理装置にはスマートフォン、携帯電話機やPHS(Personal Handyphone System)等の移動体通信端末、さらには、PDA(Personal Digital Assistant)等の端末等がその範疇に含まれる。 [program]
Theinformation processing device 10 can be implemented by installing a program (information processing program) as package software or online software on a desired computer. For example, the information processing device can function as the information processing device 10 by executing the above program on the information processing device. The information processing device referred to here includes mobile communication terminals such as smartphones, mobile phones, and PHS (Personal Handyphone System), and further terminals such as PDAs (Personal Digital Assistants).
前記した情報処理装置10は、パッケージソフトウェアやオンラインソフトウェアとしてプログラム(情報処理プログラム)を所望のコンピュータにインストールさせることによって実装できる。例えば、上記のプログラムを情報処理装置に実行させることにより、情報処理装置を情報処理装置10として機能させることができる。ここで言う情報処理装置にはスマートフォン、携帯電話機やPHS(Personal Handyphone System)等の移動体通信端末、さらには、PDA(Personal Digital Assistant)等の端末等がその範疇に含まれる。 [program]
The
図13は、情報処理プログラムを実行するコンピュータの一例を示す図である。コンピュータ1000は、例えば、メモリ1010、CPU1020を有する。また、コンピュータ1000は、ハードディスクドライブインタフェース1030、ディスクドライブインタフェース1040、シリアルポートインタフェース1050、ビデオアダプタ1060、ネットワークインタフェース1070を有する。これらの各部は、バス1080によって接続される。
FIG. 13 is a diagram showing an example of a computer that executes an information processing program. The computer 1000 has, for example, a memory 1010 and a CPU 1020. The computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these components is connected by a bus 1080.
メモリ1010は、ROM(Read Only Memory)1011及びRAM(Random Access Memory)1012を含む。ROM1011は、例えば、BIOS(Basic Input Output System)等のブートプログラムを記憶する。ハードディスクドライブインタフェース1030は、ハードディスクドライブ1090に接続される。ディスクドライブインタフェース1040は、ディスクドライブ1100に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ1100に挿入される。シリアルポートインタフェース1050は、例えばマウス1110、キーボード1120に接続される。ビデオアダプタ1060は、例えばディスプレイ1130に接続される。
The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to a hard disk drive 1090. The disk drive interface 1040 is connected to a disk drive 1100. A removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to a mouse 1110 and a keyboard 1120, for example. The video adapter 1060 is connected to a display 1130, for example.
ハードディスクドライブ1090は、例えば、OS1091、アプリケーションプログラム1092、プログラムモジュール1093、プログラムデータ1094を記憶する。すなわち、上記の情報処理装置10が実行する各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール1093として実装される。プログラムモジュール1093は、例えばハードディスクドライブ1090に記憶される。例えば、情報処理装置10における機能構成と同様の処理を実行するためのプログラムモジュール1093が、ハードディスクドライブ1090に記憶される。なお、ハードディスクドライブ1090は、SSD(Solid State Drive)により代替されてもよい。
The hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, the programs that define each process executed by the information processing device 10 are implemented as program modules 1093 in which computer-executable code is written. The program modules 1093 are stored, for example, in the hard disk drive 1090. For example, a program module 1093 for executing processes similar to the functional configuration of the information processing device 10 is stored in the hard disk drive 1090. The hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
また、上述した実施形態の処理で用いられるデータは、プログラムデータ1094として、例えばメモリ1010やハードディスクドライブ1090に記憶される。そして、CPU1020が、メモリ1010やハードディスクドライブ1090に記憶されたプログラムモジュール1093やプログラムデータ1094を必要に応じてRAM1012に読み出して実行する。
The data used in the processing of the above-described embodiment is stored as program data 1094, for example, in memory 1010 or hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 or program data 1094 stored in memory 1010 or hard disk drive 1090 into RAM 1012 as necessary and executes it.
なお、プログラムモジュール1093やプログラムデータ1094は、ハードディスクドライブ1090に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ1100等を介してCPU1020によって読み出されてもよい。あるいは、プログラムモジュール1093及びプログラムデータ1094は、ネットワーク(LAN(Local Area Network)、WAN(Wide Area Network)等)を介して接続される他のコンピュータに記憶されてもよい。そして、プログラムモジュール1093及びプログラムデータ1094は、他のコンピュータから、ネットワークインタフェース1070を介してCPU1020によって読み出されてもよい。
The program module 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and program data 1094 may be stored in another computer connected via a network (such as a LAN (Local Area Network), WAN (Wide Area Network)). The program module 1093 and program data 1094 may then be read by the CPU 1020 from the other computer via the network interface 1070.
10 情報処理装置
11 入出力部
12 記憶部
13 制御部
130 状態入力部
131 有効手算出部
132 参考方策算出部
133 拡張データベース作成部
134 参考手算出部
135 参考方策作成部
136 AI方策出力部
137 混合部
138 出力処理部 REFERENCE SIGNSLIST 10 Information processing device 11 Input/output unit 12 Memory unit 13 Control unit 130 State input unit 131 Effective move calculation unit 132 Reference policy calculation unit 133 Extended database creation unit 134 Reference move calculation unit 135 Reference policy creation unit 136 AI policy output unit 137 Mixing unit 138 Output processing unit
11 入出力部
12 記憶部
13 制御部
130 状態入力部
131 有効手算出部
132 参考方策算出部
133 拡張データベース作成部
134 参考手算出部
135 参考方策作成部
136 AI方策出力部
137 混合部
138 出力処理部 REFERENCE SIGNS
Claims (7)
- 所定の時点のゲーム状態の入力を受け付ける入力部と、
前記所定の時点のゲーム状態において、ゲームのルール上、ユーザがとることができる行動の集合を算出する有効手算出部と、
前記ゲーム状態において、ゲームのルール上、ユーザがとることができる行動の集合と、前記ユーザの過去の対戦データとに基づき、前記ユーザの過去の対戦データとの類似度が最大となる方策である参考方策を算出する参考方策算出部と、
ゲーム状態を入力とし、前記ゲーム状態におけるユーザの方策を出力するよう学習されたAIモデルにより出力された、前記ゲーム状態におけるユーザの方策と、前記参考方策とを当該参考方策の類似度の大きさに応じた割合で混合した方策を作成する混合部と、
前記作成された方策を出力する出力処理部と、
を備えることを特徴とする情報処理装置。 an input unit that receives an input of a game state at a predetermined time;
an effective move calculation unit that calculates a set of actions that a user can take according to the rules of the game in the game state at the predetermined time point;
a reference policy calculation unit that calculates a reference policy that is a policy that maximizes a similarity to the user's past battle data, based on a set of actions that the user can take under game rules in the game state and the user's past battle data;
a mixing unit that uses a game state as an input and creates a policy by mixing a user's policy in the game state, the policy being output by an AI model trained to output a user's policy in the game state, with the reference policy in a ratio according to the degree of similarity of the reference policy;
an output processing unit that outputs the created policy;
An information processing device comprising: - 前記方策は、前記ゲーム状態においてユーザが各行動を選ぶ確率を示した情報である
ことを特徴とする請求項1に記載の情報処理装置。 The information processing device according to claim 1 , wherein the policy is information indicating a probability that the user will select each action in the game state. - 前記参考方策算出部は、
前記所定の時点のゲーム状態と当該ゲーム状態においてユーザがとりうる行動とのペアと、前記ユーザの過去の対戦データに示されるゲーム状態と当該ゲーム状態において前記ユーザがとった行動とのペアとの類似度を、前記類似度として算出する
ことを特徴とする請求項1に記載の情報処理装置。 The reference policy calculation unit is
The information processing device according to claim 1, characterized in that the similarity is calculated as a similarity between a pair of a game state at the specified time point and an action that the user can take in that game state, and a pair of a game state shown in the user's past battle data and an action taken by the user in that game state. - 前記参考方策算出部は、
前記ゲーム状態においてユーザがとることができる行動のうち前記類似度が最大となる行動について、前記ユーザが当該行動を選ぶ確率を、他の行動を選ぶ確率よりも高く設定した方策を作成し、作成した方策を前記参考方策とする
ことを請求項3に記載の情報処理装置。 The reference policy calculation unit is
The information processing device of claim 3, further comprising: a policy that sets the probability that the user will choose an action that has the greatest similarity among the actions that the user can take in the game state higher than the probability that the user will choose other actions; and the created policy is set as the reference policy. - 前記混合部は、
前記類似度が大きいほど、前記AIモデルより出力されたユーザの方策に対する前記参考方策の割合を大きくして、前記AIモデルにより出力されたユーザの方策と前記参考方策とを混合する
ことを特徴とする請求項1に記載の情報処理装置。 The mixing section includes:
The information processing device according to claim 1, characterized in that the greater the similarity, the greater the ratio of the reference policy to the user's policy output by the AI model, and the user's policy output by the AI model and the reference policy are mixed. - 情報処理装置により実行される情報処理方法であって、
所定の時点のゲーム状態の入力を受け付ける工程と、
前記所定の時点のゲーム状態において、ゲームのルール上、ユーザがとることができる行動の集合を算出する工程と、
前記ゲーム状態において、ゲームのルール上、ユーザがとることができる行動の集合と、前記ユーザの過去の対戦データとに基づき、前記ユーザの過去の対戦データとの類似度が最大となる方策である参考方策を算出する工程と、
ゲーム状態を入力とし、前記ゲーム状態におけるユーザの方策を出力するよう学習されたAIモデルにより出力された、前記ゲーム状態におけるユーザの方策と、前記参考方策とを当該参考方策の類似度の大きさに応じた割合で混合した方策を作成する工程と、
前記作成した方策を出力する工程と、
を含むことを特徴とする情報処理方法。 An information processing method executed by an information processing device,
receiving an input of a game state at a given time;
calculating a set of actions that the user can take according to the game rules in the game state at the given time point;
calculating a reference policy that maximizes a similarity to the user's past battle data based on a set of actions that the user can take in accordance with the game rules in the game state and the user's past battle data;
A step of generating a policy by mixing the user's policy in the game state, which is output by an AI model trained to input a game state and output a user's policy in the game state, and the reference policy in a ratio according to the degree of similarity of the reference policy;
outputting the created policy;
13. An information processing method comprising: - 所定の時点のゲーム状態の入力を受け付ける工程と、
前記所定の時点のゲーム状態において、ゲームのルール上、ユーザがとることができる行動の集合を算出する工程と、
前記ゲーム状態において、ゲームのルール上、ユーザがとることができる行動の集合と、前記ユーザの過去の対戦データとに基づき、前記ユーザの過去の対戦データとの類似度が最大となる方策である参考方策を算出する工程と、
ゲーム状態を入力とし、前記ゲーム状態におけるユーザの方策を出力するよう学習されたAIモデルにより出力された、前記ゲーム状態におけるユーザの方策と、前記参考方策とを当該参考方策の類似度の大きさに応じた割合で混合した方策を作成する工程と、
前記作成した方策を出力する工程と、
をコンピュータに実行させるための情報処理プログラム。 receiving an input of a game state at a given time;
calculating a set of actions that the user can take according to the game rules in the game state at the given time point;
calculating a reference policy that maximizes a similarity to the user's past battle data based on a set of actions that the user can take in accordance with the game rules in the game state and the user's past battle data;
A step of generating a policy by mixing the user's policy in the game state, which is output by an AI model trained to input a game state and output a user's policy in the game state, and the reference policy in a ratio according to the degree of similarity of the reference policy;
outputting the created policy;
An information processing program for causing a computer to execute the above.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2022/044452 WO2024116387A1 (en) | 2022-12-01 | 2022-12-01 | Information processing device, information processing method, and information processing program |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2022/044452 WO2024116387A1 (en) | 2022-12-01 | 2022-12-01 | Information processing device, information processing method, and information processing program |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024116387A1 true WO2024116387A1 (en) | 2024-06-06 |
Family
ID=91323148
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2022/044452 WO2024116387A1 (en) | 2022-12-01 | 2022-12-01 | Information processing device, information processing method, and information processing program |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2024116387A1 (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004187860A (en) * | 2002-12-10 | 2004-07-08 | Konami Co Ltd | Apparatus and program for playing game |
US20170357893A1 (en) * | 2016-06-10 | 2017-12-14 | Apple Inc. | Artificial intelligence controller that procedurally tailors itself to an application |
JP2019191786A (en) * | 2018-04-21 | 2019-10-31 | Heroz株式会社 | Game program generation device and game program generation program |
JP2020115957A (en) * | 2019-01-21 | 2020-08-06 | 株式会社 ディー・エヌ・エー | Information processing device, information processing program, and information processing method |
JP2020191022A (en) * | 2019-05-23 | 2020-11-26 | 国立大学法人神戸大学 | Learning method, learning device, and learning program for ai agent behaving like human |
-
2022
- 2022-12-01 WO PCT/JP2022/044452 patent/WO2024116387A1/en unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004187860A (en) * | 2002-12-10 | 2004-07-08 | Konami Co Ltd | Apparatus and program for playing game |
US20170357893A1 (en) * | 2016-06-10 | 2017-12-14 | Apple Inc. | Artificial intelligence controller that procedurally tailors itself to an application |
JP2019191786A (en) * | 2018-04-21 | 2019-10-31 | Heroz株式会社 | Game program generation device and game program generation program |
JP2020115957A (en) * | 2019-01-21 | 2020-08-06 | 株式会社 ディー・エヌ・エー | Information processing device, information processing program, and information processing method |
JP2020191022A (en) * | 2019-05-23 | 2020-11-26 | 国立大学法人神戸大学 | Learning method, learning device, and learning program for ai agent behaving like human |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Holland | Studying complex adaptive systems | |
CN111506514B (en) | Intelligent testing method and system applied to elimination game | |
Eisenberg et al. | Towards reinforcement learning for in-place model transformations | |
JP6325762B1 (en) | Information processing apparatus, information processing method, and information processing program | |
WO2024116387A1 (en) | Information processing device, information processing method, and information processing program | |
CN113877209B (en) | Game data testing method, system, equipment and storage medium | |
CN112262399A (en) | Action learning device, action learning method, action learning system, program, and recording medium | |
Li et al. | Global Opposition Learning and Diversity ENhancement based Differential Evolution with exponential crossover for numerical optimization | |
WO2021090518A1 (en) | Learning device, information integration system, learning method, and recording medium | |
JP5206197B2 (en) | Rule learning method, program and apparatus | |
Huang et al. | Hybrid embedding and joint training of stacked encoder for opinion question machine reading comprehension | |
CN115186828A (en) | Behavior decision method, device and equipment for man-machine interaction and storage medium | |
Banharnsakun | Artificial bee colony algorithm for solving the knight’s tour problem | |
CN113779360A (en) | Multi-head question-answering model-based question solving method, device, equipment and storage medium | |
Fayed | Classification of the chess endgame problem using logistic regression, decision trees, and neural networks | |
Mohri et al. | Online learning with transductive regret | |
Zhao | Evolution of restricted randomization with maximum tolerated imbalance | |
Leong | Analyzing big data with decision trees | |
Miikkulainen | Evolutionary Supervised Machine Learning | |
JP4752242B2 (en) | Data analyzer | |
JP7021382B1 (en) | How to generate a trained model to predict the action the user chooses, etc. | |
CN115054913A (en) | Cloud game scheduling method, system and computer readable storage medium | |
JP7349833B2 (en) | Systems, methods, and programs for providing games | |
JP7093568B2 (en) | Information processing equipment | |
Matsumura et al. | Evolutionary dynamics of evolutionary programming in noisy environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22967219 Country of ref document: EP Kind code of ref document: A1 |