WO2024116387A1

WO2024116387A1 - Information processing device, information processing method, and information processing program

Info

Publication number: WO2024116387A1
Application number: PCT/JP2022/044452
Authority: WO
Inventors: 司吉田; 済央野本; 篤深山
Original assignee: 日本電信電話株式会社
Priority date: 2022-12-01
Filing date: 2022-12-01
Publication date: 2024-06-06

Abstract

An information processing device calculates a set E of moves (effective moves) that a user can make in accordance with the rules of a game in a game state s0. The information processing device then refers to a database D of past battle data of the user, and from the set E of effective moves, selects a move (reference move a0^*) that has maximum similarity to a move made by the user in a game state that the user has experienced in the past. The information processing device then outputs a reference policy p_a0^* indicating the reference move a0^* and a similarity m of the reference policy p_a0^*. The information processing device also acquires a policy (AI policy) p obtained by inputting the game state s0 into an AI model. The information processing device then creates and outputs a policy p' that resembles the moves of the user by mixing the reference policy p_a0^* with the AI policy p in a proportion according to the extent of the similarity m.

Description

Information processing device, information processing method, and information processing program

The present invention relates to an information processing device, an information processing method, and an information processing program for outputting strategies suited to the strength of a user in a competitive game.

In recent years, AI technology has been actively used in computer battles in games such as Go and Shogi, and research into strong AI that can beat professional players is particularly active. One practice method is for humans to observe the behavior (battle tendencies) of such strong AI and learn optimal actions during the game.

Practicing against a strong AI is not commonly done because the AI is too strong, and in actual practice matches, it is considered desirable to play against an AI that matches the user's level.

Here, to create an AI with the same level of strength as the user, it is necessary to properly understand the user's strength and incorporate it into the AI. One possible method for creating an AI with the same level of strength as the user is to use machine learning based on the user's match data.

However, there is a problem in that a large amount of user match data is required to create an AI with sufficient accuracy to match the user's strength through machine learning using the user's match data. Therefore, the objective of the present invention is to create an AI with sufficient accuracy to match the user's strength without using a large amount of user match data.

In order to solve the above-mentioned problems, the present invention is characterized by comprising an input unit that accepts input of the game state at a predetermined time; an effective move calculation unit that calculates a set of actions that the user can take in the game state at the predetermined time according to the game rules; a reference strategy calculation unit that calculates a reference strategy that is a strategy that maximizes the similarity to the user's past match data based on the set of actions that the user can take in the game state according to the game rules and the user's past match data; a mixing unit that uses the game state as input and creates a strategy by mixing the user's strategy in the game state and the reference strategy output by an AI model trained to output the user's strategy in the game state in a ratio according to the magnitude of the similarity of the reference strategy; and an output processing unit that outputs the created strategy.

According to the present invention, it is possible to construct an AI with sufficient accuracy that corresponds to the user's strength without using a large amount of the user's battle data.

FIG. 1 is a diagram for explaining the terminology used in this embodiment. FIG. 2 is a diagram for explaining an overview of the information processing device. FIG. 3 is a diagram illustrating an example of the configuration of an information processing device. FIG. 4 is a diagram showing an example of a matrix indicating the state of tic-tac-toe. FIG. 5 is a diagram showing an example of calculation of a set of effective moves by the effective move calculation unit in FIG. FIG. 6 is a diagram showing an example of the database of the match data of FIG. FIG. 7 is a diagram showing an example of an expanded database created by the expanded database creating unit in FIG. FIG. 8 is a diagram showing an example of identification of a reference move by the reference move calculation unit in FIG. 3 and calculation of the similarity of the reference move. FIG. 9 is a diagram showing an example of a reference policy created by the reference policy creating unit in FIG. FIG. 10 is a diagram showing an example of an AI policy created by the AI policy output unit in FIG. FIG. 11 is a diagram showing an example of mixing the reference policy and the AI policy by the mixing unit in FIG. FIG. 12 is a flowchart illustrating an example of a processing procedure of the information processing device. FIG. 13 is a diagram illustrating a computer that executes a program.

Below, a form (embodiment) for carrying out the present invention will be described with reference to the drawings. The present invention is not limited to this embodiment.

[Terminology explanation]
First, the terms used in this embodiment will be explained with reference to FIG. 1. A game state s is the state of the game at a given time (for example, the current time), and is also simply called a "state." An action (move) a is an action that a user can execute in the game. A set of effective moves E is a set of actions a that a user can execute in a certain game state s according to the rules of the game. A policy p is information indicating, by a probability value, which action a a user will execute in a certain game state s.

[overview]
Next, an overview of the information processing device of this embodiment will be described. The information processing device uses a small amount of battle data of a user and an AI model to output a strategy similar to that of the user.

In this embodiment, the user's match data is defined as "a pair (s, a) of a game state (board) s and an action a in that game state s." A database D is prepared, which is a collection of users' match data. When outputting a strategy in a certain game state, the information processing device searches the database D for data on a situation close to that game state. The information processing device then uses the user's actions shown in the searched data as a reference to output the probability (strategy) p that the user will choose each action in that game state.

Here, the information processing device adjusts the strategy for game state s0, for example, as follows:

For example, as shown in FIG. 2, the information processing device first determines a set E of effective moves in the game state s0 (calculating effective moves). Then, the information processing device creates data (a0, (s, a)) for all combinations that combine action a0 in the set E of effective moves with each data (s, a) in database D. The created combinations are called the extended database D' (creating extended database, see formula (1)).

Next, the information processing device searches the extended database D' for (a0 ^* , (s, a) ^* ) that maximizes the similarity between the data (s0, a0) and the data (s, a). In other words, the information processing device searches for a move that is similar to a move in the state indicated by the user's past match data from among the effective moves in the game state s0. The move (a0 ^* ) at this time is called the reference move. The similarity of the reference move at this time is defined as m (see formula (2)).

Next, the information processing device creates a reference policy p_a0 ^* that sets the probability of selecting the reference move a0 ^* in the game state s0 to the highest. The information processing device also inputs the game state s0 to the AI model and obtains the user's policy (AI policy p) output from the AI model. Then, the information processing device obtains a policy p' by mixing the reference policy p_a0 ^* with the AI policy p according to the magnitude of m (see formula (3)).

By doing this, the information processing device can output a strategy p' that resembles a move that the user could make in game state s0, without using a large amount of match data. The device that plays the match game with the user then determines the next move in game state s0 based on the output strategy p'. This allows the user to play a match game that matches their own strength.

[Configuration example]
Next, a configuration example of the information processing device 10 will be described with reference to Fig. 3. The information processing device 10 includes, for example, an input/output unit 11, a storage unit 12, and a control unit 13.

The input/output unit 11 is an interface that handles the input and output of various data. The input/output unit 11, for example, accepts input of a game state s0 at a given point in time (e.g., the present). For example, it accepts input of a game state s0 immediately after a user takes an action in a competitive game.

The memory unit 12 stores data, programs, etc. that are referenced when the control unit 13 executes various processes. The memory unit 12 is realized by a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory, or a storage device such as a hard disk or an optical disk.

For example, the memory unit 12 stores the user's current game state s0 etc. received by the input/output unit 11. The memory unit 12 also stores a database D of the user's match data, parameters of the AI model etc. The AI model is a model that is trained to take the game state as input and output the probability that the user will perform each action in that game state (the user's strategy).

The control unit 13 is responsible for controlling the entire information processing device 10. The functions of the control unit 13 are realized, for example, by a CPU (Central Processing Unit) executing a program stored in the storage unit 12.

The control unit 13 includes, for example, a state input unit 130, an effective move calculation unit 131, a reference strategy calculation unit 132, an AI strategy output unit 136, a mixing unit 137, and an output processing unit 138.

The various components of the control unit 13 will be described in detail below with reference to the drawings. Note that in the following, the game to be processed by the information processing device 10 will be described as an example of tic-tac-toe, as shown in FIG. 4.

Each square in this tic-tac-toe game is assigned a number as shown by reference numeral 401 in FIG. 4. Based on the numbers shown by reference numeral 401, the information processing device 10 represents the game state of tic-tac-toe, for example, as a combination of three matrices as shown by reference numeral 402. The first matrix is a matrix indicating squares in the tic-tac-toe game that contain a circle. The second matrix is a matrix indicating squares in the tic-tac-toe game that contain an x. The third matrix is a matrix indicating squares in the tic-tac-toe game that contain no squares.

Returning to the explanation of FIG. 3, the state input unit 130 accepts input of the game state s0. In addition, the effective move calculation unit 131 calculates a set E of moves (effective moves) that the user can take in the game state s0.

For example, when the effective move calculation unit 131 receives an input of the game state (state) s0 indicated by reference numeral 501 in FIG. 5, it calculates and outputs a set E of effective moves indicated by reference numeral 502.

Returning to the explanation of FIG. 3, the reference move calculation unit 132 calculates a reference move, which is a move that maximizes the similarity to the user's past match data, based on the set E of effective moves calculated by the effective move calculation unit 131 and the user's past match data (match data database D).

The reference strategy calculation unit 132 includes, for example, an extended database creation unit 133, a reference move calculation unit 134, and a reference strategy creation unit 135. The extended database creation unit 133 creates an extended database D' that combines the action a0 of the set E of effective moves with each data (s, a) of the database D of the match data.

For example, consider the case where the set of valid moves E is the set of moves shown by reference numeral 502 in FIG. 5, and the match data stored in database D is the match data shown in FIG. 6. In this case, the extended database creation unit 133 creates the extended database D' shown by reference numeral 701 by calculating combinations of the set of valid moves E and the match data in database D, as shown in FIG. 7. The extended database creation unit 133 then outputs the created extended database D'.

Returning to the explanation of FIG. 3, the reference move calculation unit 134 uses the extended database D' to identify an effective move (reference move) from the set E of effective moves in the game state s0 that has the highest similarity to the move indicated by the user's match data. The reference move calculation unit 134 then outputs the identified reference move and the similarity m of that reference move.

For example, as shown in FIG. 8, when the reference move calculation unit 134 receives input of the game state s0 and the expanded database D', it calculates the similarity between the combination of (game state s0, valid move) and the match data (game state, move) in a brute force manner.

For example, the reference move calculation unit 134 vectorizes (game state s0, valid move) and match data (game state, move) using a neural network, and calculates the similarity between these vectors. For example, a mechanism is provided that uses a variational encoder to convert the game state and action = (s, a) into vectors. Then, using the mechanism described above, the reference move calculation unit 134 converts (s, a) into vector v1 and (s0, a0) into vector v2. Next, the reference move calculation unit 134 calculates the cosine similarity between vector v1 and vector v2.

Then, the reference move calculation unit 134 identifies the effective move that maximizes the calculated similarity. For example, among the combinations of (game state s0, effective move) shown in FIG. 8 and (game state, move) shown in the match data, the combination that maximizes the similarity is the combination shown by reference symbol 801. Therefore, the reference move calculation unit 134 identifies the effective move a0=3 in the combination shown by reference symbol 801 as the reference move. Also, the similarity m of a0=3 is 0.94. Therefore, the reference move calculation unit 134 outputs a0=3 and similarity m=0.94.

Returning to the explanation of FIG. 3, the reference policy creation unit 135 creates a reference policy from the reference move identified by the reference move calculation unit 134. For example, the reference policy creation unit 135 creates a policy in which the probability of selecting the reference move a0 in the game state s0 is set higher than the probability of selecting other moves. The created policy is then designated as the reference policy p_a0.

For example, as shown in FIG. 9, the reference policy creation unit 135 creates a vector indicated by the reference move a0=3 as a one-hot vector, thereby creating a vector indicated by the symbol 901 (a vector in which, of the squares 1 to 9, the probability of the square 3 is set to 1.0, and the probabilities of the other squares are set to 0.0).

Returning to the explanation of FIG. 3, the AI policy output unit 136 outputs the user's policy in the game state s0 output from the AI model. For example, as shown in FIG. 10, the AI policy output unit 136 inputs the game state s0 to the AI model, and outputs the user's policy (AI policy p shown with reference symbol 1001) output from the AI model.

Returning to the explanation of FIG. 3, the mixing unit 137 creates a policy by mixing the reference policy created by the reference policy creation unit 135 and the AI policy p output by the AI policy output unit 136.

For example, as shown in FIG. 11, the mixing unit 137 mixes the reference policy p_a0 ^* and the AI policy p by weighting them using an exponential function using the similarity m (see equation (4)), and creates and outputs the policy p' shown by the symbol 1101.

As a result, when mixing the reference policy p_a0 ^* and the AI policy p, the mixing unit 137 can mix the ratio of the reference policy p_a0 ^{* to the AI policy p to increase the more similar the reference policy p_a0*} ^is to a policy shown in the user's past battle data. As a result, the mixing unit 137 can create a policy p' that resembles a past move of the user. After that, the output processing unit 138 outputs the policy p' created by the mixing unit 137 via the input/output unit 11.

[Example of processing procedure]
Next, an example of a processing procedure executed by the information processing device 10 will be described with reference to Fig. 12. First, the state input unit 130 accepts input of a game state s0 (S1). After that, the effective move calculation unit 131 calculates a set E of effective moves in the game state s0 input in S1 (S2).

Next, the extended database creation unit 133 creates an extended database D' by combining the action a0 of the set E of valid moves calculated in S2 with each data item (s, a) in the database D of the match data (S3).

After S3, the reference move calculation unit 134 identifies a pair of the game state and move indicated by the user's match data that has the highest similarity from among the pairs of game state s0 and effective moves in the expanded database D' created in S3. The reference move calculation unit 134 then outputs the effective move (reference move) for the identified pair and the similarity m of the pair (S4: Calculation of reference move). After that, the reference strategy creation unit 135 creates a reference strategy from the reference move calculated in S4 (S5).

The AI policy output unit 136 inputs the game state s0 to the AI model and outputs the policy (AI policy) output by the AI model (S6). Then, the mixing unit 137 mixes the reference policy created in S5 and the AI policy output in S6 in a ratio according to the similarity m (S7: mixing of reference policy and AI policy). After that, the output processing unit 138 outputs the policy mixed in S7 (S8).

In this way, when selecting a user's action in game state s0, the information processing device 10 searches for data of similar situations from the user's past match data, and selects the user's action (reference move) by referring to the user's action in that similar situation. The information processing device 10 then mixes the selected action (reference move) with the strategy (AI strategy) output by the AI model, and determines and outputs the probability (strategy) that the user will select each action. This allows the information processing device 10 to output a strategy with sufficient accuracy corresponding to the user's strength, without using a large amount of the user's match data.

[System configuration, etc.]
In addition, each component of each part shown in the figure is a functional concept, and does not necessarily have to be physically configured as shown in the figure. In other words, the specific form of distribution and integration of each device is not limited to that shown in the figure, and all or a part of it can be functionally or physically distributed and integrated in any unit depending on various loads, usage conditions, etc. Furthermore, each processing function performed by each device can be realized in whole or in any part by a CPU and a program executed by the CPU, or can be realized as hardware using wired logic.

Furthermore, among the processes described in the above embodiments, all or part of the processes described as being performed automatically can be performed manually, or all or part of the processes described as being performed manually can be performed automatically using known methods. In addition, the information including the processing procedures, control procedures, specific names, various data and parameters shown in the above documents and drawings can be changed as desired unless otherwise specified.

[program]
The information processing device 10 can be implemented by installing a program (information processing program) as package software or online software on a desired computer. For example, the information processing device can function as the information processing device 10 by executing the above program on the information processing device. The information processing device referred to here includes mobile communication terminals such as smartphones, mobile phones, and PHS (Personal Handyphone System), and further terminals such as PDAs (Personal Digital Assistants).

FIG. 13 is a diagram showing an example of a computer that executes an information processing program. The computer 1000 has, for example, a memory 1010 and a CPU 1020. The computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these components is connected by a bus 1080.

The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to a hard disk drive 1090. The disk drive interface 1040 is connected to a disk drive 1100. A removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to a mouse 1110 and a keyboard 1120, for example. The video adapter 1060 is connected to a display 1130, for example.

The hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, the programs that define each process executed by the information processing device 10 are implemented as program modules 1093 in which computer-executable code is written. The program modules 1093 are stored, for example, in the hard disk drive 1090. For example, a program module 1093 for executing processes similar to the functional configuration of the information processing device 10 is stored in the hard disk drive 1090. The hard disk drive 1090 may be replaced by an SSD (Solid State Drive).

The data used in the processing of the above-described embodiment is stored as program data 1094, for example, in memory 1010 or hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 or program data 1094 stored in memory 1010 or hard disk drive 1090 into RAM 1012 as necessary and executes it.

The program module 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and program data 1094 may be stored in another computer connected via a network (such as a LAN (Local Area Network), WAN (Wide Area Network)). The program module 1093 and program data 1094 may then be read by the CPU 1020 from the other computer via the network interface 1070.

REFERENCE SIGNS LIST 10 Information processing device 11 Input/output unit 12 Memory unit 13 Control unit 130 State input unit 131 Effective move calculation unit 132 Reference policy calculation unit 133 Extended database creation unit 134 Reference move calculation unit 135 Reference policy creation unit 136 AI policy output unit 137 Mixing unit 138 Output processing unit

Claims

an input unit that receives an input of a game state at a predetermined time;
an effective move calculation unit that calculates a set of actions that a user can take according to the rules of the game in the game state at the predetermined time point;
a reference policy calculation unit that calculates a reference policy that is a policy that maximizes a similarity to the user's past battle data, based on a set of actions that the user can take under game rules in the game state and the user's past battle data;
a mixing unit that uses a game state as an input and creates a policy by mixing a user's policy in the game state, the policy being output by an AI model trained to output a user's policy in the game state, with the reference policy in a ratio according to the degree of similarity of the reference policy;
an output processing unit that outputs the created policy;
An information processing device comprising:
The information processing device according to claim 1 , wherein the policy is information indicating a probability that the user will select each action in the game state.
The reference policy calculation unit is
The information processing device according to claim 1, characterized in that the similarity is calculated as a similarity between a pair of a game state at the specified time point and an action that the user can take in that game state, and a pair of a game state shown in the user's past battle data and an action taken by the user in that game state.
The reference policy calculation unit is
The information processing device of claim 3, further comprising: a policy that sets the probability that the user will choose an action that has the greatest similarity among the actions that the user can take in the game state higher than the probability that the user will choose other actions; and the created policy is set as the reference policy.
The mixing section includes:
The information processing device according to claim 1, characterized in that the greater the similarity, the greater the ratio of the reference policy to the user's policy output by the AI model, and the user's policy output by the AI model and the reference policy are mixed.
An information processing method executed by an information processing device,
receiving an input of a game state at a given time;
calculating a set of actions that the user can take according to the game rules in the game state at the given time point;
calculating a reference policy that maximizes a similarity to the user's past battle data based on a set of actions that the user can take in accordance with the game rules in the game state and the user's past battle data;
A step of generating a policy by mixing the user's policy in the game state, which is output by an AI model trained to input a game state and output a user's policy in the game state, and the reference policy in a ratio according to the degree of similarity of the reference policy;
outputting the created policy;
13. An information processing method comprising:
receiving an input of a game state at a given time;
calculating a set of actions that the user can take according to the game rules in the game state at the given time point;
calculating a reference policy that maximizes a similarity to the user's past battle data based on a set of actions that the user can take in accordance with the game rules in the game state and the user's past battle data;
A step of generating a policy by mixing the user's policy in the game state, which is output by an AI model trained to input a game state and output a user's policy in the game state, and the reference policy in a ratio according to the degree of similarity of the reference policy;
outputting the created policy;
An information processing program for causing a computer to execute the above.