CN110772794B

CN110772794B - Intelligent game processing method, device, equipment and storage medium

Info

Publication number: CN110772794B
Application number: CN201910966855.2A
Authority: CN
Inventors: 徐波
Original assignee: Duoyi Network Co ltd; GUANGDONG LIWEI NETWORK TECHNOLOGY CO LTD; Guangzhou Duoyi Network Co ltd
Current assignee: Duoyi Network Co ltd; GUANGDONG LIWEI NETWORK TECHNOLOGY CO LTD; Guangzhou Duoyi Network Co ltd
Priority date: 2019-10-12
Filing date: 2019-10-12
Publication date: 2023-06-16
Anticipated expiration: 2039-10-12
Also published as: CN110772794A

Abstract

The invention discloses an intelligent game processing method, which comprises the following steps: acquiring game state data of a current player of an intelligent game, and simulating the intelligent game; searching actions correspondingly executed by the game state data of the current player according to a preset Monte Carlo search tree until the whole round of game is finished; the game data generated by the corresponding action obtained by each execution of the search is used as the game state data of the current player in the next search; storing game information generated correspondingly after each action executed by the whole round of game into a database; and optimizing a pre-constructed value network and a loss value function of the pre-constructed action probability network by adopting a back propagation method according to the sampling data acquired from the database to obtain an optimal value network and an optimal action probability network. The embodiment of the invention also discloses an intelligent game processing device, equipment and a storage medium, which solve the problems of inaccurate behavior decision of non-player characters and reduced game quality in the incomplete information game by adopting a plurality of embodiments.

Description

Intelligent game processing method, device, equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for processing an intelligent game.

Background

The players in the round-making game take turns to own rounds, and can only operate the rounds until the rounds arrive, so that the interest is strong, and the round-making game has become one of the main leisure and entertainment modes of people. The action decision of the non-player character in the round-trip game is often an important factor influencing the game quality and the user experience, the traditional game artificial intelligence is realized through a state machine or an action tree, and the corresponding operation of the non-player character is triggered through different states.

With the success of Al phaGo on go, reinforcement learning is increasingly widely applied to games, however, chess games have the characteristic of complete information, the states of the opponent and own are all disclosed, hidden information and random factors are not existed. In the game with incomplete information, the player cannot know the information of the other party completely, the effect brought by the player has randomness, a plurality of uncertain factors exist, and the training of reinforcement learning is difficult, so that the action decision of the non-player character is inaccurate, and the game quality is reduced.

Disclosure of Invention

The embodiment of the invention provides an intelligent game processing method, device, equipment and storage medium, which can effectively solve the problems of inaccurate behavior decision and reduced game quality of non-player characters when facing to incomplete information games in the prior art.

An embodiment of the present invention provides an intelligent game processing method, including:

acquiring game state data of a current player of an intelligent game, and simulating the intelligent game;

searching actions correspondingly executed by the game state data of the current player according to a preset Monte Carlo search tree until the whole round of game is finished; the game data generated by the corresponding action obtained by each execution of the search are taken as the game state data of the current player in the next search;

storing game information generated correspondingly after each action executed by the whole round of game into a database;

optimizing a pre-constructed value network and a loss value function of a pre-constructed action probability network by adopting a back propagation method according to sampling data acquired from the database to obtain an optimal value network and an optimal action probability network;

wherein the game information includes: the game state data of the current player, the probability label of executing each action corresponding to the game state data of the current player, the value label of the current game state and the game result when the whole round of game is finished;

the loss value functions of the pre-constructed value network and the pre-constructed action probability network are specifically as follows:

L＝αLv+βLp+ηw ^t w；

wherein Lv is a loss value function of the value network, alpha is a coefficient of the loss value function of the value network, lp is a loss value function of the action probability network, beta is a coefficient of the loss value function of the action probability network, w ^t w is the regularization penalty term, and η is the coefficient of the regularization penalty term.

As an improvement of the scheme, searching the action correspondingly executed by the game state data of the current player according to the preset Monte Carlo search tree until the whole round of game is finished; the game data generated by the corresponding action obtained by each execution of the search is taken as the game state data of the current player in the next search, and specifically comprises the following steps:

determining a node corresponding to the game state data of the current player;

searching at least one corresponding child node according to the node corresponding to the game state data of the current player;

and selecting the best child node from at least one corresponding child node as an action executed under the game state data of the current player.

As an improvement of the above-described aspect, the selecting the best child node from at least one of the corresponding child nodes as the action performed under the game state data of the current player; the method specifically comprises the following steps:

selecting the child node with the largest access times as the optimal child node;

or alternatively, the first and second heat exchangers may be,

selecting an optimal child node from at least one corresponding child node according to preset selection probability;

or alternatively, the first and second heat exchangers may be,

selecting the child node with the maximum confidence value; wherein,

the UCB is a confidence value, N (v) is the current access times of the child nodes, N (v') is the access times of the child nodes,e (Q (v)) is the tie benefit obtained by the current child node in the full round of game.

As an improvement of the above solution, the simulating the intelligent game specifically includes:

the visible information is kept unchanged, and the non-visible information is subjected to random generation operation.

As an improvement of the above solution, after searching the action corresponding to the game status data of the current player according to the preset monte carlo search tree until the whole round of game is finished, before saving the game information corresponding to each action executed by the whole round of game to the database, the method further includes:

and carrying out back propagation on the nodes corresponding to the game state data of at least one current player to respectively calculate the profit value and the access times corresponding to each current game state.

As an improvement of the above solution, the method further includes:

if the action which is executed corresponding to the game state data of the current player cannot be found according to the preset Monte Carlo search tree and the whole round of game is not finished, the action which is executed possibly corresponding to the game state data of the current player is expanded through a preset neural network, and the probability of each action which is executed possibly is recorded.

Another embodiment of the present invention correspondingly provides an intelligent game processing apparatus, including:

the acquisition module is used for acquiring game state data of a current player of the intelligent game and simulating the intelligent game;

the searching matching module is used for searching actions correspondingly executed by the game state data of the current player according to a preset Monte Carlo search tree until the whole round of game is finished; the game data generated by the corresponding action obtained by each execution of the search are taken as the game state data of the current player in the next search;

the storage module is used for storing game information generated correspondingly after each action executed by the whole round of game into the database;

the construction module is used for optimizing a pre-constructed value network and a loss value function of a pre-constructed action probability network by adopting a back propagation method according to the sampling data acquired from the database to acquire an optimal value network and an optimal action probability network;

L＝αLv+βLp+ηw ^t w；

Another embodiment of the present invention provides a smart game processing device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, where the processor executes the computer program to implement the smart game processing method according to the embodiment of the present invention.

Another embodiment of the present invention provides a storage medium, where the computer readable storage medium includes a stored computer program, where when the computer program runs, a device where the computer readable storage medium is controlled to execute the intelligent game processing method described in the foregoing embodiment of the present invention.

Compared with the prior art, the intelligent game processing method, the intelligent game processing device, the intelligent game processing equipment and the intelligent game storage medium disclosed by the embodiment of the invention are characterized in that game state data of a current player of an intelligent game are acquired and simulated, actions which are correspondingly executed by the game state data of the current player are searched according to a preset Monte Carlo search tree, the game data generated by the actions which are correspondingly executed are used as the current game state to determine the actions which are correspondingly executed when the next search is carried out, until the whole round of game is finished, game information which is correspondingly generated after each action which is executed by the whole round of game is stored in a database, a preset value network and an action probability network loss value function are optimized according to sampling data acquired in the database, so that an optimal value network and an optimal action probability network are obtained, and in a non-complete information game, the training process of reinforcement learning is simplified, so that the action decision of the non-player is more accurate, the game quality is improved, the playability is enhanced, and the user experience is improved.

Drawings

FIG. 1 is a flow chart of an intelligent game processing method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a simulation flow of a preset Monte Carlo search tree according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an intelligent game processing device according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an intelligent game processing device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, a flow chart of an intelligent game processing method according to an embodiment of the invention is shown.

The embodiment of the invention provides an intelligent game processing method, which comprises the following steps:

s10, acquiring game state data of a current player of the intelligent game, and simulating the intelligent game.

Wherein the game state data of the current player includes: the game battle comprises a gate assignment, a grade, a blood volume value, a magic value, equipment information, pet information, a gain effect carried by the player, a reduction effect and the like of both players. And combining the data into a vector form for input.

In this embodiment, the game state data of the current player may be obtained according to the established game data model, which is just one implementation of this embodiment, not limited to the above-mentioned implementation, but may be other implementations in which the game state data of the current player may be obtained.

S20, searching actions correspondingly executed by the game state data of the current player according to a preset Monte Carlo search tree until the whole round of game is finished; the game data generated by the corresponding action obtained by each time of searching is used as the game state data of the current player in the next searching.

Referring to fig. 2, the preset monte carlo search tree is: each node of the Monte Carlo search corresponds to a state in the game, and the child node of each node generates a new state after taking various actions in the state, and each action corresponds to a node. Actions available for each state in the present example include different skills of the player to bring in, use of the item, and the like. Using the same actions in the same state, different effects may also occur due to the randomness of the game, while each child node is directed only to the actions taken.

Defining tree node attributes in the Monte Carlo search tree, wherein the node attributes comprise indexes of father nodes, indexes of all child nodes, the number of times the node is accessed, an action probability value of output of a pre-constructed action probability network to a current node, and a final benefit value of a game after the node is calculated according to the pre-constructed value network.

In this embodiment, in the simulation of the same game, a new Monte Carlo tree may be recreated every time an action is performed (i.e., one action is performed), or the previously created Monte Carlo tree may be reused, and game data generated by the root node according to the action actually taken at present is regarded as game state data of the current player until the game of the current game is finished, and a new Monte Carlo tree is recreated at the beginning of the next game.

Specifically, when the next action to be correspondingly executed is required to be determined according to the game state data of the current player, searching the action to be correspondingly executed through a preset Monte Carlo tree, taking the game data generated by the searched corresponding action as the game state data of the current player to search for the corresponding action continuously until the simulated intelligent game is ended.

S30, saving game information corresponding to each action data executed by the whole round of game to a database; wherein the game information includes: the game state data of the current player, the probability tags of the current player corresponding to executing each action, and the game state value tags of the current player.

In this embodiment, the current game state value tag includes: victory or fail or tie is noted as 1, -1 and 0, respectively. The determination of the action probability label can be used as the action probability label of each child node according to the ratio of the accessed times of each child node to the accessed times of the father node after the simulation is finished.

Specifically, game information corresponding to each action data is stored in a database, the database is expanded, the diversity of training samples is increased, the generalization capability is improved, and training is accelerated.

And S40, optimizing a loss value function of the pre-constructed value network and the pre-constructed action probability network by adopting a back propagation method according to the sampling data acquired from the database to obtain an optimal value network and an optimal action probability network.

The preset value network and action probability network loss value function specifically comprises the following steps: l=αlv+βlp+ηw ^t w, wherein Lv is the loss function of the value network, the coefficient is alpha, lp is the loss function of the action probability network, and the coefficient is beta, w ^t w is a regularization penalty term and the coefficient is η. The loss function of the value network is the mean square error of the value tag output at a certain game state at that state from the actual value tag at that state. The action probability network loss value function is obtained byThe output probability value of a certain game state in the action probability network and the cross entropy of the action probability label.

Specifically, gradient descent is performed by deep learning back propagation until convergence, so as to obtain the optimal value network and action probability network.

In summary, by acquiring game state data of a current player of an intelligent game and performing simulation, searching actions corresponding to the game state data of the current player according to a preset Monte Carlo search tree, determining the actions corresponding to the actions to be performed as the current game state when performing the next search, storing game information corresponding to each action performed by the whole round of game to a database until the whole round of game is finished, optimizing a preset value network and action probability network loss value function by adopting a back propagation method according to sampling data acquired from the database, and obtaining an optimal value network and an optimal action probability network.

and determining a node corresponding to the game state data of the current player.

In this embodiment, the game state data of the current player is used as the root node of the preset monte carlo tree.

And searching at least one corresponding child node according to the node corresponding to the game state data of the current player.

In this embodiment, the child nodes are accessed layer by layer according to the root node.

And selecting the child node with the largest access times as the optimal child node.

Or selecting the best sub-node from at least one corresponding sub-node according to a preset selection probability.

Specifically, according to the preset selection probability, the optimal sub-node is selected from at least one corresponding sub-node, so that randomness is increased, and the diversity of training samples is increased.

Or selecting the child node with the largest confidence value; wherein,

the UCB is a confidence value, N (v) is the current child node access times, N (v') is the child node access times, and E (Q (v)) is the tie benefit obtained by the current child node in the whole round of game.

Illustratively, each round of simulation requires a regeneration of the game environment, the state that the own can see in the state of the counterpart remains unchanged for the state of the own and the state that the own cannot see in the state of the counterpart requires a re-random generation (as shown in fig. 2, i.e., implicit partial re-random generation). In this example, the object used, some special gains or benefits, some special skills that can be used are not visible to the opponent's own. Each new round of simulation is to re-randomize this non-visible information, each time the state-unknown part is started again.

As an improvement of the above solution, after searching the action data corresponding to the game status data of the current player according to the preset monte carlo search tree until the whole round of game is finished, before saving the game information corresponding to each action data executed by the whole round of game to the database, the method further includes:

In this embodiment, the current child node propagates back to the root node, and the profit value and the access times in each node that result in the result (winning or losing or tie) are updated back according to the last player game status data, when the profit value is calculated, the moving average calculation method is adopted to update the average profit iteratively, and according to the calculated judgment, the higher the profit value, the greater the potential profit of the player game status data is indicated.

As an improvement of the above solution, the method further includes:

if the action data which is executed corresponding to the game state data of the current player cannot be found according to the preset Monte Carlo search tree and the whole round of game is not finished, the action which is executed possibly corresponding to the game state data of the current player is expanded through a preset neural network, and the probability of each action which is executed possibly is recorded.

In this embodiment, according to game status data of a current player, the game status data of the current player is input into a preset neural network to obtain a value of the game status data of the current player and an execution probability of an available execution action, a current node is expanded to obtain sub-nodes, the sub-nodes are available actions, the probability values are recorded in each sub-node, the corresponding sub-nodes comprise all actions which can be taken in the current game status, including application of available skills and use of articles, and the probabilities corresponding to each action are recorded in the expansion.

Referring to fig. 3, a schematic structural diagram of an intelligent game processing device according to an embodiment of the invention is shown.

The embodiment of the invention correspondingly provides an intelligent game processing device, which comprises:

and the acquisition module 10 is used for acquiring the game state data of the current player of the intelligent game and simulating the intelligent game.

The search matching module 20 is configured to search actions corresponding to the game status data of the current player according to a preset monte carlo search tree until the whole round of game is completed; the game data generated by the corresponding action obtained by each time of searching is used as the game state data of the current player in the next searching.

And the storage module 30 is used for storing the game information generated correspondingly after each action executed by the whole round of game into a database.

And the construction module 40 is configured to optimize the pre-constructed value network and the loss value function of the pre-constructed action probability network by adopting a back propagation method according to the sampling data acquired from the database, so as to obtain an optimal value network and an optimal action probability network.

In summary, according to the intelligent game processing device provided by the embodiment of the invention, game state data of a current player of an intelligent game is obtained and simulated, actions corresponding to the game state data of the current player are searched according to a preset Monte Carlo search tree, the action corresponding to the actions is determined by using the game data generated by the actions corresponding to the actions as the current game state when the next search is performed until the whole round of game is finished, game information generated after each action is performed by the whole round of game is saved to a database, a preset value network and an action probability network loss value function are optimized according to sampling data obtained in the database by adopting a back propagation method, so that an optimal value network and an optimal action probability network are obtained, and in a non-complete information game, the training process of reinforcement learning is simplified, so that the action decision of a non-player is more accurate, the game quality is improved, the playability is enhanced, and the user experience is improved.

Referring to fig. 4, a schematic diagram of a smart game processing device according to an embodiment of the present invention is provided. The intelligent game processing device of this embodiment includes: a processor 11, a memory 12 and a computer program stored in the memory and executable on the processor. The steps of the above-described embodiments of the intelligent game processing method are implemented when the processor executes the computer program. Alternatively, the processor may implement the functions of the modules/units in the above-described device embodiments when executing the computer program.

The computer program may be divided into one or more modules/units, which are stored in the memory and executed by the processor to accomplish the present invention, for example. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions for describing the execution of the computer program in the intelligent game processing device.

The intelligent game processing equipment can be computing equipment such as a desktop computer, a notebook computer, a palm computer, a cloud server and the like. The smart game processing device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that the schematic diagram is merely an example of a smart game processing device and is not meant to be limiting, and that more or fewer components than shown may be included, or certain components may be combined, or different components may be included, for example, the smart game processing device may also include input-output devices, network access devices, buses, etc.

The processor 11 may be a central processing unit (Centra l Process i ng Un it, CPU), but may also be other general purpose processors, digital signal processors (Digita l Signa l Processor, DSP), application specific integrated circuits (App l icat ion Specifi C I ntegrated Ci rcuit, asic), off-the-shelf programmable gate arrays (Fie l d-Programmab l e Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like that is the control center of the intelligent game processing device, connecting the various parts of the overall intelligent game processing device using various interfaces and lines.

The memory 12 may be used to store the computer programs and/or modules, and the processor may implement various functions of the intelligent game processing device by executing or executing the computer programs and/or modules stored in the memory, and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart memory Card (Smart Med ia Card, SMC), secure Digital (SD) Card, flash Card (F1 ash Card), at least one disk storage device, flash memory device, or other volatile solid state storage device.

Wherein the integrated modules/units of the intelligent game processing device, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier wave signal, a telecommunication signal, a software distribution medium, and so forth.

It should be noted that the above-described apparatus embodiments are merely illustrative, and the units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the invention, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the invention, such changes and modifications are also intended to be within the scope of the invention.

Claims

1. An intelligent game processing method, comprising:

L＝αLv+βLp+ηw ^t w；

2. The intelligent game processing method according to claim 1, wherein the action corresponding to the game state data of the current player is searched according to a preset monte carlo search tree until the whole round of game is finished; the game data generated by the corresponding action obtained by each execution of the search is taken as the game state data of the current player in the next search, and specifically comprises the following steps:

determining a node corresponding to the game state data of the current player;

3. The intelligent game processing method according to claim 2, wherein the selecting the best child node from at least one of the corresponding child nodes is performed as an action under game state data of the current player; the method specifically comprises the following steps:

or alternatively, the first and second heat exchangers may be,

selecting the child node with the maximum confidence value; wherein,

4. The intelligent game processing method according to claim 1, wherein the simulating the intelligent game specifically comprises:

5. The intelligent game processing method according to claim 1, wherein after searching the action corresponding to the game status data of the current player according to the preset monte carlo search tree until the whole round of game is finished, before saving the game information corresponding to each action executed by the whole round of game to the database, the method further comprises:

6. The intelligent game processing method according to claim 1, wherein the method further comprises:

7. An intelligent game processing device, comprising:

L＝αLv+βLp+ηw ^t w；

8. A smart game processing device comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the smart game processing method of any one of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored computer program, wherein the computer program, when run, controls a device in which the computer readable storage medium is located to perform the intelligent game processing method according to any one of claims 1 to 6.