CN113159313B - Data processing method and device of game model, electronic equipment and storage medium - Google Patents

Data processing method and device of game model, electronic equipment and storage medium Download PDF

Info

Publication number
CN113159313B
CN113159313B CN202110228510.4A CN202110228510A CN113159313B CN 113159313 B CN113159313 B CN 113159313B CN 202110228510 A CN202110228510 A CN 202110228510A CN 113159313 B CN113159313 B CN 113159313B
Authority
CN
China
Prior art keywords
game
data
neural network
playing
card
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110228510.4A
Other languages
Chinese (zh)
Other versions
CN113159313A (en
Inventor
查道琛
马文晔
谢静如
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202110228510.4A priority Critical patent/CN113159313B/en
Publication of CN113159313A publication Critical patent/CN113159313A/en
Application granted granted Critical
Publication of CN113159313B publication Critical patent/CN113159313B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/70Game security or game management aspects
    • A63F13/79Game security or game management aspects involving player-related data, e.g. identities, accounts, preferences or play histories
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Abstract

The game simulator arranged at a far end acquires game matching data of each character object generated by self-gaming based on a neural network of a first card playing model, inputs the game matching data of each character object into a neural network of a second card playing model with the same parameters as the neural network of the first card playing model, and trains the neural network of the second card playing model by adopting a reinforcement learning algorithm, so that a target card playing model with updated parameters of the neural network is obtained. According to the method, the game simulator at the remote end plays the game to generate corresponding game data serving as training data, so that the neural network is trained through a reinforcement learning algorithm without depending on data and experience of human characters, the accuracy of the card playing of the trained target card playing model can be improved, and the training speed of the model is improved.

Description

Game model data processing method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a data processing method and apparatus for a game model, an electronic device, and a storage medium.
Background
With the development of electronic and network technologies, network games have become an indispensable part of people's life as an entertainment form in people's life. Such as a landlord, a mahjong tile, etc. In the process of the ground fighting main game, the card playing is the most important ring, and the winning or losing of the card playing of the player directly determines the winning or losing of the game. And the playing experience of the human player in the game is influenced by the quality of the playing of the robot in the game. Matching human players with a fairly high level of robotics can give players an interesting gaming experience. Therefore, the card-playing strategy is a crucial part in the intelligent decision of the ground fighter.
In the related art, a card playing strategy based on supervised learning is generally adopted, namely, a human card playing is simulated through a supervised learning algorithm according to card playing data of a large number of human players generated on line. However, since it depends on the card-playing data of the human player, the performance depends on the quality of the data to a large extent, so that the trained strategy is difficult to exceed the human level, so that the accuracy of the trained robot in controlling the card-playing is low.
Disclosure of Invention
The present disclosure provides a data processing method and apparatus for a game model, an electronic device, and a storage medium, so as to at least solve the problem of low accuracy of card-playing by a robot in the related art. The technical scheme of the disclosure is as follows:
according to a first aspect of embodiments of the present disclosure, there is provided a data processing method of a game model, the method including:
obtaining game matching data generated by a game simulator which is arranged at a far end in a self-game mode, wherein the game matching data comprises state data of each character object, target behaviors corresponding to the state data and game matching results when the game simulator conducts the self-game based on a neural network of a first card-playing model;
on the basis of the game result, inputting the state data of each role object and the target behavior corresponding to the state data into a neural network of a second card-playing model with the same parameters as those of the neural network of the first card-playing model, wherein the neural network of the first card-playing model is obtained by synchronizing the neural network parameters of the second card-playing model;
and training the neural network of the second card playing model by adopting a reinforcement learning algorithm to obtain a target card playing model with updated parameters of the neural network.
In one embodiment, the obtaining of game-play data generated by self-gaming of a remotely-located game simulator comprises: acquiring state data of a corresponding target role object and all candidate behaviors corresponding to the state data when the game simulator arranged at a far end plays a game based on a neural network of a first card-playing model; based on game strategy and state data, obtaining decision data of each candidate behavior corresponding to the state data; determining a target behavior corresponding to the state data according to the decision data of each candidate behavior; and obtaining an execution result after the target behavior is executed until the game is finished, and obtaining a game-playing result of the game.
In one embodiment, after the obtaining of the game-play-pair data generated by the game simulator arranged at the remote end playing by itself, the method further comprises: storing the game play data in a buffer corresponding to each character object based on different character objects in the game; the training of the neural network of the second card-playing model by adopting the reinforcement learning algorithm to obtain the target card-playing model with updated parameters of the neural network comprises the following steps: and training the neural network of the second card-playing model corresponding to each role object in parallel by adopting a reinforcement learning algorithm based on the game data in the buffer zone corresponding to each role object to obtain the target card-playing model after the parameters of the neural network corresponding to each role object are updated.
In one embodiment, the training, based on the game data in the buffer corresponding to each character object, of the neural network of the second card-playing model corresponding to each character object in parallel by using a reinforcement learning algorithm to obtain the target card-playing model with updated parameters of the neural network corresponding to each character object includes: when the buffer area with the data volume reaching the set value exists, training a neural network of a second card-playing model corresponding to the role object in the buffer area by adopting a reinforcement learning algorithm based on the game data in the buffer area with the data volume reaching the set value, and obtaining a target card-playing model after the parameters of the neural network corresponding to the role object are updated.
In one embodiment, after obtaining the target card-playing model with updated parameters of the neural network, the method further comprises: transmitting the updated parameters of the neural network to a game simulator arranged at a remote end, wherein the updated parameters of the neural network are used for indicating to update the game simulator; the acquiring of game-play data generated by self-play of a game simulator arranged at a remote end comprises the following steps: and acquiring game-play data generated by self-game of the updated game simulator arranged at a remote end.
In one embodiment, the transmitting the updated parameters of the neural network to a remotely located game simulator includes: locking setting is carried out before parameters of the neural network are updated; and releasing the lock until the parameters of the neural network are updated, and transmitting the updated parameters of the neural network to a game simulator arranged at a remote end.
In one embodiment, after obtaining the updated target card-playing model of the parameters of the neural network corresponding to each character object, the method further includes: acquiring state data of a character object corresponding to the target card-playing model in an actual game scene and all candidate behaviors corresponding to the state data; inputting the state data and all candidate behaviors corresponding to the state data into the target card-playing model to obtain decision data of each candidate behavior corresponding to the state data and output by the target card-playing model; and determining the target behaviors meeting the conditions according to the decision data of each candidate behavior for playing cards.
In one embodiment, the determining, according to the decision data of each candidate behavior, a target behavior meeting a condition for playing the card includes: determining the candidate behavior with the highest decision data according to the decision data of each candidate behavior; and taking the candidate behavior with the highest decision data as a target behavior to play the card.
According to a second aspect of the embodiments of the present disclosure, there is provided a data processing apparatus of a game model, including:
the game playing system comprises a game playing data acquisition module, a game playing data acquisition module and a game playing data acquisition module, wherein the game playing data acquisition module is configured to execute the game playing data generated by self-playing of a game simulator arranged at a remote end, and the game playing data comprises state data of each character object, target behaviors corresponding to the state data and a game playing result when the game simulator plays the game by self based on a neural network of a first card playing model;
a training module configured to perform inputting the state data of each character object and the target behavior corresponding to the state data into a neural network of a second card-playing model having the same neural network parameters as those of the first card-playing model based on the game result, wherein the neural network of the first card-playing model is obtained by synchronizing the neural network parameters of the second card-playing model; and training the neural network of the second card playing model by adopting a reinforcement learning algorithm to obtain a target card playing model with updated parameters of the neural network.
In one embodiment, the office alignment data acquisition module is configured to perform: acquiring state data of a corresponding target role object and all candidate behaviors corresponding to the state data when the game simulator arranged at a far end plays a game based on a neural network of a first card-playing model; based on game strategy and state data, obtaining decision data of each candidate behavior corresponding to the state data; determining a target behavior corresponding to the state data according to the decision data of each candidate behavior; and obtaining an execution result after the target behavior is executed until the game is finished, and obtaining a game-playing result of the game.
In one embodiment, the apparatus further comprises a storage module configured to perform: storing the game play data in a buffer corresponding to each character object based on different character objects in the game; the training module is configured to perform: and training the neural network of the second card-playing model corresponding to each role object in parallel by adopting a reinforcement learning algorithm based on the game data in the buffer zone corresponding to each role object to obtain the target card-playing model after the parameters of the neural network corresponding to each role object are updated.
In one embodiment, the training module is further configured to perform: when the buffer area with the data volume reaching the set value exists, training a neural network of a second card-playing model corresponding to the role object in the buffer area by adopting a reinforcement learning algorithm based on the game data in the buffer area with the data volume reaching the set value, and obtaining a target card-playing model after the parameters of the neural network corresponding to the role object are updated.
In one embodiment, the apparatus further includes an update module configured to perform: after a target card playing model with updated parameters of the neural network is obtained, transmitting the updated parameters of the neural network to a game simulator arranged at a far end, wherein the updated parameters of the neural network are used for indicating to update the game simulator; the office data acquisition module is further configured to perform: and acquiring game-play data generated by self-game of the updated game simulator arranged at a remote end.
In one embodiment, the update module is further configured to perform: locking setting is carried out before parameters of the neural network are updated; and releasing the lock until the parameters of the neural network are updated, and transmitting the updated parameters of the neural network to a game simulator arranged at a remote end.
In one embodiment, the apparatus further comprises a model application module configured to perform: acquiring state data of a role object role corresponding to the target card-playing model in an actual game scene and all candidate behaviors corresponding to the state data; inputting the state data and all candidate behaviors corresponding to the state data into the target card-playing model to obtain decision data of each candidate behavior corresponding to the state data and output by the target card-playing model; and determining the target behaviors meeting the conditions according to the decision data of each candidate behavior to play the card.
In one embodiment, the model application module is further configured to perform: determining the candidate behavior with the highest decision data according to the decision data of each candidate behavior; and taking the candidate behavior with the highest decision data as a target behavior to play the card.
According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to cause the electronic device to perform the data processing method of the game model described in any embodiment of the first aspect.
According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium, wherein instructions that, when executed by a processor of an electronic device, enable the electronic device to perform the data processing method of the game model described in any one of the first aspect.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program stored in a readable storage medium, from which at least one processor of a device reads and executes the computer program, so that the device performs the data processing method of a game model described in any one of the embodiments of the first aspect.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects: game simulators arranged at far ends play games by themselves based on a neural network of a first card playing model to generate game-playing data of each role object, state data of each role object and target behaviors corresponding to the state data are input into a neural network of a second card playing model with the same parameters as the neural network of the first card playing model based on game-playing results, and the neural network of the second card playing model is trained by adopting a reinforcement learning algorithm, so that a target card playing model with updated parameters of the neural network is obtained. According to the game simulator, the game simulator at the far end plays the game to generate corresponding game data serving as training data, so that the accuracy of the card playing of the trained target card playing model can be improved by training the neural network through the reinforcement learning algorithm without depending on the data and experience of human characters, and the game simulator generating the training data is arranged at the far end, so that the game simulator can be realized in multiple processes, and the training speed of the model is further improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
FIG. 1 is a flow diagram illustrating a data processing method for a game model according to an exemplary embodiment.
Fig. 2 is a flowchart illustrating steps for obtaining session data according to an exemplary embodiment.
FIG. 3 is a flow chart illustrating a method of data processing for a game model according to another exemplary embodiment.
FIG. 4 is a block diagram illustrating a data processing arrangement of a game model according to an exemplary embodiment.
FIG. 5 is a block diagram illustrating a data processing device of a game model according to one exemplary embodiment.
FIG. 6 is a schematic diagram illustrating a game model according to an exemplary embodiment.
FIG. 7 is a block diagram illustrating an electronic device in accordance with an example embodiment.
Detailed Description
In order to make the technical solutions of the present disclosure better understood, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
Since the card-playing scheme in the conventional art is generally implemented as follows: randomly playing cards, namely uniformly sampling the legal card types to play cards; playing cards based on rules, namely playing cards according to a certain rule, wherein the basic principle is that the cards are played in the least times; playing cards based on supervised learning, namely simulating the playing cards of human beings by using a supervised learning algorithm according to a large amount of human data generated on line; playing cards based on strategy gradient or Q learning, namely training a model to play cards through strategy gradient or Q learning; playing cards based on searching, namely searching for an optimal playing card method through MC, and guessing cards of other roles before searching. However, random and rule-based methods of dealing are less effective and may be prone to unreasonable deals because coverage is difficult to guarantee by manually written rules. The playing card based on supervised learning depends on human data, so the performance depends on the quality of the data to a great extent, and the trained strategy is difficult to exceed the human level. Playing cards based on strategy gradient or Q learning is difficult to process complicated card types in the poker base, and the strategy gradient is difficult to utilize the characteristics of the card types, and the Q learning is difficult to process a large number of card types, so that the effect is not ideal. Search-based card-playing requires a great deal of time and is not suitable for large-scale deployment.
Based on this, the present disclosure provides a data processing method for a game model, which can be applied to any electronic device that needs to obtain a card playing model, wherein the electronic device includes, but is not limited to, a personal computer, a notebook computer, a smart phone, a tablet computer, a portable wearable device, and the like. The card-playing model can be used for conducting self-gaming during model training to generate training data, and conducting gaming with a user during game playing after continuous iterative training based on the training data. As shown in fig. 1, the method comprises the steps of:
in step S110, game data generated by a game simulator provided at a remote location playing itself is acquired.
The game simulator is loaded with the neural network parameters of the first card-playing model, so that the game simulator can play itself to generate game data of each character object, and the data and experience of human characters are not required to be relied on. Specifically, the game matching data includes state data of each character object when the game simulator plays itself based on the neural network of the first card-playing model, a target behavior corresponding to the state data, and a game matching result. The state data of each character object is composed of environment characteristics, and comprises information which can be seen by the corresponding character object at a certain moment, such as own hand information, historical card playing information and the like. The target behavior corresponding to the state data is the card-playing behavior of a character object in the state, namely, what card is played. The game result refers to the win or loss or the score of the corresponding character when the game is finished. The neural network is a mathematical model for processing information by applying a structure similar to brain neural synapse connection, and the purpose of processing information is achieved by adjusting the interconnection relationship among a large number of nodes in the neural network depending on the complexity of the system. Neural networks typically include an input layer, a hidden layer, and an output layer, and under each layer, there are many neurons that make up the basic structure of this layer. In the embodiment, the game simulator arranged at the remote end acquires the game data of each character object generated when the game simulator plays the game by self based on the neural network of the first card-playing model, and trains the model through the following steps, so that the data and experience of the human character do not need to be relied on, and the capability of each character object is iteratively improved and even exceeds the level of the human character. The game simulator for generating the game-play data is arranged at the far end, so that the game simulator can be realized in multiple processes, thereby improving the speed of generating the game-play data
In step S120, based on the result of the match, the state data of each character object and the target behavior corresponding to the state data are input to the neural network of the second deal model.
The neural network of the second card-playing model is arranged locally and used for carrying out model training to obtain a target card-playing model, parameters of the neural network of the second card-playing model are the same as those of the first card-playing model, and the neural network of the first card-playing model is obtained by synchronizing the parameters of the neural network of the second card-playing model. That is, before training the second card-playing model, in order to achieve strategy synchronization, the neural network of the first card-playing model for generating the game data can be obtained by copying the network parameters of the second card-playing model to a remote end. Therefore, on the basis of the game matching data of each character object generated when the neural network of the first card playing model plays games, the game matching data is input into the neural network of the second card playing model with the same network parameters, and the neural network of the second card playing model is trained through the subsequent steps. Specifically, in this embodiment, based on the game result in the game data generated by the remote game simulator in the self-game, the state data of each character object and the target behavior corresponding to the state data are input to the neural network of the second card-playing model having the same parameters as the neural network of the first card-playing model, and the neural network of the second card-playing model is trained through the subsequent steps.
In step S130, a neural network of the second card-playing model is trained by using a reinforcement learning algorithm, so as to obtain a target card-playing model with updated parameters of the neural network.
The Reinforcement Learning algorithm (RL algorithm for short) is developed from theories such as animal Learning and parameter disturbance adaptive control, and the neural network is trained through the Reinforcement Learning algorithm, so that the neural network can continuously learn from past experiences to acquire knowledge without a large number of marked determination labels, and thus the trained neural network can improve a behavior scheme to adapt to the environment, and adopt a behavior with maximized return to improve the accuracy of card drawing and achieve a better card drawing effect. In this embodiment, the self game of the neural network of the remote first card-playing model with the same network parameters of the second card-playing model is based on the game data generated by the game to be trained, and the neural network of the second card-playing model is trained by adopting a reinforcement learning algorithm, so as to obtain the target card-playing model with updated parameters of the neural network. It is understood that the target card playing model can be applied in an actual game scene to play with a human character, and therefore, the target card playing model is also called a game model. Of course, in order to further improve the card-playing accuracy of the target card-playing model, the target card-playing model may be further trained. This embodiment is not limited to this.
According to the data processing method of the game model, game matching data generated by a game simulator arranged at a far end based on a neural network self game of a first card playing model are obtained, state data of each role object and target behaviors corresponding to the state data are input into a neural network of a second card playing model with the same parameters as the neural network of the first card playing model based on a game matching result, and a neural network of the second card playing model is trained by adopting a reinforcement learning algorithm, so that a target card playing model with updated parameters of the neural network is obtained. According to the game simulator, the game simulator at the far end plays the game to generate corresponding game data serving as training data, so that the accuracy of the card playing of the trained target card playing model can be improved by training the neural network through the reinforcement learning algorithm without depending on the data and experience of human characters, and the game simulator generating the training data is arranged at the far end, so that the game simulator can be realized in multiple processes, and the training speed of the model is further improved.
In an exemplary embodiment, as shown in fig. 2, in step S110, obtaining the game-play-aiming data generated by the game simulator playing itself at the remote location may specifically be implemented by the following steps:
in step S111, state data of the corresponding target character object and all candidate behaviors corresponding to the state data are acquired when the game simulator provided at the remote end performs a self-game based on the neural network of the first card drawing model.
The target role object is the role object corresponding to the first card-playing model. The state data of the target character object is hand information, historical card playing information and the like owned by the character object corresponding to the first card playing model. All the candidate behaviors corresponding to the state data refer to the card types which can be played currently, for example, for the example of a land fighting primary game, if the card types played by the previous family are three with one, all the candidate behaviors corresponding to the state data refer to all three with one which can be played by the character object which needs to be played currently; if the type of the card played by the previous family is a sequence, all the candidate behaviors corresponding to the state data refer to all sequences in which the character object needing to play the card can play the card. In the embodiment, the state data of the corresponding target character object and all the candidate behaviors corresponding to the state data are acquired when the game simulator performs self-game based on the neural network of the first card-playing model in each step of the game, and then the game data generated when the game simulator performs self-game is acquired through the subsequent steps.
In step S112, based on the game strategy and the status data, decision data of each candidate behavior corresponding to the status data is acquired.
Wherein the game strategy is a game rule pre-packaged in the game simulator. The decision data is the expected value corresponding to the candidate behavior, namely the probability that the game wins when a card is played based on a certain candidate behavior. In the embodiment, based on the game rule and the state data packaged in advance, the decision data of each candidate behavior corresponding to the state data is acquired.
In step S113, a target behavior corresponding to the state data is determined according to the decision data of each candidate behavior.
Wherein the target behavior refers to a most favorable candidate behavior selected from all candidate behaviors in the state. In this embodiment, the target behavior corresponding to the state data is determined according to the decision data of each candidate behavior, and specifically, the candidate behavior with the highest decision data may be determined as the target behavior corresponding to the state data.
In step S114, the execution result after the execution of the target action is acquired until the game end, and the game play result is acquired.
The execution result refers to a result after the target behavior is executed, that is, a current win or lose result of the target behavior is executed. The game result is the winning or losing result when the game is finished. In this embodiment, based on the targeting behavior obtained in the above steps, the targeting behavior is executed, so as to obtain an execution result after the targeting behavior is executed, and a game result of the game is obtained until the game is finished, so as to obtain game data of each character object generated by the game simulator playing the self game.
In the above embodiment, the state data of the target character object and all candidate behaviors corresponding to the state data are acquired by the game simulator when the game simulator self-plays based on the neural network of the first card-playing model, the decision data of each candidate behavior corresponding to the state data is acquired based on the game strategy and the state data, the target behavior corresponding to the state data is determined according to the decision data of each candidate behavior, the execution result after the target behavior is executed is acquired, the game play result is acquired until the game is finished, all game play data generated when the game simulator performs self-playing are obtained, so that the model training does not need to rely on data and experience of a human character, and the game play data is generated by the model self-playing for training, so that the selection of the training data is more flexible.
In an exemplary embodiment, after acquiring the game play data generated by the game simulator playing itself at the remote location, the method further comprises: game play data is stored in a buffer corresponding to each character object based on the different character objects in the game. Training a neural network of the second card playing model by adopting a reinforcement learning algorithm to obtain a target card playing model with updated parameters of the neural network, and specifically comprising the following steps: and training the neural network of the second card-playing model corresponding to each role object in parallel by adopting a reinforcement learning algorithm based on the game data in the buffer zone corresponding to each role object to obtain the target card-playing model after the parameters of the neural network corresponding to each role object are updated. Specifically, because the corresponding card-playing strategies are different due to different character objects in the game, in order to enable the neural network to better learn the card-playing strategies of different character objects, the game-playing data of each character object can be stored in the corresponding buffer area based on different character objects in the game. And then training the neural network of the second card-playing model corresponding to the role object according to the game data in the buffer zone corresponding to the role object, so as to obtain the target card-playing model corresponding to the role object, and the learning capability of the model is improved.
In an exemplary embodiment, based on the game data in the buffer corresponding to each character object, a neural network of the second card-playing model corresponding to each character object is trained in parallel by using a reinforcement learning algorithm, so as to obtain a target card-playing model with updated parameters of the neural network corresponding to each character object, including: when the buffer area with the data volume reaching the set value exists, training the neural network of the second card-playing model corresponding to the role object in the buffer area by adopting a reinforcement learning algorithm based on the game data in the buffer area with the data volume reaching the set value, and thus obtaining the target card-playing model after the parameters of the neural network corresponding to the role object are updated. Specifically, in order to further improve the speed and learning efficiency of model training, the game data generated by the game simulator in the self-game process can be efficiently collected through the buffer areas corresponding to different character objects, the buffer areas storing the game data of different character objects are monitored, and only when the data amount in the buffer areas reaches a set value, the neural network corresponding to the character objects is trained based on the batch data in the buffer areas to obtain the target card-playing model of the corresponding character objects, so that not only is the speed of model training and the learning efficiency of the model improved, but also the robustness of the model is improved.
In an exemplary embodiment, in order to further improve the efficiency of model training, the neural networks of the second card-playing models corresponding to the character objects may be respectively trained in parallel in different threads, so that the learning efficiency of the models is accelerated through parallelization, and the training speed of the models is further improved.
In an exemplary embodiment, after obtaining the updated parameter of the neural network, the method further comprises: and transmitting the updated parameters of the neural network to a game simulator arranged at a far end, wherein the updated parameters of the neural network are used for indicating the game simulator to be updated so as to obtain the updated game simulator, and further acquiring game-play data generated by self-game of the updated game simulator arranged at the far end so as to continuously train the model, so that the model gradually learns how to win the opponent in the learning process so as to improve the performance of the model.
In an exemplary embodiment, to ensure strategy synchronization, the neural network parameters of the first card-playing model for generating the game play data should be consistent with the neural network parameters of the second card-playing model to be trained, and therefore, the neural network parameters of the second card-playing model are generally copied to the remote game simulator to achieve synchronization of the neural network parameters of the first card-playing model and the neural network parameters of the second card-playing model in the remote game simulator. In order to ensure the correctness of the parameters of the neural network, the present embodiment may perform locking setting before updating the parameters of the neural network, release the lock after updating the parameters of the neural network, transmit the updated parameters of the neural network to the game simulator disposed at the remote end, and perform locking setting before updating next time. That is, the neural network is locked before being updated and then is locked and released after being updated, so that the updated network parameters are synchronized to the remote game simulator, thereby preventing the remote game simulator from synchronizing to wrong network parameters and ensuring the correctness of the synchronized network parameters.
In an exemplary embodiment, as shown in fig. 3, after obtaining the updated target card-playing model of the parameters of the neural network corresponding to each character object, the method further includes the following steps:
in step S310, state data of a character object corresponding to the target card-playing model in the actual game scene and all candidate behaviors corresponding to the state data are acquired.
The actual game scene refers to a scene of playing games with human characters through the target card playing model. The state data of the character object refers to hand information possessed by the character object corresponding to the target card-drawing model, historical card-drawing information, cards not drawn by other character objects, cards drawn by a previous character object, the number of currently drawn special card types, and the like. All candidate behaviors corresponding to the state data are behavior data that the character object corresponding to the target card playing model can play cards under the current state, namely the card type that the cards can be played. In this embodiment, the state data of the character object corresponding to the target card playing model in the actual game scene and all the candidate behaviors corresponding to the state data are obtained, and the target behaviors of the playing cards are determined through the subsequent steps.
In step S320, the state data and all the candidate behaviors corresponding to the state data are input into the target card-playing model, so as to obtain decision data of each candidate behavior corresponding to the state data output by the target card-playing model.
Specifically, the acquired state data of the character object and all the candidate behaviors corresponding to the state data are input into the target card-playing model of the character object, so that the decision data of each candidate behavior corresponding to the state data output by the target card-playing model is obtained. For example, in the case of a landlord game, if the type of the cards played by the previous house is three with one, it is determined that the type of the cards currently played by the target card playing model is three with one, each three with one currently played by the target card playing model is input into the target card playing model as a candidate behavior, and the target card playing model can output decision data of each candidate behavior based on the current state data.
In step S330, a target behavior meeting the condition is determined according to the decision data of each candidate behavior for dealing.
Since the decision data is the expected value corresponding to a candidate action, i.e. the probability of winning a game based on a certain candidate action. Therefore, in the embodiment, the corresponding target behaviors can be determined to play according to the decision data of each candidate behavior. Specifically, the candidate behavior with the highest decision data can be determined according to the decision data of each candidate behavior, and the candidate behavior with the highest decision data is determined as the target behavior for playing, so that the playing accuracy is improved.
In the above embodiment, the state data of the character object corresponding to the target card playing model and all the candidate behaviors corresponding to the state data in the actual game scene are acquired, the state data and all the candidate behaviors corresponding to the state data are input into the target card playing model, the decision data of each candidate behavior output by the target card playing model and corresponding to the state data are obtained, and then the target behavior meeting the conditions is determined according to the decision data of each candidate behavior for playing cards, so that the card playing accuracy is improved, the level of the human character is matched, and interesting game experience is brought to the human character.
It should be understood that although the various steps in the flowcharts of fig. 1-3 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1-3 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps or stages.
FIG. 4 is a block diagram illustrating a data processing device of a game model according to an exemplary embodiment. Referring to fig. 4, the apparatus includes a session data acquisition module 402 and a training module 404.
A game-playing data acquisition module 402 configured to execute acquisition of game-playing data generated by a game simulator arranged at a remote end in a self-playing manner, where the game-playing data includes state data of each character object, a target behavior corresponding to the state data, and a game-playing result when the game simulator plays a game by itself based on a neural network of a first card-playing model;
a training module 404 configured to perform, based on the game result, inputting the state data of the role objects and the target behaviors corresponding to the state data into a neural network of a second card-playing model having the same neural network parameters as those of the first card-playing model, where the neural network of the first card-playing model is obtained by synchronizing the neural network parameters of the second card-playing model; and training the neural network of the second card playing model by adopting a reinforcement learning algorithm to obtain a target card playing model with updated parameters of the neural network.
In an exemplary embodiment, the office data acquisition module is configured to perform: acquiring state data of a corresponding target role object and all candidate behaviors corresponding to the state data when the game simulator arranged at a far end plays a game based on a neural network of a first card-playing model; based on game strategies and state data, obtaining decision data of each candidate behavior corresponding to the state data; determining a target behavior corresponding to the state data according to the decision data of each candidate behavior; and obtaining an execution result after the target behavior is executed until the game is finished, and obtaining a game-playing result of the game.
In an exemplary embodiment, the apparatus further comprises a storage module configured to perform: storing the game play data in a buffer corresponding to each character object based on different character objects in the game; the training module is configured to perform: and training the neural network of the second card-playing model corresponding to each role object in parallel by adopting a reinforcement learning algorithm based on the game data in the buffer zone corresponding to each role object to obtain the target card-playing model after the parameters of the neural network corresponding to each role object are updated.
In an exemplary embodiment, the training module is further configured to perform: when the buffer area with the data volume reaching the set value exists, training the neural network of the second card-playing model corresponding to the role object in the buffer area by adopting a reinforcement learning algorithm based on the game data in the buffer area with the data volume reaching the set value, and obtaining the target card-playing model after the parameters of the neural network corresponding to the role object are updated.
In an exemplary embodiment, the apparatus further comprises an update module configured to perform: after a target card playing model with updated parameters of the neural network is obtained, transmitting the updated parameters of the neural network to a game simulator arranged at a far end, wherein the updated parameters of the neural network are used for indicating to update the game simulator; the office data acquisition module is further configured to perform: and acquiring game-play data generated by self-game of the updated game simulator arranged at a remote end.
In an exemplary embodiment, the update module is further configured to perform: locking setting is carried out before parameters of the neural network are updated; and releasing the lock until the parameters of the neural network are updated, and transmitting the updated parameters of the neural network to a game simulator arranged at a remote end.
In an exemplary embodiment, the apparatus further comprises a model application module configured to perform: acquiring state data of a role object role corresponding to the target card playing model in an actual game scene and all candidate behaviors corresponding to the state data; inputting the state data and all candidate behaviors corresponding to the state data into the target card-playing model to obtain decision data of each candidate behavior corresponding to the state data and output by the target card-playing model; and determining the target behaviors meeting the conditions according to the decision data of each candidate behavior for playing cards.
In an exemplary embodiment, the model application module is further configured to perform: determining the candidate behavior with the highest decision data according to the decision data of each candidate behavior; and taking the candidate behavior with the highest decision data as a target behavior to play the cards.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
The following further illustrates the scheme of the present disclosure by a specific embodiment, as shown in fig. 5, the game simulator obtains match data of each character object generated by the game simulator at the remote end playing based on the self-game of the neural network of the first card-playing model, and then the learner performs model training on the neural network of the second local card-playing model, so as to obtain the target card-playing model with updated parameters of the neural network. Specifically, the simulator and the learner each maintain a local model, and for convenience of explanation in this embodiment, the model in the simulator is defined as a first card-playing model, and the model in the learner is defined as a second card-playing model. The simulator plays a self game through the first card playing model, and therefore game data are generated. The learner receives the game data generated by the simulator, and trains the second card-playing model through a reinforcement learning algorithm, and because the learning speed of the learner is higher than the game data generation speed of the simulator, the simulator can be realized through multiple processes, thereby improving the simulation speed and accelerating the training process.
The generation of the game data by the simulator and the training of the learner are mainly explained in the following. First, the simulator copies the parameters of the neural network in the learner to the local to realize the synchronization of the two models. The simulator further plays the self game by using the synchronized model to generate game-play data. Specifically, taking a field-fighting primary game as an example, a field-fighting primary game engine is realized by a simulator. And after the simulator synchronizes the model parameters in the learner, taking the state data of the role corresponding to the model and all the candidate behaviors corresponding to the state data as the input of the neural network. The neural network outputs decision data corresponding to each candidate behavior in the state, and therefore the candidate behavior with the highest decision data is selected to play the card. After the game engine is started, a character plays cards in each step, the model corresponding to the character plays cards in the above mode, after the game engine receives the cards played by the character, the game engine enters the next step and plays cards by the next character, and so on until the game is finished. The state data, the playing (target behavior) and the final winning or losing result (game result) of the character corresponding to each step in the game are sent to the learner to be used for updating the model. The state data consists of the characteristics of the environment, including hand information and historical card playing information; the target behavior is the card type of the strategy; the final result is a win or a loss of the game or a score. To speed up the simulation, the simulator may be implemented in multiple processes. Each sub-process interacts and simulates with the learner separately, and the resulting data is ultimately gathered into the learner for training.
Based on the data collected by the simulator, the learner updates the model using a reinforcement learning algorithm. The learner maintains and updates a state-behavioral value neural network. The input of the network is state data and candidate behaviors, and the output is the value (namely decision data) of the candidate behaviors under the state data. The value represents the feedback expected to make this behavior in that state. Specifically, as shown in fig. 6, the neural network is composed of LSTM (Long Short Term Memory, i.e., Long Short Term Memory network) and Multi-Layer MLP (Multi-Layer Perceptron), i.e., Multi-Layer Perceptron. The status characteristics of the neural network include the current hand of the corresponding character, the cards that have been played (i.e., the historical playing cards), the cards that have not been played by the opponent, the cards that were played last, and the number of special versions that have been played.
Wherein, the historical card playing information is coded through the LSTM network. Specifically, the historical card-playing information can be regarded as a sequence, the sequence is input into the LSTM network one by one, the LSTM outputs coded feature expression, namely a low latitude continuous vector, the low latitude continuous vector and the hand information are sent into the multilayer MLP neural network, and finally value is output. The value is the expected value of a certain card in that state, i.e. how likely it will be that a hand will win.
The neural network is randomly initialized based on the Gaussian distribution in the initial state. Based on the data generated by the simulator, the learner then uses a gradient descent approach to update the parameters of the neural network. The algorithm can be implemented based on MC (Monte-Carlo, i.e. Monte Carlo) by directly fitting the corresponding value for each state and behavior. Specifically, for each match, the end of the game is awarded a prize, such as win or loss, or a score, for the game. The algorithm targets the end-of-line reward by updating the neural network with a Mean Square Error (MSE) loss function.
Since a game generally includes a plurality of characters, and the card-playing strategy of each character may be different, the learner may deal with different characters by training a plurality of independent neural networks, respectively. To speed up training, a buffer mechanism may also be employed to efficiently collect data. Specifically, the data generated by the simulator is buffered in different buffers according to the role. When the amount of data in the buffer reaches a certain amount, the algorithm will use a batch of data in the buffer together for training. Thereby accelerating learning efficiency through parallelization. To further improve efficiency, the updates of each model may be performed in parallel under different threads.
And because there is a large amount of data interaction between the simulator and the learner. That is, the simulator needs to copy the network parameters from the learner frequently, and the learner needs to obtain a large amount of match data from the simulator. Because training and simulation are performed concurrently by multiple processes, a locking mechanism and a queue mechanism can be used to ensure data transmission accuracy and transmission efficiency.
In particular, the replication of network parameters may employ a locking mechanism to ensure the correctness of the parameters. Since the locking mechanism may guarantee that each process is exclusive of the use of the resource, when a process requests a resource, if the resource is already locked, the process needs to wait until the lock is released. When a process successfully requests a resource, the process locks the resource from use by other processes until the resource is used. In this embodiment, in order to avoid the simulator synchronizing to the wrong network parameters, the learner locks the network parameters each time the simulator updates, releases the lock after the update, and transmits the updated parameters to the remote modeler, thereby ensuring the correctness of model parameter transmission.
The transfer of office data employs a shared buffer and queue mechanism to ensure that data is transferred between the simulator and the learner correctly and efficiently. In particular, by using a shared buffer, which may include multiple entries and which may be read and written simultaneously by the simulator and the learner. In addition, by designing a full queue and an empty queue, when the simulator is filled with one entry, the simulator transmits the entry number to the learner through the full queue, and the learner receives the entry number from the full queue and reads data from the corresponding entry number. And after the learner successfully reads the data, the learner transmits the read entry number to the simulator through the empty queue. After receiving the entry number from the empty queue, the simulator can further fill the data into the corresponding entry, thereby ensuring the correctness of the local data transmission. After repeated iteration, the learner gradually increases the playing level of the neural network, and stops iteration until the conditions are met to obtain a trained target playing module, and the target playing module can be applied to an actual game scene to play games with human roles.
The present disclosure generates a large amount of alignment data by a simulator and iteratively updates the neural network model with the RL algorithm. With respect to rule or supervised learning, the present disclosure generates data with self-gaming simulation, thereby eliminating the need for human data or experience; compared with an algorithm based on strategy gradient or Q-learning, the MC algorithm is adopted in the method, complex card types in a fighting game can be effectively processed, the structures of a simulator and a learner are designed to improve training efficiency, the capability of the model gradually learns how to defeat opponents in the learning process, the strength of the model is gradually enhanced, the level of human characters is matched, and interesting game experience is brought to the human characters.
FIG. 7 is a block diagram illustrating an apparatus Z00 for a data processing method for a game model according to an example embodiment. For example, device Z00 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, fitness device, personal digital assistant, and the like.
Referring to fig. 7, device Z00 may include one or more of the following components: a processing component Z02, a memory Z04, a power component Z06, a multimedia component Z08, an audio component Z10, an interface for input/output (I/O) Z12, a sensor component Z14 and a communication component Z16.
The processing component Z02 generally controls the overall operation of the device Z00, such as operations associated with display, telephone calls, data communications, camera operations and recording operations. The processing component Z02 may include one or more processors Z20 to execute instructions to perform all or part of the steps of the method described above. Further, the processing component Z02 may include one or more modules that facilitate interaction between the processing component Z02 and other components. For example, the processing component Z02 may include a multimedia module to facilitate interaction between the multimedia component Z08 and the processing component Z02.
The memory Z04 is configured to store various types of data to support operations at device Z00. Examples of such data include instructions for any application or method operating on device Z00, contact data, phonebook data, messages, pictures, videos, etc. The memory Z04 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power supply component Z06 provides power to the various components of the device Z00. The power component Z06 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device Z00.
The multimedia component Z08 comprises a screen providing an output interface between the device Z00 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component Z08 includes a front facing camera and/or a rear facing camera. When device Z00 is in an operating mode, such as a capture mode or a video mode, the front-facing camera and/or the rear-facing camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component Z10 is configured to output and/or input an audio signal. For example, the audio component Z10 includes a Microphone (MIC) configured to receive external audio signals when the device Z00 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory Z04 or transmitted via the communication component Z16. In some embodiments, the audio module Z10 further comprises a speaker for outputting audio signals.
The I/O interface Z12 provides an interface between the processing component Z02 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly Z14 includes one or more sensors for providing various aspects of status assessment for the device Z00. For example, sensor assembly Z14 may detect the open/closed state of device Z00, the relative positioning of the components, such as the display and keypad of device Z00, sensor assembly Z14 may also detect a change in the position of one component of device Z00 or device Z00, the presence or absence of user contact with device Z00, the orientation or acceleration/deceleration of device Z00, and a change in the temperature of device Z00. The sensor assembly Z14 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly Z14 can also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly Z14 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component Z16 is configured to facilitate wired or wireless communication between device Z00 and other devices. Device Z00 may access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component Z16 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component Z16 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the device Z00 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, there is also provided a computer readable storage medium, such as the memory Z04, comprising instructions executable by the processor Z20 of the device Z00 to perform the above method. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, there is also provided a computer program product comprising a computer program which, when executed by a processor, implements the data processing method of the game model described above.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (19)

1. A data processing method for a game model, the method comprising:
obtaining game matching data generated by a game simulator which is arranged at a far end in a self-game mode, wherein the game matching data comprises state data of each character object, target behaviors corresponding to the state data and game matching results when the game simulator conducts the self-game based on a neural network of a first card playing model, and the state data at least comprises hand information held by the corresponding character object and historical card playing information;
on the basis of the game result, inputting the state data of each role object and the target behavior corresponding to the state data into a neural network of a second card-playing model with the same parameters as those of the neural network of the first card-playing model, wherein the neural network of the first card-playing model is obtained by synchronizing the neural network parameters of the second card-playing model;
and training the neural network of the second card playing model by adopting a reinforcement learning algorithm to obtain a target card playing model with updated parameters of the neural network.
2. The method of claim 1, wherein obtaining match data generated by a remotely located game simulator playing itself comprises:
acquiring state data of a corresponding target role object and all candidate behaviors corresponding to the state data when the game simulator arranged at a far end plays a game based on a neural network of a first card-playing model;
based on game strategy and state data, obtaining decision data of each candidate behavior corresponding to the state data;
determining a target behavior corresponding to the state data according to the decision data of each candidate behavior;
and obtaining an execution result after the target behavior is executed until the game is finished, and obtaining a game-playing result of the game.
3. The method of claim 1, wherein after said obtaining remotely located game play data generated by a game simulator playing itself, said method further comprises:
storing the game play data in a buffer corresponding to each character object based on different character objects in the game;
the training of the neural network of the second card-playing model by adopting the reinforcement learning algorithm to obtain the target card-playing model with updated parameters of the neural network comprises the following steps:
and training the neural network of the second card-playing model corresponding to each role object in parallel by adopting a reinforcement learning algorithm based on the game data in the buffer zone corresponding to each role object to obtain the target card-playing model after the parameters of the neural network corresponding to each role object are updated.
4. The method of claim 3, wherein the training the neural network of the second card-playing model corresponding to each character object in parallel by using a reinforcement learning algorithm based on the game data in the buffer corresponding to each character object to obtain the updated parameters of the neural network corresponding to each character object, comprises:
when the buffer area with the data volume reaching the set value exists, training a neural network of a second card-playing model corresponding to the role object in the buffer area by adopting a reinforcement learning algorithm based on the game data in the buffer area with the data volume reaching the set value, and obtaining a target card-playing model after the parameters of the neural network corresponding to the role object are updated.
5. The method of claim 1, wherein after obtaining the updated parameter of neural network target discard model, the method further comprises:
transmitting the updated parameters of the neural network to a game simulator arranged at a remote end, wherein the updated parameters of the neural network are used for indicating to update the game simulator;
the acquiring of game-play data generated by self-play of a game simulator arranged at a remote end comprises the following steps:
and acquiring game-play data generated by self-game of the updated game simulator arranged at a remote end.
6. The method of claim 5, wherein transmitting the updated parameters of the neural network to a remotely located game simulator comprises:
locking setting is carried out before parameters of the neural network are updated; and releasing the lock until the parameters of the neural network are updated, and transmitting the updated parameters of the neural network to a game simulator arranged at a remote end.
7. The method of any of claims 1 to 6, wherein after obtaining the updated target card-playing model of the parameters of the neural network corresponding to each character object, the method further comprises:
acquiring state data of a character object corresponding to the target card-playing model in an actual game scene and all candidate behaviors corresponding to the state data;
inputting the state data and all candidate behaviors corresponding to the state data into the target card-playing model to obtain decision data of each candidate behavior corresponding to the state data and output by the target card-playing model;
and determining the target behaviors meeting the conditions according to the decision data of each candidate behavior for playing cards.
8. The method of claim 7, wherein determining the target behavior meeting the condition to play according to the decision data of each candidate behavior comprises:
determining the candidate behavior with the highest decision data according to the decision data of each candidate behavior;
and taking the candidate behavior with the highest decision data as a target behavior to play the card.
9. A data processing apparatus of a game model, comprising:
the game playing system comprises a game playing data acquisition module, a game playing data acquisition module and a game playing data processing module, wherein the game playing data acquisition module is configured to execute the game playing data generated by self-gaming of a game simulator arranged at a remote end, the game playing data comprises state data of each character object, target behaviors corresponding to the state data and a game playing result when the game simulator performs self-gaming based on a neural network of a first playing model, and the state data at least comprises hand information and historical playing information held by the corresponding character object;
a training module configured to perform inputting the state data of each character object and the target behavior corresponding to the state data into a neural network of a second card-playing model having the same neural network parameters as those of the first card-playing model based on the game result, wherein the neural network of the first card-playing model is obtained by synchronizing the neural network parameters of the second card-playing model; and training the neural network of the second card playing model by adopting a reinforcement learning algorithm to obtain a target card playing model with updated parameters of the neural network.
10. The apparatus of claim 9, wherein the office data acquisition module is configured to perform:
acquiring state data of a corresponding target role object and all candidate behaviors corresponding to the state data when the game simulator arranged at a far end plays a game based on a neural network of a first card-playing model;
based on game strategies and state data, obtaining decision data of each candidate behavior corresponding to the state data;
determining a target behavior corresponding to the state data according to the decision data of each candidate behavior;
and obtaining an execution result after the target behavior is executed until the game is finished, and obtaining a game-playing result of the game.
11. The apparatus of claim 9, further comprising a storage module configured to perform:
storing the game play data in a buffer corresponding to each character object based on different character objects in the game;
the training module is configured to perform: and training the neural network of the second card-playing model corresponding to each role object in parallel by adopting a reinforcement learning algorithm based on the game data in the buffer zone corresponding to each role object to obtain the target card-playing model after the parameters of the neural network corresponding to each role object are updated.
12. The apparatus of claim 11, wherein the training module is further configured to perform:
when the buffer area with the data volume reaching the set value exists, training the neural network of the second card-playing model corresponding to the role object in the buffer area by adopting a reinforcement learning algorithm based on the game data in the buffer area with the data volume reaching the set value, and obtaining the target card-playing model after the parameters of the neural network corresponding to the role object are updated.
13. The apparatus of claim 9, further comprising an update module configured to perform:
after a target card playing model with updated parameters of the neural network is obtained, transmitting the updated parameters of the neural network to a game simulator arranged at a far end, wherein the updated parameters of the neural network are used for indicating to update the game simulator;
the office data acquisition module is further configured to perform: and acquiring game-play data generated by self-game of the updated game simulator arranged at a remote end.
14. The apparatus of claim 13, wherein the update module is further configured to perform:
locking setting is carried out before parameters of the neural network are updated; and releasing the lock until the parameters of the neural network are updated, and transmitting the updated parameters of the neural network to a game simulator arranged at a remote end.
15. The apparatus according to any of claims 9 to 14, wherein the apparatus further comprises a model application module configured to perform:
acquiring state data of a role object role corresponding to the target card playing model in an actual game scene and all candidate behaviors corresponding to the state data;
inputting the state data and all candidate behaviors corresponding to the state data into the target card-playing model to obtain decision data of each candidate behavior corresponding to the state data and output by the target card-playing model;
and determining the target behaviors meeting the conditions according to the decision data of each candidate behavior for playing cards.
16. The apparatus of claim 15, wherein the model application module is further configured to perform:
determining the candidate behavior with the highest decision data according to the decision data of each candidate behavior;
and taking the candidate behavior with the highest decision data as a target behavior to play the cards.
17. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement a data processing method of a game model as claimed in any one of claims 1 to 8.
18. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the data processing method of a game model of any one of claims 1 to 8.
19. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements a data processing method of a game model according to any one of claims 1 to 8.
CN202110228510.4A 2021-03-02 2021-03-02 Data processing method and device of game model, electronic equipment and storage medium Active CN113159313B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110228510.4A CN113159313B (en) 2021-03-02 2021-03-02 Data processing method and device of game model, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110228510.4A CN113159313B (en) 2021-03-02 2021-03-02 Data processing method and device of game model, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113159313A CN113159313A (en) 2021-07-23
CN113159313B true CN113159313B (en) 2022-09-09

Family

ID=76883781

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110228510.4A Active CN113159313B (en) 2021-03-02 2021-03-02 Data processing method and device of game model, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113159313B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530520A (en) * 2013-10-16 2014-01-22 腾讯科技(深圳)有限公司 Method and terminal for obtaining data
CN106055339A (en) * 2016-06-08 2016-10-26 天津联众逸动科技发展有限公司 Method for determining card playing strategy of computer player in two-against-one game
CN110119804A (en) * 2019-05-07 2019-08-13 安徽大学 A kind of Ai Ensitan chess game playing algorithm based on intensified learning
CN110404264A (en) * 2019-07-25 2019-11-05 哈尔滨工业大学(深圳) It is a kind of based on the virtually non-perfect information game strategy method for solving of more people, device, system and the storage medium self played a game
CN110841295A (en) * 2019-11-07 2020-02-28 腾讯科技(深圳)有限公司 Data processing method based on artificial intelligence and related device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7892080B1 (en) * 2006-10-24 2011-02-22 Fredrik Andreas Dahl System and method for conducting a game including a computer-controlled player
CA3100315A1 (en) * 2018-05-14 2019-11-21 Angel Playing Cards Co., Ltd. Table game management system and game management system
CN109621422B (en) * 2018-11-26 2021-09-17 腾讯科技(深圳)有限公司 Electronic chess and card decision model training method and device and strategy generation method and device
CN111729300A (en) * 2020-06-24 2020-10-02 贵州大学 Monte Carlo tree search and convolutional neural network based bucket owner strategy research method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530520A (en) * 2013-10-16 2014-01-22 腾讯科技(深圳)有限公司 Method and terminal for obtaining data
CN106055339A (en) * 2016-06-08 2016-10-26 天津联众逸动科技发展有限公司 Method for determining card playing strategy of computer player in two-against-one game
CN110119804A (en) * 2019-05-07 2019-08-13 安徽大学 A kind of Ai Ensitan chess game playing algorithm based on intensified learning
CN110404264A (en) * 2019-07-25 2019-11-05 哈尔滨工业大学(深圳) It is a kind of based on the virtually non-perfect information game strategy method for solving of more people, device, system and the storage medium self played a game
CN110841295A (en) * 2019-11-07 2020-02-28 腾讯科技(深圳)有限公司 Data processing method based on artificial intelligence and related device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于深度网络的多人计算机博弈算法研究;柴化云,王福成;《智能技术》;20201231;第213-215页 *

Also Published As

Publication number Publication date
CN113159313A (en) 2021-07-23

Similar Documents

Publication Publication Date Title
CN103657087B (en) Formula narration environment on the spot in person
CN109091869A (en) Method of controlling operation, device, computer equipment and the storage medium of virtual objects
CN101631164A (en) User device for gesture based exchange of information, related method, device and system
CN106030581A (en) Automatic context sensitive search for application assistance
CN110841295B (en) Data processing method based on artificial intelligence and related device
CN112221152A (en) Artificial intelligence AI model training method, device, equipment and medium
US11468738B2 (en) System and method for playing online game
CN112870721B (en) Game interaction method, device, equipment and storage medium
CN112221140B (en) Method, device, equipment and medium for training action determination model of virtual object
CN110826717A (en) Game service execution method, device, equipment and medium based on artificial intelligence
KR20100116970A (en) Game system of escaping from the game space
CN110598853B (en) Model training method, information processing method and related device
CN115581922A (en) Game character control method, device, storage medium and electronic equipment
CN114917586A (en) Model training method, object control method, device, medium, and apparatus
CN113509726B (en) Interaction model training method, device, computer equipment and storage medium
CN113159313B (en) Data processing method and device of game model, electronic equipment and storage medium
CA3087629C (en) System for managing user experience and method therefor
CN110263937B (en) Data processing method, device and storage medium
WO2023087912A1 (en) Data synchronization method and apparatus, and device and medium
CN114404977B (en) Training method of behavior model and training method of structure capacity expansion model
CN111840997B (en) Processing system, method, device, electronic equipment and storage medium for game
CN112995687A (en) Interaction method, device, equipment and medium based on Internet
CN110193192A (en) A kind of automated game method and apparatus
CN116883561B (en) Animation generation method, training method, device and equipment of action controller
CN113641273B (en) Knowledge propagation method, apparatus, device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant