CN109107161B

CN109107161B - Game object control method, device, medium and equipment

Info

Publication number: CN109107161B
Application number: CN201810942957.6A
Authority: CN
Inventors: 黄盈; 周大军
Original assignee: Shenzhen Tencent Network Information Technology Co Ltd
Current assignee: Shenzhen Tencent Network Information Technology Co Ltd
Priority date: 2018-08-17
Filing date: 2018-08-17
Publication date: 2019-12-27
Anticipated expiration: 2038-08-17
Also published as: CN109107161A

Abstract

The application discloses a control method of a game object, which comprises the following steps: acquiring a game image when a game object participates in a game, and judging whether an opponent object of the game object exists in the game image; if not, inputting the game image into the first strategy model, acquiring an action probability vector output by the first strategy model, selecting an action with the highest probability as a target action according to the action probability vector, and controlling a game object to execute the target action so as to realize in-game map exploration; if yes, inputting the game image into a second strategy model, obtaining an action value vector output by the second strategy model, selecting an action with the maximum value as a target action according to the action value vector, and controlling the game object to execute the target action so as to realize fighting with the opponent object. In the method, the control of the game object is divided into map exploration and fighting, so that the training time is shortened, and the model has better performance. The application also discloses a device, equipment and a medium.

Description

Game object control method, device, medium and equipment

Technical Field

The present application relates to the field of game artificial intelligence technologies, and in particular, to a method and an apparatus for controlling a game object, a computer storage medium, and a device.

Background

At present, in game development and application, Artificial Intelligence (AI), namely game AI, is used in many scenes. For example, in the game development process, the game AI can replace the role of a tester, and the game AI plays the game to obtain test data so as to realize the game performance test; as another example, during a game application, a game player may actively select a game AI to automatically process in the event of a dropped or other scenario. It is also possible that some games AI are selected to play with real persons in order to enable the game to be played normally when the number of players is insufficient.

At present, games AI mainly have two implementation schemes, one is based on scripts, the advantage is that the implementation is simple, the operation cost is low, the disadvantage is that the behavior pattern is rigid, the scheme is only suitable for games with simple scenes, and in strategy games such as first-person shooting games (FPS), the scene randomness is strong, and the scheme cannot be applied. The other is an implementation scheme based on deep reinforcement learning, the scheme is mainly based on a deep reinforcement learning algorithm, the game AI is realized by training a single neural network model on line, and as the model has better performance only under the condition of more training samples, the generation efficiency of sample data in the on-line training process is very low, and more training samples are difficult to obtain in a short time, the game AI has poorer performance.

Disclosure of Invention

The embodiment of the application provides a control method of a game object, which realizes game AI based on a model obtained by simulating learning and deep reinforcement learning in parallel, greatly saves the training time of the model, and enables the game AI to have better performance for games with strong scene randomness. In addition, the embodiment of the application also provides a corresponding device, equipment and a computer storage medium.

In view of the above, an aspect of the present application provides a method for controlling a game object, the method including:

acquiring a game image when a game object participates in a game, and judging whether an opponent object of the game object exists in the game image or not;

if not, inputting the game image into a first strategy model, acquiring an action probability vector output by the first strategy model, selecting an action with the highest probability as a target action according to the action probability vector, and controlling the game object to execute the target action so as to realize in-game map exploration; the first strategy model is a deep neural network model obtained by utilizing an imitation learning algorithm to learn offline;

if yes, inputting the game image into a second strategy model, obtaining an action value vector output by the second strategy model, selecting an action with the maximum value as a target action according to the action value vector, and controlling the game object to execute the target action so as to realize fighting with an opponent object; the second strategy model is a deep neural network model obtained by utilizing a deep reinforcement learning algorithm for online learning.

An aspect of the present application provides a control apparatus for a game object, the apparatus including:

the judgment module is used for acquiring a game image when a game object participates in a game and judging whether an opponent object of the game object exists in the game image or not;

the fighting module is used for inputting the game image into a second strategy model if the game image is in the first strategy model, acquiring an action value vector output by the second strategy model, selecting an action with the maximum value as a target action according to the action value vector, and controlling the game object to execute the target action so as to realize fighting with an opponent object; the second strategy model is a deep neural network model obtained by utilizing a deep reinforcement learning algorithm for online learning.

One aspect of the present application provides a control apparatus for a game object, the apparatus comprising a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to execute the steps of the control method of the game object according to the instructions in the program code.

An aspect of the present application provides a computer-readable storage medium for storing a program code for executing the above-described control method of a game object.

According to the technical scheme, the embodiment of the application has the following advantages:

the embodiment of the application provides a control method of a game object, wherein the method realizes the game AI through a deep neural network model instead of a fixed script, so that the method can be suitable for strategy games with strong scene randomness. In addition, the control of the game object is divided into two parts, namely map exploration and fighting, specifically, a first strategy model is trained offline by simulating a learning algorithm, and an offline training sample can be obtained in advance, so that the training time of the first strategy model can be greatly saved, and the model can rapidly learn the strategy of exploring the map; the second strategy model is trained through a deep reinforcement learning algorithm and is responsible for action output during fighting, so that online training is only performed on the fighting process, the training time of the second strategy model is greatly shortened, and the model has better performance during fighting.

In addition, the single neural network model is divided into two neural network models, the training difficulty is reduced, the two models can be trained in parallel, and the training time overhead of the game AI is further reduced. In the method, the second strategy model is not required to be retrained aiming at different map scenes, and the first strategy model is retrained only by simulating a learning algorithm so as to learn a specific map strategy. Therefore, in the embodiment of the application, the game AI is divided into two stages of map exploration and fighting, the simulated learning and the deep reinforcement learning algorithm are respectively adopted for training, the training difficulty and the training time overhead are greatly reduced, and the game AI realized by the method has better performance and can adapt to different game scenes.

Drawings

Fig. 1 is a scene architecture diagram of a control method of a game object in an embodiment of the present application;

FIG. 2 is a flow chart of a method for controlling a game object according to an embodiment of the present disclosure;

FIG. 3 is a flow chart of a method of training a first strategy model in an embodiment of the present application;

FIG. 4 is a flow chart of a method of training a second strategy model in an embodiment of the present application;

FIG. 5A is a schematic diagram of an application scenario of a method for controlling a game object in an embodiment of the present application;

FIG. 5B is a diagram illustrating a game character performing a map search according to an embodiment of the present disclosure;

FIG. 5C is a schematic diagram of a game character fighting an opponent character in an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a control device for game objects according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a control device for game objects according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a control device for game objects according to an embodiment of the present application;

FIG. 9 is a schematic structural diagram of a control device for game objects according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a control device for game objects according to an embodiment of the present application;

FIG. 11 is a schematic structural diagram of a control device for game objects according to an embodiment of the present application;

FIG. 12 is a schematic structural diagram of a control device for a game object according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a control device for a game object in an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Aiming at the technical problems that in the prior art, a game AI behavior mode realized based on a script is rigid, is only suitable for games with simple scenes, and the game AI performance realized based on deep reinforcement learning is poor, in the method, the control of a game object is realized through the game AI, the game AI is realized through a deep neural network model, but not through a fixed script, so that the method is also suitable for strategy games with strong scene randomness. In addition, the control of the game object is divided into two parts, namely map exploration and fighting, specifically, a first strategy model is trained offline by simulating a learning algorithm, and an offline training sample can be obtained in advance, so that the training time of the first strategy model can be greatly saved, and the model can rapidly learn the strategy of exploring the map; the second strategy model is trained through a deep reinforcement learning algorithm and is responsible for action output during fighting, so that online training is only performed on the fighting process, the training time of the second strategy model is greatly shortened, and the model has better performance during fighting.

In addition, the single neural network model is divided into two neural network models, the training difficulty is reduced, the two models can be trained in parallel, and the training time overhead of the game AI is further reduced. In the method, the second strategy model is not required to be retrained aiming at different map scenes, and the first strategy model is retrained only by simulating a learning algorithm so as to learn a specific map strategy. Therefore, in the embodiment of the application, the game AI is divided into two stages of map exploration and fighting, the simulated learning algorithm and the deep reinforcement learning algorithm are respectively adopted for training, the training difficulty and the training time overhead are greatly reduced, and the game AI trained by the method has better performance and can adapt to different game scenes.

The game object described in this embodiment refers to an object that can participate in battle in game application. Specifically, the game object may be a game character such as a person, an animal, or another living body in the game, wherein the game character may be divided into a player game character and a non-player game character. For a player game character, a game player can control the game character to perform corresponding operation, or in some cases, the player game character is delivered to a game AI, and the game AI controls the game character to perform corresponding operation; for a non-player game character, the game character may be controlled to perform corresponding operations by a game AI built in the game application server or the terminal. In order to facilitate understanding of the technical solutions of the present application, the embodiments described below are exemplified by taking a game character as a game object, but do not limit the implementation of the present application.

It can be understood that the control method of the game object provided by the present application may be executed by a game agent, where the game agent is the game AI described above, and from a software level, the game AI may be an engine, a function module, or a plug-in a game application. The game agent can be deployed in a server or a terminal device. The server may be an independent server or a cluster formed by a plurality of servers, and the terminal device may be a computing device with data processing capability, including a desktop, a notebook computer, or a smart phone.

The control method of the game object provided by the embodiment of the application can be applied to various scenes. For example, in a game development phase, a game agent may play a game instead of a tester to implement game performance testing. For another example, in the game application stage, the player hosts his/her character according to his/her own needs, so that the game agent can control the player character to continue playing the game by executing the control method of the game object. In the multiplayer game, when the number of the players is insufficient, the game intelligent agent can select the game role, execute the control method of the game object and play the game together with the real person. Some games also include non-player characters, such as the assistors of the player game characters, and the game agent can also execute the control method of the game object in the identity of the non-player character to assist the player game character in playing the game.

For easy understanding, the following will briefly describe the control method of the game object in the embodiment of the present application in a scenario where the game agent replaces the player to play the game when the player is inconvenient.

Fig. 1 is a view of a scene structure of a method for controlling a game object in an embodiment of the present application, and referring to fig. 1, an application scene includes a terminal device 10 and a server 20, where a game AI is built in the server 20, and control of the game object can be achieved through the game AI. The control process of the game object will be described in detail below.

The player leaves during the game play through the terminal device 10, and hosts a player game character, which is a game object in the present embodiment, to the game AI, so that the game AI of the server 20 can acquire a game image of the game character when the game character participates in the game from the terminal device, and determine whether an opponent character of the game character exists in the game image.

If there is no opponent character of the game character in the game image, it is necessary to search for an in-game map so as to find an opponent character of the game character. Specifically, the game agent may input a game image into a first strategy model, which is a deep neural network model obtained by offline learning using a learning-simulated algorithm, then obtain an action probability vector output by the first strategy model, select an action with the highest probability as a target action according to the action probability vector, and then control a game character to execute the target action, thereby implementing in-game map exploration.

If an opponent character of the game character exists in the game image, it is necessary to battle with the opponent character. Specifically, the game agent inputs the game image into a second strategy model, the second strategy model is a deep neural network model obtained by utilizing a deep reinforcement learning algorithm for on-line learning, then the game agent obtains an action value vector output by the second strategy model, selects an action with the maximum value as a target action according to the action value vector, and controls the game role to execute the target action so as to achieve fighting with the opponent role.

In the scene, the game intelligent agent controls the game role to execute the target action through the first strategy model to realize map exploration in the game, and controls the game role to execute the target action through the second strategy model to realize fighting with the opponent role, so that the game intelligent agent replaces a player to control the game role. Due to the fact that the first strategy model trained offline and the second strategy model trained online are adopted, the models are deep neural network models, and the strategy game system can be well suitable for strategy games with strong scene randomness. And the mode of combining off-line learning and on-line learning is adopted, so that the method has better performance.

In order to make the technical solution of the present application clearer, the following will describe in detail a control method of a game object provided in the embodiments of the present application with reference to the accompanying drawings.

Fig. 2 is a flowchart of a method for controlling a game object according to an embodiment of the present application, please refer to fig. 2, where the method includes:

s201: the method comprises the steps of obtaining a game image when a game object participates in a game, and judging whether an opponent object of the game object exists in the game image. If not, executing S202; if yes, go to S203.

In a strategy game with a high scene randomness, such as a first-person shooter game, a game agent needs to determine an opponent object of a game object in order to control the game object to fight against the opponent object. Based on this, the game agent acquires a game image when the game object participates in the game, and determines whether or not an opponent object of the game object exists in the game image.

Taking a game object as a game role as an example, a game intelligent agent can acquire a game image and then recognize the game image through an image recognition technology, so as to acquire the game image when the game role participates in a game; next, the game agent determines whether or not there is an opponent character of the game character in the game image when the game character participates in the game, by using an image recognition technique. If not, the game agent needs to search for the in-game map to find the opponent character of the game character, namely S202 is executed; if yes, the game agent battles with the opponent character, and step S203 is executed.

S202: inputting the game image into a first strategy model, acquiring an action probability vector output by the first strategy model, selecting an action with the highest probability as a target action according to the action probability vector, and controlling the game object to execute the target action to realize the search of the map in the game.

The first strategy model is a deep neural network model obtained by off-line learning by using a simulation learning algorithm.

During the game, the game object often needs to be map explored to find the opponent object. In order to improve the efficiency of finding opponent objects, game objects can be controlled to move according to a map exploration strategy. Wherein the map exploration strategy may be obtained by mimicking human expert decision data. Based on the strategy model, the game server can adopt a simulation learning algorithm to learn offline to obtain a deep neural network model as the first strategy model.

Still taking a game role as an example, the game agent inputs a game image when the game role participates in the game into the first strategy model, obtains an action probability vector output by the first strategy model, the action probability vector represents a map exploration strategy given by the first strategy model, and the game agent controls the game role to conduct map exploration according to the map exploration strategy. Specifically, the game agent selects the action with the highest probability as the target action according to the action probability vector, and controls the game character to execute the target action, thereby realizing the map exploration in the game.

It can be understood that, after the game agent controls the game character to execute the target action, the game image of the game character participating in the game will change, and the game agent may determine whether an opponent character of the target character exists in the new game image, that is, re-execute S201. In some possible implementations, the game agent may repeatedly perform S201 and S202, and perform S203 when an opponent character of the game character exists in the game image.

S203: inputting the game image into a second strategy model, obtaining an action value vector output by the second strategy model, selecting an action with the maximum value as a target action according to the action value vector, and controlling the game object to execute the target action so as to realize fighting with an opponent object.

The second strategy model is a deep neural network model obtained by utilizing a deep reinforcement learning algorithm for online learning.

In the game process, when the opponent object of the game object is determined to be included in the game image, the battle is developed between the game object and the opponent object, and in order to improve the probability of hitting the opponent object and reduce the probability of being hit by the opponent object, the game object can be controlled to fight according to a battle strategy. The combat strategy can be obtained by carrying out deep reinforcement learning on combat data, and specifically, the server utilizes a deep reinforcement learning algorithm to learn online to obtain a deep neural network model as a second strategy model.

Taking a game object as a game role as an example, the game intelligent agent inputs a game image into the second strategy model, obtains an action value vector output by the second strategy model, and the action value vector represents a fighting strategy given by the second strategy model, namely the value generated by executing each action in the current scene, and the game intelligent agent carries out strategy according to the fighting strategy. Specifically, the game agent selects the action with the maximum value as the target action according to the action value vector, and controls the game role to execute the target action, so as to realize fighting with the opponent role.

Therefore, the method for controlling the game object provided by the embodiment of the application realizes the game AI through the deep neural network model, and does not realize the control of the game object through the fixed script, so that the method can be applied to the strategy game with strong scene randomness. In addition, the control of the game object is divided into two parts, namely map exploration and fighting, specifically, a first strategy model is trained offline by simulating a learning algorithm, and an offline training sample can be obtained in advance, so that the training time of the first strategy model can be greatly saved, and the model can rapidly learn the strategy of exploring the map; the second strategy model is trained through a deep reinforcement learning algorithm and is responsible for action output during fighting, so that online training is only performed on the fighting process, the training time of the second strategy model is greatly shortened, and the model has better performance during fighting.

In the above embodiments, the key point for implementing the control of the game object is to provide an accurate control strategy, and the game AI determines the control strategy based on the first strategy model and the second strategy model.

In the embodiment, the first strategy model is trained based on a simulated learning algorithm. The mock learning algorithm includes algorithms based on behavioral cloning, inverse reinforcement learning, or Generative confrontation Networks (GAN). Specifically, in this embodiment, the game application server may perform offline learning by using a direct imitation learning algorithm based on behavior cloning, so as to obtain the first policy model.

Fig. 3 is a flowchart of a method for training a first strategy model in an embodiment of the present application, and referring to fig. 3, the method includes:

s301: samples are extracted from the game player operation video to generate a sample set.

Each sample in the set of samples includes a frame image and its corresponding action tag that identifies an action performed by a game object in a game image.

The server acquires a game player operation video, extracts a frame image including a game object when the game object participates in a game from the game player operation video, then identifies the frame image to obtain a corresponding action label, can generate a sample according to the frame image and the action label, and further forms a sample set. In some possible implementation manners, the server may identify frame images in the game player operation video by using an image identification algorithm to obtain an action tag corresponding to each frame image; then, taking each frame image in the game player operation video and the corresponding action label as a sample; a set of samples is generated from the sample.

Wherein, the simulation learning is based on the teaching of the training sample to make the motion output by the model under the same scene consistent with the motion of the sample, and based on this, the consistency of the motion in the training sample should be maintained. For example, in a cabin, the left side and the right side both have doors leading to the deck, and the same door should be selected to go out each time from the cabin to the deck, based on which the samples include frame images of game characters from the cabin to the deck, and corresponding action labels, which should be consistent when the sample set includes a plurality of samples of a scene from the cabin to the deck.

Based on this, after the sample set is generated, considering that there may be a large number of frames in the video without motion or with the same motion as the previous adjacent frame image but with different motion, this results in poor quality of the sample set, and the training precision cannot be improved. Based on this, the server can also check the consistency of the sample action in the sample set. Specifically, the server may compare each frame image in the sample set with a frame image of a frame previous thereto, and compare the action tag of each frame image in the sample set with a frame previous thereto; and if the image of a certain frame of image is similar to the image of the previous frame of image and the action label is inconsistent, changing the action label corresponding to the certain frame of image in the sample set into the action label corresponding to the previous frame of image. And aiming at different strategy games, the server can further verify the samples in the sample set according to the characteristics of the games.

Taking Cross Fire Mobile (CFM) as an example, the server may further process the frame image without motion according to a previous frame image of the frame image without motion. Specifically, if a certain frame of image has no corresponding action and the action tag corresponding to the previous frame of image identifies a specified type of action, the action tag corresponding to the frame of image in the sample set is changed to the action tag corresponding to the previous frame of image. In other possible implementation manners, if a certain frame of image has no corresponding action and the action tag corresponding to the previous frame of image identifies a non-specified type of action, the sample corresponding to the certain frame of image is deleted from the sample set.

Wherein the specified type of action may be set empirically. Specifically, for any battle type game, such as a first-person shooter game, the association relationship of the action tags between adjacent frames in the operation video of the game player is analyzed, and the specified type action is set according to the association relationship. In some possible implementations, the specified type of action may be a left-right turn.

It should be noted that, in the above embodiment, the server may implement the comparison between the frame image and the previous frame image by means of a sliding window, so as to determine the consistency of the action in the sample.

S302: and training the initial deep neural network model according to the sample set by adopting a simulated learning algorithm to obtain the deep neural network model meeting the training end condition as the first strategy model.

The server trains the initial deep neural network model according to the sample set by adopting a simulated learning algorithm, and iteratively updates parameters of the initial deep neural network model to obtain the deep neural network model meeting the training end condition as a first strategy model. Wherein, satisfying the training end condition may be that the model is in a converged state. In the model training process, the model parameters are updated according to the sample set by adopting a simulation learning algorithm, and when the model is in a convergence state, the finally learned model can be used as a first strategy model.

In this embodiment, when the server obtains the first policy model by offline learning using a direct-emulation learning algorithm based on behavioral cloning, the first policy model may be implemented by a convolutional neural network. Specifically, the initial deep neural network model may be a convolutional neural network model, and the model may include 6 convolutional layers, 3 fully-connected layers, and 1 softmax layer, where softmax is used to solve the multi-classification problem and output the probability of each class. In the initial deep neural network model, the server can adopt an Adam optimizer to optimize model parameters, and adopt a cross entropy loss function as a loss function of the model so as to judge whether a training end condition is met according to the cross entropy loss function.

When the server obtains the first strategy model through the simulation learning training, the first strategy model comprises the same structure as the initial neural network model because the first strategy model is obtained through the training of the initial neural network model. Specifically, the first policy model includes 6 convolutional layers, 3 fully-connected layers, and 1 Softmax layer.

In view of the above, the embodiment of the present application provides a method for training a first strategy model, which includes acquiring a player operation video in advance, extracting samples from the player operation video to generate a sample set, and performing offline training on an initial neural network model according to the sample set by using a simulation learning algorithm to obtain a deep neural network model meeting a training end condition as the first strategy model. In the method, an off-line training mode is adopted, and training samples can be obtained in advance without on-line generation, namely, waiting for the generation of the samples is not needed, so that the training time overhead is greatly reduced; and off-line training adopts a large number of training samples, so that the model trained by the large number of training samples has better performance.

The embodiment shown in fig. 3 mainly describes a specific implementation manner of the method for training the first policy model, and next, a specific implementation manner of the method for training the second policy model provided by the embodiment of the present application will be described with reference to the drawings.

Fig. 4 is a flowchart of a method for training a second strategy model in an embodiment of the present application, and referring to fig. 4, the method includes:

s401: the method comprises the steps of collecting game images of game objects when the game objects participate in games, determining actions of the game objects in the game images and determining return values obtained by the game objects to implement the actions according to return functions.

The reward function represents that a first reward value is given if the game object is hit by the opponent object, a second reward value is given if the game object hits the opponent object, a third reward value is given if the game object kills the opponent object, the first reward value is a negative number, and the second reward value is a positive number and smaller than the third reward value.

In this embodiment, the second strategy model is used to provide a combat strategy, and when the game object is a game character, the combat strategy represents an action of the game character that generates the maximum value in the current scene, so that the game character can execute the action according to the combat strategy during the game process, and the maximum value can be generated. The value is positively correlated with the return value, and based on the value, the server collects a game image of a game character when the game character participates in the game, determines the action of the game character in the game image, and determines the return value obtained by the game character implementing the action according to the return function.

The reward function may represent reward values corresponding to different states of the game character, specifically, if the game character is hit, a first reward value is given, the first reward value is a negative value, which is equivalent to giving a penalty when hit, if the game character is hit, a second reward value is given, the opponent character is killed, a third reward value is given, the second reward value is an integer and is smaller than the third reward value, which is equivalent to giving a reward when hit or kill the opponent, and the reward when hit is smaller than the reward when kill. Based on this, the game character needs to perform actions to improve the probability of hitting and killing the opponent character, and reduce the probability of being hit by the opponent character, thereby obtaining a larger return value and maximizing the value. In a specific implementation manner, the first return value takes the value of-0.1, the second return value takes the value of 0.1, and the third return value takes the value of 1.0, and the set of return values mainly strengthens that the network learns to hit enemies and reduces hit.

S402: a sample set is generated.

Each sample in the set of samples comprises: the game system comprises a frame of game image, an action label corresponding to the frame of game image, a return value corresponding to the frame of game image and a next frame of game image adjacent to the frame of game image.

In the game fighting process, the value generated by each selection in each state is related to the return value of the current action in the current state and the value generated by the selection in the next state, and therefore, when the sample set is collected and generated, each sample in the sample set not only comprises one frame of game image, the action label corresponding to the game image, the return value corresponding to the image, but also comprises the next frame of game image adjacent to the image.

S403: and training network parameters of a second strategy model according to the samples in the sample set.

The server trains the network parameters of the second strategy model according to the samples in the sample set, so that the second strategy model for providing the fighting strategy is obtained. And the server adopts a deep reinforcement learning algorithm to train according to the samples in the sample set to obtain a second strategy model.

In specific implementation, the server may initialize a Deep-reinforcement learning model (DQN), input samples in a sample set into the DQN model, and update model parameters of the DQN model by using a Deep-reinforcement learning algorithm, thereby implementing model training. When the trained model meets the training end condition, for example, the DQN model is in a convergence state, the model is used as a second strategy model, the second strategy model can be used in a battle scene, a battle strategy is provided for the game object, and when the game object executes a target action and fights with the opponent object according to the battle strategy, the maximum value can be obtained.

As can be seen from the above, an embodiment of the present application provides a method for training a second policy model, which is implemented based on a depth-enhanced learning algorithm, and includes first constructing a reward function of a model, where the reward function represents reward values of a game object in different states, and based on the reward function, collecting a game image of the game object when the game object participates in a game, identifying an action performed by the game object in the game image, and determining a reward value obtained by performing the action on the game object according to the reward function, then generating a sample set according to the game image, an action tag corresponding to the game image, the reward value corresponding to the game image, and a next frame of game image, and further generating a sample set, and training a network parameter of the second policy model according to the sample set to obtain the second policy model. Because the second strategy model is responsible for action output during fighting, on-line training is only executed aiming at the fighting process, and the training time of the second strategy model is greatly shortened. In addition, map exploration is not needed in the model, the generation efficiency of sample data in the online training process can meet the requirement of model training, a large amount of sample data is not needed, and the second strategy model can also obtain better performance.

In order to facilitate understanding of the technical solutions of the present application, the following describes in detail a control method for a game object in an embodiment of the present application with reference to specific scenes.

Fig. 5A is a schematic view of an application scenario of a method for controlling a game object in this embodiment, referring to fig. 5A, where the game scenario includes a terminal device 10 and a server 20, where the server 20 is embedded with a game AI, and the game AI is implemented based on a first policy model and a second policy model.

In this application scenario, the tester sets a game character to play in the AI game mode through the terminal device 10, and thus, the game agent in the server 20 can replace the tester to control the game character.

Specifically, the game agent obtains a game image when a game character participates in the game, judges whether an opponent character of the game character exists in the game image through an image recognition technology, inputs the game image into the first strategy model if the recognition result is negative, obtains an action probability vector output by the first strategy model, selects an action with the highest probability as a target action according to the action probability vector, and controls the game character to execute the target action so as to realize map exploration in the game. Specifically, referring to the map exploration diagram in the game shown in fig. 5B, 501 in the diagram is a game image when a game character currently participates in the game, and the game agent does not recognize that an opponent character of the game character exists in the game image through an image recognition technology, so that the game agent inputs the game image into the first policy model, obtains an action probability vector output by the first policy model, determines a target action according to the action probability vector, and controls the game character to execute the target action to implement the map exploration in the game, as shown in 502 in the diagram, a black dotted line position is a position of the game character at the previous moment, the game agent determines that the target action is a right turn according to the first policy model, and the game character moves to the right turn to the current position to perform the map exploration in the game and search for the opponent character.

In the map searching process, if the opponent character of the game character is identified in the game image, the game image is input into the second strategy model, the action value vector output by the second strategy model is obtained, the action with the maximum value is selected as the target action according to the action value vector, and the game character is controlled to execute the target action so as to realize fighting with the opponent character. Specifically, referring to the schematic diagram of battle with an opponent character in the game shown in fig. 5C, in the diagram, 503 is a game image when the game character currently participates in the game, the game agent recognizes that the game character opponent character exists in the game image through an image recognition technology, an object highlighted by a rectangular frame in 503 is the game character opponent character, the game agent inputs the game image into the second strategy model, obtains a battle strategy output by the model, and controls the game character to execute a target action according to the strategy so as to aim at the game character opponent character, specifically refer to 504, so that battle with the opponent character can be realized.

The server 20 may collect data of the game agent controlling the game character as game test data for helping game development and testing personnel to find out the defects of the game so as to improve the game in a targeted manner.

In the embodiment, the game intelligent body identifies the game image when the game role participates in the game, when no opponent role exists in the game image, the game image is input into the first strategy model, the game role is controlled to conduct map exploration according to the strategy given by the first strategy model, the opponent role is searched, when the opponent role exists in the game image, the game image is input into the second strategy model, the game role is controlled to fight with the opponent role according to the strategy given by the second strategy model, the game intelligent body replaces a tester to realize the control of the game operation of the game role, the tester is liberated, the test pressure is greatly relieved, and the game design and development are promoted.

Based on the above specific implementation manner of the method for controlling a game object provided in the embodiment of the present application, the embodiment of the present application further provides a control device for a game object, and the control device for a game object in the embodiment of the present application will be described in terms of function modularization.

Fig. 6 is a schematic structural diagram of a control device for a game object in an embodiment of the present application, and referring to fig. 6, the device 600 includes:

the determining module 610 is configured to obtain a game image when a game object participates in a game, and determine whether an opponent object of the game object exists in the game image;

if not, the searching module 620 is configured to input the game image into a first policy model, obtain an action probability vector output by the first policy model, select an action with the highest probability as a target action according to the action probability vector, and control the game object to execute the target action to realize in-game map searching; the first strategy model is a deep neural network model obtained by utilizing an imitation learning algorithm to learn offline;

a fighting module 630, configured to, if yes, input the game image into a second policy model, obtain an action value vector output by the second policy model, select an action with a largest value as a target action according to the action value vector, and control the game object to execute the target action to implement fighting with an opponent object; the second strategy model is a deep neural network model obtained by utilizing a deep reinforcement learning algorithm for online learning.

Optionally, referring to fig. 7, fig. 7 is a schematic structural diagram of a control device for a game object in an embodiment of the present application, and based on the structure shown in fig. 6, the device further includes:

an extracting module 640, configured to extract samples from the game player operation video to generate a sample set, where each sample in the sample set includes a frame image and an action tag corresponding to the frame image, and the action tag identifies an action implemented by a game object in the game image;

and the first training module 650 is configured to train the initial deep neural network model according to the sample set by using a simulated learning algorithm, so as to obtain a deep neural network model meeting a training end condition, which is used as the first strategy model.

Optionally, referring to fig. 8, fig. 8 is a schematic structural diagram of a control device for a game object in an embodiment of the present application, and on the basis of the structure shown in fig. 7, the extracting module 640 includes:

the identifying submodule 641 is configured to identify frame images in the game player operation video by using an image identification algorithm, and obtain an action tag corresponding to each frame image;

the first generation submodule 642 is used for taking each frame of image in the game player operation video and the corresponding action tag as a sample;

a second generation submodule 643, configured to generate a sample set according to the samples.

Optionally, referring to fig. 9, fig. 9 is a schematic structural diagram of a control device for a game object in an embodiment of the present application, and on the basis of the structure shown in fig. 8, the extracting module 640 further includes:

a comparison sub-module 644 for comparing each frame image in the sample set with a frame image of a frame preceding the frame image, and comparing the action tag of each frame image in the sample set with a frame preceding the frame image;

and a changing sub-module 645, configured to change the action tag corresponding to the certain frame image in the sample set to the action tag corresponding to the previous frame image if the certain frame image is similar to the image of the previous frame image and the action tags are not consistent.

Optionally, referring to fig. 10, fig. 10 is a schematic structural diagram of a control device for a game object in the embodiment of the present application, and on the basis of the structure shown in fig. 9, the modification sub-module 645 is further configured to:

if a certain frame of image has no corresponding action and the action label corresponding to the previous frame of image identifies the action of the specified type, changing the action label corresponding to the certain frame of image in the sample set into the action label corresponding to the previous frame of image;

the extraction module 640 further includes:

the deleting submodule 646 is configured to delete the sample corresponding to the certain frame image from the sample set if the certain frame image does not have the corresponding action and the action tag corresponding to the previous frame image identifies a non-specified type of action.

Optionally, the first policy model includes 6 convolutional layers, 3 fully-connected layers, and 1 Softmax layer.

Optionally, the mimic learning algorithm comprises a direct mimic learning algorithm based on behavioral clones.

Optionally, referring to fig. 11, fig. 11 is a schematic structural diagram of a control device for a game object in the embodiment of the present application, and on the basis of the structure shown in fig. 6, the device 600 further includes:

the determining module 660 is configured to collect a game image of the game object when the game object participates in the game, determine an action implemented by the game object in the game image, and determine a reward value obtained by implementing the action by the game object according to a reward function; the return function represents that a first return value is given if the game object is hit by the opponent object, a second return value is given if the game object is hit by the opponent object, a third return value is given if the game object kills the opponent object, the first return value is a negative number, and the second return value is a positive number and smaller than the third return value;

a generating module 670 for generating a set of samples, each sample in the set of samples comprising: one frame of game image, an action label corresponding to the one frame of game image, a return value corresponding to the one frame of game image and a next frame of game image adjacent to the one frame of game image;

and the second training module 680 is configured to train the network parameters of the second policy model according to the samples in the sample set.

Optionally, the game is a first person shooter game.

As described above, the present embodiment provides a game object control device that realizes control of a game object by a game AI that is realized by a deep neural network model, instead of realizing control of a game object by a fixed script, and is therefore applicable to a strategy game having a high scene randomness. In addition, the control of the game object is divided into two parts, namely map exploration and fighting, specifically, a first strategy model is trained offline by simulating a learning algorithm, and an offline training sample can be obtained in advance, so that the training time of the first strategy model can be greatly saved, and the model can rapidly learn the strategy of exploring the map; the second strategy model is trained through a deep reinforcement learning algorithm and is responsible for action output during fighting, so that online training is only performed on the fighting process, the training time of the second strategy model is greatly shortened, and the model has better performance during fighting.

The embodiments shown in fig. 6 to 11 describe the control device of the game object provided in the embodiments of the present application from the perspective of module functionalization, and based on this, the embodiments of the present application also provide a control apparatus of the game object. The following describes a specific implementation of a control device for a game object in the embodiment of the present application in detail from the viewpoint of hardware implementation.

Fig. 12 is a schematic structural diagram of a control device for a game object according to an embodiment of the present disclosure, where the control device may be a server, and the server 1200 may generate a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1222 (e.g., one or more processors) and a memory 1232, and one or more storage media 1230 (e.g., one or more mass storage devices) storing an application program 1242 or data 1244. Memory 1232 and storage media 1230 can be, among other things, transient storage or persistent storage. The program stored in the storage medium 1230 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 1222 may be configured to communicate with the storage medium 1230, to execute a series of instruction operations in the storage medium 1230 on the server 1200.

The server 1200 may also include one or more power supplies 1226, one or more wired or wireless network interfaces 1250, one or more input-output interfaces 1258, and/or one or more operating systems 1241, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 12.

The CPU 1222 is configured to perform the following steps:

Optionally, the CPU 1222 is further configured to execute the steps of any one implementation manner of a control method of a game object in the embodiment of the present application.

The embodiment of the present application further provides another control device for a game object, where the control device for the game object may be a terminal device, as shown in fig. 13, for convenience of description, only a part related to the embodiment of the present application is shown, and details of the specific technology are not disclosed, please refer to the method part in the embodiment of the present application. The terminal device may be any terminal device including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA, for short in english), a Sales terminal (POS, for short in english), a vehicle-mounted computer, and the terminal device is taken as a mobile phone as an example:

fig. 13 is a block diagram illustrating a partial structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 13, the handset includes: radio Frequency (RF) circuit 1310, memory 1320, input unit 1330, display unit 1340, sensor 1350, audio circuit 1360, wireless fidelity (WiFi) module 1370, processor 1380, and power supply 1390. Those skilled in the art will appreciate that the handset configuration shown in fig. 13 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 13:

RF circuit 1310 may be used for receiving and transmitting signals during a message transmission or call, and in particular, for processing received downlink information of a base station by processor 1380; in addition, the data for designing uplink is transmitted to the base station. In general, RF circuit 1310 includes, but is not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a low noise Amplifier (Lownoise Amplifier, LNA), a duplexer, and the like. In addition, RF circuit 1310 may also communicate with networks and other devices via wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), e-mail, Short Message Service (SMS), and so on.

The memory 1320 may be used to store software programs and modules, and the processor 1380 executes various functional applications and data processing of the cellular phone by operating the software programs and modules stored in the memory 1320. The memory 1320 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 1320 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 1330 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 1330 may include a touch panel 1331 and other input devices 1332. Touch panel 1331, also referred to as a touch screen, can collect touch operations by a user (e.g., operations by a user on or near touch panel 1331 using any suitable object or accessory such as a finger, a stylus, etc.) and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 1331 may include two portions of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, and sends the touch point coordinates to the processor 1380, where the touch controller can receive and execute commands sent by the processor 1380. In addition, the touch panel 1331 may be implemented by various types, such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 1330 may include other input devices 1332 in addition to the touch panel 1331. In particular, other input devices 1332 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 1340 may be used to display information input by a user or information provided to the user and various menus of the cellular phone. The Display unit 1340 may include a Display panel 1341, and optionally, the Display panel 1341 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, touch panel 1331 can overlay display panel 1341, and when touch panel 1331 detects a touch operation on or near touch panel 1331, processor 1380 can be configured to determine the type of touch event, and processor 1380 can then provide a corresponding visual output on display panel 1341 based on the type of touch event. Although in fig. 13, the touch panel 1331 and the display panel 1341 are two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 1331 and the display panel 1341 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 1350, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel 1341 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 1341 and/or the backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.

The audio circuit 1360, speaker 1361, microphone 1362 may provide an audio interface between the user and the handset. The audio circuit 1360 may transmit the electrical signal converted from the received audio data to the speaker 1361, and the electrical signal is converted into a sound signal by the speaker 1361 and output; on the other hand, the microphone 1362 converts the collected sound signal into an electric signal, converts the electric signal into audio data after being received by the audio circuit 1360, and then processes the audio data by the audio data output processor 1380, and then sends the audio data to, for example, another cellular phone via the RF circuit 1310, or outputs the audio data to the memory 1320 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 1370, and provides wireless broadband internet access for the user. Although fig. 13 shows the WiFi module 1370, it is understood that it does not belong to the essential constitution of the handset, and may be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 1380 is a control center of the mobile phone, connects various parts of the entire mobile phone using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 1320 and calling data stored in the memory 1320, thereby integrally monitoring the mobile phone. Optionally, processor 1380 may include one or more processing units; preferably, the processor 1380 may integrate an application processor, which handles primarily operating systems, user interfaces, application programs, etc., and a modem processor, which handles primarily wireless communications. It will be appreciated that the modem processor described above may not be integrated within processor 1380.

The handset also includes a power supply 1390 (e.g., a battery) to supply power to the various components, which may preferably be logically coupled to the processor 1380 via a power management system to manage charging, discharging, and power consumption management functions via the power management system.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.

In the embodiment of the present application, the processor 1380 included in the terminal further has the following functions:

Optionally, the processor 1380 is further configured to execute the steps of any one implementation of a method for controlling a game object in the embodiment of the present application.

An embodiment of the present application further provides a computer-readable storage medium for storing a program code, where the program code is configured to execute any one implementation of the control method for a game object described in the foregoing embodiments.

Embodiments of the present application further provide a computer program product including instructions, which when run on a computer, cause the computer to execute any one of the implementation manners of the control method for a game object described in the foregoing embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of controlling a game object, comprising:

2. The method of claim 1, further comprising:

extracting samples from a game player operation video to generate a sample set, each sample in the sample set including a frame image and an action tag corresponding thereto, the action tag identifying an action performed by a game object in the game image;

and training the initial deep neural network model according to the sample set by adopting a simulated learning algorithm to obtain the deep neural network model meeting the training end condition as the first strategy model.

3. The method of claim 2, wherein extracting samples from the video of game player operations to generate a sample set comprises:

identifying frame images in the operation video of the game player by using an image identification algorithm to obtain action labels corresponding to the frame images;

taking each frame of image in the game player operation video and the corresponding action label as a sample;

a sample set is generated from the samples.

4. The method of claim 3, further comprising:

comparing each frame image in the sample set with a frame image of a frame before the frame image in the sample set, and comparing each frame image in the sample set with an action label of a frame before the frame image in the sample set;

and if the image of a certain frame of image is similar to the image of the previous frame of image and the action label is inconsistent, changing the action label corresponding to the certain frame of image in the sample set into the action label corresponding to the previous frame of image.

5. The method of claim 4, further comprising:

and if a certain frame image does not have corresponding action and the action label corresponding to the previous frame image identifies a non-specified type action, deleting the sample corresponding to the certain frame image from the sample set.

6. The method of claim 1, wherein the first policy model comprises 6 convolutional layers, 3 fully-connected layers, and 1 Softmax layer.

7. The method of claim 1, wherein the mock learning algorithm comprises a direct mock learning algorithm based on behavioral clones.

8. The method of claim 1, wherein the second strategy model is trained by:

collecting game images of the game objects when the game objects participate in the game, determining actions of the game objects in the game images and determining return values obtained by the game objects implementing the actions according to return functions; the return function represents that a first return value is given if the game object is hit by the opponent object, a second return value is given if the game object is hit by the opponent object, a third return value is given if the game object kills the opponent object, the first return value is a negative number, and the second return value is a positive number and smaller than the third return value;

generating a set of samples, each sample in the set of samples comprising: one frame of game image, an action label corresponding to the one frame of game image, a return value corresponding to the one frame of game image and a next frame of game image adjacent to the one frame of game image;

and training network parameters of a second strategy model according to the samples in the sample set.

9. The method of claim 1, wherein the game is a first person shooter game.

10. A control device for a game object, comprising:

11. An apparatus for controlling a game object, the apparatus comprising a processor and a memory:

the processor is configured to execute the control method of a game object according to any one of claims 1 to 9 according to an instruction in the program code.

12. A computer-readable storage medium characterized in that the computer-readable storage medium stores a program code for executing the control method of a game object according to any one of claims 1 to 9.