CN112791394B

CN112791394B - Game model training method and device, electronic equipment and storage medium

Info

Publication number: CN112791394B
Application number: CN202110145344.1A
Authority: CN
Inventors: 杨敬文
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-02-02
Filing date: 2021-02-02
Publication date: 2022-09-30
Anticipated expiration: 2041-02-02
Also published as: CN112791394A

Abstract

The invention provides a game model training method, a game model training device and electronic equipment, wherein the method comprises the following steps: determining reward parameters matched with the initial parameters of the strategy generation sub-network by acquiring an action information set and a state information set in a game environment where a target object is located and executing action information in the action information set; updating the initial parameters of the strategy generation sub-network, and determining evaluation value signal parameters matched with state information through a state evaluation sub-network; and updating the initial parameters of the strategy generation sub-network and the initial parameters of the state evaluation sub-network respectively according to the evaluation value signal parameters. Therefore, the accuracy of the game model can be effectively guaranteed, the game strategy with complex dimensionality can be processed more quickly, the game strategy can be adjusted timely and accurately, meanwhile, the calculation cost is reduced, the efficiency of game strategy generation is improved, and the game strategy with complex dimensionality is processed.

Description

Game model training method and device, electronic equipment and storage medium

Technical Field

The present invention relates to information processing technologies, and in particular, to a game model training method and apparatus, and an electronic device.

Background

Artificial Intelligence (AI) is a comprehensive technique in computer science, and by studying the design principles and implementation methods of various intelligent machines, the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to a wide range of fields, for example, natural language processing technology and machine learning/deep learning and the like, and it is believed that with the development of the technology, the artificial intelligence technology will be applied in more fields and play more and more important values.

Real-time games generally have the characteristics of complex game rules, variable dynamic scenes, uncertain behavior achievement, incomplete information, short decision time and the like. In the face of such huge decision space and real-time decision requirements, how to make, select and execute strategies is the most important problem faced by game intelligent systems, for example, in different types of games such as hero alliance, royal glory, QQ car of Multiplayer Online tactical sports game (MOBA), the game mechanism is more complex and closer to the real world scene. In the MOBA game, the number of game units required to be confronted and cooperated by players is large, so that the game scenes are diversified, and the learning complexity of the game AI strategy module is increased due to the abundant game units. Therefore, how to determine an accurate game strategy in a game confrontation with a great amount of information is a key for improving the AI capability of the game, so that the time cost required for simulating a long game period and searching for an optimal game strategy is high, and the experience of a game user is not facilitated.

Disclosure of Invention

In view of this, an embodiment of the present invention provides a game model training method, an apparatus, an electronic device, and a storage medium, and a technical solution of the embodiment of the present invention is implemented as follows:

the embodiment of the invention provides a game model training method, which comprises the following steps:

acquiring an action information set and a state information set in a game environment where a target object is located;

determining initial parameters of a strategy generation sub-network and initial parameters of a state evaluation sub-network in the game model;

when the state information in the state information set changes, determining reward parameters matched with the initial parameters of the strategy generation sub-network by executing the action information in the action information set;

updating the initial parameters of the strategy generation sub-network based on the reward parameters matched with the initial parameters of the strategy generation sub-network;

determining, by the state evaluation sub-network, an evaluation value signal parameter matching the state information in response to the changed state information;

and updating the initial parameters of the strategy generation sub-network and the initial parameters of the state evaluation sub-network respectively according to the evaluation value signal parameters so as to determine the network parameters of the strategy generation sub-network and the network parameters of the state evaluation sub-network in the game model.

The embodiment of the invention also provides a game model training device, which comprises:

the information transmission module is used for acquiring an action information set and a state information set in a game environment where the target object is located;

the information processing module is used for determining initial parameters of a strategy generation sub-network and initial parameters of a state evaluation sub-network in the game model;

the information processing module is used for determining reward parameters matched with the initial parameters of the strategy generation sub-network by executing the action information in the action information set when the state information in the state information set changes;

the information processing module is used for updating the initial parameters of the strategy generation sub-network based on the reward parameters matched with the initial parameters of the strategy generation sub-network;

the information processing module is used for responding to the changed state information and determining an evaluation value signal parameter matched with the state information through the state evaluation sub-network;

and the information processing module is used for respectively updating the initial parameters of the strategy generation sub-network and the initial parameters of the state evaluation sub-network according to the evaluation value signal parameters so as to determine the network parameters of the strategy generation sub-network and the network parameters of the state evaluation sub-network in the game model.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for determining a sample acquisition mode matched with the game environment according to the game environment where the target object is located;

the information processing module is used for determining a priority threshold value matched with the game environment according to the determined sample acquisition mode;

the information processing module is configured to sample the action information and the state information in the game environment where the target object is located respectively based on the priority threshold matched with the game environment, and determine an action information set and a state information set in the game environment where the target object is located.

In the above-mentioned scheme, the first and second light sources,

the information processing module is used for determining an execution strategy matched with the changed state information through the strategy generation sub-network when the state information in the state information set changes;

the information processing module is used for determining corresponding action information in the action information set according to the determined execution strategy;

the information processing module is used for executing the action information in the game environment where the target object is located;

and the information processing module is used for determining reward parameters matched with the initial parameters of the strategy generation sub-network based on the execution result of the action information.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for determining the consumed time of the racing game when the action information is executed when the game environment where the target object is located is the racing game;

and the information processing module is used for determining the reward parameters matched with the initial parameters of the strategy generation sub-network according to the consumption time of the racing game.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for generating reward parameters matched with the initial parameters of the sub-network based on the strategy and comparing the reward parameters with a first reward parameter threshold value;

and the information processing module is used for updating the initial parameters of the strategy generation sub-network when the reward parameters are determined to reach a first reward parameter threshold value.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for determining the evaluation value signal parameter matched with the state information based on the initial parameter and the evaluation expression function of the state evaluation sub-network;

the information processing module is used for determining the update parameters of the strategy generation sub-network according to the strategy generation function of the strategy generation sub-network based on the evaluation value signal parameters,

the information processing module is used for updating the initial parameters of the strategy generation sub-network based on the updated parameters of the strategy generation sub-network;

the information processing module is used for determining an updating parameter of the state evaluation sub-network based on the evaluation value signal parameter;

and the information processing module is used for updating the initial parameters of the state evaluation sub-network according to the updated parameters of the state evaluation sub-network.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for monitoring the action execution result of the game model in the game environment where the target object is located;

the information processing module is used for determining the consumed time of the target action execution result;

and the information processing module is used for adjusting the training samples of the game model when the consumption time of the target action execution result is less than a second reward parameter threshold value.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for determining the historical parameters of the target object according to the type of the game environment where the target object is located;

the information processing module is used for determining a training sample set matched with the game model based on the historical parameters of the target object, wherein the training sample set comprises at least one group of training samples;

the information processing module is used for extracting a training sample set matched with the training sample through a noise threshold value matched with the game model;

and the information processing module is used for training the game model according to the training sample set matched with the training sample.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for determining a multitask loss function matched with the game model;

the information processing module is used for adjusting parameters of a strategy generation sub-network and network parameters of a state evaluation sub-network in the game model based on the evaluation value signal parameters and the multitask loss function until loss functions of different dimensions corresponding to the game model reach corresponding convergence conditions; so as to realize that the parameters of the game model are matched with the game environment.

In the above-mentioned scheme, the first and second light sources,

the information processing module is used for determining a dynamic noise threshold value matched with the use environment of the game model when the game environment of the target object is a role playing game;

the information processing module is used for carrying out noise removal processing on the first training sample set according to the dynamic noise threshold value so as to form a second training sample set matched with the dynamic noise threshold value;

and the information processing module is used for determining a fixed noise threshold corresponding to the game model when the game environment of the target object is a battle game, and performing denoising processing on the first training sample set according to the fixed noise threshold to form a second training sample set matched with the fixed noise threshold.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for presenting a virtual target object in a user interface passing through the game environment where the target object is located when a control component in the game environment is triggered,

and the information processing module is used for presenting a corresponding game interaction instruction in a user interface by using the virtual target object through triggering the trained game model.

An embodiment of the present invention further provides an electronic device, where the electronic device includes:

a memory for storing executable instructions;

and the processor is used for realizing the game model training method when the executable instructions stored in the memory are run.

The embodiment of the invention also provides a computer-readable storage medium, which stores executable instructions, and the executable instructions are executed by a processor to realize the game model training method.

The embodiment of the invention has the following beneficial effects:

the method comprises the steps of acquiring an action information set and a state information set in a game environment where a target object is located; determining initial parameters of a strategy generation sub-network and initial parameters of a state evaluation sub-network in the game model; when the state information in the state information set changes, determining reward parameters matched with the initial parameters of the strategy generation sub-network by executing the action information in the action information set; updating the initial parameters of the strategy generation sub-network based on the reward parameters matched with the initial parameters of the strategy generation sub-network; determining, by the state evaluation sub-network, an evaluation value signal parameter matching the state information in response to the changed state information; and updating the initial parameters of the strategy generation sub-network and the initial parameters of the state evaluation sub-network respectively according to the evaluation value signal parameters so as to determine the network parameters of the strategy generation sub-network and the network parameters of the state evaluation sub-network in the game model. Therefore, the accuracy of the game model can be effectively guaranteed, the convergence speed of the game model can be improved, the efficiency of generating the game strategy can be improved, the game strategy with complex dimensionality can be processed more quickly, the game strategy can be adjusted timely and accurately, robustness and generalization performance are achieved for action spaces of different game environments, the calculation cost is reduced, the efficiency of generating the game strategy is improved, and the game strategy with complex dimensionality can be processed.

Drawings

FIG. 1 is a schematic view of a use scenario of a game model training method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a structure of a game model training device according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an alternative process for generating a game strategy by AI technology in accordance with an embodiment of the invention;

FIG. 4 is a schematic flow chart of an alternative method for training a game model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of information acquisition of a game model training method according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating the operation of a state evaluation sub-network according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating a model structure of a game model according to an embodiment of the present invention;

FIG. 8 is a schematic view of a game model applied to a racing game according to an embodiment of the present invention;

fig. 9 is an alternative flowchart of a game model application process according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.

1) In response to the condition or state on which the performed operation depends, one or more of the performed operations may be in real-time or may have a set delay when the dependent condition or state is satisfied; there is no restriction on the order of execution of the operations performed unless otherwise specified.

2) Based on the condition or state on which the operation to be performed depends, when the condition or state on which the operation depends is satisfied, the operation or operations to be performed may be in real time or may have a set delay; there is no restriction on the order of execution of the operations performed unless otherwise specified.

3) And (4) model training, namely performing multi-class learning on the image data set. The model can be constructed by adopting deep learning frames such as Tensor Flow, torch and the like, and a multi-classification model is formed by combining multiple layers of neural network layers such as CNN and the like. The input of the model is a three-channel or original channel matrix formed by reading an image through openCV and other tools, the output of the model is multi-classification probability, and the webpage category is finally output through softmax and other algorithms. During training, the model approaches to a correct trend through an objective function such as cross entropy and the like.

4) Neural Networks (NN): an Artificial Neural Network (ANN), referred to as Neural Network or Neural Network for short, is a mathematical model or computational model that imitates the structure and function of biological Neural Network (central nervous system of animals, especially brain) in the field of machine learning and cognitive science, and is used for estimating or approximating functions.

5) The game environment is as follows: is a game environment that is displayed (or provided) when an application is run on a terminal. The game environment can be a simulation environment of a real world, a semi-simulation semi-fictional three-dimensional environment or a pure fictional three-dimensional environment. The game environment may be any one of a two-dimensional game environment, a 2.5-dimensional game environment, and a three-dimensional game environment, and the following embodiment exemplifies, but is not limited to, that the game environment is a three-dimensional game environment. Optionally, the game environment is also used for game environment engagement between at least two virtual objects. Optionally, the gaming environment is further adapted for a virtual firearm to be used for a battle between at least two virtual objects. Alternatively, the Game environment may also be, without limitation, a gun Battle type Game, a running cool type Game, a Racing type Game, a Multiplayer Online tactical sports Game (MOBA), a Racing Game (RCG), and a sports type Game (SPG). The trained game model provided by the application can be deployed in game servers corresponding to various game scenes and used for generating real-time game strategies, executing corresponding action information, simulating the operation of a virtual user, and completing different types of games in a game environment together with users who actually participate in the games.

6) And (4) action information: taking the case where a game user participates in a speed competition using the first person or the third person, including racing games such as racing cars and flying games, the action information refers to an operation command for controlling a direction key or the like of the racing car as an action, and for a role-playing game, the action information refers to a virtual weapon for attacking by shooting a bullet in a game environment, or a virtual bow and a virtual slingshot for shooting a cluster of arrows, and a virtual object can pick up the virtual firearm in the game environment and attack by the picked-up virtual firearm.

Alternatively, the virtual object may be a user virtual object controlled by an operation on the client, an Artificial Intelligence (AI Artificial Intelligence) set in a game environment battle by training, or a Non-user virtual object (NPC Non-Player Character) set in a game environment interaction. Alternatively, the virtual object may be a virtual character that plays a game in the game environment. Optionally, the number of virtual objects participating in interaction in the game environment may be preset, or may be dynamically determined according to the number of clients participating in interaction.

In this case, taking a shooting game as an example, the user may control the virtual object to execute corresponding action information at different times in the game environment, for example, the virtual object freely falls in the sky, glides or opens a parachute to fall, runs, jumps, crawls, bends over to move ahead on land, or controls the virtual object to swim, float or dive in the sea, or naturally, the user may also control the virtual object to ride a virtual vehicle to move in the game environment, for example, the virtual vehicle may be a virtual car, a virtual aircraft, a virtual yacht, and the like, which is only exemplified in the above-mentioned scenario, but the present invention is not limited thereto. The user can also control the virtual object to interact with other virtual objects in a battle mode and the like through the virtual weapon, the virtual weapon can be a cold weapon or a hot weapon, and the type of the virtual weapon is not specifically limited by the invention.

The method provided by the invention can be applied to virtual reality application programs, three-dimensional map programs, military simulation programs, First-person shooter Games (FPS First-person shooting Games), multi-player on-line tactical sports Games (MOBA Multiplayer Online Battle Arena Games) and the like, and the following embodiments are exemplified by application in Games.

Fig. 1 is a schematic view of a usage scenario of a game model training method provided by an embodiment of the present invention, and referring to fig. 1, in an application of the game model training method provided by the embodiment of the present application, a terminal includes a terminal 10-1 and a terminal 10-2, where the terminal 10-1 is located on a developer side and is used to control training and usage of a game model, and the terminal 10-2 is located on a user side and is used to request execution of a game strategy to implement a game process with a user in the game process, specifically, when it is detected that a game partner requirement exists for a human player user, at least one game partner AI may be allocated to the human player user, and a connection between a virtual client of the game partner AI and a user client used by the human player user is established to open a game competition. In the process of the game competition between the game accompany AI and the user object controlled by the human player user, any game client (such as a virtual client corresponding to the game accompany AI or a user client used by the human player user) can send scene information of the current game, state information of the user object completing the game and state information of the game accompany AI completing the game to the game server in real time or periodically. Correspondingly, the game server predicts a game strategy for controlling the game companion AI in the next time period by using the deployed trained game model through acquiring the scene parameters, the action information and the state sent by the game client, and executes corresponding action information, so that the game client controls the game companion AI to execute the predicted game action; the terminal is connected to the server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two, and uses a wireless or wired link to realize data transmission.

The terminal 10-2 is located at the user side and is used for sending out a game strategy generation request for obtaining a game strategy matched with the game environment where the target object is located, wherein the target object can be various types of game users.

As an example, the server 200 is used for laying the Game model training device to implement the Game model training method provided by the present invention, and the latter can lay a trained Game model to implement generating an adaptive Game strategy in different Game environments (such as gun Battle type Game, running cool type Game, Racing type Game, Multiplayer Online tactical sports Game (MOBA), Racing Game (RCG), and sports type Game (SPG)), and specifically, before using the Game model, the Game model needs to be trained, and the specific process includes: acquiring an action information set and a state information set in a game environment where a target object is located; determining initial parameters of a strategy generation sub-network and initial parameters of a state evaluation sub-network in the game model; when the state information in the state information set changes, determining reward parameters matched with the initial parameters of the strategy generation sub-network by executing the action information in the action information set; updating the initial parameters of the strategy generation sub-network based on the reward parameters matched with the initial parameters of the strategy generation sub-network; determining, by the state evaluation sub-network, an evaluation value signal parameter matching the state information in response to the changed state information; and updating the initial parameters of the strategy generation sub-network and the initial parameters of the state evaluation sub-network respectively according to the evaluation value signal parameters so as to determine the network parameters of the strategy generation sub-network and the network parameters of the state evaluation sub-network in the game model.

Certainly, the game model training device provided by the present invention may train the same target object based on game models in different game strategy generation environments, or train and adjust the same target object according to different levels of the target object, and finally present a game strategy on a User Interface (UI User Interface) that is adapted to the game environment determined by the game model, where the obtained game strategy that is adapted to the game environment through the game model and the game environment may also be called by other application programs (such as a game simulator or a motion sensing game device), and certainly, the game model that is adapted to different types of games may also be migrated to a mini-program game or a client-side game and a cloud game in an instant messaging process.

Certainly, after the training of the game model is completed, the game model can be used for performing role simulation to generate different game strategies, and corresponding action information is executed to assist a game player, which specifically includes: when a control component in the game environment is triggered, a virtual target object is presented in a user interface of the game environment where the target object is located, and a corresponding game interaction instruction is presented in the user interface by using the virtual target object through triggering the trained game model.

To explain the structure of the game model training device according to the embodiment of the present invention in detail, the game model training device may be implemented in various forms, such as a dedicated terminal with a processing function of the game model training device, or a server with a processing function of the game model training device, such as the server 200 in the foregoing fig. 1. Fig. 2 is a schematic diagram of a composition structure of a game model training device according to an embodiment of the present invention, and it is understood that fig. 2 only shows an exemplary structure of the game model training device, and not a whole structure, and a part of or the whole structure shown in fig. 2 may be implemented as needed.

The game model training device provided by the embodiment of the invention comprises: at least one processor 201, memory 202, user interface 203, and at least one network interface 204. The various components of the game model training apparatus are coupled together by a bus system 205. It will be appreciated that the bus system 205 is used to enable communications among the components. The bus system 205 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 205 in FIG. 2.

The user interface 203 may include, among other things, a display, a keyboard, a mouse, a trackball, a click wheel, a key, a button, a touch pad, or a touch screen.

It will be appreciated that the memory 202 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. The memory 202 in embodiments of the present invention is capable of storing data to support operation of the terminal (e.g., 10-1). Examples of such data include: any computer program, such as an operating system and application programs, for operating on a terminal (e.g., 10-1). The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application program may include various application programs.

In some embodiments, the game model training apparatus provided in the embodiments of the present invention may be implemented by a combination of hardware and software, and as an example, the game model training apparatus provided in the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the game model training method provided in the embodiments of the present invention. For example, a processor in the form of a hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

As an example of the game model training apparatus provided by the embodiment of the present invention implemented by combining software and hardware, the game model training apparatus provided by the embodiment of the present invention may be directly embodied as a combination of software modules executed by the processor 201, where the software modules may be located in a storage medium, the storage medium is located in the memory 202, the processor 201 reads executable instructions included in the software modules in the memory 202, and the game model training method provided by the embodiment of the present invention is completed in combination with necessary hardware (for example, including the processor 201 and other components connected to the bus 205).

By way of example, the Processor 201 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

As an example of the game model training apparatus provided in the embodiment of the present invention implemented by hardware, the apparatus provided in the embodiment of the present invention may be implemented by directly using the processor 201 in the form of a hardware decoding processor, for example, by being executed by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components, to implement the game model training method provided in the embodiment of the present invention.

Memory 202 in embodiments of the present invention is used to store various types of data to support the operation of the game model training apparatus. Examples of such data include: any executable instructions for operating on the game model training apparatus, such as executable instructions, a program implementing the slave game model training method of an embodiment of the present invention may be embodied in the executable instructions.

In other embodiments, the game model training device provided by the embodiment of the present invention may be implemented in software, and fig. 2 shows the game model training device stored in the memory 202, which may be software in the form of programs, plug-ins, and the like, and includes a series of modules, as an example of the programs stored in the memory 202, the game model training device may include the following software modules:

the information transmission module 2081 is configured to obtain an action information set and a state information set in a game environment where the target object is located.

The information processing module 2082 is used for determining initial parameters of a policy generation sub-network and initial parameters of a state evaluation sub-network in the game model.

The information processing module 2082 is configured to determine, when the state information in the state information set changes, a reward parameter that matches with the initial parameter of the policy generation sub-network by executing the action information in the action information set.

The information processing module 2082 is configured to update the initial parameters of the policy generation sub-network based on the incentive parameters matched with the initial parameters of the policy generation sub-network.

The information processing module 2082 is configured to determine, in response to the changed state information, an evaluation value signal parameter matching the state information through the state evaluation sub-network.

The information processing module 2082 is configured to update the initial parameter of the policy generation sub-network and the initial parameter of the state evaluation sub-network according to the evaluation value signal parameter, so as to determine the network parameter of the policy generation sub-network and the network parameter of the state evaluation sub-network in the game model.

In some embodiments, the server 200 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The terminal (e.g., terminal 10-1) may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited.

In practical application, the game model provided by the embodiment of the application can be applied to the fields of structural biology and medicine, and game strategy discovery, optimization, combination and the like can be realized through the game model.

According to the electronic device shown in fig. 2, in one aspect of the present application, the present application also provides a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform various embodiments and combinations of embodiments provided in the various alternative implementations of the point game model training method described above.

Before continuing to describe the game model training method provided by the embodiment of the present invention with reference to the game model training apparatus shown in fig. 2, first, a process of generating a game strategy by an AI technique in the related art is described, and referring to fig. 3, fig. 3 is a schematic diagram of an alternative process of generating a game strategy by an AI technique in the embodiment of the present invention, wherein, taking the game strategy of a racing game as an example, a traditional scheme is to establish a model by using the motion of a racing car and a track, and to establish a relationship between the control of the racing car and a leading route of the track by using empirical rules and dynamic formulas. And deciding whether the racing car turns or drifts according to factors such as the speed and the orientation of the racing car, the width and the curvature change of the front racing track and the like. Because the curves of a track are all morphologically diverse, the modeling of racing cars and tracks requires sufficient generalization to accommodate different tracks. However, in this process, the training efficiency of the genetic algorithm is very low, because one generation and another generation of variant evolution are needed, more than 10W iterations are often needed to converge to obtain the desired result, and thus, the generalization brings about the defects of multiple parameters and mutual coupling, and increases the time consumed by training. Meanwhile, the parameter optimization scheme is based on the premise that behavior trees are established, and the behavior tree models are all game models which are difficult to obtain and accord with the use habits of game users by establishing manual parameter adjustment according to human experience rules.

To solve this drawback, a process of training a game model deployed in a server is described, where fig. 4 is a schematic flow chart of an alternative method for training a game model provided in an embodiment of the present invention, and it can be understood that the steps shown in fig. 4 can be executed by various electronic devices running a game model training apparatus, such as a dedicated terminal with a game model training apparatus, a game strategy database server, or a server cluster of a game operator, where the dedicated terminal with a game model training apparatus can be an electronic device with a game model training apparatus in the embodiment shown in the previous fig. 2. In order to overcome the defects of inaccurate game strategy generation and low efficiency caused by the traditional game strategy generation mode, the technical scheme provided by the invention uses an artificial intelligence technology, wherein an Artificial Intelligence (AI) is a theory, a method, a technology and an application system for simulating, extending and expanding human intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and acquiring the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning. With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The following is a detailed description of the steps shown in fig. 4.

Step 401: the game model training device acquires an action information set and a state information set in a game environment where the target object is located.

Referring to fig. 5, fig. 5 is an information acquisition schematic diagram of a game model training method in an embodiment of the present invention, when acquiring action information and state information, because a policy generation sub-network and a state evaluation sub-network are trained simultaneously, coupling of acquired data information may occur, which affects training accuracy of the policy generation sub-network and the state evaluation sub-network in a game model, and thus, when acquiring the action information and the state information, a sample acquisition manner matched with a game environment may be determined according to the game environment in which a target object is located; determining a priority threshold value matched with the game environment according to the determined sample acquisition mode; based on the priority threshold matched with the game environment, the action information and the state information in the game environment of the target object are respectively sampled, and the action information set and the state information set in the game environment of the target object are determined, so that the effective sampling of different action information and state information in the game environment can be realized by using the priority thresholds of different types of games, the data fitting effect is improved, the training precision of a strategy generation sub-network and a state evaluation sub-network in a game model is improved, and a game user obtains better use experience. The game complexity of the Role-playing game and the game complexity of the action-playing game are different, so that the priority thresholds matched with the game environments are different for different types of games, and the priority thresholds can represent the sampled time interval parameters, for example, when the game type is a Role-playing game (RPG), the core of the Role-playing game is played, a user plays a Role in a real-life or fictitious world, and the Role-playing is developed through some action commands under a structured rule, and the success and failure of the user in the process depend on a rule or action policy form system. Role-playing games also include, but are not limited to, role-playing simulation Game strategy role-playing games (string-playing games), action role-playing games (action role-playing games), and massively multiplayer online role-playing games (massively player online roles). The user interface of the role playing game has a plurality of buttons with different functions, the buttons are updated in real time and appear randomly, the user interface background and the buttons are changed in real time, and the user triggers the buttons with different functions to generate corresponding action information in the game process (for example, game roles in the game can be killed mutually, and for a killing type, different actions such as 'tower-crossing killing strong', 'grass squat' and the like) at the moment, virtual objects in the game user interface generate different state changes (for example, the virtual objects in the game can be expressed into different states such as five killing, four killing and three killing, and the killing subclasses in the five killing have five killing limit reversal, and the five killing subclasses in the five killing have different sub-state changes such as killing limit reversal, monaural wind killing, five killing and five-connecting death). When the priority threshold sampling process is adopted, the motion information and the state information in the game environment may be respectively sampled by using only a sampling method in which the sampling time interval is 1 s.

For the Game type of action Game (ACT), i.e. Game with "action" as the main expression of the Game, the action Game can be counted as action Game, and the action Game also includes, but is not limited to, shooter Game (STG) and fighting Game (FTG). The internal branches of the fighting game are generally distinguished by the "linear or non-linear" of the different game maps and the "range of motion" of the game character, and therefore there are a plurality of categories representing the range of motion of the game character. The user interface background of the action game is changed in real time, the change speed is higher than that of a role playing game, when the sampling processing of the priority threshold is adopted, the action information and the state information in the game environment can be respectively sampled only in a sampling mode with the sampling time interval of 0.5s, the action information and the state information of the virtual role in the game are obtained, and it needs to be explained that for different types of games, when the action information and the state information are respectively sampled, the priority threshold can be flexibly adjusted through user instruction information, so that the development of the game type is adapted, and the data fitting effect is improved.

In some embodiments of the present invention, when training a strategy generation sub-network and a state evaluation sub-network in a game model, in order to improve a data fitting effect, training accuracies of the strategy generation sub-network and the state evaluation sub-network in the game model are improved, and noise of acquired training samples may be removed, where the acquired training samples include different motion information and state information, for example, in a racing game, the state information and the motion information may be acquired at an acquisition time interval of a priority threshold of 0.5 seconds, where the state information includes: the speed per hour of cycle racing, with the distance of each navigation point, with the track left and right walls distance, action information includes: different direction keys of the racing car are controlled, and the trigger component of the game prop is controlled. When the game environment of the target object is a role playing game, determining a dynamic noise threshold value matched with the use environment of the game model; denoising the first training sample set according to the dynamic noise threshold value to form a second training sample set matched with the dynamic noise threshold value; wherein, because the game environment of the game model is different, the dynamic noise threshold value matched with the use environment of the game model is also different, for example, the game small program game of the role playing type can be executed through the instant communication client terminal process, and the game complexity of the small program game is usually larger than the complexity of the client terminal game, therefore, the dynamic noise threshold value matched with the use environment of the game model is required to be smaller than the dynamic noise threshold value in the use environment of the role playing type game executed by the game user through the client terminal game process, and the training sample exceeding the noise threshold value is deleted according to the noise threshold value, the use of different types of games can be adapted through different dynamic noise threshold values, the training sample is effectively screened, so that the game strategy generated by the deployed trained game model, the user can obtain better game strategy.

In some embodiments of the invention, when the game environment in which the target object is located is a match game, a fixed noise threshold corresponding to a game model is determined, and de-noising the first set of training samples according to the fixed noise threshold to form a second set of training samples matching the fixed noise threshold, wherein, for the battle games deployed in fixed game terminals (such as game devices like somatosensory game machines or AR game glasses and the like), the acquisition speed and the accuracy of training samples can be effectively improved through the fixed noise threshold value, the waiting time of game users can be reduced, when the version of the game progress is updated, a new fixed noise threshold value can be obtained, and deleting the training samples with the noise lower than the fixed noise threshold value so as to improve the learning efficiency of the game terminal.

Step 402: the game model training device determines initial parameters of a strategy generation sub-network and initial parameters of a state evaluation sub-network in the game model.

The strategy generation sub-network in the game model is used for generating the operation of the AI-simulated game user on the virtual object in the game process and triggering the operation instruction corresponding to different game strategies, and the state evaluation sub-network can represent and evaluate the state of the virtual object after executing the game instruction so as to reflect the execution effect corresponding to the operation instruction corresponding to different game strategies in the game strategy. The initial parameters of the policy generation sub-network and the state evaluation sub-network in the game model may use deep neural networks, such as: the strategy generation sub-network can use a MobileNetV3 network, the MobileNetV3 network can use a small amount of model parameters to realize and configure the strategy generation sub-network suitable for terminal equipment deployment, the network architecture is based on a network structure search technology, the service performance and classification precision of the terminal equipment are taken into consideration in the strategy generation process in a fusion mode, so that the final network architecture can adapt to the performance and precision requirements of terminal equipment deployment, and therefore the identification of elements such as scenes, characters, objects, identifications and the like in images of different game videos (corresponding to different game processes) can be realized, the blood volume number, bullet quantity, weapon types and positions of virtual objects in the game processes are determined, and different game strategies are determined through the identification results of different elements.

Step 403: and when the state information in the state information set changes, the game model training device determines the reward parameters matched with the initial parameters of the strategy generation sub-network by executing the action information in the action information set.

Step 404: the game model training device updates the initial parameters of the strategy generation sub-network based on the reward parameters matched with the initial parameters of the strategy generation sub-network.

In some embodiments of the present invention, when the state information in the state information set changes, determining the reward parameter matching the initial parameter of the policy generation sub-network by executing the action information in the action information set may be implemented by:

when the state information in the state information set changes, determining the state information phase changed by the strategy generation sub-networkMatching the execution strategy; determining corresponding action information in the action information set according to the determined execution strategy; executing the action information in a game environment where the target object is located; and determining reward parameters matched with the initial parameters of the strategy generation sub-network based on the execution result of the action information. Specifically, when the game environment in which the target object is located is a racing game, the elapsed time of the racing game when the action information is executed may be determined; and determining the reward parameters matched with the initial parameters of the strategy generation sub-network according to the consumption time of the racing game, wherein the initial parameters of the strategy generation sub-network are theta, and the strategy generation function is pi theta. The initial parameter of the state estimation sub-network is omega, and the estimation expression function is expressed as V _ω . At the current state s _t According to the current strategy pi theta(s) _t ) Selection action information a _t Executing in the game environment of the game client when the state information is represented by s _t Is transferred to s _t+1 Get the reward parameter r _t Wherein the reward may be 1/t, the inverse of the elapsed time t for the period.

In some embodiments of the invention, the reward parameter is compared to a first reward parameter threshold based on the policy generating sub-network's initial parameters matching the reward parameter; updating the initial parameters of the policy generation subnetwork when it is determined that the reward parameter reaches a first reward parameter threshold, which in connection with the preceding embodiments may be t _max The second reward parameter threshold may be t _min When t reaches t _max When the initial parameters of the strategy generation sub-network are updated, when t reaches t _max After the game strategy generated by the representation game model is executed, the state change achieves the maximum effect, and the initial parameter of the strategy generation sub-network is updated, so that the parameter of the strategy generation sub-network tends to the optimal game effect, and the generated game strategy can be executed by the AI game robot, and a game user can obtain better high-level game experience.

Step 405: the game model training device determines evaluation value signal parameters matching the state information through the state evaluation sub-network in response to the changed state information.

Step 406: and the game model training device respectively updates the initial parameters of the strategy generation sub-network and the initial parameters of the state evaluation sub-network according to the evaluation value signal parameters.

Thereby, determination of network parameters of a policy generation sub-network and network parameters of a state evaluation sub-network in the game model can be achieved.

With continuing reference to fig. 5, wherein fig. 5 is a schematic flow chart of an alternative game model training method provided in the embodiment of the present invention, it can be understood that the steps shown in fig. 5 can be executed by various electronic devices operating the game model training apparatus, such as a dedicated terminal with the game model training apparatus, a game strategy database server, or a server cluster of a game operator, wherein the dedicated terminal with the game model training apparatus can be the electronic device with the game model training apparatus in the embodiment shown in the foregoing fig. 2.

Step 501: the game model training means determines the evaluation value signal parameter determined to match the state information based on the initial parameter of the state evaluation sub-network and the evaluation expression function.

Wherein, the evaluation expression function can evaluate the game action effect generated by the executed game strategy in different game environments, such as: in the racing game, the evaluation expression function can determine the distance from the next navigation point in the track, and the lethality lasting effect after different skills are released can be determined in the role playing game.

Wherein, when AI simulates game user of real game process, the state S of virtual object in game is obtained by executing different game strategies _t Change in real time if S _t To terminate the state (i.e., the state of the virtual object in the game is not being changed in reaction when any game strategy is executed), the evaluation value signal parameter R is 0, if s _t Not in the end state, then R ═ V _ω (s _t ) FromThis makes it possible to determine an evaluation value signal parameter that matches the state information

Therefore, by determining the real-time change of the state St of the virtual object in the game, the game strategy generated by the game model can be adjusted in real time, the hysteresis of the game is reduced, and the game user can obtain the use experience of completing the game process together with the real game user when playing the game through AI.

Step 502: the game model training device determines the update parameters of the strategy generation sub-network according to the strategy generation function of the strategy generation sub-network based on the evaluation value signal parameters.

Step 503: the game model training device generates an update parameter of the sub-network based on the strategy, and updates an initial parameter of the strategy generation sub-network.

In some embodiments of the invention, a multitask loss function may be determined that matches the game model; based on the evaluation value signal parameters and the multitask loss function, adjusting parameters of a strategy generation sub-network and network parameters of a state evaluation sub-network in the game model until loss functions of different dimensions corresponding to the game strategy generation model reach corresponding convergence conditions; so as to realize that the parameters of the game model are matched with the game environment. Specifically, because the game model includes a strategy generation sub-network and a state evaluation sub-network, and the game strategy generation is performed in a coordinated manner, each sub-network has a loss function corresponding to each sub-network, in order to adapt to different game scenes, the loss functions of the state evaluation sub-network and the loss functions of the strategy generation sub-network can be used for obtaining a loss of the overall multitask through linear weighting, and the loss gradient descent joint learning and the inverse gradient propagation are performed to realize the coordinated adjustment of parameters of the strategy generation sub-network and network parameters of the state evaluation sub-network.

Wherein the network parameters of the policy generating subnetwork are updated by evaluating the value signal parameters, wherein the network parameters of the policy generating subnetwork canExpressed as:

and network parameters of a state evaluation sub-network

Then, S may be initialized to be the first state of the current state sequence to obtain the corresponding feature vector Φ (S). Using phi (S) as input in a strategy generation sub-network of the game model to obtain Q value output corresponding to all game strategies of the strategy generation sub-network of the game model, and selecting a corresponding game strategy A in the current Q value output (such as acceleration in a racing game and skill release of a role playing game); executing the current game strategy A in the state S to obtain the feature vector phi (S ') and the reward phi (S ') corresponding to the new state S ' and the reward R (S), and using the mean square error loss function

And updating all network parameters of the network of the sub-network and the state evaluation sub-network by the strategy of updating the game model through the gradient back propagation of the neural network, thereby realizing the iterative updating.

Step 504: the game model training device determines an update parameter of the state evaluation sub-network based on the evaluation value signal parameter.

Step 505: and the game model training device updates the initial parameters of the state evaluation sub-network according to the updated parameters of the state evaluation sub-network.

In some embodiments of the invention, the action execution result of the game model in the game environment where the target object is located can be monitored; determining the consumed time of the target action execution result; adjusting the training samples of the game model when the elapsed time of the target action execution result is less than a second reward parameter threshold. When t reaches t _min After the game strategy generated by the representation game model is executed, the state change achieves the minimum effect, and the parameters of the strategy generation sub-network can be realized by updating the initial parameters of the strategy generation sub-networkThe game model is prevented from generating a low-level game strategy, so that the generated game strategy is executed by the AI game robot, and a game user can obtain better high-level game experience. Referring to fig. 6, fig. 6 is a schematic diagram of a working process of a state evaluation sub-network in an embodiment of the present invention, wherein, in practical applications, different game scenarios that require a game policy to be generated, such as a role-playing game and an action game, may be preset. For a real-time ar game, optionally, the policy generation sub-Network and the state evaluation sub-Network in the game model may be a Convolutional Neural Network (CNN conditional Neural Network), a Deep Neural Network (DNN Deep Neural Network), a Recurrent Neural Network (RNN Recurrent Neural Network), or the like. Taking the deep neural network shown in fig. 6 as an example, the cooperative training of the network parameters of the strategy generation sub-network and the state evaluation sub-network in the game model not only can effectively ensure the accuracy of the game model, but also can improve the convergence rate of the game model, thereby improving the efficiency of generating the game strategy, processing the game strategy with complex dimensions more quickly, and adjusting the game strategy timely and accurately.

FIG. 7 is a diagram illustrating a model structure of a game model according to an embodiment of the present invention, wherein the state information s can be used _t As an input sample of the game model, for example, a racing game, the state information s of a racing car can be used _t Consumed action command execution time t' in fixed distance after start and optimal driving time t for this distance in history data _min A difference of (i.e. t' -t) _min The output supervisory signal is used to train the strategy generator sub-network. Strategy generation of the prediction value of the sub-network plus the original reward value as a new reward

Each generation of t _max Updating parameters of the policy generation sub-network and initial parameters of the state evaluation sub-network, respectively, and recording the state St _max (ii) a In-line with the aboveThen, the game strategy determined by the strategy generation sub-network can be continuously executed to reach n groups of t _max Thereafter, the elapsed time t' for this segment is recorded. If t' < t _min Then t is updated _min T'. The collected motion information and state information (St) _max ，t’-t _min ) As the training sample of the game model, the fitting effect of the training sample can be effectively improved, the training precision of the strategy generation sub-network and the state evaluation sub-network in the game model is improved, and a game user can obtain better use experience.

The embodiment of the present invention may be implemented by combining a Cloud technology, where the Cloud technology (Cloud technology) is a hosting technology for unifying series resources such as hardware, software, and a network in a wide area network or a local area network to implement calculation, storage, processing, and sharing of data, and may also be understood as a generic term of a network technology, an information technology, an integration technology, a management platform technology, an application technology, and the like applied based on a Cloud computing business model. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, photo-like websites and more portal websites, so cloud technology needs to be supported by cloud computing.

It should be noted that cloud computing is a computing mode, and distributes computing tasks on a resource pool formed by a large number of computers, so that various application systems can obtain computing power, storage space and information services as required. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand. As a basic capability provider of cloud computing, a cloud computing resource pool platform, which is called an Infrastructure as a Service (IaaS) for short, is established, and multiple types of virtual resources are deployed in a resource pool and are used by external clients selectively. The cloud computing resource pool mainly comprises: a computing device (which may be a virtualized machine, including an operating system), a storage device, and a network device.

After the trained game model is deployed in a corresponding server, a game strategy matched with a game environment where a target object is located may be generated, the game model training method of the present application is described below by taking a racing game as an example, and fig. 8 is a schematic diagram of the game model applied to a racing game provided by an embodiment of the present invention, where when the game server detects that a human player user intends to play a round of racing game, the game server may match one or more virtual objects (i.e., accompany AI) controlled by the game model for the human player user; FIG. 8 illustrates a virtual object controlled by a game model. After matching the virtual objects controlled by the game model, the target object used by the human player user and the matching virtual object controlled by the game model may be loaded into the target game scene so that the first and virtual objects used by the human player user and controlled by the game model may be racing in the target game scene, as shown in FIG. 8. "user-1" is a target object in the game environment, and "user-2" in FIG. 8 is an accessed virtual object controlled by the game model (i.e., accompany AI); it should be noted that, in the actual game scenario, the virtual object (i.e. the accompanied AI) controlled by the game model may be assigned with the game ID (i.e. the game account) of the online real player, so as to prevent the human player user from recognizing the accompanied AI, thereby improving the use experience of the game user.

With continued reference to fig. 9, fig. 9 is an optional flowchart of a game model application process provided in an embodiment of the present invention, where the steps shown in fig. 9 may be executed by various electronic devices running a game model, for example, a dedicated terminal with a game model (e.g., a motion sensing game machine) or a server cluster of a game operator, and specifically include the following steps:

step 901: the game model training device obtains training samples.

Wherein the obtained training samples comprise: the action information and the state information, specifically, the state information and the state information may include: the game playing method comprises the following steps of game finishing time length, using frequency of each skill special effect (such as using frequency of the skill special effect A and using frequency of the skill special effect B), triggering frequency of each skill special effect (such as triggering frequency of the skill special effect A and triggering frequency of the skill special effect B), occurrence frequency of each error condition (such as occurrence frequency of the error A and the occurrence frequency of the error B), and using proportion of different game actions (such as using frequency of game strategy A, using frequency of game actions B, using frequency of game actions of target objects, and using frequency of each game item (such as using frequency of game item 'bomb' and using frequency of game item 'nitrogen acceleration').

Step 902: the game model training device updates the initial parameters of the strategy generation sub-network based on the reward parameters matched with the initial parameters of the strategy generation sub-network.

Step 903: the game model training device determines evaluation value signal parameters matching the state information through the state evaluation sub-network.

Step 904: and the game model training device respectively updates the initial parameters of the strategy generation sub-network and the initial parameters of the state evaluation sub-network according to the evaluation value signal parameters.

Step 905: deploying the trained game model.

Step 906: determining a game strategy matched with the target object through the game model, and presenting a virtual target object in a user interface passing through the game environment where the target object is located when a control component in the game environment is triggered.

Step 907: and presenting corresponding game interaction instructions in the user interface by using the virtual target object through triggering the trained game model.

In some embodiments of the present invention, the interactive instruction may also be generated by detecting a gesture of the virtual object, for example, in a three-dimensional interactive scene, the interactive instruction may be generated according to a given gesture of the virtual object. Skill identification is used to uniquely identify a skill. Many skills are often available in a game scenario, including attack skills and avoidance skills, each of which corresponds to a skill identification. The interactive instruction refers to interactive operation initiated by a user and is used for controlling the controlled virtual object to execute corresponding interactive action. The interaction comprises attack interaction, evasion interaction and the like, wherein the attack can be divided into short-distance attack and long-distance attack.

The beneficial technical effects are as follows:

acquiring an action information set and a state information set in a game environment where a target object is located; determining initial parameters of a strategy generation sub-network and initial parameters of a state evaluation sub-network in the game model; when the state information in the state information set changes, determining reward parameters matched with the initial parameters of the strategy generation sub-network by executing the action information in the action information set; updating the initial parameters of the strategy generation sub-network based on the reward parameters matched with the initial parameters of the strategy generation sub-network; determining, by the state evaluation sub-network, an evaluation value signal parameter matching the state information in response to the changed state information; and updating the initial parameters of the strategy generation sub-network and the initial parameters of the state evaluation sub-network respectively according to the evaluation value signal parameters so as to determine the network parameters of the strategy generation sub-network and the network parameters of the state evaluation sub-network in the game model. Therefore, the accuracy of the game model can be effectively guaranteed, meanwhile, the convergence speed of the game model is improved, the efficiency of generating the game strategy is improved, the game strategy with complex dimensionality is processed more quickly, the game strategy is adjusted timely and accurately, meanwhile, robustness and generalization are achieved for action spaces of different game environments, the calculation cost is reduced, the efficiency of generating the game strategy is improved, and the game strategy with complex dimensionality is processed.

The above description is only exemplary of the present invention and should not be taken as limiting the scope of the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A game model training method, the method comprising:

determining initial parameters of a strategy generation sub-network and initial parameters of a state evaluation sub-network in a game model;

determining reward parameters matching the initial parameters of the policy-generating sub-network by executing the action information in the action information set when the status information in the status information set changes,

the game environment of the target object is a racing game, the consumption time of the racing game when the action information is executed is determined, and the reward parameter matched with the initial parameter of the strategy generation sub-network is determined according to the consumption time of the racing game;

comparing the reward parameter with a first reward parameter threshold based on the reward parameter matched with the initial parameter of the strategy generation sub-network, and updating the initial parameter of the strategy generation sub-network when the reward parameter is determined to reach the first reward parameter threshold;

2. The method of claim 1, wherein the obtaining the set of action information and the set of state information in the game environment of the target object comprises:

determining a sample acquisition mode matched with the game environment according to the game environment of the target object;

determining a priority threshold value matched with the game environment according to the determined sample acquisition mode;

and respectively sampling the action information and the state information in the game environment of the target object based on the priority threshold matched with the game environment, and determining an action information set and a state information set in the game environment of the target object.

3. The method of claim 1, wherein determining the reward parameters matching the initial parameters of the policy generation sub-network by performing the action information of the action information set when the status information of the status information set changes comprises:

when the state information in the state information set changes, determining an execution strategy matched with the changed state information through the strategy generation sub-network;

determining corresponding action information in the action information set according to the determined execution strategy;

executing the action information in a game environment where the target object is located;

and determining reward parameters matched with the initial parameters of the strategy generation sub-network based on the execution result of the action information.

4. The method of claim 1, wherein the updating the initial parameters of the policy generation sub-network and the initial parameters of the state evaluation sub-network according to the evaluation value signal parameters comprises:

determining the evaluation value signal parameter matching the state information based on the initial parameter and the evaluation expression function of the state evaluation sub-network;

determining an update parameter of the policy generation sub-network according to a policy generation function of the policy generation sub-network based on the evaluation value signal parameter;

updating the initial parameters of the strategy generation sub-network based on the updated parameters of the strategy generation sub-network;

determining an update parameter of the state evaluation sub-network based on the evaluation value signal parameter;

and updating the initial parameters of the state evaluation sub-network according to the updated parameters of the state evaluation sub-network.

5. The method of claim 1, further comprising:

monitoring the action execution result of the game model in the game environment where the target object is located;

determining the consumed time of the target action execution result;

adjusting the training samples of the game model when the elapsed time of the target action execution result is less than a second reward parameter threshold.

6. The method of claim 5, further comprising:

determining historical parameters of the target object according to the type of the game environment where the target object is located;

determining a training sample set matched with the game model based on the historical parameters of the target object, wherein the training sample set comprises at least one group of training samples;

extracting a training sample set matched with the training sample through a noise threshold matched with the game model;

and training the game model according to the training sample set matched with the training samples.

7. The method of claim 6, further comprising:

determining a multitask loss function matching the game model;

based on the evaluation value signal parameters and the multitask loss function, adjusting parameters of a strategy generation sub-network and network parameters of a state evaluation sub-network in the game model until loss functions of different dimensions corresponding to the game strategy generation model reach corresponding convergence conditions; so as to realize that the parameters of the game model are matched with the game environment.

8. The method of claim 1, further comprising:

when the game environment of the target object is a role playing game, determining a dynamic noise threshold value matched with the use environment of the game model;

denoising the first training sample set according to the dynamic noise threshold value to form a second training sample set matched with the dynamic noise threshold value;

and when the game environment of the target object is a battle game, determining a fixed noise threshold corresponding to the game model, and performing denoising processing on the first training sample set according to the fixed noise threshold to form a second training sample set matched with the fixed noise threshold.

9. The method of claim 1, further comprising:

presenting a virtual target object in a user interface of the gaming environment through which the target object is located when a control component in the gaming environment is triggered,

and presenting corresponding game interaction instructions in the user interface by using the virtual target object through triggering the trained game model.

10. A game model training apparatus, the apparatus comprising:

the information processing module is used for determining the consumption time of the racing game when the action information is executed when the game environment of the target object is the racing game, and determining the reward parameters matched with the initial parameters of the strategy generation sub-network according to the consumption time of the racing game;

the information processing module is used for comparing the reward parameter with a first reward parameter threshold value based on the reward parameter matched with the initial parameter of the strategy generation sub-network, and updating the initial parameter of the strategy generation sub-network when the reward parameter is determined to reach the first reward parameter threshold value;

11. An electronic device, characterized in that the electronic device comprises:

a memory for storing executable instructions;

a processor for implementing the game model training method of any one of claims 1 to 9 when executing the executable instructions stored in the memory.

12. A computer readable storage medium storing executable instructions which, when executed by a processor, implement the game model training method of any one of claims 1 to 9.