CN112221140B

CN112221140B - Method, device, equipment and medium for training action determination model of virtual object

Info

Publication number: CN112221140B
Application number: CN202011217465.4A
Authority: CN
Inventors: 杜雪莹; 石贝; 练振杰; 高一鸣; 陈光伟; 王亮; 付强
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-11-04
Filing date: 2020-11-04
Publication date: 2024-03-22
Anticipated expiration: 2040-11-04
Also published as: CN112221140A

Abstract

The application provides a method, a device, equipment and a medium for training a motion determination model of a virtual object, and belongs to the technical field of artificial intelligence. The method comprises the following steps: determining a computing environment state of the virtual scene after the target duration based on the first environment state of the virtual scene; determining intrinsic rewards information according to the computing environment state and the actual environment state of the first environment state at the next moment; adjusting parameters of a current action determination model according to the intrinsic reward information; and determining the current action determining model as a trained action determining model in response to the current action determining model conforming to the first target condition. According to the scheme, the actions output by the action determining model can correspond to different game strategies, and the countermeasure capability and robustness of the virtual object on the game strategies are improved.

Description

Method, device, equipment and medium for training action determination model of virtual object

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a medium for training a motion determination model of a virtual object.

Background

With the development of artificial intelligence (Artificial Intelligence, AI) technology, AI has created challenges for human-center in a variety of fields and has approached the human-center athletic level. For example, in the Go field, alpha Go wins world champions of Go, in the game field, alpha Star wins Star-level contestant II (an instant strategic game) players, etc. Currently, research into game AI problems has become a test field exploring real-world general artificial intelligence.

Currently, for multi-player online tactical competition games (Multiplayer Online Battle Arena, MOBA), because the MOBA games are affected by various complex contents such as array capacity combination, strategic targets, tactical execution, etc., a method of reinforcement learning (Reinforcement Learning, RL) is generally adopted to train the AI, and learning of the AI is guided by a Reward signal (Reward) in the RL.

The technical problem of the technical scheme is that since the reward signal of the AI in the RL is a dense reward signal defined by a technician, the trained AI can only execute a single game strategy, so that the AI has weaker countermeasure capability on the game strategy and lacks robustness.

Disclosure of Invention

The embodiment of the application provides a training method, device, equipment and medium for a motion determination model of a virtual object, so that the motion output by the motion determination model can correspond to different game strategies, and the countermeasure capability and robustness of the virtual object on the game strategies are improved. The technical scheme is as follows:

In one aspect, a method for training a motion determination model of a virtual object is provided, the method comprising:

determining a computing environment state of the virtual scene after a target duration based on a first environment state of the virtual scene, wherein the first environment state and the computing environment state respectively represent the environment of the virtual scene;

determining intrinsic rewarding information according to the computing environment state and an actual environment state of the first environment state at the next moment, wherein the intrinsic rewarding information is used for indicating whether executing a target action is beneficial to a virtual object, and the actual environment state is used for indicating the environment state of the virtual scene after the virtual object executes the target action under the first environment state;

adjusting parameters of a current action determination model according to the intrinsic reward information;

and responding to the current action determining model conforming to a first target condition, determining the current action determining model as a trained action determining model, wherein the action determining model is used for outputting actions according to the input environment state.

In another aspect, there is provided an action determination model training apparatus for a virtual object, the apparatus comprising:

The state determining module is used for determining the computing environment state of the virtual scene after the target duration based on the first environment state of the virtual scene, wherein the first environment state and the computing environment state respectively represent the environment of the virtual scene;

the information determining module is used for determining intrinsic rewarding information according to the computing environment state and the actual environment state of the first environment state at the next moment, wherein the intrinsic rewarding information is used for indicating whether executing a target action is beneficial to a virtual object, and the actual environment state is used for indicating the environment state of the virtual scene after the virtual object executes the target action in the first environment state;

the parameter adjustment module is used for adjusting parameters of the current action determination model according to the intrinsic rewarding information;

the model determining module is used for responding to the current action determining model to accord with a first target condition, determining the current action determining model as a trained action determining model, and the action determining model is used for outputting actions according to the input environment state.

In a possible implementation manner, the information determining module is configured to obtain a first difference value between the first environmental state and the computing environmental state; acquiring a second difference value between the actual environment state and the computing environment state; determining a target difference between the first difference and the second difference as the intrinsic rewards information, the target difference being not negative indicating that performing the target action is beneficial to the virtual object.

In one possible implementation, the apparatus further includes:

and the state transformation module is used for transforming the first environment state and the actual environment state to the same dimension as the computing environment state.

In one possible implementation, the state determining module is configured to determine a first environment vector of a first environment state of the virtual scene; and inputting the first environment vector into an environment state determining model, and outputting the calculated environment state of the virtual scene after the target time length by the environment state determining model, wherein the environment state determining model is used for calculating the environment state after the target time length according to the known environment state.

In one possible implementation, the training step of the environmental state determination model includes:

acquiring a first sample environment state and a second sample environment state corresponding to the first sample environment state after a target duration;

training an environment state determination model in the current iteration process by taking the first sample environment state as input and the second sample environment state as label information;

and responding to the environment state determining model in the current iteration process to meet a second target condition, and taking the environment state determining model in the current iteration process as a trained environment state determining model.

In one possible implementation, the second sample environmental state is extracted based on a priori knowledge.

In one possible implementation manner, the environment state determining model is an environment state determining model corresponding to any one iteration process.

In one possible implementation, the apparatus further includes a self-playing module for self-playing training of the current action determination model.

In another aspect, a computer device is provided that includes a processor and a memory for storing at least one piece of program code that is loaded and executed by the processor to implement the operations performed in the method for motion determination model training of a virtual object in embodiments of the present application.

In another aspect, a computer readable storage medium having stored therein at least one piece of program code loaded and executed by a processor to implement operations performed in a method of model training for motion determination of a virtual object in embodiments of the present application is provided.

In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer program code, the computer program code being stored in a computer readable storage medium. The computer program code is read from a computer readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method of motion determination model training of virtual objects provided in the various aspects or various alternative implementations of the various aspects described above.

The beneficial effects that technical scheme that this application embodiment provided brought are:

in the embodiment of the application, the method for training the motion determination model of the virtual object is provided, in the reinforcement learning process, parameters of the current motion determination model are adjusted based on intrinsic reward information, and the intrinsic reward information is determined according to the predicted computing environment state and the actual environment state after the virtual object executes the motion, so that the motion outputted by the motion determination model obtained through training can be obtained from a virtual scene after the virtual object executes the motion, and the intrinsic reward information is dynamically changed according to the predicted computing environment state, so that the motion outputted by the motion determination model can correspond to different game strategies, and the countermeasure capability and robustness of the virtual object on the game strategies are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic view of an implementation environment of a method for training a motion determination model of a virtual object according to an embodiment of the present application;

FIG. 2 is a flow chart of a method of motion determination model training for a virtual object provided in accordance with an embodiment of the present application;

FIG. 3 is a flow chart of another method of motion determination model training for virtual objects provided in accordance with an embodiment of the present application;

FIG. 4 is a flow chart of a training environment state determination model provided in accordance with an embodiment of the present application;

FIG. 5 is a flow diagram of a training action determination model provided in accordance with an embodiment of the present application;

FIG. 6 is a block diagram of a virtual object motion determination model training apparatus provided in accordance with an embodiment of the present application;

fig. 7 is a block diagram of a terminal according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

Techniques that may be used in embodiments of the present application are described below.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Reinforcement learning (Reinforcement Learning, RL), also known as re-excitation learning, evaluation learning, or reinforcement learning, is one of the paradigm and methodology of machine learning to describe and solve the problem of agents (agents) through learning strategies to maximize returns or achieve specific goals during interactions with an environment.

Reinforcement learning is learning by an agent in a "trial and error" manner, and directs the behavior of rewards obtained by interacting with the environment, with the goal of maximizing rewards obtained by the agent, and is different from supervised learning in connection with sense learning, and is mainly represented by reinforcement signals, where reinforcement signals provided by the environment in reinforcement learning are an evaluation of how well an action is generated (typically a scalar signal), rather than telling reinforcement learning system RLS (Reinforcement Learning System) how to generate the correct action. Since little information is provided by the external environment, RLS must learn from its own experiences. In this way, the RLS obtains knowledge in the context of the action-assessment, improving the action plan to suit the context.

Virtual scene: is a virtual scene that an application program displays (or provides) while running on a terminal. The virtual scene can be a simulation environment for the real world, a semi-simulation and semi-fictional virtual environment, or a pure fictional virtual environment. The virtual scene can be any one of a two-dimensional virtual scene, a 2.5-dimensional virtual scene or a three-dimensional virtual scene, and the dimension of the virtual scene is not limited in the embodiment of the application. For example, a virtual scene includes sky, land, sea, etc., the land including environmental elements of a desert, city, etc., and an end user can control a virtual object to move in the virtual scene. Optionally, the virtual scene can also be used for virtual scene fight between at least two virtual objects, in which virtual scene there are virtual resources available for the at least two virtual objects.

Virtual object: refers to movable objects in a virtual scene. The movable object is a virtual character, a virtual animal, a cartoon character, or the like, such as: characters, animals, plants, oil drums, walls, stones, etc. displayed in the virtual scene. The virtual object can be an avatar in the virtual scene for representing a user. A virtual scene can include a plurality of virtual objects, each virtual object having its own shape and volume in the virtual scene, occupying a portion of space in the virtual scene. Alternatively, when the virtual scene is a three-dimensional virtual scene, the virtual object can be a three-dimensional stereoscopic model, which can be a three-dimensional character constructed based on three-dimensional human skeleton technology, and the same virtual object can exhibit different external figures by wearing different skins. In some embodiments, the virtual object can also be implemented using a 2.5-dimensional or 2-dimensional model, which is not limited by embodiments of the present application.

Optionally, the virtual object is a user Character controlled by an operation on the client, or is an artificial intelligence (Artificial Intelligence, AI) set in the virtual scene fight by training, or is a Non-user Character (NPC) set in the virtual scene interaction. Optionally, the virtual object is a virtual character that performs an antagonistic interaction in the virtual scene. Optionally, the number of virtual objects participating in the interaction in the virtual scene can be preset, or can be dynamically determined according to the number of clients joining the interaction.

MOBA (Multiplayer Online Battle Arena, multiplayer online tactical competition) game: the virtual object game system is a game in which a plurality of points are provided in a virtual scene, and users in different camps control virtual objects to fight in the virtual scene, occupy the points or destroy hostile camping points. For example, a MOBA game may divide a user into at least two hostile camps, with different virtual teams belonging to the at least two hostile camps occupying respective map areas, respectively, for performing a game targeting a certain winning condition. Such victory conditions include, but are not limited to: at least one of occupying a data point or destroying a hostile data point, clicking and killing a virtual object of the hostile, guaranteeing the survival of the virtual object per se in a specified scene and time, grabbing to a certain resource, and exceeding the interactive score of the other party in a specified time. For example, a mobile MOBA game may divide a user into two hostile camps, disperse virtual objects controlled by the user in a virtual scene to compete with each other to destroy or preempt all points of the hostile as a winning condition.

The following describes an implementation environment of the method for training the motion determination model of the virtual object provided in the embodiment of the present application. Fig. 1 is a schematic diagram of an implementation environment of a method for training a motion determination model of a virtual object according to an embodiment of the present application. Referring to fig. 1, the implementation environment includes a terminal 101 and a server 102.

The terminal 101 and the server 102 can be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.

Optionally, the terminal 101 is a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto. The terminal 101 installs and runs an application program supporting a virtual scene. The application may be any one of a First person shooter game (FPS), a third person shooter game, a multiplayer online tactical competition game (Multiplayer Online Battle Arena games, MOBA), a Real-time strategic game (Real-Time Strategy Game, RTS), a virtual reality application, a three-dimensional map program, or a multiplayer gunfight survival game. Illustratively, the terminal 101 is a terminal used by a user who uses the terminal 101 to operate a controlled virtual object located in a virtual scene to perform activities including, but not limited to: adjusting at least one of body posture, crawling, walking, running, riding, jumping, driving, picking up, shooting, attacking, throwing. Illustratively, the virtual object is a virtual character, such as an emulated persona or a cartoon persona.

Alternatively, the server 102 is a stand-alone physical server, or can be a server cluster or a distributed system formed by a plurality of physical servers, or can be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), and basic cloud computing services such as big data and artificial intelligence platforms. The server 102 is configured to provide background services for applications that support virtual scenarios. Alternatively, the server 102 can be capable of assuming primary computing effort and the terminal 101 can be capable of assuming secondary computing effort; alternatively, the server 102 takes on secondary computing work and the terminal 101 takes on primary computing work; alternatively, a distributed computing architecture is used for collaborative computing between the server 102 and the terminal 101.

Alternatively, the virtual object controlled by the terminal 101 (hereinafter referred to as a controlled virtual object) and the virtual object controlled by the server 102 (hereinafter referred to as an AI object) are in the same virtual scene, and at this time the controlled virtual object can interact with the AI object in the virtual scene. In some embodiments, the controlled virtual object and the AI object can be hostile, e.g., the controlled virtual object and the AI object can belong to different teams and organizations, and the hostile virtual objects can be interacted with in an antagonistic manner by releasing skills from each other.

Those skilled in the art will appreciate that the number of terminals described above can be greater or fewer. Such as the above-described terminals can be only one, or the above-described terminals can be three, five, or more. The number of terminals and the device type are not limited in the embodiment of the present application.

Alternatively, the wireless network or wired network described above uses standard communication techniques and/or protocols. The network is typically the internet, but can be any network including, but not limited to, a local area network (Local Area Network, LAN), metropolitan area network (Metropolitan Area Network, MAN), wide area network (Wide Area Network, WAN), mobile, wired or wireless network, private network, or any combination of virtual private networks. In some embodiments, data exchanged over the network is represented using techniques and/or formats including HyperText Mark-up Language (HTML), extensible markup Language (Extensible Markup Language, XML), and the like. In addition, all or some of the links can be encrypted using conventional encryption techniques such as secure socket layer (Secure Socket Layer, SSL), transport layer security (Transport Layer Security, TLS), virtual private network (Virtual Private Network, VPN), internet protocol security (Internet Protocol Security, IPsec), and the like. In other embodiments, custom and/or dedicated data communication techniques can also be used in place of or in addition to the data communication techniques described above.

Fig. 2 is a flowchart of a method for training a motion determination model of a virtual object according to an embodiment of the present application, as shown in fig. 2, and in the embodiment of the present application, an application to a computer device is illustrated as an example. The motion determination model training method of the virtual object comprises the following steps:

201. the computer device determines a computing environment state of the virtual scene after a target duration based on a first environment state of the virtual scene, the first environment state and the computing environment state respectively representing environments of the virtual scene.

In the embodiment of the application, the virtual scene is a virtual scene in the MOBA game, and the virtual scene includes various virtual resources, such as a monster, a defensive tower, a soldier and the like. Optionally, the environmental state of the virtual scene is the state in which the various virtual resources are currently located, for example, the monster in the virtual scene exists in a fixed position, after the monster in any position is knocked out, the monster in the position is in an unoccupied state, and correspondingly, the monster in the position can be regenerated after a period of time, and at the moment, the monster in the position is in a surviving state. For another example, the defensive towers in the virtual scene exist at fixed positions, cannot be regenerated after being destroyed, and need to be destroyed according to a certain sequence, and the environment state comprises the state of each defensive tower. The computer device is capable of acquiring a first environmental state of the virtual scene at a time. Optionally, the above environmental state may further include a state where each virtual object in the virtual scene is located, such as a location, a life value, and the like, where the first environmental state is not limited in the embodiment of the present application. The computer device is capable of calculating a computing environment state after a target time period from the first environment state based on the first environment state.

202. The computer equipment determines intrinsic rewarding information according to the computing environment state and the actual environment state of the first environment state at the next moment, wherein the intrinsic rewarding information is used for indicating whether the execution of the target action is beneficial to the virtual object, and the actual environment state is used for indicating the environment state of the virtual scene after the virtual object executes the target action under the first environment state.

In this embodiment of the present application, when the virtual scene is in the first environmental state, the virtual object may execute the target action, thereby causing the environmental state of the virtual scene to change, and at this time, the computer device obtains the changed environmental state, that is, the actual environmental state of the first environmental state at the next moment. The computer equipment can determine the intrinsic rewarding information according to the actual environment state after the target action is executed and the calculated computing environment state, and if the intrinsic rewarding information is positive, the intrinsic rewarding information indicates that the virtual object is beneficial to the virtual object after the target action is executed; the intrinsic rewards information, if negative or zero, indicates that the virtual object will not benefit the virtual object after performing the target action. Wherein, the virtual object is beneficial to the virtual object by adding the gold coin of the virtual object, improving the experience value of the virtual object or winning the game by the virtual object, and the virtual object is not beneficial to the virtual object by reducing the life value of the virtual object, losing the game by the virtual object, and the like. Of course, after the virtual object performs the target action, there may be a forward event that is beneficial to the virtual object, or a reverse event that is not beneficial to the virtual object, and the terminal may be further capable of giving different weights to the forward event and the reverse event, and determining the intrinsic reward information according to the sum of the weights of the forward event and the reverse event.

203. The computer device adjusts parameters of the current action determination model based on the intrinsic reward information.

In the embodiment of the application, if the intrinsic reward information indicates that the virtual object performs the target action beneficial to the virtual object, adjusting parameters of the current action determination model so that the target action is performed when similar environment states are encountered again; if the intrinsic rewards information indicates that the virtual object cannot benefit the virtual object from performing the target action, then parameters of the current action determination model are adjusted so that the target action is not performed when similar environmental conditions are again encountered.

204. In response to the current action determination model conforming to the first target condition, the computer device determines the current action determination model as a trained action determination model for outputting an action in accordance with the input environmental state.

In this embodiment of the present application, the first target condition is that the preset training times are reached, the environment state of the virtual scene is changed by the action output by the action determination model, and the error of the predicted computing environment state is within the target range or the environment state of the virtual scene reaches the preset target environment state, which may, of course, be other conditions.

Fig. 2 is a main flow of the method for training the motion determination model of the virtual object according to the embodiment of the present application, and is described below based on an application scenario in which the environment state determination model is trained first, and then the motion determination model is trained according to the trained environment state determination model.

Fig. 3 is a flowchart of another method for training a motion determination model of a virtual object according to an embodiment of the present application, as shown in fig. 3, and in the embodiment of the present application, an application to a terminal is described as an example. The motion determination model training method of the virtual object comprises the following steps:

301. The terminal trains an environment state determining model, and the environment state determining model is used for calculating the environment state after the target time length according to the environment state known by the virtual scene.

In the embodiment of the application, the virtual scene is a virtual scene in the MOBA game, and the virtual scene includes various virtual resources, such as a monster, a defensive tower, a soldier and the like. The number and the state of the plurality of virtual resources are the environmental state of the virtual scene. As the virtual object performs different actions, the above-mentioned environmental states also or correspondingly change. Optionally, the environment state can also include a state where each virtual object in the virtual scene is located, such as a location, a life value, and the like, which is not limited in the embodiment of the present application.

The terminal can train the environment state determination model in a supervised learning manner. The training steps of the environment state determining model are as follows: taking an iterative process as an example, the terminal first acquires a first sample environment state and a second sample environment state corresponding to the first sample environment state after a target duration. And then the terminal takes the first sample environment state as input, the second sample environment state as label information, and the environment state determination model in the iterative process is trained. Responding to the environment state determining model in the current iteration process to accord with a second target condition, and enabling the terminal to take the environment state determining model in the current iteration process as a trained environment state determining model; and responding to the environment state in the current iteration process to determine that the model does not accord with the second target condition, and executing the next iteration process by the terminal. Optionally, the second sample environmental state is input information extracted based on priori knowledge, for example, a game execution strategy provided by expert data in the game field is used as priori knowledge of an environmental state determining model, and the environmental state determining model is obtained through training. Optionally, the environmental state determination model is trained based on a Meta-Controller network (phi).

For example, referring to fig. 4, fig. 4 is a schematic flow chart of a training environment state determination model according to an embodiment of the present application. As shown in fig. 4, the process of training the environment state determination model is a supervised learning process, and first, the terminal acquires expert data in the game field, such as game data of the player, execution policy data, resource acquisition data, and the like, as a priori knowledge. Future targets (meta-gol) of the defined game are then obtained, which are some environmental states that the virtual scene may reach during the game, such as putting, clearing, wild, etc. Since these future targets represent changes in environmental conditions over a long period of time, which are sparse data, the meta-gold design is simpler. Then, the terminal extracts the virtual information from the expert data as the prior knowledge in a supervised learning mannerThe environmental state of the simulated scene at a certain moment, namely the first sample environmental state, is taken as input information, namely a feature, of the environmental state determination model corresponding to the current iteration process, and the defined meta-gold is taken as a label (table) of the environmental state determination model corresponding to the current iteration process. The first sample environment state of the virtual scene is expressed as s _t Corresponding environmental state determination model pair s' corresponding to the current iteration process _t Predicting and outputting the calculated sample computing environment state expressed as g _t+c Where t represents time and c represents a period of time. By g _t+c And the difference between the labels, the terminal can carry out parameter adjustment on the environment state determination model corresponding to the iterative process until a second target condition is obtained.

It should be noted that, the second target condition is that the preset iteration number is reached, the error is smaller than the error threshold, or the loss is smaller than the loss threshold, which is not limited in the embodiment of the present application.

302. The terminal determines the computing environment state of the virtual scene after the target duration through an environment state determination model based on the first environment state of the virtual scene, wherein the first environment state and the computing environment state respectively represent the environment of the virtual scene.

In this embodiment of the present application, the first environmental state is a current environmental state of the virtual scene, and the terminal can determine, based on the first environmental state, a computing environmental state of the virtual scene after the target duration through the environmental state determination model. Correspondingly, the terminal can determine a first environment vector of a first environment state of the virtual scene, then input the first environment vector into an environment state determination model, and output a computing environment state of the virtual scene after a target duration by the environment state determination model. The environment state after the target time length is calculated based on the current environment state of the virtual scene through the environment state determining model, so that the action determining model can be optimized based on the environment state after the target time length, and actions corresponding to various game strategies can be output by the action determining model obtained through training.

After the terminal controls the virtual object to execute any action, the current environmental state of the virtual scene is changed, at this time, the current environmental state of the virtual scene is changed into the environmental state of the next moment, the terminal calculates the intrinsic reward information based on the environmental state of the next moment and the computing environmental state output by the environmental state determination model, and based on the intrinsic reward information, a new action is output by the current action determination model, and the new action again causes the change of the environmental state of the virtual scene, thereby realizing the continuous update of the parameters of the current action determination model.

303. The terminal determines intrinsic rewarding information according to the computing environment state and the actual environment state of the first environment state at the next moment, wherein the intrinsic rewarding information is used for indicating whether the execution of the target action is beneficial to the virtual object, and the actual environment state is used for indicating the environment state of the virtual scene after the virtual object executes the target action under the first environment state.

In this embodiment of the present application, after the virtual object executes the target action in the first environmental state of the virtual scene, the environmental state of the virtual scene is changed, and at this time, the terminal obtains the changed environmental state, that is, the actual environmental state of the first environmental state at the next moment. The terminal can determine the intrinsic rewarding information according to the actual environmental state after executing the target action and the calculated computing environmental state, refer to step 202, and will not be described herein.

Alternatively, the closer the actual environmental state is to the computing environmental state than the first environmental state, the more beneficial the execution target action is to the virtual object, whereas the actual environmental state deviates from the process of developing the first environmental state to the computing environmental state, and the execution target is not beneficial to the virtual object. The terminal determines the intrinsic rewarding information according to the computing environment and the actual environment state, and comprises the following steps: the terminal is capable of obtaining a first difference between the first environmental state and the computing environmental state. The terminal is then able to obtain a second difference of the actual environmental state and the computational environmental state. Finally, the terminal can determine a target difference value between the first difference value and the second difference value as intrinsic rewards information, wherein the target difference value is not negative, and the target difference value is beneficial to the virtual object. Alternatively, in calculating intrinsic reward information, the terminal can transform the first environmental state and the actual environmental state to the same dimensions as the computing environmental state.

Alternatively, the terminal can calculate the above-described intrinsic bonus information through the following formula (1).

reward _intrinsic ＝||f(s _t )-g _t+c ||-||f(s _t+1 )-g _t+c || (1)；

Wherein, reward _intrinsic Representing intrinsic rewards information; f () represents the number s _t Sum s _t+1 Respectively mapped to g _t+c The same dimension; s is(s) _t Representing a first environmental state; g _t+c Representing a computing environment state; s is(s) _t+1 Representing the actual environmental conditions.

304. And the terminal adjusts parameters of the current action determining model according to the intrinsic reward information.

305. And responding to the current action determining model conforming to the first target condition, determining the current action determining model as a trained action determining model by the terminal, wherein the action determining model is used for outputting actions according to the input environment state.

In this embodiment of the present application, the first target condition is that the preset training times are reached, the environment state of the virtual scene is changed by the action output by the action determination model, and the error of the predicted computing environment state is within the target range, or the environment state of the virtual scene reaches the preset target environment state, which is not limited in this embodiment of the present application.

It should be noted that, the framework of the action determination model is trained by the terminal and is a framework of hierarchical reinforcement learning, the framework includes a two-layer structure, the upper layer structure calculates the environmental state of the virtual scene after the target duration, that is, the above-mentioned computing environmental state through the pre-trained environmental state determination model, and then builds an Intrinsic Reward signal (Intrinsic Reward) as the above-mentioned Intrinsic Reward information according to the above-mentioned computing environmental state through the Reward calculation unit, where the Intrinsic Reward signal is used to replace the manually-defined dense Reward signal. The lower layer structure adjusts the parameters of the current action determining model through the intrinsic rewarding information so as to realize strategy optimization, and the action determining model is constructed based on the game AI strategy network. Optionally, the terminal can perform policy optimization on the current action determining model through a PPO (Proximal Policy Optimization) reinforcement learning algorithm, and then perform self-playing training on the optimized current action determining model until the action determining model meeting the first target condition is obtained. According to the scheme, the inherent rewarding information comes from the prediction of the environment state determining model, and the environment state determining model can predict the distribution of the environment states of various execution strategies, so that various inputs can be provided for the current action determining model, the trained AI object can execute various game execution strategies, and the countermeasure capability of the AI object on the game strategies is improved.

For example, referring to fig. 5, fig. 5 is a schematic flow chart of a training action determination model according to an embodiment of the present application. As shown in fig. 5, the process of training the motion determination model is a reinforcement learning process, and first, a first environmental state s of a virtual scene is obtained _t . The terminal then determines the model for the first environmental state s based on the pre-trained environmental state _t Predicting and calculating to obtain the state g of the computing environment _t+c I.e. the environmental state that the virtual scene may reach after the target duration. The terminal then determines intrinsic rewards information based on the computing unit, based on the computing environment state output by the environment state determination model and the actual environment state of the first environment state at the next time. The terminal determines the parameters of the model for the current action according to the intrinsic rewarding informationAnd adjusting, wherein the current action determines the model output sample action. After the virtual object executes the sample action, the environment state of the virtual scene is changed again, and the terminal repeatedly executes the training steps until a trained action determination model is obtained. After the action determination model is trained, the current third environment state s 'of the virtual scene can be used' _t And outputting the corresponding action so that the terminal can control the virtual object to execute the corresponding action.

306. And the terminal acquires the current third environment state of the virtual scene.

In the embodiment of the application, the terminal can acquire the state of the environment of the virtual scene at the current moment as the third environment state of the virtual scene.

307. And the terminal processes the third environment state based on the action determining model to obtain an action to be executed.

In this embodiment of the present application, after obtaining the third environmental state, the terminal may process the third environmental state based on the trained action determination model, and determine an action to be performed by at least one virtual object at a next moment in the virtual scene. Optionally, the terminal can determine an environmental state vector according to the third environmental state, then the terminal inputs the environmental state vector into the action determining model, the action determining model outputs an action to be performed of at least one virtual object, and the terminal controls the at least one virtual object to perform the action to be performed. The at least one virtual object is a terminal-controlled AI object.

It should be noted that, the action determining model can be trained based on the intrinsic reward information, and in this embodiment of the present application, the intrinsic reward information is indirectly obtained based on the environmental state determining model, unlike the manner in which the reward information is manually defined in the conventional scheme. Correspondingly, firstly, predicting the current environmental state of the virtual scene by the environmental state determining model to obtain the computing environmental state after the current target time length, and then obtaining the intrinsic rewarding information based on the computing environmental state. Optionally, the environmental state determining model is trained based on prior knowledge of expert data, and is used for predicting the environmental state after the target duration according to the known environmental state of the virtual scene, which is detailed in step 301.

308. And the terminal controls the at least one virtual object and respectively executes the actions to be executed in the virtual scene.

In the embodiment of the application, the at least one virtual object is at least one AI object, and the terminal can control the at least one AI object to perform antagonistic interaction with a controlled virtual object controlled by the user through the terminal in a virtual scene. Optionally, the terminal can control the at least one AI object to perform actions to be performed in the virtual scene respectively, so as to implement antagonistic interaction with the at least one controlled virtual object. Or the terminal can control the at least one AI object to respectively execute the actions to be executed in the virtual scene so as to realize interaction with the virtual resources, such as wild, pushing a tower, clearing a soldier line and the like. The embodiment of the application does not limit the behavior action indicated by the behavior action information. The action to be performed is a general concept, and is not limited to a fixed action.

When the terminal trains the motion determination model by using the framework of hierarchical learning, the environment state determination model can be pre-trained, and then the motion determination model can be trained according to the trained environment state determination model, as in steps 301 to 305, the motion determination model and the environment state determination model can be trained simultaneously, and at this time, the environment state determination model is the environment state determination model corresponding to any iteration process. For example, the terminal first acquires a first sample environment state and a second sample environment state corresponding to the first sample environment state after a target period of time. And then the terminal takes the first sample environment state as input, the second sample environment state as label information, and the environment state determination model in the iterative process is trained. The terminal obtains the sample computing environment state of the virtual scene after the target duration, which is output by the environment state determining model in the iterative process. And the terminal adjusts the parameters of the environment state determination model in the iterative process according to the second sample environment state and the sample computing environment state. The terminal determines intrinsic rewarding information according to the sample computing environment state and the sample actual environment state of the first sample environment state at the next moment, and adjusts parameters of the current action determining model according to the intrinsic rewarding information. And in response to the current action determination model conforming to the first target condition, the terminal determines the current action determination model as a trained action determination model. By training the action determination model by adopting a layered learning framework, compared with training the action determination model by means of inverse reinforcement learning and the like, a large amount of calculation force can be saved, and the method is suitable for more complex game scenes, so that the game strategies executable by the virtual objects have diversity, and the countermeasure capability of the virtual objects on the game strategies can be effectively improved.

Fig. 6 is a block diagram of a virtual object motion determination model training apparatus provided according to an embodiment of the present application. The device is used for executing the steps when the action determination model training method of the virtual object is executed, referring to fig. 6, the device comprises: a state determination module 601, an information determination module 602, a parameter adjustment module 603, and a model determination module 604.

The state determining module 601 is configured to determine a computing environment state of the virtual scene after the target duration based on a first environment state of the virtual scene, where the first environment state and the computing environment state respectively represent environments of the virtual scene;

An information determining module 602, configured to determine intrinsic rewards information according to the computing environment state and an actual environment state of the first environment state at a next time, where the intrinsic rewards information is used to indicate whether executing a target action is beneficial to a virtual object, and the actual environment state is used to indicate that the virtual scene is in an environment state after the virtual object executes the target action in the first environment state;

a parameter adjustment module 603, configured to adjust parameters of the current motion determination model according to the intrinsic reward information;

the model determining module 604 is configured to determine the current motion determining model as a trained motion determining model in response to the current motion determining model conforming to a first target condition, where the motion determining model is configured to output a motion according to an input environmental state.

In one possible implementation, the information determining module 602 is configured to obtain a first difference between the first environmental state and the computing environmental state; acquiring a second difference value between the actual environment state and the computing environment state; determining a target difference between the first difference and the second difference as the intrinsic rewards information, the target difference being not negative indicating that performing the target action is beneficial to the virtual object.

In one possible implementation, the apparatus further includes:

and the state conversion module is used for converting the first environment state and the actual environment state into the same dimension as the computing environment state.

In a possible implementation, the state determining module 601 is configured to determine a first environment vector of a first environment state of the virtual scene; the first environment vector is input into an environment state determining model, the environment state determining model outputs the computing environment state of the virtual scene after the target time length, and the environment state determining model is used for computing the environment state after the target time length according to the known environment state.

taking the first sample environment state as input and the second sample environment state as label information, training an environment state determination model in the iterative process;

In one possible implementation, the environmental state-determining model is an environmental state-determining model corresponding to any one iteration process.

In one possible implementation, the apparatus further includes a self-playing module for self-playing training the current motion determination model.

It should be noted that: in the training device for the motion determination model of the virtual object according to the above embodiment, only the division of the above functional modules is used for illustration when the motion determination model is trained, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the device for training the motion determination model of the virtual object provided in the above embodiment and the method embodiment for training the motion determination model of the virtual object belong to the same concept, and detailed implementation processes of the device and the method embodiment are detailed and will not be described herein.

In the embodiment of the present application, the computer device can be configured as a terminal or a server, and when the computer device is configured as a terminal, the technical solution provided in the embodiment of the present application is implemented by the terminal as an execution body; when the computer device is configured as a server, the server is used as an execution body to implement the technical scheme provided by the embodiment of the application, and certainly, the technical scheme provided by the application can also be implemented through interaction between the terminal and the server, for example, the terminal sends the environment state of the virtual scene to the server, the server processes the acquired environment state based on the trained action determination model to obtain an action to be executed, the action to be executed is returned to the terminal, and the terminal controls at least one virtual object to execute the action to be executed. The embodiments of the present application are not limited in this regard.

Fig. 7 is a block diagram of a terminal 700 according to an embodiment of the present application. The terminal 700 can be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. Terminal 700 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, etc.

In general, the terminal 700 includes: a processor 701 and a memory 702.

The processor 701 can include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 701 can be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 701 can also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 701 can integrate a GPU (Graphics Processing Unit, image processor) for taking care of rendering and drawing of the content that the display screen is required to display. In some embodiments, the processor 701 can also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

The memory 702 can include one or more computer-readable storage media, which can be non-transitory. The memory 702 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 702 is used to store at least one program code for execution by processor 701 to implement the method of motion determination model training for virtual objects provided by method embodiments in the present application.

In some embodiments, the terminal 700 may further optionally include: a peripheral interface 703 and at least one peripheral. The processor 701, the memory 702, and the peripheral interface 703 can be connected by a bus or signal lines. The individual peripheral devices can be connected to the peripheral device interface 703 via buses, signal lines or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 704, a display 705, a camera assembly 706, audio circuitry 707, and a power supply 709.

A peripheral interface 703 may be used to connect I/O (Input/Output) related at least one peripheral device to the processor 701 and memory 702. In some embodiments, the processor 701, memory 702, and peripheral interface 703 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 701, the memory 702, and the peripheral interface 703 can be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 704 is configured to receive and transmit RF (Radio Frequency) signals, also referred to as electromagnetic signals. The radio frequency circuitry 704 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 704 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 704 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuitry 704 is capable of communicating with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 704 can also include NFC (Near Field Communication ) related circuitry, which is not limited in this application.

The display screen 705 is used to display a UI (User Interface). The UI can include graphics, text, icons, video, and any combination thereof. When the display 705 is a touch display, the display 705 also has the ability to collect touch signals at or above the surface of the display 705. The touch signal can be input to the processor 701 as a control signal for processing. At this time, the display screen 705 can also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 705 can be one, disposed on the front panel of the terminal 700; in other embodiments, the display 705 can be at least two, respectively disposed on different surfaces of the terminal 700 or in a folded design; in other embodiments, the display 705 can be a flexible display disposed on a curved surface or a folded surface of the terminal 700. Even the display 705 can be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The display 705 can be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 706 is used to capture images or video. Optionally, the camera assembly 706 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 706 can also include a flash. The flash lamp may be a single-color temperature flash lamp or a two-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 707 can include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 701 for processing, or inputting the electric signals to the radio frequency circuit 704 for voice communication. For the purpose of stereo acquisition or noise reduction, a plurality of microphones can be respectively disposed at different portions of the terminal 700. The microphone can also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 701 or the radio frequency circuit 704 into sound waves. The speaker can be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only an electric signal but also an acoustic wave audible to humans can be converted into an acoustic wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 707 can also include a headphone jack.

A power supply 709 is used to power the various components in the terminal 700. The power supply 709 can be alternating current, direct current, disposable battery, or rechargeable battery. When the power supply 709 includes a rechargeable battery, the rechargeable battery can be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery can also be used to support fast charge technology.

In some embodiments, the terminal 700 further includes one or more sensors 710. The one or more sensors 710 include, but are not limited to: acceleration sensor 711, gyro sensor 712, pressure sensor 713, optical sensor 715, and proximity sensor 716.

The acceleration sensor 711 is capable of detecting the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 700. For example, the acceleration sensor 711 can be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 701 can control the display screen 705 to display a user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 711. The acceleration sensor 711 can also be used for the acquisition of motion data of a game or a user.

The gyro sensor 712 can detect the body direction and the rotation angle of the terminal 700, and the gyro sensor 712 can collect the 3D motion of the user to the terminal 700 in cooperation with the acceleration sensor 711. The processor 701 can realize the following functions according to the data collected by the gyro sensor 712: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 713 can be disposed at a side frame of the terminal 700 and/or at a lower layer of the display screen 705. When the pressure sensor 713 is disposed at a side frame of the terminal 700, a grip signal of the user to the terminal 700 can be detected, and the processor 701 performs left-right hand recognition or quick operation according to the grip signal collected by the pressure sensor 713. When the pressure sensor 713 is disposed at the lower layer of the display screen 705, the processor 701 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 705. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The optical sensor 715 is used to collect the ambient light intensity. In one embodiment, the processor 701 is capable of controlling the display brightness of the display screen 705 based on the intensity of ambient light collected by the optical sensor 715. Specifically, when the intensity of the ambient light is high, the display brightness of the display screen 705 is turned up; when the ambient light intensity is low, the display brightness of the display screen 705 is turned down. In another embodiment, the processor 701 is further capable of dynamically adjusting the photographing parameters of the camera assembly 706 based on the intensity of ambient light collected by the optical sensor 715.

A proximity sensor 716, also referred to as a distance sensor, is typically provided on the front panel of the terminal 700. The proximity sensor 716 is used to collect the distance between the user and the front of the terminal 700. In one embodiment, when the proximity sensor 716 detects that the distance between the user and the front face of the terminal 700 gradually decreases, the processor 701 controls the display 705 to switch from the bright screen state to the off screen state; when the proximity sensor 716 detects that the distance between the user and the front surface of the terminal 700 gradually increases, the processor 701 controls the display screen 705 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 7 is not limiting of the terminal 700 and can include more or fewer components than shown, or certain components may be combined, or a different arrangement of components may be employed.

Fig. 8 is a schematic structural diagram of a server provided according to an embodiment of the present application, where the server 800 can generate relatively large differences according to different configurations or performances, and includes one or more processors (Central Processing Units, CPU) 801 and one or more memories 802, where at least one program code is stored in the memories 802, and the at least one program code is loaded and executed by the processor 801 to implement the motion determination model training method of the virtual object provided in the above method embodiments. Of course, the server may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

The embodiment of the application also provides a computer readable storage medium, which is applied to a computer device, wherein at least one section of program code is stored in the computer readable storage medium, and the at least one section of program code is loaded and executed by a processor to realize the operation executed by the computer device in the method for training the motion determination model of the virtual object in the embodiment.

Embodiments of the present application also provide a computer program product or computer program comprising computer program code stored in a computer readable storage medium. The computer program code is read from a computer readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method of motion determination model training of virtual objects provided in the various alternative implementations described above.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments is merely exemplary in nature and is in no way intended to limit the invention, since it is intended that all modifications, equivalents, improvements, etc. that fall within the spirit and scope of the invention.

Claims

1. A method of motion determination model training for a virtual object, the method comprising:

acquiring a first difference value between the first environment state and the computing environment state;

acquiring a second difference value between an actual environment state and the computing environment state, wherein the actual environment state is used for indicating an environment state of the virtual object after the virtual object executes a target action in the first environment state;

determining a target difference between the first difference and the second difference as intrinsic rewards information, the target difference not being negative indicating that performing the target action is beneficial to the virtual object, the intrinsic rewards information being used to indicate whether performing the target action is beneficial to the virtual object;

2. The method of claim 1, wherein prior to the obtaining the first difference between the first environmental state and the computing environmental state, the method further comprises:

the first environmental state and the actual environmental state are transformed to the same dimension as the computing environmental state.

3. The method of claim 1, wherein determining the computing environment state of the virtual scene after the target duration based on the first environment state of the virtual scene comprises:

determining a first environment vector of a first environment state of the virtual scene;

and inputting the first environment vector into an environment state determining model, and outputting the calculated environment state of the virtual scene after the target time length by the environment state determining model, wherein the environment state determining model is used for calculating the environment state after the target time length according to the known environment state.

4. A method according to claim 3, wherein the training step of the environmental state determination model comprises:

5. The method of claim 4, wherein the second sample environmental state is extracted based on a priori knowledge.

6. A method according to claim 3, wherein the environmental state determination model is a model of the environmental state determination corresponding to any one iterative process.

7. The method of claim 1, wherein after adjusting parameters of a current action determination model based on the intrinsic reward information, the method further comprises:

And performing self-playing training on the current action determining model.

8. An action determination model training apparatus for a virtual object, the apparatus comprising:

the information determining module is used for obtaining a first difference value between the first environment state and the computing environment state; acquiring a second difference value between an actual environment state and the computing environment state, wherein the actual environment state is used for indicating an environment state of the virtual object after the virtual object executes a target action in the first environment state; determining a target difference between the first difference and the second difference as intrinsic rewards information, the target difference not being negative indicating that performing the target action is beneficial to the virtual object, the intrinsic rewards information being used to indicate whether performing the target action is beneficial to the virtual object;

9. The apparatus of claim 8, wherein the apparatus further comprises:

10. The apparatus of claim 8, wherein the status determination module is configured to:

11. The apparatus of claim 10, wherein the training of the environmental state determination model comprises:

12. The apparatus of claim 11, wherein the second sample environmental state is extracted based on a priori knowledge.

13. The apparatus of claim 10, wherein the environmental state determination model is a model of an environmental state determination corresponding to any one iterative process.

14. The apparatus of claim 8, wherein the apparatus further comprises:

and the self-playing module is used for performing self-playing training on the current action determining model.

15. A computer device comprising a processor and a memory for storing at least one piece of program code that is loaded by the processor and that performs the method of motion determination model training of a virtual object according to any of claims 1 to 7.

16. A storage medium storing at least one piece of program code for performing the method of motion determination model training of a virtual object according to any of claims 1 to 7.

17. A computer program product, characterized in that the computer program product comprises computer program code, which is stored in a computer readable storage medium, from which computer program code a processor of a computer device reads, which processor executes the computer program code, such that the computer device performs the method of action determination model training of a virtual object according to any of claims 1 to 7.