CN116821693B

CN116821693B - Model training method and device for virtual scene, electronic equipment and storage medium

Info

Publication number: CN116821693B
Application number: CN202311094973.1A
Authority: CN
Inventors: 高一鸣; 刘飞宇; 王亮; 付强
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-08-29
Filing date: 2023-08-29
Publication date: 2023-11-03
Anticipated expiration: 2043-08-29
Also published as: CN116821693A

Abstract

The application provides a model training method and device for a virtual scene, electronic equipment and a storage medium; the method comprises the following steps: acquiring scene data and a first model in a virtual scene, wherein the first model is a machine learning model which is obtained by pre-training based on real interaction data, and the second model is a machine learning model to be trained; calling a first model and a second model of a collaboration mode to simulate the game for a plurality of times based on the scene data to obtain first game data of each simulated game; based on the first pair of office data, invoking a second model to determine an actual rewarding index and an expected rewarding index corresponding to the first model; screening a training sample set from the first pair of office data based on an actual reward index and the desired reward index; the second model is trained based on the set of training samples. The application can promote the collaboration between the virtual object controlled by the artificial intelligence and the virtual object of the real player.

Description

Model training method and device for virtual scene, electronic equipment and storage medium

Technical Field

The present application relates to computer technologies, and in particular, to a method and apparatus for training a model of a virtual scene, an electronic device, and a storage medium.

Background

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

In a virtual scene, there are a large number of virtual objects controlled by a computer program in addition to virtual objects controlled by a real player. In some game modes, virtual objects controlled by real players and virtual objects controlled by artificial intelligence participate in a game in the same team, and in order to obtain winning games, the virtual objects in the team need to have high-level personal operations and consider cooperation with teammates. In the related art, a virtual object controlled by artificial intelligence temporarily cannot form a good cooperation with a real player, and the competitive experience of the real player is affected.

Disclosure of Invention

The embodiment of the application provides a model training method and device for a virtual scene, electronic equipment, a computer readable storage medium and a computer program product, which can improve the collaboration between a virtual object controlled by artificial intelligence and a virtual object of a real player in the virtual scene.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a model training method of a virtual scene, which comprises the following steps:

acquiring scene data and a first model in a virtual scene, wherein the first model is a machine learning model obtained by pre-training based on real interaction data, and the second model is a machine learning model to be trained;

calling the first model and the second model in a cooperative mode to perform simulation check for a plurality of times based on the scene data to obtain first check data of each simulation check, wherein the second model in the cooperative mode is used for controlling a second virtual object to cooperate with a first virtual object controlled by the first model;

based on the first pair of office data, calling the second model to determine an actual rewarding index and an expected rewarding index corresponding to the first model;

Screening a training sample set from the first pair of office data based on an actual reward index and the desired reward index;

and training the second model based on the training sample set, wherein the trained second model is used for controlling the second virtual object in the game of the virtual scene.

The embodiment of the application provides a model training device of a virtual scene, which comprises the following components:

the data acquisition module is configured to acquire scene data in a virtual scene and a first model, wherein the first model is a machine learning model which is obtained by pre-training based on real interaction data, and the second model is a machine learning model to be trained;

the game simulation module is configured to call the first model and the second model in the cooperation mode to simulate game for a plurality of times based on the scene data to obtain first game data of each simulated game, wherein the second model in the cooperation mode is used for controlling a second virtual object to cooperate with the first virtual object controlled by the first model;

the training module is configured to call the second model to determine an actual rewarding index and an expected rewarding index corresponding to the first model based on the first pair of office data;

The training module is configured to screen a training sample set from the first pair of office data based on an actual rewards index and the expected rewards index;

the training module is configured to train the second model based on the training sample set, wherein the trained second model is used for controlling the second virtual object in the game of the virtual scene.

An embodiment of the present application provides an electronic device, including:

a memory for storing computer executable instructions;

and the processor is used for realizing the model training method of the virtual scene provided by the embodiment of the application when executing the computer executable instructions stored in the memory.

The embodiment of the application provides a computer readable storage medium which stores computer executable instructions for realizing the model training method of the virtual scene provided by the embodiment of the application when being executed by a processor.

The embodiment of the application provides a computer program product, which comprises a computer program or a computer executable instruction, and the computer program or the computer executable instruction realizes the model training method of the virtual scene provided by the embodiment of the application when being executed by a processor.

The embodiment of the application has the following beneficial effects:

simulating a real player to control a virtual object through the first model, so that the first model and a second model in a cooperative mode perform simulation game to acquire game data as training data, and the computing resources and time cost required by acquiring the training data are saved; the method comprises the steps of obtaining actual rewarding indexes and expected rewarding indexes of a first model in a game, taking the actual rewarding indexes and the expected rewarding indexes as conditions for screening samples, improving the efficiency of obtaining the samples, enabling the samples to be more accurate, and improving the cooperativity between virtual objects controlled by a trained second model and virtual objects controlled by real human players.

Drawings

Fig. 1 is an application mode schematic diagram of a model training method of a virtual scene provided by an embodiment of the present application;

fig. 2A is a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 2B is a schematic diagram of a second model according to an embodiment of the present application;

FIG. 3A is a first flow chart of a model training method for a virtual scene according to an embodiment of the present application;

FIG. 3B is a second flow chart of a model training method for a virtual scene according to an embodiment of the present application;

FIG. 3C is a third flow chart of a model training method for a virtual scene according to an embodiment of the present application;

fig. 4A is a fourth flowchart of a model training method of a virtual scene according to an embodiment of the present application;

fig. 4B is a fifth flowchart of a model training method of a virtual scene according to an embodiment of the present application;

FIG. 5A is a schematic diagram of a first principle of model training provided by an embodiment of the present application;

FIG. 5B is a second schematic diagram of model training provided by an embodiment of the present application;

FIG. 5C is a third principle schematic of model training provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a reward index provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a schematic model structure of a competitive type agent according to an embodiment of the present application;

FIG. 8A is a first experimental result provided by an embodiment of the present application;

FIG. 8B is a second experimental result provided by an embodiment of the present application;

FIG. 8C is a third experimental result provided by an embodiment of the present application;

fig. 8D is a fourth experimental result provided by the embodiment of the present application.

Detailed Description

The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a particular ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a particular order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.

It should be noted that, in the present application, the relevant data collection process (for example, data of a real player in a game) should be strictly based on the requirements of relevant national laws and regulations when applied to an example, obtain the informed consent or independent consent of the personal information body, and develop the subsequent data use and processing actions within the scope of laws and regulations and the authorization of the personal information body.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.

1) Virtual scenes, namely, a scene which is output by equipment and is different from the real world, can form visual perception of the virtual scenes through naked eyes or the assistance of equipment, for example, a two-dimensional image output by a display screen, and a three-dimensional image output by three-dimensional display technologies such as three-dimensional projection, virtual reality and augmented reality technologies; in addition, various simulated real world sensations such as auditory sensations, tactile sensations, olfactory sensations, and motion sensations can also be formed by various possible hardware.

2) In response to a condition or state that is used to represent the condition or state upon which the performed operation depends, the performed operation or operations may be in real-time or with a set delay when the condition or state upon which it depends is satisfied; without being specifically described, there is no limitation in the execution sequence of the plurality of operations performed.

3) Virtual objects, objects that interact in a virtual scene, objects that are under the control of a user or a robot program (e.g., an artificial intelligence based robot program) are capable of being stationary, moving, and performing various actions in the virtual scene, such as various characters in a game, and the like.

4) Agent, which refers to an autonomously movable software or hardware entity, is an important concept in the field of artificial intelligence, and any independent entity that can be considered and interact with the environment can be abstracted as an Agent.

5) Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning (RL, reinforcement Learning), transfer learning, induction learning, teaching learning, and the like.

6) Population-based training methods, methods for adaptively adjusting hyper-parameters are used. Based on the evolution mode of population (population), the poor model is eliminated, the (explloit) good model is utilized, random disturbance (expllore) is added for further optimization, and finally the optimal model is obtained.

7) The Self-Play (Self-Play) is an unsupervised learning method, and is a reinforcement learning algorithm for learning and exploring from Self-game by machine learning.

The embodiment of the application provides a model training method of a virtual scene, a model training device of the virtual scene, electronic equipment, a computer readable storage medium and a computer program product, which can promote the collaboration between a virtual object controlled by artificial intelligence and a virtual object of a real player.

The following describes exemplary applications of the electronic device provided by the embodiments of the present application, where the electronic device provided by the embodiments of the present application may implement various types of user terminals, such as a notebook computer, a tablet computer, a desktop computer, a set-top box, a smart television, a mobile device (for example, a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device), a vehicle-mounted terminal, a Virtual Reality (VR) device, an augmented Reality (Augmented Reality, AR) device, and so on, and may also be implemented as a server. In the following, an exemplary application when the electronic device is implemented as a terminal device or a server will be described.

Referring to fig. 1, fig. 1 is an application mode schematic diagram of a model training method of a virtual scene according to an embodiment of the present application; for example, fig. 1 relates to a server 200, a network 300, a terminal device 400, and a database 500. The terminal device 400 is connected to the server 200 via the network 300, and the network 300 may be a wide area network or a local area network, or a combination of both.

In some embodiments, database 500 may be a game database, server 200 may be a server of a game platform, server 200 being used to train artificial intelligence (a second model) applied in a game, the second model being used to control a second virtual object of teammates in a virtual scene as real players. The first model is a model for simulating a control mode of the virtual object by the real player and is used for controlling the first virtual object. The real player controls a third virtual object in the virtual scene. The server 200 is used for transmitting pictures of virtual scenes to the terminal device 400. The scene data of the virtual scene may be game data.

For example, the server 200 invokes the scene data of the virtual scene from the database 500, obtains a first model simulating the real player, invokes the first model and the second model to perform multiple simulation of the game, obtains game data, and screens the game data to obtain training samples. The second model is trained based on the training samples such that virtual objects controlled by the second model cooperate with other virtual objects. When the user controls the third virtual object to team with the second virtual object to play a game through the terminal device 400, the server 200 invokes the second model to determine the behavior of the second virtual object so that the second virtual object cooperates with the third virtual object. The virtual object controlled by the artificial intelligence cooperates with the virtual object controlled by the player, so that the game experience of the player is improved.

The embodiment of the application can be realized through a Database technology, and a Database (Database) can be taken as a place where the electronic file cabinet stores electronic files in short, so that a user can perform operations such as adding, inquiring, updating, deleting and the like on the data in the files. A "database" is a collection of data stored together in a manner that can be shared with multiple users, with as little redundancy as possible, independent of the application.

The database management system (Database Management System, DBMS) is a computer software system designed for managing databases, and generally has basic functions of storage, interception, security, backup, and the like. The database management system may classify according to the database model it supports, e.g., relational, XML (Extensible Markup Language ); or by the type of computer supported, e.g., server cluster, mobile phone; or by classification according to the query language used, such as structured query language (SQL, structured Query Language), XQuery; or by performance impact emphasis, such as maximum scale, maximum speed of operation; or other classification schemes. Regardless of the manner of classification used, some DBMSs are able to support multiple query languages across categories, for example, simultaneously.

The embodiment of the application can also be realized by Cloud Technology, and the Cloud Technology (Cloud Technology) is based on the general terms of network Technology, information Technology, integration Technology, management platform Technology, application Technology and the like applied by a Cloud computing business mode, can form a resource pool, and is used as required, flexible and convenient. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the advanced development and application of the internet industry and the promotion of requirements of search services, social networks, mobile commerce, open collaboration and the like, each article possibly has a hash code identification mark, the hash code identification mark needs to be transmitted to a background system for logic processing, data of different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.

In some embodiments, server 200 may be implemented as a plurality of servers, including: the virtual scene management system comprises a server for running a virtual scene, a server for training a model and a server for controlling virtual objects in the virtual scene.

In some embodiments, the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like. The electronic device may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present application.

In one implementation scenario, the model training method of the virtual scenario provided by the embodiment of the application is suitable for application modes which can complete relevant data calculation of the virtual scenario completely depending on the graphics processing hardware computing capability of the terminal equipment 400, such as games in a single-machine/offline mode, and output of the virtual scenario is completed through various different types of terminal equipment 400 such as smart phones, tablet computers, virtual reality/augmented reality equipment and the like.

By way of example, the types of graphics processing hardware include central processing units (CPU, central Processing Unit) and graphics processors (GPU, graphics Processing Unit).

When forming the visual perception of the virtual scene, the terminal device 400 calculates the data required for display through the graphic calculation hardware, and completes loading, analysis and rendering of the display data, and outputs a video frame capable of forming the visual perception for the virtual scene at the graphic output hardware, for example, a two-dimensional video frame is presented on the display screen of the smart phone, or a video frame for realizing the three-dimensional display effect is projected on the lens of the augmented reality/virtual reality glasses; in addition, to enrich the perceived effect, the terminal device 400 may also form one or more of auditory perception, tactile perception, motion perception and gustatory perception by means of different hardware.

As an example, the terminal device 400 has a client (e.g., a stand-alone game application) running thereon, and outputs a virtual scene including role playing during the running of the client, where the virtual scene may be an environment for interaction of a game character, such as a plains, streets, valleys, etc. for the game character to fight against; the first virtual object may be a game character controlled by a user, i.e. the first virtual object is controlled by a real user, will move in a virtual scene in response to operation of the real user with respect to a controller (e.g. touch screen, voice operated switch, keyboard, mouse, joystick, etc.), for example when the real user moves the joystick to the right, the first virtual object will move to the right in the virtual scene, and may also remain stationary in place, jump, control the first virtual object to perform shooting operations, etc.

For example, the virtual scene may be a game virtual scene, the user may be a player, the third virtual object is a virtual object controlled by the player, and a screen of playing a game between the third virtual object and a second virtual object controlled by the second model is displayed in a man-machine interaction interface in the terminal device 400, where the second virtual object and the third virtual object cooperate.

In one implementation scenario, the model training method of the virtual scenario provided by the embodiment of the application is suitable for the game mode cooperatively implemented by the terminal equipment and the server. The method mainly relates to two game modes, namely a local game mode and a cloud game mode, wherein the local game mode refers to that a terminal device and a server cooperatively operate game processing logic, an operation instruction input by a player in the terminal device is partially processed by the game logic operated by the terminal device, the other part is processed by the game logic operated by the server, and the game logic operated by the server is more complex and consumes more calculation power; the cloud game mode is that a server runs game logic processing, and a cloud server renders game scene data into audio and video streams and transmits the audio and video streams to a terminal device for display. The terminal device only needs to have the basic streaming media playing capability and the capability of acquiring the operation instruction of the player and sending the operation instruction to the server.

Taking the example of forming the visual perception of the virtual scene, the server 200 performs calculation of the virtual scene related display data (such as scene data) and sends the calculated display data to the terminal device 400 through the network 300, the terminal device 400 finishes loading, analyzing and rendering the calculated display data depending on the graphic calculation hardware, and outputs the virtual scene depending on the graphic output hardware to form the visual perception, for example, a two-dimensional video frame can be presented on a display screen of a smart phone, or a video frame for realizing a three-dimensional display effect can be projected on a lens of an augmented reality/virtual reality glasses; as regards the perception of the form of the virtual scene, it is understood that the auditory perception may be formed by means of the corresponding hardware output of the terminal device 400, for example using a microphone, the tactile perception may be formed using a vibrator, etc.

As an example, the terminal device 400 has a client (e.g., a network version of a game application) running thereon, and outputs a virtual scene including role playing during the running of the client, where the virtual scene may be an environment for interaction of a game character, such as a plains, streets, valleys, etc. for the game character to fight against; the first virtual object may be a game character controlled by a user, i.e. the first virtual object is controlled by a real user, will move in a virtual scene in response to operation of the real user with respect to a controller (e.g. touch screen, voice operated switch, keyboard, mouse, joystick, etc.), for example when the real user moves the joystick to the right, the first virtual object will move to the right in the virtual scene, and may also remain stationary in place, jump, control the first virtual object to perform shooting operations, etc.

For example, the virtual scene may be a game virtual scene, the server 200 may be a server of a game platform, the user may be a player, the third virtual object is a virtual object controlled by a real player, a screen of a game between the third virtual object and a second virtual object controlled by a second model is displayed in a man-machine interaction interface in the terminal device 400, and cooperation is performed between the second virtual object and the third virtual object.

Referring to fig. 2A, fig. 2A is a schematic structural diagram of an electronic device according to an embodiment of the present application, where the electronic device may be the server 200 in fig. 1, and the server 200 shown in fig. 2A includes: at least one processor 410, a memory 450, at least one network interface 420. The various components in terminal device 400 are coupled together by bus system 440. It is understood that the bus system 440 is used to enable connected communication between these components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 440 in fig. 2A.

The processor 410 may be an integrated circuit chip having signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

Memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 450 optionally includes one or more storage devices physically remote from processor 410.

Memory 450 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 450 described in embodiments of the present application is intended to comprise any suitable type of memory.

In some embodiments, memory 450 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 451 including system programs, e.g., framework layer, core library layer, driver layer, etc., for handling various basic system services and performing hardware-related tasks, for implementing various basic services and handling hardware-based tasks;

a network communication module 452 for accessing other electronic devices via one or more (wired or wireless) network interfaces 420, the exemplary network interface 420 comprising: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.;

In some embodiments, the apparatus provided in the embodiments of the present application may be implemented in software, and fig. 2A shows a model training apparatus 455 of a virtual scene stored in a memory 450, which may be software in the form of a program and a plug-in, and includes the following software modules: the data acquisition module 4551, the game simulation module 4552, the training module 4553, and the game execution module 4554 are logical, and thus may be arbitrarily combined or further split according to the functions implemented. In fig. 2A, all the above modules are shown once for convenience of expression, but the model training apparatus 455 in the virtual scene should not be considered as excluding the implementation that may include only the data acquisition module 4551, the game simulation module 4552, and the training module 4553, the functions of each module will be described below.

In some embodiments, the terminal or the server may implement the model training method of the virtual scene provided by the embodiment of the present application by running a computer program. For example, the computer program may be a native program or a software module in an operating system; a local (Native) Application program (APP), i.e. a program that needs to be installed in an operating system to run, such as a live APP or an instant messaging APP; the method can also be an applet, namely a program which can be run only by being downloaded into a browser environment; but also an applet that can be embedded in any APP. In general, the computer programs described above may be any form of application, module or plug-in.

The model training method of the virtual scene provided by the embodiment of the application will be described in combination with the exemplary application and implementation of the terminal provided by the embodiment of the application.

In the following, the method for training a model of a virtual scene provided by the embodiment of the present application is described, and as described above, an electronic device implementing the method for training a model of a virtual scene of the embodiment of the present application may be a terminal or a server, or a combination of both. The execution subject of the respective steps will not be repeated hereinafter.

It should be noted that, in the following examples of image processing, the application of the competition model in the game virtual scene is described as an example, and those skilled in the art may apply the competition model obtained by the model training method of the virtual scene provided by the embodiment of the present application to other scenes according to the understanding of the following description.

Referring to fig. 3A, fig. 3A is a first schematic view of a model training method of a virtual scene according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 3A.

In step 301, scene data and a first model in a virtual scene are acquired.

Illustratively, the first model is a machine learning model pre-trained based on real interaction data. The types of interaction data include real interaction data and simulated interaction data. The real interactive data refers to interactive data generated by a real player controlling a virtual object in a virtual scene. Simulated interaction data is interaction data generated by a virtual object controlled by a computer program. The real interaction data may be obtained by: and obtaining the game data, and extracting data corresponding to the virtual object controlled by the real player from the game data as real interaction data.

The first model is used to simulate a first virtual object in a real user control virtual scene. The first model may be obtained by: a plurality of models capable of simulating human control virtual objects are trained from a large amount of real interaction data. The number of the first models for controlling the virtual objects in the virtual scene may be plural, and when the first models are used for simulating the control modes of the real player, the type of the control mode corresponding to each first model may be different, for example: attack type with good attack, defense type with good defense, etc.

For example, the second model is a machine learning model to be trained, the second model is used to autonomously control a second virtual object in the virtual scene, and the first virtual object and the second virtual object belong to the same team. During the training of the second model, the first model population is introduced as teammates to play the game.

In the embodiment of the application, the first model for simulating human to control the virtual object is introduced, and training data is acquired instead of game data and real players, so that the time cost and the calculation resource consumption for acquiring the training data are saved.

In some embodiments, the second model includes a value prediction network; referring to fig. 3B, fig. 3B is a second flowchart of a model training method for a virtual scene according to an embodiment of the present application, and before step 302, steps 3011 to 3013 of fig. 3B are performed, which is described in detail below.

In step 3011, based on the scene data, the first model and the second model of the competitive mode are called to perform multiple simulated game play, so as to obtain second game play data of each simulated game play.

For example, in the athletic mode, the second virtual object controlled by the second model does not cooperate with the first virtual object controlled by the first model. That is, in the competitive mode, when the virtual object controlled by the second model executes the corresponding behavior according to the game state information in the virtual scene, the relevant behavior is executed with the bonus index in the virtual scene as the guiding.

In step 3012, the original reward indicia of the first model in each simulated pair is determined based on the second pair data.

By way of example, the first model is used to simulate the way a real player controls a virtual object, while the operational capabilities of the real player are limited, and the rewards in the virtual scene acquired can be used to characterize the original capabilities of the real player without the assistance of other virtual objects in the team.

By way of example, taking 5V5 play of the MOBA game as an example, the types of bonus indicators include: number of defeated enemies, probability of winning, number of resources, participation or support rate, and number of resources obtained in the game. Taking FPS game as an example for explanation, types of bonus indicators include: winning probability, ranking individuals in the game, number of defeated enemies, amount of supplies. The raw bonus indicia may be weighted results of multiple bonus indicia or the raw bonus indicia may be characterized as a data set of a combination of rewards of each type.

In step 3013, the value prediction network in the second model is trained based on the scene data, simulating the original reward indicators for the game each time.

Illustratively, the trained value prediction network is used to predict a reward indicator for a first model in a game, the types of the game including simulated games and real games.

In some embodiments, the second model further comprises a feature extraction network; step 3013 may be implemented by: calling a feature extraction neural network in the second model based on the scene data to perform feature extraction processing to obtain information features of the first model in each simulation pair; calling a value prediction network in the second model to perform prediction processing based on the information characteristics of each simulation pair, so as to obtain a prediction rewarding index of the first model in each simulation pair; determining a first loss function of the value prediction network based on a difference between the original rewards index and the predicted rewards index; and carrying out back propagation processing on the value prediction network based on the first loss function to obtain the trained value prediction network.

For example, the first loss function may be an information difference loss, characterizing the difference between the original prize index and the predicted prize index.

According to the method and the device for obtaining the virtual object, through the training value prediction network, the value prediction network can be used for evaluating the rewards which can be obtained by the virtual object of the human player and the rewards which are obtained by the virtual object controlled by the second model, and further, the training samples are determined by taking the rewards as evaluation parameters, the accuracy of obtaining the training samples is improved, and further, models with collaborative capability and rewarding capability can be obtained through training, and game experience of users is improved.

With continued reference to fig. 3A, in step 302, based on the scene data, the first model and the second model in the collaborative mode are invoked for performing a plurality of simulated matches, resulting in first match data for each simulated match.

In the collaboration mode, for example, collaboration is performed between the second virtual object controlled by the second model and the first virtual object controlled by the first model. The simulation may be performed simultaneously for the bureau and the manner in which the real player simulated by the first model is controlled may be different. The virtual scenes of each simulation pair can be different, and the first virtual objects which are used as teammates with the second virtual objects can be different, so that different first pair data can be obtained, and the richness of training samples is improved.

In step 303, based on the first pair of office data, a second model is invoked to determine an actual rewards index and a desired rewards index corresponding to the first model.

In some embodiments, referring to fig. 2B, fig. 2B is a schematic structural diagram of a second model provided in an embodiment of the present application, where the second model 201 includes a feature extraction network 202, a value prediction network 204, a decision network 203, and a portal model 205. The feature extraction network 202 is used to extract features in the scene data and the portal modeling module 205 is used to determine the mode in which the current decision network 203 is located. The value prediction network 204 is configured to determine corresponding reward indicators based on the features output by the feature extraction network 202. The feature extraction network 202 may be a long-short term neural network or other neural network model.

In some embodiments, the second model includes a value prediction network and a feature extraction network; step 303 may be implemented by: based on the first game data, calling a value prediction network in the second model to determine an actual rewarding index of the first model in each simulated game; calling a feature extraction network in the second model to perform feature extraction processing based on the first pair of office data and the scene data to obtain information features related to the first model; the value prediction network in the second model is invoked based on the information characteristics to determine the expected reward index for the first model in each simulated pair.

For example, the actual rewards index is calculated based on the actual first pair of office data, and the expected rewards index is an index corresponding to rewards acquired by the virtual object of the first model without the assistance of the virtual object controlled by the second model. The expected reward index is predicted by invoking the value prediction network in the second model, and the principle of obtaining the reward index and the training principle of the value prediction network refer to the above steps 3011 to 3013, which are not described herein.

In step 304, a training sample set is screened from the first pair of office data based on the actual reward indicia and the desired reward indicia.

In some embodiments, step 304 may be implemented by: marking a first pair of office data with actual rewards index larger than expected rewards index as positive sample; marking a first pair of office data with actual rewards indicators less than or equal to the expected rewards indicators as negative samples; each positive sample is combined with each negative sample into a training sample set.

For example, the desired rewards index is an index corresponding to rewards that a virtual object of the first model obtains with the assistance of a virtual object that is not controlled by the second model. If the actual rewarding index is lower than the expected rewarding index, the fact that the behavior predicted by the second model does not cause beneficial assistance to the virtual object controlled by the first model in the opposite office is indicated, the corresponding opposite office data is marked as a negative sample, and the sample label is 0; if the actual rewarding index is higher than the expected rewarding index, the behavior predicted by the second model is beneficial to assist the virtual object controlled by the first model in the opposite office, the corresponding opposite office data is marked as a positive sample, and the sample label is 1.

In step 305, a second model is trained based on the set of training samples.

Here, the trained second model is used to autonomously control the corresponding behavior of the second virtual object in the game of the virtual scene. That is, the second model controls the second virtual object to perform the corresponding behavior without intervention of the user.

In some embodiments, referring to fig. 3C, fig. 3C is a third flow chart of a model training method of a virtual scene according to an embodiment of the present application; step 305 may perform a training process on the second model through step 3051 and step 3052 of fig. 3C, respectively, as described in detail below.

In step 3051, in response to the actual reward index of the first pair of office data in the training sample set being greater than the expected reward index, a second loss function is determined based on a difference between the actual reward index and the expected reward index, and parameters of a second model in the first pair of office data training cooperation mode are updated based on the second loss function.

Illustratively, the second model includes a network for predicting the gain achieved by the real player. If the actual rewarding index is smaller than the expected rewarding index, the parameters of the second model do not need to be updated according to the difference between the actual rewarding index and the expected rewarding index. The parameter update process is directed to the decision network in the second model.

In some embodiments, determining the second loss function based on the difference between the actual reward index and the expected reward index may be accomplished by: for each actual bonus indicator of the first model with a corresponding desired bonus indicator, performing the following: acquiring a first difference value between an actual rewarding index and an expected rewarding index; obtaining a second difference value between the first difference value and an expected Gain index (Gain), and taking a norm of the second difference value as a loss value, wherein when the second virtual object assists the first virtual object, the expected Gain index characterizes a Gain effect formed by the assisting action on the first virtual object; the average value of each loss value is taken as a second loss function. The above process can be characterized as the following equation (1) and equation (2).

Equation (1) and equation (2):

（1）

（2）

wherein,,is the loss value of the second loss function,is used as an index of the desired rewards,is an actual bonus indicator, V characterizes the original bonus of the human player's corresponding unassisted behavior,is the desired gain indicator.Is the actual bonus earned by the human player, i.e., the amount of bonus earned by the virtual object controlled by the human player.

In step 3052, third pair of office data is extracted from the training sample set, and a second model in the athletic mode is trained based on the third pair of office data.

The third pair of office data is illustratively the first pair of office data unassociated with the first model. The parameter update process is directed to the decision network in the second model.

In some embodiments, the third pair of office data comprises: and the second virtual object corresponding to the second model executes the state information and the actual value parameters of each actual behavior. Step 3052 may be implemented by: calling a second model to conduct prediction processing based on the scene data and the state information in the third pair of office data to obtain a plurality of candidate prediction behaviors of the second virtual object and corresponding prediction value parameters; taking the candidate prediction behavior with the maximum prediction value parameter as a target prediction behavior; determining cross entropy loss of the second model based on the target predicted behavior and the actual behavior corresponding to the target predicted behavior; determining a mean square error loss of the second model based on the actual value parameter corresponding to the predicted value parameter; and carrying out parameter updating processing on the second model based on the mean square error loss and the cross entropy loss to obtain a trained second model.

For example, step 3052 is used to train the base capacity of the second model, i.e., the Self-Play (Self-Play) capacity. The input of the second model is environment data in the virtual scene, the environment data is subjected to feature extraction to obtain environment features, a decision network is called to predict behaviors corresponding to the environment features to obtain probabilities corresponding to the behaviors, and a plurality of candidate prediction behaviors with highest probabilities are selected. Invoking the value prediction network to determine a predicted value parameter for each candidate prediction behavior, the predicted value parameter, i.e., the gain caused by the candidate prediction behavior on the virtual object, comprising: material increases, parameters (vital value, speed, attack force, etc.) increase, and the number of defeated enemies increases. And obtaining the difference of the value parameters of the actual behavior and the predicted behavior to obtain the mean square error loss. And determining cross entropy loss based on the difference of actual behavior and predicted behavior, carrying out weighted summation on the two losses to obtain a total loss function, and carrying out back propagation processing on the model based on the total loss function to obtain a trained model.

In some embodiments, referring to fig. 4A, fig. 4A is a fourth flowchart of a model training method of a virtual scene according to an embodiment of the present application; steps 306A to 308A of fig. 4A are performed after step 305, which is described in detail below.

In step 306A, current capability parameters of the third virtual object are obtained.

Here, the third virtual object belongs to the same team as the second virtual object, and the third virtual object is controlled by the real player. The types of capability parameters of the virtual object include: life value, attack force, speed, defense force and other parameters.

In step 307A, in response to the current capability parameter of the third virtual object being less than the capability threshold, a second model in the athletic mode is invoked based on the scene data and behavior of the second virtual object is controlled based on the second model in the athletic mode.

For example, if the capability of the third virtual object is insufficient, the second virtual object may assist the third virtual object to obtain a smaller amount of rewards, and when the current capability parameter of the third virtual object is smaller than the capability threshold, the second virtual object is controlled by the second model in the competitive mode. For example: if the life value of the third virtual object is smaller than the life value threshold, the state that the third virtual object is close to being defeated is indicated, and if the second virtual object assists the third virtual object, the second virtual object may cause the second virtual object to be defeated, and the behavior of the second virtual object is controlled based on the second model in the competitive mode, so that the winning possibility of the team is improved.

Illustratively, the second model includes a decision network, a value prediction network, and a feature extraction network; invoking a second model in the athletic mode based on the scene data and controlling behavior of the second virtual object based on the second model in the athletic mode may be accomplished by: invoking a decision network to conduct behavior prediction processing based on scene data to obtain a plurality of candidate prediction behaviors; calling a value prediction network for value prediction processing based on scene data and each candidate prediction action to obtain a first prediction value parameter of each candidate prediction action; taking the first candidate prediction behavior with the highest prediction value parameter as a target prediction behavior; and controlling the behavior of the second virtual object based on the configuration parameters corresponding to the target predicted behavior.

In step 308A, in response to the current capability parameter of the third virtual object being greater than the capability threshold, a second model in the collaborative mode is invoked based on the scene data and the contrast data of the third virtual object, and behavior of the second virtual object is controlled based on the second model in the collaborative mode.

For example, if the third virtual object's ability is sufficient, the second virtual object assists the third virtual object in enabling the team's winning probability to rise, for example: and if the life value of the third virtual object is larger than the life value threshold, the third virtual object is not defeated, the second virtual object is used for assisting the third virtual object based on the second model in the cooperation mode to control the behavior of the second virtual object, and the third virtual object is prevented from being defeated, so that the winning possibility of the team is improved.

In some embodiments, the second model includes a decision network, a value prediction network, and a feature extraction network; invoking a second model in the collaboration mode based on the scene data and the game data of the third virtual object, and controlling the behavior of the second virtual object based on the second model in the collaboration mode may be achieved by: calling a feature extraction network to perform feature extraction processing based on the central office data of the third virtual object to obtain information features of the third virtual object; invoking a decision network to conduct behavior prediction processing based on scene data and information characteristics to obtain a plurality of candidate prediction behaviors; calling a value prediction network to conduct behavior prediction processing based on the scene data, each candidate prediction behavior and the information characteristic to obtain a second prediction value parameter of each candidate prediction behavior aiming at a third virtual object; taking the candidate prediction behavior with the highest second prediction value parameter as a target prediction behavior; and controlling the behavior of the second virtual object based on the configuration parameters corresponding to the target predicted behavior.

For example, the second model determines the behavior of the second virtual object based on the data of the third virtual object and the data of the virtual scene, and the predicted behavior has relevance to the third virtual object, so that the third virtual object can be cooperated, and further the game experience of the human player and the winning probability of the team are improved.

In some embodiments, referring to fig. 4B, fig. 4B is a fifth flowchart of a model training method of a virtual scene according to an embodiment of the present application; after step 305, the following steps 306B to 308B are performed, which will be described in detail below.

In step 306B, scene data in the virtual scene and a current distance between the third virtual object and the second virtual object are acquired.

For example, a third virtual object belongs to the same team as the second virtual object, the third virtual object being controlled by a real player. The current distance may be a straight line distance between the third virtual object and the second virtual object, or a path distance determined based on scene data in the virtual scene.

In step 307B, in response to the current distance between the third virtual object and the second virtual object being greater than the distance threshold, a second model in the athletic mode is invoked based on the scene data and behavior of the second virtual object is controlled based on the second model in the athletic mode.

For example, the manner in which the behavior of the second virtual object is determined in step 307B is the same as that of step 307A and will not be described here again. The distance threshold may be determined according to an actual situation in the virtual scene, if the current distance between the third virtual object and the second virtual object is greater than the distance threshold, which indicates that the distance between the third virtual object and the second virtual object is too large, the third virtual object is difficult to receive assistance of the second virtual object, the second virtual object determines a behavior in the second model of the athletic mode preferentially, and executes the related behavior.

In step 308B, in response to the current distance between the third virtual object and the second virtual object being less than or equal to the distance threshold, a second model in the collaborative mode is invoked based on the scene data and the contrast data of the third virtual object, and behavior of the second virtual object is controlled based on the second model in the collaborative mode.

For example, the manner in which the behavior of the second virtual object is determined in step 308B is the same as that of step 308A, and will not be described here again. If the current distance between the third virtual object and the second virtual object is smaller than or equal to the distance threshold, the second virtual object is in the range of the area where the third virtual object can receive assistance, the third virtual object can receive assistance of the second virtual object, the second virtual object determines the behavior according to the second model of the collaboration mode, and executes the related behavior.

According to the embodiment of the application, the first model simulates a real player to control the virtual object, so that the first model and the second model in the cooperative mode simulate the game to acquire the game data as training data, and the calculation resources and time cost required for acquiring the training data are saved; the method comprises the steps of obtaining actual rewarding indexes and expected rewarding indexes of a first model in a game, taking the actual rewarding indexes and the expected rewarding indexes as conditions for screening samples, improving the efficiency of obtaining the samples, enabling the samples to be more accurate, improving the cooperativity between virtual objects controlled by a trained second model and virtual objects controlled by real human players, and further improving the experience of the human players.

In the following, an exemplary application of the model training method of the virtual scene in an actual application scene according to the embodiment of the present application will be described.

Artificial intelligence techniques such as Deep Learning (DL) and reinforcement Learning (Reinforcement Learning, RL) are widely used in the game field. In recent years, various artificial intelligence systems have been developed that reach and even surpass the human level and defeat top human players in a variety of complex games. Developing AI systems for complex environments is very difficult, and many-player competitive games have enormous state and action space, and rely on long-term planning and strategy. Taking the 5v5 mode of a multiplayer online tactical competition game (Multiplayer Online Battle Arena, MOBA) as an example, game play is divided into two teams, each team having 5 virtual objects, each player manipulating one virtual object. Two teams play on the same symmetrical map, each team needing to acquire resources, upgrade equipment, destroy the opponent's defensive towers, and ultimately destroy the opponent's base. In order to obtain the winning game, the player needs not only to have a high level of personal operations, but also to consider the cooperation with teammates, which further increases the difficulty in developing the AI system.

In the following, the multiplayer online tactical competition game is simply referred to as a MOBA game, taking the MOBA game as an example, when high-intensity artificial intelligence is formed with a lower level of human players, the behaviors of the human players are not optimal, which may lead to a failure of the battle or even the game. In this case, the artificial intelligence may completely forgo support and assistance to human players, defeating the opponent by only collaboration between artificial intelligence or artificial intelligence individuals.

In practical human-machine team applications, human players would prefer to be able to themselves become the core of the team, rather than being entirely replaced by artificial intelligence. For example: in MOBA games, human players can obtain winning and higher game experience, for example, obtain higher MVP score, experience more highlight time, obtain more resources in the game, obtain more protection and other targets, so as to obtain more pleasant game experience. In the prior art, when the behavior of the human does not contribute to the completion of the target, the intelligent agent still chooses to ignore the human; or, the agent excessively helps the human player, affects the agent's own ability, reduces winning probability and resource acquisition, and instead reduces the human player's team experience.

The competitive artificial intelligence system for MOBA game can not effectively and reasonably help human players when being assembled with human players. The design of the existing artificial intelligence system for the MOBA game is mainly aimed at competitive targets, and human feeling during human-machine team formation is not considered, so that human player experience is affected when the technology falls to the ground; the mainstream man-machine cooperation method needs to rely on manpower and a large amount of parameter adjustment to restrict artificial intelligence and reasonably help human players.

The model training method of the virtual scene can be applied to the man-machine cooperation scene of the multi-player competitive game to train the intelligent agent (the second model above), so that the intelligent agent can more reasonably and effectively assist human players, and team experience of the players is improved. The embodiment of the application carries out fine adjustment on the competitive intelligent agent, so that the intelligent agent can more reasonably and effectively assist human players, the team experience of the players is improved, and the intelligent agent becomes a cooperative intelligent agent with higher application value. Wherein, the agent is a software or hardware entity capable of autonomous activity, and any independent entity capable of thinking and interacting with the environment can be abstracted as the agent. Hereinafter, artificial intelligence for autonomously controlling a virtual object is referred to as an agent.

The model training method of the virtual scene mainly comprises two stages: (1) a human raw value estimation stage; (2) human augmented training phase. Wherein in the human original value estimation stage, the agent estimates the original target completion ability of the human player under the existing condition by forming a team with the human player, and in the game scene, the player's ability can be characterized by the bonus index acquired by the virtual object controlled by the player. In the human enhancement training phase, the ability of the human to actually complete the target is subtracted from the ability of the human to actually complete the target, the gain obtained by the human from the intelligent agent is calculated, the gain is represented by a reward index in a virtual scene, and the intelligent agent is trained based on an Advantage value (Advantage) brought by the gain.

Referring to fig. 6, fig. 6 is a schematic diagram of a prize index provided by an embodiment of the present application.

The embodiment of the application aims to promote the game experience of a human player when the game artificial intelligence system is teamed with the human player. Taking a game of the collaborative type, MOBA and FPS as an example, the game experience is well defined. The game experience aimed at by the embodiment of the application mainly comprises two aspects of global experience and process experience. The global experience mainly includes game results (final goals of the game), personal performances, and the like. The process experience mainly comprises participation, high light, protection, resource number and the like. The primary experience metrics may be quantified as secondary numeric metrics, such as odds may be used to measure game outcome, most valuable personal (Most Valuable Player, MVP) scores or personal rankings may be used to measure personal performance, etc.

Referring to fig. 6, the primary index, i.e., index type, includes game result, individual performance, participation, high light, protection, and number of resources. The secondary index is quantized content for the primary index. In the MOBA game, the game result and the individual performance are the winning rate (ratio between the winning number and the game play number); the participation is a parameter rate or a support rate; high light sensation is the defeat number. The protective feeling is the number of injuries or the number of deaths. The number of resources is the number of virtual props, etc. In the FPS game, the game result is a winning rate; personal performance is a ranking in a game; the participation is the rescue rate; the high light sensation is the number of defeats, the amount of injury, and the number of consecutive defeats. The protective feeling is the number of injuries or the number of defeated people. The number of resources is the level of virtual equipment, the number of supplies, etc.

The quantized secondary index can be used as a human-related reward and provides an additional training signal for training of the intelligent agent. Partial rewards in the secondary metrics are easily acquired (positive rewards) or avoided (negative rewards) by the agent, which can cause the agent to over-assist or avoid the player, reducing the player's experience. For example, the smart agent may obtain a monetary reward for a human player by excessively following the human player; or may be remote from the human being to avoid injuring the negative rewards of the human player.

The following explains the model structure of the intelligent agent of the model training method in the virtual scene provided by the embodiment of the application. Referring to fig. 7, fig. 7 is a schematic diagram of a model structure of a competitive intelligent agent according to an embodiment of the present application. The model of the athletic agent includes an encoder 501A, a value estimation network 502A, a decision network 503A, a door model block 505A, a long and short term memory network 504A, and a gain prediction network 506A. The two value estimation networks 502A in fig. 7 may be the same network or different networks with the same functionality.

The game state data is input into the encoder 501A, the encoder 501A outputs state information features of the virtual scene, and the long-short-period memory network 504A performs feature extraction based on human history information to obtain a human intention representation, wherein the human intention representation is a feature in the human history information, which can represent trends and intentions of behaviors executed by human players. The game state data is data of all virtual objects included in the game of the virtual scene, and environment data of the virtual scene. Human history information is a subset of game state data, including the following parameters of virtual objects controlled by human players: number of defeats, number of deaths, amount of resources, location, equipment, aggression, and vital value.

In some embodiments, the long-short term memory network 504A functions to obtain a human intent representation, and the network to obtain a human policy representation is not limited to the long-short term memory network 504A, and may be modeled using any neural network model. The training of the model of human is not limited to the supervised learning method, and may be performed using any imitation learning or behavior cloning method. The specific structure of the network in the framework can be combined in different ways, and the effect is similar under the same parameter quantity.

The gate mechanism module 505A is used to determine whether the model of the agent currently requires decision making based on human intent characteristics. For example: in MOBA games, it is not possible to provide effective cooperation for human players when an agent is too far away from the human player. When the virtual object of the agent from the human player is smaller than the set value, the door mechanism is set to be 1, which means that the agent needs to pay attention to the human player and is switched to a cooperation mode; when the virtual object of the agent from the human player is greater than or equal to the set value, the door is set to 0, which means that the agent does not need to pay attention to the human player, and switches to the athletic mode. In the competitive mode, the human player's intent representation is set to all 0's so as not to cause any interference to the decision network of the competitive agent.

The value estimation network 502A estimates the original value of a human, i.e., the ability of the human itself to accomplish a goal, from a human policy characterization. The gain prediction network 506A estimates the desired gain based on the human strategy characterization, taking the human raw value as a baseline, subtracting the baseline from the actual return obtained by the human player with the help of the agent, resulting in the actual human gain. The behavior of the agent is meaningful when the return obtained by the human is above baseline, and only when the true gain of the human is above the desired gain.

The model training method in the virtual scene according to the embodiment of the application is explained below.

Referring to fig. 5A, fig. 5A is a schematic diagram of a first principle of model training according to an embodiment of the present application. Fig. 5A shows the encoder 501A, the value estimation network 502A, the decision network 503A, the door operator module 505A, and the long and short term memory network 504A of fig. 7. FIG. 5A illustrates a human raw value estimation phase, wherein the human raw value estimation phase is mainly used for estimating a reward index (human raw value) which can be obtained by a human player based on the current cooperation degree when the competitive type agent and the human player are assembled, the human raw value is set as a base line, and if the reward index of the human player formed when the competitive type agent and the human player are assembled is lower than the base line, the action of the agent controlling the virtual object is meaningless.

The human raw value estimation stage is described below.

It is readily appreciated that in the human team of MOBA games, human players are inherently provided with basic game capabilities. Even without the help of an agent, humans can still accomplish certain goals through their original capabilities, such as: acquisition of money, recovery of blood volume, defeating enemies, and the like. However, in MOBA games, the way money is obtained is numerous, and the difficulty of obtaining is far lower than that of defeating enemies, defeating defenses towers, and the like. Therefore, directly taking the monetary reward of the human being as the extra reward of the intelligent agent can lead the intelligent agent to give up the original target (winning), and choose to always follow the human being to acquire the monetary reward of the human being, finally leading the intelligent agent to lose autonomy completely, and seriously affecting the capability. In addition, human players are subject to some unavoidable negative effects, such as injury, defeat, etc. If the human is directly injured or is defeated, the human can be directly taken as the additional punishment of the intelligent agent, so that the intelligent agent can be far away from the human in advance, and the punishment is prevented.

To prevent the above, embodiments of the present application evaluate the basic capabilities of humans in the human raw value estimation phase. Referring to fig. 5A, the game status data entered into the decision network is frozen, and the door mechanism of the door mechanism module 505A is set to 0, allowing the agent to remain in the competitive mode all the time, with the competitive agent giving priority to decision of game bonus benefits. Team training is performed on agents and human models, and a large number of games are played. During team formation games, the returns (experience-related secondary metrics) obtained by the human model in the environment (i.e., human history information) are collected, and the value estimation network is trained based on human history information to predict the expected returns that human can obtain in the future in the event of team formation with agents in athletic mode. The trained value estimation network can be used to evaluate the basic capabilities of humans.

Illustratively, embodiments of the present application utilize a human model (the first model above) that simulates a human player controlling a virtual object as a teammate of an agent (the second model above). Population-based methods (Population Based Training, PBT): a plurality of human models are trained from a large amount of human data by a supervised learning (Supervised learning, SL) method, and human model populations are introduced into the training process of the intelligent agent to play games as teammates, so that the understanding of the intelligent agent on human behavior decisions and intentions is implicitly improved, and the generalization of the intelligent agent on unknown teammates is increased.

Referring to fig. 5B, fig. 5B is a second schematic diagram of model training provided by an embodiment of the present application; in fig. 5B, the encoder 501A, the value estimation network 502A, the decision network 503A, the door mechanism module 505A, the gain prediction network 506A, and the long-short-term memory network 504A applied to fig. 7 are frozen. FIG. 5B illustrates a human enhancement training phase, which is mainly used for fine tuning competitive agents, team training agents and human models, collecting real returns obtained by humans with the aid of agents, comparing with a baseline, and calculating real gains brought by agent behaviors to humans. When the gain of humans is positive, the behavior of the agent is considered encouraging, otherwise penalizing is required. In addition, the true gain of human beings is combined with the personal environmental value of the agent and is used as an objective function to jointly optimize the decision network of the agent.

After the value estimation network training is completed, human enhancement training may be performed. The key to achieving effective enhancement training is to set a reasonable baseline for the relevant rewards of the human, namely, the actions with positive rewards higher than the positive baseline encourage the agent, and the actions with negative rewards lower than the negative baseline punish the agent. Thus, the agent can be prevented from excessively acquiring the return which can be obtained by the human, namely, learning ineffective cooperation; and preventing agents from moving away from humans to avoid being penalized, i.e., learning uncooperation. That is, the behavior of the actual rewards above the baseline (the expected rewards index above) is taken as a positive sample and the behavior of the actual rewards below the baseline is taken as a negative sample.

Referring to FIG. 5B, the long term memory network 504A and the value estimation network 502A are frozen and the door mechanism of the door mechanism module 505A is set to 1, leaving the agent in a collaborative mode, i.e., a collaborative agent, which takes precedence over the collaboration with the player.

Team training is performed on agents and human models, and a large number of games are played. Likewise, during team games, the rewards (experience-related secondary metrics) obtained by the human model in the environment are collected. In contrast, it is also necessary to record the value estimated by the value estimation network during the game and calculate the true gain of the human model from the help of the agent using it as a baseline. Using the true gain as a tag, the training agent's gain prediction network predicts the expected gain that a human would achieve in the future in the case of team formation with the collaborative agent, based on human policy characterization.

In some embodiments, a near-end policy optimization (Proximal Policy Optimization, PPO) algorithm is used to fine-tune decision networks and personal value networks of an agent. When calculating the dominant value in the near-end policy optimization algorithm, not only the dominant value caused by the action of the agent on the personal return, but also the dominant value caused by the action of the agent on the human gain are calculated. The dominance value represents the difference between the current return and the expected return, and if the dominance value is positive, execution is encouraged; if the dominance value is negative, then the penalty is penalized.

For example, to be fully explored in reinforcement learning training, an agent may introduce samples of actions in the decision process, so sampling false actions may produce negative gains (negative cooperation), thereby reducing the desired gains. In order for the agents to explore effective collaboration, a priori knowledge is introduced in the modeling. The desired human gain is estimated using only positive gain, so that the actions of the agent can be ensured, and encouragement is received only when positive gain is generated and higher than desired. Therefore, adding an absolute value activation function before the gain prediction network output of the agent ensures that only positive gains are predicted. The gain prediction network of an agent may be trained using the following objective functions, equation (1) and equation (2):

（1）

（2）

Wherein,,is the value of the loss and,is a bonus index (desired bonus index) as a baseline,is an actual rewarding index, V represents the original value of human beings,is the predicted gain.Is the human rewards, i.e., the number of rewards earned by virtual objects controlled by a human player.

Referring to fig. 5C, fig. 5C is a third principle schematic diagram of model training according to an embodiment of the present application. Fig. 5C shows the encoder 501A, the value estimation network 502A, the decision network 503A, and the door operator module 505A of fig. 7.

Illustratively, in addition to training the ability of an agent to cooperate with real players, a large amount of training resources (e.g., 70%) are utilized for maintaining the athletic ability of the agent. Since the artificial intelligence of the MOBA game is trained using Self-play (SP), in order to maintain high intensity, the trained opponents must be high-level, and the introduction of a human model may cause the opponent's ability to be reduced, thereby affecting the ability to bid. Therefore, most resources perform original training, so that the intelligent agent is restrained from learning human enhancement and meanwhile, the self competitive capacity is kept. Referring to fig. 5C, to reduce interference with human policy characterization, the door mechanism of door mechanism module 505A would be set to 0.

In some embodiments, for trained agents, in collaborative mode, the agents will help human teammates as much as possible, but when the human teammates' abilities are too low, the help human teammates may affect the ability of the human team. The gating mechanism may be set according to the capabilities of the virtual object of the human player, for example: the agent predicts the final target value and dynamically adjusts the team mode door based on the final target value. When the predicted value is higher than the set threshold value, the man-machine team is considered to be in advantage, and the cooperation mode is maintained at the moment; when the predicted value is below the threshold, the human-machine team is considered to be at a disadvantage, at which time the collaboration mode needs to be turned off, ensuring that the human player can reach the final goal.

The beneficial effects of the embodiment of the application can be summarized as follows:

(1) The collaborative agent training method based on human gain is provided for training agents to more reasonably and effectively assist human players, and team experience of the players is improved.

(2) A fine tuning and expanding method for the competitive intelligent body is provided, the enhancement and competitive modes are controlled through a gate mechanism, and the usability of the method is improved.

(3) Dynamic adjustment realizes real-time mode switching, ensures the competence of the race and enhances the human player.

(4) In addition to being applicable to multi-player competitive games, the model of the agent is universal per se, can be extended to other games, and has heuristic value for man-machine team applications in many reality.

By way of example, referring to fig. 8A to 8D, fig. 8A is a first experimental result provided by an embodiment of the present application; FIG. 8B is a second experimental result provided by an embodiment of the present application; FIG. 8C is a third experimental result provided by an embodiment of the present application; fig. 8D is a fourth experimental result provided by the embodiment of the present application.

The difference between the model obtained by training the embodiment of the application and the existing model is compared through experiments. The human bonus enhancement model is a model in which rewards acquired by human players in a team are prioritized, and the multiplayer competitive game model is a model in which the number of rewards acquired by virtual objects controlled by the model is prioritized.

Referring to fig. 8A, the model trained by the embodiment of the present application is superior to the human rewards enhancement model in terms of the win rate, MVP score, highlight moment probability, and resource amount; the model of the embodiment of the application is superior to a multi-player competitive game model in the aspects of MVP score, highlight moment probability, participation rate, resource quantity and following rate.

Referring to FIG. 8B, the model trained by embodiments of the present application is superior to existing multiplayer competitive game models in terms of rationality of behavior, degree of gain, gaming experience of human players, overall preference of human players.

Referring to fig. 8C and fig. 8D, the model trained by the embodiment of the present application is superior to the existing multi-player competitive game model in MVP score, highlight moment probability, participation rate, resource amount, etc. The level of the winning rate is basically the same as that of the existing multi-player competitive game model.

Continuing with the description below of an exemplary architecture of the virtual scene model training apparatus 455 implemented as a software module provided by embodiments of the present application, in some embodiments, as shown in fig. 2A, the software modules stored in the virtual scene model training apparatus 455 of the memory 450 may include: the data acquisition module 4551 is configured to acquire scene data in a virtual scene and a first model, wherein the first model is a machine learning model pre-trained based on real interaction data, and the second model is a machine learning model to be trained; the game simulation module 4552 is configured to invoke the first model and the second model in the collaboration mode based on the scene data to perform multiple simulation games to obtain first game data of each simulation game, wherein the second model in the collaboration mode is used for controlling the second virtual object to cooperate with the first virtual object controlled by the first model; the training module 4553 is configured to invoke the second model to determine an actual rewarding index and an expected rewarding index corresponding to the first model based on the first pair of office data; a training module 4553 configured to screen a training sample set from the first pair of office data based on the actual reward index and the desired reward index; the training module 4553 is configured to train a second model based on the training sample set, wherein the trained second model is used to autonomously control a corresponding behavior of a second virtual object in a game of the virtual scene.

In some embodiments, the second model includes a value prediction network; the training module 4553 is configured to, before performing multiple simulated game play based on the scene data and calling the first model and the second model in the cooperative mode to obtain first game play data of each simulated game play, call the first model and the second model in the competitive mode to perform multiple simulated game play based on the scene data to obtain second game play data of each simulated game play, where the second virtual object controlled by the second model in the competitive mode does not cooperate with the first virtual object controlled by the first model; determining an original rewarding index of the first model in each simulated game based on the second game data; and training a value prediction network in the second model based on the scene data and the original rewarding indexes of each simulated game, wherein the trained value prediction network is used for predicting the rewarding indexes of the first model in the game, and the game types comprise simulated games and real games.

In some embodiments, the second model further comprises a feature extraction network; the training module 4553 is configured to invoke the feature extraction neural network in the second model to perform feature extraction processing based on the scene data, so as to obtain information features of each simulation pair; calling a value prediction network in the second model to perform prediction processing based on the information characteristics of each simulation pair, so as to obtain a prediction rewarding index of the first model in each simulation pair; determining a first loss function of the value prediction network based on a difference between the original rewards index and the predicted rewards index; and carrying out back propagation processing on the value prediction network based on the first loss function to obtain the trained value prediction network.

In some embodiments, the second model includes a value prediction network and a feature extraction network; a training module 4553 configured to invoke the value prediction network in the second model to determine an actual reward indicator for the first model in each simulated pair based on the first pair data; calling a feature extraction network in the second model to perform feature extraction processing based on the first pair of office data and the scene data to obtain information features related to the first model; and calling a value prediction network in the second model based on the information characteristics to determine a desired rewarding index of the first model in each simulation pair, wherein the desired rewarding target characterizes the rewarding index acquired by the first model when the second model is not cooperated.

In some embodiments, the training module 4553 is configured to label as a positive sample a first pair of office data having an actual reward indicator greater than a desired reward indicator; marking a first pair of office data with actual rewards indicators less than or equal to the expected rewards indicators as negative samples; each positive sample is combined with each negative sample into a training sample set.

In some embodiments, training module 4553 is configured to separately train the second model by: in response to the actual reward index of the first pair of office data in the training sample set being greater than the expected reward index, determining a second loss function based on a difference between the actual reward index and the expected reward index, and updating parameters of a second model in the first pair of office data training cooperation mode based on the second loss function; and extracting third pair of office data from the training sample set, and training a second model in the competitive mode based on the third pair of office data, wherein the third pair of office data is first pair of office data which is not associated with the first model.

In some embodiments, training module 4553 is configured to, for each actual reward indicator of the first model, perform the following with the corresponding expected reward indicator:

acquiring a first difference value between an actual rewarding index and an expected rewarding index; acquiring a second difference value between the first difference value and an expected gain index, and taking a norm of the second difference value as a loss value, wherein when the second virtual object assists the first virtual object, the expected gain index characterizes a gain effect formed by the assisting action on the first virtual object; the average value of each loss value is taken as a second loss function.

In some embodiments, the third pair of office data includes state information and actual value parameters when the second virtual object corresponding to the second model performs each actual action; the training module 4553 is configured to invoke the second model to perform prediction processing based on the scene data and the state information in the third pair of office data, so as to obtain a plurality of candidate prediction behaviors of the second virtual object and corresponding prediction value parameters; taking the candidate prediction behavior with the maximum prediction value parameter as a target prediction behavior; determining cross entropy loss of the second model based on the target predicted behavior and the actual behavior corresponding to the target predicted behavior; determining a mean square error loss of the second model based on the actual value parameter corresponding to the predicted value parameter; and carrying out parameter updating processing on the second model based on the mean square error loss and the cross entropy loss to obtain a trained second model.

In some embodiments, the game execution module 4554 is configured to obtain, after training the second model based on the training sample set, a current capability parameter of a third virtual object, wherein the third virtual object and the second virtual object belong to the same team, and the third virtual object is controlled by a real player; responsive to the current capability parameter of the third virtual object being less than the capability threshold, invoking a second model in the athletic mode based on the scene data, and controlling behavior of the second virtual object based on the second model in the athletic mode; and in response to the current capability parameter of the third virtual object being greater than the capability threshold, invoking a second model in a collaboration mode based on the scene data and the contrast data of the third virtual object, and controlling behavior of the second virtual object based on the second model in the collaboration mode.

In some embodiments, the game execution module 4554 is configured to obtain, after training the second model based on the training sample set, scene data in the virtual scene and a current distance between a third virtual object and the second virtual object, wherein the third virtual object and the second virtual object belong to the same team, and the third virtual object is controlled by the real player; responsive to the current distance between the third virtual object and the second virtual object being greater than the distance threshold, invoking a second model in the athletic mode based on the scene data, and controlling behavior of the second virtual object based on the second model in the athletic mode; and in response to the current distance between the third virtual object and the second virtual object being less than or equal to the distance threshold, invoking a second model in the collaborative mode based on the scene data and the contrast data of the third virtual object, and controlling behavior of the second virtual object based on the second model in the collaborative mode.

In some embodiments, the second model includes a decision network, a value prediction network; the office execution module 4554 is configured to call the decision network to conduct behavior prediction processing based on the scene data to obtain a plurality of candidate prediction behaviors; calling a value prediction network for value prediction processing based on scene data and each candidate prediction action to obtain a first prediction value parameter of each candidate prediction action; taking the first candidate prediction behavior with the highest prediction value parameter as a target prediction behavior; and controlling the behavior of the second virtual object based on the configuration parameters corresponding to the target predicted behavior.

In some embodiments, the second model includes a decision network, a value prediction network, and a feature extraction network; the game execution module 4554 is configured to invoke a feature extraction network to perform feature extraction processing based on the game data of the third virtual object, so as to obtain information features of the third virtual object; invoking a decision network to conduct behavior prediction processing based on scene data and information characteristics to obtain a plurality of candidate prediction behaviors; calling a value prediction network to conduct behavior prediction processing based on the scene data, each candidate prediction behavior and the information characteristic to obtain a second prediction value parameter of each candidate prediction behavior aiming at a third virtual object; taking the candidate prediction behavior with the highest second prediction value parameter as a target prediction behavior; and controlling the behavior of the second virtual object based on the configuration parameters corresponding to the target predicted behavior.

Embodiments of the present application provide a computer program product comprising a computer program or computer-executable instructions stored in a computer-readable storage medium. The processor of the electronic device reads the computer executable instructions from the computer readable storage medium, and the processor executes the computer executable instructions, so that the electronic device executes the model training method of the virtual scene according to the embodiment of the application.

Embodiments of the present application provide a computer-readable storage medium storing computer-executable instructions or a computer program stored therein, which when executed by a processor, cause the processor to perform a model training method of a virtual scene provided by the embodiments of the present application, for example, a model training method of a virtual scene as shown in fig. 3A.

In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.

In some embodiments, computer-executable instructions may be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, in the form of programs, software modules, scripts, or code, and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, computer-executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (HTML, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, executable instructions may be deployed to be executed on one electronic device or on multiple electronic devices located at one site or, alternatively, on multiple electronic devices distributed across multiple sites and interconnected by a communication network.

In summary, according to the embodiment of the application, the real player is simulated to control the virtual object through the first model, so that the first model and the second model in the cooperative mode are simulated for the game to obtain the game data as the training data, and the calculation resources and the time cost required for obtaining the training data are saved; the method comprises the steps of obtaining actual rewarding indexes and expected rewarding indexes of a first model in a game, taking the actual rewarding indexes and the expected rewarding indexes as conditions for screening samples, improving the efficiency of obtaining the samples, enabling the samples to be more accurate, and improving the cooperativity between virtual objects controlled by a trained second model and virtual objects controlled by real human players.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method for model training of a virtual scene, the method comprising:

2. The method of claim 1, wherein the second model comprises a value prediction network;

before the first model and the second model in the collaboration mode are called to perform simulation check for a plurality of times based on the scene data to obtain first check data of each simulation check, the method further comprises:

Calling the first model and the second model in the competitive mode to perform simulation game for a plurality of times based on the scene data to obtain second game data of each simulation game, wherein a second virtual object controlled by the second model in the competitive mode is not cooperated with a first virtual object controlled by the first model;

determining an original rewards index of the first model in each of the simulated counter based on the second counter data;

and training a value prediction network in the second model based on the scene data and the original rewarding indexes of the simulated counter each time, wherein the trained value prediction network is used for predicting the rewarding indexes of the first model in the counter, and the counter type comprises the simulated counter and the real counter.

3. The method of claim 2, wherein the second model further comprises a feature extraction network;

the training of the value prediction network in the second model based on the scene data, each time the simulated game's original rewards index, comprises:

invoking a feature extraction neural network in the second model based on scene data to perform feature extraction processing to obtain information features of each simulation pair;

Calling a value prediction network in the second model to perform prediction processing based on the information characteristics of each simulation pair, so as to obtain a prediction rewarding index of the first model in each simulation pair;

determining a first loss function of the value prediction network based on a difference between the original rewards index and the predicted rewards index;

and carrying out back propagation processing on the value prediction network based on the first loss function to obtain a trained value prediction network.

4. The method of claim 1, wherein the second model comprises a value prediction network and a feature extraction network;

the step of calling the second model to determine the actual rewarding index and the expected rewarding index corresponding to the first model based on the first pair of office data comprises the following steps:

invoking a value prediction network in the second model based on the first pair data to determine an actual rewards index of the first model in each of the simulated pairs;

calling a feature extraction network in the second model to perform feature extraction processing based on the first pair of office data and scene data to obtain information features related to the first model;

And calling a value prediction network in the second model based on the information characteristics to determine expected rewards indexes of the first model in each simulation pair bureau, wherein the expected rewards targets represent rewards indexes acquired by the first model when the second model is not cooperated.

5. The method of claim 1, wherein the screening training sample sets from the first pair of office data based on actual reward metrics and the expected reward metrics comprises:

marking a first pair of office data with the actual rewards index greater than the expected rewards index as positive samples;

marking a first pair of office data with the actual reward index less than or equal to the desired reward index as a negative sample;

combining each of the positive samples with each of the negative samples into a training sample set.

6. The method of claim 5, wherein the training the second model based on the set of training samples comprises:

the second model is respectively trained by:

responsive to the actual reward indicia of a first pair of office data in the training sample set being greater than the expected reward indicia, determining a second loss function based on a difference between the actual reward indicia and the expected reward indicia, and updating parameters of a second model in the first pair of office data training collaboration mode based on the second loss function;

And extracting third pair of office data from the training sample set, and training the second model in the competition mode based on the third pair of office data, wherein the third pair of office data is first pair of office data which is not associated with the first model.

7. The method of claim 6, wherein the determining a second loss function based on a difference between the actual rewards index and the expected rewards index comprises:

for each of the actual rewards indicators of the first model and the corresponding expected rewards indicators, performing the following:

acquiring a first difference value between the actual rewarding index and the expected rewarding index;

acquiring a second difference value between the first difference value and an expected gain index, and taking a norm of the second difference value as a loss value, wherein when the second virtual object assists the first virtual object, the expected gain index characterizes a gain effect formed by an assisting action on the first virtual object;

and taking the average value of each loss value as the second loss function.

8. The method of claim 6, wherein the third pair of office data comprises: the second virtual object corresponding to the second model executes the state information and the actual value parameters of each actual behavior;

Said training said second model in athletic mode based on said third pair of game data, comprising:

invoking the second model to conduct prediction processing based on the scene data and the state information in the third pair of office data to obtain a plurality of candidate prediction behaviors of the second virtual object and corresponding prediction value parameters;

taking the candidate prediction behavior with the maximum prediction value parameter as a target prediction behavior;

determining cross entropy loss of the second model based on the target predicted behavior and actual behavior corresponding to the target predicted behavior;

determining a mean square error loss of the second model based on the actual value parameter corresponding to the predicted value parameter;

and carrying out parameter updating processing on the second model based on the mean square error loss and the cross entropy loss to obtain the trained second model.

9. The method of claim 1, wherein after the training of the second model based on the set of training samples, the method further comprises:

acquiring current capacity parameters of a third virtual object, wherein the third virtual object and the second virtual object belong to the same team, and the third virtual object is controlled by a real player;

Responsive to the current capability parameter of the third virtual object being less than a capability threshold, invoking a second model in an athletic mode based on the scene data, and controlling behavior of the second virtual object based on the second model in the athletic mode;

and in response to the current capability parameter of the third virtual object being greater than a capability threshold, invoking a second model in a collaboration mode based on the scene data and the game data of the third virtual object, and controlling the behavior of the second virtual object based on the second model in the collaboration mode.

10. The method of claim 1, wherein after the training of the second model based on the set of training samples, the method further comprises:

acquiring scene data in a virtual scene and a current distance between a third virtual object and a second virtual object, wherein the third virtual object and the second virtual object belong to the same team, and the third virtual object is controlled by a real player;

responsive to a current distance between the third virtual object and the second virtual object being greater than a distance threshold, invoking a second model in an athletic mode based on the scene data, and controlling behavior of the second virtual object based on the second model in the athletic mode;

And in response to the current distance between the third virtual object and the second virtual object being less than or equal to a distance threshold, invoking a second model in a collaboration mode based on the scene data and the contrast data of the third virtual object, and controlling behavior of the second virtual object based on the second model in the collaboration mode.

11. The method according to claim 9 or 10, wherein the second model comprises a decision network, a value prediction network;

the step of calling a second model in the competitive mode based on the scene data and controlling the behavior of the second virtual object based on the second model in the competitive mode comprises the following steps:

invoking the decision network to conduct behavior prediction processing based on the scene data to obtain a plurality of candidate prediction behaviors;

calling the value prediction network to perform value prediction processing based on the scene data and each candidate prediction action to obtain a first prediction value parameter of each candidate prediction action;

taking the first candidate prediction behavior with the highest prediction value parameter as a target prediction behavior;

and controlling the behavior of the second virtual object based on the configuration parameters corresponding to the target predicted behavior.

12. The method according to claim 9 or 10, wherein the second model comprises a decision network, a value prediction network, and a feature extraction network;

the invoking a second model in a collaboration mode based on the scene data and the game data of the third virtual object, and controlling the behavior of the second virtual object based on the second model in the collaboration mode, includes:

calling the feature extraction network to perform feature extraction processing based on the office data of the third virtual object to obtain the information feature of the third virtual object;

invoking the decision network to conduct behavior prediction processing based on the scene data and the information features to obtain a plurality of candidate prediction behaviors;

invoking the value prediction network to conduct behavior prediction processing based on the scene data, each candidate prediction behavior and the information characteristic to obtain a second prediction value parameter of each candidate prediction behavior for the third virtual object;

taking the candidate prediction behavior with the highest second prediction value parameter as a target prediction behavior;

13. A model training apparatus for a virtual scene, the apparatus comprising:

14. An electronic device, the electronic device comprising:

a memory for storing computer executable instructions;

a processor for implementing the model training method of a virtual scene according to any of claims 1 to 12 when executing computer executable instructions or computer programs stored in said memory.

15. A computer readable storage medium storing computer executable instructions or a computer program, which when executed by a processor, implements the model training method of a virtual scene according to any of claims 1 to 12.