CN111111204A

CN111111204A - Interactive model training method and device, computer equipment and storage medium

Info

Publication number: CN111111204A
Application number: CN202010247990.4A
Authority: CN
Inventors: 邱福浩; 韩国安; 李晓倩; 王亮; 付强
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-04-01
Filing date: 2020-04-01
Publication date: 2020-05-08
Anticipated expiration: 2040-04-01
Also published as: CN111111204B

Abstract

The application relates to an interactive model training method, an interactive model training device, computer equipment and a storage medium, which relate to artificial intelligence, wherein the interactive model training method comprises the following steps: acquiring a first interaction state characteristic corresponding to a virtual interaction environment, and acquiring a first interaction action, wherein the first interaction action is determined by inputting the first interaction state characteristic into a first interaction model to be trained; obtaining the income obtained by the target virtual object executing the first interactive action as the first income; inputting the first interaction state characteristic and the first interaction action into a target strategy discrimination model corresponding to a target interaction strategy to obtain a first strategy discrimination value; calculating according to the first strategy discrimination value to obtain a second benefit; calculating to obtain a target benefit according to the first benefit and the second benefit; and adjusting the model parameters of the first interaction model to be trained according to the target income to obtain the updated first interaction model. By adopting the method, the model training effect can be improved.

Description

Interactive model training method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to an interactive model training method, an interactive model training apparatus, a computer device, and a storage medium.

Background

With the development of internet technology, games are becoming popular entertainment programs, for example, users can compete with other game players through Multiplayer Online Battle Arena (MOBA).

At present, the game can be played by using an artificial intelligence model, for example, when a game player drops off, the game can be temporarily managed, and the artificial intelligence model is used for replacing the dropped real player to play a game confrontation with another game real player. The artificial intelligence model needs to utilize training data to train in advance, and training of the artificial intelligence model at present mostly relies on continuous fight training to obtain evolution, however, the model that often exists training and obtains can not satisfy the reality needs, the poor problem of model training effect.

Disclosure of Invention

In view of the above, it is necessary to provide an interactive model training method, apparatus, computer device and storage medium for solving the above technical problem of poor model training effect.

A method of interaction model training, the method comprising: acquiring a first interaction state characteristic corresponding to a virtual interaction environment, and acquiring a first interaction action, wherein the first interaction action is determined by inputting the first interaction state characteristic into a first interaction model to be trained; obtaining the income obtained by the target virtual object executing the first interactive action as the first income; inputting the first interaction state feature and the first interaction action into a target strategy discrimination model corresponding to a target interaction strategy to obtain a first strategy discrimination value; calculating to obtain a second benefit according to the first strategy judgment value, wherein the first strategy judgment value and the second benefit form a positive correlation; calculating to obtain a target benefit according to the first benefit and the second benefit; and adjusting the model parameters of the first interaction model to be trained according to the target income to obtain an updated first interaction model.

An interaction model training apparatus, the apparatus comprising: the system comprises a first interaction data acquisition module, a first interaction model training module and a second interaction data acquisition module, wherein the first interaction data acquisition module is used for acquiring first interaction state characteristics corresponding to a virtual interaction environment and acquiring first interaction actions, and the first interaction actions are determined by inputting the first interaction state characteristics into the first interaction model to be trained; the first profit obtaining module is used for obtaining profits obtained by the target virtual object executing the first interactive action as first profits; a first policy judgment value obtaining module, configured to input the first interaction state feature and the first interaction action into a target policy judgment model corresponding to a target interaction policy, so as to obtain a first policy judgment value; a second profit obtaining module, configured to obtain a second profit through calculation according to the first policy decision value, where the first policy decision value and the second profit form a positive correlation; the target income obtaining module is used for calculating and obtaining target income according to the first income and the second income; and the first interaction model parameter adjusting module is used for adjusting the model parameters of the first interaction model to be trained according to the target income to obtain an updated first interaction model.

In some embodiments, the target interaction policy is an interaction policy corresponding to a preset interaction user level, and the target interaction data obtaining module is configured to: acquiring an interactive action obtained according to the user operation of the preset interactive user level as a target interactive action; and acquiring the interaction state characteristic corresponding to the target interaction action as the target interaction state characteristic.

In some embodiments, the first interaction data acquisition module is to: acquiring a fight model corresponding to a first interaction model to be trained as a second interaction model; controlling the first interaction model to be trained and the second interaction model to interact in a virtual interaction environment to obtain interaction record data corresponding to the first interaction model; and acquiring a first interaction state characteristic and a first interaction action according to the interaction record data.

In some embodiments, the apparatus further comprises: and the entering module is used for taking the updated first interaction model as the first interaction model to be trained, entering a step of controlling the interaction between the first interaction model to be trained and the second interaction model in a virtual interaction environment to obtain interaction record data corresponding to the first interaction model until the updated first interaction model converges or the number of times of model training reaches a preset number of times.

In some embodiments, the first revenue acquisition module is to: acquiring the state change corresponding to the virtual interaction environment before and after the target virtual object executes the first interaction action; and obtaining corresponding benefit according to the state change as the first benefit.

In some embodiments, the first interaction data acquisition module is to: acquiring interaction related data corresponding to a virtual interaction environment, wherein the interaction related data comprises object attribute data and object position data; obtaining attribute characteristics according to the object attribute data, and obtaining position characteristics according to the object position data; and combining the attribute characteristics and the position characteristics to obtain first interaction state characteristics.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program: acquiring a first interaction state characteristic corresponding to a virtual interaction environment, and acquiring a first interaction action, wherein the first interaction action is determined by inputting the first interaction state characteristic into a first interaction model to be trained; obtaining the income obtained by the target virtual object executing the first interactive action as the first income; inputting the first interaction state feature and the first interaction action into a target strategy discrimination model corresponding to a target interaction strategy to obtain a first strategy discrimination value; calculating to obtain a second benefit according to the first strategy judgment value, wherein the first strategy judgment value and the second benefit form a positive correlation; calculating to obtain a target benefit according to the first benefit and the second benefit; and adjusting the model parameters of the first interaction model to be trained according to the target income to obtain an updated first interaction model.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of: acquiring a first interaction state characteristic corresponding to a virtual interaction environment, and acquiring a first interaction action, wherein the first interaction action is determined by inputting the first interaction state characteristic into a first interaction model to be trained; obtaining the income obtained by the target virtual object executing the first interactive action as the first income; inputting the first interaction state feature and the first interaction action into a target strategy discrimination model corresponding to a target interaction strategy to obtain a first strategy discrimination value; calculating to obtain a second benefit according to the first strategy judgment value, wherein the first strategy judgment value and the second benefit form a positive correlation; calculating to obtain a target benefit according to the first benefit and the second benefit; and adjusting the model parameters of the first interaction model to be trained according to the target income to obtain an updated first interaction model.

The interactive model training method, the interactive model training device, the computer equipment and the storage medium acquire a first interactive state characteristic and a first interactive action corresponding to the virtual interactive environment, wherein the first interactive action is determined by inputting the first interactive state characteristic into a first interactive model to be trained; obtaining the income obtained by the target virtual object executing the first interactive action as the first income; inputting the first interaction state characteristic and the first interaction action into a target strategy discrimination model corresponding to a target interaction strategy to obtain a first strategy discrimination value; calculating to obtain a second benefit according to the first strategy judgment value, wherein the first strategy judgment value and the second benefit form a positive correlation relationship; calculating to obtain a target benefit according to the first benefit and the second benefit; and adjusting the model parameters of the first interaction model to be trained according to the target income to obtain the updated first interaction model. Since the first benefit is a benefit obtained by the target virtual object executing the first interactive action, the return brought by the target virtual object executing the first interactive action can be reflected. And the second profit is obtained by calculation according to the first strategy judgment value and is in positive correlation with the first strategy judgment value, the first strategy judgment value can reflect whether the executed first interaction action conforms to the target interaction strategy or not in a state corresponding to the first interaction state characteristic, so that the target profit is obtained by integrating the first profit and the second profit, the model parameters are adjusted according to the target profit, the adjustment of the model parameters can be adjusted towards the direction which conforms to the target interaction strategy and can balance the return profit of the executed action, and the model training effect is improved.

Drawings

FIG. 1 is a diagram of an application environment of a first interaction model trained in some embodiments;

FIG. 2 is a schematic flow chart diagram of a method of interactive model training in some embodiments;

FIG. 3 is a schematic interface diagram of game image frames in some embodiments;

FIG. 4 is a schematic diagram of interaction state features in some embodiments;

FIG. 5 is a schematic diagram of an interaction model training method in some embodiments;

FIG. 6 is a schematic flow diagram illustrating a schematic representation of a first interaction model and a second interaction model in some embodiments;

FIG. 7 is a schematic flow chart of a target strategy discrimination model obtained by training in some embodiments;

FIG. 8 is a schematic diagram of the training of a policy decision model in some embodiments;

FIG. 9 is a block diagram of an interaction model training apparatus in some embodiments;

FIG. 10 is a diagram of the internal structure of a computer device in some embodiments.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The scheme provided by the embodiment of the application relates to the technologies such as machine learning of artificial intelligence and the like, and is specifically explained by the following embodiment:

the model obtained by training the interactive model training method provided by the embodiment of the application can be applied to the application environment shown in fig. 1 for interaction. Wherein the terminal 102 and the server 104 communicate via a network. The server 104 may obtain a trained first interaction model by using the interaction model training method provided in the embodiment of the present application, the trained first interaction model may be deployed in the server 104, and the server 104 may control the target virtual object to interact by using the deployed first interaction model. For example, the user can control the virtual object located in the virtual interactive environment to perform an interactive action through the operation terminal 102. When the user is inconvenient to control the virtual object, for example, the user needs to leave temporarily, a function of performing interaction by using the artificial intelligence model (which may also be referred to as a game hosting function) may be started, and after detecting that the function of performing interaction by using the artificial intelligence model is started, the server 104 may control the virtual object corresponding to the user to perform interaction by using the deployed first interaction model. For example, the server 104 may obtain a current state feature corresponding to the virtual object, input the current state feature into the trained first interaction model, output the current interaction action by the trained first interaction model, and control the virtual object corresponding to the user to execute the current interaction action. It is to be understood that the trained first interaction model may also be deployed in any computer device, such as the terminal 102.

In some embodiments, the server 104 may further receive a request for fighting with the interaction model, which is sent by the terminal 102, where the request may carry policy selection information, such as a policy identifier, corresponding to the target interaction policy, and the server may obtain the first interaction model corresponding to the policy selection information, and control the target virtual object to interact with the virtual object corresponding to the terminal 102 through the first interaction model. For example, when a game player wants to fight a game AI, a play strategy at a level of a vegetable bird can be selected by the terminal 102, a request for fighting with the game AI at the level of the vegetable bird is sent to the server 104 through the terminal 102, and the server 104 acquires a game model of the play strategy at the level of the vegetable bird and interacts with an hero character controlled by the user. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In some embodiments, as shown in fig. 2, an interactive model training method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

step S202, a first interaction state characteristic corresponding to the virtual interaction environment is obtained, and a first interaction action is obtained, wherein the first interaction action is determined by inputting the first interaction state characteristic into a first interaction model to be trained.

The virtual interactive environment is an environment where virtual objects interact with each other, and may be a two-dimensional interactive environment or a three-dimensional interactive environment. For example, when an application program runs, a virtual interactive environment in which virtual objects interact can be displayed through a screen. As a practical example, when the game application runs, an image frame can be displayed, and the image frame is used for representing the environment where the hero character is located, so that the game player can know the current state of the hero character.

The virtual objects are active entities in the virtual interactive environment and can be controlled by an intelligent system or a person through computer equipment. For example, the virtual object may be a character virtualized in a game application, and the virtual object may be three-dimensional or two-dimensional, and may be a character virtual object or an animal virtual object. For example, the virtual object may be an hero character or a soldier in an MOBA game. The virtual objects can be divided into a plurality of types according to the groups, and the plurality refers to at least two. For example, the types of hero characters may include both my hero and enemy hero types. The hero of my party refers to hero cooperatively fighting with hero figures controlled by the first interactive model to be trained, and the fighting targets are the same. The enemy hero refers to hero which is confronted with hero figures controlled by the first interactive model to be trained, and the fighting target is opposite. For example, the fighting target of the hero of our party is to destroy the crystal of the hero of the enemy party, and the fighting target of the hero of the enemy party is to destroy the crystal of the hero of our party. In the embodiment of the present application, the target virtual object is a virtual object controlled by the first interaction model, that is, a virtual object that performs an interaction action output by the first interaction model.

The state features may be used to characterize the corresponding state. The state feature can be obtained by feature extraction according to the interaction related data. The interaction-related data may comprise, for example, at least one of attribute data or location data corresponding to the virtual object. For example, the attribute data of the virtual object may include one or more of a rating of the virtual object, a virtual object life value attribute such as blood volume of hero in the game, skill information of the virtual object, or an offensive power of the virtual object. The position data may include position data of each virtual object in an image frame where the target virtual object is located, or position data corresponding to each virtual object in a global map (also referred to as a minimap) corresponding to the virtual interactive environment. The target virtual object refers to a virtual object to be controlled by the first interaction model to be trained.

For example, for a game, FIG. 3 illustrates an interface diagram of game image frames in some embodiments. One game image frame may include a small map display area 302, a current environment display area 304, and an attribute information display area 306. The small map display area 302 displays global situation, the current environment display area 304 displays situation within the visual field of the target virtual object, and the attribute information display area 306 can display attributes of each virtual object, such as blood volume and offensive power. The game image frame corresponding interactive relevant data, namely situation condition and attribute information, are stored in the server, and the server can perform feature extraction on the interactive relevant data corresponding to the image frame to obtain state features. By extracting the characteristics of the interactive related data, the game state space of the virtual interactive environment with high complexity can be effectively extracted, and the complexity of the state characteristics is reduced. The environment and the position where the small map display area 302 and the current environment display area 304 display the virtual object are located, so the server may obtain a position feature according to the position-related data corresponding to the image frame, the position feature is similar to an image feature, and may also be referred to as a class image feature, and the size of the class map may be set according to the computing resource of the server and the required accuracy of the model, for example, may be 12 × 12 pixels in size. For the attribute information displayed in the attribute information display area 306, the server can be represented in the form of a one-hot (one-hot) encoded vector.

As a practical example, assume that the virtual interactive environment represented by the game image frame includes soldiers, monsters, defense towers, heros, obstacles, bullets, and the like. Hero is divided into enemy hero and my hero. Assuming that the enemy hero and the my hero are 5, the positions corresponding to the soldier, the wild monster, the defense tower, the 5 enemy hero, the 5 my hero, the obstacle and the bullet can be respectively obtained, and the position characteristics can be obtained. The position characteristics corresponding to the hero of the enemy can be combined, the characteristics corresponding to the hero of the enemy can be combined, and the characteristics obtained by combining the position characteristics of the same party can represent the distribution situation of the hero of the two parties. For the attributes, the vector features obtained according to the attributes of the virtual objects can be spliced. The obtained interaction state features may include, as shown in fig. 4, a small map type image feature extracted according to the interaction related data corresponding to the small map display area 302 and a current view type image feature extracted according to the interaction related data corresponding to the current environment display area 304. And extracting the vector characteristics of the interaction related data corresponding to the attribute information display area 306. The global information represents a vector obtained from attribute data of a corresponding object such as a moving body or an obstacle other than a hero person.

An interaction model refers to a machine learning model used to determine the interaction of virtual objects. The first interactive model to be trained refers to a model which needs model training to adjust model parameters, and the initial model parameters in the first interactive model to be trained may be randomly selected, or may be model parameters which have been adjusted once or many times by using the model training method provided by the embodiment of the present application or other model methods. The first interaction model may be used to determine an interaction of one or more virtual objects. The first interaction model may be, for example, a reinforcement learning model, and may be, for example, at least one of a state-value model or an action-value model, the state-value model employing a state-value function (state-value function). The action-value model employs an action-value function (action-value function). For the action value model, the interaction state features are input into the action value model, the probability of each action being selected can be obtained, and the action with the highest probability can be selected as the action to be executed by the controlled virtual object, namely the best action. And when the interactive state features are input into the state value model, the value of the state corresponding to the interactive state features can be obtained when each action is taken as the selected action, and the action with the maximum state value can be selected as the action to be executed by the controlled virtual object. Wherein the reinforcement learning is performed in a "trial and error" manner, and the updating of the model parameters is guided by the reward value obtained by interacting with the environment, with the goal that after the action is performed, the best benefit can be obtained, i.e. the goal is to maximize the reward value of the controlled virtual object. The Reinforcement learning model may be a dqn (deep Q network) model, an A3C (Asynchronous Advantage Critic algorithm) model, or an UNREAL (Unsupervised learning and attention learning in Unsupervised assistant task) model, and may be set as needed.

An interactive action refers to an action performed by a virtual object when interacting, and the interactive action may act on the virtual object itself that issued the action or other virtual objects, for example, may act on enemy hero. The interaction may include a movement, an attack action, or an evasion action, etc. For example, the interaction may be releasing anti-skill.

The server may utilize the first interaction model to be trained to interact with a person or other interaction models to obtain a first interaction state feature and a first interaction action corresponding to the virtual interaction environment. For example, before interaction, interaction state features representing a current interaction state are acquired and input into a first interaction model to be trained, according to a probability value of an action output by the first interaction model, the action with the highest probability can be selected as the action to be executed currently, a server controls a target virtual object to execute the action, the server can generate interaction record data, and a corresponding relation between an interaction related state and the interaction action is recorded in the interaction record data. After each data transmission, the interaction related data and the corresponding interaction action in the interaction record data are obtained, and the interaction state characteristic is obtained according to the interaction related data and is used as the first state characteristic. And acquiring the interactive action corresponding to the interactive related data as a first interactive action.

In some embodiments, obtaining the first interaction state feature may include: acquiring interaction related data corresponding to a virtual interaction environment, wherein the interaction related data comprises object attribute data and object position data; obtaining attribute characteristics according to the object attribute data, and obtaining position characteristics according to the object position data; and combining the attribute features and the position features to obtain first interaction state features.

Specifically, the first interaction state feature includes a feature obtained by combining the attribute feature and the position feature, for example, the attribute feature and the position feature may be combined to obtain a combined feature, and one combined feature is an independent feature. Attribute features are features derived from attributes. Location features are features derived from location. In the combination, the attribute features and the position features of different virtual objects may be combined, or the attribute features and the position features of the same virtual object may be combined. The first interactive model is trained through the combined features obtained by combining the attribute features and the position features, so that the trained first interactive model can synthesize the combination of specific positions and attributes, output interactive actions and improve the intelligent degree of the model. For example, the position feature may be a position distribution feature of my hero, the attribute feature is an attribute feature of my hero, and for a real professional-level game player, the position distribution of my hero in a global situation state, the blood volume and the attack power of my hero need to be comprehensively considered, and then an action to be executed by the hero controlled by the player is determined, so that the idea of determining the action by the real game player is simulated by combining the attribute feature and the position feature, so that a trained model is more intelligent.

Step S204, obtaining the income obtained by the target virtual object executing the first interactive action as the first income.

In particular, the reward (rewarded) is used for feeding back a reward for performing the interactive action in a state corresponding to the interactive state feature, and the reward may be positive or negative. The benefit can therefore be used to evaluate the effect of the action, i.e. the goodness of performing the action, as feedback of the environment to the action. The first benefit may be derived based on the instant benefit, which is an incentive that may be derived immediately upon performance of the action. The instant profit may be positive or negative, and how to calculate the instant profit may be set according to actual needs, for example, at least one of score logic or upgrade logic of the game. For example, setting a corresponding relation between the state change and the instant profit, and acquiring the state change corresponding to the virtual interaction environment before and after the target virtual object executes the first interaction action; and obtaining corresponding benefit according to the state change as the first benefit. The state change refers to a change of state in the virtual interactive environment before and after the action is performed, that is, a change of state of the environment due to the execution of the first interactive action. The state change may include, for example, at least one of a change in hero experience value, a change in money, a change in blood volume, a change in life state, or a change in blood volume of the building. The revenue weights for the various state changes may be set as desired. For example, if the blood volume of the target virtual object before the first interaction is performed is 12 and the blood volume after the first interaction is performed is 20, the state change is that the blood volume is increased by 8, and the corresponding immediate benefit is 12. For another example, if the blood volume of the target virtual object before the first interactive action is performed is 12 and the blood volume after the first interactive action is performed is 6, the state change is that the blood volume is reduced by 8, and the corresponding immediate benefit may be-6. For another example, if the first interactive action is attack, and the enemy hero is in the field of view of my hero before the first interactive action is executed, and the enemy hero is not in the field of view of my hero after the first interactive action is executed, that is, the enemy hero has escaped, the corresponding instant benefit may be 30.

In some embodiments, the benefit from performing the first interaction is based on the immediate benefit at all future times, as shown in equation (1), where G is_tThe reward obtained by executing the interactive action in the state of the t-th time can be used as the first profit value. R_t+1The instant profit obtained by performing the interactive action in the state of the t +1 th time is represented, k represents the distance between the future time and the current time, for example, when the future time is the next time of the current time, k =1, λ is a discount factor, generally smaller than 1, and specifically may be set as required, and λ represents that in general, the feedback of the current time is important, and the influence of the instant profit at the time farther from the current time on the return obtained by performing the interactive action at the current time is smaller.

（1）

In some embodiments, since the real-time profit cannot be obtained at all future times unless the whole game reaches the end state, the reward of performing the interactive action at the current time is calculated. A first benefit can thus be calculated from the Bellman equation, which is related to the return benefit (value) of the state at the next moment calculated using the cost function, and the immediate benefit of the currently performed first interaction. The cost function may be a state cost function. The Bellman equation is shown in formula (2), wherein in formula (2),

express valueFunction, S denotes the state, E denotes the expectation, t denotes the time t, S_tIndicating the state at time t. By combining the Bellman equation, the corresponding return income under the corresponding situation state can be calculated and obtained without depending on the situation that the game kernel is used for deduction until the game is finished.

（2）

Step S206, inputting the first interaction state characteristic and the first interaction action into a target strategy discrimination model corresponding to the target interaction strategy to obtain a first strategy discrimination value.

In particular, an interaction policy is a policy for guiding an interaction, and different policies may have different characteristics. The interaction policy may be divided according to one or more of a level of interactive users or a tendency of interaction. "multiple" means at least two. The interaction policy may be an abstract concept. For example, a game player a does not make a plan when playing a game, but a game player a play law has certain characteristics, and the play law of the game player can be considered to be performed under certain policy guidance. For another example, the game players at the professional level may generally play a different game than the game players at the novice level, and the interaction strategy of the game players at the professional level may be considered one strategy and the interaction strategy of the game players at the novice level may be considered another strategy. For example, the law of play is generally different between an aggressive game player and an evasive game player, and the aggressive game player tends to attack. Whereas evasive players are more inclined to avoid. Therefore, the interaction policy of the wary type game player can be considered as one policy, and the interaction policy of the evasive type game player can be considered as the other policy. A target interaction policy may refer to any policy. For example, it may be the strategy of an expert player.

The discrimination model is used for discriminating the conformity degree of the input information and the predetermined condition, and can be a deep neural network model. The output of the discriminant model may be a probability, which may range from 0 to 1. The greater the probability, the more in line with the predetermined condition the information representing the input. In the state represented by the first interaction state feature, performing the first interaction may be considered to be determined under the direction of the first interaction policy. The target strategy discrimination model corresponding to the target interaction strategy is used for: and judging whether the first interaction strategy corresponding to the execution of the first interaction action is in accordance with the target interaction strategy or not in the state represented by the first interaction state characteristic. That is, the first policy decision value indicates the degree of conformity between the first interaction action executed and the target interaction policy in the state indicated by the first interaction state feature, and may be represented by a probability. The first strategy judgment value is in positive correlation with the coincidence degree.

The target strategy discrimination model can be obtained by training according to target interaction data corresponding to a target interaction strategy, the target interaction data comprises target interaction actions and corresponding target interaction states, and the target interaction actions are in accordance with the target interaction strategy when the target interaction actions are in a state corresponding to (represented by) target interaction state features. For example, if the decision model aims to determine whether or not input data meets a game strategy of a player at an expert level, an action input by the player at the expert level during game play and a state feature corresponding to a game environment when the action is input can be acquired as training data to train the strategy decision model. Wherein, when training the strategy discrimination model using the training data conforming to the target interaction strategy, the model parameters are adjusted in a direction to increase the discrimination value output by the strategy discrimination model.

In some embodiments, the strategy judgment model may be trained in advance, or alternatively trained with the first interaction model.

In the embodiment of the present application, the negative correlation relationship means: the two variables have different changing directions, and when one variable changes from large to small, the other variable changes from small to large. The positive correlation relationship means that: the two variables have the same change direction, and when one variable changes from large to small, the other variable also changes from large to small. It is understood that a positive correlation herein means that the direction of change is consistent, but does not require that when one variable changes at all, another variable must also change. For example, it may be set that the variable b is 100 when the variable a is 10 to 20, and the variable b is 120 when the variable a is 20 to 30. Thus, the change directions of a and b are both such that when a is larger, b is also larger. But b may be unchanged in the range of 10 to 20 a.

And step S208, calculating to obtain a second benefit according to the first strategy judgment value.

Specifically, the second benefit is obtained according to the first policy decision value, and a relationship between the policy decision value and the benefit may be preset, where the second benefit and the first policy decision value have a positive correlation. The correspondence between the first policy discrimination value and the second benefit may be preset as needed. For example, when the first policy discrimination value is set to 0.9, the second benefit is 10. When the first policy discrimination value is 0.8, the second benefit is 6.

In some embodiments, the second benefit is a negative benefit, i.e., a negative value, when the first policy decision value is less than the first preset threshold. Or when the first strategy judgment value is larger than a second preset threshold value, the second benefit is positive benefit. The first preset threshold and the second preset threshold can be set as required, and the second preset threshold is greater than or equal to the first preset threshold. For example, the first preset threshold may be 0.6. The second preset threshold is 0.8. For the reinforcement learning model, because the training target of the reinforcement learning model is the maximum profit, and the first policy decision value is small, which means that the probability of executing the first interaction action to meet the target interaction policy is small in the state corresponding to the first interaction state feature, by setting the second profit to be the negative profit when the first policy decision value is smaller than the first preset threshold value, the probability of outputting the first interaction action to be the action to be executed in the state corresponding to the first interaction state feature of the reinforcement learning model obtained by training can be reduced.

And step S210, calculating to obtain a target benefit according to the first benefit and the second benefit.

Specifically, the first benefit and the second benefit may be added to obtain the target benefit. Or obtaining weights corresponding to the first profit and the second profit respectively, and performing weighted summation to obtain the target profit. Of course, other profit values may be combined to obtain the target profit.

In some embodiments, when the first interaction model is trained, one training may be performed by using a batch of training samples, for example, a plurality of first interaction states and corresponding first interaction actions may be obtained. And calculating the average value of the target gains corresponding to the batch of training samples as the final target gain.

Step S212, adjusting model parameters of the first interaction model to be trained according to the target income to obtain the updated first interaction model.

Specifically, the target benefit may be either positive or negative feedback for the adjustment of the model parameters. If the positive feedback is obtained, the model parameters can be adjusted, so that the trend of the first interaction action is selected to be strengthened in a state corresponding to the first interaction characteristic. If negative feedback is adopted, the model parameters can be adjusted, so that the trend of selecting the first interaction action is weakened in a state corresponding to the first interaction characteristic. The method for adjusting the model parameters may be set as required, for example, a near-end Policy Optimization (PPO) algorithm, A3C or DDPG (deep deterministic Policy Gradient) may be used.

FIG. 5 is a schematic diagram illustrating a method for interactive model training in some embodiments. The first interaction model may be a neural network model, the server may obtain game state data corresponding to the game image to obtain a first interaction state feature, and input the first interaction state feature into the first interaction model 502, where the first interaction model 502 includes an input layer, a hidden layer, and an output layer, and the output layer outputs the first interaction action. The first interaction acts on the game environment 506 causing a state change of the game environment 506, and the server may derive a first benefit value based on the state change. The server may further obtain the first interaction state feature and the first interaction action, input the first interaction state feature and the first interaction action into the policy decision model 504, the policy decision model 504 outputs a first policy decision value, and the server may return a second profit value according to the first policy decision value. The server obtains a target profit value according to the first profit value and the second profit value, and adjusts the model parameters of the first interaction model 502 by using the target profit value.

In some embodiments, obtaining the first interaction state feature and the first interaction action corresponding to the virtual interaction environment comprises: acquiring a fight model corresponding to a first interaction model to be trained as a second interaction model; controlling a first interaction model to be trained and a second interaction model to interact in a virtual interaction environment to obtain interaction record data corresponding to the first interaction model; and acquiring a first interaction state characteristic and a first interaction action according to the interaction record data.

Specifically, the second interaction model is a model that interacts with the first interaction model through the respectively controlled virtual objects. For example, the first interaction model is a model that outputs the actions of my hero, and the second interaction model is a model that outputs the actions of enemy hero. There may be one or more second interaction models. Taking a practical example, the first interactive model to be trained outputs an attack action, and the server controls the hero of the party to execute the interactive action to attack the hero of the enemy. The second interaction model can also output corresponding counterattack actions to control enemy heros to counterattack. The interaction action (called as a first interaction action) output by the first interaction model to be trained is recorded in the interaction record data corresponding to the first interaction model, and the interaction state feature (the first interaction state feature) referred to during the first interaction action is determined, so that the first interaction state feature and the first interaction action can be obtained from the interaction record data. The first interaction model to be trained and the second interaction model can interact for a plurality of times in the virtual interaction environment, for example, one game can be completed. In the embodiment of the application, because the fighting model can be selected to interact with the first interactive model to be trained, namely the model is used for playing chess, the first interactive state characteristic and the first interactive action can be automatically acquired, the efficiency of acquiring the training data can be improved, the training data independent of human players can be generated, the training of the first interactive model is carried out, and the fighting capacity can be rapidly and efficiently improved by the neural network model from scratch.

In some embodiments, a historical version corresponding to the first interaction model to be trained may be used as the second interaction model. For example, a historical version of the first interaction model may be randomly selected as the second interaction model. For example, when the first interaction model is trained in the third round, the first interaction model obtained from the first round of training can be used as the second interaction model. In this way, no additional training of the second interaction model is required.

In some embodiments, a multi-container (docker) mirror image mode can be created, and the fight between the interactive models can be rapidly expanded to a plurality of machines in parallel, so that the efficiency of interactive record data generation is improved, and enough fight data is obtained to train the first interactive model. For example, a plurality of docker images may be generated, and each docker image is paired with a first interaction model and a second interaction model, as shown in fig. 6, which is a schematic diagram of the principle of the first interaction model and the second interaction model in some embodiments. A battle model pool can be established, and the battle model pool comprises a plurality of historical version models corresponding to the first interaction model to be trained. Each docker image may select one or more of the models from the pool of battle models as a second interaction model. The first interaction model outputs a first interaction action according to the interaction state characteristics in the virtual interaction environment, and the second interaction model outputs a second interaction action according to the interaction state characteristics in the virtual interaction environment. With the change of the virtual interactive environment state, the first interactive model and the second interactive model continuously output actions, and the server controls the virtual object to execute the corresponding action so as to carry out the fight.

In some embodiments, the updated first interaction model may be used as the first interaction model to be trained. And controlling the first interactive model and the second interactive model to fight for a plurality of times in the virtual interactive environment to obtain fighting record data corresponding to the first interactive model until the updated first interactive model converges or the model training times reach preset times.

Specifically, the model convergence condition may be at least one of when the model loss value is smaller than a preset loss value or when the change of the model parameter is smaller than a preset parameter change value. The first interaction model may be trained multiple times to improve the accuracy of the action output by the first interaction model. For example, the updated first interaction model may be used as a new first interaction model to be trained. And interacting the first interaction model to be trained with the second interaction model to obtain new training data, namely new first interaction state characteristics and second interaction actions, so as to continuously adjust the model parameters of the first interaction model until the updated first interaction model converges or iterates for a preset number of times, and stopping training. The preset number of times may be set as needed, and may be, for example, 1 ten thousand times.

In some embodiments, the policy decision model may also be trained, and the training of the policy decision model and the training of the first interaction model may be performed alternately or simultaneously, for example, the first interaction model may be iteratively trained for a first preset number of times, and then the policy decision model may be iteratively trained for a second preset number of times. Then, the step … … of iteratively training the first interaction model a first preset number of times is performed until the first interaction model converges.

In some embodiments, as shown in fig. 7, the step of training to obtain the target strategy discriminant model includes:

step S702, a target interaction action and a target interaction state characteristic corresponding to the target interaction action are obtained, and the target interaction action is taken as an interaction action which accords with a target interaction strategy under the state corresponding to the target interaction state characteristic.

Specifically, the target interaction state feature corresponding to the target interaction action refers to: and executing the target interaction action in the state represented by the target interaction state characteristic. When the target interaction is used as the state corresponding to the target interaction state feature, the interaction action conforming to the target interaction strategy is as follows: and under the state represented by the target interaction state characteristics, adopting a target interaction strategy for guidance, wherein the executed action is a target interaction action.

And the target interaction action and the target interaction state characteristics are used as training data to train the strategy discrimination model. The server may store training data meeting the target interaction policy in advance. As described above, the target interaction policy may be an abstract concept, and therefore, whether the training data satisfies the target interaction policy may be determined manually, for example, the training data may be a class of real users or a legal policy corresponding to one real user, or a legal policy obtained by mixing legal policies of multiple classes of real users. For example, assuming that the training target of the first interaction model is that the play of a certain game player a user can be learned, the game operation data and the corresponding state features of the user a can be set to be acquired as the training data, and the target interaction strategy is the play strategy of the user a. Assuming that the training target of the first interaction model is the game play of the competitive game player, the game operation data and the corresponding state features of the competitive game players can be acquired and used as the training data, and the target interaction strategy is the competitive game play strategy.

In some embodiments, the corresponding status features and actions may be extracted from historical match-up data generated in normal play of the game player on the server as the target interactive actions and the target interactive status features corresponding to the target interactive actions. The game player may be a preset type of player, and the preset type may be a plurality of types. When there are multiple types, the proportion of the training data corresponding to each type may be preset, and the training data may be obtained according to the proportion. For example, the ratio of wary players to evasive players is 6: 4. If 10000 training data are required to be acquired, 6000 fighting data of good players and 4000 fighting data of avoidance players are acquired.

In some embodiments, the target interaction policy is an interaction policy corresponding to a preset interaction user level, and the obtaining the target interaction action and the target interaction state feature corresponding to the target interaction action includes: acquiring an interactive action executed by a target virtual object as a target interactive action, wherein the interactive action executed by the target virtual object is determined by presetting user operation at an interactive user level; and acquiring the interaction state characteristic corresponding to the target interaction action as the target interaction state characteristic.

Specifically, the interactive user level refers to a level of interactive users. For example, the game level of the game user may be one level, two levels, or three levels. The game level is determined according to the game upgrading strategy of different games, for example, the level which can be reached by one user can be determined according to the time length of playing the game and the win-loss ratio. The longer the game play time, the higher the win ratio, and the higher the game level. The preset interactive user level can be set according to needs, and can be one or more. For example, it may be a primary stage, or a primary and a secondary stage. The target interactive action is obtained according to user operation of a preset interactive user level, namely, the target interactive action is controlled by a user through manual operation, and the user operation can be one or more of voice operation, mouse operation, keyboard operation or input through a control game remote control lever.

The target interaction state feature corresponding to the target interaction may be obtained according to interaction-related data of the virtual interaction environment at the current time when the target interaction is performed. For example, the attribute data or the position data of each virtual object at the current time may be at least one of the attribute data and the position data. Reference may be made specifically to the step of obtaining the first interaction state feature.

In the embodiment of the application, the interactive action executed by the target virtual object is controlled by the user operation at the preset interactive user level, so that the interactive data of the player at the preset interactive user level is acquired to train the model, and the trained model can imitate the interactive strategy of the player at the preset interactive user level. For example, if the preset interactive user level is professional level, the game match action of a professional game player may be acquired, and the game state feature in the game when the professional game player outputs the game match action may be acquired as the state feature corresponding to the game match action. And forming a state feature pair by the fighting action and the state feature, and inputting the state feature pair into a strategy discrimination model as training data for training. Through multiple rounds of training, a game AI with a game playing strategy similar to that of a professional-level game player can be obtained.

In some embodiments, the target interaction feature may also include a feature that is a combination of an attribute feature and a location feature. And combining the attribute features and the position features to obtain combined features for training. The corresponding relation between the combined state obtained by combining the specific position and the attribute and the interaction action conforming to the target interaction strategy, namely the first interaction model, can be obtained through better mining.

Step S704, model training is carried out according to the target interaction state characteristics and the target interaction actions, and a target strategy judgment model is obtained.

Specifically, in the training, since the target interaction state feature and the target interaction action are in accordance with the target interaction strategy, the higher the discrimination value output by the model training target for the desired strategy discrimination model, the better, for example, the closer to 1, the better. In the model training, a model loss value may be obtained, and the model parameters of the strategy discrimination model may be adjusted in a direction of decreasing the model loss value, for example, the model parameters may be adjusted by a random gradient decreasing method.

In some embodiments, the target interaction state feature and the target interaction action may be input into a policy decision model to be trained to obtain a second policy decision value; and obtaining a second model loss value according to the second strategy discrimination value, adjusting the model parameters of the strategy discrimination model to be trained according to the second model loss value to obtain a target strategy discrimination model, wherein the second strategy discrimination value and the second model loss value form a negative correlation relationship.

Specifically, the second policy decision value is used to determine the degree of conformity of the input data to the target interaction policy, and is output by the policy decision model to be trained. For example, the policy decision value may be a probability that the policy decision model decides a combination of the target interaction state feature and the target interaction action, and satisfies the target interaction policy. The model loss value is derived from a loss function. The loss function (loss function) is a function for representing "risk" or "loss" of an event. Since the policy decision model is to decide whether the input data satisfies the target interaction policy, the second policy decision value is in a negative correlation with the second model loss value. That is, the larger the second policy discrimination value is, the more accurate the prediction is, and the model loss value is small. The smaller the second strategy discrimination value is, the more inaccurate the prediction is, and the model loss value is large. When the model parameters of the strategy discrimination model are adjusted, the model parameters are adjusted towards the direction of reducing the loss value, so that the discrimination accuracy of the strategy discrimination model on the real training data which accord with the target interaction strategy is higher and higher.

In some embodiments, the strategy discrimination model may also be trained using training data that does not conform to the target interaction strategy. For training data which do not meet the target interaction strategy, the lower the discrimination value output by the discrimination model of the expected strategy is, the better, for example, the closer to 0, the better.

In some embodiments, a first model loss value can be obtained according to a first policy decision value, and the first policy decision value and the first model loss value form a positive correlation; and adjusting the model parameters of the target strategy discrimination model according to the first model loss value.

Specifically, since the first interaction corresponding to the first policy discrimination value is output by the first interaction model, it may be considered that the target interaction policy is not met. The first policy decision value is positively correlated with the first model loss value. That is, the larger the first policy discrimination value is, the more inaccurate the discrimination is, and the model loss value is large. The smaller the first strategy discrimination value is, the more accurate the discrimination is, and the model loss value is small. When the model parameters of the strategy discrimination model are adjusted, the model parameters are adjusted towards the direction of reducing the loss value, so that the discrimination accuracy of the strategy discrimination model on the training data which do not conform to the target interaction strategy is higher and higher.

After the model parameters of the target strategy discrimination model are further adjusted, when the updated target strategy discrimination model can be used as the next round of model training, the discrimination model corresponding to the first interaction model is used for discriminating whether the first interaction action output by the first interaction model to be trained meets the target interaction strategy or not in the next round of training and in the state corresponding to the first interaction state characteristic, so that the efficiency of model training is further improved.

In some embodiments, the corresponding loss function of the policy discriminant model may be as shown in equation (3). In the formula (3), L_d1Representing the model loss value. y represents the target interaction action and the target interaction state characteristics which accord with the target interaction strategy. y' represents a first interaction state feature and a first interaction action corresponding to the first interaction model. D_y(y) represents a second strategy discrimination value that is output by inputting y to the strategy discrimination model D.

D_y(y ') denotes a first strategy discrimination value which is output by inputting y' to the strategy discrimination model D.

L_d1=log(D_y（y）)+log(1-D_y（y'）) (3)

Fig. 8 is a schematic diagram illustrating the training principle of the policy discriminant model in some embodiments. Assuming that the target interaction strategy is a shooting strategy of an expert-level game user, a sequence (y 1, y2, … … yn) of a target interaction state feature and a target interaction action corresponding to the expert-level player and a sequence (y '1, y '2, … … y't) of a first interaction state feature and a first interaction action corresponding to the first interaction model can be obtained. Wherein yn refers to a state action pair consisting of the target interaction state feature and the target interaction action at the nth moment. y't refers to the state-action pair formed by the first interaction state feature and the first interaction action at the t-th moment. The state action pairs at each time may be input into the policy decision model, that is, one state action pair is input each time, or one sequence is input into the policy decision model. For the (y 1, y2, … … yn) state action pair sequence or each state action pair in the (y 1, y2, … … yn) state action pair sequence, the probability of the policy discriminant model output is expected to be 1, that is, the discriminant result is true (true), and the target interaction policy is met. For the (y '1, y'2, … … y't) state action pair sequence or each state action pair in the (y' 1, y '2, … … y't) state action pair sequence, the probability of the strategy discriminant model output is expected to be 0, i.e. the discriminant result is true (false), and does not conform to the target interaction strategy. The training of the approximate two-class is equivalent to judging the similarity degree of the action sequence output by the first interaction model and the action sequence conforming to the target interaction strategy by using the judgment model, so that the strategy judgment model can be used for encouraging the action output by the first interaction model to better conform to the target interaction strategy.

The training principle of the mutual antagonistic learning of the first interaction model and the strategy discrimination model is as follows: during the training of the first interaction model, the first interaction output using the first interaction model may be considered to be non-compliant with the target interaction policy. The goal of the policy discrimination model is therefore to discriminate the first interaction state feature and the first interaction action as being non-compliant with the target interaction policy. Namely, the first interaction state characteristic and the first interaction are distinguished as far as possible from the real target interaction state characteristic and the target interaction which accord with the target interaction strategy. And the first interaction model can determine a second benefit according to the first strategy judgment value, wherein the second benefit is feedback on whether the output action of the first interaction model conforms to the target interaction strategy in the state corresponding to the first interaction state characteristic, so that the action output by the first interaction model can be deceived as much as possible by adjusting the parameters of the first interaction model according to the second benefit value, and the two models compete against each other to learn and continuously adjust the parameters.

The first interactive model obtained by training through the interactive model training method provided by the embodiment of the application can be applied to a game scene, and the first interactive model can be used as a game AI for fighting. For example, the game AI may be applied in the battle of the MOBA game. In a Multiplayer Online Battle Arena (MOBA) game, a playing method is to divide players into two enemy battles and to compete with each other by dispersing in a map to destroy enemy crystals as a final aim. Players mainly take two levels of thought and action in the game: one is macro scheduling, namely scheduling and coordination of hero units on a macro strategy tactical level; one is microscopic operation, i.e. operation of hero unit under a specific scene. For the game AI of MOBA, the microscopic operation is mainly a concrete operation of hero in the current scene, such as movement, attack or skill release. The macro strategy mainly comprises hero and friend cooperation, large-scale transfer scheduling and the like, and a certain strategic tactics is formed so as to better acquire resources or acquire the advantages of people in local battles. The game AI provided by the embodiment of the present application can be used to output microscopic operations.

In the related art, the self-playing data of the reinforcement learning model and the profit adjustment model parameters calculated by the set value calculation model can only guide the advancing direction of model training through the set target value profit, so that the obtained interactive model is usually far different from the conventional human law-playing strategy, is single and lacks adaptability. The nature of the instant play of multiple players in an MOBA game dictates that the variations and combinations of opponent strategies become more complex and varied during the course of the game. Therefore, the MOBA game AI strategy trained based on the reinforcement learning method is single, and is difficult to adapt to various antagonism strategies of human players. Therefore, by adopting the interactive model method provided by the embodiment of the application, the strategy of the human player can be used as a target interactive strategy, the judgment model can be trained by utilizing the data of the human game player, in the process of training the interactive model, the action output by the interactive model is judged to be in accordance with the strategy judgment value of the human law-seeking strategy by utilizing the strategy judgment model, the profit is obtained by calculating the strategy judgment value, the model parameters are adjusted by the calculated profit so as to constrain and guide the training direction of the interactive model, and the fitting and adaptation to the human law-seeking strategy are formed. This therefore amounts to guiding the strategic boost direction of the game AI with human expert interactive data. Therefore, the game AI can learn the antagonism strategy similar to or even exceeding the level of human experts on the premise of having high-level fight micro-operation capability. Therefore, the robustness and adaptability of the interactive model are greatly improved, and the comprehensive capability of the game AI is improved.

The interactive model training method provided by the embodiment of the application is described below by taking training game AI as an example, and comprises the following steps:

1. and acquiring the target interaction action and the target interaction state characteristic corresponding to the target interaction action, wherein the target interaction action is taken as the interaction action which accords with the target interaction strategy in the state corresponding to the target interaction state characteristic.

2. And carrying out model training according to the target interaction state characteristics and the target interaction actions to obtain a target strategy discrimination model.

Specifically, historical match data of high-level players, such as game players at an expert level, or match data of players who specify a legal feature may be acquired, and status features and actions may be sampled for each game. The sampling process may be to select interaction related data corresponding to a certain number of image frames in each preset time period to perform state feature extraction, and obtain an action performed by the hero person in a state corresponding to the image frame, for example, if there are 30 images in 1s, interaction related data corresponding to 5 images may be extracted. And extracting the state characteristics corresponding to the current moment in the game environment when the player outputs a certain action. And obtaining the target interaction action and the target interaction state characteristics corresponding to the target interaction action to form a training sample set. And carrying out model training by using the training sample set to obtain a target strategy discrimination model.

3. The method comprises the steps of obtaining a first interaction state characteristic and a first interaction action corresponding to a virtual interaction environment, wherein the first interaction action is determined by inputting the first interaction state characteristic into a first interaction model to be trained.

Specifically, the neural network model may be loaded, the network model parameters may be randomly initialized, and the game environment may be loaded when the first interaction model is trained for the first time.

The server can be provided with a self-playing training module, the self-playing training module can select an opponent model from an opponent model pool, and a self-playing script is started in parallel on a plurality of machines to obtain sample data of state characteristics and actions as first interactive state characteristics and first interactive actions.

4. And acquiring the income obtained by the target virtual object executing the first interactive action as the first income.

Specifically, the server may calculate a game return income corresponding to the execution of the first interactive action in the state corresponding to the first interactive state feature. For example, the game return profit may be the sum of the instant profit obtained by performing the first interactive action and the value (profit) corresponding to the next time state calculated by using the state value function.

5. Inputting the first interaction state characteristic and the first interaction action into a target strategy discrimination model corresponding to a target interaction strategy to obtain a first strategy discrimination value; and calculating to obtain a second benefit according to the first strategy discrimination value.

Specifically, the < state feature, action > data obtained in step 3 may be input into the target interaction policy decision model to obtain an output probability, and a corresponding second benefit may be obtained according to the output probability.

6. And calculating to obtain the target profit according to the first profit and the second profit.

And (5) performing summation calculation by combining the profits of the step 4 and the step 5 to obtain a final benefit, namely a target benefit.

7. And adjusting the model parameters of the first interaction model to be trained according to the target income to obtain the updated first interaction model.

Specifically, the parameter of the first interaction model may be updated according to the PPO algorithm. Steps 2-7 can be performed iteratively. The self-playing training module selects an opponent model from the opponent model pool, and starts the self-playing script in parallel at multiple machines to obtain the < state characteristic, action > sample data, wherein the step of obtaining the < state characteristic, action > sample data can be executed once every preset training times. And after the training iteration of the model is performed for a preset number of times or a preset time, evaluating the model parameters of the first interaction model, stopping training if the model is converged, and storing the final first interaction model.

In some embodiments, a self-playing training module and an expert data assistance module may be deployed in the server. The self-playing training module can be a core module and is responsible for generating self-playing data required by the first interactive model and performing iterative training on the first interactive model. The self-playing training module may include the following sub-modules: the device comprises a self-playing feature extraction module, a return income extraction module and a game self-playing module. And a neural network training module. The expert data assistance module may include the following sub-modules: an expert characteristic extraction module and a discriminator module.

The feature extraction module is used for extracting features, in an MOBA game, the situation state is no longer simple disc surface information, and the situation state features have higher complexity due to geodetic maps, multiple virtual objects, incomplete information and the like, and the state features can be extracted by the feature extraction module by referring to main state information considered by a real player in the game process.

And the profit calculation module is used for calculating return profits. During the training process of the first interaction model, the action prediction value of the first interaction model needs a specific value to evaluate the superiority and inferiority of the virtual object, such as hero, to execute the action. The return revenue indicates the return revenue that the state at a certain time t will have, and may be the accumulation of all the instant revenue at the next time.

The neural network training module is used for training the first interaction model. And obtaining the target income according to the return reward given by the game environment and the judgment value output by the judgment device by obtaining the data sample obtained from the game. The accuracy of game AI output actions according to the state environment can be improved by training the first interactive model for multiple times by utilizing a PPO reinforcement learning algorithm and taking the expectation of maximum reward (income) as a target.

The expert feature extraction module can be used for extracting action track data of an expert and corresponding state feature data. For example, corresponding features and actions can be extracted from normal game-play of a large number of players on the game server, and an expert operation track sample pool is formed for training the strategy judgment model.

The arbiter module may be configured to perform policy arbitration. A discriminator model can be constructed by using a deep neural network model, and a probability value which accords with a target interaction strategy is obtained for an input state-action pair.

It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the above-mentioned flowcharts may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or the stages is not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a part of the steps or the stages in other steps.

In some embodiments, as shown in fig. 9, there is provided an interaction model training apparatus, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, and specifically includes: a first interactive data obtaining module 902, a first profit obtaining module 904, a first policy discrimination value obtaining module 906, a second profit obtaining module 908, a target profit obtaining module 910, and a first interactive model parameter adjusting module 912, wherein:

the first interaction data obtaining module 902 is configured to obtain a first interaction state feature corresponding to the virtual interaction environment, and obtain a first interaction action, where the first interaction action is determined by inputting the first interaction state feature into a first interaction model to be trained.

A first benefit obtaining module 904, configured to obtain a benefit obtained by the target virtual object performing the first interaction as the first benefit.

A first policy decision value obtaining module 906, configured to input the first interaction state feature and the first interaction action into a target policy decision model corresponding to the target interaction policy, so as to obtain a first policy decision value.

A second benefit obtaining module 908, configured to obtain a second benefit by calculating according to the first policy decision value, where the first policy decision value and the second benefit form a positive correlation.

And a target profit obtaining module 910, configured to calculate a target profit according to the first profit and the second profit.

A first interaction model parameter adjusting module 912, configured to adjust a model parameter of the first interaction model to be trained according to the target benefit, to obtain an updated first interaction model.

In some embodiments, the interaction model training apparatus further comprises: the target interaction data acquisition module is used for acquiring a target interaction action and a target interaction state characteristic corresponding to the target interaction action, and the target interaction action is taken as an interaction action which accords with a target interaction strategy under the state corresponding to the target interaction state characteristic; and the strategy discrimination model training module is used for carrying out model training according to the target interaction state characteristics and the target interaction actions to obtain a target strategy discrimination model.

In some embodiments, the policy discriminant model training module comprises: a second strategy judgment value obtaining unit, configured to input the target interaction state feature and the target interaction action into a strategy judgment model to be trained, so as to obtain a second strategy judgment value; and the second model loss value obtaining unit is used for obtaining a second model loss value according to the second strategy judgment value, adjusting the model parameters of the strategy judgment model to be trained according to the second model loss value to obtain a target strategy judgment model, and the second strategy judgment value and the second model loss value form a negative correlation relation.

In some embodiments, the target interaction policy is an interaction policy corresponding to a preset interaction user level, and the target interaction data obtaining module is configured to: acquiring an interactive action obtained according to user operation at a preset interactive user level as a target interactive action; and acquiring the interaction state characteristic corresponding to the target interaction action as the target interaction state characteristic.

In some embodiments, the interaction model training apparatus further comprises: the first model loss value obtaining module is used for obtaining a first model loss value according to a first strategy judgment value, and the first strategy judgment value and the first model loss value form a positive correlation; and the target strategy discrimination model parameter adjusting module is used for adjusting the model parameters of the target strategy discrimination model according to the first model loss value.

In some embodiments, the first interaction data acquisition module 902 is configured to: acquiring a fight model corresponding to a first interaction model to be trained as a second interaction model; controlling a first interaction model to be trained and a second interaction model to interact in a virtual interaction environment to obtain interaction record data corresponding to the first interaction model; and acquiring a first interaction state characteristic and a first interaction action according to the interaction record data.

In some embodiments, the interaction model training apparatus further comprises: and the entering module is used for taking the updated first interaction model as the first interaction model to be trained, entering a step of controlling the first interaction model to be trained and the second interaction model to interact in the virtual interaction environment to obtain interaction record data corresponding to the first interaction model until the updated first interaction model converges or the number of times of model training reaches a preset number of times.

In some embodiments, the first revenue acquisition module 904 is operable to: acquiring a state change corresponding to a virtual interaction environment when a target virtual object executes a first interaction action; and obtaining corresponding benefit according to the state change as the first benefit.

In some embodiments, the first interaction data acquisition module 902 is configured to: acquiring interaction related data corresponding to a virtual interaction environment, wherein the interaction related data comprises object attribute data and object position data; obtaining attribute characteristics according to the object attribute data, and obtaining position characteristics according to the object position data; and combining the attribute features and the position features to obtain first interaction state features.

For the specific definition of the interactive model training device, reference may be made to the above definition of the interactive model training method, which is not described herein again. The modules in the interactive model training device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In some embodiments, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 10. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing interaction model training data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an interactive model training method.

Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In some embodiments, there is further provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the above method embodiments when executing the computer program.

In some embodiments, a computer-readable storage medium is provided, in which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of interaction model training, the method comprising:

acquiring a first interaction state characteristic corresponding to a virtual interaction environment, and acquiring a first interaction action, wherein the first interaction action is determined by inputting the first interaction state characteristic into a first interaction model to be trained;

obtaining the income obtained by the target virtual object executing the first interactive action as the first income;

inputting the first interaction state feature and the first interaction action into a target strategy discrimination model corresponding to a target interaction strategy to obtain a first strategy discrimination value;

calculating to obtain a second benefit according to the first strategy judgment value, wherein the first strategy judgment value and the second benefit form a positive correlation;

calculating to obtain a target benefit according to the first benefit and the second benefit;

and adjusting the model parameters of the first interaction model to be trained according to the target income to obtain an updated first interaction model.

2. The method of claim 1, further comprising:

acquiring a target interaction action and a target interaction state characteristic corresponding to the target interaction action, wherein the target interaction action is taken as an interaction action which accords with a target interaction strategy under the state corresponding to the target interaction state characteristic;

and performing model training according to the target interaction state characteristics and the target interaction actions to obtain the target strategy discrimination model.

3. The method of claim 2, wherein the performing model training according to the target interaction state features and the target interaction actions to obtain the target policy decision model comprises:

inputting the target interaction state characteristics and the target interaction actions into a strategy discrimination model to be trained to obtain a second strategy discrimination value;

and obtaining a second model loss value according to the second strategy discrimination value, and adjusting the model parameters of the strategy discrimination model to be trained according to the second model loss value to obtain the target strategy discrimination model, wherein the second strategy discrimination value and the second model loss value form a negative correlation relationship.

4. The method according to claim 2, wherein the target interaction policy is an interaction policy corresponding to a preset interaction user level, and the obtaining of the target interaction action and the target interaction state feature corresponding to the target interaction action comprises:

acquiring an interactive action obtained according to the user operation of the preset interactive user level as a target interactive action;

and acquiring the interaction state characteristic corresponding to the target interaction action as the target interaction state characteristic.

5. The method of claim 1, further comprising:

obtaining a first model loss value according to the first strategy judgment value, wherein the first strategy judgment value and the first model loss value form a positive correlation;

and adjusting the model parameters of the target strategy discrimination model according to the first model loss value.

6. The method of claim 1, wherein obtaining a first interaction state feature corresponding to the virtual interaction environment and obtaining a first interaction comprises:

acquiring a fight model corresponding to a first interaction model to be trained as a second interaction model;

controlling the first interaction model to be trained and the second interaction model to interact in a virtual interaction environment to obtain interaction record data corresponding to the first interaction model;

and acquiring a first interaction state characteristic and a first interaction action according to the interaction record data.

7. The method of claim 6, further comprising:

and taking the updated first interaction model as the first interaction model to be trained, and entering a step of controlling the first interaction model to be trained and the second interaction model to interact in a virtual interaction environment to obtain interaction record data corresponding to the first interaction model until the updated first interaction model converges or the number of times of model training reaches a preset number of times.

8. The method of claim 1, wherein obtaining the benefit of the target virtual object from performing the first interactive action comprises, as the first benefit:

acquiring the state change corresponding to the virtual interaction environment before and after the target virtual object executes the first interaction action;

and obtaining corresponding benefit according to the state change as the first benefit.

9. The method of claim 1, wherein obtaining the first interaction state feature corresponding to the virtual interaction environment comprises:

acquiring interaction related data corresponding to a virtual interaction environment, wherein the interaction related data comprises object attribute data and object position data;

obtaining attribute characteristics according to the object attribute data, and obtaining position characteristics according to the object position data;

and combining the attribute characteristics and the position characteristics to obtain first interaction state characteristics.

10. An interaction model training apparatus, the apparatus comprising:

the system comprises a first interaction data acquisition module, a first interaction model training module and a second interaction data acquisition module, wherein the first interaction data acquisition module is used for acquiring first interaction state characteristics corresponding to a virtual interaction environment and acquiring first interaction actions, and the first interaction actions are determined by inputting the first interaction state characteristics into the first interaction model to be trained;

the first profit obtaining module is used for obtaining profits obtained by the target virtual object executing the first interactive action as first profits;

a first policy judgment value obtaining module, configured to input the first interaction state feature and the first interaction action into a target policy judgment model corresponding to a target interaction policy, so as to obtain a first policy judgment value;

a second profit obtaining module, configured to obtain a second profit through calculation according to the first policy decision value, where the first policy decision value and the second profit form a positive correlation;

the target income obtaining module is used for calculating and obtaining target income according to the first income and the second income;

and the first interaction model parameter adjusting module is used for adjusting the model parameters of the first interaction model to be trained according to the target income to obtain an updated first interaction model.

11. The apparatus of claim 10, further comprising:

the target interaction data acquisition module is used for acquiring a target interaction action and a target interaction state characteristic corresponding to the target interaction action, and the target interaction action is taken as an interaction action which accords with the target interaction strategy under the state corresponding to the target interaction state characteristic;

and the strategy judgment model training module is used for carrying out model training according to the target interaction state characteristics and the target interaction actions to obtain the target strategy judgment model.

12. The apparatus of claim 11, wherein the policy discriminant model training module comprises:

a second policy decision value obtaining unit, configured to input the target interaction state feature and the target interaction action into a policy decision model to be trained, so as to obtain a second policy decision value;

and the second model loss value obtaining unit is used for obtaining a second model loss value according to the second strategy judgment value, adjusting the model parameters of the strategy judgment model to be trained according to the second model loss value to obtain the target strategy judgment model, and the second strategy judgment value and the second model loss value form a negative correlation relationship.

13. The apparatus of claim 10, further comprising:

a first model loss value obtaining module, configured to obtain a first model loss value according to the first policy decision value, where the first policy decision value and the first model loss value form a positive correlation;

and the target strategy discrimination model parameter adjusting module is used for adjusting the model parameters of the target strategy discrimination model according to the first model loss value.

14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 9 when executing the computer program.

15. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.